It turns out that commit 3229c18c0d6b2 'Fixes to "Implement iomap for
block_map"' introduced another bug in gfs2_iomap_begin that can cause
gfs2_block_map to set bh->b_size of an actual buffer to 0. This can
lead to arbitrary incorrect behavior including crashes or disk
corruption. Revert the incorrect part of that commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
It turns out that commit 3974320ca6 "Implement iomap for block_map"
introduced a few bugs that trigger occasional failures with xfstest
generic/476:
In gfs2_iomap_begin, we jump to do_alloc when we determine that we are
beyond the end of the allocated metadata (height > ip->i_height).
There, we can end up calling hole_size with a metapath that doesn't
match the current metadata tree, which doesn't make sense. After
untangling the code at do_alloc, fix this by checking if the block we
are looking for is within the range of allocated metadata.
In addition, add a BUG() in case gfs2_iomap_begin is accidentally called
for reading stuffed files: this is handled separately. Make sure we
don't truncate iomap->length for reads beyond the end of the file; in
that case, the entire range counts as a hole.
Finally, revert to taking a bitmap write lock when doing allocations.
It's unclear why that change didn't lead to any failures during testing.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add a small inline function for computing the maximum size of a stuffed
inode instead of open coding that in several places throughout the code.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Implement the top-level bits of punching a hole into a file.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add an upper bound to the range of blocks to deallocate blocks to
function trunc_dealloc so that this function can be used for truncating
a file as well as for punching a hole into a file.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Pull the code for computing the range of metapointers to iterate out of
gfs2_metapath_ra (for readahead), sweep_bh_for_rgrps (for deallocating
metapointers within a block), and trunc_dealloc (for walking the
metadata tree).
In sweep_bh_for_rgrps, move the code for looking up the resource group
descriptor of the current resource group out of the inner loop. The
metatype check moves to trunc_dealloc.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Turn gfs2_block_truncate_page into a function that zeroes a range within
a block rather than only the end of a block. This will be used for
cleaning the end of the first partial block and the start of the last
partial block when punching a hole in a file.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
In rare cases, the current non-recursive delete algorithm doesn't
deallocate empty intermediary indirect blocks. This should have very
little practical effect, but deallocating all blocks correctly should
still be preferable as it is cleaner and easier to validate.
The fix consists of using the first block to deallocate to compute the
start marker of the truncate point instead of the last block that needs
to be kept. With that change, computing which indirect blocks are still
needed becomes relatively easy.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The metadata read-ahead algorithm broke when switching from recursive to
non-recursive delete: the current algorithm reads ahead blocks at height
N - 1 while deallocating the blocks at hight N. However, deallocating
the blocks at height N requires a complete walk of the metadata tree,
not only down to height N - 1. Consequently, all blocks below height
N - 1 will be accessed without read-ahead.
Fix this by issuing read-aheads as early as possible, after each
metapath lookup.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Split out the entire lookup loop from lookup_metapath and
fillup_metapath. Make both functions return the actual height in
mp->mp_aheight, and return 0 on success. Handle lookup errors properly
in trunc_dealloc.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
First, this function truncates the file in chunks. When the original
file size isn't block aligned, each chunk that is truncated will remain
be misaligned. This is inefficient.
Second, this function doesn't recognize where holes are, so it loops
through them. For each chunk of a hole, it creates a new transaction.
At least avoid creating another transactions whe the current one is
still empty. (An better fix would be to skip large holes, of course.)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Document when to use gfs2_blk2rgrpd for "inexact" resource group
matching. Based on that, fix an incorrect use of gfs2_blk2rgrpd in
sweep_bh_for_rgrps.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch implements iomap for block mapping, and switches the
block_map function to use it under the covers.
The additional IOMAP_F_BOUNDARY iomap flag indicates when iomap has
reached a "metadata boundary" and fetching the next mapping is likely to
incur an additional I/O. This flag is used for setting the bh buffer
boundary flag.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch eliminates height parameters from function gfs2_bmap_alloc.
Function find_metapath determines the metapath's "find height", also
known as the desired height. Function lookup_metapath determines the
metapath's "actual height", previously known as starting height or
sheight. Function gfs2_bmap_alloc now gets both height values from
the metapath. This simplification was done as a step toward switching
the block_map functions to using iomap. The bh_map responsibilities
are also removed from function gfs2_bmap_alloc for the same reason.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add a comment about the logical block size for directories. Rename
"bsize" in gfs2_block_map to "factor". Fix a typo in the description of
metaptr1.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch if you truncated a file to a smaller size it
wasn't freeing all the blocks properly. There are two reasons.
First, the metapath comparison was not comparing previous heights.
I added a function, mp_eq_to_hgt, which checks the metapath at
all heights prior to the target height.
Second, in function find_nonnull_ptr, it needed to zero out all
pointers for heights following the target height. Translated into
decimal integer terms, this way a number like 299, when incremented,
becomes 300, not 399. The 2 gets incremented to 3, and the following
digits need to be reset.
These two things allow the truncate state machine to properly find
the blocks it needs to delete.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When gfs2 does metadata I/O, only REQ_META is used as a metadata hint of
the bio. But flag REQ_META is just a hint for block trace, not for block
layer code to handle a bio as metadata request.
For some of metadata I/Os of gfs2, A REQ_PRIO flag on the metadata bio
would be very informative to block layer code. For example, if bcache is
used as a I/O cache for gfs2, it will be possible for bcache code to get
the hint and cache the pre-fetched metadata blocks on cache device. This
behavior may be helpful to improve metadata I/O performance if the
following requests hit the cache.
Here are the locations in gfs2 code where a REQ_PRIO flag should be added,
- All places where REQ_READAHEAD is used, gfs2 code uses this flag for
metadata read ahead.
- In gfs2_meta_rq() where the first metadata block is read in.
- In gfs2_write_buf_to_page(), read in quota metadata blocks to have them
up to date.
These metadata blocks are probably to be accessed again in future, adding
a REQ_PRIO flag may have bcache to keep such metadata in fast cache
device. For system without a cache layer, REQ_PRIO can still provide hint
to block layer to handle metadata requests more properly.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
1. Andreas Gruenbacher has four patches related to cleaning up the GFS2
inode evict process. This is about half of his patches designed to
fix a long-standing GFS2 hang related to the inode shrinker.
(Shrinker calls gfs2 evict, evict calls DLM, DLM requires memory
and blocks on the shrinker.) These 4 patches have been well tested.
His second set of patches are still being tested, so I plan to hold
them until the next merge window, after we have more weeks of testing.
The first patch eliminates the flush_delayed_work, which can block.
2. Andreas's second patch protects setting of gl_object for rgrps with
a spin_lock to prevent proven races.
3. His third patch introduces a centralized mechanism for queueing glock
work with better reference counting, to prevent more races.
4. His fourth patch retains a reference to inode glocks when an error
occurs while creating an inode. This keeps the subsequent evict from
needing to reacquire the glock, which might call into DLM and block
in low memory conditions.
5. Arvind Yadav has a patch to add const to attribute_group structures.
6. I have a patch to detect directory entry inconsistencies and withdraw
the file system if any are found. Better that than silent corruption.
7. I have a patch to remove a vestigial variable from glock structures,
saving some slab space.
8. I have another patch to remove a vestigial variable from the GFS2
in-core superblock structure.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZXOIfAAoJENeLYdPf93o7RVcH/jLEK3hmZOd94pDTYg3Damuo
KI3xjyutDgQT83uwg8p5UBPwRYCDnyiOLwOWGBJJvjPEI1S4syrXq/FzOmxmX6cV
nE28ARL/OXCoFEXBMUVHvHL3nK+zEUr8rO6Xz51B1ifVq7GV8iVK+ZgxzRhx0PWP
f+0SVHiQtU0HKyxR5y9p43oygtHZaGbjy4WL0YbmFZM59y5q9A8rBHFACn2JyPBm
/zXN6gF/Orao+BDXLT6OM3vNXZcOQ7FUPWwctguHsAO/bLzWiISyfJxLWJsHvSdW
tzFTN1DByjXvqAhs4HTSuh9JfBDAyxcXkmczXJyATBkCTEJv42Iev+ILmre+wwQ=
=YTwn
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.13.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"We've got eight GFS2 patches for this merge window:
- Andreas Gruenbacher has four patches related to cleaning up the
GFS2 inode evict process. This is about half of his patches
designed to fix a long-standing GFS2 hang related to the inode
shrinker: Shrinker calls gfs2 evict, evict calls DLM, DLM requires
memory and blocks on the shrinker.
These four patches have been well tested. His second set of patches
are still being tested, so I plan to hold them until the next merge
window, after we have more weeks of testing. The first patch
eliminates the flush_delayed_work, which can block.
- Andreas's second patch protects setting of gl_object for rgrps with
a spin_lock to prevent proven races.
- His third patch introduces a centralized mechanism for queueing
glock work with better reference counting, to prevent more races.
-His fourth patch retains a reference to inode glocks when an error
occurs while creating an inode. This keeps the subsequent evict
from needing to reacquire the glock, which might call into DLM and
block in low memory conditions.
- Arvind Yadav has a patch to add const to attribute_group
structures.
- I have a patch to detect directory entry inconsistencies and
withdraw the file system if any are found. Better that than silent
corruption.
- I have a patch to remove a vestigial variable from glock
structures, saving some slab space.
- I have another patch to remove a vestigial variable from the GFS2
in-core superblock structure"
* tag 'gfs2-4.13.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: constify attribute_group structures.
gfs2: gfs2_create_inode: Keep glock across iput
gfs2: Clean up glock work enqueuing
gfs2: Protect gl->gl_object by spin lock
gfs2: Get rid of flush_delayed_work in gfs2_evict_inode
GFS2: Eliminate vestigial sd_log_flush_wrapped
GFS2: Remove gl_list from glock structure
GFS2: Withdraw when directory entry inconsistencies are detected
Put all remaining accesses to gl->gl_object under the
gl->gl_lockref.lock spinlock to prevent races.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Implement truncate/delete as a non-recursive algorithm. The older
algorithm was implemented with recursion to strip off each layer
at a time (going by height, starting with the maximum height.
This version tries to do the same thing but without recursion,
and without needing to allocate new structures or lists in memory.
For example, say you want to truncate a very large file to 1 byte,
and its end-of-file metapath is: 0.505.463.428. The starting
metapath would be 0.0.0.0. Since it's a truncate to non-zero, it
needs to preserve that byte, and all metadata pointing to it.
So it would start at 0.0.0.0, look up all its metadata buffers,
then free all data blocks pointed to at the highest level.
After that buffer is "swept", it moves on to 0.0.0.1, then
0.0.0.2, etc., reading in buffers and sweeping them clean.
When it gets to the end of the 0.0.0 metadata buffer (for 4K
blocks the last valid one is 0.0.0.508), it backs up to the
previous height and starts working on 0.0.1.0, then 0.0.1.1,
and so forth. After it reaches the end and sweeps 0.0.1.508,
it continues with 0.0.2.0, and so on. When that height is
exhausted, and it reaches 0.0.508.508 it backs up another level,
to 0.1.0.0, then 0.1.0.1, through 0.1.0.508. So it has to keep
marching backwards and forwards through the metadata until it's
all swept clean. Once it has all the data blocks freed, it
lowers the strip height, and begins the process all over again,
but with one less height. This time it sweeps 0.0.0 through
0.505.463. When that's clean, it lowers the strip height again
and works to free 0.505. Eventually it strips the lowest height, 0.
For a delete or truncate to 0, all metadata for all heights of
0.0.0.0 would be freed. For a truncate to 1 byte, 0.0.0.0 would
be preserved.
This isn't much different from normal integer incrementing,
where an integer gets incremented from 0000 (0.0.0.0) to 3021
(3.0.2.1). So 0000 gets increments to 0001, 0002, up to 0009,
then on to 0010, 0011 up to 0099, then 0100 and so forth. It's
just that each "digit" goes from 0 to 508 (for a total of 509
pointers) rather than from 0 to 9.
Note that the dinode will only have 483 pointers due to the
dinode structure itself.
Also note: this is just an example. These numbers (509 and 483)
are based on a standard 4K block size. Smaller block sizes will
yield smaller numbers of indirect pointers accordingly.
The truncation process is accomplished with the help of two
major functions and a few helper functions.
Functions do_strip and recursive_scan are obsolete, so removed.
New function sweep_bh_for_rgrps cleans a buffer_head pointed to
by the given metapath and height. By cleaning, I mean it frees
all blocks starting at the offset passed in metapath. It starts
at the first block in the buffer pointed to by the metapath and
identifies its resource group (rgrp). From there it frees all
subsequent block pointers that lie within that rgrp. If it's
already inside a transaction, it stays within it as long as it
can. In other words, it doesn't close a transaction until it knows
it's freed what it can from the resource group. In this way,
multiple buffers may be cleaned in a single transaction, as long
as those blocks in the buffer all lie within the same rgrp.
If it's not in a transaction, it starts one. If the buffer_head
has references to blocks within multiple rgrps, it frees all the
blocks inside the first rgrp it finds, then closes the
transaction. Then it repeats the cycle: identifies the next
unfreed block, uses it to find its rgrp, then starts a new
transaction for that set. It repeats this process repeatedly
until the buffer_head contains no more references to any blocks
past the given metapath.
Function trunc_dealloc has been reworked into a finite state
automaton. It has basically 3 active states:
DEALLOC_MP_FULL, DEALLOC_MP_LOWER, and DEALLOC_FILL_MP:
The DEALLOC_MP_FULL state implies the metapath has a full set
of buffers out to the "shrink height", and therefore, it can
call function sweep_bh_for_rgrps to free the blocks within the
highest height of the metapath. If it's just swept the lowest
level (or an error has occurred) the state machine is ended.
Otherwise it proceeds to the DEALLOC_MP_LOWER state.
The DEALLOC_MP_LOWER state implies we are finished with a given
buffer_head, which may now be released, and therefore we are
then missing some buffer information from the metapath. So we
need to find more buffers to read in. In most cases, this is
just a matter of releasing the buffer_head and moving to the
next pointer from the previous height, so it may be read in and
swept as well. If it can't find another non-null pointer to
process, it checks whether it's reached the end of a height
and needs to lower the strip height, or whether it still needs
move forward through the previous height's metadata. In this
state, all zero-pointers are skipped. From this state, it can
only loop around (once more backing up another height) or,
once a valid metapath is found (one that has non-zero
pointers), proceed to state DEALLOC_FILL_MP.
The DEALLOC_FILL_MP state implies that we have a metapath
but not all its buffers are read in. So we must proceed to read
in buffer_heads until the metapath has a valid buffer for every
height. If the previous state backed us up 3 heights, we may
need to read in a buffer, increment the height, then repeat the
process until buffers have been read in for all required heights.
If it's successful reading a buffer, and it's at the highest
height we need, it proceeds back to the DEALLOC_MP_FULL state.
If it's unable to fill in a buffer, (encounters a hole, etc.)
it tries to find another non-zero block pointer. If they're all
zero, it lowers the height and returns to the DEALLOC_MP_LOWER
state. If it finds a good non-null pointer, it loops around and
reads it in, while keeping the metapath in lock-step with the
pointers it examines.
The state machine runs until the truncation request is
satisfied. Then any transactions are ended, the quota and
statfs data are updated, and the function is complete.
Helper function metaptr1 was introduced to be an easy way to
determine the start of a buffer_head's indirect pointers.
Helper function lookup_mp_height was introduced to find a
metapath index and read in the buffer that corresponds to it.
In this way, function lookup_metapath becomes a simple loop to
call it for every height.
Helper function fillup_metapath is similar to lookup_metapath
except it can do partial lookups. If the state machine
backed up multiple levels (like 2999 wrapping to 3000) it
needs to find out the next starting point and start issuing
metadata reads at that point.
Helper function hptrs is a shortcut to determine how many
pointers should be expected in a buffer. Height 0 is the dinode
which has fewer pointers than the others.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
1. Andy Price submitted a patch to make gfs2_write_full_page a
static function.
2. Dan Carpenter submitted a patch to fix a ERR_PTR thinko.
I've also got a few patches, three of which fix bugs related to
deleting very large files, which cause GFS2 to run out of
journal space:
3. The first one prevents GFS2 delete operation from requesting too
much journal space.
4. The second one fixes a problem whereby GFS2 can hang because it
wasn't taking journal space demand into its calculations.
5. The third one wakes up IO waiters when a flush is done to restart
processes stuck waiting for journal space to become available.
The other three patches are a performance improvement related to
spin_lock contention between multiple writers:
6. The "tr_touched" variable was switched to a flag to be more atomic
and eliminate the possibility of some races.
7. Function meta_lo_add was moved inline with its only caller to make
the code more readable and efficient.
8. Contention on the gfs2_log_lock spinlock was greatly reduced by
avoiding the lock altogether in cases where we don't really need
it: buffers that already appear in the appropriate metadata list
for the journal. Many thanks to Steve Whitehouse for the ideas and
principles behind these patches.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYrEEEAAoJENeLYdPf93o7bjoIAIqPG/EAzi+idgMWDPQa9Eit
53dPy16snkrbWwtaK6spSWlH6bGYuHeanXORYon9bvtVjKYaa4NQclGihN2IE6uB
O8zT+MGwP45LDhNplVJpumaOALZ9ZDqQSe+3tHeNK3FhNirLyiIjSqrHt/7Yi1qi
fPLlT4Jx0TBo5rhvEGa7Yg01WhWVtnmVSMqJXj/7ZtC50s1aPyDUikdNIDfDCN2X
LxfKGDXuk6p63VQ6JKqYSBVATCR0/bbKfkuk/kBUTYLoHoapImxB8d0HgIdsh1Mv
9PlbZnnNW8k5oapuhVxjl0T5G0JsQgCkPb/wlte+ryOCjBoc2L2fCUV5qc0QxWc=
=xQyl
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Robert Peterson:
"We've got eight GFS2 patches for this merge window:
- Andy Price submitted a patch to make gfs2_write_full_page a static
function.
- Dan Carpenter submitted a patch to fix a ERR_PTR thinko.
Three patches fix bugs related to deleting very large files, which
cause GFS2 to run out of journal space:
- The first one prevents GFS2 delete operation from requesting too
much journal space.
- The second one fixes a problem whereby GFS2 can hang because it
wasn't taking journal space demand into its calculations.
- The third one wakes up IO waiters when a flush is done to restart
processes stuck waiting for journal space to become available.
The final three patches are a performance improvement related to
spin_lock contention between multiple writers:
- The "tr_touched" variable was switched to a flag to be more atomic
and eliminate the possibility of some races.
- Function meta_lo_add was moved inline with its only caller to make
the code more readable and efficient.
- Contention on the gfs2_log_lock spinlock was greatly reduced by
avoiding the lock altogether in cases where we don't really need
it: buffers that already appear in the appropriate metadata list
for the journal. Many thanks to Steve Whitehouse for the ideas and
principles behind these patches"
* tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Make gfs2_write_full_page static
GFS2: Reduce contention on gfs2_log_lock
GFS2: Inline function meta_lo_add
GFS2: Switch tr_touched to flag in transaction
GFS2: Wake up io waiters whenever a flush is done
GFS2: Made logd daemon take into account log demand
GFS2: Limit number of transaction blocks requested for truncates
GFS2: Fix reference to ERR_PTR in gfs2_glock_iter_next
This patch limits the number of transaction blocks requested during
file truncates. If we have very large multi-terabyte files, and want
to delete or truncate them, they might span so many resource groups
that we overflow the journal blocks, and cause an assert failure.
By limiting the number of blocks in the transaction, we prevent this
overflow and give other running processes time to do transactions.
The limiting factor I chose is sd_log_thresh2 which is currently
set to 4/5ths of the journal. This same ratio is used in function
gfs2_ail_flush_reqd to determine when a log flush is required.
If we make the maximum value less than this, we can get into a
infinite hang whereby the log stops moving because the number of
used blocks is less than the threshold and the iterative loop
needs more, but since we're under the threshold, the log daemon
never starts any IO on the log.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.
CURRENT_TIME is also not y2038 safe.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.
Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Replace 1 << value shift by more explicit BIT() macro
Also fixes two bare unsigned definitions:
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
+ unsigned hsize = BIT(ip->i_depth);
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces. For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense. Any check for READA is replaced with an
explicit check for REQ_RAHEAD. Also remove the READA alias for
REQ_RAHEAD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has ll_rw_block users pass in the operation and flags separately,
so ll_rw_block can setup the bio op and bi_rw flags on the bio that
is submitted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has submit_bh users pass in the operation and flags separately,
so submit_bh_wbc can setup the bio op and bi_rw flags on the bio that
is submitted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before this patch, multi-block reservation structures were allocated
from a special slab. This patch folds the structure into the gfs2_inode
structure. The disadvantage is that the gfs2_inode needs more memory,
even when a file is opened read-only. The advantages are: (a) we don't
need the special slab and the extra time it takes to allocate and
deallocate from it. (b) we no longer need to worry that the structure
exists for things like quota management. (c) This also allows us to
remove the calls to get_write_access and put_write_access since we
know the structure will exist.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch basically reverts the majority of patch 5407e24.
That patch eliminated the gfs2_qadata structure in favor of just
using the reservations structure. The problem with doing that is that
it increases the size of the reservations structure. That is not an
issue until it comes time to fold the reservations structure into the
inode in memory so we know it's always there. By separating out the
quota structure again, we aren't punishing the non-quota users by
making all the inodes bigger, requiring more slab space. This patch
creates a new slab area to allocate the quota stuff so it's managed
a little more sanely.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Use struct gfs2_alloc_parms as an argument to gfs2_quota_check()
and gfs2_quota_lock_check() to check for quota violations while
accounting for the new blocks requested by the current operation
in ap->target.
Previously, the number of new blocks requested during an operation
were not accounted for during quota_check and would allow these
operations to exceed quota. This was not very apparent since most
operations allocated only 1 block at a time and quotas would get
violated in the next operation. i.e. quota excess would only be by
1 block or so. With fallocate, (where we allocate a bunch of blocks
at once) the quota excess is non-trivial and is addressed by this
patch.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes some variables (especially maxlen in function
gfs2_block_map) from unsigned int to size_t. We need 64-bit arithmetic
for very large files (e.g. 1PB) where the variables otherwise get
shifted to all 0's.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Fix 2 typos and move one definition which was between function
comments and function definition (yet another kernel-doc warning)
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes a long standing issue in mapping the journal
extents. Most journals will consist of only a single extent,
and although the cache took account of that by merging extents,
it did not actually map large extents, but instead was doing a
block by block mapping. Since the journal was only being mapped
on mount, this was not normally noticeable.
With the updated code, it is now possible to use the same extent
mapping system during journal recovery (which will be added in a
later patch). This will allow checking of the integrity of the
journal before any reply of the journal content is attempted. For
this reason the code is moving to bmap.c, since it will be used
more widely in due course.
An exercise left for the reader is to compare the new function
gfs2_map_journal_extents() with gfs2_write_alloc_required()
Additionally, should there be a failure, the error reporting is
also updated to show more detail about what went wrong.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch adds a structure to contain allocation parameters with
the intention of future expansion of this structure. The idea is
that we should be able to add more information about the allocation
in the future in order to allow the allocator to make a better job
of placing the requests on-disk.
There is no functional difference from applying this patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The reservation for an inode should be cleared when it is truncated so
that we can start again at a different offset for future allocations.
We could try and do better than that, by resetting the search based on
where the truncation started from, but this is only a first step.
In addition, there are three callers of gfs2_rs_delete() but only one
of those should really be testing the value of i_writecount. While
we get away with that in the other cases currently, I think it would
be better if we made that test specific to the one case which
requires it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
truncate_pagecache() doesn't care about old size since commit
cedabed49b ("vfs: Fix vmtruncate() regression"). Let's drop it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a GFS2 file system is mounted with quotas and a file is grown
in such a way that its free blocks for the allocation are represented
in a secondary bitmap, GFS2 ran out of blocks in the transaction.
That resulted in "fatal: assertion "tr->tr_num_buf <= tr->tr_blocks".
This patch reserves extra blocks for the quota change so the
transaction has enough space.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch calls get_write_access in a few functions. This
merely increases inode->i_writecount for the duration of the function.
That will ensure that any file closes won't delete the inode's
multi-block reservation while the function is running.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The functions that delete block reservations from the rgrp block
reservations rbtree no longer use the ip parameter. This patch
eliminates the parameter.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull user namespace and namespace infrastructure changes from Eric W Biederman:
"This set of changes starts with a few small enhnacements to the user
namespace. reboot support, allowing more arbitrary mappings, and
support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
user namespace root.
I do my best to document that if you care about limiting your
unprivileged users that when you have the user namespace support
enabled you will need to enable memory control groups.
There is a minor bug fix to prevent overflowing the stack if someone
creates way too many user namespaces.
The bulk of the changes are a continuation of the kuid/kgid push down
work through the filesystems. These changes make using uids and gids
typesafe which ensures that these filesystems are safe to use when
multiple user namespaces are in use. The filesystems converted for
3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
changes for these filesystems were a little more involved so I split
the changes into smaller hopefully obviously correct changes.
XFS is the only filesystem that remains. I was hoping I could get
that in this release so that user namespace support would be enabled
with an allyesconfig or an allmodconfig but it looks like the xfs
changes need another couple of days before it they are ready."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
cifs: Enable building with user namespaces enabled.
cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
cifs: Convert struct cifs_sb_info to use kuids and kgids
cifs: Modify struct smb_vol to use kuids and kgids
cifs: Convert struct cifsFileInfo to use a kuid
cifs: Convert struct cifs_fattr to use kuid and kgids
cifs: Convert struct tcon_link to use a kuid.
cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
cifs: Convert from a kuid before printing current_fsuid
cifs: Use kuids and kgids SID to uid/gid mapping
cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
cifs: Override unmappable incoming uids and gids
nfsd: Enable building with user namespaces enabled.
nfsd: Properly compare and initialize kuids and kgids
nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
nfsd: Modify nfsd4_cb_sec to use kuids and kgids
nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
nfsd: Convert nfsxdr to use kuids and kgids
nfsd: Convert nfs3xdr to use kuids and kgids
...
Split NO_QUOTA_CHANGE into NO_UID_QUTOA_CHANGE and NO_GID_QUTOA_CHANGE
so the constants may be well typed.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This patch allocates a block reservation structure before growing
or shrinking a file. Without this structure, the grow or shink code
can reference the bad pointer.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Instead of using a list of buffers to write ahead of the journal
flush, this now uses a list of inodes and calls ->writepages
via filemap_fdatawrite() in order to achieve the same thing. For
most use cases this results in a shorter ordered write list,
as well as much larger i/os being issued.
The ordered write list is sorted by inode number before writing
in order to retain the disk block ordering between inodes as
per the previous code.
The previous ordered write code used to conflict in its assumptions
about how to write out the disk blocks with mpage_writepages()
so that with this updated version we can also use mpage_writepages()
for GFS2's ordered write, writepages implementation. So we will
also send larger i/os from writeback too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There is little common content in gfs2_trans_add_bh() between the data
and meta classes by the time that the functions which it calls are
taken into account. The intent here is to split this into two
separate functions. Stage one is to introduce gfs2_trans_add_data()
and gfs2_trans_add_meta() and update the callers accordingly.
Later patches will then pull in the content of gfs2_trans_add_bh()
and its dependent functions in order to clean up the code in this
area.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes an issue relating to not having enough revokes
available when truncating journaled data files. In order to ensure
that we do no run out, the truncation is broken into separate pieces
if it is large enough.
Tested using fsx on a journaled data file.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Just like ext3, this works on the root directory and any directory
with the +T flag set. Also, just like ext3, any subdirectory created
in one of the just mentioned cases will be allocated to a random
resource group (GFS2 equivalent of a block group).
If you are creating a set of directories, each of which will contain a
job running on a different node, then by setting +T on the parent
directory before creating the subdirectories, each will land up in a
different resource group, and thus resource group contention between
nodes will be kept to a minimum.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch introduces a new structure, gfs2_rbm, which is a
tuple of a resource group, a bitmap within the resource group
and an offset within that bitmap. This is designed to make
manipulating these sets of variables easier. There is also a
new helper function which converts this representation back
to a disk block address.
In addition, the rbtree nodes which are used for the reservations
were not being correctly initialised, which is now fixed. Also,
the tracing was not passing through the inode where it should
have been. That is mostly fixed aside from one corner case. This
needs to be revisited since there can also be a NULL rgrp in
some cases which results in the device being incorrect in the
trace.
This is intended to be the first step towards cleaning up some
of the allocation code, and some further bug fixes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch reduces GFS2 file fragmentation by pre-reserving blocks. The
resulting improved on disk layout greatly speeds up operations in cases
which would have resulted in interlaced allocation of blocks previously.
A typical example of this is 10 parallel dd processes, each writing to a
file in a common dirctory.
The implementation uses an rbtree of reservations attached to each
resource group (and each inode).
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch moves the ancillary quota data structures into the
block reservations structure. This saves GFS2 some time and
effort in allocating and deallocating the qadata structure.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
It turns out that the "new" parameter to function gfs2_meta_indirect_buffer
was always being passed in as zero. Therefore, this patch eliminates it
and simplifies the function.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In the future, the qadata structure will be eliminated and merged
back in with the block reservation structure, after we extend the
lifespan of that. This patch is a step forward in eliminating the
qadata structure. It adds a variable to the do_grow function to
determine when unstuffing is necessary, and has been done.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch removes the call from gfs2_blk2rgrd to function
gfs2_rindex_update and replaces it with individual calls.
The former way turned out to be too problematic.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes the page allocation in gfs2_block_truncate_page
and two others to GFP_NOFS to avoid deadlock in low-memory conditions.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch separates the code pertaining to allocations into two
parts: quota-related information and block reservations.
This patch also moves all the block reservation structure allocations to
function gfs2_inplace_reserve to simplify the code, and moves
the frees to function gfs2_inplace_release.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch is a revision of the one I previously posted.
I tried to integrate all the suggestions Steve gave.
The purpose of the patch is to change function gfs2_alloc_block
(allocate either a dinode block or an extent of data blocks)
to a more generic gfs2_alloc_blocks function that can
allocate both a dinode _and_ an extent of data blocks in the
same call. This will ultimately help us create a multi-block
reservation scheme to reduce file fragmentation.
This patch moves more toward a generic multi-block allocator that
takes a pointer to the number of data blocks to allocate, plus whether
or not to allocate a dinode. In theory, it could be called to allocate
(1) a single dinode block, (2) a group of one or more data blocks, or
(3) a dinode plus several data blocks.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
GFS2 functions gfs2_alloc_block and gfs2_alloc_di do basically
the same things, with a few exceptions. This patch combines
the two functions into a slightly more generic gfs2_alloc_block.
Having one centralized block allocation function will reduce
code redundancy and make it easier to implement multi-block
reservations to reduce file fragmentation in the future.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A potentially uninitialised variable, some unreachable code,
and the main part of this, fixing the error path in the
unlink function.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Move the recently added readahead of the indirect pointer
tree during deallocation into its own function in order
that we can use it elsewhere in the future. Also this
fixes the resetting of the "first" variable in the
original patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
GFS2's fallocate code currently goes through the page cache. Since it's only
writing to the end of the file or to holes in it, it doesn't need to, and it
was causing issues on low memory environments. This patch pulls in some of
Steve's block allocation work, and uses it to simply allocate the blocks for
the file, and zero them out at allocation time. It provides a slight
performance increase, and it dramatically simplifies the code.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch improves the performance of delete/unlink
operations in a GFS2 file system where the files are large
by adding a layer of metadata read-ahead for indirect blocks.
Mileage will vary, but on my system, deleting an 8.6G file
dropped from 22 seconds to about 4.5 seconds.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Each block which is deallocated, requires a call to gfs2_rlist_add()
and each of those calls was calling gfs2_blk2rgrpd() in order to
figure out which rgrp the block belonged in. This can be speeded up
by making use of the rgrp cached in the inode. We also reset this
cached rgrp in case the block has changed rgrp. This should provide
a big reduction in gfs2_blk2rgrpd() calls during deallocation.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The recursive_scan() function only ever takes a single "bc"
argument, so we might as well just call do_strip() directly
from resource_scan() rather than pass it in as an argument.
Also the "data" argument is always a struct strip_mine, so
we can pass that in, rather than using a void pointer.
This also moves do_strip() ahead of recursive_scan() so that
we don't need to add a prototype.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since we have ruled out supporting online filesystem shrink,
it is possible to make the resource group list append only
during the life of a super block. This gives several benefits:
Firstly, we only need to read new rindex elements as they are added
rather than needing to reread the whole rindex file each time one
element is added.
Secondly, the rindex glock can be held for much shorter periods of
time, and is completely removed from the fast path for allocations.
The lock is taken in shared mode only when updating the resource
groups when the first allocation occurs, and after a grow has
taken place.
Thirdly, this results in a reduction in code size, and everything
gets a lot simpler to understand in this area.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
isofs: Remove global fs lock
jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
mm/truncate.c: fix build for CONFIG_BLOCK not enabled
fs:update the NOTE of the file_operations structure
Remove dead code in dget_parent()
AFS: Fix silly characters in a comment
switch d_add_ci() to d_splice_alias() in "found negative" case as well
simplify gfs2_lookup()
jfs_lookup(): don't bother with . or ..
get rid of useless dget_parent() in btrfs rename() and link()
get rid of useless dget_parent() in fs/btrfs/ioctl.c
fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
drivers: fix up various ->llseek() implementations
fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
Ext4: handle SEEK_HOLE/SEEK_DATA generically
Btrfs: implement our own ->llseek
fs: add SEEK_HOLE and SEEK_DATA flags
reiserfs: make reiserfs default to barrier=flush
...
Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
shrinker callout for the inode cache, that clashed with the xfs code to
start the periodic workers later.
Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand. This means filesystem-specific locks to prevent
new dio referenes from appearing can be held. This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__gfs2_free_data and __gfs2_free_meta are almost identical, and
can be trivially combined.
[This is as per Eric's original patch minus gfs2_free_data() which had
no callers left and plus the conversion of the bmap.c calls to these
functions. All in all, a nice clean up]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The deallocation code for directories in GFS2 is largely divided into
two parts. The first part deallocates any directory leaf blocks and
marks the directory as being a regular file when that is complete. The
second stage was identical to deallocating regular files.
Regular files have their data blocks in a different
address space to directories, and thus what would have been normal data
blocks in a regular file (the hash table in a GFS2 directory) were
deallocated correctly. However, a reference to these blocks was left in the
journal (assuming of course that some previous activity had resulted in
those blocks being in the journal or ail list).
This patch uses the i_depth as a test of whether the inode is an
exhash directory (we cannot test the inode type as that has already
been changed to a regular file at this stage in deallocation)
The original issue was reported by Chris Hertel as an issue he encountered
running bonnie++
Reported-by: Christopher R. Hertel <crh@samba.org>
Cc: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch is a performance improvement to GFS2's dealloc code.
Rather than update the quota file and statfs file for every
single block that's stripped off in unlink function do_strip,
this patch keeps track and updates them once for every layer
that's stripped. This is done entirely inside the existing
transaction, so there should be no risk of corruption.
The other functions that deallocate blocks will be unaffected
because they are using wrapper functions that do the same
thing that they do today.
I tested this code on my roth cluster by creating 200
files in a directory, each of which is 100MB, then on
four nodes, I simultaneously deleted the files, thus competing
for GFS2 resources (but different files). The commands
I used were:
[root@roth-01]# time for i in `seq 1 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
[root@roth-02]# time for i in `seq 2 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
[root@roth-03]# time for i in `seq 3 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
[root@roth-05]# time for i in `seq 4 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
The performance increase was significant:
roth-01 roth-02 roth-03 roth-05
--------- --------- --------- ---------
old: real 0m34.027 0m25.021s 0m23.906s 0m35.646s
new: real 0m22.379s 0m24.362s 0m24.133s 0m18.562s
Total time spent deleting:
old: 118.6s
new: 89.4
For this particular case, this showed a 25% performance increase for
GFS2 unlinks.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When you truncate the rindex file, you need to avoid calling gfs2_rindex_hold,
since you already hold it. However, if you haven't already read in the
resource groups, you need to do that.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Some of the functions in GFS2 were not reserving space in the transaction for
the resource group header and the resource groups bitblocks that get added
when you do allocation. GFS2 now makes sure to reserve space for the
resource group header and either all the bitblocks in the resource group, or
one for each block that it may allocate, whichever is smaller using the new
gfs2_rg_blocks() inline function.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
With the update of the truncate code, ip->i_disksize and
inode->i_size are merely copies of each other. This means
we can remove ip->i_disksize and use inode->i_size exclusively
reducing the size of a GFS2 inode by 8 bytes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This updates GFS2's truncate code to use the new truncate
sequence correctly. This is a stepping stone to being
able to remove ip->i_disksize in favour of using i_size
everywhere now that the two sizes are always identical.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
trunc_start() in bmap.c incorrectly uses sizeof(struct gfs2_inode) instead of
sizeof(struct gfs2_dinode).
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Function gfs2_write_alloc_required always returned zero as its
return code. Therefore, it doesn't need to return a return code
at all. Given that, we can use the return value to return whether
or not the dinode needs block allocations rather than passing
that value in, which in turn simplifies a bunch of error checking.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch replaces a statement that got dropped out by accident.
Without the patch, truncates on stuffed (very small) files cause
those files to have an unpredictable size.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
If the inode size was corrupt for stuffed files, it was possible
for the copying of data to overrun the block and/or page. This patch
checks for that condition so that this is no longer possible.
This is also preparation for the new truncate sequence patch which
requires the ability to have stuffed files with larger sizes than
(disk block size - sizeof(on disk inode)) with the restriction that
only the initial part of the file may be non-zero.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch solves a corner case during allocation which occurs if both
metadata (indirect) and data blocks are required but there is an
obstacle in the filesystem (e.g. a resource group header or another
allocated block) such that when the allocation is requested only
enough blocks for the metadata are returned.
By changing the exit condition of this loop, we ensure that a
minimum of one data block will always be returned.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch adds the ability to trace various aspects of the GFS2
filesystem. The trace points are divided into three groups,
glocks, logging and bmap. These points have been chosen because
they allow inspection of the major internal functions of GFS2
and they are also generic enough that they are unlikely to need
any major changes as the filesystem evolves.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If a page was partially zeroed as the result of a truncate, then it was
not being correctly marked dirty. This resulted in the deleted data
reappearing if the file was read back via direct I/O.
Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch renames the ops_*.c files which have no counterpart
without the ops_ prefix in order to shorten the name and make
it more readable. In addition, ops_address.h (which was very
small) is moved into inode.h and inode.h is cleaned up by
adding extern where required.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch improves the error handling in the case where we
discover that the summary information in the resource group
doesn't match the bitmap information while in the process of
allocating blocks. Originally this resulted in a kernel bug,
but this patch changes that so that we return -EIO and print
some messages explaining what went wrong, and how to fix it.
We also remember locally not to try and allocate from the
same rgrp again, so that a subsequent allocation in a
different rgrp should succeed.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is the big patch that I've been working on for some time
now. There are many reasons for wanting to make this change
such as:
o Reducing overhead by eliminating duplicated fields between structures
o Simplifcation of the code (reduces the code size by a fair bit)
o The locking interface is now the DLM interface itself as proposed
some time ago.
o Fewer lookups of glocks when processing replies from the DLM
o Fewer memory allocations/deallocations for each glock
o Scope to do further optimisations in the future (but this patch is
more than big enough for now!)
Please note that (a) this patch relates to the lock_dlm module and
not the DLM itself, that is still a separate module; and (b) that
we retain the ability to build GFS2 as a standalone single node
filesystem with out requiring the DLM.
This patch needs a lot of testing, hence my keeping it I restarted
my -git tree after the last merge window. That way, this has the maximum
exposure before its merged. This is (modulo a few minor bug fixes) the
same patch that I've been posting on and off the the last three months
and its passed a number of different tests so far.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch removes some unused code, and make the calculation
of the number of blocks required conditional in order to reduce
the number of times this (potentially expensive) calculation
is done.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The final field in gfs2_dinode_host was the i_flags field. Thats
renamed to i_diskflags in order to avoid confusion with the existing
inode flags, and moved into the inode proper at a suitable location
to avoid creating a "hole".
At that point struct gfs2_dinode_host is no longer needed and as
promised (quite some time ago!) it can now be removed completely.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch moved the i_size field from the gfs2_dinode_host and
following the ext3 convention renames it i_disksize.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes bz 450641.
This patch changes the computation for zero_metapath_length(), which it
renames to metapath_branch_start(). When you are extending the metadata
tree, The indirect blocks that point to the new data block must either
diverge from the existing tree either at the inode, or at the first
indirect block. They can diverge at the first indirect block because the
inode has room for 483 pointers while the indirect blocks have room for
509 pointers, so when the tree is grown, there is some free space in the
first indirect block. What metapath_branch_start() now computes is the
height where the first indirect block for the new data block is located.
It can either be 1 (if the indirect block diverges from the inode) or 2
(if it diverges from the first indirect block).
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch streamlines the quota checking in the "no quota" case by
making the check inline in the calling function, thus reducing the
number of function calls. Eventually we might be able to remove the
checks from the gfs2_quota_lock() and gfs2_quota_check() functions, but
currently we can't as there are a very few places in the code which need
to call these functions directly still.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
gfs2_alloc_get may fail so we have to check it to prevent
NULL pointer dereference.
Signed-off-by: Cyrill Gorcunov <gorcunov@gamil.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We've supported mapping of extents when no block allocation is required
for some time. This patch extends that to mapping of extents when an
allocation has been requested. In that case we try to allocate as many
blocks as are requested, but we might return fewer in case there is
something preventing us from returning the complete amount (e.g. an
already allocated block is in the way).
Currently the only code path which can actually request multiple data
blocks in a single bmap call is the page_mkwrite path and even then it
only happens if there are multiple blocks per page. What this patch does
do however, is merge the allocation requests for metadata (growing the
metadata tree in either height or depth) with the allocation of the data
blocks in the case that both are needed. This results in lower overheads
even in the single block allocation case.
The one thing which we can't handle here at the moment is unstuffing. I
would like to be able to do that, but the problem which arises is that
in order to unstuff one has to get a locked page from the page cache
which results in locking problems in the (usual) case that the caller is
holding the page lock on the page it wishes to map. So that case will
have to be addressed in future patches.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In the case that we needed to grow the height of the metadata tree
we were looking up the inode buffer and then brelse()ing it despite
the fact that it is needed later in the block map process.
This patch ensures that we look up the inode's buffer once and only
once during the block map process.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The blocks counter is almost a duplicate of the i_blocks
field in the VFS inode. The only difference is that i_blocks
can be only 32bits long for 32bit arch without large single file
support. Since GFS2 doesn't handle the non-large single file
case (for 32 bit anyway) this adds a new config dependency on
64BIT || LSF. This has always been the case, however we've never
explicitly said so before.
Even if we do add support for the non-LSF case, we will still
not require this field to be duplicated since we will not be
able to access oversized files anyway.
So the net result of all this is that we shave 8 bytes from a gfs2_inode
and get our config deps correct.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>