This set includes some minor fixes and improvements.
The one large patch addresses the special "nodir" mode,
which has been a long neglected proof of concept, but
with these fixes seems to be quite usable. It allows
the resource master to be assigned statically instead of
dynamically, which can improve performance if there is
little locality and most resources are shared.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJPu/MlAAoJEDgbc8f8gGmq860P/0o+tYG2pAUz87WnKg92cGwm
ajaI78ydY6qOjndcEjbgdX6uWqVQ7f/OKo3drzVH8KFQ67eiaXC4wv2xTL3aymbX
2Ua55oiVsW+k9d9yK5Dzfa4qAlR5QPV1WEAnoVkiEDNoiGCGecjmVebhK1/Sb5Lu
1gaIJ3C+3L1ngfAzpfeB+7LwuVB36UlIyBrvPOj6yWiSDgpPaVbTrEU0NaDDDDIi
oo7tTiqivCZf/GH+ZcIjPE/LBen/lVqXSDU2YShiac/ErRfpRk9rnDFIUeN2nYPd
JwPjzutFWM+N6HIA2RCBXKo7FkK2rvYXw84/RVMvA4goEH/Qu8yDtBww20BmvFYY
3guU1udka0/NR7/ap98Btdqsvqco6R2X/rpzx8y1eD1jzUvb6El6yg3PM1Qvd8zQ
72aVzcdgAI4qtEAVziy5X4omNeQ6a55sUYXlCcvkiwZJQdPzkDuzntC28q3bgJva
QD0ugX7ltBpHuZZZb2tbBN9hfMqyo7gneaY2OoGVCTb1U9ibb5JgfZOswTC2gQsE
17vykdL5owQ8bbBj2tkRQiJ8dZoxn23hV+sZrvLm3TR8xF4oJtDqUdRs9K7iX8It
YxTTCL1LmxHRFG/0Cy2l7VhoqkIKsoVFdavW7pivFNkzp/yQNHk4r2iJWhR9YArV
qaE2HqIxJsev/B/lBPyo
=mHOh
-----END PGP SIGNATURE-----
Merge tag 'dlm-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes some minor fixes and improvements. The one large
patch addresses the special "nodir" mode, which has been a long
neglected proof of concept, but with these fixes seems to be quite
usable. It allows the resource master to be assigned statically
instead of dynamically, which can improve performance if there is
little locality and most resources are shared."
* tag 'dlm-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: NULL dereference on failure in kmem_cache_create()
gfs2: fix recovery during unmount
dlm: fixes for nodir mode
dlm: improve error and debug messages
dlm: avoid unnecessary search in search_rsb
dlm: limit rcom debug messages
dlm: fix waiter recovery
dlm: prevent connections during shutdown
Pull GFS2 changes from Steven Whitehouse.
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (24 commits)
GFS2: Fix quota adjustment return code
GFS2: Add rgrp information to block_alloc trace point
GFS2: Eliminate unused "new" parameter to gfs2_meta_indirect_buffer
GFS2: Update glock doc to add new stats info
GFS2: Update main gfs2 doc
GFS2: Remove redundant metadata block type check
GFS2: Fix sgid propagation when using ACLs
GFS2: eliminate log elements and simplify
GFS2: Eliminate vestigial sd_log_le_rg
GFS2: Eliminate needless parameter from function gfs2_setbit
GFS2: Log code fixes
GFS2: Remove unused argument from gfs2_internal_read
GFS2: Remove bd_list_tr
GFS2: Remove duplicate log code
GFS2: Clean up log write code path
GFS2: Use variable rather than qa to determine if unstuff necessary
GFS2: Change variable blk to biblk
GFS2: Fix function parameter comments in rgrp.c
GFS2: Eliminate offset parameter to gfs2_setbit
GFS2: Use slab for block reservation memory
...
This patch changes function gfs2_adjust_quota so that it properly
returns a good (zero) return code on the normal path through the code.
Without this, mounting GFS2 with -o quota=account periodically gave
this error message: GFS2: fsid=cluster:fs: gfs2_quotad: sync error -5
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is a second attempt at a patch that adds rgrp information to the
block allocation trace point for GFS2. As suggested, the patch was
modified to list the rgrp information _after_ the fields that exist today.
Again, the reason for this patch is to allow us to trace and debug
problems with the block reservations patch, which is still in the works.
We can debug problems with reservations if we can see what block allocations
result from the block reservations. It may also be handy in figuring out
if there are problems in rgrp free space accounting. In other words,
we can use it to track the rgrp and its free space along side the allocations
that are taking place.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
It turns out that the "new" parameter to function gfs2_meta_indirect_buffer
was always being passed in as zero. Therefore, this patch eliminates it
and simplifies the function.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This allows comparing hash and len in one operation on 64-bit
architectures. Right now only __d_lookup_rcu() takes advantage of this,
since that is the case we care most about.
The use of anonymous struct/unions hides the alternate 64-bit approach
from most users, the exception being a few cases where we initialize a
'struct qstr' with a static initializer. This makes the problematic
cases use a new QSTR_INIT() helper function for that (but initializing
just the name pointer with a "{ .name = xyzzy }" initializer remains
valid, as does just copying another qstr structure).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes a redundant metadata block check. See description below.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This cleans up the mode setting code when creating inodes. The
SGID bit was being reset by setattr_copy() when the user creating a
subdirectory was not in the owning group. When ACLs are in use this
SGID bit should have been propagated if the ACL allows creation of
a subdirectory. GFS2's behaviour now matches that of the other ACL
supporting filesystems in this regard.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Journal recovery from lock_dlm should not be ignored
if there is an unmount in progress. Ignoring it will
causes the recovery to get stuck. The recovery
process will correctly handle an in-progess unmount.
Signed-off-by: David Teigland <teigland@redhat.com>
The "nodir" mode (statically assign master nodes instead
of using the resource directory) has always been highly
experimental, and never seriously used. This commit
fixes a number of problems, making nodir much more usable.
- Major change to recovery: recover all locks and restart
all in-progress operations after recovery. In some
cases it's not possible to know which in-progess locks
to recover, so recover all. (Most require recovery
in nodir mode anyway since rehashing changes most
master nodes.)
- Change the way nodir mode is enabled, from a command
line mount arg passed through gfs2, into a sysfs
file managed by dlm_controld, consistent with the
other config settings.
- Allow recovering MSTCPY locks on an rsb that has not
yet been turned into a master copy.
- Ignore RCOM_LOCK and RCOM_LOCK_REPLY recovery messages
from a previous, aborted recovery cycle. Base this
on the local recovery status not being in the state
where any nodes should be sending LOCK messages for the
current recovery cycle.
- Hold rsb lock around dlm_purge_mstcpy_locks() because it
may run concurrently with dlm_recover_master_copy().
- Maintain highbast on process-copy lkb's (in addition to
the master as is usual), because the lkb can switch
back and forth between being a master and being a
process copy as the master node changes in recovery.
- When recovering MSTCPY locks, flag rsb's that have
non-empty convert or waiting queues for granting
at the end of recovery. (Rename flag from LOCKS_PURGED
to RECOVER_GRANT and similar for the recovery function,
because it's not only resources with purged locks
that need grant a grant attempt.)
- Replace a couple of unnecessary assertion panics with
error messages.
Signed-off-by: David Teigland <teigland@redhat.com>
This patch eliminates the gfs2_log_element data structure and
rolls its two components into the gfs2_bufdata. This makes the code
easier to understand and makes it easier to migrate to a rbtree
to keep the list sorted.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch eliminates gfs2 superblock variable sd_log_le_rg which
is no longer used.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch eliminates parameter "buf1" from function gfs2_setbit.
This is possible because it was always passed in as bi->bi_bh->b_data.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch removes a log lock from around atomic operation where
it is not needed, removes an unused variable, and also changes
a void pointer used incorrectly to a struct page pointer.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
gfs2_internal_read accepts an unused ra_state argument, left over from
when we did readahead on the rindex. Since there are currently no plans
to add back this readahead, this patch removes the ra_state parameter
and updates the functions which call gfs2_internal_read accordingly.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is another clean up in the logging code. This per-transaction
list was largely unused. Its main function was to ensure that the
number of buffers in a transaction was correct, however that counter
was only used to check the number of buffers in the bd_list_tr, plus
an assert at the end of each transaction. With the assert now changed
to use the calculated buffer counts, we can remove both bd_list_tr and
its associated counter.
This should make the code easier to understand as well as shrinking
a couple of structures.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The main part of this patch merges the two functions used to
write metadata and data buffers to the log. Most of the code
is common between the two functions, so this provides a nice
clean up, and makes the code more readable.
The gfs2_get_log_desc() function is also extended to take two more
arguments, and thus avoid having to set the length and data1
fields of this strucuture as a separate operation.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Prior to this patch, we have two ways of sending i/o to the log.
One of those is used when we need to allocate both the data
to be written itself and also a buffer head to submit it. This
is done via sb_getblk and friends. This is used mostly for writing
log headers.
The other method is used when writing blocks which have some
in-place counterpart. This is the case for all the metadata
blocks which are journalled, and when journaled data is in use,
for unescaped journalled data blocks.
This patch replaces both of those two methods, and about half
a dozen separate i/o submission points with a single i/o
submission function. We also go direct to bio rather than
using buffer heads, since this allows us to build i/o
requests of the maximum size for the block device in
question. It also reduces the memory required for flushing
the log, which can be very useful in low memory situations.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In the future, the qadata structure will be eliminated and merged
back in with the block reservation structure, after we extend the
lifespan of that. This patch is a step forward in eliminating the
qadata structure. It adds a variable to the do_grow function to
determine when unstuffing is necessary, and has been done.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In the resource group code, we have no less than three different
kinds of block references: block relative to the file system (u64),
block relative to the rgrp (u32), and block relative to the bitmap.
This is a small step to making the code more readable; it renames
variable blk to biblk to solidify in my mind that it's relative to
the bitmap and nothing else.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch just fixes a bunch of function parameter comments.
Slowly, over the years, the comments have gotten out of date
(mostly my fault, as I haven't been good at keeping them up to date).
This patch rectifies some of that.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes block reservations so it uses slab storage.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch makes function gfs2_page_add_databufs static.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch renames function gfs2_close to gfs2_release.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since we always write the buffer directly after this function
returns, we might as well merge it into here. This is a clean
up in preparation for some further updates to the log code
which are coming soon.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The "pull" argument to log_write_header() is only used
for debug purposes and it is not really needed any more. There
are other tests for this particular problem, so I think we can
dispose of it in order to simplify the code.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch instructs DLM to prevent an "in place" conversion, where the
lock just stays on the granted queue, and instead forces the conversion to
the back of the convert queue. This is done on upward conversions only.
This is useful in cases where, for example, a lock is frequently needed in
PR on one node, but another node needs it temporarily in EX to update it.
This may happen, for example, when the rindex is being updated by gfs2_grow.
The gfs2_grow needs to have the lock in EX, but the other nodes need to
re-read it to retrieve the updates. The glock is already granted in PR on
the non-growing nodes, so this prevents them from continually re-granting
the lock in PR, and forces the EX from gfs2_grow to go through.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull GFS2 fixes from Steven Whitehouse
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
GFS2: Allow caching of rindex glock
GFS2: Make sure rindex is uptodate before starting transactions
GFS2: use depends instead of select in kconfig
GFS2: put glock reference in error patch of read_rindex_entry
This patch allows caching of the rindex glock. We were previously
setting the GL_NOCACHE bit when the glock was released. That forced
the rindex inode to be invalidated, which caused us to re-read
rindex at the next access. However, it caused the glock to be
unnecessarily bounced around the cluster. This patch allows
the glock to remain cached, but it still causes the rindex to be
re-read once it has been written to by gfs2_grow.
Ben and I have tested single-node gfs2_grow cases and I've tested
clustered gfs2_grow cases on my four-node cluster.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch removes the call from gfs2_blk2rgrd to function
gfs2_rindex_update and replaces it with individual calls.
The former way turned out to be too problematic.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Avoids having to duplicate the dependencies of what is 'select'ed (and on
down...)
Those dependencies are currently incomplete, leading to broken builds with
GFS2_FS_LOCKING_DLM=y and IP_SCTP=n.
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes the error path of function read_rindex_entry
so that it correctly gives up its glock reference in cases where
there is a race to re-read the rindex after gfs2_grow.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull gfs2 changes from Steven Whitehouse.
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Change truncate page allocation to be GFP_NOFS
GFS2: call gfs2_write_alloc_required for each chunk
GFS2: Clean up log flush header writing
GFS2: Remove a __GFP_NOFAIL allocation
GFS2: Flush pending glock work when evicting an inode
GFS2: make sure rgrps are up to date in func gfs2_blk2rgrpd
GFS2: Eliminate sd_rindex_mutex
GFS2: Unlock rindex mutex on glock error
GFS2: Make bd_cmp() static
GFS2: Sort the ordered write list
GFS2: FITRIM ioctl support
GFS2: Move two functions from log.c to lops.c
GFS2: glock statistics gathering
Pull vfs pile 1 from Al Viro:
"This is _not_ all; in particular, Miklos' and Jan's stuff is not there
yet."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
ext4: initialization of ext4_li_mtx needs to be done earlier
debugfs-related mode_t whack-a-mole
hfsplus: add an ioctl to bless files
hfsplus: change finder_info to u32
hfsplus: initialise userflags
qnx4: new helper - try_extent()
qnx4: get rid of qnx4_bread/qnx4_getblk
take removal of PF_FORKNOEXEC to flush_old_exec()
trim includes in inode.c
um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
um: embed ->stub_pages[] into mmu_context
gadgetfs: list_for_each_safe() misuse
ocfs2: fix leaks on failure exits in module_init
ecryptfs: make register_filesystem() the last potential failure exit
ntfs: forgets to unregister sysctls on register_filesystem() failure
logfs: missing cleanup on register_filesystem() failure
jfs: mising cleanup on register_filesystem() failure
make configfs_pin_fs() return root dentry on success
configfs: configfs_create_dir() has parent dentry in dentry->d_parent
configfs: sanitize configfs_create()
...
This patch changes the page allocation in gfs2_block_truncate_page
and two others to GFP_NOFS to avoid deadlock in low-memory conditions.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
gfs2_fallocate was calling gfs2_write_alloc_required() once at the start of
the function. This caused problems since gfs2_write_alloc_required used a
long unsigned int for the len, but gfs2_fallocate could allocate a much
larger amount. This patch will move the call into the loop where the
chunks are actually allocated and zeroed out. This will keep the allocation
size under the limit, and also allow gfs2_fallocate to quickly skip over
sections of the file that are already completely allocated.
fallcate_chunk was also not correctly setting the file size. It was using the
len veriable to find the last block written to, but by the time it was setting
the size, the len variable had already been decremented to 0.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We already send both a pre and post flush to the block device
when writing a journal header. There is no need to wait for
the previous I/O specifically when we do this, unless we've
turned "barriers" off.
As a side effect, this also cleans up the code path for flushing
the journal and makes it more readable.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In order to ensure that we've got enough buffer heads for flushing
the journal, the orignal code used __GFP_NOFAIL when performing
this allocation. Here we dispense with that in favour of using a
mempool. This should improve efficiency in low memory conditions
since flushing the journal is a good way to get memory back, we
don't want to be spinning, waiting on memory allocations. The
buffers which are allocated via this mempool are fairly short lived,
so that we'll recycle them pretty quickly.
Although there are other memory allocations which occur during the
journal flush process, this is the one which can potentially require
the most memory, so the most important one to fix.
The amount of memory reserved is a fixed amount, and we should not need
to scale it when there are a greater number of filesystems in use.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This ensures that we will not try to access the inode thats
being flushed via the glock after it has been freed.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch adds a call to gfs2_rindex_update from function gfs2_blk2rgrpd
and removes calls to it that are made redundant by it. The problem is
that a gfs2_grow can add rgrps to the rindex, then put those rgrps into
use, thus rendering the rindex we read in at mount time incomplete.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Over time, we've slowly eliminated the use of sd_rindex_mutex.
Up to this point, it was only used in two places: function
gfs2_ri_total (which totals the file system size by reading
and parsing the rindex file) and function gfs2_rindex_update
which updates the rgrps in memory. Both of these functions have
the rindex glock to protect them, so the rindex is unnecessary.
Since gfs2_grow writes to the rindex via the meta_fs, the mutex
is in the wrong order according to the normal rules. This patch
eliminates the mutex entirely to avoid the problem.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes an error path in function gfs2_rindex_update
that leaves the rindex mutex held.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch sorts the ordered write list for GFS2 writes.
This increases the throughput for simultaneous writes.
For example, if you have ten processes, all doing:
dd if=/dev/zero of=/mnt/gfs2/fileX
on different files, the throughput will be much better.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The FITRIM ioctl provides an alternative way to send discard requests to
the underlying device. Using the discard mount option results in every
freed block generating a discard request to the block device. This can
be slow, since many block devices can only process discard requests of
larger sizes, and also such operations can be time consuming.
Rather than using the discard mount option, FITRIM allows a sweep of the
filesystem on an occasional basis, and also to optionally avoid sending
down discard requests for smaller regions.
In GFS2 FITRIM will work at resource group granularity. There is a flag
for each resource group which keeps track of which resource groups have
been trimmed. This flag is reset whenever a deallocation occurs in the
resource group, and set whenever a successful FITRIM of that resource
group has taken place. This helps to reduce repeated discard requests
for the same block ranges, again improving performance.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
gfs2_log_get_buf() and gfs2_log_fake_buf() are both used
only in lops.c, so move them next to their callers and they
can then become static.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>