unlinkat - Remove a directory entry
size[4] Tunlinkat tag[2] dirfid[4] name[s] flag[4]
size[4] Runlinkat tag[2]
older Tremove have the below request format
size[4] Tremove tag[2] fid[4]
The remove message is used to remove a directory entry either file or directory
The remove opreation is actually a directory opertation and should ideally have
dirfid, if not we cannot represent the fid on server with anything other than
name. We will have to derive the directory name from fid in the Tremove request.
NOTE: The operation doesn't clunk the unlink fid.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
renameat - change name of file or directory
size[4] Trenameat tag[2] olddirfid[4] oldname[s] newdirfid[4] newname[s]
size[4] Rrenameat tag[2]
older Trename have the below request format
size[4] Trename tag[2] fid[4] newdirfid[4] name[s]
The rename message is used to change the name of a file, possibly moving it
to a new directory. The rename opreation is actually a directory opertation
and should ideally have olddirfid, if not we cannot represent the fid on server
with anything other than name. We will have to derive the old directory name
from fid in the Trename request.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This make sure we don't end up reusing the unlinked inode object.
The ideal way is to use inode i_generation. But i_generation is
not available in userspace always.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Without this fix, if any invalid mount options/args are passed while mouting
the 9p fs, no error (-EINVAL) is returned and default arg value is assigned.
This fix returns -EINVAL when an invalid arguement is found while parsing
mount options.
Signed-off-by: Prem Karat <prem.karat@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This make sure we don't use wrong inode from the inode hash. The inode number
of the file deleted is reused by the next file system object created
and if we only use inode number for inode hash lookup we could end up
with wrong struct inode.
Also compare inode generation number. Not all Linux file system provide
st_gen in userspace. So it could be 0;
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Now that VFS does the right thing remove the work around.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
isofs: Remove global fs lock
jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
mm/truncate.c: fix build for CONFIG_BLOCK not enabled
fs:update the NOTE of the file_operations structure
Remove dead code in dget_parent()
AFS: Fix silly characters in a comment
switch d_add_ci() to d_splice_alias() in "found negative" case as well
simplify gfs2_lookup()
jfs_lookup(): don't bother with . or ..
get rid of useless dget_parent() in btrfs rename() and link()
get rid of useless dget_parent() in fs/btrfs/ioctl.c
fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
drivers: fix up various ->llseek() implementations
fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
Ext4: handle SEEK_HOLE/SEEK_DATA generically
Btrfs: implement our own ->llseek
fs: add SEEK_HOLE and SEEK_DATA flags
reiserfs: make reiserfs default to barrier=flush
...
Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
shrinker callout for the inode cache, that clashed with the xfs code to
start the periodic workers later.
When journalling data for an inode (either because it is a symlink or
because the filesystem is mounted in data=journal mode), ext3_evict_inode()
can discard unwritten data by calling truncate_inode_pages(). This is
because we don't mark the buffer / page dirty when journalling data but only
add the buffer to the running transaction and thus mm does not know there
are still unwritten data.
Fix the problem by carefully tracking transaction containing inode's data,
committing this transaction, and writing uncheckpointed buffers when inode
should be reaped.
Signed-off-by: Jan Kara <jack@suse.cz>
sbi->s_mutex isn't needed for isofs at all so we can just remove it. Generally,
since isofs is always mounted read-only, filesystem structure cannot change
under us. So buffer_head contents stays constant after it's filled in. That
leaves us with possible changes of global data structures. Superblock changes
only during filesystem mount (even remount does not change it), inodes are only
filled in during reading from disk. So there are no changes of these structures
to bother about.
Arguments why sbi->s_mutex can be removed at each place:
isofs_readdir: Accesses sb, inode, filp, local variables => s_mutex not needed
isofs_lookup: Protected by directory's i_mutex. Accesses sb, inode, dentry,
local variables => s_mutex not needed
rock_ridge_symlink_readpage: Protected by page lock. Accesses sb, inode,
local variables => s_mutex not needed.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We don't generate IN_DELETE_SELF on victim of overwriting rename() if
it happens to be a directory. Trivially fixed by doing to ->i_nlink
what we do ->pino_nlink a couple of lines later in jffs2_rename().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On ramfs and other simple_rename() users IN_DELETE_SELF is not generated
for victim of overwriting rename() if it's is a directory. Works on
most of the local filesystems and really trivial to fix...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
connector: add an event for monitoring process tracers
ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
ptrace_init_task: initialize child->jobctl explicitly
has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
ptrace: ptrace_reparented() should check same_thread_group()
redefine thread_group_leader() as exit_signal >= 0
do not change dead_task->exit_signal
kill task_detached()
reparent_leader: check EXIT_DEAD instead of task_detached()
make do_notify_parent() __must_check, update the callers
__ptrace_detach: avoid task_detached(), check do_notify_parent()
kill tracehook_notify_death()
make do_notify_parent() return bool
ptrace: s/tracehook_tracer_task()/ptrace_parent()/
...
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (49 commits)
xfs: add size update tracepoint to IO completion
xfs: convert AIL cursors to use struct list_head
xfs: remove confusing ail cursor wrapper
xfs: use a cursor for bulk AIL insertion
xfs: failure mapping nfs fh to inode should return ESTALE
xfs: Remove the second parameter to xfs_sb_count()
xfs: remove the dead XFS_DABUF_DEBUG code
xfs: remove leftovers of the old btree tracing code
xfs: remove the dead QUOTADEBUG code
xfs: remove the unused xfs_buf_delwri_sort function
xfs: remove wrappers around b_iodone
xfs: remove wrappers around b_fspriv
xfs: add a proper transaction pointer to struct xfs_buf
xfs: factor out xfs_da_grow_inode_int
xfs: factor out xfs_dir2_leaf_find_stale
xfs: cleanup struct xfs_dir2_free
xfs: reshuffle dir2 headers
xfs: start periodic workers later
Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc"
xfs: remove variables that serve no purpose in xfs_alloc_ag_vextent_exact()
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
dlm: don't limit active work items
dlm: use workqueue for callbacks
dlm: remove deadlock debug print
dlm: improve rsb searches
dlm: keep lkbs in idr
dlm: fix kmalloc args
dlm: don't do pointless NULL check, use kzalloc and fix order of arguments
dlm: dump address of unknown node
dlm: use vmalloc for hash tables
dlm: show addresses in configfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus:
hfsplus: ensure bio requests are not smaller than the hardware sectors
hfsplus: Add additional range check to handle on-disk corruptions
hfsplus: Add error propagation for hfsplus_ext_write_extent_locked
hfsplus: add error checking for hfs_find_init()
hfsplus: lift the 2TB size limit
hfsplus: fix overflow in hfsplus_read_wrapper
hfsplus: fix overflow in hfsplus_get_block
hfsplus: assignments inside `if' condition clean-up
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
GFS2: combine duplicated block freeing routines
GFS2: Add S_NOSEC support
GFS2: Automatically adjust glock min hold time
GFS2: Cache dir hash table in a contiguous buffer
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (32 commits)
MAINTAINERS: change e-mail of Adrian Hunter
UBIFS: fix master node recovery
UBIFS: improve power cut emulation testing
UBIFS: rename recovery testing variables
UBIFS: remove custom list of superblocks
UBIFS: stop re-defining UBI operations
UBIFS: switch to I/O helpers
UBIFS: switch to ubifs_leb_write
UBIFS: switch to ubifs_leb_read
UBIFS: introduce more I/O helpers
UBIFS: always print stacktrace when switching to R/O mode
UBIFS: remove unused and unneeded debugging function
UBIFS: add global debugfs knobs
UBIFS: introduce debugfs helpers
UBIFS: re-arrange debugging code a bit
UBIFS: be more informative in failure mode
UBIFS: switch self-check knobs to debugfs
UBIFS: lessen amount of debugging check types
UBIFS: introduce helper functions for debugging checks and tests
UBIFS: amend debugging inode size check function prototype
...
In ext2_xattr_get(), the code will acquire xattr_sem first, later checks
the length of xattr name_len > 255. It's unnecessarily time consuming and
also ext2_xattr_set() checks the length before other checks. So move the
check before acquiring xattr_sem to make these two functions consistent.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently all bio requests are 512 bytes, which may fail for media
whose physical sector size is larger than this. Ensure these
requests are not smaller than the block device logical block size.
BugLink: http://bugs.launchpad.net/bugs/734883
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
'recoff' is read from disk and used for an argument to memcpy, so if
the value read from disk is larger than the page size, it result to
"general protection fault". This patch add additional range check for
the value, so that disk fuzz won't cause such fault.
Signed-off-by: Naohiro Aota <naota@elisp.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Test-case:
void *tfunc(void *arg)
{
execvp("true", NULL);
return NULL;
}
int main(void)
{
int pid;
if (fork()) {
pthread_t t;
kill(getpid(), SIGSTOP);
pthread_create(&t, NULL, tfunc, NULL);
for (;;)
pause();
}
pid = getppid();
assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
while (wait(NULL) > 0)
ptrace(PTRACE_CONT, pid, 0,0);
return 0;
}
It is racy, exit_notify() does __wake_up_parent() too. But in the
likely case it triggers the problem: de_thread() does release_task()
and the old leader goes away without the notification, the tracer
sleeps in do_wait() without children/tracees.
Change de_thread() to do __wake_up_parent(traced_leader->parent).
Since it is already EXIT_DEAD we can do this without ptrace_unlink(),
EXIT_DEAD threads do not exist from do_wait's pov.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Squashfs now supports XZ and LZO compression in addition to ZLIB.
As such it no longer makes sense to always include ZLIB support.
In particular embedded systems may only use LZO or XZ compression, and
the ability to exclude ZLIB support will reduce kernel size.
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
It seems to hurt performance in real life. Yes, the inode will be used
later, but the conditional doesn't seem to predict all that well
(negative dentries are not uncommon) and it looks like the cost of
prefetching is simply higher than depending on the cache doing the right
thing.
As usual.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The compiler, at least for ix86 and m68k, validly warns that the
comparison:
next <= (loff_t)-1
is always true (and it's always true also for x86-64 and probably all
other arches - as long as pgoff_t isn't wider than loff_t). The
intention appears to be to avoid wrapping of "next", so rather than
eliminating the pointless comparison, fix the loop to indeed get exited
when "next" would otherwise wrap.
On m68k the following warning is observed:
fs/fscache/page.c: In function '__fscache_uncache_all_inode_pages':
fs/fscache/page.c:979: warning: comparison is always false due to limited range of data type
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Suresh Jayaraman <sjayaraman@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix silly characters in a comment in AFS code (some weird characters replaced
the word 'flag' some point way back).
Reported-by: viro@ZenIV.linux.org.uk
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
both callers there have dentry->d_parent stabilized by the fact that
their caller had obtained dentry from lookup_one_len() and had not
dropped ->i_mutex on parent since then.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
we just return -EINVAL, in others we do the normal generic thing, and in others
we're simply making sure that the properly due-dilligence is done. For example
in NFS/CIFS we need to make sure the file size is update properly for the
SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
that is all we have to do. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since Ext4 has its own lseek we need to make sure it handles
SEEK_HOLE/SEEK_DATA. For now just do the same thing that is done in the generic
case, somebody else can come along and make it do fancy things later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
Basically for the normal SEEK_*'s we will just defer to the generic helper, and
for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
hole or data. Currently this helper doesn't check for delalloc bytes for
prealloc space, so for now treat prealloc as data until that is fixed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
using fiemap in things like cp cause more problems than it solves, so lets try
and give userspace an interface that doesn't suck. We need to match solaris
here, and the definitions are
*o* If /whence/ is SEEK_HOLE, the offset of the start of the
next hole greater than or equal to the supplied offset
is returned. The definition of a hole is provided near
the end of the DESCRIPTION.
*o* If /whence/ is SEEK_DATA, the file pointer is set to the
start of the next non-hole file region greater than or
equal to the supplied offset.
So in the generic case the entire file is data and there is a virtual hole at
the end. That means we will just return i_size for SEEK_HOLE and will return
the same offset for SEEK_DATA. This is how Solaris does it so we have to do it
the same way.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change the default reiserfs mount option to barrier=flush. Based on a patch
from Jeff Mahoney in the SuSE tree.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch turns on barriers by default for ext3. mount -o barrier=0
will turn them off. Based on a patch from Chris Mason in the SuSE tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Ted Ts'o <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Moving the event counter into the dynamically allocated 'struc seq_file'
allows poll() support without the need to allocate its own tracking
structure.
All current users are switched over to use the new counter.
Requested-by: Andrew Morton akpm@linux-foundation.org
Acked-by: NeilBrown <neilb@suse.de>
Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For filesystems that delay their end_io processing we should keep our
i_dio_count until the the processing is done. Enable this by moving
the inode_dio_done call to the end_io handler if one exist. Note that
the actual move to the workqueue for ext4 and XFS is not done in
this patch yet, but left to the filesystem maintainers. At least
for XFS it's not needed yet either as XFS has an internal equivalent
to i_dio_count.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Simple filesystems always pass inode->i_sb_bdev as the block device
argument, and never need a end_io handler. Let's simply things for
them and for my grepping activity by dropping these arguments. The
only thing not falling into that scheme is ext4, which passes and
end_io handler without needing special flags (yet), but given how
messy the direct I/O code there is use of __blockdev_direct_IO
in one instead of two out of three cases isn't going to make a large
difference anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
This these filesystems to also protect truncate against direct I/O requests
by using common code. Right now the only non-DIO_LOCKING filesystem that
appears to do so is XFS, which uses an opencoded variant of the i_dio_count
scheme.
Behaviour doesn't change for filesystems never calling inode_dio_wait.
For ext4 behaviour changes when using the dioread_nonlock option, which
previously was missing any protection between truncate and direct I/O reads.
For ocfs2 that handcrafted i_dio_count manipulations are replaced with
the common code now enable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand. This means filesystem-specific locks to prevent
new dio referenes from appearing can be held. This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
i_alloc_sem is a rather special rw_semaphore. It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion. It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.
Replace it with a hand-grown construct:
- exclusion for truncates is already guaranteed by i_mutex, so it can
simply fall way
- the reader side is replaced by an i_dio_count member in struct inode
that counts the number of pending direct I/O requests. Truncate can't
proceed as long as it's non-zero
- when i_dio_count reaches non-zero we wake up a pending truncate using
wake_up_bit on a new bit in i_flags
- new references to i_dio_count can't appear while we are waiting for
it to read zero because the direct I/O count always needs i_mutex
(or an equivalent like XFS's i_iolock) for starting a new operation.
This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
system).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reject zero sized reads as soon as we know our I/O length, and don't
borther with locks or allocations that might have to be cleaned up
otherwise.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This
removes the need of using i_alloc_sem to avoid races with truncate which
seems to be the wrong locking order according to lock ordering documented in
mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to
be problematic because we can decide to flush delay-allocated blocks which
will acquire s_umount semaphore - again creating unpleasant lock dependency
if not directly a deadlock.
Also add a check for frozen filesystem so that we don't busyloop in page fault
when the filesystem is frozen.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new rw_semaphore to protect bmap against truncate. Previous
i_alloc_sem was abused for this, but it's going away in this series.
Note that we can't simply use i_mutex, given that the swapon code
calls ->bmap under it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The flags parameter went away in
d749519b444db985e40b897f73ce1898b11f997e
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Convert the inode reclaim shrinker to use the new per-sb shrinker
operations. This allows much bigger reclaim batches to be used, and
allows the XFS inode cache to be shrunk in proportion with the VFS
dentry and inode caches. This avoids the problem of the VFS caches
being shrunk significantly before the XFS inode cache is shrunk
resulting in imbalances in the caches during reclaim.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that the per-sb shrinker is responsible for shrinking 2 or more
caches, increase the batch size to keep econmies of scale for
shrinking each cache. Increase the shrinker batch size to 1024
objects.
To allow for a large increase in batch size, add a conditional
reschedule to prune_icache_sb() so that we don't hold the LRU spin
lock for too long. This mirrors the behaviour of the
__shrink_dcache_sb(), and allows us to increase the batch size
without needing to worry about problems caused by long lock hold
times.
To ensure that filesystems using the per-sb shrinker callouts don't
cause problems, document that the object freeing method must
reschedule appropriately inside loops.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now we have a per-superblock shrinker implementation, we can add a
filesystem specific callout to it to allow filesystem internal
caches to be shrunk by the superblock shrinker.
Rather than perpetuate the multipurpose shrinker callback API (i.e.
nr_to_scan == 0 meaning "tell me how many objects freeable in the
cache), two operations will be added. The first will return the
number of objects that are freeable, the second is the actual
shrinker call.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that we have per-sb shrinkers with a lifecycle that is a subset
of the superblock lifecycle and can reliably detect a filesystem
being unmounted, there is not longer any race condition for the
iprune_sem to protect against. Hence we can remove it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well. The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.
With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Both the filesystem and the lock manager can associate operations with a
lock. Confusingly, one of them (fl_release_private) actually has the
same name in both operation structures.
It would save some confusion to give the lock-manager ops different
names.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
For improving insight into IO completion behaviour.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The list of active AIL cursors uses a roll-your-own linked list with
special casing for the AIL push cursor. Simplify this code by
replacing the list with standard struct list_head lists, and use a
separate list_head to track the active cursors. This allows us to
treat the AIL push cursor as a generic cursor rather than as a
special case, further simplifying the code.
Further, fix the duplicate push cursor initialisation that the
special case handling was hiding, and clean up all the comments
around the active cursor list handling.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
xfs_trans_ail_cursor_set() doesn't set the cursor to the current log
item, it sets it to the next item. There is already a function for
doing this - xfs_trans_ail_cursor_next() - and the _set function is
simply a two line wrapper. Remove it and open code the setting of
the cursor in the two locations that call it to remove the
confusion.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Delayed logging can insert tens of thousands of log items into the
AIL at the same LSN. When the committing of log commit records
occur, we can get insertions occurring at an LSN that is not at the
end of the AIL. If there are thousands of items in the AIL on the
tail LSN, each insertion has to walk the AIL to find the correct
place to insert the new item into the AIL. This can consume large
amounts of CPU time and block other operations from occurring while
the traversals are in progress.
To avoid this repeated walk, use a AIL cursor to record
where we should be inserting the new items into the AIL without
having to repeat the walk. The cursor infrastructure already
provides this functionality for push walks, so is a simple extension
of existing code. While this will not avoid the initial walk, it
will avoid repeating it tens of thousands of times during a single
checkpoint commit.
This version includes logic improvements from Christoph Hellwig.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
On xfs exports, nfsd is incorrectly returning ENOENT instead of
ESTALE on attempts to use a filehandle of a deleted file (spotted
with pynfs test PUTFH3). The ENOENT was coming from xfs_iget.
(It's tempting to wonder whether we should just map all xfs_iget
errors to ESTALE, but I don't believe so--xfs_iget can also return
ENOMEM at least, which we wouldn't want mapped to ESTALE.)
While we're at it, the other return of ENOENT in xfs_nfs_get_inode()
also looks wrong.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove the second parameter to xfs_sb_count() since all callers of
the function set them.
Also, fix the header comment regarding it being called periodically.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Compilation of ext3/namei.c brought up an error and warning messages
when compiled with -DDX_DEBUG.
Signed-off-by: Bernd Schubert<bernd.schubert@itwm.fraunhofer.de>
Signed-off-by: Jan Kara <jack@suse.cz>
The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.
pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With the inode LRUs moving to per-sb structures, there is no longer
a need for a global inode_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The inode unused list is currently a global LRU. This does not match
the other global filesystem cache - the dentry cache - which uses
per-superblock LRU lists. Hence we have related filesystem object
types using different LRU reclaimation schemes.
To enable a per-superblock filesystem cache shrinker, both of these
caches need to have per-sb unused object LRU lists. Hence this patch
converts the global inode LRU to per-sb LRUs.
The patch only does rudimentary per-sb propotioning in the shrinker
infrastructure, as this gets removed when the per-sb shrinker
callouts are introduced later on.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Before we split up the inode_lru_lock, the unused inode counter
needs to be made independent of the global inode_lru_lock. Convert
it to per-cpu counters to do this.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL
so no need for that if (inode) ... in there (or ERR_PTR(0), for that
matter)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
Allocates dentry, sets its ->d_sb to given superblock and sets
->d_op accordingly. Old d_alloc(NULL, name) callers are converted
to that (all of them know what superblock they want). d_alloc()
itself is left only for parent != NULl case; uses __d_alloc(),
inserts result into the list of parent's children.
Note that now ->d_sb is assign-once and never NULL *and*
->d_parent is never NULL either.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
combination of kern_path_parent() and lookup_create(). Does *not*
expose struct nameidata to caller. Syscalls converted to that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of playing with removal of LOOKUP_OPEN, mangling (and
restoring) nd->path, just pass NULL to vfs_create(). The whole
point of what's being done there is to suppress any attempts
to open file by underlying fs, which is what nd == NULL indicates.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) check the right flags in ->create() (LOOKUP_OPEN, not LOOKUP_CREATE)
b) default (!LOOKUP_OPEN) open_flags is O_CREAT|O_EXCL|FMODE_READ, not 0
c) lookup_instantiate_filp() should be done only with LOOKUP_OPEN;
otherwise we need to issue CLOSE, lest we leak stateid on server.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and get rid of a bogus typecast, while we are at it; it's not
just that we want a function returning int and not void, but cast
to pointer to function taking void * and returning void would be
(void (*)(void *)) and not (void *)(void *), TYVM...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pass mask instead; kill security_inode_exec_permission() since we can use
security_inode_permission() instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: would_dump(bprm, file). Checks if we are allowed to
read the file and if we are not - sets ENFORCE_NODUMP. Exported,
used in places that previously open-coded the same logics.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
capability overrides apply only to the default case; if fs has ->permission()
that does _not_ call generic_permission(), we have no business doing them.
Moreover, if it has ->permission() that does call generic_permission(), we
have no need to recheck capabilities.
Besides, the capability overrides should apply only if we got EACCES from
acl_permission_check(); any other value (-EIO, etc.) should be returned
to caller, capabilities or not capabilities.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Call the given function for all superblocks of given type. Function
gets a superblock (with s_umount locked shared) and (void *) argument
supplied by caller of iterator.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order
for readdir, but unfortunately in the case of anything that stats the files in
order that readdir spits back (like oh say ls) that means we still have to do
the normal lookup of the file, which means looking up our other index and then
looking up the inode. What I want is a way to create dummy dentries when we
find them in readdir so that when ls or anything else subsequently does a
stat(), we already have the location information in the dentry and can go
straight to the inode itself. The lookup stuff just assumes that if it finds a
dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP
flag so that the lookup code knows it still needs to run i_op->lookup() on the
parent to get the inode for the dentry. I have tested this with btrfs and I
went from something that looks like this
http://people.redhat.com/jwhiter/ls-noreada.png
To this
http://people.redhat.com/jwhiter/ls-good.png
Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Assume that /sys/kernel/debug/dummy64 is debugfs file created by
debugfs_create_x64().
# cd /sys/kernel/debug
# echo 0x1234567812345678 > dummy64
# cat dummy64
0x0000000012345678
# echo 0x80000000 > dummy64
# cat dummy64
0xffffffff80000000
A value larger than INT_MAX cannot be written to the debugfs file created
by debugfs_create_u64 or debugfs_create_x64 on 32bit machine. Because
simple_attr_write() uses simple_strtol() for the conversion.
To fix this, use simple_strtoll() instead.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: fix race in rcu lookup of pruned dentry
Fix cifs_get_root()
[ Edited the last commit to get rid of a 'unused variable "seq"'
warning due to Al editing the patch. - Linus ]
Don't update *inode in __follow_mount_rcu() until we'd verified that
there is mountpoint there. Kudos to Hugh Dickins for catching that
one in the first place and eventually figuring out the solution (and
catching a braino in the earlier version of patch).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Allow multiple workqueue items (locks with callbacks) to be
processed concurrently. There should be no reason not to
take advantage of this workqueue feature.
Signed-off-by: David Teigland <teigland@redhat.com>
Add missing ->i_mutex, convert to lookup_one_len() instead of
(broken) open-coded analog, cope with getting something like
a//b as relative pathname. Simplify the hell out of it, while
we are there...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
It's sort of ridiculous that we've never had a working reply cache for
NFSv4.
On the other hand, we may still not: our current reply cache is likely
not very good, especially in the TCP case (which is the only case that
matters for v4). What we really need here is some serious testing.
Anyway, here's a start.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
... and it's getting it wrong, too - missing ->d_revalidate() calls when
it's dealing with filesystem (procfs) that has non-trivial ->d_revalidate()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] update cifs to version 1.74
[CIFS] update limit for snprintf in cifs_construct_tcon
cifs: Fix signing failure when server mandates signing for NTLMSSP
deal with d_move() races properly; rename_lock read-retry loop,
rcu_read_lock() held while walking to root, d_lock held over
subtraction from namelen and copying the component to stabilize
->d_name.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Embed the necessary alias into the module rather than waiting for
someone to add it to /etc/modprobe.conf
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Somebody working on this code asked what the deal was with NFSv4, since
this comment notes that it's v2/v3's statelessness that requires
sillyrename. Shouldn't hurt to document the answer.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Before nfs41 client's RECLAIM_COMPLETE done, nfs server should deny any
new locks or opens.
rfc5661:
" Whenever a client establishes a new client ID and before it does
the first non-reclaim operation that obtains a lock, it MUST send a
RECLAIM_COMPLETE with rca_one_fs set to FALSE, even if there are no
locks to reclaim. If non-reclaim locking operations are done before
the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. "
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
From: Miklos Szeredi <mszeredi@suse.cz>
Remove SLAB initialization entirely, as suggested by Bruce and Linus.
Allocate with __GFP_ZERO instead and only initialize list heads.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Check in SEQUENCE that the request doesn't exceed maxreq_sz for the
given session.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
According to RFC5661, 18.36.3,
"if the client selects a value for ca_maxresponsesize such that
a replier on a channel could never send a response,the server
SHOULD return NFS4ERR_TOOSMALL in the CREATE_SESSION reply."
So, error out when the client sets a maxreq_sz less than the minimum
possible SEQUENCE request size, or sets a maxresp_sz less than the
minimum possible SEQUENCE reply size.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Stateid's hold a read reference for a read open, a write reference for a
write open, and an additional one of each for each read+write open. The
latter wasn't getting put on a downgrade, so something like:
open RW
open R
downgrade to R
was resulting in a file leak.
Also fix an imbalance in an error path.
Regression from 7d94784293 "nfsd4: fix
downgrade/lock logic".
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Without this, for example,
open read
open read+write
close
will result in a struct file leak.
Regression from 7d94784293 "nfsd4: fix
downgrade/lock logic".
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This operation is used by the client to check the validity of a list of
stateids.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This operation is used by the client to tell the server to free a
stateid.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
As promised in feature-removal-schedule.txt it is time to
remove the nfsctl system call.
Userspace has perferred to not use this call throughout 2.6 and it has been
excluded in the default configuration since 2.6.36 (9 months ago).
So this patch removes all the code that was being compiled out.
There are still references to sys_nfsctl in various arch systemcall tables
and related code. These should be cleaned out too, probably in the next
merge window.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
DESTROY_CLIENTID MAY be preceded with a SEQUENCE operation as long as
the client ID derived from the session ID of SEQUENCE is not the same
as the client ID to be destroyed. If the client IDs are the same,
then the server MUST return NFS4ERR_CLIENTID_BUSY.
(that's not implemented yet)
If DESTROY_CLIENTID is not prefixed by SEQUENCE, it MUST be the only
operation in the COMPOUND request (otherwise, the server MUST return
NFS4ERR_NOT_ONLY_OP).
This fixes the error return; before, we returned
NFS4ERR_OP_NOT_IN_SESSION; after this patch, we return NFS4ERR_NOTSUPP.
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Instead of creating our own kthread (dlm_astd) to deliver
callbacks for all lockspaces, use a per-lockspace workqueue
to deliver the callbacks. This eliminates complications and
slowdowns from many lockspaces sharing the same thread.
Signed-off-by: David Teigland <teigland@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fix loop checks in d_materialise_unique()
Fix ->d_lock locking order in unlazy_walk()
Change explicit references to CONFIG_NFS_V4_1 to implicit ones
Get rid of the unnecessary defines in backchannel_rqst.c and
bc_svc.c: the Makefile takes care of those dependency.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use nfs_pageio_reset_read_mds and nfs_pageio_reset_write_mds instead of
completely reinitialising the struct nfs_pageio_descriptor.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
...and ensure that we recoalese to take into account differences in
differences in block sizes when falling back to write through the MDS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
...and ensure that we recoalese to take into account differences in
block sizes when falling back to read through the MDS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If an attempt to do pNFS fails, and we have to fall back to writing through
the MDS, then we may want to re-coalesce the requests that we already have
since the block size for the MDS read/writes may be different to that of
the DS read/writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of looking up the rsize and wsize, the routines that generate the
RPC requests should really be using the pg_bsize, since that is what we
use when deciding whether or not to coalesce write requests...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
__gfs2_free_data and __gfs2_free_meta are almost identical, and
can be trivially combined.
[This is as per Eric's original patch minus gfs2_free_data() which had
no callers left and plus the conversion of the bmap.c calls to these
functions. All in all, a nice clean up]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds S_NOSEC support to GFS2. We set/reset the flag either when
a user calls setattr or when we have just regained the glock
from another node. The flag is only set if there are no xattrs
on the inode and there is no suid bit set.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
This patch is a performance improvement for GFS2 in a clustered
environment. It makes the glock hold time self-adjusting.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch adds a cache for the hash table to the directory code
in order to help simplify the way in which the hash table is
accessed. This is intended to be a first step towards introducing
some performance improvements in the directory code.
There are two follow ups that I'm hoping to see fairly shortly. One
is to simplify the hash table reading code now that we always read the
complete hash table, whether we want one entry or all of them. The
other is to introduce readahead on the heads of the hash chains
which are referred to from the table.
The hash table is a maximum of 128k in size, so it is not worth trying
to read it in small chunks.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Both __d_unalias() and __d_materialise_dentry() need loop prevention.
Grab rename_lock in caller, check for loops there...
As a side benefit, we have dentry_lock_for_move() called only under
rename_lock, which seriously reduces deadlock potential of the
execrable "locking order" used for ->d_lock.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
GFS2: Resolve inode eviction and ail list interaction bug
GFS2: Fix race during filesystem mount
GFS2: force a log flush when invalidating the rindex glock
This patch contains a few misc fixes which resolve a recently
reported issue. This patch has been a real team effort and has
received a lot of testing.
The first issue is that the ail lock needs to be held over a few
more operations. The lock thats added into gfs2_releasepage() may
possibly be a candidate for replacing with RCU at some future
point, but at this stage we've gone for the obvious fix.
The second issue is that gfs2_write_inode() can end up calling
a glock recursively when called from gfs2_evict_inode() via the
syncing code, so it needs a guard added.
The third issue is that we either need to not truncate the metadata
pages of inodes which have zero link count, but which we cannot
deallocate due to them still being in use by other nodes, or we need
to ensure that those pages have all made it through the journal and
ail lists first. This patch takes the former approach, but the
latter has also been tested and there is nothing to choose between
them performance-wise. So again, we could revise that decision
in the future.
Also, the inode eviction process is now better documented.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
Reported-by: Barry J. Marson <bmarson@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Fix use of static variable in rpcb_getport_async
NFSv4.1: update nfs4_fattr_bitmap_maxsz
SUNRPC: Fix a race between work-queue and rpc_killall_tasks
pnfs: write: Set mds_offset in the generic layer - it is needed by all LDs
Remove various bits left over from the old kdb-only btree tracing code, but
leave the actual trace point stubs in place to ease adding new event based
btree tracing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Remove the dead hash table test rid which has been rotting away under
QUOTADEBUG, including some code that was compiled for normal debug
builds, but not actually called without QUOTADEBUG, and enable a few
cheap debug checks that were hidden under QUOTADEBUG for normal
debug builds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Replace the typeless b_fspriv2 and the ugly macros around it with a properly
typed transaction pointer. As a fallout the log buffer state debug checks
are also removed. We could have kept them using casts, but as they do
not have a real purpose we can as well just remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfs_da_grow_inode and xfs_dir2_grow_inode are mostly duplicate code. Factor
the meat of those two functions into a new common helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Change the bests array to be a proper variable sized entry. This is done
easily as no one relies on the size of the structure. Also change
XFS_DIR2_MAX_FREE_BESTS to an inline function while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Replace the current mess of dir2 headers with just three that have a clear
purpose:
- xfs_dir2_format.h for all format definitions, including the inline helpers
to access our variable size structures
- xfs_dir2_priv.h for all prototypes that are internal to the dir2 code
and not needed by anything outside of the directory code. For this
purpose xfs_da_btree.c, and phase6.c in xfs_repair are considered part
of the directory code.
- xfs_dir2.h for the public interface to the directory code
In addition to the reshuffle I have also update the comments to not only
match the new file structure, but also to describe the directory format
better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Start the periodic sync workers only after we have finished xfs_mountfs
and thus fully set up the filesystem structures. Without this we can
call into xfs_qm_sync before the quotainfo strucute is set up if the
mount takes unusually long, and probably hit other incomplete states
as well.
Also clean up the xfs_fs_fill_super error path by using consistent
label names, and removing an impossible to reach case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Reviewed-by: Alex Elder <aelder@sgi.com>
Make sure that child is still a child of parent before nested locking
of child->d_lock in unlazy_walk(); otherwise we are risking a violation
of locking order and deadlocks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
By pre-allocating rsb structs before searching the hash
table, they can be inserted immediately. This avoids
always having to repeat the search when adding the struct
to hash list.
This also adds space to the rsb struct for a max resource
name, so an rsb allocation can be used by any request.
The constant size also allows us to finally use a slab
for the rsb structs.
Signed-off-by: David Teigland <teigland@redhat.com>
In 34c87901e1 "Shrink stack space usage in cifs_construct_tcon" we
change the size of the username name buffer from MAX_USERNAME_SIZE
(256) to 28. This call to snprintf() needs to be updated as well.
Reported by Dan Carpenter.
Reviewed-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When using NTLMSSP authentication mechanism, if server mandates
signing, keep the flags in type 3 messages of the NTLMSSP exchange
same as in type 1 messages (i.e. keep the indicated capabilities same).
Some of the servers such as Samba, expect the flags such as
Negotiate_Key_Exchange in type 3 message of NTLMSSP exchange as well.
Some servers like Windows do not.
https://bugzilla.samba.org/show_bug.cgi?id=8212
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Split them up into two parts: one which sets up the struct nfs_read/write_data,
the other which sets up the actual RPC call or pNFS call.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we're writing back data, and the FLUSH_STABLE flag is set, then we
always want to use NFS_FILE_SYNC, since we're always in a situation where
we're doing page reclaim, and so we want to free up the page as quickly
as possible.
If we're in the FLUSH_COND_STABLE case, then we either want to use another
unstable write (if we have to do a commit anyway) or again, we want to
use NFS_FILE_SYNC because we know that we have no more pages to write
out.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Mark all deviceids established under an expired MDS clientid as invalid.
Stop all new i/o through DS and send through the MDS.
Don't use any new LAYOUTGETs that use the invalid deviceid. Purge all layouts
established under the expired MDS clientid.
Remove the MDS clientid deviceid and data servers reference
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since we take a reference to it, we really ought to pass the a pointer to
the layout header in the arguments instead of assuming that
NFS_I(inode)->layout will forever point to the correct object.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ask for whole file layouts. Until support for layout segments is fully
supported in the file layout code, discard non-whole file layouts.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The fact that the global device id cache holds a reference to the
nfs4_deviceid_node until it is invisible to rcu lookups implies that
we can always assume that the reference count is non-zero in
_find_get_deviceid.
Also clean up nfs4_put_deviceid_node and the removal of the device id
from the cache.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We need to ensure that the layouts are set up before we can decide to
coalesce requests. To do so, we want to further split up the struct
nfs_pageio_descriptor operations into an initialisation callback, a
coalescing test callback, and a 'do i/o' callback.
This patch cleans up the existing callback methods before adding the
'initialisation' callback.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When recovering open files and locks, the stateid should be tested
against the server and freed if it is invalid. This patch adds new
recovery functions for NFS v4.1.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
FREE_STATEID is used to tell the server that we want to free a stateid
that no longer has any locks associated with it. This allows the client
to reclaim locks without encountering edge conditions documented in
section 8.4.3 of RFC 5661.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds in the xdr for doing a TEST_STATEID call with a single
stateid. RFC 5661 allows multiple stateids to be tested in a single
call, but only testing one keeps things simpler for now.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the client is using NFS v4.1, then we can use SECINFO_NO_NAME to find
the secflavor for the initial mount. If the server doesn't support
SECINFO_NO_NAME then I fall back on the "guess and check" method used
for v4.0 mounts.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Layouts should be tracked per nfs_server (aka superblock)
instead of per struct nfs_client, which may have multiple FSIDs associated
with it.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>