Commit Graph

19954 Commits

Author SHA1 Message Date
Yehuda Sadeh
ae1533b62b ceph-rbd: osdc support for osd call and rollback operations
This will be used for rbd snapshots administration.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20 15:37:25 -07:00
Yehuda Sadeh
68b4476b0b ceph: messenger and osdc changes for rbd
Allow the messenger to send/receive data in a bio.  This is added
so that we wouldn't need to copy the data into pages or some other buffer
when doing IO for an rbd block device.

We can now have trailing variable sized data for osd
ops.  Also osd ops encoding is more modular.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:37:18 -07:00
Yehuda Sadeh
3499e8a5d4 ceph: refactor osdc requests creation functions
The osd requests creation are being decoupled from the
vino parameter, allowing clients using the osd to use
other arbitrary object names that are not necessarily
vino based. Also, calc_raw_layout now takes a snap id.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:36:01 -07:00
Yehuda Sadeh
7669a2c95e ceph: lookup pool in osdmap by name
Implement a pool lookup by name.  This will be used by rbd.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20 15:35:36 -07:00
Trond Myklebust
6eaa61496f NFSv4: Don't call nfs4_reclaim_complete() on receiving NFS4ERR_STALE_CLIENTID
If the server sends us an NFS4ERR_STALE_CLIENTID while the state management
thread is busy reclaiming state, we do want to treat all state that wasn't
reclaimed before the STALE_CLIENTID as if a network partition occurred (see
the edge conditions described in RFC3530 and RFC5661).
What we do not want to do is to send an nfs4_reclaim_complete(), since we
haven't yet even started reclaiming state after the server rebooted.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-10-19 19:42:53 -04:00
Trond Myklebust
ae1007d37e NFSv4: Don't call nfs4_state_mark_reclaim_reboot() from error handlers
In the case of a server reboot, the state recovery thread starts by calling
nfs4_state_end_reclaim_reboot() in order to avoid edge conditions when
the server reboots while the client is in the middle of recovery.

However, if the client has already marked the nfs4_state as requiring
reboot recovery, then the above behaviour will cause the recovery thread to
treat the open as if it was part of such an edge condition: the open will
be recovered as if it was part of a lease expiration (and all the locks
will be lost).
Fix is to remove the call to nfs4_state_mark_reclaim_reboot from
nfs4_async_handle_error(), and nfs4_handle_exception(). Instead we leave it
to the recovery thread to do this for us.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-10-19 19:42:33 -04:00
Trond Myklebust
b0ed9dbc24 NFSv4: Fix open recovery
NFSv4 open recovery is currently broken: since we do not clear the
state->flags states before attempting recovery, we end up with the
'can_open_cached()' function triggering. This again leads to no OPEN call
being put on the wire.

Reported-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-10-19 19:41:55 -04:00
Trond Myklebust
bc4866b6e0 NFS: Don't SIGBUS if nfs_vm_page_mkwrite races with a cache invalidation
In the case where we lock the page, and then find out that the page has
been thrown out of the page cache, we should just return VM_FAULT_NOPAGE.
This is what block_page_mkwrite() does in these situations.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-10-19 19:37:54 -04:00
Russell King
809b4e00ba Merge branch 'devel-stable' into devel 2010-10-19 22:06:36 +01:00
Tejun Heo
3e24e13287 cifs: cancel_delayed_work() + flush_scheduled_work() -> cancel_delayed_work_sync()
flush_scheduled_work() is going away.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-19 18:58:36 +00:00
Shirish Pargaonkar
89f150f401 Clean up two declarations of blob_len
- Eliminate double declaration of variable blob_len
- Modify function build_ntlmssp_auth_blob to return error code
  as well as length of the blob.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-19 18:56:42 +00:00
Al Viro
c4a0472725 fix rawctl compat ioctls breakage on amd64 and itanic
RAW_SETBIND and RAW_GETBIND 32bit versions are fscked in interesting ways.

1) fs/compat_ioctl.c has COMPATIBLE_IOCTL(RAW_SETBIND) followed by
HANDLE_IOCTL(RAW_SETBIND, raw_ioctl).  The latter is ignored.

2) on amd64 (and itanic) the damn thing is broken - we have int + u64 + u64
and layouts on i386 and amd64 are _not_ the same.  raw_ioctl() would
work there, but it's never called due to (1).  As it is, i386 /sbin/raw
definitely doesn't work on amd64 boxen.

3) switching to raw_ioctl() as is would *not* work on e.g. sparc64 and ppc64,
which would be rather sad, seeing that normal userland there is 32bit.
The thing is, slapping __packed on the struct in question does not DTRT -
it eliminates *all* padding.  The real solution is to use compat_u64.

4) of course, all that stuff has no business being outside of raw.c in the
first place - there should be ->compat_ioctl() for /dev/rawctl instead of
messing with compat_ioctl.c.

[akpm@linux-foundation.org: coding-style fixes]
[arnd@arndb.de: port to 2.6.36]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-19 11:29:54 +02:00
Jens Axboe
fa251f8990 Merge branch 'v2.6.36-rc8' into for-2.6.37/barrier
Conflicts:
	block/blk-core.c
	drivers/block/loop.c
	mm/swapfile.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-19 09:13:04 +02:00
Yasuaki Ishimatsu
7681bfeecc block: fix accounting bug on cross partition merges
/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

When reloading partition tables, quiesce IO to ensure that no
request references to the partition struct exists. When it is safe
to free the partition table, the IO for that device is restarted
again.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-19 09:07:02 +02:00
Thomas Gleixner
a731cd116c xfs: semaphore cleanup
Get rid of init_MUTEX[_LOCKED]() and use sema_init() instead.

(Ported to current XFS code by <aelder@sgi.com>.)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:09:09 -05:00
Arkadiusz Mi?kiewicz
6743099ce5 xfs: Extend project quotas to support 32bit project ids
This patch adds support for 32bit project quota identifiers.

On disk format is backward compatible with 16bit projid numbers. projid
on disk is now kept in two 16bit values - di_projid_lo (which holds the
same position as old 16bit projid value) and new di_projid_hi (takes
existing padding) and converts from/to 32bit value on the fly.

xfs_admin (for existing fs), mkfs.xfs (for new fs) needs to be used
to enable PROJID32BIT support.

Signed-off-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:08:08 -05:00
Christoph Hellwig
1a1a3e97ba xfs: remove xfs_buf wrappers
Stop having two different names for many buffer functions and use
the more descriptive xfs_buf_* names directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:08:07 -05:00
Christoph Hellwig
6c77b0ea1b xfs: remove xfs_cred.h
We're not actually passing around credentials inside XFS for a while
now, so remove all xfs_cred.h with it's cred_t typedef and all
instances of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:08:06 -05:00
Christoph Hellwig
78a4b0961f xfs: remove xfs_globals.h
This header only provides one extern that isn't actually declared
anywhere, and shadowed by a macro.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:08:05 -05:00
Christoph Hellwig
668332e5fe xfs: remove xfs_version.h
It used to have a place when it contained an automatically generated
CVS version, but these days it's entirely superflous.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:08:04 -05:00
Christoph Hellwig
1ae4fe6dba xfs: remove xfs_refcache.h
This header has been completely unused for a couple of years.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:08:03 -05:00
Christoph Hellwig
4957a449a1 xfs: fix the xfs_trans_committed
Use the correct prototype for xfs_trans_committed instead of casting it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:08:02 -05:00
Christoph Hellwig
dfe188d428 xfs: remove unused t_callback field in struct xfs_trans
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:08:01 -05:00
Christoph Hellwig
d276734d93 xfs: fix bogus m_maxagi check in xfs_iget
These days inode64 should only control which AGs we allocate new
inodes from, while we still try to support reading all existing
inodes.  To make this actually work the check ontop of xfs_iget
needs to be relaxed to allow inodes in all allocation groups instead
of just those that we allow allocating inodes from.  Note that we
can't simply remove the check - it prevents us from accessing
invalid data when fed invalid inode numbers from NFS or bulkstat.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:08:01 -05:00
Christoph Hellwig
1b0407125f xfs: do not use xfs_mod_incore_sb_batch for per-cpu counters
Update the per-cpu counters manually in xfs_trans_unreserve_and_mod_sb
and remove support for per-cpu counters from xfs_mod_incore_sb_batch
to simplify it.  And added benefit is that we don't have to take
m_sb_lock for transactions that only modify per-cpu counters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:08:00 -05:00
Christoph Hellwig
96540c7858 xfs: do not use xfs_mod_incore_sb for per-cpu counters
Export xfs_icsb_modify_counters and always use it for modifying
the per-cpu counters.  Remove support for per-cpu counters from
xfs_mod_incore_sb to simplify it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:59 -05:00
Christoph Hellwig
61ba35dea0 xfs: remove XFS_MOUNT_NO_PERCPU_SB
Fail the mount if we can't allocate memory for the per-CPU counters.
This is consistent with how we handle everything else in the mount
path and makes the superblock counter modification a lot simpler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:58 -05:00
Dave Chinner
50f59e8eed xfs: pack xfs_buf structure more tightly
pahole reports the struct xfs_buf has quite a few holes in it, so
packing the structure better will reduce the size of it by 16 bytes.
Also, move all the fields used in cache lookups into the first
cacheline.

Before on x86_64:

        /* size: 320, cachelines: 5 */
	/* sum members: 298, holes: 6, sum holes: 22 */

After on x86_64:

        /* size: 304, cachelines: 5 */
	/* padding: 6 */
	/* last cacheline: 48 bytes */

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:57 -05:00
Dave Chinner
74f75a0cb7 xfs: convert buffer cache hash to rbtree
The buffer cache hash is showing typical hash scalability problems.
In large scale testing the number of cached items growing far larger
than the hash can efficiently handle. Hence we need to move to a
self-scaling cache indexing mechanism.

I have selected rbtrees for indexing becuse they can have O(log n)
search scalability, and insert and remove cost is not excessive,
even on large trees. Hence we should be able to cache large numbers
of buffers without incurring the excessive cache miss search
penalties that the hash is imposing on us.

To ensure we still have parallel access to the cache, we need
multiple trees. Rather than hashing the buffers by disk address to
select a tree, it seems more sensible to separate trees by typical
access patterns. Most operations use buffers from within a single AG
at a time, so rather than searching lots of different lists,
separate the buffer indexes out into per-AG rbtrees. This means that
searches during metadata operation have a much higher chance of
hitting cache resident nodes, and that updates of the tree are less
likely to disturb trees being accessed on other CPUs doing
independent operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:56 -05:00
Dave Chinner
69b491c214 xfs: serialise inode reclaim within an AG
Memory reclaim via shrinkers has a terrible habit of having N+M
concurrent shrinker executions (N = num CPUs, M = num kswapds) all
trying to shrink the same cache. When the cache they are all working
on is protected by a single spinlock, massive contention an
slowdowns occur.

Wrap the per-ag inode caches with a reclaim mutex to serialise
reclaim access to the AG. This will block concurrent reclaim in each
AG but still allow reclaim to scan multiple AGs concurrently. Allow
shrinkers to move on to the next AG if it can't get the lock, and if
we can't get any AG, then start blocking on locks.

To prevent reclaimers from continually scanning the same inodes in
each AG, add a cursor that tracks where the last reclaim got up to
and start from that point on the next reclaim. This should avoid
only ever scanning a small number of inodes at the satart of each AG
and not making progress. If we have a non-shrinker based reclaim
pass, ignore the cursor and reset it to zero once we are done.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:55 -05:00
Dave Chinner
e3a20c0b02 xfs: batch inode reclaim lookup
Batch and optimise the per-ag inode lookup for reclaim to minimise
scanning overhead. This involves gang lookups on the radix trees to
get multiple inodes during each tree walk, and tighter validation of
what inodes can be reclaimed without blocking befor we take any
locks.

This is based on ideas suggested in a proof-of-concept patch
posted by Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:54 -05:00
Dave Chinner
78ae525676 xfs: implement batched inode lookups for AG walking
With the reclaim code separated from the generic walking code, it is
simple to implement batched lookups for the generic walk code.
Separate out the inode validation from the execute operations and
modify the tree lookups to get a batch of inodes at a time.

Reclaim operations will be optimised separately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:53 -05:00
Dave Chinner
e13de955ca xfs: split out inode walk inode grabbing
When doing read side inode cache walks, the code to validate and
grab an inode is common to all callers. Split it out of the execute
callbacks in preparation for batching lookups. Similarly, split out
the inode reference dropping from the execute callbacks into the
main lookup look to be symmetric with the grab.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:52 -05:00
Dave Chinner
65d0f20533 xfs: split inode AG walking into separate code for reclaim
The reclaim walk requires different locking and has a slightly
different walk algorithm, so separate it out so that it can be
optimised separately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:52 -05:00
Dave Chinner
69d6cc76cf xfs: remove buftarg hash for external devices
For RT and external log devices, we never use hashed buffers on them
now.  Remove the buftarg hash tables that are set up for them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:51 -05:00
Dave Chinner
1922c949c5 xfs: use unhashed buffers for size checks
When we are checking we can access the last block of each device, we
do not need to use cached buffers as they will be tossed away
immediately. Use uncached buffers for size checks so that all IO
prior to full in-memory structure initialisation does not use the
buffer cache.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:50 -05:00
Dave Chinner
26af655233 xfs: kill XBF_FS_MANAGED buffers
Filesystem level managed buffers are buffers that have their
lifecycle controlled by the filesystem layer, not the buffer cache.
We currently cache these buffers, which makes cleanup and cache
walking somewhat troublesome. Convert the fs managed buffers to
uncached buffers obtained by via xfs_buf_get_uncached(), and remove
the XBF_FS_MANAGED special cases from the buffer cache.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:49 -05:00
Dave Chinner
ebad861b57 xfs: store xfs_mount in the buftarg instead of in the xfs_buf
Each buffer contains both a buftarg pointer and a mount pointer. If
we add a mount pointer into the buftarg, we can avoid needing the
b_mount field in every buffer and grab it from the buftarg when
needed instead. This shrinks the xfs_buf by 8 bytes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:48 -05:00
Dave Chinner
5adc94c247 xfs: introduced uncached buffer read primitve
To avoid the need to use cached buffers for single-shot or buffers
cached at the filesystem level, introduce a new buffer read
primitive that bypasses the cache an reads directly from disk.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:47 -05:00
Dave Chinner
686865f76e xfs: rename xfs_buf_get_nodaddr to be more appropriate
xfs_buf_get_nodaddr() is really used to allocate a buffer that is
uncached. While it is not directly assigned a disk address, the fact
that they are not cached is a more important distinction. With the
upcoming uncached buffer read primitive, we should be consistent
with this disctinction.

While there, make page allocation in xfs_buf_get_nodaddr() safe
against memory reclaim re-entrancy into the filesystem by allowing
a flags parameter to be passed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:46 -05:00
Dave Chinner
dcd79a1423 xfs: don't use vfs writeback for pure metadata modifications
Under heavy multi-way parallel create workloads, the VFS struggles
to write back all the inodes that have been changed in age order.
The bdi flusher thread becomes CPU bound, spending 85% of it's time
in the VFS code, mostly traversing the superblock dirty inode list
to separate dirty inodes old enough to flush.

We already keep an index of all metadata changes in age order - in
the AIL - and continued log pressure will do age ordered writeback
without any extra overhead at all. If there is no pressure on the
log, the xfssyncd will periodically write back metadata in ascending
disk address offset order so will be very efficient.

Hence we can stop marking VFS inodes dirty during transaction commit
or when changing timestamps during transactions. This will keep the
inodes in the superblock dirty list to those containing data or
unlogged metadata changes.

However, the timstamp changes are slightly more complex than this -
there are a couple of places that do unlogged updates of the
timestamps, and the VFS need to be informed of these. Hence add a
new function xfs_trans_ichgtime() for transactional changes,
and leave xfs_ichgtime() for the non-transactional changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-10-18 15:07:45 -05:00
Dave Chinner
e176579e70 xfs: lockless per-ag lookups
When we start taking a reference to the per-ag for every cached
buffer in the system, kernel lockstat profiling on an 8-way create
workload shows the mp->m_perag_lock has higher acquisition rates
than the inode lock and has significantly more contention. That is,
it becomes the highest contended lock in the system.

The perag lookup is trivial to convert to lock-less RCU lookups
because perag structures never go away. Hence the only thing we need
to protect against is tree structure changes during a grow. This can
be done simply by replacing the locking in xfs_perag_get() with RCU
read locking. This removes the mp->m_perag_lock completely from this
path.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:44 -05:00
Dave Chinner
bd32d25a7c xfs: remove debug assert for per-ag reference counting
When we start taking references per cached buffer to the the perag
it is cached on, it will blow the current debug maximum reference
count assert out of the water. The assert has never caught a bug,
and we have tracing to track changes if there ever is a problem,
so just remove it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:43 -05:00
Dave Chinner
d1583a3833 xfs: reduce the number of CIL lock round trips during commit
When commiting a transaction, we do a lock CIL state lock round trip
on every single log vector we insert into the CIL. This is resulting
in the lock being as hot as the inode and dcache locks on 8-way
create workloads. Rework the insertion loops to bring the number
of lock round trips to one per transaction for log vectors, and one
more do the busy extents.

Also change the allocation of the log vector buffer not to zero it
as we copy over the entire allocated buffer anyway.

This patch also includes a structural cleanup to the CIL item
insertion provided by Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:42 -05:00
Poyo VL
9c169915ad xfs: eliminate some newly-reported gcc warnings
Ionut Gabriel Popescu <poyo_vl@yahoo.com> submitted a simple change
to eliminate some "may be used uninitialized" warnings when building
XFS.  The reported condition seems to be something that GCC did not
used to recognize or report.  The warnings were produced by:

    gcc version 4.5.0 20100604
    [gcc-4_5-branch revision 160292] (SUSE Linux)

Signed-off-by: Ionut Gabriel Popescu <poyo_vl@yahoo.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:39 -05:00
Christoph Hellwig
c0e59e1ac0 xfs: remove the ->kill_root btree operation
The implementation os ->kill_root only differ by either simply
zeroing out the now unused buffer in the btree cursor in the inode
allocation btree or using xfs_btree_setbuf in the allocation btree.

Initially both of them used xfs_btree_setbuf, but the use in the
ialloc btree was removed early on because it interacted badly with
xfs_trans_binval.

In addition to zeroing out the buffer in the cursor xfs_btree_setbuf
updates the bc_ra array in the btree cursor, and calls
xfs_trans_brelse on the buffer previous occupying the slot.

The bc_ra update should be done for the alloc btree updated too,
although the lack of it does not cause serious problems.  The
xfs_trans_brelse call on the other hand is effectively a no-op in
the end - it keeps decrementing the bli_recur refcount until it hits
zero, and then just skips out because the buffer will always be
dirty at this point.  So removing it for the allocation btree is
just fine.

So unify the code and move it to xfs_btree.c.  While we're at it
also replace the call to xfs_btree_setbuf with a NULL bp argument in
xfs_btree_del_cursor with a direct call to xfs_trans_brelse given
that the cursor is beeing freed just after this and the state
updates are superflous.  After this xfs_btree_setbuf is only used
with a non-NULL bp argument and can thus be simplified.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:38 -05:00
Christoph Hellwig
acecf1b5d8 xfs: stop using xfs_qm_dqtobp in xfs_qm_dqflush
In xfs_qm_dqflush we know that q_blkno must be initialized already from a
previous xfs_qm_dqread.  So instead of calling xfs_qm_dqtobp we can
simply read the quota buffer directly.  This also saves us from a duplicate
xfs_qm_dqcheck call check and allows xfs_qm_dqtobp to be simplified now
that it is always called for a newly initialized inode.  In addition to
that properly unwind all locks in xfs_qm_dqflush when xfs_qm_dqcheck
fails.

This mirrors a similar cleanup in the inode lookup done earlier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:37 -05:00
Christoph Hellwig
52fda11424 xfs: simplify xfs_qm_dqusage_adjust
There is no need to have the users and group/project quota locked at the
same time.  Get rid of xfs_qm_dqget_noattach and just do a xfs_qm_dqget
inside xfs_qm_quotacheck_dqadjust for the quota we are operating on
right now.  The new version of xfs_qm_quotacheck_dqadjust holds the
inode lock over it's operations, which is not a problem as it simply
increments counters and there is no concern about log contention
during mount time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18 15:07:36 -05:00
Dave Chinner
4472235205 xfs: Introduce XFS_IOC_ZERO_RANGE
XFS_IOC_ZERO_RANGE is the equivalent of an atomic XFS_IOC_UNRESVSP/
XFS_IOC_RESVSP call pair. It enabled ranges of written data to be
turned into zeroes without requiring IO or having to free and
reallocate the extents in the range given as would occur if we had
to punch and then preallocate them separately.  This enables
applications to zero parts of files very quickly without changing
the layout of the files in any way.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-10-18 15:07:25 -05:00
Dave Chinner
3ae4c9deb3 xfs: use range primitives for xfs page cache operations
While XFS passes ranges to operate on from the core code, the
functions being called ignore the either the entire range or the end
of the range. This is historical because when the function were
written linux didn't have the necessary range operations. Update the
functions to use the correct operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-10-18 15:07:24 -05:00
Boaz Harrosh
115e19c535 exofs: Set i_mapping->backing_dev_info anyway
Though it has been promised that inode->i_mapping->backing_dev_info
is not used and the supporting code is fine. Until the pointer
will default to NULL, I'd rather it points to the correct thing
regardless.

At least for future infrastructure coder it is a clear indication
of where are the key points that inodes are initialized.
I know because it took me time to find this out.

Signed-off-by: Boaz Harrosh <Boaz Harrosh bharrosh@panasas.com>
2010-10-18 20:16:02 +02:00
Boaz Harrosh
7aebf4106b exofs: Cleaup read path in regard with read_for_write
Last BUG fix added a flag to the the page_collect structure
to communicate with readpage_strip. This calls for a clean up
removing that flag's reincarnations in the read functions
parameters.

Signed-off-by: Boaz Harrosh <Boaz Harrosh bharrosh@panasas.com>
2010-10-18 20:16:02 +02:00
Ingo Molnar
f2f108eb45 Merge branch 'linus' into core/locking
Merge reason: Update to almost-final-.36

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-18 18:43:46 +02:00
Andrea Gelmini
33027af637 GFS2: fixed typo
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-10-18 14:38:07 +01:00
Justin P. Mattock
631dd1a885 Update broken web addresses in the kernel.
The patch below updates broken web addresses in the kernel

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Finn Thain <fthain@telegraphics.com.au>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Dimitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Ben Pfaff <blp@cs.stanford.edu>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Reviewed-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-10-18 11:03:14 +02:00
Jeff Layton
b33879aa83 cifs: move cifsFileInfo_put to file.c
...and make it non-inlined in preparation for the move of most of
cifs_close to it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-18 01:32:05 +00:00
Jeff Layton
4477288a10 cifs: convert GlobalSMBSeslock from a rwlock to regular spinlock
Convert this lock to a regular spinlock

A rwlock_t offers little value here. It's more expensive than a regular
spinlock unless you have a fairly large section of code that runs under
the read lock and can benefit from the concurrency.

Additionally, we need to ensure that the refcounting for files isn't
racy and to do that we need to lock areas that can increment it for
write. That means that the areas that can actually use a read_lock are
very few and relatively infrequently used.

While we're at it, change the name to something easier to type, and fix
a bug in find_writable_file. cifsFileInfo_put can sleep and shouldn't be
called while holding the lock.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-18 01:32:01 +00:00
Steve French
7a16f1961a [CIFS] Fix minor checkpatch warning and update cifs version
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-18 01:09:48 +00:00
Jeff Layton
15ecb436c0 cifs: move cifs_new_fileinfo to file.c
It's currently in dir.c which makes little sense...

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-18 01:07:31 +00:00
Jeff Layton
2e396b83f6 cifs: eliminate pfile pointer from cifsFileInfo
All the remaining users of cifsFileInfo->pfile just use it to get
at the f_flags/f_mode. Now that we store that separately in the
cifsFileInfo, there's no need to consult the pfile at all from
a cifsFileInfo pointer.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-18 01:07:20 +00:00
Jeff Layton
7da4b49a0e cifs: cifs_write argument change and cleanup
Have cifs_write take a cifsFileInfo pointer instead of a filp. Since
cifsFileInfo holds references on the dentry, and that holds one to
the inode, we can eliminate some unneeded NULL pointer checks.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-18 01:04:23 +00:00
Jeff Layton
15886177e4 cifs: clean up cifs_reopen_file
Add a f_flags field that holds the f_flags field from the filp. We'll
need this info in case the filp ever goes away before the cifsFileInfo
does. Have cifs_reopen_file use that value instead of filp->f_flags
too and have it take a cifsFileInfo arg instead of a filp.

While we're at it, get rid of some bogus cargo-cult NULL pointer
checks in that function and reduce the level of indentation.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-18 01:04:19 +00:00
Jeff Layton
abfe1eedd6 cifs: eliminate the inode argument from cifs_new_fileinfo
It already takes a file pointer. The inode associated with that had damn
well better be the same one we're passing in anyway. Thus, there's no
need for a separate argument here.

Also, get rid of the bogus check for a null pCifsInode pointer. The
CIFS_I macro uses container_of(), and that will virtually never return a
NULL pointer anyway.

Finally, move the setting of the canCache* flags outside of the lock.
Other places in the code don't hold that lock when setting it, so I
assume it's not really needed here either.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-18 01:04:16 +00:00
Jeff Layton
f6a53460e2 cifs: eliminate oflags option from cifs_new_fileinfo
Eliminate the poor, misunderstood "oflags" option from cifs_new_fileinfo.
The callers mostly pass in the filp->f_flags here.

That's not correct however since we're checking that value for
the presence of FMODE_READ. Luckily that only affects how the f_list is
ordered. What it really wants here is the file->f_mode. Just use that
field from the filp to determine it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-18 00:34:35 +00:00
Jeff Layton
608712fe86 cifs: fix flags handling in cifs_posix_open
The way flags are passed and converted for cifs_posix_open is rather
non-sensical. Some callers call cifs_posix_convert_flags on the flags
before they pass them to cifs_posix_open, whereas some don't. Two flag
conversion steps is just confusing though.

Change the function instead to clearly expect input in f_flags format,
and fix the callers to pass that in. Then, have cifs_posix_open call
cifs_convert_posix_flags to do the conversion. Move cifs_posix_open to
file.c as well so we can keep cifs_convert_posix_flags as a static
function.

Fix it also to not ignore O_CREAT, O_EXCL and O_TRUNC, and instead have
cifs_reopen_file mask those bits off before calling cifs_posix_open.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-18 00:34:29 +00:00
Artem Bityutskiy
7d08ae3c92 UBIFS: add a commentary about log recovery
Add a commentary which elaborates that 'ubifs_recover_log_leb()' recovers only
the last log LEB, not any. Also remove some unneeded newlines.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2010-10-17 15:57:40 +03:00
Alan Stern
69d44ffbd7 sysfs: Add sysfs_merge_group() and sysfs_unmerge_group()
This patch (as1420) adds sysfs_merge_group() and sysfs_unmerge_group()
functions, allowing drivers easily to add and remove sets of
attributes to a pre-existing attribute group directory.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2010-10-17 01:57:44 +02:00
Jeff Liu
2decd65a26 ocfs2: Avoid to evaluate xattr block flags again.
It was evaludated to indexed before, check it is ok i think.

Signed-off-by: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-10-15 13:03:43 -07:00
Joel Becker
fc3718918f Merge branch 'globalheartbeat-2' of git://oss.oracle.com/git/smushran/linux-2.6 into ocfs2-merge-window
Conflicts:
	fs/ocfs2/ocfs2.h
2010-10-15 13:03:09 -07:00
Sunil Mushran
d4396eafe4 ocfs2/cluster: Release debugfs file elapsed_time_in_ms
An earlier commit forgot to remove a debugfs file, elapsed_time_in_ms.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-15 11:57:21 -07:00
Jeff Layton
2f4f26fcf3 cifs: eliminate cifs_posix_open_inode_helper
cifs: eliminate cifs_posix_open_inode_helper

This function is redundant. The only thing it does is set the canCache
flags, but those get set in cifs_new_fileinfo anyway.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-15 18:22:21 +00:00
Suresh Jayaraman
6221ddd0f5 cifs: handle FindFirst failure gracefully
FindFirst failure due to permission errors or any other errors are silently
ignored by cifs_readdir(). This could cause problem to applications that depend
on the error to do further processing.

Reproducer:
  - mount a cifs share
  - mkdir tdir;touch tdir/1 tdir/2 tdir/3
  - chmod -x tdir
  - ls tdir

Currently, we start calling filldir() for '.' and '..' before we know we
whether FindFirst could succeed or not. If FindFirst fails later, there is no
way to notify VFS by setting buf.error and so VFS won't be able to catch this.
Fix this by moving the call to initiate_cifs_search() before we start doing
filldir().

This fixes https://bugzilla.samba.org/show_bug.cgi?id=7535

Reported-by: Tom Dexter <digitalaudiorock@gmail.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-15 15:19:55 +00:00
Arnd Bergmann
776c163b1b vfs: make no_llseek the default
All file operations now have an explicit .llseek
operation pointer, so we can change the default
action for future code.

This makes changes the default from default_llseek
to no_llseek, which always returns -ESPIPE if
a user tries to seek on a file without a .llseek
operation.

The name of the default_llseek function remains
unchanged, if anyone thinks we should change it,
please speak up.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
2010-10-15 15:53:46 +02:00
Arnd Bergmann
ab91261f5c vfs: don't use BKL in default_llseek
There are currently 191 users of default_llseek.
Nine of these are in device drivers that use the
big kernel lock. None of these ever touch
file->f_pos outside of llseek or file_pos_write.

Consequently, we never rely on the BKL
in the default_llseek function and can
replace that with i_mutex, which is also
used in generic_file_llseek.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-15 15:53:34 +02:00
Arnd Bergmann
6038f373a3 llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.

The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.

New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time.  Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.

The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.

Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.

Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.

===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
//   but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}

@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}

@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
   *off = E
|
   *off += E
|
   func(..., off, ...)
|
   E = *off
)
...+>
}

@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}

@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
  *off = E
|
  *off += E
|
  func(..., off, ...)
|
  E = *off
)
...+>
}

@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}

@ fops0 @
identifier fops;
@@
struct file_operations fops = {
 ...
};

@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
 .llseek = llseek_f,
...
};

@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
 .read = read_f,
...
};

@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
 .write = write_f,
...
};

@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
 .open = open_f,
...
};

// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
...  .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};

@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
...  .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};

// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
...  .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};

// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};

// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};

@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+	.llseek = default_llseek, /* write accesses f_pos */
};

// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////

@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
 .write = write_f,
 .read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};

@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};

@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};

@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-10-15 15:53:27 +02:00
Christoph Hellwig
46bf36ecec hfsplus: fix getxattr return value
We need to support -EOPNOTSUPP for attributes that are not supported to
match other filesystems and allow userspace to detect if Posix ACLs
are supported or not.  setxattr already gets this right.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-10-15 05:45:00 -07:00
Linus Torvalds
8fd01d6cfb Export dump_{write,seek} to binary loader modules
If you build aout support as a module, you'll want these exported.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-14 19:15:28 -07:00
Linus Torvalds
3aa0ce825a Un-inline the core-dump helper functions
Tony Luck reports that the addition of the access_ok() check in commit
0eead9ab41 ("Don't dump task struct in a.out core-dumps") broke the
ia64 compile due to missing the necessary header file includes.

Rather than add yet another include (<asm/unistd.h>) to make everything
happy, just uninline the silly core dump helper functions and move the
bodies to fs/exec.c where they make a lot more sense.

dump_seek() in particular was too big to be an inline function anyway,
and none of them are in any way performance-critical.  And we really
don't need to mess up our include file headers more than they already
are.

Reported-and-tested-by: Tony Luck <tony.luck@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-14 14:32:06 -07:00
Shirish Pargaonkar
5d0d28824c NTLM authentication and signing - Calculate auth response per smb session
Start calculation auth response within a session.  Move/Add pertinet
data structures like session key, server challenge and ntlmv2_hash in
a session structure.  We should do the calculations within a session
before copying session key and response over to server data
structures because a session setup can fail.

Only after a very first smb session succeeds, it copies/makes its
session key, session key of smb connection.  This key stays with
the smb connection throughout its life.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-14 18:05:19 +00:00
Linus Torvalds
0eead9ab41 Don't dump task struct in a.out core-dumps
akiphie points out that a.out core-dumps have that odd task struct
dumping that was never used and was never really a good idea (it goes
back into the mists of history, probably the original core-dumping
code).  Just remove it.

Also do the access_ok() check on dump_write().  It probably doesn't
matter (since normal filesystems all seem to do it anyway), but he
points out that it's normally done by the VFS layer, so ...

[ I suspect that we should possibly do "vfs_write()" instead of
  calling ->write directly.  That also does the whole fsnotify and write
  statistics thing, which may or may not be a good idea. ]

And just to be anal, do this all for the x86-64 32-bit a.out emulation
code too, even though it's not enabled (and won't currently even
compile)

Reported-by: akiphie <akiphie@lavabit.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-14 10:57:40 -07:00
Christoph Hellwig
32e39e19cc hfsplus: remove the unused hfsplus_kmap/hfsplus_kunmap helpers
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-10-14 09:54:43 -04:00
Christoph Hellwig
90e616905a hfsplus: create correct initial catalog entries for device files
Make sure the initial insertation of the catalog entry already contains
the device number by calling init_special_inode early and setting writing
out the dev field of the on-disk permission structure.  The latter is
facilitated by sharing the almost identical hfsplus_set_perms helpers
between initial catalog entry creating and ->write_inode.

Unless we crashed just after mknod this bug was harmless as the inode
is marked dirty at the end of hfsplus_mknod, and hfsplus_write_inode
will update the catalog entry to contain the correct value.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-10-14 09:54:39 -04:00
Christoph Hellwig
722c55d13e hfsplus: remove superflous rootflags field in hfsplus_inode_info
The rootflags field in hfsplus_inode_info only caches the immutable and
append-only flags in the VFS inode, so we can easily get rid of it.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-10-14 09:54:33 -04:00
Christoph Hellwig
f6089ff87d hfsplus: fix link corruption
HFS implements hardlink by using indirect catalog entries that refer to a hidden
directly.  The link target is cached in the dev field in the HFS+ specific
inode, which is also used for the device number for device files, and inside
for passing the nlink value of the indirect node from hfsplus_cat_write_inode
to a helper function.  Now if we happen to write out the indirect node while
hfsplus_link is creating the catalog entry we'll get a link pointing to the
linkid of the current nlink value.  This can easily be reproduced by a large
enough loop of local git-clone operations.

Stop abusing the dev field in the HFS+ inode for short term storage by
refactoring the way the permission structure in the catalog entry is
set up, and rename the dev field to linkid to avoid any confusion.

While we're at it also prevent creating hard links to special files, as
the HFS+ dev and linkid share the same space in the on-disk structure.

Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-10-14 09:54:28 -04:00
Christoph Hellwig
13571a6977 hfsplus: validate btree flags
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-10-14 09:54:23 -04:00
Eric Sandeen
9250f92597 hfsplus: handle more on-disk corruptions without oopsing
hfs seems prone to bad things when it encounters on disk corruption.  Many
values are read from disk, and used as lengths to memcpy, as an example.
This patch fixes up several of these problematic cases.

o sanity check the on-disk maximum key lengths on mount
  (these are set to a defined value at mkfs time and shouldn't differ)
o check on-disk node keylens against the maximum key length for each tree
o fix hfs_btree_open so that going out via free_tree: doesn't wind
  up in hfs_releasepage, which wants to follow the very pointer
  we were trying to set up:
	HFS_SB(sb)->cat_tree = hfs_btree_open()
    .
  failure gets to hfs_releasepage and tries to follow HFS_SB(sb)->cat_tree

Tested with the fsfuzzer; it survives more than it used to.

[hch: ported of commit cf05946250 from hfs]
[hch: added the fixes from 5581d018ed3493d226e7a4d645d9c8a5af6c36b]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-10-14 09:53:48 -04:00
Al Viro
b6b41424f0 hfsplus: hfs_bnode_find() can fail, resulting in hfs_bnode_split() breakage
oops and fs corruption; the latter can happen even on valid fs in case of oom.

[hch: port of commit 3d10a15d69 from hfs]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-10-14 09:53:42 -04:00
Jeff Mahoney
ee52716245 hfsplus: fix oops on mount with corrupted btree extent records
A particular fsfuzzer run caused an hfs file system to crash on mount. This
is due to a corrupted MDB extent record causing a miscalculation of
HFSPLUS_I(inode)->first_blocks for the extent tree. If the extent records
are zereod out, then it won't trigger the first_blocks special case and
instead falls through to the extent code, which we're in the middle
of initializing.

This patch catches the 0 size extent records, reports the corruption,
and fails the mount.

[hch: ported of commit 47f365eb57 from hfs]

Reported-by: Ramon de Carvalho Valle <rcvalle@linux.vnet.ibm.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Christoph Hellwig <hch@tuxera.com>
2010-10-14 09:53:37 -04:00
Linus Torvalds
8c35bf368c Merge branch 'for-2.6.36' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.36' of git://linux-nfs.org/~bfields/linux:
  nfsd: fix BUG at fs/nfsd/nfsfh.h:199 on unlink
2010-10-13 16:51:29 -07:00
J. Bruce Fields
b1e86db1de nfsd: fix BUG at fs/nfsd/nfsfh.h:199 on unlink
As of commit 43a9aa64a2 "NFSD:
Fill in WCC data for REMOVE, RMDIR, MKNOD, and MKDIR", we sometimes call
fh_unlock on a filehandle that isn't fully initialized.

We should fix up the callers, but as a quick fix it is also sufficient
just to remove this assertion.

Reported-by: Marius Tolzmann <tolzmann@molgen.mpg.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-10-13 15:48:55 -04:00
Jeff Layton
d7c86ff8cd cifs: don't use vfsmount to pin superblock for oplock breaks
Filesystems aren't really supposed to do anything with a vfsmount. It's
considered a layering violation since vfsmounts are entirely managed at
the VFS layer.

CIFS currently keeps an active reference to a vfsmount in order to
prevent the superblock vanishing before an oplock break has completed.
What we really want to do instead is to keep sb->s_active high until the
oplock break has completed. This patch borrows the scheme that NFS uses
for handling sillyrenames.

An atomic_t is added to the cifs_sb_info. When it transitions from 0 to
1, an extra reference to the superblock is taken (by bumping the
s_active value). When it transitions from 1 to 0, that reference is
dropped and a the superblock teardown may proceed if there are no more
references to it.

Also, the vfsmount pointer is removed from cifsFileInfo and from
cifs_new_fileinfo, and some bogus forward declarations are removed from
cifsfs.h.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-12 18:08:01 +00:00
Jeff Layton
a5e18bc36e cifs: keep dentry reference in cifsFileInfo instead of inode reference
cifsFileInfo is a bit problematic. It contains a reference back to the
struct file itself. This makes it difficult for a cifsFileInfo to exist
without a corresponding struct file.

It would be better instead of the cifsFileInfo just held info pertaining
to the open file on the server instead without any back refrences to the
struct file. This would allow it to exist after the filp to which it was
originally attached was closed.

Much of the use of the file pointer in this struct is to get at the
dentry.  Begin divorcing the cifsFileInfo from the struct file by
keeping a reference to the dentry. Since the dentry will have a
reference to the inode, we can eliminate the "pInode" field too and
convert the igrab/iput to dget/dput.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-12 18:06:42 +00:00
Jeff Layton
1c456013e9 cifs: on multiuser mount, set ownership to current_fsuid/current_fsgid (try #7)
commit 3aa1c8c290 made cifs_getattr set
the ownership of files to current_fsuid/current_fsgid when multiuser
mounts were in use and when mnt_uid/mnt_gid were non-zero.

It should have instead based that decision on the
CIFS_MOUNT_OVERR_UID/GID flags.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-12 15:43:53 +00:00
Thomas Gleixner
756b0322e5 affs: Use sema_init instead of init_MUTEX
Get rid of init_MUTE() and use sema_init() instead.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
LKML-Reference: <20100907125056.511395595@linutronix.de>
2010-10-12 17:39:25 +02:00
Thomas Gleixner
4a94103554 hfs: Convert tree_lock to mutex
tree_lock is used as mutex so make it a mutex.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
LKML-Reference: <20100907125056.416332114@linutronix.de>
2010-10-12 17:36:11 +02:00
Shirish Pargaonkar
9daa42e220 CIFS ntlm authentication and signing - Build a proper av/ti pair blob for ntlmv2 without extended security authentication
Build an av pair blob as part of ntlmv2 (without extended security) auth
request.  Include netbios and dns names for domain and server and
a timestamp in the blob.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-12 15:14:06 +00:00
Eric Paris
7c5347733d fanotify: disable fanotify syscalls
This patch disables the fanotify syscalls by just not building them and
letting the cond_syscall() statements in kernel/sys_ni.c redirect them
to sys_ni_syscall().

It was pointed out by Tvrtko Ursulin that the fanotify interface did not
include an explicit prioritization between groups.  This is necessary
for fanotify to be usable for hierarchical storage management software,
as they must get first access to the file, before inotify-like notifiers
see the file.

This feature can be added in an ABI compatible way in the next release
(by using a number of bits in the flags field to carry the info) but it
was suggested by Alan that maybe we should just hold off and do it in
the next cycle, likely with an (new) explicit argument to the syscall.
I don't like this approach best as I know people are already starting to
use the current interface, but Alan is all wise and noone on list backed
me up with just using what we have.  I feel this is needlessly ripping
the rug out from under people at the last minute, but if others think it
needs to be a new argument it might be the best way forward.

Three choices:
Go with what we got (and implement the new feature next cycle).  Add a
new field right now (and implement the new feature next cycle).  Wait
till next cycle to release the ABI (and implement the new feature next
cycle).  This is number 3.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-11 18:15:28 -07:00
J. Bruce Fields
ecec6e34e1 nfsd4: expire clients more promptly
Expire clients more promptly, at the expense of possibly running the
laundromat thread more frequently.

Though it's not the default, I'd like it to be feasible to run with a
lease time of just a few seconds, at which point a minimum 10 second
wait between laundromat runs seems a little much.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-10-11 20:00:18 -04:00
Tristan Ye
7bdb0d18bf ocfs2: Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes.
Currently, the default behavior of O_DIRECT writes was allowing
concurrent writing among nodes to the same file, with no cluster
coherency guaranteed (no EX lock held).  This can leave stale data in
the cache for buffered reads on other nodes.

The new mount option introduce a chance to choose two different
behaviors for O_DIRECT writes:

    * coherency=full, as the default value, will disallow
                      concurrent O_DIRECT writes by taking
                      EX locks.

    * coherency=buffered, allow concurrent O_DIRECT writes
                          without EX lock among nodes, which
                          gains high performance at risk of
                          getting stale data on other nodes.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-10-11 14:14:55 -07:00
Goldwyn Rodrigues
75d9bbc738 Initialize max_slots early
Functions such as ocfs2_recovery_init() make use of osb->max_slots.
Initialize osb->max_slots early so the functions may use the correct
value.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-10-11 13:56:32 -07:00
Poyo VL
f30d44f3e5 When I tried to compile I got the following warning:
fs/ocfs2/slot_map.c: In function ‘ocfs2_init_slot_info’:
fs/ocfs2/slot_map.c:360: warning: ‘bytes’ may be used uninitialized in this function
fs/ocfs2/slot_map.c:360: note: ‘bytes’ was declared here
Compiler: gcc version 4.4.3 (GCC) on Mandriva
I'm not sure why this warning occurs, I think compiler don't know that variable
"bytes" is initialized when it is sent by reference to
ocfs2_slot_map_physical_size and it throws that ugly warning.
However, a simple initialization of "bytes" variable with 0 will fix it.

Signed-off-by: Ionut Gabriel Popescu <poyo_vl@yahoo.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-10-11 13:45:52 -07:00
Srinivas Eeda
9b5cd10e4c ocfs2: validate bg_free_bits_count after update
This patch adds a safe check to ensure bg_free_bits_count doesn't exceed
bg_bits in a group descriptor. This is to avoid on disk corruption that was
seen recently.

debugfs: group <52803072>
       Group Chain: 179   Parent Inode: 11  Generation: 2959379682
       CRC32: 00000000   ECC: 0000
       ##   Block#            Total    Used     Free     Contig   Size
       0    52803072          32256    4294965350   34202    18207    4032
       ......

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-10-11 13:43:24 -07:00
Tejun Heo
6370a6ad3b workqueue: add and use WQ_MEM_RECLAIM flag
Add WQ_MEM_RECLAIM flag which currently maps to WQ_RESCUER, mark
WQ_RESCUER as internal and replace all external WQ_RESCUER usages to
WQ_MEM_RECLAIM.

This makes the API users express the intent of the workqueue instead
of indicating the internal mechanism used to guarantee forward
progress.  This is also to make it cleaner to add more semantics to
WQ_MEM_RECLAIM.  For example, if deemed necessary, memory reclaim
workqueues can be made highpri.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
2010-10-11 15:20:26 +02:00
Linus Torvalds
8dc54e49ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: update issue_seq on cap grant
  ceph: send cap release message early on failed revoke.
  ceph: Update max_len with minimum required size
  ceph: Fix return value of encode_fh function
  ceph: avoid null deref in osd request error path
  ceph: fix list_add usage on unsafe_writes list
2010-10-09 12:03:46 -07:00
Linus Torvalds
267aeb6c14 Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  exofs: Fix double page_unlock BUG in write_begin/end
2010-10-09 12:03:23 -07:00
Sunil Mushran
4d94aa1b1d ocfs2/cluster: Bump up dlm protocol to version 1.1
dlm protocol 1.1. activates messages DLM_QUERY_REGION and DLM_QUERY_NODEINFO
that are a must for global heartbeat.

It also activates o2hb_global_heartbeat_active().

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-09 10:27:04 -07:00
Jeff Layton
0dd12c2195 cifs: initialize tlink_tree_lock and tlink_tree
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-08 16:34:49 +00:00
Boaz Harrosh
f17b1f9f1a exofs: Fix double page_unlock BUG in write_begin/end
This BUG is there since the first submit of the code, but only triggered
in last Kernel. It's timing related do to the asynchronous object-creation
behaviour of exofs. (Which should be investigated farther)

The bug is obvious hence the fixed.

Signed-off-by: Boaz Harrosh <Boaz Harrosh bharrosh@panasas.com>
2010-10-08 11:26:54 -04:00
Nicolas Pitre
e4eab08d60 ARM: 6342/1: fix ASLR of PIE executables
Since commits 990cb8acf2 and cc92c28b2d, it is possible to have full
address space layout randomization (ASLR) on ARM.  Except that one small
change was missing for ASLR of PIE executables.

Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-10-08 10:02:53 +01:00
Naoya Horiguchi
290408d4a2 hugetlb: hugepage migration core
This patch extends page migration code to support hugepage migration.
One of the potential users of this feature is soft offlining which
is triggered by memory corrected errors (added by the next patch.)

Todo:
- there are other users of page migration such as memory policy,
  memory hotplug and memocy compaction.
  They are not ready for hugepage support for now.

ChangeLog since v4:
- define migrate_huge_pages()
- remove changes on isolation/putback_lru_page()

ChangeLog since v2:
- refactor isolate/putback_lru_page() to handle hugepage
- add comment about race on unmap_and_move_huge_page()

ChangeLog since v1:
- divide migration code path for hugepage
- define routine checking migration swap entry for hugetlb
- replace "goto" with "if/else" in remove_migration_pte()

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:45 +02:00
Hidetoshi Seto
b8aeec3417 HWPOISON/signalfd: add support for addr_lsb
Similar change as to signal delivery: copy out the si_addr_lsb field
to user space in signalfd

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08 09:32:15 +02:00
Steve French
6ea75952d7 Merge branch 'for-next' 2010-10-08 03:42:03 +00:00
Steve French
d244555613 [CIFS] Remove build warning
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-08 03:38:46 +00:00
Jeff Layton
ccc46a7402 cifs: fix module refcount leak in find_domain_name
find_domain_name() uses load_nls_default which takes a module reference
on the appropriate NLS module, but doesn't put it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-08 03:33:08 +00:00
Jeff Layton
2de970ff69 cifs: implement recurring workqueue job to prune old tcons
Create a workqueue job that cleans out unused tlinks. For now, it uses
a hardcoded expire time of 10 minutes. When it's done, the work rearms
itself. On umount, the work is cancelled before tearing down the tlink
tree.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-08 03:31:21 +00:00
Jeff Layton
3aa1c8c290 cifs: on multiuser mount, set ownership to current_fsuid/current_fsgid (try #5)
...when unix extensions aren't enabled. This makes everything on the
mount appear to be owned by the current user.

This version of the patch differs from previous versions however in that
the admin can still force the ownership of all files to appear as a
single user via the uid=/gid= options.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-08 03:26:28 +00:00
Bryan Schumaker
955a857e06 NFS: new idmapper
This patch creates a new idmapper system that uses the request-key function to
place a call into userspace to map user and group ids to names.  The old
idmapper was single threaded, which prevented more than one request from running
at a single time.  This means that a user would have to wait for an upcall to
finish before accessing a cached result.

The upcall result is stored on a keyring of type id_resolver.  See the file
Documentation/filesystems/nfs/idmapper.txt for instructions.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
[Trond: fix up the return value of nfs_idmap_lookup_name and clean up code]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-10-07 18:48:49 -04:00
Linus Torvalds
5710c2b275 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: properly account for reclaimed inodes
2010-10-07 13:45:26 -07:00
Steve French
13cd4b7f74 [CIFS] Various small checkpatch cleanups
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-07 18:46:32 +00:00
Jeff Layton
0eb8a132c4 cifs: add "multiuser" mount option
This allows someone to declare a mount as a multiuser mount.

Multiuser mounts also imply "noperm" since we want to allow the server
to handle permission checking. It also (for now) requires Kerberos
authentication. Eventually, we could expand this to other authtypes, but
that requires a scheme to allow per-user credential stashing in some
form.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-07 18:30:45 +00:00
Jeff Layton
9d002df492 cifs: add routines to build sessions and tcons on the fly
This patch is rather large, but it's a bit difficult to do piecemeal...

For non-multiuser mounts, everything will basically work as it does
today. A call to cifs_sb_tlink will return the "master" tcon link.

Turn the tcon pointer in the cifs_sb into a radix tree that uses the
fsuid of the process as a key. The value is a new "tcon_link" struct
that contains info about a tcon that's under construction.

When a new process needs a tcon, it'll call cifs_sb_tcon. That will
then look up the tcon_link in the radix tree. If it exists and is
valid, it's returned.

If it doesn't exist, then we stuff a new tcon_link into the tree and
mark it as pending and then go and try to build the session/tcon.
If that works, the tcon pointer in the tcon_link is updated and the
pending flag is cleared.

If the construction fails, then we set the tcon pointer to an ERR_PTR
and clear the pending flag.

If the radix tree is searched and the tcon_link is marked pending
then we go to sleep and wait for the pending flag to be cleared.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-07 18:18:00 +00:00
Sage Weil
d91f2438d8 ceph: update issue_seq on cap grant
We need to update the issue_seq on any grant operation, be it via an MDS
reply or a separate grant message.  The update in the grant path was
missing.  This broke cap release for inodes in which the MDS sent an
explicit grant message that was not soon after followed by a successful
MDS reply on the same inode.

Also fix the signedness on seq locals.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:01:50 -07:00
Greg Farnum
21b559de56 ceph: send cap release message early on failed revoke.
If an MDS tries to revoke caps that we don't have, we want to send
releases early since they probably contain the caps message the MDS
is looking for.

Previously, we only sent the messages if we didn't have the inode either. But
in a multi-mds system we can retain the inode after dropping all caps for
a single MDS.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:24 -07:00
Aneesh Kumar K.V
bba0cd0e3d ceph: Update max_len with minimum required size
encode_fh on error should update max_len with minimum required
size, so that caller can redo the call with the reallocated buffer.
This is required with open by handle patch series

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:24 -07:00
Aneesh Kumar K.V
92923dcbfc ceph: Fix return value of encode_fh function
encode_fh function should return 255 on error as done by other file
system to indicate EOVERFLOW. Also max_len is in sizeof(u32) units
and not in bytes.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:23 -07:00
Sage Weil
6bc18876ba ceph: avoid null deref in osd request error path
If we interrupt an osd request, we call __cancel_request, but it wasn't
verifying that req->r_osd was non-NULL before dereferencing it.  This could
cause a crash if osds were flapping and we aborted a request on said osd.

Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:23 -07:00
Henry C Chang
936aeb5c4a ceph: fix list_add usage on unsafe_writes list
Fix argument order.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-07 08:00:23 -07:00
Johannes Weiner
081003fff4 xfs: properly account for reclaimed inodes
When marking an inode reclaimable, a per-AG counter is increased, the
inode is tagged reclaimable in its per-AG tree, and, when this is the
first reclaimable inode in the AG, the AG entry in the per-mount tree
is also tagged.

When an inode is finally reclaimed, however, it is only deleted from
the per-AG tree.  Neither the counter is decreased, nor is the parent
tree's AG entry untagged properly.

Since the tags in the per-mount tree are not cleared, the inode
shrinker iterates over all AGs that have had reclaimable inodes at one
point in time.

The counters on the other hand signal an increasing amount of slab
objects to reclaim.  Since "70e60ce xfs: convert inode shrinker to
per-filesystem context" this is not a real issue anymore because the
shrinker bails out after one iteration.

But the problem was observable on a machine running v2.6.34, where the
reclaimable work increased and each process going into direct reclaim
eventually got stuck on the xfs inode shrinking path, trying to scan
several million objects.

Fix this by properly unwinding the reclaimable-state tracking of an
inode when it is reclaimed.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@kernel.org
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-06 22:35:48 -05:00
Sunil Mushran
43695d095d ocfs2/cluster: Show per region heartbeat elapsed time
This patch adds a per region debugfs file that shows the elapsed time
since the time the o2hb timer was last armed.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-06 17:55:09 -07:00
Sunil Mushran
d6aa1c7c9e ocfs2/cluster: Add mlogs for heartbeat up/down events
This patch adds mlogs for o2hb up and down events.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-06 18:50:50 -07:00
Sunil Mushran
1f28530537 ocfs2/cluster: Create debugfs dir/files for each region
This patch creates debugfs directory for each o2hb region and creates
files to expose the region number and the per region live node bitmap.
This information will be useful in debugging cluster issues.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-06 17:55:12 -07:00
Sunil Mushran
a6de013654 ocfs2/cluster: Create debugfs files for live, quorum and failed region bitmaps
This patch prints the bitmaps of live, quorum and failed regions. This
information will be useful in debugging cluster issues.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-06 17:55:13 -07:00
Sunil Mushran
b1c5ebfbe3 ocfs2/cluster: Maintain bitmap of failed regions
In global heartbeat mode, we track the bitmap of regions that have seen
heartbeat timeouts. We fence if the number of such regions is greater than
or equal to half the number of quorum regions.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-07 17:05:52 -07:00
Sunil Mushran
43182d2a79 ocfs2/cluster: Maintain bitmap of quorum regions
o2hb allows online adding of regions. However, a newly added region is not
used in quorum calculations unless it has been added on all nodes. This patch
tracks a bitmap of such quorum regions.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-06 17:55:16 -07:00
Sunil Mushran
e7d656baf6 ocfs2/cluster: Track bitmap of live heartbeat regions
A heartbeat region becomes live (or active) after a fixed number of (steady)
iterations.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-06 17:55:18 -07:00
Sunil Mushran
536f0741f3 ocfs2/cluster: Track number of global heartbeat regions
In global heartbeat mode, we have a upper limit for the number of active regions.
This patch adds the facility to track the number of active global heartbeat
regions and fails to start heartbeat if the number exceeds the maximum.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-07 17:03:07 -07:00
Sunil Mushran
823a637ae9 ocfs2/cluster: Maintain live node bitmap per heartbeat region
Currently we track a global livenode bitmap that keeps track of all nodes
that are heartbeating in all regions.

This patch adds the ability to track the livenode bitmap on a per region basis.
We will use this facility in a later patch to allow us to withstand the loss of
a minority number of regions.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-06 17:55:21 -07:00
Sunil Mushran
8ca8b0bbd8 ocfs2/cluster: Reorganize o2hb debugfs init
o2hb debugfs handling is reorganized to allow for easy expansion.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-07 17:01:27 -07:00
Sunil Mushran
0e105d37c2 ocfs2/cluster: Check slots for unconfigured live nodes
o2hb currently checks slots for configured nodes only. This patch makes
it check the slots for the live nodes too to take care of a race in which
a node is removed from the configuration but not from the live map.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-07 17:00:16 -07:00
Sunil Mushran
39a298563e ocfs2/cluster: Print messages when adding/removing nodes
Prints messages when the user adds or removes nodes.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-07 17:30:17 -07:00
Sunil Mushran
18c50cb0d3 ocfs2/cluster: Print messages when adding/removing heartbeat regions
Prints messages when the user adds or removes heartbeat regions in global
heartbeat mode. These messages are useful when debugging cluster related issues.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-06 18:26:59 -07:00
Sunil Mushran
18cfdf1b1a ocfs2/dlm: Add message DLM_QUERY_NODEINFO
Adds new dlm message DLM_QUERY_NODEINFO that sends the attributes of all
registered nodes. This message is sent if the negotiated dlm protocol is
1.1 or higher. If the information of the joining node does not match
that of any existing nodes, the join domain request is rejected.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-07 16:47:03 -07:00
Sunil Mushran
5f3c6d9c61 ocfs2: Print message if user mounts without starting global heartbeat
In global heartbeat mode, the heartbeat is started by the user. This patch
prints an error if the user attempts to mount a volume without starting the
heartbeat.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-06 17:55:29 -07:00
Sunil Mushran
ea2034416b ocfs2/dlm: Add message DLM_QUERY_REGION
Adds new dlm message DLM_QUERY_REGION that sends the names of all active
heartbeat regions. This message is only sent in the global heartbeat
mode. If the regions in the joining node do not fully match the ones in
the active nodes, the join domain request is rejected.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-09 10:26:23 -07:00
Sunil Mushran
b3c85c4cdf ocfs2/cluster: Get all heartbeat regions
Export function in o2hb to get a list of heartbeat regions. It also adds an
upper limit to the length of the heartbeat region name.

o2hb_global_heartbeat_active() currently disables global heartbeat. It will
be enabled in a later patch after all the code is added.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-07 14:31:06 -07:00
Sunil Mushran
b1365d0bd1 ocfs2/dlm: Expose dlm_protocol in dlm_state
Add dlm_protocol to the list of info shown by the debugfs file, dlm_state.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-06 17:55:34 -07:00
Sunil Mushran
2c442719e9 ocfs2: Add support for heartbeat=global mount option
Adds support for heartbeat=global mount option. It ensures that the heartbeat
mode passed matches the one enabled on disk.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-07 15:23:50 -07:00
Sunil Mushran
98f486f23b ocfs2: Add an incompat feature flag OCFS2_FEATURE_INCOMPAT_CLUSTERINFO
OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for
both userspace and o2cb cluster stacks. It also allows us to extend cluster
info to include stack flags.

This patch also adds stackflags to sb->s_clusterinfo. It also introduces a
clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled
global heartbeat mode.

This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The
clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-09 10:24:46 -07:00
Sunil Mushran
54b5187b5a ocfs2/cluster: Add heartbeat mode configfs parameter
Add heartbeat mode parameter to the configfs tree. This will be used
to set/show the heartbeat mode. The user is free to toggle the mode
between local and global as long as there is no active heartbeat region.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
2010-10-07 15:26:08 -07:00
Linus Torvalds
089eed29b4 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  writeback: always use sb->s_bdi for writeback purposes
2010-10-06 11:11:18 -07:00
Linus Torvalds
8fe9793af0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: Initialize total_len in fuse_retrieve()
2010-10-06 09:50:41 -07:00
Shirish Pargaonkar
c9928f7040 ntlm authentication and signing - Correct response length for ntlmv2 authentication without extended security
Fix incorrect calculation of case sensitive response length in the
ntlmv2 (without extended security) response.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-06 16:13:19 +00:00
Jeff Layton
29e07c82a9 cifs: fix cifs_show_options to show "username=" or "multiuser"
...based on CIFS_MOUNT_MULTIUSER flag.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-06 16:13:11 +00:00
Jeff Layton
6508d904e6 cifs: have find_readable/writable_file filter by fsuid
When we implement multiuser mounts, we'll need to filter filehandles
by fsuid. Add a flag for multiuser mounts and code to filter by
fsuid when it's set.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-06 16:12:59 +00:00
Jeff Layton
13cfb7334e cifs: have cifsFileInfo hold a reference to a tlink rather than tcon pointer
cifsFileInfo needs a pointer to a tcon, but it doesn't currently hold a
reference to it. Change it to keep a pointer to a tcon_link instead and
hold a reference to it.

That will keep the tcon from being freed until the file is closed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-06 16:12:49 +00:00
Jeff Layton
7ffec37245 cifs: add refcounted and timestamped container for holding tcons
Eventually, we'll need to track the use of tcons on a per-sb basis, so that
we know when it's ok to tear them down. Begin this conversion by adding a
new "tcon_link" struct and accessors that get it. For now, the core data
structures are untouched -- cifs_sb still just points to a single tcon and
the pointers are just cast to deal with the accessor functions. A later
patch will flesh this out.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-06 16:12:44 +00:00
Steven Whitehouse
134669854e GFS2: Fix type mapping for demote_rq interface
Mostly the glock operations follow the type of the glock. The
one exception is the transaction glock, so we need to check for
that directly.

Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-10-06 09:58:44 +01:00
Felipe Contreras
5a44a73b90 autofs4: Only declare function when CONFIG_COMPAT is defined
The patch solves the following warnings message when CONFIG_COMPAT
is not defined:

fs/autofs4/root.c:31: warning: ‘autofs4_root_compat_ioctl’ declared ‘static’ but never defined

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
Cc: Ian Kent <raven@themaw.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-05 11:02:15 +02:00
Márton Németh
26daad8067 autofs: Only declare function when CONFIG_COMPAT is defined
The patch solves the following warnings message when CONFIG_COMPAT
is not defined:

fs/autofs/root.c:30: warning: ‘autofs_root_compat_ioctl’ declared ‘static’ but never defined

Signed-off-by: Márton Németh <nm127@freemail.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-05 11:02:15 +02:00
Petr Vandrovec
2a4df5d332 ncpfs: Lock socket in ncpfs while setting its callbacks
Otherwise partially updated pointers could be seen if
pointer update is not atomic.

Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-05 11:02:14 +02:00
Arnd Bergmann
b89f432133 fs/locks.c: prepare for BKL removal
This prepares the removal of the big kernel lock from the
file locking code. We still use the BKL as long as fs/lockd
uses it and ceph might sleep, but we can flip the definition
to a private spinlock as soon as that's done.
All users outside of fs/lockd get converted to use
lock_flocks() instead of lock_kernel() where appropriate.

Based on an earlier patch to use a spinlock from Matthew
Wilcox, who has attempted this a few times before, the
earliest patch from over 10 years ago turned it into
a semaphore, which ended up being slower than the BKL
and was subsequently reverted.

Someone should do some serious performance testing when
this becomes a spinlock, since this has caused problems
before. Using a spinlock should be at least as good
as the BKL in theory, but who knows...

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
2010-10-05 11:02:04 +02:00
Petr Vandrovec
2e54eb96e2 BKL: Remove BKL from ncpfs
Dozen of changes in ncpfs to provide some locking other than BKL.

In readdir cache unlock and mark complete first page as last operation,
so it can be used for synchronization, as code intended.

When updating dentry name on case insensitive filesystems do at least
some basic locking...

Hold i_mutex when updating inode fields.

Push some ncp_conn_is_valid down to ncp_request.  Connection can become
invalid at any moment, and fewer error code paths to test the better.

Use i_size_{read,write} to modify file size.

Set inode's backing_dev_info as ncpfs has its own special bdi.

In ioctl unbreak ioctls invoked on filesystem mounted 'ro' - tests are
for inode writeable or owner match, but were turned to filesystem
writeable and inode writeable or owner match.  Also collect all permission
checks in single place.

Add some locking, and remove comments saying that it would be cool to
add some locks to the code.

Constify some pointers.

Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:52 +02:00
Arnd Bergmann
6005679412 BKL: Remove BKL from OCFS2
The BKL in ocfs2/dlmfs is used in put_super, fill_super and remount_fs
that are all three protected by the superblocks s_umount rw_semaphore.

The use in ocfs2_control_open is evidently unrelated and the function
is protected by ocfs2_control_lock.

Therefore it is safe to remove the BKL entirely.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
2010-10-04 21:10:51 +02:00
Arnd Bergmann
3dbc4b32d0 BKL: Remove BKL from squashfs
The BKL is only used in put_super and fill_super, which are both protected
by the superblocks s_umount rw_semaphore. Therefore it is safe to remove
the BKL entirely.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Phillip Lougher <phillip@lougher.demon.co.uk>
2010-10-04 21:10:50 +02:00
Arnd Bergmann
1a028dd2dd BKL: Remove BKL from jffs2
The BKL is only used in put_super, fill_super and remount_fs that are all
three protected by the superblocks s_umount rw_semaphore. Therefore it is
safe to remove the BKL entirely.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw2@infradead.org>
2010-10-04 21:10:50 +02:00
Arnd Bergmann
18dfe89d7c BKL: Remove BKL from ecryptfs
The BKL is only used in fill_super, which is protected by the superblocks
s_umount rw_semaphorei, and in fasync, which does not do anything that
could require the BKL. Therefore it is safe to remove the BKL entirely.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
2010-10-04 21:10:48 +02:00
Arnd Bergmann
77f2fe036c BKL: Remove BKL from afs
The BKL is only used in put_super and fill_super, which are both protected
by the superblocks s_umount rw_semaphore. Therefore it is safe to remove
the BKL entirely.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: linux-afs@lists.infradead.org
Cc: David Howells <dhowells@redhat.com>
2010-10-04 21:10:48 +02:00
Arnd Bergmann
00e300e1b6 BKL: Remove BKL from autofs4
autofs4 uses the BKL only to guard its ioctl operations.
This can be trivially converted to use a mutex, as we have
done with most device drivers before.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ian Kent <raven@themaw.net>
2010-10-04 21:10:46 +02:00
Arnd Bergmann
4f819a7899 BKL: Remove BKL from isofs
As in other file systems, we can replace the big kernel lock
with a private mutex in isofs. This means we can now access
multiple file systems concurrently, but it also means that
we serialize readdir and lookup across sleeping operations
which previously released the big kernel lock. This should
not matter though, as these operations are in practice
serialized through the hardware access.

The isofs_get_blocks functions now does not take any lock
any more, it used to recursively get the BKL. After looking
at the code for hours, I convinced myself that it was never
needed here anyway, because it only reads constant fields
of the inode and writes to a buffer head array that is
at this time only visible to the caller.

The get_sb and fill_super operations do not need the locking
at all because they operate on a file system that is either
about to be created or to be destroyed but in either case
is not visible to other threads.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:45 +02:00
Arnd Bergmann
3768744cfe BKL: Remove BKL from fat
The lock_kernel in fat_put_super is not needed because
it only protects the super block itself and we know that
no other thread can reach it because we are about to
kfree the object.

In the two fill_super functions, this converts the locking
to use lock_super like elsewhere in the fat code. This
is probably not needed either, but is consistent and puts
us on the safe side.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Jan Blunck <jblunck@infradead.org>
2010-10-04 21:10:45 +02:00
Jan Blunck
3e44f9f1dc BKL: Remove BKL from ext2 filesystem
The BKL is still used in ext2_put_super(), ext2_fill_super(), ext2_sync_fs()
ext2_remount() and ext2_write_inode(). From these calls ext2_put_super(),
ext2_fill_super() and ext2_remount() are protected against each other by
the struct super_block s_umount rw semaphore. The call in ext2_write_inode()
could only protect the modification of the ext2_sb_info through
ext2_update_dynamic_rev() against concurrent ext2_sync_fs() or ext2_remount().
ext2_fill_super() and ext2_put_super() can be left out because you need a
valid filesystem reference in all three cases, which you do not have when
you are one of these functions.

If the BKL is only protecting the modification of the ext2_sb_info it can
safely be removed since this is protected by the struct ext2_sb_info s_lock.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:44 +02:00
Jan Blunck
6841c05021 BKL: Remove BKL from do_new_mount()
After pushing down the BKL to the get_sb/fill_super operations of the
filesystems that still make usage of the BKL it is safe to remove it from
do_new_mount().

I've read through all the code formerly covered by the BKL inside
do_kern_mount() and have satisfied myself that it doesn't need the BKL
any more.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:43 +02:00
Jan Blunck
efdffb5403 BKL: Remove BKL from NTFS
The BKL is only used in put_super, fill_super and remount_fs that are all
three protected by the superblocks s_umount rw_semaphore. Therefore it is
safe to remove the BKL entirely.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:41 +02:00
Jan Blunck
d6d4c19c5f BKL: Remove BKL from NILFS2
The BKL is only used in put_super, fill_super and remount_fs that are all
three protected by the superblocks s_umount rw_semaphore. Therefore it is
safe to remove the BKL entirely.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:41 +02:00
Jan Blunck
22b26db6f8 BKL: Remove BKL from JFS
The BKL is only used in put_super, fill_super and remount_fs that are all
three protected by the superblocks s_umount rw_semaphore. Therefore it is
safe to remove the BKL entirely.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:40 +02:00
Jan Blunck
8526fb37c9 BKL: Remove BKL from HFS
The BKL is only used in put_super and fill_super that are both protected by
the superblocks s_umount rw_semaphore. Therefore it is safe to remove the
BKL entirely.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:39 +02:00
Jan Blunck
f2143c4e2e BKL: Remove BKL from ext4 filesystem
The BKL is still used in ext4_put_super(), ext4_fill_super() and
ext4_remount(). All three calles are protected against concurrent calls by
the s_umount rw semaphore of struct super_block.

Therefore the BKL is protecting nothing in this case.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:38 +02:00
Jan Blunck
77b54a46a8 BKL: Remove BKL from ext3_put_super() and ext3_remount()
The BKL lock is protecting the remounting against a potential call to
ext3_put_super(). This could not happen, since this is protected by the
s_umount rw semaphore of struct super_block.

Therefore I think the BKL is protecting nothing here.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:37 +02:00
Jan Blunck
d646cf82e9 BKL: Remove BKL from ext3 fill_super()
The BKL is protecting nothing than two memory allocations here.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:36 +02:00
Jan Blunck
b0991aa324 BKL: Remove BKL from CifsFS
The BKL is only used in put_super and fill_super that are both protected by
the superblocks s_umount rw_semaphore. Therefore it is safe to remove the
BKL entirely.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:36 +02:00
Jan Blunck
ba13d597a6 BKL: Remove BKL from BFS
The BKL is only used in put_super and fill_super that are both protected by
the superblocks s_umount rw_semaphore. Therefore it is safe to remove the BKL
entirely.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:35 +02:00
Jan Blunck
74c41429ae BKL: Remove BKL from Amiga FFS
The BKL is only used in put_super, fill_super and remount_fs that are all
three protected by the superblocks s_umount rw_semaphore. Therefore it is
safe to remove the BKL entirely.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:34 +02:00
Jan Blunck
db71922217 BKL: Explicitly add BKL around get_sb/fill_super
This patch is a preparation necessary to remove the BKL from do_new_mount().
It explicitly adds calls to lock_kernel()/unlock_kernel() around
get_sb/fill_super operations for filesystems that still uses the BKL.

I've read through all the code formerly covered by the BKL inside
do_kern_mount() and have satisfied myself that it doesn't need the BKL
any more.

do_kern_mount() is already called without the BKL when mounting the rootfs
and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
afs_mntpt_follow_link(). Both later functions are actually the filesystems
follow_link inode operation. vfs_kern_mount() is calling the specified
get_sb function and lets the filesystem do its job by calling the given
fill_super function.

Therefore I think it is safe to push down the BKL from the VFS to the
low-level filesystems get_sb/fill_super operation.

[arnd: do not add the BKL to those file systems that already
       don't use it elsewhere]

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Christoph Hellwig <hch@infradead.org>
2010-10-04 21:10:10 +02:00
Christoph Hellwig
aaead25b95 writeback: always use sb->s_bdi for writeback purposes
We currently use struct backing_dev_info for various different purposes.
Originally it was introduced to describe a backing device which includes
an unplug and congestion function and various bits of readahead information
and VM-relevant flags.  We're also using for tracking dirty inodes for
writeback.

To make writeback properly find all inodes we need to only access the
per-filesystem backing_device pointed to by the superblock in ->s_bdi
inside the writeback code, and not the instances pointeded to by
inode->i_mapping->backing_dev which can be overriden by special devices
or might not be set at all by some filesystems.

Long term we should split out the writeback-relevant bits of struct
backing_device_info (which includes more than the current bdi_writeback)
and only point to it from the superblock while leaving the traditional
backing device as a separate structure that can be overriden by devices.

The one exception for now is the block device filesystem which really
wants different writeback contexts for it's different (internal) inodes
to handle the writeout more efficiently.  For now we do this with
a hack in fs-writeback.c because we're so late in the cycle, but in
the future I plan to replace this with a superblock method that allows
for multiple writeback contexts per filesystem.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-04 14:25:33 +02:00
Geert Uytterhoeven
0157443c56 fuse: Initialize total_len in fuse_retrieve()
fs/fuse/dev.c:1357: warning: ‘total_len’ may be used uninitialized in this
function

Initialize total_len to zero, else its value will be undefined.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-10-04 10:45:32 +02:00
J. Bruce Fields
3351514215 nfsd4: return expired on unfound stateid's
Commit 78155ed75f "nfsd4: distinguish
expired from stale stateids" attempted to distinguish expired and stale
stateid's using time information that may not have been completely
reliable, so I reverted it.

That was throwing out the baby with the bathwater; we still do want to
return expired, but let's do that using the simpler approach of just
assuming any stateid is expired if it looks like it was given out by the
current server instance, but we can't find it any more.

This may help clients that are recovering from network partitions.

Reported-by: Bian Naimeng <biannm@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-10-02 18:49:33 -04:00
J. Bruce Fields
328ead2872 nfsd4: add new connections to session
As long as we're not implementing any session security, we should just
automatically add any new connections that come along to the list of
sessions associated with the session.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-10-01 19:29:45 -04:00
J. Bruce Fields
db90681d6e nfsd4: refactor connection allocation
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-10-01 19:29:45 -04:00
J. Bruce Fields
19cf5c026f nfsd4: use callbacks on svc_xprt_deletion
Remove connections from the list when they go down.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-10-01 19:29:44 -04:00
J. Bruce Fields
c7662518c7 nfsd4: keep per-session list of connections
The spec requires us in various places to keep track of the connections
associated with each session.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-10-01 19:29:44 -04:00
J. Bruce Fields
5b6feee960 nfsd4: clean up session allocation
Changes:
	- make sure session memory reservation is released on failure
	  path.
	- use min_t()/min() for more compact code in several places.
	- break alloc_init_session into smaller pieces.
	- miscellaneous other cleanup.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-10-01 19:29:44 -04:00
J. Bruce Fields
dd93842457 nfsd4: fix alloc_init_session return type
This returns an nfs error, not -ERRNO.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-10-01 19:29:44 -04:00
J. Bruce Fields
c23753dac1 nfsd4: fix alloc_init_session BUILD_BUG_ON()
Note we're allocating an array of nfsd4_slot *'s, not nfsd4_slot's.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-10-01 19:29:44 -04:00
J. Bruce Fields
6ff8da0887 nfsd4: Move callback setup to callback queue
Instead of creating the new rpc client from a regular server thread,
set a flag, kick off a null call, and allow the null call to do the work
of setting up the client on the callback workqueue.

Use a spinlock to ensure the callback work gets a consistent view of the
callback parameters.

This allows, for example, changing the callback from contexts where
sleeping is not allowed.  I hope it will also keep the locking simple as
we add more session and trunking features, by serializing most of the
callback-specific work.

This also closes a small race where the the new cb_ident could be used
with an old connection (or vice-versa).

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-10-01 19:29:44 -04:00
J. Bruce Fields
fb00392326 nfsd4: remove separate cb_args struct
I don't see the point of the separate struct.  It seems to just be
getting in the way.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-10-01 19:29:43 -04:00
J. Bruce Fields
cee277d924 nfsd4: use generic callback code in null case
This will eventually allow us, for example, to kick off null callback
from contexts where we can't sleep.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-10-01 19:29:43 -04:00
J. Bruce Fields
5878453dbd nfsd4: generic callback code
Make the recall callback code more generic, so that other callbacks
will be able to use it too.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-10-01 19:29:43 -04:00
J. Bruce Fields
1c8556026e nfsd4: rename nfs4_rpc_args->nfsd4_cb_args
With apologies for the gratuitous rename, the new name seems more
helpful to me.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-10-01 19:29:43 -04:00
J. Bruce Fields
586f36735e nfsd4: combine nfs4_rpc_args and nfsd4_cb_sequence
These two structs don't really need to be distinct as far as I can tell.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-10-01 19:29:43 -04:00
J. Bruce Fields
07263f1efe nfsd4: minor variable renaming (cb -> conn)
Now that we have both nfsd4_callback and nfsd4_cb_conn structures, I get
confused if variables of both types are always named cb....

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-10-01 19:29:43 -04:00