Commit Graph

36891 Commits

Author SHA1 Message Date
NeilBrown
4fa2c54b51 NFS: nfs4_do_open should add negative results to the dcache.
If you have an NFSv4 mounted directory which does not container 'foo'
and:

  ls -l foo
  ssh $server touch foo
  cat foo

then the 'cat' will fail (usually, depending a bit on the various
cache ages).  This is correct as negative looks are cached by default.
However with the same initial conditions:

  cat foo
  ssh $server touch foo
  cat foo

will usually succeed.  This is because an "open" does not add a
negative dentry to the dcache, while a "lookup" does.

This can have negative performance effects.  When "gcc" searches for
an include file, it will try to "open" the file in every director in
the search path.  Without caching of negative "open" results, this
generates much more traffic to the server than it should (or than
NFSv3 does).

The root of the problem is that _nfs4_open_and_get_state() will call
d_add_unique() on a positive result, but not on a negative result.
Compare with nfs_lookup() which calls d_materialise_unique on both
a positive result and on ENOENT.

This patch adds a call d_add() in the ENOENT case for
_nfs4_open_and_get_state() and also calls nfs_set_verifier().

With it, many fewer "open" requests for known-non-existent files are
sent to the server.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:05:22 -04:00
Andrey Utkin
7a9e75a185 nfs3_list_one_acl(): check get_acl() result with IS_ERR_OR_NULL
There was a check for result being not NULL. But get_acl() may return
NULL, or ERR_PTR, or actual pointer.
The purpose of the function where current change is done is to "list
ACLs only when they are available", so any error condition of get_acl()
mustn't be elevated, and returning 0 there is still valid.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81111
Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 74adf83f5d (nfs: only show Posix ACLs in listxattr if actually...)
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:05:22 -04:00
Trond Myklebust
9806755c56 Merge branch 'nfs-rdma' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next
* 'nfs-rdma' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (916 commits)
  xprtrdma: Handle additional connection events
  xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macro
  xprtrdma: Make rpcrdma_ep_disconnect() return void
  xprtrdma: Schedule reply tasklet once per upcall
  xprtrdma: Allocate each struct rpcrdma_mw separately
  xprtrdma: Rename frmr_wr
  xprtrdma: Disable completions for LOCAL_INV Work Requests
  xprtrdma: Disable completions for FAST_REG_MR Work Requests
  xprtrdma: Don't post a LOCAL_INV in rpcrdma_register_frmr_external()
  xprtrdma: Reset FRMRs after a flushed LOCAL_INV Work Request
  xprtrdma: Reset FRMRs when FAST_REG_MR is flushed by a disconnect
  xprtrdma: Properly handle exhaustion of the rb_mws list
  xprtrdma: Chain together all MWs in same buffer pool
  xprtrdma: Back off rkey when FAST_REG_MR fails
  xprtrdma: Unclutter struct rpcrdma_mr_seg
  xprtrdma: Don't invalidate FRMRs if registration fails
  xprtrdma: On disconnect, don't ignore pending CQEs
  xprtrdma: Update rkeys after transport reconnect
  xprtrdma: Limit data payload size for ALLPHYSICAL
  xprtrdma: Protect ia->ri_id when unmapping/invalidating MRs
  ...
2014-08-03 17:04:51 -04:00
Trond Myklebust
3a505845cd NFS: Enforce an upper limit on the number of cached access call
This may be used to limit the number of cached credentials building up
inside the access cache.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:03:22 -04:00
Linus Torvalds
da83fc6e0f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We have two more fixes in my for-linus branch.

  I was hoping to also include a fix for a btrfs deadlock with
  compression enabled, but we're still nailing that one down"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: test for valid bdev before kobj removal in btrfs_rm_device
  Btrfs: fix abnormal long waiting in fsync
2014-07-20 20:21:05 -07:00
Linus Torvalds
90d51d5606 NFS client fixes for Linux 3.16
Highlights include;
 - Stable fix for an NFSv3 posix ACL regression
 - Multiple fixes for regressions to the NFS generic read/write code
   - Fix page splitting bugs that come into play when a small rsize/wsize
     read/write needs to be sent again (due to error conditions or page
     redirty).
   - Fix nfs_wb_page_cancel, which is called by the "invalidatepage" method
 - Fix 2 compile warnings about unused variables.
 - Fix a performance issue affecting unstable writes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTys5JAAoJEGcL54qWCgDygakP/1JOfQZzyHNRml5aDVRLtfp/
 QQpn8ne3jjEov+0BhzSxqkpHP+8fCcF0UgD5asEnbM83ruoB8EUHErXdq7kRkAYf
 COpDZAww4sPXUszG/xGqIED503DKbf69Ds5pQMh8g71hfQpw04UmMQYbnRNBP2bI
 Z5eu0Xiwaf3MVRaxHMXVy9sl7in/cQBvrXnKLTIYOLA0U5bdCI9JWT5+qbOUCC/3
 YuYm5EjM+OMyhvEWyVDXFp9kmw0vtBS8FwAXYKBjwfJLNl8dGuERWKlFDPNOxdpZ
 QrOBhUH73d2tgvTHzUg/RRtpA6mKOfznU3SQK3muP188/2/sbb4y/RChwv+Ignat
 YqWFgbDTz1idKvjj+Vzcd7eL3HO9Kono/YkAG8i5xmOlSeenzDra1lDAbuB+zNzd
 oLVY1AJp8+13c74gaCDurxJ7fq6Fth97eqxdo8fdQH8Mn6m6qRm2V57rOo59eFQS
 bX6EN4ja8ashrKprCvlhXDGJmQKZWJMEGoTXJCj5w+HBklONV6jdbyzGFKT+Zmhg
 IP+USsaLJDswZNpdWq8Zb9RthfFAURKEQEemwV5eLQNtTa+xQ22ZOF2ycrbBqsED
 etCCm8gyNukl2RCAxGfLrtpBytlcNMmcRKGktKqO1SAAua3IP33wGD7zql8rwGCZ
 m9XMXUkVKiHoErl3jZ4T
 =qklz
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Apologies for the relative lateness of this pull request, however the
  commits fix some issues with the NFS read/write code updates in
  3.16-rc1 that can cause serious Oopsing when using small r/wsize.  The
  delay was mainly due to extra testing to make sure that the fixes
  behave correctly.

  Highlights include;
   - Stable fix for an NFSv3 posix ACL regression
   - Multiple fixes for regressions to the NFS generic read/write code:
     - Fix page splitting bugs that come into play when a small
       rsize/wsize read/write needs to be sent again (due to error
       conditions or page redirty)
     - Fix nfs_wb_page_cancel, which is called by the "invalidatepage"
       method
   - Fix 2 compile warnings about unused variables
   - Fix a performance issue affecting unstable writes"

* tag 'nfs-for-3.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Don't reset pg_moreio in __nfs_pageio_add_request
  NFS: Remove 2 unused variables
  nfs: handle multiple reqs in nfs_wb_page_cancel
  nfs: handle multiple reqs in nfs_page_async_flush
  nfs: change find_request to find_head_request
  nfs: nfs_page should take a ref on the head req
  nfs: mark nfs_page reqs with flag for extra ref
  nfs: only show Posix ACLs in listxattr if actually present
2014-07-20 19:55:44 -07:00
Eric Sandeen
0bfaa9c5cb btrfs: test for valid bdev before kobj removal in btrfs_rm_device
commit 99994cd btrfs: dev delete should remove sysfs entry
added a btrfs_kobj_rm_device, which dereferences device->bdev...
right after we check whether device->bdev might be NULL.

I don't honestly know if it's possible to have a NULL device->bdev
here, but assuming that it is (given the test), we need to move
the kobject removal to be under that test.

(Coverity spotted this)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-19 11:49:44 -07:00
Liu Bo
98ce2deda2 Btrfs: fix abnormal long waiting in fsync
xfstests generic/127 detected this problem.

With commit 7fc34a62ca, now fsync will only flush
data within the passed range.  This is the cause of the above problem,
-- btrfs's fsync has a stage called 'sync log' which will wait for all the
ordered extents it've recorded to finish.

In xfstests/generic/127, with mixed operations such as truncate, fallocate,
punch hole, and mapwrite, we get some pre-allocated extents, and mapwrite will
mmap, and then msync.  And I find that msync will wait for quite a long time
(about 20s in my case), thanks to ftrace, it turns out that the previous
fallocate calls 'btrfs_wait_ordered_range()' to flush dirty pages, but as the
range of dirty pages may be larger than 'btrfs_wait_ordered_range()' wants,
there can be some ordered extents created but not getting corresponding pages
flushed, then they're left in memory until we fsync which runs into the
stage 'sync log', and fsync will just wait for the system writeback thread
to flush those pages and get ordered extents finished, so the latency is
inevitable.

This adds a flush similar to btrfs_start_ordered_extent() in
btrfs_wait_logged_extents() to fix that.

Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-19 11:49:44 -07:00
Linus Torvalds
f839719122 This patch set contains two minor docs/spelling fixes, some fixes for
flock, a change to use GFP_NOFS to avoid recursion on a rarely used
 code path and a fix for a race relating to the glock lru.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJTyPZQAAoJEMrg3m4a/8jSFBEQAKSnJQUP9MSxVwNBrgOiybXW
 kQd8RYs7cdt33i97C3Im9xSVktPz4HKTvuwHyvNV1oyWScfWSyqCgC//cU+/zlYV
 wJDZWIASNoQheY6UfxR6TeBPZo9Hgq7RQRGj4h1ttag9+b8Zz9aV5TCxcoh28ULF
 629TyNwg4xdiEKX2xZusDwGCoHn5f5l9pAa5MyPrcyPzn1lOJP1lz++Lci2nqC4g
 DvA/KzQzDLQ2lKXdSd95avwQxnHqmeCTvClPmK9GgONrt66tqq6CcCLB1jPRE7/O
 J7f0VWy/PEeo8ot+9siiA380EvM6hWvJx5Fuen/Qb9dQ5sgsJMkvgbqlHK6zB/i3
 Je6Qq+aVPz3qktmXdyEagpXfZAQAxy0PUWezQBQH8HIlhwKMGC1QaFgMoAFIks1Y
 S38IBHCwlymytWYdVaRhyUOnlzzaSyeYROzs7hZoxRRUilge5rPkrqtv4HWLSRtZ
 rGFEid181+qTO2TyoiMRY2oR3U0PHfbE9Dhv5Pu9caTl55kj9eAGwvqnOn6IpyvF
 eiUoWOnDYFO+8sxFKPYFndglEZx0zBU6B/7axyQ3qam3BojTJwKh+2+4TqauM0zo
 4ehwJEzVmV21sbyMfUHCKTQEkW8OjQ+EkxAEmGhp4IODNwZ3vPfFBdhFi3fBipqO
 WhDmeDmOddb9cCoQG8WZ
 =VTve
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes

Pull gfs2 fixes from Steven Whitehouse:
 "This patch set contains two minor docs/spelling fixes, some fixes for
  flock, a change to use GFP_NOFS to avoid recursion on a rarely used
  code path and a fix for a race relating to the glock lru"

* tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
  GFS2: fs/gfs2/rgrp.c: kernel-doc warning fixes
  GFS2: memcontrol: Spelling s/invlidate/invalidate/
  GFS2: Allow caching of glocks for flock
  GFS2: Allow flocks to use normal glock dq rather than dq_wait
  GFS2: replace count*size kzalloc by kcalloc
  GFS2: Use GFP_NOFS when allocating glocks
  GFS2: Fix race in glock lru glock disposal
  GFS2: Only wait for demote when last holder is dequeued
2014-07-18 06:26:04 -10:00
Linus Torvalds
847f56eb0e xfs: fixes for 3.15-rc5
Fixes for low memory perforamnce regressions and a quota inode handling
 regression.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJTyFtkAAoJEK3oKUf0dfodAioP/0nsIq3be9zch0/Jjf9aduON
 1aMaUE/9p4g/zu0f96ld3GI5/guRVllZ/0qJFyYAnAgOGs1hGARQfOuNnHBjgdLZ
 D261JbT9z8d8inQ7BMSc3EBBQ2CAZsAwRmg6UbeaWBE4hjlJ03RGWBE/aMMh10Wh
 t2fWUoeYWZWLi+Gfa+lPpnTESH7nBK5cW+daC16I1fU9Z/RcXAl+pCN2s5Ls6h7G
 1mQNfhAuII+/ydB7UWXL6SQ7/sDvDwNvedMLljwg6oYu/riSG2kYPZKW6O9qwL6T
 z9onPg5lEQMjWlbl6qzNo6OT3pAs37vssG0zVw2ZS8GjZ84edRzw2PGctJAFu2Hh
 sWqmtYGNKBjtjnxJ4zlRfeBkwpHbZGlLyOwzKoDlyQ8j9KZ8v8lsKyEpJK6/XJNG
 1rMJZV5twu+xvZUwf0zkg0tuxoT/3T3kbIHsFkEaJmQW7jrxTvdybW/rp6KHSbcb
 rzCpuZ5Ghh9qss+EeCv3k2nnWjysDP4kSwQMZ0zCedzvDTmga2TMw//MUwaM+i7M
 D7Raq4Qcs2updrFk2j9OyML2hi49KuPTtEu2OC7ObfxvBsZgSbTvyw1Vq/rsiDM5
 FZMV/giKRoCFpRpp7xF+db0zkBC2xDU9tGz196dzGtg7rvp6Z5401mS8fAr9H/LJ
 D2Wf2OXx3oss9v4rrO7N
 =rnN8
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.16-rc5' of git://oss.sgi.com/xfs/xfs

Pull xfs fixes from Dave Chinner:
 "Fixes for low memory perforamnce regressions and a quota inode
  handling regression.

  These are regression fixes for issues recently introduced - the change
  in the stack switch location is fairly important, so I've held off
  sending this update until I was sure that it still addresses the stack
  usage problem the original solved.  So while the commits in the xfs
  tree are recent, it has been under tested for several weeks now"

* tag 'xfs-for-linus-3.16-rc5' of git://oss.sgi.com/xfs/xfs:
  xfs: null unused quota inodes when quota is on
  xfs: refine the allocation stack switch
  Revert "xfs: block allocation work needs to be kswapd aware"
2014-07-18 06:21:43 -10:00
Fabian Frederick
27ff6a0f7f GFS2: fs/gfs2/rgrp.c: kernel-doc warning fixes
Cc: cluster-devel@redhat.com
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:15:14 +01:00
Geert Uytterhoeven
6b49d1d9c3 GFS2: memcontrol: Spelling s/invlidate/invalidate/
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: cluster-devel@redhat.com
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:14:31 +01:00
Bob Peterson
97a4f1d765 GFS2: Allow caching of glocks for flock
This patch removes the GLF_NOCACHE flag from the glocks associated with
flocks. There should be no good reason not to cache glocks for flocks:
they only force the glock to be demoted before they can be reacquired,
which can slow down performance and even cause glock hangs, especially
in cases where the flocks are held in Shared (SH) mode.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:14:12 +01:00
Bob Peterson
5bef3e7cf1 GFS2: Allow flocks to use normal glock dq rather than dq_wait
This patch allows flock glocks to use a non-blocking dequeue rather
than dq_wait. It also reverts the previous patch I had posted regarding
dq_wait. The reverted patch isn't necessarily a bad idea, but I decided
this might avoid unforeseen side effects, and was therefore safer.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:13:56 +01:00
Fabian Frederick
6ec43b1838 GFS2: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Cc: cluster-devel@redhat.com
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:13:38 +01:00
Steven Whitehouse
fe0bbd2986 GFS2: Use GFP_NOFS when allocating glocks
Normally GFP_KERNEL is ok here, but there is now a rarely used code path
relating to deallocation of unlinked inodes (in certain corner cases)
which if hit at times of memory shortage can cause recursion while
trying to free memory.

One solution would be to try and move the gfs2_glock_get() call so
that it is no longer called while another glock is held, but that
doesn't look at all easy, so GFP_NOFS is the best solution for the
time being.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:13:12 +01:00
Steven Whitehouse
94a09a3999 GFS2: Fix race in glock lru glock disposal
We must not leave items on the LRU list with GLF_LOCK set, since
they can be removed if the glock is brought back into use, which
may then potentially result in a hang, waiting for GLF_LOCK to
clear.

It doesn't happen very often, since it requires a glock that has
not been used for a long time to be brought back into use at the
same moment that the shrinker is part way through disposing of
glocks.

The fix is to set GLF_LOCK at a later time, when we already know
that the other locks can be obtained. Also, we now only release
the lru_lock in case a resched is needed, rather than on every
iteration.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:12:51 +01:00
Bob Peterson
79272b3562 GFS2: Only wait for demote when last holder is dequeued
Function gfs2_glock_dq_wait is supposed to dequeue a glock and then
wait for the lock to be demoted. The problem is, if this is a shared
lock, its demote will depend on the other holders, which means you
might end up waiting forever because the other process is blocked.
This problem is especially apparent when dealing with nested flocks.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:12:14 +01:00
Linus Torvalds
c20ddc6499 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota fix from Jan Kara:
 "Fix locking of dquot shrinker"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: missing lock in dqcache_shrink_scan()
2014-07-15 17:47:42 -10:00
Niu Yawei
d68aab6b8f quota: missing lock in dqcache_shrink_scan()
Commit 1ab6c4997e (fs: convert fs shrinkers to new scan/count API)
accidentally removed locking from quota shrinker. Fix it -
dqcache_shrink_scan() should use dq_list_lock to protect the
scan on free_dquots list.

CC: stable@vger.kernel.org
Fixes: 1ab6c4997e
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15 22:36:18 +02:00
Linus Torvalds
0b632204c7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "This contains miscellaneous fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: replace count*size kzalloc by kcalloc
  fuse: release temporary page if fuse_writepage_locked() failed
  fuse: restructure ->rename2()
  fuse: avoid scheduling while atomic
  fuse: handle large user and group ID
  fuse: inode: drop cast
  fuse: ignore entry-timeout on LOOKUP_REVAL
  fuse: timeout comparison fix
2014-07-15 08:57:17 -07:00
Dave Chinner
03e01349c6 xfs: null unused quota inodes when quota is on
When quota is on, it is expected that unused quota inodes have a
value of NULLFSINO. The changes to support a separate project quota
in 3.12 broken this rule for non-project quota inode enabled
filesystem, as the code now refuses to write the group quota inode
if neither group or project quotas are enabled. This regression was
introduced by commit d892d58 ("xfs: Start using pquotaino from the
superblock").

In this case, we should be writing NULLFSINO rather than nothing to
ensure that we leave the group quota inode in a valid state while
quotas are enabled.

Failure to do so doesn't cause a current kernel to break - the
separate project quota inodes introduced translation code to always
treat a zero inode as NULLFSINO. This was introduced by commit
0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
also in 3.12 but older kernels do not do this and hence taking a
filesystem back to an older kernel can result in quotas failing
initialisation at mount time. When that happens, we see this in
dmesg:

[ 1649.215390] XFS (sdb): Mounting Filesystem
[ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
[ 1649.316902] XFS (sdb): Ending clean mount

By ensuring that we write NULLFSINO to quota inodes that aren't
active, we avoid this problem. We have to be really careful when
determining if the quota inodes are active or not, because we don't
want to write a NULLFSINO if the quota inodes are active and we
simply aren't updating them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:28:41 +10:00
Dave Chinner
cf11da9c5d xfs: refine the allocation stack switch
The allocation stack switch at xfs_bmapi_allocate() has served it's
purpose, but is no longer a sufficient solution to the stack usage
problem we have in the XFS allocation path.

Whilst the kernel stack size is now 16k, that is not a valid reason
for undoing all our "keep stack usage down" modifications. What it
does allow us to do is have the freedom to refine and perfect the
modifications knowing that if we get it wrong it won't blow up in
our faces - we have a safety net now.

This is important because we still have the issue of older kernels
having smaller stacks and that they are still supported and are
demonstrating a wide range of different stack overflows.  Red Hat
has several open bugs for allocation based stack overflows from
directory modifications and direct IO block allocation and these
problems still need to be solved. If we can solve them upstream,
then distro's won't need to bake their own unique solutions.

To that end, I've observed that every allocation based stack
overflow report has had a specific characteristic - it has happened
during or directly after a bmap btree block split. That event
requires a new block to be allocated to the tree, and so we
effectively stack one allocation stack on top of another, and that's
when we get into trouble.

A further observation is that bmap btree block splits are much rarer
than writeback allocation - over a range of different workloads I've
observed the ratio of bmap btree inserts to splits ranges from 100:1
(xfstests run) to 10000:1 (local VM image server with sparse files
that range in the hundreds of thousands to millions of extents).
Either way, bmap btree split events are much, much rarer than
allocation events.

Finally, we have to move the kswapd state to the allocation workqueue
work when allocation is done on behalf of kswapd. This is proving to
cause significant perturbation in performance under memory pressure
and appears to be generating allocation deadlock warnings under some
workloads, so avoiding the use of a workqueue for the majority of
kswapd writeback allocation will minimise the impact of such
behaviour.

Hence it makes sense to move the stack switch to xfs_btree_split()
and only do it for bmap btree splits. Stack switches during
allocation will be much rarer, so there won't be significant
performacne overhead caused by switching stacks. The worse case
stack from all allocation paths will be split, not just writeback.
And the majority of memory allocations will be done in the correct
context (e.g. kswapd) without causing additional latency, and so we
simplify the memory reclaim interactions between processes,
workqueues and kswapd.

The worst stack I've been able to generate with this patch in place
is 5600 bytes deep. It's very revealing because we exit XFS at:

37)     1768      64   kmem_cache_alloc+0x13b/0x170

about 1800 bytes of stack consumed, and the remaining 3800 bytes
(and 36 functions) is memory reclaim, swap and the IO stack. And
this occurs in the inode allocation from an open(O_CREAT) syscall,
not writeback.

The amount of stack being used is much less than I've previously be
able to generate - fs_mark testing has been able to generate stack
usage of around 7k without too much trouble; with this patch it's
only just getting to 5.5k. This is primarily because the metadata
allocation paths (e.g. directory blocks) are no longer causing
double splits on the same stack, and hence now stack tracing is
showing swapping being the worst stack consumer rather than XFS.

Performance of fs_mark inode create workloads is unchanged.
Performance of fs_mark async fsync workloads is consistently good
with context switches reduced by around 150,000/s (30%).
Performance of dbench, streaming IO and postmark is unchanged.
Allocation deadlock warnings have not been seen on the workloads
that generated them since adding this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:08:24 +10:00
Dave Chinner
aa182e64f1 Revert "xfs: block allocation work needs to be kswapd aware"
This reverts commit 1f6d64829d.

This commit resulted in regressions in performance in low
memory situations where kswapd was doing writeback of delayed
allocation blocks. It resulted in significant parallelism of the
kswapd work and with the special kswapd flags meant that hundreds of
active allocation could dip into kswapd specific memory reserves and
avoid being throttled. This cause a large amount of performance
variation, as well as random OOM-killer invocations that didn't
previously exist.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:08:10 +10:00
Benjamin LaHaise
263782c1c9 aio: protect reqs_available updates from changes in interrupt handlers
As of commit f8567a3845 it is now possible to
have put_reqs_available() called from irq context.  While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU.  This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott.  Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.

Many thanks to Robert and folks for testing and tracking this down.

Reported-by: Robert Elliot <Elliott@hp.com>
Tested-by: Robert Elliot <Elliott@hp.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kenel.org
2014-07-14 13:05:26 -04:00
Fabian Frederick
f2b3455e47 fuse: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-14 16:30:25 +02:00
Maxim Patlasov
27f1b36326 fuse: release temporary page if fuse_writepage_locked() failed
tmp_page to be freed if fuse_write_file_get() returns NULL.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-14 16:17:57 +02:00
Linus Torvalds
18b34d9a7a More bug fixes for ext4 -- most importantly, a fix for a bug
(introduced in 3.15) that can end up triggering a file system
 corruption error after a journal replay.  (It shouldn't lead to any
 actual data corruption, but it is scary and can force file systems to
 be remounted read-only, etc.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJTwuY+AAoJENNvdpvBGATwgN8QAJ2/S5GFxQwHbglHmayXYuMQ
 fU411FwJ1wbqjYyYb+jyBoYcsgpsCKPTqA2JbPlHsFTm2Ec+BPzsybhtYw5ybdeW
 1qAfPTSgNxYXroNwpaqOamxgfXgOaV4iqwvZ4tYcLcrtPq0MOcC5rlSaKMdJuSA1
 6M2/8PijOTndUVJpS/GhSMdKlTAXjtfv9V6t/pfLuoo7cNadlggpJnwC8Qm9DNAA
 5ETVZK44q2+2YvGwrvY6LBb9BVBpL29YbWPNqqw/OXXY++ZFhBJV07osZO38MpsB
 QzUyfRaMTgm9/BdbkG8uxA7Zk6C0YBl5eC4aU79LWGWjGO225CLj95LoBOVjQw9f
 eh+RFGapwVvtyzScDF/a9pH6UwGco/s4kCq8rLr2ztljlO595N3LUwhQBHtiSGtm
 fr65NRDyJMXbqy8yLGrlOnP/4ll2VfTH+el2+tzr5smoTD29EASM155hKDDUOAG0
 TrDHtNrxG1MIROHjp+HSui424Op7NXTnfjwmuKzo+mGpPOcPclPSmAacFJpRGVBE
 220hnk+LrBf525nJzQYHifdCL+JAqbWv/S4YSRGizgppK3DlO/gYcu1zpWb0WWuo
 0VuvxUZDSIZY1aVpMEOQov74WtovB7YyG8RPHl7h2m5dJuLLFgJmLDMDTJR1LLNT
 +tHNJ6jERLQz9wqTvquh
 =OX7Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "More bug fixes for ext4 -- most importantly, a fix for a bug
  introduced in 3.15 that can end up triggering a file system corruption
  error after a journal replay.

  It shouldn't lead to any actual data corruption, but it is scary and
  can force file systems to be remounted read-only, etc"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix potential null pointer dereference in ext4_free_inode
  ext4: fix a potential deadlock in __ext4_es_shrink()
  ext4: revert commit which was causing fs corruption after journal replays
  ext4: disable synchronous transaction batching if max_batch_time==0
  ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()
  ext4: clarify error count warning messages
  ext4: fix unjournalled bg descriptor while initializing inode bitmap
2014-07-13 13:14:55 -07:00
Trond Myklebust
e655f945cd Merge branch 'bugfixes' into linux-next
* bugfixes:
  NFS: Don't reset pg_moreio in __nfs_pageio_add_request
  NFS: Remove 2 unused variables
  nfs: handle multiple reqs in nfs_wb_page_cancel
  nfs: handle multiple reqs in nfs_page_async_flush
  nfs: change find_request to find_head_request
  nfs: nfs_page should take a ref on the head req
  nfs: mark nfs_page reqs with flag for extra ref
  nfs: only show Posix ACLs in listxattr if actually present

Conflicts:
	fs/nfs/write.c
2014-07-13 15:22:02 -04:00
Trond Myklebust
f563b89b18 NFS: Don't reset pg_moreio in __nfs_pageio_add_request
Once we've started sending unstable NFS writes, we do not want to
clear pg_moreio, or we may end up sending the very last request as
a stable write if the commit lists are still empty.

Do, however, reset pg_moreio in the case where we end up having to
recoalesce the write if an attempt to use pNFS failed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-13 15:18:44 -04:00
Fabian Frederick
002160269f NFS: use ARRAY_SIZE instead of sizeof/sizeof[0]
Use macro definition

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:43:58 -04:00
Himangi Saraogi
8ee2b78a44 NFSv4: Drop cast
This patch does away with the cast on void * as it is unnecessary.

The following Coccinelle semantic patch was used for making the change:

@r@
expression x;
void* e;
type T;
identifier f;
@@

(
  *((T *)e)
|
  ((T *)x)[...]
|
  ((T *)x)->f
|
- (T *)
  e
)

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:43:47 -04:00
Fabian Frederick
57b696fb1b fs/nfs_common/nfsacl.c: move EXPORT symbol after functions
Fix checkpatch warnings:

"WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable"

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:43:42 -04:00
Jeff Layton
f11b2a1cfb nfs4: copy acceptor name from context to nfs_client
The current CB_COMPOUND handling code tries to compare the principal
name of the request with the cl_hostname in the client. This is not
guaranteed to ever work, particularly if the client happened to mount
a CNAME of the server or a non-fqdn.

Fix this by instead comparing the cr_principal string with the acceptor
name that we get from gssd. In the event that gssd didn't send one
down (i.e. it was too old), then we fall back to trying to use the
cl_hostname as we do today.

Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:41:25 -04:00
Jeff Layton
f1cdae87fc nfs4: turn free_lock_state into a void return operation
Nothing checks its return value.

Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:36:37 -04:00
Jeff Layton
49a4bda22e nfs4: queue free_lock_state job submission to nfsiod
We got a report of the following warning in Fedora:

BUG: sleeping function called from invalid context at mm/slub.c:969
in_atomic(): 1, irqs_disabled(): 0, pid: 533, name: bash
3 locks held by bash/533:
 #0:  (&sp->so_delegreturn_mutex){+.+...}, at: [<ffffffffa033da62>] nfs4_proc_lock+0x262/0x910 [nfsv4]
 #1:  (&nfsi->rwsem){.+.+.+}, at: [<ffffffffa033da6a>] nfs4_proc_lock+0x26a/0x910 [nfsv4]
 #2:  (&sb->s_type->i_lock_key#23){+.+...}, at: [<ffffffff812998dc>] flock_lock_file_wait+0x8c/0x3a0
CPU: 0 PID: 533 Comm: bash Not tainted 3.15.0-0.rc1.git1.1.fc21.x86_64 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000000 00000000d664ff3c ffff880078b69a70 ffffffff817e82e0
 0000000000000000 ffff880078b69a98 ffffffff810cf1a4 0000000000000050
 0000000000000050 ffff88007cc01a00 ffff880078b69ad8 ffffffff8121449e
Call Trace:
 [<ffffffff817e82e0>] dump_stack+0x4d/0x66
 [<ffffffff810cf1a4>] __might_sleep+0x184/0x240
 [<ffffffff8121449e>] kmem_cache_alloc_trace+0x4e/0x330
 [<ffffffffa0331124>] ? nfs4_release_lockowner+0x74/0x110 [nfsv4]
 [<ffffffffa0331124>] nfs4_release_lockowner+0x74/0x110 [nfsv4]
 [<ffffffffa0352340>] nfs4_put_lock_state+0x90/0xb0 [nfsv4]
 [<ffffffffa0352375>] nfs4_fl_release_lock+0x15/0x20 [nfsv4]
 [<ffffffff81297515>] locks_free_lock+0x45/0x90
 [<ffffffff8129996c>] flock_lock_file_wait+0x11c/0x3a0
 [<ffffffffa033da6a>] ? nfs4_proc_lock+0x26a/0x910 [nfsv4]
 [<ffffffffa033301e>] do_vfs_lock+0x1e/0x30 [nfsv4]
 [<ffffffffa033da79>] nfs4_proc_lock+0x279/0x910 [nfsv4]
 [<ffffffff810dbb26>] ? local_clock+0x16/0x30
 [<ffffffff810f5a3f>] ? lock_release_holdtime.part.28+0xf/0x200
 [<ffffffffa02f820c>] do_unlk+0x8c/0xc0 [nfs]
 [<ffffffffa02f85c5>] nfs_flock+0xa5/0xf0 [nfs]
 [<ffffffff8129a6f6>] locks_remove_file+0xb6/0x1e0
 [<ffffffff812159d8>] ? kfree+0xd8/0x2d0
 [<ffffffff8123bc63>] __fput+0xd3/0x210
 [<ffffffff8123bdee>] ____fput+0xe/0x10
 [<ffffffff810bfb6d>] task_work_run+0xcd/0xf0
 [<ffffffff81019cd1>] do_notify_resume+0x61/0x90
 [<ffffffff817fbea2>] int_signal+0x12/0x17

The problem is that NFSv4 is trying to do an allocation from
fl_release_private (in order to send a RELEASE_LOCKOWNER call). That
function can be called while holding the inode->i_lock, and it's
currently set up to do __GFP_WAIT allocations. v4.1 code has a
similar problem.

This patch adds a work_struct to the nfs4_lock_state and has the code
queue the free_lock_state operation to nfsiod.

Reported-by: Josh Stone <jistone@redhat.com>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:36:35 -04:00
Jeff Layton
8003d3c4aa nfs4: treat lock owners as opaque values
Do the following set of ops with a file on a NFSv4 mount:

    exec 3>>/file/on/nfsv4
    flock -x 3
    exec 3>&-

You'll see the LOCK request go across the wire, but no LOCKU when the
file is closed.

What happens is that the fd is passed across a fork, and the final close
is done in a different process than the opener. That makes
__nfs4_find_lock_state miss finding the correct lock state because it
uses the fl_pid as a search key. A new one is created, and the locking
code treats it as a delegation stateid (because NFS_LOCK_INITIALIZED
isn't set).

The root cause of this breakage seems to be commit 77041ed9b4
(NFSv4: Ensure the lockowners are labelled using the fl_owner and/or
fl_pid).

That changed it so that flock lockowners are allocated based on the
fl_pid. I think this is incorrect. flock locks should be "owned" by the
struct file, and that is already accounted for in the fl_owner field of
the lock request when it comes through nfs_flock.

This patch basically reverts the above commit and with it, a LOCKU is
sent in the above reproducer.

Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:36:31 -04:00
Peng Tao
039b756a2d nfs41: layout return on close in delegation return
If file is not opened by anyone, we do layout return on close
in delegation return.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:23:17 -04:00
Peng Tao
fe08c54691 nfs41: return layout on last close
If client has valid delegation, do not return layout on close at all.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:23:04 -04:00
Peng Tao
15bb3afe90 nfs4: add nfs4_check_delegation
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:22:58 -04:00
Peng Tao
0b0bc6ea77 pnfs/filelayout: retry ds commit if nfs_commitdata_alloc fails
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:22:45 -04:00
Peng Tao
c8a3292d24 pnfs/filelayout: fix race between mark_request_commit and scan_commit_lists
We need to hold cinfo lock while setting bucket->wlseg and adding req to nwritten
list at the same time. Otherwise there might be a window where nwritten list
is empty yet we set bucket->wlseg, in which case ff_layout_scan_ds_commit_list()
may end up clearing bucket->wlseg incorrectly, casuing client to oops later on.

This was found when testing flexfile layout but filelayout has the same problem.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:22:41 -04:00
Trond Myklebust
f3792d63d2 NFSv4: Fix OPEN w/create access mode checking
POSIX states that open("foo", O_CREAT|O_RDONLY, 000) should succeed if
the file "foo" does not already exist. With the current NFS client,
it will fail with an EACCES error because of the permissions checks in
nfs4_opendata_access().

Fix is to turn that test off if the server says that we created the file.

Reported-by: "Frank S. Filz" <ffilzlnx@mindspring.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:20:55 -04:00
Trond Myklebust
aafe37504c NFS: Remove 2 unused variables
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 17:35:57 -04:00
Weston Andros Adamson
3e2170451e nfs: handle multiple reqs in nfs_wb_page_cancel
Use nfs_lock_and_join_requests to merge all subrequests into the head request -
this cancels and dereferences all subrequests.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 17:35:47 -04:00
Weston Andros Adamson
d458138353 nfs: handle multiple reqs in nfs_page_async_flush
Change nfs_find_and_lock_request so nfs_page_async_flush can handle multiple
requests in a page. There is only one request for a page the first time
nfs_page_async_flush is called, but if a write or commit fails, async_flush
is called again and there may be multiple requests associated with the page.
The solution is to merge all the requests in a page group into a single
request before calling nfs_pageio_add_request.

Rename nfs_find_and_lock_request to nfs_lock_and_join_requests and
change it to first lock all requests for the page, then cancel and merge
all subrequests into the head request.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 17:35:46 -04:00
Weston Andros Adamson
84d3a9a913 nfs: change find_request to find_head_request
nfs_page_find_request_locked* should find the head request for that page.
Rename the functions and add comments to make this clear, and fix a bug
that could return a subrequest when page_private isn't set on the page.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 16:51:41 -04:00
Weston Andros Adamson
85710a837c nfs: nfs_page should take a ref on the head req
nfs_pages that aren't the the head of a group must take a reference on the
head as long as ->wb_head is set to it. This stops the head from hitting
a refcount of 0 while there is still an active nfs_page for the page group.

This avoids kref warnings in the writeback code when the page group head
is found and referenced.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 16:51:41 -04:00
Weston Andros Adamson
17089a29a2 nfs: mark nfs_page reqs with flag for extra ref
Change the use of PG_INODE_REF - set it when taking extra reference on
subrequests and take care to only release once for each request.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 16:51:41 -04:00
Namjae Jeon
bf40c92635 ext4: fix potential null pointer dereference in ext4_free_inode
Fix potential null pointer dereferencing problem caused by e43bb4e612
("ext4: decrement free clusters/inodes counters when block group declared bad")

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-12 16:11:42 -04:00