Commit Graph

4973 Commits

Author SHA1 Message Date
Linus Torvalds
0f0d12728e Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount flag updates from Al Viro:
 "Another chunk of fmount preparations from dhowells; only trivial
  conflicts for that part. It separates MS_... bits (very grotty
  mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
  only a small subset of MS_... stuff).

  This does *not* convert the filesystems to new constants; only the
  infrastructure is done here. The next step in that series is where the
  conflicts would be; that's the conversion of filesystems. It's purely
  mechanical and it's better done after the merge, so if you could run
  something like

	list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

	sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
	        -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
	        -e 's/\<MS_NODEV\>/SB_NODEV/g' \
	        -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
	        -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
	        -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
	        -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
	        -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
	        -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
	        -e 's/\<MS_SILENT\>/SB_SILENT/g' \
	        -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
	        -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
	        -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
	        -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
	        $list

  and commit it with something along the lines of 'convert filesystems
  away from use of MS_... constants' as commit message, it would save a
  quite a bit of headache next cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Differentiate mount flags (MS_*) from internal superblock flags
  VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
  vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-14 18:54:01 -07:00
Linus Torvalds
8e7757d83d NFS client updates for Linux 4.14
Hightlights include:
 
 Stable bugfixes:
 - Fix mirror allocation in the writeback code to avoid a use after free
 - Fix the O_DSYNC writes to use the correct byte range
 - Fix 2 use after free issues in the I/O code
 
 Features:
 - Writeback fixes to split up the inode->i_lock in order to reduce contention
 - RPC client receive fixes to reduce the amount of time the
   xprt->transport_lock is held when receiving data from a socket into am
   XDR buffer.
 - Ditto fixes to reduce contention between call side users of the rdma
   rb_lock, and its use in rpcrdma_reply_handler.
 - Re-arrange rdma stats to reduce false cacheline sharing.
 - Various rdma cleanups and optimisations.
 - Refactor the NFSv4.1 exchange id code and clean up the code.
 - Const-ify all instances of struct rpc_xprt_ops
 
 Bugfixes:
 - Fix the NFSv2 'sec=' mount option.
 - NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys'
 - Fix the NFSv3 GRANT callback when the port changes on the server.
 - Fix livelock issues with COMMIT
 - NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state() when doing
   and NFSv4.1 open by filehandle.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZtbvIAAoJEGcL54qWCgDy/boP/jRuVk6B2VyhWnJkOgdQzIN3
 Q8PIR0oxkywH2MI7c9/G2k5b/HD9BK2iQrXzIoPxRuPrckKLwzqYclzG8PR4Niyg
 D3CCzrvGcEXZrv/nHQ+HDMD0ZuUyXFqhrYeyQwNSJ9p/oP0gaxnYwteennfJVa99
 mv6+LdoY+lzVYJI1gmMHVF2zOhN+rTe7xUVnjYnsVCpwMvL+u992oZl3qQJRFG6b
 HlXOy7h5JRFyue61P20PSgh9D1JUWWYD/V0EG+7cIvByAg5KxhvVgjqSsTTT7FXe
 Omn4fTv1MFzk8er9qYFRjpM2IoIdAejFMqX3/PxQVr2qOFNmHYrq+WsdWNQEr/Wu
 WREJu5Ac1Hboe2/scA+DtuVPFePPPyrolhwk533aNWrdDywg01e0XqBEDKR/atJd
 u5lvW20UfLQuCFLOpaxDpq2ngQSOg6t96N36tsydG0SAVpiydOPMLqkQi7Nb3aoB
 79xGpmtnijP5T6jnOI2/nexM08OMTI0BhMbXJC5v1+lnxIJKcKdnGlTM4UJyxUMq
 /3dFI4IQZLfkMEjIvZFoi+nKWx3DYhiUhkKhbBYwtB4P4q8Z2qKTPHFxORz9griZ
 Pa+8BPuDuodIWuDD97q1Dnw2NWjQim8Rx/ce4c8FHGzwMJLPkcVqk+guGsub5IdO
 7qF7Vvv02gJ48TAqTBDf
 =1Ssl
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Hightlights include:

  Stable bugfixes:
   - Fix mirror allocation in the writeback code to avoid a use after
     free
   - Fix the O_DSYNC writes to use the correct byte range
   - Fix 2 use after free issues in the I/O code

  Features:
   - Writeback fixes to split up the inode->i_lock in order to reduce
     contention
   - RPC client receive fixes to reduce the amount of time the
     xprt->transport_lock is held when receiving data from a socket into
     am XDR buffer.
   - Ditto fixes to reduce contention between call side users of the
     rdma rb_lock, and its use in rpcrdma_reply_handler.
   - Re-arrange rdma stats to reduce false cacheline sharing.
   - Various rdma cleanups and optimisations.
   - Refactor the NFSv4.1 exchange id code and clean up the code.
   - Const-ify all instances of struct rpc_xprt_ops

  Bugfixes:
   - Fix the NFSv2 'sec=' mount option.
   - NFSv4.1: don't use machine credentials for CLOSE when using
     'sec=sys'
   - Fix the NFSv3 GRANT callback when the port changes on the server.
   - Fix livelock issues with COMMIT
   - NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state() when
     doing and NFSv4.1 open by filehandle"

* tag 'nfs-for-4.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (69 commits)
  NFS: Count the bytes of skipped subrequests in nfs_lock_and_join_requests()
  NFS: Don't hold the group lock when calling nfs_release_request()
  NFS: Remove pnfs_generic_transfer_commit_list()
  NFS: nfs_lock_and_join_requests and nfs_scan_commit_list can deadlock
  NFS: Fix 2 use after free issues in the I/O code
  NFS: Sync the correct byte range during synchronous writes
  lockd: Delete an error message for a failed memory allocation in reclaimer()
  NFS: remove jiffies field from access cache
  NFS: flush data when locking a file to ensure cache coherence for mmap.
  SUNRPC: remove some dead code.
  NFS: don't expect errors from mempool_alloc().
  xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler
  xprtrdma: Re-arrange struct rx_stats
  NFS: Fix NFSv2 security settings
  NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys'
  SUNRPC: ECONNREFUSED should cause a rebind.
  NFS: Remove unused parameter gfp_flags from nfs_pageio_init()
  NFSv4: Fix up mirror allocation
  SUNRPC: Add a separate spinlock to protect the RPC request receive list
  SUNRPC: Cleanup xs_tcp_read_common()
  ...
2017-09-11 22:01:44 -07:00
NeilBrown
bf4b490597 NFS: various changes relating to reporting IO errors.
1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
   They aren't used.

2/ Make nfs_context_set_write_error() a "static inline" in internal.h
   so we can...

3/ Use nfs_context_set_write_error() instead of mapping_set_error()
   if nfs_pageio_add_request() fails before sending any request.
   NFS generally keeps errors in the open_context, not the mapping,
   so this is more consistent.

4/ If filemap_write_and_write_range() reports any error, still
   check ctx->error.  The value in ctx->error is likely to be
   more useful.  As part of this, NFS_CONTEXT_ERROR_WRITE is
   cleared slightly earlier, before nfs_file_fsync_commit() is called,
   rather than at the start of that function.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-11 22:28:56 -04:00
Chuck Lever
8224b2734a NFS: Add static NFS I/O tracepoints
Tools like tcpdump and rpcdebug can be very useful. But there are
plenty of environments where they are difficult or impossible to
use. For example, we've had customers report I/O failures during
workloads so heavy that collecting network traffic or enabling
RPC debugging are themselves onerous.

The kernel's static tracepoints are lightweight (less likely to
introduce timing changes) and efficient (the trace data is compact).
They also work in scenarios where capturing network traffic is not
possible due to lack of hardware support (some InfiniBand HCAs) or
where data or network privacy is a concern.

Introduce tracepoints that show when an NFS READ, WRITE, or COMMIT
is initiated, and when it completes. Record the arguments and
results of each operation, which are not shown by existing sunrpc
module's tracepoints.

For instance, the recorded offset and count can be used to match an
"initiate" event to a "done" event. If an NFS READ result returns
fewer bytes than requested or zero, seeing the EOF flag can be
probative. Seeing an NFS4ERR_BAD_STATEID result is also indication
of a particular class of problems. The timing information attached
to each event record can often be useful as well.

Usage example:

[root@manet tmp]# trace-cmd record -e nfs:*initiate* -e nfs:*done
/sys/kernel/debug/tracing/events/nfs/*initiate*/filter
/sys/kernel/debug/tracing/events/nfs/*done/filter
Hit Ctrl^C to stop recording
^CKernel buffer statistics:
  Note: "entries" are the entries left in the kernel ring buffer and are not
        recorded in the trace data. They should all be zero.

CPU: 0
entries: 0
overrun: 0
commit overrun: 0
bytes: 3680
oldest event ts:    78.367422
now ts:   100.124419
dropped events: 0
read events: 74

... and so on.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-11 22:20:38 -04:00
Trond Myklebust
70d2f7b1ea pNFS: Use the standard I/O stateid when calling LAYOUTGET
Instead of having a private method for copying the open/delegation stateid,
use the same call that is used for standard I/O through the MDS.

Note that this means we transmit the stateid with a zero seqid, avoiding
issues with NFS4ERR_OLD_STATEID.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-11 22:19:00 -04:00
Trond Myklebust
1bd5d6d08e NFS: Count the bytes of skipped subrequests in nfs_lock_and_join_requests()
If we skip a subrequest due to a zero refcount, we should still count
the byte range that it covered so that we accurately reconstruct the
original request size.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09 16:43:09 -04:00
Linus Torvalds
ad9a19d003 More RDMA work and some op-structure constification from Chuck Lever,
and a small cleanup to our xdr encoding.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZst0LAAoJECebzXlCjuG+o30QALbchoIvs7BiDrUxYMfJ2nCa
 7UW69STwX79B3NZTg7RrScFTLPEFW9DMpb/Og7AYTH3/wdgGYQNM1UxGUYe7IxSN
 xemH7BSmQzJ7ryaxouO/jskUw5nvNRXhY0PMxJApjrCs837vTjduIVw9zUa8EDeH
 9toxpTM4k3z/1myj60PuHnuQF9EyLDL6W581loDF04nQB3pVRbAZOh1lUeqMgLUd
 7IF+CDECFcjL7oZSA3wDGpsVySLdZ+GYxloFIDO/d8kHEsZD3OaN2MdfRki8EOSQ
 qibTYO0284VeyNLUOIHjspqbDh0Lr2F7VolMmlM5GF1IuApih0/QYidqsH6/As3U
 JIAK53vgqZfK2qI0ud7dGGFEnT/vlE7pQiXiza36xI8YZu4Xz6uGbM41p38RU8jO
 3fr38xdPqqO7YE6F7ZUHYyrmW81Vi0lFdQkw1DBEipHV8UquuCmdtAeR9xgDsdQ/
 LsMVevM1mF+19krOIGbBnENq1GX78ecfHEYGxlTjf/MeO4JYl+8/x7Ow2e/ZbwSa
 7hpUeCiVuVmy1hqOEtraBl5caAG0hCE8PeGRrdr5dA6ZS9YTm0ANgtxndKabwDh2
 CjXF3gRnQNUGdFGCi/fmvfb89tVNj1tL52pbQqfgOb/VFrrL328vyNNg/1p2VY4Q
 qzmKtxZhi/XBewQjaSQl
 =E3UQ
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.14' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "More RDMA work and some op-structure constification from Chuck Lever,
  and a small cleanup to our xdr encoding"

* tag 'nfsd-4.14' of git://linux-nfs.org/~bfields/linux:
  svcrdma: Estimate Send Queue depth properly
  rdma core: Add rdma_rw_mr_payload()
  svcrdma: Limit RQ depth
  svcrdma: Populate tail iovec when receiving
  nfsd: Incoming xdr_bufs may have content in tail buffer
  svcrdma: Clean up svc_rdma_build_read_chunk()
  sunrpc: Const-ify struct sv_serv_ops
  nfsd: Const-ify NFSv4 encoding and decoding ops arrays
  sunrpc: Const-ify instances of struct svc_xprt_ops
  nfsd4: individual encoders no longer see error cases
  nfsd4: skip encoder in trivial error cases
  nfsd4: define ->op_release for compound ops
  nfsd4: opdesc will be useful outside nfs4proc.c
  nfsd4: move some nfsd4 op definitions to xdr4.h
2017-09-09 13:31:49 -07:00
Trond Myklebust
8b77484f2b NFS: Don't hold the group lock when calling nfs_release_request()
That can deadlock if this is the last reference since
nfs_page_group_destroy() calls nfs_page_group_sync_on_bit().
Note that even if the page was removed from the subpage list,
the req->wb_head could still be pointing to the old head.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09 15:36:40 -04:00
Trond Myklebust
5d2a9d9dac NFS: Remove pnfs_generic_transfer_commit_list()
It's pretty much a duplicate of nfs_scan_commit_list() that also
clears the PG_COMMIT_TO_DS flag.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09 12:51:40 -04:00
Trond Myklebust
137da553db NFS: nfs_lock_and_join_requests and nfs_scan_commit_list can deadlock
Since the commit list is not ordered, it is possible for nfs_scan_commit_list
to hold a request that nfs_lock_and_join_requests() is waiting for, while
at the same time trying to grab a request that nfs_lock_and_join_requests
already holds.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09 12:28:01 -04:00
Trond Myklebust
196639ebbe NFS: Fix 2 use after free issues in the I/O code
The writeback code wants to send a commit after processing the pages,
which is why we want to delay releasing the struct path until after
that's done.

Also, the layout code expects that we do not free the inode before
we've put the layout segments in pnfs_writehdr_free() and
pnfs_readhdr_free()

Fixes: 919e3bd9a8 ("NFS: Ensure we commit after writeback is complete")
Fixes: 4714fb51fd ("nfs: remove pgio_header refcount, related cleanup")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-08 22:07:52 -04:00
Linus Torvalds
a0725ab0c7 Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the first pull request for 4.14, containing most of the code
  changes. It's a quiet series this round, which I think we needed after
  the churn of the last few series. This contains:

   - Fix for a registration race in loop, from Anton Volkov.

   - Overflow complaint fix from Arnd for DAC960.

   - Series of drbd changes from the usual suspects.

   - Conversion of the stec/skd driver to blk-mq. From Bart.

   - A few BFQ improvements/fixes from Paolo.

   - CFQ improvement from Ritesh, allowing idling for group idle.

   - A few fixes found by Dan's smatch, courtesy of Dan.

   - A warning fixup for a race between changing the IO scheduler and
     device remova. From David Jeffery.

   - A few nbd fixes from Josef.

   - Support for cgroup info in blktrace, from Shaohua.

   - Also from Shaohua, new features in the null_blk driver to allow it
     to actually hold data, among other things.

   - Various corner cases and error handling fixes from Weiping Zhang.

   - Improvements to the IO stats tracking for blk-mq from me. Can
     drastically improve performance for fast devices and/or big
     machines.

   - Series from Christoph removing bi_bdev as being needed for IO
     submission, in preparation for nvme multipathing code.

   - Series from Bart, including various cleanups and fixes for switch
     fall through case complaints"

* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
  kernfs: checking for IS_ERR() instead of NULL
  drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
  drbd: Fix allyesconfig build, fix recent commit
  drbd: switch from kmalloc() to kmalloc_array()
  drbd: abort drbd_start_resync if there is no connection
  drbd: move global variables to drbd namespace and make some static
  drbd: rename "usermode_helper" to "drbd_usermode_helper"
  drbd: fix race between handshake and admin disconnect/down
  drbd: fix potential deadlock when trying to detach during handshake
  drbd: A single dot should be put into a sequence.
  drbd: fix rmmod cleanup, remove _all_ debugfs entries
  drbd: Use setup_timer() instead of init_timer() to simplify the code.
  drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
  drbd: new disk-option disable-write-same
  drbd: Fix resource role for newly created resources in events2
  drbd: mark symbols static where possible
  drbd: Send P_NEG_ACK upon write error in protocol != C
  drbd: add explicit plugging when submitting batches
  drbd: change list_for_each_safe to while(list_first_entry_or_null)
  drbd: introduce drbd_recv_header_maybe_unplug
  ...
2017-09-07 11:59:42 -07:00
tarangg@amazon.com
e973b1a599 NFS: Sync the correct byte range during synchronous writes
Since commit 18290650b1 ("NFS: Move buffered I/O locking into
nfs_file_write()") nfs_file_write() has not flushed the correct byte
range during synchronous writes.  generic_write_sync() expects that
iocb->ki_pos points to the right edge of the range rather than the
left edge.

To replicate the problem, open a file with O_DSYNC, have the client
write at increasing offsets, and then print the successful offsets.
Block port 2049 partway through that sequence, and observe that the
client application indicates successful writes in advance of what the
server received.

Fixes: 18290650b1 ("NFS: Move buffered I/O locking into nfs_file_write()")
Signed-off-by: Jacob Strauss <jsstraus@amazon.com>
Signed-off-by: Tarang Gupta <tarangg@amazon.com>
Tested-by: Tarang Gupta <tarangg@amazon.com>
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-07 11:07:13 -04:00
Jan Kara
26b433d0da fscache: remove unused ->now_uncached callback
Patch series "Ranged pagevec lookup", v2.

In this series I make pagevec_lookup() update the index (to be
consistent with pagevec_lookup_tag() and also as a preparation for
ranged lookups), provide ranged variant of pagevec_lookup() and use it
in places where it makes sense.  This not only removes some common code
but is also a measurable performance win for some use cases (see patch
4/10) where radix tree is sparse and searching & grabing of a page after
the end of the range has measurable overhead.

This patch (of 10):

The callback doesn't ever get called.  Remove it.

Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
NeilBrown
03c6f7d64a NFS: remove jiffies field from access cache
This field hasn't been used since commit 57b691819e ("NFS: Cache
access checks more aggressively").

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06 12:32:37 -04:00
NeilBrown
779eafab06 NFS: flush data when locking a file to ensure cache coherence for mmap.
When a byte range lock (or flock) is taken out on an NFS file, the
validity of the cached data is checked and the inode is marked
NFS_INODE_INVALID_DATA.  However the cached data isn't flushed from
the page cache.

This is sufficient for future read() requests or mmap() requests as
they call nfs_revalidate_mapping() which performs the flush if
necessary.

However an existing mapping is not affected.  Accessing data through
that mapping will continue to return old data even though the inode is
marked NFS_INODE_INVALID_DATA.

This can easily be confirmed using the 'nfs' tool in
  git://github.com/okirch/twopence-nfs.git
and running

   nfs coherence FILENAME
on one client, and
   nfs coherence -r FILENAME
on another client.

It appears that prior to Linux 2.6.0 this worked correctly.

However commit:

http://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/?id=ca9268fe3ddd075714005adecd4afbd7f9ab87d0

removed the call to inode_invalidate_pages() from nfs_zap_caches().  I
haven't tested this code, but inspection suggests that prior to this
commit, file locking would invalidate all inode pages.

This patch adds a call to nfs_revalidate_mapping() after a
successful SETLK so that invalid data is flushed.  With this patch the
above test passes.  To minimize impact (and possibly avoid a GETATTR
call) this only happens if the mapping might be mapped into
userspace.

Cc: Olaf Kirch <okir@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06 12:31:15 -04:00
NeilBrown
237f8306c3 NFS: don't expect errors from mempool_alloc().
Commit fbe77c30e9 ("NFS: move rw_mode to nfs_pageio_header")
reintroduced some pointless code that commit 518662e0fc ("NFS: fix
usage of mempools.") had recently removed.

Remove it again.

Cc: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06 12:31:15 -04:00
Chuck Lever
afea5657c2 sunrpc: Const-ify struct sv_serv_ops
Close an attack vector by moving the arrays of per-server methods to
read-only memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 22:13:50 -04:00
Christoph Hellwig
74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Trond Myklebust
7af7a5963c Merge branch 'bugfixes' 2017-08-20 13:04:12 -04:00
Chuck Lever
53a75f22e7 NFS: Fix NFSv2 security settings
For a while now any NFSv2 mount where sec= is specified uses
AUTH_NULL. If sec= is not specified, the mount uses AUTH_UNIX.
Commit e68fd7c807 ("mount: use sec= that was specified on the
command line") attempted to address a very similar problem with
NFSv3, and should have fixed this too, but it has a bug.

The MNTv1 MNT procedure does not return a list of security flavors,
so our client makes up a list containing just AUTH_NULL. This should
enable nfs_verify_authflavors() to assign the sec= specified flavor,
but instead, it incorrectly sets it to AUTH_NULL.

I expect this would also be a problem for any NFSv3 server whose
MNTv3 MNT procedure returned a security flavor list containing only
AUTH_NULL.

Fixes: e68fd7c807 ("mount: use sec= that was specified on ... ")
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=310
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20 12:43:34 -04:00
NeilBrown
b79e87e070 NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys'
An NFSv4.1 client might close a file after the user who opened it has
logged off.  In this case the user's credentials may no longer be
valid, if they are e.g. kerberos credentials that have expired.

NFSv4.1 has a mechanism to allow the client to use machine credentials
to close a file.  However due to a short-coming in the RFC, a CLOSE
with those credentials may not be possible if the file in question
isn't exported to the same security flavor - the required PUTFH must
be rejected when this is the case.

Specifically if a server and client support kerberos in general and
have used it to form a machine credential, but the file is only
exported to "sec=sys", a PUTFH with the machine credentials will fail,
so CLOSE is not possible.

As RPC_AUTH_UNIX (used by sec=sys) credentials can never expire, there
is no value in using the machine credential in place of them.
So in that case, just use the users credentials for CLOSE etc, as you would
in NFSv4.0

Signed-off-by: Neil Brown <neilb@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20 12:43:17 -04:00
Trond Myklebust
3bde7afdab NFS: Remove unused parameter gfp_flags from nfs_pageio_init()
Now that the mirror allocation has been moved, the parameter can go.
Also remove the redundant symbol export.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20 11:35:33 -04:00
Trond Myklebust
14abcb0bf5 NFSv4: Fix up mirror allocation
There are a number of callers of nfs_pageio_complete() that want to
continue using the nfs_pageio_descriptor without needing to call
nfs_pageio_init() again. Examples include nfs_pageio_resend() and
nfs_pageio_cond_complete().

The problem is that nfs_pageio_complete() also calls
nfs_pageio_cleanup_mirroring(), which frees up the array of mirrors.
This can lead to writeback errors, in the next call to
nfs_pageio_setup_mirroring().

Fix by simply moving the allocation of the mirrors to
nfs_pageio_setup_mirroring().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=196709
Reported-by: JianhongYin <yin-jianhong@163.com>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20 10:36:35 -04:00
Trond Myklebust
b7561e5186 Merge branch 'writeback' 2017-08-18 14:51:10 -04:00
Trond Myklebust
2ce209c42c NFS: Wait for requests that are locked on the commit list
If a request is on the commit list, but is locked, we will currently skip
it, which can lead to livelocking when the commit count doesn't reduce
to zero.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:48 -04:00
Trond Myklebust
8205b9ce03 NFSv4/pnfs: Replace pnfs_put_lseg_locked() with pnfs_put_lseg()
Now that we no longer hold the inode->i_lock when manipulating the
commit lists, it is safe to call pnfs_put_lseg() again.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:48 -04:00
Trond Myklebust
4b9bb25b36 NFS: Switch to using mapping->private_lock for page writeback lookups.
Switch from using the inode->i_lock for this to avoid contention with
other metadata manipulation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:48 -04:00
Trond Myklebust
5cb953d4b1 NFS: Use an atomic_long_t to count the number of commits
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:48 -04:00
Trond Myklebust
a6b6d5b85a NFS: Use an atomic_long_t to count the number of requests
Rather than forcing us to take the inode->i_lock just in order to bump
the number.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
e824f99ada NFSv4: Use a mutex to protect the per-inode commit lists
The commit lists can get very large, so using the inode->i_lock can
end up affecting general metadata performance.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
b30d2f04c3 NFS: Refactor nfs_page_find_head_request()
Split out the 2 cases so that we can treat the locking differently.
The issue is that the locking in the pageswapcache cache is highly
linked to the commit list locking.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
bd37d6fce1 NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()
Hide the locking from nfs_lock_and_join_requests() so that we can
separate out the requirements for swapcache pages.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
7e8a30f8b4 NFS: Fix up nfs_page_group_covers_page()
Fix up the test in nfs_page_group_covers_page(). The simplest implementation
is to check that we have a set of intersecting or contiguous subrequests
that connect page offset 0 to nfs_page_length(req->wb_page).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
1344b7ea17 NFS: Remove unused parameter from nfs_page_group_lock()
nfs_page_group_lock() is now always called with the 'nonblock'
parameter set to 'false'.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
dee83046e7 NFS: Remove unuse function nfs_page_group_lock_wait()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
902a4c0046 NFS: Remove nfs_page_group_clear_bits()
At this point, we only expect ever to potentially see PG_REMOVE and
PG_TEARDOWN being set on the subrequests.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
5b2b5187fa NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases
Since nfs_page_group_destroy() does not take any locks on the requests
to be freed, we need to ensure that we don't inadvertently free the
request in nfs_destroy_unlinked_subrequests() while the last reference
is being released elsewhere.

Do this by:

1) Taking a reference to the request unless it is already being freed
2) Checking (under the page group lock) if PG_TEARDOWN is already set before
   freeing an unreferenced request in nfs_destroy_unlinked_subrequests()

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
74a6d4b5ae NFS: Further optimise nfs_lock_and_join_requests()
When locking the entire group in order to remove subrequests,
the locks are always taken in order, and with the page group
lock being taken after the page head is locked. The intention
is that:

1) The lock on the group head guarantees that requests may not
   be removed from the group (although new entries could be appended
   if we're not holding the group lock).
2) It is safe to drop and retake the page group lock while iterating
   through the list, in particular when waiting for a subrequest lock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
b5bab9bf91 NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests()
We should no longer need the inode->i_lock, now that we've
straightened out the request locking. The locking schema is now:

1) Lock page head request
2) Lock the page group
3) Lock the subrequests one by one

Note that there is a subtle race with nfs_inode_remove_request() due
to the fact that the latter does not lock the page head, when removing
it from the struct page. Only the last subrequest is locked, hence
we need to re-check that the PagePrivate(page) is still set after
we've locked all the subrequests.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
7e6cca6caf NFS: Remove page group limit in nfs_flush_incompatible()
nfs_try_to_update_request() should be able to cope now.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
f6032f216f NFS: Teach nfs_try_to_update_request() to deal with request page_groups
Simplify the code, and avoid some flushes to disk.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
b66aaa8dfe NFS: Fix the inode request accounting when pages have subrequests
Both nfs_destroy_unlinked_subrequests() and nfs_lock_and_join_requests()
manipulate the inode flags adjusting the NFS_I(inode)->nrequests.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:47 -04:00
Trond Myklebust
31a01f093e NFS: Don't unlock writebacks before declaring PG_WB_END
We don't want nfs_lock_and_join_requests() to start fiddling with
the request before the call to nfs_page_group_sync_on_bit().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
e14bebf6de NFS: Don't check request offset and size without holding a lock
Request offsets and sizes are not guaranteed to be stable unless you
are holding the request locked.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
a0e265bc78 NFS: Fix an ABBA issue in nfs_lock_and_join_requests()
All other callers of nfs_page_group_lock() appear to already hold the
page lock on the head page, so doing it in the opposite order here
is inefficient, although not deadlock prone since we roll back all
locks on contention.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
7cb9cd9aa2 NFS: Fix a reference and lock leak in nfs_lock_and_join_requests()
Yes, this is a situation that should never happen (hence the WARN_ON)
but we should still ensure that we free up the locks and references to
the faulty pages.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
08fead2ae5 NFS: Ensure we always dereference the page head last
This fixes a race with nfs_page_group_sync_on_bit() whereby the
call to wake_up_bit() in nfs_page_group_unlock() could occur after
the page header had been freed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
1403390d83 NFS: Reduce lock contention in nfs_try_to_update_request()
Micro-optimisation to move the lockless check into the for(;;) loop.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
82749dd4ef NFS: Reduce lock contention in nfs_page_find_head_request()
Add a lockless check for whether or not the page might be carrying
an existing writeback before we grab the inode->i_lock.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
6d17d653c9 NFS: Simplify page writeback
We don't expect the page header lock to ever be held across I/O, so
it should always be safe to wait for it, even if we're doing nonblocking
writebacks.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15 11:54:46 -04:00
Trond Myklebust
55cfcd1211 Merge branch 'open_state' 2017-08-15 11:54:13 -04:00
Trond Myklebust
75e8c48b9e NFSv4: Use the nfs4_state being recovered in _nfs4_opendata_to_nfs4_state()
If we're recovering a nfs4_state, then we should try to use that instead
of looking up a new stateid. Only do that if the inodes match, though.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-13 20:36:15 -04:00
Trond Myklebust
4e2fcac773 NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state()
When doing open by filehandle we don't really want to lookup a new inode,
but rather update the one we've got. Add a helper which does this for us.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-13 20:36:15 -04:00
Linus Torvalds
216e4a1def Some more NFS client bugfixes for 4.13
Stable fix:
 - Fix leaking nfs4_ff_ds_version array
 
 Other fixes:
 - Improve TEST_STATEID OLD_STATEID handling to prevent recovery loop
 - Require 64-bit sector_t for pNFS blocklayout to prevent 32-bit compile
 errors
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlmOFIIACgkQ18tUv7Cl
 QOsaUQ/9E7lAP6yYp8HfjIBayN1gcme0ZeGzmWVdP8R9isvqTE0MjrwoNxk7h61H
 La/qUcymE32bMX8qYlDs0mw+yhiTcR/UoP5lS/4FCSUZoQsE6BWXoh+O9QlqEcuE
 mFbA9SV52Pf5Mdc/bTNKyh7jgCjeqzlu2sRo5LUM+N7G/M2a5RPfJVGVNYpOmVs/
 ay30B5tHG/K3eeXECLjFTw3HeMorsS2coTaxtX6RghqPoVF6OFZarMUt69IX3zgg
 jBjokz7YfaPSeOEIOapGGRRARHRBAaPE8TvAtRd45R2pMk+Lr12cFWLjT72wRCCM
 nXrTpJc+q8feje9YpT5yoKtgRnW6etxKM8dtyYrXG1NO+dfZHNIe2Z1ARplhzhV3
 Rt8lBV0N0b7kHZfyMJjYINhAbUxvS8UghRpljuHm4+f1lkoV6cVhKoaat/7MQDwZ
 I55M2Edl+A6wPQA7hpFuIT++PVN6GDK7D1rZTKaDBfZ3OCTOQLx0g1kZwHYs/lmk
 gvvtkj82RmbIPoG1rbxHTJFoQdVrpVCYAWr4rbgqNvUrZCjxTRmwRmyMpC/M1cXI
 noyZ/F+VdVLa0mADKMUmiQJ6QkoHjRIAIqlJbLRRl2VFlWHfu7hUiXk7hqt5ocQW
 cpxwird0Fur8cbEKVriRcwNpqGBrDDO7bv1lyQkwEOeHWZ6Fv9o=
 =1/Ms
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "A few more NFS client bugfixes from me for rc5.

  Dros has a stable fix for flexfiles to prevent leaking the
  nfs4_ff_ds_version arrays when freeing a layout, Trond fixed a
  potential recovery loop situation with the TEST_STATEID operation, and
  Christoph fixed up the pNFS blocklayout Kconfig options to prevent
  unsafe use with kernels that don't have large block device support.
  Summary:

  Stable fix:
   - fix leaking nfs4_ff_ds_version array

  Other fixes:
   - improve TEST_STATEID OLD_STATEID handling to prevent recovery loop

   - require 64-bit sector_t for pNFS blocklayout to prevent 32-bit
     compile errors"

* tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  pnfs/blocklayout: require 64-bit sector_t
  NFSv4: Ignore NFS4ERR_OLD_STATEID in nfs41_check_open_stateid()
  nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays
2017-08-11 13:54:09 -07:00
Christoph Hellwig
8a9d6e964d pnfs/blocklayout: require 64-bit sector_t
The blocklayout code does not compile cleanly for a 32-bit sector_t,
and also has no reliable checks for devices sizes, which makes it
unsafe to use with a kernel that doesn't support large block devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5c83746a0c ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11 14:10:13 -04:00
Trond Myklebust
c0ca0e5934 NFSv4: Ignore NFS4ERR_OLD_STATEID in nfs41_check_open_stateid()
If the call to TEST_STATEID returns NFS4ERR_OLD_STATEID, then it just
means we raced with other calls to OPEN.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-09 13:36:56 -04:00
Weston Andros Adamson
1feb26162b nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays
The client was freeing the nfs4_ff_layout_ds, but not the contained
nfs4_ff_ds_version array.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08 17:18:10 -04:00
Trond Myklebust
bfab281721 NFSv4: Cleanup setting of the migration flags.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-07 09:32:55 -04:00
Trond Myklebust
937e3133cd NFSv4.1: Ensure we clear the SP4_MACH_CRED flags in nfs4_sp4_select_mode()
If the server changes, so that it no longer supports SP4_MACH_CRED, or
that it doesn't support the same set of SP4_MACH_CRED functionality,
then we want to ensure that we clear the unsupported flags.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-07 09:32:55 -04:00
Trond Myklebust
9c760d1fd5 NFSv4: Refactor _nfs4_proc_exchange_id()
Tease apart the functionality in nfs4_exchange_id_done() so that
it is easier to debug exchange id vs trunking issues by moving
all the processing out of nfs4_exchange_id_done() and into the
callers.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-07 09:32:55 -04:00
Linus Torvalds
19ec50a438 Two more NFS client bugfixes for 4.13
Stable fix:
 - Fix EXCHANGE_ID corrupt verifier issue
 
 Other fix:
 - Fix double frees in nfs4_test_session_trunk()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlmCNJAACgkQ18tUv7Cl
 QOvlKhAAjjKMtdwYDV7B3SJvDfa5KNH9nNaMjKoD/OzWavDL7jRVRUWayeJiy4sf
 wJAMsG280ztag1Mr/OnClESZThWNrHTnKjq/P8kuefJz+pPvWP+AaQ/8QSM/rulM
 Y7qfh+FZb2t7lK80VAZTXEi4bvQzlkIV2285a34VIUm3VTpwl3HvltkBMLFztDSb
 sdaQh+N48FOYdG9RMncxpzFfRvUILzP9GFMg6WAFAcbr1wSsT9MfLqd+ANgOidE8
 X2Ch9l0jN51m1dFGmMUZ5HMmuhUh2m50SXhmUOTpS7fj+k11OWsV2IIWHKPk/LjI
 5E0aUPDU2TOE9noLSfkguAyxvqAGeAHPDIE7QM9vTGgktzXWGzh++ODeSzkIDp+5
 iZ+cYLkme1lOg1nEqj1/SrIZXHP7ZCvK8iqNgoqT8rIp6zE1tLPnNAbRDmnwC/VH
 PrfINATV8llN/3vFojWJfD1enDe3+dALPINLVQOjwjafq6d5/hzRdCGqWpCDjmlU
 esDoCkoWGUjadIr6EZGtDAzSZgafsUx6d6QnGtkIUsz1d0FIXH6holB5EKaQpOLf
 dVb9iO5R2EcVP+WvB3KjA5y6MCNvoVqebxMvLBPCYsu3fI8fQ0sgSjgSJXi0fRzg
 G7DUsKcBGVKarDMXTUReL2G6lN7h0f5EsM1WnA9KF0TV9aah/ZE=
 =Ddq9
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "Two fixes from Trond this time, now that he's back from his vacation.
  The first is a stable fix for the EXCHANGE_ID issue on the mailing
  list, and the other fixes a double-free situation that he found at the
  same time.

  Stable fix:
   - Fix EXCHANGE_ID corrupt verifier issue

  Other fix:
   - Fix double frees in nfs4_test_session_trunk()"

* tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4: Fix double frees in nfs4_test_session_trunk()
  NFSv4: Fix EXCHANGE_ID corrupt verifier issue
2017-08-02 20:56:44 -07:00
Trond Myklebust
d9cb73300a NFSv4: Fix double frees in nfs4_test_session_trunk()
rpc_clnt_add_xprt() expects the callback function to be synchronous, and
expects to release the transport and switch references itself.

Fixes: 04fa2c6bb5 ("NFS pnfs data server multipath session trunking")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-02 09:45:55 -04:00
Trond Myklebust
fd40559c86 NFSv4: Fix EXCHANGE_ID corrupt verifier issue
The verifier is allocated on the stack, but the EXCHANGE_ID RPC call was
changed to be asynchronous by commit 8d89bd70bc. If we interrrupt
the call to rpc_wait_for_completion_task(), we can therefore end up
transmitting random stack contents in lieu of the verifier.

Fixes: 8d89bd70bc ("NFS setup async exchange_id")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-01 16:28:55 -04:00
Linus Torvalds
286ba844c5 More NFS client bugfixes for 4.13
Stable fixes:
 - Fix a race where CB_NOTIFY_LOCK fails to wake a waiter
 - Invalidate file size when taking a lock to prevent corruption
 
 Other fixes:
 - Don't excessively generate tiny writes with fallocate
 - Use the raw NFS access mask in nfs4_opendata_access()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAll7l0oACgkQ18tUv7Cl
 QOuL3w/+M5I5xKKrMOjg2cfzdFAn+syTmXYK0HFrxp76CiaNBcQFK5kG1ebkdrTM
 EDXWznamWRTCIbg0U7/3X/763yWUyZM8RSC0nJXyt7FNZitg1Hsvw/OawaM2Z8q4
 TQQPqelEAhAG7zbgyCCe+SuAoAOTq7HpX2wru8gK6POBOP6gmNEJtchBzqrlsq8d
 bSH8E7BLhbRIwC3htsPfTW0NZtqpp7u/wKeLtt/ZGIuM/+78iOa1wMwCESVJfd47
 2DpDmS0LxVtAjs8lCjMBKyrypNEvh+evgdbxeiXG/T6ykzWhBy96OSOm7ooIjqOr
 pkptxrKOBGb9/8rMnxjCKRIfjgVz77GfB2jD+/ILzRP1E0xipHXWCc5XqzBP999l
 zqUVDMPH3zrq8lxmC9FgoY1PJAcRrZ/aEIjozwkVcksTDYx+GJPYMR6Wks9/1cT0
 4jLTRsBgckj9b3FcjsCiyavHBweDChCEgzx5CLpEqH1KKCfT6MnLTb/WoJL/s1R8
 MLb0MC5PMpLP4OvCRR+mCg+dJD2nXF/Cz2E9r3SSlhZiDsNWmdBXk2A5XoAFzY5l
 pQeqkogBdiINu7p/G3n7837ThRUGV+04C9D9WDI7IF/dktOyYYO/4DNDVYEiFqKL
 9v8Hc4EyGwR2dY5iEKaSuNEk8zTxL1ZGv1H8WTCSDmNRQ+64Q3s=
 =OHnE
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "More NFS client bugfixes for 4.13.

  Most of these fix locking bugs that Ben and Neil noticed, but I also
  have a patch to fix one more access bug that was reported after last
  week.

  Stable fixes:
   - Fix a race where CB_NOTIFY_LOCK fails to wake a waiter
   - Invalidate file size when taking a lock to prevent corruption

  Other fixes:
   - Don't excessively generate tiny writes with fallocate
   - Use the raw NFS access mask in nfs4_opendata_access()"

* tag 'nfs-for-4.13-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter
  NFS: Optimize fallocate by refreshing mapping when needed.
  NFS: invalidate file size when taking a lock.
  NFS: Use raw NFS access mask in nfs4_opendata_access()
2017-07-28 14:44:56 -07:00
Benjamin Coddington
b7dbcc0e43 NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter
nfs4_retry_setlk() sets the task's state to TASK_INTERRUPTIBLE within the
same region protected by the wait_queue's lock after checking for a
notification from CB_NOTIFY_LOCK callback.  However, after releasing that
lock, a wakeup for that task may race in before the call to
freezable_schedule_timeout_interruptible() and set TASK_WAKING, then
freezable_schedule_timeout_interruptible() will set the state back to
TASK_INTERRUPTIBLE before the task will sleep.  The result is that the task
will sleep for the entire duration of the timeout.

Since we've already set TASK_INTERRUPTIBLE in the locked section, just use
freezable_schedule_timout() instead.

Fixes: a1d617d8f1 ("nfs: allow blocking locks to be awoken by lock callbacks")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-28 15:35:30 -04:00
NeilBrown
6ba80d4348 NFS: Optimize fallocate by refreshing mapping when needed.
posix_fallocate() will allocate space in an NFS file by considering
the last byte of every 4K block.  If it is before EOF, it will read
the byte and if it is zero, a zero is written out.  If it is after EOF,
the zero is unconditionally written.

For the blocks beyond EOF, if NFS believes its cache is valid, it will
expand these writes to write full pages, and then will merge the pages.
This results if (typically) 1MB writes.  If NFS believes its cache is
not valid (particularly if NFS_INO_INVALID_DATA or
NFS_INO_REVAL_PAGECACHE are set - see nfs_write_pageuptodate()), it will
send the individual 1-byte writes. This results in (typically) 256 times
as many RPC requests, and can be substantially slower.

Currently nfs_revalidate_mapping() is only used when reading a file or
mmapping a file, as these are times when the content needs to be
up-to-date.  Writes don't generally need the cache to be up-to-date, but
writes beyond EOF can benefit, particularly in the posix_fallocate()
case.

So this patch calls nfs_revalidate_mapping() when writing beyond EOF -
i.e. when there is a gap between the end of the file and the start of
the write.  If the cache is thought to be out of date (as happens after
taking a file lock), this will cause a GETATTR, and the two flags
mentioned above will be cleared.  With this, posix_fallocate() on a
newly locked file does not generate excessive tiny writes.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-27 11:22:42 -04:00
NeilBrown
442ce0499c NFS: invalidate file size when taking a lock.
Prior to commit ca0daa277a ("NFS: Cache aggressively when file is open
for writing"), NFS would revalidate, or invalidate, the file size when
taking a lock.  Since that commit it only invalidates the file content.

If the file size is changed on the server while wait for the lock, the
client will have an incorrect understanding of the file size and could
corrupt data.  This particularly happens when writing beyond the
(supposed) end of file and can be easily be demonstrated with
posix_fallocate().

If an application opens an empty file, waits for a write lock, and then
calls posix_fallocate(), glibc will determine that the underlying
filesystem doesn't support fallocate (assuming version 4.1 or earlier)
and will write out a '0' byte at the end of each 4K page in the region
being fallocated that is after the end of the file.
NFS will (usually) detect that these writes are beyond EOF and will
expand them to cover the whole page, and then will merge the pages.
Consequently, NFS will write out large blocks of zeroes beyond where it
thought EOF was.  If EOF had moved, the pre-existing part of the file
will be over-written.  Locking should have protected against this,
but it doesn't.

This patch restores the use of nfs_zap_caches() which invalidated the
cached attributes.  When posix_fallocate() asks for the file size, the
request will go to the server and get a correct answer.

cc: stable@vger.kernel.org (v4.8+)
Fixes: ca0daa277a ("NFS: Cache aggressively when file is open for writing")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-27 11:22:42 -04:00
Anna Schumaker
1e6f209515 NFS: Use raw NFS access mask in nfs4_opendata_access()
Commit bd8b244174 ("NFS: Store the raw NFS access mask in the inode's
access cache") changed how the access results are stored after an
access() call.  An NFS v4 OPEN might have access bits returned with the
opendata, so we should use the NFS4_ACCESS values when determining the
return value in nfs4_opendata_access().

Fixes: bd8b244174 ("NFS: Store the raw NFS access mask in the inode's
access cache")
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Tested-by: Takashi Iwai <tiwai@suse.de>
2017-07-26 16:53:57 -04:00
Linus Torvalds
505d5c1119 NFS client bugfixes for 4.13
Stable bugfixes:
 - Fix error reporting regression
 
 Bugfixes:
 - Fix setting filelayout ds address race
 - Fix subtle access bug when using ACLs
 - Fix setting mnt3_counts array size
 - Fix a couple of pNFS commit races
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAllyVYYACgkQ18tUv7Cl
 QOtTwBAA8ek0Ba5wIfwlQe4MvIX1t6v7q7Otrwdombhuw1a6nW410hFwu2SG8kGd
 fQWr1xnqRsMKwu83UuImtlYYvMa271TZgXxPNWx7KJpi/zocnigJRg5sVpkcRqza
 AjjPc245pHCooQoaTAlm5WrO9zDm0s7lRGuTB1cvPRsElnFVWT0jwhTASIDOU1Zy
 9K/hElAnXp/dZv/ydjSePTPPsVQPJWLbJPlSm6vAIQPyeXWUeCgAym0yf2FVvtsd
 AQeozaq9xq6tofVPZhsYWeBKswjTHs3FxL8vhDOF9gF3QMPm43mwsfrKVidWo4vW
 0UJnRZRCFgG0WIxxhA7l1Z9MovAsXlbWmFufgCa4Ev/bC5WuUT4ZEkjBGJw2vXD4
 0/lxkhD41PBhCl/LIod9OT6iJ8koifl50JUC4N67D2illFy9a7Btzx3EPxfDz9uG
 6jEek9x6B1xf7AC4HhxByN/E8gKX08N4Q/afxTFuAwrzKRKqI4Me3qbCyU86bp+T
 wiAxgVPVnmb/VBVLU68i7titdsA6U8ZO12FFqu9QOr9wHMXxa6108h2Nia9jVTdk
 EZhanXa6tJThQQ/QZicuR5hTtoM4BuikaaJSHZ4ODbgrAjsMAyqy4qsf4tf3LLAo
 tEDu5sJDmHJhhdtqzuz+OtDn5iJ2Nga6a4/fMwt9YU2ZQBm7X/8=
 =ZcQv
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "Stable bugfix:
   - Fix error reporting regression

  Bugfixes:
   - Fix setting filelayout ds address race
   - Fix subtle access bug when using ACLs
   - Fix setting mnt3_counts array size
   - Fix a couple of pNFS commit races"

* tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS/filelayout: Fix racy setting of fl->dsaddr in filelayout_check_deviceid()
  NFS: Be more careful about mapping file permissions
  NFS: Store the raw NFS access mask in the inode's access cache
  NFSv3: Convert nfs3_proc_access() to use nfs_access_set_mask()
  NFS: Refactor NFS access to kernel access mask calculation
  net/sunrpc/xprt_sock: fix regression in connection error reporting.
  nfs: count correct array for mnt3_counts array size
  Revert commit 722f0b8911 ("pNFS: Don't send COMMITs to the DSes if...")
  pNFS/flexfiles: Handle expired layout segments in ff_layout_initiate_commit()
  NFS: Fix another COMMIT race in pNFS
  NFS: Fix a COMMIT race in pNFS
  mount: copy the port field into the cloned nfs_server structure.
  NFS: Don't run wake_up_bit() when nobody is waiting...
  nfs: add export operations
2017-07-21 16:26:01 -07:00
Trond Myklebust
1ebf980127 NFS/filelayout: Fix racy setting of fl->dsaddr in filelayout_check_deviceid()
We must set fl->dsaddr once, and once only, even if there are multiple
processes calling filelayout_check_deviceid() for the same layout
segment.

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 14:08:45 -04:00
Trond Myklebust
ecbb903c56 NFS: Be more careful about mapping file permissions
When mapping a directory, we want the MAY_WRITE permissions to reflect
whether or not we have permission to modify, add and delete the directory
entries. MAY_EXEC must map to lookup permissions.

On the other hand, for files, we want MAY_WRITE to reflect a permission
to modify and extend the file.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 11:51:19 -04:00
Trond Myklebust
bd8b244174 NFS: Store the raw NFS access mask in the inode's access cache
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 11:51:19 -04:00
Trond Myklebust
eda3e20847 NFSv3: Convert nfs3_proc_access() to use nfs_access_set_mask()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 11:51:19 -04:00
Trond Myklebust
15d4b73ac2 NFS: Refactor NFS access to kernel access mask calculation
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 11:51:19 -04:00
Eryu Guan
ecc7b435d2 nfs: count correct array for mnt3_counts array size
Array size of mnt3_counts should be the size of array
mnt3_procedures, not mnt_procedures, though they're same in size
right now. Found this by code inspection.

Fixes: 1c5876ddbd ("sunrpc: move p_count out of struct rpc_procinfo")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-21 08:49:57 -04:00
Trond Myklebust
213297369c Revert commit 722f0b8911 ("pNFS: Don't send COMMITs to the DSes if...")
Doing the test without taking any locks is racy, and so really it makes
more sense to do it in the flexfiles code (which is the only case that
cares).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-19 15:28:21 -04:00
Trond Myklebust
4b75053e9b pNFS/flexfiles: Handle expired layout segments in ff_layout_initiate_commit()
If the layout has expired due to a fencing event, then we should not
attempt to commit to the DS.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-19 15:28:21 -04:00
Trond Myklebust
4118188645 NFS: Fix another COMMIT race in pNFS
We must make sure that cinfo->ds->ncommitting is in sync with the
commit list, since it is checked as part of pnfs_commit_list().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-19 15:28:21 -04:00
Trond Myklebust
e39928f942 NFS: Fix a COMMIT race in pNFS
We must make sure that cinfo->ds->nwritten is in sync with the
commit list, since it is checked as part of pnfs_scan_commit_lists().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-19 15:28:21 -04:00
Steve Dickson
89a6814d9b mount: copy the port field into the cloned nfs_server structure.
Doing this copy eliminates the "port=0" entry in
the /proc/mounts entries

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=69241

Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-19 15:28:21 -04:00
David Howells
bc98a42c1f VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
Firstly by applying the following with coccinelle's spatch:

	@@ expression SB; @@
	-SB->s_flags & MS_RDONLY
	+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

	@@ expression A, SB; @@
	(
	-(!sb_rdonly(SB)) && A
	+!sb_rdonly(SB) && A
	|
	-A != (sb_rdonly(SB))
	+A != sb_rdonly(SB)
	|
	-A == (sb_rdonly(SB))
	+A == sb_rdonly(SB)
	|
	-!(sb_rdonly(SB))
	+!sb_rdonly(SB)
	|
	-A && (sb_rdonly(SB))
	+A && sb_rdonly(SB)
	|
	-A || (sb_rdonly(SB))
	+A || sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) != A
	+sb_rdonly(SB) != A
	|
	-(sb_rdonly(SB)) == A
	+sb_rdonly(SB) == A
	|
	-(sb_rdonly(SB)) && A
	+sb_rdonly(SB) && A
	|
	-(sb_rdonly(SB)) || A
	+sb_rdonly(SB) || A
	)

	@@ expression A, B, SB; @@
	(
	-(sb_rdonly(SB)) ? 1 : 0
	+sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) ? A : B
	+sb_rdonly(SB) ? A : B
	)

to remove left over excess bracketage and finally by applying:

	@@ expression A, SB; @@
	(
	-(A & MS_RDONLY) != sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
	|
	-(A & MS_RDONLY) == sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
	)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:34 +01:00
Linus Torvalds
b86faee6d1 NFS client updates for Linux 4.13
Stable bugfixes:
 - Fix -EACCESS on commit to DS handling
 - Fix initialization of nfs_page_array->npages
 - Only invalidate dentries that are actually invalid
 
 Features:
 - Enable NFSoRDMA transparent state migration
 - Add support for lookup-by-filehandle
 - Add support for nfs re-exporting
 
 Other bugfixes and cleanups:
 - Christoph cleaned up the way we declare NFS operations
 - Clean up various internal structures
 - Various cleanups to commits
 - Various improvements to error handling
 - Set the dt_type of . and .. entries in NFS v4
 - Make slot allocation more reliable
 - Fix fscache stat printing
 - Fix uninitialized variable warnings
 - Fix potential list overrun in nfs_atomic_open()
 - Fix a race in NFSoRDMA RPC reply handler
 - Fix return size for nfs42_proc_copy()
 - Fix against MAC forgery timing attacks
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlln4jEACgkQ18tUv7Cl
 QOv2ZxAAwbQN9Dtx4rOZmPe0Xszua23sNN0ja891PodkCjIiZrRelZhLIBAf1rfP
 uSR+jTD8EsBHGt3bzTXg2DHz+o8cGDZuH+uuZX+wRWJPQcKA2pC7zElqnse8nmn5
 4Z1UUdzf42vE4NZ/G1ucqpEiAmOqGJ3s7pCRLLXPvOSSQXqOhiomNDAcGxX05FIv
 Ly4Kr6RIfg/O4oNOZBuuL/tZHodeyOj1vbyjt/4bDQ5MEXlUQfcjJZEsz/2EcNh6
 rAgbquxr1pGCD072pPBwYNH2vLGbgNN41KDDMGI0clp+8p6EhV6BOlgcEoGtZM86
 c0yro2oBOB2vPCv9nGr6JgTOHPKG6ksJ7vWVXrtQEjBGP82AbFfAawLgqZ6Ae8dP
 Sqpx55j4xdm4nyNglCuhq5PlPAogARq/eibR+RbY973Lhzr5bZb3XqlairCkNNEv
 4RbTlxbWjhgrKJ56jVf+KpUDJAVG5viKMD7YDx/bOfLtvPwALbozD7ONrunz5v43
 PgQEvWvVtnQAKp27pqHemTsLFhU6M6eGUEctRnAfB/0ogWZh1X8QXgulpDlqG3kb
 g12kr5hfA0pSfcB0aGXVzJNnHKfW3IY3WBWtxq4xaMY22YkHtuB+78+9/yk3jCAi
 dvimjT2Ko9fE9MnltJ/hC5BU+T+xUxg+1vfwWnKMvMH8SIqjyu4=
 =OpLj
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - Fix -EACCESS on commit to DS handling
   - Fix initialization of nfs_page_array->npages
   - Only invalidate dentries that are actually invalid

  Features:
   - Enable NFSoRDMA transparent state migration
   - Add support for lookup-by-filehandle
   - Add support for nfs re-exporting

  Other bugfixes and cleanups:
   - Christoph cleaned up the way we declare NFS operations
   - Clean up various internal structures
   - Various cleanups to commits
   - Various improvements to error handling
   - Set the dt_type of . and .. entries in NFS v4
   - Make slot allocation more reliable
   - Fix fscache stat printing
   - Fix uninitialized variable warnings
   - Fix potential list overrun in nfs_atomic_open()
   - Fix a race in NFSoRDMA RPC reply handler
   - Fix return size for nfs42_proc_copy()
   - Fix against MAC forgery timing attacks"

* tag 'nfs-for-4.13-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (68 commits)
  NFS: Don't run wake_up_bit() when nobody is waiting...
  nfs: add export operations
  nfs4: add NFSv4 LOOKUPP handlers
  nfs: add a nfs_ilookup helper
  nfs: replace d_add with d_splice_alias in atomic_open
  sunrpc: use constant time memory comparison for mac
  NFSv4.2 fix size storage for nfs42_proc_copy
  xprtrdma: Fix documenting comments in frwr_ops.c
  xprtrdma: Replace PAGE_MASK with offset_in_page()
  xprtrdma: FMR does not need list_del_init()
  xprtrdma: Demote "connect" log messages
  NFSv4.1: Use seqid returned by EXCHANGE_ID after state migration
  NFSv4.1: Handle EXCHGID4_FLAG_CONFIRMED_R during NFSv4.1 migration
  xprtrdma: Don't defer MR recovery if ro_map fails
  xprtrdma: Fix FRWR invalidation error recovery
  xprtrdma: Fix client lock-up after application signal fires
  xprtrdma: Rename rpcrdma_req::rl_free
  xprtrdma: Pass only the list of registered MRs to ro_unmap_sync
  xprtrdma: Pre-mark remotely invalidated MRs
  xprtrdma: On invalidation failure, remove MWs from rl_registered
  ...
2017-07-13 14:35:37 -07:00
Trond Myklebust
b4f937cffa NFS: Don't run wake_up_bit() when nobody is waiting...
"perf lock" shows fairly heavy contention for the bit waitqueue locks
when doing an I/O heavy workload.
Use a bit to tell whether or not there has been contention for a lock
so that we can optimise away the bit waitqueue options in those cases.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 17:12:07 -04:00
Peng Tao
20fa190272 nfs: add export operations
This support for opening files on NFS by file handle, both through the
open_by_handle syscall, and for re-exporting NFS (for example using a
different version).  The support is very basic for now, as each open by
handle will have to do an NFSv4 open operation on the wire.  In the
future this will hopefully be mitigated by an open file cache, as well
as various optimizations in NFS for this specific case.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
[hch: incorporated various changes, resplit the patches, new changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 17:12:04 -04:00
Trond Myklebust
301bfa4830 NFS: Don't run wake_up_bit() when nobody is waiting...
"perf lock" shows fairly heavy contention for the bit waitqueue locks
when doing an I/O heavy workload.
Use a bit to tell whether or not there has been contention for a lock
so that we can optimise away the bit waitqueue options in those cases.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:57:18 -04:00
Linus Torvalds
6240300597 Chuck's RDMA update overhauls the "call receive" side of the
RPC-over-RDMA transport to use the new rdma_rw API.
 
 Christoph cleaned the way nfs operations are declared, removing a bunch
 of function-pointer casts and declaring the operation vectors as const.
 
 Christoph's changes touch both client and server, and both client and
 server pulls this time around should be based on the same commits from
 Christoph.
 
 (Note: Anna and I initially didn't coordinate this well and we realized
 our pull requests were going to leave you with Christoph's 33 patches
 duplicated between our two trees.  We decided a last-minute rebase was
 the lesser of two evils, so her pull request will show that last-minute
 rebase.  Yell if that was the wrong choice, and we'll know better for
 next time....)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZZ80JAAoJECebzXlCjuG+PiMP/jmw4IbzY4qt/X8aldVTMPZ8
 TkEXuZSrc7FbmroqAR0XN/qJjzENKUcrnlYm7HKVe6iItTZUvJuVThtHQVGzZUZD
 wP2VRzgkky59aDs9cphfTPGKPKL1MtoC3qQdFmKd/8ZhBDHIq89A2pQJwl7PI4rA
 IHzvLmZtTKL+xWoypqZQxepONhEY2ZPrffGWL+5OVF/dPmWfJ6m/M6jRTb7zV/YD
 PZyRqWQ8UY/HwZTwRrxZDCCxUsmRUPZz195iFjM8wvBl7auWNetC22gyyITlvfzf
 1m0zJqw3qn09+v2xnAWs/ZVxypg6rsEiIcL2mf0JC/tQh+iIzabc4e/TwDEWqSq+
 ocQrvXJuZCjsrMqg4oaIuDFogaZCsGR5wxDAEyfYDS/8fMdiKq8xJzT7v31/2U37
 Bsr1hvgAmD4eZWaTrJg11V5RnTzDgns+EtNfISR8t4/k+wehDfyzav8A+j72sqvR
 JT+7iUEd0QcBwo+MCC7AOnLLsIX45QUjZKKrvZNAC1fmr8RyAF1zo5HHO+NNjLuP
 J2PUG2GbNxsQkm/JAFKDvyklLpEXZc6uyYAcEefirxYbh1x0GfuetzqtH58DtrQL
 /1e80MRG9Qgq5S8PvYyvp1bIQPDRaQ188chEvzZy+3QeNXydq2LzDh0bjlM+4A9I
 DZhP2pNGLh0ImaPtX0q+
 =mR/a
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.13' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Chuck's RDMA update overhauls the "call receive" side of the
  RPC-over-RDMA transport to use the new rdma_rw API.

  Christoph cleaned the way nfs operations are declared, removing a
  bunch of function-pointer casts and declaring the operation vectors as
  const.

  Christoph's changes touch both client and server, and both client and
  server pulls this time around should be based on the same commits from
  Christoph"

* tag 'nfsd-4.13' of git://linux-nfs.org/~bfields/linux: (53 commits)
  svcrdma: fix an incorrect check on -E2BIG and -EINVAL
  nfsd4: factor ctime into change attribute
  svcrdma: Remove svc_rdma_chunk_ctxt::cc_dir field
  svcrdma: use offset_in_page() macro
  svcrdma: Clean up after converting svc_rdma_recvfrom to rdma_rw API
  svcrdma: Clean-up svc_rdma_unmap_dma
  svcrdma: Remove frmr cache
  svcrdma: Remove unused Read completion handlers
  svcrdma: Properly compute .len and .buflen for received RPC Calls
  svcrdma: Use generic RDMA R/W API in RPC Call path
  svcrdma: Add recvfrom helpers to svc_rdma_rw.c
  sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
  svcrdma: Don't account for Receive queue "starvation"
  svcrdma: Improve Reply chunk sanity checking
  svcrdma: Improve Write chunk sanity checking
  svcrdma: Improve Read chunk sanity checking
  svcrdma: Remove svc_rdma_marshal.c
  svcrdma: Avoid Send Queue overflow
  svcrdma: Squelch disconnection messages
  sunrpc: Disable splice for krb5i
  ...
2017-07-13 13:56:24 -07:00
Jeff Layton
5b5faaf6df nfs4: add NFSv4 LOOKUPP handlers
This will be needed in order to implement the get_parent export op
for nfsd.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:15 -04:00
Peng Tao
00422483ad nfs: add export operations
This support for opening files on NFS by file handle, both through the
open_by_handle syscall, and for re-exporting NFS (for example using a
different version).  The support is very basic for now, as each open by
handle will have to do an NFSv4 open operation on the wire.  In the
future this will hopefully be mitigated by an open file cache, as well
as various optimizations in NFS for this specific case.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
[hch: incorporated various changes, resplit the patches, new changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 16:00:15 -04:00
Peng Tao
f174ff7a0a nfs: add a nfs_ilookup helper
This helper will allow to find an existing NFS inode by the file handle
and fattr.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:15 -04:00
Peng Tao
774d9513a3 nfs: replace d_add with d_splice_alias in atomic_open
It's a trival change but follows knfsd export document that asks
for d_splice_alias during lookup.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:14 -04:00
Olga Kornievskaia
1ee48bdd22 NFSv4.2 fix size storage for nfs42_proc_copy
Return size of COPY is u64 but it was assigned to an "int" status.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:14 -04:00
Chuck Lever
838edb9497 NFSv4.1: Use seqid returned by EXCHANGE_ID after state migration
Transparent State Migration copies a client's lease state from the
server where a filesystem used to reside to the server where it now
resides. When an NFSv4.1 client first contacts that destination
server, it uses EXCHANGE_ID to detect trunking relationships.

The lease that was copied there is returned to that client, but the
destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to
the client. This is because the lease was confirmed on the source
server (before it was copied).

When CONFIRMED_R is set, the client throws away the sequence ID
returned by the server. During a Transparent State Migration, however
there's no other way for the client to know what sequence ID to use
with a lease that's been migrated.

Therefore, the client must save and use the contrived slot sequence
value returned by the destination server even when CONFIRMED_R is
set.

Note that some servers always return a seqid of 1 after a migration.

Reported-by: Xuan Qi <xuan.qi@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Xuan Qi <xuan.qi@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:12 -04:00
Chuck Lever
8dcbec6d20 NFSv4.1: Handle EXCHGID4_FLAG_CONFIRMED_R during NFSv4.1 migration
Transparent State Migration copies a client's lease state from the
server where a filesystem used to reside to the server where it now
resides. When an NFSv4.1 client first contacts that destination
server, it uses EXCHANGE_ID to detect trunking relationships.

The lease that was copied there is returned to that client, but the
destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to
the client. This is because the lease was confirmed on the source
server (before it was copied).

Normally, when CONFIRMED_R is set, a client purges the lease and
creates a new one. However, that throws away the entire benefit of
Transparent State Migration.

Therefore, the client must not purge that lease when it is possible
that Transparent State Migration has occurred.

Reported-by: Xuan Qi <xuan.qi@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Xuan Qi <xuan.qi@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:12 -04:00
NeilBrown
26fde4dfcb NFS: check for nfs_refresh_inode() errors in nfs_fhget()
If an NFS server returns a filehandle that we have previously
seen, and reports a different type, then nfs_refresh_inode()
will log a warning and return an error.

nfs_fhget() does not check for this error and may return an
inode with a different type than the one that the server
reported.

This is likely to cause confusion, and is one way that
->open_context() could return a directory inode as discussed
in the previous patch.

So if nfs_refresh_inode() returns and error, return that error
from nfs_fhget() to avoid the confusion propagating.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:09 -04:00
NeilBrown
eaa2b82c3b NFS: guard against confused server in nfs_atomic_open()
A confused server could return a filehandle for an
NFSv4 OPEN request, which it previously returned for a directory.
So the inode returned by  ->open_context() in nfs_atomic_open()
could conceivably be a directory inode.

This has particular implications for the call to
nfs_file_set_open_context() in nfs_finish_open().
If that is called on a directory inode, then the nfs_open_context
that gets stored in the filp->private_data will be linked to
nfs_inode->open_files.

When the directory is closed, nfs_closedir() will (ultimately)
free the ->private_data, but not unlink it from nfs_inode->open_files
(because it doesn't expect an nfs_open_context there).

Subsequently the memory could get used for something else and eventually
if the ->open_files list is walked, the walker will fall off the end and
crash.

So: change nfs_finish_open() to only call nfs_file_set_open_context()
for regular-file inodes.

This failure mode has been seen in a production setting (unknown NFS
server implementation).  The kernel was v3.0 and the specific sequence
seen would not affect more recent kernels, but I think a risk is still
present, and caution is wise.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:08 -04:00
NeilBrown
cc89684c9a NFS: only invalidate dentrys that are clearly invalid.
Since commit bafc9b754f ("vfs: More precise tests in d_invalidate")
in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
to be invalidated even if it has filesystems mounted on or it or on a
descendant.  The mounted filesystem is unmounted.

This means we need to be careful not to return 0 unless the directory
referred to truly is invalid.  So -ESTALE or -ENOENT should invalidate
the directory.  Other errors such a -EPERM or -ERESTARTSYS should be
returned from ->d_revalidate() so they are propagated to the caller.

A particular problem can be demonstrated by:

1/ mount an NFS filesystem using NFSv3 on /mnt
2/ mount any other filesystem on /mnt/foo
3/ ls /mnt/foo
4/ turn off network, or otherwise make the server unable to respond
5/ ls /mnt/foo &
6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
7/ kill -9 $! # this results in -ERESTARTSYS being returned
8/ observe that /mnt/foo has been unmounted.

This patch changes nfs_lookup_revalidate() to only treat
  -ESTALE from nfs_lookup_verify_inode() and
  -ESTALE or -ENOENT from ->lookup()
as indicating an invalid inode.  Other errors are returned.

Also nfs_check_inode_attributes() is changed to return -ESTALE rather
than -EIO.  This is consistent with the error returned in similar
circumstances from nfs_update_inode().

As this bug allows any user to unmount a filesystem mounted on an NFS
filesystem, this fix is suitable for stable kernels.

Fixes: bafc9b754f ("vfs: More precise tests in d_invalidate")
Cc: stable@vger.kernel.org (v3.18+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:08 -04:00
Olga Kornievskaia
22368ff11d PNFS for stateid errors retry against MDS first
Upon receiving a stateid error such as BAD_STATEID, the client
should retry the operation against the MDS before deciding to
do stateid recovery.

Previously, the code would initiate state recovery and it could
lead to a race in a state manager that could chose an incorrect
recovery method which would lead to the EIO failure for the
application.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:08 -04:00
Olga Kornievskaia
a0bc01e0f1 PNFS fix EACCESS on commit to DS handling
Commit fabbbee0eb "PNFS fix fallback to MDS if got error on
commit to DS" moved the pnfs_set_lo_fail() to unhandled errors
which was not correct and lead to a kernel oops on umount.

Instead, fix the original EACCESS on commit to DS error by
getting the new layout and re-doing the IO.

Fixes: fabbbee0eb ("PNFS fix fallback to MDS if got error on commit to DS")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # v4.12
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:59:57 -04:00
Dan Carpenter
4cd1ec95bd NFS: silence a uninitialized variable warning
Static checkers have gotten clever enough to complain that "id_long" is
uninitialized on the failure path.  It's harmless, but simple to fix.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:28 -04:00