If you export a subdirectory of a filesystem, a READDIRPLUS on the root
of that export will return the filehandle of the parent with the ".."
entry.
The filehandle is optional, so let's just not return the filehandle for
".." if we're at the root of an export.
Note that once the client learns one filehandle outside of the export,
they can trivially access the rest of the export using further lookups.
However, it is also not very difficult to guess filehandles outside of
the export. So exporting a subdirectory of a filesystem should
considered equivalent to providing access to the entire filesystem. To
avoid confusion, we recommend only exporting entire filesystems.
Reported-by: Youjipeng <wangzhibei1999@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
While converting the NFSv4 decoder to use xdr_stream-based XDR
processing, I removed the old SAVEMEM() macro. This macro wrapped
a bit of logic that avoided a memory allocation by recognizing when
the decoded item resides in a linear section of the Receive buffer.
In that case, it returned a pointer into that buffer instead of
allocating a bounce buffer.
The bounce buffer is necessary only when xdr_inline_decode() has
placed the decoded item in the xdr_stream's scratch buffer, which
disappears the next time xdr_inline_decode() is called with that
xdr_stream. That happens only if the data item crosses a page
boundary in the receive buffer, an exceedingly rare occurrence.
Allocating a bounce buffer every time results in a minor performance
regression that was introduced by the recent NFSv4 decoder overhaul.
Let's restore the previous behavior. On average, it saves about 1.5
kmalloc() calls per COMPOUND.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
fs/nfsd/nfssvc.c:36:6: warning: symbol 'inter_copy_offload_enable' was not declared. Should it be static?
The parameter was added by commit ce0887ac96 ("NFSD add nfs4 inter
ssc to nfsd4_copy"). Relocate it into the source file that uses it,
and make it static. This approach is similar to the
nfs4_disable_idmapping, cltrack_prog, and cltrack_legacy_disable
module parameters.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If the READ_PLUS operation was truncated due to an error, then ensure we
clear the 'eof' flag.
Fixes: 9f0b5792f0 ("NFSD: Encode a full READ_PLUS reply")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Ensure that we encode the data payload + padding, and that we truncate
the preallocated buffer to the actual read size.
Fixes: 528b84934e ("NFSD: Add READ_PLUS data support")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl/bPRMACgkQnJ2qBz9k
QNmktwf7BE+H0PEgm3VfEs8uKUnmgr/TTBd9rhuKVa8NeYrT1YlX2ocCykawaLSW
ppyXkr2rWKwvRO5P9hZPUsMbjvp7ucz14imBHlhiQpPyfh8cqMazPJLySqbAI/M+
Eo8WIl74EqQ4VIgCGgfIVD073yjA4FWvO+5/CITYR44Pc2WzyCdU/1oKGBrs4+Cg
OZAsHvg+2uKiEVeaBwbII+X/jChCJwEfHEYry3A8oRL427HrDir7Jc9i3SNGTDnc
SE6DPj9X5HWOfoXjVrMratnaz654isvdRdP6GRAFKX8rJlNPGLMZbQ3DTzLGTYKL
7r9KylGD5nCkL1SXjUOLCqHgVRrgpg==
=xcC/
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"A few fsnotify fixes from Amir fixing fallout from big fsnotify
overhaul a few months back and an improvement of defaults limiting
maximum number of inotify watches from Waiman"
* tag 'fsnotify_for_v5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: fix events reported to watching parent and child
inotify: convert to handle_inode_event() interface
fsnotify: generalize handle_inode_event()
inotify: Increase default inotify.max_user_watches limit to 1048576
For the case of NFSv4, specify to the client that the pre/post-op
attributes were not recorded atomically with the main operation.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Don't set PF_LOCAL_THROTTLE on remote filesystems like NFS, since they
aren't expected to ever be subject to double buffering.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If the underlying filesystem times out, then we want knfsd to return
NFSERR_JUKEBOX/DELAY rather than NFSERR_STALE.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It's not uncommon for some workloads to do a bunch of I/O to a file and
delete it just afterward. If knfsd has a cached open file however, then
the file may still be open when the dentry is unlinked. If the
underlying filesystem is nfs, then that could trigger it to do a
sillyrename.
On a REMOVE or RENAME scan the nfsd_file cache for open files that
correspond to the inode, and proactively unhash and put their
references. This should prevent any delete-on-last-close activity from
occurring, solely due to knfsd's open file cache.
This must be done synchronously though so we use the variants that call
flush_delayed_fput. There are deadlock possibilities if you call
flush_delayed_fput while holding locks, however. In the case of
nfsd_rename, we don't even do the lookups of the dentries to be renamed
until we've locked for rename.
Once we've figured out what the target dentry is for a rename, check to
see whether there are cached open files associated with it. If there
are, then unwind all of the locking, close them all, and then reattempt
the rename.
None of this is really necessary for "typical" filesystems though. It's
mostly of use for NFS, so declare a new export op flag and use that to
determine whether to close the files beforehand.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When we start allowing NFS to be reexported, then we have some problems
when it comes to subtree checking. In principle, we could allow it, but
it would mean encoding parent info in the filehandles and there may not
be enough space for that in a NFSv3 filehandle.
To enforce this at export upcall time, we add a new export_ops flag
that declares the filesystem ineligible for subtree checking.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
With NFSv3 nfsd will always attempt to send along WCC data to the
client. This generally involves saving off the in-core inode information
prior to doing the operation on the given filehandle, and then issuing a
vfs_getattr to it after the op.
Some filesystems (particularly clustered or networked ones) have an
expensive ->getattr inode operation. Atomicity is also often difficult
or impossible to guarantee on such filesystems. For those, we're best
off not trying to provide WCC information to the client at all, and to
simply allow it to poll for that information as needed with a GETATTR
RPC.
This patch adds a new flags field to struct export_operations, and
defines a new EXPORT_OP_NOWCC flag that filesystems can use to indicate
that nfsd should not attempt to provide WCC info in NFSv3 replies. It
also adds a blurb about the new flags field and flag to the exporting
documentation.
The server will also now skip collecting this information for NFSv2 as
well, since that info is never used there anyway.
Note that this patch does not add this flag to any filesystem
export_operations structures. This was originally developed to allow
reexporting nfs via nfsd.
Other filesystems may want to consider enabling this flag too. It's hard
to tell however which ones have export operations to enable export via
knfsd and which ones mostly rely on them for open-by-filehandle support,
so I'm leaving that up to the individual maintainers to decide. I am
cc'ing the relevant lists for those filesystems that I think may want to
consider adding this though.
Cc: HPDD-discuss@lists.01.org
Cc: ceph-devel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: fuse-devel@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This reverts commit a85857633b.
We're still factoring ctime into our change attribute even in the
IS_I_VERSION case. If someone sets the system time backwards, a client
could see the change attribute go backwards. Maybe we can just say
"well, don't do that", but there's some question whether that's good
enough, or whether we need a better guarantee.
Also, the client still isn't actually using the attribute.
While we're still figuring this out, let's just stop returning this
attribute.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
inode_query_iversion() has side effects, and there's no point calling it
when we're not even going to use it.
We check whether we're currently processing a v4 request by checking
fh_maxsize, which is arguably a little hacky; we could add a flag to
svc_fh instead.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Minor cleanup, no change in behavior.
Also pull out a common helper that'll be useful elsewhere.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It doesn't make sense to carry all these extra fields around. Just
make everything into change attribute from the start.
This is just cleanup, there should be no change in behavior.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
inode_query_iversion() can modify i_version. Depending on the exported
filesystem, that may not be safe. For example, if you're re-exporting
NFS, NFS stores the server's change attribute in i_version and does not
expect it to be modified locally. This has been observed causing
unnecessary cache invalidations.
The way a filesystem indicates that it's OK to call
inode_query_iverson() is by setting SB_I_VERSION.
So, move the I_VERSION check out of encode_change(), where it's used
only in GETATTR responses, to nfsd4_change_attribute(), which is
also called for pre- and post- operation attributes.
(Note we could also pull the NFSEXP_V4ROOT case into
nfsd4_change_attribute() as well. That would actually be a no-op,
since pre/post attrs are only used for metadata-modifying operations,
and V4ROOT exports are read-only. But we might make the change in
the future just for simplicity.)
Reported-by: Daire Byrne <daire@dneg.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since commit b4868b44c5 ("NFSv4: Wait for stateid updates after
CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
seconds delay regardless of the size of the copy. The delay is from
nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
fails because the seqid in both nfs4_state and nfs4_stateid are 0.
Fix by modifying nfs4_init_cp_state to return the stateid with seqid 1
instead of 0. This is also to conform with section 4.8 of RFC 7862.
Here is the relevant paragraph from section 4.8 of RFC 7862:
A copy offload stateid's seqid MUST NOT be zero. In the context of a
copy offload operation, it is inappropriate to indicate "the most
recent copy offload operation" using a stateid with a seqid of zero
(see Section 8.2.2 of [RFC5661]). It is inappropriate because the
stateid refers to internal state in the server and there may be
several asynchronous COPY operations being performed in parallel on
the same file by the server. Therefore, a copy offload stateid with
a seqid of zero MUST be considered invalid.
Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
linux/fs/nfsd/nfs4proc.c:1542:24: warning: incorrect type in assignment (different base types)
linux/fs/nfsd/nfs4proc.c:1542:24: expected restricted __be32 [assigned] [usertype] status
linux/fs/nfsd/nfs4proc.c:1542:24: got int
Clean-up: The dup_copy_fields() function returns only zero, so make
it return void for now, and get rid of the return code check.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The warning message from nfsd terminating normally
can confuse system adminstrators or monitoring software.
Though it's not exactly fair to pin-point a commit where it
originated, the current form in the current place started
to appear in:
Fixes: e096bbc648 ("knfsd: remove special handling for SIGHUP")
Signed-off-by: kazuo ito <kzpn200@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The handle_inode_event() interface was added as (quoting comment):
"a simple variant of handle_event() for groups that only have inode
marks and don't have ignore mask".
In other words, all backends except fanotify. The inotify backend
also falls under this category, but because it required extra arguments
it was left out of the initial pass of backends conversion to the
simple interface.
This results in code duplication between the generic helper
fsnotify_handle_event() and the inotify_handle_event() callback
which also happen to be buggy code.
Generalize the handle_inode_event() arguments and add the check for
FS_EXCL_UNLINK flag to the generic helper, so inotify backend could
be converted to use the simple interface.
Link: https://lore.kernel.org/r/20201202120713.702387-2-amir73il@gmail.com
CC: stable@vger.kernel.org
Fixes: b9a1b97725 ("fsnotify: create method handle_inode_event() in fsnotify_operations")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Now that all the NFSv4 decoder functions have been converted to
make direct calls to the xdr helpers, remove the unused C macros.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
And clean-up: Now that we have removed the DECODE_TAIL macro from
nfsd4_decode_compound(), we observe that there's no benefit for
nfsd4_decode_compound() to return nfs_ok or nfserr_bad_xdr only to
have its sole caller convert those values to one or zero,
respectively. Have nfsd4_decode_compound() return 1/0 instead.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor for clarity.
Also, remove a stale comment. Commit ed94164398 ("nfsd: implement
machine credential support for some operations") added support for
SP4_MACH_CRED, so state_protect_a is no longer completely ignored.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>