Commit Graph

6583 Commits

Author SHA1 Message Date
Benjamin Coddington
5b4a82a072 Revert "NFSv4: Retry LOCK on OLD_STATEID during delegation return"
Olga Kornievskaia reports that this patch breaks NFSv4.0 state recovery.
It also introduces additional complexity in the error paths for cases not
related to the original problem.  Let's revert it for now, and address the
original problem in another manner.

This reverts commit f5ea16137a.

Fixes: f5ea16137a ("NFSv4: Retry LOCK on OLD_STATEID during delegation return")
Reported-by: Kornievskaia, Olga <Olga.Kornievskaia@netapp.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-29 14:25:35 -04:00
Benjamin Coddington
e901f17b07 NFS: Don't cleanup sysfs superblock entry if uninitialized
Its possible to end up in nfs_free_server() before the server's superblock
sysfs entry has been initialized, in which case calling kobject_put() will
emit a WARNING.  Check if the kobject has been initialized before cleaning
it up.

Fixes: 1c7251187d ("NFS: add superblock sysfs entries")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-29 14:25:35 -04:00
Jeff Layton
cded49ba36 nfs: don't report STATX_BTIME in ->getattr
NFS doesn't properly support reporting the btime in getattr (yet), but
61a968b4f0 mistakenly added it to the request_mask. This causes statx
for STATX_BTIME to report a zeroed out btime instead of properly
clearing the flag.

Cc: stable@vger.kernel.org # v6.3+
Fixes: 61a968b4f0 ("nfs: report the inode version in getattr if requested")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2214134
Reported-by: Boyang Xue <bxue@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 17:16:43 -04:00
Olga Kornievskaia
c907e72f58 NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION
When the client received NFS4ERR_BADSESSION, it schedules recovery
and start the state manager thread which in turn freezes the
session table and does not allow for any new requests to use the
no-longer valid session. However, it is possible that before
the state manager thread runs, a new operation would use the
released slot that received BADSESSION and was therefore not
updated its sequence number. Such re-use of the slot can lead
the application errors.

Fixes: 5c441544f0 ("NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process()")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:10:57 -04:00
Qi Zheng
7f7ab33689 NFSv4.2: fix wrong shrinker_id
Currently, the list_lru::shrinker_id corresponding to the nfs4_xattr
shrinkers is wrong:

>>> prog["nfs4_xattr_cache_lru"].shrinker_id
(int)0
>>> prog["nfs4_xattr_entry_lru"].shrinker_id
(int)0
>>> prog["nfs4_xattr_large_entry_lru"].shrinker_id
(int)0
>>> prog["nfs4_xattr_cache_shrinker"].id
(int)18
>>> prog["nfs4_xattr_entry_shrinker"].id
(int)19
>>> prog["nfs4_xattr_large_entry_shrinker"].id
(int)20

This is not what we expect, which will cause these shrinkers
not to be found in shrink_slab_memcg().

We should assign shrinker::id before calling list_lru_init_memcg(),
so that the corresponding list_lru::shrinker_id will be assigned
the correct value like below:

>>> prog["nfs4_xattr_cache_lru"].shrinker_id
(int)16
>>> prog["nfs4_xattr_entry_lru"].shrinker_id
(int)17
>>> prog["nfs4_xattr_large_entry_lru"].shrinker_id
(int)18
>>> prog["nfs4_xattr_cache_shrinker"].id
(int)16
>>> prog["nfs4_xattr_entry_shrinker"].id
(int)17
>>> prog["nfs4_xattr_large_entry_shrinker"].id
(int)18

So just do it.

Fixes: 95ad37f90c ("NFSv4.2: add client side xattr caching.")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:10:45 -04:00
Benjamin Coddington
6ad477a69a NFSv4: Clean up some shutdown loops
If a SEQUENCE call receives -EIO for a shutdown client, it will retry the
RPC call.  Instead of doing that for a shutdown client, just bail out.

Likewise, if the state manager decides to perform recovery for a shutdown
client, it will continuously retry.  As above, just bail out.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:09:25 -04:00
Benjamin Coddington
7d3e26a054 NFS: Cancel all existing RPC tasks when shutdown
Walk existing RPC tasks and cancel them with -EIO when the client is
shutdown.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:08:46 -04:00
Benjamin Coddington
d9615d166c NFS: add sysfs shutdown knob
Within each nfs_server sysfs tree, add an entry named "shutdown".  Writing
1 to this file will set the cl_shutdown bit on the rpc_clnt structs
associated with that mount.  If cl_shutdown is set, the task scheduler
immediately returns -EIO for new tasks.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:08:12 -04:00
Benjamin Coddington
f4057ffd0e NFS: add a sysfs link to the acl rpc_client
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:06:40 -04:00
Benjamin Coddington
d97c058977 NFS: add a sysfs link to the lockd rpc_client
After lockd is started, add a symlink for lockd's rpc_client under
NFS' superblock sysfs.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:06:07 -04:00
Benjamin Coddington
e13b549319 NFS: Add sysfs links to sunrpc clients for nfs_clients
For the general and state management nfs_client under each mount, create
symlinks to their respective rpc_client sysfs entries.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:04:13 -04:00
Benjamin Coddington
1c7251187d NFS: add superblock sysfs entries
Create a sysfs directory for each mount that corresponds to the mount's
nfs_server struct.  As the mount is being constructed, use the name
"server-n", but rename it to the "MAJOR:MINOR" of the mount after assigning
a device_id. The rename approach allows us to populate the mount's directory
with links to the various rpc_client objects during the mount's
construction.  The naming convention (MAJOR:MINOR) can be used to reference
a particular NFS mount's sysfs tree.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 15:03:48 -04:00
Benjamin Coddington
e96f9268ee NFS: Make all of /sys/fs/nfs network-namespace unique
Expand the NFS network-namespaced sysfs from /sys/fs/nfs/net down one level
into /sys/fs/nfs by moving the "net" kobject onto struct
nfs_netns_client and setting it up during network namespace init.

This prepares the way for superblock kobjects within /sys/fs/nfs that will
only be visible to matching network namespaces.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 14:59:51 -04:00
Benjamin Coddington
943aef2dbc NFS: Open-code the nfs_kset kset_create_and_add()
In preparation to make objects below /sys/fs/nfs namespace aware, we need
to define our own kobj_type for the nfs kset so that we can add the
.child_ns_type member in a following patch.  No functional change here, only
the unrolling of kset_create_and_add().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 14:59:07 -04:00
Benjamin Coddington
d5082ace6c NFS: rename nfs_client_kobj to nfs_net_kobj
Match the variable names to the sysfs structure.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 14:59:07 -04:00
Benjamin Coddington
8b18a2edec NFS: rename nfs_client_kset to nfs_kset
Be brief and match the subsystem name.  There's no need to distinguish this
kset variable from the server.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 14:56:45 -04:00
Chuck Lever
c8407f2e56 NFS: Add an "xprtsec=" NFS mount option
After some discussion, we decided that controlling transport layer
security policy should be separate from the setting for the user
authentication flavor. To accomplish this, add a new NFS mount
option to select a transport layer security policy for RPC
operations associated with the mount point.

  xprtsec=none     - Transport layer security is forced off.

  xprtsec=tls      - Establish an encryption-only TLS session. If
                     the initial handshake fails, the mount fails.
                     If TLS is not available on a reconnect, drop
                     the connection and try again.

  xprtsec=mtls     - Both sides authenticate and an encrypted
                     session is created. If the initial handshake
                     fails, the mount fails. If TLS is not available
                     on a reconnect, drop the connection and try
                     again.

To support client peer authentication (mtls), the handshake daemon
will have configurable default authentication material (certificate
or pre-shared key). In the future, mount options can be added that
can provide this material on a per-mount basis.

Updates to mount.nfs (to support xprtsec=auto) and nfs(5) will be
sent under separate cover.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:30:17 -04:00
Chuck Lever
6c0a8c5fcf NFS: Have struct nfs_client carry a TLS policy field
The new field is used to match struct nfs_clients that have the same
TLS policy setting.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:29:23 -04:00
Chuck Lever
9e8ab85a7e NFS: Improvements for fs_context-related tracepoints
Add some missing observability to the fs_context tracepoints
added by commit 33ce83ef0b ("NFS: Replace fs_context-related
dprintk() call sites with tracepoints").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:15:23 -04:00
Anna Schumaker
86e2e1f6d9 NFSv4.2: SETXATTR should update ctime
Otherwise, `stat` will report a stale value to users.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:10:48 -04:00
Anna Schumaker
64edd55d0f NFSv4.2: Clean up xattr size macros
Fold them into the other NFS v4.2 operations in the right spots and
adjust spacing to keep the same style.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:09:50 -04:00
Anna Schumaker
d594097367 NFSv4.2: Clean up nfs4_xdr_dec_*xattr() functions
I add commends above each function to match the style of the other
nfs4_xdr_dec_*() functions. I also remove the unnecessary #ifdef
CONFIG_NFS_V4_2 that was added around this code, since we are already in
a v4.2-only file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:09:07 -04:00
Anna Schumaker
31f1bd8f89 NFSv4.2: Clean up: Move nfs4_xdr_enc_*xattr() functions
They should be in the nfs4_xdr_enc_*() section, and not at the bottom of
the file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:08:40 -04:00
Anna Schumaker
04b4c9fb07 NFSv4.2: Clean up: move decode_*xattr() functions
Move them out of the encode_*() section and into the decode_*() section
where it belongs.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:08:20 -04:00
Anna Schumaker
fd42ba8223 NFSv4.2: Clean up: Move the encode_copy_commit() function
Move the function to be with the other encode_*() functions, instead of
in the middle of the nfs4_xdr_enc_*() section.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-06-19 12:07:54 -04:00
Anna Schumaker
43439d858b NFSv4.2: Fix a potential double free with READ_PLUS
kfree()-ing the scratch page isn't enough, we also need to set the pointer
back to NULL to avoid a double-free in the case of a resend.

Fixes: fbd2a05f29 (NFSv4.2: Rework scratch handling for READ_PLUS)
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-05-19 17:11:59 -04:00
Fabio M. De Francesco
4b71e2416e NFS: Convert kmap_atomic() to kmap_local_folio()
kmap_atomic() is deprecated in favor of kmap_local_{folio,page}().

Therefore, replace kmap_atomic() with kmap_local_folio() in
nfs_readdir_folio_array_append().

kmap_atomic() disables page-faults and preemption (the latter only for
!PREEMPT_RT kernels), However, the code within the mapping/un-mapping in
nfs_readdir_folio_array_append() does not depend on the above-mentioned
side effects.

Therefore, a mere replacement of the old API with the new one is all that
is required (i.e., there is no need to explicitly add any calls to
pagefault_disable() and/or preempt_disable()).

Tested with (x)fstests in a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.

Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Fixes: ec108d3cc7 ("NFS: Convert readdir page array functions to use a folio")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-05-19 16:50:05 -04:00
Linus Torvalds
16a8829130 nfs: fix another case of NULL/IS_ERR confusion wrt folio pointers
Dan has been improving on the smatch error pointer checks, and pointed
at another case where the __filemap_get_folio() conversion to error
pointers had been overlooked.  This time because it was hidden behind
the filemap_grab_folio() helper function that is a wrapper around it.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-09 10:22:13 -07:00
Linus Torvalds
994e2419f1 nfs: fix mis-merged __filemap_get_folio() error check
Fix another case of an incorrect check for the returned 'folio' value
from __filemap_get_folio().

The failure case used to return NULL, but was changed by commit
66dabbb65d ("mm: return an ERR_PTR from __filemap_get_folio").

But in the meantime, commit ec108d3cc7 ("NFS: Convert readdir page
array functions to use a folio") added a new user of that function.

And my merge of the two did not fix this up correctly.

The ext4 merge had the same issue, but that one had been caught in
linux-next and got properly fixed while merging.

Fixes: 0127f25b5d ("Merge tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs")
Cc: Anna Schumaker <anna@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-06 10:16:19 -07:00
Linus Torvalds
4e1c80ae5c NFSD 6.4 Release Notes
The big ticket item for this release is support for RPC-with-TLS
 [RFC 9289] has been added to the Linux NFS server. The goal is to
 provide a simple-to-deploy, low-overhead in-transit confidentiality
 and peer authentication mechanism. It can supplement NFS Kerberos
 and it can protect the use of legacy non-cryptographic user
 authentication flavors such as AUTH_SYS. The TLS Record protocol is
 handled entirely by kTLS, meaning it can use either software
 encryption or offload encryption to smart NICs.
 
 Work continues on improving NFSD's open file cache. Among the many
 clean-ups in that area is a patch to convert the rhashtable to use
 the list-hashing version of that data structure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRK/JMACgkQM2qzM29m
 f5cF5A/+JZFRSPlfSYt0YHzUQQSDdYn5n/IG9TwJQd62xheu083WuKRaCOYYoOhg
 06nZd6p7nuF1E0n2ZWOKSE6YkBSE6z4M6KrQlm6lCe/nmxYCR87IYfJCXuL+Yf0e
 /LdL4OTvDHzY5ec1DreERldPIUJ8GFzwChH8/z4XwbNDR7qJkF/gf8YxpFr+8K+j
 Cfyl8woZiEze/Nvxy1YtAqa7HMEpitt0aWJN55rHwTh9c3b0nmDzziYFcVqXgybJ
 /qUHfHBak66ll8RqhcQ3BMuyfszwASERbPsaZ2a5F/RaxLL5ZWfFyhgQwm+PZWT+
 J5DdSBwLEQYtKQGD41A1aorP6v/u2QelfWrl4S7/qjRpREp8Ba2IU4fYLjGb1499
 Imk68BA7NwFp87tdMi/7en1VVgina4U/S3X71aUYWe+C0g48BfTrVwq4SVbQSAo4
 1638vbZnrJbsJMr9OaaysKWfv4KZB020Ji1KVwuqmgy5F8kdfJCCQ2UR/fHuJ3DY
 R0Zrd1Ryjwr83viP+Xj0ERiW405gPdCT0RJqoA7rznRPCqT5M42tf5z65uO7iZeE
 C1udgDaoQOtioKlem6FcDXLkryf986slGA7V91lat/Jt8A5jLKQfjVe3Q+kaaqXP
 ka1DQnYelzMzILQQs39cqW5pShrH8e3tfRZ7JhdBgrpxVXz9ZZM=
 =lA2+
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "The big ticket item for this release is that support for RPC-with-TLS
  [RFC 9289] has been added to the Linux NFS server.

  The goal is to provide a simple-to-deploy, low-overhead in-transit
  confidentiality and peer authentication mechanism. It can supplement
  NFS Kerberos and it can protect the use of legacy non-cryptographic
  user authentication flavors such as AUTH_SYS. The TLS Record protocol
  is handled entirely by kTLS, meaning it can use either software
  encryption or offload encryption to smart NICs.

  Aside from that, work continues on improving NFSD's open file cache.
  Among the many clean-ups in that area is a patch to convert the
  rhashtable to use the list-hashing version of that data structure"

* tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
  NFSD: Handle new xprtsec= export option
  SUNRPC: Support TLS handshake in the server-side TCP socket code
  NFSD: Clean up xattr memory allocation flags
  NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop
  SUNRPC: Clear rq_xid when receiving a new RPC Call
  SUNRPC: Recognize control messages in server-side TCP socket code
  SUNRPC: Be even lazier about releasing pages
  SUNRPC: Convert svc_xprt_release() to the release_pages() API
  SUNRPC: Relocate svc_free_res_pages()
  nfsd: simplify the delayed disposal list code
  SUNRPC: Ignore return value of ->xpo_sendto
  SUNRPC: Ensure server-side sockets have a sock->file
  NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()
  sunrpc: simplify two-level sysctl registration for svcrdma_parm_table
  SUNRPC: return proper error from get_expiry()
  lockd: add some client-side tracepoints
  nfs: move nfs_fhandle_hash to common include file
  lockd: server should unlock lock if client rejects the grant
  lockd: fix races in client GRANTED_MSG wait logic
  lockd: move struct nlm_wait to lockd.h
  ...
2023-04-29 11:04:14 -07:00
Linus Torvalds
0127f25b5d NFS Client Updates for Linux 6.4
New Features:
   * Convert the readdir path to use folios
   * Convert the NFS fscache code to use netfs
 
 Bugfixes and Cleanups:
   * Always send a RECLAIM_COMPLETE after establishing a lease
   * Simplify sysctl registrations and other cleanups
   * Handle out-of-order write replies on NFS v3
   * Have sunrpc call_bind_status use standard hard/soft task semantics
   * Other minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmRMI04ACgkQ18tUv7Cl
 QOuCNQ//SkQm8aOM4DkYFeDIObye6xMzgtWrB25grYNG4a/DcYqb5kNcbmI5l1tE
 Tus8KMZAWSpwa0m8ALctzp+pZQWQkY/svsqqHrKIGUHBI8F0OinVCqc2MzNN75WX
 m/1wELW6ek9RBL5BoJtAPt+Qu8/jP6KD64Zot7snBeUrzreaZDcz0HM+EcQhi7X7
 qd5XS0/cA2eLEBBQcQdFpRhHvgW12BMYM/zp3/ER5H52L2iAlZunGWw+Nqs8ueOR
 D7K2+CF1sV1k6hYbLWNoaF2J6PZr5dRpc6gSq4fLP4WUKjqQwmQp8cm9iLpf6jGa
 a+Y7t8aj7vup8jVCVGWYWZA2G2gi6jWmxxWudkJwfAa1E45t1B4/C0udwlxR20OO
 XI2Bhe5YwTURgSOvOS9QTZJpQN4qfpEL0NoAmAT5fAHBQ2CXDrMlSIxPS7U6LO9q
 YqwIHcAHvYVnbD45IUh2Zjbp65mRb1VkU6WzOyK1/sNHEyYpubIWXB/yLaA3oGge
 V3xUgvlTzLVzzyQfwiRfzAD1P5/USaXE/B36c4itfCr5rJnAfsiBP3gk0o9yq18J
 3Yb6olrmc9CzeA7PN88uEus4VZHbaE9OktRFIjJ22jlLQEY4xougdE5asY1XX8F+
 OKLLLeeCrsbvrANB9XcLVsLqdMYvsd0VaCX9HtN3UP+7Lod5T10=
 =gpBC
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:

   - Convert the readdir path to use folios

   - Convert the NFS fscache code to use netfs

  Bugfixes and Cleanups:

   - Always send a RECLAIM_COMPLETE after establishing a lease

   - Simplify sysctl registrations and other cleanups

   - Handle out-of-order write replies on NFS v3

   - Have sunrpc call_bind_status use standard hard/soft task semantics

   - Other minor cleanups"

* tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.2: Rework scratch handling for READ_PLUS
  NFS: Cleanup unused rpc_clnt variable
  NFS: set varaiable nfs_netfs_debug_id storage-class-specifier to static
  SUNRPC: remove the maximum number of retries in call_bind_status
  NFS: Convert readdir page array functions to use a folio
  NFS: Convert the readdir array-of-pages into an array-of-folios
  NFSv3: handle out-of-order write replies.
  NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit
  NFS: Remove all NFSIOS_FSCACHE counters due to conversion to netfs API
  NFS: Convert buffered read paths to use netfs when fscache is enabled
  NFS: Configure support for netfs when NFS fscache is configured
  NFS: Rename readpage_async_filler to nfs_read_add_folio
  sunrpc: simplify one-level sysctl registration for debug_table
  sunrpc: move sunrpc_table and proc routines above
  sunrpc: simplify one-level sysctl registration for xs_tunables_table
  sunrpc: simplify one-level sysctl registration for xr_tunables_table
  nfs: simplify two-level sysctl registration for nfs_cb_sysctls
  nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
  lockd: simplify two-level sysctl registration for nlm_sysctls
  NFSv4.1: Always send a RECLAIM_COMPLETE after establishing lease
2023-04-29 10:58:44 -07:00
Anna Schumaker
fbd2a05f29 NFSv4.2: Rework scratch handling for READ_PLUS
Instead of using a tiny, static scratch buffer, we should use a kmalloc()-ed
buffer that is allocated when checking for read plus usage. This lets us
use the buffer before decoding any part of the READ_PLUS operation
instead of setting it right before segment decoding, meaning it should
be a little more robust.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-28 15:48:45 -04:00
Linus Torvalds
33afd4b763 Mainly singleton patches all over the place. Series of note are:
- updates to scripts/gdb from Glenn Washburn
 
 - kexec cleanups from Bjorn Helgaas
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr+6wAKCRDdBJ7gKXxA
 jn4NAP4u/hj/kR2dxYehcVLuQqJspCRZZBZlAReFJyHNQO6voAEAk0NN9rtG2+/E
 r0G29CJhK+YL0W6mOs8O1yo9J1rZnAM=
 =2CUV
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Mainly singleton patches all over the place.

  Series of note are:

   - updates to scripts/gdb from Glenn Washburn

   - kexec cleanups from Bjorn Helgaas"

* tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (50 commits)
  mailmap: add entries for Paul Mackerras
  libgcc: add forward declarations for generic library routines
  mailmap: add entry for Oleksandr
  ocfs2: reduce ioctl stack usage
  fs/proc: add Kthread flag to /proc/$pid/status
  ia64: fix an addr to taddr in huge_pte_offset()
  checkpatch: introduce proper bindings license check
  epoll: rename global epmutex
  scripts/gdb: add GDB convenience functions $lx_dentry_name() and $lx_i_dentry()
  scripts/gdb: create linux/vfs.py for VFS related GDB helpers
  uapi/linux/const.h: prefer ISO-friendly __typeof__
  delayacct: track delays from IRQ/SOFTIRQ
  scripts/gdb: timerlist: convert int chunks to str
  scripts/gdb: print interrupts
  scripts/gdb: raise error with reduced debugging information
  scripts/gdb: add a Radix Tree Parser
  lib/rbtree: use '+' instead of '|' for setting color.
  proc/stat: remove arch_idle_time()
  checkpatch: check for misuse of the link tags
  checkpatch: allow Closes tags with links
  ...
2023-04-27 19:57:00 -07:00
Linus Torvalds
7fa8a8ee94 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
 
 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
 
 - zsmalloc performance improvements from Sergey Senozhatsky.
 
 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.
 
 - VFS rationalizations from Christoph Hellwig:
 
   - removal of most of the callers of write_one_page().
 
   - make __filemap_get_folio()'s return value more useful
 
 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing.  Use `mount -o noswap'.
 
 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.
 
 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).
 
 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.
 
 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
   than exclusive, and has fixed a bunch of errors which were caused by its
   unintuitive meaning.
 
 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.
 
 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.
 
 - Cleanups to do_fault_around() by Lorenzo Stoakes.
 
 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.
 
 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.
 
 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.
 
 - Lorenzo has also contributed further cleanups of vma_merge().
 
 - Chaitanya Prakash provides some fixes to the mmap selftesting code.
 
 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.
 
 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.
 
 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.
 
 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.
 
 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.
 
 - Yosry Ahmed has improved the performance of memcg statistics flushing.
 
 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.
 
 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.
 
 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.
 
 - Pankaj Raghav has made page_endio() unneeded and has removed it.
 
 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.
 
 - Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
 
 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.
 
 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.
 
 - Peter Xu fixes a few issues with uffd-wp at fork() time.
 
 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
 jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
 =J+Dj
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
   switching from a user process to a kernel thread.

 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
   Raghav.

 - zsmalloc performance improvements from Sergey Senozhatsky.

 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.

 - VFS rationalizations from Christoph Hellwig:
     - removal of most of the callers of write_one_page()
     - make __filemap_get_folio()'s return value more useful

 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing. Use `mount -o noswap'.

 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.

 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).

 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.

 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
   rather than exclusive, and has fixed a bunch of errors which were
   caused by its unintuitive meaning.

 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.

 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.

 - Cleanups to do_fault_around() by Lorenzo Stoakes.

 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.

 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.

 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.

 - Lorenzo has also contributed further cleanups of vma_merge().

 - Chaitanya Prakash provides some fixes to the mmap selftesting code.

 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.

 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.

 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.

 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.

 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.

 - Yosry Ahmed has improved the performance of memcg statistics
   flushing.

 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.

 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.

 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.

 - Pankaj Raghav has made page_endio() unneeded and has removed it.

 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.

 - Yosry Ahmed has fixed an issue around memcg's page recalim
   accounting.

 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.

 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.

 - Peter Xu fixes a few issues with uffd-wp at fork() time.

 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
  shmem: restrict noswap option to initial user namespace
  mm/khugepaged: fix conflicting mods to collapse_file()
  sparse: remove unnecessary 0 values from rc
  mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
  hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
  maple_tree: fix allocation in mas_sparse_area()
  mm: do not increment pgfault stats when page fault handler retries
  zsmalloc: allow only one active pool compaction context
  selftests/mm: add new selftests for KSM
  mm: add new KSM process and sysfs knobs
  mm: add new api to enable ksm per process
  mm: shrinkers: fix debugfs file permissions
  mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
  migrate_pages_batch: fix statistics for longterm pin retry
  userfaultfd: use helper function range_in_vma()
  lib/show_mem.c: use for_each_populated_zone() simplify code
  mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
  fs/buffer: convert create_page_buffers to folio_create_buffers
  fs/buffer: add folio_create_empty_buffers helper
  ...
2023-04-27 19:42:02 -07:00
Linus Torvalds
888d3c9f7f sysctl-6.4-rc1
This pull request goes with only a few sysctl moves from the
 kernel/sysctl.c file, the rest of the work has been put towards
 deprecating two API calls which incur recursion and prevent us
 from simplifying the registration process / saving memory per
 move. Most of the changes have been soaking on linux-next since
 v6.3-rc3.
 
 I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
 feedback that we should see if we could *save* memory with these
 moves instead of incurring more memory. We currently incur more
 memory since when we move a syctl from kernel/sysclt.c out to its
 own file we end up having to add a new empty sysctl used to register
 it. To achieve saving memory we want to allow syctls to be passed
 without requiring the end element being empty, and just have our
 registration process rely on ARRAY_SIZE(). Without this, supporting
 both styles of sysctls would make the sysctl registration pretty
 brittle, hard to read and maintain as can be seen from Meng Tang's
 efforts to do just this [0]. Fortunately, in order to use ARRAY_SIZE()
 for all sysctl registrations also implies doing the work to deprecate
 two API calls which use recursion in order to support sysctl
 declarations with subdirectories.
 
 And so during this development cycle quite a bit of effort went into
 this deprecation effort. I've annotated the following two APIs are
 deprecated and in few kernel releases we should be good to remove them:
 
   * register_sysctl_table()
   * register_sysctl_paths()
 
 During this merge window we should be able to deprecate and unexport
 register_sysctl_paths(), we can probably do that towards the end
 of this merge window.
 
 Deprecating register_sysctl_table() will take a bit more time but
 this pull request goes with a few example of how to do this.
 
 As it turns out each of the conversions to move away from either of
 these two API calls *also* saves memory. And so long term, all these
 changes *will* prove to have saved a bit of memory on boot.
 
 The way I see it then is if remove a user of one deprecated call, it
 gives us enough savings to move one kernel/sysctl.c out from the
 generic arrays as we end up with about the same amount of bytes.
 
 Since deprecating register_sysctl_table() and register_sysctl_paths()
 does not require maintainer coordination except the final unexport
 you'll see quite a bit of these changes from other pull requests, I've
 just kept the stragglers after rc3.
 
 Most of these changes have been soaking on linux-next since around rc3.
 
 [0] https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRHAjQSHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinTzgQAI/uKHKi0VlUR1l2Psl0XbseUVueuyj3
 ZDxSJpbVUmsoDf2MlLjzB8mYE3ricnNTDbLr7qOyA6pXdM1N0mY5LQmRVRu8/ffd
 2T1hQ5pl7YnJdWP5dPhcF9Y+jnu1tjX1MW5DS4fzllwK7FnD86HuIruGq52RAPS/
 /FH+BD9eodLWWXk6A/o2GFqoWxPKQI0GLxEYWa7Hg7yt8E/3PQL9QsRzn8i6U+HW
 BrN/+G3YD1VCCzXu0UAeXnm+i1Z7CdvqNdZuSkvE3DObiZ5WpOS+/i7FrDB7zdiu
 zAbHaifHnDPtcK3w2ZodbLAAwEWD/mG4iwIjE2kgIMVYxBv7TFDBRREXAWYAevIT
 UUuZnWDQsGaWdjywrebaUycEfd6dytKyan0fTXgMFkcoWRjejhitfdM2iZDdQROg
 q453p4HqOw4vTrhy4ov4zOX7J3EFiBzpZdl+SmLqcXk+jbLVb/Q9snUWz1AFtHBl
 gHoP5bS82uVktGG3MsObjgTzYYMQjO9YGIrVuW1VP9uWs8WaoWx6M9FQJIIhtwE+
 h6wG2s7CjuFWnS0/IxWmDOn91QyUn1w7ohiz9TuvYj/5GLSBpBDGCJHsNB5T2WS1
 qbQRaZ2Kg3j9TeyWfXxdlxBx7bt3ni+J/IXDY0zom2sTpGHKl8D2g5AzmEXJDTpl
 kd7Z3gsmwhDh
 =0U0W
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "This only does a few sysctl moves from the kernel/sysctl.c file, the
  rest of the work has been put towards deprecating two API calls which
  incur recursion and prevent us from simplifying the registration
  process / saving memory per move. Most of the changes have been
  soaking on linux-next since v6.3-rc3.

  I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
  feedback that we should see if we could *save* memory with these moves
  instead of incurring more memory. We currently incur more memory since
  when we move a syctl from kernel/sysclt.c out to its own file we end
  up having to add a new empty sysctl used to register it. To achieve
  saving memory we want to allow syctls to be passed without requiring
  the end element being empty, and just have our registration process
  rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls
  would make the sysctl registration pretty brittle, hard to read and
  maintain as can be seen from Meng Tang's efforts to do just this [0].
  Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations
  also implies doing the work to deprecate two API calls which use
  recursion in order to support sysctl declarations with subdirectories.

  And so during this development cycle quite a bit of effort went into
  this deprecation effort. I've annotated the following two APIs are
  deprecated and in few kernel releases we should be good to remove
  them:

   - register_sysctl_table()
   - register_sysctl_paths()

  During this merge window we should be able to deprecate and unexport
  register_sysctl_paths(), we can probably do that towards the end of
  this merge window.

  Deprecating register_sysctl_table() will take a bit more time but this
  pull request goes with a few example of how to do this.

  As it turns out each of the conversions to move away from either of
  these two API calls *also* saves memory. And so long term, all these
  changes *will* prove to have saved a bit of memory on boot.

  The way I see it then is if remove a user of one deprecated call, it
  gives us enough savings to move one kernel/sysctl.c out from the
  generic arrays as we end up with about the same amount of bytes.

  Since deprecating register_sysctl_table() and register_sysctl_paths()
  does not require maintainer coordination except the final unexport
  you'll see quite a bit of these changes from other pull requests, I've
  just kept the stragglers after rc3"

Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0]

* tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits)
  fs: fix sysctls.c built
  mm: compaction: remove incorrect #ifdef checks
  mm: compaction: move compaction sysctl to its own file
  mm: memory-failure: Move memory failure sysctls to its own file
  arm: simplify two-level sysctl registration for ctl_isa_vars
  ia64: simplify one-level sysctl registration for kdump_ctl_table
  utsname: simplify one-level sysctl registration for uts_kern_table
  ntfs: simplfy one-level sysctl registration for ntfs_sysctls
  coda: simplify one-level sysctl registration for coda_table
  fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls
  xfs: simplify two-level sysctl registration for xfs_table
  nfs: simplify two-level sysctl registration for nfs_cb_sysctls
  nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
  lockd: simplify two-level sysctl registration for nlm_sysctls
  proc_sysctl: enhance documentation
  xen: simplify sysctl registration for balloon
  md: simplify sysctl registration
  hv: simplify sysctl registration
  scsi: simplify sysctl registration with register_sysctl()
  csky: simplify alignment sysctl registration
  ...
2023-04-27 16:52:33 -07:00
Linus Torvalds
0cfcde1faf There are a number of major cleanups in ext4 this cycle:
* The data=journal writepath has been significantly cleaned up and
   simplified, and reduces a large number of data=journal special cases
   by Jan Kara.
 
 * Ojaswin Muhoo has replaced linked list used to track extents that
   have been used for inode preallocation with a red-black tree in the
   multi-block allocator.  This improves performance for workloads
   which do a large number of random allocating writes.
 
 * Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the
   multi-block allocator.
 
 * Matthew wilcox has converted the code paths for reading and writing
   ext4 pages to use folios.
 
 * Jason Yan has continued to factor out ext4_fill_super() into smaller
   functions for improve ease of maintenance and comprehension.
 
 * Josh Triplett has created an uapi header for ext4 userspace API's.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmRHS3IACgkQ8vlZVpUN
 gaNN7AgAnFiWfk4UqKpBsUL5iQKJgf2K4tjlNXgPd6ghNns0IdFEyeWSHhr6KLv/
 SQeoMMyiWaUcTvZs9DokD8U/9M1ELPUiE9W5c9GxJjM86SXp8BlLYSZTiRoNHzGJ
 noQpvikj4qTRviK0rA3q5ICTP2eh1ECHMFJy2wcsZQgwnBelUejQHsTGtOwSvFWF
 8wMdfuVtAFDZJjzOxzVKfHP22R5HVRWlAU7P1d97qKjBj4Se3+QchI+zdcIrmU9A
 tTmCXj57NpTDyLjS9dIDmLygtTv93lOzOmZS8glw0BFonPcd3ObI4RHVxR+V9xu1
 lN13YYgBrK6yfApn9L5XL/31PuLfbg==
 =VLBx
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "There are a number of major cleanups in ext4 this cycle:

   - The data=journal writepath has been significantly cleaned up and
     simplified, and reduces a large number of data=journal special
     cases by Jan Kara.

   - Ojaswin Muhoo has replaced linked list used to track extents that
     have been used for inode preallocation with a red-black tree in the
     multi-block allocator. This improves performance for workloads
     which do a large number of random allocating writes.

   - Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the
     multi-block allocator.

   - Matthew wilcox has converted the code paths for reading and writing
     ext4 pages to use folios.

   - Jason Yan has continued to factor out ext4_fill_super() into
     smaller functions for improve ease of maintenance and
     comprehension.

   - Josh Triplett has created an uapi header for ext4 userspace API's"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (105 commits)
  ext4: Add a uapi header for ext4 userspace APIs
  ext4: remove useless conditional branch code
  ext4: remove unneeded check of nr_to_submit
  ext4: move dax and encrypt checking into ext4_check_feature_compatibility()
  ext4: factor out ext4_block_group_meta_init()
  ext4: move s_reserved_gdt_blocks and addressable checking into ext4_check_geometry()
  ext4: rename two functions with 'check'
  ext4: factor out ext4_flex_groups_free()
  ext4: use ext4_group_desc_free() in ext4_put_super() to save some duplicated code
  ext4: factor out ext4_percpu_param_init() and ext4_percpu_param_destroy()
  ext4: factor out ext4_hash_info_init()
  Revert "ext4: Fix warnings when freezing filesystem with journaled data"
  ext4: Update comment in mpage_prepare_extent_to_map()
  ext4: Simplify handling of journalled data in ext4_bmap()
  ext4: Drop special handling of journalled data from ext4_quota_on()
  ext4: Drop special handling of journalled data from ext4_evict_inode()
  ext4: Fix special handling of journalled data from extent zeroing
  ext4: Drop special handling of journalled data from extent shifting operations
  ext4: Drop special handling of journalled data from ext4_sync_file()
  ext4: Commit transaction before writing back pages in data=journal mode
  ...
2023-04-26 08:57:41 -07:00
Jeff Layton
e59fb6749e nfs: move nfs_fhandle_hash to common include file
lockd needs to be able to hash filehandles for tracepoints. Move the
nfs_fhandle_hash() helper to a common nfs include file.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:00 -04:00
Jeff Layton
dcb779fcd4 nfsd: allow reaping files still under writeback
On most filesystems, there is no reason to delay reaping an nfsd_file
just because its underlying inode is still under writeback. nfsd just
relies on client activity or the local flusher threads to do writeback.

The main exception is NFS, which flushes all of its dirty data on last
close. Add a new EXPORT_OP_FLUSH_ON_CLOSE flag to allow filesystems to
signal that they do this, and only skip closing files under writeback on
such filesystems.

Also, remove a redundant NULL file pointer check in
nfsd_file_check_writeback, and clean up nfs's export op flag
definitions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:59 -04:00
Linus Torvalds
e2eff52ce5 v6.4/vfs.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEJxiQAKCRCRxhvAZXjc
 oj4jAP4pt/SuHdZQzv9k2j+UA4hpDmwt7UUWhLxtycQrt9oEoQD/U9hCwnYY9Ksj
 5TSZiK/OGyoslmF1zfAC+1R2crgr3A0=
 =PpBS
 -----END PGP SIGNATURE-----

Merge tag 'v6.4/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains a pile of various smaller fixes. Most of them aren't
  very interesting so this just highlights things worth mentioning:

   - Various filesystems contained the same little helper to convert
     from the mode of a dentry to the DT_* type of that dentry.

     They have now all been switched to rely on the generic
     fs_umode_to_dtype() helper. All custom helpers are removed (Jeff)

   - Fsnotify now reports ACCESS and MODIFY events for splice
     (Chung-Chiang Cheng)

   - After converting timerfd a long time ago to rely on
     wait_event_interruptible_*() apis, convert eventfd as well. This
     removes the complex open-coded wait code (Wen Yang)

   - Simplify sysctl registration for devpts, avoiding the declaration
     of two tables. Instead, just use a prefixed path with
     register_sysctl() (Luis)

   - The setattr_should_drop_sgid() helper is now exported so NFS can
     use it. By switching NFS to this helper an NFS setgid inheritance
     bug is fixed (me)"

* tag 'v6.4/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: hfsplus: remove WARN_ON() from hfsplus_cat_{read,write}_inode()
  pnode: pass mountpoint directly
  eventfd: use wait_event_interruptible_locked_irq() helper
  splice: report related fsnotify events
  fs: consolidate duplicate dt_type helpers
  nfs: use vfs setgid helper
  Update relatime comments to include equality
  fs/buffer: Remove redundant assignment to err
  fs_context: drop the unused lsm_flags member
  fs/namespace: fnic: Switch to use %ptTd
  Documentation: update idmappings.rst
  devpts: simplify two-level sysctl registration for pty_kern_table
  eventpoll: align comment with nested epoll limitation
2023-04-24 13:39:58 -07:00
Linus Torvalds
7bcff5a396 v6.4/vfs.acl
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEhwgAKCRCRxhvAZXjc
 otwgAQDXHnKiPm/d76lITXbxdUNCtvZz+ig26EbOrD+vEszzIQEA81dru0QbCNCt
 ctoZdcsmtKbt2VaYQF1CDOhlnNg5VQM=
 =pER1
 -----END PGP SIGNATURE-----

Merge tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull acl updates from Christian Brauner:
 "After finishing the introduction of the new posix acl api last cycle
  the generic POSIX ACL xattr handlers are still around in the
  filesystems xattr handlers for two reasons:

   (1) Because a few filesystems rely on the ->list() method of the
       generic POSIX ACL xattr handlers in their ->listxattr() inode
       operation.

   (2) POSIX ACLs are only available if IOP_XATTR is raised. The
       IOP_XATTR flag is raised in inode_init_always() based on whether
       the sb->s_xattr pointer is non-NULL. IOW, the registered xattr
       handlers of the filesystem are used to raise IOP_XATTR. Removing
       the generic POSIX ACL xattr handlers from all filesystems would
       risk regressing filesystems that only implement POSIX ACL support
       and no other xattrs (nfs3 comes to mind).

  This contains the work to decouple POSIX ACLs from the IOP_XATTR flag
  as they don't depend on xattr handlers anymore. So it's now possible
  to remove the generic POSIX ACL xattr handlers from the sb->s_xattr
  list of all filesystems. This is a crucial step as the generic POSIX
  ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs
  don't depend on the xattr infrastructure anymore.

  Adressing problem (1) will require more long-term work. It would be
  best to get rid of the ->list() method of xattr handlers completely at
  some point.

  For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX
  ACL xattr handler is kept around so they can continue to use
  array-based xattr handler indexing.

  This update does simplify the ->listxattr() implementation of all
  these filesystems however"

* tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  acl: don't depend on IOP_XATTR
  ovl: check for ->listxattr() support
  reiserfs: rework priv inode handling
  fs: rename generic posix acl handlers
  reiserfs: rework ->listxattr() implementation
  fs: simplify ->listxattr() implementation
  fs: drop unused posix acl handlers
  xattr: remove unused argument
  xattr: add listxattr helper
  xattr: simplify listxattr helpers
2023-04-24 13:35:23 -07:00
Benjamin Coddington
e025f0a73f NFS: Cleanup unused rpc_clnt variable
The root rpc_clnt is not used here, clean it up.

Fixes: 4dc73c6791 ("NFSv4: keep state manager thread active if swap is enabled")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-21 17:00:55 -04:00
Tom Rix
c5733ae6dc NFS: set varaiable nfs_netfs_debug_id storage-class-specifier to static
smatch reports
fs/nfs/fscache.c:260:10: warning: symbol
  'nfs_netfs_debug_id' was not declared. Should it be static?

This variable is only used in its defining file, so it should be static

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-21 16:36:21 -04:00
Luis Chamberlain
a2189b7767 nfs: simplify two-level sysctl registration for nfs_cb_sysctls
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().

Simplify this registration.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-13 11:49:35 -07:00
Luis Chamberlain
d2235a705b nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().

Simplify this registration.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-13 11:49:35 -07:00
Anna Schumaker
ec108d3cc7 NFS: Convert readdir page array functions to use a folio
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-12 11:02:20 -04:00
Anna Schumaker
61f02e0ab8 NFS: Convert the readdir array-of-pages into an array-of-folios
This patch only converts the actual array, but doesn't touch the
individual nfs_cache_array pages and related functions (that will be
done in the next patch).

I also adjust the names of the fields in the nfs_readdir_descriptor to
say "folio" instead of "page".

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-12 11:02:14 -04:00
NeilBrown
3db63daabe NFSv3: handle out-of-order write replies.
NFSv3 includes pre/post wcc attributes which allow the client to
determine if all changes to the file have been made by the client
itself, or if any might have been made by some other client.

If there are gaps in the pre/post ctime sequence it must be assumed that
some other client changed the file in that gap and the local cache must
be suspect.  The next time the file is opened the cache should be
invalidated.

Since Commit 1c341b7775 ("NFS: Add deferred cache invalidation for
close-to-open consistency violations") in linux 5.3 the Linux client has
been triggering this invalidation.  The chunk in nfs_update_inode() in
particularly triggers.

Unfortunately Linux NFS assumes that all replies will be processed in
the order sent, and will arrive in the order processed.  This is not
true in general.  Consequently Linux NFS might ignore the wcc info in a
WRITE reply because the reply is in response to a WRITE that was sent
before some other request for which a reply has already been seen.  This
is detected by Linux using the gencount tests in nfs_inode_attr_cmp().

Also, when the gencount tests pass it is still possible that the request
were processed on the server in a different order, and a gap seen in
the ctime sequence might be filled in by a subsequent reply, so gaps
should not immediately trigger delayed invalidation.

The net result is that writing to a server and then reading the file
back can result in going to the server for the read rather than serving
it from cache - all because a couple of replies arrived out-of-order.
This is a performance regression over kernels before 5.3, though the
change in 5.3 is a correctness improvement.

This has been seen with Linux writing to a Netapp server which
occasionally re-orders requests.  In testing the majority of requests
were in-order, but a few (maybe 2 or three at a time) could be
re-ordered.

This patch addresses the problem by recording any gaps seen in the
pre/post ctime sequence and not triggering invalidation until either
there are too many gaps to fit in the table, or until there are no more
active writes and the remaining gaps cannot be resolved.

We allocate a table of 16 gaps on demand.  If the allocation fails we
revert to current behaviour which is of little cost as we are unlikely
to be able to cache the writes anyway.

In the table we store "start->end" pair when iversion is updated and
"end<-start" pairs pre/post pairs reported by the server.  Usually these
exactly cancel out and so nothing is stored.  When there are
out-of-order replies we do store gaps and these will eventually be
cancelled against later replies when this client is the only writer.

If the final write is out-of-order there may be one gap remaining when
the file is closed.  This will be noticed and if there is precisely on
gap and if the iversion can be advanced to match it, then we do so.

This patch makes no attempt to handle directories correctly.  The same
problem potentially exists in the out-of-order replies to create/unlink
requests can cause future lookup requires to be sent to the server
unnecessarily.  A similar scheme using the same primitives could be used
to notice and handle out-of-order replies.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-11 16:13:21 -04:00
Dave Wysochanski
03f5bd75a4 NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit
The NFS specific trace points are no longer needed as tracing is well
covered by netfs and fscache.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-11 13:08:27 -04:00
Dave Wysochanski
0631d5e02a NFS: Remove all NFSIOS_FSCACHE counters due to conversion to netfs API
The old NFSIOS_FSCACHE counters are no longer accurate or useful with
the conversion to the new netfs API.  The new API does not have a page
based interface, and so the counters in nfs_stat_fscachecounters are
no longer obtainable.  The new netfs the API has extensive statistics
inside /proc/fs/fscache/stats so we no longer need NFS specific fscache
stats.

Note this also removes the 'fsc:' line from /proc/self/mountstats so
it will be a user-visible change.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-11 13:08:26 -04:00
Dave Wysochanski
000dbe0bec NFS: Convert buffered read paths to use netfs when fscache is enabled
Convert the NFS buffered read code paths to corresponding netfs APIs,
but only when fscache is configured and enabled.

The netfs API defines struct netfs_request_ops which must be filled
in by the network filesystem.  For NFS, we only need to define 5 of
the functions, the main one being the issue_read() function.
The issue_read() function is called by the netfs layer when a read
cannot be fulfilled locally, and must be sent to the server (either
the cache is not active, or it is active but the data is not available).
Once the read from the server is complete, netfs requires a call to
netfs_subreq_terminated() which conveys either how many bytes were read
successfully, or an error.  Note that issue_read() is called with a
structure, netfs_io_subrequest, which defines the IO requested, and
contains a start and a length (both in bytes), and assumes the underlying
netfs will return a either an error on the whole region, or the number
of bytes successfully read.

The NFS IO path is page based and the main APIs are the pgio APIs defined
in pagelist.c.  For the pgio APIs, there is no way for the caller to
know how many RPCs will be sent and how the pages will be broken up
into underlying RPCs, each of which will have their own completion and
return code.  In contrast, netfs is subrequest based, a single
subrequest may contain multiple pages, and a single subrequest is
initiated with issue_read() and terminated with netfs_subreq_terminated().
Thus, to utilze the netfs APIs, NFS needs some way to accommodate
the netfs API requirement on the single response to the whole
subrequest, while also minimizing disruptive changes to the NFS
pgio layer.

The approach taken with this patch is to allocate a small structure
for each nfs_netfs_issue_read() call, store the final error and number
of bytes successfully transferred in the structure, and update these values
as each RPC completes.  The refcount on the structure is used as a marker
for the last RPC completion, is incremented in nfs_netfs_read_initiate(),
and decremented inside nfs_netfs_read_completion(), when a nfs_pgio_header
contains a valid pointer to the data.  On the final put (which signals
the final outstanding RPC is complete) in nfs_netfs_read_completion(),
call netfs_subreq_terminated() with either the final error value (if
one or more READs complete with an error) or the number of bytes
successfully transferred (if all RPCs complete successfully).  Note
that when all RPCs complete successfully, the number of bytes transferred
is capped to the length of the subrequest.  Capping the transferred length
to the subrequest length prevents "Subreq overread" warnings from netfs.
This is due to the "aligned_len" in nfs_pageio_add_page(), and the
corner case where NFS requests a full page at the end of the file,
even when i_size reflects only a partial page (NFS overread).

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-11 13:08:26 -04:00