Add some comments on AFS abort code handling in the rotation algorithm and
adjust the errors produced to match.
Reported-by: Jeffrey E Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
David Howells says:
(3) afs_check_validity().
(4) afs_getattr().
These are both pretty short, so your solution is probably good for them.
That said, afs_vnode_commit_status() can spend a long time under the
write lock - and pretty much every file RPC op returns a status update.
Change these functions to use read_seqbegin(). This simplifies the code
and doesn't change the current behaviour, the "seq" counter is always even
so read_seqbegin_or_lock() can never take the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20231130115617.GA21584@redhat.com/
David Howells says:
(5) afs_find_server().
There could be a lot of servers in the list and each server can have
multiple addresses, so I think this would be better with an exclusive
second pass.
The server list isn't likely to change all that often, but when it does
change, there's a good chance several servers are going to be
added/removed one after the other. Further, this is only going to be
used for incoming cache management/callback requests from the server,
which hopefully aren't going to happen too often - but it is remotely
drivable.
(6) afs_find_server_by_uuid().
Similarly to (5), there could be a lot of servers to search through, but
they are in a tree not a flat list, so it should be faster to process.
Again, it's not likely to change that often and, again, when it does
change it's likely to involve multiple changes. This can be driven
remotely by an incoming cache management request but is mostly going to
be driven by setting up or reconfiguring a volume's server list -
something that also isn't likely to happen often.
Make the "seq" counter odd on the 2nd pass, otherwise read_seqbegin_or_lock()
never takes the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20231130115614.GA21581@redhat.com/
David Howells says:
(2) afs_lookup_volume_rcu().
There can be a lot of volumes known by a system. A thousand would
require a 10-step walk and this is drivable by remote operation, so I
think this should probably take a lock on the second pass too.
Make the "seq" counter odd on the 2nd pass, otherwise read_seqbegin_or_lock()
never takes the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20231130115606.GA21571@redhat.com/
Add three iov_iter structs:
(1) Add an iov_iter (->iter) to the I/O request to describe the
unencrypted-side buffer.
(2) Add an iov_iter (->io_iter) to the I/O request to describe the
encrypted-side I/O buffer. This may be a different size to the buffer
in (1).
(3) Add an iov_iter (->io_iter) to the I/O subrequest to describe the part
of the I/O buffer for that subrequest.
This will allow future patches to point to a bounce buffer instead for
purposes of handling oversize writes, decryption (where we want to save the
encrypted data to the cache) and decompression.
These iov_iters persist for the lifetime of the (sub)request, and so can be
accessed multiple times without worrying about them being deallocated upon
return to the caller.
The network filesystem must appropriately advance the iterator before
terminating the request.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Provide default invalidate_folio and release_folio calls. These will need
to interact with invalidation correctly at some point. They will be needed
if netfslib is to make use of folio->private for its own purposes.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
AFS currently uses folio->private to store the range of bytes within a
folio that have been modified - the idea being that if we have, say, a 2MiB
folio and someone writes a single byte, we only have to write back that
single page and not the whole 2MiB folio - thereby saving on network
bandwidth.
Remove this, at least for now, and accept the extra network load (which
doesn't matter in the common case of writing a whole file at a time from
beginning to end).
This makes folio->private available for netfslib to use.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Move the resource pinning-for-writeback from fscache code to netfslib code.
This is used to keep a cache backing object pinned whilst we have dirty
pages on the netfs inode in the pagecache such that VM writeback will be
able to reach it.
Whilst we're at it, switch the parameters of netfs_unpin_writeback() to
match ->write_inode() so that it can be used for that directly.
Note that this mechanism could be more generically useful than that for
network filesystems. Quite often they have to keep around other resources
(e.g. authentication tokens or network connections) until the writeback is
complete.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Remove ->begin_cache_operation() in favour of just calling fscache directly.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Christian Brauner <christian@brauner.io>
cc: linux-fsdevel@vger.kernel.org
cc: linux-cachefs@redhat.com
When an afs_volume struct is put, its refcount is reduced to 0 before
the cell->volume_lock is taken and the volume removed from the
cell->volumes tree.
Unfortunately, this means that the lookup code can race and see a volume
with a zero ref in the tree, resulting in a use-after-free:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 130782 at lib/refcount.c:25 refcount_warn_saturate+0x7a/0xda
...
RIP: 0010:refcount_warn_saturate+0x7a/0xda
...
Call Trace:
afs_get_volume+0x3d/0x55
afs_create_volume+0x126/0x1de
afs_validate_fc+0xfe/0x130
afs_get_tree+0x20/0x2e5
vfs_get_tree+0x1d/0xc9
do_new_mount+0x13b/0x22e
do_mount+0x5d/0x8a
__do_sys_mount+0x100/0x12a
do_syscall_64+0x3a/0x94
entry_SYSCALL_64_after_hwframe+0x62/0x6a
Fix this by:
(1) When putting, use a flag to indicate if the volume has been removed
from the tree and skip the rb_erase if it has.
(2) When looking up, use a conditional ref increment and if it fails
because the refcount is 0, replace the node in the tree and set the
removal flag.
Fixes: 20325960f8 ("afs: Reorganise volume and server trees to be rooted on the cell")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In afs_update_cell(), ret is the result of the DNS lookup and the errors
are to be handled by a switch - however, the value gets clobbered in
between by setting it to -ENOMEM in case afs_alloc_vlserver_list()
fails.
Fix this by moving the setting of -ENOMEM into the error handling for
OOM failure. Further, only do it if we don't have an alternative error
to return.
Found by Linux Verification Center (linuxtesting.org) with SVACE. Based
on a patch from Anastasia Belova [1].
Fixes: d5c32c89b2 ("afs: Fix cell DNS lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Anastasia Belova <abelova@astralinux.ru>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: lvc-project@linuxtesting.org
Link: https://lore.kernel.org/r/20231221085849.1463-1-abelova@astralinux.ru/ [1]
Link: https://lore.kernel.org/r/1700862.1703168632@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the afs dynamic root directory, the ->lookup() function does a DNS check
on the cell being asked for and if the DNS upcall reports an error it will
report an error back to userspace (typically ENOENT).
However, if a failed DNS upcall returns a new-style result, it will return
a valid result, with the status field set appropriately to indicate the
type of failure - and in that case, dns_query() doesn't return an error and
we let stat() complete with no error - which can cause confusion in
userspace as subsequent calls that trigger d_automount then fail with
ENOENT.
Fix this by checking the status result from a valid dns_query() and
returning an error if it indicates a failure.
Fixes: bbb4c4323a ("dns: Allow the dns resolver to retrieve a server set")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216637
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Fix the afs dynamic root's d_delete function to always delete unused
dentries rather than only deleting them if they're positive. With things
as they stand upstream, negative dentries stemming from failed DNS lookups
stick around preventing retries.
Fixes: 66c7e1d319 ("afs: Split the dynroot stuff out and give it its own ops tables")
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
If an AFS cell that has an unreachable (eg. ENETUNREACH) server listed (VL
server or fileserver), an asynchronous probe to one of its addresses may
fail immediately because sendmsg() returns an error. When this happens, a
refcount underflow can happen if certain events hit a very small window.
The way this occurs is:
(1) There are two levels of "call" object, the afs_call and the
rxrpc_call. Each of them can be transitioned to a "completed" state
in the event of success or failure.
(2) Asynchronous afs_calls are self-referential whilst they are active to
prevent them from evaporating when they're not being processed. This
reference is disposed of when the afs_call is completed.
Note that an afs_call may only be completed once; once completed
completing it again will do nothing.
(3) When a call transmission is made, the app-side rxrpc code queues a Tx
buffer for the rxrpc I/O thread to transmit. The I/O thread invokes
sendmsg() to transmit it - and in the case of failure, it transitions
the rxrpc_call to the completed state.
(4) When an rxrpc_call is completed, the app layer is notified. In this
case, the app is kafs and it schedules a work item to process events
pertaining to an afs_call.
(5) When the afs_call event processor is run, it goes down through the
RPC-specific handler to afs_extract_data() to retrieve data from rxrpc
- and, in this case, it picks up the error from the rxrpc_call and
returns it.
The error is then propagated to the afs_call and that is completed
too. At this point the self-reference is released.
(6) If the rxrpc I/O thread manages to complete the rxrpc_call within the
window between rxrpc_send_data() queuing the request packet and
checking for call completion on the way out, then
rxrpc_kernel_send_data() will return the error from sendmsg() to the
app.
(7) Then afs_make_call() will see an error and will jump to the error
handling path which will attempt to clean up the afs_call.
(8) The problem comes when the error handling path in afs_make_call()
tries to unconditionally drop an async afs_call's self-reference.
This self-reference, however, may already have been dropped by
afs_extract_data() completing the afs_call
(9) The refcount underflows when we return to afs_do_probe_vlserver() and
that tries to drop its reference on the afs_call.
Fix this by making afs_make_call() attempt to complete the afs_call rather
than unconditionally putting it. That way, if afs_extract_data() manages
to complete the call first, afs_make_call() won't do anything.
The bug can be forced by making do_udp_sendmsg() return -ENETUNREACH and
sticking an msleep() in rxrpc_send_data() after the 'success:' label to
widen the race window.
The error message looks something like:
refcount_t: underflow; use-after-free.
WARNING: CPU: 3 PID: 720 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
...
RIP: 0010:refcount_warn_saturate+0xba/0x110
...
afs_put_call+0x1dc/0x1f0 [kafs]
afs_fs_get_capabilities+0x8b/0xe0 [kafs]
afs_fs_probe_fileserver+0x188/0x1e0 [kafs]
afs_lookup_server+0x3bf/0x3f0 [kafs]
afs_alloc_server_list+0x130/0x2e0 [kafs]
afs_create_volume+0x162/0x400 [kafs]
afs_get_tree+0x266/0x410 [kafs]
vfs_get_tree+0x25/0xc0
fc_mount+0xe/0x40
afs_d_automount+0x1b3/0x390 [kafs]
__traverse_mounts+0x8f/0x210
step_into+0x340/0x760
path_openat+0x13a/0x1260
do_filp_open+0xaf/0x160
do_sys_openat2+0xaf/0x170
or something like:
refcount_t: underflow; use-after-free.
...
RIP: 0010:refcount_warn_saturate+0x99/0xda
...
afs_put_call+0x4a/0x175
afs_send_vl_probes+0x108/0x172
afs_select_vlserver+0xd6/0x311
afs_do_cell_detect_alias+0x5e/0x1e9
afs_cell_detect_alias+0x44/0x92
afs_validate_fc+0x9d/0x134
afs_get_tree+0x20/0x2e6
vfs_get_tree+0x1d/0xc9
fc_mount+0xe/0x33
afs_d_automount+0x48/0x9d
__traverse_mounts+0xe0/0x166
step_into+0x140/0x274
open_last_lookups+0x1c1/0x1df
path_openat+0x138/0x1c3
do_filp_open+0x55/0xb4
do_sys_openat2+0x6c/0xb6
Fixes: 34fa47612b ("afs: Fix race in async call refcounting")
Reported-by: Bill MacAllister <bill@ca-zephyr.org>
Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1052304
Suggested-by: Jeffrey E Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/2633992.1702073229@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There were already assertions that we were not passing a tail page to
error_remove_page(), so make the compiler enforce that by converting
everything to pass and use a folio.
Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In preparation for removing the return value entirely, stop testing it
in afs.
Link: https://lkml.kernel.org/r/20231108204605.745109-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
While there, use struct_size() helper, instead of the open-coded
version, to calculate the size for the allocation of the whole
flexible structure, including of course, the flexible-array member.
This code was found with the help of Coccinelle, and audited and
fixed manually.
Signed-off-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/ZSVKwBmxQ1amv47E@work
Signed-off-by: Kees Cook <keescook@chromium.org>
Saves a pointer per struct dentry and actually makes the things less
clumsy. Cleaned the d_walk() and dcache_readdir() a bit by use
of hlist_for_... iterators.
A couple of new helpers - d_first_child() and d_next_sibling(),
to make the expressions less awful.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Mark a superblock that is for for an R/O or Backup volume as SB_RDONLY when
mounting it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
AFS doesn't really do locking on R/O volumes as fileservers don't maintain
state with each other and thus a lock on a R/O volume file on one
fileserver will not be be visible to someone looking at the same file on
another fileserver.
Further, the server may return an error if you try it.
Fix this by doing what other AFS clients do and handle filelocking on R/O
volume files entirely within the client and don't touch the server.
Fixes: 6c6c1d63c2 ("afs: Provide mount-time configurable byte-range file locking emulation")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Make AFS return error ENOENT if no cell SRV or AFSDB DNS record (or
cellservdb config file record) can be found rather than returning
EDESTADDRREQ.
Also add cell name lookup info to the cursor dump.
Fixes: d5c32c89b2 ("afs: Fix cell DNS lookup")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216637
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
When kafs tries to look up a cell in the DNS or the local config, it will
translate a lookup failure into EDESTADDRREQ whereas OpenAFS translates it
into ENOENT. Applications such as West expect the latter behaviour and
fail if they see the former.
This can be seen by trying to mount an unknown cell:
# mount -t afs %example.com:cell.root /mnt
mount: /mnt: mount(2) system call failed: Destination address required.
Fixes: 4d673da145 ("afs: Support the AFS dynamic root")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216637
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
afs_server_list is accessed with the rcu_read_lock() held from
volume->servers, so it needs to be cleaned up correctly.
Fix this by using kfree_rcu() instead of kfree().
Fixes: 8a070a9648 ("afs: Detect cell aliases 1 - Cells with root volumes")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will
be maintained as an LTS kernel.
The architecture specific system call tables are updated for
the added map_shadow_stack() syscall and to remove references
to the long-gone sys_lookup_dcookie() syscall.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmVC40IACgkQYKtH/8kJ
Uidhmw/9EX+aWSXGoObJ3fngaNSMw+PmrEuP8qEKBHxfKHcCdX3hc451Oh4GlhaQ
tru91pPwgNvN2/rfoKusxT+V4PemGIzfNni/04rp+P0kvmdw5otQ2yNhsQNsfVmq
XGWvkxF4P2GO6bkjjfR/1dDq7GtlyXtwwPDKeLbYb6TnJOZjtx+EAN27kkfSn1Ms
R4Sa3zJ+DfHUmHL5S9g+7UD/CZ5GfKNmIskI4Mz5GsfoUz/0iiU+Bge/9sdcdSJQ
kmbLy5YnVzfooLZ3TQmBFsO3iAMWb0s/mDdtyhqhTVmTUshLolkPYyKnPFvdupyv
shXcpEST2XJNeaDRnL2K4zSCdxdbnCZHDpjfl9wfioBg7I8NfhXKpf1jYZHH1de4
LXq8ndEFEOVQw/zSpYWfQq1sux8Jiqr+UK/ukbVeFWiGGIUs91gEWtPAf8T0AZo9
ujkJvaWGl98O1g5wmBu0/dAR6QcFJMDfVwbmlIFpU8O+MEaz6X8mM+O5/T0IyTcD
eMbAUjj4uYcU7ihKzHEv/0SS9Of38kzff67CLN5k8wOP/9NlaGZ78o1bVle9b52A
BdhrsAefFiWHp1jT6Y9Rg4HOO/TguQ9e6EWSKOYFulsiLH9LEFaB9RwZLeLytV0W
vlAgY9rUW77g1OJcb7DoNv33nRFuxsKqsnz3DEIXtgozo9CzbYI=
=H1vH
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull ia64 removal and asm-generic updates from Arnd Bergmann:
- The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will be
maintained as an LTS kernel.
- The architecture specific system call tables are updated for the
added map_shadow_stack() syscall and to remove references to the
long-gone sys_lookup_dcookie() syscall.
* tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
hexagon: Remove unusable symbols from the ptrace.h uapi
asm-generic: Fix spelling of architecture
arch: Reserve map_shadow_stack() syscall number for all architectures
syscalls: Cleanup references to sys_lookup_dcookie()
Documentation: Drop or replace remaining mentions of IA64
lib/raid6: Drop IA64 support
Documentation: Drop IA64 from feature descriptions
kernel: Drop IA64 support from sig_fault handlers
arch: Remove Itanium (IA-64) architecture
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmU/3cUWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsEoEACBGPSiOmfSWdH3TOnIG270PD24
jGjg8KFv7RC/JTOdYmpLl0okdlGT9LvjN/ToSSDEw3PIayxoXUdhkbYy0MYtiV3m
yz2ozDTzJuplQX/W2fPE+nXSzIwHao2zjPPFjHnT7lt8IIjhgjiOtLfZ2gGUkW99
Mdu2aWh3u0r4tC8OS23++yN5ibRc5l72efsjDWjZ0aPXnxE1bjmLMiIPiizpndIf
beasPuDBs98sJVYouemCwnsPXuXOPz3Q1Cpo/fTd+TMTJCLSemCQZCTuOBU0acI/
ZjLCgCaJU1yIYKBMtrIN4G9kITZniXX3/Nm4o6NQMVlcCqMeNaHuflomqWoqWfhE
UPbRo2eghZOaMNiCKLLvZDIqPrh1IcsiEl6Ef3W4hICc42GTK96IuGisIvDXwQ4N
/SzTOupJuN42noh3z1M3XuZy5RoXJ99IYDNY5CTKf9IdqvA0bbGkU3nb1gZH/xw9
BjTqKzR/7K1kTXuSgagDZ1Wceej9pZxhX7E3IHYsP8ZOvKug3EeL4yybVwQ3HRfq
Qnzcp/qPB9cOkLSQXveRTFTsj2mX28Gixct/iDuc1jIYwGQlY1gI6dcUcqby6ptM
BrQti7eR2NH2+T3aE2UVCIWsZVhx7NaSF+z8JxfAuu56jicc4xJVsi8zrNveWX5M
m2VXyBl3121BVtKi4w==
=0iVF
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"One of the more voluminous set of changes is for adding the new
__counted_by annotation[1] to gain run-time bounds checking of
dynamically sized arrays with UBSan.
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R.
Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem
Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas
Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees
Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)"
* tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits)
hwmon: (acpi_power_meter) replace open-coded kmemdup_nul
reset: Annotate struct reset_control_array with __counted_by
kexec: Annotate struct crash_mem with __counted_by
virtio_console: Annotate struct port_buffer with __counted_by
ima: Add __counted_by for struct modsig and use struct_size()
MAINTAINERS: Include stackleak paths in hardening entry
string: Adjust strtomem() logic to allow for smaller sources
hardening: x86: drop reference to removed config AMD_IOMMU_V2
randstruct: Fix gcc-plugin performance mode to stay in group
mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by
drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by
irqchip/imx-intmux: Annotate struct intmux_data with __counted_by
KVM: Annotate struct kvm_irq_routing_table with __counted_by
virt: acrn: Annotate struct vm_memory_region_batch with __counted_by
hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by
sparc: Annotate struct cpuinfo_tree with __counted_by
isdn: kcapi: replace deprecated strncpy with strscpy_pad
isdn: replace deprecated strncpy with strscpy
NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by
nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
=m4pB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode time accessor updates from Christian Brauner:
"This finishes the conversion of all inode time fields to accessor
functions as discussed on list. Changing timestamps manually as we
used to do before is error prone. Using accessors function makes this
robust.
It does not contain the switch of the time fields to discrete 64 bit
integers to replace struct timespec and free up space in struct inode.
But after this, the switch can be trivially made and the patch should
only affect the vfs if we decide to do it"
* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
fs: rename inode i_atime and i_mtime fields
security: convert to new timestamp accessors
selinux: convert to new timestamp accessors
apparmor: convert to new timestamp accessors
sunrpc: convert to new timestamp accessors
mm: convert to new timestamp accessors
bpf: convert to new timestamp accessors
ipc: convert to new timestamp accessors
linux: convert to new timestamp accessors
zonefs: convert to new timestamp accessors
xfs: convert to new timestamp accessors
vboxsf: convert to new timestamp accessors
ufs: convert to new timestamp accessors
udf: convert to new timestamp accessors
ubifs: convert to new timestamp accessors
tracefs: convert to new timestamp accessors
sysv: convert to new timestamp accessors
squashfs: convert to new timestamp accessors
server: convert to new timestamp accessors
client: convert to new timestamp accessors
...
This makes it harder for accidental or malicious changes to
afs_xattr_handlers at runtime.
Cc: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: linux-afs@lists.infradead.org
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Link: https://lore.kernel.org/r/20230930050033.41174-5-wedsonaf@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct afs_addr_list.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: linux-afs@lists.infradead.org
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915201449.never.649-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct afs_permits.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: linux-afs@lists.infradead.org
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915201456.never.529-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
The Itanium architecture is obsolete, and an informal survey [0] reveals
that any residual use of Itanium hardware in production is mostly HP-UX
or OpenVMS based. The use of Linux on Itanium appears to be limited to
enthusiasts that occasionally boot a fresh Linux kernel to see whether
things are still working as intended, and perhaps to churn out some
distro packages that are rarely used in practice.
None of the original companies behind Itanium still produce or support
any hardware or software for the architecture, and it is listed as
'Orphaned' in the MAINTAINERS file, as apparently, none of the engineers
that contributed on behalf of those companies (nor anyone else, for that
matter) have been willing to support or maintain the architecture
upstream or even be responsible for applying the odd fix. The Intel
firmware team removed all IA-64 support from the Tianocore/EDK2
reference implementation of EFI in 2018. (Itanium is the original
architecture for which EFI was developed, and the way Linux supports it
deviates significantly from other architectures.) Some distros, such as
Debian and Gentoo, still maintain [unofficial] ia64 ports, but many have
dropped support years ago.
While the argument is being made [1] that there is a 'for the common
good' angle to being able to build and run existing projects such as the
Grid Community Toolkit [2] on Itanium for interoperability testing, the
fact remains that none of those projects are known to be deployed on
Linux/ia64, and very few people actually have access to such a system in
the first place. Even if there were ways imaginable in which Linux/ia64
could be put to good use today, what matters is whether anyone is
actually doing that, and this does not appear to be the case.
There are no emulators widely available, and so boot testing Itanium is
generally infeasible for ordinary contributors. GCC still supports IA-64
but its compile farm [3] no longer has any IA-64 machines. GLIBC would
like to get rid of IA-64 [4] too because it would permit some overdue
code cleanups. In summary, the benefits to the ecosystem of having IA-64
be part of it are mostly theoretical, whereas the maintenance overhead
of keeping it supported is real.
So let's rip off the band aid, and remove the IA-64 arch code entirely.
This follows the timeline proposed by the Debian/ia64 maintainer [5],
which removes support in a controlled manner, leaving IA-64 in a known
good state in the most recent LTS release. Other projects will follow
once the kernel support is removed.
[0] https://lore.kernel.org/all/CAMj1kXFCMh_578jniKpUtx_j8ByHnt=s7S+yQ+vGbKt9ud7+kQ@mail.gmail.com/
[1] https://lore.kernel.org/all/0075883c-7c51-00f5-2c2d-5119c1820410@web.de/
[2] https://gridcf.org/gct-docs/latest/index.html
[3] https://cfarm.tetaneutral.net/machines/list/
[4] https://lore.kernel.org/all/87bkiilpc4.fsf@mid.deneb.enyo.de/
[5] https://lore.kernel.org/all/ff58a3e76e5102c94bb5946d99187b358def688a.camel@physik.fu-berlin.de/
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support tracking
KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the GENERIC_IOREMAP
ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take
GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency improvements
("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation, from
Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code ("Two
minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most
file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap
on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table range
API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM subsystem
documentation ("Improve mm documentation").
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA
jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6
Nfio7afS1ncD+hPYT8947UnLxTgn+ww=
=Afws
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Some swap cleanups from Ma Wupeng ("fix WARN_ON in
add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support
tracking KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with
UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the
GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
architectures to take GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency
improvements ("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation,
from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code
("Two minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
most file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
memmap on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock
migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table
range API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM
subsystem documentation ("Improve mm documentation").
* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
maple_tree: shrink struct maple_tree
maple_tree: clean up mas_wr_append()
secretmem: convert page_is_secretmem() to folio_is_secretmem()
nios2: fix flush_dcache_page() for usage from irq context
hugetlb: add documentation for vma_kernel_pagesize()
mm: add orphaned kernel-doc to the rst files.
mm: fix clean_record_shared_mapping_range kernel-doc
mm: fix get_mctgt_type() kernel-doc
mm: fix kernel-doc warning from tlb_flush_rmaps()
mm: remove enum page_entry_size
mm: allow ->huge_fault() to be called without the mmap_lock held
mm: move PMD_ORDER to pgtable.h
mm: remove checks for pte_index
memcg: remove duplication detection for mem_cgroup_uncharge_swap
mm/huge_memory: work on folio->swap instead of page->private when splitting folio
mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
mm/swap: use dedicated entry for swap in folio
mm/swap: stop using page->private on tail pages for THP_SWAP
selftests/mm: fix WARNING comparing pointer to 0
selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
...
Fscache has an optimisation by which reads from the cache are skipped
until we know that (a) there's data there to be read and (b) that data
isn't entirely covered by pages resident in the netfs pagecache. This is
done with two flags manipulated by fscache_note_page_release():
if (...
test_bit(FSCACHE_COOKIE_HAVE_DATA, &cookie->flags) &&
test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags))
clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags);
where the NO_DATA_TO_READ flag causes cachefiles_prepare_read() to
indicate that netfslib should download from the server or clear the page
instead.
The fscache_note_page_release() function is intended to be called from
->releasepage() - but that only gets called if PG_private or PG_private_2
is set - and currently the former is at the discretion of the network
filesystem and the latter is only set whilst a page is being written to
the cache, so sometimes we miss clearing the optimisation.
Fix this by following Willy's suggestion[1] and adding an address_space
flag, AS_RELEASE_ALWAYS, that causes filemap_release_folio() to always call
->release_folio() if it's set, even if PG_private or PG_private_2 aren't
set.
Note that this would require folio_test_private() and page_has_private() to
become more complicated. To avoid that, in the places[*] where these are
used to conditionalise calls to filemap_release_folio() and
try_to_release_page(), the tests are removed the those functions just
jumped to unconditionally and the test is performed there.
[*] There are some exceptions in vmscan.c where the check guards more than
just a call to the releaser. I've added a function, folio_needs_release()
to wrap all the checks for that.
AS_RELEASE_ALWAYS should be set if a non-NULL cookie is obtained from
fscache and cleared in ->evict_inode() before truncate_inode_pages_final()
is called.
Additionally, the FSCACHE_COOKIE_NO_DATA_TO_READ flag needs to be cleared
and the optimisation cancelled if a cachefiles object already contains data
when we open it.
[dwysocha@redhat.com: call folio_mapping() inside folio_needs_release()]
Link: 902c990e31
Link: https://lkml.kernel.org/r/20230628104852.3391651-3-dhowells@redhat.com
Fixes: 1f67e6d0b1 ("fscache: Provide a function to note the release of a page")
Fixes: 047487c947 ("cachefiles: Implement the I/O routines")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Tested-by: SeongJae Park <sj@kernel.org>
Cc: Daire Byrne <daire.byrne@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.
Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230705190309.579783-22-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When an AFS FS.StoreData RPC call is made, amongst other things it is
given the resultant file size to be. On the server, this is processed
by truncating the file to new size and then writing the data.
Now, kafs has a lock (vnode->io_lock) that serves to serialise
operations against a specific vnode (ie. inode), but the parameters for
the op are set before the lock is taken. This allows two writebacks
(say sync and kswapd) to race - and if writes are ongoing the writeback
for a later write could occur before the writeback for an earlier one if
the latter gets interrupted.
Note that afs_writepages() cannot take i_mutex and only takes a shared
lock on vnode->validate_lock.
Also note that the server does the truncation and the write inside a
lock, so there's no problem at that end.
Fix this by moving the calculation for the proposed new i_size inside
the vnode->io_lock. Also reset the iterator (which we might have read
from) and update the mtime setting there.
Fixes: bd80d8a80e ("afs: Use ITER_XARRAY for writing")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/3526895.1687960024@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Yosry has also eliminated cgroup's atomic rstat flushing.
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability.
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning.
- Lorenzo Stoakes has done some maintanance work on the get_user_pages()
interface.
- Liam Howlett continues with cleanups and maintenance work to the maple
tree code. Peng Zhang also does some work on maple tree.
- Johannes Weiner has done some cleanup work on the compaction code.
- David Hildenbrand has contributed additional selftests for
get_user_pages().
- Thomas Gleixner has contributed some maintenance and optimization work
for the vmalloc code.
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code.
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting.
- Christoph Hellwig has some cleanups for the filemap/directio code.
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the provided
APIs rather than open-coding accesses.
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings.
- John Hubbard has a series of fixes to the MM selftesting code.
- ZhangPeng continues the folio conversion campaign.
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock.
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
128 to 8.
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management.
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code.
- Vishal Moola also has done some folio conversion work.
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
=B7yQ
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8QgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpupIEADKEZvpxDyaxHjYZFFeoSJRkh+AEJHe0Xtr
J5vUL8t8zmAV3F7i8XaoAEcR0dC0VQcoTc8fAOty71+5hsc7gvtyyNjqU/YWRVqK
Xr+VJuSJ+OGx3MzpRWEkepagfPyqP5cyyCOK6gqIgqzc3IwqkR/3QHVRc6oR8YbY
AQd7tqm2fQXK9WDHEy5hcaQeqb9uKZjQQoZejpPPerpJM+9RMgKxpCGtnLLIUhr/
sgl7KyLIQPBmveO2vfOR+dmsJBqsLqneqkXDKMAIfpeVEEkHHAlCH4E5Ne1XUS+s
ie4If+reuyn1Ktt5Ry1t7w2wr8cX1fcay3K28tgwjE2Bvremc5YnYgb3pyUDW38f
tXXkpg/eTXd/Pn0Crpagoa9zJ927tt5JXIO1/PagPEP1XOqUuthshDFsrVqfqbs+
36gqX2JWB4NJTg9B9KBHA3+iVCJyZLjUqOqws7hOJOvhQytZVm/IwkGBg1Slhe1a
J5WemBlqX8lTgXz0nM7cOhPYTZeKe6hazCcb5VwxTUTj9SGyYtsMfqqTwRJO9kiF
j1VzbOAgExDYe+GvfqOFPh9VqZho66+DyOD/Xtca4eH7oYyHSmP66o8nhRyPBPZA
maBxQhUkPQn4/V/0fL2TwIdWYKsbj8bUyINKPZ2L35YfeICiaYIctTwNJxtRmItB
M3VxWD3GZQ==
=KhW4
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux
Pull splice updates from Jens Axboe:
"This kills off ITER_PIPE to avoid a race between truncate,
iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
memory corruption.
Instead, we either use (a) filemap_splice_read(), which invokes the
buffered file reading code and splices from the pagecache into the
pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
into it and then pushes the filled pages into the pipe; or (c) handle
it in filesystem-specific code.
Summary:
- Rename direct_splice_read() to copy_splice_read()
- Simplify the calculations for the number of pages to be reclaimed
in copy_splice_read()
- Turn do_splice_to() into a helper, vfs_splice_read(), so that it
can be used by overlayfs and coda to perform the checks on the
lower fs
- Make vfs_splice_read() jump to copy_splice_read() to handle
direct-I/O and DAX
- Provide shmem with its own splice_read to handle non-existent pages
in the pagecache. We don't want a ->read_folio() as we don't want
to populate holes, but filemap_get_pages() requires it
- Provide overlayfs with its own splice_read to call down to a lower
layer as overlayfs doesn't provide ->read_folio()
- Provide coda with its own splice_read to call down to a lower layer
as coda doesn't provide ->read_folio()
- Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
and random files as they just copy to the output buffer and don't
splice pages
- Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation
- Make cifs use filemap_splice_read()
- Replace pointers to generic_file_splice_read() with pointers to
filemap_splice_read() as DIO and DAX are handled in the caller;
filesystems can still provide their own alternate ->splice_read()
op
- Remove generic_file_splice_read()
- Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
was the only user"
* tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
splice: kdoc for filemap_splice_read() and copy_splice_read()
iov_iter: Kill ITER_PIPE
splice: Remove generic_file_splice_read()
splice: Use filemap_splice_read() instead of generic_file_splice_read()
cifs: Use filemap_splice_read()
trace: Convert trace/seq to use copy_splice_read()
zonefs: Provide a splice-read wrapper
xfs: Provide a splice-read wrapper
orangefs: Provide a splice-read wrapper
ocfs2: Provide a splice-read wrapper
ntfs3: Provide a splice-read wrapper
nfs: Provide a splice-read wrapper
f2fs: Provide a splice-read wrapper
ext4: Provide a splice-read wrapper
ecryptfs: Provide a splice-read wrapper
ceph: Provide a splice-read wrapper
afs: Provide a splice-read wrapper
9p: Add splice_read wrapper
net: Make sock_splice_read() use copy_splice_read() by default
tty, proc, kernfs, random: Use copy_splice_read()
...
Commit acc8d8588c converted afs_writepages_region() to write back a
folio batch. The function waits for writeback to a folio, but then
proceeds to the rest of the batch without trying to write that folio
again. This patch fixes has it attempt to write the folio again.
[DH: Also remove an 'else' that adding a goto makes redundant]
Fixes: acc8d8588c ("afs: convert afs_writepages_region() to use filemap_get_folios_tag()")
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20230607204120.89416-2-vishal.moola@gmail.com/
Commit acc8d8588c converted afs_writepages_region() to write back a
folio batch. If writeback needs rescheduling, the function exits without
dropping the references to the folios in fbatch. This patch fixes that.
[DH: Moved the added line before the _leave()]
Fixes: acc8d8588c ("afs: convert afs_writepages_region() to use filemap_get_folios_tag()")
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20230607204120.89416-1-vishal.moola@gmail.com/
In the same spirit as commit ca57f02295 ("afs: Fix fileserver probe
RTT handling"), don't rule out using a vlserver just because there
haven't been enough packets yet to calculate a real rtt. Always set the
server's probe rtt from the estimate provided by rxrpc_kernel_get_srtt,
which is capped at 1 second.
This could lead to EDESTADDRREQ errors when accessing a cell for the
first time, even though the vl servers are known and have responded to a
probe.
Fixes: 1d4adfaf65 ("rxrpc: Make rxrpc_kernel_get_srtt() indicate validity")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2023-June/006746.html
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kafs incorrectly passes a zero mtime (ie. 1st Jan 1970) to the server when
creating a file, dir or symlink because the mtime recorded in the
afs_operation struct gets passed to the server by the marshalling routines,
but the afs_mkdir(), afs_create() and afs_symlink() functions don't set it.
This gets masked if a file or directory is subsequently modified.
Fix this by filling in op->mtime before calling the create op.
Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
changes.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFaSnAAKCRDdBJ7gKXxA
jskTAPwIkL6xVMIEdtWfQNIy751o0onzfOQAIWgM3dSL83cMJQEA5OXxXt4LjrJw
4q0zf4GLYXjMjhnfYr8zXF39y05iyQA=
=nIhV
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"Five hotfixes.
Three are cc:stable, two pertain to merge window changes"
* tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
afs: fix the afs_dir_get_folio return value
nilfs2: do not write dirty data after degenerating to read-only
mm: do not reclaim private data from pinned page
nilfs2: fix infinite loop in nilfs_mdt_get_block()
mm/mmap/vma_merge: always check invariants
Keep returning NULL on failure instead of letting an ERR_PTR escape to
callers that don't expect it.
Link: https://lkml.kernel.org/r/20230503154526.1223095-2-hch@lst.de
Fixes: 66dabbb65d ("mm: return an ERR_PTR from __filemap_get_folio")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
afs_read_dir fetches an amount of data that's based on what the inode
size is thought to be. If the file on the server is larger than what
was fetched, the code rechecks i_size and retries. If the local i_size
was not properly updated, this can lead to an endless loop of fetching
i_size from the server and noticing each time that the size is larger on
the server.
If it is known that the remote size is larger than i_size, bump up the
fetch size to that size.
Fixes: f3ddee8dc4 ("afs: Fix directory handling")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Fix afs_getattr() to report the server's idea of the file size of a
directory rather than the local size. The local size may differ as we edit
the local copy to avoid having to redownload it and we may end up with a
differently structured blob of a different size.
However, if the directory is discarded from the pagecache we then download
it again and the user may see the directory file size apparently change.
Fixes: 63a4681ff3 ("afs: Locally edit directory data for mkdir/create/unlink/...")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
If the data version returned from the server is larger than expected,
the local data is invalidated, but we may still want to note the remote
file size.
Since we're setting change_size, we have to also set data_changed
for the i_size to get updated.
Fixes: 3f4aa98181 ("afs: Fix EOF corruption")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
afs_make_call() calls rxrpc_kernel_begin_call() to begin a call (which may
get stalled in the background waiting for a connection to become
available); it then calls rxrpc_kernel_set_max_life() to set the timeouts -
but that starts the call timer so the call timer might then expire before
we get a connection assigned - leading to the following oops if the call
stalled:
BUG: kernel NULL pointer dereference, address: 0000000000000000
...
CPU: 1 PID: 5111 Comm: krxrpcio/0 Not tainted 6.3.0-rc7-build3+ #701
RIP: 0010:rxrpc_alloc_txbuf+0xc0/0x157
...
Call Trace:
<TASK>
rxrpc_send_ACK+0x50/0x13b
rxrpc_input_call_event+0x16a/0x67d
rxrpc_io_thread+0x1b6/0x45f
? _raw_spin_unlock_irqrestore+0x1f/0x35
? rxrpc_input_packet+0x519/0x519
kthread+0xe7/0xef
? kthread_complete_and_exit+0x1b/0x1b
ret_from_fork+0x22/0x30
Fix this by noting the timeouts in struct rxrpc_call when the call is
created. The timer will be started when the first packet is transmitted.
It shouldn't be possible to trigger this directly from userspace through
AF_RXRPC as sendmsg() will return EBUSY if the call is in the
waiting-for-conn state if it dropped out of the wait due to a signal.
Fixes: 9d35d880e0 ("rxrpc: Move client call connection to the I/O thread")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page().
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
than exclusive, and has fixed a bunch of errors which were caused by its
unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
=J+Dj
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
If the rxrpc call set up by afs_make_call() receives an error whilst it is
transmitting the request, there's the possibility that it may get to the
point the rxrpc call is ended (after the error_kill_call label) just as the
call is queued for async processing.
This could manifest itself as call->rxcall being seen as NULL in
afs_deliver_to_call() when it tries to lock the call.
Fix this by splitting rxrpc_kernel_end_call() into a function to shut down
an rxrpc call and a function to release the caller's reference and calling
the latter only when we get to afs_put_call().
Reported-by: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kafs-testing+fedora36_64checkkafs-build-306@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
For the map_pages() method, we need a test that does not sleep. The page
fault handler will continue to call the fault() method where we can sleep
and do the full revalidation there.
Link: https://lkml.kernel.org/r/20230327174515.1811532-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instead of returning NULL for all errors, distinguish between:
- no entry found and not asked to allocated (-ENOENT)
- failed to allocate memory (-ENOMEM)
- would block (-EAGAIN)
so that callers don't have to guess the error based on the passed in
flags.
Also pass through the error through the direct callers: filemap_get_folio,
filemap_lock_folio filemap_grab_folio and filemap_get_incore_folio.
[hch@lst.de: fix null-pointer deref]
Link: https://lkml.kernel.org/r/20230310070023.GA13563@lst.de
Link: https://lkml.kernel.org/r/20230310043137.GA1624890@u2004
Link: https://lkml.kernel.org/r/20230307143410.28031-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nilfs2]
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter". These filters provide users
with finer-grained control over DAMOS's actions. SeongJae has also done
some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series "mm:
support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with his
series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings. The previous BPF-based approach had
shortcomings. See "mm: In-kernel support for memory-deny-write-execute
(MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a per-node
basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage during
compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in ths
series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's series
"mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
"fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest of
the kernel in the series "Simplify the external interface for GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the series
"mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
=MlGs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPvfncQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpob2EADXJxcr2jjYHm/7cjKkyuVX8fr80dNdMeuY
JFdsjG1k6Uj73BVhQQWYTcs/PsrWBHWRsv6uz4WgOELj55eXmf5Q0kJszyUeJW33
/DjqLvtoppVcYf80xE13wKvCfn73BjwQo6xkGM0qAYn15eaXiD/Ax3xC6eJlsBeK
PEw7EJyhacbSxZa/1D2B6+mqII1jUQWProTCc3udZ4JHi3WvdWa3Rda0qCqHl4a1
+K2aP2YTFIRPxBzfMNa/CafWVIFubTdht+4Ds6R60RImzB9e0VUBfcsiUyW5Zg7L
Fwv7ptXuWrALwVNdW56Oz1QikBxn2pdRR2HMLwKJW1MD8kP9r8LMm2jV5Rhiwe0B
OQsGRYkOzBvw+bxeP5fvk0iPGVMz6ActH4gkraA5QdLqayDaFYOadlhqz0uRo5SH
Fb42Vl658K/MHDSIk8U58TNkmrsIJsBGohXI9DOGINPPvv3XOPi4Q1HmXkGRmii0
y+lNU/QEGh7xXXew29SPP76uQpQaYfC7NxXCMw/OpOMwehzjsjshmM2lpxi8zsgt
PJUmfHv5qxCplNmTJXmUpmX7sS7550HUdu9FJb13DM+gzKg8bk9jWVuLrzqrVlG5
1hKWEl1+heg1heRfaIuJVLbPI0au6Sb4uqhih/PHyrP9TWIoAruDbDJM65GKTxyE
2uEgcHzHQw==
=poRc
-----END PGP SIGNATURE-----
Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe updates via Christoph:
- Small improvements to the logging functionality (Amit Engel)
- Authentication cleanups (Hannes Reinecke)
- Cleanup and optimize the DMA mapping cod in the PCIe driver
(Keith Busch)
- Work around the command effects for Format NVM (Keith Busch)
- Misc cleanups (Keith Busch, Christoph Hellwig)
- Fix and cleanup freeing single sgl (Keith Busch)
- MD updates via Song:
- Fix a rare crash during the takeover process
- Don't update recovery_cp when curr_resync is ACTIVE
- Free writes_pending in md_stop
- Change active_io to percpu
- Updates to drbd, inching us closer to unifying the out-of-tree driver
with the in-tree one (Andreas, Christoph, Lars, Robert)
- BFQ update adding support for multi-actuator drives (Paolo, Federico,
Davide)
- Make brd compliant with REQ_NOWAIT (me)
- Fix for IOPOLL and queue entering, fixing stalled IO waiting on
timeouts (me)
- Fix for REQ_NOWAIT with multiple bios (me)
- Fix memory leak in blktrace cleanup (Greg)
- Clean up sbitmap and fix a potential hang (Kemeng)
- Clean up some bits in BFQ, and fix a bug in the request injection
(Kemeng)
- Clean up the request allocation and issue code, and fix some bugs
related to that (Kemeng)
- ublk updates and fixes:
- Add support for unprivileged ublk (Ming)
- Improve device deletion handling (Ming)
- Misc (Liu, Ziyang)
- s390 dasd fixes (Alexander, Qiheng)
- Improve utility of request caching and fixes (Anuj, Xiao)
- zoned cleanups (Pankaj)
- More constification for kobjs (Thomas)
- blk-iocost cleanups (Yu)
- Remove bio splitting from drivers that don't need it (Christoph)
- Switch blk-cgroups to use struct gendisk. Some of this is now
incomplete as select late reverts were done. (Christoph)
- Add bvec initialization helpers, and convert callers to use that
rather than open-coding it (Christoph)
- Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin,
Matthew, Ulf, Zhong)
* tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits)
brd: use radix_tree_maybe_preload instead of radix_tree_preload
block: use proper return value from bio_failfast()
block: bio-integrity: Copy flags when bio_integrity_payload is cloned
block: Fix io statistics for cgroup in throttle path
brd: mark as nowait compatible
brd: check for REQ_NOWAIT and set correct page allocation mask
brd: return 0/-error from brd_insert_page()
block: sync mixed merged request's failfast with 1st bio's
Revert "blk-cgroup: pin the gendisk in struct blkcg_gq"
Revert "blk-cgroup: pass a gendisk to blkg_lookup"
Revert "blk-cgroup: delay blk-cgroup initialization until add_disk"
Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release"
Revert "blk-cgroup: move the cgroup information to struct gendisk"
nvme-pci: remove iod use_sgls
nvme-pci: fix freeing single sgl
block: ublk: check IO buffer based on flag need_get_data
s390/dasd: Fix potential memleak in dasd_eckd_init()
s390/dasd: sort out physical vs virtual pointers usage
block: Remove the ALLOC_CACHE_SLACK constant
block: make kobj_type structures constant
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
=+BG5
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfs idmapping updates from Christian Brauner:
- Last cycle we introduced the dedicated struct mnt_idmap type for
mount idmapping and the required infrastucture in 256c8aed2b ("fs:
introduce dedicated idmap type for mounts"). As promised in last
cycle's pull request message this converts everything to rely on
struct mnt_idmap.
Currently we still pass around the plain namespace that was attached
to a mount. This is in general pretty convenient but it makes it easy
to conflate namespaces that are relevant on the filesystem with
namespaces that are relevant on the mount level. Especially for
non-vfs developers without detailed knowledge in this area this was a
potential source for bugs.
This finishes the conversion. Instead of passing the plain namespace
around this updates all places that currently take a pointer to a
mnt_userns with a pointer to struct mnt_idmap.
Now that the conversion is done all helpers down to the really
low-level helpers only accept a struct mnt_idmap argument instead of
two namespace arguments.
Conflating mount and other idmappings will now cause the compiler to
complain loudly thus eliminating the possibility of any bugs. This
makes it impossible for filesystem developers to mix up mount and
filesystem idmappings as they are two distinct types and require
distinct helpers that cannot be used interchangeably.
Everything associated with struct mnt_idmap is moved into a single
separate file. With that change no code can poke around in struct
mnt_idmap. It can only be interacted with through dedicated helpers.
That means all filesystems are and all of the vfs is completely
oblivious to the actual implementation of idmappings.
We are now also able to extend struct mnt_idmap as we see fit. For
example, we can decouple it completely from namespaces for users that
don't require or don't want to use them at all. We can also extend
the concept of idmappings so we can cover filesystem specific
requirements.
In combination with the vfs{g,u}id_t work we finished in v6.2 this
makes this feature substantially more robust and thus difficult to
implement wrong by a given filesystem and also protects the vfs.
- Enable idmapped mounts for tmpfs and fulfill a longstanding request.
A long-standing request from users had been to make it possible to
create idmapped mounts for tmpfs. For example, to share the host's
tmpfs mount between multiple sandboxes. This is a prerequisite for
some advanced Kubernetes cases. Systemd also has a range of use-cases
to increase service isolation. And there are more users of this.
However, with all of the other work going on this was way down on the
priority list but luckily someone other than ourselves picked this
up.
As usual the patch is tiny as all the infrastructure work had been
done multiple kernel releases ago. In addition to all the tests that
we already have I requested that Rodrigo add a dedicated tmpfs
testsuite for idmapped mounts to xfstests. It is to be included into
xfstests during the v6.3 development cycle. This should add a slew of
additional tests.
* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
shmem: support idmapped mounts for tmpfs
fs: move mnt_idmap
fs: port vfs{g,u}id helpers to mnt_idmap
fs: port fs{g,u}id helpers to mnt_idmap
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
fs: port i_{g,u}id_{needs_}update() to mnt_idmap
quota: port to mnt_idmap
fs: port privilege checking helpers to mnt_idmap
fs: port inode_owner_or_capable() to mnt_idmap
fs: port inode_init_owner() to mnt_idmap
fs: port acl to mnt_idmap
fs: port xattr to mnt_idmap
fs: port ->permission() to pass mnt_idmap
fs: port ->fileattr_set() to pass mnt_idmap
fs: port ->set_acl() to pass mnt_idmap
fs: port ->get_acl() to pass mnt_idmap
fs: port ->tmpfile() to pass mnt_idmap
fs: port ->rename() to pass mnt_idmap
fs: port ->mknod() to pass mnt_idmap
fs: port ->mkdir() to pass mnt_idmap
...
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmPuC5kTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFdXpEADFN/loTtcANLPQmLmgmJLDuZr2zKrf
aziMJRjqGMx6BLdwBgX8/XGBNwpG4tkVbI+zdRoVHkpcayMDpLq0dnrvi79a/dGU
fBrI72ZDMd/S9lzbodObHMziLqvgFthsPm9ldVAZ2Kt400KKNE+ozcveiC3yVGy0
n1k5BSt/78abzpqut5whVgJBooHtUMCh3XvBJPKwgOneHfAXCm+jqaXlKKpKlpZj
s2OUyn8BLfNkTgpAZ88L5Rkf0mftjziL6C8KOMy1hvOsyiP0IkwLuQ/kO+2H0Ate
p3tbOGvUT+n1gYpFYBDLnuWB4G8+CVPxfoO6KGhwT4OlCJpPlNCM8O+w/A/dKn4I
858spkpYPMy91lEkcrRLRkg/MARWGTgZ3k76fp3OWNnfruWd6ekMlYKx9n6CIy34
Aoc3Svy9KeA7oRrbRDltmw3UVmz53GcDo337ZL1J6Jph3s86dMG7AwGYvoDfKuKK
b0oNK5db5v50scBnRHX6UWejE5fSnnHvgC7pU57u08odCVEUALB+r8f04vmkxcVJ
Qed7lolQdFzn9ddaOXzpg5KeCe/cX3p4IPZSTHad7CPr8gswmC135DfXCr64x2hC
5jyNzKbe/x+7B2xCweHmEk4ojt8IU3UaYxLJoQkNeVr8rEGC9gqZgkSDe7BnTpOf
wT0ijzhy2u5RKg==
=Zhf3
-----END PGP SIGNATURE-----
Merge tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"The main change here is that I've broken out most of the file locking
definitions into a new header file. I also went ahead and completed
the removal of locks_inode function"
* tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: remove locks_inode
filelock: move file locking definitions to separate header file
Use the bvec_set_folio helper to initialize a bvec.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230203150634.3199647-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert to use folios throughout. This function is in preparation to
remove find_get_pages_range_tag().
Also modify this function to write the whole batch one at a time, rather
than calling for a new set every single write.
Link: https://lkml.kernel.org/r/20230104211448.4804-6-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
locks_inode was turned into a wrapper around file_inode in de2a4a501e
(Partially revert "locks: fix file locking on overlayfs"). Finish
replacing locks_inode invocations everywhere with file_inode.
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
The file locking definitions have lived in fs.h since the dawn of time,
but they are only used by a small subset of the source files that
include it.
Move the file locking definitions to a new header file, and add the
appropriate #include directives to the source files that need them. By
doing this we trim down fs.h a bit and limit the amount of rebuilding
that has to be done when we make changes to the file locking APIs.
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Steve French <stfrench@microsoft.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Move the call state changes that are made in rxrpc_recvmsg() to the I/O
thread. This means that, thenceforth, only the I/O thread does this and
the call state lock can be removed.
This requires the Rx phase to be ended when the last packet is received,
not when it is processed.
Since this now changes the rxrpc call state to SUCCEEDED before we've
consumed all the data from it, rxrpc_kernel_check_life() mustn't say the
call is dead until the recvmsg queue is empty (unless the call has failed).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Tidy up the abort generation infrastructure in the following ways:
(1) Create an enum and string mapping table to list the reasons an abort
might be generated in tracing.
(2) Replace the 3-char string with the values from (1) in the places that
use that to log the abort source. This gets rid of a memcpy() in the
tracepoint.
(3) Subsume the rxrpc_rx_eproto tracepoint with the rxrpc_abort tracepoint
and use values from (1) to indicate the trace reason.
(4) Always make a call to an abort function at the point of the abort
rather than stashing the values into variables and using goto to get
to a place where it reported. The C optimiser will collapse the calls
together as appropriate. The abort functions return a value that can
be returned directly if appropriate.
Note that this extends into afs also at the points where that generates an
abort. To aid with this, the afs sources need to #define
RXRPC_TRACE_ONLY_DEFINE_ENUMS before including the rxrpc tracing header
because they don't have access to the rxrpc internal structures that some
of the tracepoints make use of.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
We're trying to get rid of the ->writepage() hook[1]. Stop afs from using
it by unlocking the page and calling afs_writepages_region() rather than
folio_write_one().
A flag is passed to afs_writepages_region() to indicate that it should only
write a single region so that we don't flush the entire file in
->write_begin(), but do add other dirty data to the region being written to
try and reduce the number of RPC ops.
This requires ->migrate_folio() to be implemented, so point that at
filemap_migrate_folio() for files and also for symlinks and directories.
This can be tested by turning on the afs_folio_dirty tracepoint and then
doing something like:
xfs_io -c "w 2223 7000" -c "w 15000 22222" -c "w 23 7" /afs/my/test/foo
and then looking in the trace to see if the write at position 15000 gets
stored before page 0 gets dirtied for the write at position 23.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20221113162902.883850-1-hch@lst.de/ [1]
Link: https://lore.kernel.org/r/166876785552.222254.4403222906022558715.stgit@warthog.procyon.org.uk/ # v1
afs_zap_permits() has been removed since
commit be080a6f43 ("afs: Overhaul permit caching").
afs_cache_netfs has been removed since
commit 523d27cda1 ("afs: Convert afs to use the new fscache API").
so remove the declare for them from header file.
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20220909070353.1160228-1-cuigaosheng1@huawei.com/
The afs_fs_probe_dispatcher() work function is passed a count on
net->servers_outstanding when it is scheduled (which may come via its
timer). This is passed back to the work_item, passed to the timer or
dropped at the end of the dispatcher function.
But, at the top of the dispatcher function, there are two checks which
skip the rest of the function: if the network namespace is being destroyed
or if there are no fileservers to probe. These two return paths, however,
do not drop the count passed to the dispatcher, and so, sometimes, the
destruction of a network namespace, such as induced by rmmod of the kafs
module, may get stuck in afs_purge_servers(), waiting for
net->servers_outstanding to become zero.
Fix this by adding the missing decrements in afs_fs_probe_dispatcher().
Fixes: f6cbb368bc ("afs: Actively poll fileservers to maintain NAT or firewall openings")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/167164544917.2072364.3759519569649459359.stgit@warthog.procyon.org.uk/
direction misannotations and (hopefully) preventing
more of the same for the future.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
=bcvF
-----END PGP SIGNATURE-----
Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"iov_iter work; most of that is about getting rid of direction
misannotations and (hopefully) preventing more of the same for the
future"
* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use less confusing names for iov_iter direction initializers
iov_iter: saner checks for attempt to copy to/from iterator
[xen] fix "direction" argument of iov_iter_kvec()
[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
[target] fix iov_iter_bvec() "direction" argument
[s390] memcpy_real(): WRITE is "data source", not destination...
[s390] zcore: WRITE is "data source", not destination...
[infiniband] READ is "data destination", not source...
[fsi] WRITE is "data source", not destination...
[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
csum_and_copy_to_iter(): handle ITER_DISCARD
get rid of unlikely() on page_copy_sane() calls
The atomic_read was accidentally replaced with atomic_inc_return,
which prevents the server from getting cleaned up and causes rmmod
to hang with a warning:
Can't purge s=00000001
Fixes: 2757a4dc18 ("afs: Fix access after dec in put functions")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20221130174053.2665818-1-marc.dionne@auristor.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fileserver probing code attempts to work out the best fileserver to
use for a volume by retrieving the RTT calculated by AF_RXRPC for the
probe call sent to each server and comparing them. Sometimes, however,
no RTT estimate is available and rxrpc_kernel_get_srtt() returns false,
leading good fileservers to be given an RTT of UINT_MAX and thus causing
the rotation algorithm to ignore them.
Fix afs_select_fileserver() to ignore rxrpc_kernel_get_srtt()'s return
value and just take the estimated RTT it provides - which will be capped
at 1 second.
Fixes: 1d4adfaf65 ("rxrpc: Make rxrpc_kernel_get_srtt() indicate validity")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/166965503999.3392585.13954054113218099395.stgit@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
wireless and bluetooth subtrees
Current release - regressions:
- skb: export skb drop reaons to user by TRACE_DEFINE_ENUM
- bluetooth: fix regression preventing ACL packet transmission
Current release - new code bugs:
- dsa: microchip: fix kernel oops on ksz8 switches
- dsa: qca8k: fix NULL pointer dereference for of_device_get_match_data
Previous releases - regressions:
- netfilter: clean up hook list when offload flags check fails
- wifi: mt76: fix crash in chip reset fail
- rxrpc: fix ICMP/ICMP6 error handling
- ice: fix DMA mappings leak
- i40e: fix kernel crash during module removal
Previous releases - always broken:
- ipv6: sr: fix out-of-bounds read when setting HMAC data.
- tcp: TX zerocopy should not sense pfmemalloc status
- sch_sfb: don't assume the skb is still around after enqueueing to child
- netfilter: drop dst references before setting
- wifi: wilc1000: fix DMA on stack objects
- rxrpc: fix an insufficiently large sglist in rxkad_verify_packet_2()
- fec: use a spinlock to guard `fep->ptp_clk_on`
Misc:
- usb: qmi_wwan: add Quectel RM520N
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmMZwOMSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkCVMP/jUOqbRTzPOqV7+HhDM65Ww9Prb1WYKS
4o9Vi9DAL/DH8tXSDDtacE0GbtN6Mlr0kEYJE+lwn1hFsfGY1mzH2pcVjJqtZ7fh
6o2joaQMJ5lJe4dsr2k0TRtYhCzdeCrvzoTs2qLSeb5KYFOsxMtBikSF3kOJ1TZq
3bOK3OomiT/XO4FCnyPJvg8VBghkp69oNnefOavM8x7FzrMh6MY7VQem/IjnlIU2
9sHvMPLai81B1NXu2eybThEYSqutZHzM1PLuqIMhSMwdiSrVRlLFq7/BNP0FxCTR
/Mby+fnU4xIu+TK1bRoUM0fsipNceVEkXln4pQrRmxu1Yw62RFn5p+MGCM+Sc1UB
8HZzjtNRtlD+ZfRyQKs2m1+WJtbuESyzAeJaKoQ2eO2StmLqlnZLEYGl6oVQj6Uj
l3Bn1aBIOYrSM0T/qTNmAv7BTJYvpomlwqoQ5A1+5b0jKhn39Un3Yj3eU2aEv8eX
mJ80KAWk26Cc7ZA8WbL70Ac8wV76+TcpWCezVvrcZqrFoYAvicLrwEc92Fq0IiG5
bkA84zhd1TOFSuYVH7f79lSUPYHIs8FxFXIV8SZZGLbhj6nk607cwgbpA9IXizZT
CUajNoJclS1tELPG7VOv6DJS+VnpdsCm0+q3dnYTYiLlrEX71Bgc/ofgGOBKTeS3
iRyS2V0GLYoW
=DeAB
-----END PGP SIGNATURE-----
Merge tag 'net-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from rxrpc, netfilter, wireless and bluetooth
subtrees.
Current release - regressions:
- skb: export skb drop reaons to user by TRACE_DEFINE_ENUM
- bluetooth: fix regression preventing ACL packet transmission
Current release - new code bugs:
- dsa: microchip: fix kernel oops on ksz8 switches
- dsa: qca8k: fix NULL pointer dereference for
of_device_get_match_data
Previous releases - regressions:
- netfilter: clean up hook list when offload flags check fails
- wifi: mt76: fix crash in chip reset fail
- rxrpc: fix ICMP/ICMP6 error handling
- ice: fix DMA mappings leak
- i40e: fix kernel crash during module removal
Previous releases - always broken:
- ipv6: sr: fix out-of-bounds read when setting HMAC data.
- tcp: TX zerocopy should not sense pfmemalloc status
- sch_sfb: don't assume the skb is still around after
enqueueing to child
- netfilter: drop dst references before setting
- wifi: wilc1000: fix DMA on stack objects
- rxrpc: fix an insufficiently large sglist in
rxkad_verify_packet_2()
- fec: use a spinlock to guard `fep->ptp_clk_on`
Misc:
- usb: qmi_wwan: add Quectel RM520N"
* tag 'net-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
sch_sfb: Also store skb len before calling child enqueue
net: phy: lan87xx: change interrupt src of link_up to comm_ready
net/smc: Fix possible access to freed memory in link clear
net: ethernet: mtk_eth_soc: check max allowed hash in mtk_ppe_check_skb
net: skb: export skb drop reaons to user by TRACE_DEFINE_ENUM
net: ethernet: mtk_eth_soc: fix typo in __mtk_foe_entry_clear
net: dsa: felix: access QSYS_TAG_CONFIG under tas_lock in vsc9959_sched_speed_set
net: dsa: felix: disable cut-through forwarding for frames oversized for tc-taprio
net: dsa: felix: tc-taprio intervals smaller than MTU should send at least one packet
net: usb: qmi_wwan: add Quectel RM520N
net: dsa: qca8k: fix NULL pointer dereference for of_device_get_match_data
tcp: fix early ETIMEDOUT after spurious non-SACK RTO
stmmac: intel: Simplify intel_eth_pci_remove()
net: mvpp2: debugfs: fix memory leak when using debugfs_lookup()
ipv6: sr: fix out-of-bounds read when setting HMAC data.
bonding: accept unsolicited NA message
bonding: add all node mcast address when slave up
bonding: use unspecified address if no available link local address
wifi: use struct_group to copy addresses
wifi: mac80211_hwsim: check length for virtio packets
...
When trying to get a file lock on an AFS file, the server may return
UAEAGAIN to indicate that the lock is already held. This is currently
translated by the default path to -EREMOTEIO.
Translate it instead to -EAGAIN so that we know we can retry it.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey E Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/166075761334.3533338.2591992675160918098.stgit@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rxrpc and kafs between them try to use the receive timestamp on the first
data packet (ie. the one with sequence number 1) as a base from which to
calculate the time at which callback promise and lock expiration occurs.
However, we don't know how long it took for the server to send us the reply
from it having completed the basic part of the operation - it might then,
for instance, have to send a bunch of a callback breaks, depending on the
particular operation.
Fix this by using the time at which the operation is issued on the client
as a base instead. That should never be longer than the server's idea of
the expiry time.
Fixes: 781070551c ("afs: Fix calculation of callback expiry time")
Fixes: 2070a3e449 ("rxrpc: Allow the reply time to be obtained on a client call")
Suggested-by: Jeffrey E Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
filldir_t instances (directory iterators callbacks) used to return 0 for
"OK, keep going" or -E... for "stop". Note that it's *NOT* how the
error values are reported - the rules for those are callback-dependent
and ->iterate{,_shared}() instances only care about zero vs. non-zero
(look at emit_dir() and friends).
So let's just return bool ("should we keep going?") - it's less confusing
that way. The choice between "true means keep going" and "true means
stop" is bikesheddable; we have two groups of callbacks -
do something for everything in directory, until we run into problem
and
find an entry in directory and do something to it.
The former tended to use 0/-E... conventions - -E<something> on failure.
The latter tended to use 0/1, 1 being "stop, we are done".
The callers treated anything non-zero as "stop", ignoring which
non-zero value did they get.
"true means stop" would be more natural for the second group; "true
means keep going" - for the first one. I tried both variants and
the things like
if allocation failed
something = -ENOMEM;
return true;
just looked unnatural and asking for trouble.
[folded suggestion from Matthew Wilcox <willy@infradead.org>]
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Enable multipage folio support for the afs filesystem.
Support has already been implemented in netfslib, fscache and cachefiles
and in most of afs, but I've waited for Matthew Wilcox's latest folio
changes.
Note that it does require a change to afs_write_begin() to return the
correct subpage. This is a "temporary" change as we're working on
getting rid of the need for ->write_begin() and ->write_end()
completely, at least as far as network filesystems are concerned - but
it doesn't prevent afs from making use of the capability.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: kafs-testing@auristor.com
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/lkml/2274528.1645833226@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmLpXWwACgkQ+7dXa6fL
C2tpSg//WX4+1XvsehIZxykJYmQ7XjVjrg9uGIEaslUry2nqEECDeK7LppKLuCoD
1xMBc5GRn4h5YbBCGbMWnr+r6eezcXdwr5Fv+On4fD3l5AWTk5xTaR8v1So3G9bK
nAkWaoSrFapuifWgBkDGy9NhLW/k6d24BAKCqGhOlPcVzi4B95C4gQ5sIlMlH9xw
jUkHB+HSLZj/yBV47sWQUs0uy5fCiOgm3qgWYl2cTHD4qJ052i0ELO30uAv9Oykq
YjgOUryZlCP/vLSWzOeK8bgINjCLetd+2jGT96bV15BouQ4uoHSrgt7ACtp1I3NP
WAPGdzlzXZqaYSrqX2sPP/Tmq0/SipoKCRqK5e2yetmqs7zznKZe0ATwFhjJYp7L
IOULEaOAyrAeT9KQEwqtANXXoUxyEUqz2GKiybXWC8oUFJ0RAKpSf4LRi9cpoaRz
VDmi0F4/ZA+nMDOPPdbSZTtDCu+68Lz2l6Z7ONwIZ1qzyXJB4BC4wb+9rx7UftA9
QuX9TqGUtjlWiPcM0qvdFoPOIA3xOXbZVtlvqygFxzDDXGlRVZdUa3uqMjELKzK1
v1yRZIS00zqWM4lWFyaQNCZHwFhlnixRzkAmvQcioAb/NJL3eA4n3FbL+RBTSrWP
XEtgpI32ROobA3S+69LomYH4AisMgP9XDP04mOQqQfIxg9GFdEI=
=fXqy
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Fix AFS refcount handling.
The first patch converts afs to use refcount_t for its refcounts and
the second patch fixes afs_put_call() and afs_put_server() to save the
values they're going to log in the tracepoint before decrementing the
refcount"
* tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix access after dec in put functions
afs: Use refcount_t rather than atomic_t
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into their
own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp
gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha
mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS
7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT
/58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z
C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M
Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q==
=DgUi
-----END PGP SIGNATURE-----
Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into
their own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
fs: remove the NULL get_block case in mpage_writepages
fs: don't call ->writepage from __mpage_writepage
fs: remove the nobh helpers
jfs: stop using the nobh helper
ext2: remove nobh support
ntfs3: refactor ntfs_writepages
mm/folio-compat: Remove migration compatibility functions
fs: Remove aops->migratepage()
secretmem: Convert to migrate_folio
hugetlb: Convert to migrate_folio
aio: Convert to migrate_folio
f2fs: Convert to filemap_migrate_folio()
ubifs: Convert to filemap_migrate_folio()
btrfs: Convert btrfs_migratepage to migrate_folio
mm/migrate: Add filemap_migrate_folio()
mm/migrate: Convert migrate_page() to migrate_folio()
nfs: Convert to migrate_folio
btrfs: Convert btree_migratepage to migrate_folio
mm/migrate: Convert expected_page_refs() to folio_expected_refs()
mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
...
Reference-putting functions should not access the object being put after
decrementing the refcount unless they reduce the refcount to zero.
Fix a couple of instances of this in afs by copying the information to be
logged by tracepoint to local variables before doing the decrement.
[Fixed a bit in afs_put_server() that I'd missed but Marc caught]
Fixes: 341f741f04 ("afs: Refcount the afs_call struct")
Fixes: 4521819369 ("afs: Trace afs_server usage")
Fixes: 977e5f8ed0 ("afs: Split the usage count on struct afs_server")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/165911278430.3745403.16526310736054780645.stgit@warthog.procyon.org.uk/ # v1
check_write_begin() will unlock and put the folio when return
non-zero. So we should avoid unlocking and putting it twice in
netfs layer.
Change the way ->check_write_begin() works in the following two ways:
(1) Pass it a pointer to the folio pointer, allowing it to unlock and put
the folio prior to doing the stuff it wants to do, provided it clears
the folio pointer.
(2) Change the return values such that 0 with folio pointer set means
continue, 0 with folio pointer cleared means re-get and all error
codes indicating an error (no special treatment for -EAGAIN).
[ bagasdotme: use Sphinx code text syntax for *foliop pointer ]
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/56423
Link: https://lore.kernel.org/r/cf169f43-8ee7-8697-25da-0204d1b4343e@redhat.com
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If read_mapping_page() encounters an error, it returns an errno, not a
page with PageError set, so this is dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
The recent patch to make afs_getattr consult the server didn't account
for the pseudo-inodes employed by the dynamic root-type afs superblock
not having a volume or a server to access, and thus an oops occurs if
such a directory is stat'd.
Fix this by checking to see if the vnode->volume pointer actually points
anywhere before following it in afs_getattr().
This can be tested by stat'ing a directory in /afs. It may be
sufficient just to do "ls /afs" and the oops looks something like:
BUG: kernel NULL pointer dereference, address: 0000000000000020
...
RIP: 0010:afs_getattr+0x8b/0x14b
...
Call Trace:
<TASK>
vfs_statx+0x79/0xf5
vfs_fstatat+0x49/0x62
Fixes: 2aeb8c86d4 ("afs: Fix afs_getattr() to refetch file status if callback break occurred")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/165408450783.1031787.7941404776393751186.stgit@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The netfs_io_request cleanup op is now always in a position to be given a
pointer to a netfs_io_request struct, so this can be passed in instead of
the mapping and private data arguments (both of which are included in the
struct).
So rename the ->cleanup op to ->free_request (to match ->init_request) and
pass in the I/O pointer.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Change the signature of netfs helper functions to take a struct netfs_inode
pointer rather than a struct inode pointer where appropriate, thereby
relieving the need for the network filesystem to convert its internal inode
format down to the VFS inode only for netfslib to bounce it back up. For
type safety, it's better not to do that (and it's less typing too).
Give netfs_write_begin() an extra argument to pass in a pointer to the
netfs_inode struct rather than deriving it internally from the file
pointer. Note that the ->write_begin() and ->write_end() ops are intended
to be replaced in the future by netfslib code that manages this without the
need to call in twice for each page.
netfs_readpage() and similar are intended to be pointed at directly by the
address_space_operations table, so must stick to the signature dictated by
the function pointers there.
Changes
=======
- Updated the kerneldoc comments and documentation [DH].
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/CAHk-=wgkwKyNmNdKpQkqZ6DnmUL-x9hp0YBnUGjaPFEAdxDTbw@mail.gmail.com/
Remove an unused global variable and make another static as reported by
make C=1.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled. This was causing the
following complaint[1] from gcc v12:
In file included from include/linux/string.h:253,
from include/linux/ceph/ceph_debug.h:7,
from fs/ceph/inode.c:2:
In function 'fortify_memset_chk',
inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode). The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.
Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).
Most of the changes were done with:
perl -p -i -e 's/vfs_inode/netfs.inode/'g \
`git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`
Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.
Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].
Version #2:
- Fix a couple of missed name changes due to a disabled cifs option.
- Rename nfs_i_context to nfs_inode
- Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
structs.
[ This also undoes commit 507160f46c ("netfs: gcc-12: temporarily
disable '-Wattribute-warning' for now") that is no longer needed ]
Fixes: bc899ee1c8 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a pure band-aid so that I can continue merging stuff from people
while some of the gcc-12 fallout gets sorted out.
In particular, gcc-12 is very unhappy about the kinds of pointer
arithmetic tricks that netfs does, and that makes the fortify checks
trigger in afs and ceph:
In function ‘fortify_memset_chk’,
inlined from ‘netfs_i_context_init’ at include/linux/netfs.h:327:2,
inlined from ‘afs_set_netfs_context’ at fs/afs/inode.c:61:2,
inlined from ‘afs_root_iget’ at fs/afs/inode.c:543:2:
include/linux/fortify-string.h:258:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
258 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
and the reason is that netfs_i_context_init() is passed a 'struct inode'
pointer, and then it does
struct netfs_i_context *ctx = netfs_i_context(inode);
memset(ctx, 0, sizeof(*ctx));
where that netfs_i_context() function just does pointer arithmetic on
the inode pointer, knowing that the netfs_i_context is laid out
immediately after it in memory.
This is all truly disgusting, since the whole "netfs_i_context is laid
out immediately after it in memory" is not actually remotely true in
general, but is just made to be that way for afs and ceph.
See for example fs/cifs/cifsglob.h:
struct cifsInodeInfo {
struct {
/* These must be contiguous */
struct inode vfs_inode; /* the VFS's inode record */
struct netfs_i_context netfs_ctx; /* Netfslib context */
};
[...]
and realize that this is all entirely wrong, and the pointer arithmetic
that netfs_i_context() is doing is also very very wrong and wouldn't
give the right answer if netfs_ctx had different alignment rules from a
'struct inode', for example).
Anyway, that's just a long-winded way to say "the gcc-12 warning is
actually quite reasonable, and our code happens to work but is pretty
disgusting".
This is getting fixed properly, but for now I made the mistake of
thinking "the week right after the merge window tends to be calm for me
as people take a breather" and I did a sustem upgrade. And I got gcc-12
as a result, so to continue merging fixes from people and not have the
end result drown in warnings, I am fixing all these gcc-12 issues I hit.
Including with these kinds of temporary fixes.
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/all/AEEBCF5D-8402-441D-940B-105AA718C71F@chromium.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>