Commit Graph

545 Commits

Author SHA1 Message Date
Pavel Begunkov
03adabe81a io_uring: io_free_req() via tw
io_free_req() is not often used but nevertheless problematic as there is
no way to know the current context, it may be used from the submission
path or even by an irq handler. Push it to a fresh context using
task_work.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3a92fe80bb068757e51aaa0b105cfbe8f5dfee9e.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-04 09:30:39 -06:00
Pavel Begunkov
2ad4c6d080 io_uring: don't put nodes under spinlocks
io_req_put_rsrc() doesn't need any locking, so move it out of
a spinlock section in __io_req_complete_post() and adjust helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d5b87a5f31270dade6805f7acafc4cc34b84b241.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-04 09:30:39 -06:00
Pavel Begunkov
8e15c0e71b io_uring/rsrc: keep cached refs per node
We cache refs of the current node (i.e. ctx->rsrc_node) in
ctx->rsrc_cached_refs. We'll be moving away from atomics, so move the
cached refs in struct io_rsrc_node for now. It's a prep patch and
shouldn't change anything in practise.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9edc3669c1d71b06c2dca78b2b2b8bb9292738b9.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-04 09:30:39 -06:00
Pavel Begunkov
b8fb5b4fdd io_uring/rsrc: use non-pcpu refcounts for nodes
One problem with the current rsrc infra is that often updates will
generates lots of rsrc nodes, each carry pcpu refs. That takes quite a
lot of memory, especially if there is a stall, and takes lots of CPU
cycles. Only pcpu allocations takes >50 of CPU with a naive benchmark
updating files in a loop.

Replace pcpu refs with normal refcounting. There is already a hot path
avoiding atomics / refs, but following patches will further improve it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e9ed8a9457b331a26555ff9443afc64cdaab7247.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-04 09:30:39 -06:00
Jens Axboe
e3ef728ff0 io_uring: cap io_sqring_entries() at SQ ring size
We already do this manually for the !SQPOLL case, do it in general and
we can also dump the ugly min3() in io_submit_sqes().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:15 -06:00
Jens Axboe
2ad57931db io_uring: rename trace_io_uring_submit_sqe() tracepoint
It has nothing to do with the SQE at this point, it's a request
submission. While in there, get rid of the 'force_nonblock' argument
which is also dead, as we only pass in true.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:15 -06:00
Pavel Begunkov
a282967c84 io_uring: encapsulate task_work state
For task works we're passing around a bool pointer for whether the
current ring is locked or not, let's wrap it in a structure, that
will make it more opaque preventing abuse and will also help us
to pass more info in the future if needed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:15 -06:00
Pavel Begunkov
13bfa6f15d io_uring: remove extra tw trylocks
Before cond_resched()'ing in handle_tw_list() we also drop the current
ring context, and so the next loop iteration will need to pick/pin a new
context and do trylock.

The chunk removed by this patch was intended to be an optimisation
covering exactly this case, i.e. retaking the lock after reschedule, but
in reality it's skipped for the first iteration after resched as
described and will keep hammering the lock if it's contended.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:15 -06:00
Jens Axboe
07d99096e1 io_uring/io-wq: drop outdated comment
Since the move to PF_IO_WORKER, we don't juggle memory context manually
anymore. Remove that outdated part of the comment for __io_worker_idle().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:15 -06:00
Gabriel Krisman Bertazi
eb47943f22 io-wq: Drop struct io_wqe
Since commit 0654b05e7e65 ("io_uring: One wqe per wq"), we have just a
single io_wqe instance embedded per io_wq.  Drop the extra structure in
favor of accessing struct io_wq directly, cleaning up quite a bit of
dereferences and backpointers.

No functional changes intended.  Tested with liburing's testsuite
and mmtests performance microbenchmarks.  I didn't observe any
performance regressions.

Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:15 -06:00
Gabriel Krisman Bertazi
dfd63baf89 io-wq: Move wq accounting to io_wq
Since we now have a single io_wqe per io_wq instead of per-node, and in
preparation to its removal, move the accounting into the parent
structure.

Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:14 -06:00
Jens Axboe
fcb46c0ccc io_uring/kbuf: disallow mapping a badly aligned provided ring buffer
On at least parisc, we have strict requirements on how we virtually map
an address that is shared between the application and the kernel. On
these platforms, IOU_PBUF_RING_MMAP should be used when setting up a
shared ring buffer for provided buffers. If the application is mapping
these pages and asking the kernel to pin+map them as well, then we have
no control over what virtual address we get in the kernel.

For that case, do a sanity check if SHM_COLOUR is defined, and disallow
the mapping request. The application must fall back to using
IOU_PBUF_RING_MMAP for this case, and liburing will do that transparently
with the set of helpers that it has.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:14 -06:00
Breno Leitao
e1fe7ee885 io_uring: Add KASAN support for alloc_caches
Add support for KASAN in the alloc_caches (apoll and netmsg_cache).
Thus, if something touches the unused caches, it will raise a KASAN
warning/exception.

It poisons the object when the object is put to the cache, and unpoisons
it when the object is gotten or freed.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230223164353.2839177-2-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:14 -06:00
Breno Leitao
efba1a9e65 io_uring: Move from hlist to io_wq_work_node
Having cache entries linked using the hlist format brings no benefit, and
also requires an unnecessary extra pointer address per cache entry.

Use the internal io_wq_work_node single-linked list for the internal
alloc caches (async_msghdr and async_poll)

This is required to be able to use KASAN on cache entries, since we do
not need to touch unused (and poisoned) cache entries when adding more
entries to the list.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230223164353.2839177-2-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:16:12 -06:00
Breno Leitao
da64d6db3b io_uring: One wqe per wq
Right now io_wq allocates one io_wqe per NUMA node.  As io_wq is now
bound to a task, the task basically uses only the NUMA local io_wqe, and
almost never changes NUMA nodes, thus, the other wqes are mostly
unused.

Allocate just one io_wqe embedded into io_wq, and uses all possible cpus
(cpu_possible_mask) in the io_wqe->cpumask.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230310201107.4020580-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:14:21 -06:00
Jens Axboe
c56e022c0a io_uring: add support for user mapped provided buffer ring
The ring mapped provided buffer rings rely on the application allocating
the memory for the ring, and then the kernel will map it. This generally
works fine, but runs into issues on some architectures where we need
to be able to ensure that the kernel and application virtual address for
the ring play nicely together. This at least impacts architectures that
set SHM_COLOUR, but potentially also anyone setting SHMLBA.

To use this variant of ring provided buffers, the application need not
allocate any memory for the ring. Instead the kernel will do so, and
the allocation must subsequently call mmap(2) on the ring with the
offset set to:

	IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT)

to get a virtual address for the buffer ring. Normally the application
would allocate a suitable piece of memory (and correctly aligned) and
simply pass that in via io_uring_buf_reg.ring_addr and the kernel would
map it.

Outside of the setup differences, the kernel allocate + user mapped
provided buffer ring works exactly the same.

Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:14:21 -06:00
Jens Axboe
81cf17cd3a io_uring/kbuf: rename struct io_uring_buf_reg 'pad' to'flags'
In preparation for allowing flags to be set for registration, rename
the padding and use it for that.

Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:14:21 -06:00
Jens Axboe
25a2c188a0 io_uring/kbuf: add buffer_list->is_mapped member
Rather than rely on checking buffer_list->buf_pages or ->buf_nr_pages,
add a separate member that tracks if this is a ring mapped provided
buffer list or not.

Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:14:20 -06:00
Jens Axboe
ba56b63242 io_uring/kbuf: move pinning of provided buffer ring into helper
In preparation for allowing the kernel to allocate the provided buffer
rings and have the application mmap it instead, abstract out the
current method of pinning and mapping the user allocated ring.

No functional changes intended in this patch.

Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:14:20 -06:00
Helge Deller
d808459b2e io_uring: Adjust mapping wrt architecture aliasing requirements
Some architectures have memory cache aliasing requirements (e.g. parisc)
if memory is shared between userspace and kernel. This patch fixes the
kernel to return an aliased address when asked by userspace via mmap().

Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:14:20 -06:00
Jens Axboe
d4755e1538 io_uring: avoid hashing O_DIRECT writes if the filesystem doesn't need it
io_uring hashes writes to a given file/inode so that it can serialize
them. This is useful if the file system needs exclusive access to the
file to perform the write, as otherwise we end up with a ton of io-wq
threads trying to lock the inode at the same time. This can cause
excessive system time.

But if the file system has flagged that it supports parallel O_DIRECT
writes, then there's no need to serialize the writes. Check for that
through FMODE_DIO_PARALLEL_WRITE and don't hash it if we don't need to.

In a basic test of 8 threads writing to a file on XFS on a gen2 Optane,
with each thread writing in 4k chunks, it improves performance from
~1350K IOPS (or ~5290MiB/sec) to ~1410K IOPS (or ~5500MiB/sec).

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-03 07:14:20 -06:00
Linus Torvalds
f3fa7f026e io_uring-6.3-2023-03-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmQmRYAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpl0vEACPcohqWcbGvxMuOwhrX5yUeRMIk/rBUOY3
 /wmToHo9N7dHZJVyOYzGXcmTvedxXB53ea0iX/SYtTSU98Ku49UnCYDaypyo6Lag
 Qi7jRYsAVRMQBg893cSNwqNpq8IMM8Lu7GDh1W8z3MYmWmyixouNAlsLcrn8r5Yd
 3zBXuk+o5wS9Rz8RHyl3PeqiOACYB8aEYebvdjqcQS8pHokkXc9uQsNh+YFXdoB6
 vg8KeBgASEGNtnjU5bA5iB8E2vcyx1j1qeFFdyIjcRs2mKAjs9qCOENZ35XKXY3i
 J6FiWPg7uq5mk6XBjQ2P9vmeetRgcQgCHmEhaBd/H9VTDAOwK4KnlCQXIbxXdjkC
 6+fIv+msldLLE/i5lF5UnmrffurcvLRS479hpZeC5UFnhVwD0v+gUcm9Y2eWEOqD
 0Vp+YgkHowdo8ZsWXOWrv8maxo3mLyHGP1juXnTbw184hcLRyyiMN8cc2o4BJ1X/
 e60tkIbVhjxNoG8Boifs0Ep6UhGc+2muEv+BnWmtf3wDSsWkye2dsyXuQEWUK+yI
 8PGRPsttG+5X7UWtI/1AV9YV/NaRDpR8LIUPG2nKRkO1QwNUW2+DE098wW2GsCZs
 KA6KrozxiLc4H36ll7JLQjNFFzDy6sjLtDoQV6XA6eFA5bmYxhzI73OC7npLRKgO
 Jnc9Rvh7Zw==
 =5IP9
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.3-2023-03-30' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Fix a regression with the poll retry, introduced in this merge window
   (me)

 - Fix a regression with the alloc cache not decrementing the member
   count on removal. Also a regression from this merge window (Pavel)

 - Fix race around rsrc node grabbing (Pavel)

* tag 'io_uring-6.3-2023-03-30' of git://git.kernel.dk/linux:
  io_uring: fix poll/netmsg alloc caches
  io_uring/rsrc: fix rogue rsrc node grabbing
  io_uring/poll: clear single/double poll flags on poll arming
2023-03-31 12:30:13 -07:00
Pavel Begunkov
fd30d1cdcc io_uring: fix poll/netmsg alloc caches
We increase cache->nr_cached when we free into the cache but don't
decrease when we take from it, so in some time we'll get an empty
cache with cache->nr_cached larger than IO_ALLOC_CACHE_MAX, that fails
io_alloc_cache_put() and effectively disables caching.

Fixes: 9b797a37c4 ("io_uring: add abstraction around apoll cache")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-30 06:53:42 -06:00
Pavel Begunkov
4ff0b50de8 io_uring/rsrc: fix rogue rsrc node grabbing
We should not be looking at ctx->rsrc_node and anyhow modifying the node
without holding uring_lock, grabbing references in such a way is not
safe either.

Cc: stable@vger.kernel.org
Fixes: 5106dd6e74 ("io_uring: propagate issue_flags state down to file assignment")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1202ede2d7bb90136e3482b2b84aad9ed483e5d6.1680098433.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-29 09:23:46 -06:00
Jens Axboe
005308f7bd io_uring/poll: clear single/double poll flags on poll arming
Unless we have at least one entry queued, then don't call into
io_poll_remove_entries(). Normally this isn't possible, but if we
retry poll then we can have ->nr_entries cleared again as we're
setting it up. If this happens for a poll retry, then we'll still have
at least REQ_F_SINGLE_POLL set. io_poll_remove_entries() then thinks
it has entries to remove.

Clear REQ_F_SINGLE_POLL and REQ_F_DOUBLE_POLL unconditionally when
arming a poll request.

Fixes: c16bda3759 ("io_uring/poll: allow some retries for poll triggering spuriously")
Cc: stable@vger.kernel.org
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-28 07:09:01 -06:00
Linus Torvalds
83511470af block-6.3-2023-03-24
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmQdzYkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj+KEACpOBtrwaOUoMe683q/I9HYp5Te1/O9mOK3
 T+KmSxtDWiGTm/iQYUQQqGk+wXktAXjn31g+s1BZjxSbqG0OhR6Bd3XdgIVTqsCt
 aZiy3rQA+1sJ/rOoKNNzBbJKoIbVjGhxM8fXKZjsIyZBA1E0cFfsy/9pMIRoVJFL
 baonvAizJrw3nVkI3bFCuLaVQTl97Veg24rOkfr4YuSaac6wjzbsiTXH3EfJa3l/
 DumFkIE1fNK0FBQmt3+ky9M2R4yWlevenWoSZaVUdzKqG5DpYJmQrfaHn0QvTeXD
 Bxco8Pqie2k96sQlhVMiTps4HiWC9qdiUXmKeZmMfjzI1tfz6OGcKIAXUhJAYGJn
 ZngYSUGpLhvEduudDaVg6g1Gt2hNmQHtEion4IySMiurr2rHf+Pip4OfoAiVpWcq
 7ONE7wfQ1J8+NypKUJ6D9K0Z3ueMdrNv5AsjBSR2tEuMUUzVtDi5Uu0pvHXbclGj
 X5l9xln6J7+dj/6YOmKLXUvml+YEeslq2n2o3/H9zxIKXDbAHO5DKnAPEnMqhF9J
 ak3UIDYNhYjyYWNlQrEE/LpJ0wfVEGWXxowluwTy3gAyqh3Bfyu5C+Je/M6lMAev
 PghHsRWE6WPWsoh2qIC2aTXzJI6UjJdT5394kJYJ9aQm+gxObwE5+7UQANtGWlMC
 iOaDH9KpOw==
 =7EVo
 -----END PGP SIGNATURE-----

Merge tag 'block-6.3-2023-03-24' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
     - Send Identify with CNS 06h only to I/O controllers (Martin
       George)
     - Fix nvme_tcp_term_pdu to match spec (Caleb Sander)

 - Pass in issue_flags for uring_cmd, so the end_io handlers don't need
   to assume what the right context is (me)

 - Fix for ublk, marking it as LIVE before adding it to avoid races on
   the initial IO (Ming)

* tag 'block-6.3-2023-03-24' of git://git.kernel.dk/linux:
  nvme-tcp: fix nvme_tcp_term_pdu to match spec
  nvme: send Identify with CNS 06h only to I/O controllers
  block/io_uring: pass in issue_flags for uring_cmd task_work handling
  block: ublk_drv: mark device as LIVE before adding disk
2023-03-24 14:10:39 -07:00
Savino Dicanosa
02a4d923e4 io_uring/rsrc: fix null-ptr-deref in io_file_bitmap_get()
When fixed files are unregistered, file_alloc_end and alloc_hint
are not cleared. This can later cause a NULL pointer dereference in
io_file_bitmap_get() if auto index selection is enabled via
IORING_FILE_INDEX_ALLOC:

[    6.519129] BUG: kernel NULL pointer dereference, address: 0000000000000000
[...]
[    6.541468] RIP: 0010:_find_next_zero_bit+0x1a/0x70
[...]
[    6.560906] Call Trace:
[    6.561322]  <TASK>
[    6.561672]  io_file_bitmap_get+0x38/0x60
[    6.562281]  io_fixed_fd_install+0x63/0xb0
[    6.562851]  ? __pfx_io_socket+0x10/0x10
[    6.563396]  io_socket+0x93/0xf0
[    6.563855]  ? __pfx_io_socket+0x10/0x10
[    6.564411]  io_issue_sqe+0x5b/0x3d0
[    6.564914]  io_submit_sqes+0x1de/0x650
[    6.565452]  __do_sys_io_uring_enter+0x4fc/0xb20
[    6.566083]  ? __do_sys_io_uring_register+0x11e/0xd80
[    6.566779]  do_syscall_64+0x3c/0x90
[    6.567247]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[...]

To fix the issue, set file alloc range and alloc_hint to zero after
file tables are freed.

Cc: stable@vger.kernel.org
Fixes: 4278a0deb1 ("io_uring: defer alloc_hint update to io_file_bitmap_set()")
Signed-off-by: Savino Dicanosa <sd7.dev@pm.me>
[axboe: add explicit bitmap == NULL check as well]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-22 11:04:55 -06:00
Jens Axboe
74e2e17ee1 io_uring/net: avoid sending -ECONNABORTED on repeated connection requests
Since io_uring does nonblocking connect requests, if we do two repeated
ones without having a listener, the second will get -ECONNABORTED rather
than the expected -ECONNREFUSED. Treat -ECONNABORTED like a normal retry
condition if we're nonblocking, if we haven't already seen it.

Cc: stable@vger.kernel.org
Fixes: 3fb1bd6881 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT")
Link: https://github.com/axboe/liburing/issues/828
Reported-by: Hui, Chunyang <sanqian.hcy@antgroup.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-20 20:44:45 -06:00
Jens Axboe
9d2789ac9d block/io_uring: pass in issue_flags for uring_cmd task_work handling
io_uring_cmd_done() currently assumes that the uring_lock is held
when invoked, and while it generally is, this is not guaranteed.
Pass in the issue_flags associated with it, so that we have
IO_URING_F_UNLOCKED available to be able to lock the CQ ring
appropriately when completing events.

Cc: stable@vger.kernel.org
Fixes: ee692a21e9 ("fs,io_uring: add infrastructure for uring-cmd")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-20 20:01:25 -06:00
Pavel Begunkov
d2acf78908 io_uring/rsrc: fix folio accounting
| BUG: Bad page state in process kworker/u8:0  pfn:5c001
| page:00000000bfda61c8 refcount:0 mapcount:0 mapping:0000000000000000 index:0x20001 pfn:0x5c001
| head:0000000011409842 order:9 entire_mapcount:0 nr_pages_mapped:0 pincount:1
| anon flags: 0x3fffc00000b0004(uptodate|head|mappedtodisk|swapbacked|node=0|zone=0|lastcpupid=0xffff)
| raw: 03fffc0000000000 fffffc0000700001 ffffffff00700903 0000000100000000
| raw: 0000000000000200 0000000000000000 00000000ffffffff 0000000000000000
| head: 03fffc00000b0004 dead000000000100 dead000000000122 ffff00000a809dc1
| head: 0000000000020000 0000000000000000 00000000ffffffff 0000000000000000
| page dumped because: nonzero pincount
| CPU: 3 PID: 9 Comm: kworker/u8:0 Not tainted 6.3.0-rc2-00001-gc6811bf0cd87 #1
| Hardware name: linux,dummy-virt (DT)
| Workqueue: events_unbound io_ring_exit_work
| Call trace:
|  dump_backtrace+0x13c/0x208
|  show_stack+0x34/0x58
|  dump_stack_lvl+0x150/0x1a8
|  dump_stack+0x20/0x30
|  bad_page+0xec/0x238
|  free_tail_pages_check+0x280/0x350
|  free_pcp_prepare+0x60c/0x830
|  free_unref_page+0x50/0x498
|  free_compound_page+0xcc/0x100
|  free_transhuge_page+0x1f0/0x2b8
|  destroy_large_folio+0x80/0xc8
|  __folio_put+0xc4/0xf8
|  gup_put_folio+0xd0/0x250
|  unpin_user_page+0xcc/0x128
|  io_buffer_unmap+0xec/0x2c0
|  __io_sqe_buffers_unregister+0xa4/0x1e0
|  io_ring_exit_work+0x68c/0x1188
|  process_one_work+0x91c/0x1a58
|  worker_thread+0x48c/0xe30
|  kthread+0x278/0x2f0
|  ret_from_fork+0x10/0x20

Mark reports an issue with the recent patches coalescing compound pages
while registering them in io_uring. The reason is that we try to drop
excessive references with folio_put_refs(), but pages were acquired
with pin_user_pages(), which has extra accounting and so should be put
down with matching unpin_user_pages() or at least gup_put_folio().

As a fix unpin_user_pages() all but first page instead, and let's figure
out a better API after.

Fixes: 57bebf807e ("io_uring/rsrc: optimise registered huge pages")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/10efd5507d6d1f05ea0f3c601830e08767e189bd.1678980230.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-16 09:32:18 -06:00
Pavel Begunkov
5da28edd7b io_uring/msg_ring: let target know allocated index
msg_ring requests transferring files support auto index selection via
IORING_FILE_INDEX_ALLOC, however they don't return the selected index
to the target ring and there is no other good way for the userspace to
know where is the receieved file.

Return the index for allocated slots and 0 otherwise, which is
consistent with other fixed file installing requests.

Cc: stable@vger.kernel.org # v6.0+
Fixes: e6130eba8a ("io_uring: add support for passing fixed file descriptors")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://github.com/axboe/liburing/issues/809
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-16 07:16:56 -06:00
Li zeming
6acd352dfe io_uring: rsrc: Optimize return value variable 'ret'
The initialization assignment of the variable ret is changed to 0, only
in 'goto fail;' Use the ret variable as the function return value.

Signed-off-by: Li zeming <zeming@nfschina.com>
Link: https://lore.kernel.org/r/20230317182538.3027-1-zeming@nfschina.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-15 19:59:11 -06:00
Michal Koutný
a5fc1441af io_uring/sqpoll: Do not set PF_NO_SETAFFINITY on sqpoll threads
Users may specify a CPU where the sqpoll thread would run. This may
conflict with cpuset operations because of strict PF_NO_SETAFFINITY
requirement. That flag is unnecessary for polling "kernel" threads, see
the reasoning in commit 01e68ce08a ("io_uring/io-wq: stop setting
PF_NO_SETAFFINITY on io-wq workers"). Drop the flag on poll threads too.

Fixes: 01e68ce08a ("io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers")
Link: https://lore.kernel.org/all/20230314162559.pnyxdllzgw7jozgx@blackpad/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Link: https://lore.kernel.org/r/20230314183332.25834-1-mkoutny@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-15 06:50:59 -06:00
Jens Axboe
fa780334a8 io_uring: silence variable ‘prev’ set but not used warning
If io_uring.o is built with W=1, it triggers a warning:

io_uring/io_uring.c: In function ‘__io_submit_flush_completions’:
io_uring/io_uring.c:1502:40: warning: variable ‘prev’ set but not used [-Wunused-but-set-variable]
 1502 |         struct io_wq_work_node *node, *prev;
      |                                        ^~~~

which is due to the wq_list_for_each() iterator always keeping a 'prev'
variable. Most users need this to remove an entry from a list, for
example, but __io_submit_flush_completions() never does that.

Add a basic helper that doesn't track prev instead, and use that in
that function.

Reported-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com>
Reviewed-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-09 10:10:58 -07:00
Jens Axboe
03b3d6be73 io_uring/uring_cmd: ensure that device supports IOPOLL
It's possible for a file type to support uring commands, but not
pollable ones. Hence before issuing one of those, we should check
that it is supported and error out upfront if it isn't.

Cc: stable@vger.kernel.org
Fixes: 5756a3a7e7 ("io_uring: add iopoll infrastructure for io_uring_cmd")
Link: https://github.com/axboe/liburing/issues/816
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-09 09:23:59 -07:00
Jens Axboe
01e68ce08a io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers
Every now and then reports come in that are puzzled on why changing
affinity on the io-wq workers fails with EINVAL. This happens because they
set PF_NO_SETAFFINITY as part of their creation, as io-wq organizes
workers into groups based on what CPU they are running on.

However, this is purely an optimization and not a functional requirement.
We can allow setting affinity, and just lazily update our worker to wqe
mappings. If a given io-wq thread times out, it normally exits if there's
no more work to do. The exception is if it's the last worker available.
For the timeout case, check the affinity of the worker against group mask
and exit even if it's the last worker. New workers should be created with
the right mask and in the right location.

Reported-by:Daniel Dao <dqminh@cloudflare.com>
Link: https://lore.kernel.org/io-uring/CA+wXwBQwgxB3_UphSny-yAP5b26meeOu1W4TwYVcD_+5gOhvPw@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-08 08:48:13 -07:00
Linus Torvalds
53ae7e1176 io_uring-6.3-2023-03-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmQB58gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvz0EACk4hUVpReCU7mspo+eAnzo1+Oi5DHdm34C
 t4DU00lJ107Als5i86kP95zK2UHMXMsTt6N0fVwmDd/qQ8zhMwxH5sjyHVsknvGa
 sNLDCcA8yYwhfnhMnRIk024Rvn9p3U4CThSKF6G7g78absm5ULu/UL3MSiI4Q8aM
 gfK2MlD6aZLTZKLYefOI4fMBwLvJoZ3/mwpzdBfijJT/qUsYhrg8rJ9Pxo+JvtNf
 Oxny83DnT2i7ggVZraNouuD972EGf9Lm4TAmWV5l5nG4aKYZfRny+QwA48Eghs5+
 EpUNtpdvL5pguZIJLI77LSkNmJA/0oFsjh/qGOYcPZfS9erY3KOr8+jBh9FKdLC+
 5AZVt8pnSKqWT1EjKmpEThbaPgzXEhiFYq9kKO4gG5VhSgD5oxC1MXhwb8QJ0gOm
 9PR9pjzwPyYqMj7yAne4lY7I0OujHgYi9nrSpW2mLeeWjRJmkYXjB9NYxNnwiOLS
 H5SzbzHafn8HAuLacwp3StqrwqCE4bx+dTCYgLj/puN2IFEre/xtvy4S6sG2tPAF
 1hJwvV2cYqqK9Khx5fBHv2GAlMlozqe8RwQF84VGTuu+Jturo73CNtlkTGPT2Yds
 SJCTNA05uo0WsvNdiwlMWUW+nXzMFuzu0stb/Gg14KjAcS8CxYGKRRP8kFAoxrQY
 LtmaNYcukQ==
 =qWNf
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-6.3-2023-03-03' of git://git.kernel.dk/linux

Pull more io_uring updates from Jens Axboe:
 "Here's a set of fixes/changes that didn't make the first cut, either
  because they got queued before I sent the early merge request, or
  fixes that came in afterwards. In detail:

   - Don't set MSG_NOSIGNAL on recv/recvmsg opcodes, as AF_PACKET will
     error out (David)

   - Fix for spurious poll wakeups (me)

   - Fix for a file leak for buffered reads in certain conditions
     (Joseph)

   - Don't allow registered buffers of mixed types (Pavel)

   - Improve handling of huge pages for registered buffers (Pavel)

   - Provided buffer ring size calculation fix (Wojciech)

   - Minor cleanups (me)"

* tag 'io_uring-6.3-2023-03-03' of git://git.kernel.dk/linux:
  io_uring/poll: don't pass in wake func to io_init_poll_iocb()
  io_uring: fix fget leak when fs don't support nowait buffered read
  io_uring/poll: allow some retries for poll triggering spuriously
  io_uring: remove MSG_NOSIGNAL from recvmsg
  io_uring/rsrc: always initialize 'folio' to NULL
  io_uring/rsrc: optimise registered huge pages
  io_uring/rsrc: optimise single entry advance
  io_uring/rsrc: disallow multi-source reg buffers
  io_uring: remove unused wq_list_merge
  io_uring: fix size calculation when registering buf ring
  io_uring/rsrc: fix a comment in io_import_fixed()
  io_uring: rename 'in_idle' to 'in_cancel'
  io_uring: consolidate the put_ref-and-return section of adding work
2023-03-03 10:25:29 -08:00
Linus Torvalds
f122a08b19 capability: just use a 'u64' instead of a 'u32[2]' array
Back in 2008 we extended the capability bits from 32 to 64, and we did
it by extending the single 32-bit capability word from one word to an
array of two words.  It was then obfuscated by hiding the "2" behind two
macro expansions, with the reasoning being that maybe it gets extended
further some day.

That reasoning may have been valid at the time, but the last thing we
want to do is to extend the capability set any more.  And the array of
values not only causes source code oddities (with loops to deal with
it), but also results in worse code generation.  It's a lose-lose
situation.

So just change the 'u32[2]' into a 'u64' and be done with it.

We still have to deal with the fact that the user space interface is
designed around an array of these 32-bit values, but that was the case
before too, since the array layouts were different (ie user space
doesn't use an array of 32-bit values for individual capability masks,
but an array of 32-bit slices of multiple masks).

So that marshalling of data is actually simplified too, even if it does
remain somewhat obscure and odd.

This was all triggered by my reaction to the new "cap_isidentical()"
introduced recently.  By just using a saner data structure, it went from

	unsigned __capi;
	CAP_FOR_EACH_U32(__capi) {
		if (a.cap[__capi] != b.cap[__capi])
			return false;
	}
	return true;

to just being

	return a.val == b.val;

instead.  Which is rather more obvious both to humans and to compilers.

Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Moore <paul@paul-moore.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-01 10:01:22 -08:00
Jens Axboe
1947ddf9b3 io_uring/poll: don't pass in wake func to io_init_poll_iocb()
We only use one, and it's io_poll_wake(). Hardwire that in the initial
init, as well as in __io_queue_proc() if we're setting up for double
poll.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-01 10:06:53 -07:00
Joseph Qi
54aa7f2330 io_uring: fix fget leak when fs don't support nowait buffered read
Heming reported a BUG when using io_uring doing link-cp on ocfs2. [1]

Do the following steps can reproduce this BUG:
mount -t ocfs2 /dev/vdc /mnt/ocfs2
cp testfile /mnt/ocfs2/
./link-cp /mnt/ocfs2/testfile /mnt/ocfs2/testfile.1
umount /mnt/ocfs2

Then umount will fail, and it outputs:
umount: /mnt/ocfs2: target is busy.

While tracing umount, it blames mnt_get_count() not return as expected.
Do a deep investigation for fget()/fput() on related code flow, I've
finally found that fget() leaks since ocfs2 doesn't support nowait
buffered read.

io_issue_sqe
|-io_assign_file  // do fget() first
  |-io_read
  |-io_iter_do_read
    |-ocfs2_file_read_iter  // return -EOPNOTSUPP
  |-kiocb_done
    |-io_rw_done
      |-__io_complete_rw_common  // set REQ_F_REISSUE
    |-io_resubmit_prep
      |-io_req_prep_async  // override req->file, leak happens

This was introduced by commit a196c78b54 in v5.18. Fix it by don't
re-assign req->file if it has already been assigned.

[1] https://lore.kernel.org/ocfs2-devel/ab580a75-91c8-d68a-3455-40361be1bfa8@linux.alibaba.com/T/#t

Fixes: a196c78b54 ("io_uring: assign non-fixed early for async work")
Cc: <stable@vger.kernel.org>
Reported-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230228045459.13524-1-joseph.qi@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-28 05:58:42 -07:00
Jens Axboe
c16bda3759 io_uring/poll: allow some retries for poll triggering spuriously
If we get woken spuriously when polling and fail the operation with
-EAGAIN again, then we generally only allow polling again if data
had been transferred at some point. This is indicated with
REQ_F_PARTIAL_IO. However, if the spurious poll triggers when the socket
was originally empty, then we haven't transferred data yet and we will
fail the poll re-arm. This either punts the socket to io-wq if it's
blocking, or it fails the request with -EAGAIN if not. Neither condition
is desirable, as the former will slow things down, while the latter
will make the application confused.

We want to ensure that a repeated poll trigger doesn't lead to infinite
work making no progress, that's what the REQ_F_PARTIAL_IO check was
for. But it doesn't protect against a loop post the first receive, and
it's unnecessarily strict if we started out with an empty socket.

Add a somewhat random retry count, just to put an upper limit on the
potential number of retries that will be done. This should be high enough
that we won't really hit it in practice, unless something needs to be
aborted anyway.

Cc: stable@vger.kernel.org # v5.10+
Link: https://github.com/axboe/liburing/issues/364
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-25 20:10:13 -07:00
David Lamparter
7605c43d67 io_uring: remove MSG_NOSIGNAL from recvmsg
MSG_NOSIGNAL is not applicable for the receiving side, SIGPIPE is
generated when trying to write to a "broken pipe".  AF_PACKET's
packet_recvmsg() does enforce this, giving back EINVAL when MSG_NOSIGNAL
is set - making it unuseable in io_uring's recvmsg.

Remove MSG_NOSIGNAL from io_recvmsg_prep().

Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: David Lamparter <equinox@diac24.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230224150123.128346-1-equinox@diac24.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-24 12:59:02 -07:00
Jens Axboe
977bc87356 io_uring/rsrc: always initialize 'folio' to NULL
Smatch complains that:

smatch warnings:
io_uring/rsrc.c:1262 io_sqe_buffer_register() error: uninitialized symbol 'folio'.

'folio' may be used uninitialized, which can happen if we end up with a
single page mapped. Ensure that we clear folio to NULL at the top so
it's always set.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/r/202302241432.YML1CD5C-lkp@intel.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-24 12:58:31 -07:00
Linus Torvalds
3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00
Pavel Begunkov
57bebf807e io_uring/rsrc: optimise registered huge pages
When registering huge pages, internally io_uring will split them into
many PAGE_SIZE bvec entries. That's bad for performance as drivers need
to eventually dma-map the data and will do it individually for each bvec
entry. Coalesce huge pages into one large bvec.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:24 -07:00
Pavel Begunkov
b000ae0ec2 io_uring/rsrc: optimise single entry advance
Iterating within the first bvec entry should be essentially free, but we
use iov_iter_advance() for that, which shows up in benchmark profiles
taking up to 0.5% of CPU. Replace it with a hand coded version.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00
Pavel Begunkov
edd4782696 io_uring/rsrc: disallow multi-source reg buffers
If two or more mappings go back to back to each other they can be passed
into io_uring to be registered as a single registered buffer. That would
even work if mappings came from different sources, e.g. it's possible to
mix in this way anon pages and pages from shmem or hugetlb. That is not
a problem but it'd rather be less prone if we forbid such mixing.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00
Pavel Begunkov
9a1563d172 io_uring: remove unused wq_list_merge
There are no users of wq_list_merge, kill it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5f9ad0301949213230ad9000a8359d591aae615a.1677002255.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00
Wojciech Lukowicz
48ba08374e io_uring: fix size calculation when registering buf ring
Using struct_size() to calculate the size of io_uring_buf_ring will sum
the size of the struct and of the bufs array. However, the struct's fields
are overlaid with the array making the calculated size larger than it
should be.

When registering a ring with N * PAGE_SIZE / sizeof(struct io_uring_buf)
entries, i.e. with fully filled pages, the calculated size will span one
more page than it should and io_uring will try to pin the following page.
Depending on how the application allocated the ring, it might succeed
using an unrelated page or fail returning EFAULT.

The size of the ring should be the product of ring_entries and the size
of io_uring_buf, i.e. the size of the bufs array only.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Wojciech Lukowicz <wlukowicz01@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230218184141.70891-1-wlukowicz01@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00
Pavel Begunkov
6bf65a1b36 io_uring/rsrc: fix a comment in io_import_fixed()
io_import_fixed() supports offsets, but "may not" means the opposite.
Replace it with "might not" so the comments rather speaks about
possible cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/5b5f79958456caa6dc532f6205f75f224b232c81.1676902343.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-22 09:57:23 -07:00