We use io_rsrc_node_switch() coupled with io_rsrc_node_switch_start()
for a bunch of cases including initialising ctx->rsrc_node, i.e. by
passing NULL instead of rsrc_data. Leave it to only deal with actual
node changing.
For that, first remove it from io_uring_create() and add a function
allocating the first node. Then also remove all calls to
io_rsrc_node_switch() from files/buffers register as we already have a
node installed and it does essentially nothing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d146fe306ff98b1a5a60c997c252534f03d423d7.1681210788.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The kernel test robot complains about __io_remove_buffers().
io_uring/kbuf.c:221 __io_remove_buffers() warn: variable dereferenced
before check 'bl->buf_ring' (see line 219)
That check is not needed as ->buf_ring will always be set, so we can
remove it and so silence the warning.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a632bbf749d9d911e605255652ce08d18e7d2c6.1681210788.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring/io_uring.c:432 io_prep_async_work() error: we previously
assumed 'req->file' could be null (see line 425).
Even though it's a false positive as there will not be REQ_F_ISREG set
without a file, let's add a simple check to make the kernel test robot
happy. We don't care about performance here, but assumingly it'll be
optimised out by the compiler.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a6cfbe92c74b789c0b4f046f7f98d19b1ca2e5b7.1681210788.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We know now what the completion context is for the uring_cmd completion
handling, so use that to have io_req_task_complete() decide what the
best way to complete the request is. This allows batching of the posted
completions if we have multiple pending, rather than always doing them
one-by-one.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chains of memory accesses are never good for performance.
The req->task->io_uring->in_cancel in io_req_local_work_add() is there
so that when a task is exiting via io_uring_try_cancel_requests() and
starts waiting for completions, it gets woken up by every new task_work
item queued.
Do a little trick by announcing waiting in io_uring_try_cancel_requests(),
making io_req_local_work_add() wake us up. We also need to check for
deferred tw items after prepare_to_wait(TASK_INTERRUPTIBLE);
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fb11597e9bbcb365901824f8c5c2cf0d6ee100d0.1680782017.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Every task_work will try to wake the task to be executed, which causes
excessive scheduling and additional overhead. For some tw it's
justified, but others won't do much but post a single CQE.
When a task waits for multiple cqes, every such task_work will wake it
up. Instead, the task may give a hint about how many cqes it waits for,
io_req_local_work_add() will compare against it and skip wake ups
if #cqes + #tw is not enough to satisfy the waiting condition. Task_work
that uses the optimisation should be simple enough and never post more
than one CQE. It's also ignored for non DEFER_TASKRUN rings.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d2b77e99d1e86624d8a69f7037d764b739dcd225.1680782017.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently pin the ctx for io_req_local_work_add() with
percpu_ref_get/put, which implies two rcu_read_lock/unlock pairs and some
extra overhead on top in the fast path. Replace it with a pure rcu read
and let io_ring_exit_work() synchronise against it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cbdfcb6b232627f30e9e50ef91f13c4f05910247.1680782017.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than check this in the fast path issue, it makes more sense to
just assign the copy of the data when we're setting it up anyway. This
makes the code a bit cleaner, and removes the need for this check in
the issue path.
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct delayed_work rsrc_put_work was previously used to offload node
freeing because io_rsrc_node_ref_zero() was previously called by RCU in
the IRQ context. Now, as percpu refcounting is gone, we can do it
eagerly at the spot without pushing it to a worker.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/13fb1aac1e8d068ad8fd4a0c6d0d157ab61b90c0.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Every io_rsrc_node keeps a list of items to put, and all entries are
kmalloc()'ed. However, it's quite often to queue up only one entry per
node, so let's add an inline entry there to avoid extra allocations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c482c1c652c45c85ac52e67c974bc758a50fed5f.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have too many "rsrc" around which makes the name of struct
io_rsrc_node::rsrc_list confusing. The field is responsible for keeping
a list of files or buffers, so call it item_list and add comments
around.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3e34d4dfc1fdbb6b520f904ee6187c2ccf680efe.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use ->rsrc_ref_lock spinlock to protect ->rsrc_ref_list in
io_rsrc_node_ref_zero(). Now we removed pcpu refcounting, which means
io_rsrc_node_ref_zero() is not executed from the irq context as an RCU
callback anymore, and we also put it under ->uring_lock.
io_rsrc_node_switch(), which queues up nodes into the list, is also
protected by ->uring_lock, so we can safely get rid of ->rsrc_ref_lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b60af883c263551190b526a55ff2c9d5ae07141.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, for nodes we have an atomic counter and some cached
(non-atomic) refs protected by uring_lock. Let's put all ref
manipulations under uring_lock and get rid of the atomic part.
It's free as in all cases we care about we already hold the lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/25b142feed7d831008257d90c8b17c0115d4fc15.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_free_req() is not often used but nevertheless problematic as there is
no way to know the current context, it may be used from the submission
path or even by an irq handler. Push it to a fresh context using
task_work.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3a92fe80bb068757e51aaa0b105cfbe8f5dfee9e.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We cache refs of the current node (i.e. ctx->rsrc_node) in
ctx->rsrc_cached_refs. We'll be moving away from atomics, so move the
cached refs in struct io_rsrc_node for now. It's a prep patch and
shouldn't change anything in practise.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9edc3669c1d71b06c2dca78b2b2b8bb9292738b9.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
One problem with the current rsrc infra is that often updates will
generates lots of rsrc nodes, each carry pcpu refs. That takes quite a
lot of memory, especially if there is a stall, and takes lots of CPU
cycles. Only pcpu allocations takes >50 of CPU with a naive benchmark
updating files in a loop.
Replace pcpu refs with normal refcounting. There is already a hot path
avoiding atomics / refs, but following patches will further improve it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e9ed8a9457b331a26555ff9443afc64cdaab7247.1680576071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We already do this manually for the !SQPOLL case, do it in general and
we can also dump the ugly min3() in io_submit_sqes().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It has nothing to do with the SQE at this point, it's a request
submission. While in there, get rid of the 'force_nonblock' argument
which is also dead, as we only pass in true.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For task works we're passing around a bool pointer for whether the
current ring is locked or not, let's wrap it in a structure, that
will make it more opaque preventing abuse and will also help us
to pass more info in the future if needed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Before cond_resched()'ing in handle_tw_list() we also drop the current
ring context, and so the next loop iteration will need to pick/pin a new
context and do trylock.
The chunk removed by this patch was intended to be an optimisation
covering exactly this case, i.e. retaking the lock after reschedule, but
in reality it's skipped for the first iteration after resched as
described and will keep hammering the lock if it's contended.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the move to PF_IO_WORKER, we don't juggle memory context manually
anymore. Remove that outdated part of the comment for __io_worker_idle().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 0654b05e7e65 ("io_uring: One wqe per wq"), we have just a
single io_wqe instance embedded per io_wq. Drop the extra structure in
favor of accessing struct io_wq directly, cleaning up quite a bit of
dereferences and backpointers.
No functional changes intended. Tested with liburing's testsuite
and mmtests performance microbenchmarks. I didn't observe any
performance regressions.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since we now have a single io_wqe per io_wq instead of per-node, and in
preparation to its removal, move the accounting into the parent
structure.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On at least parisc, we have strict requirements on how we virtually map
an address that is shared between the application and the kernel. On
these platforms, IOU_PBUF_RING_MMAP should be used when setting up a
shared ring buffer for provided buffers. If the application is mapping
these pages and asking the kernel to pin+map them as well, then we have
no control over what virtual address we get in the kernel.
For that case, do a sanity check if SHM_COLOUR is defined, and disallow
the mapping request. The application must fall back to using
IOU_PBUF_RING_MMAP for this case, and liburing will do that transparently
with the set of helpers that it has.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add support for KASAN in the alloc_caches (apoll and netmsg_cache).
Thus, if something touches the unused caches, it will raise a KASAN
warning/exception.
It poisons the object when the object is put to the cache, and unpoisons
it when the object is gotten or freed.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20230223164353.2839177-2-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Having cache entries linked using the hlist format brings no benefit, and
also requires an unnecessary extra pointer address per cache entry.
Use the internal io_wq_work_node single-linked list for the internal
alloc caches (async_msghdr and async_poll)
This is required to be able to use KASAN on cache entries, since we do
not need to touch unused (and poisoned) cache entries when adding more
entries to the list.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230223164353.2839177-2-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Right now io_wq allocates one io_wqe per NUMA node. As io_wq is now
bound to a task, the task basically uses only the NUMA local io_wqe, and
almost never changes NUMA nodes, thus, the other wqes are mostly
unused.
Allocate just one io_wqe embedded into io_wq, and uses all possible cpus
(cpu_possible_mask) in the io_wqe->cpumask.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20230310201107.4020580-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The ring mapped provided buffer rings rely on the application allocating
the memory for the ring, and then the kernel will map it. This generally
works fine, but runs into issues on some architectures where we need
to be able to ensure that the kernel and application virtual address for
the ring play nicely together. This at least impacts architectures that
set SHM_COLOUR, but potentially also anyone setting SHMLBA.
To use this variant of ring provided buffers, the application need not
allocate any memory for the ring. Instead the kernel will do so, and
the allocation must subsequently call mmap(2) on the ring with the
offset set to:
IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT)
to get a virtual address for the buffer ring. Normally the application
would allocate a suitable piece of memory (and correctly aligned) and
simply pass that in via io_uring_buf_reg.ring_addr and the kernel would
map it.
Outside of the setup differences, the kernel allocate + user mapped
provided buffer ring works exactly the same.
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for allowing flags to be set for registration, rename
the padding and use it for that.
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than rely on checking buffer_list->buf_pages or ->buf_nr_pages,
add a separate member that tracks if this is a ring mapped provided
buffer list or not.
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for allowing the kernel to allocate the provided buffer
rings and have the application mmap it instead, abstract out the
current method of pinning and mapping the user allocated ring.
No functional changes intended in this patch.
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some architectures have memory cache aliasing requirements (e.g. parisc)
if memory is shared between userspace and kernel. This patch fixes the
kernel to return an aliased address when asked by userspace via mmap().
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring hashes writes to a given file/inode so that it can serialize
them. This is useful if the file system needs exclusive access to the
file to perform the write, as otherwise we end up with a ton of io-wq
threads trying to lock the inode at the same time. This can cause
excessive system time.
But if the file system has flagged that it supports parallel O_DIRECT
writes, then there's no need to serialize the writes. Check for that
through FMODE_DIO_PARALLEL_WRITE and don't hash it if we don't need to.
In a basic test of 8 threads writing to a file on XFS on a gen2 Optane,
with each thread writing in 4k chunks, it improves performance from
~1350K IOPS (or ~5290MiB/sec) to ~1410K IOPS (or ~5500MiB/sec).
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmQmRYAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpl0vEACPcohqWcbGvxMuOwhrX5yUeRMIk/rBUOY3
/wmToHo9N7dHZJVyOYzGXcmTvedxXB53ea0iX/SYtTSU98Ku49UnCYDaypyo6Lag
Qi7jRYsAVRMQBg893cSNwqNpq8IMM8Lu7GDh1W8z3MYmWmyixouNAlsLcrn8r5Yd
3zBXuk+o5wS9Rz8RHyl3PeqiOACYB8aEYebvdjqcQS8pHokkXc9uQsNh+YFXdoB6
vg8KeBgASEGNtnjU5bA5iB8E2vcyx1j1qeFFdyIjcRs2mKAjs9qCOENZ35XKXY3i
J6FiWPg7uq5mk6XBjQ2P9vmeetRgcQgCHmEhaBd/H9VTDAOwK4KnlCQXIbxXdjkC
6+fIv+msldLLE/i5lF5UnmrffurcvLRS479hpZeC5UFnhVwD0v+gUcm9Y2eWEOqD
0Vp+YgkHowdo8ZsWXOWrv8maxo3mLyHGP1juXnTbw184hcLRyyiMN8cc2o4BJ1X/
e60tkIbVhjxNoG8Boifs0Ep6UhGc+2muEv+BnWmtf3wDSsWkye2dsyXuQEWUK+yI
8PGRPsttG+5X7UWtI/1AV9YV/NaRDpR8LIUPG2nKRkO1QwNUW2+DE098wW2GsCZs
KA6KrozxiLc4H36ll7JLQjNFFzDy6sjLtDoQV6XA6eFA5bmYxhzI73OC7npLRKgO
Jnc9Rvh7Zw==
=5IP9
-----END PGP SIGNATURE-----
Merge tag 'io_uring-6.3-2023-03-30' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Fix a regression with the poll retry, introduced in this merge window
(me)
- Fix a regression with the alloc cache not decrementing the member
count on removal. Also a regression from this merge window (Pavel)
- Fix race around rsrc node grabbing (Pavel)
* tag 'io_uring-6.3-2023-03-30' of git://git.kernel.dk/linux:
io_uring: fix poll/netmsg alloc caches
io_uring/rsrc: fix rogue rsrc node grabbing
io_uring/poll: clear single/double poll flags on poll arming
We increase cache->nr_cached when we free into the cache but don't
decrease when we take from it, so in some time we'll get an empty
cache with cache->nr_cached larger than IO_ALLOC_CACHE_MAX, that fails
io_alloc_cache_put() and effectively disables caching.
Fixes: 9b797a37c4 ("io_uring: add abstraction around apoll cache")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>