Commit Graph

163 Commits

Author SHA1 Message Date
Dylan Yudaken
9bb66906f2 io_uring: support multishot in recvmsg
Similar to multishot recv, this will require provided buffers to be
used. However recvmsg is much more complex than recv as it has multiple
outputs. Specifically flags, name, and control messages.

Support this by introducing a new struct io_uring_recvmsg_out with 4
fields. namelen, controllen and flags match the similar out fields in
msghdr from standard recvmsg(2), payloadlen is the length of the payload
following the header.
This struct is placed at the start of the returned buffer. Based on what
the user specifies in struct msghdr, the next bytes of the buffer will be
name (the next msg_namelen bytes), and then control (the next
msg_controllen bytes). The payload will come at the end. The return value
in the CQE is the total used size of the provided buffer.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-4-dylany@fb.com
[axboe: style fixups, see link]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:18 -06:00
Dylan Yudaken
72c531f8ef net: copy from user before calling __get_compat_msghdr
this is in preparation for multishot receive from io_uring, where it needs
to have access to the original struct user_msghdr.

functionally this should be a no-op.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
7fa875b8e5 net: copy from user before calling __copy_msghdr
this is in preparation for multishot receive from io_uring, where it needs
to have access to the original struct user_msghdr.

functionally this should be a no-op.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220714110258.1336200-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
6d2f75a0cf io_uring: support 0 length iov in buffer select in compat
Match up work done in "io_uring: allow iov_len = 0 for recvmsg and buffer
select", but for compat code path.

Fixes: a68caad69ce5 ("io_uring: allow iov_len = 0 for recvmsg and buffer select")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220708181838.1495428-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
e2df2ccb75 io_uring: fix multishot ending when not polled
If multishot is not actually polling then return IOU_OK rather than the
result.
If the result was > 0 this will confuse things further up the callstack
which expect a return <= 0.

Fixes: 1300ebb20286 ("io_uring: multishot recv")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220708181838.1495428-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Jens Axboe
43e0bbbd0b io_uring: add netmsg cache
For recvmsg/sendmsg, if they don't complete inline, we currently need
to allocate a struct io_async_msghdr for each request. This is a
somewhat large struct.

Hook up sendmsg/recvmsg to use the io_alloc_cache. This reduces the
alloc + free overhead considerably, yielding 4-5% of extra performance
running netbench.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Jens Axboe
9731bc9855 io_uring: impose max limit on apoll cache
Caches like this tend to grow to the peak size, and then never get any
smaller. Impose a max limit on the size, to prevent it from growing too
big.

A somewhat randomly chosen 512 is the max size we'll allow the cache
to get. If a batch of frees come in and would bring it over that, we
simply start kfree'ing the surplus.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Jens Axboe
9b797a37c4 io_uring: add abstraction around apoll cache
In preparation for adding limits, and one more user, abstract out the
core bits of the allocation+free cache.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Jens Axboe
9da7471ed1 io_uring: move apoll cache to poll.c
This is where it's used, move the flush handler in there.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Pavel Begunkov
e8375e43ca io_uring: consolidate hash_locked io-wq handling
Don't duplicate code disabling REQ_F_HASH_LOCKED for IO_URING_F_UNLOCKED
(i.e. io-wq), move the handling into __io_arm_poll_handler().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0ff0ffdfaa65b3d536131535c3dad3c63d9b7bb0.1657203020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Pavel Begunkov
b21a51e26e io_uring: clear REQ_F_HASH_LOCKED on hash removal
Instead of clearing REQ_F_HASH_LOCKED while arming a poll, unset the bit
when we're removing the entry from the table in io_poll_tw_hash_eject().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/02e48bb88d6f1480c94ac2924c43ad1fbd48e92a.1657203020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Pavel Begunkov
ceff501790 io_uring: don't race double poll setting REQ_F_ASYNC_DATA
Just as with io_poll_double_prepare() setting REQ_F_DOUBLE_POLL, we can
race with the first poll entry when setting REQ_F_ASYNC_DATA. Move it
under io_poll_double_prepare().

Fixes: a18427bb2d ("io_uring: optimise submission side poll_refs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/df6920f509c11115aa2bce8b34dc5fdb0eb98920.1657203020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Pavel Begunkov
7a121ced6e io_uring: don't miss setting REQ_F_DOUBLE_POLL
When adding a second poll entry we should set REQ_F_DOUBLE_POLL
unconditionally. We might race with the first entry removal but that
doesn't change the rule.

Fixes: a18427bb2d ("io_uring: optimise submission side poll_refs")
Reported-and-tested-by: syzbot+49950ba66096b1f0209b@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8b680d83ded07424db83e8745585e7a6d72826ef.1657203020.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
cf0dd9527e io_uring: disable multishot recvmsg
recvmsg has semantics that do not make it trivial to extend to
multishot. Specifically it has user pointers and returns data in the
original parameter. In order to make this API useful these will need to be
somehow included with the provided buffers.

For now remove multishot for recvmsg as it is not useful.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220704140106.200167-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
e0486f3f7c io_uring: only trace one of complete or overflow
In overflow we see a duplcate line in the trace, and in some cases 3
lines (if initial io_post_aux_cqe fails).
Instead just trace once for each CQE

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-13-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
b3fdea6ecb io_uring: multishot recv
Support multishot receive for io_uring.
Typical server applications will run a loop where for each recv CQE it
requeues another recv/recvmsg.

This can be simplified by using the existing multishot functionality
combined with io_uring's provided buffers.
The API is to add the IORING_RECV_MULTISHOT flag to the SQE. CQEs will
then be posted (with IORING_CQE_F_MORE flag set) when data is available
and is read. Once an error occurs or the socket ends, the multishot will
be removed and a completion without IORING_CQE_F_MORE will be posted.

The benefit to this is that the recv is much more performant.
 * Subsequent receives are queued up straight away without requiring the
   application to finish a processing loop.
 * If there are more data in the socket (sat the provided buffer size is
   smaller than the socket buffer) then the data is immediately
   returned, improving batching.
 * Poll is only armed once and reused, saving CPU cycles

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-11-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
cbd2574854 io_uring: fix multishot accept ordering
Similar to multishot poll, drop multishot accept when CQE overflow occurs.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-10-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
a2da676376 io_uring: fix multishot poll on overflow
On overflow, multishot poll can still complete with the IORING_CQE_F_MORE
flag set.
If in the meantime the user clears a CQE and a the poll was cancelled then
the poll will post a CQE without the IORING_CQE_F_MORE (and likely result
-ECANCELED).

However when processing the application will encounter the non-overflow
CQE which indicates that there will be no more events posted. Typical
userspace applications would free memory associated with the poll in this
case.
It will then subsequently receive the earlier CQE which has overflowed,
which breaks the contract given by the IORING_CQE_F_MORE flag.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-9-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
52120f0fad io_uring: add allow_overflow to io_post_aux_cqe
Some use cases of io_post_aux_cqe would not want to overflow as is, but
might want to change the flags/result. For example multishot receive
requires in order CQE, and so if there is an overflow it would need to
stop receiving until the overflow is taken care of.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-8-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
114eccdf0e io_uring: add IOU_STOP_MULTISHOT return code
For multishot we want a way to signal the caller that multishot has ended
but also this might not be an error return.

For example sockets return 0 when closed, which should end a multishot
recv, but still have a CQE with result 0

Introduce IOU_STOP_MULTISHOT which does this and indicates that the return
code is stored inside req->cqe

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-7-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
2ba69707d9 io_uring: clean up io_poll_check_events return values
The values returned are a bit confusing, where 0 and 1 have implied
meaning, so add some definitions for them.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Dylan Yudaken
d4e097dae2 io_uring: recycle buffers on error
Rather than passing an error back to the user with a buffer attached,
recycle the buffer immediately.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Dylan Yudaken
5702196e7d io_uring: allow iov_len = 0 for recvmsg and buffer select
When using BUFFER_SELECT there is no technical requirement that the user
actually provides iov, and this removes one copy_from_user call.

So allow iov_len to be 0.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Dylan Yudaken
32f3c434b1 io_uring: restore bgid in io_put_kbuf
Attempt to restore bgid. This is needed when recycling unused buffers as
the next time around it will want the correct bgid.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Dylan Yudaken
b8c015598c io_uring: allow 0 length for buffer select
If user gives 0 for length, we can set it from the available buffer size.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
6e73dffbb9 io_uring: let to set a range for file slot allocation
From recently io_uring provides an option to allocate a file index for
operation registering fixed files. However, it's utterly unusable with
mixed approaches when for a part of files the userspace knows better
where to place it, as it may race and users don't have any sane way to
pick a slot and hoping it will not be taken.

Let the userspace to register a range of fixed file slots in which the
auto-allocation happens. The use case is splittting the fixed table in
two parts, where on of them is used for auto-allocation and another for
slot-specified operations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66ab0394e436f38437cf7c44676e1920d09687ad.1656154403.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Jens Axboe
e6130eba8a io_uring: add support for passing fixed file descriptors
With IORING_OP_MSG_RING, one ring can send a message to another ring.
Extend that support to also allow sending a fixed file descriptor to
that ring, enabling one ring to pass a registered descriptor to another
one.

Arguments are extended to pass in:

sqe->addr3	fixed file slot in source ring
sqe->file_index	fixed file slot in destination ring

IORING_OP_MSG_RING is extended to take a command argument in sqe->addr.
If set to zero (or IORING_MSG_DATA), it sends just a message like before.
If set to IORING_MSG_SEND_FD, a fixed file descriptor is sent according
to the above arguments.

Two common use cases for this are:

1) Server needs to be shutdown or restarted, pass file descriptors to
   another onei

2) Backend is split, and one accepts connections, while others then get
  the fd passed and handle the actual connection.

Both of those are classic SCM_RIGHTS use cases, and it's not possible to
support them with direct descriptors today.

By default, this will post a CQE to the target ring, similarly to how
IORING_MSG_DATA does it. If IORING_MSG_RING_CQE_SKIP is set, no message
is posted to the target ring. The issuer is expected to notify the
receiver side separately.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Jens Axboe
f110ed8498 io_uring: split out fixed file installation and removal
Put it with the filetable code, which is where it belongs. While doing
so, have the helpers take a ctx rather than an io_kiocb. It doesn't make
sense to use a request, as it's not an operation on the request itself.
It applies to the ring itself.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
fbb8bb0291 io_uring: remove ctx->refs pinning on enter
io_uring_enter() takes ctx->refs, which was previously preventing racing
with register quiesce. However, as register now doesn't touch the refs,
we can freely kill extra ctx pinning and rely on the fact that we're
holding a file reference preventing the ring from being destroyed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a11c57ad33a1be53541fce90669c1b79cf4d8940.1656153286.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
3273c4407a io_uring: don't check file ops of registered rings
Registered rings are per definitions io_uring files, so we don't need to
additionally verify them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/425cd64fd885b8e329a46c205ee811987691baaf.1656153286.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
ad8b261d83 io_uring: remove extra TIF_NOTIFY_SIGNAL check
io_run_task_work() accounts for TIF_NOTIFY_SIGNAL, so no need to have an
second check in io_run_task_work_sig().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/52ce41a592ad904511697f432141e5690fd4b968.1656153285.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
3218e5d32d io_uring: fuse fallback_node and normal tw node
Now as both normal and fallback paths use llist, just keep one node head
in struct io_task_work and kill off ->fallback_node.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d04ebde409f7b162fe247b361b4486b193293e46.1656153285.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
37c7bd31b3 io_uring: improve io_fail_links()
io_fail_links() is called with ->completion_lock held and for that
reason we'd want to keep it as small as we can. Instead of doing
__io_req_complete_post() for each linked request under the lock, fail
them in a task_work handler under ->uring_lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a2f68708b970a21f4e84ddfa7b3abd67a8fffb27.1656153285.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Jens Axboe
fe991a7688 io_uring: move POLLFREE handling to separate function
We really don't care about this at all in terms of performance. Outside
of having it already be marked unlikely(), shove it into a separate
__cold function.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Hao Xu
795bbbc8a9 io_uring: kbuf: inline io_kbuf_recycle_ring()
Make io_kbuf_recycle_ring() inline since it is the fast path of
provided buffer.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220623130126.179232-1-hao.xu@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
49f1c68e04 io_uring: optimise submission side poll_refs
The final poll_refs put in __io_arm_poll_handler() takes quite some
cycles. When we're arming from the original task context task_work won't
be run, so in this case we can assume that we won't race with task_works
and so not take the initial ownership ref.

One caveat is that after arming a poll we may race with it, so we have
to add a bunch of io_poll_get_ownership() hidden inside of
io_poll_can_finish_inline() whenever we want to complete arming inline.
For the same reason we can't just set REQ_F_DOUBLE_POLL in
__io_queue_proc() and so need to sync with the first poll entry by
taking its wq head lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8825315d7f5e182ac1578a031e546f79b1c97d01.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
de08356f48 io_uring: refactor poll arm error handling
__io_arm_poll_handler() errors parsing is a horror, in case it failed it
returns 0 and the caller is expected to look at ipt.error, which already
led us to a number of problems before.

When it returns a valid mask, leave it as it's not, i.e. return 1 and
store the mask in ipt.result_mask. In case of a failure that can be
handled inline return an error code (negative value), and return 0 if
__io_arm_poll_handler() took ownership of the request and will complete
it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/018cacdaef5fe95d7dc56b32e85d752cab7607f6.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
063a007996 io_uring: change arm poll return values
The rules for __io_arm_poll_handler()'s result parsing are complicated,
as the first step don't pass return a mask but pass back a positive
return code and fill ipt->result_mask.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/529e29e9f97f2e6e383ccd44234d8b576a83a921.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
5204aa8c43 io_uring: add a helper for apoll alloc
Extract a helper function for apoll allocation, makes the code easier to
read.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f93282b47dd678e805dd0d7097f66968ced495c.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
13a99017ff io_uring: remove events caching atavisms
Remove events argument from *io_poll_execute(), it's not needed and not
used.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/12efd4e15c6a90cf9e5b59807cfcb57852b51dc7.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:16 -06:00
Pavel Begunkov
0638cd7be2 io_uring: clean poll ->private flagging
We store a req pointer in wqe->private but also take one bit to mark
double poll entries. Replace macro helpers with inline functions for
better type checking and also name the double flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a61240555c64ac0b7a9b0eb59a9efeb638a35a4.1655990418.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Jens Axboe
78a861b949 io_uring: add sync cancelation API through io_uring_register()
The io_uring cancelation API is async, like any other API that we expose
there. For the case of finding a request to cancel, or not finding one,
it is fully sync in that when submission returns, the CQE for both the
cancelation request and the targeted request have been posted to the
CQ ring.

However, if the targeted work is being executed by io-wq, the API can
only start the act of canceling it. This makes it difficult to use in
some circumstances, as the caller then has to wait for the CQEs to come
in and match on the same cancelation data there.

Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register()
that does sync cancelations, always. For the io-wq case, it'll wait
for the cancelation to come in before returning. The only expected
returns from this API is:

0		Request found and canceled fine.
> 0		Requests found and canceled. Only happens if asked to
		cancel multiple requests, and if the work wasn't in
		progress.
-ENOENT		Request not found.
-ETIME		A timeout on the operation was requested, but the timeout
		expired before we could cancel.

and we won't get -EALREADY via this API.

If the timeout value passed in is -1 (tv_sec and tv_nsec), then that
means that no timeout is requested. Otherwise, the timespec passed in
is the amount of time the sync cancel will wait for a successful
cancelation.

Link: https://github.com/axboe/liburing/discussions/608
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Jens Axboe
7d8ca72501 io_uring: add IORING_ASYNC_CANCEL_FD_FIXED cancel flag
In preparation for not having a request to pass in that carries this
state, add a separate cancelation flag that allows the caller to ask
for a fixed file for cancelation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Jens Axboe
88f52eaad2 io_uring: have cancelation API accept io_uring_task directly
We just use the io_kiocb passed in to find the io_uring_task, and we
already pass in the ctx via cd->ctx anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Hao Xu
024b8fde33 io_uring: kbuf: kill __io_kbuf_recycle()
__io_kbuf_recycle() is only called in io_kbuf_recycle(). Kill it and
tweak the code so that the legacy pbuf and ring pbuf code become clear

Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220622055551.642370-1-hao.xu@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Dylan Yudaken
c6dd763c24 io_uring: trace task_work_run
trace task_work_run to help provide stats on how often task work is run
and what batch sizes are coming through.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-9-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Dylan Yudaken
3a0c037b0e io_uring: batch task_work
Batching task work up is an important performance optimisation, as
task_work_add is expensive.

In order to keep the semantics replace the task_list with a fake node
while processing the old list, and then do a cmpxchg at the end to see if
there is more work.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-6-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Dylan Yudaken
923d159247 io_uring: introduce llist helpers
Introduce helpers to atomically switch llist.

Will later move this into common code

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Dylan Yudaken
f88262e60b io_uring: lockless task list
With networking use cases we see contention on the spinlock used to
protect the task_list when multiple threads try and add completions at once.
Instead we can use a lockless list, and assume that the first caller to
add to the list is responsible for kicking off task work.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-4-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Dylan Yudaken
c34398a8c0 io_uring: remove __io_req_task_work_add
this is no longer needed as there is only one caller

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-3-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00