linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-16 16:54:20 +08:00

Author	SHA1	Message	Date
Pavel Begunkov	6b2e11d782	io_uring: tighten task exit cancellations commit `f8b632e89a` upstream. io_uring_cancel_generic() should retry if any state changes like a request is completed, however in case of a task exit it only goes for another loop and avoids schedule() if any tracked (i.e. REQ_F_INFLIGHT) request got completed. Let's assume we have a non-tracked request executing in iowq and a tracked request linked to it. Let's also assume io_uring_cancel_generic() fails to find and cancel the request, i.e. via io_run_local_work(), which may happen as io-wq has gaps. Next, the request logically completes, io-wq still hold a ref but queues it for completion via tw, which happens in io_uring_try_cancel_requests(). After, right before prepare_to_wait() io-wq puts the request, grabs the linked one and tries executes it, e.g. arms polling. Finally the cancellation loop calls prepare_to_wait(), there are no tw to run, no tracked request was completed, so the tctx_inflight() check passes and the task is put to indefinite sleep. Cc: stable@vger.kernel.org Fixes: `3f48cf18f8` ("io_uring: unify files and task cancel") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/acac7311f4e02ce3c43293f8f1fda9c705d158f1.1721819383.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-19 05:45:14 +02:00
Ming Lei	88678723cb	io_uring: fail NOP if non-zero op flags is passed in commit `3d8f874bd6` upstream. The NOP op flags should have been checked from beginning like any other opcode, otherwise NOP may not be extended with the op flags. Given both liburing and Rust io-uring crate always zeros SQE op flags, just ignore users which play raw NOP uring interface without zeroing SQE, because NOP is just for test purpose. Then we can save one NOP2 opcode. Suggested-by: Jens Axboe <axboe@kernel.dk> Fixes: `2b188cc1bb` ("Add io_uring IO interface") Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240510035031.78874-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-06-16 13:39:53 +02:00
Jens Axboe	a180ca3369	io_uring: ensure '0' is returned on file registration success A previous backport mistakenly removed code that cleared 'ret' to zero, as the SCM logging was performed. Fix up the return value so we don't return an errant error on fixed file registration. Fixes: `d909d381c3` ("io_uring: drop any code related to SCM_RIGHTS") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-10 16:19:37 +02:00
Jens Axboe	deb5946255	io_uring: don't save/restore iowait state [ Upstream commit `6f0974eccb` ] This kind of state is per-syscall, and since we're doing the waiting off entering the io_uring_enter(2) syscall, there's no way that iowait can already be set for this case. Simplify it by setting it if we need to, and always clearing it to 0 when done. Fixes: `7b72d661f1` ("io_uring: gate iowait schedule on having pending requests") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-03-26 18:21:35 -04:00
Jens Axboe	d909d381c3	io_uring: drop any code related to SCM_RIGHTS Commit `6e5e6d2749` upstream. This is dead code after we dropped support for passing io_uring fds over SCM_RIGHTS, get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-03-26 18:21:11 -04:00
Jens Axboe	9dd3863e3f	io_uring/unix: drop usage of io_uring socket Commit `a4104821ad` upstream. Since we no longer allow sending io_uring fds over SCM_RIGHTS, move to using io_is_uring_fops() to detect whether this is a io_uring fd or not. With that done, kill off io_uring_get_socket() as nobody calls it anymore. This is in preparation to yanking out the rest of the core related to unix gc with io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-03-26 18:21:11 -04:00
Jens Axboe	79ea1c6f06	io_uring/rw: ensure io->bytes_done is always initialized commit `0a535eddbe` upstream. If IOSQE_ASYNC is set and we fail importing an iovec for a readv or writev request, then we leave ->bytes_done uninitialized and hence the eventual failure CQE posted can potentially have a random res value rather than the expected -EINVAL. Setup ->bytes_done before potentially failing, so we have a consistent value if we fail the request early. Cc: stable@vger.kernel.org Reported-by: xingwei lee <xrivendell7@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-01-25 14:52:48 -08:00
Pavel Begunkov	bcedd497b3	io_uring/af_unix: disable sending io_uring over sockets commit `705318a99a` upstream. File reference cycles have caused lots of problems for io_uring in the past, and it still doesn't work exactly right and races with unix_stream_read_generic(). The safest fix would be to completely disallow sending io_uring files via sockets via SCM_RIGHT, so there are no possible cycles invloving registered files and thus rendering SCM accounting on the io_uring side unnecessary. Cc: <stable@vger.kernel.org> Fixes: `0091bfc817` ("io_uring/af_unix: defer registered files gc to io_uring release") Reported-and-suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c716c88321939156909cfa1bd8b0faaf1c804103.1701868795.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-13 18:36:46 +01:00
Pavel Begunkov	ea3291cb90	io_uring: fix mutex_unlock with unreferenced ctx commit `f7b32e7850` upstream. Callers of mutex_unlock() have to make sure that the mutex stays alive for the whole duration of the function call. For io_uring that means that the following pattern is not valid unless we ensure that the context outlives the mutex_unlock() call. mutex_lock(&ctx->uring_lock); req_put(req); // typically via io_req_task_submit() mutex_unlock(&ctx->uring_lock); Most contexts are fine: io-wq pins requests, syscalls hold the file, task works are taking ctx references and so on. However, the task work fallback path doesn't follow the rule. Cc: <stable@vger.kernel.org> Fixes: `04fc6c802d` ("io_uring: save ctx put/get for task_work submit") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/CAG48ez3xSoYb+45f1RLtktROJrpiDQ1otNvdR+YLQf7m+Krj5Q@mail.gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-13 18:36:42 +01:00
Keith Busch	313a34d1c0	io_uring: fix off-by one bvec index commit `d6fef34ee4` upstream. If the offset equals the bv_len of the first registered bvec, then the request does not include any of that first bvec. Skip it so that drivers don't have to deal with a zero length bvec, which was observed to break NVMe's PRP list creation. Cc: stable@vger.kernel.org Fixes: `bd11b3a391` ("io_uring: don't use iov_iter_advance() for fixed buffers") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231120221831.2646460-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-03 07:31:26 +01:00
Charles Mirabile	35b5d86e43	io_uring/fs: consider link->flags when getting path for LINKAT commit `8479063f1f` upstream. In order for `AT_EMPTY_PATH` to work as expected, the fact that the user wants that behavior needs to make it to `getname_flags` or it will return ENOENT. Fixes: `cf30da90bc` ("io_uring: add support for IORING_OP_LINKAT") Cc: <stable@vger.kernel.org> Link: https://github.com/axboe/liburing/issues/995 Signed-off-by: Charles Mirabile <cmirabil@redhat.com> Link: https://lore.kernel.org/r/20231120105545.1209530-1-cmirabil@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-03 07:31:24 +01:00
Jens Axboe	3d7912710e	io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid commit `7644b1a1c9` upstream. We could race with SQ thread exit, and if we do, we'll hit a NULL pointer dereference when the thread is cleared. Grab the SQPOLL data lock before attempting to get the task cpu and pid for fdinfo, this ensures we have a stable view of it. Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=218032 Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: He Gao <hegao@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-11-28 16:56:36 +00:00
Jens Axboe	a2b1d486fb	io_uring/fs: remove sqe->rw_flags checking from LINKAT commit `a52d4f6575` upstream. This is unionized with the actual link flags, so they can of course be set and they will be evaluated further down. If not we fail any LINKAT that has to set option flags. Fixes: `cf30da90bc` ("io_uring: add support for IORING_OP_LINKAT") Cc: stable@vger.kernel.org Reported-by: Thomas Leonard <talex5@gmail.com> Link: https://github.com/axboe/liburing/issues/955 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-10-06 13:18:22 +02:00
Pavel Begunkov	c29848249f	io_uring: break iopolling on signal [ upstream commit `dc314886cb` ] Don't keep spinning iopoll with a signal set. It'll eventually return back, e.g. by virtue of need_resched(), but it's not a nice user experience. Cc: stable@vger.kernel.org Fixes: `def596e955` ("io_uring: support for IO polling") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/eeba551e82cad12af30c3220125eb6cb244cc94c.1691594339.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-09-19 12:22:54 +02:00
Pavel Begunkov	0def123f12	io_uring: break out of iowq iopoll on teardown [ upstream commit `45500dc4e0` ] io-wq will retry iopoll even when it failed with -EAGAIN. If that races with task exit, which sets TIF_NOTIFY_SIGNAL for all its workers, such workers might potentially infinitely spin retrying iopoll again and again and each time failing on some allocation / waiting / etc. Don't keep spinning if io-wq is dying. Fixes: `561fb04a6a` ("io_uring: replace workqueue usage with io-wq") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-09-19 12:22:54 +02:00
Pavel Begunkov	1a0aba2bf2	io_uring: always lock in io_apoll_task_func From: Dylan Yudaken <dylany@meta.com> [ upstream commit `c06c6c5d27` ] This is required for the failure case (io_req_complete_failed) and is missing. The alternative would be to only lock in the failure path, however all of the non-error paths in io_poll_check_events that do not do not return IOU_POLL_NO_ACTION end up locking anyway. The only extraneous lock would be for the multishot poll overflowing the CQE ring, however multishot poll would probably benefit from being locked as it will allow completions to be batched. So it seems reasonable to lock always. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-3-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-09-19 12:22:53 +02:00
Pavel Begunkov	cc6b09671d	io_uring: fix drain stalls by invalid SQE [ Upstream commit `cfdbaa3a29` ] cq_extra is protected by ->completion_lock, which io_get_sqe() misses. The bug is harmless as it doesn't happen in real life, requires invalid SQ index array and racing with submission, and only messes up the userspace, i.e. stall requests execution but will be cleaned up on ring destruction. Fixes: `15641e4270` ("io_uring: don't cache number of dropped SQEs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66096d54651b1a60534bb2023f2947f09f50ef73.1691538547.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-09-19 12:22:39 +02:00
Aleksa Sarai	a7cedc2b76	io_uring: correct check for O_TMPFILE Commit `72dbde0f2a` upstream. O_TMPFILE is actually __O_TMPFILE\|O_DIRECTORY. This means that the old check for whether RESOLVE_CACHED can be used would incorrectly think that O_DIRECTORY could not be used with RESOLVE_CACHED. Cc: stable@vger.kernel.org # v5.12+ Fixes: `3a81fd0204` ("io_uring: enable LOOKUP_CACHED path resolution for filename lookups") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20230807-resolve_cached-o_tmpfile-v3-1-e49323e1ef6f@cyphar.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-16 18:21:59 +02:00
Jens Axboe	c12fa4ac89	io_uring: gate iowait schedule on having pending requests Commit `7b72d661f1` upstream. A previous commit made all cqring waits marked as iowait, as a way to improve performance for short schedules with pending IO. However, for use cases that have a special reaper thread that does nothing but wait on events on the ring, this causes a cosmetic issue where we know have one core marked as being "busy" with 100% iowait. While this isn't a grave issue, it is confusing to users. Rather than always mark us as being in iowait, gate setting of current->in_iowait to 1 by whether or not the waiting task has pending requests. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/CAMEGJJ2RxopfNQ7GNLhr7X9=bHXKo+G5OOe0LUq=+UgLXsv1Xg@mail.gmail.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=217699 Link: https://bugzilla.kernel.org/show_bug.cgi?id=217700 Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Phil Elwell <phil@raspberrypi.com> Tested-by: Andres Freund <andres@anarazel.de> Fixes: `8a796565ce` ("io_uring: Use io_schedule* in cqring wait") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-11 15:13:48 +02:00
Jens Axboe	3359fdf49d	io_uring: treat -EAGAIN for REQ_F_NOWAIT as final for io-wq commit `a9be202269` upstream. io-wq assumes that an issue is blocking, but it may not be if the request type has asked for a non-blocking attempt. If we get -EAGAIN for that case, then we need to treat it as a final result and not retry or arm poll for it. Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/issues/897 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-03 10:22:47 +02:00
Ondrej Mosnacek	21d063d27b	io_uring: don't audit the capability check in io_uring_create() [ Upstream commit `6adc2272aa` ] The check being unconditional may lead to unwanted denials reported by LSMs when a process has the capability granted by DAC, but denied by an LSM. In the case of SELinux such denials are a problem, since they can't be effectively filtered out via the policy and when not silenced, they produce noise that may hide a true problem or an attack. Since not having the capability merely means that the created io_uring context will be accounted against the current user's RLIMIT_MEMLOCK limit, we can disable auditing of denials for this check by using ns_capable_noaudit() instead of capable(). Fixes: `2b188cc1bb` ("Add io_uring IO interface") Link: https://bugzilla.redhat.com/show_bug.cgi?id=2193317 Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/20230718115607.65652-1-omosnace@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-08-03 10:22:26 +02:00
Jens Axboe	d9e1cfae8d	io_uring: add reschedule point to handle_tw_list() Commit `f586800854` upstream. If CONFIG_PREEMPT_NONE is set and the task_work chains are long, we could be running into issues blocking others for too long. Add a reschedule check in handle_tw_list(), and flush the ctx if we need to reschedule. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-23 13:47:38 +02:00
Andres Freund	f8307d862c	io_uring: Use io_schedule* in cqring wait Commit `8a796565ce` upstream. I observed poor performance of io_uring compared to synchronous IO. That turns out to be caused by deeper CPU idle states entered with io_uring, due to io_uring using plain schedule(), whereas synchronous IO uses io_schedule(). The losses due to this are substantial. On my cascade lake workstation, t/io_uring from the fio repository e.g. yields regressions between 20% and 40% with the following command: ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0 This is repeatable with different filesystems, using raw block devices and using different block devices. Use io_schedule_prepare() / io_schedule_finish() in io_cqring_wait_schedule() to address the difference. After that using io_uring is on par or surpassing synchronous IO (using registered files etc makes it reliably win, but arguably is a less fair comparison). There are other calls to schedule() in io_uring/, but none immediately jump out to be similarly situated, so I did not touch them. Similarly, it's possible that mutex_lock_io() should be used, but it's not clear if there are cases where that matters. Cc: stable@vger.kernel.org # 5.10+ Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Andres Freund <andres@anarazel.de> Link: https://lore.kernel.org/r/20230707162007.194068-1-andres@anarazel.de [axboe: minor style fixup] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-23 13:47:38 +02:00
Jens Axboe	8e29835366	io_uring: wait interruptibly for request completions on exit commit `4826c59453` upstream. WHen the ring exits, cleanup is done and the final cancelation and waiting on completions is done by io_ring_exit_work. That function is invoked by kworker, which doesn't take any signals. Because of that, it doesn't really matter if we wait for completions in TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE state. However, it does matter to the hung task detection checker! Normally we expect cancelations and completions to happen rather quickly. Some test cases, however, will exit the ring and park the owning task stopped (eg via SIGSTOP). If the owning task needs to run task_work to complete requests, then io_ring_exit_work won't make any progress until the task is runnable again. Hence io_ring_exit_work can trigger the hung task detection, which is particularly problematic if panic-on-hung-task is enabled. As the ring exit doesn't take signals to begin with, have it wait interruptibly rather than uninterruptibly. io_uring has a separate stuck-exit warning that triggers independently anyway, so we're not really missing anything by making this switch. Cc: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/r/b0e4aaef-7088-56ce-244c-976edeac0e66@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-23 13:47:32 +02:00
Jens Axboe	fb348857e7	io_uring: ensure IOPOLL locks around deferred work No direct upstream commit exists for this issue. It was fixed in 5.18 as part of a larger rework of the completion side. io_commit_cqring() writes the CQ ring tail to make it visible, but it also kicks off any deferred work we have. A ring setup with IOPOLL does not need any locking around the CQ ring updates, as we're always under the ctx uring_lock. But if we have deferred work that needs processing, then io_queue_deferred() assumes that the completion_lock is held, as it is for !IOPOLL. Add a lockdep assertion to check and document this fact, and have io_iopoll_complete() check if we have deferred work and run that separately with the appropriate lock grabbed. Cc: stable@vger.kernel.org # 5.10, 5.15 Reported-by: dghost david <daviduniverse18@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-23 13:47:18 +02:00
Jens Axboe	96987c383c	io_uring/net: disable partial retries for recvmsg with cmsg Commit `78d0d2063b` upstream. We cannot sanely handle partial retries for recvmsg if we have cmsg attached. If we don't, then we'd just be overwriting the initial cmsg header on retries. Alternatively we could increment and handle this appropriately, but it doesn't seem worth the complication. Move the MSG_WAITALL check into the non-multishot case while at it, since MSG_WAITALL is explicitly disabled for multishot anyway. Link: https://lore.kernel.org/io-uring/0b0d4411-c8fd-4272-770b-e030af6919a0@kernel.dk/ Cc: stable@vger.kernel.org # 5.10+ Reported-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-28 10:29:44 +02:00
Jens Axboe	25a543ca30	io_uring/net: clear msg_controllen on partial sendmsg retry Commit `b1dc492087` upstream. If we have cmsg attached AND we transferred partial data at least, clear msg_controllen on retry so we don't attempt to send that again. Cc: stable@vger.kernel.org # 5.10+ Fixes: `cac9e4418f` ("io_uring/net: save msghdr->msg_control for retries") Reported-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-28 10:29:44 +02:00
Jens Axboe	34a7e5021a	io_uring/net: save msghdr->msg_control for retries Commit `cac9e4418f` upstream. If the application sets ->msg_control and we have to later retry this command, or if it got queued with IOSQE_ASYNC to begin with, then we need to retain the original msg_control value. This is due to the net stack overwriting this field with an in-kernel pointer, to copy it in. Hitting that path for the second time will now fail the copy from user, as it's attempting to copy from a non-user address. Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/issues/880 Reported-and-tested-by: Marek Majkowski <marek@cloudflare.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-28 10:29:44 +02:00
Jens Axboe	0e388fce7a	io_uring: hold uring mutex around poll removal Snipped from commit `9ca9fb24d5` upstream. While reworking the poll hashing in the v6.0 kernel, we ended up grabbing the ctx->uring_lock in poll update/removal. This also fixed a bug with linked timeouts racing with timeout expiry and poll removal. Bring back just the locking fix for that. Reported-and-tested-by: Querijn Voet <querijnqyn@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-21 15:59:14 +02:00
Fedor Pchelkin	f4ba55411c	io_uring: avoid null-ptr-deref in io_arm_poll_handler No upstream commit exists for this commit. The issue was introduced with backporting upstream commit `c16bda3759` ("io_uring/poll: allow some retries for poll triggering spuriously"). Memory allocation can possibly fail causing invalid pointer be dereferenced just before comparing it to NULL value. Move the pointer check in proper place (upstream has the similar location of the check). In case the request has REQ_F_POLLED flag up, apoll can't be NULL so no need to check there. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-22 13:31:37 +01:00
Jens Axboe	345fb368e5	io_uring/poll: allow some retries for poll triggering spuriously commit `c16bda3759` upstream. If we get woken spuriously when polling and fail the operation with -EAGAIN again, then we generally only allow polling again if data had been transferred at some point. This is indicated with REQ_F_PARTIAL_IO. However, if the spurious poll triggers when the socket was originally empty, then we haven't transferred data yet and we will fail the poll re-arm. This either punts the socket to io-wq if it's blocking, or it fails the request with -EAGAIN if not. Neither condition is desirable, as the former will slow things down, while the latter will make the application confused. We want to ensure that a repeated poll trigger doesn't lead to infinite work making no progress, that's what the REQ_F_PARTIAL_IO check was for. But it doesn't protect against a loop post the first receive, and it's unnecessarily strict if we started out with an empty socket. Add a somewhat random retry count, just to put an upper limit on the potential number of retries that will be done. This should be high enough that we won't really hit it in practice, unless something needs to be aborted anyway. Cc: stable@vger.kernel.org # v5.10+ Link: https://github.com/axboe/liburing/issues/364 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:40:04 +01:00
David Lamparter	7e8cd208e9	io_uring: remove MSG_NOSIGNAL from recvmsg commit `7605c43d67` upstream. MSG_NOSIGNAL is not applicable for the receiving side, SIGPIPE is generated when trying to write to a "broken pipe". AF_PACKET's packet_recvmsg() does enforce this, giving back EINVAL when MSG_NOSIGNAL is set - making it unuseable in io_uring's recvmsg. Remove MSG_NOSIGNAL from io_recvmsg_prep(). Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: David Lamparter <equinox@diac24.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230224150123.128346-1-equinox@diac24.net Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:40:04 +01:00
Pavel Begunkov	dde0d0dfbd	io_uring/rsrc: disallow multi-source reg buffers commit `edd4782696` upstream. If two or more mappings go back to back to each other they can be passed into io_uring to be registered as a single registered buffer. That would even work if mappings came from different sources, e.g. it's possible to mix in this way anon pages and pages from shmem or hugetlb. That is not a problem but it'd rather be less prone if we forbid such mixing. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:40:04 +01:00
Jens Axboe	abd54d87da	io_uring: add a conditional reschedule to the IOPOLL cancelation loop commit `fcc926bb85` upstream. If the kernel is configured with CONFIG_PREEMPT_NONE, we could be sitting in a tight loop reaping events but not giving them a chance to finish. This results in a trace ala: rcu: INFO: rcu_sched self-detected stall on CPU rcu: 2-...!: (5249 ticks this GP) idle=935c/1/0x4000000000000000 softirq=4265/4274 fqs=1 (t=5251 jiffies g=465 q=4135 ncpus=4) rcu: rcu_sched kthread starved for 5249 jiffies! g465 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_sched state:R running task stack:0 pid:12 ppid:2 flags:0x00000008 Call trace: __switch_to+0xb0/0xc8 __schedule+0x43c/0x520 schedule+0x4c/0x98 schedule_timeout+0xbc/0xdc rcu_gp_fqs_loop+0x308/0x344 rcu_gp_kthread+0xd8/0xf0 kthread+0xb8/0xc8 ret_from_fork+0x10/0x20 rcu: Stack dump where RCU GP kthread last ran: Task dump for CPU 0: task:kworker/u8:10 state:R running task stack:0 pid:89 ppid:2 flags:0x0000000a Workqueue: events_unbound io_ring_exit_work Call trace: __switch_to+0xb0/0xc8 0xffff0000c8fefd28 CPU: 2 PID: 95 Comm: kworker/u8:13 Not tainted 6.2.0-rc5-00042-g40316e337c80-dirty #2759 Hardware name: linux,dummy-virt (DT) Workqueue: events_unbound io_ring_exit_work pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : io_do_iopoll+0x344/0x360 lr : io_do_iopoll+0xb8/0x360 sp : ffff800009bebc60 x29: ffff800009bebc60 x28: 0000000000000000 x27: 0000000000000000 x26: ffff0000c0f67d48 x25: ffff0000c0f67840 x24: ffff800008950024 x23: 0000000000000001 x22: 0000000000000000 x21: ffff0000c27d3200 x20: ffff0000c0f67840 x19: ffff0000c0f67800 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000001 x13: 0000000000000001 x12: 0000000000000000 x11: 0000000000000179 x10: 0000000000000870 x9 : ffff800009bebd60 x8 : ffff0000c27d3ad0 x7 : fefefefefefefeff x6 : 0000646e756f626e x5 : ffff0000c0f67840 x4 : 0000000000000000 x3 : ffff0000c2398000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: io_do_iopoll+0x344/0x360 io_uring_try_cancel_requests+0x21c/0x334 io_ring_exit_work+0x90/0x40c process_one_work+0x1a4/0x254 worker_thread+0x1ec/0x258 kthread+0xb8/0xc8 ret_from_fork+0x10/0x20 Add a cond_resched() in the cancelation IOPOLL loop to fix this. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:40:04 +01:00
Jens Axboe	337eb887c7	io_uring: mark task TASK_RUNNING before handling resume/task work commit `2f2bb1ffc9` upstream. Just like for task_work, set the task mode to TASK_RUNNING before doing potential resume work. We're not holding any locks at this point, but we may have already set the task state to TASK_INTERRUPTIBLE in preparation for going to sleep waiting for events. Ensure that we set it back to TASK_RUNNING if we have work to process, to avoid warnings on calling blocking operations with !TASK_RUNNING. Fixes: `b5d3ae202f` ("io_uring: handle TIF_NOTIFY_RESUME when checking for task_work") Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202302062208.24d3e563-oliver.sang@intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:40:04 +01:00
Jens Axboe	54df6c5edf	io_uring: handle TIF_NOTIFY_RESUME when checking for task_work commit `b5d3ae202f` upstream. If TIF_NOTIFY_RESUME is set, then we need to call resume_user_mode_work() for PF_IO_WORKER threads. They never return to usermode, hence never get a chance to process any items that are marked by this flag. Most notably this includes the final put of files, but also any throttling markers set by block cgroups. Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-10 09:40:03 +01:00
Jens Axboe	937c15e27a	io_uring: ensure that io_init_req() passes in the right issue_flags We can't use 0 here, as io_init_req() is always invoked with the ctx uring_lock held. Newer kernels have IO_URING_F_UNLOCKED for this, but previously we used IO_URING_F_NONBLOCK to indicate this as well. Fixes: `cf7f9cd500` ("io_uring: add missing lock in io_get_file_fixed") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-03 15:14:08 +01:00
Bing-Jhong Billy Jheng	cf7f9cd500	io_uring: add missing lock in io_get_file_fixed io_get_file_fixed will access io_uring's context. Lock it if it is invoked unlocked (eg via io-wq) to avoid a race condition with fixed files getting unregistered. No single upstream patch exists for this issue, it was fixed as part of the file assignment changes that went into the 5.18 cycle. Signed-off-by: Jheng, Bing-Jhong Billy <billy@starlabs.sg> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-03 11:45:54 +01:00
Jens Axboe	4b6f8263e9	io_uring/rw: remove leftover debug statement commit `5c61795ea9` upstream. This debug statement was never meant to go into the upstream release, kill it off before it ends up in a release. It was just part of the testing for the initial version of the patch. Fixes: `2ec33a6c3c` ("io_uring/rw: ensure kiocb_end_write() is always called") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-24 07:22:49 +01:00
Jens Axboe	b10acfcd61	io_uring/rw: ensure kiocb_end_write() is always called commit `2ec33a6c3c` upstream. A previous commit moved the notifications and end-write handling, but it is now missing a few spots where we also want to call both of those. Without that, we can potentially be missing file notifications, and more importantly, have an imbalance in the super_block writers sem accounting. Fixes: `b000145e99` ("io_uring/rw: defer fsnotify calls to task context") Reported-by: Dave Chinner <david@fromorbit.com> Link: https://lore.kernel.org/all/20221010050319.GC2703033@dread.disaster.area/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-24 07:22:49 +01:00
Pavel Begunkov	124fb13cc7	io_uring: fix double poll leak on repolling commit `c0737fa9a5` upstream. We have re-polling for partial IO, so a request can be polled twice. If it used two poll entries the first time then on the second io_arm_poll_handler() it will find the old apoll entry and NULL kmalloc()'ed second entry, i.e. apoll->double_poll, so leaking it. Fixes: `10c873334f` ("io_uring: allow re-poll if we made progress") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fee2452494222ecc7f1f88c8fb659baef971414a.1655852245.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-24 07:22:49 +01:00
Alviro Iskandar Setiawan	e944f1e37b	io_uring: Clean up a false-positive warning from GCC 9.3.0 commit `0d7c1153d9` upstream. In io_recv(), if import_single_range() fails, the @flags variable is uninitialized, then it will goto out_free. After the goto, the compiler doesn't know that (ret < min_ret) is always true, so it thinks the "if ((flags & MSG_WAITALL) ..." path could be taken. The complaint comes from gcc-9 (Debian 9.3.0-22) 9.3.0: ``` fs/io_uring.c:5238 io_recvfrom() error: uninitialized symbol 'flags' ``` Fix this by bypassing the @ret and @flags check when import_single_range() fails. Reasons: 1. import_single_range() only returns -EFAULT when it fails. 2. At that point, @flags is uninitialized and shouldn't be read. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: "Chen, Rong A" <rong.a.chen@intel.com> Link: https://lore.gnuweeb.org/timl/d33bb5a9-8173-f65b-f653-51fc0681c6d6@intel.com/ Cc: Pavel Begunkov <asml.silence@gmail.com> Suggested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Fixes: `7297ce3d59` ("io_uring: improve send/recv error handling") Signed-off-by: Alviro Iskandar Setiawan <alviro.iskandar@gmail.com> Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Link: https://lore.kernel.org/r/20220207140533.565411-1-ammarfaizi2@gnuweeb.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-24 07:22:49 +01:00
Stefan Metzmacher	2e4c95a404	io_uring/net: fix fast_iov assignment in io_setup_async_msg() commit `3e4cb6ebbb` upstream. I hit a very bad problem during my tests of SENDMSG_ZC. BUG(); in first_iovec_segment() triggered very easily. The problem was io_setup_async_msg() in the partial retry case, which seems to happen more often with _ZC. iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset being only relative to the first element. Which means kmsg->msg.msg_iter.iov is no longer the same as kmsg->fast_iov. But this would rewind the copy to be the start of async_msg->fast_iov, which means the internal state of sync_msg->msg.msg_iter is inconsitent. I tested with 5 vectors with length like this 4, 0, 64, 20, 8388608 and got a short writes with: - ret=2675244 min_ret=8388692 => remaining 5713448 sr->done_io=2675244 - ret=-EAGAIN => io_uring_poll_arm - ret=4911225 min_ret=5713448 => remaining 802223 sr->done_io=7586469 - ret=-EAGAIN => io_uring_poll_arm - ret=802223 min_ret=802223 => res=8388692 While this was easily triggered with SENDMSG_ZC (queued for 6.1), it was a potential problem starting with `7ba89d2af1` in 5.18 for IORING_OP_RECVMSG. And also with `4c3c09439c` in 5.19 for IORING_OP_SENDMSG. However `257e84a537` introduced the critical code into io_setup_async_msg() in 5.11. Fixes: `7ba89d2af1` ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly") Fixes: `257e84a537` ("io_uring: refactor sendmsg/recvmsg iov managing") Cc: stable@vger.kernel.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2e7be246e2fb173520862b0c7098e55767567a2.1664436949.git.metze@samba.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-24 07:22:48 +01:00
Jens Axboe	311b298a33	io_uring: io_kiocb_update_pos() should not touch file for non -1 offset commit `6f83ab22ad` upstream. -1 tells use to use the current position, but we check if the file is a stream regardless of that. Fix up io_kiocb_update_pos() to only dip into file if we need to. This is both more efficient and also drops 12 bytes of text on aarch64 and 64 bytes on x86-64. Fixes: `b4aec40015` ("io_uring: do not recalculate ppos unnecessarily") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-01-24 07:22:48 +01:00
Jens Axboe	89a410dbd0	io_uring/rw: defer fsnotify calls to task context commit `b000145e99` upstream. We can't call these off the kiocb completion as that might be off soft/hard irq context. Defer the calls to when we process the task_work for this request. That avoids valid complaints like: stack backtrace: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.0.0-rc6-syzkaller-00321-g105a36f3694e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_usage_bug kernel/locking/lockdep.c:3961 [inline] valid_state kernel/locking/lockdep.c:3973 [inline] mark_lock_irq kernel/locking/lockdep.c:4176 [inline] mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632 mark_lock kernel/locking/lockdep.c:4596 [inline] mark_usage kernel/locking/lockdep.c:4527 [inline] __lock_acquire+0x11d9/0x56d0 kernel/locking/lockdep.c:5007 lock_acquire kernel/locking/lockdep.c:5666 [inline] lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631 __fs_reclaim_acquire mm/page_alloc.c:4674 [inline] fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4688 might_alloc include/linux/sched/mm.h:271 [inline] slab_pre_alloc_hook mm/slab.h:700 [inline] slab_alloc mm/slab.c:3278 [inline] __kmem_cache_alloc_lru mm/slab.c:3471 [inline] kmem_cache_alloc+0x39/0x520 mm/slab.c:3491 fanotify_alloc_fid_event fs/notify/fanotify/fanotify.c:580 [inline] fanotify_alloc_event fs/notify/fanotify/fanotify.c:813 [inline] fanotify_handle_event+0x1130/0x3f40 fs/notify/fanotify/fanotify.c:948 send_to_group fs/notify/fsnotify.c:360 [inline] fsnotify+0xafb/0x1680 fs/notify/fsnotify.c:570 __fsnotify_parent+0x62f/0xa60 fs/notify/fsnotify.c:230 fsnotify_parent include/linux/fsnotify.h:77 [inline] fsnotify_file include/linux/fsnotify.h:99 [inline] fsnotify_access include/linux/fsnotify.h:309 [inline] __io_complete_rw_common+0x485/0x720 io_uring/rw.c:195 io_complete_rw+0x1a/0x1f0 io_uring/rw.c:228 iomap_dio_complete_work fs/iomap/direct-io.c:144 [inline] iomap_dio_bio_end_io+0x438/0x5e0 fs/iomap/direct-io.c:178 bio_endio+0x5f9/0x780 block/bio.c:1564 req_bio_endio block/blk-mq.c:695 [inline] blk_update_request+0x3fc/0x1300 block/blk-mq.c:825 scsi_end_request+0x7a/0x9a0 drivers/scsi/scsi_lib.c:541 scsi_io_completion+0x173/0x1f70 drivers/scsi/scsi_lib.c:971 scsi_complete+0x122/0x3b0 drivers/scsi/scsi_lib.c:1438 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1022 __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571 invoke_softirq kernel/softirq.c:445 [inline] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650 irq_exit_rcu+0x5/0x20 kernel/softirq.c:662 common_interrupt+0xa9/0xc0 arch/x86/kernel/irq.c:240 Fixes: `f63cf5192f` ("io_uring: ensure that fsnotify is always called") Link: https://lore.kernel.org/all/20220929135627.ykivmdks2w5vzrwg@quack3/ Reported-by: syzbot+dfcc5f4da15868df7d4d@syzkaller.appspotmail.com Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-24 07:22:44 +01:00
Dylan Yudaken	05d69b372b	io_uring: do not recalculate ppos unnecessarily commit `b4aec40015` upstream. There is a slight optimisation to be had by calculating the correct pos pointer inside io_kiocb_update_pos and then using that later. It seems code size drops by a bit: 000000000000a1b0 0000000000000400 t io_read 000000000000a5b0 0000000000000319 t io_write vs 000000000000a1b0 00000000000003f6 t io_read 000000000000a5b0 0000000000000310 t io_write Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-24 07:22:44 +01:00
Dylan Yudaken	ff8a070253	io_uring: update kiocb->ki_pos at execution time commit `d34e1e5b39` upstream. Update kiocb->ki_pos at execution time rather than in io_prep_rw(). io_prep_rw() happens before the job is enqueued to a worker and so the offset might be read multiple times before being executed once. Ensures that the file position in a set of _linked_ SQEs will be only obtained after earlier SQEs have completed, and so will include their incremented file position. Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-24 07:22:44 +01:00
Dylan Yudaken	b7958caf41	io_uring: remove duplicated calls to io_kiocb_ppos commit `af9c45eceb` upstream. io_kiocb_ppos is called in both branches, and it seems that the compiler does not fuse this. Fusing removes a few bytes from loop_rw_iter. Before: $ nm -S fs/io_uring.o \| grep loop_rw_iter 0000000000002430 0000000000000124 t loop_rw_iter After: $ nm -S fs/io_uring.o \| grep loop_rw_iter 0000000000002430 000000000000010d t loop_rw_iter Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-24 07:22:44 +01:00
Jens Axboe	86e2d6901a	io_uring: ensure that cached task references are always put on exit commit `e775f93f2a` upstream. io_uring caches task references to avoid doing atomics for each of them per request. If a request is put from the same task that allocated it, then we can maintain a per-ctx cache of them. This obviously relies on io_uring always pruning caches in a reliable way, and there's currently a case off io_uring fd release where we can miss that. One example is a ring setup with IOPOLL, which relies on the task polling for completions, which will free them. However, if such a task submits a request and then exits or closes the ring without reaping the completion, then ring release will reap and put. If release happens from that very same task, the completed request task refs will get put back into the cache pool. This is problematic, as we're now beyond the point of pruning caches. Manually drop these caches after doing an IOPOLL reap. This releases references from the current task, which is enough. If another task happens to be doing the release, then the caching will not be triggered and there's no issue. Cc: stable@vger.kernel.org Fixes: `e98e49b2bb` ("io_uring: extend task put optimisations") Reported-by: Homin Rhee <hominlab@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-24 07:22:44 +01:00
Dylan Yudaken	30b9068934	io_uring: fix async accept on O_NONBLOCK sockets commit `a73825ba70` upstream. Do not set REQ_F_NOWAIT if the socket is non blocking. When enabled this causes the accept to immediately post a CQE with EAGAIN, which means you cannot perform an accept SQE on a NONBLOCK socket asynchronously. By removing the flag if there is no pending accept then poll is armed as usual and when a connection comes in the CQE is posted. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220324143435.2875844-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-01-24 07:22:44 +01:00

1 2

62 Commits