linux

korg/linux

mirror of https://mirrors.bfsu.edu.cn/git/linux.git synced 2024-11-18 01:34:14 +08:00

Author	SHA1	Message	Date
Linus Torvalds	9836e93c0a	for-5.19/io_uring-passthrough-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKovAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpv9oD/4qCs7k3bPZZWZ6xoWb4EObyyWOUifi26lp vpsJHFUbA67S/i4++LV9H18SazWJ7h08ac4bjgZ+NQz40/1WkTN8/Fa76jo+BnNK 7T10Wp4Ak6uwWVrKaA81pnT+G9+xmHlJ3X27aKxzLuT7BEPpShZ6ouFVjTkx9CzN LrLjuCDTOBBN+ZoaroWYfdLwTQX2VCAl9B15lOtQIlFvuuU8VlrvLboY+80K8TvY 1wvTA2HTjnXoYx+/cTTMIFZIwQH3r1hsbwEDD8/YJj1+ouhSRQ1b0p/nk2pA+3ws HF5r/YS/rLBjlPF094IzeOBaUyA433AN1VhZqnII8ek7ViT3W3x+BRrgE9O6ZkWT 0AjX1BXReI5rdFmxBmwsSdBnrSoGaJOf2GdsCCdubXBIi+F/RvyajrPf7PTB5zbW 9WEK/uy3xvZsRVkUGAzOb9QGdvjcllgMzwPJsDegDCw5PdcPdT3mzy6KGIWipFLp j8R+br7hRMpOJv/YpihJDMzSDkQ/r1/SCwR4fpLid/QdSHG/eRTQK6c4Su5bNYEy QDy2F6kQdBVtEJCQHcEOsbhXzSTNBcdB+ujUUM5653FkaHe6y4JbomLrsNx407Id i/4ROwA5K1dioJx503Eap+OhbI5rV+PFytJTwxvLrNyVGccwbH2YOVq80fsVBP2e cZbn6EX4Vg== =/peE -----END PGP SIGNATURE----- Merge tag 'for-5.19/io_uring-passthrough-2022-05-22' of git://git.kernel.dk/linux-block Pull io_uring NVMe command passthrough from Jens Axboe: "On top of everything else, this adds support for passthrough for io_uring. The initial feature for this is NVMe passthrough support, which allows non-filesystem based IO commands and admin commands. To support this, io_uring grows support for SQE and CQE members that are twice as big, allowing to pass in a full NVMe command without having to copy data around. And to complete with more than just a single 32-bit value as the output" * tag 'for-5.19/io_uring-passthrough-2022-05-22' of git://git.kernel.dk/linux-block: (22 commits) io_uring: cleanup handling of the two task_work lists nvme: enable uring-passthrough for admin commands nvme: helper for uring-passthrough checks blk-mq: fix passthrough plugging nvme: add vectored-io support for uring-cmd nvme: wire-up uring-cmd support for io-passthru on char-device. nvme: refactor nvme_submit_user_cmd() block: wire-up support for passthrough plugging fs,io_uring: add infrastructure for uring-cmd io_uring: support CQE32 for nop operation io_uring: enable CQE32 io_uring: support CQE32 in /proc info io_uring: add tracing for additional CQE32 fields io_uring: overflow processing for CQE32 io_uring: flush completions for CQE32 io_uring: modify io_get_cqe for CQE32 io_uring: add CQE32 completion processing io_uring: add CQE32 setup processing io_uring: change ring size calculation for CQE32 io_uring: store add. return values for CQE32 ...	2022-05-23 13:06:15 -07:00
Linus Torvalds	e1a8fde720	for-5.19/io_uring-net-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKotMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmVwEACo7qBTjrrneZEwlYUWrSr45QtDNsQHPWjv aoK1dBLVH4ZjoZoOTI/aYcRgd5IJYo1P6I9tUrolM/+N3adM4UTEVC7i2PYDOaL3 WUm/YT2aSLiyHaHQON7SMyGSVU8kfM9YvJAGbj7ohalO9A2VVtHfUAmcAtBdgWqv Dl/Uu6vbogOl19xztAwN4nvwqljA+GUMnbHJ/oeASzrMzYMOdQ0q3UsQbEt+pTXt rBzv8fCsrKsT2uBc59Bi3eFKeBMM6ERzux/40TlqcOnXf3KUCK7nM4VaRgPbvXdt GOOYfYs+j9L8SSEedvdKyYNq4vVwWgYfTRAKMNB0FPiOaTGZuUthqkgRZGYY8AA9 +lJWxa+mzPmWEOmL+E44kt0OwtKDHX72ccEJUD7PHhTp0g87yKZfS6mXRNYLSxm7 IYt7N1x3cOp0lrwUTvLDnSPOTuYOSEiB2JZtfkf+y3SuI5SWowIcudKOuO5p7G1r IpAROsZrpHzMf/eniINoX3IrqBSqr254jzwq+9IgUaw/ky76oPYqM1dWP9BnVxCg PXgvfT5zj6xrU43TxTeIPU92JoAqhMeXi6dcyoiAAf9+8Vih+sbmLzAdJbYb5F2v G0ISy31+x/Goi43fQS59HzS/MNXJplcmy2mxKUYBT7/ZoJ2A26Q8SukTWD+U8sDn XIrV4HEOUQ== =PUw1 -----END PGP SIGNATURE----- Merge tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-block Pull io_uring 'more data in socket' support from Jens Axboe: "To be able to fully utilize the 'poll first' support in the core io_uring branch, it's advantageous knowing if the socket was empty after a receive. This adds support for that" * tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-block: io_uring: return hint on whether more data is available after receive tcp: pass back data left in socket after receive	2022-05-23 12:51:04 -07:00
Linus Torvalds	368da430d0	for-5.19/io_uring-socket-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKorgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpm0eEACdTzhm7h5cXn9KjIvWLkdocAb/NOL8GYPn Q1mY1SqKQFZvs/fyKHkkZEiIBPxhvN6snVFXMpb4LDmPYeeH4GTUlNomrGTIjvf/ j6SnZN4lCs9A2NlE+iDVWnFQOPQFALza2Y9BhC5xzay326qnKlO+0fQv3C1vXXrc /PNLqxQr7+GmO0a0PJnS6mGWGj6qF7nLqilB9apnKsTK6BKbJEec6ciKreqxU6ME WHaux11uIAbcf8rc6C/2myEK0k6jCOAue3vZ0lizygf+8klUCl2vMqV5BLwCBlXG /e7hBsUUrGr0CG0fryqhQQTUxsZLshioBbQH1vttSeZCli46mmWWAhPNy3/jb1ZU 72bazA84Fe9ney9uVZvZoMoBsG+6t6UOatqND13MeRFAXnkRr0jZRuau2iBxgqAr OINJW+IVPU7IrCD+S4lV1/LCdhLhYcob8/zfKmIrdHMQnWG/gLonVpYJIBCyLDAv 2jvHFIPJuSMUSGVjRKCb16LLNV6u7YG6VOWbKuippxfJxDdwA3TOtOhvTJIpYq0u TotPgpZ7bfcr4xDsGgD9mZS8E7jwsL/G0/MwsnixELykEXuhd++sgoTbr+RyUYdV 45Hm6DsxlytjzOb/5uQrqhwrso05eVt14K74XApPa3fWKL8aWCh1jGSdo3CSbIyW iHwss919Ag== =nb5i -----END PGP SIGNATURE----- Merge tag 'for-5.19/io_uring-socket-2022-05-22' of git://git.kernel.dk/linux-block Pull io_uring socket() support from Jens Axboe: "This adds support for socket(2) for io_uring. This is handy when using direct / registered file descriptors with io_uring. Outside of those two patches, a small series from Dylan on top that improves the tracing by providing a text representation of the opcode rather than needing to decode this by reading the header file every time. That sits in this branch as it was the last opcode added (until it wasn't...)" * tag 'for-5.19/io_uring-socket-2022-05-22' of git://git.kernel.dk/linux-block: io_uring: use the text representation of ops in trace io_uring: rename op -> opcode io_uring: add io_uring_get_opcode io_uring: add type to op enum io_uring: add socket(2) support net: add __sys_socket_file()	2022-05-23 12:42:33 -07:00
Linus Torvalds	09beaff75e	for-5.19/io_uring-xattr-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKopkQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgakEACktFtUBQLrYOXbM/mVxMpR//ht4e29E8k0 j/DkqK0yDKn9VkvDryALguH+ixNSI9Z4N7xSELLb/meNQsbJ7YdprL3xJn3BoUgs 3zx44janE8J3Q5TsXvD2z2jPIMaT892t5+5aLFYZqP1g+KDXI8T1WpHsETMkKfRG ZPeerUrd0fhtnDpViaaYbRutIEt8V8tsPhh0XG/4GojWjUW0FTsRKBSGuQ0sQnUr aJDfF5VylOjOBzRGimGZ23vJIgtZ8UEpX0T2MxR5V6ffj4cI8bCFQOrphh7yHxF5 f09xte80zX6pow5AivIpultZShR6IoQG5DIvF59woNP16uXy5yUyVTQvdnt8RlyY RjLd8ro9Gt4wBQGqckJLyY/o1FGhaQ8S99wOixUlpb9qKAOGmQZI97FQKFENqx/1 Xe+bP6QmTt9uCXsYPIFBtZaaEv2u0yjHOyERFUSzKJQUuPTa5Rmen0EXYXRhe5/E p+sR3Qbk1wzlW7UHuCT2gcaI67SAFG+yDv1U6BAaVdcS71i0WCA+Q2a6AuB+NJzg ER4+JRoeOnjEXSP2UPvIUBL1Komdj4R2hnrOK4S80R3yQ3NaadrWywhBn5HNcniM wE2P6J0erzRFqyfBw9tyNLsZwR1iS7JqSD9/NuBLoWwb42O0l+WgqqwDTSxMsde4 egKBaidRqg== =CfhD -----END PGP SIGNATURE----- Merge tag 'for-5.19/io_uring-xattr-2022-05-22' of git://git.kernel.dk/linux-block Pull io_uring xattr support from Jens Axboe: "Support for the xattr variants" * tag 'for-5.19/io_uring-xattr-2022-05-22' of git://git.kernel.dk/linux-block: io_uring: cleanup error-handling around io_req_complete io_uring: fix trace for reduced sqe padding io_uring: add fgetxattr and getxattr support io_uring: add fsetxattr and setxattr support fs: split off do_getxattr from getxattr fs: split off setxattr_copy and do_setxattr function from setxattr	2022-05-23 12:30:30 -07:00
Linus Torvalds	3a166bdbf3	for-5.19/io_uring-2022-05-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKol0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpn+sEACbdEQqG6OoCOhJ0ZuxTdQqNMGxCImKBxjP 8Bqf+0hYNgwfG+80/UQvmc7olb+KxvZ6KtrgViC/ujhvMQmX0Xf/881kiiKG/iHJ XKoL9PdqIkenIGnlyEp1uRmnUbooYF+s4iT6Gj/pjnn29GbcKjsPzKV1CUNkt3GC R+wpdKczHQDaSwzDY5Ntyjf68QUQOyUznkHW+6JOcBeih3ET7NfapR/zsFS93RlL B9pQ9NiBBQfzCAUycVyQMC+p/rJbKWgidAiFk4fXKRm8/7iNwT4dB0+oUymlECxt xvalRVK6ER1s4RSdQcUTZoQA+SrzzOnK1DYja9cvcLT3wH+aojana6S0rOMDi8wp hoWT5jdMaZN09Vcm7J4sBN15i50m9aDITp21PKOVDZXSMVsebltCL9phaN5+9x/j AfF6Vki1WTB4gYaDHR8v6UkW+HcF1WOmMdq8GB9UMfnTya6EJqAooYT9lhQBP/rv jxkdj9Fu98O87dOfy1Av9AxH1UB8d7ypCJKkSEMAUPoWf0rC9HjYr0cRq/yppAj8 pI/0PwXaXRfQuoHPqZyETrPel77VQdBw+Hg+6TS0KlTd3WlVEJMZJPtXK466IFLp pYSRVnSI9PuhiClOpxriTCw0cppfRIv11IerCxRziqH9S1zijk0VBCN40//XDs1o JfvoA6htKQ== =S+Uf -----END PGP SIGNATURE----- Merge tag 'for-5.19/io_uring-2022-05-22' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: "Here are the main io_uring changes for 5.19. This contains: - Fixes for sparse type warnings (Christoph, Vasily) - Support for multi-shot accept (Hao) - Support for io_uring managed fixed files, rather than always needing the applicationt o manage the indices (me) - Fix for a spurious poll wakeup (Dylan) - CQE overflow fixes (Dylan) - Support more types of cancelations (me) - Support for co-operative task_work signaling, rather than always forcing an IPI (me) - Support for doing poll first when appropriate, rather than always attempting a transfer first (me) - Provided buffer cleanups and support for mapped buffers (me) - Improve how io_uring handles inflight SCM files (Pavel) - Speedups for registered files (Pavel, me) - Organize the completion data in a struct in io_kiocb rather than keep it in separate spots (Pavel) - task_work improvements (Pavel) - Cleanup and optimize the submission path, in general and for handling links (Pavel) - Speedups for registered resource handling (Pavel) - Support sparse buffers and file maps (Pavel, me) - Various fixes and cleanups (Almog, Pavel, me)" * tag 'for-5.19/io_uring-2022-05-22' of git://git.kernel.dk/linux-block: (111 commits) io_uring: fix incorrect __kernel_rwf_t cast io_uring: disallow mixed provided buffer group registrations io_uring: initialize io_buffer_list head when shared ring is unregistered io_uring: add fully sparse buffer registration io_uring: use rcu_dereference in io_close io_uring: consistently use the EPOLL* defines io_uring: make apoll_events a __poll_t io_uring: drop a spurious inline on a forward declaration io_uring: don't use ERR_PTR for user pointers io_uring: use a rwf_t for io_rw.flags io_uring: add support for ring mapped supplied buffers io_uring: add io_pin_pages() helper io_uring: add buffer selection support to IORING_OP_NOP io_uring: fix locking state for empty buffer group io_uring: implement multishot mode for accept io_uring: let fast poll support multishot io_uring: add REQ_F_APOLL_MULTISHOT for requests io_uring: add IORING_ACCEPT_MULTISHOT for accept io_uring: only wake when the correct events are set io_uring: avoid io-wq -EAGAIN looping for !IOPOLL ...	2022-05-23 12:22:49 -07:00
Jens Axboe	3fe07bcd80	io_uring: cleanup handling of the two task_work lists Rather than pass in a bool for whether or not this work item needs to go into the priority list or not, provide separate helpers for it. For most use cases, this also then gets rid of the branch for non-priority task work. While at it, rename the prior_task_list to prio_task_list. Prior is a confusing name for it, as it would seem to indicate that this is the previous task_work list. prio makes it clear that this is a priority task_work list. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-21 09:17:05 -06:00
Jens Axboe	2fcabce2d7	io_uring: disallow mixed provided buffer group registrations It's nonsensical to register a provided buffer ring, if a classic provided buffer group with the same ID exists. Depending on the order of which we decide what type to pick, the other type will never get used. Explicitly disallow it and return an error if this is attempted. Fixes: `c7fb19428d` ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 16:21:30 -06:00
Jens Axboe	1d0dbbfa28	io_uring: initialize io_buffer_list head when shared ring is unregistered We use ->buf_pages != 0 to tell if this is a shared buffer ring or a classic provided buffer group. If we unregister the shared ring and then attempt to use it, buf_pages is zero yet the classic list head isn't properly initialized. This causes io_buffer_select() to think that we have classic buffers available, but then we crash when we try and get one from the list. Just initialize the list if we unregister a shared buffer ring, leaving it in a sane state for either re-registration or for attempting to use it. And do the same for the initial setup from the classic path. Fixes: `c7fb19428d` ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 16:21:11 -06:00
Pavel Begunkov	0184f08e65	io_uring: add fully sparse buffer registration Honour IORING_RSRC_REGISTER_SPARSE not only for direct files but fixed buffers as well. It makes the rsrc API more consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66f429e4912fe39fb3318217ff33a2853d4544be.1652879898.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 12:53:04 -06:00
Christoph Hellwig	0bf1dbee9b	io_uring: use rcu_dereference in io_close Accessing the file table needs a rcu_dereference_protected(). Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:19:05 -06:00
Christoph Hellwig	a294bef57c	io_uring: consistently use the EPOLL* defines POLL* are unannotated values for the userspace ABI, while everything in-kernel should use EPOLL* and the __poll_t type. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:19:05 -06:00
Christoph Hellwig	58f5c8d39e	io_uring: make apoll_events a __poll_t apoll_events is fed to vfs_poll and the poll tables, so it should be a __poll_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:19:05 -06:00
Christoph Hellwig	ee67ba3b20	io_uring: drop a spurious inline on a forward declaration io_file_get_normal isn't marked inline, so don't claim it as such in the forward declaration. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:19:05 -06:00
Christoph Hellwig	984824db84	io_uring: don't use ERR_PTR for user pointers ERR_PTR abuses the high bits of a pointer to transport error information. This is only safe for kernel pointers and not user pointers. Fix io_buffer_select and its helpers to just return NULL for failure and get rid of this abuse. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:18:56 -06:00
Christoph Hellwig	20cbd21d89	io_uring: use a rwf_t for io_rw.flags Use the proper type. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:17:52 -06:00
Jens Axboe	c7fb19428d	io_uring: add support for ring mapped supplied buffers Provided buffers allow an application to supply io_uring with buffers that can then be grabbed for a read/receive request, when the data source is ready to deliver data. The existing scheme relies on using IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use in real world applications. It's pretty efficient if the application is able to supply back batches of provided buffers when they have been consumed and the application is ready to recycle them, but if fragmentation occurs in the buffer space, it can become difficult to supply enough buffers at the time. This hurts efficiency. Add a register op, IORING_REGISTER_PBUF_RING, which allows an application to setup a shared queue for each buffer group of provided buffers. The application can then supply buffers simply by adding them to this ring, and the kernel can consume then just as easily. The ring shares the head with the application, the tail remains private in the kernel. Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the ring, they must use the mapped ring. Mapped provided buffer rings can co-exist with normal provided buffers, just not within the same group ID. To gauge overhead of the existing scheme and evaluate the mapped ring approach, a simple NOP benchmark was written. It uses a ring of 128 entries, and submits/completes 32 at the time. 'Replenish' is how many buffers are provided back at the time after they have been consumed: Test Replenish NOPs/sec ================================================================ No provided buffers NA ~30M Provided buffers 32 ~16M Provided buffers 1 ~10M Ring buffers 32 ~27M Ring buffers 1 ~27M The ring mapped buffers perform almost as well as not using provided buffers at all, and they don't care if you provided 1 or more back at the same time. This means application can just replenish as they go, rather than need to batch and compact, further reducing overhead in the application. The NOP benchmark above doesn't need to do any compaction, so that overhead isn't even reflected in the above test. Co-developed-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:12:42 -06:00
Jens Axboe	d8c2237d0a	io_uring: add io_pin_pages() helper Abstract this out from io_sqe_buffer_register() so we can use it elsewhere too without duplicating this code. No intended functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:12:42 -06:00
Jens Axboe	3d200242a6	io_uring: add buffer selection support to IORING_OP_NOP Obviously not really useful since it's not transferring data, but it is helpful in benchmarking overhead of provided buffers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:12:42 -06:00
Jens Axboe	e7637a492b	io_uring: fix locking state for empty buffer group io_provided_buffer_select() must drop the submit lock, if needed, even in the error handling case. Failure to do so will leave us with the ctx->uring_lock held, causing spew like: ==================================== WARNING: iou-wrk-366/368 still has locks held! 5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted ------------------------------------ 1 lock held by iou-wrk-366/368: #0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48 stack backtrace: CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace.part.0+0xa4/0xd4 show_stack+0x14/0x5c dump_stack_lvl+0x88/0xb0 dump_stack+0x14/0x2c debug_check_no_locks_held+0x84/0x90 try_to_freeze.isra.0+0x18/0x44 get_signal+0x94/0x6ec io_wqe_worker+0x1d8/0x2b4 ret_from_fork+0x10/0x20 and triggering later hangs off get_signal() because we attempt to re-grab the lock. Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com Fixes: `149c69b04a` ("io_uring: abstract out provided buffer list selection") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:12:41 -06:00
Jens Axboe	aa184e8671	io_uring: don't attempt to IOPOLL for MSG_RING requests We gate whether to IOPOLL for a request on whether the opcode is allowed on a ring setup for IOPOLL and if it's got a file assigned. MSG_RING is the only one that allows a file yet isn't pollable, it's merely supported to allow communication on an IOPOLL ring, not because we can poll for completion of it. Put the assigned file early and clear it, so we don't attempt to poll for it. Reported-by: syzbot+1a0a53300ce782f8b3ad@syzkaller.appspotmail.com Fixes: `3f1d52abf0` ("io_uring: defer msg-ring file validity check until command issue") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-17 12:46:04 -06:00
Hao Xu	4e86a2c980	io_uring: implement multishot mode for accept Refactor io_accept() to support multishot mode. theoretical analysis: 1) when connections come in fast - singleshot: add accept sqe(userspace) --> accept inline ^ \| \|-----------------\| - multishot: add accept sqe(userspace) --> accept inline ^ \| \|----\| we do accept repeatedly in place until get EAGAIN 2) when connections come in at a low pressure similar thing like 1), we reduce a lot of userspace-kernel context switch and useless vfs_poll() tests: Did some tests, which goes in this way: server client(multiple) accept connect read write write read close close Basically, raise up a number of clients(on same machine with server) to connect to the server, and then write some data to it, the server will write those data back to the client after it receives them, and then close the connection after write return. Then the client will read the data and then close the connection. Here I test 10000 clients connect one server, data size 128 bytes. And each client has a go routine for it, so they come to the server in short time. test 20 times before/after this patchset, time spent:(unit cycle, which is the return value of clock()) before: 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 +1934226+1914385)/20.0 = 1927633.75 after: 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 +1871324+1940803)/20.0 = 1894750.45 (1927633.75 - 1894750.45) / 1927633.75 = 1.65% Signed-off-by: Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-5-haoxu.linux@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-14 08:34:22 -06:00
Hao Xu	dbc2564cfe	io_uring: let fast poll support multishot For operations like accept, multishot is a useful feature, since we can reduce a number of accept sqe. Let's integrate it to fast poll, it may be good for other operations in the future. Signed-off-by: Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-4-haoxu.linux@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-14 08:34:22 -06:00
Hao Xu	227685ebfa	io_uring: add REQ_F_APOLL_MULTISHOT for requests Add a flag to indicate multishot mode for fast poll. currently only accept use it, but there may be more operations leveraging it in the future. Also add a mask IO_APOLL_MULTI_POLLED which stands for REQ_F_APOLL_MULTI \| REQ_F_POLLED, to make the code short and cleaner. Signed-off-by: Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-3-haoxu.linux@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-14 08:34:22 -06:00
Dylan Yudaken	1b1d7b4bf1	io_uring: only wake when the correct events are set The check for waking up a request compares the poll_t bits, however this will always contain some common flags so this always wakes up. For files with single wait queues such as sockets this can cause the request to be sent to the async worker unnecesarily. Further if it is non-blocking will complete the request with EAGAIN which is not desired. Here exclude these common events, making sure to not exclude POLLERR which might be important. Fixes: `d7718a9d25` ("io_uring: use poll driven retry for files that support it") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220512091834.728610-3-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-13 14:37:50 -06:00
Pavel Begunkov	e0deb6a025	io_uring: avoid io-wq -EAGAIN looping for !IOPOLL If an opcode handler semi-reliably returns -EAGAIN, io_wq_submit_work() might continue busily hammer the same handler over and over again, which is not ideal. The -EAGAIN handling in question was put there only for IOPOLL, so restrict it to IOPOLL mode only where there is no other recourse than to retry as we cannot wait. Fixes: `def596e955` ("io_uring: support for IO polling") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f168b4f24181942f3614dd8ff648221736f572e6.1652433740.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-13 06:50:42 -06:00
Jens Axboe	a8da73a32b	io_uring: add flag for allocating a fully sparse direct descriptor space Currently to setup a fully sparse descriptor space upfront, the app needs to alloate an array of the full size and memset it to -1 and then pass that in. Make this a bit easier by allowing a flag that simply does this internally rather than needing to copy each slot separately. This works with IORING_REGISTER_FILES2 as the flag is set in struct io_uring_rsrc_register, and is only allow when the type is IORING_RSRC_FILE as this doesn't make sense for registered buffers. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-13 06:28:50 -06:00
Jens Axboe	09893e15f1	io_uring: bump max direct descriptor count to 1M We currently limit these to 32K, but since we're now backing the table space with vmalloc when needed, there's no reason why we can't make it bigger. The total space is limited by RLIMIT_NOFILE as well. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-13 06:28:46 -06:00
Jens Axboe	c30c3e00cb	io_uring: allow allocated fixed files for accept If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot, then that's a hint to allocate a fixed file descriptor rather than have one be passed in directly. This can be useful for having io_uring manage the direct descriptor space, and also allows multi-shot support to work with fixed files. Normal accept direct requests will complete with 0 for success, and < 0 in case of error. If io_uring is asked to allocated the direct descriptor, then the direct descriptor is returned in case of success. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-13 06:28:43 -06:00
Jens Axboe	1339f24b33	io_uring: allow allocated fixed files for openat/openat2 If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot, then that's a hint to allocate a fixed file descriptor rather than have one be passed in directly. This can be useful for having io_uring manage the direct descriptor space. Normal open direct requests will complete with 0 for success, and < 0 in case of error. If io_uring is asked to allocated the direct descriptor, then the direct descriptor is returned in case of success. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-13 06:28:42 -06:00
Jens Axboe	b70b8e3331	io_uring: add basic fixed file allocator Applications currently always pick where they want fixed files to go. In preparation for allowing these types of commands with multishot support, add a basic allocator in the fixed file table. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-13 06:28:40 -06:00
Jens Axboe	d78bd8adfc	io_uring: track fixed files with a bitmap In preparation for adding a basic allocator for direct descriptors, add helpers that set/clear whether a file slot is used. Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-13 06:28:29 -06:00
Jens Axboe	ee692a21e9	fs,io_uring: add infrastructure for uring-cmd file_operations->uring_cmd is a file private handler. This is somewhat similar to ioctl but hopefully a lot more sane and useful as it can be used to enable many io_uring capabilities for the underlying operation. IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't know what is in this command type, it's for the provider of ->uring_cmd() to deal with. Co-developed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220511054750.20432-2-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-11 07:40:47 -06:00
Stefan Roesch	2bb04df7c2	io_uring: support CQE32 for nop operation This adds support for filling the extra1 and extra2 fields for large CQE's. Co-developed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-13-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:34 -06:00
Stefan Roesch	76c68fbf1a	io_uring: enable CQE32 This enables large CQE's in the uring setup. Co-developed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-12-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:34 -06:00
Stefan Roesch	f9b3dfcc68	io_uring: support CQE32 in /proc info This exposes the extra1 and extra2 fields in the /proc output. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-11-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:34 -06:00
Stefan Roesch	c4bb964fa0	io_uring: add tracing for additional CQE32 fields This adds tracing for the extra1 and extra2 fields. Co-developed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-10-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:34 -06:00
Stefan Roesch	e45a3e0500	io_uring: overflow processing for CQE32 This adds the overflow processing for large CQE's. This adds two parameters to the io_cqring_event_overflow function and uses these fields to initialize the large CQE fields. Allocate enough space for large CQE's in the overflow structue. If no large CQE's are used, the size of the allocation is unchanged. The cqe field can have a different size depending if its a large CQE or not. To be able to allocate different sizes, the two fields in the structure are re-ordered. Co-developed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-9-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:34 -06:00
Stefan Roesch	0e2e5c47fe	io_uring: flush completions for CQE32 This flushes the completions according to their CQE type: the same processing is done for the default CQE size, but for large CQE's the extra1 and extra2 fields are filled in. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-8-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:34 -06:00
Stefan Roesch	2fee6bc640	io_uring: modify io_get_cqe for CQE32 Modify accesses to the CQE array to take large CQE's into account. The index needs to be shifted by one for large CQE's. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-7-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:34 -06:00
Stefan Roesch	effcf8bdeb	io_uring: add CQE32 completion processing This adds the completion processing for the large CQE's and makes sure that the extra1 and extra2 fields are passed through. Co-developed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-6-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:34 -06:00
Stefan Roesch	916587984f	io_uring: add CQE32 setup processing This adds two new function to setup and fill the CQE32 result structure. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-5-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:34 -06:00
Stefan Roesch	baf9cb643b	io_uring: change ring size calculation for CQE32 This changes the function rings_size to take large CQE's into account. Co-developed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-4-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:34 -06:00
Stefan Roesch	4e5bc0a9a1	io_uring: store add. return values for CQE32 This reuses the hash list node for the storage we need to hold the two 64-bit values that must be passed back. Co-developed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220426182134.136504-3-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:33 -06:00
Jens Axboe	ebdeb7c01d	io_uring: add support for 128-byte SQEs Normal SQEs are 64-bytes in length, which is fine for all the commands we support. However, in preparation for supporting passthrough IO, provide an option for setting up a ring with 128-byte SQEs. We continue to use the same type for io_uring_sqe, it's marked and commented with a zero sized array pad at the end. This provides up to 80 bytes of data for a passthrough command - 64 bytes for the extra added data, and 16 bytes available at the end of the existing SQE. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:35:33 -06:00
Jens Axboe	b5ba65df47	Merge branch 'for-5.19/io_uring-socket' into for-5.19/io_uring-passthrough * for-5.19/io_uring-socket: io_uring: use the text representation of ops in trace io_uring: rename op -> opcode io_uring: add io_uring_get_opcode io_uring: add type to op enum io_uring: add socket(2) support net: add __sys_socket_file() io_uring: fix trace for reduced sqe padding io_uring: add fgetxattr and getxattr support io_uring: add fsetxattr and setxattr support fs: split off do_getxattr from getxattr fs: split off setxattr_copy and do_setxattr function from setxattr	2022-05-09 06:35:28 -06:00
Jens Axboe	1308689906	Merge branch 'for-5.19/io_uring' into for-5.19/io_uring-passthrough * for-5.19/io_uring: (85 commits) io_uring: don't clear req->kbuf when buffer selection is done io_uring: eliminate the need to track provided buffer ID separately io_uring: move provided buffer state closer to submit state io_uring: move provided and fixed buffers into the same io_kiocb area io_uring: abstract out provided buffer list selection io_uring: never call io_buffer_select() for a buffer re-select io_uring: get rid of hashed provided buffer groups io_uring: always use req->buf_index for the provided buffer group io_uring: ignore ->buf_index if REQ_F_BUFFER_SELECT isn't set io_uring: kill io_rw_buffer_select() wrapper io_uring: make io_buffer_select() return the user address directly io_uring: kill io_recv_buffer_select() wrapper io_uring: use 'sr' vs 'req->sr_msg' consistently io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg io_uring: check IOPOLL/ioprio support upfront io_uring: replace smp_mb() with smp_mb__after_atomic() in io_sq_thread() io_uring: add IORING_SETUP_TASKRUN_FLAG io_uring: use TWA_SIGNAL_NO_IPI if IORING_SETUP_COOP_TASKRUN is used io_uring: set task_work notify method at init time io-wq: use __set_notify_signal() to wake workers ...	2022-05-09 06:35:11 -06:00
Jens Axboe	7ccba24d3b	io_uring: don't clear req->kbuf when buffer selection is done It's not needed as the REQ_F_BUFFER_SELECTED flag tracks the state of whether or not kbuf is valid, so just drop it. Suggested-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:29:06 -06:00
Jens Axboe	1dbd023eb0	io_uring: eliminate the need to track provided buffer ID separately We have io_kiocb->buf_index which is used for either fixed buffers, or for provided buffers. For the latter, it's used to hold the buffer group ID for buffer selection. Post selection, req->kbuf->bid is used to get the buffer ID. Store the buffer ID, when selected, in req->buf_index. If we do end up recycling the buffer, reset it back to the buffer group ID. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:29:06 -06:00
Jens Axboe	660cbfa234	io_uring: move provided buffer state closer to submit state The timeout and other items that follow are less hot, so let's move the provided buffer state above that. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:29:06 -06:00
Jens Axboe	a4f8d94cfb	io_uring: move provided and fixed buffers into the same io_kiocb area These are mutually exclusive - if you use provided buffers, then you cannot use fixed buffers and vice versa. Move them into the same spot in the io_kiocb, which is also advantageous for provided buffers as they get near the submit side hot cacheline. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-09 06:29:06 -06:00

1 2 3 4 5 ...

2001 Commits