Commit Graph

983793 Commits

Author SHA1 Message Date
Pavel Begunkov
c1d5a22468 io_uring: refactor scheduling in io_cqring_wait
schedule_timeout() with timeout=MAX_SCHEDULE_TIMEOUT is guaranteed to
work just as schedule(), so instead of hand-coding it based on arguments
always use the timeout version and simplify code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-04 08:05:46 -07:00
Pavel Begunkov
9936c7c2bc io_uring: deduplicate core cancellations sequence
Files and task cancellations go over same steps trying to cancel
requests in io-wq, poll, etc. Deduplicate it with a helper.

note: new io_uring_try_cancel_requests() is former
__io_uring_cancel_task_requests() with files passed as an agrument and
flushing overflowed requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-04 08:05:46 -07:00
Pavel Begunkov
57cd657b82 io_uring: simplify do_read return parsing
do_read() returning 0 bytes read (not -EAGAIN/etc.) is not an important
enough of a case to prioritise it. Fold it into ret < 0 check, so we get
rid of an extra if and make it a bit more readable.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 13:09:21 -07:00
Pavel Begunkov
ce3d5aae33 io_uring: deduplicate adding to REQ_F_INFLIGHT
We don't know for how long REQ_F_INFLIGHT is going to stay, cleaner to
extract a helper for marking requests as so.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 13:09:21 -07:00
Pavel Begunkov
e86d004729 io_uring: remove work flags after cleanup
Shouldn't be a problem now, but it's better to clean
REQ_F_WORK_INITIALIZED and work->flags only after relevant resources are
killed, so cancellation see them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 13:09:21 -07:00
Pavel Begunkov
34e08fed2c io_uring: inline io_req_drop_files()
req->files now have same lifetime as all other iowq-work resources,
inline io_req_drop_files() for consistency. Moreover, since
REQ_F_INFLIGHT is no more files specific, the function name became
very confusing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 13:09:21 -07:00
Pavel Begunkov
ba13e23f37 io_uring: kill not used needs_file_no_error
We have no request types left using needs_file_no_error, remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 13:09:21 -07:00
Pavel Begunkov
9ae1f8dd37 io_uring: fix inconsistent lock state
WARNING: inconsistent lock state

inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes:
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline]
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&fs->lock);
  <Interrupt>
    lock(&fs->lock);

 *** DEADLOCK ***

1 lock held by syz-executor217/8450:
 #0: ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442

stack backtrace:
CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-20210129-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
[...]
 _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
 spin_lock include/linux/spinlock.h:354 [inline]
 io_req_clean_work fs/io_uring.c:1398 [inline]
 io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029
 __io_free_req+0x3d/0x2e0 fs/io_uring.c:2046
 io_free_req fs/io_uring.c:2269 [inline]
 io_double_put_req fs/io_uring.c:2392 [inline]
 io_put_req+0xf9/0x570 fs/io_uring.c:2388
 io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497
 __run_hrtimer kernel/time/hrtimer.c:1519 [inline]
 __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583
 hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline]
 __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102
 asm_call_irq_on_stack+0xf/0x20
 </IRQ>
 __run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline]
 run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline]
 sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629
RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline]
RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199
 spin_unlock_irq include/linux/spinlock.h:404 [inline]
 io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525
 __io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594
 io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639
 io_queue_link_head fs/io_uring.c:6650 [inline]
 io_submit_sqe fs/io_uring.c:6697 [inline]
 io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960
 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Don't free requests from under hrtimer context (softirq) as it may sleep
or take spinlocks improperly (e.g. non-irq versions).

Cc: stable@vger.kernel.org # 5.6+
Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 13:09:21 -07:00
Dan Carpenter
13770a71ed io_uring: Fix NULL dereference in error in io_sqe_files_register()
If we hit a "goto out_free;" before the "ctx->file_data" pointer has
been assigned then it leads to a NULL derefence when we call:

	free_fixed_rsrc_data(ctx->file_data);

We can fix this by moving the assignment earlier.

Fixes: 1ad555c6ae ("io_uring: create common fixed_rsrc_data allocation routines")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 11:56:41 -07:00
Hao Xu
8b28fdf211 io_uring: check kthread parked flag before sqthread goes to sleep
Abaci reported this issue:

#[  605.170872] INFO: task kworker/u4:1:53 blocked for more than 143 seconds.
[  605.172123]       Not tainted 5.10.0+ #1
[  605.172811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.173915] task:kworker/u4:1    state:D stack:    0 pid:   53 ppid:     2 flags:0x00004000
[  605.175130] Workqueue: events_unbound io_ring_exit_work
[  605.175931] Call Trace:
[  605.176334]  __schedule+0xe0e/0x25a0
[  605.176971]  ? firmware_map_remove+0x1a1/0x1a1
[  605.177631]  ? write_comp_data+0x2a/0x80
[  605.178272]  schedule+0xd0/0x270
[  605.178811]  schedule_timeout+0x6b6/0x940
[  605.179415]  ? mark_lock.part.0+0xca/0x1420
[  605.180062]  ? usleep_range+0x170/0x170
[  605.180684]  ? wait_for_completion+0x16d/0x280
[  605.181392]  ? mark_held_locks+0x9e/0xe0
[  605.182079]  ? rwlock_bug.part.0+0x90/0x90
[  605.182853]  ? lockdep_hardirqs_on_prepare+0x286/0x400
[  605.183817]  wait_for_completion+0x175/0x280
[  605.184713]  ? wait_for_completion_interruptible+0x340/0x340
[  605.185611]  ? _raw_spin_unlock_irq+0x24/0x30
[  605.186307]  ? migrate_swap_stop+0x9c0/0x9c0
[  605.187046]  kthread_park+0x127/0x1c0
[  605.187738]  io_sq_thread_stop+0xd5/0x530
[  605.188459]  io_ring_exit_work+0xb1/0x970
[  605.189207]  process_one_work+0x92c/0x1510
[  605.189947]  ? pwq_dec_nr_in_flight+0x360/0x360
[  605.190682]  ? rwlock_bug.part.0+0x90/0x90
[  605.191430]  ? write_comp_data+0x2a/0x80
[  605.192207]  worker_thread+0x9b/0xe20
[  605.192900]  ? process_one_work+0x1510/0x1510
[  605.193599]  kthread+0x353/0x460
[  605.194154]  ? _raw_spin_unlock_irq+0x24/0x30
[  605.194910]  ? kthread_create_on_node+0x100/0x100
[  605.195821]  ret_from_fork+0x1f/0x30
[  605.196605]
[  605.196605] Showing all locks held in the system:
[  605.197598] 1 lock held by khungtaskd/25:
[  605.198301]  #0: ffffffff8b5f76a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x30
[  605.199914] 3 locks held by kworker/u4:1/53:
[  605.200609]  #0: ffff888100109938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x82a/0x1510
[  605.202108]  #1: ffff888100e47dc0 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x85e/0x1510
[  605.203681]  #2: ffff888116931870 (&sqd->lock){+.+.}-{3:3}, at: io_sq_thread_park.part.0+0x19/0x50
[  605.205183] 3 locks held by systemd-journal/161:
[  605.206037] 1 lock held by syslog-ng/254:
[  605.206674] 2 locks held by agetty/311:
[  605.207292]  #0: ffff888101097098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80
[  605.208715]  #1: ffffc900000332e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0
[  605.210131] 2 locks held by bash/677:
[  605.210723]  #0: ffff88810419a098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80
[  605.212105]  #1: ffffc900000512e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0
[  605.213777]
[  605.214151] =============================================

I believe this is caused by the follow race:

(ctx_list is empty now)
=> io_put_sq_data               |
==> kthread_park(sqd->thread);  |
====> set KTHREAD_SHOULD_PARK	|
====> wake_up_process(k)        | sq thread is running
				|
				|
				| needs_sched is true since no ctx,
				| so TASK_INTERRUPTIBLE set and schedule
				| out then never wake up again
				|
====> wait_for_completion	|
	(stuck here)

So check if sqthread gets park flag right before schedule().
since ctx_list is always empty when this problem happens, here I put
kthread_should_park() before setting the wakeup flag(ctx_list is empty
so this for loop is fast), where is close enough to schedule(). The
problem doesn't show again in my repro testing after this fix.

Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:43 -07:00
Pavel Begunkov
090da7d52f MAINTAINERS: update io_uring section
Add the missing kernel io_uring header, add Pavel as a reviewer, and
exclude io_uring from the FILESYSTEMS section to avoid keep spamming Al
(mainly) with bug reports, patches, etc.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:43 -07:00
noah
4e0377a1c5 io_uring: Add skip option for __io_sqe_files_update
This patch adds support for skipping a file descriptor when using
IORING_REGISTER_FILES_UPDATE.  __io_sqe_files_update will skip fds set
to IORING_REGISTER_FILES_SKIP. IORING_REGISTER_FILES_SKIP is inturn
added as a #define in io_uring.h

Signed-off-by: noah <goldstein.w.n@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:43 -07:00
Pavel Begunkov
67973b933e io_uring: cleanup files_update looping
Replace a while with a simple for loop, that looks way more natural, and
enables us to use "continue" as indexes are no more updated by hand in
the end of the loop.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:43 -07:00
Pavel Begunkov
7c6607313f io_uring: consolidate putting reqs task
We grab a task for each request and while putting it it also have to do
extra work like inflight accounting and waking up that task. This
sequence is duplicated several time, it's good time to add a helper.
More to that, the helper generates better code due to better locality
and so not failing alias analysis.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:43 -07:00
Pavel Begunkov
ecfc849282 io_uring: ensure only sqo_task has file notes
For SQPOLL io_uring we want to have only one file note held by
sqo_task. Add a warning to make sure it holds. It's deep in
io_uring_add_task_file() out of hot path, so shouldn't hurt.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:43 -07:00
Yejune Deng
0bead8cd39 io_uring: simplify io_remove_personalities()
The function io_remove_personalities() is very similar to
io_unregister_personality(),so implement io_remove_personalities()
calling io_unregister_personality().

Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:43 -07:00
Jens Axboe
4014d943cb io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL
It's no longer used as IORING_OP_CLOSE got rid for the need of flagging
it as uncancelable, kill it of.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:43 -07:00
Jens Axboe
9eac1904d3 io_uring: get rid of intermediate IORING_OP_CLOSE stage
We currently split the close into two, in case we have a ->flush op
that we can't safely handle from non-blocking context. This requires
us to flag the op as uncancelable if we do need to punt it async, and
that means special handling for just this op type.

Use __close_fd_get_file() and grab the files lock so we can get the file
and check if we need to go async in one atomic operation. That gets rid
of the need for splitting this into two steps, and hence the need for
IO_WQ_WORK_NO_CANCEL.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:43 -07:00
Jens Axboe
53dec2ea74 fs: provide locked helper variant of close_fd_get_file()
Assumes current->files->file_lock is already held on invocation. Helps
the caller check the file before removing the fd, if it needs to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
e342c807f5 io_uring: save atomic dec for inline executed reqs
When a request is completed with comp_state, its completion reference
put is deferred to io_submit_flush_completions(), but the submission
is put not far from there, so do it together to save one atomic dec per
request. That targets requests that complete inline, e.g. buffered rw,
send/recv.

Proper benchmarking haven't been conducted but for nops(batch=32) it was
around 7901 vs 8117 KIOPS (~2.7%), or ~4% per perf profiling.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
9affd664f0 io_uring: don't flush CQEs deep down the stack
io_submit_flush_completions() is called down the stack in the _state
version of io_req_complete(), that's ok because is only called by
io_uring opcode handler functions directly. Move it up to
__io_queue_sqe() as preparation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
a38d68db67 io_uring: help inlining of io_req_complete()
__io_req_complete() inlining is a bit weird, some compilers don't
optimise out the non-NULL branch of it even when called as
io_req_complete(). Help it a bit by extracting state and stateless
helpers out of __io_req_complete().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
8662daec09 io_uring: add a helper timeout mode calculation
Deduplicates translation of timeout flags into hrtimer_mode.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
eab30c4d20 io_uring: deduplicate failing task_work_add
When io_req_task_work_add() fails, the request will be cancelled by
enqueueing via task_works of io-wq. Extract a function for that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
02b23a9af5 io_uring: remove __io_state_file_put
The check in io_state_file_put() is optimised pretty well when called
from __io_file_get(). Don't pollute the code with all these variants.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
85bcb6c67e io_uring: simplify io_alloc_req()
Get rid of a label in io_alloc_req(), it's cleaner to do return
directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
888aae2eed io_uring: further deduplicate #CQ events calc
Apparently, there is one more place hand coded calculation of number of
CQ events in the ring. Use __io_cqring_events() helper in
io_get_cqring() as well. Naturally, assembly stays identical.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
ec30e04ba4 io_uring: inline __io_commit_cqring()
Inline it in its only user, that's cleaner

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
2d7e935809 io_uring: inline io_async_submit()
The name is confusing and it's used only in one place.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
5c766a908d io_uring: cleanup personalities under uring_lock
personality_idr is usually synchronised by uring_lock, the exception
would be removing personalities in io_ring_ctx_wait_and_kill(), which
is legit as refs are killed by that point but still would be more
resilient to do it under the lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
dc2a6e9aa9 io_uring: refactor io_resubmit_prep()
It's awkward to pass return a value into a function for it to return it
back. Check it at the caller site and clean up io_resubmit_prep() a bit.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
bf6182b6d4 io_uring: optimise io_rw_reissue()
The hot path is IO completing on the first try. Reshuffle io_rw_reissue() so
it's checked first.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Bijan Mottahedeh
00835dce14 io_uring: make percpu_ref_release names consistent
Make the percpu ref release function names consistent between rsrc data
and nodes.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Bijan Mottahedeh
1ad555c6ae io_uring: create common fixed_rsrc_data allocation routines
Create common alloc/free fixed_rsrc_data routines for both files and
buffers.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[remove buffer part]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Bijan Mottahedeh
d7954b2ba9 io_uring: create common fixed_rsrc_ref_node handling routines
Create common routines to be used for both files/buffers registration.

[remove io_sqe_rsrc_set_node substitution]

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[merge, quiesce only for files]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Pavel Begunkov
bc9744cd16 io_uring: split ref_node alloc and init
A simple prep patch allowing to set refnode callbacks after it was
allocated. This needed to 1) keep ourself off of hi-level functions
where it's not pretty and they are not necessary 2) amortise ref_node
allocation in the future, e.g. for updates.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Bijan Mottahedeh
6802535df7 io_uring: split alloc_fixed_file_ref_node
Split alloc_fixed_file_ref_node into resource generic/specific parts,
to be leveraged for fixed buffers.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Bijan Mottahedeh
2a63b2d9c3 io_uring: add rsrc_ref locking routines
Encapsulate resource reference locking into separate routines.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Bijan Mottahedeh
d67d2263fb io_uring: separate ref_list from fixed_rsrc_data
Uplevel ref_list and make it common to all resources.  This is to
allow one common ref_list to be used for both files, and buffers
in upcoming patches.

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:42 -07:00
Bijan Mottahedeh
5023853183 io_uring: generalize io_queue_rsrc_removal
Generalize io_queue_rsrc_removal to handle both files and buffers.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[remove io_mapped_ubuf from rsrc tables/etc. for now]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:41 -07:00
Bijan Mottahedeh
269bbe5fd4 io_uring: rename file related variables to rsrc
This is a prep rename patch for subsequent patches to generalize file
registration.

[io_uring_rsrc_update:: rename fds -> data]

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
[leave io_uring_files_update as struct]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:41 -07:00
Bijan Mottahedeh
2b358604aa io_uring: modularize io_sqe_buffers_register
Move allocation of buffer management structures, and validation of
buffers into separate routines.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:41 -07:00
Bijan Mottahedeh
0a96bbe499 io_uring: modularize io_sqe_buffer_register
Split io_sqe_buffer_register into two routines:

- io_sqe_buffer_register() registers a single buffer
- io_sqe_buffers_register iterates over all user specified buffers

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:41 -07:00
Jens Axboe
3a81fd0204 io_uring: enable LOOKUP_CACHED path resolution for filename lookups
Instead of being pessimistic and assume that path lookup will block, use
LOOKUP_CACHED to attempt just a cached lookup. This ensures that the
fast path is always done inline, and we only punt to async context if
IO is needed to satisfy the lookup.

For forced nonblock open attempts, mark the file O_NONBLOCK over the
actual ->open() call as well. We can safely clear this again before
doing fd_install(), so it'll never be user visible that we fiddled with
it.

This greatly improves the performance of file open where the dentry is
already cached:

ached		5.10-git	5.10-git+LOOKUP_CACHED	Speedup
---------------------------------------------------------------
33%		1,014,975	900,474			1.1x
89%		 545,466	292,937			1.9x
100%		 435,636	151,475			2.9x

The more cache hot we are, the faster the inline LOOKUP_CACHED
optimization helps. This is unsurprising and expected, as a thread
offload becomes a more dominant part of the total overhead. If we look
at io_uring tracing, doing an IORING_OP_OPENAT on a file that isn't in
the dentry cache will yield:

275.550481: io_uring_create: ring 00000000ddda6278, fd 3 sq size 8, cq size 16, flags 0
275.550491: io_uring_submit_sqe: ring 00000000ddda6278, op 18, data 0x0, non block 1, sq_thread 0
275.550498: io_uring_queue_async_work: ring 00000000ddda6278, request 00000000c0267d17, flags 69760, normal queue, work 000000003d683991
275.550502: io_uring_cqring_wait: ring 00000000ddda6278, min_events 1
275.550556: io_uring_complete: ring 00000000ddda6278, user_data 0x0, result 4

which shows a failed nonblock lookup, then punt to worker, and then we
complete with fd == 4. This takes 65 usec in total. Re-running the same
test case again:

281.253956: io_uring_create: ring 0000000008207252, fd 3 sq size 8, cq size 16, flags 0
281.253967: io_uring_submit_sqe: ring 0000000008207252, op 18, data 0x0, non block 1, sq_thread 0
281.253973: io_uring_complete: ring 0000000008207252, user_data 0x0, result 4

shows the same request completing inline, also returning fd == 4. This
takes 6 usec.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-01 10:02:41 -07:00
Jens Axboe
b2d86c7cec Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-5.12/io_uring
Merge RESOLVE_CACHED bits from Al, as the io_uring changes will build on
top of that.

* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED
  fs: add support for LOOKUP_CACHED
  saner calling conventions for unlazy_child()
  fs: make unlazy_walk() error handling consistent
  fs/namei.c: Remove unlikely of status being -ECHILD in lookup_fast()
  do_tmpfile(): don't mess with finish_open()
2021-02-01 10:02:28 -07:00
Linus Torvalds
1048ba83fb Linux 5.11-rc6 2021-01-31 13:50:09 -08:00
Linus Torvalds
ac8c6edd20 A single EFI fix from Lukas:
- handle boolean device properties imported from Apple firmware
   correctly.
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmARRNIACgkQw08iOZLZ
 jyQitgwAvGzlv41S+q6FU1zf8cYHetGZ9/hT4Dwykhi+wtCky/m7N1DyszJhIN49
 u0yMx/m4PVlrKz+vMotXLHIFGWZHaTZhARp+mrcHu5t0d41RC1JPWfJuVPedNrAx
 0QZ6A50pRJcFthG+J4mmW3ph+RxntnSKomyojM07cecS4bAwV/Rm4RO7Jki2FLZ9
 w5bKK2swYCuA25On4+SHkHMok+Cvx7shpHDbu1QsXtVGtu1RPSZXNqCP2YmjhbYv
 YHpAXd5gGbnX0/gZTlDm5jQ2zvXDPO4GJjlQuhQOloDbVGdL5ARYpW23o7tdFanw
 L/NyLrebRpFJwHwqSud1WIgfVVY9OstDDACX335NrWXvh2Tm473KtqtDeb78WnGp
 gSfrXMjTRInOEyHKZNCE5Dpbj5yOLMTA/CChYeqO/xlsagZ5H5h2YIgrNFaYyps3
 Y5OG3/dqueFDr47xM4iXcy+1tS6H0SZilAcwHBTq4TBawd77cY75/laNq9uC04Du
 Cj4Jhe7l
 =jRKO
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI fix from Borislav Petkov:
 "A single fix from Lukas: handle boolean device properties imported
  from Apple firmware correctly"

* tag 'efi-urgent-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/apple-properties: Reinstate support for boolean properties
2021-01-31 11:57:37 -08:00
Linus Torvalds
f5a376edde A single fix for objtool to generate proper unwind info for newer
toolchains which do not generate section symbols anymore. And a cleanup
 ontop.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmAWi8EACgkQEsHwGGHe
 VUrZhBAAmbaEBdU05+ah24r+XHLsCJBQwlwHAr71YfNnEpq/KRHXLtc3QJLAdOVf
 ku0536gDJvMUX7d7ap2ooSVAD9Ed1h4N7BvOn8eLFiaPc6NG9Tw6FZc/X6OKtyLd
 DyFOsNAa9JnjjeqT7TTYdqbcJUzPSqd3Ufg5V4UZcVwvGTkbc+k1TktnteTMXWUI
 t99wXCOfw2accdUrr3MIkdvSGNo099VZa/DBZQVmpjCcMSOfe/0KQIoeVagEpAew
 T0WxONdM62Nz4Tv03N6m6EqVpIOc8BueRuOWlX/c5XVCmYx8BDSdFb6EY9sEh10i
 hLU1U36BCUT1uAA/ZAuw/I22fy5MXqbrGvWJrcW8Wav1fQfaDYkDyGNE+aBjXysQ
 uZGTzbfAdAS2B8XTElzYJZwh1WW7Je7b2pZhL5/6kwoa8E82NsR7a2inl6pdkKin
 LcrLlxrSZYbAjhYuA3Da4iErvtu/UloQwfDhga7NasGdVQzlwUQBX67Tgt1PA9B2
 JWoeY1NKBGboNEQa3NWq37yCtfcpx2hL4wWgyUbj0TMOXO06V/ZhrPzIQDrMmVGx
 g52NrYnH/CujrKgWH3+Q+kBWA/BSVP5p3UnhLCDM1X7dyZiimuLOJNDUQ9WldENV
 rsGgKyW3/6F4UzmqLr0oOB6X9/2v15LSktN9BJtv3UWUCl/PfXU=
 =VV7d
 -----END PGP SIGNATURE-----

Merge tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Borislav Petkov:
 "A single fix for objtool to generate proper unwind info for newer
  toolchains which do not generate section symbols anymore. And a
  cleanup ontop.

  This was originally going to go during the next merge window but
  people can already trigger a build error with binutils-2.36 which
  doesn't emit section symbols - something which objtool relies on - so
  let's expedite it"

* tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/entry: Remove put_ret_addr_in_rdi THUNK macro argument
  x86/entry: Emit a symbol for register restoring thunk
2021-01-31 11:48:12 -08:00
Linus Torvalds
17b756d037 A fix for handling advertised, but non-existent 146818 RTCs correctly. With
the recent UIP handling changes the time readout of non-existent RTCs hangs
 forever as the read returns always 0xFF which means the UIP bit is
 set. Sanity check the RTC before registering by checking the RTC_VALID
 register for correctness.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmAWiNYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXnqD/9NDgAilf97MsbTowlN+veoJHW/LHG5
 2CDTqYHSPYnfR8Nzr+ZdnQMzsw0IfbfXT8Z2GceCf1moXaKAx1DxHiRr+GPnFYov
 SANZ1W8xKwk4PbiLhq6wvtHc2RVmSZuJoh6a3FlspUR7NbjnqsJqrqoAQvjTn6b2
 /G/sXMtVRM9VQ/v80L6OukQfLzXDjLDZ50nYBAcobtNnSNCoFCyScZL7pOr4M7hY
 u9V9j+ZFVPb4Gut3j4dRUMdEExTMGzOMfQnABHW0QDn8kjOOvMey3XJLEAiJPT2T
 WkB/UBscq+le8h2aE2dX106iMXzGNmE1pphHhjwxk60dexdZpw1bgBKIPmqpnzXK
 W3nUSw3PHjUQpwPgl4GdnyZjoJO1XlsP322RrzPAdYOy2sqDrcKNTbueZFIC/zLT
 7PYLmnT2qZsTrBPGzFEkN9biz1sZvnOmJ/G8pMyIbV+JtaGpig7uy7wFczXqsKGH
 gBU0Zqv+UffoTuBQbYR/sdgotFCaPuUQKm26UzKE8FLA8Y9MUJ2St5QZcqDasQEO
 wrvdEWdXRtMexVJVJ1kzzhSBwnkkCOVARRPRr80+DDLfeqmgzwjV2jyaYkrBPtgo
 XezxOROjuhIn5GpsCX68CUgg9+l5mxLFLMFB1vYoxQyZ49jZKXunob9flo4xfVTv
 Z2oPwmDi/Ew7kw==
 =Q1FX
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Thomas Gleixner:
 "A fix for handling advertised, but non-existent 146818 RTCs correctly.

  With the recent UIP handling changes the time readout of non-existent
  RTCs hangs forever as the read returns always 0xFF which means the UIP
  bit is set.

  Sanity check the RTC before registering by checking the RTC_VALID
  register for correctness"

* tag 'timers-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rtc: mc146818: Detect and handle broken RTCs
2021-01-31 11:40:57 -08:00
Linus Torvalds
f7ea44c717 A single fix for the single step reporting regression caused by getting the
condition wrong when moving SYSCALL_EMU away from TIF flags.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmAWh7cTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYof8zEAC+Qm4Myg9SiHWr8EiZa4+TqmUxTge8
 oV36+Je18y7vHFElGBByCwfEHvsLO/mi3kgKn2lBZsSyiSiUs15p0S5M/7A7HmbW
 mcFmHioECe7VUL5Ml1Y6mhyhA9o3QdAv3PAHwNBbvUwbJSrCS7rld94T4xeZiaBh
 y00qFikxKTbblgSZpVKDG7wUKYHVQwJMqYVw6I6Y4iB+QfM1EGQxWFzV2td3H/UE
 A7g8Ay8QOXxd/agnwZaOTHrQy2Rsnp3n9sD5Y6hVpZLT3FulxsaSftK/ngn9uTou
 bFYagpXxJRPt6TylK2Y8Nn2Y1ZcLoq/bj7XKSN0MpgcM+y3/vV9GUOpyFmDTug2F
 P5onx7S6vKxG3ews+WlTxHYaSRWbO0OHWLTM+FHbW7ben/DjWNVNBa4L1u3w0Skq
 igyqmCzQURjkDbsCaMsdKPeG0KJOlCqTNj4aImskNGv5OUt77rziGg42jI07MLYV
 mE9+e/cw5P1FVoVaaMwplUvOmGaG8647IdapDo0UctHm7Y+GC81bwry/bbW/oesi
 7acnmCrO/sILwzE1H+YQnxofVlTXW/pCx3MUfNUEyJOUuI7orobfd1MOJjVUKj++
 Zm5bRy8h0RZ5q9Xy2GwCh0mSRihbtQzBXdbwIDpltFYUBF6cHh1ryqKAHBE55JiB
 IQ4J+3OK1f8dNQ==
 =WMj1
 -----END PGP SIGNATURE-----

Merge tag 'core-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull single stepping fix from Thomas Gleixner:
 "A single fix for the single step reporting regression caused by
  getting the condition wrong when moving SYSCALL_EMU away from TIF
  flags"

[ There's apparently another problem too, fix pending ]

* tag 'core-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Unbreak single step reporting behaviour
2021-01-31 11:39:32 -08:00