linux/drivers/gpu/drm/i915/intel_breadcrumbs.c

876 lines
25 KiB
C
Raw Normal View History

drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
/*
* Copyright © 2015 Intel Corporation
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice (including the next
* paragraph) shall be included in all copies or substantial portions of the
* Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
* IN THE SOFTWARE.
*
*/
#include <linux/kthread.h>
#include <uapi/linux/sched/types.h>
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
#include "i915_drv.h"
static unsigned int __intel_breadcrumbs_wakeup(struct intel_breadcrumbs *b)
{
struct intel_wait *wait;
unsigned int result = 0;
lockdep_assert_held(&b->irq_lock);
wait = b->irq_wait;
if (wait) {
result = ENGINE_WAKEUP_WAITER;
if (wake_up_process(wait->tsk))
result |= ENGINE_WAKEUP_ASLEEP;
}
return result;
}
unsigned int intel_engine_wakeup(struct intel_engine_cs *engine)
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
drm/i915: Protect intel_engine_wakeup() for call from irq context intel_engine_wakeup() is called by nop_request_submit() which is installed to handle third party fences completed from within irq context. As such, it needs the full irqsave/irqrestore and not the partial spin_irq_lock handling. [18942.714467] ================================= [18942.719076] [ INFO: inconsistent lock state ] [18942.723522] 4.11.0-rc2-CI-CI_DRM_2368+ #1 Tainted: G U W [18942.729970] --------------------------------- [18942.734466] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [18942.740594] gem_eio/1275 [HC0[0]:SC0[0]:HE1:SE1] takes: [18942.745932] (&(&fence->lock)->rlock){+.?...}, at: [<ffffffff815ec100>] dma_fence_signal+0x100/0x 230 [18942.755331] {IN-SOFTIRQ-W} state was registered at: [18942.760356] __lock_acquire+0x5d0/0x1bb0 [18942.764444] lock_acquire+0xc9/0x220 [18942.768196] _raw_spin_lock_irqsave+0x41/0x60 [18942.772747] dma_fence_signal+0x100/0x230 [18942.776927] vgem_fence_timeout+0x9/0x10 [vgem] [18942.781701] call_timer_fn+0x92/0x380 [18942.785557] expire_timers+0x150/0x1f0 [18942.789491] run_timer_softirq+0x7c/0x160 [18942.793705] __do_softirq+0x116/0x4c0 [18942.797560] irq_exit+0xa9/0xc0 [18942.800873] smp_apic_timer_interrupt+0x38/0x50 [18942.805611] apic_timer_interrupt+0x90/0xa0 [18942.810008] cpuidle_enter_state+0x135/0x380 [18942.814503] cpuidle_enter+0x12/0x20 [18942.818250] call_cpuidle+0x1e/0x40 [18942.821906] do_idle+0x17e/0x1f0 [18942.825333] cpu_startup_entry+0x18/0x20 [18942.829463] rest_init+0x127/0x130 [18942.833025] start_kernel+0x3f1/0x3fe [18942.836908] x86_64_start_reservations+0x2a/0x2c [18942.841733] x86_64_start_kernel+0x173/0x186 [18942.846234] verify_cpu+0x0/0xfc [18942.849604] irq event stamp: 30568 [18942.853140] hardirqs last enabled at (30567): [<ffffffff8110b81f>] ktime_get+0xef/0x120 [18942.861468] hardirqs last disabled at (30568): [<ffffffff81876377>] _raw_spin_lock_irqsave+0x17/0 x60 [18942.870812] softirqs last enabled at (30462): [<ffffffff81085cd9>] __do_softirq+0x1d9/0x4c0 [18942.879443] softirqs last disabled at (30439): [<ffffffff81086139>] irq_exit+0xa9/0xc0 [18942.887616] [18942.887616] other info that might help us debug this: [18942.894279] Possible unsafe locking scenario: [18942.894279] [18942.900336] CPU0 [18942.902851] ---- [18942.905362] lock(&(&fence->lock)->rlock); [18942.909647] <Interrupt> [18942.912330] lock(&(&fence->lock)->rlock); [18942.916821] [18942.916821] *** DEADLOCK *** [18942.916821] [18942.922862] 1 lock held by gem_eio/1275: [18942.926859] #0: (&(&fence->lock)->rlock){+.?...}, at: [<ffffffff815ec100>] dma_fence_signal+0x1 00/0x230 [18942.936651] [18942.936651] stack backtrace: [18942.941142] CPU: 3 PID: 1275 Comm: gem_eio Tainted: G U W 4.11.0-rc2-CI-CI_DRM_2368+ # 1 [18942.950367] Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F21 01/06/2 017 [18942.959756] Call Trace: [18942.962244] dump_stack+0x67/0x92 [18942.965626] print_usage_bug.part.23+0x259/0x268 [18942.970362] mark_lock+0x12c/0x6f0 [18942.973851] ? check_usage_forwards+0x130/0x130 [18942.978487] mark_held_locks+0x6f/0xa0 [18942.982329] ? _raw_spin_unlock_irq+0x27/0x50 [18942.986797] trace_hardirqs_on_caller+0x150/0x200 [18942.991599] trace_hardirqs_on+0xd/0x10 [18942.995515] _raw_spin_unlock_irq+0x27/0x50 [18942.999796] intel_engine_wakeup+0x26/0x30 [i915] [18943.004670] intel_engine_init_global_seqno+0x131/0x1a0 [i915] [18943.010745] nop_submit_request+0x2e/0x40 [i915] [18943.015476] submit_notify+0x3f/0x5c [i915] [18943.019763] __i915_sw_fence_complete+0x176/0x220 [i915] [18943.025234] ? try_to_del_timer_sync+0x4d/0x60 [18943.029825] i915_sw_fence_complete+0x25/0x40 [i915] [18943.034887] dma_i915_sw_fence_wake+0x26/0x60 [i915] [18943.039959] dma_fence_signal+0x146/0x230 [18943.044109] vgem_fence_signal_ioctl+0x6c/0xc0 [vgem] [18943.049275] drm_ioctl+0x200/0x450 [18943.052758] ? vgem_fence_attach_ioctl+0x270/0x270 [vgem] [18943.058334] do_vfs_ioctl+0x90/0x6e0 [18943.061991] ? entry_SYSCALL_64_fastpath+0x5/0xb1 [18943.066843] ? __this_cpu_preempt_check+0x13/0x20 [18943.071643] ? trace_hardirqs_on_caller+0xe7/0x200 [18943.076532] SyS_ioctl+0x3c/0x70 [18943.079842] entry_SYSCALL_64_fastpath+0x1c/0xb1 [18943.084558] RIP: 0033:0x7f0dfcc14357 [18943.088240] RSP: 002b:00007ffeb4628da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [18943.095996] RAX: ffffffffffffffda RBX: ffffffff8147eb93 RCX: 00007f0dfcc14357 [18943.103311] RDX: 00007ffeb4628de0 RSI: 0000000040086442 RDI: 0000000000000005 [18943.110574] RBP: ffffc9000176ff88 R08: 0000000000000004 R09: 0000000000000000 [18943.117845] R10: 0000000000000029 R11: 0000000000000246 R12: 0000000000000001 [18943.125168] R13: 0000000000000005 R14: 0000000040086442 R15: 0000000000000000 [18943.132520] ? __this_cpu_preempt_check+0x13/0x20 Fixes: cdc3a4539034 ("drm/i915: No need to save/restore irq status in intel_engine_wakeup") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170320143133.1507-1-chris@chris-wilson.co.uk Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
2017-03-20 22:31:33 +08:00
unsigned long flags;
unsigned int result;
drm/i915: Protect intel_engine_wakeup() for call from irq context intel_engine_wakeup() is called by nop_request_submit() which is installed to handle third party fences completed from within irq context. As such, it needs the full irqsave/irqrestore and not the partial spin_irq_lock handling. [18942.714467] ================================= [18942.719076] [ INFO: inconsistent lock state ] [18942.723522] 4.11.0-rc2-CI-CI_DRM_2368+ #1 Tainted: G U W [18942.729970] --------------------------------- [18942.734466] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [18942.740594] gem_eio/1275 [HC0[0]:SC0[0]:HE1:SE1] takes: [18942.745932] (&(&fence->lock)->rlock){+.?...}, at: [<ffffffff815ec100>] dma_fence_signal+0x100/0x 230 [18942.755331] {IN-SOFTIRQ-W} state was registered at: [18942.760356] __lock_acquire+0x5d0/0x1bb0 [18942.764444] lock_acquire+0xc9/0x220 [18942.768196] _raw_spin_lock_irqsave+0x41/0x60 [18942.772747] dma_fence_signal+0x100/0x230 [18942.776927] vgem_fence_timeout+0x9/0x10 [vgem] [18942.781701] call_timer_fn+0x92/0x380 [18942.785557] expire_timers+0x150/0x1f0 [18942.789491] run_timer_softirq+0x7c/0x160 [18942.793705] __do_softirq+0x116/0x4c0 [18942.797560] irq_exit+0xa9/0xc0 [18942.800873] smp_apic_timer_interrupt+0x38/0x50 [18942.805611] apic_timer_interrupt+0x90/0xa0 [18942.810008] cpuidle_enter_state+0x135/0x380 [18942.814503] cpuidle_enter+0x12/0x20 [18942.818250] call_cpuidle+0x1e/0x40 [18942.821906] do_idle+0x17e/0x1f0 [18942.825333] cpu_startup_entry+0x18/0x20 [18942.829463] rest_init+0x127/0x130 [18942.833025] start_kernel+0x3f1/0x3fe [18942.836908] x86_64_start_reservations+0x2a/0x2c [18942.841733] x86_64_start_kernel+0x173/0x186 [18942.846234] verify_cpu+0x0/0xfc [18942.849604] irq event stamp: 30568 [18942.853140] hardirqs last enabled at (30567): [<ffffffff8110b81f>] ktime_get+0xef/0x120 [18942.861468] hardirqs last disabled at (30568): [<ffffffff81876377>] _raw_spin_lock_irqsave+0x17/0 x60 [18942.870812] softirqs last enabled at (30462): [<ffffffff81085cd9>] __do_softirq+0x1d9/0x4c0 [18942.879443] softirqs last disabled at (30439): [<ffffffff81086139>] irq_exit+0xa9/0xc0 [18942.887616] [18942.887616] other info that might help us debug this: [18942.894279] Possible unsafe locking scenario: [18942.894279] [18942.900336] CPU0 [18942.902851] ---- [18942.905362] lock(&(&fence->lock)->rlock); [18942.909647] <Interrupt> [18942.912330] lock(&(&fence->lock)->rlock); [18942.916821] [18942.916821] *** DEADLOCK *** [18942.916821] [18942.922862] 1 lock held by gem_eio/1275: [18942.926859] #0: (&(&fence->lock)->rlock){+.?...}, at: [<ffffffff815ec100>] dma_fence_signal+0x1 00/0x230 [18942.936651] [18942.936651] stack backtrace: [18942.941142] CPU: 3 PID: 1275 Comm: gem_eio Tainted: G U W 4.11.0-rc2-CI-CI_DRM_2368+ # 1 [18942.950367] Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F21 01/06/2 017 [18942.959756] Call Trace: [18942.962244] dump_stack+0x67/0x92 [18942.965626] print_usage_bug.part.23+0x259/0x268 [18942.970362] mark_lock+0x12c/0x6f0 [18942.973851] ? check_usage_forwards+0x130/0x130 [18942.978487] mark_held_locks+0x6f/0xa0 [18942.982329] ? _raw_spin_unlock_irq+0x27/0x50 [18942.986797] trace_hardirqs_on_caller+0x150/0x200 [18942.991599] trace_hardirqs_on+0xd/0x10 [18942.995515] _raw_spin_unlock_irq+0x27/0x50 [18942.999796] intel_engine_wakeup+0x26/0x30 [i915] [18943.004670] intel_engine_init_global_seqno+0x131/0x1a0 [i915] [18943.010745] nop_submit_request+0x2e/0x40 [i915] [18943.015476] submit_notify+0x3f/0x5c [i915] [18943.019763] __i915_sw_fence_complete+0x176/0x220 [i915] [18943.025234] ? try_to_del_timer_sync+0x4d/0x60 [18943.029825] i915_sw_fence_complete+0x25/0x40 [i915] [18943.034887] dma_i915_sw_fence_wake+0x26/0x60 [i915] [18943.039959] dma_fence_signal+0x146/0x230 [18943.044109] vgem_fence_signal_ioctl+0x6c/0xc0 [vgem] [18943.049275] drm_ioctl+0x200/0x450 [18943.052758] ? vgem_fence_attach_ioctl+0x270/0x270 [vgem] [18943.058334] do_vfs_ioctl+0x90/0x6e0 [18943.061991] ? entry_SYSCALL_64_fastpath+0x5/0xb1 [18943.066843] ? __this_cpu_preempt_check+0x13/0x20 [18943.071643] ? trace_hardirqs_on_caller+0xe7/0x200 [18943.076532] SyS_ioctl+0x3c/0x70 [18943.079842] entry_SYSCALL_64_fastpath+0x1c/0xb1 [18943.084558] RIP: 0033:0x7f0dfcc14357 [18943.088240] RSP: 002b:00007ffeb4628da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [18943.095996] RAX: ffffffffffffffda RBX: ffffffff8147eb93 RCX: 00007f0dfcc14357 [18943.103311] RDX: 00007ffeb4628de0 RSI: 0000000040086442 RDI: 0000000000000005 [18943.110574] RBP: ffffc9000176ff88 R08: 0000000000000004 R09: 0000000000000000 [18943.117845] R10: 0000000000000029 R11: 0000000000000246 R12: 0000000000000001 [18943.125168] R13: 0000000000000005 R14: 0000000040086442 R15: 0000000000000000 [18943.132520] ? __this_cpu_preempt_check+0x13/0x20 Fixes: cdc3a4539034 ("drm/i915: No need to save/restore irq status in intel_engine_wakeup") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170320143133.1507-1-chris@chris-wilson.co.uk Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
2017-03-20 22:31:33 +08:00
spin_lock_irqsave(&b->irq_lock, flags);
result = __intel_breadcrumbs_wakeup(b);
drm/i915: Protect intel_engine_wakeup() for call from irq context intel_engine_wakeup() is called by nop_request_submit() which is installed to handle third party fences completed from within irq context. As such, it needs the full irqsave/irqrestore and not the partial spin_irq_lock handling. [18942.714467] ================================= [18942.719076] [ INFO: inconsistent lock state ] [18942.723522] 4.11.0-rc2-CI-CI_DRM_2368+ #1 Tainted: G U W [18942.729970] --------------------------------- [18942.734466] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [18942.740594] gem_eio/1275 [HC0[0]:SC0[0]:HE1:SE1] takes: [18942.745932] (&(&fence->lock)->rlock){+.?...}, at: [<ffffffff815ec100>] dma_fence_signal+0x100/0x 230 [18942.755331] {IN-SOFTIRQ-W} state was registered at: [18942.760356] __lock_acquire+0x5d0/0x1bb0 [18942.764444] lock_acquire+0xc9/0x220 [18942.768196] _raw_spin_lock_irqsave+0x41/0x60 [18942.772747] dma_fence_signal+0x100/0x230 [18942.776927] vgem_fence_timeout+0x9/0x10 [vgem] [18942.781701] call_timer_fn+0x92/0x380 [18942.785557] expire_timers+0x150/0x1f0 [18942.789491] run_timer_softirq+0x7c/0x160 [18942.793705] __do_softirq+0x116/0x4c0 [18942.797560] irq_exit+0xa9/0xc0 [18942.800873] smp_apic_timer_interrupt+0x38/0x50 [18942.805611] apic_timer_interrupt+0x90/0xa0 [18942.810008] cpuidle_enter_state+0x135/0x380 [18942.814503] cpuidle_enter+0x12/0x20 [18942.818250] call_cpuidle+0x1e/0x40 [18942.821906] do_idle+0x17e/0x1f0 [18942.825333] cpu_startup_entry+0x18/0x20 [18942.829463] rest_init+0x127/0x130 [18942.833025] start_kernel+0x3f1/0x3fe [18942.836908] x86_64_start_reservations+0x2a/0x2c [18942.841733] x86_64_start_kernel+0x173/0x186 [18942.846234] verify_cpu+0x0/0xfc [18942.849604] irq event stamp: 30568 [18942.853140] hardirqs last enabled at (30567): [<ffffffff8110b81f>] ktime_get+0xef/0x120 [18942.861468] hardirqs last disabled at (30568): [<ffffffff81876377>] _raw_spin_lock_irqsave+0x17/0 x60 [18942.870812] softirqs last enabled at (30462): [<ffffffff81085cd9>] __do_softirq+0x1d9/0x4c0 [18942.879443] softirqs last disabled at (30439): [<ffffffff81086139>] irq_exit+0xa9/0xc0 [18942.887616] [18942.887616] other info that might help us debug this: [18942.894279] Possible unsafe locking scenario: [18942.894279] [18942.900336] CPU0 [18942.902851] ---- [18942.905362] lock(&(&fence->lock)->rlock); [18942.909647] <Interrupt> [18942.912330] lock(&(&fence->lock)->rlock); [18942.916821] [18942.916821] *** DEADLOCK *** [18942.916821] [18942.922862] 1 lock held by gem_eio/1275: [18942.926859] #0: (&(&fence->lock)->rlock){+.?...}, at: [<ffffffff815ec100>] dma_fence_signal+0x1 00/0x230 [18942.936651] [18942.936651] stack backtrace: [18942.941142] CPU: 3 PID: 1275 Comm: gem_eio Tainted: G U W 4.11.0-rc2-CI-CI_DRM_2368+ # 1 [18942.950367] Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F21 01/06/2 017 [18942.959756] Call Trace: [18942.962244] dump_stack+0x67/0x92 [18942.965626] print_usage_bug.part.23+0x259/0x268 [18942.970362] mark_lock+0x12c/0x6f0 [18942.973851] ? check_usage_forwards+0x130/0x130 [18942.978487] mark_held_locks+0x6f/0xa0 [18942.982329] ? _raw_spin_unlock_irq+0x27/0x50 [18942.986797] trace_hardirqs_on_caller+0x150/0x200 [18942.991599] trace_hardirqs_on+0xd/0x10 [18942.995515] _raw_spin_unlock_irq+0x27/0x50 [18942.999796] intel_engine_wakeup+0x26/0x30 [i915] [18943.004670] intel_engine_init_global_seqno+0x131/0x1a0 [i915] [18943.010745] nop_submit_request+0x2e/0x40 [i915] [18943.015476] submit_notify+0x3f/0x5c [i915] [18943.019763] __i915_sw_fence_complete+0x176/0x220 [i915] [18943.025234] ? try_to_del_timer_sync+0x4d/0x60 [18943.029825] i915_sw_fence_complete+0x25/0x40 [i915] [18943.034887] dma_i915_sw_fence_wake+0x26/0x60 [i915] [18943.039959] dma_fence_signal+0x146/0x230 [18943.044109] vgem_fence_signal_ioctl+0x6c/0xc0 [vgem] [18943.049275] drm_ioctl+0x200/0x450 [18943.052758] ? vgem_fence_attach_ioctl+0x270/0x270 [vgem] [18943.058334] do_vfs_ioctl+0x90/0x6e0 [18943.061991] ? entry_SYSCALL_64_fastpath+0x5/0xb1 [18943.066843] ? __this_cpu_preempt_check+0x13/0x20 [18943.071643] ? trace_hardirqs_on_caller+0xe7/0x200 [18943.076532] SyS_ioctl+0x3c/0x70 [18943.079842] entry_SYSCALL_64_fastpath+0x1c/0xb1 [18943.084558] RIP: 0033:0x7f0dfcc14357 [18943.088240] RSP: 002b:00007ffeb4628da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [18943.095996] RAX: ffffffffffffffda RBX: ffffffff8147eb93 RCX: 00007f0dfcc14357 [18943.103311] RDX: 00007ffeb4628de0 RSI: 0000000040086442 RDI: 0000000000000005 [18943.110574] RBP: ffffc9000176ff88 R08: 0000000000000004 R09: 0000000000000000 [18943.117845] R10: 0000000000000029 R11: 0000000000000246 R12: 0000000000000001 [18943.125168] R13: 0000000000000005 R14: 0000000040086442 R15: 0000000000000000 [18943.132520] ? __this_cpu_preempt_check+0x13/0x20 Fixes: cdc3a4539034 ("drm/i915: No need to save/restore irq status in intel_engine_wakeup") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170320143133.1507-1-chris@chris-wilson.co.uk Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
2017-03-20 22:31:33 +08:00
spin_unlock_irqrestore(&b->irq_lock, flags);
return result;
}
static unsigned long wait_timeout(void)
{
return round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES);
}
static noinline void missed_breadcrumb(struct intel_engine_cs *engine)
{
DRM_DEBUG_DRIVER("%s missed breadcrumb at %pF, irq posted? %s, current seqno=%x, last=%x\n",
engine->name, __builtin_return_address(0),
yesno(test_bit(ENGINE_IRQ_BREADCRUMB,
&engine->irq_posted)),
intel_engine_get_seqno(engine),
intel_engine_last_submit(engine));
set_bit(engine->id, &engine->i915->gpu_error.missed_irq_rings);
}
static void intel_breadcrumbs_hangcheck(unsigned long data)
{
struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
struct intel_breadcrumbs *b = &engine->breadcrumbs;
if (!b->irq_armed)
return;
if (b->hangcheck_interrupts != atomic_read(&engine->irq_count)) {
b->hangcheck_interrupts = atomic_read(&engine->irq_count);
mod_timer(&b->hangcheck, wait_timeout());
return;
}
/* We keep the hangcheck timer alive until we disarm the irq, even
* if there are no waiters at present.
*
* If the waiter was currently running, assume it hasn't had a chance
* to process the pending interrupt (e.g, low priority task on a loaded
* system) and wait until it sleeps before declaring a missed interrupt.
*
* If the waiter was asleep (and not even pending a wakeup), then we
* must have missed an interrupt as the GPU has stopped advancing
* but we still have a waiter. Assuming all batches complete within
* DRM_I915_HANGCHECK_JIFFIES [1.5s]!
*/
if (intel_engine_wakeup(engine) & ENGINE_WAKEUP_ASLEEP) {
missed_breadcrumb(engine);
mod_timer(&engine->breadcrumbs.fake_irq, jiffies + 1);
} else {
mod_timer(&b->hangcheck, wait_timeout());
}
}
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
static void intel_breadcrumbs_fake_irq(unsigned long data)
{
struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
struct intel_breadcrumbs *b = &engine->breadcrumbs;
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
/* The timer persists in case we cannot enable interrupts,
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
* or if we have previously seen seqno/interrupt incoherency
* ("missed interrupt" syndrome, better known as a "missed breadcrumb").
* Here the worker will wake up every jiffie in order to kick the
* oldest waiter to do the coherent seqno check.
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
*/
spin_lock_irq(&b->irq_lock);
if (!__intel_breadcrumbs_wakeup(b))
__intel_engine_disarm_breadcrumbs(engine);
spin_unlock_irq(&b->irq_lock);
if (!b->irq_armed)
return;
mod_timer(&b->fake_irq, jiffies + 1);
/* Ensure that even if the GPU hangs, we get woken up.
*
* However, note that if no one is waiting, we never notice
* a gpu hang. Eventually, we will have to wait for a resource
* held by the GPU and so trigger a hangcheck. In the most
* pathological case, this will be upon memory starvation! To
* prevent this, we also queue the hangcheck from the retire
* worker.
*/
i915_queue_hangcheck(engine->i915);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
}
static void irq_enable(struct intel_engine_cs *engine)
{
/* Enabling the IRQ may miss the generation of the interrupt, but
* we still need to force the barrier before reading the seqno,
* just in case.
*/
set_bit(ENGINE_IRQ_BREADCRUMB, &engine->irq_posted);
drm/i915: Convert breadcrumbs spinlock to be irqsafe The breadcrumbs are about to be used from within IRQ context sections (e.g. nouveau signals a fence from an interrupt handler causing us to submit a new request) and/or from bottom-half tasklets (i.e. intel_lrc_irq_handler), therefore we need to employ the irqsafe spinlock variants. For example, deferring the request submission to the intel_lrc_irq_handler generates this trace: [ 66.388639] ================================= [ 66.388650] [ INFO: inconsistent lock state ] [ 66.388663] 4.9.0-rc2+ #56 Not tainted [ 66.388672] --------------------------------- [ 66.388682] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 66.388695] swapper/1/0 [HC0[0]:SC1[1]:HE0:SE0] takes: [ 66.388706] (&(&b->lock)->rlock){+.?...} , at: [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.388761] {SOFTIRQ-ON-W} state was registered at: [ 66.388772] [ 66.388783] [<ffffffff810bd842>] __lock_acquire+0x682/0x1870 [ 66.388795] [ 66.388803] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.388814] [ 66.388824] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.388835] [ 66.388845] [<ffffffff81401e41>] intel_engine_reset_breadcrumbs+0x21/0xb0 [ 66.388857] [ 66.388866] [<ffffffff81403ae7>] gen8_init_common_ring+0x67/0x100 [ 66.388878] [ 66.388887] [<ffffffff81403b92>] gen8_init_render_ring+0x12/0x60 [ 66.388903] [ 66.388912] [<ffffffff813f8707>] i915_gem_init_hw+0xf7/0x2a0 [ 66.388927] [ 66.388936] [<ffffffff813f899b>] i915_gem_init+0xbb/0xf0 [ 66.388950] [ 66.388959] [<ffffffff813b4980>] i915_driver_load+0x7e0/0x1330 [ 66.388978] [ 66.388988] [<ffffffff813c09d8>] i915_pci_probe+0x28/0x40 [ 66.389003] [ 66.389013] [<ffffffff812fa0db>] pci_device_probe+0x8b/0xf0 [ 66.389028] [ 66.389037] [<ffffffff8147737e>] driver_probe_device+0x21e/0x430 [ 66.389056] [ 66.389065] [<ffffffff8147766e>] __driver_attach+0xde/0xe0 [ 66.389080] [ 66.389090] [<ffffffff814751ad>] bus_for_each_dev+0x5d/0x90 [ 66.389105] [ 66.389113] [<ffffffff81477799>] driver_attach+0x19/0x20 [ 66.389134] [ 66.389144] [<ffffffff81475ced>] bus_add_driver+0x15d/0x260 [ 66.389159] [ 66.389168] [<ffffffff81477e3b>] driver_register+0x5b/0xd0 [ 66.389183] [ 66.389281] [<ffffffff812fa19b>] __pci_register_driver+0x5b/0x60 [ 66.389301] [ 66.389312] [<ffffffff81aed333>] i915_init+0x3e/0x45 [ 66.389326] [ 66.389336] [<ffffffff81ac2ffa>] do_one_initcall+0x8b/0x118 [ 66.389350] [ 66.389359] [<ffffffff81ac323a>] kernel_init_freeable+0x1b3/0x23b [ 66.389378] [ 66.389387] [<ffffffff8160fc39>] kernel_init+0x9/0x100 [ 66.389402] [ 66.389411] [<ffffffff816180e7>] ret_from_fork+0x27/0x40 [ 66.389426] irq event stamp: 315865 [ 66.389438] hardirqs last enabled at (315864): [<ffffffff816178f1>] _raw_spin_unlock_irqrestore+0x31/0x50 [ 66.389469] hardirqs last disabled at (315865): [<ffffffff816176b3>] _raw_spin_lock_irqsave+0x13/0x50 [ 66.389499] softirqs last enabled at (315818): [<ffffffff8107a04c>] _local_bh_enable+0x1c/0x50 [ 66.389530] softirqs last disabled at (315819): [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.389559] [ 66.389559] other info that might help us debug this: [ 66.389580] Possible unsafe locking scenario: [ 66.389580] [ 66.389598] CPU0 [ 66.389609] ---- [ 66.389620] lock(&(&b->lock)->rlock); [ 66.389650] <Interrupt> [ 66.389661] lock(&(&b->lock)->rlock); [ 66.389690] [ 66.389690] *** DEADLOCK *** [ 66.389690] [ 66.389715] 2 locks held by swapper/1/0: [ 66.389728] #0: (&(&tl->lock)->rlock){..-...}, at: [<ffffffff81403e01>] intel_lrc_irq_handler+0x201/0x3c0 [ 66.389785] #1: (&(&req->lock)->rlock/1){..-...}, at: [<ffffffff813fc0af>] __i915_gem_request_submit+0x8f/0x170 [ 66.389854] [ 66.389854] stack backtrace: [ 66.389959] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc2+ #56 [ 66.389976] Hardware name: / , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [ 66.389999] ffff88027fd03c58 ffffffff812beae5 ffff88027696e680 ffffffff822afe20 [ 66.390036] ffff88027fd03ca8 ffffffff810bb420 0000000000000001 0000000000000000 [ 66.390070] 0000000000000000 0000000000000006 0000000000000004 ffff88027696ee10 [ 66.390104] Call Trace: [ 66.390117] <IRQ> [ 66.390128] [<ffffffff812beae5>] dump_stack+0x68/0x93 [ 66.390147] [<ffffffff810bb420>] print_usage_bug+0x1d0/0x1e0 [ 66.390164] [<ffffffff810bb8a0>] mark_lock+0x470/0x4f0 [ 66.390181] [<ffffffff810ba9d0>] ? print_shortest_lock_dependencies+0x1b0/0x1b0 [ 66.390203] [<ffffffff810bd75d>] __lock_acquire+0x59d/0x1870 [ 66.390221] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.390237] [<ffffffff810bedbc>] ? lock_acquire+0x6c/0xb0 [ 66.390255] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390273] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.390291] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390309] [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.390327] [<ffffffff813fc170>] __i915_gem_request_submit+0x150/0x170 [ 66.390345] [<ffffffff81403e8b>] intel_lrc_irq_handler+0x28b/0x3c0 [ 66.390363] [<ffffffff81079d97>] tasklet_action+0x57/0xc0 [ 66.390380] [<ffffffff8107a249>] __do_softirq+0x119/0x240 [ 66.390396] [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.390414] [<ffffffff8101afd5>] do_IRQ+0x65/0x110 [ 66.390431] [<ffffffff81618806>] common_interrupt+0x86/0x86 [ 66.390446] <EOI> [ 66.390457] [<ffffffff814ec6d1>] ? cpuidle_enter_state+0x151/0x200 [ 66.390480] [<ffffffff814ec7a2>] cpuidle_enter+0x12/0x20 [ 66.390498] [<ffffffff810b639e>] call_cpuidle+0x1e/0x40 [ 66.390516] [<ffffffff810b65ae>] cpu_startup_entry+0x10e/0x1f0 [ 66.390534] [<ffffffff81036133>] start_secondary+0x103/0x130 (This is split out of the defer global seqno allocation patch due to realisation that we need a more complete conversion if we want to defer request submission even further.) v2: lockdep was warning about mixed SOFTIRQ contexts not HARDIRQ contexts so we only need to use spin_lock_bh and not disable interrupts. v3: We need full irq protection as we may be called from a third party interrupt handler (via fences). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-32-chris@chris-wilson.co.uk
2016-10-28 20:58:55 +08:00
/* Caller disables interrupts */
spin_lock(&engine->i915->irq_lock);
engine->irq_enable(engine);
drm/i915: Convert breadcrumbs spinlock to be irqsafe The breadcrumbs are about to be used from within IRQ context sections (e.g. nouveau signals a fence from an interrupt handler causing us to submit a new request) and/or from bottom-half tasklets (i.e. intel_lrc_irq_handler), therefore we need to employ the irqsafe spinlock variants. For example, deferring the request submission to the intel_lrc_irq_handler generates this trace: [ 66.388639] ================================= [ 66.388650] [ INFO: inconsistent lock state ] [ 66.388663] 4.9.0-rc2+ #56 Not tainted [ 66.388672] --------------------------------- [ 66.388682] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 66.388695] swapper/1/0 [HC0[0]:SC1[1]:HE0:SE0] takes: [ 66.388706] (&(&b->lock)->rlock){+.?...} , at: [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.388761] {SOFTIRQ-ON-W} state was registered at: [ 66.388772] [ 66.388783] [<ffffffff810bd842>] __lock_acquire+0x682/0x1870 [ 66.388795] [ 66.388803] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.388814] [ 66.388824] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.388835] [ 66.388845] [<ffffffff81401e41>] intel_engine_reset_breadcrumbs+0x21/0xb0 [ 66.388857] [ 66.388866] [<ffffffff81403ae7>] gen8_init_common_ring+0x67/0x100 [ 66.388878] [ 66.388887] [<ffffffff81403b92>] gen8_init_render_ring+0x12/0x60 [ 66.388903] [ 66.388912] [<ffffffff813f8707>] i915_gem_init_hw+0xf7/0x2a0 [ 66.388927] [ 66.388936] [<ffffffff813f899b>] i915_gem_init+0xbb/0xf0 [ 66.388950] [ 66.388959] [<ffffffff813b4980>] i915_driver_load+0x7e0/0x1330 [ 66.388978] [ 66.388988] [<ffffffff813c09d8>] i915_pci_probe+0x28/0x40 [ 66.389003] [ 66.389013] [<ffffffff812fa0db>] pci_device_probe+0x8b/0xf0 [ 66.389028] [ 66.389037] [<ffffffff8147737e>] driver_probe_device+0x21e/0x430 [ 66.389056] [ 66.389065] [<ffffffff8147766e>] __driver_attach+0xde/0xe0 [ 66.389080] [ 66.389090] [<ffffffff814751ad>] bus_for_each_dev+0x5d/0x90 [ 66.389105] [ 66.389113] [<ffffffff81477799>] driver_attach+0x19/0x20 [ 66.389134] [ 66.389144] [<ffffffff81475ced>] bus_add_driver+0x15d/0x260 [ 66.389159] [ 66.389168] [<ffffffff81477e3b>] driver_register+0x5b/0xd0 [ 66.389183] [ 66.389281] [<ffffffff812fa19b>] __pci_register_driver+0x5b/0x60 [ 66.389301] [ 66.389312] [<ffffffff81aed333>] i915_init+0x3e/0x45 [ 66.389326] [ 66.389336] [<ffffffff81ac2ffa>] do_one_initcall+0x8b/0x118 [ 66.389350] [ 66.389359] [<ffffffff81ac323a>] kernel_init_freeable+0x1b3/0x23b [ 66.389378] [ 66.389387] [<ffffffff8160fc39>] kernel_init+0x9/0x100 [ 66.389402] [ 66.389411] [<ffffffff816180e7>] ret_from_fork+0x27/0x40 [ 66.389426] irq event stamp: 315865 [ 66.389438] hardirqs last enabled at (315864): [<ffffffff816178f1>] _raw_spin_unlock_irqrestore+0x31/0x50 [ 66.389469] hardirqs last disabled at (315865): [<ffffffff816176b3>] _raw_spin_lock_irqsave+0x13/0x50 [ 66.389499] softirqs last enabled at (315818): [<ffffffff8107a04c>] _local_bh_enable+0x1c/0x50 [ 66.389530] softirqs last disabled at (315819): [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.389559] [ 66.389559] other info that might help us debug this: [ 66.389580] Possible unsafe locking scenario: [ 66.389580] [ 66.389598] CPU0 [ 66.389609] ---- [ 66.389620] lock(&(&b->lock)->rlock); [ 66.389650] <Interrupt> [ 66.389661] lock(&(&b->lock)->rlock); [ 66.389690] [ 66.389690] *** DEADLOCK *** [ 66.389690] [ 66.389715] 2 locks held by swapper/1/0: [ 66.389728] #0: (&(&tl->lock)->rlock){..-...}, at: [<ffffffff81403e01>] intel_lrc_irq_handler+0x201/0x3c0 [ 66.389785] #1: (&(&req->lock)->rlock/1){..-...}, at: [<ffffffff813fc0af>] __i915_gem_request_submit+0x8f/0x170 [ 66.389854] [ 66.389854] stack backtrace: [ 66.389959] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc2+ #56 [ 66.389976] Hardware name: / , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [ 66.389999] ffff88027fd03c58 ffffffff812beae5 ffff88027696e680 ffffffff822afe20 [ 66.390036] ffff88027fd03ca8 ffffffff810bb420 0000000000000001 0000000000000000 [ 66.390070] 0000000000000000 0000000000000006 0000000000000004 ffff88027696ee10 [ 66.390104] Call Trace: [ 66.390117] <IRQ> [ 66.390128] [<ffffffff812beae5>] dump_stack+0x68/0x93 [ 66.390147] [<ffffffff810bb420>] print_usage_bug+0x1d0/0x1e0 [ 66.390164] [<ffffffff810bb8a0>] mark_lock+0x470/0x4f0 [ 66.390181] [<ffffffff810ba9d0>] ? print_shortest_lock_dependencies+0x1b0/0x1b0 [ 66.390203] [<ffffffff810bd75d>] __lock_acquire+0x59d/0x1870 [ 66.390221] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.390237] [<ffffffff810bedbc>] ? lock_acquire+0x6c/0xb0 [ 66.390255] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390273] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.390291] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390309] [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.390327] [<ffffffff813fc170>] __i915_gem_request_submit+0x150/0x170 [ 66.390345] [<ffffffff81403e8b>] intel_lrc_irq_handler+0x28b/0x3c0 [ 66.390363] [<ffffffff81079d97>] tasklet_action+0x57/0xc0 [ 66.390380] [<ffffffff8107a249>] __do_softirq+0x119/0x240 [ 66.390396] [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.390414] [<ffffffff8101afd5>] do_IRQ+0x65/0x110 [ 66.390431] [<ffffffff81618806>] common_interrupt+0x86/0x86 [ 66.390446] <EOI> [ 66.390457] [<ffffffff814ec6d1>] ? cpuidle_enter_state+0x151/0x200 [ 66.390480] [<ffffffff814ec7a2>] cpuidle_enter+0x12/0x20 [ 66.390498] [<ffffffff810b639e>] call_cpuidle+0x1e/0x40 [ 66.390516] [<ffffffff810b65ae>] cpu_startup_entry+0x10e/0x1f0 [ 66.390534] [<ffffffff81036133>] start_secondary+0x103/0x130 (This is split out of the defer global seqno allocation patch due to realisation that we need a more complete conversion if we want to defer request submission even further.) v2: lockdep was warning about mixed SOFTIRQ contexts not HARDIRQ contexts so we only need to use spin_lock_bh and not disable interrupts. v3: We need full irq protection as we may be called from a third party interrupt handler (via fences). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-32-chris@chris-wilson.co.uk
2016-10-28 20:58:55 +08:00
spin_unlock(&engine->i915->irq_lock);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
}
static void irq_disable(struct intel_engine_cs *engine)
{
drm/i915: Convert breadcrumbs spinlock to be irqsafe The breadcrumbs are about to be used from within IRQ context sections (e.g. nouveau signals a fence from an interrupt handler causing us to submit a new request) and/or from bottom-half tasklets (i.e. intel_lrc_irq_handler), therefore we need to employ the irqsafe spinlock variants. For example, deferring the request submission to the intel_lrc_irq_handler generates this trace: [ 66.388639] ================================= [ 66.388650] [ INFO: inconsistent lock state ] [ 66.388663] 4.9.0-rc2+ #56 Not tainted [ 66.388672] --------------------------------- [ 66.388682] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 66.388695] swapper/1/0 [HC0[0]:SC1[1]:HE0:SE0] takes: [ 66.388706] (&(&b->lock)->rlock){+.?...} , at: [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.388761] {SOFTIRQ-ON-W} state was registered at: [ 66.388772] [ 66.388783] [<ffffffff810bd842>] __lock_acquire+0x682/0x1870 [ 66.388795] [ 66.388803] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.388814] [ 66.388824] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.388835] [ 66.388845] [<ffffffff81401e41>] intel_engine_reset_breadcrumbs+0x21/0xb0 [ 66.388857] [ 66.388866] [<ffffffff81403ae7>] gen8_init_common_ring+0x67/0x100 [ 66.388878] [ 66.388887] [<ffffffff81403b92>] gen8_init_render_ring+0x12/0x60 [ 66.388903] [ 66.388912] [<ffffffff813f8707>] i915_gem_init_hw+0xf7/0x2a0 [ 66.388927] [ 66.388936] [<ffffffff813f899b>] i915_gem_init+0xbb/0xf0 [ 66.388950] [ 66.388959] [<ffffffff813b4980>] i915_driver_load+0x7e0/0x1330 [ 66.388978] [ 66.388988] [<ffffffff813c09d8>] i915_pci_probe+0x28/0x40 [ 66.389003] [ 66.389013] [<ffffffff812fa0db>] pci_device_probe+0x8b/0xf0 [ 66.389028] [ 66.389037] [<ffffffff8147737e>] driver_probe_device+0x21e/0x430 [ 66.389056] [ 66.389065] [<ffffffff8147766e>] __driver_attach+0xde/0xe0 [ 66.389080] [ 66.389090] [<ffffffff814751ad>] bus_for_each_dev+0x5d/0x90 [ 66.389105] [ 66.389113] [<ffffffff81477799>] driver_attach+0x19/0x20 [ 66.389134] [ 66.389144] [<ffffffff81475ced>] bus_add_driver+0x15d/0x260 [ 66.389159] [ 66.389168] [<ffffffff81477e3b>] driver_register+0x5b/0xd0 [ 66.389183] [ 66.389281] [<ffffffff812fa19b>] __pci_register_driver+0x5b/0x60 [ 66.389301] [ 66.389312] [<ffffffff81aed333>] i915_init+0x3e/0x45 [ 66.389326] [ 66.389336] [<ffffffff81ac2ffa>] do_one_initcall+0x8b/0x118 [ 66.389350] [ 66.389359] [<ffffffff81ac323a>] kernel_init_freeable+0x1b3/0x23b [ 66.389378] [ 66.389387] [<ffffffff8160fc39>] kernel_init+0x9/0x100 [ 66.389402] [ 66.389411] [<ffffffff816180e7>] ret_from_fork+0x27/0x40 [ 66.389426] irq event stamp: 315865 [ 66.389438] hardirqs last enabled at (315864): [<ffffffff816178f1>] _raw_spin_unlock_irqrestore+0x31/0x50 [ 66.389469] hardirqs last disabled at (315865): [<ffffffff816176b3>] _raw_spin_lock_irqsave+0x13/0x50 [ 66.389499] softirqs last enabled at (315818): [<ffffffff8107a04c>] _local_bh_enable+0x1c/0x50 [ 66.389530] softirqs last disabled at (315819): [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.389559] [ 66.389559] other info that might help us debug this: [ 66.389580] Possible unsafe locking scenario: [ 66.389580] [ 66.389598] CPU0 [ 66.389609] ---- [ 66.389620] lock(&(&b->lock)->rlock); [ 66.389650] <Interrupt> [ 66.389661] lock(&(&b->lock)->rlock); [ 66.389690] [ 66.389690] *** DEADLOCK *** [ 66.389690] [ 66.389715] 2 locks held by swapper/1/0: [ 66.389728] #0: (&(&tl->lock)->rlock){..-...}, at: [<ffffffff81403e01>] intel_lrc_irq_handler+0x201/0x3c0 [ 66.389785] #1: (&(&req->lock)->rlock/1){..-...}, at: [<ffffffff813fc0af>] __i915_gem_request_submit+0x8f/0x170 [ 66.389854] [ 66.389854] stack backtrace: [ 66.389959] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc2+ #56 [ 66.389976] Hardware name: / , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [ 66.389999] ffff88027fd03c58 ffffffff812beae5 ffff88027696e680 ffffffff822afe20 [ 66.390036] ffff88027fd03ca8 ffffffff810bb420 0000000000000001 0000000000000000 [ 66.390070] 0000000000000000 0000000000000006 0000000000000004 ffff88027696ee10 [ 66.390104] Call Trace: [ 66.390117] <IRQ> [ 66.390128] [<ffffffff812beae5>] dump_stack+0x68/0x93 [ 66.390147] [<ffffffff810bb420>] print_usage_bug+0x1d0/0x1e0 [ 66.390164] [<ffffffff810bb8a0>] mark_lock+0x470/0x4f0 [ 66.390181] [<ffffffff810ba9d0>] ? print_shortest_lock_dependencies+0x1b0/0x1b0 [ 66.390203] [<ffffffff810bd75d>] __lock_acquire+0x59d/0x1870 [ 66.390221] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.390237] [<ffffffff810bedbc>] ? lock_acquire+0x6c/0xb0 [ 66.390255] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390273] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.390291] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390309] [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.390327] [<ffffffff813fc170>] __i915_gem_request_submit+0x150/0x170 [ 66.390345] [<ffffffff81403e8b>] intel_lrc_irq_handler+0x28b/0x3c0 [ 66.390363] [<ffffffff81079d97>] tasklet_action+0x57/0xc0 [ 66.390380] [<ffffffff8107a249>] __do_softirq+0x119/0x240 [ 66.390396] [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.390414] [<ffffffff8101afd5>] do_IRQ+0x65/0x110 [ 66.390431] [<ffffffff81618806>] common_interrupt+0x86/0x86 [ 66.390446] <EOI> [ 66.390457] [<ffffffff814ec6d1>] ? cpuidle_enter_state+0x151/0x200 [ 66.390480] [<ffffffff814ec7a2>] cpuidle_enter+0x12/0x20 [ 66.390498] [<ffffffff810b639e>] call_cpuidle+0x1e/0x40 [ 66.390516] [<ffffffff810b65ae>] cpu_startup_entry+0x10e/0x1f0 [ 66.390534] [<ffffffff81036133>] start_secondary+0x103/0x130 (This is split out of the defer global seqno allocation patch due to realisation that we need a more complete conversion if we want to defer request submission even further.) v2: lockdep was warning about mixed SOFTIRQ contexts not HARDIRQ contexts so we only need to use spin_lock_bh and not disable interrupts. v3: We need full irq protection as we may be called from a third party interrupt handler (via fences). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-32-chris@chris-wilson.co.uk
2016-10-28 20:58:55 +08:00
/* Caller disables interrupts */
spin_lock(&engine->i915->irq_lock);
engine->irq_disable(engine);
drm/i915: Convert breadcrumbs spinlock to be irqsafe The breadcrumbs are about to be used from within IRQ context sections (e.g. nouveau signals a fence from an interrupt handler causing us to submit a new request) and/or from bottom-half tasklets (i.e. intel_lrc_irq_handler), therefore we need to employ the irqsafe spinlock variants. For example, deferring the request submission to the intel_lrc_irq_handler generates this trace: [ 66.388639] ================================= [ 66.388650] [ INFO: inconsistent lock state ] [ 66.388663] 4.9.0-rc2+ #56 Not tainted [ 66.388672] --------------------------------- [ 66.388682] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 66.388695] swapper/1/0 [HC0[0]:SC1[1]:HE0:SE0] takes: [ 66.388706] (&(&b->lock)->rlock){+.?...} , at: [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.388761] {SOFTIRQ-ON-W} state was registered at: [ 66.388772] [ 66.388783] [<ffffffff810bd842>] __lock_acquire+0x682/0x1870 [ 66.388795] [ 66.388803] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.388814] [ 66.388824] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.388835] [ 66.388845] [<ffffffff81401e41>] intel_engine_reset_breadcrumbs+0x21/0xb0 [ 66.388857] [ 66.388866] [<ffffffff81403ae7>] gen8_init_common_ring+0x67/0x100 [ 66.388878] [ 66.388887] [<ffffffff81403b92>] gen8_init_render_ring+0x12/0x60 [ 66.388903] [ 66.388912] [<ffffffff813f8707>] i915_gem_init_hw+0xf7/0x2a0 [ 66.388927] [ 66.388936] [<ffffffff813f899b>] i915_gem_init+0xbb/0xf0 [ 66.388950] [ 66.388959] [<ffffffff813b4980>] i915_driver_load+0x7e0/0x1330 [ 66.388978] [ 66.388988] [<ffffffff813c09d8>] i915_pci_probe+0x28/0x40 [ 66.389003] [ 66.389013] [<ffffffff812fa0db>] pci_device_probe+0x8b/0xf0 [ 66.389028] [ 66.389037] [<ffffffff8147737e>] driver_probe_device+0x21e/0x430 [ 66.389056] [ 66.389065] [<ffffffff8147766e>] __driver_attach+0xde/0xe0 [ 66.389080] [ 66.389090] [<ffffffff814751ad>] bus_for_each_dev+0x5d/0x90 [ 66.389105] [ 66.389113] [<ffffffff81477799>] driver_attach+0x19/0x20 [ 66.389134] [ 66.389144] [<ffffffff81475ced>] bus_add_driver+0x15d/0x260 [ 66.389159] [ 66.389168] [<ffffffff81477e3b>] driver_register+0x5b/0xd0 [ 66.389183] [ 66.389281] [<ffffffff812fa19b>] __pci_register_driver+0x5b/0x60 [ 66.389301] [ 66.389312] [<ffffffff81aed333>] i915_init+0x3e/0x45 [ 66.389326] [ 66.389336] [<ffffffff81ac2ffa>] do_one_initcall+0x8b/0x118 [ 66.389350] [ 66.389359] [<ffffffff81ac323a>] kernel_init_freeable+0x1b3/0x23b [ 66.389378] [ 66.389387] [<ffffffff8160fc39>] kernel_init+0x9/0x100 [ 66.389402] [ 66.389411] [<ffffffff816180e7>] ret_from_fork+0x27/0x40 [ 66.389426] irq event stamp: 315865 [ 66.389438] hardirqs last enabled at (315864): [<ffffffff816178f1>] _raw_spin_unlock_irqrestore+0x31/0x50 [ 66.389469] hardirqs last disabled at (315865): [<ffffffff816176b3>] _raw_spin_lock_irqsave+0x13/0x50 [ 66.389499] softirqs last enabled at (315818): [<ffffffff8107a04c>] _local_bh_enable+0x1c/0x50 [ 66.389530] softirqs last disabled at (315819): [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.389559] [ 66.389559] other info that might help us debug this: [ 66.389580] Possible unsafe locking scenario: [ 66.389580] [ 66.389598] CPU0 [ 66.389609] ---- [ 66.389620] lock(&(&b->lock)->rlock); [ 66.389650] <Interrupt> [ 66.389661] lock(&(&b->lock)->rlock); [ 66.389690] [ 66.389690] *** DEADLOCK *** [ 66.389690] [ 66.389715] 2 locks held by swapper/1/0: [ 66.389728] #0: (&(&tl->lock)->rlock){..-...}, at: [<ffffffff81403e01>] intel_lrc_irq_handler+0x201/0x3c0 [ 66.389785] #1: (&(&req->lock)->rlock/1){..-...}, at: [<ffffffff813fc0af>] __i915_gem_request_submit+0x8f/0x170 [ 66.389854] [ 66.389854] stack backtrace: [ 66.389959] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc2+ #56 [ 66.389976] Hardware name: / , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [ 66.389999] ffff88027fd03c58 ffffffff812beae5 ffff88027696e680 ffffffff822afe20 [ 66.390036] ffff88027fd03ca8 ffffffff810bb420 0000000000000001 0000000000000000 [ 66.390070] 0000000000000000 0000000000000006 0000000000000004 ffff88027696ee10 [ 66.390104] Call Trace: [ 66.390117] <IRQ> [ 66.390128] [<ffffffff812beae5>] dump_stack+0x68/0x93 [ 66.390147] [<ffffffff810bb420>] print_usage_bug+0x1d0/0x1e0 [ 66.390164] [<ffffffff810bb8a0>] mark_lock+0x470/0x4f0 [ 66.390181] [<ffffffff810ba9d0>] ? print_shortest_lock_dependencies+0x1b0/0x1b0 [ 66.390203] [<ffffffff810bd75d>] __lock_acquire+0x59d/0x1870 [ 66.390221] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.390237] [<ffffffff810bedbc>] ? lock_acquire+0x6c/0xb0 [ 66.390255] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390273] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.390291] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390309] [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.390327] [<ffffffff813fc170>] __i915_gem_request_submit+0x150/0x170 [ 66.390345] [<ffffffff81403e8b>] intel_lrc_irq_handler+0x28b/0x3c0 [ 66.390363] [<ffffffff81079d97>] tasklet_action+0x57/0xc0 [ 66.390380] [<ffffffff8107a249>] __do_softirq+0x119/0x240 [ 66.390396] [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.390414] [<ffffffff8101afd5>] do_IRQ+0x65/0x110 [ 66.390431] [<ffffffff81618806>] common_interrupt+0x86/0x86 [ 66.390446] <EOI> [ 66.390457] [<ffffffff814ec6d1>] ? cpuidle_enter_state+0x151/0x200 [ 66.390480] [<ffffffff814ec7a2>] cpuidle_enter+0x12/0x20 [ 66.390498] [<ffffffff810b639e>] call_cpuidle+0x1e/0x40 [ 66.390516] [<ffffffff810b65ae>] cpu_startup_entry+0x10e/0x1f0 [ 66.390534] [<ffffffff81036133>] start_secondary+0x103/0x130 (This is split out of the defer global seqno allocation patch due to realisation that we need a more complete conversion if we want to defer request submission even further.) v2: lockdep was warning about mixed SOFTIRQ contexts not HARDIRQ contexts so we only need to use spin_lock_bh and not disable interrupts. v3: We need full irq protection as we may be called from a third party interrupt handler (via fences). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-32-chris@chris-wilson.co.uk
2016-10-28 20:58:55 +08:00
spin_unlock(&engine->i915->irq_lock);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
}
void __intel_engine_disarm_breadcrumbs(struct intel_engine_cs *engine)
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
lockdep_assert_held(&b->irq_lock);
drm/i915: Wake up all waiters before idling When we idle, we wakeup the first waiter (checking to see if it missed an earlier wakeup) and disarm the breadcrumbs. However, we now assert that there are no waiter when the interrupt is disabled, triggering an assert if there were multiple waiters when we idled. [ 420.842275] invalid opcode: 0000 [#1] PREEMPT SMP [ 420.842285] Modules linked in: vgem snd_hda_codec_realtek x86_pkg_temp_thermal snd_hda_codec_generic intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep mei_me snd_hda_core mei snd_pcm lpc_ich i915 r8169 mii prime_numbers [ 420.842357] CPU: 4 PID: 8714 Comm: kms_pipe_crc_ba Tainted: G U W 4.10.0-CI-CI_DRM_2280+ #1 [ 420.842377] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012 [ 420.842395] task: ffff880117ddce40 task.stack: ffffc90001114000 [ 420.842439] RIP: 0010:__intel_engine_remove_wait+0x1f4/0x200 [i915] [ 420.842454] RSP: 0018:ffffc90001117b18 EFLAGS: 00010046 [ 420.842467] RAX: 0000000000000000 RBX: ffff88010c25c2a8 RCX: 0000000000000001 [ 420.842481] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffffc90001117c50 [ 420.842495] RBP: ffffc90001117b58 R08: 0000000011e52352 R09: c4d16acc00000000 [ 420.842511] R10: ffffffff82789eb0 R11: ffff880117ddce40 R12: ffffc90001117c50 [ 420.842525] R13: ffffc90001117c50 R14: 0000000000000078 R15: 0000000000000000 [ 420.842540] FS: 00007fe47dda0a40(0000) GS:ffff88011fb00000(0000) knlGS:0000000000000000 [ 420.842559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 420.842571] CR2: 00007fd6c0a2cec4 CR3: 000000010a5e5000 CR4: 00000000001406e0 [ 420.842586] Call Trace: [ 420.842595] ? do_raw_spin_lock+0xad/0xb0 [ 420.842635] intel_engine_remove_wait.part.3+0x26/0x40 [i915] [ 420.842678] intel_engine_remove_wait+0xe/0x20 [i915] [ 420.842721] i915_wait_request+0x4f0/0x8c0 [i915] [ 420.842736] ? wake_up_q+0x70/0x70 [ 420.842747] ? wake_up_q+0x70/0x70 [ 420.842787] i915_gem_object_wait_fence+0x7d/0x1a0 [i915] [ 420.842829] i915_gem_object_wait+0x30d/0x520 [i915] [ 420.842842] ? __this_cpu_preempt_check+0x13/0x20 [ 420.842884] i915_gem_wait_ioctl+0x12e/0x2e0 [i915] [ 420.842924] ? i915_gem_wait_ioctl+0x22/0x2e0 [i915] [ 420.842939] drm_ioctl+0x200/0x450 [ 420.842976] ? i915_gem_set_wedged+0x90/0x90 [i915] [ 420.842993] do_vfs_ioctl+0x90/0x6e0 [ 420.843003] ? entry_SYSCALL_64_fastpath+0x5/0xb1 [ 420.843017] ? __this_cpu_preempt_check+0x13/0x20 [ 420.843030] ? trace_hardirqs_on_caller+0xe7/0x200 [ 420.843042] SyS_ioctl+0x3c/0x70 [ 420.843054] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 420.843065] RIP: 0033:0x7fe47c4b9357 [ 420.843075] RSP: 002b:00007ffc3c0633c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 420.843094] RAX: ffffffffffffffda RBX: ffffffff81482393 RCX: 00007fe47c4b9357 [ 420.843109] RDX: 00007ffc3c063400 RSI: 00000000c010646c RDI: 0000000000000004 [ 420.843123] RBP: ffffc90001117f88 R08: 0000000000000008 R09: 0000000000000000 [ 420.843137] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 420.843151] R13: 0000000000000004 R14: 00000000c010646c R15: 0000000000000000 [ 420.843168] ? __this_cpu_preempt_check+0x13/0x20 [ 420.843180] Code: 81 48 c7 c1 40 6a 16 a0 48 c7 c2 47 29 15 a0 be 17 01 00 00 48 c7 c7 10 6a 16 a0 e8 c7 ea fe e0 e9 5d ff ff ff 0f 0b 0f 0b 0f 0b <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 e8 67 41 7e e1 [ 420.843325] RIP: __intel_engine_remove_wait+0x1f4/0x200 [i915] RSP: ffffc90001117b18 Fixes: b66255f0f779 ("drm/i915: Refactor wakeup of the next breadcrumb waiter") Fixes: 67b807a89230 ("drm/i915: Delay disabling the user interrupt for breadcrumbs") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170306092916.11623-2-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
2017-03-06 17:29:15 +08:00
GEM_BUG_ON(b->irq_wait);
if (b->irq_enabled) {
irq_disable(engine);
b->irq_enabled = false;
}
b->irq_armed = false;
}
void intel_engine_disarm_breadcrumbs(struct intel_engine_cs *engine)
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
struct intel_wait *wait, *n, *first;
if (!b->irq_armed)
return;
/* We only disarm the irq when we are idle (all requests completed),
drm/i915: Wake up all waiters before idling When we idle, we wakeup the first waiter (checking to see if it missed an earlier wakeup) and disarm the breadcrumbs. However, we now assert that there are no waiter when the interrupt is disabled, triggering an assert if there were multiple waiters when we idled. [ 420.842275] invalid opcode: 0000 [#1] PREEMPT SMP [ 420.842285] Modules linked in: vgem snd_hda_codec_realtek x86_pkg_temp_thermal snd_hda_codec_generic intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep mei_me snd_hda_core mei snd_pcm lpc_ich i915 r8169 mii prime_numbers [ 420.842357] CPU: 4 PID: 8714 Comm: kms_pipe_crc_ba Tainted: G U W 4.10.0-CI-CI_DRM_2280+ #1 [ 420.842377] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012 [ 420.842395] task: ffff880117ddce40 task.stack: ffffc90001114000 [ 420.842439] RIP: 0010:__intel_engine_remove_wait+0x1f4/0x200 [i915] [ 420.842454] RSP: 0018:ffffc90001117b18 EFLAGS: 00010046 [ 420.842467] RAX: 0000000000000000 RBX: ffff88010c25c2a8 RCX: 0000000000000001 [ 420.842481] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffffc90001117c50 [ 420.842495] RBP: ffffc90001117b58 R08: 0000000011e52352 R09: c4d16acc00000000 [ 420.842511] R10: ffffffff82789eb0 R11: ffff880117ddce40 R12: ffffc90001117c50 [ 420.842525] R13: ffffc90001117c50 R14: 0000000000000078 R15: 0000000000000000 [ 420.842540] FS: 00007fe47dda0a40(0000) GS:ffff88011fb00000(0000) knlGS:0000000000000000 [ 420.842559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 420.842571] CR2: 00007fd6c0a2cec4 CR3: 000000010a5e5000 CR4: 00000000001406e0 [ 420.842586] Call Trace: [ 420.842595] ? do_raw_spin_lock+0xad/0xb0 [ 420.842635] intel_engine_remove_wait.part.3+0x26/0x40 [i915] [ 420.842678] intel_engine_remove_wait+0xe/0x20 [i915] [ 420.842721] i915_wait_request+0x4f0/0x8c0 [i915] [ 420.842736] ? wake_up_q+0x70/0x70 [ 420.842747] ? wake_up_q+0x70/0x70 [ 420.842787] i915_gem_object_wait_fence+0x7d/0x1a0 [i915] [ 420.842829] i915_gem_object_wait+0x30d/0x520 [i915] [ 420.842842] ? __this_cpu_preempt_check+0x13/0x20 [ 420.842884] i915_gem_wait_ioctl+0x12e/0x2e0 [i915] [ 420.842924] ? i915_gem_wait_ioctl+0x22/0x2e0 [i915] [ 420.842939] drm_ioctl+0x200/0x450 [ 420.842976] ? i915_gem_set_wedged+0x90/0x90 [i915] [ 420.842993] do_vfs_ioctl+0x90/0x6e0 [ 420.843003] ? entry_SYSCALL_64_fastpath+0x5/0xb1 [ 420.843017] ? __this_cpu_preempt_check+0x13/0x20 [ 420.843030] ? trace_hardirqs_on_caller+0xe7/0x200 [ 420.843042] SyS_ioctl+0x3c/0x70 [ 420.843054] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 420.843065] RIP: 0033:0x7fe47c4b9357 [ 420.843075] RSP: 002b:00007ffc3c0633c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 420.843094] RAX: ffffffffffffffda RBX: ffffffff81482393 RCX: 00007fe47c4b9357 [ 420.843109] RDX: 00007ffc3c063400 RSI: 00000000c010646c RDI: 0000000000000004 [ 420.843123] RBP: ffffc90001117f88 R08: 0000000000000008 R09: 0000000000000000 [ 420.843137] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 420.843151] R13: 0000000000000004 R14: 00000000c010646c R15: 0000000000000000 [ 420.843168] ? __this_cpu_preempt_check+0x13/0x20 [ 420.843180] Code: 81 48 c7 c1 40 6a 16 a0 48 c7 c2 47 29 15 a0 be 17 01 00 00 48 c7 c7 10 6a 16 a0 e8 c7 ea fe e0 e9 5d ff ff ff 0f 0b 0f 0b 0f 0b <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 e8 67 41 7e e1 [ 420.843325] RIP: __intel_engine_remove_wait+0x1f4/0x200 [i915] RSP: ffffc90001117b18 Fixes: b66255f0f779 ("drm/i915: Refactor wakeup of the next breadcrumb waiter") Fixes: 67b807a89230 ("drm/i915: Delay disabling the user interrupt for breadcrumbs") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170306092916.11623-2-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
2017-03-06 17:29:15 +08:00
* so if the bottom-half remains asleep, it missed the request
* completion.
*/
drm/i915: Wake up all waiters before idling When we idle, we wakeup the first waiter (checking to see if it missed an earlier wakeup) and disarm the breadcrumbs. However, we now assert that there are no waiter when the interrupt is disabled, triggering an assert if there were multiple waiters when we idled. [ 420.842275] invalid opcode: 0000 [#1] PREEMPT SMP [ 420.842285] Modules linked in: vgem snd_hda_codec_realtek x86_pkg_temp_thermal snd_hda_codec_generic intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep mei_me snd_hda_core mei snd_pcm lpc_ich i915 r8169 mii prime_numbers [ 420.842357] CPU: 4 PID: 8714 Comm: kms_pipe_crc_ba Tainted: G U W 4.10.0-CI-CI_DRM_2280+ #1 [ 420.842377] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012 [ 420.842395] task: ffff880117ddce40 task.stack: ffffc90001114000 [ 420.842439] RIP: 0010:__intel_engine_remove_wait+0x1f4/0x200 [i915] [ 420.842454] RSP: 0018:ffffc90001117b18 EFLAGS: 00010046 [ 420.842467] RAX: 0000000000000000 RBX: ffff88010c25c2a8 RCX: 0000000000000001 [ 420.842481] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffffc90001117c50 [ 420.842495] RBP: ffffc90001117b58 R08: 0000000011e52352 R09: c4d16acc00000000 [ 420.842511] R10: ffffffff82789eb0 R11: ffff880117ddce40 R12: ffffc90001117c50 [ 420.842525] R13: ffffc90001117c50 R14: 0000000000000078 R15: 0000000000000000 [ 420.842540] FS: 00007fe47dda0a40(0000) GS:ffff88011fb00000(0000) knlGS:0000000000000000 [ 420.842559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 420.842571] CR2: 00007fd6c0a2cec4 CR3: 000000010a5e5000 CR4: 00000000001406e0 [ 420.842586] Call Trace: [ 420.842595] ? do_raw_spin_lock+0xad/0xb0 [ 420.842635] intel_engine_remove_wait.part.3+0x26/0x40 [i915] [ 420.842678] intel_engine_remove_wait+0xe/0x20 [i915] [ 420.842721] i915_wait_request+0x4f0/0x8c0 [i915] [ 420.842736] ? wake_up_q+0x70/0x70 [ 420.842747] ? wake_up_q+0x70/0x70 [ 420.842787] i915_gem_object_wait_fence+0x7d/0x1a0 [i915] [ 420.842829] i915_gem_object_wait+0x30d/0x520 [i915] [ 420.842842] ? __this_cpu_preempt_check+0x13/0x20 [ 420.842884] i915_gem_wait_ioctl+0x12e/0x2e0 [i915] [ 420.842924] ? i915_gem_wait_ioctl+0x22/0x2e0 [i915] [ 420.842939] drm_ioctl+0x200/0x450 [ 420.842976] ? i915_gem_set_wedged+0x90/0x90 [i915] [ 420.842993] do_vfs_ioctl+0x90/0x6e0 [ 420.843003] ? entry_SYSCALL_64_fastpath+0x5/0xb1 [ 420.843017] ? __this_cpu_preempt_check+0x13/0x20 [ 420.843030] ? trace_hardirqs_on_caller+0xe7/0x200 [ 420.843042] SyS_ioctl+0x3c/0x70 [ 420.843054] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 420.843065] RIP: 0033:0x7fe47c4b9357 [ 420.843075] RSP: 002b:00007ffc3c0633c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 420.843094] RAX: ffffffffffffffda RBX: ffffffff81482393 RCX: 00007fe47c4b9357 [ 420.843109] RDX: 00007ffc3c063400 RSI: 00000000c010646c RDI: 0000000000000004 [ 420.843123] RBP: ffffc90001117f88 R08: 0000000000000008 R09: 0000000000000000 [ 420.843137] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 420.843151] R13: 0000000000000004 R14: 00000000c010646c R15: 0000000000000000 [ 420.843168] ? __this_cpu_preempt_check+0x13/0x20 [ 420.843180] Code: 81 48 c7 c1 40 6a 16 a0 48 c7 c2 47 29 15 a0 be 17 01 00 00 48 c7 c7 10 6a 16 a0 e8 c7 ea fe e0 e9 5d ff ff ff 0f 0b 0f 0b 0f 0b <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 e8 67 41 7e e1 [ 420.843325] RIP: __intel_engine_remove_wait+0x1f4/0x200 [i915] RSP: ffffc90001117b18 Fixes: b66255f0f779 ("drm/i915: Refactor wakeup of the next breadcrumb waiter") Fixes: 67b807a89230 ("drm/i915: Delay disabling the user interrupt for breadcrumbs") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170306092916.11623-2-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
2017-03-06 17:29:15 +08:00
spin_lock_irq(&b->rb_lock);
spin_lock(&b->irq_lock);
first = fetch_and_zero(&b->irq_wait);
__intel_engine_disarm_breadcrumbs(engine);
spin_unlock(&b->irq_lock);
drm/i915: Wake up all waiters before idling When we idle, we wakeup the first waiter (checking to see if it missed an earlier wakeup) and disarm the breadcrumbs. However, we now assert that there are no waiter when the interrupt is disabled, triggering an assert if there were multiple waiters when we idled. [ 420.842275] invalid opcode: 0000 [#1] PREEMPT SMP [ 420.842285] Modules linked in: vgem snd_hda_codec_realtek x86_pkg_temp_thermal snd_hda_codec_generic intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep mei_me snd_hda_core mei snd_pcm lpc_ich i915 r8169 mii prime_numbers [ 420.842357] CPU: 4 PID: 8714 Comm: kms_pipe_crc_ba Tainted: G U W 4.10.0-CI-CI_DRM_2280+ #1 [ 420.842377] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012 [ 420.842395] task: ffff880117ddce40 task.stack: ffffc90001114000 [ 420.842439] RIP: 0010:__intel_engine_remove_wait+0x1f4/0x200 [i915] [ 420.842454] RSP: 0018:ffffc90001117b18 EFLAGS: 00010046 [ 420.842467] RAX: 0000000000000000 RBX: ffff88010c25c2a8 RCX: 0000000000000001 [ 420.842481] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffffc90001117c50 [ 420.842495] RBP: ffffc90001117b58 R08: 0000000011e52352 R09: c4d16acc00000000 [ 420.842511] R10: ffffffff82789eb0 R11: ffff880117ddce40 R12: ffffc90001117c50 [ 420.842525] R13: ffffc90001117c50 R14: 0000000000000078 R15: 0000000000000000 [ 420.842540] FS: 00007fe47dda0a40(0000) GS:ffff88011fb00000(0000) knlGS:0000000000000000 [ 420.842559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 420.842571] CR2: 00007fd6c0a2cec4 CR3: 000000010a5e5000 CR4: 00000000001406e0 [ 420.842586] Call Trace: [ 420.842595] ? do_raw_spin_lock+0xad/0xb0 [ 420.842635] intel_engine_remove_wait.part.3+0x26/0x40 [i915] [ 420.842678] intel_engine_remove_wait+0xe/0x20 [i915] [ 420.842721] i915_wait_request+0x4f0/0x8c0 [i915] [ 420.842736] ? wake_up_q+0x70/0x70 [ 420.842747] ? wake_up_q+0x70/0x70 [ 420.842787] i915_gem_object_wait_fence+0x7d/0x1a0 [i915] [ 420.842829] i915_gem_object_wait+0x30d/0x520 [i915] [ 420.842842] ? __this_cpu_preempt_check+0x13/0x20 [ 420.842884] i915_gem_wait_ioctl+0x12e/0x2e0 [i915] [ 420.842924] ? i915_gem_wait_ioctl+0x22/0x2e0 [i915] [ 420.842939] drm_ioctl+0x200/0x450 [ 420.842976] ? i915_gem_set_wedged+0x90/0x90 [i915] [ 420.842993] do_vfs_ioctl+0x90/0x6e0 [ 420.843003] ? entry_SYSCALL_64_fastpath+0x5/0xb1 [ 420.843017] ? __this_cpu_preempt_check+0x13/0x20 [ 420.843030] ? trace_hardirqs_on_caller+0xe7/0x200 [ 420.843042] SyS_ioctl+0x3c/0x70 [ 420.843054] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 420.843065] RIP: 0033:0x7fe47c4b9357 [ 420.843075] RSP: 002b:00007ffc3c0633c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 420.843094] RAX: ffffffffffffffda RBX: ffffffff81482393 RCX: 00007fe47c4b9357 [ 420.843109] RDX: 00007ffc3c063400 RSI: 00000000c010646c RDI: 0000000000000004 [ 420.843123] RBP: ffffc90001117f88 R08: 0000000000000008 R09: 0000000000000000 [ 420.843137] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 420.843151] R13: 0000000000000004 R14: 00000000c010646c R15: 0000000000000000 [ 420.843168] ? __this_cpu_preempt_check+0x13/0x20 [ 420.843180] Code: 81 48 c7 c1 40 6a 16 a0 48 c7 c2 47 29 15 a0 be 17 01 00 00 48 c7 c7 10 6a 16 a0 e8 c7 ea fe e0 e9 5d ff ff ff 0f 0b 0f 0b 0f 0b <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 e8 67 41 7e e1 [ 420.843325] RIP: __intel_engine_remove_wait+0x1f4/0x200 [i915] RSP: ffffc90001117b18 Fixes: b66255f0f779 ("drm/i915: Refactor wakeup of the next breadcrumb waiter") Fixes: 67b807a89230 ("drm/i915: Delay disabling the user interrupt for breadcrumbs") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170306092916.11623-2-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
2017-03-06 17:29:15 +08:00
rbtree_postorder_for_each_entry_safe(wait, n, &b->waiters, node) {
RB_CLEAR_NODE(&wait->node);
if (wake_up_process(wait->tsk) && wait == first)
drm/i915: Wake up all waiters before idling When we idle, we wakeup the first waiter (checking to see if it missed an earlier wakeup) and disarm the breadcrumbs. However, we now assert that there are no waiter when the interrupt is disabled, triggering an assert if there were multiple waiters when we idled. [ 420.842275] invalid opcode: 0000 [#1] PREEMPT SMP [ 420.842285] Modules linked in: vgem snd_hda_codec_realtek x86_pkg_temp_thermal snd_hda_codec_generic intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep mei_me snd_hda_core mei snd_pcm lpc_ich i915 r8169 mii prime_numbers [ 420.842357] CPU: 4 PID: 8714 Comm: kms_pipe_crc_ba Tainted: G U W 4.10.0-CI-CI_DRM_2280+ #1 [ 420.842377] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012 [ 420.842395] task: ffff880117ddce40 task.stack: ffffc90001114000 [ 420.842439] RIP: 0010:__intel_engine_remove_wait+0x1f4/0x200 [i915] [ 420.842454] RSP: 0018:ffffc90001117b18 EFLAGS: 00010046 [ 420.842467] RAX: 0000000000000000 RBX: ffff88010c25c2a8 RCX: 0000000000000001 [ 420.842481] RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffffc90001117c50 [ 420.842495] RBP: ffffc90001117b58 R08: 0000000011e52352 R09: c4d16acc00000000 [ 420.842511] R10: ffffffff82789eb0 R11: ffff880117ddce40 R12: ffffc90001117c50 [ 420.842525] R13: ffffc90001117c50 R14: 0000000000000078 R15: 0000000000000000 [ 420.842540] FS: 00007fe47dda0a40(0000) GS:ffff88011fb00000(0000) knlGS:0000000000000000 [ 420.842559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 420.842571] CR2: 00007fd6c0a2cec4 CR3: 000000010a5e5000 CR4: 00000000001406e0 [ 420.842586] Call Trace: [ 420.842595] ? do_raw_spin_lock+0xad/0xb0 [ 420.842635] intel_engine_remove_wait.part.3+0x26/0x40 [i915] [ 420.842678] intel_engine_remove_wait+0xe/0x20 [i915] [ 420.842721] i915_wait_request+0x4f0/0x8c0 [i915] [ 420.842736] ? wake_up_q+0x70/0x70 [ 420.842747] ? wake_up_q+0x70/0x70 [ 420.842787] i915_gem_object_wait_fence+0x7d/0x1a0 [i915] [ 420.842829] i915_gem_object_wait+0x30d/0x520 [i915] [ 420.842842] ? __this_cpu_preempt_check+0x13/0x20 [ 420.842884] i915_gem_wait_ioctl+0x12e/0x2e0 [i915] [ 420.842924] ? i915_gem_wait_ioctl+0x22/0x2e0 [i915] [ 420.842939] drm_ioctl+0x200/0x450 [ 420.842976] ? i915_gem_set_wedged+0x90/0x90 [i915] [ 420.842993] do_vfs_ioctl+0x90/0x6e0 [ 420.843003] ? entry_SYSCALL_64_fastpath+0x5/0xb1 [ 420.843017] ? __this_cpu_preempt_check+0x13/0x20 [ 420.843030] ? trace_hardirqs_on_caller+0xe7/0x200 [ 420.843042] SyS_ioctl+0x3c/0x70 [ 420.843054] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 420.843065] RIP: 0033:0x7fe47c4b9357 [ 420.843075] RSP: 002b:00007ffc3c0633c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 420.843094] RAX: ffffffffffffffda RBX: ffffffff81482393 RCX: 00007fe47c4b9357 [ 420.843109] RDX: 00007ffc3c063400 RSI: 00000000c010646c RDI: 0000000000000004 [ 420.843123] RBP: ffffc90001117f88 R08: 0000000000000008 R09: 0000000000000000 [ 420.843137] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 420.843151] R13: 0000000000000004 R14: 00000000c010646c R15: 0000000000000000 [ 420.843168] ? __this_cpu_preempt_check+0x13/0x20 [ 420.843180] Code: 81 48 c7 c1 40 6a 16 a0 48 c7 c2 47 29 15 a0 be 17 01 00 00 48 c7 c7 10 6a 16 a0 e8 c7 ea fe e0 e9 5d ff ff ff 0f 0b 0f 0b 0f 0b <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 e8 67 41 7e e1 [ 420.843325] RIP: __intel_engine_remove_wait+0x1f4/0x200 [i915] RSP: ffffc90001117b18 Fixes: b66255f0f779 ("drm/i915: Refactor wakeup of the next breadcrumb waiter") Fixes: 67b807a89230 ("drm/i915: Delay disabling the user interrupt for breadcrumbs") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170306092916.11623-2-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
2017-03-06 17:29:15 +08:00
missed_breadcrumb(engine);
}
b->waiters = RB_ROOT;
spin_unlock_irq(&b->rb_lock);
}
static bool use_fake_irq(const struct intel_breadcrumbs *b)
{
const struct intel_engine_cs *engine =
container_of(b, struct intel_engine_cs, breadcrumbs);
if (!test_bit(engine->id, &engine->i915->gpu_error.missed_irq_rings))
return false;
/* Only start with the heavy weight fake irq timer if we have not
* seen any interrupts since enabling it the first time. If the
* interrupts are still arriving, it means we made a mistake in our
* engine->seqno_barrier(), a timing error that should be transient
* and unlikely to reoccur.
*/
return atomic_read(&engine->irq_count) == b->hangcheck_interrupts;
}
static void enable_fake_irq(struct intel_breadcrumbs *b)
{
/* Ensure we never sleep indefinitely */
if (!b->irq_enabled || use_fake_irq(b))
mod_timer(&b->fake_irq, jiffies + 1);
else
mod_timer(&b->hangcheck, wait_timeout());
}
static void __intel_breadcrumbs_enable_irq(struct intel_breadcrumbs *b)
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
{
struct intel_engine_cs *engine =
container_of(b, struct intel_engine_cs, breadcrumbs);
struct drm_i915_private *i915 = engine->i915;
lockdep_assert_held(&b->irq_lock);
if (b->irq_armed)
return;
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
/* The breadcrumb irq will be disarmed on the interrupt after the
* waiters are signaled. This gives us a single interrupt window in
* which we can add a new waiter and avoid the cost of re-enabling
* the irq.
*/
b->irq_armed = true;
GEM_BUG_ON(b->irq_enabled);
if (I915_SELFTEST_ONLY(b->mock)) {
/* For our mock objects we want to avoid interaction
* with the real hardware (which is not set up). So
* we simply pretend we have enabled the powerwell
* and the irq, and leave it up to the mock
* implementation to call intel_engine_wakeup()
* itself when it wants to simulate a user interrupt,
*/
return;
}
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
/* Since we are waiting on a request, the GPU should be busy
* and should have its own rpm reference. This is tracked
* by i915->gt.awake, we can forgo holding our own wakref
* for the interrupt as before i915->gt.awake is released (when
* the driver is idle) we disarm the breadcrumbs.
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
*/
/* No interrupts? Kick the waiter every jiffie! */
if (intel_irqs_enabled(i915)) {
if (!test_bit(engine->id, &i915->gpu_error.test_irq_rings))
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
irq_enable(engine);
b->irq_enabled = true;
}
enable_fake_irq(b);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
}
static inline struct intel_wait *to_wait(struct rb_node *node)
{
return rb_entry(node, struct intel_wait, node);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
}
static inline void __intel_breadcrumbs_finish(struct intel_breadcrumbs *b,
struct intel_wait *wait)
{
lockdep_assert_held(&b->rb_lock);
GEM_BUG_ON(b->irq_wait == wait);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
/* This request is completed, so remove it from the tree, mark it as
* complete, and *then* wake up the associated task. N.B. when the
* task wakes up, it will find the empty rb_node, discern that it
* has already been removed from the tree and skip the serialisation
* of the b->rb_lock and b->irq_lock. This means that the destruction
* of the intel_wait is not serialised with the interrupt handler
* by the waiter - it must instead be serialised by the caller.
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
*/
rb_erase(&wait->node, &b->waiters);
RB_CLEAR_NODE(&wait->node);
wake_up_process(wait->tsk); /* implicit smp_wmb() */
}
static inline void __intel_breadcrumbs_next(struct intel_engine_cs *engine,
struct rb_node *next)
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
spin_lock(&b->irq_lock);
GEM_BUG_ON(!b->irq_armed);
GEM_BUG_ON(!b->irq_wait);
b->irq_wait = to_wait(next);
spin_unlock(&b->irq_lock);
/* We always wake up the next waiter that takes over as the bottom-half
* as we may delegate not only the irq-seqno barrier to the next waiter
* but also the task of waking up concurrent waiters.
*/
if (next)
wake_up_process(to_wait(next)->tsk);
}
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
static bool __intel_engine_add_wait(struct intel_engine_cs *engine,
struct intel_wait *wait)
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
struct rb_node **p, *parent, *completed;
bool first;
u32 seqno;
/* Insert the request into the retirement ordered list
* of waiters by walking the rbtree. If we are the oldest
* seqno in the tree (the first to be retired), then
* set ourselves as the bottom-half.
*
* As we descend the tree, prune completed branches since we hold the
* spinlock we know that the first_waiter must be delayed and can
* reduce some of the sequential wake up latency if we take action
* ourselves and wake up the completed tasks in parallel. Also, by
* removing stale elements in the tree, we may be able to reduce the
* ping-pong between the old bottom-half and ourselves as first-waiter.
*/
first = true;
parent = NULL;
completed = NULL;
seqno = intel_engine_get_seqno(engine);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
/* If the request completed before we managed to grab the spinlock,
* return now before adding ourselves to the rbtree. We let the
* current bottom-half handle any pending wakeups and instead
* try and get out of the way quickly.
*/
if (i915_seqno_passed(seqno, wait->seqno)) {
RB_CLEAR_NODE(&wait->node);
return first;
}
p = &b->waiters.rb_node;
while (*p) {
parent = *p;
if (wait->seqno == to_wait(parent)->seqno) {
/* We have multiple waiters on the same seqno, select
* the highest priority task (that with the smallest
* task->prio) to serve as the bottom-half for this
* group.
*/
if (wait->tsk->prio > to_wait(parent)->tsk->prio) {
p = &parent->rb_right;
first = false;
} else {
p = &parent->rb_left;
}
} else if (i915_seqno_passed(wait->seqno,
to_wait(parent)->seqno)) {
p = &parent->rb_right;
if (i915_seqno_passed(seqno, to_wait(parent)->seqno))
completed = parent;
else
first = false;
} else {
p = &parent->rb_left;
}
}
rb_link_node(&wait->node, parent, p);
rb_insert_color(&wait->node, &b->waiters);
if (first) {
spin_lock(&b->irq_lock);
b->irq_wait = wait;
/* After assigning ourselves as the new bottom-half, we must
* perform a cursory check to prevent a missed interrupt.
* Either we miss the interrupt whilst programming the hardware,
* or if there was a previous waiter (for a later seqno) they
* may be woken instead of us (due to the inherent race
* in the unlocked read of b->irq_seqno_bh in the irq handler)
* and so we miss the wake up.
*/
__intel_breadcrumbs_enable_irq(b);
spin_unlock(&b->irq_lock);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
}
if (completed) {
/* Advance the bottom-half (b->irq_wait) before we wake up
* the waiters who may scribble over their intel_wait
* just as the interrupt handler is dereferencing it via
* b->irq_wait.
*/
if (!first) {
struct rb_node *next = rb_next(completed);
GEM_BUG_ON(next == &wait->node);
__intel_breadcrumbs_next(engine, next);
}
do {
struct intel_wait *crumb = to_wait(completed);
completed = rb_prev(completed);
__intel_breadcrumbs_finish(b, crumb);
} while (completed);
}
GEM_BUG_ON(!b->irq_wait);
GEM_BUG_ON(!b->irq_armed);
GEM_BUG_ON(rb_first(&b->waiters) != &b->irq_wait->node);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
return first;
}
bool intel_engine_add_wait(struct intel_engine_cs *engine,
struct intel_wait *wait)
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
bool first;
spin_lock_irq(&b->rb_lock);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
first = __intel_engine_add_wait(engine, wait);
spin_unlock_irq(&b->rb_lock);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
return first;
}
static inline bool chain_wakeup(struct rb_node *rb, int priority)
{
return rb && to_wait(rb)->tsk->prio <= priority;
}
static inline int wakeup_priority(struct intel_breadcrumbs *b,
struct task_struct *tsk)
{
if (tsk == b->signaler)
return INT_MIN;
else
return tsk->prio;
}
static void __intel_engine_remove_wait(struct intel_engine_cs *engine,
struct intel_wait *wait)
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
lockdep_assert_held(&b->rb_lock);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
if (RB_EMPTY_NODE(&wait->node))
goto out;
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
if (b->irq_wait == wait) {
const int priority = wakeup_priority(b, wait->tsk);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
struct rb_node *next;
/* We are the current bottom-half. Find the next candidate,
* the first waiter in the queue on the remaining oldest
* request. As multiple seqnos may complete in the time it
* takes us to wake up and find the next waiter, we have to
* wake up that waiter for it to perform its own coherent
* completion check.
*/
next = rb_next(&wait->node);
if (chain_wakeup(next, priority)) {
/* If the next waiter is already complete,
* wake it up and continue onto the next waiter. So
* if have a small herd, they will wake up in parallel
* rather than sequentially, which should reduce
* the overall latency in waking all the completed
* clients.
*
* However, waking up a chain adds extra latency to
* the first_waiter. This is undesirable if that
* waiter is a high priority task.
*/
u32 seqno = intel_engine_get_seqno(engine);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
while (i915_seqno_passed(seqno, to_wait(next)->seqno)) {
struct rb_node *n = rb_next(next);
__intel_breadcrumbs_finish(b, to_wait(next));
next = n;
if (!chain_wakeup(next, priority))
break;
}
}
__intel_breadcrumbs_next(engine, next);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
} else {
GEM_BUG_ON(rb_first(&b->waiters) == &wait->node);
}
GEM_BUG_ON(RB_EMPTY_NODE(&wait->node));
rb_erase(&wait->node, &b->waiters);
out:
GEM_BUG_ON(b->irq_wait == wait);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
GEM_BUG_ON(rb_first(&b->waiters) !=
(b->irq_wait ? &b->irq_wait->node : NULL));
}
void intel_engine_remove_wait(struct intel_engine_cs *engine,
struct intel_wait *wait)
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
/* Quick check to see if this waiter was already decoupled from
* the tree by the bottom-half to avoid contention on the spinlock
* by the herd.
*/
if (RB_EMPTY_NODE(&wait->node)) {
GEM_BUG_ON(READ_ONCE(b->irq_wait) == wait);
return;
}
spin_lock_irq(&b->rb_lock);
__intel_engine_remove_wait(engine, wait);
spin_unlock_irq(&b->rb_lock);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
}
static bool signal_valid(const struct drm_i915_gem_request *request)
{
return intel_wait_check_request(&request->signaling.wait, request);
}
static bool signal_complete(const struct drm_i915_gem_request *request)
{
if (!request)
return false;
/* If another process served as the bottom-half it may have already
* signalled that this wait is already completed.
*/
if (intel_wait_complete(&request->signaling.wait))
return signal_valid(request);
/* Carefully check if the request is complete, giving time for the
* seqno to be visible or if the GPU hung.
*/
if (__i915_request_irq_complete(request))
return true;
return false;
}
static struct drm_i915_gem_request *to_signaler(struct rb_node *rb)
{
return rb_entry(rb, struct drm_i915_gem_request, signaling.node);
}
static void signaler_set_rtpriority(void)
{
struct sched_param param = { .sched_priority = 1 };
sched_setscheduler_nocheck(current, SCHED_FIFO, &param);
}
static int intel_breadcrumbs_signaler(void *arg)
{
struct intel_engine_cs *engine = arg;
struct intel_breadcrumbs *b = &engine->breadcrumbs;
struct drm_i915_gem_request *request;
/* Install ourselves with high priority to reduce signalling latency */
signaler_set_rtpriority();
do {
bool do_schedule = true;
set_current_state(TASK_INTERRUPTIBLE);
/* We are either woken up by the interrupt bottom-half,
* or by a client adding a new signaller. In both cases,
* the GPU seqno may have advanced beyond our oldest signal.
* If it has, propagate the signal, remove the waiter and
* check again with the next oldest signal. Otherwise we
* need to wait for a new interrupt from the GPU or for
* a new client.
*/
rcu_read_lock();
request = rcu_dereference(b->first_signal);
if (request)
request = i915_gem_request_get_rcu(request);
rcu_read_unlock();
if (signal_complete(request)) {
local_bh_disable();
dma_fence_signal(&request->fence);
local_bh_enable(); /* kick start the tasklets */
spin_lock_irq(&b->rb_lock);
/* Wake up all other completed waiters and select the
* next bottom-half for the next user interrupt.
*/
__intel_engine_remove_wait(engine,
&request->signaling.wait);
/* Find the next oldest signal. Note that as we have
* not been holding the lock, another client may
* have installed an even older signal than the one
* we just completed - so double check we are still
* the oldest before picking the next one.
*/
if (request == rcu_access_pointer(b->first_signal)) {
struct rb_node *rb =
rb_next(&request->signaling.node);
rcu_assign_pointer(b->first_signal,
rb ? to_signaler(rb) : NULL);
}
rb_erase(&request->signaling.node, &b->signals);
RB_CLEAR_NODE(&request->signaling.node);
spin_unlock_irq(&b->rb_lock);
i915_gem_request_put(request);
/* If the engine is saturated we may be continually
* processing completed requests. This angers the
* NMI watchdog if we never let anything else
* have access to the CPU. Let's pretend to be nice
* and relinquish the CPU if we burn through the
* entire RT timeslice!
*/
do_schedule = need_resched();
}
if (unlikely(do_schedule)) {
DEFINE_WAIT(exec);
if (kthread_should_park())
kthread_parkme();
if (kthread_should_stop()) {
GEM_BUG_ON(request);
break;
}
if (request)
add_wait_queue(&request->execute, &exec);
schedule();
if (request)
remove_wait_queue(&request->execute, &exec);
}
i915_gem_request_put(request);
} while (1);
__set_current_state(TASK_RUNNING);
return 0;
}
void intel_engine_enable_signaling(struct drm_i915_gem_request *request)
{
struct intel_engine_cs *engine = request->engine;
struct intel_breadcrumbs *b = &engine->breadcrumbs;
struct rb_node *parent, **p;
bool first, wakeup;
u32 seqno;
drm/i915: Convert breadcrumbs spinlock to be irqsafe The breadcrumbs are about to be used from within IRQ context sections (e.g. nouveau signals a fence from an interrupt handler causing us to submit a new request) and/or from bottom-half tasklets (i.e. intel_lrc_irq_handler), therefore we need to employ the irqsafe spinlock variants. For example, deferring the request submission to the intel_lrc_irq_handler generates this trace: [ 66.388639] ================================= [ 66.388650] [ INFO: inconsistent lock state ] [ 66.388663] 4.9.0-rc2+ #56 Not tainted [ 66.388672] --------------------------------- [ 66.388682] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 66.388695] swapper/1/0 [HC0[0]:SC1[1]:HE0:SE0] takes: [ 66.388706] (&(&b->lock)->rlock){+.?...} , at: [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.388761] {SOFTIRQ-ON-W} state was registered at: [ 66.388772] [ 66.388783] [<ffffffff810bd842>] __lock_acquire+0x682/0x1870 [ 66.388795] [ 66.388803] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.388814] [ 66.388824] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.388835] [ 66.388845] [<ffffffff81401e41>] intel_engine_reset_breadcrumbs+0x21/0xb0 [ 66.388857] [ 66.388866] [<ffffffff81403ae7>] gen8_init_common_ring+0x67/0x100 [ 66.388878] [ 66.388887] [<ffffffff81403b92>] gen8_init_render_ring+0x12/0x60 [ 66.388903] [ 66.388912] [<ffffffff813f8707>] i915_gem_init_hw+0xf7/0x2a0 [ 66.388927] [ 66.388936] [<ffffffff813f899b>] i915_gem_init+0xbb/0xf0 [ 66.388950] [ 66.388959] [<ffffffff813b4980>] i915_driver_load+0x7e0/0x1330 [ 66.388978] [ 66.388988] [<ffffffff813c09d8>] i915_pci_probe+0x28/0x40 [ 66.389003] [ 66.389013] [<ffffffff812fa0db>] pci_device_probe+0x8b/0xf0 [ 66.389028] [ 66.389037] [<ffffffff8147737e>] driver_probe_device+0x21e/0x430 [ 66.389056] [ 66.389065] [<ffffffff8147766e>] __driver_attach+0xde/0xe0 [ 66.389080] [ 66.389090] [<ffffffff814751ad>] bus_for_each_dev+0x5d/0x90 [ 66.389105] [ 66.389113] [<ffffffff81477799>] driver_attach+0x19/0x20 [ 66.389134] [ 66.389144] [<ffffffff81475ced>] bus_add_driver+0x15d/0x260 [ 66.389159] [ 66.389168] [<ffffffff81477e3b>] driver_register+0x5b/0xd0 [ 66.389183] [ 66.389281] [<ffffffff812fa19b>] __pci_register_driver+0x5b/0x60 [ 66.389301] [ 66.389312] [<ffffffff81aed333>] i915_init+0x3e/0x45 [ 66.389326] [ 66.389336] [<ffffffff81ac2ffa>] do_one_initcall+0x8b/0x118 [ 66.389350] [ 66.389359] [<ffffffff81ac323a>] kernel_init_freeable+0x1b3/0x23b [ 66.389378] [ 66.389387] [<ffffffff8160fc39>] kernel_init+0x9/0x100 [ 66.389402] [ 66.389411] [<ffffffff816180e7>] ret_from_fork+0x27/0x40 [ 66.389426] irq event stamp: 315865 [ 66.389438] hardirqs last enabled at (315864): [<ffffffff816178f1>] _raw_spin_unlock_irqrestore+0x31/0x50 [ 66.389469] hardirqs last disabled at (315865): [<ffffffff816176b3>] _raw_spin_lock_irqsave+0x13/0x50 [ 66.389499] softirqs last enabled at (315818): [<ffffffff8107a04c>] _local_bh_enable+0x1c/0x50 [ 66.389530] softirqs last disabled at (315819): [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.389559] [ 66.389559] other info that might help us debug this: [ 66.389580] Possible unsafe locking scenario: [ 66.389580] [ 66.389598] CPU0 [ 66.389609] ---- [ 66.389620] lock(&(&b->lock)->rlock); [ 66.389650] <Interrupt> [ 66.389661] lock(&(&b->lock)->rlock); [ 66.389690] [ 66.389690] *** DEADLOCK *** [ 66.389690] [ 66.389715] 2 locks held by swapper/1/0: [ 66.389728] #0: (&(&tl->lock)->rlock){..-...}, at: [<ffffffff81403e01>] intel_lrc_irq_handler+0x201/0x3c0 [ 66.389785] #1: (&(&req->lock)->rlock/1){..-...}, at: [<ffffffff813fc0af>] __i915_gem_request_submit+0x8f/0x170 [ 66.389854] [ 66.389854] stack backtrace: [ 66.389959] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc2+ #56 [ 66.389976] Hardware name: / , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [ 66.389999] ffff88027fd03c58 ffffffff812beae5 ffff88027696e680 ffffffff822afe20 [ 66.390036] ffff88027fd03ca8 ffffffff810bb420 0000000000000001 0000000000000000 [ 66.390070] 0000000000000000 0000000000000006 0000000000000004 ffff88027696ee10 [ 66.390104] Call Trace: [ 66.390117] <IRQ> [ 66.390128] [<ffffffff812beae5>] dump_stack+0x68/0x93 [ 66.390147] [<ffffffff810bb420>] print_usage_bug+0x1d0/0x1e0 [ 66.390164] [<ffffffff810bb8a0>] mark_lock+0x470/0x4f0 [ 66.390181] [<ffffffff810ba9d0>] ? print_shortest_lock_dependencies+0x1b0/0x1b0 [ 66.390203] [<ffffffff810bd75d>] __lock_acquire+0x59d/0x1870 [ 66.390221] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.390237] [<ffffffff810bedbc>] ? lock_acquire+0x6c/0xb0 [ 66.390255] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390273] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.390291] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390309] [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.390327] [<ffffffff813fc170>] __i915_gem_request_submit+0x150/0x170 [ 66.390345] [<ffffffff81403e8b>] intel_lrc_irq_handler+0x28b/0x3c0 [ 66.390363] [<ffffffff81079d97>] tasklet_action+0x57/0xc0 [ 66.390380] [<ffffffff8107a249>] __do_softirq+0x119/0x240 [ 66.390396] [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.390414] [<ffffffff8101afd5>] do_IRQ+0x65/0x110 [ 66.390431] [<ffffffff81618806>] common_interrupt+0x86/0x86 [ 66.390446] <EOI> [ 66.390457] [<ffffffff814ec6d1>] ? cpuidle_enter_state+0x151/0x200 [ 66.390480] [<ffffffff814ec7a2>] cpuidle_enter+0x12/0x20 [ 66.390498] [<ffffffff810b639e>] call_cpuidle+0x1e/0x40 [ 66.390516] [<ffffffff810b65ae>] cpu_startup_entry+0x10e/0x1f0 [ 66.390534] [<ffffffff81036133>] start_secondary+0x103/0x130 (This is split out of the defer global seqno allocation patch due to realisation that we need a more complete conversion if we want to defer request submission even further.) v2: lockdep was warning about mixed SOFTIRQ contexts not HARDIRQ contexts so we only need to use spin_lock_bh and not disable interrupts. v3: We need full irq protection as we may be called from a third party interrupt handler (via fences). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-32-chris@chris-wilson.co.uk
2016-10-28 20:58:55 +08:00
/* Note that we may be called from an interrupt handler on another
* device (e.g. nouveau signaling a fence completion causing us
* to submit a request, and so enable signaling). As such,
* we need to make sure that all other users of b->rb_lock protect
drm/i915: Convert breadcrumbs spinlock to be irqsafe The breadcrumbs are about to be used from within IRQ context sections (e.g. nouveau signals a fence from an interrupt handler causing us to submit a new request) and/or from bottom-half tasklets (i.e. intel_lrc_irq_handler), therefore we need to employ the irqsafe spinlock variants. For example, deferring the request submission to the intel_lrc_irq_handler generates this trace: [ 66.388639] ================================= [ 66.388650] [ INFO: inconsistent lock state ] [ 66.388663] 4.9.0-rc2+ #56 Not tainted [ 66.388672] --------------------------------- [ 66.388682] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 66.388695] swapper/1/0 [HC0[0]:SC1[1]:HE0:SE0] takes: [ 66.388706] (&(&b->lock)->rlock){+.?...} , at: [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.388761] {SOFTIRQ-ON-W} state was registered at: [ 66.388772] [ 66.388783] [<ffffffff810bd842>] __lock_acquire+0x682/0x1870 [ 66.388795] [ 66.388803] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.388814] [ 66.388824] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.388835] [ 66.388845] [<ffffffff81401e41>] intel_engine_reset_breadcrumbs+0x21/0xb0 [ 66.388857] [ 66.388866] [<ffffffff81403ae7>] gen8_init_common_ring+0x67/0x100 [ 66.388878] [ 66.388887] [<ffffffff81403b92>] gen8_init_render_ring+0x12/0x60 [ 66.388903] [ 66.388912] [<ffffffff813f8707>] i915_gem_init_hw+0xf7/0x2a0 [ 66.388927] [ 66.388936] [<ffffffff813f899b>] i915_gem_init+0xbb/0xf0 [ 66.388950] [ 66.388959] [<ffffffff813b4980>] i915_driver_load+0x7e0/0x1330 [ 66.388978] [ 66.388988] [<ffffffff813c09d8>] i915_pci_probe+0x28/0x40 [ 66.389003] [ 66.389013] [<ffffffff812fa0db>] pci_device_probe+0x8b/0xf0 [ 66.389028] [ 66.389037] [<ffffffff8147737e>] driver_probe_device+0x21e/0x430 [ 66.389056] [ 66.389065] [<ffffffff8147766e>] __driver_attach+0xde/0xe0 [ 66.389080] [ 66.389090] [<ffffffff814751ad>] bus_for_each_dev+0x5d/0x90 [ 66.389105] [ 66.389113] [<ffffffff81477799>] driver_attach+0x19/0x20 [ 66.389134] [ 66.389144] [<ffffffff81475ced>] bus_add_driver+0x15d/0x260 [ 66.389159] [ 66.389168] [<ffffffff81477e3b>] driver_register+0x5b/0xd0 [ 66.389183] [ 66.389281] [<ffffffff812fa19b>] __pci_register_driver+0x5b/0x60 [ 66.389301] [ 66.389312] [<ffffffff81aed333>] i915_init+0x3e/0x45 [ 66.389326] [ 66.389336] [<ffffffff81ac2ffa>] do_one_initcall+0x8b/0x118 [ 66.389350] [ 66.389359] [<ffffffff81ac323a>] kernel_init_freeable+0x1b3/0x23b [ 66.389378] [ 66.389387] [<ffffffff8160fc39>] kernel_init+0x9/0x100 [ 66.389402] [ 66.389411] [<ffffffff816180e7>] ret_from_fork+0x27/0x40 [ 66.389426] irq event stamp: 315865 [ 66.389438] hardirqs last enabled at (315864): [<ffffffff816178f1>] _raw_spin_unlock_irqrestore+0x31/0x50 [ 66.389469] hardirqs last disabled at (315865): [<ffffffff816176b3>] _raw_spin_lock_irqsave+0x13/0x50 [ 66.389499] softirqs last enabled at (315818): [<ffffffff8107a04c>] _local_bh_enable+0x1c/0x50 [ 66.389530] softirqs last disabled at (315819): [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.389559] [ 66.389559] other info that might help us debug this: [ 66.389580] Possible unsafe locking scenario: [ 66.389580] [ 66.389598] CPU0 [ 66.389609] ---- [ 66.389620] lock(&(&b->lock)->rlock); [ 66.389650] <Interrupt> [ 66.389661] lock(&(&b->lock)->rlock); [ 66.389690] [ 66.389690] *** DEADLOCK *** [ 66.389690] [ 66.389715] 2 locks held by swapper/1/0: [ 66.389728] #0: (&(&tl->lock)->rlock){..-...}, at: [<ffffffff81403e01>] intel_lrc_irq_handler+0x201/0x3c0 [ 66.389785] #1: (&(&req->lock)->rlock/1){..-...}, at: [<ffffffff813fc0af>] __i915_gem_request_submit+0x8f/0x170 [ 66.389854] [ 66.389854] stack backtrace: [ 66.389959] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc2+ #56 [ 66.389976] Hardware name: / , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [ 66.389999] ffff88027fd03c58 ffffffff812beae5 ffff88027696e680 ffffffff822afe20 [ 66.390036] ffff88027fd03ca8 ffffffff810bb420 0000000000000001 0000000000000000 [ 66.390070] 0000000000000000 0000000000000006 0000000000000004 ffff88027696ee10 [ 66.390104] Call Trace: [ 66.390117] <IRQ> [ 66.390128] [<ffffffff812beae5>] dump_stack+0x68/0x93 [ 66.390147] [<ffffffff810bb420>] print_usage_bug+0x1d0/0x1e0 [ 66.390164] [<ffffffff810bb8a0>] mark_lock+0x470/0x4f0 [ 66.390181] [<ffffffff810ba9d0>] ? print_shortest_lock_dependencies+0x1b0/0x1b0 [ 66.390203] [<ffffffff810bd75d>] __lock_acquire+0x59d/0x1870 [ 66.390221] [<ffffffff810bedbc>] lock_acquire+0x6c/0xb0 [ 66.390237] [<ffffffff810bedbc>] ? lock_acquire+0x6c/0xb0 [ 66.390255] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390273] [<ffffffff8161753a>] _raw_spin_lock+0x2a/0x40 [ 66.390291] [<ffffffff81401c88>] ? intel_engine_enable_signaling+0x78/0x150 [ 66.390309] [<ffffffff81401c88>] intel_engine_enable_signaling+0x78/0x150 [ 66.390327] [<ffffffff813fc170>] __i915_gem_request_submit+0x150/0x170 [ 66.390345] [<ffffffff81403e8b>] intel_lrc_irq_handler+0x28b/0x3c0 [ 66.390363] [<ffffffff81079d97>] tasklet_action+0x57/0xc0 [ 66.390380] [<ffffffff8107a249>] __do_softirq+0x119/0x240 [ 66.390396] [<ffffffff8107a50e>] irq_exit+0xbe/0xd0 [ 66.390414] [<ffffffff8101afd5>] do_IRQ+0x65/0x110 [ 66.390431] [<ffffffff81618806>] common_interrupt+0x86/0x86 [ 66.390446] <EOI> [ 66.390457] [<ffffffff814ec6d1>] ? cpuidle_enter_state+0x151/0x200 [ 66.390480] [<ffffffff814ec7a2>] cpuidle_enter+0x12/0x20 [ 66.390498] [<ffffffff810b639e>] call_cpuidle+0x1e/0x40 [ 66.390516] [<ffffffff810b65ae>] cpu_startup_entry+0x10e/0x1f0 [ 66.390534] [<ffffffff81036133>] start_secondary+0x103/0x130 (This is split out of the defer global seqno allocation patch due to realisation that we need a more complete conversion if we want to defer request submission even further.) v2: lockdep was warning about mixed SOFTIRQ contexts not HARDIRQ contexts so we only need to use spin_lock_bh and not disable interrupts. v3: We need full irq protection as we may be called from a third party interrupt handler (via fences). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-32-chris@chris-wilson.co.uk
2016-10-28 20:58:55 +08:00
* against interrupts, i.e. use spin_lock_irqsave.
*/
/* locked by dma_fence_enable_sw_signaling() (irqsafe fence->lock) */
GEM_BUG_ON(!irqs_disabled());
lockdep_assert_held(&request->lock);
seqno = i915_gem_request_global_seqno(request);
if (!seqno)
return;
request->signaling.wait.tsk = b->signaler;
request->signaling.wait.request = request;
request->signaling.wait.seqno = seqno;
i915_gem_request_get(request);
spin_lock(&b->rb_lock);
/* First add ourselves into the list of waiters, but register our
* bottom-half as the signaller thread. As per usual, only the oldest
* waiter (not just signaller) is tasked as the bottom-half waking
* up all completed waiters after the user interrupt.
*
* If we are the oldest waiter, enable the irq (after which we
* must double check that the seqno did not complete).
*/
wakeup = __intel_engine_add_wait(engine, &request->signaling.wait);
/* Now insert ourselves into the retirement ordered list of signals
* on this engine. We track the oldest seqno as that will be the
* first signal to complete.
*/
parent = NULL;
first = true;
p = &b->signals.rb_node;
while (*p) {
parent = *p;
if (i915_seqno_passed(seqno,
to_signaler(parent)->signaling.wait.seqno)) {
p = &parent->rb_right;
first = false;
} else {
p = &parent->rb_left;
}
}
rb_link_node(&request->signaling.node, parent, p);
rb_insert_color(&request->signaling.node, &b->signals);
if (first)
rcu_assign_pointer(b->first_signal, request);
spin_unlock(&b->rb_lock);
if (wakeup)
wake_up_process(b->signaler);
}
void intel_engine_cancel_signaling(struct drm_i915_gem_request *request)
{
struct intel_engine_cs *engine = request->engine;
struct intel_breadcrumbs *b = &engine->breadcrumbs;
GEM_BUG_ON(!irqs_disabled());
lockdep_assert_held(&request->lock);
GEM_BUG_ON(!request->signaling.wait.seqno);
spin_lock(&b->rb_lock);
if (!RB_EMPTY_NODE(&request->signaling.node)) {
if (request == rcu_access_pointer(b->first_signal)) {
struct rb_node *rb =
rb_next(&request->signaling.node);
rcu_assign_pointer(b->first_signal,
rb ? to_signaler(rb) : NULL);
}
rb_erase(&request->signaling.node, &b->signals);
RB_CLEAR_NODE(&request->signaling.node);
i915_gem_request_put(request);
}
__intel_engine_remove_wait(engine, &request->signaling.wait);
spin_unlock(&b->rb_lock);
request->signaling.wait.seqno = 0;
}
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
int intel_engine_init_breadcrumbs(struct intel_engine_cs *engine)
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
struct task_struct *tsk;
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
spin_lock_init(&b->rb_lock);
spin_lock_init(&b->irq_lock);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
setup_timer(&b->fake_irq,
intel_breadcrumbs_fake_irq,
(unsigned long)engine);
setup_timer(&b->hangcheck,
intel_breadcrumbs_hangcheck,
(unsigned long)engine);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
/* Spawn a thread to provide a common bottom-half for all signals.
* As this is an asynchronous interface we cannot steal the current
* task for handling the bottom-half to the user interrupt, therefore
* we create a thread to do the coherent seqno dance after the
* interrupt and then signal the waitqueue (via the dma-buf/fence).
*/
tsk = kthread_run(intel_breadcrumbs_signaler, engine,
"i915/signal:%d", engine->id);
if (IS_ERR(tsk))
return PTR_ERR(tsk);
b->signaler = tsk;
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
return 0;
}
static void cancel_fake_irq(struct intel_engine_cs *engine)
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
del_timer_sync(&b->hangcheck);
del_timer_sync(&b->fake_irq);
clear_bit(engine->id, &engine->i915->gpu_error.missed_irq_rings);
}
void intel_engine_reset_breadcrumbs(struct intel_engine_cs *engine)
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
cancel_fake_irq(engine);
spin_lock_irq(&b->irq_lock);
if (b->irq_enabled)
irq_enable(engine);
else
irq_disable(engine);
/* We set the IRQ_BREADCRUMB bit when we enable the irq presuming the
* GPU is active and may have already executed the MI_USER_INTERRUPT
* before the CPU is ready to receive. However, the engine is currently
* idle (we haven't started it yet), there is no possibility for a
* missed interrupt as we enabled the irq and so we can clear the
* immediate wakeup (until a real interrupt arrives for the waiter).
*/
clear_bit(ENGINE_IRQ_BREADCRUMB, &engine->irq_posted);
if (b->irq_armed)
enable_fake_irq(b);
spin_unlock_irq(&b->irq_lock);
}
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
void intel_engine_fini_breadcrumbs(struct intel_engine_cs *engine)
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
/* The engines should be idle and all requests accounted for! */
WARN_ON(READ_ONCE(b->irq_wait));
WARN_ON(!RB_EMPTY_ROOT(&b->waiters));
WARN_ON(rcu_access_pointer(b->first_signal));
WARN_ON(!RB_EMPTY_ROOT(&b->signals));
if (!IS_ERR_OR_NULL(b->signaler))
kthread_stop(b->signaler);
cancel_fake_irq(engine);
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
}
bool intel_breadcrumbs_busy(struct intel_engine_cs *engine)
drm/i915: Slaughter the thundering i915_wait_request herd One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
2016-07-02 00:23:15 +08:00
{
struct intel_breadcrumbs *b = &engine->breadcrumbs;
bool busy = false;
spin_lock_irq(&b->rb_lock);
if (b->irq_wait) {
wake_up_process(b->irq_wait->tsk);
busy = true;
}
if (rcu_access_pointer(b->first_signal)) {
wake_up_process(b->signaler);
busy = true;
}
spin_unlock_irq(&b->rb_lock);
return busy;
}
#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
#include "selftests/intel_breadcrumbs.c"
#endif