mirror of
https://gitlab.freedesktop.org/mesa/mesa.git
synced 2024-11-23 18:24:13 +08:00
639 lines
24 KiB
ReStructuredText
639 lines
24 KiB
ReStructuredText
Freedreno
|
|
=========
|
|
|
|
Freedreno GLES and GL driver for Adreno 2xx-6xx GPUs. It implements up to
|
|
OpenGL ES 3.2 and desktop OpenGL 4.5.
|
|
|
|
See the `Freedreno Wiki
|
|
<https://gitlab.freedesktop.org/freedreno/freedreno/-/wikis/home>`__ for more
|
|
details.
|
|
|
|
Turnip
|
|
------
|
|
|
|
Turnip is a Vulkan 1.3 driver for Adreno 6xx GPUs.
|
|
|
|
The current set of specific chip versions supported can be found in
|
|
:file:`src/freedreno/common/freedreno_devices.py`. The current set of features
|
|
supported can be found rendered at `Mesa Matrix <https://mesamatrix.net/>`__.
|
|
There are no plans to port to a5xx or earlier GPUs.
|
|
|
|
Hardware architecture
|
|
---------------------
|
|
|
|
Adreno is a mostly tile-mode renderer, but with the option to bypass tiling
|
|
("gmem") and render directly to system memory ("sysmem"). It is UMA, using
|
|
mostly write combined memory but with the ability to map some buffers as cache
|
|
coherent with the CPU.
|
|
|
|
.. toctree::
|
|
:glob:
|
|
|
|
freedreno/hw/*
|
|
|
|
Hardware acronyms
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
.. glossary::
|
|
|
|
Cluster
|
|
A group of hardware registers, often with multiple copies to allow
|
|
pipelining. There is an M:N relationship between hardware blocks that do
|
|
work and the clusters of registers for the state that hardware blocks use.
|
|
|
|
CP
|
|
Command Processor. Reads the stream of state changes and draw commands
|
|
generated by the driver.
|
|
|
|
PFP
|
|
Prefetch Parser. Adreno 2xx-4xx CP component.
|
|
|
|
ME
|
|
Micro Engine. Adreno 2xx-4xx CP component after PFP, handles most PM4 commands.
|
|
|
|
SQE
|
|
a6xx+ replacement for PFP/ME. This is the microcontroller that runs the
|
|
microcode (loaded from Linux) which actually processes the command stream
|
|
and writes to the hardware registers. See `afuc
|
|
<https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/afuc/README.rst>`__.
|
|
|
|
ROQ
|
|
DMA engine used by the SQE for reading memory, with some prefetch buffering.
|
|
Mostly reads in the command stream, but also serves for
|
|
``CP_MEMCPY``/``CP_MEM_TO_REG`` and visibility stream reads.
|
|
|
|
SP
|
|
Shader Processor. Unified, scalar shader engine. One or more, depending on
|
|
GPU and tier.
|
|
|
|
TP
|
|
Texture Processor.
|
|
|
|
UCHE
|
|
Unified L2 Cache. 32KB on A330, unclear how big now.
|
|
|
|
CCU
|
|
Color Cache Unit.
|
|
|
|
VSC
|
|
Visibility Stream Compressor
|
|
|
|
PVS
|
|
Primitive Visibility Stream
|
|
|
|
FE
|
|
Front End? Index buffer and vertex attribute fetch cluster. Includes PC,
|
|
VFD, VPC.
|
|
|
|
VFD
|
|
Vertex Fetch and Decode
|
|
|
|
VPC
|
|
Varying/Position Cache? Hardware block that stores shaded vertex data for
|
|
primitive assembly.
|
|
|
|
HLSQ
|
|
High Level Sequencer. Manages state for the SPs, batches up PS invocations
|
|
between primitives, is involved in preemption.
|
|
|
|
PC_VS
|
|
Cluster where varyings are read from VPC and assembled into primitives to
|
|
feed GRAS.
|
|
|
|
VS
|
|
Vertex Shader. Responsible for generating VS/GS/tess invocations
|
|
|
|
GRAS
|
|
Rasterizer. Responsible for generating PS invocations from primitives, also
|
|
does LRZ
|
|
|
|
PS
|
|
Pixel Shader.
|
|
|
|
RB
|
|
Render Backend. Performs both early and late Z testing, blending, and
|
|
attachment stores of output of the PS.
|
|
|
|
GMEM
|
|
Roughly 128KB-1MB of memory on the GPU (SKU-dependent), used to store
|
|
attachments during tiled rendering
|
|
|
|
LRZ
|
|
Low Resolution Z. A low resolution area of the depth buffer that can be
|
|
initialized during the binning pass to contain the worst-case (farthest) Z
|
|
values in a block, and then used to early reject fragments during
|
|
rasterization.
|
|
|
|
Cache hierarchy
|
|
^^^^^^^^^^^^^^^
|
|
|
|
The a6xx GPUs have two main caches: CCU and UCHE.
|
|
|
|
UCHE (Unified L2 Cache) is the cache behind the vertex fetch, VSC writes,
|
|
texture L1, LRZ, and storage image accesses (``ldib``/``stib``). Misses and
|
|
flushes access system memory.
|
|
|
|
The CCU is the separate cache used by 2D blits and sysmem render target access
|
|
(and also for resolves to system memory when in GMEM mode). Its memory comes
|
|
from a carveout of GMEM controlled by ``RB_CCU_CNTL``, with a varying amount
|
|
reserved based on whether we're in a render pass using GMEM for attachment
|
|
storage, or we're doing sysmem rendering. Cache entries have the attachment
|
|
number and layer mixed into the cache tag in some way, likely so that a
|
|
fragment's access is spread through the cache even if the attachments are the
|
|
same size and alignments in address space. This means that the cache must be
|
|
flushed and invalidated between memory being used for one attachment and another
|
|
(notably depth vs color, but also MRT color).
|
|
|
|
The Texture Processors (TP) additionally have a small L1 cache (1KB on A330,
|
|
unclear how big now) before accessing UCHE. This cache is used for normal
|
|
sampling like ``sam``` and ``isam`` (and the compiler will make read-only
|
|
storage image access through it as well). It is not coherent with UCHE (may get
|
|
stale results when you ``sam`` after ``stib``), but must get flushed per draw or
|
|
something because you don't need a manual invalidate between draws storing to an
|
|
image and draws sampling from a texture.
|
|
|
|
The command processor (CP) does not read from either of these caches, and
|
|
instead uses FIFOs in the ROQ to avoid stalls reading from system memory.
|
|
|
|
Draw states
|
|
^^^^^^^^^^^
|
|
|
|
Since the SQE is not a fast processor, and tiled rendering means that many draws
|
|
won't even be used in many bins, since a5xx state updates can be batched up into
|
|
"draw states" that point to a fragment of CP packets. At draw time, if the draw
|
|
call is going to actually execute (some primitive is visible in the current
|
|
tile), the SQE goes through the ``GROUP_ID``\s and for any with an update since
|
|
the last time they were executed, it executes the corresponding fragment.
|
|
|
|
Starting with a6xx, states can be tagged with whether they should be executed
|
|
at draw time for any of sysmem, binning, or tile rendering. This allows a
|
|
single command stream to be generated which can be executed in any of the modes,
|
|
unlike pre-a6xx where we had to generate separate command lists for the binning
|
|
and rendering phases.
|
|
|
|
Note that this means that the generated draw state has to always update all of
|
|
the state you have chosen to pack into that ``GROUP_ID``, since any of your
|
|
previous state changes in a previous draw state command may have been skipped.
|
|
|
|
Pipelining (a6xx+)
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
Most CP commands write to registers. In a6xx+, the registers are located in
|
|
clusters corresponding to the stage of the pipeline they are used from (see
|
|
``enum tu_stage`` for a list). To pipeline state updates and drawing, registers
|
|
generally have two copies ("contexts") in their cluster, so previous draws can
|
|
be working on the previous set of register state while the next draw's state is
|
|
being set up. You can find what registers go into which clusters by looking at
|
|
:command:`crashdec` output in the ``regs-name: CP_MEMPOOL`` section.
|
|
|
|
As SQE processes register writes in the command stream, it sends them into a
|
|
per-cluster queue stored in ``CP_MEMPOOL``. This allows the pipeline stages to
|
|
process their stream of register updates and events independent of each other
|
|
(so even with just 2 contexts in a stage, earlier stages can proceed on to later
|
|
draws before later stages have caught up).
|
|
|
|
Each cluster has a per-context bit indicating that the context is done/free.
|
|
Register writes will stall on the context being done.
|
|
|
|
During a 3D draw command, SQE generates several internal events flow through the
|
|
pipeline:
|
|
|
|
- ``CP_EVENT_START`` clears the done bit for the context when written to the
|
|
cluster
|
|
- ``PC_EVENT_CMD``/``PC_DRAW_CMD``/``HLSQ_EVENT_CMD``/``HLSQ_DRAW_CMD`` kick off
|
|
the actual event/drawing.
|
|
- ``CONTEXT_DONE`` event completes after the event/draw is complete and sets the
|
|
done flag.
|
|
- ``CP_EVENT_END`` waits for the done flag on the next context, then copies all
|
|
the registers that were dirtied in this context to that one.
|
|
|
|
The 2D blit engine has its own ``CP_2D_EVENT_START``, ``CP_2D_EVENT_END``,
|
|
``CONTEXT_DONE_2D``, so 2D and 3D register contexts can do separate context
|
|
rollover.
|
|
|
|
Because the clusters proceed independently of each other even across draws, if
|
|
you need to synchronize an earlier cluster to the output of a later one, then
|
|
you will need to ``CP_WAIT_FOR_IDLE`` after flushing and invalidating any
|
|
necessary caches.
|
|
|
|
Also, note that some registers are not banked at all, and will require a
|
|
``CP_WAIT_FOR_IDLE`` for any previous usage of the register to complete.
|
|
|
|
In a2xx-a4xx, there weren't per-stage clusters, and instead there were two
|
|
register banks that were flipped between per draw.
|
|
|
|
Bindless/Bindful Descriptors (a6xx+)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Starting with a6xx++, cat5 (texture) and cat6 (image/SSBO/UBO) instructions are
|
|
extended to support bindless descriptors.
|
|
|
|
In the old bindful model, descriptors are separate for textures, samplers,
|
|
UBOs, and IBOs (combined descriptor for images and SSBOs), with separate
|
|
registers for the memory containing the array of descriptors, and/or different
|
|
``STATE_TYPE`` and ``STATE_BLOCK`` for ``CP_LOAD_STATE``/``_FRAG``/``_GEOM``
|
|
to pre-load the descriptors into cache.
|
|
|
|
- textures - per-shader-stage
|
|
- registers: ``SP_xS_TEX_CONST``/``SP_xS_TEX_COUNT``
|
|
- state-type: ``ST6_CONSTANTS``
|
|
- state-block: ``SB6_xS_TEX``
|
|
- samplers - per-shader-stage
|
|
- registers: ``SP_xS_TEX_SAMP``
|
|
- state-type: ``ST6_SHADER``
|
|
- state-block: ``SB6_xS_TEX``
|
|
- UBOs - per-shader-stage
|
|
- registers: none
|
|
- state-type: ``ST6_UBO``
|
|
- state-block: ``SB6_xS_SHADER``
|
|
- IBOs - global across shader 3d stages, separate for compute shader
|
|
- registers: ``SP_IBO``/``SP_IBO_COUNT`` or ``SP_CS_IBO``/``SP_CS_IBO_COUNT``
|
|
- state-type: ``ST6_SHADER``
|
|
- state-block: ``ST6_IBO`` or ``ST6_CS_IBO`` for compute shaders
|
|
- Note, unlike per-shader-stage descriptors, ``CP_LOAD_STATE6`` is used,
|
|
as opposed to ``CP_LOAD_STATE6_GEOM`` or ``CP_LOAD_STATE6_FRAG``
|
|
depending on shader stage.
|
|
|
|
.. note::
|
|
For the per-shader-stage registers and state-blocks the ``xS`` notation
|
|
refers to per-shader-stage names, ex. ``SP_FS_TEX_CONST`` or ``SB6_DS_TEX``
|
|
|
|
Textures and IBOs (images) use *basically* the same 64byte descriptor format
|
|
with some exceptions (for ex, for IBOs cubemaps are handles as 2d array).
|
|
SSBOs are just untyped buffers, but otherwise use the same descriptors and
|
|
instructions as images. Samplers use a 16byte descriptor, and UBOs use an
|
|
8byte descriptor which packs the size in the upper 15 bits of the UBO address.
|
|
|
|
In the bindless model, descriptors are split into 5 descriptor sets, which are
|
|
global across shader stages (but as with bindful IBO descriptors, separate for
|
|
3d stages vs compute stage). Each HW descriptor is an array of descriptors
|
|
of configurable size (each descriptor set can be configured for a descriptor
|
|
pitch of 8bytes or 64bytes). Each descriptor can be of arbitrary format (ie.
|
|
UBOs/IBOs/textures/samplers interleaved), it's interpretation by the HW is
|
|
determined by the instruction that references the descriptor. Each descriptor
|
|
set can contain at least 2^^16 descriptors.
|
|
|
|
The HW is configured with the base address of the descriptor set via an array
|
|
of "BINDLESS_BASE" registers, ie ``SP_BINDLESS_BASE[n]``/``HLSQ_BINDLESS_BASE[n]``
|
|
for 3d shader stages, or ``SP_CS_BINDLESS_BASE[n]``/``HLSQ_CS_BINDLESS_BASE[n]``
|
|
for compute shaders, with the descriptor pitch encoded in the low bits.
|
|
Which of the descriptor sets is referenced is encoded via three bits in the
|
|
instruction. The address of the descriptor is calculated as::
|
|
|
|
descriptor_addr = (BINDLESS_BASE[n] & ~0x3) +
|
|
(idx * 4 * (2 << BINDLESS_BASE[n] & 0x3))
|
|
|
|
|
|
.. note::
|
|
Turnip reserves one descriptor set for internal use and exposes the other
|
|
four for the application via the Vulkan API.
|
|
|
|
Software Architecture
|
|
---------------------
|
|
|
|
Freedreno and Turnip use a shared core for shader compiler, image layout, and
|
|
register and command stream definitions. They implement separate state
|
|
management and command stream generation.
|
|
|
|
.. toctree::
|
|
:glob:
|
|
|
|
freedreno/*
|
|
|
|
GPU devcoredump
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
A kernel message from DRM of "gpu fault" can mean any sort of error reported by
|
|
the GPU (including its internal hang detection). If a fault in GPU address
|
|
space happened, you should expect to find a message from the iommu, with the
|
|
faulting address and a hardware unit involved:
|
|
|
|
.. code-block:: text
|
|
|
|
*** gpu fault: ttbr0=000000001c941000 iova=000000010066a000 dir=READ type=TRANSLATION source=TP|VFD (0,0,0,1)
|
|
|
|
On a GPU fault or hang, a GPU core dump is taken by the DRM driver and saved to
|
|
``/sys/devices/virtual/devcoredump/**/data``. You can cp that file to a
|
|
:file:`crash.devcore` to save it, otherwise the kernel will expire it
|
|
eventually. Echo 1 to the file to free the core early, as another core won't be
|
|
taken until then.
|
|
|
|
Once you have your core file, you can use :command:`crashdec -f crash.devcore`
|
|
to decode it. The output will have ``ESTIMATED CRASH LOCATION`` where we
|
|
estimate the CP to have stopped. Note that it is expected that this will be
|
|
some distance past whatever state triggered the fault, given GPU pipelining, and
|
|
will often be at some ``CP_REG_TO_MEM`` (which waits on previous WFIs) or
|
|
``CP_WAIT_FOR_ME`` (which waits for all register writes to land) or similar
|
|
event. You can try running the workload with ``TU_DEBUG=flushall`` or
|
|
``FD_MESA_DEBUG=flush`` to try to close in on the failing commands.
|
|
|
|
You can also find what commands were queued up to each cluster in the
|
|
``regs-name: CP_MEMPOOL`` section.
|
|
|
|
If ``ESTIMATED CRASH LOCATION`` doesn't exist you could find ``CP_SQE_STAT``,
|
|
though going here is the last resort and likely won't be helpful.
|
|
|
|
.. code-block::
|
|
|
|
indexed-registers:
|
|
- regs-name: CP_SQE_STAT
|
|
dwords: 51
|
|
PC: 00d7 <-------------
|
|
PKT: CP_LOAD_STATE6_FRAG
|
|
$01: 70348003 $11: 00000000
|
|
$02: 20000000 $12: 00000022
|
|
|
|
The ``PC`` value is an instruction address in the current firmware.
|
|
You would need to disassemble the firmware (/lib/firmware/qcom/aXXX_sqe.fw) via:
|
|
|
|
.. code-block:: sh
|
|
|
|
afuc-disasm -v a650_sqe.fw > a650_sqe.fw.disasm
|
|
|
|
Now you should search for PC value in the disassembly, e.g.:
|
|
|
|
.. code-block::
|
|
|
|
l018: 00d1: 08dd0001 add $addr, $06, 0x0001
|
|
00d2: 981ff806 mov $data, $data
|
|
00d3: 8a080001 mov $08, 0x0001 << 16
|
|
00d4: 3108ffff or $08, $08, 0xffff
|
|
00d5: 9be8f805 and $data, $data, $08
|
|
00d6: 9806e806 mov $addr, $06
|
|
00d7: 9803f806 mov $data, $03 <------------- HERE
|
|
00d8: d8000000 waitin
|
|
00d9: 981f0806 mov $01, $data
|
|
|
|
|
|
Command Stream Capture
|
|
^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
During Mesa development, it's often useful to look at the command streams we
|
|
send to the kernel. We have an interface for the kernel to capture all
|
|
submitted command streams:
|
|
|
|
.. code-block:: sh
|
|
|
|
cat /sys/kernel/debug/dri/0/rd > cmdstream &
|
|
|
|
By default, command stream capture does not capture texture/vertex/etc. data.
|
|
You can enable capturing all the BOs with:
|
|
|
|
.. code-block:: sh
|
|
|
|
echo Y > /sys/module/msm/parameters/rd_full
|
|
|
|
Note that, since all command streams get captured, it is easy to run the system
|
|
out of memory doing this, so you probably don't want to enable it during play of
|
|
a heavyweight game. Instead, to capture a command stream within a game, you
|
|
probably want to cause a crash in the GPU during a frame of interest so that a
|
|
single GPU core dump is generated. Emitting ``0xdeadbeef`` in the CS should be
|
|
enough to cause a fault.
|
|
|
|
``fd_rd_output`` facilities provide support for generating the command stream
|
|
capture from inside Mesa. Different ``FD_RD_DUMP`` options are available:
|
|
|
|
- ``enable`` simply enables dumping the command stream on each submit for a
|
|
given logical device. When a more advanced option is specified, ``enable`` is
|
|
implied as specified.
|
|
- ``combine`` will combine all dumps into a single file instead of writing the
|
|
dump for each submit into a standalone file.
|
|
- ``full`` will dump every buffer object, which is necessary for replays of
|
|
command streams (see below).
|
|
- ``trigger`` will establish a trigger file through which dumps can be better
|
|
controlled. Writing a positive integer value into the file will enable dumping
|
|
of that many subsequent submits. Writing -1 will enable dumping of submits
|
|
until disabled. Writing 0 (or any other value) will disable dumps.
|
|
|
|
Output dump files and trigger file (when enabled) are hard-coded to be placed
|
|
under ``/tmp``, or ``/data/local/tmp`` under Android. `FD_RD_DUMP_TESTNAME` can
|
|
be used to specify a more descriptive prefix for the output or trigger files.
|
|
|
|
Functionality is generic to any Freedreno-based backend, but is currently only
|
|
integrated in the MSM backend of Turnip. Using the existing ``TU_DEBUG=rd``
|
|
option will translate to ``FD_RD_DUMP=enable``.
|
|
|
|
Capturing Hang RD
|
|
+++++++++++++++++
|
|
|
|
Devcore file doesn't contain all submitted command streams, only the hanging one.
|
|
Additionally it is geared towards analyzing the GPU state at the moment of the crash.
|
|
|
|
Alternatively, it's possible to obtain the whole submission with all command
|
|
streams via ``/sys/kernel/debug/dri/0/hangrd``:
|
|
|
|
.. code-block:: sh
|
|
|
|
sudo cat /sys/kernel/debug/dri/0/hangrd > logfile.rd // Do the cat _before_ the expected hang
|
|
|
|
The format of hangrd is the same as in ordinary command stream capture.
|
|
``rd_full`` also has the same effect on it.
|
|
|
|
Replaying Command Stream
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
``replay`` tool allows capturing and replaying ``rd`` to reproduce GPU faults.
|
|
Especially useful for transient GPU issues since it has much higher chances to
|
|
reproduce them.
|
|
|
|
Dumping rendering results or even just memory is currently unsupported.
|
|
|
|
- Replaying command streams requires kernel with ``MSM_INFO_SET_IOVA`` support.
|
|
- Requires ``rd`` capture to have full snapshots of the memory (``rd_full`` is enabled).
|
|
|
|
Replaying is done via ``replay`` tool:
|
|
|
|
.. code-block:: sh
|
|
|
|
./replay test_replay.rd
|
|
|
|
More examples:
|
|
|
|
.. code-block:: sh
|
|
|
|
./replay --first=start_submit_n --last=last_submit_n test_replay.rd
|
|
|
|
.. code-block:: sh
|
|
|
|
./replay --override=0 test_replay.rd
|
|
|
|
Editing Command Stream (a6xx+)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
While replaying a fault is useful in itself, modifying the capture to
|
|
understand what causes the fault could be even more useful.
|
|
|
|
``rddecompiler`` decompiles a single cmdstream from ``rd`` into compilable C source.
|
|
Given the address space bounds the generated program creates a new ``rd`` which
|
|
could be used to override cmdstream with 'replay'. Generated ``rd`` is not replayable
|
|
on its own and depends on buffers provided by the source ``rd``.
|
|
|
|
C source could be compiled by putting it into src/freedreno/decode/generate-rd.cc.
|
|
|
|
The workflow would look like this:
|
|
|
|
1. Find the cmdstream № you want to edit;
|
|
2. Decompile it:
|
|
|
|
.. code-block:: sh
|
|
|
|
./rddecompiler -s %cmd_stream_n% example.rd > src/freedreno/decode/generate-rd.cc
|
|
|
|
3. Edit the command stream;;
|
|
4. Compile and deploy freedreno tools;
|
|
5. Plug the generator into cmdstream replay:
|
|
|
|
.. code-block:: sh
|
|
|
|
./replay --override=%cmd_stream_№%
|
|
|
|
6. Repeat 3-5.
|
|
|
|
GPU Hang Debugging
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
Not a guide for how to do it but mostly an enumeration of methods.
|
|
|
|
Useful ``TU_DEBUG`` (for Turnip) options to narrow down the hang cause:
|
|
|
|
``sysmem``, ``gmem``, ``nobin``, ``forcebin``, ``noubwc``, ``nolrz``, ``flushall``, ``syncdraw``, ``rast_order``
|
|
|
|
Useful ``FD_MESA_DEBUG`` (for Freedreno) options:
|
|
|
|
``sysmem``, ``gmem``, ``nobin``, ``noubwc``, ``nolrz``, ``notile``, ``dclear``, ``ddraw``, ``flush``, ``inorder``, ``noblit``
|
|
|
|
Useful ``IR3_SHADER_DEBUG`` options:
|
|
|
|
``nouboopt``, ``spillall``, ``nopreamble``, ``nofp16``
|
|
|
|
Use Graphics Flight Recorder to narrow down the place which hangs,
|
|
use our own breadcrumbs implementation in case of unrecoverable hangs.
|
|
|
|
In case of faults use RenderDoc to find the problematic command. If it's
|
|
a draw call, edit shader in RenderDoc to find whether it culprit is a shader.
|
|
If yes, bisect it.
|
|
|
|
If editing the shader messes the assembly too much and the issue becomes unreproducible
|
|
try editing the assembly itself via ``IR3_SHADER_OVERRIDE_PATH``.
|
|
|
|
If fault or hang is transient try capturing an ``rd`` and replay it. If issue
|
|
is reproduced - bisect the GPU packets until the culprit is found.
|
|
|
|
Do the above if culprit is not a shader.
|
|
|
|
The hang recovery mechanism in Kernel is not perfect, in case of unrecoverable
|
|
hangs check whether the kernel is up to date and look for unmerged patches
|
|
which could improve the recovery.
|
|
|
|
GPU Breadcrumbs
|
|
+++++++++++++++
|
|
|
|
Breadcrumbs described below are available only in Turnip.
|
|
|
|
Freedreno has simpler breadcrumbs, in debug build writes breadcrumbs
|
|
into ``CP_SCRATCH_REG[6]`` and per-tile breadcrumbs into ``CP_SCRATCH_REG[7]``,
|
|
in this way they are available in the devcoredump. TODO: generalize Tunip's
|
|
breadcrumbs implementation.
|
|
|
|
This is a simple implementations of breadcrumbs tracking of GPU progress
|
|
intended to be a last resort when debugging unrecoverable hangs.
|
|
For best results use Vulkan traces to have a predictable place of hang.
|
|
|
|
For ordinary hangs as a more user-friendly solution use GFR
|
|
"Graphics Flight Recorder".
|
|
|
|
Or breadcrumbs implementation aims to handle cases where nothing can be done
|
|
after the hang. In-driver breadcrumbs also allow more precise tracking since
|
|
we could target a single GPU packet.
|
|
|
|
While breadcrumbs support gmem, try to reproduce the hang in a sysmem mode
|
|
because it would require much less breadcrumb writes and syncs.
|
|
|
|
Breadcrumbs settings:
|
|
|
|
.. code-block:: sh
|
|
|
|
TU_BREADCRUMBS=%IP%:%PORT%,break=%BREAKPOINT%:%BREAKPOINT_HITS%
|
|
|
|
``BREAKPOINT``
|
|
The breadcrumb starting from which we require explicit ack.
|
|
``BREAKPOINT_HITS``
|
|
How many times breakpoint should be reached for break to occur.
|
|
Necessary for a gmem mode and re-usable cmdbuffers in both of which
|
|
the same cmdstream could be executed several times.
|
|
|
|
A typical work flow would be:
|
|
|
|
- Start listening for breadcrumbs on a remote host:
|
|
|
|
.. code-block:: sh
|
|
|
|
nc -lvup $PORT | stdbuf -o0 xxd -pc -c 4 | awk -Wposix '{printf("%u:%u\n", "0x" $0, a[$0]++)}'
|
|
|
|
- Start capturing command stream;
|
|
- Replay the hanging trace with:
|
|
|
|
.. code-block:: sh
|
|
|
|
TU_BREADCRUMBS=$IP:$PORT,break=-1:0
|
|
|
|
- Increase hangcheck period:
|
|
|
|
.. code-block:: sh
|
|
|
|
echo -n 60000 > /sys/kernel/debug/dri/0/hangcheck_period_ms
|
|
|
|
- After GPU hang note the last breadcrumb and relaunch trace with:
|
|
|
|
.. code-block:: sh
|
|
|
|
TU_BREADCRUMBS=%IP%:%PORT%,break=%LAST_BREADCRUMB%:%HITS%
|
|
|
|
- After the breakpoint is reached each breadcrumb would require
|
|
explicit ack from the user. This way it's possible to find
|
|
the last packet which didn't hang.
|
|
|
|
- Find the packet in the decoded cmdstream.
|
|
|
|
Debugging random failures
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
In most cases random GPU faults and rendering artifacts are caused by some kind
|
|
of undefined behavior that falls under the following categories:
|
|
|
|
- Usage of a stale reg value;
|
|
- Usage of stale memory (e.g. expecting it to be zeroed when it is not);
|
|
- Lack of the proper synchronization.
|
|
|
|
Finding instances of stale reg reads
|
|
++++++++++++++++++++++++++++++++++++
|
|
|
|
Turnip has a debug option to stomp the registers with invalid values to catch
|
|
the cases where stale data is read.
|
|
|
|
.. code-block:: sh
|
|
|
|
MESA_VK_ABORT_ON_DEVICE_LOSS=1 \
|
|
TU_DEBUG_STALE_REGS_RANGE=0x00000c00,0x0000be01 \
|
|
TU_DEBUG_STALE_REGS_FLAGS=cmdbuf,renderpass \
|
|
./app
|
|
|
|
.. envvar:: TU_DEBUG_STALE_REGS_RANGE
|
|
|
|
the reg range in which registers would be stomped. Add ``inverse`` to the
|
|
flags in order for this range to specify which registers NOT to stomp.
|
|
|
|
.. envvar:: TU_DEBUG_STALE_REGS_FLAGS
|
|
|
|
``cmdbuf``
|
|
stomp registers at the start of each command buffer.
|
|
``renderpass``
|
|
stomp registers before each render pass.
|
|
``inverse``
|
|
changes ``TU_DEBUG_STALE_REGS_RANGE`` meaning to
|
|
"regs that should NOT be stomped".
|
|
|
|
The best way to pinpoint the reg which causes a failure is to bisect the regs
|
|
range. In case when a fail is caused by combination of several registers
|
|
the ``inverse`` flag may be set to find the reg which prevents the failure.
|