mirror of
https://gitlab.freedesktop.org/mesa/mesa.git
synced 2024-12-13 04:04:14 +08:00
ca6779f3cb
When using Freedreno's RD output facilities, enable prefixing any output or trigger file with the string specified in the FD_RD_DUMP_TESTNAME environment option. This is similar to how the TESTNAME env can be used with libwrap to provide a more descriptive name for the output RD file. These prefixes can be quite long, e.g. the longest test case name in Vulkan CTS is above 250 characters. For that reason the output name string in the fd_rd_output struct is now allocated on the heap, and any path building using the output name has its on-stack string buffer enlarged. Signed-off-by: Zan Dobersek <zdobersek@igalia.com> Acked-by: Rob Clark <robdclark@chromium.org> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28442>
639 lines
24 KiB
ReStructuredText
639 lines
24 KiB
ReStructuredText
Freedreno
|
|
=========
|
|
|
|
Freedreno GLES and GL driver for Adreno 2xx-6xx GPUs. It implements up to
|
|
OpenGL ES 3.2 and desktop OpenGL 4.5.
|
|
|
|
See the `Freedreno Wiki
|
|
<https://gitlab.freedesktop.org/freedreno/freedreno/-/wikis/home>`__ for more
|
|
details.
|
|
|
|
Turnip
|
|
------
|
|
|
|
Turnip is a Vulkan 1.3 driver for Adreno 6xx GPUs.
|
|
|
|
The current set of specific chip versions supported can be found in
|
|
:file:`src/freedreno/common/freedreno_devices.py`. The current set of features
|
|
supported can be found rendered at `Mesa Matrix <https://mesamatrix.net/>`__.
|
|
There are no plans to port to a5xx or earlier GPUs.
|
|
|
|
Hardware architecture
|
|
---------------------
|
|
|
|
Adreno is a mostly tile-mode renderer, but with the option to bypass tiling
|
|
("gmem") and render directly to system memory ("sysmem"). It is UMA, using
|
|
mostly write combined memory but with the ability to map some buffers as cache
|
|
coherent with the CPU.
|
|
|
|
.. toctree::
|
|
:glob:
|
|
|
|
freedreno/hw/*
|
|
|
|
Hardware acronyms
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
.. glossary::
|
|
|
|
Cluster
|
|
A group of hardware registers, often with multiple copies to allow
|
|
pipelining. There is an M:N relationship between hardware blocks that do
|
|
work and the clusters of registers for the state that hardware blocks use.
|
|
|
|
CP
|
|
Command Processor. Reads the stream of state changes and draw commands
|
|
generated by the driver.
|
|
|
|
PFP
|
|
Prefetch Parser. Adreno 2xx-4xx CP component.
|
|
|
|
ME
|
|
Micro Engine. Adreno 2xx-4xx CP component after PFP, handles most PM4 commands.
|
|
|
|
SQE
|
|
a6xx+ replacement for PFP/ME. This is the microcontroller that runs the
|
|
microcode (loaded from Linux) which actually processes the command stream
|
|
and writes to the hardware registers. See `afuc
|
|
<https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/afuc/README.rst>`__.
|
|
|
|
ROQ
|
|
DMA engine used by the SQE for reading memory, with some prefetch buffering.
|
|
Mostly reads in the command stream, but also serves for
|
|
``CP_MEMCPY``/``CP_MEM_TO_REG`` and visibility stream reads.
|
|
|
|
SP
|
|
Shader Processor. Unified, scalar shader engine. One or more, depending on
|
|
GPU and tier.
|
|
|
|
TP
|
|
Texture Processor.
|
|
|
|
UCHE
|
|
Unified L2 Cache. 32KB on A330, unclear how big now.
|
|
|
|
CCU
|
|
Color Cache Unit.
|
|
|
|
VSC
|
|
Visibility Stream Compressor
|
|
|
|
PVS
|
|
Primitive Visibility Stream
|
|
|
|
FE
|
|
Front End? Index buffer and vertex attribute fetch cluster. Includes PC,
|
|
VFD, VPC.
|
|
|
|
VFD
|
|
Vertex Fetch and Decode
|
|
|
|
VPC
|
|
Varying/Position Cache? Hardware block that stores shaded vertex data for
|
|
primitive assembly.
|
|
|
|
HLSQ
|
|
High Level Sequencer. Manages state for the SPs, batches up PS invocations
|
|
between primitives, is involved in preemption.
|
|
|
|
PC_VS
|
|
Cluster where varyings are read from VPC and assembled into primitives to
|
|
feed GRAS.
|
|
|
|
VS
|
|
Vertex Shader. Responsible for generating VS/GS/tess invocations
|
|
|
|
GRAS
|
|
Rasterizer. Responsible for generating PS invocations from primitives, also
|
|
does LRZ
|
|
|
|
PS
|
|
Pixel Shader.
|
|
|
|
RB
|
|
Render Backend. Performs both early and late Z testing, blending, and
|
|
attachment stores of output of the PS.
|
|
|
|
GMEM
|
|
Roughly 128KB-1MB of memory on the GPU (SKU-dependent), used to store
|
|
attachments during tiled rendering
|
|
|
|
LRZ
|
|
Low Resolution Z. A low resolution area of the depth buffer that can be
|
|
initialized during the binning pass to contain the worst-case (farthest) Z
|
|
values in a block, and then used to early reject fragments during
|
|
rasterization.
|
|
|
|
Cache hierarchy
|
|
^^^^^^^^^^^^^^^
|
|
|
|
The a6xx GPUs have two main caches: CCU and UCHE.
|
|
|
|
UCHE (Unified L2 Cache) is the cache behind the vertex fetch, VSC writes,
|
|
texture L1, LRZ, and storage image accesses (``ldib``/``stib``). Misses and
|
|
flushes access system memory.
|
|
|
|
The CCU is the separate cache used by 2D blits and sysmem render target access
|
|
(and also for resolves to system memory when in GMEM mode). Its memory comes
|
|
from a carveout of GMEM controlled by ``RB_CCU_CNTL``, with a varying amount
|
|
reserved based on whether we're in a render pass using GMEM for attachment
|
|
storage, or we're doing sysmem rendering. Cache entries have the attachment
|
|
number and layer mixed into the cache tag in some way, likely so that a
|
|
fragment's access is spread through the cache even if the attachments are the
|
|
same size and alignments in address space. This means that the cache must be
|
|
flushed and invalidated between memory being used for one attachment and another
|
|
(notably depth vs color, but also MRT color).
|
|
|
|
The Texture Processors (TP) additionally have a small L1 cache (1KB on A330,
|
|
unclear how big now) before accessing UCHE. This cache is used for normal
|
|
sampling like ``sam``` and ``isam`` (and the compiler will make read-only
|
|
storage image access through it as well). It is not coherent with UCHE (may get
|
|
stale results when you ``sam`` after ``stib``), but must get flushed per draw or
|
|
something because you don't need a manual invalidate between draws storing to an
|
|
image and draws sampling from a texture.
|
|
|
|
The command processor (CP) does not read from either of these caches, and
|
|
instead uses FIFOs in the ROQ to avoid stalls reading from system memory.
|
|
|
|
Draw states
|
|
^^^^^^^^^^^
|
|
|
|
Since the SQE is not a fast processor, and tiled rendering means that many draws
|
|
won't even be used in many bins, since a5xx state updates can be batched up into
|
|
"draw states" that point to a fragment of CP packets. At draw time, if the draw
|
|
call is going to actually execute (some primitive is visible in the current
|
|
tile), the SQE goes through the ``GROUP_ID``\s and for any with an update since
|
|
the last time they were executed, it executes the corresponding fragment.
|
|
|
|
Starting with a6xx, states can be tagged with whether they should be executed
|
|
at draw time for any of sysmem, binning, or tile rendering. This allows a
|
|
single command stream to be generated which can be executed in any of the modes,
|
|
unlike pre-a6xx where we had to generate separate command lists for the binning
|
|
and rendering phases.
|
|
|
|
Note that this means that the generated draw state has to always update all of
|
|
the state you have chosen to pack into that ``GROUP_ID``, since any of your
|
|
previous state changes in a previous draw state command may have been skipped.
|
|
|
|
Pipelining (a6xx+)
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
Most CP commands write to registers. In a6xx+, the registers are located in
|
|
clusters corresponding to the stage of the pipeline they are used from (see
|
|
``enum tu_stage`` for a list). To pipeline state updates and drawing, registers
|
|
generally have two copies ("contexts") in their cluster, so previous draws can
|
|
be working on the previous set of register state while the next draw's state is
|
|
being set up. You can find what registers go into which clusters by looking at
|
|
:command:`crashdec` output in the ``regs-name: CP_MEMPOOL`` section.
|
|
|
|
As SQE processes register writes in the command stream, it sends them into a
|
|
per-cluster queue stored in ``CP_MEMPOOL``. This allows the pipeline stages to
|
|
process their stream of register updates and events independent of each other
|
|
(so even with just 2 contexts in a stage, earlier stages can proceed on to later
|
|
draws before later stages have caught up).
|
|
|
|
Each cluster has a per-context bit indicating that the context is done/free.
|
|
Register writes will stall on the context being done.
|
|
|
|
During a 3D draw command, SQE generates several internal events flow through the
|
|
pipeline:
|
|
|
|
- ``CP_EVENT_START`` clears the done bit for the context when written to the
|
|
cluster
|
|
- ``PC_EVENT_CMD``/``PC_DRAW_CMD``/``HLSQ_EVENT_CMD``/``HLSQ_DRAW_CMD`` kick off
|
|
the actual event/drawing.
|
|
- ``CONTEXT_DONE`` event completes after the event/draw is complete and sets the
|
|
done flag.
|
|
- ``CP_EVENT_END`` waits for the done flag on the next context, then copies all
|
|
the registers that were dirtied in this context to that one.
|
|
|
|
The 2D blit engine has its own ``CP_2D_EVENT_START``, ``CP_2D_EVENT_END``,
|
|
``CONTEXT_DONE_2D``, so 2D and 3D register contexts can do separate context
|
|
rollover.
|
|
|
|
Because the clusters proceed independently of each other even across draws, if
|
|
you need to synchronize an earlier cluster to the output of a later one, then
|
|
you will need to ``CP_WAIT_FOR_IDLE`` after flushing and invalidating any
|
|
necessary caches.
|
|
|
|
Also, note that some registers are not banked at all, and will require a
|
|
``CP_WAIT_FOR_IDLE`` for any previous usage of the register to complete.
|
|
|
|
In a2xx-a4xx, there weren't per-stage clusters, and instead there were two
|
|
register banks that were flipped between per draw.
|
|
|
|
Bindless/Bindful Descriptors (a6xx+)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Starting with a6xx++, cat5 (texture) and cat6 (image/ssbo/ubo) instructions are
|
|
extended to support bindless descriptors.
|
|
|
|
In the old bindful model, descriptors are separate for textures, samplers,
|
|
UBOs, and IBOs (combined descriptor for images and SSBOs), with separate
|
|
registers for the memory containing the array of descriptors, and/or different
|
|
``STATE_TYPE`` and ``STATE_BLOCK`` for ``CP_LOAD_STATE``/``_FRAG``/``_GEOM``
|
|
to pre-load the descriptors into cache.
|
|
|
|
- textures - per-shader-stage
|
|
- registers: ``SP_xS_TEX_CONST``/``SP_xS_TEX_COUNT``
|
|
- state-type: ``ST6_CONSTANTS``
|
|
- state-block: ``SB6_xS_TEX``
|
|
- samplers - per-shader-stage
|
|
- registers: ``SP_xS_TEX_SAMP``
|
|
- state-type: ``ST6_SHADER``
|
|
- state-block: ``SB6_xS_TEX``
|
|
- UBOs - per-shader-stage
|
|
- registers: none
|
|
- state-type: ``ST6_UBO``
|
|
- state-block: ``SB6_xS_SHADER``
|
|
- IBOs - global acress shader 3d stages, separate for compute shader
|
|
- registers: ``SP_IBO``/``SP_IBO_COUNT`` or ``SP_CS_IBO``/``SP_CS_IBO_COUNT``
|
|
- state-type: ``ST6_SHADER``
|
|
- state-block: ``ST6_IBO`` or ``ST6_CS_IBO`` for compute shaders
|
|
- Note, unlike per-shader-stage descriptors, ``CP_LOAD_STATE6`` is used,
|
|
as opposed to ``CP_LOAD_STATE6_GEOM`` or ``CP_LOAD_STATE6_FRAG``
|
|
depending on shader stage.
|
|
|
|
.. note::
|
|
For the per-shader-stage registers and state-blocks the ``xS`` notation
|
|
refers to per-shader-stage names, ex. ``SP_FS_TEX_CONST`` or ``SB6_DS_TEX``
|
|
|
|
Textures and IBOs (images) use *basically* the same 64byte descriptor format
|
|
with some exceptions (for ex, for IBOs cubemaps are handles as 2d array).
|
|
SSBOs are just untyped buffers, but otherwise use the same descriptors and
|
|
instructions as images. Samplers use a 16byte descriptor, and UBOs use an
|
|
8byte descriptor which packs the size in the upper 15 bits of the UBO address.
|
|
|
|
In the bindless model, descriptors are split into 5 descriptor sets, which are
|
|
global across shader stages (but as with bindful IBO descriptors, separate for
|
|
3d stages vs compute stage). Each hw descriptor is an array of descriptors
|
|
of configurable size (each descriptor set can be configured for a descriptor
|
|
pitch of 8bytes or 64bytes). Each descriptor can be of arbitrary format (ie.
|
|
UBOs/IBOs/textures/samplers interleaved), it's interpretation by the hw is
|
|
determined by the instruction that references the descriptor. Each descriptor
|
|
set can contain at least 2^^16 descriptors.
|
|
|
|
The hw is configured with the base address of the descriptor set via an array
|
|
of "BINDLESS_BASE" registers, ie ``SP_BINDLESS_BASE[n]``/``HLSQ_BINDLESS_BASE[n]``
|
|
for 3d shader stages, or ``SP_CS_BINDLESS_BASE[n]``/``HLSQ_CS_BINDLESS_BASE[n]``
|
|
for compute shaders, with the descriptor pitch encoded in the low bits.
|
|
Which of the descriptor sets is referenced is encoded via three bits in the
|
|
instruction. The address of the descriptor is calculated as::
|
|
|
|
descriptor_addr = (BINDLESS_BASE[n] & ~0x3) +
|
|
(idx * 4 * (2 << BINDLESS_BASE[n] & 0x3))
|
|
|
|
|
|
.. note::
|
|
Turnip reserves one descriptor set for internal use and exposes the other
|
|
four for the application via the vulkan API.
|
|
|
|
Software Architecture
|
|
---------------------
|
|
|
|
Freedreno and Turnip use a shared core for shader compiler, image layout, and
|
|
register and command stream definitions. They implement separate state
|
|
management and command stream generation.
|
|
|
|
.. toctree::
|
|
:glob:
|
|
|
|
freedreno/*
|
|
|
|
GPU devcoredump
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
A kernel message from DRM of "gpu fault" can mean any sort of error reported by
|
|
the GPU (including its internal hang detection). If a fault in GPU address
|
|
space happened, you should expect to find a message from the iommu, with the
|
|
faulting address and a hardware unit involved:
|
|
|
|
.. code-block:: text
|
|
|
|
*** gpu fault: ttbr0=000000001c941000 iova=000000010066a000 dir=READ type=TRANSLATION source=TP|VFD (0,0,0,1)
|
|
|
|
On a GPU fault or hang, a GPU core dump is taken by the DRM driver and saved to
|
|
``/sys/devices/virtual/devcoredump/**/data``. You can cp that file to a
|
|
:file:`crash.devcore` to save it, otherwise the kernel will expire it
|
|
eventually. Echo 1 to the file to free the core early, as another core won't be
|
|
taken until then.
|
|
|
|
Once you have your core file, you can use :command:`crashdec -f crash.devcore`
|
|
to decode it. The output will have ``ESTIMATED CRASH LOCATION`` where we
|
|
estimate the CP to have stopped. Note that it is expected that this will be
|
|
some distance past whatever state triggered the fault, given GPU pipelining, and
|
|
will often be at some ``CP_REG_TO_MEM`` (which waits on previous WFIs) or
|
|
``CP_WAIT_FOR_ME`` (which waits for all register writes to land) or similar
|
|
event. You can try running the workload with ``TU_DEBUG=flushall`` or
|
|
``FD_MESA_DEBUG=flush`` to try to close in on the failing commands.
|
|
|
|
You can also find what commands were queued up to each cluster in the
|
|
``regs-name: CP_MEMPOOL`` section.
|
|
|
|
If ``ESTIMATED CRASH LOCATION`` doesn't exist you could find ``CP_SQE_STAT``,
|
|
though going here is the last resort and likely won't be helpful.
|
|
|
|
.. code-block::
|
|
|
|
indexed-registers:
|
|
- regs-name: CP_SQE_STAT
|
|
dwords: 51
|
|
PC: 00d7 <-------------
|
|
PKT: CP_LOAD_STATE6_FRAG
|
|
$01: 70348003 $11: 00000000
|
|
$02: 20000000 $12: 00000022
|
|
|
|
The ``PC`` value is an instruction address in the current firmware.
|
|
You would need to disassemble the firmware (/lib/firmware/qcom/aXXX_sqe.fw) via:
|
|
|
|
.. code-block:: sh
|
|
|
|
afuc-disasm -v a650_sqe.fw > a650_sqe.fw.disasm
|
|
|
|
Now you should search for PC value in the disassembly, e.g.:
|
|
|
|
.. code-block::
|
|
|
|
l018: 00d1: 08dd0001 add $addr, $06, 0x0001
|
|
00d2: 981ff806 mov $data, $data
|
|
00d3: 8a080001 mov $08, 0x0001 << 16
|
|
00d4: 3108ffff or $08, $08, 0xffff
|
|
00d5: 9be8f805 and $data, $data, $08
|
|
00d6: 9806e806 mov $addr, $06
|
|
00d7: 9803f806 mov $data, $03 <------------- HERE
|
|
00d8: d8000000 waitin
|
|
00d9: 981f0806 mov $01, $data
|
|
|
|
|
|
Command Stream Capture
|
|
^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
During Mesa development, it's often useful to look at the command streams we
|
|
send to the kernel. We have an interface for the kernel to capture all
|
|
submitted command streams:
|
|
|
|
.. code-block:: sh
|
|
|
|
cat /sys/kernel/debug/dri/0/rd > cmdstream &
|
|
|
|
By default, command stream capture does not capture texture/vertex/etc. data.
|
|
You can enable capturing all the BOs with:
|
|
|
|
.. code-block:: sh
|
|
|
|
echo Y > /sys/module/msm/parameters/rd_full
|
|
|
|
Note that, since all command streams get captured, it is easy to run the system
|
|
out of memory doing this, so you probably don't want to enable it during play of
|
|
a heavyweight game. Instead, to capture a command stream within a game, you
|
|
probably want to cause a crash in the GPU during a frame of interest so that a
|
|
single GPU core dump is generated. Emitting ``0xdeadbeef`` in the CS should be
|
|
enough to cause a fault.
|
|
|
|
``fd_rd_output`` facilities provide support for generating the command stream
|
|
capture from inside Mesa. Different ``FD_RD_DUMP`` options are available:
|
|
|
|
- ``enable`` simply enables dumping the command stream on each submit for a
|
|
given logical device. When a more advanced option is specified, ``enable`` is
|
|
implied as specified.
|
|
- ``combine`` will combine all dumps into a single file instead of writing the
|
|
dump for each submit into a standalone file.
|
|
- ``full`` will dump every buffer object, which is necessary for replays of
|
|
command streams (see below).
|
|
- ``trigger`` will establish a trigger file through which dumps can be better
|
|
controlled. Writing a positive integer value into the file will enable dumping
|
|
of that many subsequent submits. Writing -1 will enable dumping of submits
|
|
until disabled. Writing 0 (or any other value) will disable dumps.
|
|
|
|
Output dump files and trigger file (when enabled) are hard-coded to be placed
|
|
under ``/tmp``, or ``/data/local/tmp`` under Android. `FD_RD_DUMP_TESTNAME` can
|
|
be used to specify a more descriptive prefix for the output or trigger files.
|
|
|
|
Functionality is generic to any Freedreno-based backend, but is currently only
|
|
integrated in the MSM backend of Turnip. Using the existing ``TU_DEBUG=rd``
|
|
option will translate to ``FD_RD_DUMP=enable``.
|
|
|
|
Capturing Hang RD
|
|
+++++++++++++++++
|
|
|
|
Devcore file doesn't contain all submitted command streams, only the hanging one.
|
|
Additionally it is geared towards analyzing the GPU state at the moment of the crash.
|
|
|
|
Alternatively, it's possible to obtain the whole submission with all command
|
|
streams via ``/sys/kernel/debug/dri/0/hangrd``:
|
|
|
|
.. code-block:: sh
|
|
|
|
sudo cat /sys/kernel/debug/dri/0/hangrd > logfile.rd // Do the cat _before_ the expected hang
|
|
|
|
The format of hangrd is the same as in ordinary command stream capture.
|
|
``rd_full`` also has the same effect on it.
|
|
|
|
Replaying Command Stream
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
``replay`` tool allows capturing and replaying ``rd`` to reproduce GPU faults.
|
|
Especially useful for transient GPU issues since it has much higher chances to
|
|
reproduce them.
|
|
|
|
Dumping rendering results or even just memory is currently unsupported.
|
|
|
|
- Replaying command streams requires kernel with ``MSM_INFO_SET_IOVA`` support.
|
|
- Requires ``rd`` capture to have full snapshots of the memory (``rd_full`` is enabled).
|
|
|
|
Replaying is done via ``replay`` tool:
|
|
|
|
.. code-block:: sh
|
|
|
|
./replay test_replay.rd
|
|
|
|
More examples:
|
|
|
|
.. code-block:: sh
|
|
|
|
./replay --first=start_submit_n --last=last_submit_n test_replay.rd
|
|
|
|
.. code-block:: sh
|
|
|
|
./replay --override=0 test_replay.rd
|
|
|
|
Editing Command Stream (a6xx+)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
While replaying a fault is useful in itself, modifying the capture to
|
|
understand what causes the fault could be even more useful.
|
|
|
|
``rddecompiler`` decompiles a single cmdstream from ``rd`` into compilable C source.
|
|
Given the address space bounds the generated program creates a new ``rd`` which
|
|
could be used to override cmdstream with 'replay'. Generated ``rd`` is not replayable
|
|
on its own and depends on buffers provided by the source ``rd``.
|
|
|
|
C source could be compiled by putting it into src/freedreno/decode/generate-rd.cc.
|
|
|
|
The workflow would look like this:
|
|
|
|
1. Find the cmdstream № you want to edit;
|
|
2. Decompile it:
|
|
|
|
.. code-block:: sh
|
|
|
|
./rddecompiler -s %cmd_stream_n% example.rd > src/freedreno/decode/generate-rd.cc
|
|
|
|
3. Edit the command stream;;
|
|
4. Compile and deploy freedreno tools;
|
|
5. Plug the generator into cmdstream replay:
|
|
|
|
.. code-block:: sh
|
|
|
|
./replay --override=%cmd_stream_№%
|
|
|
|
6. Repeat 3-5.
|
|
|
|
GPU Hang Debugging
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
Not a guide for how to do it but mostly an enumeration of methods.
|
|
|
|
Useful ``TU_DEBUG`` (for Turnip) options to narrow down the hang cause:
|
|
|
|
``sysmem``, ``gmem``, ``nobin``, ``forcebin``, ``noubwc``, ``nolrz``, ``flushall``, ``syncdraw``, ``rast_order``
|
|
|
|
Useful ``FD_MESA_DEBUG`` (for Freedreno) options:
|
|
|
|
``sysmem``, ``gmem``, ``nobin``, ``noubwc``, ``nolrz``, ``notile``, ``dclear``, ``ddraw``, ``flush``, ``inorder``, ``noblit``
|
|
|
|
Useful ``IR3_SHADER_DEBUG`` options:
|
|
|
|
``nouboopt``, ``spillall``, ``nopreamble``, ``nofp16``
|
|
|
|
Use Graphics Flight Recorder to narrow down the place which hangs,
|
|
use our own breadcrumbs implementation in case of unrecoverable hangs.
|
|
|
|
In case of faults use RenderDoc to find the problematic command. If it's
|
|
a draw call, edit shader in RenderDoc to find whether it culprit is a shader.
|
|
If yes, bisect it.
|
|
|
|
If editing the shader messes the assembly too much and the issue becomes unreproducible
|
|
try editing the assembly itself via ``IR3_SHADER_OVERRIDE_PATH``.
|
|
|
|
If fault or hang is transient try capturing an ``rd`` and replay it. If issue
|
|
is reproduced - bisect the GPU packets until the culprit is found.
|
|
|
|
Do the above if culprit is not a shader.
|
|
|
|
The hang recovery mechanism in Kernel is not perfect, in case of unrecoverable
|
|
hangs check whether the kernel is up to date and look for unmerged patches
|
|
which could improve the recovery.
|
|
|
|
GPU Breadcrumbs
|
|
+++++++++++++++
|
|
|
|
Breadcrumbs described below are available only in Turnip.
|
|
|
|
Freedreno has simpler breadcrumbs, in debug build writes breadcrumbs
|
|
into ``CP_SCRATCH_REG[6]`` and per-tile breadcrumbs into ``CP_SCRATCH_REG[7]``,
|
|
in this way they are available in the devcoredump. TODO: generalize Tunip's
|
|
breadcrumbs implementation.
|
|
|
|
This is a simple implementations of breadcrumbs tracking of GPU progress
|
|
intended to be a last resort when debugging unrecoverable hangs.
|
|
For best results use Vulkan traces to have a predictable place of hang.
|
|
|
|
For ordinary hangs as a more user-friendly solution use GFR
|
|
"Graphics Flight Recorder".
|
|
|
|
Or breadcrumbs implementation aims to handle cases where nothing can be done
|
|
after the hang. In-driver breadcrumbs also allow more precise tracking since
|
|
we could target a single GPU packet.
|
|
|
|
While breadcrumbs support gmem, try to reproduce the hang in a sysmem mode
|
|
because it would require much less breadcrumb writes and syncs.
|
|
|
|
Breadcrumbs settings:
|
|
|
|
.. code-block:: sh
|
|
|
|
TU_BREADCRUMBS=%IP%:%PORT%,break=%BREAKPOINT%:%BREAKPOINT_HITS%
|
|
|
|
``BREAKPOINT``
|
|
The breadcrumb starting from which we require explicit ack.
|
|
``BREAKPOINT_HITS``
|
|
How many times breakpoint should be reached for break to occur.
|
|
Necessary for a gmem mode and re-usable cmdbuffers in both of which
|
|
the same cmdstream could be executed several times.
|
|
|
|
A typical work flow would be:
|
|
|
|
- Start listening for breadcrumbs on a remote host:
|
|
|
|
.. code-block:: sh
|
|
|
|
nc -lvup $PORT | stdbuf -o0 xxd -pc -c 4 | awk -Wposix '{printf("%u:%u\n", "0x" $0, a[$0]++)}'
|
|
|
|
- Start capturing command stream;
|
|
- Replay the hanging trace with:
|
|
|
|
.. code-block:: sh
|
|
|
|
TU_BREADCRUMBS=$IP:$PORT,break=-1:0
|
|
|
|
- Increase hangcheck period:
|
|
|
|
.. code-block:: sh
|
|
|
|
echo -n 60000 > /sys/kernel/debug/dri/0/hangcheck_period_ms
|
|
|
|
- After GPU hang note the last breadcrumb and relaunch trace with:
|
|
|
|
.. code-block:: sh
|
|
|
|
TU_BREADCRUMBS=%IP%:%PORT%,break=%LAST_BREADCRUMB%:%HITS%
|
|
|
|
- After the breakpoint is reached each breadcrumb would require
|
|
explicit ack from the user. This way it's possible to find
|
|
the last packet which didn't hang.
|
|
|
|
- Find the packet in the decoded cmdstream.
|
|
|
|
Debugging random failures
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
In most cases random GPU faults and rendering artifacts are caused by some kind
|
|
of undifined behaviour that falls under the following categories:
|
|
|
|
- Usage of a stale reg value;
|
|
- Usage of stale memory (e.g. expecting it to be zeroed when it is not);
|
|
- Lack of the proper synchronization.
|
|
|
|
Finding instances of stale reg reads
|
|
++++++++++++++++++++++++++++++++++++
|
|
|
|
Turnip has a debug option to stomp the registers with invalid values to catch
|
|
the cases where stale data is read.
|
|
|
|
.. code-block:: sh
|
|
|
|
MESA_VK_ABORT_ON_DEVICE_LOSS=1 \
|
|
TU_DEBUG_STALE_REGS_RANGE=0x00000c00,0x0000be01 \
|
|
TU_DEBUG_STALE_REGS_FLAGS=cmdbuf,renderpass \
|
|
./app
|
|
|
|
.. envvar:: TU_DEBUG_STALE_REGS_RANGE
|
|
|
|
the reg range in which registers would be stomped. Add ``inverse`` to the
|
|
flags in order for this range to specify which registers NOT to stomp.
|
|
|
|
.. envvar:: TU_DEBUG_STALE_REGS_FLAGS
|
|
|
|
``cmdbuf``
|
|
stomp registers at the start of each command buffer.
|
|
``renderpass``
|
|
stomp registers before each renderpass.
|
|
``inverse``
|
|
changes ``TU_DEBUG_STALE_REGS_RANGE`` meaning to
|
|
"regs that should NOT be stomped".
|
|
|
|
The best way to pinpoint the reg which causes a failure is to bisect the regs
|
|
range. In case when a fail is caused by combination of several registers
|
|
the ``inverse`` flag may be set to find the reg which prevents the failure.
|