mirror of
https://gitlab.freedesktop.org/mesa/mesa.git
synced 2024-12-14 04:34:23 +08:00
ae74d486ad
Use the hawkmoth c:auto* directives to incorporate isl documentation. Convert @param style parameter descriptions to rst info field lists. Add static stubs for generated headers. Fix a lot of references, in particular the symbols are now in the Sphinx C domain, not C++ domain. Tweak syntax here and there. Based on the earlier work by Erik Faye-Lund <kusmabite@gmail.com> Reviewed-by: Erik Faye-Lund <erik.faye-lund@collabora.com> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24507>
324 lines
17 KiB
ReStructuredText
324 lines
17 KiB
ReStructuredText
Tiling
|
|
======
|
|
|
|
The naive view of an image in memory is that the pixels are stored one after
|
|
another in memory usually in an X-major order. An image that is arranged in
|
|
this way is called "linear". Linear images, while easy to reason about, can
|
|
have very bad cache locality. Graphics operations tend to act on pixels that
|
|
are close together in 2-D euclidean space. If you move one pixel to the right
|
|
or left in a linear image, you only move a few bytes to one side or the other
|
|
in memory. However, if you move one pixel up or down you can end up kilobytes
|
|
or even megabytes away.
|
|
|
|
Tiling (sometimes referred to as swizzling) is a method of re-arranging the
|
|
pixels of a surface so that pixels which are close in 2-D euclidean space are
|
|
likely to be close in memory.
|
|
|
|
Basics
|
|
------
|
|
|
|
The basic idea of a tiled image is that the image is first divided into
|
|
two-dimensional blocks or tiles. Each tile takes up a chunk of contiguous
|
|
memory and the tiles are arranged like pixels in linear surface. This is best
|
|
demonstrated with a specific example. Suppose we have a RGBA8888 X-tiled
|
|
surface on Intel graphics. Then the surface is divided into 128x8 pixel tiles
|
|
each of which is 4KB of memory. Within each tile, the pixels are laid out like
|
|
a 128x8 linear image. The tiles themselves are laid out row-major in memory
|
|
like giant pixels. This means that, as long as you don't leave your 128x8
|
|
tile, you can move in both dimensions without leaving the same 4K page in
|
|
memory.
|
|
|
|
.. image:: tiling-basic.svg
|
|
:alt: Example of an X-tiled image
|
|
|
|
You can, however do even better than this. Suppose that same image is,
|
|
instead, Y-tiled. Then the surface is divided into 32x32 pixel tiles each of
|
|
which is 4KB of memory. Within a tile, each 64B cache line corresponds to 4x4
|
|
pixel region of the image (you can think of it as a tile within a tile). This
|
|
means that very small deviations don't even leave the cache line. This added
|
|
bit of pixel shuffling is known to have a substantial performance impact in
|
|
most real-world applications.
|
|
|
|
Intel graphics has several different tiling formats that we'll discuss in
|
|
detail in later sections. The most commonly used as of the writing of this
|
|
chapter is Y-tiling. In all tiling formats the basic principal is the same:
|
|
The image is divided into tiles of a particular size and, within those tiles,
|
|
the data is re-arranged (or swizzled) based on a particular pattern. A tile
|
|
size will always be specified in bytes by rows and the actual X-dimension of
|
|
the tile in elements depends on the size of the element in bytes.
|
|
|
|
Bit-6 Swizzling
|
|
^^^^^^^^^^^^^^^
|
|
|
|
On some older hardware, there is an additional address swizzle that is applied
|
|
on top of the tiling format. This has been removed starting with Broadwell
|
|
because, as it says in the Broadwell PRM Vol 5 "Tiling Algorithm" (p. 17):
|
|
|
|
Address Swizzling for Tiled-Surfaces is no longer used because the main
|
|
memory controller has a more effective address swizzling algorithm.
|
|
|
|
Whether or not swizzling is enabled depends on the memory configuration of the
|
|
system. Generally, systems with dual-channel RAM have swizzling enabled and
|
|
single-channel do not. Supposedly, this swizzling allows for better balancing
|
|
between the two memory channels and increases performance. Because it depends
|
|
on the memory configuration which may change from one boot to the next, it
|
|
requires a run-time check.
|
|
|
|
The best documentation for bit-6 swizzling can be found in the Haswell PRM Vol.
|
|
5 "Memory Views" in the section entitled "Address Swizzling for Tiled-Y
|
|
Surfaces". It exists on older platforms but the docs get progressively worse
|
|
the further you go back.
|
|
|
|
ISL Representation
|
|
------------------
|
|
|
|
The structure of any given tiling format is represented by ISL using the
|
|
:c:enum:`isl_tiling` enum and the :c:struct:`isl_tile_info` structure:
|
|
|
|
.. c:autoenum:: isl_tiling
|
|
:file: src/intel/isl/isl.h
|
|
:members:
|
|
|
|
.. c:autofunction:: isl_tiling_get_info
|
|
:file: src/intel/isl/isl.c
|
|
|
|
.. c:autostruct:: isl_tile_info
|
|
:members:
|
|
|
|
The ``isl_tile_info`` structure has two different sizes for a tile: a logical
|
|
size in surface elements and a physical size in bytes. In order to determine
|
|
the proper logical size, the bits-per-block of the underlying format has to be
|
|
passed into ``isl_tiling_get_info``. The proper way to compute the size of an
|
|
image in bytes given a width and height in elements is as follows:
|
|
|
|
.. code-block:: c
|
|
|
|
uint32_t width_tl = DIV_ROUND_UP(width_el * (format_bpb / tile_info.format_bpb),
|
|
tile_info.logical_extent_el.w);
|
|
uint32_t height_tl = DIV_ROUND_UP(height_el, tile_info.logical_extent_el.h);
|
|
uint32_t row_pitch = width_tl * tile_info.phys_extent_el.w;
|
|
uint32_t size = height_tl * tile_info.phys_extent_el.h * row_pitch;
|
|
|
|
It is very important to note that there is no direct conversion between
|
|
:c:member:`isl_tile_info.logical_extent_el` and
|
|
:c:member:`isl_tile_info.phys_extent_B`. It is tempting to assume that the
|
|
logical and physical heights are the same and simply divide the width of
|
|
:c:member:`isl_tile_info.phys_extent_B` by the size of the format (which is
|
|
what the PRM does) to get :c:member:`isl_tile_info.logical_extent_el` but
|
|
this is not at all correct. Some tiling formats have logical and physical
|
|
heights that differ and so no such calculation will work in general. The
|
|
easiest case study for this is W-tiling. From the Sky Lake PRM Vol. 2d,
|
|
"RENDER_SURFACE_STATE" (p. 427):
|
|
|
|
If the surface is a stencil buffer (and thus has Tile Mode set to
|
|
TILEMODE_WMAJOR), the pitch must be set to 2x the value computed based on
|
|
width, as the stencil buffer is stored with two rows interleaved.
|
|
|
|
What does this mean? Why are we multiplying the pitch by two? What does it
|
|
mean that "the stencil buffer is stored with two rows interleaved"? The
|
|
explanation for all these questions is that a W-tile (which is only used for
|
|
stencil) has a logical size of 64el x 64el but a physical size of 128B
|
|
x 32rows. In memory, a W-tile has the same footprint as a Y-tile (128B
|
|
x 32rows) but every pair of rows in the stencil buffer is interleaved into
|
|
a single row of bytes yielding a two-dimensional area of 64el x 64el. You can
|
|
consider this as its own tiling format or as a modification of Y-tiling. The
|
|
interpretation in the PRMs vary by hardware generation; on Sandy Bridge they
|
|
simply said it was Y-tiled but by Sky Lake there is almost no mention of
|
|
Y-tiling in connection with stencil buffers and they are always W-tiled. This
|
|
mismatch between logical and physical tile sizes are also relevant for
|
|
hierarchical depth buffers as well as single-channel MCS and CCS buffers.
|
|
|
|
X-tiling
|
|
--------
|
|
|
|
The simplest tiling format available on Intel graphics (which has been
|
|
available since gen4) is X-tiling. An X-tile is 512B x 8rows and, within the
|
|
tile, the data is arranged in an X-major linear fashion. You can also look at
|
|
X-tiling as being an 8x8 cache line grid where the cache lines are arranged
|
|
X-major as follows:
|
|
|
|
======= ======= ======= ======= ======= ======= ======= =======
|
|
`0x000` `0x040` `0x080` `0x0c0` `0x100` `0x140` `0x180` `0x1c0`
|
|
`0x200` `0x240` `0x280` `0x2c0` `0x300` `0x340` `0x380` `0x3c0`
|
|
`0x400` `0x440` `0x480` `0x4c0` `0x500` `0x540` `0x580` `0x5c0`
|
|
`0x600` `0x640` `0x680` `0x6c0` `0x700` `0x740` `0x780` `0x7c0`
|
|
`0x800` `0x840` `0x880` `0x8c0` `0x900` `0x940` `0x980` `0x9c0`
|
|
`0xa00` `0xa40` `0xa80` `0xac0` `0xb00` `0xb40` `0xb80` `0xbc0`
|
|
`0xc00` `0xc40` `0xc80` `0xcc0` `0xd00` `0xd40` `0xd80` `0xdc0`
|
|
`0xe00` `0xe40` `0xe80` `0xec0` `0xf00` `0xf40` `0xf80` `0xfc0`
|
|
======= ======= ======= ======= ======= ======= ======= =======
|
|
|
|
Each cache line represents a piece of a single row of pixels within the image.
|
|
The memory locations of two vertically adjacent pixels within the same X-tile
|
|
always differs by 512B or 8 cache lines.
|
|
|
|
As mentioned above, X-tiling is slower than Y-tiling (though still faster than
|
|
linear). However, until Sky Lake, the display scan-out hardware could only do
|
|
X-tiling so we have historically used X-tiling for all window-system buffers
|
|
(because X or a Wayland compositor may want to put it in a plane).
|
|
|
|
Bit-6 Swizzling
|
|
^^^^^^^^^^^^^^^
|
|
|
|
When bit-6 swizzling is enabled, bits 9 and 10 are XORed in with bit 6 of the
|
|
tiled address:
|
|
|
|
.. code-block:: c
|
|
|
|
addr[6] ^= addr[9] ^ addr[10];
|
|
|
|
Y-tiling
|
|
--------
|
|
|
|
The Y-tiling format, also available since gen4, is substantially different from
|
|
X-tiling and performs much better in practice. Each Y-tile is an 8x8 grid of cache lines arranged Y-major as follows:
|
|
|
|
======= ======= ======= ======= ======= ======= ======= =======
|
|
`0x000` `0x200` `0x400` `0x600` `0x800` `0xa00` `0xc00` `0xe00`
|
|
`0x040` `0x240` `0x440` `0x640` `0x840` `0xa40` `0xc40` `0xe40`
|
|
`0x080` `0x280` `0x480` `0x680` `0x880` `0xa80` `0xc80` `0xe80`
|
|
`0x0c0` `0x2c0` `0x4c0` `0x6c0` `0x8c0` `0xac0` `0xcc0` `0xec0`
|
|
`0x100` `0x300` `0x500` `0x700` `0x900` `0xb00` `0xd00` `0xf00`
|
|
`0x140` `0x340` `0x540` `0x740` `0x940` `0xb40` `0xd40` `0xf40`
|
|
`0x180` `0x380` `0x580` `0x780` `0x980` `0xb80` `0xd80` `0xf80`
|
|
`0x1c0` `0x3c0` `0x5c0` `0x7c0` `0x9c0` `0xbc0` `0xdc0` `0xfc0`
|
|
======= ======= ======= ======= ======= ======= ======= =======
|
|
|
|
Each 64B cache line within the tile is laid out as 4 rows of 16B each:
|
|
|
|
====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
|
|
`0x00` `0x01` `0x02` `0x03` `0x04` `0x05` `0x06` `0x07` `0x08` `0x09` `0x0a` `0x0b` `0x0c` `0x0d` `0x0e` `0x0f`
|
|
`0x10` `0x11` `0x12` `0x13` `0x14` `0x15` `0x16` `0x17` `0x18` `0x19` `0x1a` `0x1b` `0x1c` `0x1d` `0x1e` `0x1f`
|
|
`0x20` `0x21` `0x22` `0x23` `0x24` `0x25` `0x26` `0x27` `0x28` `0x29` `0x2a` `0x2b` `0x2c` `0x2d` `0x2e` `0x2f`
|
|
`0x30` `0x31` `0x32` `0x33` `0x34` `0x35` `0x36` `0x37` `0x38` `0x39` `0x3a` `0x3b` `0x3c` `0x3d` `0x3e` `0x3f`
|
|
====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
|
|
|
|
Y-tiling is widely regarded as being substantially faster than X-tiling so it
|
|
is generally preferred. However, prior to Sky Lake, Y-tiling was not available
|
|
for scanout so X tiling was used for any sort of window-system buffers.
|
|
Starting with Sky Lake, we can scan out from Y-tiled buffers.
|
|
|
|
Bit-6 Swizzling
|
|
^^^^^^^^^^^^^^^
|
|
|
|
When bit-6 swizzling is enabled, bit 9 is XORed in with bit 6 of the tiled
|
|
address:
|
|
|
|
.. code-block:: c
|
|
|
|
addr[6] ^= addr[9];
|
|
|
|
W-tiling
|
|
--------
|
|
|
|
W-tiling is a new tiling format added on Sandy Bridge for use in stencil
|
|
buffers. W-tiling is similar to Y-tiling in that it's arranged as an 8x8
|
|
Y-major grid of cache lines. The bytes within each cache line are arranged as
|
|
follows:
|
|
|
|
====== ====== ====== ====== ====== ====== ====== ======
|
|
`0x00` `0x01` `0x04` `0x05` `0x10` `0x11` `0x14` `0x15`
|
|
`0x02` `0x03` `0x06` `0x07` `0x12` `0x13` `0x16` `0x17`
|
|
`0x08` `0x09` `0x0c` `0x0d` `0x18` `0x19` `0x1c` `0x1d`
|
|
`0x0a` `0x0b` `0x0e` `0x0f` `0x1a` `0x1b` `0x1e` `0x1f`
|
|
`0x20` `0x21` `0x24` `0x25` `0x30` `0x31` `0x34` `0x35`
|
|
`0x22` `0x23` `0x26` `0x27` `0x32` `0x33` `0x36` `0x37`
|
|
`0x28` `0x29` `0x2c` `0x2d` `0x38` `0x39` `0x3c` `0x3d`
|
|
`0x2a` `0x2b` `0x2e` `0x2f` `0x3a` `0x3b` `0x3e` `0x3f`
|
|
====== ====== ====== ====== ====== ====== ====== ======
|
|
|
|
While W-tiling has been required for stencil all the way back to Sandy Bridge,
|
|
the docs are somewhat confused as to whether stencil buffers are W or Y-tiled.
|
|
This seems to stem from the fact that the hardware seems to implement W-tiling
|
|
as a sort of modified Y-tiling. One example of this is the somewhat odd
|
|
requirement that W-tiled buffers have their pitch multiplied by 2. From the
|
|
Sky Lake PRM Vol. 2d, "RENDER_SURFACE_STATE" (p. 427):
|
|
|
|
If the surface is a stencil buffer (and thus has Tile Mode set to
|
|
TILEMODE_WMAJOR), the pitch must be set to 2x the value computed based on
|
|
width, as the stencil buffer is stored with two rows interleaved.
|
|
|
|
The last phrase holds the key here: "the stencil buffer is stored with two rows
|
|
interleaved". More accurately, a W-tiled buffer can be viewed as a Y-tiled
|
|
buffer with each set of 4 W-tiled lines interleaved to form 2 Y-tiled lines. In
|
|
ISL, we represent a W-tile as a tiling with a logical dimension of 64el x 64el
|
|
but a physical size of 128B x 32rows. This cleanly takes care of the pitch
|
|
issue above and seems to nicely model the hardware.
|
|
|
|
Tile4
|
|
-----
|
|
|
|
The tile4 format, introduced on Xe-HP, is somewhat similar to Y but with more
|
|
internal shuffling. Each tile4 tile is an 8x8 grid of cache lines arranged
|
|
as follows:
|
|
|
|
======= ======= ======= ======= ======= ======= ======= =======
|
|
`0x000` `0x040` `0x080` `0x0a0` `0x200` `0x240` `0x280` `0x2a0`
|
|
`0x100` `0x140` `0x180` `0x1a0` `0x300` `0x340` `0x380` `0x3a0`
|
|
`0x400` `0x440` `0x480` `0x4a0` `0x600` `0x640` `0x680` `0x6a0`
|
|
`0x500` `0x540` `0x580` `0x5a0` `0x700` `0x740` `0x780` `0x7a0`
|
|
`0x800` `0x840` `0x880` `0x8a0` `0xa00` `0xa40` `0xa80` `0xaa0`
|
|
`0x900` `0x940` `0x980` `0x9a0` `0xb00` `0xb40` `0xb80` `0xba0`
|
|
`0xc00` `0xc40` `0xc80` `0xca0` `0xe00` `0xe40` `0xe80` `0xea0`
|
|
`0xd00` `0xd40` `0xd80` `0xda0` `0xf00` `0xf40` `0xf80` `0xfa0`
|
|
======= ======= ======= ======= ======= ======= ======= =======
|
|
|
|
Each 64B cache line within the tile is laid out the same way as for a Y-tile,
|
|
as 4 rows of 16B each:
|
|
|
|
====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
|
|
`0x00` `0x01` `0x02` `0x03` `0x04` `0x05` `0x06` `0x07` `0x08` `0x09` `0x0a` `0x0b` `0x0c` `0x0d` `0x0e` `0x0f`
|
|
`0x10` `0x11` `0x12` `0x13` `0x14` `0x15` `0x16` `0x17` `0x18` `0x19` `0x1a` `0x1b` `0x1c` `0x1d` `0x1e` `0x1f`
|
|
`0x20` `0x21` `0x22` `0x23` `0x24` `0x25` `0x26` `0x27` `0x28` `0x29` `0x2a` `0x2b` `0x2c` `0x2d` `0x2e` `0x2f`
|
|
`0x30` `0x31` `0x32` `0x33` `0x34` `0x35` `0x36` `0x37` `0x38` `0x39` `0x3a` `0x3b` `0x3c` `0x3d` `0x3e` `0x3f`
|
|
====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
|
|
|
|
Tiling as a bit pattern
|
|
-----------------------
|
|
|
|
There is one more important angle on tiling that should be discussed before we
|
|
finish. Every tiling can be described by three things:
|
|
|
|
1. A logical width and height in elements
|
|
2. A physical width in bytes and height in rows
|
|
3. A mapping from logical elements to physical bytes within the tile
|
|
|
|
We have spent a good deal of time on the first two because this is what you
|
|
really need for doing surface layout calculations. However, there are cases in
|
|
which the map from logical to physical elements is critical. One example is
|
|
W-tiling where we have code to do W-tiled encoding and decoding in the shader
|
|
for doing stencil blits because the hardware does not allow us to render to
|
|
W-tiled surfaces.
|
|
|
|
There are many ways to mathematically describe the mapping from logical
|
|
elements to physical bytes. In the PRMs they give a very complicated set of
|
|
formulas involving lots of multiplication, modulus, and sums that show you how
|
|
to compute the mapping. With a little creativity, you can easily reduce those
|
|
to a set of bit shifts and ORs. By far the simplest formulation, however, is
|
|
as a mapping from the bits of the texture coordinates to bits in the address.
|
|
Suppose that :math:`(u, v)` is location of a 1-byte element within a tile. If
|
|
you represent :math:`u` as :math:`u_n u_{n-1} \cdots u_2 u_1 u_0` where
|
|
:math:`u_0` is the LSB and :math:`u_n` is the MSB of :math:`u` and similarly
|
|
:math:`v = v_m v_{m-1} \cdots v_2 v_1 v_0`, then the bits of the address within
|
|
the tile are given by the table below:
|
|
|
|
=========================================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
|
|
Tiling 11 10 9 8 7 6 5 4 3 2 1 0
|
|
=========================================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
|
|
:c:enumerator:`isl_tiling.ISL_TILING_X` :math:`v_2` :math:`v_1` :math:`v_0` :math:`u_8` :math:`u_7` :math:`u_6` :math:`u_5` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0`
|
|
:c:enumerator:`isl_tiling.ISL_TILING_Y0` :math:`u_6` :math:`u_5` :math:`u_4` :math:`v_4` :math:`v_3` :math:`v_2` :math:`v_1` :math:`v_0` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0`
|
|
:c:enumerator:`isl_tiling.ISL_TILING_W` :math:`u_5` :math:`u_4` :math:`u_3` :math:`v_5` :math:`v_4` :math:`v_3` :math:`v_2` :math:`u_2` :math:`v_1` :math:`u_1` :math:`v_0` :math:`u_0`
|
|
:c:enumerator:`isl_tiling.ISL_TILING_4` :math:`v_4` :math:`v_3` :math:`u_6` :math:`v_2` :math:`u_5` :math:`u_4` :math:`v_1` :math:`v_0` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0`
|
|
=========================================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
|
|
|
|
Constructing the mapping this way makes a lot of sense when you think about
|
|
hardware. It may seem complex on paper but "simple" things such as addition
|
|
are relatively expensive in hardware while interleaving bits in a well-defined
|
|
pattern is practically free. For a format that has more than one byte per
|
|
element, you simply chop bits off the bottom of the pattern, hard-code them to
|
|
0, and adjust bit indices as needed. For a 128-bit format, for instance, the
|
|
Y-tiled pattern becomes :math:`u_2 u_1 u_0 v_4 v_3 v_2 v_1 v_0`. The Sky Lake
|
|
PRM Vol. 5 in the section "2D Surfaces" contains an expanded version of the
|
|
above table (which we will not repeat here) that also includes the bit patterns
|
|
for the Ys and Yf tiling formats.
|