From e0752673becc9d6263f1e982289111f6d7aa7c43 Mon Sep 17 00:00:00 2001 From: Alyssa Rosenzweig Date: Wed, 28 Dec 2022 15:26:45 -0500 Subject: [PATCH] docs/panfrost: Move description of instancing Connor Abbott wrote a nice explanation of how instance divisors work on Mali. Let's add it to the driver docs instead of letting it languish in a forgotten header file. This is mostly pasted from the existing header in tree, with a few local changes applied. Signed-off-by: Alyssa Rosenzweig Part-of: --- docs/drivers/panfrost.rst | 111 +++++++++++++++++++++++++ src/panfrost/include/panfrost-job.h | 123 ---------------------------- 2 files changed, 111 insertions(+), 123 deletions(-) diff --git a/docs/drivers/panfrost.rst b/docs/drivers/panfrost.rst index 0c51b20c001..1adc0d7053b 100644 --- a/docs/drivers/panfrost.rst +++ b/docs/drivers/panfrost.rst @@ -175,3 +175,114 @@ Mali-T760 and newer, Arm Framebuffer Compression (AFBC) is more efficient and should be used instead where possible. However, not all formats are compressible, so u-interleaved tiling remains an important fallback on Panfrost. +Instancing +---------- + +The attribute descriptor lets the attribute unit compute the address of an +attribute given the vertex and instance ID. Unfortunately, the way this works is +rather complicated when instancing is enabled. + +To explain this, first we need to explain how compute and vertex threads are +dispatched. When a quad is dispatched, it receives a single, linear index. +However, we need to translate that index into a (vertex id, instance id) pair. +One option would be to do: + +.. math:: + \text{vertex id} = \text{linear id} \% \text{num vertices} + + \text{instance id} = \text{linear id} / \text{num vertices} + +but this involves a costly division and modulus by an arbitrary number. +Instead, we could pad num_vertices. We dispatch padded_num_vertices * +num_instances threads instead of num_vertices * num_instances, which results +in some "extra" threads with vertex_id >= num_vertices, which we have to +discard. The more we pad num_vertices, the more "wasted" threads we +dispatch, but the division is potentially easier. + +One straightforward choice is to pad num_vertices to the next power of two, +which means that the division and modulus are just simple bit shifts and +masking. But the actual algorithm is a bit more complicated. The thread +dispatcher has special support for dividing by 3, 5, 7, and 9, in addition +to dividing by a power of two. As a result, padded_num_vertices can be +1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads, +since we need less padding. + +padded_num_vertices is picked by the hardware. The driver just specifies the +actual number of vertices. Note that padded_num_vertices is a multiple of four +(presumably because threads are dispatched in groups of 4). Also, +padded_num_vertices is always at least one more than num_vertices, which seems +like a quirk of the hardware. For larger num_vertices, the hardware uses the +following algorithm: using the binary representation of num_vertices, we look at +the most significant set bit as well as the following 3 bits. Let n be the +number of bits after those 4 bits. Then we set padded_num_vertices according to +the following table: + +========== ======================= +high bits padded_num_vertices +========== ======================= +1000 :math:`9 \cdot 2^n` +1001 :math:`5 \cdot 2^{n+1}` +101x :math:`3 \cdot 2^{n+2}` +110x :math:`7 \cdot 2^{n+1}` +111x :math:`2^{n+4}` +========== ======================= + +For example, if num_vertices = 70 is passed to glDraw(), its binary +representation is 1000110, so n = 3 and the high bits are 1000, and +therefore padded_num_vertices = :math:`9 \cdot 2^3` = 72. + +The attribute unit works in terms of the original linear_id. if +num_instances = 1, then they are the same, and everything is simple. +However, with instancing things get more complicated. There are four +possible modes, two of them we can group together: + +1. Use the linear_id directly. Only used when there is no instancing. + +2. Use the linear_id modulo a constant. This is used for per-vertex +attributes with instancing enabled by making the constant equal +padded_num_vertices. Because the modulus is always padded_num_vertices, this +mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9. +The shift field specifies the power of two, while the extra_flags field +specifies the odd number. If shift = n and extra_flags = m, then the modulus +is :math:`(2m + 1) \cdot 2^n`. As an example, if num_vertices = 70, then as +computed above, padded_num_vertices = :math:`9 \cdot 2^3`, so we should set +extra_flags = 4 and shift = 3. Note that we must exactly follow the hardware +algorithm used to get padded_num_vertices in order to correctly implement +per-vertex attributes. + +3. Divide the linear_id by a constant. In order to correctly implement +instance divisors, we have to divide linear_id by padded_num_vertices times +to user-specified divisor. So first we compute padded_num_vertices, again +following the exact same algorithm that the hardware uses, then multiply it +by the GL-level divisor to get the hardware-level divisor. This case is +further divided into two more cases. If the hardware-level divisor is a +power of two, then we just need to shift. The shift amount is specified by +the shift field, so that the hardware-level divisor is just 2^shift. + +If it isn't a power of two, then we have to divide by an arbitrary integer. +For that, we use the well-known technique of multiplying by an approximation +of the inverse. The driver must compute the magic multiplier and shift +amount, and then the hardware does the multiplication and shift. The +hardware and driver also use the "round-down" optimization as described in +http://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf. +The hardware further assumes the multiplier is between 2^31 and 2^32, so the +high bit is implicitly set to 1 even though it is set to 0 by the driver -- +presumably this simplifies the hardware multiplier a little. The hardware +first multiplies linear_id by the multiplier and takes the high 32 bits, +then applies the round-down correction if extra_flags = 1, then finally +shifts right by the shift field. + +There are some differences between ridiculousfish's algorithm and the Mali +hardware algorithm, which means that the reference code from ridiculousfish +doesn't always produce the right constants. Mali does not use the pre-shift +optimization, since that would make a hardware implementation slower (it +would have to always do the pre-shift, multiply, and post-shift operations). +It also forces the multplier to be at least 2^31, which means that the +exponent is entirely fixed, so there is no trial-and-error. Altogether, +given the divisor d, the algorithm the driver must follow is: + +1. Set shift = :math:`\lfloor \log_2(d) \rfloor`. +2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`. +3. If :math:`e <= 2^{shift}`, then we need to use the round-down algorithm. Set + magic_divisor = m - 1 and extra_flags = 1. 4. Otherwise, set magic_divisor = + m and extra_flags = 0. diff --git a/src/panfrost/include/panfrost-job.h b/src/panfrost/include/panfrost-job.h index efdf26a1fb8..6138ca7fbe6 100644 --- a/src/panfrost/include/panfrost-job.h +++ b/src/panfrost/include/panfrost-job.h @@ -42,129 +42,6 @@ typedef uint64_t mali_ptr; #define MALI_EXTRACT_TYPE(fmt) ((fmt)&0xe0) #define MALI_EXTRACT_INDEX(pixfmt) (((pixfmt) >> 12) & 0xFF) -/* - * Mali Attributes - * - * This structure lets the attribute unit compute the address of an attribute - * given the vertex and instance ID. Unfortunately, the way this works is - * rather complicated when instancing is enabled. - * - * To explain this, first we need to explain how compute and vertex threads are - * dispatched. This is a guess (although a pretty firm guess!) since the - * details are mostly hidden from the driver, except for attribute instancing. - * When a quad is dispatched, it receives a single, linear index. However, we - * need to translate that index into a (vertex id, instance id) pair, or a - * (local id x, local id y, local id z) triple for compute shaders (although - * vertex shaders and compute shaders are handled almost identically). - * Focusing on vertex shaders, one option would be to do: - * - * vertex_id = linear_id % num_vertices - * instance_id = linear_id / num_vertices - * - * but this involves a costly division and modulus by an arbitrary number. - * Instead, we could pad num_vertices. We dispatch padded_num_vertices * - * num_instances threads instead of num_vertices * num_instances, which results - * in some "extra" threads with vertex_id >= num_vertices, which we have to - * discard. The more we pad num_vertices, the more "wasted" threads we - * dispatch, but the division is potentially easier. - * - * One straightforward choice is to pad num_vertices to the next power of two, - * which means that the division and modulus are just simple bit shifts and - * masking. But the actual algorithm is a bit more complicated. The thread - * dispatcher has special support for dividing by 3, 5, 7, and 9, in addition - * to dividing by a power of two. This is possibly using the technique - * described in patent US20170010862A1. As a result, padded_num_vertices can be - * 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads, - * since we need less padding. - * - * padded_num_vertices is picked by the hardware. The driver just specifies the - * actual number of vertices. At least for Mali G71, the first few cases are - * given by: - * - * num_vertices | padded_num_vertices - * 3 | 4 - * 4-7 | 8 - * 8-11 | 12 (3 * 4) - * 12-15 | 16 - * 16-19 | 20 (5 * 4) - * - * Note that padded_num_vertices is a multiple of four (presumably because - * threads are dispatched in groups of 4). Also, padded_num_vertices is always - * at least one more than num_vertices, which seems like a quirk of the - * hardware. For larger num_vertices, the hardware uses the following - * algorithm: using the binary representation of num_vertices, we look at the - * most significant set bit as well as the following 3 bits. Let n be the - * number of bits after those 4 bits. Then we set padded_num_vertices according - * to the following table: - * - * high bits | padded_num_vertices - * 1000 | 9 * 2^n - * 1001 | 5 * 2^(n+1) - * 101x | 3 * 2^(n+2) - * 110x | 7 * 2^(n+1) - * 111x | 2^(n+4) - * - * For example, if num_vertices = 70 is passed to glDraw(), its binary - * representation is 1000110, so n = 3 and the high bits are 1000, and - * therefore padded_num_vertices = 9 * 2^3 = 72. - * - * The attribute unit works in terms of the original linear_id. if - * num_instances = 1, then they are the same, and everything is simple. - * However, with instancing things get more complicated. There are four - * possible modes, two of them we can group together: - * - * 1. Use the linear_id directly. Only used when there is no instancing. - * - * 2. Use the linear_id modulo a constant. This is used for per-vertex - * attributes with instancing enabled by making the constant equal - * padded_num_vertices. Because the modulus is always padded_num_vertices, this - * mode only supports a modulus that is a power of 2 times 1, 3, 5, 7, or 9. - * The shift field specifies the power of two, while the extra_flags field - * specifies the odd number. If shift = n and extra_flags = m, then the modulus - * is (2m + 1) * 2^n. As an example, if num_vertices = 70, then as computed - * above, padded_num_vertices = 9 * 2^3, so we should set extra_flags = 4 and - * shift = 3. Note that we must exactly follow the hardware algorithm used to - * get padded_num_vertices in order to correctly implement per-vertex - * attributes. - * - * 3. Divide the linear_id by a constant. In order to correctly implement - * instance divisors, we have to divide linear_id by padded_num_vertices times - * to user-specified divisor. So first we compute padded_num_vertices, again - * following the exact same algorithm that the hardware uses, then multiply it - * by the GL-level divisor to get the hardware-level divisor. This case is - * further divided into two more cases. If the hardware-level divisor is a - * power of two, then we just need to shift. The shift amount is specified by - * the shift field, so that the hardware-level divisor is just 2^shift. - * - * If it isn't a power of two, then we have to divide by an arbitrary integer. - * For that, we use the well-known technique of multiplying by an approximation - * of the inverse. The driver must compute the magic multiplier and shift - * amount, and then the hardware does the multiplication and shift. The - * hardware and driver also use the "round-down" optimization as described in - * http://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf. - * The hardware further assumes the multiplier is between 2^31 and 2^32, so the - * high bit is implicitly set to 1 even though it is set to 0 by the driver -- - * presumably this simplifies the hardware multiplier a little. The hardware - * first multiplies linear_id by the multiplier and takes the high 32 bits, - * then applies the round-down correction if extra_flags = 1, then finally - * shifts right by the shift field. - * - * There are some differences between ridiculousfish's algorithm and the Mali - * hardware algorithm, which means that the reference code from ridiculousfish - * doesn't always produce the right constants. Mali does not use the pre-shift - * optimization, since that would make a hardware implementation slower (it - * would have to always do the pre-shift, multiply, and post-shift operations). - * It also forces the multplier to be at least 2^31, which means that the - * exponent is entirely fixed, so there is no trial-and-error. Altogether, - * given the divisor d, the algorithm the driver must follow is: - * - * 1. Set shift = floor(log2(d)). - * 2. Compute m = ceil(2^(shift + 32) / d) and e = 2^(shift + 32) % d. - * 3. If e <= 2^shift, then we need to use the round-down algorithm. Set - * magic_divisor = m - 1 and extra_flags = 1. - * 4. Otherwise, set magic_divisor = m and extra_flags = 0. - */ - /* Purposeful off-by-one in width, height fields. For example, a (64, 64) * texture is stored as (63, 63) in these fields. This adjusts for that. * There's an identical pattern in the framebuffer descriptor. Even vertex