From e2eba3c7dafee67cd6e5da5e0458b6d5de441e00 Mon Sep 17 00:00:00 2001
From: Francisco Jerez <currojerez@riseup.net>
Date: Wed, 16 Oct 2024 14:47:06 -0700
Subject: [PATCH] intel/brw/xe2+: Adjust performance analysis divergence weight
 due to EU fusion removal.

This reduces the penalty the heuristic gives to SIMD32 shaders
relative to SIMD16 in presence of discard control flow on Xe2+.  The
penalty was meant to account for the inefficient divergence behavior
of SIMD32 shaders on Gfx12.x platforms, since Gfx12 hardware had EUs
bundled in groups of two, and each pair shared control flow logic so
both EUs could only execute instructions in lockstep, which meant that
SIMD32 shaders had an effective warp size of 64 on Gfx12.x.

This change switches back to more optimistic modelling of discard
divergence.  With it we gain about 6% performance in a Shadow of the
Tomb Raider trace (tested on BMG).

One may wonder if there are still workloads that would suffer
materially from enabling SIMD32 for all pixel shaders on Xe2 instead
of using this heuristic, since Xe2 EUs have twice the GRF space, twice
the FPU throughput and better divergence behavior than Xe, but the
answer seems to be yes unfortunately: E.g. Superposition has some
pixel shaders where SIMD32 has substantially worse scheduling due to
the increased number of false dependencies due to higher register
pressure, and using SIMD32 for them reduces performance significantly.
The heuristic seems to model this correctly so it doesn't look like we
can do without it at least right now on Xe2.

Acked-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/31697>
---
 src/intel/compiler/brw_ir_performance.cpp | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/intel/compiler/brw_ir_performance.cpp b/src/intel/compiler/brw_ir_performance.cpp
index 47bde298114..ffe9bc963e4 100644
--- a/src/intel/compiler/brw_ir_performance.cpp
+++ b/src/intel/compiler/brw_ir_performance.cpp
@@ -1012,13 +1012,15 @@ namespace {
        *       weights used elsewhere in the compiler back-end.
        *
        *       Note that we provide slightly more pessimistic weights on
-       *       Gfx12+ for SIMD32, since the effective warp size on that
+       *       Gfx12.x for SIMD32, since the effective warp size on that
        *       platform is 2x the SIMD width due to EU fusion, which increases
        *       the likelihood of divergent control flow in comparison to
        *       previous generations, giving narrower SIMD modes a performance
        *       advantage in several test-cases with non-uniform discard jumps.
+       *       EU fusion has been removed on Xe2+ so its divergence behavior is
+       *       expected to be closer to pre-Gfx12 platforms.
        */
-      const float discard_weight = (dispatch_width > 16 || s->devinfo->ver < 12 ?
+      const float discard_weight = (dispatch_width > 16 || s->devinfo->ver != 12 ?
                                     1.0 : 0.5);
       const float loop_weight = 10;
       unsigned halt_count = 0;