From e2eba3c7dafee67cd6e5da5e0458b6d5de441e00 Mon Sep 17 00:00:00 2001 From: Francisco Jerez Date: Wed, 16 Oct 2024 14:47:06 -0700 Subject: [PATCH] intel/brw/xe2+: Adjust performance analysis divergence weight due to EU fusion removal. This reduces the penalty the heuristic gives to SIMD32 shaders relative to SIMD16 in presence of discard control flow on Xe2+. The penalty was meant to account for the inefficient divergence behavior of SIMD32 shaders on Gfx12.x platforms, since Gfx12 hardware had EUs bundled in groups of two, and each pair shared control flow logic so both EUs could only execute instructions in lockstep, which meant that SIMD32 shaders had an effective warp size of 64 on Gfx12.x. This change switches back to more optimistic modelling of discard divergence. With it we gain about 6% performance in a Shadow of the Tomb Raider trace (tested on BMG). One may wonder if there are still workloads that would suffer materially from enabling SIMD32 for all pixel shaders on Xe2 instead of using this heuristic, since Xe2 EUs have twice the GRF space, twice the FPU throughput and better divergence behavior than Xe, but the answer seems to be yes unfortunately: E.g. Superposition has some pixel shaders where SIMD32 has substantially worse scheduling due to the increased number of false dependencies due to higher register pressure, and using SIMD32 for them reduces performance significantly. The heuristic seems to model this correctly so it doesn't look like we can do without it at least right now on Xe2. Acked-by: Lionel Landwerlin Part-of: --- src/intel/compiler/brw_ir_performance.cpp | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/intel/compiler/brw_ir_performance.cpp b/src/intel/compiler/brw_ir_performance.cpp index 47bde298114..ffe9bc963e4 100644 --- a/src/intel/compiler/brw_ir_performance.cpp +++ b/src/intel/compiler/brw_ir_performance.cpp @@ -1012,13 +1012,15 @@ namespace { * weights used elsewhere in the compiler back-end. * * Note that we provide slightly more pessimistic weights on - * Gfx12+ for SIMD32, since the effective warp size on that + * Gfx12.x for SIMD32, since the effective warp size on that * platform is 2x the SIMD width due to EU fusion, which increases * the likelihood of divergent control flow in comparison to * previous generations, giving narrower SIMD modes a performance * advantage in several test-cases with non-uniform discard jumps. + * EU fusion has been removed on Xe2+ so its divergence behavior is + * expected to be closer to pre-Gfx12 platforms. */ - const float discard_weight = (dispatch_width > 16 || s->devinfo->ver < 12 ? + const float discard_weight = (dispatch_width > 16 || s->devinfo->ver != 12 ? 1.0 : 0.5); const float loop_weight = 10; unsigned halt_count = 0;