Introduce smul_highpart and umul_highpart RTX for high-part multiplications

This patch introduces new RTX codes to allow the RTL passes and
backends to consistently represent high-part multiplications.
Currently, the RTL used by different backends for expanding
smul<mode>3_highpart and umul<mode>3_highpart varies greatly,
with many but not all choosing to express this something like:

(define_insn "smuldi3_highpart"
  [(set (match_operand:DI 0 "nvptx_register_operand" "=R")
       (truncate:DI
        (lshiftrt:TI
         (mult:TI (sign_extend:TI
                   (match_operand:DI 1 "nvptx_register_operand" "R"))
                  (sign_extend:TI
                   (match_operand:DI 2 "nvptx_register_operand" "R")))
         (const_int 64))))]
  ""
  "%.\\tmul.hi.s64\\t%0, %1, %2;")

One complication with using this "widening multiplication" representation
is that it requires an intermediate in a wider mode, making it difficult
or impossible to encode a high-part multiplication of the widest supported
integer mode.  A second is that it can interfere with optimization; for
example simplify-rtx.c contains the comment:

   case TRUNCATE:
      /* Don't optimize (lshiftrt (mult ...)) as it would interfere
         with the umulXi3_highpart patterns.  */

Hopefully these problems are solved (or reduced) by introducing a
new canonical form for high-part multiplications in RTL passes.
This also simplifies insn patterns when one operand is constant.

Whilst implementing some constant folding simplifications and
compile-time evaluation of these new RTX codes, I noticed that
this functionality could also be added for the existing saturating
arithmetic RTX codes.  Then likewise when documenting these new RTX
codes, I also took the opportunity to silence the @xref warnings in
invoke.texi.

2021-10-07  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* rtl.def (SMUL_HIGHPART, UMUL_HIGHPART): New RTX codes for
	representing signed and unsigned high-part multiplication resp.
	* simplify-rtx.c (simplify_binary_operation_1) [SMUL_HIGHPART,
	UMUL_HIGHPART]: Simplify high-part multiplications by zero.
	[SS_PLUS, US_PLUS, SS_MINUS, US_MINUS, SS_MULT, US_MULT,
	SS_DIV, US_DIV]: Similar simplifications for saturating
	arithmetic.
	(simplify_const_binary_operation) [SS_PLUS, US_PLUS, SS_MINUS,
	US_MINUS, SS_MULT, US_MULT, SMUL_HIGHPART, UMUL_HIGHPART]:
	Implement compile-time evaluation for constant operands.

	* dwarf2out.c (mem_loc_descriptor): Skip SMUL_HIGHPART and
	UMUL_HIGHPART.
	* doc/rtl.texi (smul_highpart, umul_highpart): Document RTX codes.
	* doc/md.texi (smul@var{m}3_highpart, umul@var{m3}_highpart):
	Mention the new smul_highpart and umul_highpart RTX codes.
	* doc/invoke.texi: Silence @xref "compilation" warnings.

gcc/testsuite/ChangeLog
	* gcc.target/i386/sse2-mmx-paddsb-2.c: New test case.
	* gcc.target/i386/sse2-mmx-paddusb-2.c: New test case.
	* gcc.target/i386/sse2-mmx-psubsb-2.c: New test case.
	* gcc.target/i386/sse2-mmx-psubusb-2.c: New test case.
This commit is contained in:
Roger Sayle 2021-10-07 15:42:09 +01:00
parent 1a7d452c09
commit 555fa3545e
10 changed files with 216 additions and 9 deletions

View File

@ -3126,7 +3126,7 @@ errors if these functions are not inlined everywhere they are called.
@itemx -fno-modules-ts
@opindex fmodules-ts
@opindex fno-modules-ts
Enable support for C++20 modules (@xref{C++ Modules}). The
Enable support for C++20 modules (@pxref{C++ Modules}). The
@option{-fno-modules-ts} is usually not needed, as that is the
default. Even though this is a C++20 feature, it is not currently
implicitly enabled by selecting that standard version.
@ -33608,7 +33608,7 @@ version selected, although in pre-C++20 versions, it is of course an
extension.
No new source file suffixes are required or supported. If you wish to
use a non-standard suffix (@xref{Overall Options}), you also need
use a non-standard suffix (@pxref{Overall Options}), you also need
to provide a @option{-x c++} option too.@footnote{Some users like to
distinguish module interface files with a new suffix, such as naming
the source @code{module.cppm}, which involves
@ -33670,8 +33670,8 @@ to be resolved at the end of compilation. Without this, imported
macros are only resolved when expanded or (re)defined. This option
detects conflicting import definitions for all macros.
@xref{C++ Module Mapper} for details of the @option{-fmodule-mapper}
family of options.
For details of the @option{-fmodule-mapper} family of options,
@pxref{C++ Module Mapper}.
@menu
* C++ Module Mapper:: Module Mapper
@ -33888,8 +33888,8 @@ dialect used and imports of the module.@footnote{The precise contents
of this output may change.} The timestamp is the same value as that
provided by the @code{__DATE__} & @code{__TIME__} macros, and may be
explicitly specified with the environment variable
@code{SOURCE_DATE_EPOCH}. @xref{Environment Variables} for further
details.
@code{SOURCE_DATE_EPOCH}. For further details
@pxref{Environment Variables}.
A set of related CMIs may be copied, provided the relative pathnames
are preserved.

View File

@ -5776,11 +5776,13 @@ multiplication.
@item @samp{smul@var{m}3_highpart}
Perform a signed multiplication of operands 1 and 2, which have mode
@var{m}, and store the most significant half of the product in operand 0.
The least significant half of the product is discarded.
The least significant half of the product is discarded. This may be
represented in RTL using a @code{smul_highpart} RTX expression.
@cindex @code{umul@var{m}3_highpart} instruction pattern
@item @samp{umul@var{m}3_highpart}
Similar, but the multiplication is unsigned.
Similar, but the multiplication is unsigned. This may be represented
in RTL using an @code{umul_highpart} RTX expression.
@cindex @code{madd@var{m}@var{n}4} instruction pattern
@item @samp{madd@var{m}@var{n}4}

View File

@ -2524,7 +2524,19 @@ not be the same.
For unsigned widening multiplication, use the same idiom, but with
@code{zero_extend} instead of @code{sign_extend}.
@findex smul_highpart
@findex umul_highpart
@cindex high-part multiplication
@cindex multiplication high part
@item (smul_highpart:@var{m} @var{x} @var{y})
@itemx (umul_highpart:@var{m} @var{x} @var{y})
Represents the high-part multiplication of @var{x} and @var{y} carried
out in machine mode @var{m}. @code{smul_highpart} returns the high part
of a signed multiplication, @code{umul_highpart} returns the high part
of an unsigned multiplication.
@findex fma
@cindex fused multiply-add
@item (fma:@var{m} @var{x} @var{y} @var{z})
Represents the @code{fma}, @code{fmaf}, and @code{fmal} builtin
functions, which compute @samp{@var{x} * @var{y} + @var{z}}

View File

@ -16809,6 +16809,8 @@ mem_loc_descriptor (rtx rtl, machine_mode mode,
case CONST_FIXED:
case CLRSB:
case CLOBBER:
case SMUL_HIGHPART:
case UMUL_HIGHPART:
break;
case CONST_STRING:

View File

@ -467,6 +467,11 @@ DEF_RTL_EXPR(SS_MULT, "ss_mult", "ee", RTX_COMM_ARITH)
/* Multiplication with unsigned saturation */
DEF_RTL_EXPR(US_MULT, "us_mult", "ee", RTX_COMM_ARITH)
/* Signed high-part multiplication. */
DEF_RTL_EXPR(SMUL_HIGHPART, "smul_highpart", "ee", RTX_COMM_ARITH)
/* Unsigned high-part multiplication. */
DEF_RTL_EXPR(UMUL_HIGHPART, "umul_highpart", "ee", RTX_COMM_ARITH)
/* Operand 0 divided by operand 1. */
DEF_RTL_EXPR(DIV, "div", "ee", RTX_BIN_ARITH)
/* Division with signed saturation */

View File

@ -4142,11 +4142,36 @@ simplify_context::simplify_binary_operation_1 (rtx_code code,
case US_PLUS:
case SS_MINUS:
case US_MINUS:
/* Simplify x +/- 0 to x, if possible. */
if (trueop1 == CONST0_RTX (mode))
return op0;
return 0;
case SS_MULT:
case US_MULT:
/* Simplify x * 0 to 0, if possible. */
if (trueop1 == CONST0_RTX (mode)
&& !side_effects_p (op0))
return op1;
/* Simplify x * 1 to x, if possible. */
if (trueop1 == CONST1_RTX (mode))
return op0;
return 0;
case SMUL_HIGHPART:
case UMUL_HIGHPART:
/* Simplify x * 0 to 0, if possible. */
if (trueop1 == CONST0_RTX (mode)
&& !side_effects_p (op0))
return op1;
return 0;
case SS_DIV:
case US_DIV:
/* ??? There are simplifications that can be done. */
/* Simplify x / 1 to x, if possible. */
if (trueop1 == CONST1_RTX (mode))
return op0;
return 0;
case VEC_SERIES:
@ -5012,6 +5037,51 @@ simplify_const_binary_operation (enum rtx_code code, machine_mode mode,
}
break;
}
case SS_PLUS:
result = wi::add (pop0, pop1, SIGNED, &overflow);
clamp_signed_saturation:
if (overflow == wi::OVF_OVERFLOW)
result = wi::max_value (GET_MODE_PRECISION (int_mode), SIGNED);
else if (overflow == wi::OVF_UNDERFLOW)
result = wi::min_value (GET_MODE_PRECISION (int_mode), SIGNED);
else if (overflow != wi::OVF_NONE)
return NULL_RTX;
break;
case US_PLUS:
result = wi::add (pop0, pop1, UNSIGNED, &overflow);
clamp_unsigned_saturation:
if (overflow != wi::OVF_NONE)
result = wi::max_value (GET_MODE_PRECISION (int_mode), UNSIGNED);
break;
case SS_MINUS:
result = wi::sub (pop0, pop1, SIGNED, &overflow);
goto clamp_signed_saturation;
case US_MINUS:
result = wi::sub (pop0, pop1, UNSIGNED, &overflow);
if (overflow != wi::OVF_NONE)
result = wi::min_value (GET_MODE_PRECISION (int_mode), UNSIGNED);
break;
case SS_MULT:
result = wi::mul (pop0, pop1, SIGNED, &overflow);
goto clamp_signed_saturation;
case US_MULT:
result = wi::mul (pop0, pop1, UNSIGNED, &overflow);
goto clamp_unsigned_saturation;
case SMUL_HIGHPART:
result = wi::mul_high (pop0, pop1, SIGNED);
break;
case UMUL_HIGHPART:
result = wi::mul_high (pop0, pop1, UNSIGNED);
break;
default:
return NULL_RTX;
}

View File

@ -0,0 +1,33 @@
/* { dg-do compile } */
/* { dg-options "-O2 -msse2" } */
typedef char v8qi __attribute__ ((vector_size (8)));
char foo()
{
v8qi tx = { 1, 0, 0, 0, 0, 0, 0, 0 };
v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
v8qi t = __builtin_ia32_paddsb(tx, ty);
return t[0];
}
char bar()
{
v8qi tx = { 100, 0, 0, 0, 0, 0, 0, 0 };
v8qi ty = { 100, 0, 0, 0, 0, 0, 0, 0 };
v8qi t = __builtin_ia32_paddsb(tx, ty);
return t[0];
}
char baz()
{
v8qi tx = { -100, 0, 0, 0, 0, 0, 0, 0 };
v8qi ty = { -100, 0, 0, 0, 0, 0, 0, 0 };
v8qi t = __builtin_ia32_paddsb(tx, ty);
return t[0];
}
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$3," 1 } } */
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$127," 1 } } */
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$-128," 1 } } */
/* { dg-final { scan-assembler-not "paddsb\[ \\t\]+%xmm\[0-9\]+" } } */

View File

@ -0,0 +1,25 @@
/* { dg-do compile } */
/* { dg-options "-O2 -msse2" } */
typedef char v8qi __attribute__ ((vector_size (8)));
char foo()
{
v8qi tx = { 1, 0, 0, 0, 0, 0, 0, 0 };
v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
v8qi t = __builtin_ia32_paddusb(tx, ty);
return t[0];
}
char bar()
{
v8qi tx = { 200, 0, 0, 0, 0, 0, 0, 0 };
v8qi ty = { 200, 0, 0, 0, 0, 0, 0, 0 };
v8qi t = __builtin_ia32_paddusb(tx, ty);
return t[0];
}
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$3," 1 } } */
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$-1," 1 } } */
/* { dg-final { scan-assembler-not "paddusb\[ \\t\]+%xmm\[0-9\]+" } } */

View File

@ -0,0 +1,33 @@
/* { dg-do compile } */
/* { dg-options "-O2 -msse2" } */
typedef char v8qi __attribute__ ((vector_size (8)));
char foo()
{
v8qi tx = { 5, 0, 0, 0, 0, 0, 0, 0 };
v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
v8qi t = __builtin_ia32_psubsb(tx, ty);
return t[0];
}
char bar()
{
v8qi tx = { -100, 0, 0, 0, 0, 0, 0, 0 };
v8qi ty = { 100, 0, 0, 0, 0, 0, 0, 0 };
v8qi t = __builtin_ia32_psubsb(tx, ty);
return t[0];
}
char baz()
{
v8qi tx = { 100, 0, 0, 0, 0, 0, 0, 0 };
v8qi ty = { -100, 0, 0, 0, 0, 0, 0, 0 };
v8qi t = __builtin_ia32_psubsb(tx, ty);
return t[0];
}
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$3," 1 } } */
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$-128," 1 } } */
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$127," 1 } } */
/* { dg-final { scan-assembler-not "paddsb\[ \\t\]+%xmm\[0-9\]+" } } */

View File

@ -0,0 +1,25 @@
/* { dg-do compile } */
/* { dg-options "-O2 -msse2" } */
typedef char v8qi __attribute__ ((vector_size (8)));
char foo()
{
v8qi tx = { 5, 0, 0, 0, 0, 0, 0, 0 };
v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
v8qi t = __builtin_ia32_psubusb(tx, ty);
return t[0];
}
char bar()
{
v8qi tx = { 100, 0, 0, 0, 0, 0, 0, 0 };
v8qi ty = { 200, 0, 0, 0, 0, 0, 0, 0 };
v8qi t = __builtin_ia32_psubusb(tx, ty);
return t[0];
}
/* { dg-final { scan-assembler-times "movl\[ \\t\]+\\\$3," 1 } } */
/* { dg-final { scan-assembler-times "xorl\[ \\t\]+" 1 } } */
/* { dg-final { scan-assembler-not "psubusb\[ \\t\]+%xmm\[0-9\]+" } } */