Go to file
Richard Biener 9aaedfc414 load and store-lanes with SLP
The following is a prototype for how to represent load/store-lanes
within SLP.  I've for now settled with having a single load node
with multiple permute nodes acting as selection, one for each loaded lane
and a single store node fed from all stored lanes.  For

  for (int i = 0; i < 1024; ++i)
    {
      a[2*i] = b[2*i] + 7;
      a[2*i+1] = b[2*i+1] * 3;
    }

you have the following SLP graph where I explain how things are set
up and code-generated:

t.c:23:21: note:   SLP graph after lowering permutations:
t.c:23:21: note:   node 0x50dc8b0 (max_nunits=1, refcnt=1) vector(4) int
t.c:23:21: note:   op template: *_6 = _7;
t.c:23:21: note:        stmt 0 *_6 = _7;
t.c:23:21: note:        stmt 1 *_12 = _13;
t.c:23:21: note:        children 0x50dc488 0x50dc6e8

This is the store node, it's marked with ldst_lanes = true during
SLP discovery.  This node code-generates

  vect_array.65[0] = vect__7.61_29;
  vect_array.65[1] = vect__13.62_28;
  MEM <int[8]> [(int *)vectp_a.63_27] = .STORE_LANES (vect_array.65);

...
t.c:23:21: note:   node 0x50dc520 (max_nunits=4, refcnt=2) vector(4) int
t.c:23:21: note:   op: VEC_PERM_EXPR
t.c:23:21: note:        stmt 0 _5 = *_4;
t.c:23:21: note:        lane permutation { 0[0] }
t.c:23:21: note:        children 0x50dc948
t.c:23:21: note:   node 0x50dc780 (max_nunits=4, refcnt=2) vector(4) int
t.c:23:21: note:   op: VEC_PERM_EXPR
t.c:23:21: note:        stmt 0 _11 = *_10;
t.c:23:21: note:        lane permutation { 0[1] }
t.c:23:21: note:        children 0x50dc948

These are the selection nodes, marked with ldst_lanes = true.
They code generate nothing.

t.c:23:21: note:   node 0x50dc948 (max_nunits=4, refcnt=3) vector(4) int
t.c:23:21: note:   op template: _5 = *_4;
t.c:23:21: note:        stmt 0 _5 = *_4;
t.c:23:21: note:        stmt 1 _11 = *_10;
t.c:23:21: note:        load permutation { 0 1 }

This is the load node, marked with ldst_lanes = true (the load
permutation is only accurate when taking into account the lane permute
in the selection nodes).  It code generates

  vect_array.58 = .LOAD_LANES (MEM <int[8]> [(int *)vectp_b.56_33]);
  vect__5.59_31 = vect_array.58[0];
  vect__5.60_30 = vect_array.58[1];

This scheme allows to leave code generation in vectorizable_load/store
mostly as-is.

While this should support both load-lanes and (masked) store-lanes
the decision to do either is done during SLP discovery time and
cannot be reversed without altering the SLP tree - as-is the SLP
tree is not usable for non-store-lanes on the store side, the
load side is OK representation-wise but will very likely fail
permute handling as the lowering to deal with the two input vector
restriction isn't done - but of course since the permute node is
marked as to be ignored that doesn't work out.  So I've put
restrictions in place that fail vectorization if a load/store-lane
SLP tree is later classified differently by get_load_store_type.

I'll note that for example gcc.target/aarch64/sve/mask_struct_store_3.c
will not get SLP store-lanes used because the full store SLPs just
fine though we then fail to handle the "splat" load-permutation

t2.c:5:21: note:   node 0x4db2630 (max_nunits=4, refcnt=2) vector([4,4]) int
t2.c:5:21: note:   op template: _6 = *_5;
t2.c:5:21: note:        stmt 0 _6 = *_5;
t2.c:5:21: note:        stmt 1 _6 = *_5;
t2.c:5:21: note:        stmt 2 _6 = *_5;
t2.c:5:21: note:        stmt 3 _6 = *_5;
t2.c:5:21: note:        load permutation { 0 0 0 0 }

the load permute lowering code currently doesn't consider it worth
lowering single loads from a group (or in this case not grouped loads).
The expectation is the target can handle this by two interleaves with
itself.

So what we see here is that while the explicit SLP representation is
helpful in some cases, in cases like this it would require changing
it when we make decisions how to vectorize.  My idea is that this
all will change a lot when we re-do SLP discovery (for loops) and
when we get rid of non-SLP as I think vectorizable_* should be
allowed to alter the SLP graph during analysis.

The patch also removes the code cancelling SLP if we can use
load/store-lanes from the main loop vector analysis code and
re-implements it as re-discovering the SLP instance with
forced single-lane splits so SLP load/store-lanes scheme can be
used.

This is now done after SLP discovery and SLP pattern recog are
complete to not disturb the latter but per SLP instance instead
of being a global decision on the whole loop.

This is a behavioral change that for example shows in
gcc.dg/vect/slp-perm-6.c on ARM where we formerly used SLP permutes
but now a mix of SLP without permutes and load/store lanes.  The
previous flaky heuristic is now flaky in a different way.

Testing on RISC-V and aarch64 reveal several testcases that require
adjustment as to now expect SLP even when load/store lanes are being
used.  If in doubt I've adjusted them to the final expectation which
will lead to one or two new FAILs where we still do the SLP cancelling.
I have a followup that implements that while remaining in SLP that's
in final testing.

Note that gcc.dg/vect/slp-42.c and gcc.dg/vect/pr68445.c will FAIL
on aarch64 with SVE because for some odd reason vect_stridedN
is true for any N for check_effective_target_vect_fully_masked
targets but SVE cannot do ld8 while risc-v can.

I have not bothered to adjust target tests that now fail assembly-scan.

	* tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark
	load, store and permute nodes.
	* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize ldst_lanes.
	(vect_build_slp_instance): For stores iff the target prefers
	store-lanes discover single-lane sub-groups, do not perform
	interleaving lowering but mark the node with ldst_lanes.
	Also allow i == 0 - fatal failure - for splitting up a store group
	when we're not doing single-lane discovery already.
	(vect_lower_load_permutations): When the target supports
	load lanes and the loads all fit the pattern split out
	a single level of permutes only and mark the load and
	permute nodes with ldst_lanes.
	(vectorizable_slp_permutation_1): Handle the load-lane permute
	forwarding of vector defs.
	(vect_analyze_slp): After SLP pattern recog is finished see if
	there are any SLP instances that would benefit from using
	load/store-lanes and re-discover those with forced single lanes.
	* tree-vect-stmts.cc (get_group_load_store_type): Support
	load/store-lanes for SLP.
	(vectorizable_store): Support SLP code generation for store-lanes.
	(vectorizable_load): Support SLP code generation for load-lanes.
	* tree-vect-loop.cc (vect_analyze_loop_2): Do not cancel SLP
	when store-lanes can be used.

	* gcc.dg/vect/slp-55.c: New testcase.
	* gcc.dg/vect/slp-56.c: Likewise.
	* gcc.dg/vect/slp-11c.c: Adjust.
	* gcc.dg/vect/slp-53.c: Likewise.
	* gcc.dg/vect/slp-cond-1.c: Likewise.
	* gcc.dg/vect/vect-complex-5.c: Likewise.
	* gcc.dg/vect/slp-1.c: Likewise.
	* gcc.dg/vect/slp-54.c: Remove riscv XFAIL.
	* gcc.dg/vect/slp-perm-5.c: Adjust.
	* gcc.dg/vect/slp-perm-7.c: Likewise.
	* gcc.dg/vect/slp-perm-8.c: Likewise.
	* gcc.dg/vect/slp-multitypes-11.c: Likewise.
	* gcc.dg/vect/slp-multitypes-11-big-array.c: Likewise.
	* gcc.dg/vect/slp-perm-9.c: Remove expected SLP fail due to
	three-vector permute.
	* gcc.dg/vect/slp-perm-6.c: Remove XFAIL.
	* gcc.dg/vect/slp-perm-1.c: Adjust.
	* gcc.dg/vect/slp-perm-2.c: Likewise.
	* gcc.dg/vect/slp-perm-3.c: Likewise.
	* gcc.dg/vect/slp-perm-4.c: Likewise.
	* gcc.dg/vect/pr68445.c: Likewise.
	* gcc.dg/vect/slp-11b.c: Likewise.
	* gcc.dg/vect/slp-2.c: Likewise.
	* gcc.dg/vect/slp-23.c: Likewise.
	* gcc.dg/vect/slp-33.c: Likewise.
	* gcc.dg/vect/slp-42.c: Likewise.
	* gcc.dg/vect/slp-46.c: Likewise.
	* gcc.dg/vect/slp-perm-10.c: Likewise.
2024-09-02 08:50:32 +02:00
.github
c++tools Daily bump. 2024-05-09 10:58:01 +00:00
config Daily bump. 2024-04-17 00:18:45 +00:00
contrib Daily bump. 2024-08-02 00:18:55 +00:00
fixincludes Daily bump. 2024-07-12 00:17:52 +00:00
gcc load and store-lanes with SLP 2024-09-02 08:50:32 +02:00
gnattools Daily bump. 2024-07-08 00:17:01 +00:00
gotools Daily bump. 2024-04-16 00:18:06 +00:00
include Daily bump. 2024-08-10 00:17:05 +00:00
INSTALL
libada Update copyright years. 2024-01-03 12:19:35 +01:00
libatomic Daily bump. 2024-07-19 00:18:20 +00:00
libbacktrace Daily bump. 2024-08-06 00:17:19 +00:00
libcc1 Daily bump. 2024-03-17 00:17:21 +00:00
libcody Update Copyright year in ChangeLog files 2024-01-03 11:35:18 +01:00
libcpp Daily bump. 2024-08-28 00:19:45 +00:00
libdecnumber Daily bump. 2024-04-03 00:17:29 +00:00
libffi Daily bump. 2024-07-02 00:17:36 +00:00
libgcc Daily bump. 2024-08-28 00:19:45 +00:00
libgfortran Daily bump. 2024-08-21 00:19:34 +00:00
libgm2 Daily bump. 2024-05-30 00:16:44 +00:00
libgo runtime: dump registers on Solaris 2024-04-29 11:39:58 -07:00
libgomp Daily bump. 2024-08-29 00:19:25 +00:00
libgrust Daily bump. 2024-08-02 00:18:55 +00:00
libiberty Daily bump. 2024-08-06 00:17:19 +00:00
libitm Daily bump. 2024-06-01 00:17:20 +00:00
libobjc Daily bump. 2024-09-01 00:25:25 +00:00
libphobos Daily bump. 2024-06-01 00:17:20 +00:00
libquadmath Daily bump. 2024-08-29 00:19:25 +00:00
libsanitizer Daily bump. 2024-02-17 00:17:08 +00:00
libssp Daily bump. 2024-05-09 10:58:01 +00:00
libstdc++-v3 Daily bump. 2024-08-29 00:19:25 +00:00
libvtv Daily bump. 2024-06-01 00:17:20 +00:00
lto-plugin Daily bump. 2024-08-24 00:18:13 +00:00
maintainer-scripts Daily bump. 2024-07-20 00:17:53 +00:00
zlib
.b4-config Add config file so b4 uses inbox.sourceware.org automatically 2024-07-28 11:13:16 +01:00
.dir-locals.el dir-locals: apply our C settings in C++ also 2024-07-31 20:38:27 +02:00
.gitattributes
.gitignore *: add modern gettext 2023-11-14 00:47:11 +01:00
ABOUT-NLS
ar-lib
ChangeLog Daily bump. 2024-08-23 00:17:24 +00:00
ChangeLog.jit
ChangeLog.tree-ssa
compile
config-ml.in config-ml.in: Fix multi-os-dir search 2024-05-06 12:08:28 +08:00
config.guess
config.rpath
config.sub
configure build: Fix missing variable quotes and typo 2024-06-20 10:58:37 +08:00
configure.ac build: Fix missing variable quotes and typo 2024-06-20 10:58:37 +08:00
COPYING
COPYING3
COPYING3.LIB
COPYING.LIB
COPYING.RUNTIME
depcomp
install-sh
libtool-ldflags
libtool.m4 Build: fix error in fixinclude configure 2023-11-22 11:54:33 +01:00
lt~obsolete.m4
ltgcc.m4
ltmain.sh
ltoptions.m4
ltsugar.m4
ltversion.m4
MAINTAINERS MAINTAINERS: Change my contact email in MAINTAINERS file. 2024-08-07 04:55:37 +00:00
Makefile.def gccrs: Fix missing build dependency 2024-01-16 16:23:02 +01:00
Makefile.in Makefile.tpl: fix whitespace in licence header 2024-08-22 03:41:12 +01:00
Makefile.tpl Makefile.tpl: fix whitespace in licence header 2024-08-22 03:41:12 +01:00
missing
mkdep
mkinstalldirs
move-if-change
multilib.am
README
SECURITY.txt SECURITY.txt: Drop "exploitable" in reference to hardening issues 2024-01-09 10:49:01 -05:00
symlink-tree
test-driver
ylwrap

This directory contains the GNU Compiler Collection (GCC).

The GNU Compiler Collection is free software.  See the files whose
names start with COPYING for copying permission.  The manuals, and
some of the runtime libraries, are under different terms; see the
individual source files for details.

The directory INSTALL contains copies of the installation information
as HTML and plain text.  The source of this information is
gcc/doc/install.texi.  The installation information includes details
of what is included in the GCC sources and what files GCC installs.

See the file gcc/doc/gcc.texi (together with other files that it
includes) for usage and porting information.  An online readable
version of the manual is in the files gcc/doc/gcc.info*.

See http://gcc.gnu.org/bugs/ for how to report bugs usefully.

Copyright years on GCC source files may be listed using range
notation, e.g., 1987-2012, indicating that every year in the range,
inclusive, is a copyrightable year that could otherwise be listed
individually.