Currently unaligned YMM and ZMM load and store costs are cheaper than
aligned which causes the vectorizer to purposely mis-align accesses
by adding an alignment prologue. It looks like the unaligned costs
were simply copied from the bogus znver4 costs. The following makes
the unaligned costs equal to the aligned costs like in the fixed znver4
version.
* config/i386/x86-tune-costs.h (znver5_cost): Update unaligned
load and store cost from the aligned costs.
(cherry picked from commit 896393791e)
Address override only applies to the (reg32) part in the thread address
fs:(reg32). Don't rewrite thread address like
(set (reg:CCZ 17 flags)
(compare:CCZ (reg:SI 98 [ __gmpfr_emax.0_1 ])
(mem/c:SI (plus:SI (plus:SI (unspec:SI [
(const_int 0 [0])
] UNSPEC_TP)
(reg:SI 107))
(const:SI (unspec:SI [
(symbol_ref:SI ("previous_emax") [flags 0x1a] <var_decl 0x7fffe9a11cf0 previous_emax>)
] UNSPEC_DTPOFF))) [1 previous_emax+0 S4 A32])))
if address override is used to avoid the invalid memory operand like
cmpl %fs:previous_emax@dtpoff(%eax), %r12d
gcc/
PR target/116839
* config/i386/i386.cc (ix86_rewrite_tls_address_1): Make it
static. Return if TLS address is thread register plus an integer
register.
gcc/testsuite/
PR target/116839
* gcc.target/i386/pr116839.c: New file.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c79cc30862)
Currently subregs originating from *tf_to_fprx2_0 and *tf_to_fprx2_1
survive register allocation. This in turn leads to wrong register
renaming. Keeping the current approach would mean we need two insns for
*tf_to_fprx2_0 and *tf_to_fprx2_1, respectively. Something along the
lines
(define_insn "*tf_to_fprx2_0"
[(set (subreg:DF (match_operand:FPRX2 0 "nonimmediate_operand" "=f") 0)
(unspec:DF [(match_operand:TF 1 "general_operand" "v")]
UNSPEC_TF_TO_FPRX2_0))]
"TARGET_VXE"
"#")
(define_insn "*tf_to_fprx2_0"
[(set (match_operand:DF 0 "nonimmediate_operand" "=f")
(unspec:DF [(match_operand:TF 1 "general_operand" "v")]
UNSPEC_TF_TO_FPRX2_0))]
"TARGET_VXE"
"vpdi\t%v0,%v1,%v0,1
[(set_attr "op_type" "VRR")])
and similar for *tf_to_fprx2_1. Note, pre register allocation operand 0
has mode FPRX2 and afterwards DF once subregs have been eliminated.
Since we always copy a whole vector register into a floating-point
register pair, another way to fix this is to merge *tf_to_fprx2_0 and
*tf_to_fprx2_1 into a single insn which means we don't have to use
subregs at all. The downside of this is that the assembler template
contains two instructions, now. The upside is that we don't have to
come up with some artificial insn before RA which might be more
readable/maintainable. That is implemented by this patch.
In commit r11-4872-ge627cda5686592, the output operand specifier %V was
introduced which is used in tf_to_fprx2 only, now. Instead of coming up
with its counterpart %F for floating-point registers, which would also
only be used in tf_to_fprx2, I print the operands directly. This
renders %V unused which is why it is removed by this patch.
gcc/ChangeLog:
PR target/115860
* config/s390/s390.cc (print_operand): Remove operand specifier
%V.
* config/s390/s390.md (UNSPEC_TF_TO_FPRX2): New.
* config/s390/vector.md (*tf_to_fprx2_0): Remove.
(*tf_to_fprx2_1): Remove.
(tf_to_fprx2): New.
gcc/testsuite/ChangeLog:
* gcc.target/s390/vector/long-double-asm-abi.c: Adapt
scan-assembler directive.
* gcc.target/s390/vector/long-double-to-i64.c: Adapt
scan-assembler directive.
* gcc.target/s390/pr115860-1.c: New test.
(cherry picked from commit 46c2538435)
Ensure for AQ and AR constraints that the resulting displacement after
adding any positive offset less than the size of the object being
referenced is still valid.
gcc/ChangeLog:
* config/s390/s390.cc (s390_mem_constraint): Check displacement
for AQ and AR constraints.
(cherry picked from commit 1a71ff3b89)
gcc/fortran/ChangeLog:
PR fortran/100273
* trans-decl.cc (gfc_create_module_variable): Handle module
variable also when it is needed for the result specification
of a contained function.
gcc/testsuite/ChangeLog:
PR fortran/100273
* gfortran.dg/pr100273.f90: New test.
(cherry picked from commit 1f462b5072)
When a memory copy operation is analyzed by analyze_ssa_name, if both the
load and store are made through the same SSA name, the store is overlooked.
gcc/
* ipa-modref.cc (modref_eaf_analysis::analyze_ssa_name): Always
process both the load and the store of a memory copy operation.
gcc/testsuite/
* gcc.dg/ipa/modref-4.c: New test.
In s390_expand_insv(), if generating code for ICM et al. src is a MEM
and gen_lowpart might force src into a register such that we end up with
patterns which do not match anymore. Use adjust_address() instead in
order to preserve a MEM.
Furthermore, it is not straight forward to enforce a subreg. For
example, in case of a paradoxical subreg, gen_lowpart() may return a
register. In order to compensate this, s390_gen_lowpart_subreg() emits
a reference to a pseudo which does not coincide with its definition
which is wrong. Additionally, if dest is a paradoxical subreg, then do
not try to emit a strict_low_part since it could mean that dest was not
initialized even though this might be fixed up later by init-regs.
Splitter for insn *get_tp_64, *zero_extendhisi2_31,
*zero_extendqisi2_31, *zero_extendqihi2_31 are applied after reload.
Thus, operands[0] is a hard register and gen_lowpart (m, operands[0])
just returns the hard register for mode m which is fine to use as an
argument for strict_low_part, i.e., we do not need to enforce subregs
here since after reload subregs are supposed to be eliminated anyway.
This fixes gcc.dg/torture/pr111821.c.
gcc/ChangeLog:
* config/s390/s390-protos.h (s390_gen_lowpart_subreg): Remove.
* config/s390/s390.cc (s390_gen_lowpart_subreg): Remove.
(s390_expand_insv): Use adjust_address() and emit a
strict_low_part only in case of a natural subreg.
* config/s390/s390.md: Use gen_lowpart() instead of
s390_gen_lowpart_subreg().
(cherry picked from commit 9ebc9fbddd)
This patch is backported from GCC15 with some tweaks.
Since r15-3539, there are requests coming in to add other alias option
documentation. This patch will add all of them, including corei7, corei7-avx,
core-avx-i, core-avx2, atom and slm.
Also in the patch, I reordered that part of documentation, currently all
the CPUs/products are just all over the place. I regrouped them by
date-to-now products (since the very first CPU to latest Panther Lake), P-core
(since the clients become hybrid cores, starting from Sapphire Rapids) and
E-core (since Bonnell). In GCC14 and eariler GCC, Xeon Phi CPUs are still
there, I put them after E-core CPUs.
And in the patch, I refined the product names in documentation.
gcc/ChangeLog:
* doc/invoke.texi: Add corei7, corei7-avx, core-avx-i,
core-avx2, atom, and slm. Reorder the -march documentation by
splitting them into date-to-now products, P-core, E-core and
Xeon Phi. Refine the product names in documentation.
r12-3495 added maybe_warn_about_constant_value which will crash if
it gets a nameless VAR_DECL, which is what happens in this PR.
We created this VAR_DECL in cp_parser_decomposition_declaration.
PR c++/116676
gcc/cp/ChangeLog:
* constexpr.cc (maybe_warn_about_constant_value): Check DECL_NAME.
gcc/testsuite/ChangeLog:
* g++.dg/cpp1z/constexpr-116676.C: New test.
Reviewed-by: Jason Merrill <jason@redhat.com>
(cherry picked from commit dfe0d4389a)
Don't use temp for a PARALLEL BLKmode argument of an EXPR_LIST expression
in a TImode register. Otherwise, the TImode variable will be put in
the GPR save area which guarantees only 8-byte alignment.
gcc/
PR target/116621
* config/i386/i386.cc (ix86_gimplify_va_arg): Don't use temp for
a PARALLEL BLKmode container of an EXPR_LIST expression in a
TImode register.
gcc/testsuite/
PR target/116621
* gcc.target/i386/pr116621.c: New test.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit fa7bbb065c)
Update analyze_parms not to disable function parameter analysis for
-ffat-lto-objects. Tested on x86-64, there are no differences in zstd
with "-O2 -flto=auto" -g "vs -O2 -flto=auto -g -ffat-lto-objects".
PR ipa/116410
* ipa-modref.cc (analyze_parms): Always analyze function parameter
for LTO.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2f1689ea8e)
The intrin for non-optimized got a typo in mask type, which will cause
the high bits of __mmask32 being unexpectedly zeroed.
The test does not fail under O0 with current 1b since the testcase is
wrong. We need to include avx512-mask-type.h after SIZE is defined, or
it will always be __mmask8. That problem also happened in AVX10.2 testcases.
I will write a seperate patch to fix that.
gcc/ChangeLog:
* config/i386/avx512fp16intrin.h
(_mm512_mask_fpclass_ph_mask): Correct mask type to __mmask32.
(_mm512_fpclass_ph_mask): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512fp16-vfpclassph-1c.c: New test.
For function arguments/return, when it's BLK mode, it's put in a
parallel with an expr_list, and the expr_list contains the real mode
and registers.
Current ix86_check_avx_upper_register only checked for SSE_REG_P, and
failed to handle that. The patch extend the handle to each subrtx.
gcc/ChangeLog:
PR target/116512
* config/i386/i386.cc (ix86_check_avx_upper_register): Iterate
subrtx to scan for avx upper register.
(ix86_check_avx_upper_stores): Inline old
ix86_check_avx_upper_register.
(ix86_avx_u128_mode_needed): Ditto, and replace
FOR_EACH_SUBRTX with call to new
ix86_check_avx_upper_register.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr116512.c: New test.
(cherry picked from commit ab214ef734)