Those TR13/OpenMP 6.0 routines permit a reproducible offloading to
a specific device by mapping an OpenMP device number to a
unique ID (UID). The GPU device UIDs should be universally unique,
the one for the host is not.
gcc/ChangeLog:
* omp-general.cc (omp_runtime_api_procname): Add
get_device_from_uid and omp_get_uid_from_device routines.
include/ChangeLog:
* cuda/cuda.h (cuDeviceGetUuid): Declare.
(cuDeviceGetUuid_v2): Add prototype.
libgomp/ChangeLog:
* config/gcn/target.c (omp_get_uid_from_device,
omp_get_device_from_uid): Add stub implementation.
* config/nvptx/target.c (omp_get_uid_from_device,
omp_get_device_from_uid): Likewise.
* fortran.c (omp_get_uid_from_device_,
omp_get_uid_from_device_8_): New functions.
* libgomp-plugin.h (GOMP_OFFLOAD_get_uid): Add prototype.
* libgomp.h (struct gomp_device_descr): Add 'uid' and 'get_uid_func'.
* libgomp.map (GOMP_6.0): New, includind the new UID routines.
* libgomp.texi (OpenMP Technical Report 13): Mark UID routines as 'Y'.
(Device Information Routines): Document new UID routines.
(Offload-Target Specifics): Document UID format.
* omp.h.in (omp_get_device_from_uid, omp_get_uid_from_device):
New prototype.
* omp_lib.f90.in (omp_get_device_from_uid, omp_get_uid_from_device):
New interface.
* omp_lib.h.in: Likewise.
* plugin/cuda-lib.def: Add cuDeviceGetUuid and cuDeviceGetUuid_v2 via
CUDA_ONE_CALL_MAYBE_NULL.
* plugin/plugin-gcn.c (GOMP_OFFLOAD_get_uid): New.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_uid): New.
* target.c (str_omp_initial_device): New static var.
(STR_OMP_DEV_PREFIX): Define.
(gomp_get_uid_for_device, omp_get_uid_from_device,
omp_get_device_from_uid): New.
(gomp_load_plugin_for_device): DLSYM_OPT the function 'get_uid'.
(gomp_target_init): Set the device's 'uid' field to NULL.
* testsuite/libgomp.c/device_uid.c: New test.
* testsuite/libgomp.fortran/device_uid.f90: New test.
This commit adds OpenMP 5.1+'s interop enumeration, type and routine
declarations to the C/C++ header file and, new in OpenMP TR13, also to
the Fortran module and omp_lib.h header file.
While a stub implementation is provided, only with foreign runtime
support by the libgomp GPU plugins and with the 'interop' directive,
this becomes really useful.
libgomp/ChangeLog:
* fortran.c (omp_get_interop_str_, omp_get_interop_name_,
omp_get_interop_type_desc_, omp_get_interop_rc_desc_): Add.
* libgomp.map (GOMP_5.1.3): New; add interop routines.
* omp.h.in: Add interop typedefs, enum and prototypes.
(__GOMP_DEFAULT_NULL): Define.
(omp_target_memcpy_async, omp_target_memcpy_rect_async):
Use it for the optional depend argument.
* omp_lib.f90.in: Add paramters and interfaces for interop.
* omp_lib.h.in: Likewise; move F90 '&' to column 81 for
-ffree-length-80.
* target.c (omp_get_num_interop_properties, omp_get_interop_int,
omp_get_interop_ptr, omp_get_interop_str, omp_get_interop_name,
omp_get_interop_type_desc, omp_get_interop_rc_desc): Add.
* config/gcn/target.c (omp_get_num_interop_properties,
omp_get_interop_int, omp_get_interop_ptr, omp_get_interop_str,
omp_get_interop_name, omp_get_interop_type_desc,
omp_get_interop_rc_desc): Add.
* config/nvptx/target.c (omp_get_num_interop_properties,
omp_get_interop_int, omp_get_interop_ptr, omp_get_interop_str,
omp_get_interop_name, omp_get_interop_type_desc,
omp_get_interop_rc_desc): Add.
* testsuite/libgomp.c-c++-common/interop-routines-1.c: New test.
* testsuite/libgomp.c-c++-common/interop-routines-2.c: New test.
* testsuite/libgomp.fortran/interop-routines-1.F90: New test.
* testsuite/libgomp.fortran/interop-routines-2.F90: New test.
* testsuite/libgomp.fortran/interop-routines-3.F: New test.
* testsuite/libgomp.fortran/interop-routines-4.F: New test.
* testsuite/libgomp.fortran/interop-routines-5.F: New test.
* testsuite/libgomp.fortran/interop-routines-6.F: New test.
* testsuite/libgomp.fortran/interop-routines-7.F90: New test.
Corresponding to commit 9fa72756d9
"libgomp, nvptx: Honor OpenMP 5.1 num_teams lower bound", these are the
GCN offloading changes to fix:
PASS: libgomp.c/../libgomp.c-c++-common/teams-2.c (test for excess errors)
[-FAIL:-]{+PASS:+} libgomp.c/../libgomp.c-c++-common/teams-2.c execution test
PASS: libgomp.c++/../libgomp.c-c++-common/teams-2.c (test for excess errors)
[-FAIL:-]{+PASS:+} libgomp.c++/../libgomp.c-c++-common/teams-2.c execution test
..., and omptests' 't-critical' test case. I've cross checked that those test
cases are the ones that regress for nvptx offloading, if I locally revert the
"libgomp, nvptx: Honor OpenMP 5.1 num_teams lower bound" changes.
libgomp/
* config/gcn/libgomp-gcn.h (GOMP_TEAM_NUM): Inject.
* config/gcn/target.c (GOMP_teams4): Handle.
* config/gcn/team.c (gomp_gcn_enter_kernel): Initialize.
* config/gcn/teams.c (omp_get_team_num): Adjust.
A splay-tree was previously used to lookup equivalent target addresses
for a given host address on offload targets. However, as splay-trees can
modify their structure on lookup, they are not suitable for concurrent
access from separate teams/threads without some form of locking. This
patch changes the lookup data structure to a hashtab instead, which does
not have these issues.
The call to build_indirect_map to initialize the data structure is now
called from just the first thread of the first team to avoid redundant
calls to this function.
2024-03-22 Kwok Cheung Yeung <kcyeung@baylibre.com>
libgomp/
* config/accel/target-indirect.c: Include string.h and hashtab.h.
Remove include of splay-tree.h. Update comments.
(splay_tree_prefix, splay_tree_c): Delete.
(struct indirect_map_t): New.
(hash_entry_type, htab_alloc, htab_free, htab_hash, htab_eq): New.
(GOMP_INDIRECT_ADD_MAP): Remove volatile qualifier.
(USE_SPLAY_TREE_LOOKUP): Rename to...
(USE_HASHTAB_LOOKUP): ..this.
(indirect_map, indirect_array): Delete.
(indirect_htab): New.
(build_indirect_map): Remove locking. Build indirect map using
hashtab.
(GOMP_target_map_indirect_ptr): Use indirect_htab to lookup target
address.
(GOMP_target_map_indirect_ptr): Remove volatile qualifier.
* config/gcn/team.c (gomp_gcn_enter_kernel): Call build_indirect_map
from first thread of first team only.
* config/nvptx/team.c (gomp_nvptx_main): Likewise.
* testsuite/libgomp.c-c++-common/declare-target-indirect-2.c (main):
Add missing break statements.
* testsuite/libgomp.fortran/declare-target-indirect-2.f90: Remove
xfail.
This is enough to get gfx1030 and gfx1100 working; there are still some test
failures to investigate, and probably some tuning to do.
gcc/ChangeLog:
* config/gcn/gcn-opts.h (TARGET_PACKED_WORK_ITEMS): Add TARGET_RDNA3.
* config/gcn/gcn-valu.md (all_convert): New iterator.
(<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>): New
define_expand, and rename the old one to ...
(*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>): ... this.
(extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>): Likewise, to ...
(extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>): .. this.
(*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_shift<exec>): New.
* config/gcn/gcn.cc (gcn_global_address_p): Use "offsetbits" correctly.
(gcn_hsa_declare_function_name): Update the vgpr counting for gfx1100.
* config/gcn/gcn.md (<u>mulhisi3): Disable on RDNA3.
(<u>mulqihi3_scalar): Likewise.
libgcc/ChangeLog:
* config/gcn/amdgcn_veclib.h (CDNA3_PLUS): Handle RDNA3.
libgomp/ChangeLog:
* config/gcn/time.c (RTC_TICKS): Configure RDNA3.
(omp_get_wtime): Add RDNA3-compatible variant.
* plugin/plugin-gcn.c (max_isa_vgprs): Tune for gfx1030 and gfx1100.
Signed-off-by: Andrew Stubbs <ams@baylibre.com>
Fix-up for commit 348874f0ba
"libgomp: basic pinned memory on Linux", which may result in build failures
as follow, for example, for the '-m32' multilib of x86_64-pc-linux-gnu:
In file included from [...]/source-gcc/libgomp/config/linux/allocator.c:31:
[...]/source-gcc/libgomp/config/linux/allocator.c: In function ‘linux_memspace_alloc’:
[...]/source-gcc/libgomp/config/linux/allocator.c:70:26: error: format ‘%ld’ expects argument of type ‘long int’, but argument 3 has type ‘size_t’ {aka ‘unsigned int’} [-Werror=format=]
70 | gomp_debug (0, "libgomp: failed to pin %ld bytes of"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
71 | " memory (ulimit too low?)\n", size);
| ~~~~
| |
| size_t {aka unsigned int}
[...]/source-gcc/libgomp/libgomp.h:186:29: note: in definition of macro ‘gomp_debug’
186 | (gomp_debug) ((KIND), __VA_ARGS__); \
| ^~~~~~~~~~~
[...]/source-gcc/libgomp/config/linux/allocator.c:70:52: note: format string is defined here
70 | gomp_debug (0, "libgomp: failed to pin %ld bytes of"
| ~~^
| |
| long int
| %d
cc1: all warnings being treated as errors
make[9]: *** [allocator.lo] Error 1
make[9]: Leaving directory `[...]/build-gcc/x86_64-pc-linux-gnu/32/libgomp'
[...]
Fix this in the same way as used elsewhere in libgomp.
libgomp/
* config/linux/allocator.c (linux_memspace_alloc): Fix 'size_t'
vs. '%ld' format string mismatch.
Implement the OpenMP pinned memory trait on Linux hosts using the mlock
syscall. Pinned allocations are performed using mmap, not malloc, to ensure
that they can be unpinned safely when freed.
This implementation will work OK for page-scale allocations, and finer-grained
allocations will be implemented in a future patch.
libgomp/ChangeLog:
* allocator.c (MEMSPACE_ALLOC): Add PIN.
(MEMSPACE_CALLOC): Add PIN.
(MEMSPACE_REALLOC): Add PIN.
(MEMSPACE_FREE): Add PIN.
(MEMSPACE_VALIDATE): Add PIN.
(omp_init_allocator): Use MEMSPACE_VALIDATE to check pinning.
(omp_aligned_alloc): Add pinning to all MEMSPACE_* calls.
(omp_aligned_calloc): Likewise.
(omp_realloc): Likewise.
(omp_free): Likewise.
* config/linux/allocator.c: New file.
* config/nvptx/allocator.c (MEMSPACE_ALLOC): Add PIN.
(MEMSPACE_CALLOC): Add PIN.
(MEMSPACE_REALLOC): Add PIN.
(MEMSPACE_FREE): Add PIN.
(MEMSPACE_VALIDATE): Add PIN.
* config/gcn/allocator.c (MEMSPACE_ALLOC): Add PIN.
(MEMSPACE_CALLOC): Add PIN.
(MEMSPACE_REALLOC): Add PIN.
(MEMSPACE_FREE): Add PIN.
* libgomp.texi: Switch pinned trait to supported.
(MEMSPACE_VALIDATE): Add PIN.
* testsuite/libgomp.c/alloc-pinned-1.c: New test.
* testsuite/libgomp.c/alloc-pinned-2.c: New test.
* testsuite/libgomp.c/alloc-pinned-3.c: New test.
* testsuite/libgomp.c/alloc-pinned-4.c: New test.
Co-Authored-By: Thomas Schwinge <thomas@codesourcery.com>
This implements the OpenMP low-latency memory allocator for AMD GCN using the
small per-team LDS memory (Local Data Store).
Since addresses can now refer to LDS space, the "Global" address space is
no-longer compatible. This patch therefore switches the backend to use
entirely "Flat" addressing (which supports both memories). A future patch
will re-enable "global" instructions for cases where it is known to be safe
to do so.
gcc/ChangeLog:
* config/gcn/gcn-builtins.def (DISPATCH_PTR): New built-in.
* config/gcn/gcn.cc (gcn_init_machine_status): Disable global
addressing.
(gcn_expand_builtin_1): Implement GCN_BUILTIN_DISPATCH_PTR.
libgomp/ChangeLog:
* config/gcn/libgomp-gcn.h (TEAM_ARENA_START): Move to here.
(TEAM_ARENA_FREE): Likewise.
(TEAM_ARENA_END): Likewise.
(GCN_LOWLAT_HEAP): New.
* config/gcn/team.c (LITTLEENDIAN_CPU): New, and import hsa.h.
(__gcn_lowlat_init): New prototype.
(gomp_gcn_enter_kernel): Initialize the low-latency heap.
* libgomp.h (TEAM_ARENA_START): Move to libgomp.h.
(TEAM_ARENA_FREE): Likewise.
(TEAM_ARENA_END): Likewise.
* plugin/plugin-gcn.c (lowlat_size): New variable.
(print_kernel_dispatch): Label the group_segment_size purpose.
(init_environment_variables): Read GOMP_GCN_LOWLAT_POOL.
(create_kernel_dispatch): Pass low-latency head allocation to kernel.
(run_kernel): Use shadow; don't assume values.
* testsuite/libgomp.c/omp_alloc-traits.c: Enable for amdgcn.
* config/gcn/allocator.c: New file.
* libgomp.texi: Document low-latency implementation details.
The NVPTX low latency memory is not accessible outside the team that allocates
it, and therefore should be unavailable for allocators with the access trait
"all". This change means that the omp_low_lat_mem_alloc predefined
allocator no longer works (but omp_cgroup_mem_alloc still does).
libgomp/ChangeLog:
* allocator.c (MEMSPACE_VALIDATE): New macro.
(omp_init_allocator): Use MEMSPACE_VALIDATE.
(omp_aligned_alloc): Use OMP_LOW_LAT_MEM_ALLOC_INVALID.
(omp_aligned_calloc): Likewise.
(omp_realloc): Likewise.
* config/nvptx/allocator.c (nvptx_memspace_validate): New function.
(MEMSPACE_VALIDATE): New macro.
(OMP_LOW_LAT_MEM_ALLOC_INVALID): New define.
* libgomp.texi: Document low-latency implementation details.
* testsuite/libgomp.c/omp_alloc-1.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-2.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-3.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-4.c (main): Add access trait.
* testsuite/libgomp.c/omp_alloc-5.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-6.c (main): Add access trait.
* testsuite/libgomp.c/omp_alloc-traits.c: New test.
This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc. The memory
can be allocated, reallocated, and freed using a basic but fast algorithm,
is thread safe and the size of the low-latency heap can be configured using
the GOMP_NVPTX_LOWLAT_POOL environment variable.
The use of the PTX dynamic_smem_size feature means that low-latency allocator
will not work with the PTX 3.1 multilib.
For now, the omp_low_lat_mem_alloc allocator also works, but that will change
when I implement the access traits.
libgomp/ChangeLog:
* allocator.c (MEMSPACE_ALLOC): New macro.
(MEMSPACE_CALLOC): New macro.
(MEMSPACE_REALLOC): New macro.
(MEMSPACE_FREE): New macro.
(predefined_alloc_mapping): New array. Add _Static_assert to match.
(ARRAY_SIZE): New macro.
(omp_aligned_alloc): Use MEMSPACE_ALLOC.
Implement fall-backs for predefined allocators. Simplify existing
fall-backs.
(omp_free): Use MEMSPACE_FREE.
(omp_calloc): Use MEMSPACE_CALLOC. Implement fall-backs for
predefined allocators. Simplify existing fall-backs.
(omp_realloc): Use MEMSPACE_REALLOC, MEMSPACE_ALLOC, and MEMSPACE_FREE.
Implement fall-backs for predefined allocators. Simplify existing
fall-backs.
* config/nvptx/team.c (__nvptx_lowlat_pool): New asm variable.
(__nvptx_lowlat_init): New prototype.
(gomp_nvptx_main): Call __nvptx_lowlat_init.
* libgomp.texi: Update memory space table.
* plugin/plugin-nvptx.c (lowlat_pool_size): New variable.
(GOMP_OFFLOAD_init_device): Read the GOMP_NVPTX_LOWLAT_POOL envvar.
(GOMP_OFFLOAD_run): Apply lowlat_pool_size.
* basic-allocator.c: New file.
* config/nvptx/allocator.c: New file.
* testsuite/libgomp.c/omp_alloc-1.c: New test.
* testsuite/libgomp.c/omp_alloc-2.c: New test.
* testsuite/libgomp.c/omp_alloc-3.c: New test.
* testsuite/libgomp.c/omp_alloc-4.c: New test.
* testsuite/libgomp.c/omp_alloc-5.c: New test.
* testsuite/libgomp.c/omp_alloc-6.c: New test.
Co-authored-by: Kwok Cheung Yeung <kcy@codesourcery.com>
Co-Authored-By: Thomas Schwinge <thomas@codesourcery.com>
The patch 'openmp: Add support for the 'indirect' clause in C/C++'
introduced a new file target-indirect.c into the Makefile sources,
but that file was only present in config/linux/ and config/accel/,
so targets that are not Linux-based or GPU accelerators will not
pick it up and fail to build.
This is fixed by making the version in config/linux/ the default by
moving it into the base directory of libgomp.
2023-11-07 Kwok Cheung Yeung <kcy@codesourcery.com>
libgomp/
* config/linux/target-indirect.c: Move to...
* target-indirect.c: ...here.
This adds support for the 'indirect' clause in the 'declare target'
directive. Functions declared as indirect may be called via function
pointers passed from the host in offloaded code.
Virtual calls to member functions via the object pointer in C++ are
currently not supported in target regions.
2023-11-07 Kwok Cheung Yeung <kcy@codesourcery.com>
gcc/c-family/
* c-attribs.cc (c_common_attribute_table): Add attribute for
indirect functions.
* c-pragma.h (enum parma_omp_clause): Add entry for indirect clause.
gcc/c/
* c-decl.cc (c_decl_attributes): Add attribute for indirect
functions.
* c-lang.h (c_omp_declare_target_attr): Add indirect field.
* c-parser.cc (c_parser_omp_clause_name): Handle indirect clause.
(c_parser_omp_clause_indirect): New.
(c_parser_omp_all_clauses): Handle indirect clause.
(OMP_DECLARE_TARGET_CLAUSE_MASK): Add indirect clause to mask.
(c_parser_omp_declare_target): Handle indirect clause. Emit error
message if device_type or indirect clauses used alone. Emit error
if indirect clause used with device_type that is not 'any'.
(OMP_BEGIN_DECLARE_TARGET_CLAUSE_MASK): Add indirect clause to mask.
(c_parser_omp_begin): Handle indirect clause.
* c-typeck.cc (c_finish_omp_clauses): Handle indirect clause.
gcc/cp/
* cp-tree.h (cp_omp_declare_target_attr): Add indirect field.
* decl2.cc (cplus_decl_attributes): Add attribute for indirect
functions.
* parser.cc (cp_parser_omp_clause_name): Handle indirect clause.
(cp_parser_omp_clause_indirect): New.
(cp_parser_omp_all_clauses): Handle indirect clause.
(handle_omp_declare_target_clause): Add extra parameter. Add
indirect attribute for indirect functions.
(OMP_DECLARE_TARGET_CLAUSE_MASK): Add indirect clause to mask.
(cp_parser_omp_declare_target): Handle indirect clause. Emit error
message if device_type or indirect clauses used alone. Emit error
if indirect clause used with device_type that is not 'any'.
(OMP_BEGIN_DECLARE_TARGET_CLAUSE_MASK): Add indirect clause to mask.
(cp_parser_omp_begin): Handle indirect clause.
* semantics.cc (finish_omp_clauses): Handle indirect clause.
gcc/
* lto-cgraph.cc (enum LTO_symtab_tags): Add tag for indirect
functions.
(output_offload_tables): Write indirect functions.
(input_offload_tables): read indirect functions.
* lto-section-names.h (OFFLOAD_IND_FUNC_TABLE_SECTION_NAME): New.
* omp-builtins.def (BUILT_IN_GOMP_TARGET_MAP_INDIRECT_PTR): New.
* omp-offload.cc (offload_ind_funcs): New.
(omp_discover_implicit_declare_target): Add functions marked with
'omp declare target indirect' to indirect functions list.
(omp_finish_file): Add indirect functions to section for offload
indirect functions.
(execute_omp_device_lower): Redirect indirect calls on target by
passing function pointer to BUILT_IN_GOMP_TARGET_MAP_INDIRECT_PTR.
(pass_omp_device_lower::gate): Run pass_omp_device_lower if
indirect functions are present on an accelerator device.
* omp-offload.h (offload_ind_funcs): New.
* tree-core.h (omp_clause_code): Add OMP_CLAUSE_INDIRECT.
* tree.cc (omp_clause_num_ops): Add entry for OMP_CLAUSE_INDIRECT.
(omp_clause_code_name): Likewise.
* tree.h (OMP_CLAUSE_INDIRECT_EXPR): New.
* config/gcn/mkoffload.cc (process_asm): Process offload_ind_funcs
section. Count number of indirect functions.
(process_obj): Emit number of indirect functions.
* config/nvptx/mkoffload.cc (ind_func_ids, ind_funcs_tail): New.
(process): Emit offload_ind_func_table in PTX code. Emit indirect
function names and count in image.
* config/nvptx/nvptx.cc (nvptx_record_offload_symbol): Mark
indirect functions in PTX code with IND_FUNC_MAP.
gcc/testsuite/
* c-c++-common/gomp/declare-target-7.c: Update expected error message.
* c-c++-common/gomp/declare-target-indirect-1.c: New.
* c-c++-common/gomp/declare-target-indirect-2.c: New.
* g++.dg/gomp/attrs-21.C (v12): Update expected error message.
* g++.dg/gomp/declare-target-indirect-1.C: New.
* gcc.dg/gomp/attrs-21.c (v12): Update expected error message.
include/
* gomp-constants.h (GOMP_VERSION): Increment to 3.
(GOMP_VERSION_SUPPORTS_INDIRECT_FUNCS): New.
libgcc/
* offloadstuff.c (OFFLOAD_IND_FUNC_TABLE_SECTION_NAME): New.
(__offload_ind_func_table): New.
(__offload_ind_funcs_end): New.
(__OFFLOAD_TABLE__): Add entries for indirect functions.
libgomp/
* Makefile.am (libgomp_la_SOURCES): Add target-indirect.c.
* Makefile.in: Regenerate.
* libgomp-plugin.h (GOMP_INDIRECT_ADDR_MAP): New define.
(GOMP_OFFLOAD_load_image): Add extra argument.
* libgomp.h (struct indirect_splay_tree_key_s): New.
(indirect_splay_tree_node, indirect_splay_tree,
indirect_splay_tree_key): New.
(indirect_splay_compare): New.
* libgomp.map (GOMP_5.1.1): Add GOMP_target_map_indirect_ptr.
* libgomp.texi (OpenMP 5.1): Update documentation on indirect
calls in target region and on indirect clause.
(Other new OpenMP 5.2 features): Add entry for virtual function calls.
* libgomp_g.h (GOMP_target_map_indirect_ptr): Add prototype.
* oacc-host.c (host_load_image): Add extra argument.
* target.c (gomp_load_image_to_device): If the GOMP_VERSION is high
enough, read host indirect functions table and pass to
load_image_func.
* config/accel/target-indirect.c: New.
* config/linux/target-indirect.c: New.
* config/gcn/team.c (build_indirect_map): Add prototype.
(gomp_gcn_enter_kernel): Initialize support for indirect
function calls on GCN target.
* config/nvptx/team.c (build_indirect_map): Add prototype.
(gomp_nvptx_main): Initialize support for indirect function
calls on NVPTX target.
* plugin/plugin-gcn.c (struct gcn_image_desc): Add field for
indirect functions count.
(GOMP_OFFLOAD_load_image): Add extra argument. If the GOMP_VERSION
is high enough, build address translation table and copy it to target
memory.
* plugin/plugin-nvptx.c (nvptx_tdata): Add field for indirect
functions count.
(GOMP_OFFLOAD_load_image): Add extra argument. If the GOMP_VERSION
is high enough, Build address translation table and copy it to target
memory.
* testsuite/libgomp.c-c++-common/declare-target-indirect-1.c: New.
* testsuite/libgomp.c-c++-common/declare-target-indirect-2.c: New.
* testsuite/libgomp.c++/declare-target-indirect-1.C: New.
As with the memkind library, it is only used when found at runtime;
it does not need to be present when building GCC.
The included testcase does not check whether the memory has been placed
on the nearest node as the Linux kernel memory handling too often ignores
that hint, using a different node for the allocation. However, when
running with 'numactl --preferred=<node> ./executable', it is clearly
visible that the feature works by comparing malloc/default vs. nearest
placement (using get_mempolicy to obtain the node for a mem addr).
libgomp/ChangeLog:
* allocator.c: Add ifdef for LIBGOMP_USE_LIBNUMA.
(enum gomp_numa_memkind_kind): Renamed from gomp_memkind_kind;
add GOMP_MEMKIND_LIBNUMA.
(struct gomp_libnuma_data, gomp_init_libnuma, gomp_get_libnuma): New.
(omp_init_allocator): Handle partition=nearest with libnuma if avail.
(omp_aligned_alloc, omp_free, omp_aligned_calloc, omp_realloc): Add
numa_alloc_local (+ memset), numa_free, and numa_realloc calls as
needed.
* config/linux/allocator.c (LIBGOMP_USE_LIBNUMA): Define
* libgomp.texi: Fix a typo; use 'fi' instead of its ligature char.
(Memory allocation): Renamed from 'Memory allocation with libmemkind';
updated for libnuma usage.
* testsuite/libgomp.c-c++-common/alloc-11.c: New test.
* testsuite/libgomp.c-c++-common/alloc-12.c: New test.
The nteams-var ICV exists per device and can be set either via the routine
omp_set_num_teams or as environment variable (OMP_NUM_TEAMS with optional
_ALL/_DEV/_DEV_<num> suffix); it is default-initialized to zero. The number
of teams created is described under the num_teams clause. If the clause is
absent, the number of teams is implementation defined but at least
one team must exist and, if nteams-var is positive, at most nteams-var
teams may exist.
The latter condition was not honored in a target region before this
commit, such that too many teams were created.
Already before this commit, both the num_teams([lower:]upper) clause
(on the host and in target regions) and, only on the host, the nteams-var
ICV were honored. And as only one teams is created for host fallback,
unless the clause specifies otherwise, the nteams-var ICV was and is
effectively honored.
libgomp/ChangeLog:
PR libgomp/109875
* config/gcn/target.c (GOMP_teams4): Honor nteams-var ICV.
* config/nvptx/target.c (GOMP_teams4): Likewise.
* testsuite/libgomp.c-c++-common/teams-nteams-icv-1.c: New test.
* testsuite/libgomp.c-c++-common/teams-nteams-icv-2.c: New test.
* testsuite/libgomp.c-c++-common/teams-nteams-icv-3.c: New test.
* testsuite/libgomp.c-c++-common/teams-nteams-icv-4.c: New test.
Switch from using stacks in the "private segment" to using a memory block
allocated on the host side. The primary reason is to permit the reverse
offload implementation to access values located on the device stack, but
there may also be performance benefits, especially with repeated kernel
invocations.
This implementation unifies the stacks with the "team arena" optimization
feature, and now allows both to have run-time configurable sizes.
A new ABI is needed, so all libraries must be rebuilt, and newlib must be
version 4.3.0.20230120 or newer.
gcc/ChangeLog:
* config/gcn/gcn-run.cc: Include libgomp-gcn.h.
(struct kernargs): Replace the common content with kernargs_abi.
(struct heap): Delete.
(main): Read GCN_STACK_SIZE envvar.
Allocate space for the device stacks.
Write the new kernargs fields.
* config/gcn/gcn.cc (gcn_option_override): Remove stack_size_opt.
(default_requested_args): Remove PRIVATE_SEGMENT_BUFFER_ARG and
PRIVATE_SEGMENT_WAVE_OFFSET_ARG.
(gcn_addr_space_convert): Mask the QUEUE_PTR_ARG content.
(gcn_expand_prologue): Move the TARGET_PACKED_WORK_ITEMS to the top.
Set up the stacks from the values in the kernargs, not private.
(gcn_expand_builtin_1): Match the stack configuration in the prologue.
(gcn_hsa_declare_function_name): Turn off the private segment.
(gcn_conditional_register_usage): Ensure QUEUE_PTR is fixed.
* config/gcn/gcn.h (FIXED_REGISTERS): Fix the QUEUE_PTR register.
* config/gcn/gcn.opt (mstack-size): Change the description.
include/ChangeLog:
* gomp-constants.h (GOMP_VERSION_GCN): Bump.
libgomp/ChangeLog:
* config/gcn/libgomp-gcn.h (DEFAULT_GCN_STACK_SIZE): New define.
(DEFAULT_TEAM_ARENA_SIZE): New define.
(struct heap): Move to this file.
(struct kernargs_abi): Likewise.
* config/gcn/team.c (gomp_gcn_enter_kernel): Use team arena size from
the kernargs.
* libgomp.h: Include libgomp-gcn.h.
(TEAM_ARENA_SIZE): Remove.
(team_malloc): Update the error message.
* plugin/plugin-gcn.c (struct kernargs): Move common content to
struct kernargs_abi.
(struct agent_info): Rename team arenas to ephemeral memories.
(struct team_arena_list): Rename ....
(struct ephemeral_memories_list): to this.
(struct heap): Delete.
(team_arena_size): New variable.
(stack_size): New variable.
(print_kernel_dispatch): Update debug messages.
(init_environment_variables): Read GCN_TEAM_ARENA_SIZE.
Read GCN_STACK_SIZE.
(get_team_arena): Rename ...
(configure_ephemeral_memories): ... to this, and set up stacks.
(release_team_arena): Rename ...
(release_ephemeral_memories): ... to this.
(destroy_team_arenas): Rename ...
(destroy_ephemeral_memories): ... to this.
(create_kernel_dispatch): Add num_threads parameter.
Adjust for kernargs_abi refactor and ephemeral memories.
(release_kernel_dispatch): Adjust for ephemeral memories.
(run_kernel): Pass thread-count to create_kernel_dispatch.
(GOMP_OFFLOAD_init_device): Adjust for ephemeral memories.
(GOMP_OFFLOAD_fini_device): Adjust for ephemeral memories.
gcc/testsuite/ChangeLog:
* gcc.c-torture/execute/pr47237.c: Xfail on amdgcn.
* gcc.dg/builtin-apply3.c: Xfail for amdgcn.
* gcc.dg/builtin-apply4.c: Xfail for amdgcn.
* gcc.dg/torture/stackalign/builtin-apply-3.c: Xfail for amdgcn.
* gcc.dg/torture/stackalign/builtin-apply-4.c: Xfail for amdgcn.
Instead of trying to have the GPU do CPU-with-OS-like things, this new barriers
implementation for NVPTX uses simplistic bar.* synchronization instructions.
Tasks are processed after threads have joined, and only if team->task_count != 0
It is noted that: there might be a little bit of performance forfeited for
cases where earlier arriving threads could've been used to process tasks ahead
of other threads, but that has the requirement of implementing complex
futex-wait/wake like behavior, which is what we're try to avoid with this patch.
It is deemed that task processing is not what GPU target offloading is usually
used for.
Implementation highlight notes:
1. gomp_team_barrier_wake() is now an empty function (threads never "wake" in
the usual manner)
2. gomp_team_barrier_cancel() now uses the "exit" PTX instruction.
3. gomp_barrier_wait_last() now is implemented using "bar.arrive"
4. gomp_team_barrier_wait_end()/gomp_team_barrier_wait_cancel_end():
The main synchronization is done using a 'bar.red' instruction. This reduces
across all threads the condition (team->task_count != 0), to enable the task
processing down below if any thread created a task.
(this bar.red usage means that this patch is dependent on the prior NVPTX
bar.red GCC patch)
PR target/99555
libgomp/ChangeLog:
* config/nvptx/bar.c (generation_to_barrier): Remove.
(futex_wait,futex_wake,do_spin,do_wait): Remove.
(GOMP_WAIT_H): Remove.
(#include "../linux/bar.c"): Remove.
(gomp_barrier_wait_end): New function.
(gomp_barrier_wait): Likewise.
(gomp_barrier_wait_last): Likewise.
(gomp_team_barrier_wait_end): Likewise.
(gomp_team_barrier_wait): Likewise.
(gomp_team_barrier_wait_final): Likewise.
(gomp_team_barrier_wait_cancel_end): Likewise.
(gomp_team_barrier_wait_cancel): Likewise.
(gomp_team_barrier_cancel): Likewise.
* config/nvptx/bar.h (gomp_barrier_t): Remove waiters, lock fields.
(gomp_barrier_init): Remove init of waiters, lock fields.
(gomp_team_barrier_wake): Remove prototype, add new static inline
function.
This patch adds support for omp_get_max_teams, omp_set_num_teams, and
omp_{gs}et_teams_thread_limit on offload devices. That includes the usage of
device-specific ICV values (specified as environment variables or changed on a
device). In order to reuse device-specific ICV values, a copy back mechanism is
implemented that copies ICV values back from device to the host.
Additionally, a limitation of the number of teams on gcn offload devices is
implemented. The number of teams is limited by twice the number of compute
units (one team is executed on one compute unit). This avoids queueing
unnessecary many teams and a corresponding allocation of large amounts of
memory. Without that limitation the memory allocation for a large number of
user-specified teams can result in an "memory access fault".
A limitation of the number of teams is already also implemented for nvptx
devices (see nvptx_adjust_launch_bounds in libgomp/plugin/plugin-nvptx.c).
gcc/ChangeLog:
* gimplify.cc (optimize_target_teams): Set initial num_teams_upper
to "-2" instead of "1" for non-existing num_teams clause in order to
disambiguate from the case of an existing num_teams clause with value 1.
libgomp/ChangeLog:
* config/gcn/icv-device.c (omp_get_teams_thread_limit): Added to
allow processing of device-specific values.
(omp_set_teams_thread_limit): Likewise.
(ialias): Likewise.
* config/nvptx/icv-device.c (omp_get_teams_thread_limit): Likewise.
(omp_set_teams_thread_limit): Likewise.
(ialias): Likewise.
* icv-device.c (omp_get_teams_thread_limit): Likewise.
(ialias): Likewise.
(omp_set_teams_thread_limit): Likewise.
* icv.c (omp_set_teams_thread_limit): Removed.
(omp_get_teams_thread_limit): Likewise.
(ialias): Likewise.
* libgomp.texi: Updated documentation for nvptx and gcn corresponding
to the limitation of the number of teams.
* plugin/plugin-gcn.c (limit_teams): New helper function that limits
the number of teams by twice the number of compute units.
(parse_target_attributes): Limit the number of teams on gcn offload
devices.
* target.c (get_gomp_offload_icvs): Added teams_thread_limit_var
handling.
(gomp_load_image_to_device): Added a size check for the ICVs struct
variable.
(gomp_copy_back_icvs): New function that is used in GOMP_target_ext to
copy back the ICV values from device to host.
(GOMP_target_ext): Update the number of teams and threads in the kernel
args also considering device-specific values.
* testsuite/libgomp.c-c++-common/icv-4.c: Fixed an error in the reading
of OMP_TEAMS_THREAD_LIMIT from the environment.
* testsuite/libgomp.c-c++-common/icv-5.c: Extended.
* testsuite/libgomp.c-c++-common/icv-6.c: Extended.
* testsuite/libgomp.c-c++-common/icv-7.c: Extended.
* testsuite/libgomp.c-c++-common/icv-9.c: New test.
* testsuite/libgomp.fortran/icv-5.f90: New test.
* testsuite/libgomp.fortran/icv-6.f90: New test.
gcc/testsuite/ChangeLog:
* c-c++-common/gomp/target-teams-1.c: Adapt expected values for
num_teams from "1" to "-2" in cases without num_teams clause.
* g++.dg/gomp/target-teams-1.C: Likewise.
* gfortran.dg/gomp/defaultmap-4.f90: Likewise.
* gfortran.dg/gomp/defaultmap-5.f90: Likewise.
* gfortran.dg/gomp/defaultmap-6.f90: Likewise.
output.printf_data.(value union) contains text[128], which has the size
of 128 bytes, sufficient for 16 uint64_t variables; hence value_u64[2]
could be extended to value_u64[6] - sufficient for all required arguments
to gomp_target_rev. Additionally, next_output.printf_data.(msg union)
contained msg_u64 which then is no longer needed and also caused 32bit
vs 64bit alignment issues.
libgomp/
* config/gcn/libgomp-gcn.h (struct output):
Remove 'msg_u64' from the union, change
value_u64[2] to value_u64[6].
* config/gcn/target.c (GOMP_target_ext): Update accordingly.
* plugin/plugin-gcn.c (process_reverse_offload, console_output):
Likewise.
libgomp/ChangeLog:
* config/gcn/libgomp-gcn.h: New file; contains
struct output, declared previously in plugin-gcn.c.
* config/gcn/target.c: Include it.
(GOMP_ADDITIONAL_ICVS): Declare as extern var.
(GOMP_target_ext): Handle reverse offload.
* plugin/plugin-gcn.c: Include libgomp-gcn.h.
(struct kernargs): Replace struct def by the one
from libgomp-gcn.h for output_data.
(process_reverse_offload): New.
(console_output): Call it.
Add __builtin_gcn_kernarg_ptr to avoid using hard-coded register values
and permit future ABI changes while keeping the API.
gcc/ChangeLog:
* config/gcn/gcn-builtins.def (KERNARG_PTR): Add.
* config/gcn/gcn.cc (gcn_init_builtin_types): Change siptr_type_node,
sfptr_type_node and voidptr_type_node from FLAT to ADDR_SPACE_DEFAULT.
(gcn_expand_builtin_1): Handle GCN_BUILTIN_KERNARG_PTR.
(gcn_oacc_dim_size): Return in ADDR_SPACE_FLAT.
libgomp/ChangeLog:
* config/gcn/team.c (gomp_gcn_enter_kernel): Use
__builtin_gcn_kernarg_ptr instead of asm ("s8").
Co-Authored-By: Andrew Stubbs <ams@codesourcery.com>
This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.
include/ChangeLog:
* cuda/cuda.h (enum CUdevice_attribute): Add
CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
(CU_MEMHOSTALLOC_DEVICEMAP): Define.
(cuMemHostAlloc): Add prototype.
libgomp/ChangeLog:
* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
'static' for this variable.
* config/nvptx/libgomp-nvptx.h: New file.
* config/nvptx/target.c: Include it.
(GOMP_ADDITIONAL_ICVS): Declare extern var.
(GOMP_REV_OFFLOAD_VAR): Declare var.
(GOMP_target_ext): Handle reverse offload.
* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
* target.c (gomp_target_rev): ... this new stub function.
* libgomp.h (gomp_target_rev): Declare.
* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
* plugin/cuda-lib.def (cuMemHostAlloc): Add.
* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
(struct ptx_device): Add rev_data member.
(nvptx_open_device): Remove async_engines query, last used in
r10-304-g1f4c5b9b; add unified-address assert check.
(GOMP_OFFLOAD_get_num_devices): Claim unified address
support.
(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
offload functions exist. Make offload var available
on host and device.
(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
(GOMP_OFFLOAD_run): Handle reverse offload.
This patch considers the environment variable syntax extension for
device-specific variants of environment variables from OpenMP 5.1 (see
OpenMP 5.1 specification, p. 75 and p. 639). An environment variable (e.g.
OMP_NUM_TEAMS) can have different suffixes:
_DEV (e.g. OMP_NUM_TEAMS_DEV): affects all devices but not the host.
_DEV_<device> (e.g. OMP_NUM_TEAMS_DEV_42): affects only device with
number <device>.
no suffix (e.g. OMP_NUM_TEAMS): affects only the host.
In future OpenMP versions also suffix _ALL will be introduced (see discussion
https://github.com/OpenMP/spec/issues/3179). This is also considered in this
patch:
_ALL (e.g. OMP_NUM_TEAMS_ALL): affects all devices and the host.
The precedence is as follows (descending). For the host:
1. no suffix
2. _ALL
For devices:
1. _DEV_<device>
2. _DEV
3. _ALL
That means, _DEV_<device> is used whenever available. Otherwise _DEV is used if
available, and at last _ALL. If there is no value for any of the variable
variants, default values are used as already implemented before.
This patch concerns parsing (a), storing (b), output (c) and transmission to the
device (d):
(a) The actual number of devices and the numbering are not known when parsing
the environment variables. Thus all environment variables are iterated and
searched for device-specific ones.
(b) Only configured device-specific variables are stored. Thus, a linked list
is used.
(c) The output is done in omp_display_env (see specification p. 468f). Global
ICVs are tagged with [all], see https://github.com/OpenMP/spec/issues/3179.
ICVs which are not global but aren't handled device-specific yet are tagged
with [host]. omp_display_env outputs the initial values of the ICVs. That is
why a dedicated data structure is introduced for the inital values only
(gomp_initial_icv_list).
(d) Device-specific ICVs are transmitted to the device via GOMP_ADDITIONAL_ICVS.
libgomp/ChangeLog:
* config/gcn/icv-device.c (omp_get_default_device): Return device-
specific ICV.
(omp_get_max_teams): Added for GCN devices.
(omp_set_num_teams): Likewise.
(ialias): Likewise.
* config/nvptx/icv-device.c (omp_get_default_device): Return device-
specific ICV.
(omp_get_max_teams): Added for NVPTX devices.
(omp_set_num_teams): Likewise.
(ialias): Likewise.
* env.c (struct gomp_icv_list): New struct to store entries of initial
ICV values.
(struct gomp_offload_icv_list): New struct to store entries of device-
specific ICV values that are copied to the device and back.
(struct gomp_default_icv_values): New struct to store default values of
ICVs according to the OpenMP standard.
(parse_schedule): Generalized for different variants of OMP_SCHEDULE.
(print_env_var_error): Function that prints an error for invalid values
for ICVs.
(parse_unsigned_long_1): Removed getenv. Generalized.
(parse_unsigned_long): Likewise.
(parse_int_1): Likewise.
(parse_int): Likewise.
(parse_int_secure): Likewise.
(parse_unsigned_long_list): Likewise.
(parse_target_offload): Likewise.
(parse_bind_var): Likewise.
(parse_stacksize): Likewise.
(parse_boolean): Likewise.
(parse_wait_policy): Likewise.
(parse_allocator): Likewise.
(omp_display_env): Extended to output different variants of environment
variables.
(print_schedule): New helper function for omp_display_env which prints
the values of run_sched_var.
(print_proc_bind): New helper function for omp_display_env which prints
the values of proc_bind_var.
(enum gomp_parse_type): Collection of types used for parsing environment
variables.
(ENTRY): Preprocess string lengths of environment variables.
(OMP_VAR_CNT): Preprocess table size.
(OMP_HOST_VAR_CNT): Likewise.
(INT_MAX_STR_LEN): Constant for the maximal number of digits of a device
number.
(gomp_get_icv_flag): Returns if a flag for a particular ICV is set.
(gomp_set_icv_flag): Sets a flag for a particular ICV.
(print_device_specific_icvs): New helper function for omp_display_env to
print device specific ICV values.
(get_device_num): New helper function for parse_device_specific.
Extracts the device number from an environment variable name.
(get_icv_member_addr): Gets the memory address for a particular member
of an ICV struct.
(gomp_get_initial_icv_item): Get a list item of gomp_initial_icv_list.
(initialize_icvs): New function to initialize a gomp_initial_icvs
struct.
(add_initial_icv_to_list): Adds an ICV struct to gomp_initial_icv_list.
(startswith): Checks if a string starts with a given prefix.
(initialize_env): Extended to parse the new syntax of environment
variables.
* icv-device.c (omp_get_max_teams): Added.
(ialias): Likewise.
(omp_set_num_teams): Likewise.
* icv.c (omp_set_num_teams): Moved to icv-device.c.
(omp_get_max_teams): Likewise.
(ialias): Likewise.
* libgomp-plugin.h (GOMP_DEVICE_NUM_VAR): Removed.
(GOMP_ADDITIONAL_ICVS): New target-side struct that
holds the designated ICVs of the target device.
* libgomp.h (enum gomp_icvs): Collection of ICVs.
(enum gomp_device_num): Definition of device numbers for _ALL, _DEV, and
no suffix.
(enum gomp_env_suffix): Collection of possible suffixes of environment
variables.
(struct gomp_initial_icvs): Contains all ICVs for which we need to store
initial values.
(struct gomp_default_icv):New struct to hold ICVs for which we need
to store initial values.
(struct gomp_icv_list): Definition of a linked list that is used for
storing ICVs for the devices and also for _DEV, _ALL, and without
suffix.
(struct gomp_offload_icvs): New struct to hold ICVs that are copied to
a device.
(struct gomp_offload_icv_list): Definition of a linked list that holds
device-specific ICVs that are copied to devices.
(gomp_get_initial_icv_item): Get a list item of gomp_initial_icv_list.
(gomp_get_icv_flag): Returns if a flag for a particular ICV is set.
* libgomp.texi: Updated.
* plugin/plugin-gcn.c (GOMP_OFFLOAD_load_image): Extended to read
further ICVs from the offload image.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_load_image): Likewise.
* target.c (gomp_get_offload_icv_item): Get a list item of
gomp_offload_icv_list.
(get_gomp_offload_icvs): New. Returns the ICV values
depending on the device num and the variable hierarchy.
(gomp_load_image_to_device): Extended to copy further ICVs to a device.
* testsuite/libgomp.c-c++-common/icv-5.c: New test.
* testsuite/libgomp.c-c++-common/icv-6.c: New test.
* testsuite/libgomp.c-c++-common/icv-7.c: New test.
* testsuite/libgomp.c-c++-common/icv-8.c: New test.
* testsuite/libgomp.c-c++-common/omp-display-env-1.c: New test.
* testsuite/libgomp.c-c++-common/omp-display-env-2.c: New test.
As reported by Richard Sandiford, #include "../../../allocator.c"
has one too many ../s, dunno why it worked for me when using
../configure (VPATH = ../../../libgomp)
2022-06-09 Jakub Jelinek <jakub@redhat.com>
* config/linux/allocator.c: Fix up #include directive.
This patch adds support for dlopening libmemkind.so on Linux and uses it
for some kinds of allocations (but not yet e.g. pinned memory).
2022-06-09 Jakub Jelinek <jakub@redhat.com>
* allocator.c: Include dlfcn.h if LIBGOMP_USE_MEMKIND is defined.
(enum gomp_memkind_kind): New type.
(struct omp_allocator_data): Add memkind field if LIBGOMP_USE_MEMKIND
is defined.
(struct gomp_memkind_data): New type.
(memkind_data, memkind_data_once): New variables.
(gomp_init_memkind, gomp_get_memkind): New functions.
(omp_init_allocator): Initialize data.memkind, don't fail for
omp_high_bw_mem_space if libmemkind supports it.
(omp_aligned_alloc, omp_free, omp_aligned_calloc, omp_realloc): Add
memkind support of LIBGOMP_USE_MEMKIND is defined.
* config/linux/allocator.c: New file.
On Thu, Nov 11, 2021 at 02:14:05PM +0100, Thomas Schwinge wrote:
> There appears to be yet another issue: there still are quite a number of
> 'FAIL: libgomp.c/places-10.c execution test' reports on
> <gcc-testresults@gcc.gnu.org>. Also in my testing testing, on a system
> where '/sys/devices/system/node/online' contains '0-1', I get a FAIL:
>
> [...]
> OPENMP DISPLAY ENVIRONMENT BEGIN
> _OPENMP = '201511'
> OMP_DYNAMIC = 'FALSE'
> OMP_NESTED = 'FALSE'
> OMP_NUM_THREADS = '8'
> OMP_SCHEDULE = 'DYNAMIC'
> OMP_PROC_BIND = 'TRUE'
> OMP_PLACES = '{0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30},{FAIL: libgomp.c/places-10.c execution test
I've finally managed to debug this (by dumping used /sys/ files from
an affected system in Fedora build system, replacing /sys/ with /tmp/
in gcc sources and populating there those files), I think following patch
ought to fix it.
2022-03-18 Jakub Jelinek <jakub@redhat.com>
* config/linux/affinity.c (gomp_affinity_init_numa_domains): Move seen
variable next to pl variable.
Consider the following omp fragment.
...
#pragma omp target
#pragma omp parallel num_threads (2)
#pragma omp task
;
...
This hangs at -O0 for nvptx.
Investigating the behaviour gives us the following trace of events:
- both threads execute GOMP_task, where they:
- deposit a task, and
- execute gomp_team_barrier_wake
- thread 1 executes gomp_team_barrier_wait_end and, not being the last thread,
proceeds to wait at the team barrier
- thread 0 executes gomp_team_barrier_wait_end and, being the last thread, it
calls gomp_barrier_handle_tasks, where it:
- executes both tasks and marks the team barrier done
- executes a gomp_team_barrier_wake which wakes up thread 1
- thread 1 exits the team barrier
- thread 0 returns from gomp_barrier_handle_tasks and goes to wait at
the team barrier.
- thread 0 hangs.
To understand why there is a hang here, it's good to understand how things
are setup for nvptx. The libgomp/config/nvptx/bar.c implementation is
a copy of the libgomp/config/linux/bar.c implementation, with uses of both
futex_wake and do_wait replaced with uses of ptx insn bar.sync:
...
if (bar->total > 1)
asm ("bar.sync 1, %0;" : : "r" (32 * bar->total));
...
The point where thread 0 goes to wait at the team barrier, corresponds in
the linux implementation with a do_wait. In the linux case, the call to
do_wait doesn't hang, because it's waiting for bar->generation to become
a certain value, and if bar->generation already has that value, it just
proceeds, without any need for coordination with other threads.
In the nvtpx case, the bar.sync waits until thread 1 joins it in the same
logical barrier, which never happens: thread 1 is lingering in the
thread pool at the thread pool barrier (using a different logical barrier),
waiting to join a new team.
The easiest way to fix this is to revert to the posix implementation for
bar.{c,h}. That however falls back on a busy-waiting approach, and
does not take advantage of the ptx bar.sync insn.
Instead, we revert to the linux implementation for bar.c,
and implement bar.c local functions futex_wait and futex_wake using the
bar.sync insn.
The bar.sync insn takes an argument specifying how many threads are
participating, and that doesn't play well with the futex syntax where it's
not clear in advance how many threads will be woken up.
This is solved by waking up all waiting threads each time a futex_wait or
futex_wake happens, and possibly going back to sleep with an updated thread
count.
Tested libgomp on x86_64 with nvptx accelerator.
libgomp/ChangeLog:
2021-04-20 Tom de Vries <tdevries@suse.de>
PR target/99555
* config/nvptx/bar.c (generation_to_barrier): New function, copied
from config/rtems/bar.c.
(futex_wait, futex_wake): New function.
(do_spin, do_wait): New function, copied from config/linux/wait.h.
(gomp_barrier_wait_end, gomp_barrier_wait_last)
(gomp_team_barrier_wake, gomp_team_barrier_wait_end):
(gomp_team_barrier_wait_cancel_end, gomp_team_barrier_cancel): Remove
and replace with include of config/linux/bar.c.
* config/nvptx/bar.h (gomp_barrier_t): Add fields waiters and lock.
(gomp_barrier_init): Init new fields.
* testsuite/libgomp.c-c++-common/task-detach-6.c: Remove nvptx-specific
workarounds.
* testsuite/libgomp.c/pr99555-1.c: Same.
* testsuite/libgomp.fortran/task-detach-6.f90: Same.
Currently omp_get_device_num does not work on gcn targets with more than one
offload device. The reason is that GOMP_DEVICE_NUM_VAR is static in
icv-device.c and thus "__gomp_device_num" is not visible in the offload image.
This patch removes "static" such that "__gomp_device_num" is now part of the
offload image and can now be found in GOMP_OFFLOAD_load_image in the plugin.
This is not an issue for nvptx. There, "__gomp_device_num" is in the offload
image even with "static".
libgomp/ChangeLog:
* config/gcn/icv-device.c: Make GOMP_DEVICE_NUM_VAR public (remove
"static") to make the device num available in the offload image.
OpenMP 5.1 says that thread_limit clause can also appear on target,
and similarly to teams should affect the thread-limit-var ICV.
On combined target teams, the clause goes to both.
We actually passed thread_limit internally on target already before,
but only used it for gcn/ptx offloading to hint how many threads should be
created and for ptx didn't set thread_limit_var in that case.
Similarly for host fallback.
Also, I found that we weren't copying the args array that contains encoded
thread_limit and num_teams clause for target (etc.) for async target.
2021-11-15 Jakub Jelinek <jakub@redhat.com>
gcc/
* gimplify.c (optimize_target_teams): Only add OMP_CLAUSE_THREAD_LIMIT
to OMP_TARGET_CLAUSES if it isn't there already.
gcc/c-family/
* c-omp.c (c_omp_split_clauses) <case OMP_CLAUSE_THREAD_LIMIT>:
Duplicate to both OMP_TARGET and OMP_TEAMS.
gcc/c/
* c-parser.c (OMP_TARGET_CLAUSE_MASK): Add
PRAGMA_OMP_CLAUSE_THREAD_LIMIT.
gcc/cp/
* parser.c (OMP_TARGET_CLAUSE_MASK): Add
PRAGMA_OMP_CLAUSE_THREAD_LIMIT.
libgomp/
* task.c (gomp_create_target_task): Copy args array as well.
* target.c (gomp_target_fallback): Add args argument.
Set gomp_icv (true)->thread_limit_var if thread_limit is present.
(GOMP_target): Adjust gomp_target_fallback caller.
(GOMP_target_ext): Likewise.
(gomp_target_task_fn): Likewise.
* config/nvptx/team.c (gomp_nvptx_main): Set
gomp_global_icv.thread_limit_var.
* testsuite/libgomp.c-c++-common/thread-limit-1.c: New test.
Here is a PTX implementation of what I was talking about, that for
num_teams_upper 0 or whenever num_teams_lower <= num_blocks, the current
implementation is fine but if the user explicitly asks for more
teams than we can provide in hardware, we need to stop assuming that
omp_get_team_num () is equal to the hw team id, but instead need to use some
team specific memory (it is .shared for PTX), or if none is
provided, array indexed by the hw team id and run some teams serially within
the same hw thread.
2021-11-15 Jakub Jelinek <jakub@redhat.com>
* config/nvptx/team.c (__gomp_team_num): Define as
__attribute__((shared)) var.
(gomp_nvptx_main): Initialize __gomp_team_num to 0.
* config/nvptx/target.c (__gomp_team_num): Declare as
extern __attribute__((shared)) var.
(GOMP_teams4): Use __gomp_team_num as the team number instead of
%ctaid.x. If first, initialize it to %ctaid.x. If num_teams_lower
is bigger than num_blocks, use num_teams_lower teams and arrange for
bumping of __gomp_team_num if !first and returning false once we run
out of teams.
* config/nvptx/teams.c (__gomp_team_num): Declare as
extern __attribute__((shared)) var.
(omp_get_team_num): Return __gomp_team_num value instead of %ctaid.x.
My recent libgomp change apparently broke libgomp build for gcn offloading.
The problem is that gcn, unlike nvptx, doesn't override teams.c source file
and the patch I've committed assumed all the non-LIBGOMP_USE_PTHREADS targets
do not use it. My understanding is that gcn included omp_get_num_teams
and omp_get_team_num definitions in both icv-device.o and teams.o,
with the definitions only in the former working correctly.
This patch brings gcn into sync with how nvptx does it, that teams.c
is overridden, provides a dummy GOMP_teams_reg and omp_get_{num_teams,team_num}
definitions and icv-device.c doesn't provide those.
2021-11-12 Jakub Jelinek <jakub@redhat.com>
PR target/103201
* config/gcn/icv-device.c (omp_get_num_teams, omp_get_team_num): Move
to ...
* config/gcn/teams.c: ... here. New file.
The following patch implements what I've been talking about earlier,
honor that for explicit num_teams clause we create at least the
lower-bound (if not specified, upper-bound) teams in the league.
For host fallback, it still means we only have one thread doing all the
teams, sequentially one after another.
For PTX and GCN, I think the new teams-2.c test and maybe teams-4.c too
will or might fail.
For these offloads, I think it is ok to remove symbols no longer used
from libgomp.a.
If num_teams_lower is bigger than the provided num_blocks or num_workgroups,
we should arrange for gomp_num_teams_var to be num_teams_lower - 1,
stop using the %ctaid.x or __builtin_gcn_dim_pos (0) for omp_get_team_num ()
and instead use for it some .shared var that GOMP_teams4 initializes to
%ctaid.x or __builtin_gcn_dim_pos (0) when first and for !first
increment that by num_blocks or num_workgroups each time and only
return false when we are above num_teams_lower.
Any help with actually implementing this for the 2 architectures highly
appreciated.
2021-11-12 Jakub Jelinek <jakub@redhat.com>
gcc/
* omp-builtins.def (BUILT_IN_GOMP_TEAMS): Remove.
(BUILT_IN_GOMP_TEAMS4): New.
* builtin-types.def (BT_FN_VOID_UINT_UINT): Remove.
(BT_FN_BOOL_UINT_UINT_UINT_BOOL): New.
* omp-low.c (lower_omp_teams): Use GOMP_teams4 instead of
GOMP_teams, pass to it also num_teams lower-bound expression
or a dup of upper-bound if it is missing and a flag whether
it is the first call or not.
gcc/fortran/
* types.def (BT_FN_VOID_UINT_UINT): Remove.
(BT_FN_BOOL_UINT_UINT_UINT_BOOL): New.
libgomp/
* libgomp_g.h (GOMP_teams4): Declare.
* libgomp.map (GOMP_5.1): Export GOMP_teams4.
* target.c (GOMP_teams4): New function.
* config/nvptx/target.c (GOMP_teams): Remove.
(GOMP_teams4): New function.
* config/gcn/target.c (GOMP_teams): Remove.
(GOMP_teams4): New function.
* testsuite/libgomp.c/teams-4.c (main): Expect exactly 2
teams instead of <= 2.
* testsuite/libgomp.c-c++-common/teams-2.c: New test.
If numa-domains is used with num-places count, sometimes the function
could create more places than requested and crash. This depended on the
content of /sys/devices/system/node/online file, e.g. if the file
contains
0-1,16-17
and all NUMA nodes contain at least one CPU in the cpuset of the program,
then numa_domains(2) or numa_domains(4) (or 5+) work fine while
numa_domains(1) or numa_domains(3) misbehave. I.e. the function was able
to stop after reaching limit on the , separators (or trivially at the end),
but not within in the ranges.
2021-10-18 Jakub Jelinek <jakub@redhat.com>
* config/linux/affinity.c (gomp_affinity_init_numa_domains): Add
&& gomp_places_list_len < count after nfirst <= nlast loop condition.
Yesterday when working on numa_domains, I've noticed because of a bug
in my patch a hang on a large NUMA machine. I've fixed the bug, but
also discovered that the hang was a result of making wrong assumptions
about strtoul/strtoull. All the uses were for portability setting
errno = 0 before the calls and treating non-zero errno after the call
as invalid input, but for the case where there are no valid digits at
all strtoul may set errno to EINVAL, but doesn't have to and with
glibc doesn't do that. So, this patch goes through all the strtoul calls
and next to errno != 0 checks adds also endptr == startptr check.
Haven't done it in places where we immediately reject strtoul returning 0
the same as we reject errno != 0, because strtoul must return 0 in the
case where it sets endptr to the start pointer. In some spots the code
was using errno = 0; x = strtoul (p, &p, 10); if (errno) { /*invalid*/ }
and those spots had to be changed to
errno = 0; x = strtoul (p, &end, 10); if (errno || end == p) { /*invalid*/ }
p = end;
2021-10-15 Jakub Jelinek <jakub@redhat.com>
* env.c (parse_schedule): For strtoul or strtoull calls which don't
clearly reject return value 0 as invalid handle the case where end
pointer is the same as first argument as invalid.
(parse_unsigned_long_1): Likewise.
(parse_one_place): Likewise.
(parse_places_var): Likewise.
(parse_stacksize): Likewise.
(parse_spincount): Likewise.
(parse_affinity): Likewise.
(parse_gomp_openacc_dim): Likewise. Avoid strict aliasing violation.
Make code valid C89.
* config/linux/affinity.c (gomp_affinity_find_last_cache_level):
For strtoul calls which don't clearly reject return value 0 as
invalid handle the case where end pointer is the same as first
argument as invalid.
(gomp_affinity_init_level_1): Likewise.
(gomp_affinity_init_numa_domains): Likewise.
* config/rtems/proc.c (parse_thread_pools): Likewise.
When writing the places-*.c tests, I've noticed that we mishandle threads
abstract name with specified num-places if num-places isn't a multiple of
number of hw threads in a core. It then happily ignores the maximum count
and overwrites for the remaining hw threads in a core further places that
haven't been allocated.
2021-10-15 Jakub Jelinek <jakub@redhat.com>
* config/linux/affinity.c (gomp_affinity_init_level_1): For level 1
after creating count places clean up and return immediately.
* testsuite/libgomp.c/places-6.c: New test.
* testsuite/libgomp.c/places-7.c: New test.
* testsuite/libgomp.c/places-8.c: New test.
* testsuite/libgomp.c/places-9.c: New test.
* testsuite/libgomp.c/places-10.c: New test.
This adds support for numa_domains abstract name in OMP_PLACES, also new
in OpenMP 5.1.
Way to test this is
OMP_PLACES=numa_domains OMP_DISPLAY_ENV=true LD_PRELOAD=.libs/libgomp.so.1 /bin/true
and see what it prints on OMP_PLACES line.
For non-NUMA machines it should print a single place that covers all CPUs,
for NUMA machine one place for each NUMA node with corresponding CPUs.
2021-10-15 Jakub Jelinek <jakub@redhat.com>
* env.c (parse_places_var): Handle numa_domains as level 5.
* config/linux/affinity.c (gomp_affinity_init_numa_domains): New
function.
(gomp_affinity_init_level): Use it instead of
gomp_affinity_init_level_1 for level == 5.
* testsuite/libgomp.c/places-5.c: New test.
This patch implements support for ll_caches abstract name in OMP_PLACES,
which stands for places where logical cpus in each place share the last
level cache.
This seems to work fine for me on x86 and kernel sources show that it is
in common code, but on some machines on CompileFarm the files I'm using,
i.e.
/sys/devices/system/cpu/cpuN/cache/indexN/level
/sys/devices/system/cpu/cpuN/cache/indexN/shared_cpu_list
don't exist, is that because they have too old kernel and newer kernels
are fine or should I implement some fallback methods (which)?
E.g. on gcc112.fsffrance.org I see just shared_cpu_map and not shared_cpu_list
(with shared_cpu_map being harder to parse) and on another box I didn't even
see the cache subdirectories.
Way to test this is
OMP_PLACES=ll_caches OMP_DISPLAY_ENV=true LD_PRELOAD=.libs/libgomp.so.1 /bin/true
and see what it prints on OMP_PLACES line.
2021-10-15 Jakub Jelinek <jakub@redhat.com>
* env.c (parse_places_var): Handle ll_caches as level 4.
* config/linux/affinity.c (gomp_affinity_find_last_cache_level): New
function.
(gomp_affinity_init_level_1): Handle level 4 as logical cpus sharing
last level cache.
(gomp_affinity_init_level): Likewise.
* testsuite/libgomp.c/places-1.c: New test.
* testsuite/libgomp.c/places-2.c: New test.
* testsuite/libgomp.c/places-3.c: New test.
* testsuite/libgomp.c/places-4.c: New test.
... and add a minimum amount of offloading testing.
(Leaving aside that 'fwrite' to 'stderr' probably wouldn't work anyway) the
'fwrite' calls in 'libgomp/error.c:GOMP_warning', 'libgomp/error.c:GOMP_error'
drag in 'isatty', which isn't provided by my nvptx newlib build at present, so
we get, for example:
[...]
FAIL: libgomp.c/../libgomp.c-c++-common/declare_target-1.c (test for excess errors)
Excess errors:
unresolved symbol isatty
mkoffload: fatal error: [...]/build-gcc/./gcc/x86_64-pc-linux-gnu-accel-nvptx-none-gcc returned 1 exit status
[...]
..., and many more.
Fix up for recent commit 0d973c0a0d
"openmp: Implement the error directive".
libgomp/
* config/nvptx/error.c (fwrite, exit): Override, too.
* testsuite/libgomp.c-c++-common/error-1.c: Add a minimum amount
of offloading testing.
* testsuite/libgomp.fortran/error-1.f90: Likewise.
This patch implements the omp_get_device_num library routine, specified in
OpenMP 5.0.
GOMP_DEVICE_NUM_VAR is a macro symbol which defines name of a "device number"
variable, is defined on the device-side libgomp, has it's address returned to
host-side libgomp during device initialization, and the host libgomp then
sets its value to the designated device number.
libgomp/ChangeLog:
* icv-device.c (omp_get_device_num): New API function, host side.
* fortran.c (omp_get_device_num_): New interface function.
* libgomp-plugin.h (GOMP_DEVICE_NUM_VAR): Define macro symbol.
* libgomp.map (OMP_5.0.2): New version space with omp_get_device_num,
omp_get_device_num_.
* libgomp.texi (omp_get_device_num): Add documentation for new API
function.
* omp.h.in (omp_get_device_num): Add declaration.
* omp_lib.f90.in (omp_get_device_num): Likewise.
* omp_lib.h.in (omp_get_device_num): Likewise.
* target.c (gomp_load_image_to_device): If additional entry for device
number exists at end of returned entries from 'load_image_func' hook,
copy the assigned device number over to the device variable.
* config/gcn/icv-device.c (GOMP_DEVICE_NUM_VAR): Define static global.
(omp_get_device_num): New API function, device side.
* plugin/plugin-gcn.c ("symcat.h"): Add include.
(GOMP_OFFLOAD_load_image): Add addresses of device GOMP_DEVICE_NUM_VAR
at end of returned 'target_table' entries.
* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Define static global.
(omp_get_device_num): New API function, device side.
* plugin/plugin-nvptx.c ("symcat.h"): Add include.
(GOMP_OFFLOAD_load_image): Add addresses of device GOMP_DEVICE_NUM_VAR
at end of returned 'target_table' entries.
* testsuite/lib/libgomp.exp
(check_effective_target_offload_target_intelmic): New function for
testing for intelmic offloading.
* testsuite/libgomp.c-c++-common/target-45.c: New test.
* testsuite/libgomp.fortran/target10.f90: New test.
With yesterday's commit 9f2bc5077d "[gcn]
Work-around libgomp 'error: array subscript 0 is outside array bounds of
‘__lds struct gomp_thread * __lds[0]’ [-Werror=array-bounds]' [PR101484]",
I did defuse the "unexpected" '-Werror=array-bounds' diagnostics that we see
as of commit a110855667 "Correct handling of
variable offset minus constant in -Warray-bounds [PR100137]". However, these
'#pragma GCC diagnostic [...]' directives cause some code generation changes
(that seems unexpected, problematic!), which results in a lot (ten thousands)
of 'GCN team arena exhausted' run-time diagnostics, also leading to a few
FAILs:
PASS: libgomp.c/../libgomp.c-c++-common/for-11.c (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-11.c execution test
PASS: libgomp.c/../libgomp.c-c++-common/for-12.c (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-12.c execution test
PASS: libgomp.c/../libgomp.c-c++-common/for-3.c (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-3.c execution test
PASS: libgomp.c/../libgomp.c-c++-common/for-5.c (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-5.c execution test
PASS: libgomp.c/../libgomp.c-c++-common/for-6.c (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-6.c execution test
PASS: libgomp.c/../libgomp.c-c++-common/for-9.c (test for excess errors)
[-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/for-9.c execution test
Same for 'libgomp.c++'.
It remains to be analyzed how '#pragma GCC diagnostic [...]' directives can
cause code generation changes; for now I'm working around the "unexpected"
'-Werror=array-bounds' diagnostics differently.
Overall, still awaiting a different solution, of course.
libgomp/
PR target/101484
* configure.tgt [amdgcn*-*-*] (XCFLAGS): Add
'-Wno-error=array-bounds'.
* config/gcn/team.c: Remove '-Werror=array-bounds' work-around.
* libgomp.h [__AMDGCN__]: Likewise.
... seen as of commit a110855667 "Correct
handling of variable offset minus constant in -Warray-bounds [PR100137]".
Awaiting a different solution, of course.
libgomp/
PR target/101484
* config/gcn/team.c: Apply '-Werror=array-bounds' work-around.
* libgomp.h [__AMDGCN__]: Likewise.
sem.h is included in between # pragma GCC visibility push(hidden)
and # pragma GCC visibility pop and includes limits.h there, which
since the introduction of sysconf declaration in recent glibcs
in there causes trouble. libgomp assumes it is compiled by gcc,
so we don't really need to include limits.h there and can use
-__INT_MAX__ - 1 instead (which clang and icc support too for years).
2021-07-13 Jakub Jelinek <jakub@redhat.com>
Florian Weimer <fweimer@redhat.com>
* config/linux/sem.h: Don't include limits.h.
(SEM_WAIT): Define to -__INT_MAX__ - 1 instead of INT_MIN.
* config/linux/affinity.c: Include limits.h.
OpenMP Nesting of Regions restrictions say:
- If a target update, target data, target enter data, or target exit data
construct is encountered during execution of a target region, the behavior is unspecified.
- If a target construct is encountered during execution of a target region and a device
clause in which the ancestor device-modifier appears is not present on the construct, the
behavior is unspecified.
That wording is about the dynamic (runtime) behavior, not about lexical nesting,
so while it is UB if omp target * is encountered in the target region, we need to make
it compile and link (for lexical nesting of target * inside of target we actually
emit a warning).
To make this work, I had to do multiple changes.
One was to mark .omp_data_{sizes,kinds}.* variables when static as "omp declare target".
Another one was to add stub GOMP_target* entrypoints to nvptx and gcn libgomp.a.
The entrypoint functions shouldn't be called or passed in the offload regions,
otherwise
libgomp: cuLaunchKernel error: too many resources requested for launch
was reported; fixed by changing those arguments of calls to GOMP_target_ext
to NULL.
And we didn't mark the entrypoints "omp target entrypoint" when the caller
has been "omp declare target".
2021-05-26 Jakub Jelinek <jakub@redhat.com>
PR libgomp/100573
gcc/
* omp-low.c: Include omp-offload.h.
(create_omp_child_function): If current_function_decl has
"omp declare target" attribute and is_gimple_omp_offloaded,
remove that attribute from the copy of attribute list and
add "omp target entrypoint" attribute instead.
(lower_omp_target): Mark .omp_data_sizes.* and .omp_data_kinds.*
variables for offloading if in omp_maybe_offloaded_ctx.
* omp-offload.c (pass_omp_target_link::execute): Nullify second
argument to GOMP_target_data_ext in offloaded code.
libgomp/
* config/nvptx/target.c (GOMP_target_ext, GOMP_target_data_ext,
GOMP_target_end_data, GOMP_target_update_ext,
GOMP_target_enter_exit_data): New dummy entrypoints.
* config/gcn/target.c (GOMP_target_ext, GOMP_target_data_ext,
GOMP_target_end_data, GOMP_target_update_ext,
GOMP_target_enter_exit_data): Likewise.
* testsuite/libgomp.c-c++-common/for-3.c (DO_PRAGMA, OMPTEAMS,
OMPFROM, OMPTO): Define.
(main): Remove #pragma omp target teams around all the tests.
* testsuite/libgomp.c-c++-common/target-41.c: New test.
* testsuite/libgomp.c-c++-common/target-42.c: New test.
Linux man-pages 5.07 wrongly declares syscall output type as int. This error
was fixed in release 5.10, so this patch reverts my recent change.
2021-02-11 Uroš Bizjak <ubizjak@gmail.com>
libgomp/
* config/linux/x86/futex.h (__futex_wait):
Revert output type back to long.
(__futex_wake): Ditto.
(futex_wait): Update for revert.
(futex_wake): Ditto.