Introduce an Arm MTE compatible strlen implementation.
The existing implementation assumes that any access to the pages in
which the string resides is safe. This assumption is not true when
MTE is enabled. This patch updates the algorithm to ensure that
accesses remain within the bounds of an MTE tag (16-byte chunks) and
improves overall performance on modern cores. On cores with less
efficient Advanced SIMD implementation such as Cortex-A53 it can
be slower.
Benchmarked on Cortex-A72, Cortex-A53, Neoverse N1.
Co-authored-by: Wilco Dijkstra <wilco.dijkstra@arm.com>
Introduce an Arm MTE compatible strchr implementation.
The existing implementation assumes that any access to the pages in
which the string resides is safe. This assumption is not true when
MTE is enabled. This patch updates the algorithm to ensure that
accesses remain within the bounds of an MTE tag (16-byte chunks) and
improves overall performance.
Benchmarked on Cortex-A72, Cortex-A53, Neoverse N1.
Co-authored-by: Wilco Dijkstra <wilco.dijkstra@arm.com>
Introduce an Arm MTE compatible strchrnul implementation.
The existing implementation assumes that any access to the pages in
which the string resides is safe. This assumption is not true when
MTE is enabled. This patch updates the algorithm to ensure that
accesses remain within the bounds of an MTE tag (16-byte chunks) and
improves overall performance.
Benchmarked on Cortex-A72, Cortex-A53, Neoverse N1.
Co-authored-by: Wilco Dijkstra <wilco.dijkstra@arm.com>
Falkor's memcpy and memmove share some implementation details,
therefore, the two routines are moved to a single source file
for code reuse.
The two routines now share code for small and medium copies
(up to and including 128 bytes). Large copies in memcpy do not
handle overlap correctly, consequently, the loops for
moving/copying more than 128 bytes stay separate for memcpy
and memmove.
To increase code reuse a number of small modifications were made:
1. The old implementation of memcpy copied the first 16-bytes as
soon as the size of data was determined to be greater than 32 bytes.
For memcpy code to also work when copying small/medium overlapping
data, the first load and store was moved to the large copy case.
2. Medium memcpy case no longer assumes that 16 bytes were already
copied and uses 8 registers to copy up to 128 bytes.
3. Small case for memmove was enlarged to that of memcpy, which is
less than or equal to 32 bytes.
4. Medium case for memmove was enlarged to that of memcpy, which is
less than or equal to 128 bytes.
Other changes include:
1. Improve alignment of existing loop bodies.
2. 'Delouse' memmove and memcpy input arguments. Make sure that
upper 32-bits of input registers are zeroed if unused.
3. Do one more iteration in memmove loops and reduce the number of
copies made from the start/end of the buffer, depending on
the direction of the memmove loop.
Benchmarking:
Looking at the results from bench-memcpy-random.out, we can see that
now memmove_falkor is about 5% faster than memcpy_falkor_old, while
memmove_falkor_old was more than 15% slower. The memcpy implementation
remained largely unmodified, so there is no significant performance
change.
The reason for such a significant memmove performance gain is the
increase of the upper bound on the small copy case to 32 bytes and
the increase of the upper bound on the medium copy case to 128 bytes.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
d6d74ec16 ('htl: Enable more tests') moved the linking rules from
nptl/Makefile and htl/Makefile to the shared sysdeps/pthread/Makefile. But
e.g. on powerpc some tests are added in sysdeps/powerpc/Makefile, which is
included *after* sysdeps/pthread/Makefile, and thus the tests don't get
affected by the rules and fail to link. For now let's just copy over the
set of rules in both nptl/Makefile and htl/Makefile.
* sysdeps/pthread/Makefile: Move libpthread linking rules to...
* htl/Makefile: ... here and...
* nptl/Makefile: ... there.
We really need modules to use their own pthread_atfork so that
__dso_handle properly identifies them.
* sysdeps/htl/pt-atfork.c (__pthread_atfork): Hide function.
(pthread_atfork): Hide alias.
* sysdeps/htl/old_pt-atfork.c (pthread_atfork): Rename macro to
__pthread_atfork to fix building the compatibility alias.
* sysdeps/htl/pthreadP.h: Include <link.h>
(__pthread_init_static_tls): New prototype.
* htl/pt-alloc.c (__pthread_init_static_tls): New function.
* sysdeps/mach/hurd/htl/pt-sysdep.c (_init_routine): Initialize tcb
field of initial thread. Set GL(dl_init_static_tls) to
&__pthread_init_static_tls.
and add _nocancel variants.
* sysdeps/mach/hurd/pread64.c (__libc_pread64): Call __pread64_nocancel
surrounded by enabling async cancel, to replace implementation moved to...
* sysdeps/mach/hurd/pread64_nocancel.c (__pread64_nocancel): ... here.
* sysdeps/mach/hurd/read.c (__libc_read): Call __read_nocancel surrounded by
enabling async cancel, to replace implementation moved to...
* sysdeps/mach/hurd/read_nocancel.c (__read_nocancel): ... here.
* sysdeps/mach/hurd/Makefile (sysdep_routines): Add read_nocancel and
pread64_nocancel.
* sysdeps/mach/hurd/not-cancel.h (__read_nocancel, __pread64_nocancel):
Replace macros with prototypes with a hidden proto on libc.
* sysdeps/mach/hurd/dl-sysdep.c: Include <not-cancel.h>.
(__pread64_nocancel): New alias, check that it is not hidden.
(__read_nocancel): New alias, check that it is not hidden.
* sysdeps/mach/hurd/Versions (libc.GLIBC_PRIVATE): Add __read_nocancel and
__pread64_nocancel.
(ld.GLIBC_2.1): Add __pread64.
(ld.GLIBC_PRIVATE): Add __read_nocancel and __pread64_nocancel.
* sysdeps/mach/hurd/i386/ld.abilist (__pread64): Add symbol.
* sysdeps/mach/hurd/i386/localplt.data (__read_nocancel, __pread64,
__pread64_nocancel): Add references.
* sysdeps/i386/htl/Makefile: New file.
* sysdeps/i386/htl/tcb-offsets.sym: New file.
* sysdeps/mach/hurd/i386/Makefile [setjmp] (gen-as-const-headers): Add
signal-defines.sym.
* sysdeps/mach/hurd/i386/____longjmp_chk.S: Include tcb-offsets.h.
(____longjmp_chk): Harmonize with i386's __longjmp. Clear SS_ONSTACK
when jumping off the alternate stack.
* sysdeps/mach/hurd/i386/__longjmp.S: New file.
The existing macros are fragile and expect local variables with a
certain name. Fix this by defining them as functions with default
implementation in a new header dl-runtime.h which arches can override
if need be.
This came up during ARC port review, hence the need for argument pltgot
in reloc_index() which is not needed by existing ports.
This patch potentially only affects hppa/x86 ports,
build tested for both those configs and a few more.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This started as a trivial change to Anton's rawmemchr. I got
carried away. This is a hybrid between P8's asympotically
faster 64B checks with extremely efficient small string checks
e.g <64B (and sometimes a little bit more depending on alignment).
The second trick is to align to 64B by running a 48B checking loop
16B at a time until we naturally align to 64B (i.e checking 48/96/144
bytes/iteration based on the alignment after the first 5 comparisons).
This allieviates the need to check page boundaries.
Finally, explicly use the P7 strlen with the runtime loader when building
P9. We need to be cautious about vector/vsx extensions here on P9 only
builds.
This defines the macro such that it should behave best on all
supported powerpc targets. Likewise, this allows us to remove the
ppc64le specific s_fmaf128.c.
I have verified powerpc64le multiarch and powerpc64le power9
no-multiarch builds continue to generate optimize fmaf128.
commit 7621e38bf3
Author: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Date: Tue Jan 29 17:43:45 2019 +0000
Add generic hp-timing support
removed the clock_gettime option. Restore the clock_gettime option for
some x86 CPUs on which value from RDTSC may not be incremented at a fixed
rate.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
commit e9698175b0
Author: Lukasz Majewski <lukma@denx.de>
Date: Mon Mar 16 08:31:41 2020 +0100
y2038: Replace __clock_gettime with __clock_gettime64
breaks benchtests with sysdeps/generic/hp-timing.h:
In file included from ./bench-timing.h:23,
from ./bench-skeleton.c:25,
from
/export/build/gnu/tools-build/glibc-gitlab/build-x86_64-linux/benchtests/bench-rint.c:45:
./bench-skeleton.c: In function ‘main’:
../sysdeps/generic/hp-timing.h:37:23: error: storage size of ‘tv’ isn’t known
37 | struct __timespec64 tv; \
| ^~
Define HP_TIMING_NOW with clock_gettime in sysdeps/generic/hp-timing.h
if _ISOMAC is defined. Don't define __clock_gettime in bench-timing.h
since it is no longer needed.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
There are:
#define TUNABLE_SET_VAL_IF_VALID_RANGE(__cur, __val, __type) \
({ \
__type min = (__cur)->type.min; \
__type max = (__cur)->type.max; \
\
if ((__type) (__val) >= min && (__type) (val) <= max) \
^^^ Should be __val
{ \
(__cur)->val.numval = val; \
^^^ Should be __val
(__cur)->initialized = true; \
} \
})
Luckily since all TUNABLE_SET_VAL_IF_VALID_RANGE usages are
TUNABLE_SET_VAL_IF_VALID_RANGE (cur, val, int64_t);
this didn't cause any issues.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
When detecting hole support, we write at 16MiB, and filesystems will
typically need two levels of data to record that. On filesystems with
8KB block, the two indirection blocks will require a total of 16KB
overhead, thus 32 512-byte sectors.
Spotted on GNU/Hurd with a 4KB blocks filesystem, but also happens on Linux
with 4KB or 8KB blocks filesystems.
* support/support_descriptor_supports_holes.c
(support_descriptor_supports_holes): Set block_headroom to 32.
The build uses an undefined macro evaluation for fmaf128 build.
For now set USE_FMAL_BUILTIN and USE_FMAF128_BUILTIN to 0.
Checked with a build for:
powerpc64le-linux-gnu-power9-disable-multi-arch
powerpc64le-linux-gnu-power9
powerpc64le-linux-gnu
powerpc64-linux-gnu-power8
powerpc64-linux-gnu
powerpc-linux-gnu-power4
powerpc-linux-gnu
timer_create needs to create threads with all signals blocked,
including SIGTIMER (which happens to equal SIGCANCEL).
Fixes commit b3cae39dcb ("nptl: Start
new threads with all signals blocked [BZ #25098]").
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This introduces the function __pthread_attr_extension to allocate the
extension space, which is freed by pthread_attr_destroy.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
union pthread_attr_transparent has always the correct size, even if
pthread_attr_t has padding that is not present in struct pthread_attr.
This should not result in an observable behavioral change. The
existing code appears to have been correct, but it was brittle because
it was not clear which functions were allowed to write to an entire
pthread_attr_t argument (e.g., by copying it).
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This allows to reuse the storage after calling pthread_cond_destroy.
* sysdeps/htl/bits/types/struct___pthread_cond.h (__pthread_cond):
Replace unused struct __pthread_condimpl *__impl field with unsigned int
__wrefs.
(__PTHREAD_COND_INITIALIZER): Update accordingly.
* sysdeps/htl/pt-cond-timedwait.c (__pthread_cond_timedwait_internal):
Register as waiter in __wrefs field. On unregistering, wake any pending
pthread_cond_destroy.
* sysdeps/htl/pt-cond-destroy.c (__pthread_cond_destroy): Register wake
request in __wrefs.
* nptl/Makefile (tests): Move tst-cond20 tst-cond21 to...
* sysdeps/pthread/Makefile (tests): ... here.
* nptl/tst-cond20.c nptl/tst-cond21.c: Move to...
* sysdeps/pthread/tst-cond20.c sysdeps/pthread/tst-cond21.c: ... here.
The function mbstowcs, by an XSI extension to POSIX, accepts a null
pointer for the destination wchar_t array. This API behaviour allows
you to use the function to compute the length of the required wchar_t
array i.e. does the conversion without storing it and returns the
number of wide characters required.
We remove the __write_only__ markup for the first argument because it
is not true since the destination may be a null pointer, and so the
length argument may not apply. We remove the markup otherwise the new
test case cannot be compiled with -Werror=nonnull.
We add a new test case for mbstowcs which exercises the destination is
a null pointer behaviour which we have now explicitly documented.
The mbsrtowcs and mbsnrtowcs behave similarly, and mbsrtowcs is
documented as doing this in C11, even if the standard doesn't come out
and call out this specific use case. We add one note to each of
mbsrtowcs and mbsnrtowcs to call out that they support a null pointer
for the destination.
The wcsrtombs function behaves similarly but in the other way around
and allows you to use a null destination pointer to compute how many
bytes you would need to convert the wide character input. We document
this particular case also, but leave wcsnrtombs as a references to
wcsrtombs, so the reader must still read the details of the semantics
for wcsrtombs.