gcc/unicode at master - gcc

mirror of https://gcc.gnu.org/git/gcc.git synced 2024-11-23 02:44:18 +08:00

History

Jakub Jelinek d0e8f58b81 contrib, libcpp, libstdc++: Update to Unicode 16.0 It is autumn again and there is a new Unicode version 16.0. The following patch updates our Unicode stuff in contrib, libcpp and libstdc++ from that Unicode version. 2024-10-08 Jakub Jelinek <jakub@redhat.com> contrib/ * unicode/README: Update glibc git commit hash, replace Unicode 15 or 15.1 versions with 16. * unicode/gen_libstdcxx_unicode_data.py: Use 160000 instead of 150100 in _GLIBCXX_GET_UNICODE_DATA test. * unicode/from_glibc/utf8_gen.py: Updated from glibc 064c708c78cc2a6b5802dce73108fc0c1c6bfc80 commit. * unicode/DerivedCoreProperties.txt: Updated from Unicode 16.0. * unicode/emoji-data.txt: Likewise. * unicode/PropList.txt: Likewise. * unicode/GraphemeBreakProperty.txt: Likewise. * unicode/DerivedNormalizationProps.txt: Likewise. * unicode/NameAliases.txt: Likewise. * unicode/UnicodeData.txt: Likewise. * unicode/EastAsianWidth.txt: Likewise. gcc/testsuite/ * c-c++-common/cpp/named-universal-char-escape-1.c: Add tests for some Unicode 16.0 characters, both normal and generated. libcpp/ * makeucnid.cc (write_copyright): Update Unicode Copyright years. * makeuname2c.cc (generated_ranges): Adjust Unicode version from 15.1 to 16.0. Add EGYPTIAN HIEROGLYPH- generated range, adjust indexes in following entries. (write_copyright): Update Unicode Copyright years. * generated_cpp_wcwidth.h: Regenerated. * ucnid.h: Regenerated. * uname2c.h: Regenerated. libstdc++-v3/ * include/bits/unicode.h (std::__unicode::__v15_1_0): Rename inline namespace to ... (std::__unicode::__v16_0_0): ... this. (_GLIBCXX_GET_UNICODE_DATA): Change from 150100 to 160000. * include/bits/unicode-data.h: Regenerated. * testsuite/ext/unicode/properties.cc: Check for _Gcb_SpacingMark on U+11F03 rather than U+1D16D as the latter lost SpacingMark property in Unicode 16.0.		2024-10-08 10:01:47 +02:00
..
from_glibc	contrib, libcpp, libstdc++: Update to Unicode 16.0	2024-10-08 10:01:47 +02:00
DerivedCoreProperties.txt	contrib, libcpp, libstdc++: Update to Unicode 16.0	2024-10-08 10:01:47 +02:00
DerivedNormalizationProps.txt	contrib, libcpp, libstdc++: Update to Unicode 16.0	2024-10-08 10:01:47 +02:00
EastAsianWidth.txt	contrib, libcpp, libstdc++: Update to Unicode 16.0	2024-10-08 10:01:47 +02:00
emoji-data.txt	contrib, libcpp, libstdc++: Update to Unicode 16.0	2024-10-08 10:01:47 +02:00
gen_libstdcxx_unicode_data.py	contrib, libcpp, libstdc++: Update to Unicode 16.0	2024-10-08 10:01:47 +02:00
gen_wcwidth.py	contrib: Remove C-style comments from Python files	2024-01-05 13:57:05 +00:00
gen-box-drawing-chars.py	contrib: Remove C-style comments from Python files	2024-01-05 13:57:05 +00:00
gen-combining-chars.py	contrib: Remove C-style comments from Python files	2024-01-05 13:57:05 +00:00
gen-printable-chars.py	contrib: Remove C-style comments from Python files	2024-01-05 13:57:05 +00:00
GraphemeBreakProperty.txt	contrib, libcpp, libstdc++: Update to Unicode 16.0	2024-10-08 10:01:47 +02:00
NameAliases.txt	contrib, libcpp, libstdc++: Update to Unicode 16.0	2024-10-08 10:01:47 +02:00
PropList.txt	contrib, libcpp, libstdc++: Update to Unicode 16.0	2024-10-08 10:01:47 +02:00
README	contrib, libcpp, libstdc++: Update to Unicode 16.0	2024-10-08 10:01:47 +02:00
unicode-license.txt
UnicodeData.txt	contrib, libcpp, libstdc++: Update to Unicode 16.0	2024-10-08 10:01:47 +02:00
utf8-dump.py	contrib: add unicode/utf8-dump.py	2021-11-01 11:52:28 -04:00

README

This directory contains a mechanism for GCC to have its own internal
implementation of wcwidth functionality (cpp_wcwidth () in libcpp/charset.c),
as well as a mechanism to update the information about codepoints permitted in
identifiers, which is encoded in libcpp/ucnid.h, and mapping between Unicode
names and codepoints, which is encoded in libcpp/uname2c.h.

The idea is to produce the necessary lookup tables
(../../libcpp/{ucnid.h,uname2c.h,generated_cpp_wcwidth.h}) in a reproducible
way, starting from the following files that are distributed by the Unicode
Consortium:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
ftp://ftp.unicode.org/Public/UNIDATA/EastAsianWidth.txt
ftp://ftp.unicode.org/Public/UNIDATA/PropList.txt
ftp://ftp.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
ftp://ftp.unicode.org/Public/UNIDATA/NameAliases.txt

Two additional files are needed for lookup tables in libstdc++:

ftp://ftp.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt
ftp://ftp.unicode.org/Public/UNIDATA/emoji/emoji-data.txt

All these files have been added to source control in this directory;
please see unicode-license.txt for the relevant copyright information.

In order to keep in sync with glibc's wcwidth as much as possible, it is
desirable for the logic that processes the Unicode data to be the same as
glibc's.  To that end, we also put in this directory, in the from_glibc/
directory, the glibc python code that implements their logic.  This code was
copied verbatim from glibc, and it can be updated at any time from the glibc
source code repository.  The files copied from that repository are:

localedata/unicode-gen/unicode_utils.py
localedata/unicode-gen/utf8_gen.py

And the most recent versions added to GCC are from glibc git commit:
064c708c78cc2a6b5802dce73108fc0c1c6bfc80

The script gen_wcwidth.py found here contains the GCC-specific code to
map glibc's output to the lookup tables we require.  This script should not need
to change, unless there are structural changes to the Unicode data files or to
the glibc code.  Similarly, makeucnid.cc in ../../libcpp contains the logic to
produce ucnid.h.

The procedure to update GCC's Unicode support is the following:

1.  Update the six Unicode data files from the above URLs.

2.  Update the two glibc files in from_glibc/ from glibc's git.  Update
    the commit number above in this README.

3.  Run ./gen_wcwidth.py X.Y > ../../libcpp/generated_cpp_wcwidth.h
    (where X.Y is the version of the Unicode standard corresponding to the
    Unicode data files being used, most recently, 16.0.0).

4.  Update Unicode Copyright years in libcpp/makeucnid.cc and in
    libcpp/makeuname2c.cc up to the year in which the Unicode
    standard has been released.

5.  Compile makeucnid, e.g. with:
      g++ -O2 ../../libcpp/makeucnid.cc -o ../../libcpp/makeucnid

6.  Generate ucnid.h as follows:
      ../../libcpp/makeucnid ../../libcpp/ucnid.tab UnicodeData.txt \
	DerivedNormalizationProps.txt DerivedCoreProperties.txt \
	> ../../libcpp/ucnid.h

7.  Read the corresponding Unicode's standard and update correspondingly
    generated_ranges table in libcpp/makeuname2c.cc (in Unicode 16 all
    the needed information was in Table 4-8).

8.  Compile makeuname2c, e.g. with:
      g++ -O2 ../../libcpp/makeuname2c.cc -o ../../libcpp/makeuname2c

9:  Generate uname2c.h as follows:
      ../../libcpp/makeuname2c UnicodeData.txt NameAliases.txt \
	> ../../libcpp/uname2c.h

See gen_libstdcxx_unicode_data.py for instructions on updating the lookup
tables in libstdc++.