ext/mbstring: update to Unicode 15

Updates UCD to Unicode 15.1 (released 2023 Sept). The upcoming
Unicode 16 version will be released roughly on 2024 Sept.

Previously: 0fdffc18, #7502

UCD 15.1 `DerivedNormalizationProps` contains multiple properties in
the same line, which breaks the parser. This also updates the
`ucgendat.php` script to allow 2 or three fields in each line, and to
look for the `Cased` and `Case_Ignorable` properties in either of the
fields to mimic the previous behavior.
This commit is contained in:
Ayesh Karunaratne 2024-06-27 01:58:14 +07:00 committed by Alex Dowad
parent 23f99f08c9
commit 421ac9ac28
5 changed files with 1192 additions and 1126 deletions

1
NEWS
View File

@ -123,6 +123,7 @@ PHP NEWS
- MBString:
. Added mb_trim, mb_ltrim and mb_rtrim. (Yuya Hamada)
. Added mb_ucfirst and mb_lcfirst. (Yuya Hamada)
. Updated Unicode data tables to Unicode 15.1. (Ayesh Karunaratne)
- MySQLnd:
. Fixed bug GH-13440 (PDO quote bottleneck). (nielsdos)

View File

@ -684,6 +684,9 @@ PHP 8.4 UPGRADE NOTES
$domain name is empty or too long, and if $variant is not
INTL_IDNA_VARIANT_UTS46.
- MBString:
. Unicode data tables have been updated to Unicode 15.1.
- OpenSSL:
. The OpenSSL extension now requires at least OpenSSL 1.1.1.

View File

@ -58,14 +58,13 @@ static const struct {
{ 0x2e80, 0x2e99 },
{ 0x2e9b, 0x2ef3 },
{ 0x2f00, 0x2fd5 },
{ 0x2ff0, 0x2ffb },
{ 0x3000, 0x303e },
{ 0x2ff0, 0x303e },
{ 0x3041, 0x3096 },
{ 0x3099, 0x30ff },
{ 0x3105, 0x312f },
{ 0x3131, 0x318e },
{ 0x3190, 0x31e3 },
{ 0x31f0, 0x321e },
{ 0x31ef, 0x321e },
{ 0x3220, 0x3247 },
{ 0x3250, 0x4dbf },
{ 0x4e00, 0xa48c },
@ -88,7 +87,9 @@ static const struct {
{ 0x1aff5, 0x1affb },
{ 0x1affd, 0x1affe },
{ 0x1b000, 0x1b122 },
{ 0x1b132, 0x1b132 },
{ 0x1b150, 0x1b152 },
{ 0x1b155, 0x1b155 },
{ 0x1b164, 0x1b167 },
{ 0x1b170, 0x1b2fb },
{ 0x1f004, 0x1f004 },
@ -122,7 +123,7 @@ static const struct {
{ 0x1f6cc, 0x1f6cc },
{ 0x1f6d0, 0x1f6d2 },
{ 0x1f6d5, 0x1f6d7 },
{ 0x1f6dd, 0x1f6df },
{ 0x1f6dc, 0x1f6df },
{ 0x1f6eb, 0x1f6ec },
{ 0x1f6f4, 0x1f6fc },
{ 0x1f7e0, 0x1f7eb },
@ -130,15 +131,13 @@ static const struct {
{ 0x1f90c, 0x1f93a },
{ 0x1f93c, 0x1f945 },
{ 0x1f947, 0x1f9ff },
{ 0x1fa70, 0x1fa74 },
{ 0x1fa78, 0x1fa7c },
{ 0x1fa80, 0x1fa86 },
{ 0x1fa90, 0x1faac },
{ 0x1fab0, 0x1faba },
{ 0x1fac0, 0x1fac5 },
{ 0x1fad0, 0x1fad9 },
{ 0x1fae0, 0x1fae7 },
{ 0x1faf0, 0x1faf6 },
{ 0x1fa70, 0x1fa7c },
{ 0x1fa80, 0x1fa88 },
{ 0x1fa90, 0x1fabd },
{ 0x1fabf, 0x1fac5 },
{ 0x1face, 0x1fadb },
{ 0x1fae0, 0x1fae8 },
{ 0x1faf0, 0x1faf8 },
{ 0x20000, 0x2fffd },
{ 0x30000, 0x3fffd },
};

View File

@ -0,0 +1,28 @@
--TEST--
mbstring Unicode Data tests
--EXTENSIONS--
mbstring
--FILE--
<?php
print "ASCII (PHP): " . mb_strwidth('PHP', 'UTF-8') . "\n";
print "Vietnamese (Xin chào): " . mb_strwidth('Xin chào', 'UTF-8') . "\n";
print "Traditional Chinese (你好): " . mb_strwidth('你好', 'UTF-8') . "\n";
print "Sinhalese (අයේෂ්): " . mb_strwidth('අයේෂ්', 'UTF-8') . "\n";
print "Emoji (\u{1F418}): " . mb_strwidth("\u{1F418}", 'UTF-8') . "\n";
// New in Unicode 15.0, width=2
print "Emoji (\u{1F6DC}): " . mb_strwidth("\u{1F6DC}", 'UTF-8') . "\n";
?>
--EXPECT--
ASCII (PHP): 3
Vietnamese (Xin chào): 8
Traditional Chinese (你好): 4
Sinhalese (අයේෂ්): 5
Emoji (🐘): 2
Emoji (🛜): 2

File diff suppressed because it is too large Load Diff