Commit Graph

19 Commits

Author SHA1 Message Date
Nikita Popov
f2be6e732a Update data tables for Unicode 11 2018-06-11 20:25:37 +02:00
Gabriel Caruso
2238403892 Trailing whitespaces on ext/*
Signed-off-by: Gabriel Caruso <carusogabriel34@gmail.com>
2018-01-04 02:38:32 -02:00
Nikita Popov
f4a1d9c821 Fixed bug #65544 and #71298 2017-07-28 14:57:08 +02:00
Nikita Popov
582a65b06f Implement full case mapping
Implement full case mapping according to SpecialCasing.txt and
also full case folding according to CaseFolding.txt (F). There
are a number of caveats:

* Only language-agnostic and unconditional full case mapping
  is implemented. The only language-agnostic conditional case
  mapping rule relates to Greek sigma in final position
  (Final_Sigma). Correctly handling this requires both arbitrary
  lookahead and lookbehind, which would require some larger
  changes to how the case mapping is implemented. This is a
  possible future extension.
* The only language-specific handling that is implemented is
  for Turkish dotted/undotted Is, if the ISO-8859-9 encoding
  is used. This matches the previous behavior and makes sure
  that no codepoints not supported by the encoding are
  produced. A future extension would be to also handle the
  Turkish mappings specified by SpecialCasing.txt based on
  the mbfl internal language.
* Full case folding is implemented, but case-insensitive mb_*
  operations continue to use simple case folding. The reason is
  that full case folding of the haystack string may change the
  position at which a match occurred. This would have to be
  mapped back into the position in the original string.
* mb_convert_case() exposes both the full and the simple case
  mapping / folding, where full is the default. The constants
  are:

   * MB_CASE_LOWER (used by mb_strtolower)
   * MB_CASE_UPPER (used by mb_strtolower)
   * MB_CASE_TITLE
   * MB_CASE_FOLD
   * MB_CASE_LOWER_SIMPLE
   * MB_CASE_UPPER_SIMPLE
   * MB_CASE_TITLE_SIMPLE
   * MB_CASE_FOLD_SIMPLE (used by case-insensitive operations)
2017-07-28 12:32:50 +02:00
Nikita Popov
9ac7c1e71d Use case-folding for case insensitive comparisons
Instead of using lowercasing.
2017-07-28 12:32:50 +02:00
Nikita Popov
80a0601fe5 Use MPH for case maps
Instead of performing a binary search, use a hashtable to store
the case maps. In particular a minimal perfect hash construction
is used, which does not require collision resolution (but does
use an auxiliary table for the hash perturbation).
2017-07-28 12:32:50 +02:00
Nikita Popov
eacd70f762 Don't store titlecase if same as uppercase
The totitle code already has a fallback for that case.
2017-07-28 12:32:50 +02:00
Nikita Popov
cedfc2f426 Drop implementation-specific character properties
No point in keeping around non-standard character properties if
we're not using them and most are not even being populated.
2017-07-28 12:32:50 +02:00
Nikita Popov
8ace7045e9 Handle character ranges in ucgendat generically
In particular, the previous implementation did not account for
Tangut Ideographs and CJK Ideograph extensions C through F.
2017-07-25 18:48:12 +02:00
Nikita Popov
4bd61ec7ad Fix handling of some special ranges in ucgendat
* Han Ideagraphs go up to U+9FEA.
* CJK Compatibility Ideographs are no longer specified as a special
  range in remotely recent versions of Unicode.
* Surrogate properties should be assigned to U+D800-U+DFFF, not to
  U+10000-U+1FFFF.
2017-07-25 18:48:12 +02:00
Nikita Popov
3c6b2512cb Change layout of case mapping table
Previously the case mapping table was segregated by the type of
the character (upper, lower, title) and always stored the other
two variants (key, other1, other2). Now the table is segregated
by the target type (key, other). As only very few characters have
more than one target this only slightly increases the size of the
table.

The advantage of this layout is that we only need to perform a
single table lookup in the case table. Previously, depending on
the case that was hit, either one lookup in the property table,
or two lookups in the property table and one lookup in the case
table were required.

This changes the layout from libunicode in the OpenLDAP project
-- however, the last commit there was over 10 years ago, so I
don't see value in keeping this in sync.
2017-07-23 18:33:15 +02:00
Nikita Popov
706f0cf8a0 Update Unicode data for Unicode 10 2017-07-23 16:05:39 +02:00
Nikita Popov
24cfbfd56f Update ucgendat for more bidi properties
Handle them the same way as others -- by classifying as Other
Neutral.
2017-07-23 16:03:11 +02:00
Nikita Popov
077e61fad3 Fixed bug #69267 completely
ucgendat.c was assuming that a title-case character is a character
that has both lower and upper-case variants. However, there are
title-case characters that only have a lower-case variant. Use the
Lt general character proprety to determine where in the case map
the character should be placed instead.
2017-07-23 15:30:17 +02:00
Nikita Popov
0e4af9192f Partial fix for bug #69267
This pulls in 60a25c72ba389f53b0621ca250bc99f3b295d43f from the
OpenLDAP project.
2017-07-23 14:47:21 +02:00
Xinchen Hui
e841016df7 Upgrade unicode_data.h to UnicodeData.txt 8.0.0 (part of bug #70475 ext/mbstring/unicode_data.h needs update) 2015-09-15 07:56:10 -07:00
Stanislav Malyshev
b7a7b1a624 trailing whitespace removal 2015-01-10 15:07:38 -08:00
Gustavo André dos Santos Lopes
42dae97fd4 - Fixed bug #52981 (Unicode casing table was out-of-date).
Updated with UnicodeData-6.0.0d7.txt and included the
  source of the generator program with the distribution.
#The replaced tables, generated circa 2002, seem to reflect
#Unicode 3.2. I was unable to generate the same property
#offsets with Unicode 3.2 data, but all the tests I made
#indicate php_unicode_is_prop() is returning the correct
#values. The replaced file merely says it used a "modified
#version" of ucgendat, which is not very helpful. The results
#I got were not significantly different, only slightly higher
#offsets at two properties, which were carried over to the
#subsequent properties.
#I was, however, able to replicate precisely the casing table.
#The extent of the "modifications" besides omitting most of
#the tables, a slightly different layout and the casing table
#offsets having been multiplied by 3 is unclear.
#The test suite showed no regressions; however, it's very poor
#in testing the modified portion of the extension.
2010-10-05 01:54:17 +00:00
Wez Furlong
1a87c6b5bf (PHP mb_convert_case) Add function that will convert the case of a string
Respecting it's encoding (or the internal encoding).
2002-09-26 00:53:47 +00:00