mirror of
https://github.com/php/php-src.git
synced 2025-01-10 21:14:37 +08:00
404 lines
14 KiB
Plaintext
404 lines
14 KiB
Plaintext
This document attempts to describe portions of the API related to the new
|
|
Unicode functionality and the best practices for upgrading existing
|
|
functions to support Unicode.
|
|
|
|
Your first stop should be README.UNICODE: it covers the general Unicode
|
|
functionality and concepts without going into technical implementation
|
|
details.
|
|
|
|
Working in Unicode World
|
|
========================
|
|
|
|
Strings
|
|
-------
|
|
|
|
A lot of internal functionality is controlled by the unicode_semantics
|
|
switch. Its value is found in the Unicode globals variable, UG(unicode). It
|
|
is either on or off for the entire request.
|
|
|
|
The big thing is that there are two new string types: IS_UNICODE and
|
|
IS_BINARY. The former one has its own storage in the value union part of
|
|
zval (value.ustr) and the latter re-uses value.str.
|
|
|
|
Both types have new macros to set the zval value and to access it.
|
|
|
|
Z_USTRVAL(), Z_USTRLEN()
|
|
- accesses the value and length (in code units) of the Unicode type string
|
|
|
|
Z_BINVAL(), Z_BINLEN()
|
|
- accesses the value and length of the binary type string
|
|
|
|
Z_UNIVAL(), Z_UNILEN()
|
|
- accesses either Unicode or native string value, depending on the current
|
|
setting of UG(unicode) switch. The Z_UNIVAL() type resolves to char*, so
|
|
you may need to cast it appropriately.
|
|
|
|
Z_USTRCPLEN()
|
|
- gives the number of codepoints in the Unicode type string
|
|
|
|
ZVAL_BINARY(), ZVAL_BINARYL()
|
|
- Sets zval to hold a binary string. Takes the same parameters as
|
|
Z_STRING(), Z_STRINGL().
|
|
|
|
ZVAL_UNICODE, ZVAL_UNICODEL()
|
|
- Sets zval to hold a Unicode string. Takes the same parameters as
|
|
Z_STRING(), Z_STRINGL().
|
|
|
|
ZVAL_ASCII_STRING(), ZVAL_ASCII_STRINGL()
|
|
- When UG(unicode) is off, it's equivalent to Z_STRING(), ZSTRINGL(). When
|
|
UG(unicode) is on, it sets zval to hold a Unicode representation of the
|
|
passed-in ASCII string. It will always create a new string in
|
|
UG(unicode)=1 case, so the value of the duplicate flag is not taken into
|
|
account.
|
|
|
|
ZVAL_RT_STRING()
|
|
- When UG(unicode) is off, it's equivalent to Z_STRING(), Z_STRINGL(). WHen
|
|
UG(unicode) is on, it takes the input string, converts it to Unicode
|
|
using the runtime_encoding converter and sets zval to it. Since a new
|
|
string is always created in this case, the value of the duplicate flag
|
|
does not matter.
|
|
|
|
ZVAL_TEXT()
|
|
- This macro sets the zval to hold either a Unicode or a normal string,
|
|
depending on the value of UG(unicode). No conversion happens, so the
|
|
argument has to be cast to (char*) when using this macro. One example of
|
|
its usage would be to initialize zval to hold the name of a user
|
|
function.
|
|
|
|
There are, of course, related conversion macros.
|
|
|
|
convert_to_string_with_converter(zval *op, UConverter *conv)
|
|
- converts a zval to native string using the specified converter, if necessary.
|
|
|
|
convert_to_binary()
|
|
- converts a zval to binary string.
|
|
|
|
convert_to_unicode()
|
|
- converts a zval to Unicode string.
|
|
|
|
convert_to_unicode_with_converter(zval *op, UConverter *conv)
|
|
- converts a zval to Unicode string using the specified converter, if
|
|
necessary.
|
|
|
|
convert_to_text(zval *op)
|
|
- converts a zval to either Unicode or native string, depending on the
|
|
value of UG(unicode) switch
|
|
|
|
zend_ascii_to_unicode() function can be used to convert an ASCII char*
|
|
string to Unicode. This is useful especially for inline string literals, in
|
|
which case you can simply use USTR_MAKE() macro, e.g.:
|
|
|
|
UChar* ustr;
|
|
|
|
ustr = USTR_MAKE("main");
|
|
|
|
If you need to initialize a few such variables, it may be more efficient to
|
|
use ICU macros, which avoid the conversion, depending on the platform. See
|
|
[1] for more information.
|
|
|
|
USTR_FREE() can be used to free a UChar* string safely, since it checks for
|
|
NULL argument. USTR_LEN() takes either a UChar* or a char* argument,
|
|
depending on the UG(unicode) value, and returns its length. Cast the
|
|
argument to char* before passing it.
|
|
|
|
The list of functions that add new array values and add object properties
|
|
has also been expanded to include the new types. Please see zend_API.h for
|
|
full listing (add_*_ascii_string_*, add_*_rt_string_*, add_*_unicode_*,
|
|
add_*_binary_*).
|
|
|
|
UBYTES() macro can be used to obtain the number of bytes necessary to store
|
|
the given number of UChar's. The typical usage is:
|
|
|
|
char *constant_name = colon + (UG(unicode)?UBYTES(2):2);
|
|
|
|
|
|
Code Points and Code Units
|
|
--------------------------
|
|
|
|
Unicode type strings are in the UTF-16 encoding where 1 Unicode character
|
|
may be represented by 1 or 2 UChar's. Each UChar is referred to as a "code
|
|
unit", and a full Unicode character as a "code point". Consequently, number
|
|
of code units and number of code points for the same Unicode string may be
|
|
different. This has many implications, the most important of which is that
|
|
you cannot simply index the UChar* string to get the desired codepoint.
|
|
|
|
The zval's value.ustr.len contains actually the number of code units. To
|
|
obtain the number of code points, one can use u_counChar32() ICU API
|
|
function or Z_USTRCPLEN() macro.
|
|
|
|
ICU provides a number of macros for working with UTF-16 strings on the
|
|
codepoint level [2]. They allow you to do things like obtain a codepoint at
|
|
random code unit offset, move forward and backward over the string, etc.
|
|
There are two versions of iterator macros, *_SAFE and *_UNSAFE. It is strong
|
|
recommended to use *_SAFE version, since they handle unpaired surrogates and
|
|
check for string boundaries. Here is an example of how to move through
|
|
UChar* string and work on codepoints.
|
|
|
|
UChar *str = ...;
|
|
int32_t str_len = ...;
|
|
UChar32 codepoint;
|
|
int32_t offset = 0;
|
|
|
|
while (offset < str_len) {
|
|
U16_NEXT(str, offset, str_len, codepoint);
|
|
/* now we have the Unicode character in codepoint */
|
|
}
|
|
|
|
There is not macro to get a codepoint at a certain code point offset, but
|
|
there is a Zend API function that does it.
|
|
|
|
inline UChar32 zend_get_codepoint_at(UChar *str, int32_t length, int32_t n);
|
|
|
|
To retrieve 3rd codepoint, you would call:
|
|
|
|
zend_get_codepoint_at(str, str_len, 3);
|
|
|
|
If you have a UChar32 codepoint and need to put it into a UChar* string,
|
|
there is another helper function, zend_codepoint_to_uchar(). It takes
|
|
a single UChar32 and converts it to a UChar sequence (1 or 2 UChar's).
|
|
|
|
UChar buf[8];
|
|
UChar32 codepoint = 0x101a2;
|
|
int8_t num_uchars;
|
|
num_uchars = zend_codepoint_to_uchar(codepoint, buf);
|
|
|
|
The return value is the number of resulting UChar's or 0, which indicates
|
|
invalid codepoint.
|
|
|
|
|
|
Memory Allocation
|
|
-----------------
|
|
|
|
For ease of use and to reduce possible bugs, there are memory allocation
|
|
functions specific to Unicode strings. Please use them at all times when
|
|
allocating UChar's.
|
|
|
|
eumalloc(size)
|
|
eurealloc(ptr, size)
|
|
eustrndup(s, length)
|
|
eustrdup(s)
|
|
|
|
peumalloc(size, persistent)
|
|
peurealloc(ptr, size, persistent)
|
|
|
|
The size parameter refers to the number of UChar's, not bytes.
|
|
|
|
|
|
Hashes
|
|
------
|
|
|
|
Hashes API has been upgraded to work with Unicode and binary strings. All
|
|
hash functions that worked with string keys now have their equivalent
|
|
zend_u_hash_* API. The zend_u_hash_* functions take the type of the key
|
|
string as the second argument.
|
|
|
|
When UG(unicode) switch is on, the IS_STRING keys are upconverted to
|
|
IS_UNICODE and then used in the hash lookup.
|
|
|
|
There are two new constants that define key types:
|
|
|
|
#define HASH_KEY_IS_BINARY 4
|
|
#define HASH_KEY_IS_UNICODE 5
|
|
|
|
Note that zend_hash_get_current_key_ex() does not have a zend_u_hash_*
|
|
version. It returns the key as a char* pointer, you can can cast it
|
|
appropriately based on the key type.
|
|
|
|
|
|
Identifiers and Class Entries
|
|
-----------------------------
|
|
|
|
In Unicode mode all the identifiers are Unicode strings. This means that
|
|
while various structures such as zend_class_entry, zend_function, etc store
|
|
the identifier name as a char* pointer, it will actually point to UChar*
|
|
string. Be careful when accessing the names of classes, functions, and such
|
|
-- always check UG(unicode) before using them.
|
|
|
|
In addition, zend_class_entry has a u_twin field that points to its Unicode
|
|
counterpart in UG(unicode) mode. Use U_CLASS_ENTRY() macro to access the
|
|
correct class entry, e.g.:
|
|
|
|
ce = U_CLASS_ENTRY(default_exception_ce);
|
|
|
|
|
|
Formatted Output
|
|
----------------
|
|
|
|
Since UTF-16 strings frequently contain NULL bytes, you cannot simpley use
|
|
%s format to print them out. Towards that end, output functions such as
|
|
php_printf(), spprintf(), etc now have three different formats for use with
|
|
Unicode strings:
|
|
|
|
%r
|
|
This format treats the corresponding argument as a Unicode string. The
|
|
string is automatically converted to the output encoding. If you wish to
|
|
apply a different converter to the string, use %*r and pass the
|
|
converter before the string argument.
|
|
|
|
UChar *class_name = USTR_NAME("ReflectionClass");
|
|
zend_printf("%r", class_name);
|
|
|
|
%R
|
|
This format requires at least two arguments: the first one specifies the
|
|
type of the string to follow (IS_STRING or IS_UNICODE), and the second
|
|
one - the string itself. If the string is of Unicode type, it is
|
|
automatically converted to the output encoding. If you wish to apply
|
|
a different converter to the string, use %*R and pass the converter
|
|
before the string argument.
|
|
|
|
zend_throw_exception_ex(U_CLASS_ENTRY(reflection_exception_ptr), 0 TSRMLS_CC,
|
|
"Interface %R does not exist",
|
|
Z_TYPE_P(class_name), Z_UNIVAL_P(class_name));
|
|
|
|
%v
|
|
This format takes only one parameter, the string, but the expected
|
|
string type depends on the UG(unicode) value. If the string is of
|
|
Unicode type, it is automatically converted to the output encoding. If
|
|
you wish to apply a different converter to the string, use %*R and pass
|
|
the converter before the string argument.
|
|
|
|
zend_error(E_WARNING, "%v::__toString() did not return anything",
|
|
Z_OBJCE_P(object)->name);
|
|
|
|
|
|
|
|
Upgrading Functions
|
|
===================
|
|
|
|
Let's take a look at a couple of functions that have been upgraded to
|
|
support new string types.
|
|
|
|
substr()
|
|
--------
|
|
|
|
This functions returns part of a string based on offset and length
|
|
parameters.
|
|
|
|
void *str;
|
|
int32_t str_len, cp_len;
|
|
zend_uchar str_type;
|
|
|
|
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "tl|l", &str, &str_len, &str_type, &f, &l) == FAILURE) {
|
|
return;
|
|
}
|
|
|
|
The first thing we notice is that the incoming string specifier is 't',
|
|
which means that we can accept all 3 string types. The 'str' variable is
|
|
declared as void*, because it can point to either UChar* or char*.
|
|
The actual type of the incoming string is stored in 'str_type' variable.
|
|
|
|
if (str_type == IS_UNICODE) {
|
|
cp_len = u_countChar32(str, str_len);
|
|
} else {
|
|
cp_len = str_len;
|
|
}
|
|
|
|
If the string is a Unicode one, we cannot rely on the str_len value to tell
|
|
us the number of characters in it. Instead, we call u_countChar32() to
|
|
obtain it.
|
|
|
|
The next several lines normalize start and length parameters to fit within the
|
|
string. Nothing new here. Then we locate the appropriate segment.
|
|
|
|
if (str_type == IS_UNICODE) {
|
|
int32_t start = 0, end = 0;
|
|
U16_FWD_N((UChar*)str, end, str_len, f);
|
|
start = end;
|
|
U16_FWD_N((UChar*)str, end, str_len, l);
|
|
RETURN_UNICODEL((UChar*)str + start, end-start, 1);
|
|
|
|
Since codepoint (character) #n is not necessarily at offset #n in Unicode
|
|
strings, we start at the beginning and iterate forward until we have gone
|
|
through the required number of codepoints to reach the start of the segment.
|
|
Then we save the location in 'start' and continue iterating through the number
|
|
of codepoints specified by the offset. Once that's done, we can return the
|
|
segment as a Unicode string.
|
|
|
|
} else {
|
|
RETURN_STRINGL((char*)str + f, l, 1);
|
|
}
|
|
|
|
For native and binary types, we can return the segment directly.
|
|
|
|
|
|
strrev()
|
|
--------
|
|
|
|
Let's look at strrev() which requires somewhat more complicated upgrade.
|
|
While one of the guidelines for upgrades is that combining sequences are not
|
|
really taken into account during processing -- substr() can break them up,
|
|
for example -- in this case, we actually should be concerned, because
|
|
reversing combining sequence may result in a completely different string. To
|
|
illustrate:
|
|
|
|
a (U+0061 LATIN SMALL LETTER A)
|
|
o (U+006f LATIN SMALL LETTER O)
|
|
+ ' (U+0301 COMBINING ACUTE ACCENT)
|
|
+ _ (U+0320 COMBINING MINUS SIGN BELOW)
|
|
l (U+006C LATIN SMALL LETTER L)
|
|
|
|
Reversing this would result in:
|
|
|
|
l (U+006C LATIN SMALL LETTER L)
|
|
+ _ (U+0320 COMBINING MINUS SIGN BELOW)
|
|
+ ' (U+0301 COMBINING ACUTE ACCENT)
|
|
o (U+006f LATIN SMALL LETTER O)
|
|
a (U+0061 LATIN SMALL LETTER A)
|
|
|
|
All of a sudden the combining marks are being applied to 'l' instead of 'o'.
|
|
To avoid this, we need to treat combininig sequences as a unit, by checking
|
|
the combining character class of each character with u_getCombiningClass().
|
|
|
|
strrev() obtains its single argument, a string, and unless the string is of
|
|
Unicode type, processes it exactly as before, simply swapping bytes around.
|
|
For Unicode case, the magic is like this:
|
|
|
|
int32_t i, x1, x2;
|
|
UChar32 ch;
|
|
UChar *u_s, *u_n, *u_p;
|
|
|
|
u_n = eumalloc(Z_USTRLEN_PP(str)+1);
|
|
u_p = u_n;
|
|
u_s = Z_USTRVAL_PP(str);
|
|
|
|
i = Z_USTRLEN_PP(str);
|
|
while (i > 0) {
|
|
U16_PREV(u_s, 0, i, ch);
|
|
if (u_getCombiningClass(ch) == 0) {
|
|
u_p += zend_codepoint_to_uchar(ch, u_p);
|
|
} else {
|
|
x2 = i;
|
|
do {
|
|
U16_PREV(u_s, 0, i, ch);
|
|
} while (u_getCombiningClass(ch) != 0);
|
|
x1 = i;
|
|
while (x1 <= x2) {
|
|
U16_NEXT(u_s, x1, Z_USTRLEN_PP(str), ch);
|
|
u_p += zend_codepoint_to_uchar(ch, u_p);
|
|
}
|
|
}
|
|
}
|
|
*u_p = 0;
|
|
|
|
The basic idea is to walk the string backwards from the end, using
|
|
U16_PREV() macro. If the combining class of the current character is 0,
|
|
meaning it's a base character and not a combining mark, we simply append it
|
|
to the new string. Otherwise, we save the location of the index and do a run
|
|
over the characters until we get to the next one with combining class 0. At
|
|
that point we append the sequence as is, without reversing, to the new
|
|
string. Voila.
|
|
|
|
Note that the code uses zend_codepoint_to_uchar() to convert full Unicode
|
|
characters (UChar32 type) to 1 or 2 UTF-16 code units (UChar type).
|
|
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
[1] http://icu.sourceforge.net/apiref/icu4c/ustring_8h.html#a1
|
|
|
|
[2] http://icu.sourceforge.net/apiref/icu4c/utf16_8h.html
|
|
|
|
vim: set et ai tw=76 fo=tron21:
|