mirror of
https://github.com/php/php-src.git
synced 2024-11-25 19:05:31 +08:00
710 lines
25 KiB
Plaintext
710 lines
25 KiB
Plaintext
Introduction
|
|
============
|
|
|
|
As successful as PHP has proven to be in the past several years, it is still
|
|
the only remaining member of the P-trinity of scripting languages - Perl and
|
|
Python being the other two - that remains blithely ignorant of the
|
|
multilingual and multinational environment around it. The software
|
|
development community has been moving towards Unicode Standard for some time
|
|
now, and PHP can no longer afford to be outside of this movement. Surely,
|
|
some steps have been taken recently to allow for easier processing of
|
|
multibyte data with the mbstring extension, but it is not enabled in PHP by
|
|
default and is not as intuitive or transparent as it could be.
|
|
|
|
The basic goal of this document is to describe how PHP 6 will support the
|
|
Unicode Standard natively. Since the full implementation of the Unicode
|
|
Standard is very involved, the idea is to use the already existing,
|
|
well-tested, full-featured, and freely available ICU (International
|
|
Components for Unicode) library. This will allow us to concentrate on the
|
|
details of PHP integration and speed up the implementation.
|
|
|
|
General Remarks
|
|
===============
|
|
|
|
Backwards Compatibility
|
|
-----------------------
|
|
Throughout the design and implementation of Unicode support, backwards
|
|
compatibility must be of paramount concern. PHP is used on an enormous number of
|
|
sites and the upgrade to Unicode-enabled PHP has to be transparent. This means
|
|
that the existing data types and functions must work as they have always
|
|
done. However, the speed of certain operations may be affected, due to
|
|
increased complexity of the code overall.
|
|
|
|
Unicode Encoding
|
|
----------------
|
|
The initial version will not support Byte Order Mark. Characters are
|
|
expected to be composed, Normalization Form C. Later versions will support
|
|
BOM, and decomposed and other characters.
|
|
|
|
|
|
Implementation Approach
|
|
=======================
|
|
|
|
The implementation is done in phases. This allows for more basic and
|
|
low-level implementation issues to be ironed out and tested before
|
|
proceeding to more advanced topics.
|
|
|
|
Legend:
|
|
- TODO
|
|
+ finished
|
|
* in progress
|
|
|
|
Phase I
|
|
-------
|
|
+ Basic Unicode string support, including instantiation, concatenation,
|
|
indexing
|
|
|
|
+ Simple output of Unicode strings via 'print' and 'echo' statements
|
|
with appropriate output encoding conversion
|
|
|
|
+ Conversion of Unicode strings to/from various encodings via encode() and
|
|
decode() functions
|
|
|
|
+ Determining length of Unicode strings via strlen() function, some
|
|
simple string functions ported (substr).
|
|
|
|
|
|
Phase II
|
|
--------
|
|
* HTTP input request decoding
|
|
|
|
+ Fixing remaining string-aware operators (assignment to {}, etc)
|
|
|
|
+ Comparison (collation) of Unicode strings with built-in operators
|
|
|
|
* Support for Unicode and binary strings in PHP streams
|
|
|
|
+ Support for Unicode identifiers
|
|
|
|
* Configurable handling of conversion failures
|
|
|
|
+ \C{} escape sequence in strings
|
|
|
|
|
|
Phase III
|
|
---------
|
|
* Exposing ICU API
|
|
|
|
- Porting all remaining functions to support Unicode and/or binary
|
|
strings
|
|
|
|
|
|
Encoding Names
|
|
==============
|
|
All the encoding settings discussed in this document accept any valid
|
|
encoding name supported by ICU. See ICU online documentation for the full
|
|
list of encodings.
|
|
|
|
|
|
Internal Encoding
|
|
=================
|
|
|
|
UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumes
|
|
two bytes for any Unicode character in the Basic Multilingual Plane, which
|
|
is where most of the current world's languages are represented. While being
|
|
less memory efficient for basic ASCII text it simplifies the processing and
|
|
makes interfacing with ICU easier, since ICU uses UTF-16 for its internal
|
|
processing as well.
|
|
|
|
|
|
Fallback Encoding
|
|
=================
|
|
|
|
This setting specifies the "fallback" encoding for all the other ones. So if
|
|
a specific encoding setting is not set, PHP defaults it to the fallback
|
|
encoding. If the fallback_encoding is not specified either, it is set to
|
|
UTF-8.
|
|
|
|
fallback_encoding = "iso-8859-1"
|
|
|
|
|
|
Runtime Encoding
|
|
================
|
|
|
|
Currently PHP neither specifies nor cares what the encoding of its strings
|
|
is. However, the Unicode implementation needs to know what this encoding is
|
|
for several reasons, including type coersion and encoding conversion for
|
|
strings generated at runtime via function calls and casting. This setting
|
|
specifies this runtime encoding.
|
|
|
|
runtime_encoding = "iso-8859-1"
|
|
|
|
|
|
Output Encoding
|
|
===============
|
|
|
|
Automatic output encoding conversion is supported on the standard output
|
|
stream. Therefore, command such as 'print' and 'echo' automatically convert
|
|
their arguments to the specified encoding. No automatic output encoding is
|
|
performed for anything else. Therefore, when writing to files or external
|
|
resources, the developer has to manually encode the data using functions
|
|
provided by the unicode extension or rely on stream encoding filters. The
|
|
unicode extension provides necessary stream filters to make developers'
|
|
lives easier.
|
|
|
|
The existing default_charset setting so far has been used only for
|
|
specifying the charset portion of the Content-Type MIME header. For several
|
|
reasons, this setting is deprecated. Now it is only used when the Unicode
|
|
semantics switch is disabled and does not affect the actual transcoding of
|
|
the output stream. The output encoding setting takes precedence in all other
|
|
cases.
|
|
|
|
output_encoding = "utf-8"
|
|
|
|
|
|
HTTP Input Encoding
|
|
===================
|
|
|
|
To make accessing HTTP input variables easier, PHP automatically decodes
|
|
HTTP GET and POST requests based on the specified encoding. If the HTTP
|
|
request contains the encoding specification in the headers, then it will be
|
|
used instead of this setting. If the HTTP input encoding setting is not
|
|
specified, PHP falls back onto the output encoding setting, because modern
|
|
browsers are supposed to return the data in the same encoding as they
|
|
received it in.
|
|
|
|
If the actual encoding is passed in the request itself or is found
|
|
elsewhere, then the application can ask PHP to re-decode the raw input
|
|
explicitly.
|
|
|
|
http_input_encoding = "utf-8"
|
|
|
|
|
|
Script Encoding
|
|
===============
|
|
|
|
PHP scripts may be written in any encoding supported by ICU. The encoding
|
|
of the scripts can be specified site-wide via an INI directive
|
|
script_encoding, or with a 'declare' pragma at the beginning of the script.
|
|
The reason for pragma is that an application written in Shift-JIS, for
|
|
example, should be executable on a system where the INI directive cannot be
|
|
changed by the application itself. The pragma setting is valid only for the
|
|
script it occurs in, and does not propagate to the included files.
|
|
|
|
pragma:
|
|
<?php declare(encoding = 'utf-8'); ?>
|
|
|
|
INI setting:
|
|
script_encoding = utf-8
|
|
|
|
|
|
Conversion Semantics
|
|
====================
|
|
|
|
Not all characters can be converted between Unicode and legacy encodings.
|
|
Normally, when downconverting from Unicode, the default behavior of ICU
|
|
converters is to substitute the missing sequence with the appropriate
|
|
substitution sequence for that codepage, such as 0x1A (Control-Z) in
|
|
ISO-8859-1. When upconverting to Unicode, if an encoding has a character
|
|
which cannot be converted into Unicode, that sequence is replaced by the
|
|
Unicode substitution character (U+FFFD).
|
|
|
|
The conversion failure behavior can be customized:
|
|
|
|
- perform substitution as described above with a custom substitution
|
|
character
|
|
- skip any invalid characters
|
|
- stop the conversion, raise an error, and return partial conversion
|
|
results
|
|
- replace the missing character with a diagnostic character and continue,
|
|
e.g. [U+hhhh]
|
|
|
|
There are two INI settings that control this.
|
|
|
|
unicode.from_error_mode = U_INVALID_SUBSTITUTE
|
|
U_INVALID_SKIP
|
|
U_INVALID_STOP
|
|
U_INVALID_ESCAPE
|
|
|
|
unicode.from_error_subst_char = a2
|
|
|
|
The second setting is supposed to contain the Unicode code point value for
|
|
the substitution character. This value has to be representable in the target
|
|
encoding.
|
|
|
|
Note that PHP always tries to convert as much as of the data as possible and
|
|
returns the converted results even if an error happens.
|
|
|
|
|
|
Unicode Switch
|
|
==============
|
|
|
|
Obviously, PHP cannot simply impose new Unicode support on everyone. There
|
|
are many applications that do not care about Unicode and do not need it.
|
|
Consequently, there is a switch that enables certain fundamental language
|
|
changes related to Unicode. This switch is available as a site-wide, or
|
|
per-dir INI setting only.
|
|
|
|
Note that having switch turned off does not imply that PHP is unaware of
|
|
Unicode at all and that no Unicode string can exist. It only affects certain
|
|
aspects of the language, and Unicode strings can always be created
|
|
programmatically.
|
|
|
|
unicode_semantics = On
|
|
|
|
[TODO: list areas that are affected by this switch]
|
|
|
|
|
|
Unicode String Type
|
|
===================
|
|
|
|
Unicode string type (IS_UNICODE) is supposed to contain text data encoded in
|
|
UTF-16 format. It is the main string type in PHP when Unicode semantics
|
|
switch is turned on. Unicode strings can exist when the switch is off, but
|
|
they have to be produced programmatically, via calls to functions that
|
|
return Unicode type.
|
|
|
|
The operational unit when working with Unicode strings is a code point, not
|
|
code unit or byte. One code point in UTF-16 may be comprised of 1 or 2 code
|
|
units, each of which is a 16-bit word. Working on the code point level is
|
|
necessary because doing otherwise would mean offloading the processing of
|
|
surrogate pairs onto PHP users, and that is less than desirable.
|
|
|
|
The repercussions are that one cannot expect code point N to be at offset
|
|
N in the Unicode string. Instead, one has to iterate from the beginning from
|
|
the string using U16_FWD() macro until the desired codepoint is reached.
|
|
|
|
The codepoint access is one of the primary areas targeted for optimization.
|
|
|
|
|
|
Native Encoding String Type
|
|
===========================
|
|
|
|
Native encoding string type (IS_STRING) serves two purposes: backwards
|
|
compatibility when Unicode semantics switch is off, and for representing
|
|
strings in non-Unicode encodings (native encodings) when it is on. It is
|
|
processsed on the byte level.
|
|
|
|
|
|
Binary String Type
|
|
==================
|
|
|
|
Binary string type (IS_BINARY) can be used for storing images, PDFs, or
|
|
other binary data intended to be processed on a byte-level and that cannot
|
|
be intepreted as text.
|
|
|
|
Binary data type does not participate in implicit conversions, and cannot be
|
|
explicitly upconverted to other string types, although the inverse is
|
|
possible.
|
|
|
|
Printing binary data to the standard output passes it through as-is,
|
|
independent of the output encoding.
|
|
|
|
When Unicode semantics switch is off, binary string literals and binary
|
|
strings returned by functions actually resolve to IS_STRING type, for
|
|
backwards compatibility reasons.
|
|
|
|
|
|
Zval Structure Changes
|
|
======================
|
|
|
|
PHP is a type-agnostic language. Its data values are encapsulated in a zval
|
|
(Zend value) structure that can change as necessary to accomodate various types.
|
|
|
|
struct _zval_struct {
|
|
/* Variable information */
|
|
union {
|
|
long lval; /* long value */
|
|
double dval; /* double value */
|
|
struct {
|
|
char *val;
|
|
int len;
|
|
} str; /* string value */
|
|
HashTable *ht; /* hash table value */
|
|
zend_object_value obj; /* object value */
|
|
} value;
|
|
zend_uint refcount;
|
|
zend_uchar type; /* active type */
|
|
zend_uchar is_ref;
|
|
};
|
|
|
|
The type field determines what is stored in the union, IS_STRING being the only
|
|
data type pertinent to this discussion. In the current version, the strings
|
|
are binary-safe, but, for all intents and purposes, are assumed to be
|
|
comprised of 8-bit characters. It is possible to treat the string value as
|
|
an opaque type containing arbitrary binary data, and in fact that is how
|
|
mbstring extension uses it, in order to store multibyte strings. However,
|
|
many extensions and the Zend engine itself manipulate the string value
|
|
directly without regard to its internals. Needless to say, this can lead to
|
|
problems.
|
|
|
|
For IS_UNICODE type, we need to add another structure to the union:
|
|
|
|
union {
|
|
....
|
|
struct {
|
|
UChar *val; /* Unicode string value */
|
|
int32_t len; /* number of UChar's */
|
|
....
|
|
} value;
|
|
|
|
This cleanly separates the two types of strings and helps preserve backwards
|
|
compatibility. For IS_BINARY type, we can re-use the str union.
|
|
|
|
|
|
Language Modifications
|
|
======================
|
|
|
|
If a Unicode switch is turned on, PHP string literals - single-quoted,
|
|
double-quoted, and heredocs - become Unicode strings (IS_UNICODE type).
|
|
They support all the same escape sequences and variable interpolations as
|
|
previously, with the addition of some new escape sequences.
|
|
|
|
The contents of the strings are interpreted as follows:
|
|
|
|
- all non-escaped characters are interpreted as a corresponding Unicode
|
|
codepoint based on the current script encoding, e.g. ASCII 'a' (0x51) =>
|
|
U+0061, Shift-JIS (0x92 0x69) => U+4E2D
|
|
|
|
- existing PHP escape sequences are also interpreted as Unicode codepoints,
|
|
including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020
|
|
|
|
- two new escape sequences, \uXXXX and \UXXXXXX are interpreted as a 4 or
|
|
6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 =>
|
|
U+10410
|
|
|
|
- a new escape sequence allows specifying a character by its full
|
|
Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20
|
|
|
|
The single-quoted string is more restrictive than the other two types: so
|
|
far the only escape sequence allowed inside of it was \', which specifies
|
|
a literal single quote. However, single quoted strings now support the new
|
|
Unicode character escape sequences as well.
|
|
|
|
PHP allows variable interpolation inside the double-quoted and heredoc strings.
|
|
However, the parser separates the string into literal and variable chunks during
|
|
compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that the
|
|
literal chunks can be handled in the normal way for as far as Unicode
|
|
support is concerned.
|
|
|
|
Since all string literals become Unicode by default, one loses the ability
|
|
to specify byte-oriented or binary strings. In order to create binary string
|
|
literals, a new syntax is necessary: prefixing a string literal with letter
|
|
'b' creates a binary string.
|
|
|
|
$var = b'abc\001';
|
|
$var = b"abc\001";
|
|
$var = b<<<EOD
|
|
abc\001
|
|
EOD;
|
|
|
|
The binary string literals support the same escape sequences as the current
|
|
PHP strings. If the Unicode switch is turned off, then the binary string
|
|
literals generate normal string (IS_STRING) type internally, without any
|
|
effect on the application.
|
|
|
|
The string operators have been changed to accomodate the new IS_UNICODE and
|
|
IS_BINARY types. In more detail:
|
|
|
|
- The concatenation (.) operator has been changed to automatically coerce
|
|
IS_STRING type to the more precise IS_UNICODE if its operands are of two
|
|
different string types. It does not perform coersion for IS_BINARY type,
|
|
however, since binary data is not considered to be in any encoding. To
|
|
concatenate string with binary data, strings have to be cast to binary
|
|
type first. The coersion uses the conversion matrix specified later in
|
|
this document.
|
|
|
|
- The concatenation assignment operator (.=) has been changed similarly.
|
|
|
|
- The string indexing operators {}/[] have been changed to accomodate
|
|
IS_UNICODE type strings and extract the specified character. Note that
|
|
the index specifies a code point, not a byte, or a code unit, thus
|
|
supporting supplementary characters as well.
|
|
|
|
- Both Unicode and binary string types can be used as array keys. If the
|
|
Unicode switch is on, the native encoding strings are converted to
|
|
Unicode, if they are used as hash keys, but binary strings are not.
|
|
Note that this means if Unicode switch is off, then Unicode string "abc"
|
|
and native string "abc" do not hash to the same value.
|
|
|
|
- Bitwise operators and increment/decrement operators do not work on
|
|
Unicode strings. They do work on binary strings.
|
|
|
|
- Two new casting operators are introduced, (unicode) and (binary).
|
|
They use the conversion matrix specified later in this document.
|
|
|
|
- The comparison operators when applied to Unicode strings, perform
|
|
comparison in binary code point order. They also do appropriate coersion
|
|
if the strings are of differing types.
|
|
|
|
- The arithmetic operators use the same semantic as today for converting
|
|
strings to numbers. A Unicode string is considered numeric if it
|
|
represents a long or a double number in en_US_POSIX locale.
|
|
|
|
|
|
Inline HTML
|
|
===========
|
|
Because inline HTML blocks are intermixed with PHP ones, they are also
|
|
written in the script encoding. PHP transcodes the HTML blocks to the output
|
|
encoding as needed, resulting in direct passthrough if the script encoding
|
|
matches output encoding.
|
|
|
|
|
|
Identifiers
|
|
===========
|
|
Considering that scripts may be written in various encodings, we do not
|
|
restrict identifiers to be ASCII-only. PHP allows any valid identifier based
|
|
on the Unicode Standard Annex #31. The identifiers are case folded when
|
|
necessary (class and function names) and converted to normalization form
|
|
NFKC, so that two identifiers written in two compatible ways refer to the
|
|
same thing.
|
|
|
|
|
|
Numbers
|
|
=======
|
|
Unlike identifiers, we restrict numbers to consist only of ASCII digits and
|
|
do not interpret them as written in a specific locale. The numbers are
|
|
expected to adhere to en_US_POSIX or C locale, i.e. having no thousands
|
|
separator and fractional separator being (.) "full stop". Numeric strings
|
|
are supposed to adhere to the same rules, i.e. "10,3" is not interpreted as
|
|
a number even if the current locale's fractional separator is comma.
|
|
|
|
|
|
Parameter Parsing API Modifications
|
|
===================================
|
|
|
|
Internal PHP functions largely uses zend_parse_parameters() API in order to
|
|
obtain the parameters passed to them by the user. For example:
|
|
|
|
char *str;
|
|
int len;
|
|
|
|
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &str, &len) == FAILURE) {
|
|
return;
|
|
}
|
|
|
|
This forces the input parameter to be a string, and its value and length are
|
|
stored in the variables specified by the caller.
|
|
|
|
There are now three new specifiers: 't', 'u', and 'T'.
|
|
|
|
't' specifier
|
|
-------------
|
|
This specifier indicates that the caller requires the incoming parameter
|
|
to be string data (IS_STRING, IS_UNICODE, IS_BINARY). The caller has to provide
|
|
the storage for string value, length, and type.
|
|
|
|
void *str;
|
|
int len;
|
|
zend_uchar type;
|
|
|
|
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "t", &str, &len, &type) == FAILURE) {
|
|
return;
|
|
}
|
|
if (type == IS_UNICODE) {
|
|
/* process UTF-16 data */
|
|
} else {
|
|
/* process native string or binary data */
|
|
}
|
|
|
|
For IS_STRING and IS_BINARY types, the length represents the number of
|
|
bytes, and for IS_UNICODE the number of UChar's. When converting other
|
|
types (numbers, booleans, etc) to strings, the exact behavior depends on
|
|
the Unicode semantics switch: if on, they are converted to IS_UNICODE,
|
|
otherwise to IS_STRING.
|
|
|
|
|
|
'u' specifier
|
|
-------------
|
|
This specifier indicates that the caller requires the incoming parameter
|
|
to be a Unicode UTF-16 encoded string. If a non-Unicode string is passed,
|
|
the engine creates a copy of the string and automatically convert it
|
|
to Unicode type before passing it to the internal function. No such
|
|
conversion is necessary for Unicode strings, obviously. Binary type cannot
|
|
be upconverted, and the engine issues an error in such case.
|
|
|
|
UChar *str;
|
|
int32_t len;
|
|
|
|
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "u", &str, &len) == FAILURE) {
|
|
return;
|
|
}
|
|
/* process UTF-16 data */
|
|
|
|
|
|
'T' specifier
|
|
-------------
|
|
This specifier is useful when the function takes two or more strings and
|
|
operates on them. Using 't' specifier for each one would be somewhat
|
|
problematic if the passed-in strings are of mixed types, and multiple
|
|
checks need to be performed in order to do anything. All parameters
|
|
marked by the 'T' specifier are promoted to the same type.
|
|
|
|
Binary type is generally speaking the most precise one. However, we do not
|
|
want to convert Unicode strings to binary ones, so an error is thrown
|
|
if the incoming list of parameters has both Unicode and binary strings in
|
|
it.
|
|
|
|
If there are no binary strings, and at least one of the strings is of
|
|
Unicode type, then all the rest of the strings are upconverted to Unicode.
|
|
|
|
Otherwise the promotion is to IS_STRING type.
|
|
|
|
|
|
void *str1, *str2;
|
|
int len1, len2;
|
|
zend_uchar type1, type2;
|
|
|
|
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "TT", &str1, &len1,
|
|
&type1, &str2, &len2, &type2) == FAILURE) {
|
|
return;
|
|
}
|
|
if (type1 == IS_UNICODE) {
|
|
/* process as Unicode, str2 is guaranteed to be Unicode as well */
|
|
} else {
|
|
/* process as native string, str2 is guaranteed to be the same */
|
|
}
|
|
|
|
|
|
The existing 's' specifier has been modified as well. If a Unicode string is
|
|
passed in, it automatically copies and converts the string to the runtime
|
|
encoding, and issues a warning. If a binary type is passed-in, no conversion
|
|
is necessary.
|
|
|
|
|
|
Upgrading Existing Functions
|
|
============================
|
|
|
|
Upgrading functions to work with new data types will be a deliberate and
|
|
involved process, because one needs to consider not only the mechanisms for
|
|
processing Unicode characters, for example, but also the semantics of
|
|
the function.
|
|
|
|
The main tenet of the upgrade process should be that when processing Unicode
|
|
strings, the unit of operation is a code point, not a code unit or a byte.
|
|
For example, strlen() returns the number of code points in the string.
|
|
|
|
strlen('abc') = 3
|
|
strlen('ab\U010000') = 3
|
|
strlen('ab\uD800\uDC00') = 3 /* not 4 */
|
|
|
|
Function upgrade guidelines are available in a separate document.
|
|
|
|
|
|
Unicode Extension
|
|
=================
|
|
|
|
There will be one or more extensions that provide Unicode and i18n services
|
|
to PHP. In phase I only the conversion service is necessary. The Unicode
|
|
extension is 'ext/unicode' and its functions should be prefixed with 'unicode'
|
|
or 'icu'.
|
|
|
|
Conversion Functions
|
|
--------------------
|
|
|
|
string unicode_encode(unicode $input, text $encoding)
|
|
|
|
Takes a UTF-16 Unicode string and converts it to the the target
|
|
encoding, returning the result.
|
|
|
|
unicode unicode_decode(string $input, text $encoding)
|
|
|
|
Takes a string in the source encoding and converts it to a UTF-16
|
|
Unicode string, returning the result.
|
|
|
|
|
|
Type Conversion Matrix
|
|
======================
|
|
|
|
to | IS_STRING | IS_UNICODE | IS_BINARY
|
|
from | | |
|
|
-------------------------------------------------------------------
|
|
| | |
|
|
IS_STRING | n/a | implicit=yes | explicit=yes
|
|
| | explicit=yes | implicit=no
|
|
| | |
|
|
-------------------------------------------------------------------
|
|
| | |
|
|
IS_UNICODE | explicit=yes | n/a | explicit=yes
|
|
| implicit=no | | implicit=no
|
|
| | |
|
|
------------------------------|------------------------------------
|
|
| | |
|
|
IS_BINARY | explicit=no | explicit=no | n/a
|
|
| implicit=no | implicit=no |
|
|
| | |
|
|
|
|
explicit = casting
|
|
implicit = for concatenation, etc
|
|
|
|
IS_STRING <-> IS_UNICODE uses runtime-encoding
|
|
IS_UNICODE -> IS_BINARY converts to runtime encoding first, then to binary
|
|
|
|
|
|
Implementation Details That Need Expanding
|
|
==========================================
|
|
- Streams support for Unicode - What stream filters will we be providing?
|
|
- Conversion errors behavior - Need to define the default.
|
|
- INI files encoding - Do we support BOMs?
|
|
- There are likely to be other issues which are missing from this document
|
|
|
|
|
|
Build System
|
|
============
|
|
|
|
Unicode support in PHP is always enabled. The only configuration option
|
|
during development should be the location of the ICU headers and libraries.
|
|
|
|
--with-icu-dir=<dir> <dir> parameter specifies the location of ICU
|
|
header and library files.
|
|
|
|
After the initial development we have to repackage ICU library for our needs
|
|
and bundle it with PHP.
|
|
|
|
|
|
Document History
|
|
================
|
|
0.5: Updated per latest discussions. Removed tentative language in several
|
|
places, since we have decided on everything described here already.
|
|
Clarified details according to Phase II progress.
|
|
|
|
0.4: Updated to include all the latest discussions. Updated development
|
|
phases.
|
|
|
|
0.3: Updated to include all the latest discussions.
|
|
|
|
0.2: Updated Phase I design proposal per discussion on unicode@php.net.
|
|
Modified Internal Encoding section to contain only UTF-16 info..
|
|
Expanded Script Encoding section.
|
|
Added Binary Data Type section.
|
|
Amended Language Modifications section to describe string literals
|
|
behavior.
|
|
Amended Build System section.
|
|
|
|
0.1: Phase I design proposal
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
Unicode
|
|
http://www.unicode.org
|
|
|
|
Unicode Glossary
|
|
http://www.unicode.org/glossary/
|
|
|
|
UTF-8
|
|
http://www.utf-8.com/
|
|
|
|
UTF-16
|
|
http://www.ietf.org/rfc/rfc2781.txt
|
|
|
|
ICU Homepage
|
|
http://www.ibm.com/software/globalization/icu/
|
|
|
|
ICU User Guide and API Reference
|
|
http://icu.sourceforge.net/
|
|
|
|
Unicode Annex #31
|
|
http://www.unicode.org/reports/tr31/
|
|
|
|
PHP Parameter Parsing API
|
|
http://www.php.net/manual/en/zend.arguments.retrieval.php
|
|
|
|
|
|
Authors
|
|
=======
|
|
Andrei Zmievski <andrei@gravitonic.com>
|
|
|
|
vim: set et :
|