mirror of
https://github.com/php/php-src.git
synced 2025-01-09 20:44:33 +08:00
661 lines
24 KiB
Plaintext
661 lines
24 KiB
Plaintext
Introduction
|
|
============
|
|
|
|
As successful as PHP has proven to be in the past several years, it is still
|
|
the only remaining member of the P-trinity of scripting languages - Perl and
|
|
Python being the other two - that remains blithely ignorant of the
|
|
multilingual and multinational environment around it. The software
|
|
development community has been moving towards Unicode Standard for some time
|
|
now, and PHP can no longer afford to be outside of this movement. Surely,
|
|
some steps have been taken recently to allow for easier processing of
|
|
multibyte data with the mbstring extension, but it is not enabled in PHP by
|
|
default and is not as intuitive or transparent as it could be.
|
|
|
|
The basic goal of this document is to describe how PHP 6 will support the
|
|
Unicode Standard natively. Since the full implementation of the Unicode
|
|
Standard is very involved, the idea is to use the already existing,
|
|
well-tested, full-featured, and freely available ICU (International
|
|
Components for Unicode) library. This will allow us to concentrate on the
|
|
details of PHP integration and speed up the implementation.
|
|
|
|
General Remarks
|
|
===============
|
|
|
|
Backwards Compatibility
|
|
-----------------------
|
|
Throughout the design and implementation of Unicode support, backwards
|
|
compatibility must be of paramount concern. PHP is used on an enormous number of
|
|
sites and the upgrade to Unicode-enabled PHP has to be transparent. This means
|
|
that the existing data types and functions must work as they have always
|
|
done. However, the speed of certain operations may be affected, due to
|
|
increased complexity of the code overall.
|
|
|
|
Unicode Encoding
|
|
----------------
|
|
The initial version will not support Byte Order Mark. Characters are
|
|
expected to be composed, Normalization Form C. Later versions will support
|
|
BOM, and decomposed and other characters.
|
|
|
|
|
|
Implementation Approach
|
|
=======================
|
|
|
|
The implementation is done in phases. This allows for more basic and
|
|
low-level implementation issues to be ironed out and tested before
|
|
proceeding to more advanced topics.
|
|
|
|
Legend:
|
|
- TODO
|
|
+ finished
|
|
* in progress
|
|
|
|
Phase I
|
|
-------
|
|
+ Basic Unicode string support, including instantiation, concatenation,
|
|
indexing
|
|
|
|
+ Simple output of Unicode strings via 'print' and 'echo' statements
|
|
with appropriate output encoding conversion
|
|
|
|
+ Conversion of Unicode strings to/from various encodings via encode() and
|
|
decode() functions
|
|
|
|
+ Determining length of Unicode strings via strlen() function, some
|
|
simple string functions ported (substr).
|
|
|
|
|
|
Phase II
|
|
--------
|
|
* HTTP input request decoding
|
|
|
|
+ Fixing remaining string-aware operators (assignment to [] etc)
|
|
|
|
+ Support for Unicode and binary strings in PHP streams
|
|
|
|
+ Support for Unicode identifiers
|
|
|
|
+ Configurable handling of conversion failures
|
|
|
|
+ \C{} escape sequence in strings
|
|
|
|
|
|
Phase III
|
|
---------
|
|
* Exposing ICU API
|
|
|
|
* Porting all remaining functions to support Unicode and/or binary
|
|
strings
|
|
|
|
|
|
Encoding Names
|
|
==============
|
|
All the encoding settings discussed in this document accept any valid
|
|
encoding name supported by ICU. See ICU online documentation for the full
|
|
list of encodings.
|
|
|
|
|
|
Unicode Semantics Switch
|
|
========================
|
|
|
|
Obviously, PHP cannot simply impose new Unicode support on everyone. There
|
|
are many applications that do not care about Unicode and do not need it.
|
|
Consequently, there is a switch that enables certain fundamental language
|
|
changes related to Unicode. This switch is available only as a site-wide (per
|
|
virtual server) INI setting.
|
|
|
|
Note that having switch turned off does not imply that PHP is unaware of Unicode
|
|
at all and that no Unicode strings can exist. It only affects certain aspects of
|
|
the language, and Unicode strings can always be created programmatically. All
|
|
the functions and operators will still support Unicode strings and work
|
|
appropriately.
|
|
|
|
unicode.semantics = On
|
|
|
|
|
|
Internal Encoding
|
|
=================
|
|
|
|
UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumes
|
|
two bytes for any Unicode character in the Basic Multilingual Plane, which
|
|
is where most of the current world's languages are represented. While being
|
|
less memory efficient for basic ASCII text it simplifies the processing and
|
|
makes interfacing with ICU easier, since ICU uses UTF-16 for its internal
|
|
processing as well.
|
|
|
|
|
|
Fallback Encoding
|
|
=================
|
|
|
|
This setting specifies the "fallback" encoding for all the other ones. So if
|
|
a specific encoding setting is not set, PHP defaults it to the fallback
|
|
encoding. If the fallback_encoding is not specified either, it is set to
|
|
UTF-8.
|
|
|
|
unicode.fallback_encoding = "iso-8859-1"
|
|
|
|
|
|
Runtime Encoding
|
|
================
|
|
|
|
Currently PHP neither specifies nor cares what the encoding of its strings
|
|
is. However, the Unicode implementation needs to know what this encoding is
|
|
for several reasons, including explicit (casting) and implicit (concatenation,
|
|
comparison, parameter passing) type coersions. This setting specifies the
|
|
runtime encoding.
|
|
|
|
unicode.runtime_encoding = "iso-8859-1"
|
|
|
|
|
|
Output Encoding
|
|
===============
|
|
|
|
Automatic output encoding conversion is supported on the standard output
|
|
stream. Therefore, commands such as 'print' and 'echo' automatically convert
|
|
their arguments to the specified encoding. No automatic output encoding is
|
|
performed for anything else. Therefore, when writing to files or external
|
|
resources, the developer has to manually encode the data using functions
|
|
provided by the unicode extension or rely on stream encoding features
|
|
|
|
The existing default_charset setting so far has been used only for
|
|
specifying the charset portion of the Content-Type MIME header. For several
|
|
reasons, this setting is deprecated. Now it is only used when the Unicode
|
|
semantics switch is disabled and does not affect the actual transcoding of
|
|
the output stream. The output encoding setting takes precedence in all other
|
|
cases. If the output encoding is set, PHP will automatically add 'charset'
|
|
portion to the Conten-Type header.
|
|
|
|
unicode.output_encoding = "utf-8"
|
|
|
|
|
|
HTTP Input Encoding
|
|
===================
|
|
|
|
There will be no explicit input encoding setting. Instead, PHP will rely on a
|
|
couple of heuristics to determine what encoding the incoming request might be
|
|
in. Firstly, PHP will attempt to decode the input using the value of the
|
|
unicode.output_encoding setting, because that is the most logical choice if we
|
|
assume that the clients send the data back in the encoding that the page with
|
|
the form was in. If that is unsuccessful, we could fallback on the "_charset_"
|
|
form parameter, if present. This parameter is sent by IE (and possibly Firefox)
|
|
along with the form data and indicates the encoding of the request. Note that
|
|
this parameter will be present only if the form contains a hidden field named
|
|
"_charset_".
|
|
|
|
The variables that are decoded successfully will be put into the request arrays
|
|
as Unicode strings, those that fail -- as binary strings. PHP will set a
|
|
flag (probably in the $_SERVER array) indicating that there were problems during
|
|
the conversion. The user will have access to the raw input in case of
|
|
failure via the input filter extension and can to access the request parameters
|
|
via input_get_arg() function. The input filter extension always looks in
|
|
the raw input data and not in the request arrays, and input_get_arg() has a
|
|
'charset' parameter that can be specified to tell PHP what charset the incoming
|
|
data is in. This kills two birds with one stone: users have access to request
|
|
arrays data on successful decoding as well as a standard and secure way to get
|
|
at the data in case of failed decoding.
|
|
|
|
|
|
Script Encoding
|
|
===============
|
|
|
|
PHP scripts may be written in any encoding supported by ICU. The encoding
|
|
of the scripts can be specified site-wide via an INI directive, or with a
|
|
'declare' pragma at the beginning of the script. The reason for pragma is that
|
|
an application written in Shift-JIS, for example, should be executable on a
|
|
system where the INI directive cannot be changed by the application itself. The
|
|
pragma setting is valid only for the script it occurs in, and does not propagate
|
|
to the included files.
|
|
|
|
pragma:
|
|
<?php declare(encoding = 'utf-8'); ?>
|
|
|
|
INI setting:
|
|
unicode.script_encoding = utf-8
|
|
|
|
|
|
Conversion Semantics
|
|
====================
|
|
|
|
Not all characters can be converted between Unicode and legacy encodings.
|
|
Normally, when downconverting from Unicode, the default behavior of ICU
|
|
converters is to substitute the missing sequence with the appropriate
|
|
substitution sequence for that codepage, such as 0x1A (Control-Z) in
|
|
ISO-8859-1. When upconverting to Unicode, if an encoding has a character
|
|
which cannot be converted into Unicode, that sequence is replaced by the
|
|
Unicode substitution character (U+FFFD).
|
|
|
|
The conversion error behavior can be customized:
|
|
|
|
- stop the conversion and return an empty string
|
|
- skip any invalid characters
|
|
- substibute invalid characters with a custom substitution character
|
|
- escape the invalid character in various formats
|
|
|
|
The global conversion error settings can be controlled with these two functions:
|
|
|
|
unicode_set_error_mode(int direction, int mode)
|
|
unicode_set_subst_char(unicode char)
|
|
|
|
Where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these
|
|
constants:
|
|
|
|
U_CONV_ERROR_STOP
|
|
U_CONV_ERROR_SKIP
|
|
U_CONV_ERROR_SUBST
|
|
U_CONV_ERROR_ESCAPE_UNICODE
|
|
U_CONV_ERROR_ESCAPE_ICU
|
|
U_CONV_ERROR_ESCAPE_JAVA
|
|
U_CONV_ERROR_ESCAPE_XML_DEC
|
|
U_CONV_ERROR_ESCAPE_XML_HEX
|
|
|
|
Substitution character can be set only for FROM_UNICODE direction and has to
|
|
exist in the target character set.
|
|
|
|
|
|
Unicode String Type
|
|
===================
|
|
|
|
Unicode string type (IS_UNICODE) is supposed to contain text data encoded in
|
|
UTF-16 format. It is the main string type in PHP when Unicode semantics
|
|
switch is turned on. Unicode strings can exist when the switch is off, but
|
|
they have to be produced programmatically, via calls to functions that
|
|
return Unicode type.
|
|
|
|
The operational unit when working with Unicode strings is a code point, not
|
|
code unit or byte. One code point in UTF-16 may be comprised of 1 or 2 code
|
|
units, each of which is a 16-bit word. Working on the code point level is
|
|
necessary because doing otherwise would mean offloading the processing of
|
|
surrogate pairs onto PHP users, and that is less than desirable.
|
|
|
|
The repercussions are that one cannot expect code point N to be at offset N in
|
|
the Unicode string. Instead, one has to iterate from the beginning from the
|
|
string using U16_FWD() macro until the desired codepoint is reached. This will
|
|
be transparent to the end user who will work only with "character" offsets.
|
|
|
|
The codepoint access is one of the primary areas targeted for optimization.
|
|
|
|
|
|
Binary String Type
|
|
==================
|
|
|
|
Binary string type (IS_STRING) serves two purposes: backwards compatibility and
|
|
representing non-Unicode strings and binary data. When Unicode semantics switch
|
|
is off, it is used for all strings in PHP, same in previous versions. When the
|
|
switch is on, this type will be used to store text in other encodings as well as
|
|
true binary data such as images, PDFs, etc.
|
|
|
|
Printing binary data to the standard output passes it through as-is, independent
|
|
of the output encoding.
|
|
|
|
|
|
Zval Structure Changes
|
|
======================
|
|
|
|
PHP is a type-agnostic language. Its data values are encapsulated in a zval
|
|
(Zend value) structure that can change as necessary to accomodate various types.
|
|
|
|
struct _zval_struct {
|
|
/* Variable information */
|
|
union {
|
|
long lval; /* long value */
|
|
double dval; /* double value */
|
|
struct {
|
|
char *val;
|
|
int len;
|
|
} str; /* string value */
|
|
HashTable *ht; /* hash table value */
|
|
zend_object_value obj; /* object value */
|
|
} value;
|
|
zend_uint refcount;
|
|
zend_uchar type; /* active type */
|
|
zend_uchar is_ref;
|
|
};
|
|
|
|
The type field determines what is stored in the union, IS_STRING being the only
|
|
data type pertinent to this discussion. In the current version, the strings
|
|
are binary-safe, but, for all intents and purposes, are assumed to be
|
|
comprised of 8-bit characters. It is possible to treat the string value as
|
|
an opaque type containing arbitrary binary data, and in fact that is how
|
|
mbstring extension uses it, in order to store multibyte strings. However,
|
|
many extensions and the Zend engine itself manipulate the string value
|
|
directly without regard to its internals. Needless to say, this can lead to
|
|
problems.
|
|
|
|
For IS_UNICODE type, we need to add another structure to the union:
|
|
|
|
union {
|
|
....
|
|
struct {
|
|
UChar *val; /* Unicode string value */
|
|
int len; /* number of UChar's */
|
|
....
|
|
} value;
|
|
|
|
This cleanly separates the two types of strings and helps preserve backwards
|
|
compatibility.
|
|
|
|
To optimize access to IS_STRING and IS_UNICODE storage at runtime, we need yet
|
|
another structure:
|
|
|
|
union {
|
|
....
|
|
struct { /* Universal string type */
|
|
zstr val;
|
|
int len;
|
|
} uni;
|
|
....
|
|
} value;
|
|
|
|
Where zstr ia union of char*, UChar*, and void*.
|
|
|
|
|
|
Language Modifications
|
|
======================
|
|
|
|
If a Unicode switch is turned on, PHP string literals - single-quoted,
|
|
double-quoted, and heredocs - become Unicode strings (IS_UNICODE type).
|
|
They support all the same escape sequences and variable interpolations as
|
|
previously, with the addition of some new escape sequences.
|
|
|
|
The contents of the strings are interpreted as follows:
|
|
|
|
- all non-escaped characters are interpreted as a corresponding Unicode
|
|
codepoint based on the current script encoding, e.g. ASCII 'a' (0x51) =>
|
|
U+0061, Shift-JIS (0x92 0x69) => U+4E2D
|
|
|
|
- existing PHP escape sequences are also interpreted as Unicode codepoints,
|
|
including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020
|
|
|
|
- two new escape sequences, \uXXXX and \UXXXXXX are interpreted as a 4 or
|
|
6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 =>
|
|
U+10410
|
|
|
|
- a new escape sequence allows specifying a character by its full
|
|
Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20
|
|
|
|
The single-quoted string is more restrictive than the other two types: so
|
|
far the only escape sequence allowed inside of it was \', which specifies
|
|
a literal single quote. However, single quoted strings now support the new
|
|
Unicode character escape sequences as well.
|
|
|
|
PHP allows variable interpolation inside the double-quoted and heredoc strings.
|
|
However, the parser separates the string into literal and variable chunks during
|
|
compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that the
|
|
literal chunks can be handled in the normal way for as far as Unicode
|
|
support is concerned.
|
|
|
|
Since all string literals become Unicode by default, one loses the ability
|
|
to specify byte-oriented or binary strings. In order to create binary string
|
|
literals, a new syntax is necessary: prefixing a string literal with letter
|
|
'b' creates a binary string.
|
|
|
|
$var = b'abc\001';
|
|
$var = b"abc\001";
|
|
$var = b<<<EOD
|
|
abc\001
|
|
EOD;
|
|
|
|
The binary string literals support the same escape sequences as the current
|
|
PHP strings. If the Unicode switch is turned off, then the binary string
|
|
literals generate normal string (IS_STRING) type internally, without any
|
|
effect on the application.
|
|
|
|
The string operators have been changed to accomodate the new IS_UNICODE and
|
|
IS_BINARY types. In more detail:
|
|
|
|
- The concatenation (.) operator has been changed to automatically coerce
|
|
IS_STRING type to the more precise IS_UNICODE if its operands are of two
|
|
different string types.
|
|
|
|
- The concatenation assignment operator (.=) has been changed similarly.
|
|
|
|
- The string indexing operator [] has been changed to accomodate IS_UNICODE
|
|
type strings and extract the specified character. Note that the index
|
|
specifies a code point, not a byte, or a code unit, thus supporting
|
|
supplementary characters.
|
|
|
|
- Both Unicode and binary string types can be used as array keys. If the
|
|
Unicode switch is on, the binary keys are converted to Unicode.
|
|
|
|
- Bitwise operators and increment/decrement operators do not work on
|
|
Unicode strings. They do work on binary strings.
|
|
|
|
- Two new casting operators are introduced, (unicode) and (binary). The
|
|
(string) operator will cast to Unicode type if the Unicode semantics switch is
|
|
on, and to binary type otherwise.
|
|
|
|
- The comparison operators when applied to Unicode strings, perform
|
|
comparison in binary code point order. They also do appropriate coersion
|
|
if the strings are of differing types.
|
|
|
|
- The arithmetic operators use the same semantics as today for converting
|
|
strings to numbers. A Unicode string is considered numeric if it
|
|
represents a long or a double number in en_US_POSIX locale.
|
|
|
|
|
|
Inline HTML
|
|
===========
|
|
Because inline HTML blocks are intermixed with PHP ones, they are also
|
|
written in the script encoding. PHP transcodes the HTML blocks to the output
|
|
encoding as needed, resulting in direct passthrough if the script encoding
|
|
matches output encoding.
|
|
|
|
|
|
Identifiers
|
|
===========
|
|
Considering that scripts may be written in various encodings, we do not
|
|
restrict identifiers to be ASCII-only. PHP allows any valid identifier based
|
|
on the Unicode Standard Annex #31. The identifiers are case folded when
|
|
necessary (class and function names) and converted to normalization form
|
|
NFKC, so that two identifiers written in two compatible ways refer to the
|
|
same thing.
|
|
|
|
|
|
Numbers
|
|
=======
|
|
Unlike identifiers, we restrict numbers to consist only of ASCII digits and
|
|
do not interpret them as written in a specific locale. The numbers are
|
|
expected to adhere to en_US_POSIX or C locale, i.e. having no thousands
|
|
separator and fractional separator being (.) "full stop". Numeric strings
|
|
are supposed to adhere to the same rules, i.e. "10,3" is not interpreted as
|
|
a number even if the current locale's fractional separator is comma.
|
|
|
|
|
|
Parameter Parsing API Modifications
|
|
===================================
|
|
|
|
Internal PHP functions largely uses zend_parse_parameters() API in order to
|
|
obtain the parameters passed to them by the user. For example:
|
|
|
|
char *str;
|
|
int len;
|
|
|
|
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &str, &len) == FAILURE) {
|
|
return;
|
|
}
|
|
|
|
This forces the input parameter to be a string, and its value and length are
|
|
stored in the variables specified by the caller.
|
|
|
|
There are now five new specifiers: 'u', 't', 'T', 'U', and 'S'.
|
|
|
|
't' specifier
|
|
-------------
|
|
This specifier indicates that the caller requires the incoming parameter to be
|
|
string data (IS_STRING, IS_UNICODE). The caller has to provide the storage for
|
|
string value, length, and type.
|
|
|
|
void *str;
|
|
int len;
|
|
zend_uchar type;
|
|
|
|
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "t", &str, &len, &type) == FAILURE) {
|
|
return;
|
|
}
|
|
if (type == IS_UNICODE) {
|
|
/* process Unicode string */
|
|
} else {
|
|
/* process binary string */
|
|
}
|
|
|
|
For IS_STRING type, the length represents the number of bytes, and for
|
|
IS_UNICODE the number of UChar's. When converting other types (numbers,
|
|
booleans, etc) to strings, the exact behavior depends on the Unicode semantics
|
|
switch: if on, they are converted to IS_UNICODE, otherwise to IS_STRING.
|
|
|
|
|
|
'u' specifier
|
|
-------------
|
|
This specifier indicates that the caller requires the incoming parameter
|
|
to be a Unicode encoded string. If a non-Unicode string is passed, the engine
|
|
creates a copy of the string and automatically convert it to Unicode type before
|
|
passing it to the internal function. No such conversion is necessary for Unicode
|
|
strings, obviously.
|
|
|
|
UChar *str;
|
|
int len;
|
|
|
|
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "u", &str, &len) == FAILURE) {
|
|
return;
|
|
}
|
|
/* process Unicode string */
|
|
|
|
|
|
'T' specifier
|
|
-------------
|
|
This specifier is useful when the function takes two or more strings and
|
|
operates on them. Using 't' specifier for each one would be somewhat
|
|
problematic if the passed-in strings are of mixed types, and multiple
|
|
checks need to be performed in order to do anything. All parameters
|
|
marked by the 'T' specifier are promoted to the same type.
|
|
|
|
If at least one of the 'T' parameters is of Unicode type, then the rest of
|
|
them are converted to IS_UNICODE. Otherwise all 'T' parameters are conveted to
|
|
IS_STRING type.
|
|
|
|
|
|
void *str1, *str2;
|
|
int len1, len2;
|
|
zend_uchar type1, type2;
|
|
|
|
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "TT", &str1, &len1,
|
|
&type1, &str2, &len2, &type2) == FAILURE) {
|
|
return;
|
|
}
|
|
if (type1 == IS_UNICODE) {
|
|
/* process as Unicode, str2 is guaranteed to be Unicode as well */
|
|
} else {
|
|
/* process as binary string, str2 is guaranteed to be the same */
|
|
}
|
|
|
|
|
|
The existing 's' specifier has been modified as well. If a Unicode string is
|
|
passed in, it automatically copies and converts the string to the runtime
|
|
encoding, and issues a warning. If a binary type is passed-in, no conversion
|
|
is necessary.
|
|
|
|
The 'U' and 'S' specifiers are similar to 'u' and 's' but they are more strict
|
|
about the type of the passed-in parameter. If 'U' is specified and the binary
|
|
string is passed in, the engine will issue a warning instead of doing automatic
|
|
conversion. The converse applies to the 'S' specifier.
|
|
|
|
|
|
Upgrading Existing Functions
|
|
============================
|
|
|
|
Upgrading functions to work with new data types will be a deliberate and
|
|
involved process, because one needs to consider not only the mechanisms for
|
|
processing Unicode characters, for example, but also the semantics of
|
|
the function.
|
|
|
|
The main tenet of the upgrade process should be that when processing Unicode
|
|
strings, the unit of operation is a code point, not a code unit or a byte.
|
|
For example, strlen() returns the number of code points in the string.
|
|
|
|
strlen('abc') = 3
|
|
strlen('ab\U010000') = 3
|
|
strlen('ab\uD800\uDC00') = 3 /* not 4 */
|
|
|
|
Function upgrade guidelines are available in a separate document.
|
|
|
|
|
|
Document TODO
|
|
==========================================
|
|
- Streams support for Unicode - What stream filters will be provided?
|
|
- User conversion error handler
|
|
- INI files encoding - UTF-8? Do we support BOMs?
|
|
- There are likely to be other issues which are missing from this document
|
|
|
|
|
|
Build System
|
|
============
|
|
|
|
Unicode support in PHP is always enabled. The only configuration option
|
|
during development should be the location of the ICU headers and libraries.
|
|
|
|
--with-icu-dir=<dir> <dir> parameter specifies the location of ICU
|
|
header and library files.
|
|
|
|
After the initial development we have to repackage ICU library for our needs
|
|
and bundle it with PHP.
|
|
|
|
|
|
Document History
|
|
================
|
|
0.6: Remove notion of native encoding string, only 2 string types are used
|
|
now. Update conversion error behavior section and parameter parsing.
|
|
Bring the document up-to-date with reality in general.
|
|
|
|
0.5: Updated per latest discussions. Removed tentative language in several
|
|
places, since we have decided on everything described here already.
|
|
Clarified details according to Phase II progress.
|
|
|
|
0.4: Updated to include all the latest discussions. Updated development
|
|
phases.
|
|
|
|
0.3: Updated to include all the latest discussions.
|
|
|
|
0.2: Updated Phase I design proposal per discussion on unicode@php.net.
|
|
Modified Internal Encoding section to contain only UTF-16 info..
|
|
Expanded Script Encoding section.
|
|
Added Binary Data Type section.
|
|
Amended Language Modifications section to describe string literals
|
|
behavior.
|
|
Amended Build System section.
|
|
|
|
0.1: Phase I design proposal
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
Unicode
|
|
http://www.unicode.org
|
|
|
|
Unicode Glossary
|
|
http://www.unicode.org/glossary/
|
|
|
|
UTF-8
|
|
http://www.utf-8.com/
|
|
|
|
UTF-16
|
|
http://www.ietf.org/rfc/rfc2781.txt
|
|
|
|
ICU Homepage
|
|
http://www.ibm.com/software/globalization/icu/
|
|
|
|
ICU User Guide and API Reference
|
|
http://icu.sourceforge.net/
|
|
|
|
Unicode Annex #31
|
|
http://www.unicode.org/reports/tr31/
|
|
|
|
PHP Parameter Parsing API
|
|
http://www.php.net/manual/en/zend.arguments.retrieval.php
|
|
|
|
|
|
Authors
|
|
=======
|
|
Andrei Zmievski <andrei@gravitonic.com>
|
|
|
|
vim: set et :
|