mirror of
https://github.com/php/php-src.git
synced 2025-01-25 13:14:22 +08:00
c53602571d
fixes bug #42746
645 lines
23 KiB
Plaintext
645 lines
23 KiB
Plaintext
Audience
|
|
========
|
|
|
|
This README describes how PHP 6 provides native support for the Unicode
|
|
Standard. Readers of this document should be proficient with PHP and have a
|
|
basic understanding of Unicode concepts. For more technical details about
|
|
PHP 6 design principles and for guidelines about writing Unicode-ready PHP
|
|
extensions, refer to README.UNICODE-UPGRADES.
|
|
|
|
Introduction
|
|
============
|
|
|
|
As successful as PHP has proven to be over the years, its support for
|
|
multilingual and multinational environments has languished. PHP can no
|
|
longer afford to remain outside the overall movement towards the Unicode
|
|
standard. Although recent updates involving the mbstring extension have
|
|
enabled easier multibyte data processing, this does not constitute native
|
|
Unicode support.
|
|
|
|
Since the full implementation of the Unicode Standard is very involved, our
|
|
approach is to speed up implementation by using the well-tested,
|
|
full-featured, and freely available ICU (International Components for
|
|
Unicode) library.
|
|
|
|
|
|
General Remarks
|
|
===============
|
|
|
|
International Components for Unicode
|
|
------------------------------------
|
|
|
|
ICU (International Components for Unicode is a mature, widely used set of
|
|
C/C++ and Java libraries for Unicode support, software internationalization
|
|
and globalization. It provides:
|
|
|
|
- Encoding conversions
|
|
- Collations
|
|
- Unicode text processing
|
|
- and much more
|
|
|
|
When building PHP 6, Unicode support is always enabled. The only
|
|
configuration option during development should be the location of the ICU
|
|
headers and libraries.
|
|
|
|
--with-icu-dir=<dir>
|
|
|
|
where <dir> specifies the location of ICU header and library files. If you do
|
|
not specify this option, PHP attempts to find ICU under /usr and /usr/local.
|
|
|
|
NOTE: ICU is not bundled with PHP 6 yet. To download the distribution, visit
|
|
http://icu.sourceforge.net. PHP requires ICU version 3.4 or higher.
|
|
|
|
Backwards Compatibility
|
|
-----------------------
|
|
Our paramount concern for providing Unicode support is backwards compatibility.
|
|
Because PHP is used on so many sites, existing data types and functions must
|
|
work as they always have. However, although PHP's interfaces must remain
|
|
backwards-compatible, the speed of certain operations might be affected due to
|
|
internal implementation changes.
|
|
|
|
Encoding Names
|
|
--------------
|
|
All the encoding settings discussed in this document can accept any valid
|
|
encoding name supported by ICU. For a full list of encodings, refer to the ICU
|
|
online documentation.
|
|
|
|
NOTE: References to "Unicode" in this document generally mean the UTF-16
|
|
character encoding, unless explicitly stated otherwise.
|
|
|
|
Unicode Semantics Switch
|
|
========================
|
|
|
|
Because many applications do not require Unicode, PHP 6 provides a server-wide
|
|
INI setting to enable Unicode support:
|
|
|
|
unicode.semantics = On/Off
|
|
|
|
This switch is off by default. If your applications do not require native
|
|
Unicode support, you may leave this switch off, and continue to use Unicode
|
|
strings only when you need to.
|
|
|
|
However, if your application is ready to fully support Unicode, you should
|
|
turn this switch on. This activates various Unicode support mechanisms,
|
|
including:
|
|
|
|
* All string literals become Unicode
|
|
* All variables received from HTTP requests become Unicode
|
|
* PHP identifiers may use Unicode characters
|
|
|
|
More fundamentally, your PHP environment is now a Unicode environment. Strings
|
|
inside PHP are Unicode, and the system is responsible for converting non-Unicode
|
|
strings on PHP's periphery (for example, in HTTP input and output, streams, and
|
|
filesystem operations). With unicode.semantics on, you must specify binary
|
|
strings explicitly. PHP makes no assumptions about the content of a binary
|
|
string, so your application must handle all binary string appropriately.
|
|
|
|
Conversely, if unicode.semantics is off, PHP behaves as it did in the past.
|
|
String literals do not become Unicode, and files are binary strings for
|
|
backwards compatibility. You can always create Unicode strings programmatically,
|
|
and all functions and operators support Unicode strings transparently.
|
|
|
|
|
|
Fallback Encoding
|
|
=================
|
|
|
|
The fallback encoding provides a default value for all other unicode.*_encoding
|
|
INI settings. If you do not set a particular unicode.*_encoding setting, PHP
|
|
uses the fallback encoding. If you do not specify a fallback encoding, PHP uses
|
|
UTF-8.
|
|
|
|
unicode.fallback_encoding = "iso-8859-1"
|
|
|
|
|
|
Runtime Encoding
|
|
================
|
|
|
|
The runtime encoding specifies the encoding PHP uses for converting binary
|
|
strings within the PHP engine itself.
|
|
|
|
unicode.runtime_encoding = "iso-8859-1"
|
|
|
|
This setting has no effect on I/O-related operations such as writing to
|
|
standard out, reading from the filesystem, or decoding HTTP input variables.
|
|
|
|
PHP enables you to explicitly convert strings using casting:
|
|
|
|
* (binary) -- casts to binary string type
|
|
* (unicode) -- casts to Unicode string type
|
|
* (string) -- casts to Unicode string type if unicode.semantics is on,
|
|
to binary otherwise
|
|
|
|
For example, if unicode.runtime_encoding is iso-8859-1, and $uni is a unicode
|
|
string, then
|
|
|
|
$str = (binary)$uni
|
|
|
|
creates a binary string $str in the ISO-8859-1 encoding.
|
|
|
|
Implicit conversions include concatenation, comparison, and parameter passing.
|
|
For better precision, PHP attempts to convert strings to Unicode before
|
|
performing these sorts of operations. For example, if we concatenate our binary
|
|
string $str with a unicode literal, PHP converts $str to Unicode first, using
|
|
the encoding specified by unicode.runtime_encoding.
|
|
|
|
Output Encoding
|
|
===============
|
|
|
|
PHP automatically converts output for commands that write to the standard
|
|
output stream, such as 'print' and 'echo'.
|
|
|
|
unicode.output_encoding = "utf-8"
|
|
|
|
However, PHP does not convert binary strings. When writing to files or external
|
|
resources, you must rely on stream encoding features or manually encode the data
|
|
using functions provided by the unicode extension.
|
|
|
|
The existing default_charset INI setting is DEPRECATED in favor of
|
|
unicode.output_setting. Previously, default_charset only specified the charset
|
|
portion of the Content-Type MIME header. Now default_charset only takes effect
|
|
when unicode.semantics is off, and it does not affect the actual transcoding of
|
|
the output stream. Setting unicode.output_encoding causes PHP to add the
|
|
'charset' portion to the Content-Type header, overriding any value set for
|
|
default_charset.
|
|
|
|
|
|
HTTP Input Encoding
|
|
===================
|
|
|
|
The HTTP input encoding specifies the encoding of variables received via
|
|
HTTP, such as the contents of the $_GET and _$POST arrays.
|
|
|
|
This functionality is currently under development. For a discussion of the
|
|
approach that the PHP 6 team is taking, refer to:
|
|
|
|
http://marc.theaimsgroup.com/?t=116613047300005&r=1&w=2
|
|
|
|
|
|
Filesystem Encoding
|
|
===================
|
|
|
|
The filesystem encoding specifies the encoding of file and directory names
|
|
on the filesystem.
|
|
|
|
unicode.filename_encoding = "utf-8"
|
|
|
|
Filesystem-related functions such as opendir() perform this conversion when
|
|
accepting and returning file names. You should set the filename encoding to
|
|
the encoding used by your filesystem.
|
|
|
|
|
|
Script Encoding
|
|
===============
|
|
|
|
You may write PHP scripts in any encoding supported by ICU. To specify the
|
|
script encoding site-wide, use the INI setting:
|
|
|
|
unicode.script_encoding = utf-8
|
|
|
|
If you cannot change the encoding system wide, you can use a pragma to
|
|
override the INI setting in a local script:
|
|
|
|
<?php declare(encoding = 'Shift-JIS'); ?>
|
|
|
|
The pragma setting must be the first statement in the script. It only affects
|
|
the script in which it occurs, and does not propagate to any included files.
|
|
|
|
|
|
INI Files
|
|
=========
|
|
|
|
If unicode.semantics is on, INI files are presumed to contain UTF-8 encoded
|
|
keys and values. If unicode.semantics is off, the data is taken as-is,
|
|
similar to PHP 5. No validation occurs during parsing. Instead invalid UTF-8
|
|
sequences are caught during access by ini_*() functions.
|
|
|
|
|
|
Stream I/O
|
|
==========
|
|
|
|
PHP has a streams-based I/O system for generalized filesystem access,
|
|
networking, data compression, and other operations. Since the data on the
|
|
other end of the stream can be in any encoding, you need to think about
|
|
data conversion.
|
|
|
|
Okay, this needs to be clarified. By "default", streams are actually
|
|
opened in binary mode. You have to specify 't' flag or use FILE_TEXT in
|
|
order to open it in text mode, where conversions apply. And for the text
|
|
mode streams, the default stream encoding is UTF-8 indeed.
|
|
|
|
By default, PHP opens streams in binary mode. To open a file in text mode,
|
|
you must use the 't' flag (or the FILE_TEXT parameter -- see below). The
|
|
default encoding for streams in text mode is UTF-8. This means that if
|
|
'file.txt' is a UTF-8 text file, this code snippet:
|
|
|
|
$fp = fopen('file.txt', 'rt');
|
|
$str = fread($fp, 100)
|
|
|
|
returns 100 Unicode characters, while:
|
|
|
|
$fp = fopen('file.txt', 'wt');
|
|
$fwrite($fp, $uni)
|
|
|
|
writes to a UTF-8 text file.
|
|
|
|
If you mainly work with files in an encoding other than UTF-8, you can
|
|
change the default context encoding setting:
|
|
|
|
stream_default_encoding('Shift-JIS');
|
|
$data = file_get_contents('file.txt', FILE_TEXT);
|
|
// work on $data
|
|
file_put_contents('file.txt', $data, FILE_TEXT);
|
|
|
|
The file_get_contents() and file_put_contents() functions now accept an
|
|
additional parameter, FILE_TEXT. If you provide FILE_TEXT for
|
|
file_get_contents(), PHP returns a Unicode string. Without FILE_TEXT, PHP
|
|
returns a binary string (which would be appropriate for true binary data, such
|
|
as an image file). When writing a Unicode string with file_put_contents(), you
|
|
must supply the FILE_TEXT parameter, or PHP generates a warning.
|
|
|
|
If you need to work with multiple encodings, you can create custom contexts
|
|
using stream_context_create() and then pass in the custom context as an
|
|
additional parameter. For example:
|
|
|
|
$ctx = stream_context_create(NULL, array('encoding' => 'big5'));
|
|
$data = file_get_contents('file.txt', FILE_TEXT, $ctx);
|
|
// work on $data
|
|
file_put_contents('file.txt', $data, FILE_TEXT, $ctx);
|
|
|
|
|
|
Conversion Semantics and Error Handling
|
|
=======================================
|
|
|
|
PHP can convert strings explicitly (casting) and implicitly (concatenation,
|
|
comparison, and parameter passing. For example, when concatenating a Unicode
|
|
string and a binary string, PHP converts the binary string to Unicode for better
|
|
precision.
|
|
|
|
However, not all characters can be converted between Unicode and legacy
|
|
encodings. The first possibility is that a string contains corrupt data or
|
|
an illegal byte sequence. In this case, the converter simply stops with
|
|
a message that resembles:
|
|
|
|
Warning: Could not convert binary string to Unicode string
|
|
(converter UTF-8 failed on bytes (0xE9) at offset 2)
|
|
|
|
Conversely, if a similar error occurs when attempting to convert Unicode to
|
|
a legacy string, the converter generates a message that resembles:
|
|
|
|
Warning: Could not convert Unicode string to binary string
|
|
(converter ISO-8859-1 failed on character {U+DC00} at offset 2)
|
|
|
|
To customize this behavior, refer to "Creating a Custom Error Handler" below.
|
|
|
|
The second possibility is that a Unicode character simply cannot be represented
|
|
in the legacy encoding. By default, when downconverting from Unicode, the
|
|
converter substitutes any missing sequences with the appropriate substitution
|
|
sequence for that codepage, such as 0x1A (Control-Z) in ISO-8859-1. When
|
|
upconverting to Unicode, the converter replaces any byte sequence that has no
|
|
Unicode equivalent with the Unicode substitution character (U+FFFD).
|
|
|
|
You can customize the conversion error behavior to:
|
|
|
|
- stop the conversion and return an empty string
|
|
- skip any invalid characters
|
|
- substibute invalid characters with a custom substitution character
|
|
- escape the invalid character in various formats
|
|
|
|
To control the global conversion error settings, use the functions:
|
|
|
|
unicode_set_error_mode(int direction, int mode)
|
|
unicode_set_subst_char(unicode char)
|
|
|
|
where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these
|
|
constants:
|
|
|
|
U_CONV_ERROR_STOP
|
|
U_CONV_ERROR_SKIP
|
|
U_CONV_ERROR_SUBST
|
|
U_CONV_ERROR_ESCAPE_UNICODE
|
|
U_CONV_ERROR_ESCAPE_ICU
|
|
U_CONV_ERROR_ESCAPE_JAVA
|
|
U_CONV_ERROR_ESCAPE_XML_DEC
|
|
U_CONV_ERROR_ESCAPE_XML_HEX
|
|
|
|
As an example, with a runtime encoding of ISO-8859-1, the conversion:
|
|
|
|
$str = (binary)"< \u30AB >";
|
|
|
|
results in:
|
|
|
|
MODE RESULT
|
|
--------------------------------------
|
|
stop ""
|
|
skip "< >"
|
|
substitute "< ? >"
|
|
escape (Unicode) "< {U+30AB} >"
|
|
escape (ICU) "< %U30AB >"
|
|
escape (Java) "< \u30AB >"
|
|
escape (XML decimal) "< カ >"
|
|
escape (XML hex) "< カ >"
|
|
|
|
With a runtime encoding of UTF-8, the conversion of the (illegal) sequence:
|
|
|
|
$str = (unicode)b"< \xe9\xfe >";
|
|
|
|
results in:
|
|
|
|
MODE RESULT
|
|
--------------------------------------
|
|
stop ""
|
|
skip ""
|
|
substitute ""
|
|
escape (Unicode) "< %XE9%XFE >"
|
|
escape (ICU) "< %XE9%XFE >"
|
|
escape (Java) "< \xE9\xFE >"
|
|
escape (XML decimal) "< éþ >"
|
|
escape (XML hex) "< éþ >"
|
|
|
|
The substitution character can be set only for FROM_UNICODE direction and has to
|
|
exist in the target character set. The default substitution character is (?).
|
|
|
|
NOTE: Casting is just a shortcut for using unicode.runtime_encoding. To convert
|
|
using an alternative encoding, use the unicode_encode() and unicode_decode()
|
|
functions. For example,
|
|
|
|
$str = unicode_encode($uni, 'koi8-r', U_CONV_ERROR_SUBST);
|
|
|
|
results in a binary KOI8-R encoded string.
|
|
|
|
Creating a Custom Error Handler
|
|
-------------------------------
|
|
If an error occurs during the conversion, PHP outputs a warning describing the
|
|
problem. Instead of this default behavior, PHP can invoke a user-provided error
|
|
handler, similar to how the current user-defined error handler works. To set
|
|
the custom conversion error handler, call:
|
|
|
|
mixed unicode_set_error_handler(callback error_handler)
|
|
|
|
The function returns the previously defined custom error handler. If no error
|
|
handler was defined, or if an error occurs when returning the handler, this
|
|
function returns NULL.
|
|
|
|
When the custom handler is set, the standard error handler is bypassed. It is
|
|
the responsibility of the custom handler to output or log any messages, raise
|
|
exceptions, or die(), if necessary. However, if the custom error handler returns
|
|
FALSE, the standard handler will be invoked afterwards.
|
|
|
|
The user function specified as the error_handler must accept five parameters:
|
|
|
|
mixed error_handler($direction, $encoding, $char_or_byte, $offset,
|
|
$message)
|
|
|
|
where:
|
|
|
|
$direction - the direction of conversion, FROM_UNICODE/TO_UNICODE
|
|
|
|
$encoding - the name of the encoding to/from which the conversion
|
|
was attempted
|
|
|
|
$char_or_byte - either Unicode character or byte sequence (depending
|
|
on direction) which caused the error
|
|
|
|
$offset - the offset of the failed character/byte sequence in
|
|
the source string
|
|
|
|
$message - the error message describing the problem
|
|
|
|
NOTE: If the error mode set by unicode_set_error_mode() is substitute,
|
|
skip, or escape, the handler won't be called, since these are non-error
|
|
causing operations. To always invoke your handler, set the error mode to
|
|
U_CONV_ERROR_STOP.
|
|
|
|
|
|
Unicode String Type
|
|
===================
|
|
|
|
The Unicode string type (IS_UNICODE) is supposed to contain text data encoded in
|
|
UTF-16. This is the main string type in PHP when Unicode semantics switch is
|
|
turned on. Unicode strings can exist when the switch is off, but they have to be
|
|
produced programmatically via calls to functions that return Unicode types.
|
|
|
|
|
|
Binary String Type
|
|
==================
|
|
|
|
Binary string type (IS_STRING) serves two purposes: backwards compatibility and
|
|
representing non-Unicode strings and binary data. When Unicode semantics switch
|
|
is off, it is used for all strings in PHP, same in previous versions. When the
|
|
switch is on, this type will be used to store text in other encodings as well as
|
|
true binary data such as images, PDFs, etc.
|
|
|
|
Printing binary data to the standard output passes it through as-is, independent
|
|
of the output encoding.
|
|
|
|
For examples of specifying binary string literals, refer to the section
|
|
"Language Modfications".
|
|
|
|
Language Modifications
|
|
======================
|
|
|
|
If a Unicode switch is turned on, PHP string literals -- single-quoted,
|
|
double-quoted, and heredocs -- become Unicode strings (IS_UNICODE type). String
|
|
literals support all the same escape sequences and variable interpolations as
|
|
before, plus several new escape sequences.
|
|
|
|
PHP interprets the contents of strings as follows:
|
|
|
|
- all non-escaped characters are interpreted as a corresponding Unicode
|
|
codepoint based on the current script encoding, e.g. ASCII 'a' (0x61) =>
|
|
U+0061, Shift-JIS (0x92 0x86) => U+4E2D
|
|
|
|
- existing PHP escape sequences are also interpreted as Unicode codepoints,
|
|
including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020
|
|
|
|
- two new escape sequences, \uXXXX and \UXXXXXX, are interpreted as a 4 or
|
|
6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 =>
|
|
U+10410. (Having two sequences avoids the ambiguity of \u020608 --
|
|
is that supposed to be U+0206 followed by "08", or U+020608 ?)
|
|
|
|
- a new escape sequence allows specifying a character by its full
|
|
Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20
|
|
|
|
PHP allows variable interpolation inside the double-quoted and heredoc strings.
|
|
However, the parser separates the string into literal and variable chunks during
|
|
compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that PHP
|
|
can handle literal chunks in the normal way as far as Unicode support is
|
|
concerned.
|
|
|
|
Since all string literals become Unicode by default, PHP 6 introduces new syntax
|
|
for creating byte-oriented or binary strings. Prefixing a string literal with
|
|
the letter 'b' creates a binary string:
|
|
|
|
$var = b'abc\001';
|
|
$var = b"abc\001";
|
|
$var = b<<<EOD
|
|
abc\001
|
|
EOD;
|
|
|
|
The content of a binary string is the literal byte sequence inside the
|
|
delimiters, which depends on the script encoding (unicode.script_encoding).
|
|
Binary string literals support the same escape sequences as PHP 5 strings. If
|
|
the Unicode switch is turned off, then the binary string literals generate the
|
|
normal string (IS_STRING) type internally without any effect on the application.
|
|
|
|
The string operators now accomodate the new IS_UNICODE and IS_BINARY types:
|
|
|
|
- The concatenation operator (.) and concatenation assignment operator (.=)
|
|
automatically coerce the IS_STRING type to the more precise IS_UNICODE if
|
|
the operands are of different string types.
|
|
|
|
- The string indexing operator [] now accommodates IS_UNICODE type strings
|
|
and extracts the specified character. To support supplementary characters,
|
|
the index specifies a code point, not a byte or a code unit.
|
|
|
|
- Bitwise operators and increment/decrement operators do not work on
|
|
Unicode strings. They do work on binary strings.
|
|
|
|
- Two new casting operators are introduced, (unicode) and (binary). The
|
|
(string) operator casts to Unicode type if the Unicode semantics switch is
|
|
on, and to binary type otherwise.
|
|
|
|
- The comparison operators compare Unicode strings in binary code point
|
|
order. They also coerce strings to Unicode if the strings are of different
|
|
types.
|
|
|
|
- The arithmetic operators use the same semantics as today for converting
|
|
strings to numbers. A Unicode string is considered numeric if it
|
|
represents a long or a double number in the en_US_POSIX locale.
|
|
|
|
|
|
Unicode Support in Existing Functions
|
|
=====================================
|
|
|
|
All functions in the PHP default distribution are undergoing analysis to
|
|
determine which functions need to be upgraded for native Unicode support.
|
|
You can track progress here:
|
|
|
|
http://www.php.net/~scoates/unicode/render_func_data.php
|
|
|
|
Key extensions that are fully converted include:
|
|
|
|
* curl
|
|
* dom
|
|
* json
|
|
* mysql
|
|
* mysqli
|
|
* oci8
|
|
* pcre
|
|
* reflection
|
|
* simplexml
|
|
* soap
|
|
* sqlite
|
|
* xml
|
|
* xmlreader/xmlwriter
|
|
* xsl
|
|
* zlib
|
|
|
|
NOTE: Unsafe functions might still work, since PHP performs Unicode conversions
|
|
at runtime. However, unsafe functions might not work correctly with multibyte
|
|
binary strings, or Unicode characters that are not representable in the
|
|
specified unicode.runtime_encoding.
|
|
|
|
|
|
Identifiers
|
|
===========
|
|
|
|
Since scripts may be written in various encodings, we do not restrict
|
|
identifiers to be ASCII-only. PHP allows any valid identifier based
|
|
on the Unicode Standard Annex #31.
|
|
|
|
|
|
Numbers
|
|
=======
|
|
|
|
Unlike identifiers, numbers must consist only of ASCII digits,.and are
|
|
restricted to the en_US_POSIX or C locale. In other words, numbers have no
|
|
thousands separator, and the fractional separator is (.) "full stop". Numeric
|
|
strings adhere to the same rules, so "10,3" is not interpreted as a number even
|
|
if the current locale's fractional separator is a comma.
|
|
|
|
TextIterators
|
|
=============
|
|
|
|
Instead of using the offset operator [] to access characters in a linear
|
|
fashion, use a TextIterator instead. TextIterator is very fast and enables you
|
|
to iterate over code points, combining sequences, characters, words, lines, and
|
|
sentences, both forward and backward. For example:
|
|
|
|
$text = "nai\u308ve";
|
|
foreach (new TextIterator($text) as $u) {
|
|
var_inspect($u)
|
|
}
|
|
|
|
lists six code points, including the umlaut (U+0308) as a separate code point.
|
|
Instantiating the TextIterator to iterate over characters,
|
|
|
|
$text = "nai\u308ve";
|
|
foreach (new TextIterator($text, TextIterator::CHARACTER) as $u) {
|
|
var_inspect($u)
|
|
}
|
|
|
|
lists five characters, including an "i" with an umlaut as a single character.
|
|
|
|
Locales
|
|
=======
|
|
|
|
Unicode support in PHP relies exclusively on ICU locales, NOT the POSIX locales
|
|
installed on the system. You may access the default ICU locale using:
|
|
|
|
locale_set_default()
|
|
locale_get_default()
|
|
|
|
ICU locale IDs have a somewhat different format from POSIX locale IDs. The ICU
|
|
syntax is:
|
|
|
|
<language>[_<script>]_<country>[_<variant>][@<keywords>]
|
|
|
|
For example, sr_Latn_YU_REVISED@currency=USD is Serbian (Latin, Yugoslavia,
|
|
Revised Orthography, Currency=US Dollar).
|
|
|
|
Do not use the deprecated setlocale() function. This function interacts with the
|
|
POSIX locale. If Unicode semantics are on, using setlocale() generates
|
|
a deprecation warning.
|
|
|
|
Document TODO
|
|
==========================================
|
|
- Final review.
|
|
- Fix the HTTP Input Encoding section, that's obsolete now.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
Unicode
|
|
http://www.unicode.org
|
|
|
|
Unicode Glossary
|
|
http://www.unicode.org/glossary/
|
|
|
|
UTF-8
|
|
http://www.utf-8.com/
|
|
|
|
UTF-16
|
|
http://www.ietf.org/rfc/rfc2781.txt
|
|
|
|
ICU Homepage
|
|
http://www.ibm.com/software/globalization/icu/
|
|
|
|
ICU User Guide and API Reference
|
|
http://icu.sourceforge.net/
|
|
|
|
Unicode Annex #31
|
|
http://www.unicode.org/reports/tr31/
|
|
|
|
PHP Parameter Parsing API
|
|
http://www.php.net/manual/en/zend.arguments.retrieval.php
|
|
|
|
|
|
Authors
|
|
=======
|
|
Andrei Zmievski <andrei@gravitonic.com>
|
|
Evan Goer <goer@yahoo-inc.com>
|
|
|
|
vim: set et tw=80 :
|