Audience ======== This README describes how PHP 6 provides native support for the Unicode Standard. Readers of this document should be proficient with PHP and have a basic understanding of Unicode concepts. For more technical details about PHP 6 design principles and for guidelines about writing Unicode-ready PHP extensions, refer to README.UNICODE-UPGRADES. Introduction ============ As successful as PHP has proven to be over the years, its support for multilingual and multinational environments has languished. PHP can no longer afford to remain outside the overall movement towards the Unicode standard. Although recent updates involving the mbstring extension have enabled easier multibyte data processing, this does not constitute native Unicode support. Since the full implementation of the Unicode Standard is very involved, our approach is to speed up implementation by using the well-tested, full-featured, and freely available ICU (International Components for Unicode) library. General Remarks =============== International Components for Unicode ------------------------------------ ICU (International Components for Unicode is a mature, widely used set of C/C++ and Java libraries for Unicode support, software internationalization and globalization. It provides: - Encoding conversions - Collations - Unicode text processing - and much more When building PHP 6, Unicode support is always enabled. The only configuration option during development should be the location of the ICU headers and libraries. --with-icu-dir= where specifies the location of ICU header and library files. If you do not specify this option, PHP attempts to find ICU under /usr and /usr/local. NOTE: ICU is not bundled with PHP 6 yet. To download the distribution, visit http://icu.sourceforge.net. PHP requires ICU version 3.4 or higher. Backwards Compatibility ----------------------- Our paramount concern for providing Unicode support is backwards compatibility. Because PHP is used on so many sites, existing data types and functions must work as they always have. However, although PHP's interfaces must remain backwards-compatible, the speed of certain operations might be affected due to internal implementation changes. Encoding Names -------------- All the encoding settings discussed in this document can accept any valid encoding name supported by ICU. For a full list of encodings, refer to the ICU online documentation. NOTE: References to "Unicode" in this document generally mean the UTF-16 character encoding, unless explicitly stated otherwise. Unicode Semantics Switch ======================== Because many applications do not require Unicode, PHP 6 provides a server-wide INI setting to enable Unicode support: unicode.semantics = On/Off This switch is off by default. If your applications do not require native Unicode support, you may leave this switch off, and continue to use Unicode strings only when you need to. However, if your application is ready to fully support Unicode, you should turn this switch on. This activates various Unicode support mechanisms, including: * All string literals become Unicode * All variables received from HTTP requests become Unicode * PHP identifiers may use Unicode characters More fundamentally, your PHP environment is now a Unicode environment. Strings inside PHP are Unicode, and the system is responsible for converting non-Unicode strings on PHP's periphery (for example, in HTTP input and output, streams, and filesystem operations). With unicode.semantics on, you must specify binary strings explicitly. PHP makes no assumptions about the content of a binary string, so your application must handle all binary string appropriately. Conversely, if unicode.semantics is off, PHP behaves as it did in the past. String literals do not become Unicode, and files are binary strings for backwards compatibility. You can always create Unicode strings programmatically, and all functions and operators support Unicode strings transparently. Fallback Encoding ================= The fallback encoding provides a default value for all other unicode.*_encoding INI settings. If you do not set a particular unicode.*_encoding setting, PHP uses the fallback encoding. If you do not specify a fallback encoding, PHP uses UTF-8. unicode.fallback_encoding = "iso-8859-1" Runtime Encoding ================ The runtime encoding specifies the encoding PHP uses for converting binary strings within the PHP engine itself. unicode.runtime_encoding = "iso-8859-1" This setting has no effect on I/O-related operations such as writing to standard out, reading from the filesystem, or decoding HTTP input variables. PHP enables you to explicitly convert strings using casting: * (binary) -- casts to binary string type * (unicode) -- casts to Unicode string type * (string) -- casts to Unicode string type if unicode.semantics is on, to binary otherwise For example, if unicode.runtime_encoding is iso-8859-1, and $uni is a unicode string, then $str = (binary)$uni creates a binary string $str in the ISO-8859-1 encoding. Implicit conversions include concatenation, comparison, and parameter passing. For better precision, PHP attempts to convert strings to Unicode before performing these sorts of operations. For example, if we concatenate our binary string $str with a unicode literal, PHP converts $str to Unicode first, using the encoding specified by unicode.runtime_encoding. Output Encoding =============== PHP automatically converts output for commands that write to the standard output stream, such as 'print' and 'echo'. unicode.output_encoding = "utf-8" However, PHP does not convert binary strings. When writing to files or external resources, you must rely on stream encoding features or manually encode the data using functions provided by the unicode extension. The existing default_charset INI setting is DEPRECATED in favor of unicode.output_setting. Previously, default_charset only specified the charset portion of the Content-Type MIME header. Now default_charset only takes effect when unicode.semantics is off, and it does not affect the actual transcoding of the output stream. Setting unicode.output_encoding causes PHP to add the 'charset' portion to the Content-Type header, overriding any value set for default_charset. HTTP Input Encoding =================== The HTTP input encoding specifies the encoding of variables received via HTTP, such as the contents of the $_GET and _$POST arrays. This functionality is currently under development. For a discussion of the approach that the PHP 6 team is taking, refer to: http://marc.theaimsgroup.com/?t=116613047300005&r=1&w=2 Filesystem Encoding =================== The filesystem encoding specifies the encoding of file and directory names on the filesystem. unicode.filename_encoding = "utf-8" Filesystem-related functions such as opendir() perform this conversion when accepting and returning file names. You should set the filename encoding to the encoding used by your filesystem. Script Encoding =============== You may write PHP scripts in any encoding supported by ICU. To specify the script encoding site-wide, use the INI setting: unicode.script_encoding = utf-8 If you cannot change the encoding system wide, you can use a pragma to override the INI setting in a local script: The pragma setting must be the first statement in the script. It only affects the script in which it occurs, and does not propagate to any included files. INI Files ========= If unicode.semantics is on, INI files are presumed to contain UTF-8 encoded keys and values. If unicode.semantics is off, the data is taken as-is, similar to PHP 5. No validation occurs during parsing. Instead invalid UTF-8 sequences are caught during access by ini_*() functions. Stream I/O ========== PHP has a streams-based I/O system for generalized filesystem access, networking, data compression, and other operations. Since the data on the other end of the stream can be in any encoding, you need to think about data conversion. Okay, this needs to be clarified. By "default", streams are actually opened in binary mode. You have to specify 't' flag or use FILE_TEXT in order to open it in text mode, where conversions apply. And for the text mode streams, the default stream encoding is UTF-8 indeed. By default, PHP opens streams in binary mode. To open a file in text mode, you must use the 't' flag (or the FILE_TEXT parameter -- see below). The default encoding for streams in text mode is UTF-8. This means that if 'file.txt' is a UTF-8 text file, this code snippet: $fp = fopen('file.txt', 'rt'); $str = fread($fp, 100) returns 100 Unicode characters, while: $fp = fopen('file.txt', 'wt'); $fwrite($fp, $uni) writes to a UTF-8 text file. If you mainly work with files in an encoding other than UTF-8, you can change the default context encoding setting: stream_default_encoding('Shift-JIS'); $data = file_get_contents('file.txt', FILE_TEXT); // work on $data file_put_contents('file.txt', $data, FILE_TEXT); The file_get_contents() and file_put_contents() functions now accept an additional parameter, FILE_TEXT. If you provide FILE_TEXT for file_get_contents(), PHP returns a Unicode string. Without FILE_TEXT, PHP returns a binary string (which would be appropriate for true binary data, such as an image file). When writing a Unicode string with file_put_contents(), you must supply the FILE_TEXT parameter, or PHP generates a warning. If you need to work with multiple encodings, you can create custom contexts using stream_context_create() and then pass in the custom context as an additional parameter. For example: $ctx = stream_context_create(NULL, array('encoding' => 'big5')); $data = file_get_contents('file.txt', FILE_TEXT, $ctx); // work on $data file_put_contents('file.txt', $data, FILE_TEXT, $ctx); Conversion Semantics and Error Handling ======================================= PHP can convert strings explicitly (casting) and implicitly (concatenation, comparison, and parameter passing. For example, when concatenating a Unicode string and a binary string, PHP converts the binary string to Unicode for better precision. However, not all characters can be converted between Unicode and legacy encodings. The first possibility is that a string contains corrupt data or an illegal byte sequence. In this case, the converter simply stops with a message that resembles: Warning: Could not convert binary string to Unicode string (converter UTF-8 failed on bytes (0xE9) at offset 2) Conversely, if a similar error occurs when attempting to convert Unicode to a legacy string, the converter generates a message that resembles: Warning: Could not convert Unicode string to binary string (converter ISO-8859-1 failed on character {U+DC00} at offset 2) To customize this behavior, refer to "Creating a Custom Error Handler" below. The second possibility is that a Unicode character simply cannot be represented in the legacy encoding. By default, when downconverting from Unicode, the converter substitutes any missing sequences with the appropriate substitution sequence for that codepage, such as 0x1A (Control-Z) in ISO-8859-1. When upconverting to Unicode, the converter replaces any byte sequence that has no Unicode equivalent with the Unicode substitution character (U+FFFD). You can customize the conversion error behavior to: - stop the conversion and return an empty string - skip any invalid characters - substibute invalid characters with a custom substitution character - escape the invalid character in various formats To control the global conversion error settings, use the functions: unicode_set_error_mode(int direction, int mode) unicode_set_subst_char(unicode char) where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these constants: U_CONV_ERROR_STOP U_CONV_ERROR_SKIP U_CONV_ERROR_SUBST U_CONV_ERROR_ESCAPE_UNICODE U_CONV_ERROR_ESCAPE_ICU U_CONV_ERROR_ESCAPE_JAVA U_CONV_ERROR_ESCAPE_XML_DEC U_CONV_ERROR_ESCAPE_XML_HEX As an example, with a runtime encoding of ISO-8859-1, the conversion: $str = (binary)"< \u30AB >"; results in: MODE RESULT -------------------------------------- stop "" skip "< >" substitute "< ? >" escape (Unicode) "< {U+30AB} >" escape (ICU) "< %U30AB >" escape (Java) "< \u30AB >" escape (XML decimal) "< カ >" escape (XML hex) "< カ >" With a runtime encoding of UTF-8, the conversion of the (illegal) sequence: $str = (unicode)b"< \xe9\xfe >"; results in: MODE RESULT -------------------------------------- stop "" skip "" substitute "" escape (Unicode) "< %XE9%XFE >" escape (ICU) "< %XE9%XFE >" escape (Java) "< \xE9\xFE >" escape (XML decimal) "< éþ >" escape (XML hex) "< éþ >" The substitution character can be set only for FROM_UNICODE direction and has to exist in the target character set. The default substitution character is (?). NOTE: Casting is just a shortcut for using unicode.runtime_encoding. To convert using an alternative encoding, use the unicode_encode() and unicode_decode() functions. For example, $str = unicode_encode($uni, 'koi8-r', U_CONV_ERROR_SUBST); results in a binary KOI8-R encoded string. Creating a Custom Error Handler ------------------------------- If an error occurs during the conversion, PHP outputs a warning describing the problem. Instead of this default behavior, PHP can invoke a user-provided error handler, similar to how the current user-defined error handler works. To set the custom conversion error handler, call: mixed unicode_set_error_handler(callback error_handler) The function returns the previously defined custom error handler. If no error handler was defined, or if an error occurs when returning the handler, this function returns NULL. When the custom handler is set, the standard error handler is bypassed. It is the responsibility of the custom handler to output or log any messages, raise exceptions, or die(), if necessary. However, if the custom error handler returns FALSE, the standard handler will be invoked afterwards. The user function specified as the error_handler must accept five parameters: mixed error_handler($direction, $encoding, $char_or_byte, $offset, $message) where: $direction - the direction of conversion, FROM_UNICODE/TO_UNICODE $encoding - the name of the encoding to/from which the conversion was attempted $char_or_byte - either Unicode character or byte sequence (depending on direction) which caused the error $offset - the offset of the failed character/byte sequence in the source string $message - the error message describing the problem NOTE: If the error mode set by unicode_set_error_mode() is substitute, skip, or escape, the handler won't be called, since these are non-error causing operations. To always invoke your handler, set the error mode to U_CONV_ERROR_STOP. Unicode String Type =================== The Unicode string type (IS_UNICODE) is supposed to contain text data encoded in UTF-16. This is the main string type in PHP when Unicode semantics switch is turned on. Unicode strings can exist when the switch is off, but they have to be produced programmatically via calls to functions that return Unicode types. Binary String Type ================== Binary string type (IS_STRING) serves two purposes: backwards compatibility and representing non-Unicode strings and binary data. When Unicode semantics switch is off, it is used for all strings in PHP, same in previous versions. When the switch is on, this type will be used to store text in other encodings as well as true binary data such as images, PDFs, etc. Printing binary data to the standard output passes it through as-is, independent of the output encoding. For examples of specifying binary string literals, refer to the section "Language Modfications". Language Modifications ====================== If a Unicode switch is turned on, PHP string literals -- single-quoted, double-quoted, and heredocs -- become Unicode strings (IS_UNICODE type). String literals support all the same escape sequences and variable interpolations as before, plus several new escape sequences. PHP interprets the contents of strings as follows: - all non-escaped characters are interpreted as a corresponding Unicode codepoint based on the current script encoding, e.g. ASCII 'a' (0x61) => U+0061, Shift-JIS (0x92 0x86) => U+4E2D - existing PHP escape sequences are also interpreted as Unicode codepoints, including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020 - two new escape sequences, \uXXXX and \UXXXXXX, are interpreted as a 4 or 6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 => U+10410. (Having two sequences avoids the ambiguity of \u020608 -- is that supposed to be U+0206 followed by "08", or U+020608 ?) - a new escape sequence allows specifying a character by its full Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20 PHP allows variable interpolation inside the double-quoted and heredoc strings. However, the parser separates the string into literal and variable chunks during compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that PHP can handle literal chunks in the normal way as far as Unicode support is concerned. Since all string literals become Unicode by default, PHP 6 introduces new syntax for creating byte-oriented or binary strings. Prefixing a string literal with the letter 'b' creates a binary string: $var = b'abc\001'; $var = b"abc\001"; $var = b<<[_