php-src/ext/standard/html_tables.h

6236 lines
472 KiB
C
Raw Normal View History

/*
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&<" would be decoded, but not "&é". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only <, >, &, ", ' and ', now e.g. ", ', ', ', etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
+----------------------------------------------------------------------+
| PHP Version 5 |
+----------------------------------------------------------------------+
2012-01-01 21:15:04 +08:00
| Copyright (c) 1997-2012 The PHP Group |
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&<" would be decoded, but not "&é". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only <, >, &, ", ' and ', now e.g. ", ', ', ', etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
+----------------------------------------------------------------------+
| This source file is subject to version 3.01 of the PHP license, |
| that is bundled with this package in the file LICENSE, and is |
| available through the world-wide-web at the following url: |
| http://www.php.net/license/3_01.txt |
| If you did not receive a copy of the PHP license and are unable to |
| obtain it through the world-wide-web, please send a note to |
| license@php.net so we can mail you a copy immediately. |
+----------------------------------------------------------------------+
*/
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* $Id$ */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
#ifndef HTML_TABLES_H
#define HTML_TABLES_H
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/**************************************************************************
***************************************************************************
** THIS FILE IS AUTOMATICALLY GENERATED. DO NOT MODIFY IT. **
***************************************************************************
** Please change html_tables/html_table_gen.php instead and then **
** run it in order to generate this file **
***************************************************************************
**************************************************************************/
enum entity_charset { cs_utf_8, cs_8859_1, cs_cp1252, cs_8859_15, cs_cp1251,
cs_8859_5, cs_cp866, cs_macroman, cs_koi8r, cs_big5,
cs_gb2312, cs_big5hkscs, cs_sjis, cs_eucjp,
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
cs_numelems /* used to count the number of charsets */
};
#define CHARSET_UNICODE_COMPAT(cs) ((cs) <= cs_8859_1)
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
#define CHARSET_SINGLE_BYTE(cs) ((cs) > cs_utf_8 && (cs) < cs_big5)
#define CHARSET_PARTIAL_SUPPORT(cs) ((cs) >= cs_big5)
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
static const struct {
const char *codeset;
enum entity_charset charset;
} charset_map[] = {
{ "ISO-8859-1", cs_8859_1 },
{ "ISO8859-1", cs_8859_1 },
{ "ISO-8859-15", cs_8859_15 },
{ "ISO8859-15", cs_8859_15 },
{ "utf-8", cs_utf_8 },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
{ "cp1252", cs_cp1252 },
{ "Windows-1252", cs_cp1252 },
{ "1252", cs_cp1252 },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
{ "BIG5", cs_big5 },
{ "950", cs_big5 },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
{ "GB2312", cs_gb2312 },
{ "936", cs_gb2312 },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
{ "BIG5-HKSCS", cs_big5hkscs },
{ "Shift_JIS", cs_sjis },
{ "SJIS", cs_sjis },
{ "932", cs_sjis },
{ "SJIS-win", cs_sjis },
{ "CP932", cs_sjis },
{ "EUCJP", cs_eucjp },
{ "EUC-JP", cs_eucjp },
{ "eucJP-win", cs_eucjp },
{ "KOI8-R", cs_koi8r },
{ "koi8-ru", cs_koi8r },
{ "koi8r", cs_koi8r },
{ "cp1251", cs_cp1251 },
{ "Windows-1251", cs_cp1251 },
{ "win-1251", cs_cp1251 },
{ "iso8859-5", cs_8859_5 },
{ "iso-8859-5", cs_8859_5 },
{ "cp866", cs_cp866 },
{ "866", cs_cp866 },
{ "ibm866", cs_cp866 },
{ "MacRoman", cs_macroman },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
{ NULL }
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* longest entity name length excluding & and ; */
#define LONGEST_ENTITY_LENGTH 31
/* Definitions for mappings *to* Unicode.
* The origin charset must have at most 256 code points.
* The multi-byte encodings are not supported */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
typedef struct {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
unsigned short uni_cp[64];
} enc_to_uni_stage2;
typedef struct {
const enc_to_uni_stage2 *inner[4];
} enc_to_uni;
/* bits 7-8 bits (only single bytes encodings supported )*/
#define ENT_ENC_TO_UNI_STAGE1(k) ((k & 0xC0) >> 6)
/* bits 1-6 */
#define ENT_ENC_TO_UNI_STAGE2(k) ((k) & 0x3F)
/* {{{ Mappings *to* Unicode for ISO-8859-1 */
/* {{{ Stage 2 tables for ISO-8859-1 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_iso88591_00 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005,
0x0006, 0x0007, 0x0008, 0x0009, 0x000A, 0x000B,
0x000C, 0x000D, 0x000E, 0x000F, 0x0010, 0x0011,
0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017,
0x0018, 0x0019, 0x001A, 0x001B, 0x001C, 0x001D,
0x001E, 0x001F, 0x0020, 0x0021, 0x0022, 0x0023,
0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029,
0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F,
0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035,
0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B,
0x003C, 0x003D, 0x003E, 0x003F,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_iso88591_40 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045,
0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B,
0x004C, 0x004D, 0x004E, 0x004F, 0x0050, 0x0051,
0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057,
0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D,
0x005E, 0x005F, 0x0060, 0x0061, 0x0062, 0x0063,
0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069,
0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F,
0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075,
0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B,
0x007C, 0x007D, 0x007E, 0x007F,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_iso88591_80 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x0080, 0x0081, 0x0082, 0x0083, 0x0084, 0x0085,
0x0086, 0x0087, 0x0088, 0x0089, 0x008A, 0x008B,
0x008C, 0x008D, 0x008E, 0x008F, 0x0090, 0x0091,
0x0092, 0x0093, 0x0094, 0x0095, 0x0096, 0x0097,
0x0098, 0x0099, 0x009A, 0x009B, 0x009C, 0x009D,
0x009E, 0x009F, 0x00A0, 0x00A1, 0x00A2, 0x00A3,
0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9,
0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5,
0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x00BA, 0x00BB,
0x00BC, 0x00BD, 0x00BE, 0x00BF,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_iso88591_C0 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5,
0x00C6, 0x00C7, 0x00C8, 0x00C9, 0x00CA, 0x00CB,
0x00CC, 0x00CD, 0x00CE, 0x00CF, 0x00D0, 0x00D1,
0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7,
0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD,
0x00DE, 0x00DF, 0x00E0, 0x00E1, 0x00E2, 0x00E3,
0x00E4, 0x00E5, 0x00E6, 0x00E7, 0x00E8, 0x00E9,
0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF,
0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5,
0x00F6, 0x00F7, 0x00F8, 0x00F9, 0x00FA, 0x00FB,
0x00FC, 0x00FD, 0x00FE, 0x00FF,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* end of stage 2 tables for ISO-8859-1 }}} */
/* {{{ Stage 1 table for ISO-8859-1 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni enc_to_uni_iso88591 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
&enc_to_uni_s2_iso88591_00,
&enc_to_uni_s2_iso88591_40,
&enc_to_uni_s2_iso88591_80,
2011-08-31 13:47:19 +08:00
&enc_to_uni_s2_iso88591_C0 }
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* end of stage 1 table for ISO-8859-1 }}} */
/* {{{ Mappings *to* Unicode for ISO-8859-5 */
/* {{{ Stage 2 tables for ISO-8859-5 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_iso88595_80 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x0080, 0x0081, 0x0082, 0x0083, 0x0084, 0x0085,
0x0086, 0x0087, 0x0088, 0x0089, 0x008A, 0x008B,
0x008C, 0x008D, 0x008E, 0x008F, 0x0090, 0x0091,
0x0092, 0x0093, 0x0094, 0x0095, 0x0096, 0x0097,
0x0098, 0x0099, 0x009A, 0x009B, 0x009C, 0x009D,
0x009E, 0x009F, 0x00A0, 0x0401, 0x0402, 0x0403,
0x0404, 0x0405, 0x0406, 0x0407, 0x0408, 0x0409,
0x040A, 0x040B, 0x040C, 0x00AD, 0x040E, 0x040F,
0x0410, 0x0411, 0x0412, 0x0413, 0x0414, 0x0415,
0x0416, 0x0417, 0x0418, 0x0419, 0x041A, 0x041B,
0x041C, 0x041D, 0x041E, 0x041F,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_iso88595_C0 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x0420, 0x0421, 0x0422, 0x0423, 0x0424, 0x0425,
0x0426, 0x0427, 0x0428, 0x0429, 0x042A, 0x042B,
0x042C, 0x042D, 0x042E, 0x042F, 0x0430, 0x0431,
0x0432, 0x0433, 0x0434, 0x0435, 0x0436, 0x0437,
0x0438, 0x0439, 0x043A, 0x043B, 0x043C, 0x043D,
0x043E, 0x043F, 0x0440, 0x0441, 0x0442, 0x0443,
0x0444, 0x0445, 0x0446, 0x0447, 0x0448, 0x0449,
0x044A, 0x044B, 0x044C, 0x044D, 0x044E, 0x044F,
0x2116, 0x0451, 0x0452, 0x0453, 0x0454, 0x0455,
0x0456, 0x0457, 0x0458, 0x0459, 0x045A, 0x045B,
0x045C, 0x00A7, 0x045E, 0x045F,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* end of stage 2 tables for ISO-8859-5 }}} */
/* {{{ Stage 1 table for ISO-8859-5 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni enc_to_uni_iso88595 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
&enc_to_uni_s2_iso88591_00,
&enc_to_uni_s2_iso88591_40,
&enc_to_uni_s2_iso88595_80,
2011-08-31 13:47:19 +08:00
&enc_to_uni_s2_iso88595_C0 }
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* end of stage 1 table for ISO-8859-5 }}} */
/* {{{ Mappings *to* Unicode for ISO-8859-15 */
/* {{{ Stage 2 tables for ISO-8859-15 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_iso885915_80 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x0080, 0x0081, 0x0082, 0x0083, 0x0084, 0x0085,
0x0086, 0x0087, 0x0088, 0x0089, 0x008A, 0x008B,
0x008C, 0x008D, 0x008E, 0x008F, 0x0090, 0x0091,
0x0092, 0x0093, 0x0094, 0x0095, 0x0096, 0x0097,
0x0098, 0x0099, 0x009A, 0x009B, 0x009C, 0x009D,
0x009E, 0x009F, 0x00A0, 0x00A1, 0x00A2, 0x00A3,
0x20AC, 0x00A5, 0x0160, 0x00A7, 0x0161, 0x00A9,
0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x017D, 0x00B5,
0x00B6, 0x00B7, 0x017E, 0x00B9, 0x00BA, 0x00BB,
0x0152, 0x0153, 0x0178, 0x00BF,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* end of stage 2 tables for ISO-8859-15 }}} */
/* {{{ Stage 1 table for ISO-8859-15 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni enc_to_uni_iso885915 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
&enc_to_uni_s2_iso88591_00,
&enc_to_uni_s2_iso88591_40,
&enc_to_uni_s2_iso885915_80,
2011-08-31 13:47:19 +08:00
&enc_to_uni_s2_iso88591_C0 }
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* end of stage 1 table for ISO-8859-15 }}} */
/* {{{ Mappings *to* Unicode for Windows-1252 */
/* {{{ Stage 2 tables for Windows-1252 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_win1252_80 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x20AC, 0xFFFF, 0x201A, 0x0192, 0x201E, 0x2026,
0x2020, 0x2021, 0x02C6, 0x2030, 0x0160, 0x2039,
0x0152, 0xFFFF, 0x017D, 0xFFFF, 0xFFFF, 0x2018,
0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014,
0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0xFFFF,
0x017E, 0x0178, 0x00A0, 0x00A1, 0x00A2, 0x00A3,
0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9,
0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5,
0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x00BA, 0x00BB,
0x00BC, 0x00BD, 0x00BE, 0x00BF,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* end of stage 2 tables for Windows-1252 }}} */
/* {{{ Stage 1 table for Windows-1252 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni enc_to_uni_win1252 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
&enc_to_uni_s2_iso88591_00,
&enc_to_uni_s2_iso88591_40,
&enc_to_uni_s2_win1252_80,
2011-08-31 13:47:19 +08:00
&enc_to_uni_s2_iso88591_C0 }
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* end of stage 1 table for Windows-1252 }}} */
/* {{{ Mappings *to* Unicode for Windows-1251 */
/* {{{ Stage 2 tables for Windows-1251 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_win1251_80 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x0402, 0x0403, 0x201A, 0x0453, 0x201E, 0x2026,
0x2020, 0x2021, 0x20AC, 0x2030, 0x0409, 0x2039,
0x040A, 0x040C, 0x040B, 0x040F, 0x0452, 0x2018,
0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014,
0xFFFF, 0x2122, 0x0459, 0x203A, 0x045A, 0x045C,
0x045B, 0x045F, 0x00A0, 0x040E, 0x045E, 0x0408,
0x00A4, 0x0490, 0x00A6, 0x00A7, 0x0401, 0x00A9,
0x0404, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x0407,
0x00B0, 0x00B1, 0x0406, 0x0456, 0x0491, 0x00B5,
0x00B6, 0x00B7, 0x0451, 0x2116, 0x0454, 0x00BB,
0x0458, 0x0405, 0x0455, 0x0457,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_win1251_C0 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x0410, 0x0411, 0x0412, 0x0413, 0x0414, 0x0415,
0x0416, 0x0417, 0x0418, 0x0419, 0x041A, 0x041B,
0x041C, 0x041D, 0x041E, 0x041F, 0x0420, 0x0421,
0x0422, 0x0423, 0x0424, 0x0425, 0x0426, 0x0427,
0x0428, 0x0429, 0x042A, 0x042B, 0x042C, 0x042D,
0x042E, 0x042F, 0x0430, 0x0431, 0x0432, 0x0433,
0x0434, 0x0435, 0x0436, 0x0437, 0x0438, 0x0439,
0x043A, 0x043B, 0x043C, 0x043D, 0x043E, 0x043F,
0x0440, 0x0441, 0x0442, 0x0443, 0x0444, 0x0445,
0x0446, 0x0447, 0x0448, 0x0449, 0x044A, 0x044B,
0x044C, 0x044D, 0x044E, 0x044F,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* end of stage 2 tables for Windows-1251 }}} */
/* {{{ Stage 1 table for Windows-1251 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni enc_to_uni_win1251 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
&enc_to_uni_s2_iso88591_00,
&enc_to_uni_s2_iso88591_40,
&enc_to_uni_s2_win1251_80,
2011-08-31 13:47:19 +08:00
&enc_to_uni_s2_win1251_C0 }
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* end of stage 1 table for Windows-1251 }}} */
/* {{{ Mappings *to* Unicode for KOI8-R */
/* {{{ Stage 2 tables for KOI8-R */
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_koi8r_80 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x2500, 0x2502, 0x250C, 0x2510, 0x2514, 0x2518,
0x251C, 0x2524, 0x252C, 0x2534, 0x253C, 0x2580,
0x2584, 0x2588, 0x258C, 0x2590, 0x2591, 0x2592,
0x2593, 0x2320, 0x25A0, 0x2219, 0x221A, 0x2248,
0x2264, 0x2265, 0x00A0, 0x2321, 0x00B0, 0x00B2,
0x00B7, 0x00F7, 0x2550, 0x2551, 0x2552, 0x0451,
0x2553, 0x2554, 0x2555, 0x2556, 0x2557, 0x2558,
0x2559, 0x255A, 0x255B, 0x255C, 0x255D, 0x255E,
0x255F, 0x2560, 0x2561, 0x0401, 0x2562, 0x2563,
0x2564, 0x2565, 0x2566, 0x2567, 0x2568, 0x2569,
0x256A, 0x256B, 0x256C, 0x00A9,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_koi8r_C0 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x044E, 0x0430, 0x0431, 0x0446, 0x0434, 0x0435,
0x0444, 0x0433, 0x0445, 0x0438, 0x0439, 0x043A,
0x043B, 0x043C, 0x043D, 0x043E, 0x043F, 0x044F,
0x0440, 0x0441, 0x0442, 0x0443, 0x0436, 0x0432,
0x044C, 0x044B, 0x0437, 0x0448, 0x044D, 0x0449,
0x0447, 0x044A, 0x042E, 0x0410, 0x0411, 0x0426,
0x0414, 0x0415, 0x0424, 0x0413, 0x0425, 0x0418,
0x0419, 0x041A, 0x041B, 0x041C, 0x041D, 0x041E,
0x041F, 0x042F, 0x0420, 0x0421, 0x0422, 0x0423,
0x0416, 0x0412, 0x042C, 0x042B, 0x0417, 0x0428,
0x042D, 0x0429, 0x0427, 0x042A,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* end of stage 2 tables for KOI8-R }}} */
/* {{{ Stage 1 table for KOI8-R */
2011-08-31 13:47:19 +08:00
static const enc_to_uni enc_to_uni_koi8r = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
&enc_to_uni_s2_iso88591_00,
&enc_to_uni_s2_iso88591_40,
&enc_to_uni_s2_koi8r_80,
2011-08-31 13:47:19 +08:00
&enc_to_uni_s2_koi8r_C0 }
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* end of stage 1 table for KOI8-R }}} */
/* {{{ Mappings *to* Unicode for CP-866 */
/* {{{ Stage 2 tables for CP-866 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_cp866_80 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x0410, 0x0411, 0x0412, 0x0413, 0x0414, 0x0415,
0x0416, 0x0417, 0x0418, 0x0419, 0x041A, 0x041B,
0x041C, 0x041D, 0x041E, 0x041F, 0x0420, 0x0421,
0x0422, 0x0423, 0x0424, 0x0425, 0x0426, 0x0427,
0x0428, 0x0429, 0x042A, 0x042B, 0x042C, 0x042D,
0x042E, 0x042F, 0x0430, 0x0431, 0x0432, 0x0433,
0x0434, 0x0435, 0x0436, 0x0437, 0x0438, 0x0439,
0x043A, 0x043B, 0x043C, 0x043D, 0x043E, 0x043F,
0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561,
0x2562, 0x2556, 0x2555, 0x2563, 0x2551, 0x2557,
0x255D, 0x255C, 0x255B, 0x2510,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_cp866_C0 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x2514, 0x2534, 0x252C, 0x251C, 0x2500, 0x253C,
0x255E, 0x255F, 0x255A, 0x2554, 0x2569, 0x2566,
0x2560, 0x2550, 0x256C, 0x2567, 0x2568, 0x2564,
0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256B,
0x256A, 0x2518, 0x250C, 0x2588, 0x2584, 0x258C,
0x2590, 0x2580, 0x0440, 0x0441, 0x0442, 0x0443,
0x0444, 0x0445, 0x0446, 0x0447, 0x0448, 0x0449,
0x044A, 0x044B, 0x044C, 0x044D, 0x044E, 0x044F,
0x0401, 0x0451, 0x0404, 0x0454, 0x0407, 0x0457,
0x040E, 0x045E, 0x00B0, 0x2219, 0x00B7, 0x221A,
0x2116, 0x00A4, 0x25A0, 0x00A0,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* end of stage 2 tables for CP-866 }}} */
/* {{{ Stage 1 table for CP-866 */
2011-08-31 13:47:19 +08:00
static const enc_to_uni enc_to_uni_cp866 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
&enc_to_uni_s2_iso88591_00,
&enc_to_uni_s2_iso88591_40,
&enc_to_uni_s2_cp866_80,
2011-08-31 13:47:19 +08:00
&enc_to_uni_s2_cp866_C0 }
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* end of stage 1 table for CP-866 }}} */
/* {{{ Mappings *to* Unicode for MacRoman */
/* {{{ Stage 2 tables for MacRoman */
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_macroman_00 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF,
0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF,
0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF,
0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF,
0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF,
0xFFFF, 0xFFFF, 0x0020, 0x0021, 0x0022, 0x0023,
0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029,
0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F,
0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035,
0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B,
0x003C, 0x003D, 0x003E, 0x003F,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_macroman_40 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045,
0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B,
0x004C, 0x004D, 0x004E, 0x004F, 0x0050, 0x0051,
0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057,
0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D,
0x005E, 0x005F, 0x0060, 0x0061, 0x0062, 0x0063,
0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069,
0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F,
0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075,
0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B,
0x007C, 0x007D, 0x007E, 0xFFFF,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_macroman_80 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x00C4, 0x00C5, 0x00C7, 0x00C9, 0x00D1, 0x00D6,
0x00DC, 0x00E1, 0x00E0, 0x00E2, 0x00E4, 0x00E3,
0x00E5, 0x00E7, 0x00E9, 0x00E8, 0x00EA, 0x00EB,
0x00ED, 0x00EC, 0x00EE, 0x00EF, 0x00F1, 0x00F3,
0x00F2, 0x00F4, 0x00F6, 0x00F5, 0x00FA, 0x00F9,
0x00FB, 0x00FC, 0x2020, 0x00B0, 0x00A2, 0x00A3,
0x00A7, 0x2022, 0x00B6, 0x00DF, 0x00AE, 0x00A9,
0x2122, 0x00B4, 0x00A8, 0x2260, 0x00C6, 0x00D8,
0x221E, 0x00B1, 0x2264, 0x2265, 0x00A5, 0x00B5,
0x2202, 0x2211, 0x220F, 0x03C0, 0x222B, 0x00AA,
0x00BA, 0x03A9, 0x00E6, 0x00F8,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
2011-08-31 13:47:19 +08:00
static const enc_to_uni_stage2 enc_to_uni_s2_macroman_C0 = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
0x00BF, 0x00A1, 0x00AC, 0x221A, 0x0192, 0x2248,
0x2206, 0x00AB, 0x00BB, 0x2026, 0x00A0, 0x00C0,
0x00C3, 0x00D5, 0x0152, 0x0153, 0x2013, 0x2014,
0x201C, 0x201D, 0x2018, 0x2019, 0x00F7, 0x25CA,
0x00FF, 0x0178, 0x2044, 0x20AC, 0x2039, 0x203A,
0xFB01, 0xFB02, 0x2021, 0x00B7, 0x201A, 0x201E,
0x2030, 0x00C2, 0x00CA, 0x00C1, 0x00CB, 0x00C8,
0x00CD, 0x00CE, 0x00CF, 0x00CC, 0x00D3, 0x00D4,
0xF8FF, 0x00D2, 0x00DA, 0x00DB, 0x00D9, 0x0131,
0x02C6, 0x02DC, 0x00AF, 0x02D8, 0x02D9, 0x02DA,
0x00B8, 0x02DD, 0x02DB, 0x02C7,
2011-08-31 13:47:19 +08:00
} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* end of stage 2 tables for MacRoman }}} */
/* {{{ Stage 1 table for MacRoman */
2011-08-31 13:47:19 +08:00
static const enc_to_uni enc_to_uni_macroman = { {
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
&enc_to_uni_s2_macroman_00,
&enc_to_uni_s2_macroman_40,
&enc_to_uni_s2_macroman_80,
2011-08-31 13:47:19 +08:00
&enc_to_uni_s2_macroman_C0 }
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* end of stage 1 table for MacRoman }}} */
/* {{{ Index of tables for encoding conversion */
static const enc_to_uni *const enc_to_uni_index[cs_numelems] = {
NULL,
&enc_to_uni_iso88591,
&enc_to_uni_win1252,
&enc_to_uni_iso885915,
&enc_to_uni_win1251,
&enc_to_uni_iso88595,
&enc_to_uni_cp866,
&enc_to_uni_macroman,
&enc_to_uni_koi8r,
};
/* }}} */
/* Definitions for mappings *from* Unicode */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
typedef struct {
unsigned short un_code_point; /* we don't need bigger */
unsigned char cs_code; /* currently, we only have maps to single-byte encodings */
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
} uni_to_enc;
/* {{{ Mappings *from* Unicode for ISO-8859-15 */
static const uni_to_enc unimap_iso885915[] = {
{ 0x00A5, 0xA5 }, /* yen sign */
{ 0x00A7, 0xA7 }, /* section sign */
{ 0x00A9, 0xA9 }, /* copyright sign */
{ 0x00AA, 0xAA }, /* feminine ordinal indicator */
{ 0x00AB, 0xAB }, /* left-pointing double angle quotation mark */
{ 0x00AC, 0xAC }, /* not sign */
{ 0x00AD, 0xAD }, /* soft hyphen */
{ 0x00AE, 0xAE }, /* registered sign */
{ 0x00AF, 0xAF }, /* macron */
{ 0x00B0, 0xB0 }, /* degree sign */
{ 0x00B1, 0xB1 }, /* plus-minus sign */
{ 0x00B2, 0xB2 }, /* superscript two */
{ 0x00B3, 0xB3 }, /* superscript three */
{ 0x00B5, 0xB5 }, /* micro sign */
{ 0x00B6, 0xB6 }, /* pilcrow sign */
{ 0x00B7, 0xB7 }, /* middle dot */
{ 0x00B9, 0xB9 }, /* superscript one */
{ 0x00BA, 0xBA }, /* masculine ordinal indicator */
{ 0x00BB, 0xBB }, /* right-pointing double angle quotation mark */
{ 0x0152, 0xBC }, /* latin capital ligature oe */
{ 0x0153, 0xBD }, /* latin small ligature oe */
{ 0x0160, 0xA6 }, /* latin capital letter s with caron */
{ 0x0161, 0xA8 }, /* latin small letter s with caron */
{ 0x0178, 0xBE }, /* latin capital letter y with diaeresis */
{ 0x017D, 0xB4 }, /* latin capital letter z with caron */
{ 0x017E, 0xB8 }, /* latin small letter z with caron */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
{ 0x20AC, 0xA4 }, /* euro sign */
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ end of mappings *from* Unicode for ISO-8859-15 */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ Mappings *from* Unicode for Windows-1252 */
static const uni_to_enc unimap_win1252[] = {
{ 0x0152, 0x8C }, /* latin capital ligature oe */
{ 0x0153, 0x9C }, /* latin small ligature oe */
{ 0x0160, 0x8A }, /* latin capital letter s with caron */
{ 0x0161, 0x9A }, /* latin small letter s with caron */
{ 0x0178, 0x9F }, /* latin capital letter y with diaeresis */
{ 0x017D, 0x8E }, /* latin capital letter z with caron */
{ 0x017E, 0x9E }, /* latin small letter z with caron */
{ 0x0192, 0x83 }, /* latin small letter f with hook */
{ 0x02C6, 0x88 }, /* modifier letter circumflex accent */
{ 0x02DC, 0x98 }, /* small tilde */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
{ 0x2013, 0x96 }, /* en dash */
{ 0x2014, 0x97 }, /* em dash */
{ 0x2018, 0x91 }, /* left single quotation mark */
{ 0x2019, 0x92 }, /* right single quotation mark */
{ 0x201A, 0x82 }, /* single low-9 quotation mark */
{ 0x201C, 0x93 }, /* left double quotation mark */
{ 0x201D, 0x94 }, /* right double quotation mark */
{ 0x201E, 0x84 }, /* double low-9 quotation mark */
{ 0x2020, 0x86 }, /* dagger */
{ 0x2021, 0x87 }, /* double dagger */
{ 0x2022, 0x95 }, /* bullet */
{ 0x2026, 0x85 }, /* horizontal ellipsis */
{ 0x2030, 0x89 }, /* per mille sign */
{ 0x2039, 0x8B }, /* single left-pointing angle quotation mark */
{ 0x203A, 0x9B }, /* single right-pointing angle quotation mark */
{ 0x20AC, 0x80 }, /* euro sign */
{ 0x2122, 0x99 }, /* trade mark sign */
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ end of mappings *from* Unicode for Windows-1252 */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ Mappings *from* Unicode for Windows-1251 */
static const uni_to_enc unimap_win1251[] = {
{ 0x00A0, 0xA0 }, /* no-break space */
{ 0x00A4, 0xA4 }, /* currency sign */
{ 0x00A6, 0xA6 }, /* broken bar */
{ 0x00A7, 0xA7 }, /* section sign */
{ 0x00A9, 0xA9 }, /* copyright sign */
{ 0x00AB, 0xAB }, /* left-pointing double angle quotation mark */
{ 0x00AC, 0xAC }, /* not sign */
{ 0x00AD, 0xAD }, /* soft hyphen */
{ 0x00AE, 0xAE }, /* registered sign */
{ 0x00B0, 0xB0 }, /* degree sign */
{ 0x00B1, 0xB1 }, /* plus-minus sign */
{ 0x00B5, 0xB5 }, /* micro sign */
{ 0x00B6, 0xB6 }, /* pilcrow sign */
{ 0x00B7, 0xB7 }, /* middle dot */
{ 0x00BB, 0xBB }, /* right-pointing double angle quotation mark */
{ 0x0401, 0xA8 }, /* cyrillic capital letter io */
{ 0x0402, 0x80 }, /* cyrillic capital letter dje */
{ 0x0403, 0x81 }, /* cyrillic capital letter gje */
{ 0x0404, 0xAA }, /* cyrillic capital letter ukrainian ie */
{ 0x0405, 0xBD }, /* cyrillic capital letter dze */
{ 0x0406, 0xB2 }, /* cyrillic capital letter byelorussian-ukrainian i */
{ 0x0407, 0xAF }, /* cyrillic capital letter yi */
{ 0x0408, 0xA3 }, /* cyrillic capital letter je */
{ 0x0409, 0x8A }, /* cyrillic capital letter lje */
{ 0x040A, 0x8C }, /* cyrillic capital letter nje */
{ 0x040B, 0x8E }, /* cyrillic capital letter tshe */
{ 0x040C, 0x8D }, /* cyrillic capital letter kje */
{ 0x040E, 0xA1 }, /* cyrillic capital letter short u */
{ 0x040F, 0x8F }, /* cyrillic capital letter dzhe */
{ 0x0410, 0xC0 }, /* cyrillic capital letter a */
{ 0x0411, 0xC1 }, /* cyrillic capital letter be */
{ 0x0412, 0xC2 }, /* cyrillic capital letter ve */
{ 0x0413, 0xC3 }, /* cyrillic capital letter ghe */
{ 0x0414, 0xC4 }, /* cyrillic capital letter de */
{ 0x0415, 0xC5 }, /* cyrillic capital letter ie */
{ 0x0416, 0xC6 }, /* cyrillic capital letter zhe */
{ 0x0417, 0xC7 }, /* cyrillic capital letter ze */
{ 0x0418, 0xC8 }, /* cyrillic capital letter i */
{ 0x0419, 0xC9 }, /* cyrillic capital letter short i */
{ 0x041A, 0xCA }, /* cyrillic capital letter ka */
{ 0x041B, 0xCB }, /* cyrillic capital letter el */
{ 0x041C, 0xCC }, /* cyrillic capital letter em */
{ 0x041D, 0xCD }, /* cyrillic capital letter en */
{ 0x041E, 0xCE }, /* cyrillic capital letter o */
{ 0x041F, 0xCF }, /* cyrillic capital letter pe */
{ 0x0420, 0xD0 }, /* cyrillic capital letter er */
{ 0x0421, 0xD1 }, /* cyrillic capital letter es */
{ 0x0422, 0xD2 }, /* cyrillic capital letter te */
{ 0x0423, 0xD3 }, /* cyrillic capital letter u */
{ 0x0424, 0xD4 }, /* cyrillic capital letter ef */
{ 0x0425, 0xD5 }, /* cyrillic capital letter ha */
{ 0x0426, 0xD6 }, /* cyrillic capital letter tse */
{ 0x0427, 0xD7 }, /* cyrillic capital letter che */
{ 0x0428, 0xD8 }, /* cyrillic capital letter sha */
{ 0x0429, 0xD9 }, /* cyrillic capital letter shcha */
{ 0x042A, 0xDA }, /* cyrillic capital letter hard sign */
{ 0x042B, 0xDB }, /* cyrillic capital letter yeru */
{ 0x042C, 0xDC }, /* cyrillic capital letter soft sign */
{ 0x042D, 0xDD }, /* cyrillic capital letter e */
{ 0x042E, 0xDE }, /* cyrillic capital letter yu */
{ 0x042F, 0xDF }, /* cyrillic capital letter ya */
{ 0x0430, 0xE0 }, /* cyrillic small letter a */
{ 0x0431, 0xE1 }, /* cyrillic small letter be */
{ 0x0432, 0xE2 }, /* cyrillic small letter ve */
{ 0x0433, 0xE3 }, /* cyrillic small letter ghe */
{ 0x0434, 0xE4 }, /* cyrillic small letter de */
{ 0x0435, 0xE5 }, /* cyrillic small letter ie */
{ 0x0436, 0xE6 }, /* cyrillic small letter zhe */
{ 0x0437, 0xE7 }, /* cyrillic small letter ze */
{ 0x0438, 0xE8 }, /* cyrillic small letter i */
{ 0x0439, 0xE9 }, /* cyrillic small letter short i */
{ 0x043A, 0xEA }, /* cyrillic small letter ka */
{ 0x043B, 0xEB }, /* cyrillic small letter el */
{ 0x043C, 0xEC }, /* cyrillic small letter em */
{ 0x043D, 0xED }, /* cyrillic small letter en */
{ 0x043E, 0xEE }, /* cyrillic small letter o */
{ 0x043F, 0xEF }, /* cyrillic small letter pe */
{ 0x0440, 0xF0 }, /* cyrillic small letter er */
{ 0x0441, 0xF1 }, /* cyrillic small letter es */
{ 0x0442, 0xF2 }, /* cyrillic small letter te */
{ 0x0443, 0xF3 }, /* cyrillic small letter u */
{ 0x0444, 0xF4 }, /* cyrillic small letter ef */
{ 0x0445, 0xF5 }, /* cyrillic small letter ha */
{ 0x0446, 0xF6 }, /* cyrillic small letter tse */
{ 0x0447, 0xF7 }, /* cyrillic small letter che */
{ 0x0448, 0xF8 }, /* cyrillic small letter sha */
{ 0x0449, 0xF9 }, /* cyrillic small letter shcha */
{ 0x044A, 0xFA }, /* cyrillic small letter hard sign */
{ 0x044B, 0xFB }, /* cyrillic small letter yeru */
{ 0x044C, 0xFC }, /* cyrillic small letter soft sign */
{ 0x044D, 0xFD }, /* cyrillic small letter e */
{ 0x044E, 0xFE }, /* cyrillic small letter yu */
{ 0x044F, 0xFF }, /* cyrillic small letter ya */
{ 0x0451, 0xB8 }, /* cyrillic small letter io */
{ 0x0452, 0x90 }, /* cyrillic small letter dje */
{ 0x0453, 0x83 }, /* cyrillic small letter gje */
{ 0x0454, 0xBA }, /* cyrillic small letter ukrainian ie */
{ 0x0455, 0xBE }, /* cyrillic small letter dze */
{ 0x0456, 0xB3 }, /* cyrillic small letter byelorussian-ukrainian i */
{ 0x0457, 0xBF }, /* cyrillic small letter yi */
{ 0x0458, 0xBC }, /* cyrillic small letter je */
{ 0x0459, 0x9A }, /* cyrillic small letter lje */
{ 0x045A, 0x9C }, /* cyrillic small letter nje */
{ 0x045B, 0x9E }, /* cyrillic small letter tshe */
{ 0x045C, 0x9D }, /* cyrillic small letter kje */
{ 0x045E, 0xA2 }, /* cyrillic small letter short u */
{ 0x045F, 0x9F }, /* cyrillic small letter dzhe */
{ 0x0490, 0xA5 }, /* cyrillic capital letter ghe with upturn */
{ 0x0491, 0xB4 }, /* cyrillic small letter ghe with upturn */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
{ 0x2013, 0x96 }, /* en dash */
{ 0x2014, 0x97 }, /* em dash */
{ 0x2018, 0x91 }, /* left single quotation mark */
{ 0x2019, 0x92 }, /* right single quotation mark */
{ 0x201A, 0x82 }, /* single low-9 quotation mark */
{ 0x201C, 0x93 }, /* left double quotation mark */
{ 0x201D, 0x94 }, /* right double quotation mark */
{ 0x201E, 0x84 }, /* double low-9 quotation mark */
{ 0x2020, 0x86 }, /* dagger */
{ 0x2021, 0x87 }, /* double dagger */
{ 0x2022, 0x95 }, /* bullet */
{ 0x2026, 0x85 }, /* horizontal ellipsis */
{ 0x2030, 0x89 }, /* per mille sign */
{ 0x2039, 0x8B }, /* single left-pointing angle quotation mark */
{ 0x203A, 0x9B }, /* single right-pointing angle quotation mark */
{ 0x20AC, 0x88 }, /* euro sign */
{ 0x2116, 0xB9 }, /* numero sign */
{ 0x2122, 0x99 }, /* trade mark sign */
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ end of mappings *from* Unicode for Windows-1251 */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ Mappings *from* Unicode for KOI8-R */
static const uni_to_enc unimap_koi8r[] = {
{ 0x00A0, 0x9A }, /* no-break space */
{ 0x00A9, 0xBF }, /* copyright sign */
{ 0x00B0, 0x9C }, /* degree sign */
{ 0x00B2, 0x9D }, /* superscript two */
{ 0x00B7, 0x9E }, /* middle dot */
{ 0x00F7, 0x9F }, /* division sign */
{ 0x0401, 0xB3 }, /* cyrillic capital letter io */
{ 0x0410, 0xE1 }, /* cyrillic capital letter a */
{ 0x0411, 0xE2 }, /* cyrillic capital letter be */
{ 0x0412, 0xF7 }, /* cyrillic capital letter ve */
{ 0x0413, 0xE7 }, /* cyrillic capital letter ghe */
{ 0x0414, 0xE4 }, /* cyrillic capital letter de */
{ 0x0415, 0xE5 }, /* cyrillic capital letter ie */
{ 0x0416, 0xF6 }, /* cyrillic capital letter zhe */
{ 0x0417, 0xFA }, /* cyrillic capital letter ze */
{ 0x0418, 0xE9 }, /* cyrillic capital letter i */
{ 0x0419, 0xEA }, /* cyrillic capital letter short i */
{ 0x041A, 0xEB }, /* cyrillic capital letter ka */
{ 0x041B, 0xEC }, /* cyrillic capital letter el */
{ 0x041C, 0xED }, /* cyrillic capital letter em */
{ 0x041D, 0xEE }, /* cyrillic capital letter en */
{ 0x041E, 0xEF }, /* cyrillic capital letter o */
{ 0x041F, 0xF0 }, /* cyrillic capital letter pe */
{ 0x0420, 0xF2 }, /* cyrillic capital letter er */
{ 0x0421, 0xF3 }, /* cyrillic capital letter es */
{ 0x0422, 0xF4 }, /* cyrillic capital letter te */
{ 0x0423, 0xF5 }, /* cyrillic capital letter u */
{ 0x0424, 0xE6 }, /* cyrillic capital letter ef */
{ 0x0425, 0xE8 }, /* cyrillic capital letter ha */
{ 0x0426, 0xE3 }, /* cyrillic capital letter tse */
{ 0x0427, 0xFE }, /* cyrillic capital letter che */
{ 0x0428, 0xFB }, /* cyrillic capital letter sha */
{ 0x0429, 0xFD }, /* cyrillic capital letter shcha */
{ 0x042A, 0xFF }, /* cyrillic capital letter hard sign */
{ 0x042B, 0xF9 }, /* cyrillic capital letter yeru */
{ 0x042C, 0xF8 }, /* cyrillic capital letter soft sign */
{ 0x042D, 0xFC }, /* cyrillic capital letter e */
{ 0x042E, 0xE0 }, /* cyrillic capital letter yu */
{ 0x042F, 0xF1 }, /* cyrillic capital letter ya */
{ 0x0430, 0xC1 }, /* cyrillic small letter a */
{ 0x0431, 0xC2 }, /* cyrillic small letter be */
{ 0x0432, 0xD7 }, /* cyrillic small letter ve */
{ 0x0433, 0xC7 }, /* cyrillic small letter ghe */
{ 0x0434, 0xC4 }, /* cyrillic small letter de */
{ 0x0435, 0xC5 }, /* cyrillic small letter ie */
{ 0x0436, 0xD6 }, /* cyrillic small letter zhe */
{ 0x0437, 0xDA }, /* cyrillic small letter ze */
{ 0x0438, 0xC9 }, /* cyrillic small letter i */
{ 0x0439, 0xCA }, /* cyrillic small letter short i */
{ 0x043A, 0xCB }, /* cyrillic small letter ka */
{ 0x043B, 0xCC }, /* cyrillic small letter el */
{ 0x043C, 0xCD }, /* cyrillic small letter em */
{ 0x043D, 0xCE }, /* cyrillic small letter en */
{ 0x043E, 0xCF }, /* cyrillic small letter o */
{ 0x043F, 0xD0 }, /* cyrillic small letter pe */
{ 0x0440, 0xD2 }, /* cyrillic small letter er */
{ 0x0441, 0xD3 }, /* cyrillic small letter es */
{ 0x0442, 0xD4 }, /* cyrillic small letter te */
{ 0x0443, 0xD5 }, /* cyrillic small letter u */
{ 0x0444, 0xC6 }, /* cyrillic small letter ef */
{ 0x0445, 0xC8 }, /* cyrillic small letter ha */
{ 0x0446, 0xC3 }, /* cyrillic small letter tse */
{ 0x0447, 0xDE }, /* cyrillic small letter che */
{ 0x0448, 0xDB }, /* cyrillic small letter sha */
{ 0x0449, 0xDD }, /* cyrillic small letter shcha */
{ 0x044A, 0xDF }, /* cyrillic small letter hard sign */
{ 0x044B, 0xD9 }, /* cyrillic small letter yeru */
{ 0x044C, 0xD8 }, /* cyrillic small letter soft sign */
{ 0x044D, 0xDC }, /* cyrillic small letter e */
{ 0x044E, 0xC0 }, /* cyrillic small letter yu */
{ 0x044F, 0xD1 }, /* cyrillic small letter ya */
{ 0x0451, 0xA3 }, /* cyrillic small letter io */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
{ 0x2219, 0x95 }, /* bullet operator */
{ 0x221A, 0x96 }, /* square root */
{ 0x2248, 0x97 }, /* almost equal to */
{ 0x2264, 0x98 }, /* less-than or equal to */
{ 0x2265, 0x99 }, /* greater-than or equal to */
{ 0x2320, 0x93 }, /* top half integral */
{ 0x2321, 0x9B }, /* bottom half integral */
{ 0x2500, 0x80 }, /* box drawings light horizontal */
{ 0x2502, 0x81 }, /* box drawings light vertical */
{ 0x250C, 0x82 }, /* box drawings light down and right */
{ 0x2510, 0x83 }, /* box drawings light down and left */
{ 0x2514, 0x84 }, /* box drawings light up and right */
{ 0x2518, 0x85 }, /* box drawings light up and left */
{ 0x251C, 0x86 }, /* box drawings light vertical and right */
{ 0x2524, 0x87 }, /* box drawings light vertical and left */
{ 0x252C, 0x88 }, /* box drawings light down and horizontal */
{ 0x2534, 0x89 }, /* box drawings light up and horizontal */
{ 0x253C, 0x8A }, /* box drawings light vertical and horizontal */
{ 0x2550, 0xA0 }, /* box drawings double horizontal */
{ 0x2551, 0xA1 }, /* box drawings double vertical */
{ 0x2552, 0xA2 }, /* box drawings down single and right double */
{ 0x2553, 0xA4 }, /* box drawings down double and right single */
{ 0x2554, 0xA5 }, /* box drawings double down and right */
{ 0x2555, 0xA6 }, /* box drawings down single and left double */
{ 0x2556, 0xA7 }, /* box drawings down double and left single */
{ 0x2557, 0xA8 }, /* box drawings double down and left */
{ 0x2558, 0xA9 }, /* box drawings up single and right double */
{ 0x2559, 0xAA }, /* box drawings up double and right single */
{ 0x255A, 0xAB }, /* box drawings double up and right */
{ 0x255B, 0xAC }, /* box drawings up single and left double */
{ 0x255C, 0xAD }, /* box drawings up double and left single */
{ 0x255D, 0xAE }, /* box drawings double up and left */
{ 0x255E, 0xAF }, /* box drawings vertical single and right double */
{ 0x255F, 0xB0 }, /* box drawings vertical double and right single */
{ 0x2560, 0xB1 }, /* box drawings double vertical and right */
{ 0x2561, 0xB2 }, /* box drawings vertical single and left double */
{ 0x2562, 0xB4 }, /* box drawings vertical double and left single */
{ 0x2563, 0xB5 }, /* box drawings double vertical and left */
{ 0x2564, 0xB6 }, /* box drawings down single and horizontal double */
{ 0x2565, 0xB7 }, /* box drawings down double and horizontal single */
{ 0x2566, 0xB8 }, /* box drawings double down and horizontal */
{ 0x2567, 0xB9 }, /* box drawings up single and horizontal double */
{ 0x2568, 0xBA }, /* box drawings up double and horizontal single */
{ 0x2569, 0xBB }, /* box drawings double up and horizontal */
{ 0x256A, 0xBC }, /* box drawings vertical single and horizontal double */
{ 0x256B, 0xBD }, /* box drawings vertical double and horizontal single */
{ 0x256C, 0xBE }, /* box drawings double vertical and horizontal */
{ 0x2580, 0x8B }, /* upper half block */
{ 0x2584, 0x8C }, /* lower half block */
{ 0x2588, 0x8D }, /* full block */
{ 0x258C, 0x8E }, /* left half block */
{ 0x2590, 0x8F }, /* right half block */
{ 0x2591, 0x90 }, /* light shade */
{ 0x2592, 0x91 }, /* medium shade */
{ 0x2593, 0x92 }, /* dark shade */
{ 0x25A0, 0x94 }, /* black square */
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ end of mappings *from* Unicode for KOI8-R */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ Mappings *from* Unicode for CP-866 */
static const uni_to_enc unimap_cp866[] = {
{ 0x00A0, 0xFF }, /* no-break space */
{ 0x00A4, 0xFD }, /* currency sign */
{ 0x00B0, 0xF8 }, /* degree sign */
{ 0x00B7, 0xFA }, /* middle dot */
{ 0x0401, 0xF0 }, /* cyrillic capital letter io */
{ 0x0404, 0xF2 }, /* cyrillic capital letter ukrainian ie */
{ 0x0407, 0xF4 }, /* cyrillic capital letter yi */
{ 0x040E, 0xF6 }, /* cyrillic capital letter short u */
{ 0x0410, 0x80 }, /* cyrillic capital letter a */
{ 0x0411, 0x81 }, /* cyrillic capital letter be */
{ 0x0412, 0x82 }, /* cyrillic capital letter ve */
{ 0x0413, 0x83 }, /* cyrillic capital letter ghe */
{ 0x0414, 0x84 }, /* cyrillic capital letter de */
{ 0x0415, 0x85 }, /* cyrillic capital letter ie */
{ 0x0416, 0x86 }, /* cyrillic capital letter zhe */
{ 0x0417, 0x87 }, /* cyrillic capital letter ze */
{ 0x0418, 0x88 }, /* cyrillic capital letter i */
{ 0x0419, 0x89 }, /* cyrillic capital letter short i */
{ 0x041A, 0x8A }, /* cyrillic capital letter ka */
{ 0x041B, 0x8B }, /* cyrillic capital letter el */
{ 0x041C, 0x8C }, /* cyrillic capital letter em */
{ 0x041D, 0x8D }, /* cyrillic capital letter en */
{ 0x041E, 0x8E }, /* cyrillic capital letter o */
{ 0x041F, 0x8F }, /* cyrillic capital letter pe */
{ 0x0420, 0x90 }, /* cyrillic capital letter er */
{ 0x0421, 0x91 }, /* cyrillic capital letter es */
{ 0x0422, 0x92 }, /* cyrillic capital letter te */
{ 0x0423, 0x93 }, /* cyrillic capital letter u */
{ 0x0424, 0x94 }, /* cyrillic capital letter ef */
{ 0x0425, 0x95 }, /* cyrillic capital letter ha */
{ 0x0426, 0x96 }, /* cyrillic capital letter tse */
{ 0x0427, 0x97 }, /* cyrillic capital letter che */
{ 0x0428, 0x98 }, /* cyrillic capital letter sha */
{ 0x0429, 0x99 }, /* cyrillic capital letter shcha */
{ 0x042A, 0x9A }, /* cyrillic capital letter hard sign */
{ 0x042B, 0x9B }, /* cyrillic capital letter yeru */
{ 0x042C, 0x9C }, /* cyrillic capital letter soft sign */
{ 0x042D, 0x9D }, /* cyrillic capital letter e */
{ 0x042E, 0x9E }, /* cyrillic capital letter yu */
{ 0x042F, 0x9F }, /* cyrillic capital letter ya */
{ 0x0430, 0xA0 }, /* cyrillic small letter a */
{ 0x0431, 0xA1 }, /* cyrillic small letter be */
{ 0x0432, 0xA2 }, /* cyrillic small letter ve */
{ 0x0433, 0xA3 }, /* cyrillic small letter ghe */
{ 0x0434, 0xA4 }, /* cyrillic small letter de */
{ 0x0435, 0xA5 }, /* cyrillic small letter ie */
{ 0x0436, 0xA6 }, /* cyrillic small letter zhe */
{ 0x0437, 0xA7 }, /* cyrillic small letter ze */
{ 0x0438, 0xA8 }, /* cyrillic small letter i */
{ 0x0439, 0xA9 }, /* cyrillic small letter short i */
{ 0x043A, 0xAA }, /* cyrillic small letter ka */
{ 0x043B, 0xAB }, /* cyrillic small letter el */
{ 0x043C, 0xAC }, /* cyrillic small letter em */
{ 0x043D, 0xAD }, /* cyrillic small letter en */
{ 0x043E, 0xAE }, /* cyrillic small letter o */
{ 0x043F, 0xAF }, /* cyrillic small letter pe */
{ 0x0440, 0xE0 }, /* cyrillic small letter er */
{ 0x0441, 0xE1 }, /* cyrillic small letter es */
{ 0x0442, 0xE2 }, /* cyrillic small letter te */
{ 0x0443, 0xE3 }, /* cyrillic small letter u */
{ 0x0444, 0xE4 }, /* cyrillic small letter ef */
{ 0x0445, 0xE5 }, /* cyrillic small letter ha */
{ 0x0446, 0xE6 }, /* cyrillic small letter tse */
{ 0x0447, 0xE7 }, /* cyrillic small letter che */
{ 0x0448, 0xE8 }, /* cyrillic small letter sha */
{ 0x0449, 0xE9 }, /* cyrillic small letter shcha */
{ 0x044A, 0xEA }, /* cyrillic small letter hard sign */
{ 0x044B, 0xEB }, /* cyrillic small letter yeru */
{ 0x044C, 0xEC }, /* cyrillic small letter soft sign */
{ 0x044D, 0xED }, /* cyrillic small letter e */
{ 0x044E, 0xEE }, /* cyrillic small letter yu */
{ 0x044F, 0xEF }, /* cyrillic small letter ya */
{ 0x0451, 0xF1 }, /* cyrillic small letter io */
{ 0x0454, 0xF3 }, /* cyrillic small letter ukrainian ie */
{ 0x0457, 0xF5 }, /* cyrillic small letter yi */
{ 0x045E, 0xF7 }, /* cyrillic small letter short u */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
{ 0x2116, 0xFC }, /* numero sign */
{ 0x2219, 0xF9 }, /* bullet operator */
{ 0x221A, 0xFB }, /* square root */
{ 0x2500, 0xC4 }, /* box drawings light horizontal */
{ 0x2502, 0xB3 }, /* box drawings light vertical */
{ 0x250C, 0xDA }, /* box drawings light down and right */
{ 0x2510, 0xBF }, /* box drawings light down and left */
{ 0x2514, 0xC0 }, /* box drawings light up and right */
{ 0x2518, 0xD9 }, /* box drawings light up and left */
{ 0x251C, 0xC3 }, /* box drawings light vertical and right */
{ 0x2524, 0xB4 }, /* box drawings light vertical and left */
{ 0x252C, 0xC2 }, /* box drawings light down and horizontal */
{ 0x2534, 0xC1 }, /* box drawings light up and horizontal */
{ 0x253C, 0xC5 }, /* box drawings light vertical and horizontal */
{ 0x2550, 0xCD }, /* box drawings double horizontal */
{ 0x2551, 0xBA }, /* box drawings double vertical */
{ 0x2552, 0xD5 }, /* box drawings down single and right double */
{ 0x2553, 0xD6 }, /* box drawings down double and right single */
{ 0x2554, 0xC9 }, /* box drawings double down and right */
{ 0x2555, 0xB8 }, /* box drawings down single and left double */
{ 0x2556, 0xB7 }, /* box drawings down double and left single */
{ 0x2557, 0xBB }, /* box drawings double down and left */
{ 0x2558, 0xD4 }, /* box drawings up single and right double */
{ 0x2559, 0xD3 }, /* box drawings up double and right single */
{ 0x255A, 0xC8 }, /* box drawings double up and right */
{ 0x255B, 0xBE }, /* box drawings up single and left double */
{ 0x255C, 0xBD }, /* box drawings up double and left single */
{ 0x255D, 0xBC }, /* box drawings double up and left */
{ 0x255E, 0xC6 }, /* box drawings vertical single and right double */
{ 0x255F, 0xC7 }, /* box drawings vertical double and right single */
{ 0x2560, 0xCC }, /* box drawings double vertical and right */
{ 0x2561, 0xB5 }, /* box drawings vertical single and left double */
{ 0x2562, 0xB6 }, /* box drawings vertical double and left single */
{ 0x2563, 0xB9 }, /* box drawings double vertical and left */
{ 0x2564, 0xD1 }, /* box drawings down single and horizontal double */
{ 0x2565, 0xD2 }, /* box drawings down double and horizontal single */
{ 0x2566, 0xCB }, /* box drawings double down and horizontal */
{ 0x2567, 0xCF }, /* box drawings up single and horizontal double */
{ 0x2568, 0xD0 }, /* box drawings up double and horizontal single */
{ 0x2569, 0xCA }, /* box drawings double up and horizontal */
{ 0x256A, 0xD8 }, /* box drawings vertical single and horizontal double */
{ 0x256B, 0xD7 }, /* box drawings vertical double and horizontal single */
{ 0x256C, 0xCE }, /* box drawings double vertical and horizontal */
{ 0x2580, 0xDF }, /* upper half block */
{ 0x2584, 0xDC }, /* lower half block */
{ 0x2588, 0xDB }, /* full block */
{ 0x258C, 0xDD }, /* left half block */
{ 0x2590, 0xDE }, /* right half block */
{ 0x2591, 0xB0 }, /* light shade */
{ 0x2592, 0xB1 }, /* medium shade */
{ 0x2593, 0xB2 }, /* dark shade */
{ 0x25A0, 0xFE }, /* black square */
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ end of mappings *from* Unicode for CP-866 */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ Mappings *from* Unicode for MacRoman */
static const uni_to_enc unimap_macroman[] = {
{ 0x00A0, 0xCA }, /* no-break space */
{ 0x00A1, 0xC1 }, /* inverted exclamation mark */
{ 0x00A2, 0xA2 }, /* cent sign */
{ 0x00A3, 0xA3 }, /* pound sign */
{ 0x00A5, 0xB4 }, /* yen sign */
{ 0x00A7, 0xA4 }, /* section sign */
{ 0x00A8, 0xAC }, /* diaeresis */
{ 0x00A9, 0xA9 }, /* copyright sign */
{ 0x00AA, 0xBB }, /* feminine ordinal indicator */
{ 0x00AB, 0xC7 }, /* left-pointing double angle quotation mark */
{ 0x00AC, 0xC2 }, /* not sign */
{ 0x00AE, 0xA8 }, /* registered sign */
{ 0x00AF, 0xF8 }, /* macron */
{ 0x00B0, 0xA1 }, /* degree sign */
{ 0x00B1, 0xB1 }, /* plus-minus sign */
{ 0x00B4, 0xAB }, /* acute accent */
{ 0x00B5, 0xB5 }, /* micro sign */
{ 0x00B6, 0xA6 }, /* pilcrow sign */
{ 0x00B7, 0xE1 }, /* middle dot */
{ 0x00B8, 0xFC }, /* cedilla */
{ 0x00BA, 0xBC }, /* masculine ordinal indicator */
{ 0x00BB, 0xC8 }, /* right-pointing double angle quotation mark */
{ 0x00BF, 0xC0 }, /* inverted question mark */
{ 0x00C0, 0xCB }, /* latin capital letter a with grave */
{ 0x00C1, 0xE7 }, /* latin capital letter a with acute */
{ 0x00C2, 0xE5 }, /* latin capital letter a with circumflex */
{ 0x00C3, 0xCC }, /* latin capital letter a with tilde */
{ 0x00C4, 0x80 }, /* latin capital letter a with diaeresis */
{ 0x00C5, 0x81 }, /* latin capital letter a with ring above */
{ 0x00C6, 0xAE }, /* latin capital letter ae */
{ 0x00C7, 0x82 }, /* latin capital letter c with cedilla */
{ 0x00C8, 0xE9 }, /* latin capital letter e with grave */
{ 0x00C9, 0x83 }, /* latin capital letter e with acute */
{ 0x00CA, 0xE6 }, /* latin capital letter e with circumflex */
{ 0x00CB, 0xE8 }, /* latin capital letter e with diaeresis */
{ 0x00CC, 0xED }, /* latin capital letter i with grave */
{ 0x00CD, 0xEA }, /* latin capital letter i with acute */
{ 0x00CE, 0xEB }, /* latin capital letter i with circumflex */
{ 0x00CF, 0xEC }, /* latin capital letter i with diaeresis */
{ 0x00D1, 0x84 }, /* latin capital letter n with tilde */
{ 0x00D2, 0xF1 }, /* latin capital letter o with grave */
{ 0x00D3, 0xEE }, /* latin capital letter o with acute */
{ 0x00D4, 0xEF }, /* latin capital letter o with circumflex */
{ 0x00D5, 0xCD }, /* latin capital letter o with tilde */
{ 0x00D6, 0x85 }, /* latin capital letter o with diaeresis */
{ 0x00D8, 0xAF }, /* latin capital letter o with stroke */
{ 0x00D9, 0xF4 }, /* latin capital letter u with grave */
{ 0x00DA, 0xF2 }, /* latin capital letter u with acute */
{ 0x00DB, 0xF3 }, /* latin capital letter u with circumflex */
{ 0x00DC, 0x86 }, /* latin capital letter u with diaeresis */
{ 0x00DF, 0xA7 }, /* latin small letter sharp s */
{ 0x00E0, 0x88 }, /* latin small letter a with grave */
{ 0x00E1, 0x87 }, /* latin small letter a with acute */
{ 0x00E2, 0x89 }, /* latin small letter a with circumflex */
{ 0x00E3, 0x8B }, /* latin small letter a with tilde */
{ 0x00E4, 0x8A }, /* latin small letter a with diaeresis */
{ 0x00E5, 0x8C }, /* latin small letter a with ring above */
{ 0x00E6, 0xBE }, /* latin small letter ae */
{ 0x00E7, 0x8D }, /* latin small letter c with cedilla */
{ 0x00E8, 0x8F }, /* latin small letter e with grave */
{ 0x00E9, 0x8E }, /* latin small letter e with acute */
{ 0x00EA, 0x90 }, /* latin small letter e with circumflex */
{ 0x00EB, 0x91 }, /* latin small letter e with diaeresis */
{ 0x00EC, 0x93 }, /* latin small letter i with grave */
{ 0x00ED, 0x92 }, /* latin small letter i with acute */
{ 0x00EE, 0x94 }, /* latin small letter i with circumflex */
{ 0x00EF, 0x95 }, /* latin small letter i with diaeresis */
{ 0x00F1, 0x96 }, /* latin small letter n with tilde */
{ 0x00F2, 0x98 }, /* latin small letter o with grave */
{ 0x00F3, 0x97 }, /* latin small letter o with acute */
{ 0x00F4, 0x99 }, /* latin small letter o with circumflex */
{ 0x00F5, 0x9B }, /* latin small letter o with tilde */
{ 0x00F6, 0x9A }, /* latin small letter o with diaeresis */
{ 0x00F7, 0xD6 }, /* division sign */
{ 0x00F8, 0xBF }, /* latin small letter o with stroke */
{ 0x00F9, 0x9D }, /* latin small letter u with grave */
{ 0x00FA, 0x9C }, /* latin small letter u with acute */
{ 0x00FB, 0x9E }, /* latin small letter u with circumflex */
{ 0x00FC, 0x9F }, /* latin small letter u with diaeresis */
{ 0x00FF, 0xD8 }, /* latin small letter y with diaeresis */
{ 0x0131, 0xF5 }, /* latin small letter dotless i */
{ 0x0152, 0xCE }, /* latin capital ligature oe */
{ 0x0153, 0xCF }, /* latin small ligature oe */
{ 0x0178, 0xD9 }, /* latin capital letter y with diaeresis */
{ 0x0192, 0xC4 }, /* latin small letter f with hook */
{ 0x02C6, 0xF6 }, /* modifier letter circumflex accent */
{ 0x02C7, 0xFF }, /* caron */
{ 0x02D8, 0xF9 }, /* breve */
{ 0x02D9, 0xFA }, /* dot above */
{ 0x02DA, 0xFB }, /* ring above */
{ 0x02DB, 0xFE }, /* ogonek */
{ 0x02DC, 0xF7 }, /* small tilde */
{ 0x02DD, 0xFD }, /* double acute accent */
{ 0x03A9, 0xBD }, /* greek capital letter omega */
{ 0x03C0, 0xB9 }, /* greek small letter pi */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
{ 0x2013, 0xD0 }, /* en dash */
{ 0x2014, 0xD1 }, /* em dash */
{ 0x2018, 0xD4 }, /* left single quotation mark */
{ 0x2019, 0xD5 }, /* right single quotation mark */
{ 0x201A, 0xE2 }, /* single low-9 quotation mark */
{ 0x201C, 0xD2 }, /* left double quotation mark */
{ 0x201D, 0xD3 }, /* right double quotation mark */
{ 0x201E, 0xE3 }, /* double low-9 quotation mark */
{ 0x2020, 0xA0 }, /* dagger */
{ 0x2021, 0xE0 }, /* double dagger */
{ 0x2022, 0xA5 }, /* bullet */
{ 0x2026, 0xC9 }, /* horizontal ellipsis */
{ 0x2030, 0xE4 }, /* per mille sign */
{ 0x2039, 0xDC }, /* single left-pointing angle quotation mark */
{ 0x203A, 0xDD }, /* single right-pointing angle quotation mark */
{ 0x2044, 0xDA }, /* fraction slash */
{ 0x20AC, 0xDB }, /* euro sign */
{ 0x2122, 0xAA }, /* trade mark sign */
{ 0x2202, 0xB6 }, /* partial differential */
{ 0x2206, 0xC6 }, /* increment */
{ 0x220F, 0xB8 }, /* n-ary product */
{ 0x2211, 0xB7 }, /* n-ary summation */
{ 0x221A, 0xC3 }, /* square root */
{ 0x221E, 0xB0 }, /* infinity */
{ 0x222B, 0xBA }, /* integral */
{ 0x2248, 0xC5 }, /* almost equal to */
{ 0x2260, 0xAD }, /* not equal to */
{ 0x2264, 0xB2 }, /* less-than or equal to */
{ 0x2265, 0xB3 }, /* greater-than or equal to */
{ 0x25CA, 0xD7 }, /* lozenge */
{ 0xF8FF, 0xF0 }, /* apple logo */
{ 0xFB01, 0xDE }, /* latin small ligature fi */
{ 0xFB02, 0xDF }, /* latin small ligature fl */
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ end of mappings *from* Unicode for MacRoman */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* HTML 5 has many more named entities.
* Some of them map to two unicode code points, not one.
* We're going to use a three-stage table (with an extra one for the entities
* with two code points). */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
#define ENT_STAGE1_INDEX(k) (((k) & 0xFFF000) >> 12) /* > 1D, we have no mapping */
#define ENT_STAGE2_INDEX(k) (((k) & 0xFC0) >> 6)
#define ENT_STAGE3_INDEX(k) ((k) & 0x3F)
#define ENT_CODE_POINT_FROM_STAGES(i,j,k) (((i) << 12) | ((j) << 6) | (k))
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* Table should be organized with a leading row telling the size of
* the table and the default entity (maybe NULL) and the rest being
* normal rows ordered by code point so that we can do a binary search */
typedef union {
struct {
unsigned size; /* number of remaining entries in the table */
const char *default_entity;
unsigned short default_entity_len;
} leading_entry;
struct {
unsigned second_cp; /* second code point */
const char *entity;
unsigned short entity_len;
} normal_entry;
} entity_multicodepoint_row;
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* blocks of these should start at code points k where k % 0xFC0 == 0 */
typedef struct {
char ambiguous; /* if 0 look into entity */
union {
struct {
const char *entity; /* may be NULL */
unsigned short entity_len;
} ent;
const entity_multicodepoint_row *multicodepoint_table;
} data;
} entity_stage3_row;
/* Calculate k & 0x3F Use as offset */
typedef const entity_stage3_row *entity_stage2_row; /* 64 elements */
/* Calculate k & 0xFC0 >> 6. Use as offset */
typedef const entity_stage3_row *const *entity_stage1_row; /* 64 elements */
/* For stage 1, Calculate k & 0xFFF000 >> 3*4.
* If larger than 1D, we have no mapping. Otherwise lookup that index */
typedef struct {
const entity_stage1_row *ms_table;
/* for tables with only basic entities, this member is to be accessed
* directly for better performance: */
const entity_stage3_row *table;
} entity_table_opt;
/* Replaced "GT" > "gt" and "QUOT" > "quot" for consistency's sake. */
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* {{{ Start of HTML5 multi-stage table for codepoint -> entity */
/* {{{ Start of double code point tables for HTML5 */
static const entity_multicodepoint_row multi_cp_html5_0003C[] = {
2011-08-31 13:47:19 +08:00
{ {01, "lt", 2} },
{ {0x020D2, "nvlt", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0003D[] = {
2011-08-31 13:47:19 +08:00
{ {01, "equals", 6} },
{ {0x020E5, "bne", 3} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0003E[] = {
2011-08-31 13:47:19 +08:00
{ {01, "gt", 2} },
{ {0x020D2, "nvgt", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_00066[] = {
2011-08-31 13:47:19 +08:00
{ {01, NULL , 0} },
{ {0x0006A, "fjlig", 5} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0205F[] = {
2011-08-31 13:47:19 +08:00
{ {01, "MediumSpace", 11} },
{ {0x0200A, "ThickSpace", 10} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0219D[] = {
2011-08-31 13:47:19 +08:00
{ {01, "rarrw", 5} },
{ {0x00338, "nrarrw", 6} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02202[] = {
2011-08-31 13:47:19 +08:00
{ {01, "part", 4} },
{ {0x00338, "npart", 5} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02220[] = {
2011-08-31 13:47:19 +08:00
{ {01, "angle", 5} },
{ {0x020D2, "nang", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02229[] = {
2011-08-31 13:47:19 +08:00
{ {01, "cap", 3} },
{ {0x0FE00, "caps", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0222A[] = {
2011-08-31 13:47:19 +08:00
{ {01, "cup", 3} },
{ {0x0FE00, "cups", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0223C[] = {
2011-08-31 13:47:19 +08:00
{ {01, "sim", 3} },
{ {0x020D2, "nvsim", 5} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0223D[] = {
2011-08-31 13:47:19 +08:00
{ {01, "bsim", 4} },
{ {0x00331, "race", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0223E[] = {
2011-08-31 13:47:19 +08:00
{ {01, "ac", 2} },
{ {0x00333, "acE", 3} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02242[] = {
2011-08-31 13:47:19 +08:00
{ {01, "esim", 4} },
{ {0x00338, "nesim", 5} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0224B[] = {
2011-08-31 13:47:19 +08:00
{ {01, "apid", 4} },
{ {0x00338, "napid", 5} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0224D[] = {
2011-08-31 13:47:19 +08:00
{ {01, "CupCap", 6} },
{ {0x020D2, "nvap", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0224E[] = {
2011-08-31 13:47:19 +08:00
{ {01, "bump", 4} },
{ {0x00338, "nbump", 5} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0224F[] = {
2011-08-31 13:47:19 +08:00
{ {01, "HumpEqual", 9} },
{ {0x00338, "nbumpe", 6} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02250[] = {
2011-08-31 13:47:19 +08:00
{ {01, "esdot", 5} },
{ {0x00338, "nedot", 5} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02261[] = {
2011-08-31 13:47:19 +08:00
{ {01, "Congruent", 9} },
{ {0x020E5, "bnequiv", 7} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02264[] = {
2011-08-31 13:47:19 +08:00
{ {01, "leq", 3} },
{ {0x020D2, "nvle", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02265[] = {
2011-08-31 13:47:19 +08:00
{ {01, "ge", 2} },
{ {0x020D2, "nvge", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02266[] = {
2011-08-31 13:47:19 +08:00
{ {01, "lE", 2} },
{ {0x00338, "nlE", 3} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02267[] = {
2011-08-31 13:47:19 +08:00
{ {01, "geqq", 4} },
{ {0x00338, "NotGreaterFullEqual", 19} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02268[] = {
2011-08-31 13:47:19 +08:00
{ {01, "lneqq", 5} },
{ {0x0FE00, "lvertneqq", 9} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02269[] = {
2011-08-31 13:47:19 +08:00
{ {01, "gneqq", 5} },
{ {0x0FE00, "gvertneqq", 9} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0226A[] = {
2011-08-31 13:47:19 +08:00
{ {02, "ll", 2} },
{ {0x00338, "nLtv", 4} },
{ {0x020D2, "nLt", 3} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0226B[] = {
2011-08-31 13:47:19 +08:00
{ {02, "gg", 2} },
{ {0x00338, "NotGreaterGreater", 17} },
{ {0x020D2, "nGt", 3} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0227F[] = {
2011-08-31 13:47:19 +08:00
{ {01, "SucceedsTilde", 13} },
{ {0x00338, "NotSucceedsTilde", 16} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02282[] = {
2011-08-31 13:47:19 +08:00
{ {01, "sub", 3} },
{ {0x020D2, "vnsub", 5} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02283[] = {
2011-08-31 13:47:19 +08:00
{ {01, "sup", 3} },
{ {0x020D2, "nsupset", 7} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0228A[] = {
2011-08-31 13:47:19 +08:00
{ {01, "subsetneq", 9} },
{ {0x0FE00, "vsubne", 6} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0228B[] = {
2011-08-31 13:47:19 +08:00
{ {01, "supsetneq", 9} },
{ {0x0FE00, "vsupne", 6} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_0228F[] = {
2011-08-31 13:47:19 +08:00
{ {01, "sqsub", 5} },
{ {0x00338, "NotSquareSubset", 15} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02290[] = {
2011-08-31 13:47:19 +08:00
{ {01, "sqsupset", 8} },
{ {0x00338, "NotSquareSuperset", 17} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02293[] = {
2011-08-31 13:47:19 +08:00
{ {01, "sqcap", 5} },
{ {0x0FE00, "sqcaps", 6} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02294[] = {
2011-08-31 13:47:19 +08:00
{ {01, "sqcup", 5} },
{ {0x0FE00, "sqcups", 6} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_022B4[] = {
2011-08-31 13:47:19 +08:00
{ {01, "LeftTriangleEqual", 17} },
{ {0x020D2, "nvltrie", 7} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_022B5[] = {
2011-08-31 13:47:19 +08:00
{ {01, "RightTriangleEqual", 18} },
{ {0x020D2, "nvrtrie", 7} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_022D8[] = {
2011-08-31 13:47:19 +08:00
{ {01, "Ll", 2} },
{ {0x00338, "nLl", 3} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_022D9[] = {
2011-08-31 13:47:19 +08:00
{ {01, "Gg", 2} },
{ {0x00338, "nGg", 3} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_022DA[] = {
2011-08-31 13:47:19 +08:00
{ {01, "lesseqgtr", 9} },
{ {0x0FE00, "lesg", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_022DB[] = {
2011-08-31 13:47:19 +08:00
{ {01, "gtreqless", 9} },
{ {0x0FE00, "gesl", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_022F5[] = {
2011-08-31 13:47:19 +08:00
{ {01, "isindot", 7} },
{ {0x00338, "notindot", 8} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_022F9[] = {
2011-08-31 13:47:19 +08:00
{ {01, "isinE", 5} },
{ {0x00338, "notinE", 6} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02933[] = {
2011-08-31 13:47:19 +08:00
{ {01, "rarrc", 5} },
{ {0x00338, "nrarrc", 6} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_029CF[] = {
2011-08-31 13:47:19 +08:00
{ {01, "LeftTriangleBar", 15} },
{ {0x00338, "NotLeftTriangleBar", 18} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_029D0[] = {
2011-08-31 13:47:19 +08:00
{ {01, "RightTriangleBar", 16} },
{ {0x00338, "NotRightTriangleBar", 19} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02A6D[] = {
2011-08-31 13:47:19 +08:00
{ {01, "congdot", 7} },
{ {0x00338, "ncongdot", 8} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02A70[] = {
2011-08-31 13:47:19 +08:00
{ {01, "apE", 3} },
{ {0x00338, "napE", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02A7D[] = {
2011-08-31 13:47:19 +08:00
{ {01, "les", 3} },
{ {0x00338, "nles", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02A7E[] = {
2011-08-31 13:47:19 +08:00
{ {01, "ges", 3} },
{ {0x00338, "nges", 4} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02AA1[] = {
2011-08-31 13:47:19 +08:00
{ {01, "LessLess", 8} },
{ {0x00338, "NotNestedLessLess", 17} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02AA2[] = {
2011-08-31 13:47:19 +08:00
{ {01, "GreaterGreater", 14} },
{ {0x00338, "NotNestedGreaterGreater", 23} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02AAC[] = {
2011-08-31 13:47:19 +08:00
{ {01, "smte", 4} },
{ {0x0FE00, "smtes", 5} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02AAD[] = {
2011-08-31 13:47:19 +08:00
{ {01, "late", 4} },
{ {0x0FE00, "lates", 5} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02AAF[] = {
2011-08-31 13:47:19 +08:00
{ {01, "preceq", 6} },
{ {0x00338, "NotPrecedesEqual", 16} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02AB0[] = {
2011-08-31 13:47:19 +08:00
{ {01, "SucceedsEqual", 13} },
{ {0x00338, "NotSucceedsEqual", 16} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02AC5[] = {
2011-08-31 13:47:19 +08:00
{ {01, "subE", 4} },
{ {0x00338, "nsubE", 5} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02AC6[] = {
2011-08-31 13:47:19 +08:00
{ {01, "supseteqq", 9} },
{ {0x00338, "nsupseteqq", 10} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02ACB[] = {
2011-08-31 13:47:19 +08:00
{ {01, "subsetneqq", 10} },
{ {0x0FE00, "vsubnE", 6} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02ACC[] = {
2011-08-31 13:47:19 +08:00
{ {01, "supnE", 5} },
{ {0x0FE00, "varsupsetneqq", 13} },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_multicodepoint_row multi_cp_html5_02AFD[] = {
2011-08-31 13:47:19 +08:00
{ {01, NULL , 0} },
{ {0x0FE00, "varsupsetneqq", 13} },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
/* End of double code point tables }}} */
/* {{{ Stage 3 Tables for HTML5 */
static const entity_stage3_row empty_stage3_table[] = {
/* 64 elements */
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_00000[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"Tab", 3} } }, {0, { {"NewLine", 7} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"excl", 4} } }, {0, { {"quot", 4} } }, {0, { {"num", 3} } },
{0, { {"dollar", 6} } }, {0, { {"percnt", 6} } }, {0, { {"amp", 3} } }, {0, { {"apos", 4} } },
{0, { {"lpar", 4} } }, {0, { {"rpar", 4} } }, {0, { {"ast", 3} } }, {0, { {"plus", 4} } },
{0, { {"comma", 5} } }, {0, { {NULL, 0} } }, {0, { {"period", 6} } }, {0, { {"sol", 3} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"colon", 5} } }, {0, { {"semi", 4} } },
{1, { {(void *)multi_cp_html5_0003C} } }, {1, { {(void *)multi_cp_html5_0003D} } }, {1, { {(void *)multi_cp_html5_0003E} } }, {0, { {"quest", 5} } },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_stage3_row stage3_table_html5_00040[] = {
2011-08-31 13:47:19 +08:00
{0, { {"commat", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"lbrack", 6} } },
{0, { {"bsol", 4} } }, {0, { {"rsqb", 4} } }, {0, { {"Hat", 3} } }, {0, { {"lowbar", 6} } },
{0, { {"grave", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {1, { {(void *)multi_cp_html5_00066} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"lbrace", 6} } },
{0, { {"vert", 4} } }, {0, { {"rcub", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_stage3_row stage3_table_html5_00080[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"nbsp", 4} } }, {0, { {"iexcl", 5} } }, {0, { {"cent", 4} } }, {0, { {"pound", 5} } },
{0, { {"curren", 6} } }, {0, { {"yen", 3} } }, {0, { {"brvbar", 6} } }, {0, { {"sect", 4} } },
{0, { {"DoubleDot", 9} } }, {0, { {"copy", 4} } }, {0, { {"ordf", 4} } }, {0, { {"laquo", 5} } },
{0, { {"not", 3} } }, {0, { {"shy", 3} } }, {0, { {"reg", 3} } }, {0, { {"macr", 4} } },
{0, { {"deg", 3} } }, {0, { {"plusmn", 6} } }, {0, { {"sup2", 4} } }, {0, { {"sup3", 4} } },
{0, { {"DiacriticalAcute", 16} } }, {0, { {"micro", 5} } }, {0, { {"para", 4} } }, {0, { {"CenterDot", 9} } },
{0, { {"Cedilla", 7} } }, {0, { {"sup1", 4} } }, {0, { {"ordm", 4} } }, {0, { {"raquo", 5} } },
{0, { {"frac14", 6} } }, {0, { {"half", 4} } }, {0, { {"frac34", 6} } }, {0, { {"iquest", 6} } },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_stage3_row stage3_table_html5_000C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {"Agrave", 6} } }, {0, { {"Aacute", 6} } }, {0, { {"Acirc", 5} } }, {0, { {"Atilde", 6} } },
{0, { {"Auml", 4} } }, {0, { {"Aring", 5} } }, {0, { {"AElig", 5} } }, {0, { {"Ccedil", 6} } },
{0, { {"Egrave", 6} } }, {0, { {"Eacute", 6} } }, {0, { {"Ecirc", 5} } }, {0, { {"Euml", 4} } },
{0, { {"Igrave", 6} } }, {0, { {"Iacute", 6} } }, {0, { {"Icirc", 5} } }, {0, { {"Iuml", 4} } },
{0, { {"ETH", 3} } }, {0, { {"Ntilde", 6} } }, {0, { {"Ograve", 6} } }, {0, { {"Oacute", 6} } },
{0, { {"Ocirc", 5} } }, {0, { {"Otilde", 6} } }, {0, { {"Ouml", 4} } }, {0, { {"times", 5} } },
{0, { {"Oslash", 6} } }, {0, { {"Ugrave", 6} } }, {0, { {"Uacute", 6} } }, {0, { {"Ucirc", 5} } },
{0, { {"Uuml", 4} } }, {0, { {"Yacute", 6} } }, {0, { {"THORN", 5} } }, {0, { {"szlig", 5} } },
{0, { {"agrave", 6} } }, {0, { {"aacute", 6} } }, {0, { {"acirc", 5} } }, {0, { {"atilde", 6} } },
{0, { {"auml", 4} } }, {0, { {"aring", 5} } }, {0, { {"aelig", 5} } }, {0, { {"ccedil", 6} } },
{0, { {"egrave", 6} } }, {0, { {"eacute", 6} } }, {0, { {"ecirc", 5} } }, {0, { {"euml", 4} } },
{0, { {"igrave", 6} } }, {0, { {"iacute", 6} } }, {0, { {"icirc", 5} } }, {0, { {"iuml", 4} } },
{0, { {"eth", 3} } }, {0, { {"ntilde", 6} } }, {0, { {"ograve", 6} } }, {0, { {"oacute", 6} } },
{0, { {"ocirc", 5} } }, {0, { {"otilde", 6} } }, {0, { {"ouml", 4} } }, {0, { {"divide", 6} } },
{0, { {"oslash", 6} } }, {0, { {"ugrave", 6} } }, {0, { {"uacute", 6} } }, {0, { {"ucirc", 5} } },
{0, { {"uuml", 4} } }, {0, { {"yacute", 6} } }, {0, { {"thorn", 5} } }, {0, { {"yuml", 4} } },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_stage3_row stage3_table_html5_00100[] = {
2011-08-31 13:47:19 +08:00
{0, { {"Amacr", 5} } }, {0, { {"amacr", 5} } }, {0, { {"Abreve", 6} } }, {0, { {"abreve", 6} } },
{0, { {"Aogon", 5} } }, {0, { {"aogon", 5} } }, {0, { {"Cacute", 6} } }, {0, { {"cacute", 6} } },
{0, { {"Ccirc", 5} } }, {0, { {"ccirc", 5} } }, {0, { {"Cdot", 4} } }, {0, { {"cdot", 4} } },
{0, { {"Ccaron", 6} } }, {0, { {"ccaron", 6} } }, {0, { {"Dcaron", 6} } }, {0, { {"dcaron", 6} } },
{0, { {"Dstrok", 6} } }, {0, { {"dstrok", 6} } }, {0, { {"Emacr", 5} } }, {0, { {"emacr", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"Edot", 4} } }, {0, { {"edot", 4} } },
{0, { {"Eogon", 5} } }, {0, { {"eogon", 5} } }, {0, { {"Ecaron", 6} } }, {0, { {"ecaron", 6} } },
{0, { {"Gcirc", 5} } }, {0, { {"gcirc", 5} } }, {0, { {"Gbreve", 6} } }, {0, { {"gbreve", 6} } },
{0, { {"Gdot", 4} } }, {0, { {"gdot", 4} } }, {0, { {"Gcedil", 6} } }, {0, { {NULL, 0} } },
{0, { {"Hcirc", 5} } }, {0, { {"hcirc", 5} } }, {0, { {"Hstrok", 6} } }, {0, { {"hstrok", 6} } },
{0, { {"Itilde", 6} } }, {0, { {"itilde", 6} } }, {0, { {"Imacr", 5} } }, {0, { {"imacr", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"Iogon", 5} } }, {0, { {"iogon", 5} } },
{0, { {"Idot", 4} } }, {0, { {"inodot", 6} } }, {0, { {"IJlig", 5} } }, {0, { {"ijlig", 5} } },
{0, { {"Jcirc", 5} } }, {0, { {"jcirc", 5} } }, {0, { {"Kcedil", 6} } }, {0, { {"kcedil", 6} } },
{0, { {"kgreen", 6} } }, {0, { {"Lacute", 6} } }, {0, { {"lacute", 6} } }, {0, { {"Lcedil", 6} } },
{0, { {"lcedil", 6} } }, {0, { {"Lcaron", 6} } }, {0, { {"lcaron", 6} } }, {0, { {"Lmidot", 6} } },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_stage3_row stage3_table_html5_00140[] = {
2011-08-31 13:47:19 +08:00
{0, { {"lmidot", 6} } }, {0, { {"Lstrok", 6} } }, {0, { {"lstrok", 6} } }, {0, { {"Nacute", 6} } },
{0, { {"nacute", 6} } }, {0, { {"Ncedil", 6} } }, {0, { {"ncedil", 6} } }, {0, { {"Ncaron", 6} } },
{0, { {"ncaron", 6} } }, {0, { {"napos", 5} } }, {0, { {"ENG", 3} } }, {0, { {"eng", 3} } },
{0, { {"Omacr", 5} } }, {0, { {"omacr", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"Odblac", 6} } }, {0, { {"odblac", 6} } }, {0, { {"OElig", 5} } }, {0, { {"oelig", 5} } },
{0, { {"Racute", 6} } }, {0, { {"racute", 6} } }, {0, { {"Rcedil", 6} } }, {0, { {"rcedil", 6} } },
{0, { {"Rcaron", 6} } }, {0, { {"rcaron", 6} } }, {0, { {"Sacute", 6} } }, {0, { {"sacute", 6} } },
{0, { {"Scirc", 5} } }, {0, { {"scirc", 5} } }, {0, { {"Scedil", 6} } }, {0, { {"scedil", 6} } },
{0, { {"Scaron", 6} } }, {0, { {"scaron", 6} } }, {0, { {"Tcedil", 6} } }, {0, { {"tcedil", 6} } },
{0, { {"Tcaron", 6} } }, {0, { {"tcaron", 6} } }, {0, { {"Tstrok", 6} } }, {0, { {"tstrok", 6} } },
{0, { {"Utilde", 6} } }, {0, { {"utilde", 6} } }, {0, { {"Umacr", 5} } }, {0, { {"umacr", 5} } },
{0, { {"Ubreve", 6} } }, {0, { {"ubreve", 6} } }, {0, { {"Uring", 5} } }, {0, { {"uring", 5} } },
{0, { {"Udblac", 6} } }, {0, { {"udblac", 6} } }, {0, { {"Uogon", 5} } }, {0, { {"uogon", 5} } },
{0, { {"Wcirc", 5} } }, {0, { {"wcirc", 5} } }, {0, { {"Ycirc", 5} } }, {0, { {"ycirc", 5} } },
{0, { {"Yuml", 4} } }, {0, { {"Zacute", 6} } }, {0, { {"zacute", 6} } }, {0, { {"Zdot", 4} } },
{0, { {"zdot", 4} } }, {0, { {"Zcaron", 6} } }, {0, { {"zcaron", 6} } }, {0, { {NULL, 0} } },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_stage3_row stage3_table_html5_00180[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"fnof", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"imped", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_stage3_row stage3_table_html5_001C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"gacute", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_00200[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"jmath", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_002C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"circ", 4} } }, {0, { {"Hacek", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"Breve", 5} } }, {0, { {"dot", 3} } }, {0, { {"ring", 4} } }, {0, { {"ogon", 4} } },
{0, { {"DiacriticalTilde", 16} } }, {0, { {"DiacriticalDoubleAcute", 22} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_00300[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"DownBreve", 9} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_00380[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"Alpha", 5} } }, {0, { {"Beta", 4} } }, {0, { {"Gamma", 5} } },
{0, { {"Delta", 5} } }, {0, { {"Epsilon", 7} } }, {0, { {"Zeta", 4} } }, {0, { {"Eta", 3} } },
{0, { {"Theta", 5} } }, {0, { {"Iota", 4} } }, {0, { {"Kappa", 5} } }, {0, { {"Lambda", 6} } },
{0, { {"Mu", 2} } }, {0, { {"Nu", 2} } }, {0, { {"Xi", 2} } }, {0, { {"Omicron", 7} } },
{0, { {"Pi", 2} } }, {0, { {"Rho", 3} } }, {0, { {NULL, 0} } }, {0, { {"Sigma", 5} } },
{0, { {"Tau", 3} } }, {0, { {"Upsilon", 7} } }, {0, { {"Phi", 3} } }, {0, { {"Chi", 3} } },
{0, { {"Psi", 3} } }, {0, { {"Omega", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"alpha", 5} } }, {0, { {"beta", 4} } }, {0, { {"gamma", 5} } },
{0, { {"delta", 5} } }, {0, { {"epsi", 4} } }, {0, { {"zeta", 4} } }, {0, { {"eta", 3} } },
{0, { {"theta", 5} } }, {0, { {"iota", 4} } }, {0, { {"kappa", 5} } }, {0, { {"lambda", 6} } },
{0, { {"mu", 2} } }, {0, { {"nu", 2} } }, {0, { {"xi", 2} } }, {0, { {"omicron", 7} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_003C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {"pi", 2} } }, {0, { {"rho", 3} } }, {0, { {"sigmav", 6} } }, {0, { {"sigma", 5} } },
{0, { {"tau", 3} } }, {0, { {"upsi", 4} } }, {0, { {"phi", 3} } }, {0, { {"chi", 3} } },
{0, { {"psi", 3} } }, {0, { {"omega", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"thetasym", 8} } }, {0, { {"upsih", 5} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"straightphi", 11} } }, {0, { {"piv", 3} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"Gammad", 6} } }, {0, { {"gammad", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"varkappa", 8} } }, {0, { {"rhov", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"straightepsilon", 15} } }, {0, { {"backepsilon", 11} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_00400[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {"IOcy", 4} } }, {0, { {"DJcy", 4} } }, {0, { {"GJcy", 4} } },
{0, { {"Jukcy", 5} } }, {0, { {"DScy", 4} } }, {0, { {"Iukcy", 5} } }, {0, { {"YIcy", 4} } },
{0, { {"Jsercy", 6} } }, {0, { {"LJcy", 4} } }, {0, { {"NJcy", 4} } }, {0, { {"TSHcy", 5} } },
{0, { {"KJcy", 4} } }, {0, { {NULL, 0} } }, {0, { {"Ubrcy", 5} } }, {0, { {"DZcy", 4} } },
{0, { {"Acy", 3} } }, {0, { {"Bcy", 3} } }, {0, { {"Vcy", 3} } }, {0, { {"Gcy", 3} } },
{0, { {"Dcy", 3} } }, {0, { {"IEcy", 4} } }, {0, { {"ZHcy", 4} } }, {0, { {"Zcy", 3} } },
{0, { {"Icy", 3} } }, {0, { {"Jcy", 3} } }, {0, { {"Kcy", 3} } }, {0, { {"Lcy", 3} } },
{0, { {"Mcy", 3} } }, {0, { {"Ncy", 3} } }, {0, { {"Ocy", 3} } }, {0, { {"Pcy", 3} } },
{0, { {"Rcy", 3} } }, {0, { {"Scy", 3} } }, {0, { {"Tcy", 3} } }, {0, { {"Ucy", 3} } },
{0, { {"Fcy", 3} } }, {0, { {"KHcy", 4} } }, {0, { {"TScy", 4} } }, {0, { {"CHcy", 4} } },
{0, { {"SHcy", 4} } }, {0, { {"SHCHcy", 6} } }, {0, { {"HARDcy", 6} } }, {0, { {"Ycy", 3} } },
{0, { {"SOFTcy", 6} } }, {0, { {"Ecy", 3} } }, {0, { {"YUcy", 4} } }, {0, { {"YAcy", 4} } },
{0, { {"acy", 3} } }, {0, { {"bcy", 3} } }, {0, { {"vcy", 3} } }, {0, { {"gcy", 3} } },
{0, { {"dcy", 3} } }, {0, { {"iecy", 4} } }, {0, { {"zhcy", 4} } }, {0, { {"zcy", 3} } },
{0, { {"icy", 3} } }, {0, { {"jcy", 3} } }, {0, { {"kcy", 3} } }, {0, { {"lcy", 3} } },
{0, { {"mcy", 3} } }, {0, { {"ncy", 3} } }, {0, { {"ocy", 3} } }, {0, { {"pcy", 3} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_00440[] = {
2011-08-31 13:47:19 +08:00
{0, { {"rcy", 3} } }, {0, { {"scy", 3} } }, {0, { {"tcy", 3} } }, {0, { {"ucy", 3} } },
{0, { {"fcy", 3} } }, {0, { {"khcy", 4} } }, {0, { {"tscy", 4} } }, {0, { {"chcy", 4} } },
{0, { {"shcy", 4} } }, {0, { {"shchcy", 6} } }, {0, { {"hardcy", 6} } }, {0, { {"ycy", 3} } },
{0, { {"softcy", 6} } }, {0, { {"ecy", 3} } }, {0, { {"yucy", 4} } }, {0, { {"yacy", 4} } },
{0, { {NULL, 0} } }, {0, { {"iocy", 4} } }, {0, { {"djcy", 4} } }, {0, { {"gjcy", 4} } },
{0, { {"jukcy", 5} } }, {0, { {"dscy", 4} } }, {0, { {"iukcy", 5} } }, {0, { {"yicy", 4} } },
{0, { {"jsercy", 6} } }, {0, { {"ljcy", 4} } }, {0, { {"njcy", 4} } }, {0, { {"tshcy", 5} } },
{0, { {"kjcy", 4} } }, {0, { {NULL, 0} } }, {0, { {"ubrcy", 5} } }, {0, { {"dzcy", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02000[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"ensp", 4} } }, {0, { {"emsp", 4} } },
{0, { {"emsp13", 6} } }, {0, { {"emsp14", 6} } }, {0, { {NULL, 0} } }, {0, { {"numsp", 5} } },
{0, { {"puncsp", 6} } }, {0, { {"ThinSpace", 9} } }, {0, { {"hairsp", 6} } }, {0, { {"ZeroWidthSpace", 14} } },
{0, { {"zwnj", 4} } }, {0, { {"zwj", 3} } }, {0, { {"lrm", 3} } }, {0, { {"rlm", 3} } },
{0, { {"hyphen", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"ndash", 5} } },
{0, { {"mdash", 5} } }, {0, { {"horbar", 6} } }, {0, { {"Verbar", 6} } }, {0, { {NULL, 0} } },
{0, { {"OpenCurlyQuote", 14} } }, {0, { {"rsquo", 5} } }, {0, { {"sbquo", 5} } }, {0, { {NULL, 0} } },
{0, { {"OpenCurlyDoubleQuote", 20} } }, {0, { {"rdquo", 5} } }, {0, { {"bdquo", 5} } }, {0, { {NULL, 0} } },
{0, { {"dagger", 6} } }, {0, { {"Dagger", 6} } }, {0, { {"bull", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"nldr", 4} } }, {0, { {"hellip", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"permil", 6} } }, {0, { {"pertenk", 7} } }, {0, { {"prime", 5} } }, {0, { {"Prime", 5} } },
{0, { {"tprime", 6} } }, {0, { {"backprime", 9} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"lsaquo", 6} } }, {0, { {"rsaquo", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"oline", 5} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02040[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {"caret", 5} } }, {0, { {NULL, 0} } }, {0, { {"hybull", 6} } },
{0, { {"frasl", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"bsemi", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"qprime", 6} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {1, { {(void *)multi_cp_html5_0205F} } },
{0, { {"NoBreak", 7} } }, {0, { {"af", 2} } }, {0, { {"InvisibleTimes", 14} } }, {0, { {"ic", 2} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02080[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"euro", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_020C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"TripleDot", 9} } },
{0, { {"DotDot", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02100[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"complexes", 9} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"incare", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"gscr", 4} } }, {0, { {"HilbertSpace", 12} } },
{0, { {"Hfr", 3} } }, {0, { {"Hopf", 4} } }, {0, { {"planckh", 7} } }, {0, { {"planck", 6} } },
{0, { {"imagline", 8} } }, {0, { {"Ifr", 3} } }, {0, { {"lagran", 6} } }, {0, { {"ell", 3} } },
{0, { {NULL, 0} } }, {0, { {"naturals", 8} } }, {0, { {"numero", 6} } }, {0, { {"copysr", 6} } },
{0, { {"wp", 2} } }, {0, { {"primes", 6} } }, {0, { {"rationals", 9} } }, {0, { {"realine", 7} } },
{0, { {"Rfr", 3} } }, {0, { {"Ropf", 4} } }, {0, { {"rx", 2} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"trade", 5} } }, {0, { {NULL, 0} } },
{0, { {"Zopf", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"mho", 3} } },
{0, { {"Zfr", 3} } }, {0, { {"iiota", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"Bscr", 4} } }, {0, { {"Cfr", 3} } }, {0, { {NULL, 0} } }, {0, { {"escr", 4} } },
{0, { {"expectation", 11} } }, {0, { {"Fouriertrf", 10} } }, {0, { {NULL, 0} } }, {0, { {"Mellintrf", 9} } },
{0, { {"orderof", 7} } }, {0, { {"aleph", 5} } }, {0, { {"beth", 4} } }, {0, { {"gimel", 5} } },
{0, { {"daleth", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02140[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"CapitalDifferentialD", 20} } }, {0, { {"DifferentialD", 13} } }, {0, { {"exponentiale", 12} } },
{0, { {"ImaginaryI", 10} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"frac13", 6} } },
{0, { {"frac23", 6} } }, {0, { {"frac15", 6} } }, {0, { {"frac25", 6} } }, {0, { {"frac35", 6} } },
{0, { {"frac45", 6} } }, {0, { {"frac16", 6} } }, {0, { {"frac56", 6} } }, {0, { {"frac18", 6} } },
{0, { {"frac38", 6} } }, {0, { {"frac58", 6} } }, {0, { {"frac78", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
};
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_stage3_row stage3_table_html5_02180[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"larr", 4} } }, {0, { {"uarr", 4} } }, {0, { {"srarr", 5} } }, {0, { {"darr", 4} } },
{0, { {"harr", 4} } }, {0, { {"UpDownArrow", 11} } }, {0, { {"nwarrow", 7} } }, {0, { {"UpperRightArrow", 15} } },
{0, { {"LowerRightArrow", 15} } }, {0, { {"swarr", 5} } }, {0, { {"nleftarrow", 10} } }, {0, { {"nrarr", 5} } },
{0, { {NULL, 0} } }, {1, { {(void *)multi_cp_html5_0219D} } }, {0, { {"Larr", 4} } }, {0, { {"Uarr", 4} } },
{0, { {"twoheadrightarrow", 17} } }, {0, { {"Darr", 4} } }, {0, { {"larrtl", 6} } }, {0, { {"rarrtl", 6} } },
{0, { {"LeftTeeArrow", 12} } }, {0, { {"UpTeeArrow", 10} } }, {0, { {"map", 3} } }, {0, { {"DownTeeArrow", 12} } },
{0, { {NULL, 0} } }, {0, { {"larrhk", 6} } }, {0, { {"rarrhk", 6} } }, {0, { {"larrlp", 6} } },
{0, { {"looparrowright", 14} } }, {0, { {"harrw", 5} } }, {0, { {"nleftrightarrow", 15} } }, {0, { {NULL, 0} } },
{0, { {"Lsh", 3} } }, {0, { {"rsh", 3} } }, {0, { {"ldsh", 4} } }, {0, { {"rdsh", 4} } },
{0, { {NULL, 0} } }, {0, { {"crarr", 5} } }, {0, { {"curvearrowleft", 14} } }, {0, { {"curarr", 6} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"olarr", 5} } }, {0, { {"orarr", 5} } },
{0, { {"leftharpoonup", 13} } }, {0, { {"leftharpoondown", 15} } }, {0, { {"RightUpVector", 13} } }, {0, { {"uharl", 5} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_021C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {"rharu", 5} } }, {0, { {"rhard", 5} } }, {0, { {"RightDownVector", 15} } }, {0, { {"dharl", 5} } },
{0, { {"rightleftarrows", 15} } }, {0, { {"udarr", 5} } }, {0, { {"lrarr", 5} } }, {0, { {"llarr", 5} } },
{0, { {"upuparrows", 10} } }, {0, { {"rrarr", 5} } }, {0, { {"downdownarrows", 14} } }, {0, { {"leftrightharpoons", 17} } },
{0, { {"rightleftharpoons", 17} } }, {0, { {"nLeftarrow", 10} } }, {0, { {"nhArr", 5} } }, {0, { {"nrArr", 5} } },
{0, { {"DoubleLeftArrow", 15} } }, {0, { {"DoubleUpArrow", 13} } }, {0, { {"Implies", 7} } }, {0, { {"Downarrow", 9} } },
{0, { {"hArr", 4} } }, {0, { {"Updownarrow", 11} } }, {0, { {"nwArr", 5} } }, {0, { {"neArr", 5} } },
{0, { {"seArr", 5} } }, {0, { {"swArr", 5} } }, {0, { {"lAarr", 5} } }, {0, { {"rAarr", 5} } },
{0, { {NULL, 0} } }, {0, { {"zigrarr", 7} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"LeftArrowBar", 12} } }, {0, { {"RightArrowBar", 13} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"DownArrowUpArrow", 16} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"loarr", 5} } }, {0, { {"roarr", 5} } }, {0, { {"hoarr", 5} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02200[] = {
2011-08-31 13:47:19 +08:00
{0, { {"forall", 6} } }, {0, { {"comp", 4} } }, {1, { {(void *)multi_cp_html5_02202} } }, {0, { {"Exists", 6} } },
{0, { {"nexist", 6} } }, {0, { {"empty", 5} } }, {0, { {NULL, 0} } }, {0, { {"nabla", 5} } },
{0, { {"isinv", 5} } }, {0, { {"notin", 5} } }, {0, { {NULL, 0} } }, {0, { {"ReverseElement", 14} } },
{0, { {"notniva", 7} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"prod", 4} } },
{0, { {"Coproduct", 9} } }, {0, { {"sum", 3} } }, {0, { {"minus", 5} } }, {0, { {"MinusPlus", 9} } },
{0, { {"plusdo", 6} } }, {0, { {NULL, 0} } }, {0, { {"ssetmn", 6} } }, {0, { {"lowast", 6} } },
{0, { {"compfn", 6} } }, {0, { {NULL, 0} } }, {0, { {"Sqrt", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"prop", 4} } }, {0, { {"infin", 5} } }, {0, { {"angrt", 5} } },
{1, { {(void *)multi_cp_html5_02220} } }, {0, { {"angmsd", 6} } }, {0, { {"angsph", 6} } }, {0, { {"mid", 3} } },
{0, { {"nshortmid", 9} } }, {0, { {"shortparallel", 13} } }, {0, { {"nparallel", 9} } }, {0, { {"and", 3} } },
{0, { {"or", 2} } }, {1, { {(void *)multi_cp_html5_02229} } }, {1, { {(void *)multi_cp_html5_0222A} } }, {0, { {"Integral", 8} } },
{0, { {"Int", 3} } }, {0, { {"tint", 4} } }, {0, { {"ContourIntegral", 15} } }, {0, { {"DoubleContourIntegral", 21} } },
{0, { {"Cconint", 7} } }, {0, { {"cwint", 5} } }, {0, { {"cwconint", 8} } }, {0, { {"awconint", 8} } },
{0, { {"there4", 6} } }, {0, { {"Because", 7} } }, {0, { {"ratio", 5} } }, {0, { {"Colon", 5} } },
{0, { {"minusd", 6} } }, {0, { {NULL, 0} } }, {0, { {"mDDot", 5} } }, {0, { {"homtht", 6} } },
{1, { {(void *)multi_cp_html5_0223C} } }, {1, { {(void *)multi_cp_html5_0223D} } }, {1, { {(void *)multi_cp_html5_0223E} } }, {0, { {"acd", 3} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02240[] = {
2011-08-31 13:47:19 +08:00
{0, { {"wr", 2} } }, {0, { {"NotTilde", 8} } }, {1, { {(void *)multi_cp_html5_02242} } }, {0, { {"simeq", 5} } },
{0, { {"nsime", 5} } }, {0, { {"TildeFullEqual", 14} } }, {0, { {"simne", 5} } }, {0, { {"ncong", 5} } },
{0, { {"approx", 6} } }, {0, { {"napprox", 7} } }, {0, { {"ape", 3} } }, {1, { {(void *)multi_cp_html5_0224B} } },
{0, { {"bcong", 5} } }, {1, { {(void *)multi_cp_html5_0224D} } }, {1, { {(void *)multi_cp_html5_0224E} } }, {1, { {(void *)multi_cp_html5_0224F} } },
{1, { {(void *)multi_cp_html5_02250} } }, {0, { {"doteqdot", 8} } }, {0, { {"fallingdotseq", 13} } }, {0, { {"risingdotseq", 12} } },
{0, { {"coloneq", 7} } }, {0, { {"eqcolon", 7} } }, {0, { {"ecir", 4} } }, {0, { {"circeq", 6} } },
{0, { {NULL, 0} } }, {0, { {"wedgeq", 6} } }, {0, { {"veeeq", 5} } }, {0, { {NULL, 0} } },
{0, { {"triangleq", 9} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"equest", 6} } },
{0, { {"NotEqual", 8} } }, {1, { {(void *)multi_cp_html5_02261} } }, {0, { {"NotCongruent", 12} } }, {0, { {NULL, 0} } },
{1, { {(void *)multi_cp_html5_02264} } }, {1, { {(void *)multi_cp_html5_02265} } }, {1, { {(void *)multi_cp_html5_02266} } }, {1, { {(void *)multi_cp_html5_02267} } },
{1, { {(void *)multi_cp_html5_02268} } }, {1, { {(void *)multi_cp_html5_02269} } }, {1, { {(void *)multi_cp_html5_0226A} } }, {1, { {(void *)multi_cp_html5_0226B} } },
{0, { {"between", 7} } }, {0, { {"NotCupCap", 9} } }, {0, { {"NotLess", 7} } }, {0, { {"ngtr", 4} } },
{0, { {"NotLessEqual", 12} } }, {0, { {"ngeq", 4} } }, {0, { {"LessTilde", 9} } }, {0, { {"GreaterTilde", 12} } },
{0, { {"nlsim", 5} } }, {0, { {"ngsim", 5} } }, {0, { {"lessgtr", 7} } }, {0, { {"gl", 2} } },
{0, { {"ntlg", 4} } }, {0, { {"NotGreaterLess", 14} } }, {0, { {"prec", 4} } }, {0, { {"succ", 4} } },
{0, { {"PrecedesSlantEqual", 18} } }, {0, { {"succcurlyeq", 11} } }, {0, { {"precsim", 7} } }, {1, { {(void *)multi_cp_html5_0227F} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02280[] = {
2011-08-31 13:47:19 +08:00
{0, { {"npr", 3} } }, {0, { {"NotSucceeds", 11} } }, {1, { {(void *)multi_cp_html5_02282} } }, {1, { {(void *)multi_cp_html5_02283} } },
{0, { {"nsub", 4} } }, {0, { {"nsup", 4} } }, {0, { {"SubsetEqual", 11} } }, {0, { {"supe", 4} } },
{0, { {"NotSubsetEqual", 14} } }, {0, { {"NotSupersetEqual", 16} } }, {1, { {(void *)multi_cp_html5_0228A} } }, {1, { {(void *)multi_cp_html5_0228B} } },
{0, { {NULL, 0} } }, {0, { {"cupdot", 6} } }, {0, { {"UnionPlus", 9} } }, {1, { {(void *)multi_cp_html5_0228F} } },
{1, { {(void *)multi_cp_html5_02290} } }, {0, { {"SquareSubsetEqual", 17} } }, {0, { {"SquareSupersetEqual", 19} } }, {1, { {(void *)multi_cp_html5_02293} } },
{1, { {(void *)multi_cp_html5_02294} } }, {0, { {"CirclePlus", 10} } }, {0, { {"ominus", 6} } }, {0, { {"CircleTimes", 11} } },
{0, { {"osol", 4} } }, {0, { {"CircleDot", 9} } }, {0, { {"ocir", 4} } }, {0, { {"oast", 4} } },
{0, { {NULL, 0} } }, {0, { {"odash", 5} } }, {0, { {"boxplus", 7} } }, {0, { {"boxminus", 8} } },
{0, { {"timesb", 6} } }, {0, { {"sdotb", 5} } }, {0, { {"vdash", 5} } }, {0, { {"dashv", 5} } },
{0, { {"DownTee", 7} } }, {0, { {"perp", 4} } }, {0, { {NULL, 0} } }, {0, { {"models", 6} } },
{0, { {"DoubleRightTee", 14} } }, {0, { {"Vdash", 5} } }, {0, { {"Vvdash", 6} } }, {0, { {"VDash", 5} } },
{0, { {"nvdash", 6} } }, {0, { {"nvDash", 6} } }, {0, { {"nVdash", 6} } }, {0, { {"nVDash", 6} } },
{0, { {"prurel", 6} } }, {0, { {NULL, 0} } }, {0, { {"vartriangleleft", 15} } }, {0, { {"vrtri", 5} } },
{1, { {(void *)multi_cp_html5_022B4} } }, {1, { {(void *)multi_cp_html5_022B5} } }, {0, { {"origof", 6} } }, {0, { {"imof", 4} } },
{0, { {"mumap", 5} } }, {0, { {"hercon", 6} } }, {0, { {"intcal", 6} } }, {0, { {"veebar", 6} } },
{0, { {NULL, 0} } }, {0, { {"barvee", 6} } }, {0, { {"angrtvb", 7} } }, {0, { {"lrtri", 5} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_022C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {"xwedge", 6} } }, {0, { {"xvee", 4} } }, {0, { {"bigcap", 6} } }, {0, { {"bigcup", 6} } },
{0, { {"diamond", 7} } }, {0, { {"sdot", 4} } }, {0, { {"Star", 4} } }, {0, { {"divonx", 6} } },
{0, { {"bowtie", 6} } }, {0, { {"ltimes", 6} } }, {0, { {"rtimes", 6} } }, {0, { {"lthree", 6} } },
{0, { {"rthree", 6} } }, {0, { {"backsimeq", 9} } }, {0, { {"curlyvee", 8} } }, {0, { {"curlywedge", 10} } },
{0, { {"Sub", 3} } }, {0, { {"Supset", 6} } }, {0, { {"Cap", 3} } }, {0, { {"Cup", 3} } },
{0, { {"pitchfork", 9} } }, {0, { {"epar", 4} } }, {0, { {"lessdot", 7} } }, {0, { {"gtrdot", 6} } },
{1, { {(void *)multi_cp_html5_022D8} } }, {1, { {(void *)multi_cp_html5_022D9} } }, {1, { {(void *)multi_cp_html5_022DA} } }, {1, { {(void *)multi_cp_html5_022DB} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"curlyeqprec", 11} } }, {0, { {"cuesc", 5} } },
{0, { {"NotPrecedesSlantEqual", 21} } }, {0, { {"NotSucceedsSlantEqual", 21} } }, {0, { {"NotSquareSubsetEqual", 20} } }, {0, { {"NotSquareSupersetEqual", 22} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"lnsim", 5} } }, {0, { {"gnsim", 5} } },
{0, { {"precnsim", 8} } }, {0, { {"scnsim", 6} } }, {0, { {"nltri", 5} } }, {0, { {"ntriangleright", 14} } },
{0, { {"nltrie", 6} } }, {0, { {"NotRightTriangleEqual", 21} } }, {0, { {"vellip", 6} } }, {0, { {"ctdot", 5} } },
{0, { {"utdot", 5} } }, {0, { {"dtdot", 5} } }, {0, { {"disin", 5} } }, {0, { {"isinsv", 6} } },
{0, { {"isins", 5} } }, {1, { {(void *)multi_cp_html5_022F5} } }, {0, { {"notinvc", 7} } }, {0, { {"notinvb", 7} } },
{0, { {NULL, 0} } }, {1, { {(void *)multi_cp_html5_022F9} } }, {0, { {"nisd", 4} } }, {0, { {"xnis", 4} } },
{0, { {"nis", 3} } }, {0, { {"notnivc", 7} } }, {0, { {"notnivb", 7} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02300[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"barwed", 6} } }, {0, { {"doublebarwedge", 14} } }, {0, { {NULL, 0} } },
{0, { {"lceil", 5} } }, {0, { {"RightCeiling", 12} } }, {0, { {"LeftFloor", 9} } }, {0, { {"RightFloor", 10} } },
{0, { {"drcrop", 6} } }, {0, { {"dlcrop", 6} } }, {0, { {"urcrop", 6} } }, {0, { {"ulcrop", 6} } },
{0, { {"bnot", 4} } }, {0, { {NULL, 0} } }, {0, { {"profline", 8} } }, {0, { {"profsurf", 8} } },
{0, { {NULL, 0} } }, {0, { {"telrec", 6} } }, {0, { {"target", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"ulcorner", 8} } }, {0, { {"urcorner", 8} } }, {0, { {"llcorner", 8} } }, {0, { {"drcorn", 6} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"frown", 5} } }, {0, { {"smile", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"cylcty", 6} } }, {0, { {"profalar", 8} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"topbot", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"ovbar", 5} } }, {0, { {NULL, 0} } }, {0, { {"solbar", 6} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02340[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"angzarr", 7} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02380[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"lmoust", 6} } }, {0, { {"rmoust", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"OverBracket", 11} } }, {0, { {"bbrk", 4} } }, {0, { {"bbrktbrk", 8} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_023C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"OverParenthesis", 15} } }, {0, { {"UnderParenthesis", 16} } }, {0, { {"OverBrace", 9} } }, {0, { {"UnderBrace", 10} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"trpezium", 8} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"elinters", 8} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02400[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"blank", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_024C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"oS", 2} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02500[] = {
2011-08-31 13:47:19 +08:00
{0, { {"HorizontalLine", 14} } }, {0, { {NULL, 0} } }, {0, { {"boxv", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"boxdr", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"boxdl", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"boxur", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"boxul", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"boxvr", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"boxvl", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"boxhd", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"boxhu", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"boxvh", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02540[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"boxH", 4} } }, {0, { {"boxV", 4} } }, {0, { {"boxdR", 5} } }, {0, { {"boxDr", 5} } },
{0, { {"boxDR", 5} } }, {0, { {"boxdL", 5} } }, {0, { {"boxDl", 5} } }, {0, { {"boxDL", 5} } },
{0, { {"boxuR", 5} } }, {0, { {"boxUr", 5} } }, {0, { {"boxUR", 5} } }, {0, { {"boxuL", 5} } },
{0, { {"boxUl", 5} } }, {0, { {"boxUL", 5} } }, {0, { {"boxvR", 5} } }, {0, { {"boxVr", 5} } },
{0, { {"boxVR", 5} } }, {0, { {"boxvL", 5} } }, {0, { {"boxVl", 5} } }, {0, { {"boxVL", 5} } },
{0, { {"boxHd", 5} } }, {0, { {"boxhD", 5} } }, {0, { {"boxHD", 5} } }, {0, { {"boxHu", 5} } },
{0, { {"boxhU", 5} } }, {0, { {"boxHU", 5} } }, {0, { {"boxvH", 5} } }, {0, { {"boxVh", 5} } },
{0, { {"boxVH", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02580[] = {
2011-08-31 13:47:19 +08:00
{0, { {"uhblk", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"lhblk", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"block", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"blk14", 5} } }, {0, { {"blk12", 5} } }, {0, { {"blk34", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"Square", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"squarf", 6} } }, {0, { {"EmptyVerySmallSquare", 20} } },
{0, { {NULL, 0} } }, {0, { {"rect", 4} } }, {0, { {"marker", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"fltns", 5} } }, {0, { {NULL, 0} } }, {0, { {"bigtriangleup", 13} } },
{0, { {"blacktriangle", 13} } }, {0, { {"triangle", 8} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"blacktriangleright", 18} } }, {0, { {"rtri", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"bigtriangledown", 15} } }, {0, { {"blacktriangledown", 17} } }, {0, { {"triangledown", 12} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_025C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"blacktriangleleft", 17} } }, {0, { {"ltri", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"lozenge", 7} } }, {0, { {"cir", 3} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"tridot", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"bigcirc", 7} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"ultri", 5} } }, {0, { {"urtri", 5} } }, {0, { {"lltri", 5} } }, {0, { {"EmptySmallSquare", 16} } },
{0, { {"FilledSmallSquare", 17} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02600[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"starf", 5} } }, {0, { {"star", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"phone", 5} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02640[] = {
2011-08-31 13:47:19 +08:00
{0, { {"female", 6} } }, {0, { {NULL, 0} } }, {0, { {"male", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"spadesuit", 9} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"clubs", 5} } },
{0, { {NULL, 0} } }, {0, { {"hearts", 6} } }, {0, { {"diamondsuit", 11} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"sung", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"flat", 4} } }, {0, { {"natur", 5} } }, {0, { {"sharp", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02700[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"check", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"cross", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"maltese", 7} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"sext", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02740[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"VerticalSeparator", 17} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"lbbrk", 5} } }, {0, { {"rbbrk", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_027C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"bsolhsub", 8} } }, {0, { {"suphsol", 7} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"LeftDoubleBracket", 17} } }, {0, { {"RightDoubleBracket", 18} } },
{0, { {"langle", 6} } }, {0, { {"RightAngleBracket", 17} } }, {0, { {"Lang", 4} } }, {0, { {"Rang", 4} } },
{0, { {"loang", 5} } }, {0, { {"roang", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"longleftarrow", 13} } }, {0, { {"LongRightArrow", 14} } }, {0, { {"LongLeftRightArrow", 18} } },
{0, { {"xlArr", 5} } }, {0, { {"DoubleLongRightArrow", 20} } }, {0, { {"xhArr", 5} } }, {0, { {NULL, 0} } },
{0, { {"xmap", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"dzigrarr", 8} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02900[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"nvlArr", 6} } }, {0, { {"nvrArr", 6} } },
{0, { {"nvHarr", 6} } }, {0, { {"Map", 3} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"lbarr", 5} } }, {0, { {"bkarow", 6} } }, {0, { {"lBarr", 5} } }, {0, { {"dbkarow", 7} } },
{0, { {"drbkarow", 8} } }, {0, { {"DDotrahd", 8} } }, {0, { {"UpArrowBar", 10} } }, {0, { {"DownArrowBar", 12} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"Rarrtl", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"latail", 6} } }, {0, { {"ratail", 6} } }, {0, { {"lAtail", 6} } },
{0, { {"rAtail", 6} } }, {0, { {"larrfs", 6} } }, {0, { {"rarrfs", 6} } }, {0, { {"larrbfs", 7} } },
{0, { {"rarrbfs", 7} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"nwarhk", 6} } },
{0, { {"nearhk", 6} } }, {0, { {"searhk", 6} } }, {0, { {"swarhk", 6} } }, {0, { {"nwnear", 6} } },
{0, { {"toea", 4} } }, {0, { {"seswar", 6} } }, {0, { {"swnwar", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {1, { {(void *)multi_cp_html5_02933} } },
{0, { {NULL, 0} } }, {0, { {"cudarrr", 7} } }, {0, { {"ldca", 4} } }, {0, { {"rdca", 4} } },
{0, { {"cudarrl", 7} } }, {0, { {"larrpl", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"curarrm", 7} } }, {0, { {"cularrp", 7} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02940[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"rarrpl", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"harrcir", 7} } }, {0, { {"Uarrocir", 8} } }, {0, { {"lurdshar", 8} } }, {0, { {"ldrushar", 8} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"LeftRightVector", 15} } }, {0, { {"RightUpDownVector", 17} } },
{0, { {"DownLeftRightVector", 19} } }, {0, { {"LeftUpDownVector", 16} } }, {0, { {"LeftVectorBar", 13} } }, {0, { {"RightVectorBar", 14} } },
{0, { {"RightUpVectorBar", 16} } }, {0, { {"RightDownVectorBar", 18} } }, {0, { {"DownLeftVectorBar", 17} } }, {0, { {"DownRightVectorBar", 18} } },
{0, { {"LeftUpVectorBar", 15} } }, {0, { {"LeftDownVectorBar", 17} } }, {0, { {"LeftTeeVector", 13} } }, {0, { {"RightTeeVector", 14} } },
{0, { {"RightUpTeeVector", 16} } }, {0, { {"RightDownTeeVector", 18} } }, {0, { {"DownLeftTeeVector", 17} } }, {0, { {"DownRightTeeVector", 18} } },
{0, { {"LeftUpTeeVector", 15} } }, {0, { {"LeftDownTeeVector", 17} } }, {0, { {"lHar", 4} } }, {0, { {"uHar", 4} } },
{0, { {"rHar", 4} } }, {0, { {"dHar", 4} } }, {0, { {"luruhar", 7} } }, {0, { {"ldrdhar", 7} } },
{0, { {"ruluhar", 7} } }, {0, { {"rdldhar", 7} } }, {0, { {"lharul", 6} } }, {0, { {"llhard", 6} } },
{0, { {"rharul", 6} } }, {0, { {"lrhard", 6} } }, {0, { {"udhar", 5} } }, {0, { {"ReverseUpEquilibrium", 20} } },
{0, { {"RoundImplies", 12} } }, {0, { {"erarr", 5} } }, {0, { {"simrarr", 7} } }, {0, { {"larrsim", 7} } },
{0, { {"rarrsim", 7} } }, {0, { {"rarrap", 6} } }, {0, { {"ltlarr", 6} } }, {0, { {NULL, 0} } },
{0, { {"gtrarr", 6} } }, {0, { {"subrarr", 7} } }, {0, { {NULL, 0} } }, {0, { {"suplarr", 7} } },
{0, { {"lfisht", 6} } }, {0, { {"rfisht", 6} } }, {0, { {"ufisht", 6} } }, {0, { {"dfisht", 6} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02980[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"lopar", 5} } }, {0, { {"ropar", 5} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"lbrke", 5} } },
{0, { {"rbrke", 5} } }, {0, { {"lbrkslu", 7} } }, {0, { {"rbrksld", 7} } }, {0, { {"lbrksld", 7} } },
{0, { {"rbrkslu", 7} } }, {0, { {"langd", 5} } }, {0, { {"rangd", 5} } }, {0, { {"lparlt", 6} } },
{0, { {"rpargt", 6} } }, {0, { {"gtlPar", 6} } }, {0, { {"ltrPar", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"vzigzag", 7} } }, {0, { {NULL, 0} } },
{0, { {"vangrt", 6} } }, {0, { {"angrtvbd", 8} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"ange", 4} } }, {0, { {"range", 5} } }, {0, { {"dwangle", 7} } }, {0, { {"uwangle", 7} } },
{0, { {"angmsdaa", 8} } }, {0, { {"angmsdab", 8} } }, {0, { {"angmsdac", 8} } }, {0, { {"angmsdad", 8} } },
{0, { {"angmsdae", 8} } }, {0, { {"angmsdaf", 8} } }, {0, { {"angmsdag", 8} } }, {0, { {"angmsdah", 8} } },
{0, { {"bemptyv", 7} } }, {0, { {"demptyv", 7} } }, {0, { {"cemptyv", 7} } }, {0, { {"raemptyv", 8} } },
{0, { {"laemptyv", 8} } }, {0, { {"ohbar", 5} } }, {0, { {"omid", 4} } }, {0, { {"opar", 4} } },
{0, { {NULL, 0} } }, {0, { {"operp", 5} } }, {0, { {NULL, 0} } }, {0, { {"olcross", 7} } },
{0, { {"odsold", 6} } }, {0, { {NULL, 0} } }, {0, { {"olcir", 5} } }, {0, { {"ofcir", 5} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_029C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {"olt", 3} } }, {0, { {"ogt", 3} } }, {0, { {"cirscir", 7} } }, {0, { {"cirE", 4} } },
{0, { {"solb", 4} } }, {0, { {"bsolb", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"boxbox", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"trisb", 5} } }, {0, { {"rtriltri", 8} } }, {1, { {(void *)multi_cp_html5_029CF} } },
{1, { {(void *)multi_cp_html5_029D0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"iinfin", 6} } }, {0, { {"infintie", 8} } }, {0, { {"nvinfin", 7} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"eparsl", 6} } },
{0, { {"smeparsl", 8} } }, {0, { {"eqvparsl", 8} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"lozf", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"RuleDelayed", 11} } }, {0, { {NULL, 0} } }, {0, { {"dsol", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02A00[] = {
2011-08-31 13:47:19 +08:00
{0, { {"xodot", 5} } }, {0, { {"bigoplus", 8} } }, {0, { {"bigotimes", 9} } }, {0, { {NULL, 0} } },
{0, { {"biguplus", 8} } }, {0, { {NULL, 0} } }, {0, { {"bigsqcup", 8} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"iiiint", 6} } }, {0, { {"fpartint", 8} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"cirfnint", 8} } }, {0, { {"awint", 5} } }, {0, { {"rppolint", 8} } }, {0, { {"scpolint", 8} } },
{0, { {"npolint", 7} } }, {0, { {"pointint", 8} } }, {0, { {"quatint", 7} } }, {0, { {"intlarhk", 8} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"pluscir", 7} } }, {0, { {"plusacir", 8} } },
{0, { {"simplus", 7} } }, {0, { {"plusdu", 6} } }, {0, { {"plussim", 7} } }, {0, { {"plustwo", 7} } },
{0, { {NULL, 0} } }, {0, { {"mcomma", 6} } }, {0, { {"minusdu", 7} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"loplus", 6} } }, {0, { {"roplus", 6} } }, {0, { {"Cross", 5} } },
{0, { {"timesd", 6} } }, {0, { {"timesbar", 8} } }, {0, { {NULL, 0} } }, {0, { {"smashp", 6} } },
{0, { {"lotimes", 7} } }, {0, { {"rotimes", 7} } }, {0, { {"otimesas", 8} } }, {0, { {"Otimes", 6} } },
{0, { {"odiv", 4} } }, {0, { {"triplus", 7} } }, {0, { {"triminus", 8} } }, {0, { {"tritime", 7} } },
{0, { {"iprod", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"amalg", 5} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02A40[] = {
2011-08-31 13:47:19 +08:00
{0, { {"capdot", 6} } }, {0, { {NULL, 0} } }, {0, { {"ncup", 4} } }, {0, { {"ncap", 4} } },
{0, { {"capand", 6} } }, {0, { {"cupor", 5} } }, {0, { {"cupcap", 6} } }, {0, { {"capcup", 6} } },
{0, { {"cupbrcap", 8} } }, {0, { {"capbrcup", 8} } }, {0, { {"cupcup", 6} } }, {0, { {"capcap", 6} } },
{0, { {"ccups", 5} } }, {0, { {"ccaps", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"ccupssm", 7} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"And", 3} } },
{0, { {"Or", 2} } }, {0, { {"andand", 6} } }, {0, { {"oror", 4} } }, {0, { {"orslope", 7} } },
{0, { {"andslope", 8} } }, {0, { {NULL, 0} } }, {0, { {"andv", 4} } }, {0, { {"orv", 3} } },
{0, { {"andd", 4} } }, {0, { {"ord", 3} } }, {0, { {NULL, 0} } }, {0, { {"wedbar", 6} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"sdote", 5} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"simdot", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {1, { {(void *)multi_cp_html5_02A6D} } }, {0, { {"easter", 6} } }, {0, { {"apacir", 6} } },
{1, { {(void *)multi_cp_html5_02A70} } }, {0, { {"eplus", 5} } }, {0, { {"pluse", 5} } }, {0, { {"Esim", 4} } },
{0, { {"Colone", 6} } }, {0, { {"Equal", 5} } }, {0, { {NULL, 0} } }, {0, { {"ddotseq", 7} } },
{0, { {"equivDD", 7} } }, {0, { {"ltcir", 5} } }, {0, { {"gtcir", 5} } }, {0, { {"ltquest", 7} } },
{0, { {"gtquest", 7} } }, {1, { {(void *)multi_cp_html5_02A7D} } }, {1, { {(void *)multi_cp_html5_02A7E} } }, {0, { {"lesdot", 6} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02A80[] = {
2011-08-31 13:47:19 +08:00
{0, { {"gesdot", 6} } }, {0, { {"lesdoto", 7} } }, {0, { {"gesdoto", 7} } }, {0, { {"lesdotor", 8} } },
{0, { {"gesdotol", 8} } }, {0, { {"lap", 3} } }, {0, { {"gap", 3} } }, {0, { {"lne", 3} } },
{0, { {"gne", 3} } }, {0, { {"lnap", 4} } }, {0, { {"gnap", 4} } }, {0, { {"lesseqqgtr", 10} } },
{0, { {"gEl", 3} } }, {0, { {"lsime", 5} } }, {0, { {"gsime", 5} } }, {0, { {"lsimg", 5} } },
{0, { {"gsiml", 5} } }, {0, { {"lgE", 3} } }, {0, { {"glE", 3} } }, {0, { {"lesges", 6} } },
{0, { {"gesles", 6} } }, {0, { {"els", 3} } }, {0, { {"egs", 3} } }, {0, { {"elsdot", 6} } },
{0, { {"egsdot", 6} } }, {0, { {"el", 2} } }, {0, { {"eg", 2} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"siml", 4} } }, {0, { {"simg", 4} } }, {0, { {"simlE", 5} } },
{0, { {"simgE", 5} } }, {1, { {(void *)multi_cp_html5_02AA1} } }, {1, { {(void *)multi_cp_html5_02AA2} } }, {0, { {NULL, 0} } },
{0, { {"glj", 3} } }, {0, { {"gla", 3} } }, {0, { {"ltcc", 4} } }, {0, { {"gtcc", 4} } },
{0, { {"lescc", 5} } }, {0, { {"gescc", 5} } }, {0, { {"smt", 3} } }, {0, { {"lat", 3} } },
{1, { {(void *)multi_cp_html5_02AAC} } }, {1, { {(void *)multi_cp_html5_02AAD} } }, {0, { {"bumpE", 5} } }, {1, { {(void *)multi_cp_html5_02AAF} } },
{1, { {(void *)multi_cp_html5_02AB0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"prE", 3} } },
{0, { {"scE", 3} } }, {0, { {"precneqq", 8} } }, {0, { {"scnE", 4} } }, {0, { {"precapprox", 10} } },
{0, { {"succapprox", 10} } }, {0, { {"precnapprox", 11} } }, {0, { {"succnapprox", 11} } }, {0, { {"Pr", 2} } },
{0, { {"Sc", 2} } }, {0, { {"subdot", 6} } }, {0, { {"supdot", 6} } }, {0, { {"subplus", 7} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_02AC0[] = {
2011-08-31 13:47:19 +08:00
{0, { {"supplus", 7} } }, {0, { {"submult", 7} } }, {0, { {"supmult", 7} } }, {0, { {"subedot", 7} } },
{0, { {"supedot", 7} } }, {1, { {(void *)multi_cp_html5_02AC5} } }, {1, { {(void *)multi_cp_html5_02AC6} } }, {0, { {"subsim", 6} } },
{0, { {"supsim", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {1, { {(void *)multi_cp_html5_02ACB} } },
{1, { {(void *)multi_cp_html5_02ACC} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"csub", 4} } },
{0, { {"csup", 4} } }, {0, { {"csube", 5} } }, {0, { {"csupe", 5} } }, {0, { {"subsup", 6} } },
{0, { {"supsub", 6} } }, {0, { {"subsub", 6} } }, {0, { {"supsup", 6} } }, {0, { {"suphsub", 7} } },
{0, { {"supdsub", 7} } }, {0, { {"forkv", 5} } }, {0, { {"topfork", 7} } }, {0, { {"mlcp", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"Dashv", 5} } }, {0, { {NULL, 0} } }, {0, { {"Vdashl", 6} } }, {0, { {"Barv", 4} } },
{0, { {"vBar", 4} } }, {0, { {"vBarv", 5} } }, {0, { {NULL, 0} } }, {0, { {"Vbar", 4} } },
{0, { {"Not", 3} } }, {0, { {"bNot", 4} } }, {0, { {"rnmid", 5} } }, {0, { {"cirmid", 6} } },
{0, { {"midcir", 6} } }, {0, { {"topcir", 6} } }, {0, { {"nhpar", 5} } }, {0, { {"parsim", 6} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {1, { {(void *)multi_cp_html5_02AFD} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_0FB00[] = {
2011-08-31 13:47:19 +08:00
{0, { {"fflig", 5} } }, {0, { {"filig", 5} } }, {0, { {"fllig", 5} } }, {0, { {"ffilig", 6} } },
{0, { {"ffllig", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_1D480[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"Ascr", 4} } }, {0, { {NULL, 0} } }, {0, { {"Cscr", 4} } }, {0, { {"Dscr", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"Gscr", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"Jscr", 4} } }, {0, { {"Kscr", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"Nscr", 4} } }, {0, { {"Oscr", 4} } }, {0, { {"Pscr", 4} } },
{0, { {"Qscr", 4} } }, {0, { {NULL, 0} } }, {0, { {"Sscr", 4} } }, {0, { {"Tscr", 4} } },
{0, { {"Uscr", 4} } }, {0, { {"Vscr", 4} } }, {0, { {"Wscr", 4} } }, {0, { {"Xscr", 4} } },
{0, { {"Yscr", 4} } }, {0, { {"Zscr", 4} } }, {0, { {"ascr", 4} } }, {0, { {"bscr", 4} } },
{0, { {"cscr", 4} } }, {0, { {"dscr", 4} } }, {0, { {NULL, 0} } }, {0, { {"fscr", 4} } },
{0, { {NULL, 0} } }, {0, { {"hscr", 4} } }, {0, { {"iscr", 4} } }, {0, { {"jscr", 4} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_1D4C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {"kscr", 4} } }, {0, { {"lscr", 4} } }, {0, { {"mscr", 4} } }, {0, { {"nscr", 4} } },
{0, { {NULL, 0} } }, {0, { {"pscr", 4} } }, {0, { {"qscr", 4} } }, {0, { {"rscr", 4} } },
{0, { {"sscr", 4} } }, {0, { {"tscr", 4} } }, {0, { {"uscr", 4} } }, {0, { {"vscr", 4} } },
{0, { {"wscr", 4} } }, {0, { {"xscr", 4} } }, {0, { {"yscr", 4} } }, {0, { {"zscr", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_1D500[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"Afr", 3} } }, {0, { {"Bfr", 3} } }, {0, { {NULL, 0} } }, {0, { {"Dfr", 3} } },
{0, { {"Efr", 3} } }, {0, { {"Ffr", 3} } }, {0, { {"Gfr", 3} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"Jfr", 3} } }, {0, { {"Kfr", 3} } }, {0, { {"Lfr", 3} } },
{0, { {"Mfr", 3} } }, {0, { {"Nfr", 3} } }, {0, { {"Ofr", 3} } }, {0, { {"Pfr", 3} } },
{0, { {"Qfr", 3} } }, {0, { {NULL, 0} } }, {0, { {"Sfr", 3} } }, {0, { {"Tfr", 3} } },
{0, { {"Ufr", 3} } }, {0, { {"Vfr", 3} } }, {0, { {"Wfr", 3} } }, {0, { {"Xfr", 3} } },
{0, { {"Yfr", 3} } }, {0, { {NULL, 0} } }, {0, { {"afr", 3} } }, {0, { {"bfr", 3} } },
{0, { {"cfr", 3} } }, {0, { {"dfr", 3} } }, {0, { {"efr", 3} } }, {0, { {"ffr", 3} } },
{0, { {"gfr", 3} } }, {0, { {"hfr", 3} } }, {0, { {"ifr", 3} } }, {0, { {"jfr", 3} } },
{0, { {"kfr", 3} } }, {0, { {"lfr", 3} } }, {0, { {"mfr", 3} } }, {0, { {"nfr", 3} } },
{0, { {"ofr", 3} } }, {0, { {"pfr", 3} } }, {0, { {"qfr", 3} } }, {0, { {"rfr", 3} } },
{0, { {"sfr", 3} } }, {0, { {"tfr", 3} } }, {0, { {"ufr", 3} } }, {0, { {"vfr", 3} } },
{0, { {"wfr", 3} } }, {0, { {"xfr", 3} } }, {0, { {"yfr", 3} } }, {0, { {"zfr", 3} } },
{0, { {"Aopf", 4} } }, {0, { {"Bopf", 4} } }, {0, { {NULL, 0} } }, {0, { {"Dopf", 4} } },
{0, { {"Eopf", 4} } }, {0, { {"Fopf", 4} } }, {0, { {"Gopf", 4} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html5_1D540[] = {
2011-08-31 13:47:19 +08:00
{0, { {"Iopf", 4} } }, {0, { {"Jopf", 4} } }, {0, { {"Kopf", 4} } }, {0, { {"Lopf", 4} } },
{0, { {"Mopf", 4} } }, {0, { {NULL, 0} } }, {0, { {"Oopf", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"Sopf", 4} } }, {0, { {"Topf", 4} } },
{0, { {"Uopf", 4} } }, {0, { {"Vopf", 4} } }, {0, { {"Wopf", 4} } }, {0, { {"Xopf", 4} } },
{0, { {"Yopf", 4} } }, {0, { {NULL, 0} } }, {0, { {"aopf", 4} } }, {0, { {"bopf", 4} } },
{0, { {"copf", 4} } }, {0, { {"dopf", 4} } }, {0, { {"eopf", 4} } }, {0, { {"fopf", 4} } },
{0, { {"gopf", 4} } }, {0, { {"hopf", 4} } }, {0, { {"iopf", 4} } }, {0, { {"jopf", 4} } },
{0, { {"kopf", 4} } }, {0, { {"lopf", 4} } }, {0, { {"mopf", 4} } }, {0, { {"nopf", 4} } },
{0, { {"oopf", 4} } }, {0, { {"popf", 4} } }, {0, { {"qopf", 4} } }, {0, { {"ropf", 4} } },
{0, { {"sopf", 4} } }, {0, { {"topf", 4} } }, {0, { {"uopf", 4} } }, {0, { {"vopf", 4} } },
{0, { {"wopf", 4} } }, {0, { {"xopf", 4} } }, {0, { {"yopf", 4} } }, {0, { {"zopf", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* end of stage 3 Tables for HTML5 }}} */
/* {{{ Stage 2 Tables for HTML5 */
static const entity_stage2_row empty_stage2_table[] = {
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
};
static const entity_stage2_row stage2_table_html5_00000[] = {
stage3_table_html5_00000, stage3_table_html5_00040, stage3_table_html5_00080, stage3_table_html5_000C0,
stage3_table_html5_00100, stage3_table_html5_00140, stage3_table_html5_00180, stage3_table_html5_001C0,
stage3_table_html5_00200, empty_stage3_table, empty_stage3_table, stage3_table_html5_002C0,
stage3_table_html5_00300, empty_stage3_table, stage3_table_html5_00380, stage3_table_html5_003C0,
stage3_table_html5_00400, stage3_table_html5_00440, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
};
static const entity_stage2_row stage2_table_html5_02000[] = {
stage3_table_html5_02000, stage3_table_html5_02040, stage3_table_html5_02080, stage3_table_html5_020C0,
stage3_table_html5_02100, stage3_table_html5_02140, stage3_table_html5_02180, stage3_table_html5_021C0,
stage3_table_html5_02200, stage3_table_html5_02240, stage3_table_html5_02280, stage3_table_html5_022C0,
stage3_table_html5_02300, stage3_table_html5_02340, stage3_table_html5_02380, stage3_table_html5_023C0,
stage3_table_html5_02400, empty_stage3_table, empty_stage3_table, stage3_table_html5_024C0,
stage3_table_html5_02500, stage3_table_html5_02540, stage3_table_html5_02580, stage3_table_html5_025C0,
stage3_table_html5_02600, stage3_table_html5_02640, empty_stage3_table, empty_stage3_table,
stage3_table_html5_02700, stage3_table_html5_02740, empty_stage3_table, stage3_table_html5_027C0,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
stage3_table_html5_02900, stage3_table_html5_02940, stage3_table_html5_02980, stage3_table_html5_029C0,
stage3_table_html5_02A00, stage3_table_html5_02A40, stage3_table_html5_02A80, stage3_table_html5_02AC0,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
};
static const entity_stage2_row stage2_table_html5_0F000[] = {
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
stage3_table_html5_0FB00, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
};
static const entity_stage2_row stage2_table_html5_1D000[] = {
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, stage3_table_html5_1D480, stage3_table_html5_1D4C0,
stage3_table_html5_1D500, stage3_table_html5_1D540, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
};
/* end of stage 2 tables for HTML5 }}} */
static const entity_stage1_row entity_ms_table_html5[] = {
stage2_table_html5_00000,
empty_stage2_table,
stage2_table_html5_02000,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
stage2_table_html5_0F000,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
stage2_table_html5_1D000,
};
/* end of HTML5 multi-stage table for codepoint -> entity }}} */
/* {{{ HTML5 hash table for entity -> codepoint */
typedef struct {
const char *entity;
unsigned short entity_len;
unsigned int codepoint1;
unsigned int codepoint2;
} entity_cp_map;
typedef const entity_cp_map *entity_ht_bucket;
typedef struct {
unsigned num_elems; /* power of 2 */
const entity_ht_bucket *buckets; /* .num_elems elements */
} entity_ht;
2011-08-31 13:47:19 +08:00
static const entity_cp_map ht_bucket_empty[] = { {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_000[] = { {"realpart", 8, 0x0211C, 0}, {"varr", 4, 0x02195, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_001[] = { {"angrt", 5, 0x0221F, 0}, {"iogon", 5, 0x0012F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_003[] = { {"lessdot", 7, 0x022D6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_005[] = { {"simrarr", 7, 0x02972, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_007[] = { {"Zscr", 4, 0x1D4B5, 0}, {"midast", 6, 0x0002A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_00D[] = { {"copf", 4, 0x1D554, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_00F[] = { {"female", 6, 0x02640, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_017[] = { {"NegativeThickSpace", 18, 0x0200B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_020[] = { {"copy", 4, 0x000A9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_022[] = { {"angst", 5, 0x000C5, 0}, {"searr", 5, 0x02198, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_024[] = { {"sqcups", 6, 0x02294, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_027[] = { {"Acirc", 5, 0x000C2, 0}, {"gtdot", 5, 0x022D7, 0}, {"varpi", 5, 0x003D6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_028[] = { {"UpTee", 5, 0x022A5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_029[] = { {"TildeTilde", 10, 0x02248, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_02A[] = { {"mfr", 3, 0x1D52A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_02B[] = { {"RightVectorBar", 14, 0x02953, 0}, {"gesdot", 6, 0x02A80, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_02C[] = { {"Uarrocir", 8, 0x02949, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_02E[] = { {"RightTriangleBar", 16, 0x029D0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_030[] = { {"Ocy", 3, 0x0041E, 0}, {"int", 3, 0x0222B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_034[] = { {"preccurlyeq", 11, 0x0227C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_038[] = { {"sccue", 5, 0x0227D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_040[] = { {"DoubleContourIntegral", 21, 0x0222F, 0}, {"nexist", 6, 0x02204, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_047[] = { {"acirc", 5, 0x000E2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_04C[] = { {"setmn", 5, 0x02216, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_04E[] = { {"Dopf", 4, 0x1D53B, 0}, {"LeftTee", 7, 0x022A3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_051[] = { {"SquareSuperset", 14, 0x02290, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_059[] = { {"udhar", 5, 0x0296E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_05D[] = { {"Equal", 5, 0x02A75, 0}, {"pscr", 4, 0x1D4C5, 0}, {"xvee", 4, 0x022C1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_05F[] = { {"approx", 6, 0x02248, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_060[] = { {"HARDcy", 6, 0x0042A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_061[] = { {"nGg", 3, 0x022D9, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_063[] = { {"yopf", 4, 0x1D56A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_064[] = { {"prcue", 5, 0x0227C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_065[] = { {"loarr", 5, 0x021FD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_069[] = { {"mho", 3, 0x02127, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_06A[] = { {"otimesas", 8, 0x02A36, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_06D[] = { {"capcap", 6, 0x02A4B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_06E[] = { {"eplus", 5, 0x02A71, 0}, {"nGt", 3, 0x0226B, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_06F[] = { {"Bumpeq", 6, 0x0224E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_071[] = { {"submult", 7, 0x02AC1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_073[] = { {"subplus", 7, 0x02ABF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_074[] = { {"auml", 4, 0x000E4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_07A[] = { {"RightDoubleBracket", 18, 0x027E7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_07B[] = { {"varkappa", 8, 0x003F0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_07C[] = { {"plusdo", 6, 0x02214, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_07F[] = { {"mid", 3, 0x02223, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_082[] = { {"plusdu", 6, 0x02A25, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_084[] = { {"notniva", 7, 0x0220C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_085[] = { {"notnivb", 7, 0x022FE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_086[] = { {"notnivc", 7, 0x022FD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_088[] = { {"varepsilon", 10, 0x003F5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_089[] = { {"nspar", 5, 0x02226, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_08C[] = { {"Ofr", 3, 0x1D512, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_08E[] = { {"Omega", 5, 0x003A9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_090[] = { {"equals", 6, 0x0003D, 0}, {"harrcir", 7, 0x02948, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_094[] = { {"Succeeds", 8, 0x0227B, 0}, {"cupdot", 6, 0x0228D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_097[] = { {"lsqb", 4, 0x0005B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_09E[] = { {"Qscr", 4, 0x1D4AC, 0}, {"urcorn", 6, 0x0231D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0A4[] = { {"Zopf", 4, 0x02124, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0A6[] = { {"triangleleft", 12, 0x025C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0AB[] = { {"supdsub", 7, 0x02AD8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0AC[] = { {"chcy", 4, 0x00447, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0AD[] = { {"sqsupset", 8, 0x02290, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0AE[] = { {"omega", 5, 0x003C9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0AF[] = { {"rthree", 6, 0x022CC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0B0[] = { {"THORN", 5, 0x000DE, 0}, {"clubsuit", 8, 0x02663, 0}, {"filig", 5, 0x0FB01, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0B2[] = { {"ocir", 4, 0x0229A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0B8[] = { {"ShortDownArrow", 14, 0x02193, 0}, {"atilde", 6, 0x000E3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0B9[] = { {"DownLeftTeeVector", 17, 0x0295E, 0}, {"LeftTeeArrow", 12, 0x021A4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0BA[] = { {"GreaterFullEqual", 16, 0x02267, 0}, {"emsp", 4, 0x02003, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0C0[] = { {"lozf", 4, 0x029EB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0C4[] = { {"ThinSpace", 9, 0x02009, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0CE[] = { {"fnof", 4, 0x00192, 0}, {"multimap", 8, 0x022B8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0D1[] = { {"Zacute", 6, 0x00179, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0D2[] = { {"mdash", 5, 0x02014, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0D3[] = { {"minusb", 6, 0x0229F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0D5[] = { {"minusd", 6, 0x02238, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0DF[] = { {"varsigma", 8, 0x003C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0E5[] = { {"ntilde", 6, 0x000F1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0E6[] = { {"Lambda", 6, 0x0039B, 0}, {"integers", 8, 0x02124, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0E8[] = { {"gesles", 6, 0x02A94, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0EC[] = { {"NotSubset", 9, 0x02282, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0EF[] = { {"NotLeftTriangleEqual", 20, 0x022EC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0F3[] = { {"LessLess", 8, 0x02AA1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0F4[] = { {"gscr", 4, 0x0210A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0FA[] = { {"popf", 4, 0x1D561, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0FB[] = { {"Agrave", 6, 0x000C0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0FD[] = { {"nvinfin", 7, 0x029DE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_0FE[] = { {"gacute", 6, 0x001F5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_100[] = { {"diam", 4, 0x022C4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_101[] = { {"nesim", 5, 0x02242, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_103[] = { {"YIcy", 4, 0x00407, 0}, {"bcy", 3, 0x00431, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_105[] = { {"Exists", 6, 0x02203, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_106[] = { {"vert", 4, 0x0007C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_109[] = { {"ropar", 5, 0x02986, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_10A[] = { {"topfork", 7, 0x02ADA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_10B[] = { {"nLl", 3, 0x022D8, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_10D[] = { {"notin", 5, 0x02209, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_10E[] = { {"SucceedsSlantEqual", 18, 0x0227D, 0}, {"toea", 4, 0x02928, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_10F[] = { {"ImaginaryI", 10, 0x02148, 0}, {"srarr", 5, 0x02192, 0}, {"ulcorner", 8, 0x0231C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_110[] = { {"LeftArrowBar", 12, 0x021E4, 0}, {"ldsh", 4, 0x021B2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_111[] = { {"DownBreve", 9, 0x00311, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_113[] = { {"nLt", 3, 0x0226A, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_116[] = { {"vltri", 5, 0x022B2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_11B[] = { {"VDash", 5, 0x022AB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_11C[] = { {"Dstrok", 6, 0x00110, 0}, {"Intersection", 12, 0x022C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_11E[] = { {"lrhar", 5, 0x021CB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_121[] = { {"RightTee", 8, 0x022A2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_124[] = { {"RightArrowLeftArrow", 19, 0x021C4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_129[] = { {"Ccirc", 5, 0x00108, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_12A[] = { {"ntrianglelefteq", 15, 0x022EC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_12C[] = { {"leftharpoonup", 13, 0x021BC, 0}, {"scap", 4, 0x02AB8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_12E[] = { {"darr", 4, 0x02193, 0}, {"qfr", 3, 0x1D52E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_12F[] = { {"cdot", 4, 0x0010B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_130[] = { {"supseteqq", 9, 0x02AC6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_134[] = { {"Scy", 3, 0x00421, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_135[] = { {"Hscr", 4, 0x0210B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_137[] = { {"LowerRightArrow", 15, 0x02198, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_13A[] = { {"divide", 6, 0x000F7, 0}, {"tcedil", 6, 0x00163, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_13B[] = { {"LeftArrow", 9, 0x02190, 0}, {"Qopf", 4, 0x0211A, 0}, {"vDash", 5, 0x022A8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_145[] = { {"dash", 4, 0x02010, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_147[] = { {"oror", 4, 0x02A56, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_149[] = { {"ccirc", 5, 0x00109, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_14B[] = { {"LongLeftArrow", 13, 0x027F5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_14C[] = { {"straightphi", 11, 0x003D5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_14E[] = { {"xlarr", 5, 0x027F5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_14F[] = { {"DJcy", 4, 0x00402, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_158[] = { {"nbsp", 4, 0x000A0, 0}, {"succcurlyeq", 11, 0x0227D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_159[] = { {"njcy", 4, 0x0045A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_15B[] = { {"Leftarrow", 9, 0x021D0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_15E[] = { {"dtrif", 5, 0x025BE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_15F[] = { {"bfr", 3, 0x1D51F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_161[] = { {"GreaterTilde", 12, 0x02273, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_164[] = { {"hamilt", 6, 0x0210B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_165[] = { {"Dcy", 3, 0x00414, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_168[] = { {"LeftUpVector", 12, 0x021BF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_16A[] = { {"bigoplus", 8, 0x02A01, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_170[] = { {"nwarhk", 6, 0x02923, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_173[] = { {"diams", 5, 0x02666, 0}, {"suphsol", 7, 0x027C9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_17A[] = { {"boxminus", 8, 0x0229F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_17B[] = { {"leftarrow", 9, 0x02190, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_17C[] = { {"andd", 4, 0x02A5C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_17F[] = { {"NonBreakingSpace", 16, 0x000A0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_181[] = { {"xutri", 5, 0x025B3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_189[] = { {"Longleftrightarrow", 18, 0x027FA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_18B[] = { {"Longleftarrow", 13, 0x027F8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_18C[] = { {"gtrapprox", 9, 0x02A86, 0}, {"phmmat", 6, 0x02133, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_18E[] = { {"andv", 4, 0x02A5A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_18F[] = { {"equiv", 5, 0x02261, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_190[] = { {"Sfr", 3, 0x1D516, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_191[] = { {"gopf", 4, 0x1D558, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_193[] = { {"sqsub", 5, 0x0228F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_195[] = { {"approxeq", 8, 0x0224A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_19A[] = { {"Del", 3, 0x02207, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_19C[] = { {"nrightarrow", 11, 0x0219B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_19F[] = { {"SquareUnion", 11, 0x02294, 0}, {"strns", 5, 0x000AF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1A0[] = { {"Itilde", 6, 0x00128, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1A1[] = { {"sqsup", 5, 0x02290, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1A2[] = { {"Ouml", 4, 0x000D6, 0}, {"PrecedesTilde", 13, 0x0227E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1A3[] = { {"AMP", 3, 0x00026, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1A4[] = { {"plusmn", 6, 0x000B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1A5[] = { {"xcup", 4, 0x022C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1A8[] = { {"radic", 5, 0x0221A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1AB[] = { {"longleftarrow", 13, 0x027F5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1AC[] = { {"lrcorner", 8, 0x0231F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1AD[] = { {"notni", 5, 0x0220C, 0}, {"updownarrow", 11, 0x02195, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1AE[] = { {"szlig", 5, 0x000DF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1AF[] = { {"ugrave", 6, 0x000F9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1B0[] = { {"imof", 4, 0x022B7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1B2[] = { {"csub", 4, 0x02ACF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1B5[] = { {"gsim", 4, 0x02273, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1B9[] = { {"leftleftarrows", 14, 0x021C7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1BD[] = { {"backcong", 8, 0x0224C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1BE[] = { {"clubs", 5, 0x02663, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1C0[] = { {"csup", 4, 0x02AD0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1C1[] = { {"Dfr", 3, 0x1D507, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1C4[] = { {"profline", 8, 0x02312, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1C6[] = { {"Zdot", 4, 0x0017B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1C9[] = { {"ClockwiseContourIntegral", 24, 0x02232, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1CA[] = { {"roplus", 6, 0x02A2E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1CD[] = { {"Rang", 4, 0x027EB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1CE[] = { {"bcong", 5, 0x0224C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1D0[] = { {"tshcy", 5, 0x0045B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1D1[] = { {"eDot", 4, 0x02251, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1D2[] = { {"Hopf", 4, 0x0210D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1D4[] = { {"lpar", 4, 0x00028, 0}, {"odash", 5, 0x0229D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1D5[] = { {"capbrcup", 8, 0x02A49, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1D6[] = { {"ucy", 3, 0x00443, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1D8[] = { {"ofcir", 5, 0x029BF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1D9[] = { {"Breve", 5, 0x002D8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1DA[] = { {"barvee", 6, 0x022BD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1DF[] = { {"backsim", 7, 0x0223D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1E0[] = { {"ange", 4, 0x029A4, 0}, {"half", 4, 0x000BD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1E1[] = { {"tscr", 4, 0x1D4C9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1E5[] = { {"realine", 7, 0x0211B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1E6[] = { {"Oacute", 6, 0x000D3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1E7[] = { {"dfisht", 6, 0x0297F, 0}, {"swnwar", 6, 0x0292A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1E8[] = { {"tscy", 4, 0x00446, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1EB[] = { {"lsquor", 6, 0x0201A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1EF[] = { {"naturals", 8, 0x02115, 0}, {"utrif", 5, 0x025B4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1F0[] = { {"DiacriticalTilde", 16, 0x002DC, 0}, {"RightUpVectorBar", 16, 0x02954, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1F2[] = { {"rHar", 4, 0x02964, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1F4[] = { {"curlyeqprec", 11, 0x022DE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1F8[] = { {"dtri", 4, 0x025BF, 0}, {"euml", 4, 0x000EB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1F9[] = { {"breve", 5, 0x002D8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1FA[] = { {"Barwed", 6, 0x02306, 0}, {"nvlArr", 6, 0x02902, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1FC[] = { {"dcaron", 6, 0x0010F, 0}, {"natural", 7, 0x0266E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1FE[] = { {"nsupseteqq", 10, 0x02AC6, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_1FF[] = { {"nedot", 5, 0x02250, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_205[] = { {"bigtriangledown", 15, 0x025BD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_207[] = { {"fcy", 3, 0x00444, 0}, {"marker", 6, 0x025AE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_20E[] = { {"Union", 5, 0x022C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_212[] = { {"varpropto", 9, 0x0221D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_213[] = { {"CloseCurlyDoubleQuote", 21, 0x0201D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_219[] = { {"DoubleLongRightArrow", 20, 0x027F9, 0}, {"GreaterGreater", 14, 0x02AA2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_21D[] = { {"Umacr", 5, 0x0016A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_220[] = { {"Colon", 5, 0x02237, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_222[] = { {"Hat", 3, 0x0005E, 0}, {"Uscr", 4, 0x1D4B0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_227[] = { {"SHCHcy", 6, 0x00429, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_229[] = { {"nLeftarrow", 10, 0x021CD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_22B[] = { {"Ecirc", 5, 0x000CA, 0}, {"Jukcy", 5, 0x00404, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_22C[] = { {"nbumpe", 6, 0x0224F, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_22D[] = { {"NotLess", 7, 0x0226E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_22F[] = { {"gtlPar", 6, 0x02995, 0}, {"suphsub", 7, 0x02AD7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_230[] = { {"gtreqqless", 10, 0x02A8C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_232[] = { {"ufr", 3, 0x1D532, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_234[] = { {"cirscir", 7, 0x029C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_239[] = { {"LeftDownTeeVector", 17, 0x02961, 0}, {"duhar", 5, 0x0296F, 0}, {"nrtrie", 6, 0x022ED, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_23C[] = { {"llhard", 6, 0x0296B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_23D[] = { {"umacr", 5, 0x0016B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_23E[] = { {"lates", 5, 0x02AAD, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_240[] = { {"colon", 5, 0x0003A, 0}, {"iacute", 6, 0x000ED, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_241[] = { {"NotPrecedes", 11, 0x02280, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_242[] = { {"cirfnint", 8, 0x02A10, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_246[] = { {"barwedge", 8, 0x02305, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_249[] = { {"nleftarrow", 10, 0x0219A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_24A[] = { {"Ubrcy", 5, 0x0040E, 0}, {"leftthreetimes", 14, 0x022CB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_24B[] = { {"andand", 6, 0x02A55, 0}, {"ecirc", 5, 0x000EA, 0}, {"jukcy", 5, 0x00454, 0}, {"quatint", 7, 0x02A16, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_24D[] = { {"lharul", 6, 0x0296A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_251[] = { {"smtes", 5, 0x02AAC, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_252[] = { {"UnionPlus", 9, 0x0228E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_257[] = { {"NotLeftTriangle", 15, 0x022EA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_25A[] = { {"bne", 3, 0x0003D, 0x020E5}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_25B[] = { {"gtrsim", 6, 0x02273, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_25C[] = { {"Rarr", 4, 0x021A0, 0}, {"ldquor", 6, 0x0201E, 0}, {"phiv", 4, 0x003D5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_25D[] = { {"because", 7, 0x02235, 0}, {"gEl", 3, 0x02A8C, 0}, {"setminus", 8, 0x02216, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_263[] = { {"ffr", 3, 0x1D523, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_26A[] = { {"ubrcy", 5, 0x0045E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_26B[] = { {"elinters", 8, 0x023E7, 0}, {"plusb", 5, 0x0229E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_26E[] = { {"pluse", 5, 0x02A72, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_274[] = { {"CapitalDifferentialD", 20, 0x02145, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_277[] = { {"daleth", 6, 0x02138, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_278[] = { {"kscr", 4, 0x1D4C0, 0}, {"ogon", 4, 0x002DB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_27C[] = { {"SHcy", 4, 0x00428, 0}, {"equest", 6, 0x0225F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_27E[] = { {"rbarr", 5, 0x0290D, 0}, {"topf", 4, 0x1D565, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_283[] = { {"tritime", 7, 0x02A3B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_28A[] = { {"bot", 3, 0x022A5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_294[] = { {"Wfr", 3, 0x1D51A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_297[] = { {"HumpEqual", 9, 0x0224F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_298[] = { {"rightleftharpoons", 17, 0x021CC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_29D[] = { {"frasl", 5, 0x02044, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_29F[] = { {"UnderBracket", 12, 0x023B5, 0}, {"ovbar", 5, 0x0233D, 0}, {"upharpoonright", 14, 0x021BE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2A0[] = { {"euro", 4, 0x020AC, 0}, {"nhArr", 5, 0x021CE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2A9[] = { {"NotSupersetEqual", 16, 0x02289, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2AE[] = { {"cularr", 6, 0x021B6, 0}, {"scnE", 4, 0x02AB6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2B1[] = { {"napid", 5, 0x0224B, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2B2[] = { {"harr", 4, 0x02194, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2B3[] = { {"gdot", 4, 0x00121, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2B9[] = { {"Lscr", 4, 0x02112, 0}, {"zeta", 4, 0x003B6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2BF[] = { {"ENG", 3, 0x0014A, 0}, {"Uopf", 4, 0x1D54C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2C4[] = { {"esdot", 5, 0x02250, 0}, {"scsim", 5, 0x0227F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2C5[] = { {"Hfr", 3, 0x0210C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2CE[] = { {"RightArrow", 10, 0x02192, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2CF[] = { {"Sqrt", 4, 0x0221A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2D3[] = { {"xodot", 5, 0x02A00, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2DA[] = { {"ycy", 3, 0x0044B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2DB[] = { {"hkswarow", 8, 0x02926, 0}, {"urtri", 5, 0x025F9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2DC[] = { {"roang", 5, 0x027ED, 0}, {"tosa", 4, 0x02929, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2E3[] = { {"CircleMinus", 11, 0x02296, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2E4[] = { {"Lcaron", 6, 0x0013D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2EB[] = { {"ShortLeftArrow", 14, 0x02190, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2EC[] = { {"Dot", 3, 0x000A8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2EE[] = { {"Rightarrow", 10, 0x021D2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2F0[] = { {"prsim", 5, 0x0227E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2F2[] = { {"notinE", 6, 0x022F9, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_2F8[] = { {"becaus", 6, 0x02235, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_300[] = { {"NotEqualTilde", 13, 0x02242, 0x00338}, {"nparallel", 9, 0x02226, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_301[] = { {"capcup", 6, 0x02A47, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_304[] = { {"simeq", 5, 0x02243, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_305[] = { {"forall", 6, 0x02200, 0}, {"straightepsilon", 15, 0x003F5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_308[] = { {"ruluhar", 7, 0x02968, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_30B[] = { {"jcy", 3, 0x00439, 0}, {"ltcc", 4, 0x02AA6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_30F[] = { {"bscr", 4, 0x1D4B7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_311[] = { {"ExponentialE", 12, 0x02147, 0}, {"weierp", 6, 0x02118, 0}, {"yen", 3, 0x000A5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_313[] = { {"blacksquare", 11, 0x025AA, 0}, {"uml", 3, 0x000A8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_315[] = { {"backsimeq", 9, 0x022CD, 0}, {"kopf", 4, 0x1D55C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_319[] = { {"NotPrecedesEqual", 16, 0x02AAF, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_31A[] = { {"simgE", 5, 0x02AA0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_31B[] = { {"tridot", 6, 0x025EC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_326[] = { {"DoubleLongLeftArrow", 19, 0x027F8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_329[] = { {"models", 6, 0x022A7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_32A[] = { {"emptyv", 6, 0x02205, 0}, {"eqslantgtr", 10, 0x02A96, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_32D[] = { {"Gcirc", 5, 0x0011C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_330[] = { {"bernou", 6, 0x0212C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_331[] = { {"HumpDownHump", 12, 0x0224E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_336[] = { {"yfr", 3, 0x1D536, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_338[] = { {"blacktriangle", 13, 0x025B4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_33B[] = { {"yacy", 4, 0x0044F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_33F[] = { {"lsime", 5, 0x02A8D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_340[] = { {"NotTildeEqual", 13, 0x02244, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_341[] = { {"lsimg", 5, 0x02A8F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_347[] = { {"ncap", 4, 0x02A43, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_34D[] = { {"DD", 2, 0x02145, 0}, {"gcirc", 5, 0x0011D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_350[] = { {"Cscr", 4, 0x1D49E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_356[] = { {"Lopf", 4, 0x1D543, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_358[] = { {"lBarr", 5, 0x0290E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_359[] = { {"Leftrightarrow", 14, 0x021D4, 0}, {"gtrdot", 6, 0x022D7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_35D[] = { {"NotSquareSubset", 15, 0x0228F, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_35F[] = { {"sqsubset", 8, 0x0228F, 0}, {"subsetneq", 9, 0x0228A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_361[] = { {"doublebarwedge", 14, 0x02306, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_363[] = { {"blacktriangleleft", 17, 0x025C2, 0}, {"hellip", 6, 0x02026, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_365[] = { {"xscr", 4, 0x1D4CD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_367[] = { {"LessFullEqual", 13, 0x02266, 0}, {"jfr", 3, 0x1D527, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_369[] = { {"GreaterSlantEqual", 17, 0x02A7E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_36A[] = { {"Uring", 5, 0x0016E, 0}, {"VeryThinSpace", 13, 0x0200A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_36B[] = { {"roarr", 5, 0x021FE, 0}, {"scaron", 6, 0x00161, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_36D[] = { {"Lcy", 3, 0x0041B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_36E[] = { {"RightDownVector", 15, 0x021C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_36F[] = { {"Sub", 3, 0x022D0, 0}, {"pitchfork", 9, 0x022D4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_372[] = { {"nvsim", 5, 0x0223C, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_374[] = { {"xrArr", 5, 0x027F9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_378[] = { {"CloseCurlyQuote", 15, 0x02019, 0}, {"uwangle", 7, 0x029A7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_37A[] = { {"Sum", 3, 0x02211, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_37C[] = { {"iuml", 4, 0x000EF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_37D[] = { {"Sup", 3, 0x022D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_37E[] = { {"planck", 6, 0x0210F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_37F[] = { {"Egrave", 6, 0x000C8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_380[] = { {"curlywedge", 10, 0x022CF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_382[] = { {"TildeFullEqual", 14, 0x02245, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_383[] = { {"searhk", 6, 0x02925, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_386[] = { {"ETH", 3, 0x000D0, 0}, {"napos", 5, 0x00149, 0}, {"upsi", 4, 0x003C5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_387[] = { {"twoheadleftarrow", 16, 0x0219E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_38A[] = { {"Assign", 6, 0x02254, 0}, {"uring", 5, 0x0016F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_38D[] = { {"SquareIntersection", 18, 0x02293, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_38E[] = { {"lmidot", 6, 0x00140, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_391[] = { {"kcedil", 6, 0x00137, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_394[] = { {"curren", 6, 0x000A4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_397[] = { {"acute", 5, 0x000B4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_398[] = { {"curlyeqsucc", 11, 0x022DF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_39C[] = { {"Omicron", 7, 0x0039F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_39F[] = { {"uarr", 4, 0x02191, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3A0[] = { {"Hstrok", 6, 0x00126, 0}, {"UnderBrace", 10, 0x023DF, 0}, {"tdot", 4, 0x020DB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3A1[] = { {"qint", 4, 0x02A0C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3A4[] = { {"sfrown", 6, 0x02322, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3A5[] = { {"trpezium", 8, 0x023E2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3A6[] = { {"Yscr", 4, 0x1D4B4, 0}, {"cedil", 5, 0x000B8, 0}, {"planckh", 7, 0x0210E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3A7[] = { {"lang", 4, 0x027E8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3AC[] = { {"bopf", 4, 0x1D553, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3B2[] = { {"lbbrk", 5, 0x02772, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3B4[] = { {"khcy", 4, 0x00445, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3BF[] = { {"Epsilon", 7, 0x00395, 0}, {"simlE", 5, 0x02A9F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3C0[] = { {"GT", 2, 0x0003E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3C4[] = { {"nap", 3, 0x02249, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3C9[] = { {"Lfr", 3, 0x1D50F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3CD[] = { {"succapprox", 10, 0x02AB8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3D0[] = { {"bsim", 4, 0x0223D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3D3[] = { {"Gg", 2, 0x022D9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3D9[] = { {"angrtvb", 7, 0x022BE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3DE[] = { {"xcirc", 5, 0x025EF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3E0[] = { {"Gt", 2, 0x0226B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3E1[] = { {"LeftRightVector", 15, 0x0294E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3E3[] = { {"circledast", 10, 0x0229B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3E4[] = { {"telrec", 6, 0x02315, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3E6[] = { {"SucceedsTilde", 13, 0x0227F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3E9[] = { {"nLtv", 4, 0x0226A, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3ED[] = { {"Copf", 4, 0x02102, 0}, {"napprox", 7, 0x02249, 0}, {"nsupseteq", 9, 0x02289, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3F1[] = { {"VerticalTilde", 13, 0x02240, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3F2[] = { {"parallel", 8, 0x02225, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3F7[] = { {"precnapprox", 11, 0x02AB9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3FC[] = { {"oscr", 4, 0x02134, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_3FE[] = { {"supsetneqq", 10, 0x02ACC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_402[] = { {"xopf", 4, 0x1D569, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_405[] = { {"mumap", 5, 0x022B8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_407[] = { {"varsupsetneqq", 13, 0x02ACC, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_409[] = { {"ReverseEquilibrium", 18, 0x021CB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_40E[] = { {"Ubreve", 6, 0x0016C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_40F[] = { {"YUcy", 4, 0x0042E, 0}, {"ncy", 3, 0x0043D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_413[] = { {"ltimes", 6, 0x022C9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_41A[] = { {"UpperRightArrow", 15, 0x02197, 0}, {"nvap", 4, 0x0224D, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_41B[] = { {"Im", 2, 0x02111, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_421[] = { {"simne", 5, 0x02246, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_423[] = { {"ccups", 5, 0x02A4C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_424[] = { {"nlArr", 5, 0x021CD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_425[] = { {"rarrsim", 7, 0x02974, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_426[] = { {"Ncaron", 6, 0x00147, 0}, {"vsupnE", 6, 0x02ACC, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_429[] = { {"succeq", 6, 0x02AB0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_42C[] = { {"Gammad", 6, 0x003DC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_42F[] = { {"Icirc", 5, 0x000CE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_430[] = { {"backepsilon", 11, 0x003F6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_432[] = { {"ddarr", 5, 0x021CA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_436[] = { {"larr", 4, 0x02190, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_439[] = { {"divideontimes", 13, 0x022C7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_43C[] = { {"succsim", 7, 0x0227F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_43D[] = { {"Pscr", 4, 0x1D4AB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_43E[] = { {"puncsp", 6, 0x02008, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_43F[] = { {"gtreqless", 9, 0x022DB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_440[] = { {"intcal", 6, 0x022BA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_441[] = { {"nsime", 5, 0x02244, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_443[] = { {"Yopf", 4, 0x1D550, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_446[] = { {"angsph", 6, 0x02222, 0}, {"vsupne", 6, 0x0228B, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_447[] = { {"NotNestedLessLess", 17, 0x02AA1, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_44A[] = { {"PrecedesSlantEqual", 18, 0x0227C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_44F[] = { {"icirc", 5, 0x000EE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_450[] = { {"DownLeftVectorBar", 17, 0x02956, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_454[] = { {"Auml", 4, 0x000C4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_457[] = { {"LJcy", 4, 0x00409, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_458[] = { {"sqsube", 6, 0x02291, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_45D[] = { {"nprec", 5, 0x02280, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_45F[] = { {"ngE", 3, 0x02267, 0x00338}, {"smile", 5, 0x02323, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_465[] = { {"LT", 2, 0x0003C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_466[] = { {"ldrdhar", 7, 0x02967, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_469[] = { {"utri", 4, 0x025B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_46A[] = { {"Sacute", 6, 0x0015A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_46B[] = { {"late", 4, 0x02AAD, 0}, {"nfr", 3, 0x1D52B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_46D[] = { {"NotNestedGreaterGreater", 23, 0x02AA2, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_46F[] = { {"nwarr", 5, 0x02196, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_470[] = { {"biguplus", 8, 0x02A04, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_471[] = { {"Pcy", 3, 0x0041F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_472[] = { {"bigtriangleup", 13, 0x025B3, 0}, {"rationals", 9, 0x0211A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_473[] = { {"congdot", 7, 0x02A6D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_475[] = { {"PlusMinus", 9, 0x000B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_479[] = { {"IOcy", 4, 0x00401, 0}, {"Scedil", 6, 0x0015E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_47C[] = { {"eqcirc", 6, 0x02256, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_47D[] = { {"Ll", 2, 0x022D8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_47F[] = { {"Cayleys", 7, 0x0212D, 0}, {"nge", 3, 0x02271, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_480[] = { {"NotGreater", 10, 0x0226F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_485[] = { {"Lt", 2, 0x0226A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_488[] = { {"rotimes", 7, 0x02A35, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_48C[] = { {"caps", 4, 0x02229, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_48E[] = { {"ngt", 3, 0x0226F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_48F[] = { {"Cross", 5, 0x02A2F, 0}, {"bumpeq", 6, 0x0224F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_490[] = { {"VerticalSeparator", 17, 0x02758, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_491[] = { {"plankv", 6, 0x0210F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_493[] = { {"fscr", 4, 0x1D4BB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_495[] = { {"bsol", 4, 0x0005C, 0}, {"sqsubseteq", 10, 0x02291, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_496[] = { {"boxH", 4, 0x02550, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_498[] = { {"rightarrowtail", 14, 0x021A3, 0}, {"ufisht", 6, 0x0297E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_499[] = { {"oopf", 4, 0x1D560, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_49F[] = { {"lobrk", 5, 0x027E6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4A2[] = { {"Acy", 3, 0x00410, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4A4[] = { {"NotSubsetEqual", 14, 0x02288, 0}, {"boxV", 4, 0x02551, 0}, {"dHar", 4, 0x02965, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4A6[] = { {"precnsim", 8, 0x022E8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4A7[] = { {"Mu", 2, 0x0039C, 0}, {"aelig", 5, 0x000E6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4AA[] = { {"gescc", 5, 0x02AA9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4AB[] = { {"origof", 6, 0x022B6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4AE[] = { {"upsih", 5, 0x003D2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4AF[] = { {"cross", 5, 0x02717, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4B2[] = { {"LeftFloor", 9, 0x0230A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4B6[] = { {"boxh", 4, 0x02500, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4B8[] = { {"NotGreaterEqual", 15, 0x02271, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4BC[] = { {"profalar", 8, 0x0232E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4C0[] = { {"nsmid", 5, 0x02224, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4C2[] = { {"hbar", 4, 0x0210F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4C3[] = { {"udarr", 5, 0x021C5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4C4[] = { {"boxv", 4, 0x02502, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4C5[] = { {"olarr", 5, 0x021BA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4C8[] = { {"Nu", 2, 0x0039D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4CB[] = { {"NotCongruent", 12, 0x02262, 0}, {"bkarow", 6, 0x0290D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4CD[] = { {"Pfr", 3, 0x1D513, 0}, {"forkv", 5, 0x02AD9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4CF[] = { {"nis", 3, 0x022FC, 0}, {"trianglerighteq", 15, 0x022B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4D0[] = { {"ngeq", 4, 0x02271, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4D2[] = { {"cudarrl", 7, 0x02938, 0}, {"nges", 4, 0x02A7E, 0x00338}, {"niv", 3, 0x0220B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4D3[] = { {"SubsetEqual", 11, 0x02286, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4D4[] = { {"Gscr", 4, 0x1D4A2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4D5[] = { {"complexes", 9, 0x02102, 0}, {"eDDot", 5, 0x02A77, 0}, {"nvge", 4, 0x02265, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4D8[] = { {"cudarrr", 7, 0x02935, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4DA[] = { {"Popf", 4, 0x02119, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4DE[] = { {"LongRightArrow", 14, 0x027F6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4DF[] = { {"supseteq", 8, 0x02287, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4E3[] = { {"dollar", 6, 0x00024, 0}, {"gnsim", 5, 0x022E7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4E4[] = { {"nvgt", 4, 0x0003E, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4E6[] = { {"Or", 2, 0x02A54, 0}, {"Vert", 4, 0x02016, 0}, {"lneqq", 5, 0x02268, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4E7[] = { {"nLeftrightarrow", 15, 0x021CE, 0}, {"nbump", 5, 0x0224E, 0x00338}, {"ntriangleright", 14, 0x022EB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4E8[] = { {"ecir", 4, 0x02256, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4E9[] = { {"npolint", 7, 0x02A14, 0}, {"plus", 4, 0x0002B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4ED[] = { {"centerdot", 9, 0x000B7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4F1[] = { {"zacute", 6, 0x0017A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4F7[] = { {"odiv", 4, 0x02A38, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4F9[] = { {"cap", 3, 0x02229, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4FB[] = { {"ensp", 4, 0x02002, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_4FE[] = { {"Afr", 3, 0x1D504, 0}, {"Pi", 2, 0x003A0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_500[] = { {"iquest", 6, 0x000BF, 0}, {"ltri", 4, 0x025C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_504[] = { {"nlE", 3, 0x02266, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_506[] = { {"Phi", 3, 0x003A6, 0}, {"lambda", 6, 0x003BB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_507[] = { {"Pr", 2, 0x02ABB, 0}, {"Vdashl", 6, 0x02AE6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_509[] = { {"SuchThat", 8, 0x0220B, 0}, {"Supset", 6, 0x022D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_50E[] = { {"Darr", 4, 0x021A1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_50F[] = { {"Cdot", 4, 0x0010A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_513[] = { {"rcy", 3, 0x00440, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_516[] = { {"orderof", 7, 0x02134, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_518[] = { {"leqq", 4, 0x02266, 0}, {"precsim", 7, 0x0227E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_519[] = { {"RightTriangle", 13, 0x022B3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_51B[] = { {"agrave", 6, 0x000E0, 0}, {"succnapprox", 11, 0x02ABA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_51C[] = { {"Tab", 3, 0x00009, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_524[] = { {"nle", 3, 0x02270, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_525[] = { {"spades", 6, 0x02660, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_526[] = { {"gtcc", 4, 0x02AA7, 0}, {"llcorner", 8, 0x0231E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_52F[] = { {"Oslash", 6, 0x000D8, 0}, {"Tau", 3, 0x003A4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_530[] = { {"fopf", 4, 0x1D557, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_532[] = { {"Mellintrf", 9, 0x02133, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_533[] = { {"nlt", 3, 0x0226E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_534[] = { {"lparlt", 6, 0x02993, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_53B[] = { {"Ccaron", 6, 0x0010C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_53C[] = { {"Re", 2, 0x0211C, 0}, {"dstrok", 6, 0x00111, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_53F[] = { {"leftharpoondown", 15, 0x021BD, 0}, {"ssetmn", 6, 0x02216, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_542[] = { {"lrhard", 6, 0x0296D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_543[] = { {"reg", 3, 0x000AE, 0}, {"sharp", 5, 0x0266F, 0}, {"yicy", 4, 0x00457, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_545[] = { {"ShortUpArrow", 12, 0x02191, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_548[] = { {"plusacir", 8, 0x02A23, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_54F[] = { {"cent", 4, 0x000A2, 0}, {"natur", 5, 0x0266E, 0}, {"varphi", 6, 0x003D5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_550[] = { {"lesg", 4, 0x022DA, 0x0FE00}, {"supnE", 5, 0x02ACC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_551[] = { {"ohbar", 5, 0x029B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_557[] = { {"NotLessGreater", 14, 0x02278, 0}, {"nleqslant", 9, 0x02A7D, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_55B[] = { {"Sc", 2, 0x02ABC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_55D[] = { {"NotSucceedsEqual", 16, 0x02AB0, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_55F[] = { {"DZcy", 4, 0x0040F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_564[] = { {"vartheta", 8, 0x003D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_565[] = { {"ltrie", 5, 0x022B4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_566[] = { {"Otilde", 6, 0x000D5, 0}, {"ltrif", 5, 0x025C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_56C[] = { {"Lsh", 3, 0x021B0, 0}, {"hookleftarrow", 13, 0x021A9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_56F[] = { {"rfr", 3, 0x1D52F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_570[] = { {"supne", 5, 0x0228B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_571[] = { {"Gopf", 4, 0x1D53E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_572[] = { {"UpEquilibrium", 13, 0x0296E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_575[] = { {"Tcy", 3, 0x00422, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_576[] = { {"ffilig", 6, 0x0FB03, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_577[] = { {"fork", 4, 0x022D4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_578[] = { {"oplus", 5, 0x02295, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_57A[] = { {"nvle", 4, 0x02264, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_57B[] = { {"HilbertSpace", 12, 0x0210B, 0}, {"subedot", 7, 0x02AC3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_57C[] = { {"TripleDot", 9, 0x020DB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_580[] = { {"sscr", 4, 0x1D4C8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_582[] = { {"osol", 4, 0x02298, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_583[] = { {"plustwo", 7, 0x02A27, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_586[] = { {"LessGreater", 11, 0x02276, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_588[] = { {"lrarr", 5, 0x021C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_589[] = { {"nvlt", 4, 0x0003C, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_58D[] = { {"questeq", 7, 0x0225F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_58E[] = { {"LessTilde", 9, 0x02272, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_58F[] = { {"djcy", 4, 0x00452, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_590[] = { {"xoplus", 6, 0x02A01, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_595[] = { {"primes", 6, 0x02119, 0}, {"solb", 4, 0x029C4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_596[] = { {"not", 3, 0x000AC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_59A[] = { {"angzarr", 7, 0x0237C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_59D[] = { {"nearr", 5, 0x02197, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_59F[] = { {"lowast", 6, 0x02217, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5A0[] = { {"cfr", 3, 0x1D520, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5A3[] = { {"ltcir", 5, 0x02A79, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5A6[] = { {"Ecy", 3, 0x0042D, 0}, {"gesdotol", 8, 0x02A84, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5A9[] = { {"longleftrightarrow", 18, 0x027F7, 0}, {"para", 4, 0x000B6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5AC[] = { {"Uacute", 6, 0x000DA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5AD[] = { {"blank", 5, 0x02423, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5AE[] = { {"rho", 3, 0x003C1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5B0[] = { {"dharl", 5, 0x021C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5B1[] = { {"rsquor", 6, 0x02019, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5B5[] = { {"NotSquareSubsetEqual", 20, 0x022E2, 0}, {"npr", 3, 0x02280, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5B6[] = { {"dharr", 5, 0x021C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5B7[] = { {"NewLine", 7, 0x0000A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5BB[] = { {"odot", 4, 0x02299, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5BC[] = { {"part", 4, 0x02202, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5BD[] = { {"cuvee", 5, 0x022CE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5BF[] = { {"lesdoto", 7, 0x02A81, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5C0[] = { {"itilde", 6, 0x00129, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5C1[] = { {"Tscr", 4, 0x1D4AF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5C2[] = { {"nsubE", 5, 0x02AC5, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5C4[] = { {"ratio", 5, 0x02236, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5D0[] = { {"Conint", 6, 0x0222F, 0}, {"LeftDownVectorBar", 17, 0x02959, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5D1[] = { {"Tfr", 3, 0x1D517, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5D3[] = { {"fllig", 5, 0x0FB02, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5D5[] = { {"thksim", 6, 0x0223C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5D8[] = { {"Euml", 4, 0x000CB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5D9[] = { {"chi", 3, 0x003C7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5DB[] = { {"ncup", 4, 0x02A42, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5DD[] = { {"SOFTcy", 6, 0x0042C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5DF[] = { {"bnequiv", 7, 0x02261, 0x020E5}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5E2[] = { {"nsube", 5, 0x02288, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5E4[] = { {"mapstoleft", 10, 0x021A4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5E7[] = { {"NotLessSlantEqual", 17, 0x02A7D, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5EA[] = { {"ldrushar", 8, 0x0294B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5ED[] = { {"Equilibrium", 11, 0x021CC, 0}, {"Uogon", 5, 0x00172, 0}, {"supsetneq", 9, 0x0228B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5F0[] = { {"Vbar", 4, 0x02AEB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5F3[] = { {"vnsub", 5, 0x02282, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5F6[] = { {"Square", 6, 0x025A1, 0}, {"lessapprox", 10, 0x02A85, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5F8[] = { {"And", 3, 0x02A53, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5FA[] = { {"gesdoto", 7, 0x02A82, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_5FD[] = { {"gap", 3, 0x02A86, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_601[] = { {"nsucc", 5, 0x02281, 0}, {"thicksim", 8, 0x0223C, 0}, {"vnsup", 5, 0x02283, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_602[] = { {"Efr", 3, 0x1D508, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_603[] = { {"Igrave", 6, 0x000CC, 0}, {"cir", 3, 0x025CB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_606[] = { {"Xi", 2, 0x0039E, 0}, {"oacute", 6, 0x000F3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_609[] = { {"nsc", 3, 0x02281, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_60D[] = { {"uogon", 5, 0x00173, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_613[] = { {"rharul", 6, 0x0296C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_615[] = { {"RuleDelayed", 11, 0x029F4, 0}, {"apacir", 6, 0x02A6F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_617[] = { {"jscr", 4, 0x1D4BF, 0}, {"vcy", 3, 0x00432, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_61A[] = { {"barwed", 6, 0x02305, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_61D[] = { {"sopf", 4, 0x1D564, 0}, {"thkap", 5, 0x02248, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_61F[] = { {"lesseqgtr", 9, 0x022DA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_622[] = { {"rdquor", 6, 0x0201D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_624[] = { {"Lstrok", 6, 0x00141, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_626[] = { {"Product", 7, 0x0220F, 0}, {"sqsupe", 6, 0x02292, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_628[] = { {"awconint", 8, 0x02233, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_62C[] = { {"hearts", 6, 0x02665, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_630[] = { {"rlm", 3, 0x0200F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_632[] = { {"comma", 5, 0x0002C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_636[] = { {"PartialD", 8, 0x02202, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_63A[] = { {"wedbar", 6, 0x02A5F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_63C[] = { {"oline", 5, 0x0203E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_63D[] = { {"OverBracket", 11, 0x023B4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_63E[] = { {"RBarr", 5, 0x02910, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_641[] = { {"uharl", 5, 0x021BF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_642[] = { {"leftrightsquigarrow", 19, 0x021AD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_645[] = { {"RightFloor", 10, 0x0230B, 0}, {"intprod", 7, 0x02A3C, 0}, {"vee", 3, 0x02228, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_646[] = { {"zigrarr", 7, 0x021DD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_647[] = { {"uharr", 5, 0x021BE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_648[] = { {"gcy", 3, 0x00433, 0}, {"varsubsetneq", 12, 0x0228A, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_649[] = { {"leqslant", 8, 0x02A7D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_64A[] = { {"Odblac", 6, 0x00150, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_651[] = { {"minus", 5, 0x02212, 0}, {"scpolint", 8, 0x02A13, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_652[] = { {"lrtri", 5, 0x022BF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_653[] = { {"DiacriticalGrave", 16, 0x00060, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_655[] = { {"num", 3, 0x00023, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_657[] = { {"quest", 5, 0x0003F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_658[] = { {"Kscr", 4, 0x1D4A6, 0}, {"UnderBar", 8, 0x0005F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_659[] = { {"lsquo", 5, 0x02018, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_65C[] = { {"rArr", 4, 0x021D2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_65E[] = { {"Topf", 4, 0x1D54B, 0}, {"heartsuit", 9, 0x02665, 0}, {"rBarr", 5, 0x0290F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_660[] = { {"emptyset", 8, 0x02205, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_669[] = { {"UnderParenthesis", 16, 0x023DD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_670[] = { {"dotplus", 7, 0x02214, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_671[] = { {"Psi", 3, 0x003A8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_672[] = { {"GJcy", 4, 0x00403, 0}, {"exist", 5, 0x02203, 0}, {"simplus", 7, 0x02A24, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_673[] = { {"vfr", 3, 0x1D533, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_676[] = { {"tprime", 6, 0x02034, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_678[] = { {"leftrightharpoons", 17, 0x021CB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_679[] = { {"rbrksld", 7, 0x0298E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_67D[] = { {"Ecaron", 6, 0x0011A, 0}, {"gel", 3, 0x022DB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_680[] = { {"capdot", 6, 0x02A40, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_682[] = { {"geq", 3, 0x02265, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_684[] = { {"LowerLeftArrow", 14, 0x02199, 0}, {"ges", 3, 0x02A7E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_685[] = { {"Colone", 6, 0x02A74, 0}, {"NotLessEqual", 12, 0x02270, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_68A[] = { {"nrarr", 5, 0x0219B, 0}, {"rbrkslu", 7, 0x02990, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_68C[] = { {"flat", 4, 0x0266D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_691[] = { {"there4", 6, 0x02234, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_693[] = { {"Gdot", 4, 0x00120, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_694[] = { {"ijlig", 5, 0x00133, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_696[] = { {"blacklozenge", 12, 0x029EB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_699[] = { {"Zeta", 4, 0x00396, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6A3[] = { {"duarr", 5, 0x021F5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6A4[] = { {"DotEqual", 8, 0x02250, 0}, {"dtdot", 5, 0x022F1, 0}, {"gfr", 3, 0x1D524, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6A8[] = { {"cirE", 4, 0x029C3, 0}, {"period", 6, 0x0002E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6A9[] = { {"lmoust", 6, 0x023B0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6AA[] = { {"Icy", 3, 0x00418, 0}, {"Rcaron", 6, 0x00158, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6AB[] = { {"LeftCeiling", 11, 0x02308, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6AE[] = { {"ascr", 4, 0x1D4B6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6B0[] = { {"boxtimes", 8, 0x022A0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6B4[] = { {"jopf", 4, 0x1D55B, 0}, {"ntriangleleft", 13, 0x022EA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6B6[] = { {"eqcolon", 7, 0x02255, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6B8[] = { {"rbbrk", 5, 0x02773, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6B9[] = { {"homtht", 6, 0x0223B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6BA[] = { {"ggg", 3, 0x022D9, 0}, {"seswar", 6, 0x02929, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6BC[] = { {"perp", 4, 0x022A5, 0}, {"shcy", 4, 0x00448, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6BF[] = { {"phone", 5, 0x0260E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6C0[] = { {"NotDoubleVerticalBar", 20, 0x02226, 0}, {"ngtr", 4, 0x0226F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6C4[] = { {"ThickSpace", 10, 0x0205F, 0x0200A}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6C5[] = { {"ForAll", 6, 0x02200, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6C6[] = { {"circ", 4, 0x002C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6C7[] = { {"Verbar", 6, 0x02016, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6C8[] = { {"cire", 4, 0x02257, 0}, {"lesges", 6, 0x02A93, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6C9[] = { {"slarr", 5, 0x02190, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6CC[] = { {"RightDownTeeVector", 18, 0x0295D, 0}, {"triangleq", 9, 0x0225C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6CE[] = { {"checkmark", 9, 0x02713, 0}, {"quot", 4, 0x00022, 0}, {"suplarr", 7, 0x0297B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6D1[] = { {"Backslash", 9, 0x02216, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6D2[] = { {"fallingdotseq", 13, 0x02252, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6D4[] = { {"swArr", 5, 0x021D9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6D5[] = { {"Xfr", 3, 0x1D51B, 0}, {"lbrke", 5, 0x0298B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6D9[] = { {"jmath", 5, 0x00237, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6DA[] = { {"lmoustache", 10, 0x023B0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6DB[] = { {"DownTee", 7, 0x022A4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6DC[] = { {"reals", 5, 0x0211D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6DE[] = { {"quaternions", 11, 0x0210D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6E7[] = { {"vzigzag", 7, 0x0299A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6EB[] = { {"pound", 5, 0x000A3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6EE[] = { {"permil", 6, 0x02030, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6EF[] = { {"Bscr", 4, 0x0212C, 0}, {"lfisht", 6, 0x0297C, 0}, {"vartriangleleft", 15, 0x022B2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6F5[] = { {"Kopf", 4, 0x1D542, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6F7[] = { {"Tilde", 5, 0x0223C, 0}, {"gtrarr", 6, 0x02978, 0}, {"lAarr", 5, 0x021DA, 0}, {"opar", 4, 0x029B7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_6FB[] = { {"triangle", 8, 0x025B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_704[] = { {"lcaron", 6, 0x0013E, 0}, {"wscr", 4, 0x1D4CC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_705[] = { {"asympeq", 7, 0x0224D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_706[] = { {"Ifr", 3, 0x02111, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_707[] = { {"DoubleDot", 9, 0x000A8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_709[] = { {"nVdash", 6, 0x022AE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_70C[] = { {"hairsp", 6, 0x0200A, 0}, {"leftrightarrows", 15, 0x021C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_70E[] = { {"lbrace", 6, 0x0007B, 0}, {"rightarrow", 10, 0x02192, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_70F[] = { {"Dagger", 6, 0x02021, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_712[] = { {"rsh", 3, 0x021B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_714[] = { {"eqslantless", 11, 0x02A95, 0}, {"gnapprox", 8, 0x02A8A, 0}, {"lbrack", 6, 0x0005B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_715[] = { {"uHar", 4, 0x02963, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_717[] = { {"tilde", 5, 0x002DC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_719[] = { {"complement", 10, 0x02201, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_71B[] = { {"zcy", 3, 0x00437, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_71E[] = { {"boxDL", 5, 0x02557, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_71F[] = { {"micro", 5, 0x000B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_723[] = { {"horbar", 6, 0x02015, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_724[] = { {"boxDR", 5, 0x02554, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_727[] = { {"bsolhsub", 8, 0x027C8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_729[] = { {"ac", 2, 0x0223E, 0}, {"nvdash", 6, 0x022AC, 0}, {"precapprox", 10, 0x02AB7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_72C[] = { {"af", 2, 0x02061, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_72D[] = { {"bullet", 6, 0x02022, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_72E[] = { {"demptyv", 7, 0x029B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_733[] = { {"geqq", 4, 0x02267, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_734[] = { {"uuarr", 5, 0x021C8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_735[] = { {"Ocirc", 5, 0x000D4, 0}, {"utdot", 5, 0x022F0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_736[] = { {"ap", 2, 0x02248, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_738[] = { {"bNot", 4, 0x02AED, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_73B[] = { {"CirclePlus", 10, 0x02295, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_73D[] = { {"glE", 3, 0x02A92, 0}, {"midcir", 6, 0x02AF0, 0}, {"rppolint", 8, 0x02A12, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_73E[] = { {"boxDl", 5, 0x02556, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_73F[] = { {"sdot", 4, 0x022C5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_744[] = { {"boxDr", 5, 0x02553, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_745[] = { {"Xscr", 4, 0x1D4B3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_749[] = { {"dlcrop", 6, 0x0230D, 0}, {"gtrless", 7, 0x02277, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_74B[] = { {"aopf", 4, 0x1D552, 0}, {"operp", 5, 0x029B9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_74C[] = { {"kcy", 3, 0x0043A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_74F[] = { {"larrfs", 6, 0x0291D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_751[] = { {"rcub", 4, 0x0007D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_754[] = { {"nrtri", 5, 0x022EB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_755[] = { {"nparsl", 6, 0x02AFD, 0x020E5}, {"ocirc", 5, 0x000F4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_759[] = { {"gla", 3, 0x02AA5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_75C[] = { {"Iuml", 4, 0x000CF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_75F[] = { {"mcomma", 6, 0x02A29, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_762[] = { {"glj", 3, 0x02AA4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_763[] = { {"Map", 3, 0x02905, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_765[] = { {"copysr", 6, 0x02117, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_766[] = { {"DownTeeArrow", 12, 0x021A7, 0}, {"Upsi", 4, 0x003D2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_768[] = { {"awint", 5, 0x02A11, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_76E[] = { {"DownRightVector", 15, 0x021C1, 0}, {"NotEqual", 8, 0x02260, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_770[] = { {"gesl", 4, 0x022DB, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_772[] = { {"NotCupCap", 9, 0x0226D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_776[] = { {"blacktriangleright", 18, 0x025B8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_777[] = { {"zfr", 3, 0x1D537, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_779[] = { {"leftrightarrow", 14, 0x02194, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_77A[] = { {"Abreve", 6, 0x00102, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_77F[] = { {"Uarr", 4, 0x0219F, 0}, {"gnE", 3, 0x02269, 0}, {"supmult", 7, 0x02AC2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_781[] = { {"supplus", 7, 0x02AC0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_783[] = { {"nabla", 5, 0x02207, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_787[] = { {"Lang", 4, 0x027EA, 0}, {"laquo", 5, 0x000AB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_789[] = { {"larrhk", 6, 0x021A9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_78C[] = { {"Bopf", 4, 0x1D539, 0}, {"lowbar", 6, 0x0005F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_78D[] = { {"cup", 3, 0x0222A, 0}, {"dd", 2, 0x02146, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_78E[] = { {"nsce", 4, 0x02AB0, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_790[] = { {"nshortparallel", 14, 0x02226, 0}, {"nsupE", 5, 0x02AC6, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_794[] = { {"OpenCurlyQuote", 14, 0x02018, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_797[] = { {"bsolb", 5, 0x029C5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_798[] = { {"DScy", 4, 0x00405, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_79A[] = { {"boxHD", 5, 0x02566, 0}, {"ltrPar", 6, 0x02996, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_79B[] = { {"nscr", 4, 0x1D4C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_79D[] = { {"lEg", 3, 0x02A8B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_79F[] = { {"egrave", 6, 0x000E8, 0}, {"gne", 3, 0x02A88, 0}, {"larrsim", 7, 0x02973, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7A0[] = { {"COPY", 4, 0x000A9, 0}, {"bdquo", 5, 0x0201E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7A1[] = { {"wopf", 4, 0x1D568, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7A2[] = { {"NotRightTriangleEqual", 21, 0x022ED, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7A5[] = { {"robrk", 5, 0x027E7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7A8[] = { {"kfr", 3, 0x1D528, 0}, {"nlsim", 5, 0x02274, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7AA[] = { {"xhArr", 5, 0x027FA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7AB[] = { {"boxHU", 5, 0x02569, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7AC[] = { {"lHar", 4, 0x02962, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7AE[] = { {"Mcy", 3, 0x0041C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7AF[] = { {"ee", 2, 0x02147, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7B0[] = { {"nsupe", 5, 0x02289, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7B1[] = { {"eg", 2, 0x02A9A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7B5[] = { {"trade", 5, 0x02122, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7B6[] = { {"el", 2, 0x02A99, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7B7[] = { {"nsucceq", 7, 0x02AB0, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7B8[] = { {"langle", 6, 0x027E8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7BA[] = { {"boxHd", 5, 0x02564, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7BB[] = { {"Subset", 6, 0x022D0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7BD[] = { {"DownArrowBar", 12, 0x02913, 0}, {"topbot", 6, 0x02336, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7BE[] = { {"OverBrace", 9, 0x023DE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7BF[] = { {"Eta", 3, 0x00397, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7C0[] = { {"hstrok", 6, 0x00127, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7C1[] = { {"Hacek", 5, 0x002C7, 0}, {"diamond", 7, 0x022C4, 0}, {"isinsv", 6, 0x022F3, 0}, {"rtriltri", 8, 0x029CE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7C9[] = { {"nvltrie", 7, 0x022B4, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7CB[] = { {"boxHu", 5, 0x02567, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7CD[] = { {"fpartint", 8, 0x02A0D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7CE[] = { {"Proportional", 12, 0x0221D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7D1[] = { {"NotSuperset", 11, 0x02283, 0x020D2}, {"gE", 2, 0x02267, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7D2[] = { {"scnsim", 6, 0x022E9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7D5[] = { {"uparrow", 7, 0x02191, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7D6[] = { {"ltlarr", 6, 0x02976, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7D9[] = { {"rtimes", 6, 0x022CA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7DA[] = { {"ncong", 5, 0x02247, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7DC[] = { {"Oscr", 4, 0x1D4AA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7E0[] = { {"vArr", 4, 0x021D5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7E2[] = { {"Xopf", 4, 0x1D54F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7E4[] = { {"notinva", 7, 0x02209, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7E5[] = { {"notinvb", 7, 0x022F7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7E6[] = { {"notinvc", 7, 0x022F6, 0}, {"nsqsube", 7, 0x022E2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7EC[] = { {"Tcaron", 6, 0x00164, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7EF[] = { {"upsilon", 7, 0x003C5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7F1[] = { {"ge", 2, 0x02265, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7F3[] = { {"gg", 2, 0x0226B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7F6[] = { {"KJcy", 4, 0x0040C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7F8[] = { {"gl", 2, 0x02277, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7FB[] = { {"dblac", 5, 0x002DD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_7FC[] = { {"lAtail", 6, 0x0291B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_800[] = { {"gt", 2, 0x0003E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_802[] = { {"lotimes", 7, 0x02A34, 0}, {"seArr", 5, 0x021D8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_803[] = { {"Lacute", 6, 0x00139, 0}, {"Laplacetrf", 10, 0x02112, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_808[] = { {"uuml", 4, 0x000FC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_809[] = { {"Amacr", 5, 0x00100, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_80A[] = { {"Mfr", 3, 0x1D510, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_810[] = { {"Int", 3, 0x0222C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_811[] = { {"Vvdash", 6, 0x022AA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_812[] = { {"Lcedil", 6, 0x0013B, 0}, {"larrlp", 6, 0x021AB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_816[] = { {"Larr", 4, 0x0219E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_819[] = { {"CircleTimes", 11, 0x02297, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_81C[] = { {"NotReverseElement", 17, 0x0220C, 0}, {"latail", 6, 0x02919, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_81D[] = { {"ntrianglerighteq", 16, 0x022ED, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_821[] = { {"blk12", 5, 0x02592, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_822[] = { {"intlarhk", 8, 0x02A17, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_823[] = { {"blk14", 5, 0x02591, 0}, {"ccupssm", 7, 0x02A50, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_824[] = { {"hercon", 6, 0x022B9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_828[] = { {"bigotimes", 9, 0x02A02, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_829[] = { {"amacr", 5, 0x00101, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_82D[] = { {"nrarrc", 6, 0x02933, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_82E[] = { {"ubreve", 6, 0x0016D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_830[] = { {"Yacute", 6, 0x000DD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_831[] = { {"ic", 2, 0x02063, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_832[] = { {"escr", 4, 0x0212F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_837[] = { {"ii", 2, 0x02148, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_838[] = { {"DownArrowUpArrow", 16, 0x021F5, 0}, {"nopf", 4, 0x1D55F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_83C[] = { {"in", 2, 0x02208, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_83E[] = { {"bumpE", 5, 0x02AAE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_83F[] = { {"rightharpoonup", 14, 0x021C0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_841[] = { {"nrarrw", 6, 0x0219D, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_842[] = { {"it", 2, 0x02062, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_846[] = { {"ncaron", 6, 0x00148, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_84A[] = { {"succnsim", 8, 0x022E9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_84C[] = { {"gammad", 6, 0x003DD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_84F[] = { {"yucy", 4, 0x0044E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_850[] = { {"ocy", 3, 0x0043E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_855[] = { {"hybull", 6, 0x02043, 0}, {"rpargt", 6, 0x02994, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_857[] = { {"csube", 5, 0x02AD1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_85B[] = { {"iiota", 5, 0x02129, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_85C[] = { {"nsim", 4, 0x02241, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_85E[] = { {"LeftTriangleEqual", 17, 0x022B4, 0}, {"bumpe", 5, 0x0224F, 0}, {"nearhk", 6, 0x02924, 0}, {"nhpar", 5, 0x02AF2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_861[] = { {"risingdotseq", 12, 0x02253, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_865[] = { {"blk34", 5, 0x02593, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_866[] = { {"LeftTriangle", 12, 0x022B2, 0}, {"vBarv", 5, 0x02AE9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_867[] = { {"AElig", 5, 0x000C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_868[] = { {"DoubleUpDownArrow", 17, 0x021D5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_86A[] = { {"cwint", 5, 0x02231, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_86B[] = { {"rtrie", 5, 0x022B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_86C[] = { {"rtrif", 5, 0x025B8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_873[] = { {"Fscr", 4, 0x02131, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_876[] = { {"lE", 2, 0x02266, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_879[] = { {"Oopf", 4, 0x1D546, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_87B[] = { {"spar", 4, 0x02225, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_87E[] = { {"uplus", 5, 0x0228E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_88A[] = { {"sacute", 6, 0x0015B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_88C[] = { {"fltns", 5, 0x025B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_88E[] = { {"rrarr", 5, 0x021C9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_892[] = { {"larrpl", 6, 0x02939, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_895[] = { {"ultri", 5, 0x025F8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_896[] = { {"le", 2, 0x02264, 0}, {"xuplus", 6, 0x02A04, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_897[] = { {"ljcy", 4, 0x00459, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_898[] = { {"lg", 2, 0x02276, 0}, {"vsubnE", 6, 0x02ACB, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_899[] = { {"scedil", 6, 0x0015F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_89D[] = { {"ll", 2, 0x0226A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8A5[] = { {"lt", 2, 0x0003C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8AC[] = { {"ofr", 3, 0x1D52C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8B3[] = { {"nexists", 7, 0x02204, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8B6[] = { {"smallsetminus", 13, 0x02216, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8B7[] = { {"InvisibleComma", 14, 0x02063, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8B8[] = { {"dotminus", 8, 0x02238, 0}, {"vsubne", 6, 0x0228A, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8B9[] = { {"iocy", 4, 0x00451, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8BA[] = { {"gsime", 5, 0x02A8E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8BC[] = { {"Rarrtl", 6, 0x02916, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8BD[] = { {"cirmid", 6, 0x02AEF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8C0[] = { {"ominus", 6, 0x02296, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8C1[] = { {"gsiml", 5, 0x02A90, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8C2[] = { {"Prime", 5, 0x02033, 0}, {"mp", 2, 0x02213, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8C4[] = { {"tint", 4, 0x0222D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8C7[] = { {"mu", 2, 0x003BC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8CF[] = { {"dbkarow", 7, 0x0290F, 0}, {"eopf", 4, 0x1D556, 0}, {"ogt", 3, 0x029C1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8D0[] = { {"Precedes", 8, 0x0227A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8D3[] = { {"UpTeeArrow", 10, 0x021A5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8D6[] = { {"varsupsetneq", 12, 0x0228B, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8D8[] = { {"ne", 2, 0x02260, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8DC[] = { {"ni", 2, 0x0220B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8DD[] = { {"mDDot", 5, 0x0223A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8DE[] = { {"cularrp", 7, 0x0293D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8DF[] = { {"rnmid", 5, 0x02AEE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8E0[] = { {"hardcy", 6, 0x0044A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8E2[] = { {"prime", 5, 0x02032, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8E3[] = { {"Bcy", 3, 0x00411, 0}, {"REG", 3, 0x000AE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8E7[] = { {"oS", 2, 0x024C8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8E8[] = { {"nu", 2, 0x003BD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8E9[] = { {"ohm", 3, 0x003A9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8EB[] = { {"langd", 5, 0x02991, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8F3[] = { {"backprime", 9, 0x02035, 0}, {"esim", 4, 0x02242, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8FB[] = { {"veeeq", 5, 0x0225A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8FE[] = { {"RightCeiling", 12, 0x02309, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_8FF[] = { {"crarr", 5, 0x021B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_904[] = { {"eqsim", 5, 0x02242, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_906[] = { {"or", 2, 0x02228, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_907[] = { {"OverParenthesis", 15, 0x023DC, 0}, {"UpperLeftArrow", 14, 0x02196, 0}, {"nleftrightarrow", 15, 0x021AE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_909[] = { {"expectation", 11, 0x02130, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_90C[] = { {"coprod", 6, 0x02210, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_90E[] = { {"Qfr", 3, 0x1D514, 0}, {"dArr", 4, 0x021D3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_910[] = { {"Fopf", 4, 0x1D53D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_913[] = { {"Cconint", 7, 0x02230, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_916[] = { {"larrtl", 6, 0x021A2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_918[] = { {"Aacute", 6, 0x000C1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_919[] = { {"DownLeftRightVector", 19, 0x02950, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_91B[] = { {"circleddash", 11, 0x0229D, 0}, {"thinsp", 6, 0x02009, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_91E[] = { {"Longrightarrow", 14, 0x027F9, 0}, {"pi", 2, 0x003C0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_91F[] = { {"hookrightarrow", 14, 0x021AA, 0}, {"rscr", 4, 0x1D4C7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_920[] = { {"scE", 3, 0x02AB4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_922[] = { {"pm", 2, 0x000B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_923[] = { {"ZHcy", 4, 0x00416, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_927[] = { {"pr", 2, 0x0227A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_929[] = { {"LongLeftRightArrow", 18, 0x027F7, 0}, {"supset", 6, 0x02283, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_92A[] = { {"UpArrowBar", 10, 0x02912, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_92C[] = { {"Utilde", 6, 0x00168, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_92E[] = { {"xlArr", 5, 0x027F8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_930[] = { {"DoubleUpArrow", 13, 0x021D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_936[] = { {"alefsym", 7, 0x02135, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_939[] = { {"Scirc", 5, 0x0015C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_93B[] = { {"xotime", 6, 0x02A02, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_93F[] = { {"Bfr", 3, 0x1D505, 0}, {"rdca", 4, 0x02937, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_940[] = { {"sce", 3, 0x02AB0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_945[] = { {"Nacute", 6, 0x00143, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_947[] = { {"amalg", 5, 0x02A3F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_94D[] = { {"UpDownArrow", 11, 0x02195, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_94F[] = { {"EqualTilde", 10, 0x02242, 0}, {"boxUL", 5, 0x0255D, 0}, {"oslash", 6, 0x000F8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_950[] = { {"lnap", 4, 0x02A89, 0}, {"thorn", 5, 0x000FE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_952[] = { {"ssmile", 6, 0x02323, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_953[] = { {"ndash", 5, 0x02013, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_954[] = { {"Ncedil", 6, 0x00145, 0}, {"scy", 3, 0x00441, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_955[] = { {"boxUR", 5, 0x0255A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_956[] = { {"Aring", 5, 0x000C5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_959[] = { {"scirc", 5, 0x0015D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_95B[] = { {"ccaron", 6, 0x0010D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_95D[] = { {"dotsquare", 9, 0x022A1, 0}, {"nshortmid", 9, 0x02224, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_95F[] = { {"rsquo", 5, 0x02019, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_960[] = { {"Sscr", 4, 0x1D4AE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_963[] = { {"bigwedge", 8, 0x022C0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_964[] = { {"Bernoullis", 10, 0x0212C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_969[] = { {"harrw", 5, 0x021AD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_96C[] = { {"SquareSubset", 12, 0x0228F, 0}, {"boxVH", 5, 0x0256C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_96F[] = { {"boxUl", 5, 0x0255C, 0}, {"rx", 2, 0x0211E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_970[] = { {"boxVL", 5, 0x02563, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_974[] = { {"olt", 3, 0x029C0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_975[] = { {"boxUr", 5, 0x02559, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_976[] = { {"aring", 5, 0x000E5, 0}, {"boxVR", 5, 0x02560, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_97B[] = { {"sc", 2, 0x0227B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_97C[] = { {"NestedGreaterGreater", 20, 0x0226B, 0}, {"oast", 4, 0x0229B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_97F[] = { {"star", 4, 0x02606, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_981[] = { {"LeftTeeVector", 13, 0x0295A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_983[] = { {"bigsqcup", 8, 0x02A06, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_985[] = { {"dcy", 3, 0x00434, 0}, {"preceq", 6, 0x02AAF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_986[] = { {"otilde", 6, 0x000F5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_988[] = { {"luruhar", 7, 0x02966, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_98C[] = { {"boxVh", 5, 0x0256B, 0}, {"capand", 6, 0x02A44, 0}, {"yuml", 4, 0x000FF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_98D[] = { {"Updownarrow", 11, 0x021D5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_98F[] = { {"TildeEqual", 10, 0x02243, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_990[] = { {"boxVl", 5, 0x02562, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_996[] = { {"boxVr", 5, 0x0255F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_997[] = { {"HorizontalLine", 14, 0x02500, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_99B[] = { {"xmap", 4, 0x027FC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_99C[] = { {"sigmaf", 6, 0x003C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_99E[] = { {"EmptySmallSquare", 16, 0x025FB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_99F[] = { {"dzcy", 4, 0x0045F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9A0[] = { {"cups", 4, 0x0222A, 0x0FE00}, {"zwj", 3, 0x0200D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9A1[] = { {"beta", 4, 0x003B2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9A6[] = { {"supsim", 6, 0x02AC8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9A8[] = { {"beth", 4, 0x02136, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9AA[] = { {"Iukcy", 5, 0x00406, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9AC[] = { {"eparsl", 6, 0x029E3, 0}, {"sigmav", 6, 0x003C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9B0[] = { {"lhard", 5, 0x021BD, 0}, {"sfr", 3, 0x1D530, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9B4[] = { {"nsqsupe", 7, 0x022E3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9B5[] = { {"Jsercy", 6, 0x00408, 0}, {"deg", 3, 0x000B0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9B6[] = { {"Ucy", 3, 0x00423, 0}, {"iscr", 4, 0x1D4BE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9B7[] = { {"efDot", 5, 0x02252, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9BB[] = { {"uhblk", 5, 0x02580, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9BC[] = { {"ropf", 4, 0x1D563, 0}, {"vprop", 5, 0x0221D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9BD[] = { {"isinE", 5, 0x022F9, 0}, {"raemptyv", 8, 0x029B3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9C1[] = { {"lharu", 5, 0x021BC, 0}, {"ncongdot", 8, 0x02A6D, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9C2[] = { {"subnE", 5, 0x02ACB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9C3[] = { {"ngsim", 5, 0x02275, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9C5[] = { {"starf", 5, 0x02605, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9C9[] = { {"Ograve", 6, 0x000D2, 0}, {"hksearow", 8, 0x02925, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9CA[] = { {"iukcy", 5, 0x00456, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9CC[] = { {"uacute", 6, 0x000FA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9CF[] = { {"asymp", 5, 0x02248, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9D5[] = { {"lneq", 4, 0x02A87, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9D6[] = { {"Otimes", 6, 0x02A37, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9DA[] = { {"NotTildeTilde", 13, 0x02249, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9DB[] = { {"Integral", 8, 0x0222B, 0}, {"rbrke", 5, 0x0298C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9DD[] = { {"nsub", 4, 0x02284, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9DE[] = { {"rlhar", 5, 0x021CC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9E1[] = { {"dfr", 3, 0x1D521, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9E2[] = { {"subne", 5, 0x0228A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9E5[] = { {"varnothing", 10, 0x02205, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9E7[] = { {"Fcy", 3, 0x00424, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9E9[] = { {"DoubleLeftTee", 13, 0x02AE4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9EB[] = { {"isins", 5, 0x022F4, 0}, {"nsup", 4, 0x02285, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9ED[] = { {"circlearrowleft", 15, 0x021BA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9EE[] = { {"isinv", 5, 0x02208, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9EF[] = { {"IEcy", 4, 0x00415, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9F0[] = { {"conint", 6, 0x0222E, 0}, {"vBar", 4, 0x02AE8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9F1[] = { {"edot", 4, 0x00117, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9F2[] = { {"Kappa", 5, 0x0039A, 0}, {"MediumSpace", 11, 0x0205F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9F3[] = { {"lbrksld", 7, 0x0298F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9F4[] = { {"sect", 4, 0x000A7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9F5[] = { {"nldr", 4, 0x02025, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9F7[] = { {"Jscr", 4, 0x1D4A5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9F9[] = { {"shy", 3, 0x000AD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9FA[] = { {"ulcrop", 6, 0x0230F, 0}, {"veebar", 6, 0x022BB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_9FD[] = { {"Sopf", 4, 0x1D54A, 0}, {"cuwed", 5, 0x022CF, 0}, {"rAarr", 5, 0x021DB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A01[] = { {"erarr", 5, 0x02971, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A04[] = { {"lbrkslu", 7, 0x0298D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A05[] = { {"NotSucceeds", 11, 0x02281, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A06[] = { {"nsccue", 6, 0x022E1, 0}, {"subrarr", 7, 0x02979, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A08[] = { {"looparrowright", 14, 0x021AC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A0C[] = { {"wp", 2, 0x02118, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A0D[] = { {"Emacr", 5, 0x00112, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A0E[] = { {"sim", 3, 0x0223C, 0}, {"wr", 2, 0x02240, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A10[] = { {"Udblac", 6, 0x00170, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A12[] = { {"Ufr", 3, 0x1D518, 0}, {"kappa", 5, 0x003BA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A14[] = { {"notindot", 8, 0x022F5, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A15[] = { {"nleq", 4, 0x02270, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A16[] = { {"NestedLessLess", 14, 0x0226A, 0}, {"square", 6, 0x025A1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A17[] = { {"nles", 4, 0x02A7D, 0x00338}, {"squarf", 6, 0x025AA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A21[] = { {"order", 5, 0x02134, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A23[] = { {"igrave", 6, 0x000EC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A24[] = { {"precneqq", 8, 0x02AB5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A25[] = { {"csupe", 5, 0x02AD2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A26[] = { {"xi", 2, 0x003BE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A28[] = { {"NotHumpEqual", 12, 0x0224F, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A2A[] = { {"ord", 3, 0x02A5D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A2D[] = { {"emacr", 5, 0x00113, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A30[] = { {"nwnear", 6, 0x02927, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A32[] = { {"nprcue", 6, 0x022E0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A36[] = { {"NotExists", 9, 0x02204, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A37[] = { {"die", 3, 0x000A8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A39[] = { {"ddotseq", 7, 0x02A77, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A3B[] = { {"Dashv", 5, 0x02AE4, 0}, {"Ucirc", 5, 0x000DB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A3C[] = { {"orv", 3, 0x02A5B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A3D[] = { {"Because", 7, 0x02235, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A41[] = { {"kgreen", 6, 0x00138, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A43[] = { {"Ffr", 3, 0x1D509, 0}, {"LeftVector", 10, 0x021BC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A44[] = { {"lstrok", 6, 0x00142, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A45[] = { {"twixt", 5, 0x0226C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A48[] = { {"compfn", 6, 0x02218, 0}, {"div", 3, 0x000F7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A4F[] = { {"drcrop", 6, 0x0230C, 0}, {"shortmid", 8, 0x02223, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A53[] = { {"iopf", 4, 0x1D55A, 0}, {"triangledown", 12, 0x025BF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A54[] = { {"IJlig", 5, 0x00132, 0}, {"thetasym", 8, 0x003D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A56[] = { {"Sigma", 5, 0x003A3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A57[] = { {"equivDD", 7, 0x02A78, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A5A[] = { {"Cacute", 6, 0x00106, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A5B[] = { {"dashv", 5, 0x022A3, 0}, {"ucirc", 5, 0x000FB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A61[] = { {"gneqq", 5, 0x02269, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A62[] = { {"gvertneqq", 9, 0x02269, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A63[] = { {"RightDownVectorBar", 18, 0x02955, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A64[] = { {"NotLessLess", 11, 0x0226A, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A69[] = { {"Ccedil", 6, 0x000C7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A6A[] = { {"odblac", 6, 0x00151, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A6B[] = { {"mstpos", 6, 0x0223E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A6D[] = { {"cemptyv", 7, 0x029B2, 0}, {"rarrap", 6, 0x02975, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A6F[] = { {"rmoust", 6, 0x023B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A70[] = { {"elsdot", 6, 0x02A97, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A76[] = { {"sigma", 5, 0x003C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A78[] = { {"Implies", 7, 0x021D2, 0}, {"isin", 4, 0x02208, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A7A[] = { {"bottom", 6, 0x022A5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A7E[] = { {"ShortRightArrow", 15, 0x02192, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A81[] = { {"cupcap", 6, 0x02A46, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A82[] = { {"NotSquareSuperset", 17, 0x02290, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A84[] = { {"LeftArrowRightArrow", 19, 0x021C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A85[] = { {"FilledVerySmallSquare", 21, 0x025AA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A86[] = { {"LeftUpTeeVector", 15, 0x02960, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A89[] = { {"DoubleRightArrow", 16, 0x021D2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A8D[] = { {"raquo", 5, 0x000BB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A8E[] = { {"Ascr", 4, 0x1D49C, 0}, {"ReverseUpEquilibrium", 20, 0x0296F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A92[] = { {"hArr", 4, 0x021D4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A94[] = { {"Jopf", 4, 0x1D541, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A96[] = { {"npar", 4, 0x02226, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A98[] = { {"SupersetEqual", 13, 0x02287, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A99[] = { {"ffllig", 6, 0x0FB04, 0}, {"smt", 3, 0x02AAA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A9A[] = { {"twoheadrightarrow", 17, 0x021A0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A9D[] = { {"ecaron", 6, 0x0011B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_A9F[] = { {"NotRightTriangleBar", 19, 0x029D0, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AA3[] = { {"apid", 4, 0x0224B, 0}, {"vscr", 4, 0x1D4CB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AA4[] = { {"supdot", 6, 0x02ABE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AA5[] = { {"colone", 6, 0x02254, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AA7[] = { {"dwangle", 7, 0x029A6, 0}, {"shchcy", 6, 0x00449, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AAC[] = { {"ltdot", 5, 0x022D6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AB2[] = { {"downharpoonright", 16, 0x021C2, 0}, {"gjcy", 4, 0x00453, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AB4[] = { {"wfr", 3, 0x1D534, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AB5[] = { {"rfisht", 6, 0x0297D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ABA[] = { {"Ycy", 3, 0x0042B, 0}, {"swarrow", 7, 0x02199, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AC0[] = { {"nharr", 5, 0x021AE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AC4[] = { {"frac12", 6, 0x000BD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AC5[] = { {"frac13", 6, 0x02153, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AC6[] = { {"frac14", 6, 0x000BC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AC7[] = { {"GreaterEqual", 12, 0x02265, 0}, {"frac15", 6, 0x02155, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AC8[] = { {"Gamma", 5, 0x00393, 0}, {"frac16", 6, 0x02159, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ACA[] = { {"dzigrarr", 8, 0x027FF, 0}, {"frac18", 6, 0x0215B, 0}, {"rcaron", 6, 0x00159, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ACC[] = { {"DownRightTeeVector", 18, 0x0295F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ACF[] = { {"nvrtrie", 7, 0x022B5, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AD2[] = { {"iota", 4, 0x003B9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AD3[] = { {"sol", 3, 0x0002F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AD4[] = { {"rbrace", 6, 0x0007D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ADA[] = { {"rbrack", 6, 0x0005D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ADD[] = { {"rsqb", 4, 0x0005D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ADF[] = { {"oint", 4, 0x0222E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AE4[] = { {"Wscr", 4, 0x1D4B2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AE5[] = { {"hfr", 3, 0x1D525, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AE6[] = { {"frac23", 6, 0x02154, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AE7[] = { {"dlcorn", 6, 0x0231E, 0}, {"verbar", 6, 0x0007C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AE8[] = { {"frac25", 6, 0x02156, 0}, {"gamma", 5, 0x003B3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AE9[] = { {"nVDash", 6, 0x022AF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AEB[] = { {"Jcy", 3, 0x00419, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AF5[] = { {"nwarrow", 7, 0x02196, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AF6[] = { {"OverBar", 7, 0x0203E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AF7[] = { {"rightsquigarrow", 15, 0x0219D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AFA[] = { {"iexcl", 5, 0x000A1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AFD[] = { {"sqcap", 5, 0x02293, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_AFE[] = { {"pertenk", 7, 0x02031, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B08[] = { {"PrecedesEqual", 13, 0x02AAF, 0}, {"frac34", 6, 0x000BE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B09[] = { {"Therefore", 9, 0x02234, 0}, {"frac35", 6, 0x02157, 0}, {"nvDash", 6, 0x022AD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B0A[] = { {"odsold", 6, 0x029BC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B0C[] = { {"dot", 3, 0x002D9, 0}, {"frac38", 6, 0x0215C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B10[] = { {"sqcaps", 6, 0x02293, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B11[] = { {"ZeroWidthSpace", 14, 0x0200B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B15[] = { {"rarrfs", 6, 0x0291E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B16[] = { {"Yfr", 3, 0x1D51C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B1E[] = { {"CircleDot", 9, 0x02299, 0}, {"gtcir", 5, 0x02A7A, 0}, {"squ", 3, 0x025A1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B1F[] = { {"angmsd", 6, 0x02221, 0}, {"nsubseteq", 9, 0x02288, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B23[] = { {"iprod", 5, 0x02A3C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B24[] = { {"bprime", 6, 0x02035, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B27[] = { {"supsub", 6, 0x02AD4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B29[] = { {"SquareSupersetEqual", 19, 0x02292, 0}, {"therefore", 9, 0x02234, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B2A[] = { {"frac45", 6, 0x02158, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B2B[] = { {"Aopf", 4, 0x1D538, 0}, {"NotGreaterFullEqual", 19, 0x02267, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B2C[] = { {"Tstrok", 6, 0x00166, 0}, {"rightleftarrows", 15, 0x021C4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B2D[] = { {"Fouriertrf", 10, 0x02131, 0}, {"epar", 4, 0x022D5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B2E[] = { {"omid", 4, 0x029B6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B2F[] = { {"OpenCurlyDoubleQuote", 20, 0x0201C, 0}, {"dagger", 6, 0x02020, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B33[] = { {"semi", 4, 0x0003B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B35[] = { {"supsup", 6, 0x02AD6, 0}, {"zeetrf", 6, 0x02128, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B36[] = { {"DifferentialD", 13, 0x02146, 0}, {"topcir", 6, 0x02AF1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B3A[] = { {"mscr", 4, 0x1D4C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B3D[] = { {"Wcirc", 5, 0x00174, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B3E[] = { {"boxdL", 5, 0x02555, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B40[] = { {"Gbreve", 6, 0x0011E, 0}, {"vopf", 4, 0x1D567, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B42[] = { {"lap", 3, 0x02A85, 0}, {"llarr", 5, 0x021C7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B44[] = { {"boxdR", 5, 0x02552, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B46[] = { {"RightAngleBracket", 17, 0x027E9, 0}, {"lat", 3, 0x02AAB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B47[] = { {"Jfr", 3, 0x1D50D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B4C[] = { {"frac56", 6, 0x0215A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B4E[] = { {"frac58", 6, 0x0215D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B4F[] = { {"rarrhk", 6, 0x021AA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B50[] = { {"lesdot", 6, 0x02A7F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B51[] = { {"ApplyFunction", 13, 0x02061, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B52[] = { {"NotGreaterTilde", 15, 0x02275, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B53[] = { {"Cedilla", 7, 0x000B8, 0}, {"curvearrowright", 15, 0x021B7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B56[] = { {"rdsh", 4, 0x021B3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B58[] = { {"larrb", 5, 0x021E4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B5C[] = { {"vrtri", 5, 0x022B3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B5D[] = { {"nequiv", 6, 0x02262, 0}, {"wcirc", 5, 0x00175, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B5E[] = { {"boxdl", 5, 0x02510, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B63[] = { {"DoubleDownArrow", 15, 0x021D3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B64[] = { {"boxdr", 5, 0x0250C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B67[] = { {"pluscir", 7, 0x02A22, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B69[] = { {"longmapsto", 10, 0x027FC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B6B[] = { {"gnap", 4, 0x02A8A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B6D[] = { {"bigodot", 7, 0x02A00, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B72[] = { {"thickapprox", 11, 0x02248, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B73[] = { {"DotDot", 6, 0x020DC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B77[] = { {"incare", 6, 0x02105, 0}, {"rarrbfs", 7, 0x02920, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B78[] = { {"apos", 4, 0x00027, 0}, {"tbrk", 4, 0x023B4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B7A[] = { {"grave", 5, 0x00060, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B7B[] = { {"Nscr", 4, 0x1D4A9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B7E[] = { {"rangle", 6, 0x027E9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B7F[] = { {"uArr", 4, 0x021D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B81[] = { {"Wopf", 4, 0x1D54E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B82[] = { {"doteq", 5, 0x02250, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B87[] = { {"times", 5, 0x000D7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B8D[] = { {"fflig", 5, 0x0FB00, 0}, {"lcy", 3, 0x0043B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B8F[] = { {"sub", 3, 0x02282, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B90[] = { {"frac78", 6, 0x0215E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B94[] = { {"xrarr", 5, 0x027F6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B98[] = { {"UpArrowDownArrow", 16, 0x021C5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B99[] = { {"bbrktbrk", 8, 0x023B6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B9A[] = { {"abreve", 6, 0x00103, 0}, {"lsaquo", 6, 0x02039, 0}, {"sum", 3, 0x02211, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B9C[] = { {"Eacute", 6, 0x000C9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_B9D[] = { {"sup", 3, 0x02283, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BA5[] = { {"ContourIntegral", 15, 0x0222E, 0}, {"DiacriticalDot", 14, 0x002D9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BA9[] = { {"trisb", 5, 0x029CD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BAE[] = { {"Hcirc", 5, 0x00124, 0}, {"lceil", 5, 0x02308, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BB2[] = { {"Zcaron", 6, 0x0017D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BB5[] = { {"looparrowleft", 13, 0x021AB, 0}, {"oelig", 5, 0x00153, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BB6[] = { {"LessSlantEqual", 14, 0x02A7D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BB7[] = { {"NegativeThinSpace", 17, 0x0200B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BBA[] = { {"boxhD", 5, 0x02565, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BBC[] = { {"omicron", 7, 0x003BF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BBD[] = { {"leg", 3, 0x022DA, 0}, {"rightthreetimes", 15, 0x022CC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BBF[] = { {"NotSucceedsSlantEqual", 21, 0x022E1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BC1[] = { {"angmsdaa", 8, 0x029A8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BC2[] = { {"angmsdab", 8, 0x029A9, 0}, {"rAtail", 6, 0x0291C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BC3[] = { {"angmsdac", 8, 0x029AA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BC4[] = { {"angmsdad", 8, 0x029AB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BC5[] = { {"angmsdae", 8, 0x029AC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BC6[] = { {"angmsdaf", 8, 0x029AD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BC7[] = { {"angmsdag", 8, 0x029AE, 0}, {"leq", 3, 0x02264, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BC8[] = { {"angmsdah", 8, 0x029AF, 0}, {"solbar", 6, 0x0233F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BC9[] = { {"Racute", 6, 0x00154, 0}, {"les", 3, 0x02A7D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BCB[] = { {"boxhU", 5, 0x02568, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BCE[] = { {"hcirc", 5, 0x00125, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BD1[] = { {"dscr", 4, 0x1D4B9, 0}, {"smashp", 6, 0x02A33, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BD7[] = { {"mopf", 4, 0x1D55E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BD8[] = { {"Rcedil", 6, 0x00156, 0}, {"dscy", 4, 0x00455, 0}, {"prap", 4, 0x02AB7, 0}, {"rarrlp", 6, 0x021AC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BD9[] = { {"Aogon", 5, 0x00104, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BDA[] = { {"boxhd", 5, 0x0252C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BDB[] = { {"subset", 6, 0x02282, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BDD[] = { {"lgE", 3, 0x02A91, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BDF[] = { {"epsilon", 7, 0x003B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BE1[] = { {"curarrm", 7, 0x0293C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BE2[] = { {"ratail", 6, 0x0291A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BE4[] = { {"DoubleLongLeftRightArrow", 24, 0x027FA, 0}, {"rhov", 4, 0x003F1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BE7[] = { {"LeftDoubleBracket", 17, 0x027E6, 0}, {"Lleftarrow", 10, 0x021DA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BE8[] = { {"Uuml", 4, 0x000DC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BE9[] = { {"lfr", 3, 0x1D529, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BEA[] = { {"minusdu", 7, 0x02A2A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BEB[] = { {"boxhu", 5, 0x02534, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BEF[] = { {"Ncy", 3, 0x0041D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BF0[] = { {"gneq", 4, 0x02A88, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BF1[] = { {"rangd", 5, 0x02992, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BF2[] = { {"range", 5, 0x029A5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BF3[] = { {"lfloor", 6, 0x0230A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BF7[] = { {"NotSucceedsTilde", 16, 0x0227F, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BF9[] = { {"aogon", 5, 0x00105, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BFA[] = { {"NotGreaterSlantEqual", 20, 0x02A7E, 0x00338}, {"NotSquareSupersetEqual", 22, 0x022E3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_BFC[] = { {"profsurf", 8, 0x02313, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C02[] = { {"wedgeq", 6, 0x02259, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C0B[] = { {"Alpha", 5, 0x00391, 0}, {"DiacriticalDoubleAcute", 22, 0x002DD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C0C[] = { {"lltri", 5, 0x025FA, 0}, {"tcaron", 6, 0x00165, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C11[] = { {"Imacr", 5, 0x0012A, 0}, {"subseteq", 8, 0x02286, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C12[] = { {"Escr", 4, 0x02130, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C16[] = { {"lArr", 4, 0x021D0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C18[] = { {"Nopf", 4, 0x02115, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C1A[] = { {"rpar", 4, 0x00029, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C1D[] = { {"divonx", 6, 0x022C7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C1E[] = { {"olcir", 5, 0x029BE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C23[] = { {"lacute", 6, 0x0013A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C27[] = { {"zscr", 4, 0x1D4CF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C2B[] = { {"alpha", 5, 0x003B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C31[] = { {"imacr", 5, 0x0012B, 0}, {"vellip", 6, 0x022EE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C32[] = { {"lcedil", 6, 0x0013C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C33[] = { {"sime", 4, 0x02243, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C34[] = { {"empty", 5, 0x02205, 0}, {"imped", 5, 0x001B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C35[] = { {"simg", 4, 0x02A9E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C36[] = { {"kjcy", 4, 0x0045C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C3A[] = { {"siml", 4, 0x02A9D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C3E[] = { {"LessEqualGreater", 16, 0x022DA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C3F[] = { {"Ycirc", 5, 0x00176, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C40[] = { {"RoundImplies", 12, 0x02970, 0}, {"nvrArr", 6, 0x02903, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C43[] = { {"check", 5, 0x02713, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C44[] = { {"nlarr", 5, 0x0219A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C46[] = { {"middot", 6, 0x000B7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C48[] = { {"par", 3, 0x02225, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C4A[] = { {"NotGreaterGreater", 17, 0x0226B, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C4B[] = { {"Nfr", 3, 0x1D511, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C4F[] = { {"nwArr", 5, 0x021D6, 0}, {"prec", 4, 0x0227A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C50[] = { {"Barv", 4, 0x02AE7, 0}, {"yacute", 6, 0x000FD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C54[] = { {"DoubleLeftRightArrow", 20, 0x021D4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C58[] = { {"Coproduct", 9, 0x02210, 0}, {"rarrpl", 6, 0x02945, 0}, {"subsim", 6, 0x02AC7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C5A[] = { {"ntgl", 4, 0x02279, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C5B[] = { {"LeftTriangleBar", 15, 0x029CF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C5F[] = { {"ycirc", 5, 0x00177, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C69[] = { {"doteqdot", 8, 0x02251, 0}, {"nang", 4, 0x02220, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C6B[] = { {"bigcap", 6, 0x022C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C6C[] = { {"CHcy", 4, 0x00427, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C6E[] = { {"dopf", 4, 0x1D555, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C72[] = { {"inodot", 6, 0x00131, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C76[] = { {"nvHarr", 6, 0x02904, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C77[] = { {"laemptyv", 8, 0x029B4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C78[] = { {"bigcirc", 7, 0x025EF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C7A[] = { {"scnap", 5, 0x02ABA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C7B[] = { {"DownLeftVector", 14, 0x021BD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C80[] = { {"race", 4, 0x0223D, 0x00331}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C82[] = { {"vartriangleright", 16, 0x022B3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C89[] = { {"napE", 4, 0x02A70, 0x00338}, {"supedot", 7, 0x02AC4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C8E[] = { {"acE", 3, 0x0223E, 0x00333}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C91[] = { {"pcy", 3, 0x0043F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C93[] = { {"qprime", 6, 0x02057, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C94[] = { {"RightTeeVector", 14, 0x0295B, 0}, {"curlyvee", 8, 0x022CE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C95[] = { {"swarhk", 6, 0x02926, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_C98[] = { {"Atilde", 6, 0x000C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CA6[] = { {"bbrk", 4, 0x023B5, 0}, {"prnap", 5, 0x02AB9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CA8[] = { {"image", 5, 0x02111, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CA9[] = { {"sext", 4, 0x02736, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CAA[] = { {"ldquo", 5, 0x0201C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CAC[] = { {"NotLeftTriangleBar", 18, 0x029CF, 0x00338}, {"epsiv", 5, 0x003F5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CAD[] = { {"CenterDot", 9, 0x000B7, 0}, {"acd", 3, 0x0223F, 0}, {"upuparrows", 10, 0x021C8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CAF[] = { {"Eopf", 4, 0x1D53C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CB0[] = { {"Jcirc", 5, 0x00134, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CB2[] = { {"smid", 4, 0x02223, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CB4[] = { {"bull", 4, 0x02022, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CB6[] = { {"rhard", 5, 0x021C1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CB7[] = { {"nsupset", 7, 0x02283, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CBA[] = { {"npre", 4, 0x02AAF, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CBE[] = { {"qscr", 4, 0x1D4C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CC2[] = { {"acy", 3, 0x00430, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CC4[] = { {"lnE", 3, 0x02268, 0}, {"zopf", 4, 0x1D56B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CC5[] = { {"Ntilde", 6, 0x000D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CC7[] = { {"rharu", 5, 0x021C0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CC8[] = { {"kappav", 6, 0x003F0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CC9[] = { {"timesb", 6, 0x022A0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CCB[] = { {"iiiint", 6, 0x02A0C, 0}, {"timesd", 6, 0x02A30, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CD0[] = { {"jcirc", 5, 0x00135, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CD2[] = { {"nsimeq", 6, 0x02244, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CD3[] = { {"Esim", 4, 0x02A73, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CD9[] = { {"Cap", 3, 0x022D2, 0}, {"bump", 4, 0x0224E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CDA[] = { {"lvnE", 4, 0x02268, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CDC[] = { {"rarrtl", 6, 0x021A3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CE4[] = { {"lne", 3, 0x02A87, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CE6[] = { {"commat", 6, 0x00040, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CE8[] = { {"hslash", 6, 0x0210F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CE9[] = { {"lthree", 6, 0x022CB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CED[] = { {"Gcedil", 6, 0x00122, 0}, {"pfr", 3, 0x1D52D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CF1[] = { {"RightTriangleEqual", 18, 0x022B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CF2[] = { {"ngeqslant", 9, 0x02A7E, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CF3[] = { {"Rcy", 3, 0x00420, 0}, {"gimel", 5, 0x02137, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CF4[] = { {"curarr", 6, 0x021B7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CFA[] = { {"ntlg", 4, 0x02278, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_CFF[] = { {"Rscr", 4, 0x0211B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D00[] = { {"urcrop", 6, 0x0230E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D06[] = { {"Poincareplane", 13, 0x0210C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D07[] = { {"NoBreak", 7, 0x02060, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D0B[] = { {"lcub", 4, 0x0007B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D0E[] = { {"nltri", 5, 0x022EA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D10[] = { {"blacktriangledown", 17, 0x025BE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D11[] = { {"fjlig", 5, 0x00066, 0x0006A}, {"percnt", 6, 0x00025, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D12[] = { {"rightharpoondown", 16, 0x021C1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D13[] = { {"LeftAngleBracket", 16, 0x027E8, 0}, {"npreceq", 7, 0x02AAF, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D15[] = { {"cupcup", 6, 0x02A4A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D18[] = { {"LeftVectorBar", 13, 0x02952, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D19[] = { {"NJcy", 4, 0x0040A, 0}, {"triangleright", 13, 0x025B9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D1A[] = { {"Tcedil", 6, 0x00162, 0}, {"loz", 3, 0x025CA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D1E[] = { {"afr", 3, 0x1D51E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D1F[] = { {"NotLessTilde", 12, 0x02274, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D20[] = { {"NotElement", 10, 0x02209, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D22[] = { {"NotHumpDownHump", 15, 0x0224E, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D24[] = { {"SquareSubsetEqual", 17, 0x02291, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D26[] = { {"nleqq", 5, 0x02266, 0x00338}, {"phi", 3, 0x003C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D2A[] = { {"NotRightTriangle", 16, 0x022EB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D32[] = { {"lhblk", 5, 0x02584, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D34[] = { {"caret", 5, 0x02041, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D35[] = { {"bsemi", 5, 0x0204F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D38[] = { {"aacute", 6, 0x000E1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D39[] = { {"mapsto", 6, 0x021A6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D3A[] = { {"Congruent", 9, 0x02261, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D3B[] = { {"Vdash", 5, 0x022A9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D3E[] = { {"longrightarrow", 14, 0x027F6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D42[] = { {"iinfin", 6, 0x029DC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D44[] = { {"EmptyVerySmallSquare", 20, 0x025AB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D49[] = { {"real", 4, 0x0211C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D4C[] = { {"SucceedsEqual", 13, 0x02AB0, 0}, {"utilde", 6, 0x00169, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D4F[] = { {"Rfr", 3, 0x0211C, 0}, {"tau", 3, 0x003C4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D51[] = { {"Wedge", 5, 0x022C0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D54[] = { {"piv", 3, 0x003D6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D55[] = { {"hscr", 4, 0x1D4BD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D56[] = { {"subdot", 6, 0x02ABD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D57[] = { {"dsol", 4, 0x029F6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D5A[] = { {"prnE", 4, 0x02AB5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D5B[] = { {"qopf", 4, 0x1D562, 0}, {"vdash", 5, 0x022A2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D5F[] = { {"Star", 4, 0x022C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D63[] = { {"sqsupseteq", 10, 0x02292, 0}, {"zhcy", 4, 0x00436, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D65[] = { {"nacute", 6, 0x00144, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D69[] = { {"lessgtr", 7, 0x02276, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D6A[] = { {"nless", 5, 0x0226E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D6C[] = { {"RightTeeArrow", 13, 0x021A6, 0}, {"Yuml", 4, 0x00178, 0}, {"target", 6, 0x02316, 0}, {"upharpoonleft", 13, 0x021BF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D6F[] = { {"between", 7, 0x0226C, 0}, {"boxuL", 5, 0x0255B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D70[] = { {"TSHcy", 5, 0x0040B, 0}, {"lrm", 3, 0x0200E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D71[] = { {"excl", 4, 0x00021, 0}, {"hyphen", 6, 0x02010, 0}, {"mlcp", 4, 0x02ADB, 0}, {"wedge", 5, 0x02227, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D74[] = { {"ncedil", 6, 0x00146, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D75[] = { {"boxuR", 5, 0x02558, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D76[] = { {"Not", 3, 0x02AEC, 0}, {"epsi", 4, 0x003B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D7C[] = { {"disin", 5, 0x022F2, 0}, {"nRightarrow", 11, 0x021CF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D7D[] = { {"cylcty", 6, 0x0232D, 0}, {"neArr", 5, 0x021D7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D7E[] = { {"prnsim", 6, 0x022E8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D80[] = { {"Cfr", 3, 0x0212D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D81[] = { {"Beta", 4, 0x00392, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D85[] = { {"leftarrowtail", 13, 0x021A2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D87[] = { {"parsl", 5, 0x02AFD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D89[] = { {"xwedge", 6, 0x022C0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D8A[] = { {"olcross", 7, 0x029BB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D8C[] = { {"boxvH", 5, 0x0256A, 0}, {"lsh", 3, 0x021B0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D8D[] = { {"circledR", 8, 0x000AE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D8E[] = { {"Rho", 3, 0x003A1, 0}, {"circledS", 8, 0x024C8, 0}, {"cupor", 5, 0x02A45, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D8F[] = { {"Ugrave", 6, 0x000D9, 0}, {"boxul", 5, 0x02518, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D90[] = { {"boxvL", 5, 0x02561, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D91[] = { {"sqcup", 5, 0x02294, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D93[] = { {"rect", 4, 0x025AD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D94[] = { {"mldr", 4, 0x02026, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D95[] = { {"boxur", 5, 0x02514, 0}, {"digamma", 7, 0x003DD, 0}, {"tcy", 3, 0x00442, 0}, {"urcorner", 8, 0x0231D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D96[] = { {"DoubleLeftArrow", 15, 0x021D0, 0}, {"Iscr", 4, 0x02110, 0}, {"boxvR", 5, 0x0255E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D98[] = { {"ulcorn", 6, 0x0231C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D9A[] = { {"prod", 4, 0x0220F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_D9C[] = { {"Ropf", 4, 0x0211D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DA0[] = { {"rmoustache", 10, 0x023B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DA5[] = { {"NegativeMediumSpace", 19, 0x0200B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DA6[] = { {"prop", 4, 0x0221D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DA8[] = { {"TScy", 4, 0x00426, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DA9[] = { {"xsqcup", 6, 0x02A06, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DAC[] = { {"bemptyv", 7, 0x029B0, 0}, {"boxvh", 5, 0x0253C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DB0[] = { {"boxvl", 5, 0x02524, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DB3[] = { {"NotTildeFullEqual", 17, 0x02247, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DB4[] = { {"subE", 4, 0x02AC5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DB6[] = { {"boxvr", 5, 0x0251C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DB7[] = { {"bigvee", 6, 0x022C1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DB9[] = { {"Chi", 3, 0x003A7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DBC[] = { {"circeq", 6, 0x02257, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DBE[] = { {"emsp13", 6, 0x02004, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DBF[] = { {"emsp14", 6, 0x02005, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DC2[] = { {"ouml", 4, 0x000F6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DC3[] = { {"RightArrowBar", 13, 0x021E5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DC6[] = { {"ecy", 3, 0x0044D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DC8[] = { {"succneqq", 8, 0x02AB6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DCA[] = { {"npart", 5, 0x02202, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DCF[] = { {"Element", 7, 0x02208, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DD1[] = { {"Edot", 4, 0x00116, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DD3[] = { {"RightUpDownVector", 17, 0x0294F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DD4[] = { {"sube", 4, 0x02286, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DD5[] = { {"jsercy", 6, 0x00458, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DD7[] = { {"varrho", 6, 0x003F1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DD9[] = { {"subsub", 6, 0x02AD5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DDC[] = { {"Dcaron", 6, 0x0010E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DDD[] = { {"Eogon", 5, 0x00118, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DE4[] = { {"geqslant", 8, 0x02A7E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DE6[] = { {"rdldhar", 7, 0x02969, 0}, {"zdot", 4, 0x0017C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DE7[] = { {"subsup", 6, 0x02AD3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DE9[] = { {"ograve", 6, 0x000F2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DEB[] = { {"ReverseElement", 14, 0x0220B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DED[] = { {"drcorn", 6, 0x0231F, 0}, {"rang", 4, 0x027E9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DF1[] = { {"tfr", 3, 0x1D531, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DF2[] = { {"hopf", 4, 0x1D559, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DF3[] = { {"succ", 4, 0x0227B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DF6[] = { {"otimes", 6, 0x02297, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DF7[] = { {"Vcy", 3, 0x00412, 0}, {"ltquest", 7, 0x02A7B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DF9[] = { {"lozenge", 7, 0x025CA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DFB[] = { {"LeftDownVector", 14, 0x021C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_DFD[] = { {"eogon", 5, 0x00119, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E03[] = { {"amp", 3, 0x00026, 0}, {"lopar", 5, 0x02985, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E04[] = { {"loplus", 6, 0x02A2D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E08[] = { {"NotTilde", 8, 0x02241, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E09[] = { {"CounterClockwiseContourIntegral", 31, 0x02233, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E0C[] = { {"InvisibleTimes", 14, 0x02062, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E11[] = { {"lesdotor", 8, 0x02A83, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E18[] = { {"and", 3, 0x02227, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E1B[] = { {"RightUpVector", 13, 0x021BE, 0}, {"ang", 3, 0x02220, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E1C[] = { {"DoubleRightTee", 14, 0x022A8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E1D[] = { {"LeftUpVectorBar", 15, 0x02958, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E1E[] = { {"smte", 4, 0x02AAC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E20[] = { {"Iacute", 6, 0x000CD, 0}, {"triminus", 8, 0x02A3A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E22[] = { {"efr", 3, 0x1D522, 0}, {"iiint", 5, 0x0222D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E23[] = { {"ctdot", 5, 0x022EF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E24[] = { {"mnplus", 6, 0x02213, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E25[] = { {"Vee", 3, 0x022C1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E28[] = { {"Gcy", 3, 0x00413, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E2A[] = { {"lurdshar", 8, 0x0294A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E2C[] = { {"smeparsl", 8, 0x029E4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E2F[] = { {"DoubleVerticalBar", 17, 0x02225, 0}, {"iecy", 4, 0x00435, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E30[] = { {"udblac", 6, 0x00171, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E32[] = { {"gtquest", 7, 0x02A7C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E33[] = { {"Iopf", 4, 0x1D540, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E35[] = { {"bsime", 5, 0x022CD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E36[] = { {"RightVector", 11, 0x021C0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E37[] = { {"NotGreaterLess", 14, 0x02279, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E3B[] = { {"apE", 3, 0x02A70, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E41[] = { {"CupCap", 6, 0x0224D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E42[] = { {"uscr", 4, 0x1D4CA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E43[] = { {"erDot", 5, 0x02253, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E44[] = { {"egs", 3, 0x02A96, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E48[] = { {"rlarr", 5, 0x021C4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E4C[] = { {"prE", 3, 0x02AB3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E4E[] = { {"QUOT", 4, 0x00022, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E53[] = { {"Vfr", 3, 0x1D519, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E55[] = { {"cupbrcap", 8, 0x02A48, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E57[] = { {"intercal", 8, 0x022BA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E58[] = { {"imath", 5, 0x00131, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E59[] = { {"RightUpTeeVector", 16, 0x0295C, 0}, {"trie", 4, 0x0225C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E5B[] = { {"ape", 3, 0x0224A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E5D[] = { {"softcy", 6, 0x0044C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E5E[] = { {"rarrb", 5, 0x021E5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E5F[] = { {"FilledSmallSquare", 17, 0x025FC, 0}, {"rarrc", 5, 0x02933, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E60[] = { {"Superset", 8, 0x02283, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E61[] = { {"hoarr", 5, 0x021FF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E63[] = { {"DownRightVectorBar", 18, 0x02957, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E64[] = { {"brvbar", 6, 0x000A6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E65[] = { {"ecolon", 6, 0x02255, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E66[] = { {"GreaterLess", 11, 0x02277, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E6A[] = { {"nrArr", 5, 0x021CF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E6C[] = { {"pre", 3, 0x02AAF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E6F[] = { {"aleph", 5, 0x02135, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E70[] = { {"DiacriticalAcute", 16, 0x000B4, 0}, {"SmallCircle", 11, 0x02218, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E71[] = { {"parsim", 6, 0x02AF3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E73[] = { {"rarrw", 5, 0x0219D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E78[] = { {"caron", 5, 0x002C7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E7A[] = { {"cacute", 6, 0x00107, 0}, {"lagran", 6, 0x02112, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E7C[] = { {"rarr", 4, 0x02192, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E80[] = { {"Rrightarrow", 11, 0x021DB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E83[] = { {"Vscr", 4, 0x1D4B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E84[] = { {"Gfr", 3, 0x1D50A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E89[] = { {"ccedil", 6, 0x000E7, 0}, {"propto", 6, 0x0221D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E8E[] = { {"zwnj", 4, 0x0200C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E91[] = { {"psi", 3, 0x003C8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E99[] = { {"infin", 5, 0x0221E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_E9C[] = { {"circledcirc", 11, 0x0229A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EA1[] = { {"Proportion", 10, 0x02237, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EA2[] = { {"subseteqq", 9, 0x02AC5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EA4[] = { {"nGtv", 4, 0x0226B, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EA8[] = { {"macr", 4, 0x000AF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EA9[] = { {"orslope", 7, 0x02A57, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EB1[] = { {"frown", 5, 0x02322, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EB2[] = { {"Iota", 4, 0x00399, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EB4[] = { {"rceil", 5, 0x02309, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EB7[] = { {"spadesuit", 9, 0x02660, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EB8[] = { {"sstarf", 6, 0x022C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ECA[] = { {"icy", 3, 0x00438, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ECD[] = { {"ast", 3, 0x0002A, 0}, {"nmid", 4, 0x02224, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ECF[] = { {"bowtie", 6, 0x022C8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ED1[] = { {"thetav", 6, 0x003D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ED7[] = { {"vangrt", 6, 0x0299C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ED8[] = { {"numsp", 5, 0x02007, 0}, {"triplus", 7, 0x02A39, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_ED9[] = { {"lscr", 4, 0x1D4C1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EDA[] = { {"pointint", 8, 0x02A15, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EDB[] = { {"Theta", 5, 0x00398, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EDF[] = { {"rightrightarrows", 16, 0x021C9, 0}, {"uopf", 4, 0x1D566, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EE2[] = { {"ell", 3, 0x02113, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EE4[] = { {"cuepr", 5, 0x022DE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EE5[] = { {"NotVerticalBar", 14, 0x02224, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EE7[] = { {"xnis", 4, 0x022FB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EE9[] = { {"els", 3, 0x02A95, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EEF[] = { {"DDotrahd", 8, 0x02911, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EF1[] = { {"larrbfs", 7, 0x0291F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EF2[] = { {"Rsh", 3, 0x021B1, 0}, {"boxplus", 7, 0x0229E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EF4[] = { {"swarr", 5, 0x02199, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EF5[] = { {"gvnE", 4, 0x02269, 0x0FE00}, {"xfr", 3, 0x1D535, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EF9[] = { {"ldca", 4, 0x02936, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EFB[] = { {"NotPrecedesSlantEqual", 21, 0x022E0, 0}, {"YAcy", 4, 0x0042F, 0}, {"Zcy", 3, 0x00417, 0}, {"andslope", 8, 0x02A58, 0}, {"numero", 6, 0x02116, 0}, {"theta", 5, 0x003B8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EFE[] = { {"mapstoup", 8, 0x021A5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_EFF[] = { {"bigcup", 6, 0x022C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F03[] = { {"nesear", 6, 0x02928, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F05[] = { {"lesssim", 7, 0x02272, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F08[] = { {"DownArrow", 9, 0x02193, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F0B[] = { {"orarr", 5, 0x021BB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F0F[] = { {"ccaps", 5, 0x02A4D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F10[] = { {"xdtri", 5, 0x025BD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F11[] = { {"xcap", 4, 0x022C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F13[] = { {"downdownarrows", 14, 0x021CA, 0}, {"nisd", 4, 0x022FA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F14[] = { {"VerticalBar", 11, 0x02223, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F15[] = { {"TRADE", 5, 0x02122, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F17[] = { {"Omacr", 5, 0x0014C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F18[] = { {"top", 3, 0x022A4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F19[] = { {"LeftRightArrow", 14, 0x02194, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F1A[] = { {"Mscr", 4, 0x02133, 0}, {"iff", 3, 0x021D4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F1F[] = { {"downharpoonleft", 15, 0x021C3, 0}, {"eng", 3, 0x0014B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F20[] = { {"Vopf", 4, 0x1D54D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F26[] = { {"ifr", 3, 0x1D526, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F28[] = { {"Downarrow", 9, 0x021D3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F2C[] = { {"Kcy", 3, 0x0041A, 0}, {"angle", 5, 0x02220, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F2F[] = { {"lescc", 5, 0x02AA8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F30[] = { {"lesseqqgtr", 10, 0x02A8B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F31[] = { {"bigstar", 7, 0x02605, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F33[] = { {"ddagger", 7, 0x02021, 0}, {"nltrie", 6, 0x022EC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F37[] = { {"omacr", 5, 0x0014D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F38[] = { {"cuesc", 5, 0x022DF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F40[] = { {"circlearrowright", 16, 0x021BB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F41[] = { {"ngeqq", 5, 0x02267, 0x00338}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F44[] = { {"squf", 4, 0x025AA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F46[] = { {"rtri", 4, 0x025B9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F47[] = { {"VerticalLine", 12, 0x0007C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F48[] = { {"downarrow", 9, 0x02193, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F4B[] = { {"Scaron", 6, 0x00160, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F4C[] = { {"tstrok", 6, 0x00167, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F50[] = { {"wreath", 6, 0x02240, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F51[] = { {"exponentiale", 12, 0x02147, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F55[] = { {"Idot", 4, 0x00130, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F57[] = { {"Zfr", 3, 0x02128, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F58[] = { {"bnot", 4, 0x02310, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F5B[] = { {"infintie", 8, 0x029DD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F5D[] = { {"angrtvbd", 8, 0x0299D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F5F[] = { {"prurel", 6, 0x022B0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F60[] = { {"gbreve", 6, 0x0011F, 0}, {"rsaquo", 6, 0x0203A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F62[] = { {"sung", 4, 0x0266A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F67[] = { {"lvertneqq", 9, 0x02268, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F68[] = { {"lnsim", 5, 0x022E6, 0}, {"searrow", 7, 0x02198, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F69[] = { {"nsubset", 7, 0x02282, 0x020D2}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F6D[] = { {"Cup", 3, 0x022D3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F6E[] = { {"Lmidot", 6, 0x0013F, 0}, {"sup1", 4, 0x000B9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F6F[] = { {"Delta", 5, 0x00394, 0}, {"sbquo", 5, 0x0201A, 0}, {"sup2", 4, 0x000B2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F70[] = { {"cscr", 4, 0x1D4B8, 0}, {"nsubseteqq", 10, 0x02AC5, 0x00338}, {"sup3", 4, 0x000B3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F71[] = { {"Kcedil", 6, 0x00136, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F72[] = { {"plussim", 7, 0x02A26, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F74[] = { {"KHcy", 4, 0x00425, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F75[] = { {"OElig", 5, 0x00152, 0}, {"simdot", 6, 0x02A6A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F76[] = { {"lopf", 4, 0x1D55D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F77[] = { {"boxbox", 6, 0x029C9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F78[] = { {"bepsi", 5, 0x003F6, 0}, {"lbarr", 5, 0x0290C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F79[] = { {"lnapprox", 8, 0x02A89, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F81[] = { {"sdotb", 5, 0x022A1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F82[] = { {"measuredangle", 13, 0x02221, 0}, {"supE", 4, 0x02AC6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F83[] = { {"map", 3, 0x021A6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F84[] = { {"sdote", 5, 0x02A66, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F86[] = { {"diamondsuit", 11, 0x02666, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F88[] = { {"Kfr", 3, 0x1D50E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F8B[] = { {"imagline", 8, 0x02110, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F8F[] = { {"delta", 5, 0x003B4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F91[] = { {"mapstodown", 10, 0x021A7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F93[] = { {"eqvparsl", 8, 0x029E5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F95[] = { {"UpArrow", 7, 0x02191, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F9A[] = { {"imagpart", 8, 0x02111, 0}, {"lsim", 4, 0x02272, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F9C[] = { {"trianglelefteq", 14, 0x022B4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_F9F[] = { {"isindot", 7, 0x022F5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FA0[] = { {"LeftUpDownVector", 16, 0x02951, 0}, {"curvearrowleft", 14, 0x021B6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FA1[] = { {"Diamond", 7, 0x022C4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FA2[] = { {"supe", 4, 0x02287, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FA3[] = { {"nearrow", 7, 0x02197, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FA9[] = { {"easter", 6, 0x02A6E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FB0[] = { {"rdquo", 5, 0x0201D, 0}, {"subsetneqq", 10, 0x02ACB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FB1[] = { {"Dscr", 4, 0x1D49F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FB4[] = { {"comp", 4, 0x02201, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FB5[] = { {"Uparrow", 7, 0x021D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FB6[] = { {"coloneq", 7, 0x02254, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FB7[] = { {"Mopf", 4, 0x1D544, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FB9[] = { {"rfloor", 6, 0x0230B, 0}, {"varsubsetneqq", 13, 0x02ACB, 0x0FE00}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FBC[] = { {"eacute", 6, 0x000E9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FC2[] = { {"shortparallel", 13, 0x02225, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FC4[] = { {"male", 4, 0x02642, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FC6[] = { {"yscr", 4, 0x1D4CE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FCA[] = { {"xharr", 5, 0x027F7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FCC[] = { {"cong", 4, 0x02245, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FCE[] = { {"mcy", 3, 0x0043C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FCF[] = { {"Upsilon", 7, 0x003A5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FD0[] = { {"block", 5, 0x02588, 0}, {"maltese", 7, 0x02720, 0}, {"ordf", 4, 0x000AA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FD2[] = { {"zcaron", 6, 0x0017E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FD3[] = { {"malt", 4, 0x02720, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FD6[] = { {"loang", 5, 0x027EC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FD7[] = { {"ordm", 4, 0x000BA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FDD[] = { {"NegativeVeryThinSpace", 21, 0x0200B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FDF[] = { {"eta", 3, 0x003B7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FE1[] = { {"Iogon", 5, 0x0012E, 0}, {"drbkarow", 8, 0x02910, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FE6[] = { {"eth", 3, 0x000F0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FE9[] = { {"racute", 6, 0x00155, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FEA[] = { {"cwconint", 8, 0x02232, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FEB[] = { {"egsdot", 6, 0x02A98, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FF5[] = { {"MinusPlus", 9, 0x02213, 0}, {"ring", 4, 0x002DA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FF8[] = { {"rcedil", 6, 0x00157, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FFC[] = { {"timesbar", 8, 0x02A31, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html5_FFE[] = { {"GreaterEqualLess", 16, 0x022DB, 0}, {NULL, 0, 0, 0} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_cp_map *const ht_buckets_html5[] = {
ht_bucket_html5_000, ht_bucket_html5_001, ht_bucket_empty, ht_bucket_html5_003,
ht_bucket_empty, ht_bucket_html5_005, ht_bucket_empty, ht_bucket_html5_007,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_00D, ht_bucket_empty, ht_bucket_html5_00F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_017,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_020, ht_bucket_empty, ht_bucket_html5_022, ht_bucket_empty,
ht_bucket_html5_024, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_027,
ht_bucket_html5_028, ht_bucket_html5_029, ht_bucket_html5_02A, ht_bucket_html5_02B,
ht_bucket_html5_02C, ht_bucket_empty, ht_bucket_html5_02E, ht_bucket_empty,
ht_bucket_html5_030, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_034, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_038, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_040, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_047,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_04C, ht_bucket_empty, ht_bucket_html5_04E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_051, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_059, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_05D, ht_bucket_empty, ht_bucket_html5_05F,
ht_bucket_html5_060, ht_bucket_html5_061, ht_bucket_empty, ht_bucket_html5_063,
ht_bucket_html5_064, ht_bucket_html5_065, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_069, ht_bucket_html5_06A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_06D, ht_bucket_html5_06E, ht_bucket_html5_06F,
ht_bucket_empty, ht_bucket_html5_071, ht_bucket_empty, ht_bucket_html5_073,
ht_bucket_html5_074, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_07A, ht_bucket_html5_07B,
ht_bucket_html5_07C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_07F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_082, ht_bucket_empty,
ht_bucket_html5_084, ht_bucket_html5_085, ht_bucket_html5_086, ht_bucket_empty,
ht_bucket_html5_088, ht_bucket_html5_089, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_08C, ht_bucket_empty, ht_bucket_html5_08E, ht_bucket_empty,
ht_bucket_html5_090, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_094, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_097,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_09E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_0A4, ht_bucket_empty, ht_bucket_html5_0A6, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_0AB,
ht_bucket_html5_0AC, ht_bucket_html5_0AD, ht_bucket_html5_0AE, ht_bucket_html5_0AF,
ht_bucket_html5_0B0, ht_bucket_empty, ht_bucket_html5_0B2, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_0B8, ht_bucket_html5_0B9, ht_bucket_html5_0BA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_0C0, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_0C4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_0CE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_0D1, ht_bucket_html5_0D2, ht_bucket_html5_0D3,
ht_bucket_empty, ht_bucket_html5_0D5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_0DF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_0E5, ht_bucket_html5_0E6, ht_bucket_empty,
ht_bucket_html5_0E8, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_0EC, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_0EF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_0F3,
ht_bucket_html5_0F4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_0FA, ht_bucket_html5_0FB,
ht_bucket_empty, ht_bucket_html5_0FD, ht_bucket_html5_0FE, ht_bucket_empty,
ht_bucket_html5_100, ht_bucket_html5_101, ht_bucket_empty, ht_bucket_html5_103,
ht_bucket_empty, ht_bucket_html5_105, ht_bucket_html5_106, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_109, ht_bucket_html5_10A, ht_bucket_html5_10B,
ht_bucket_empty, ht_bucket_html5_10D, ht_bucket_html5_10E, ht_bucket_html5_10F,
ht_bucket_html5_110, ht_bucket_html5_111, ht_bucket_empty, ht_bucket_html5_113,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_116, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_11B,
ht_bucket_html5_11C, ht_bucket_empty, ht_bucket_html5_11E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_121, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_124, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_129, ht_bucket_html5_12A, ht_bucket_empty,
ht_bucket_html5_12C, ht_bucket_empty, ht_bucket_html5_12E, ht_bucket_html5_12F,
ht_bucket_html5_130, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_134, ht_bucket_html5_135, ht_bucket_empty, ht_bucket_html5_137,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_13A, ht_bucket_html5_13B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_145, ht_bucket_empty, ht_bucket_html5_147,
ht_bucket_empty, ht_bucket_html5_149, ht_bucket_empty, ht_bucket_html5_14B,
ht_bucket_html5_14C, ht_bucket_empty, ht_bucket_html5_14E, ht_bucket_html5_14F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_158, ht_bucket_html5_159, ht_bucket_empty, ht_bucket_html5_15B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_15E, ht_bucket_html5_15F,
ht_bucket_empty, ht_bucket_html5_161, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_164, ht_bucket_html5_165, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_168, ht_bucket_empty, ht_bucket_html5_16A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_170, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_173,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_17A, ht_bucket_html5_17B,
ht_bucket_html5_17C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_17F,
ht_bucket_empty, ht_bucket_html5_181, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_189, ht_bucket_empty, ht_bucket_html5_18B,
ht_bucket_html5_18C, ht_bucket_empty, ht_bucket_html5_18E, ht_bucket_html5_18F,
ht_bucket_html5_190, ht_bucket_html5_191, ht_bucket_empty, ht_bucket_html5_193,
ht_bucket_empty, ht_bucket_html5_195, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_19A, ht_bucket_empty,
ht_bucket_html5_19C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_19F,
ht_bucket_html5_1A0, ht_bucket_html5_1A1, ht_bucket_html5_1A2, ht_bucket_html5_1A3,
ht_bucket_html5_1A4, ht_bucket_html5_1A5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_1A8, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_1AB,
ht_bucket_html5_1AC, ht_bucket_html5_1AD, ht_bucket_html5_1AE, ht_bucket_html5_1AF,
ht_bucket_html5_1B0, ht_bucket_empty, ht_bucket_html5_1B2, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_1B5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_1B9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_1BD, ht_bucket_html5_1BE, ht_bucket_empty,
ht_bucket_html5_1C0, ht_bucket_html5_1C1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_1C4, ht_bucket_empty, ht_bucket_html5_1C6, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_1C9, ht_bucket_html5_1CA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_1CD, ht_bucket_html5_1CE, ht_bucket_empty,
ht_bucket_html5_1D0, ht_bucket_html5_1D1, ht_bucket_html5_1D2, ht_bucket_empty,
ht_bucket_html5_1D4, ht_bucket_html5_1D5, ht_bucket_html5_1D6, ht_bucket_empty,
ht_bucket_html5_1D8, ht_bucket_html5_1D9, ht_bucket_html5_1DA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_1DF,
ht_bucket_html5_1E0, ht_bucket_html5_1E1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_1E5, ht_bucket_html5_1E6, ht_bucket_html5_1E7,
ht_bucket_html5_1E8, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_1EB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_1EF,
ht_bucket_html5_1F0, ht_bucket_empty, ht_bucket_html5_1F2, ht_bucket_empty,
ht_bucket_html5_1F4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_1F8, ht_bucket_html5_1F9, ht_bucket_html5_1FA, ht_bucket_empty,
ht_bucket_html5_1FC, ht_bucket_empty, ht_bucket_html5_1FE, ht_bucket_html5_1FF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_205, ht_bucket_empty, ht_bucket_html5_207,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_20E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_212, ht_bucket_html5_213,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_219, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_21D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_220, ht_bucket_empty, ht_bucket_html5_222, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_227,
ht_bucket_empty, ht_bucket_html5_229, ht_bucket_empty, ht_bucket_html5_22B,
ht_bucket_html5_22C, ht_bucket_html5_22D, ht_bucket_empty, ht_bucket_html5_22F,
ht_bucket_html5_230, ht_bucket_empty, ht_bucket_html5_232, ht_bucket_empty,
ht_bucket_html5_234, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_239, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_23C, ht_bucket_html5_23D, ht_bucket_html5_23E, ht_bucket_empty,
ht_bucket_html5_240, ht_bucket_html5_241, ht_bucket_html5_242, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_246, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_249, ht_bucket_html5_24A, ht_bucket_html5_24B,
ht_bucket_empty, ht_bucket_html5_24D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_251, ht_bucket_html5_252, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_257,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_25A, ht_bucket_html5_25B,
ht_bucket_html5_25C, ht_bucket_html5_25D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_263,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_26A, ht_bucket_html5_26B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_26E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_274, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_277,
ht_bucket_html5_278, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_27C, ht_bucket_empty, ht_bucket_html5_27E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_283,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_28A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_294, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_297,
ht_bucket_html5_298, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_29D, ht_bucket_empty, ht_bucket_html5_29F,
ht_bucket_html5_2A0, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_2A9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_2AE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_2B1, ht_bucket_html5_2B2, ht_bucket_html5_2B3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_2B9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_2BF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_2C4, ht_bucket_html5_2C5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_2CE, ht_bucket_html5_2CF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_2D3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_2DA, ht_bucket_html5_2DB,
ht_bucket_html5_2DC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_2E3,
ht_bucket_html5_2E4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_2EB,
ht_bucket_html5_2EC, ht_bucket_empty, ht_bucket_html5_2EE, ht_bucket_empty,
ht_bucket_html5_2F0, ht_bucket_empty, ht_bucket_html5_2F2, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_2F8, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_300, ht_bucket_html5_301, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_304, ht_bucket_html5_305, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_308, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_30B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_30F,
ht_bucket_empty, ht_bucket_html5_311, ht_bucket_empty, ht_bucket_html5_313,
ht_bucket_empty, ht_bucket_html5_315, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_319, ht_bucket_html5_31A, ht_bucket_html5_31B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_326, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_329, ht_bucket_html5_32A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_32D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_330, ht_bucket_html5_331, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_336, ht_bucket_empty,
ht_bucket_html5_338, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_33B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_33F,
ht_bucket_html5_340, ht_bucket_html5_341, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_347,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_34D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_350, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_356, ht_bucket_empty,
ht_bucket_html5_358, ht_bucket_html5_359, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_35D, ht_bucket_empty, ht_bucket_html5_35F,
ht_bucket_empty, ht_bucket_html5_361, ht_bucket_empty, ht_bucket_html5_363,
ht_bucket_empty, ht_bucket_html5_365, ht_bucket_empty, ht_bucket_html5_367,
ht_bucket_empty, ht_bucket_html5_369, ht_bucket_html5_36A, ht_bucket_html5_36B,
ht_bucket_empty, ht_bucket_html5_36D, ht_bucket_html5_36E, ht_bucket_html5_36F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_372, ht_bucket_empty,
ht_bucket_html5_374, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_378, ht_bucket_empty, ht_bucket_html5_37A, ht_bucket_empty,
ht_bucket_html5_37C, ht_bucket_html5_37D, ht_bucket_html5_37E, ht_bucket_html5_37F,
ht_bucket_html5_380, ht_bucket_empty, ht_bucket_html5_382, ht_bucket_html5_383,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_386, ht_bucket_html5_387,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_38A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_38D, ht_bucket_html5_38E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_391, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_394, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_397,
ht_bucket_html5_398, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_39C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_39F,
ht_bucket_html5_3A0, ht_bucket_html5_3A1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_3A4, ht_bucket_html5_3A5, ht_bucket_html5_3A6, ht_bucket_html5_3A7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_3AC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_3B2, ht_bucket_empty,
ht_bucket_html5_3B4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_3BF,
ht_bucket_html5_3C0, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_3C4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_3C9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_3CD, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_3D0, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_3D3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_3D9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_3DE, ht_bucket_empty,
ht_bucket_html5_3E0, ht_bucket_html5_3E1, ht_bucket_empty, ht_bucket_html5_3E3,
ht_bucket_html5_3E4, ht_bucket_empty, ht_bucket_html5_3E6, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_3E9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_3ED, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_3F1, ht_bucket_html5_3F2, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_3F7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_3FC, ht_bucket_empty, ht_bucket_html5_3FE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_402, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_405, ht_bucket_empty, ht_bucket_html5_407,
ht_bucket_empty, ht_bucket_html5_409, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_40E, ht_bucket_html5_40F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_413,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_41A, ht_bucket_html5_41B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_421, ht_bucket_empty, ht_bucket_html5_423,
ht_bucket_html5_424, ht_bucket_html5_425, ht_bucket_html5_426, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_429, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_42C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_42F,
ht_bucket_html5_430, ht_bucket_empty, ht_bucket_html5_432, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_436, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_439, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_43C, ht_bucket_html5_43D, ht_bucket_html5_43E, ht_bucket_html5_43F,
ht_bucket_html5_440, ht_bucket_html5_441, ht_bucket_empty, ht_bucket_html5_443,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_446, ht_bucket_html5_447,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_44A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_44F,
ht_bucket_html5_450, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_454, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_457,
ht_bucket_html5_458, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_45D, ht_bucket_empty, ht_bucket_html5_45F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_465, ht_bucket_html5_466, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_469, ht_bucket_html5_46A, ht_bucket_html5_46B,
ht_bucket_empty, ht_bucket_html5_46D, ht_bucket_empty, ht_bucket_html5_46F,
ht_bucket_html5_470, ht_bucket_html5_471, ht_bucket_html5_472, ht_bucket_html5_473,
ht_bucket_empty, ht_bucket_html5_475, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_479, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_47C, ht_bucket_html5_47D, ht_bucket_empty, ht_bucket_html5_47F,
ht_bucket_html5_480, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_485, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_488, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_48C, ht_bucket_empty, ht_bucket_html5_48E, ht_bucket_html5_48F,
ht_bucket_html5_490, ht_bucket_html5_491, ht_bucket_empty, ht_bucket_html5_493,
ht_bucket_empty, ht_bucket_html5_495, ht_bucket_html5_496, ht_bucket_empty,
ht_bucket_html5_498, ht_bucket_html5_499, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_49F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_4A2, ht_bucket_empty,
ht_bucket_html5_4A4, ht_bucket_empty, ht_bucket_html5_4A6, ht_bucket_html5_4A7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_4AA, ht_bucket_html5_4AB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_4AE, ht_bucket_html5_4AF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_4B2, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_4B6, ht_bucket_empty,
ht_bucket_html5_4B8, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_4BC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_4C0, ht_bucket_empty, ht_bucket_html5_4C2, ht_bucket_html5_4C3,
ht_bucket_html5_4C4, ht_bucket_html5_4C5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_4C8, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_4CB,
ht_bucket_empty, ht_bucket_html5_4CD, ht_bucket_empty, ht_bucket_html5_4CF,
ht_bucket_html5_4D0, ht_bucket_empty, ht_bucket_html5_4D2, ht_bucket_html5_4D3,
ht_bucket_html5_4D4, ht_bucket_html5_4D5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_4D8, ht_bucket_empty, ht_bucket_html5_4DA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_4DE, ht_bucket_html5_4DF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_4E3,
ht_bucket_html5_4E4, ht_bucket_empty, ht_bucket_html5_4E6, ht_bucket_html5_4E7,
ht_bucket_html5_4E8, ht_bucket_html5_4E9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_4ED, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_4F1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_4F7,
ht_bucket_empty, ht_bucket_html5_4F9, ht_bucket_empty, ht_bucket_html5_4FB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_4FE, ht_bucket_empty,
ht_bucket_html5_500, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_504, ht_bucket_empty, ht_bucket_html5_506, ht_bucket_html5_507,
ht_bucket_empty, ht_bucket_html5_509, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_50E, ht_bucket_html5_50F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_513,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_516, ht_bucket_empty,
ht_bucket_html5_518, ht_bucket_html5_519, ht_bucket_empty, ht_bucket_html5_51B,
ht_bucket_html5_51C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_524, ht_bucket_html5_525, ht_bucket_html5_526, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_52F,
ht_bucket_html5_530, ht_bucket_empty, ht_bucket_html5_532, ht_bucket_html5_533,
ht_bucket_html5_534, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_53B,
ht_bucket_html5_53C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_53F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_542, ht_bucket_html5_543,
ht_bucket_empty, ht_bucket_html5_545, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_548, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_54F,
ht_bucket_html5_550, ht_bucket_html5_551, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_557,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_55B,
ht_bucket_empty, ht_bucket_html5_55D, ht_bucket_empty, ht_bucket_html5_55F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_564, ht_bucket_html5_565, ht_bucket_html5_566, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_56C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_56F,
ht_bucket_html5_570, ht_bucket_html5_571, ht_bucket_html5_572, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_575, ht_bucket_html5_576, ht_bucket_html5_577,
ht_bucket_html5_578, ht_bucket_empty, ht_bucket_html5_57A, ht_bucket_html5_57B,
ht_bucket_html5_57C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_580, ht_bucket_empty, ht_bucket_html5_582, ht_bucket_html5_583,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_586, ht_bucket_empty,
ht_bucket_html5_588, ht_bucket_html5_589, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_58D, ht_bucket_html5_58E, ht_bucket_html5_58F,
ht_bucket_html5_590, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_595, ht_bucket_html5_596, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_59A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_59D, ht_bucket_empty, ht_bucket_html5_59F,
ht_bucket_html5_5A0, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_5A3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_5A6, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_5A9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_5AC, ht_bucket_html5_5AD, ht_bucket_html5_5AE, ht_bucket_empty,
ht_bucket_html5_5B0, ht_bucket_html5_5B1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_5B5, ht_bucket_html5_5B6, ht_bucket_html5_5B7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_5BB,
ht_bucket_html5_5BC, ht_bucket_html5_5BD, ht_bucket_empty, ht_bucket_html5_5BF,
ht_bucket_html5_5C0, ht_bucket_html5_5C1, ht_bucket_html5_5C2, ht_bucket_empty,
ht_bucket_html5_5C4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_5D0, ht_bucket_html5_5D1, ht_bucket_empty, ht_bucket_html5_5D3,
ht_bucket_empty, ht_bucket_html5_5D5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_5D8, ht_bucket_html5_5D9, ht_bucket_empty, ht_bucket_html5_5DB,
ht_bucket_empty, ht_bucket_html5_5DD, ht_bucket_empty, ht_bucket_html5_5DF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_5E2, ht_bucket_empty,
ht_bucket_html5_5E4, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_5E7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_5EA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_5ED, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_5F0, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_5F3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_5F6, ht_bucket_empty,
ht_bucket_html5_5F8, ht_bucket_empty, ht_bucket_html5_5FA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_5FD, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_601, ht_bucket_html5_602, ht_bucket_html5_603,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_606, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_609, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_60D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_613,
ht_bucket_empty, ht_bucket_html5_615, ht_bucket_empty, ht_bucket_html5_617,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_61A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_61D, ht_bucket_empty, ht_bucket_html5_61F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_622, ht_bucket_empty,
ht_bucket_html5_624, ht_bucket_empty, ht_bucket_html5_626, ht_bucket_empty,
ht_bucket_html5_628, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_62C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_630, ht_bucket_empty, ht_bucket_html5_632, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_636, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_63A, ht_bucket_empty,
ht_bucket_html5_63C, ht_bucket_html5_63D, ht_bucket_html5_63E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_641, ht_bucket_html5_642, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_645, ht_bucket_html5_646, ht_bucket_html5_647,
ht_bucket_html5_648, ht_bucket_html5_649, ht_bucket_html5_64A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_651, ht_bucket_html5_652, ht_bucket_html5_653,
ht_bucket_empty, ht_bucket_html5_655, ht_bucket_empty, ht_bucket_html5_657,
ht_bucket_html5_658, ht_bucket_html5_659, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_65C, ht_bucket_empty, ht_bucket_html5_65E, ht_bucket_empty,
ht_bucket_html5_660, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_669, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_670, ht_bucket_html5_671, ht_bucket_html5_672, ht_bucket_html5_673,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_676, ht_bucket_empty,
ht_bucket_html5_678, ht_bucket_html5_679, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_67D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_680, ht_bucket_empty, ht_bucket_html5_682, ht_bucket_empty,
ht_bucket_html5_684, ht_bucket_html5_685, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_68A, ht_bucket_empty,
ht_bucket_html5_68C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_691, ht_bucket_empty, ht_bucket_html5_693,
ht_bucket_html5_694, ht_bucket_empty, ht_bucket_html5_696, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_699, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_6A3,
ht_bucket_html5_6A4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_6A8, ht_bucket_html5_6A9, ht_bucket_html5_6AA, ht_bucket_html5_6AB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_6AE, ht_bucket_empty,
ht_bucket_html5_6B0, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_6B4, ht_bucket_empty, ht_bucket_html5_6B6, ht_bucket_empty,
ht_bucket_html5_6B8, ht_bucket_html5_6B9, ht_bucket_html5_6BA, ht_bucket_empty,
ht_bucket_html5_6BC, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_6BF,
ht_bucket_html5_6C0, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_6C4, ht_bucket_html5_6C5, ht_bucket_html5_6C6, ht_bucket_html5_6C7,
ht_bucket_html5_6C8, ht_bucket_html5_6C9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_6CC, ht_bucket_empty, ht_bucket_html5_6CE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_6D1, ht_bucket_html5_6D2, ht_bucket_empty,
ht_bucket_html5_6D4, ht_bucket_html5_6D5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_6D9, ht_bucket_html5_6DA, ht_bucket_html5_6DB,
ht_bucket_html5_6DC, ht_bucket_empty, ht_bucket_html5_6DE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_6E7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_6EB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_6EE, ht_bucket_html5_6EF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_6F5, ht_bucket_empty, ht_bucket_html5_6F7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_6FB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_704, ht_bucket_html5_705, ht_bucket_html5_706, ht_bucket_html5_707,
ht_bucket_empty, ht_bucket_html5_709, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_70C, ht_bucket_empty, ht_bucket_html5_70E, ht_bucket_html5_70F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_712, ht_bucket_empty,
ht_bucket_html5_714, ht_bucket_html5_715, ht_bucket_empty, ht_bucket_html5_717,
ht_bucket_empty, ht_bucket_html5_719, ht_bucket_empty, ht_bucket_html5_71B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_71E, ht_bucket_html5_71F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_723,
ht_bucket_html5_724, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_727,
ht_bucket_empty, ht_bucket_html5_729, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_72C, ht_bucket_html5_72D, ht_bucket_html5_72E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_733,
ht_bucket_html5_734, ht_bucket_html5_735, ht_bucket_html5_736, ht_bucket_empty,
ht_bucket_html5_738, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_73B,
ht_bucket_empty, ht_bucket_html5_73D, ht_bucket_html5_73E, ht_bucket_html5_73F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_744, ht_bucket_html5_745, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_749, ht_bucket_empty, ht_bucket_html5_74B,
ht_bucket_html5_74C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_74F,
ht_bucket_empty, ht_bucket_html5_751, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_754, ht_bucket_html5_755, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_759, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_75C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_75F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_762, ht_bucket_html5_763,
ht_bucket_empty, ht_bucket_html5_765, ht_bucket_html5_766, ht_bucket_empty,
ht_bucket_html5_768, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_76E, ht_bucket_empty,
ht_bucket_html5_770, ht_bucket_empty, ht_bucket_html5_772, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_776, ht_bucket_html5_777,
ht_bucket_empty, ht_bucket_html5_779, ht_bucket_html5_77A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_77F,
ht_bucket_empty, ht_bucket_html5_781, ht_bucket_empty, ht_bucket_html5_783,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_787,
ht_bucket_empty, ht_bucket_html5_789, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_78C, ht_bucket_html5_78D, ht_bucket_html5_78E, ht_bucket_empty,
ht_bucket_html5_790, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_794, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_797,
ht_bucket_html5_798, ht_bucket_empty, ht_bucket_html5_79A, ht_bucket_html5_79B,
ht_bucket_empty, ht_bucket_html5_79D, ht_bucket_empty, ht_bucket_html5_79F,
ht_bucket_html5_7A0, ht_bucket_html5_7A1, ht_bucket_html5_7A2, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_7A5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_7A8, ht_bucket_empty, ht_bucket_html5_7AA, ht_bucket_html5_7AB,
ht_bucket_html5_7AC, ht_bucket_empty, ht_bucket_html5_7AE, ht_bucket_html5_7AF,
ht_bucket_html5_7B0, ht_bucket_html5_7B1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_7B5, ht_bucket_html5_7B6, ht_bucket_html5_7B7,
ht_bucket_html5_7B8, ht_bucket_empty, ht_bucket_html5_7BA, ht_bucket_html5_7BB,
ht_bucket_empty, ht_bucket_html5_7BD, ht_bucket_html5_7BE, ht_bucket_html5_7BF,
ht_bucket_html5_7C0, ht_bucket_html5_7C1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_7C9, ht_bucket_empty, ht_bucket_html5_7CB,
ht_bucket_empty, ht_bucket_html5_7CD, ht_bucket_html5_7CE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_7D1, ht_bucket_html5_7D2, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_7D5, ht_bucket_html5_7D6, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_7D9, ht_bucket_html5_7DA, ht_bucket_empty,
ht_bucket_html5_7DC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_7E0, ht_bucket_empty, ht_bucket_html5_7E2, ht_bucket_empty,
ht_bucket_html5_7E4, ht_bucket_html5_7E5, ht_bucket_html5_7E6, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_7EC, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_7EF,
ht_bucket_empty, ht_bucket_html5_7F1, ht_bucket_empty, ht_bucket_html5_7F3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_7F6, ht_bucket_empty,
ht_bucket_html5_7F8, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_7FB,
ht_bucket_html5_7FC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_800, ht_bucket_empty, ht_bucket_html5_802, ht_bucket_html5_803,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_808, ht_bucket_html5_809, ht_bucket_html5_80A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_810, ht_bucket_html5_811, ht_bucket_html5_812, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_816, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_819, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_81C, ht_bucket_html5_81D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_821, ht_bucket_html5_822, ht_bucket_html5_823,
ht_bucket_html5_824, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_828, ht_bucket_html5_829, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_82D, ht_bucket_html5_82E, ht_bucket_empty,
ht_bucket_html5_830, ht_bucket_html5_831, ht_bucket_html5_832, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_837,
ht_bucket_html5_838, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_83C, ht_bucket_empty, ht_bucket_html5_83E, ht_bucket_html5_83F,
ht_bucket_empty, ht_bucket_html5_841, ht_bucket_html5_842, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_846, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_84A, ht_bucket_empty,
ht_bucket_html5_84C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_84F,
ht_bucket_html5_850, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_855, ht_bucket_empty, ht_bucket_html5_857,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_85B,
ht_bucket_html5_85C, ht_bucket_empty, ht_bucket_html5_85E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_861, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_865, ht_bucket_html5_866, ht_bucket_html5_867,
ht_bucket_html5_868, ht_bucket_empty, ht_bucket_html5_86A, ht_bucket_html5_86B,
ht_bucket_html5_86C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_873,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_876, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_879, ht_bucket_empty, ht_bucket_html5_87B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_87E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_88A, ht_bucket_empty,
ht_bucket_html5_88C, ht_bucket_empty, ht_bucket_html5_88E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_892, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_895, ht_bucket_html5_896, ht_bucket_html5_897,
ht_bucket_html5_898, ht_bucket_html5_899, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_89D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_8A5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_8AC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_8B3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_8B6, ht_bucket_html5_8B7,
ht_bucket_html5_8B8, ht_bucket_html5_8B9, ht_bucket_html5_8BA, ht_bucket_empty,
ht_bucket_html5_8BC, ht_bucket_html5_8BD, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_8C0, ht_bucket_html5_8C1, ht_bucket_html5_8C2, ht_bucket_empty,
ht_bucket_html5_8C4, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_8C7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_8CF,
ht_bucket_html5_8D0, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_8D3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_8D6, ht_bucket_empty,
ht_bucket_html5_8D8, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_8DC, ht_bucket_html5_8DD, ht_bucket_html5_8DE, ht_bucket_html5_8DF,
ht_bucket_html5_8E0, ht_bucket_empty, ht_bucket_html5_8E2, ht_bucket_html5_8E3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_8E7,
ht_bucket_html5_8E8, ht_bucket_html5_8E9, ht_bucket_empty, ht_bucket_html5_8EB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_8F3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_8FB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_8FE, ht_bucket_html5_8FF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_904, ht_bucket_empty, ht_bucket_html5_906, ht_bucket_html5_907,
ht_bucket_empty, ht_bucket_html5_909, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_90C, ht_bucket_empty, ht_bucket_html5_90E, ht_bucket_empty,
ht_bucket_html5_910, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_913,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_916, ht_bucket_empty,
ht_bucket_html5_918, ht_bucket_html5_919, ht_bucket_empty, ht_bucket_html5_91B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_91E, ht_bucket_html5_91F,
ht_bucket_html5_920, ht_bucket_empty, ht_bucket_html5_922, ht_bucket_html5_923,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_927,
ht_bucket_empty, ht_bucket_html5_929, ht_bucket_html5_92A, ht_bucket_empty,
ht_bucket_html5_92C, ht_bucket_empty, ht_bucket_html5_92E, ht_bucket_empty,
ht_bucket_html5_930, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_936, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_939, ht_bucket_empty, ht_bucket_html5_93B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_93F,
ht_bucket_html5_940, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_945, ht_bucket_empty, ht_bucket_html5_947,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_94D, ht_bucket_empty, ht_bucket_html5_94F,
ht_bucket_html5_950, ht_bucket_empty, ht_bucket_html5_952, ht_bucket_html5_953,
ht_bucket_html5_954, ht_bucket_html5_955, ht_bucket_html5_956, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_959, ht_bucket_empty, ht_bucket_html5_95B,
ht_bucket_empty, ht_bucket_html5_95D, ht_bucket_empty, ht_bucket_html5_95F,
ht_bucket_html5_960, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_963,
ht_bucket_html5_964, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_969, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_96C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_96F,
ht_bucket_html5_970, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_974, ht_bucket_html5_975, ht_bucket_html5_976, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_97B,
ht_bucket_html5_97C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_97F,
ht_bucket_empty, ht_bucket_html5_981, ht_bucket_empty, ht_bucket_html5_983,
ht_bucket_empty, ht_bucket_html5_985, ht_bucket_html5_986, ht_bucket_empty,
ht_bucket_html5_988, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_98C, ht_bucket_html5_98D, ht_bucket_empty, ht_bucket_html5_98F,
ht_bucket_html5_990, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_996, ht_bucket_html5_997,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_99B,
ht_bucket_html5_99C, ht_bucket_empty, ht_bucket_html5_99E, ht_bucket_html5_99F,
ht_bucket_html5_9A0, ht_bucket_html5_9A1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_9A6, ht_bucket_empty,
ht_bucket_html5_9A8, ht_bucket_empty, ht_bucket_html5_9AA, ht_bucket_empty,
ht_bucket_html5_9AC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_9B0, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_9B4, ht_bucket_html5_9B5, ht_bucket_html5_9B6, ht_bucket_html5_9B7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_9BB,
ht_bucket_html5_9BC, ht_bucket_html5_9BD, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_9C1, ht_bucket_html5_9C2, ht_bucket_html5_9C3,
ht_bucket_empty, ht_bucket_html5_9C5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_9C9, ht_bucket_html5_9CA, ht_bucket_empty,
ht_bucket_html5_9CC, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_9CF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_9D5, ht_bucket_html5_9D6, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_9DA, ht_bucket_html5_9DB,
ht_bucket_empty, ht_bucket_html5_9DD, ht_bucket_html5_9DE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_9E1, ht_bucket_html5_9E2, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_9E5, ht_bucket_empty, ht_bucket_html5_9E7,
ht_bucket_empty, ht_bucket_html5_9E9, ht_bucket_empty, ht_bucket_html5_9EB,
ht_bucket_empty, ht_bucket_html5_9ED, ht_bucket_html5_9EE, ht_bucket_html5_9EF,
ht_bucket_html5_9F0, ht_bucket_html5_9F1, ht_bucket_html5_9F2, ht_bucket_html5_9F3,
ht_bucket_html5_9F4, ht_bucket_html5_9F5, ht_bucket_empty, ht_bucket_html5_9F7,
ht_bucket_empty, ht_bucket_html5_9F9, ht_bucket_html5_9FA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_9FD, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_A01, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_A04, ht_bucket_html5_A05, ht_bucket_html5_A06, ht_bucket_empty,
ht_bucket_html5_A08, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_A0C, ht_bucket_html5_A0D, ht_bucket_html5_A0E, ht_bucket_empty,
ht_bucket_html5_A10, ht_bucket_empty, ht_bucket_html5_A12, ht_bucket_empty,
ht_bucket_html5_A14, ht_bucket_html5_A15, ht_bucket_html5_A16, ht_bucket_html5_A17,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_A21, ht_bucket_empty, ht_bucket_html5_A23,
ht_bucket_html5_A24, ht_bucket_html5_A25, ht_bucket_html5_A26, ht_bucket_empty,
ht_bucket_html5_A28, ht_bucket_empty, ht_bucket_html5_A2A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_A2D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_A30, ht_bucket_empty, ht_bucket_html5_A32, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_A36, ht_bucket_html5_A37,
ht_bucket_empty, ht_bucket_html5_A39, ht_bucket_empty, ht_bucket_html5_A3B,
ht_bucket_html5_A3C, ht_bucket_html5_A3D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_A41, ht_bucket_empty, ht_bucket_html5_A43,
ht_bucket_html5_A44, ht_bucket_html5_A45, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_A48, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_A4F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_A53,
ht_bucket_html5_A54, ht_bucket_empty, ht_bucket_html5_A56, ht_bucket_html5_A57,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_A5A, ht_bucket_html5_A5B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_A61, ht_bucket_html5_A62, ht_bucket_html5_A63,
ht_bucket_html5_A64, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_A69, ht_bucket_html5_A6A, ht_bucket_html5_A6B,
ht_bucket_empty, ht_bucket_html5_A6D, ht_bucket_empty, ht_bucket_html5_A6F,
ht_bucket_html5_A70, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_A76, ht_bucket_empty,
ht_bucket_html5_A78, ht_bucket_empty, ht_bucket_html5_A7A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_A7E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_A81, ht_bucket_html5_A82, ht_bucket_empty,
ht_bucket_html5_A84, ht_bucket_html5_A85, ht_bucket_html5_A86, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_A89, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_A8D, ht_bucket_html5_A8E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_A92, ht_bucket_empty,
ht_bucket_html5_A94, ht_bucket_empty, ht_bucket_html5_A96, ht_bucket_empty,
ht_bucket_html5_A98, ht_bucket_html5_A99, ht_bucket_html5_A9A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_A9D, ht_bucket_empty, ht_bucket_html5_A9F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_AA3,
ht_bucket_html5_AA4, ht_bucket_html5_AA5, ht_bucket_empty, ht_bucket_html5_AA7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_AAC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_AB2, ht_bucket_empty,
ht_bucket_html5_AB4, ht_bucket_html5_AB5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_ABA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_AC0, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_AC4, ht_bucket_html5_AC5, ht_bucket_html5_AC6, ht_bucket_html5_AC7,
ht_bucket_html5_AC8, ht_bucket_empty, ht_bucket_html5_ACA, ht_bucket_empty,
ht_bucket_html5_ACC, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_ACF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_AD2, ht_bucket_html5_AD3,
ht_bucket_html5_AD4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_ADA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_ADD, ht_bucket_empty, ht_bucket_html5_ADF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_AE4, ht_bucket_html5_AE5, ht_bucket_html5_AE6, ht_bucket_html5_AE7,
ht_bucket_html5_AE8, ht_bucket_html5_AE9, ht_bucket_empty, ht_bucket_html5_AEB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_AF5, ht_bucket_html5_AF6, ht_bucket_html5_AF7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_AFA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_AFD, ht_bucket_html5_AFE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_B08, ht_bucket_html5_B09, ht_bucket_html5_B0A, ht_bucket_empty,
ht_bucket_html5_B0C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_B10, ht_bucket_html5_B11, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_B15, ht_bucket_html5_B16, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B1E, ht_bucket_html5_B1F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B23,
ht_bucket_html5_B24, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B27,
ht_bucket_empty, ht_bucket_html5_B29, ht_bucket_html5_B2A, ht_bucket_html5_B2B,
ht_bucket_html5_B2C, ht_bucket_html5_B2D, ht_bucket_html5_B2E, ht_bucket_html5_B2F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B33,
ht_bucket_empty, ht_bucket_html5_B35, ht_bucket_html5_B36, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B3A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_B3D, ht_bucket_html5_B3E, ht_bucket_empty,
ht_bucket_html5_B40, ht_bucket_empty, ht_bucket_html5_B42, ht_bucket_empty,
ht_bucket_html5_B44, ht_bucket_empty, ht_bucket_html5_B46, ht_bucket_html5_B47,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_B4C, ht_bucket_empty, ht_bucket_html5_B4E, ht_bucket_html5_B4F,
ht_bucket_html5_B50, ht_bucket_html5_B51, ht_bucket_html5_B52, ht_bucket_html5_B53,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B56, ht_bucket_empty,
ht_bucket_html5_B58, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_B5C, ht_bucket_html5_B5D, ht_bucket_html5_B5E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B63,
ht_bucket_html5_B64, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B67,
ht_bucket_empty, ht_bucket_html5_B69, ht_bucket_empty, ht_bucket_html5_B6B,
ht_bucket_empty, ht_bucket_html5_B6D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B72, ht_bucket_html5_B73,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B77,
ht_bucket_html5_B78, ht_bucket_empty, ht_bucket_html5_B7A, ht_bucket_html5_B7B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B7E, ht_bucket_html5_B7F,
ht_bucket_empty, ht_bucket_html5_B81, ht_bucket_html5_B82, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_B87,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_B8D, ht_bucket_empty, ht_bucket_html5_B8F,
ht_bucket_html5_B90, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_B94, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_B98, ht_bucket_html5_B99, ht_bucket_html5_B9A, ht_bucket_empty,
ht_bucket_html5_B9C, ht_bucket_html5_B9D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_BA5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_BA9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_BAE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_BB2, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_BB5, ht_bucket_html5_BB6, ht_bucket_html5_BB7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_BBA, ht_bucket_empty,
ht_bucket_html5_BBC, ht_bucket_html5_BBD, ht_bucket_empty, ht_bucket_html5_BBF,
ht_bucket_empty, ht_bucket_html5_BC1, ht_bucket_html5_BC2, ht_bucket_html5_BC3,
ht_bucket_html5_BC4, ht_bucket_html5_BC5, ht_bucket_html5_BC6, ht_bucket_html5_BC7,
ht_bucket_html5_BC8, ht_bucket_html5_BC9, ht_bucket_empty, ht_bucket_html5_BCB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_BCE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_BD1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_BD7,
ht_bucket_html5_BD8, ht_bucket_html5_BD9, ht_bucket_html5_BDA, ht_bucket_html5_BDB,
ht_bucket_empty, ht_bucket_html5_BDD, ht_bucket_empty, ht_bucket_html5_BDF,
ht_bucket_empty, ht_bucket_html5_BE1, ht_bucket_html5_BE2, ht_bucket_empty,
ht_bucket_html5_BE4, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_BE7,
ht_bucket_html5_BE8, ht_bucket_html5_BE9, ht_bucket_html5_BEA, ht_bucket_html5_BEB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_BEF,
ht_bucket_html5_BF0, ht_bucket_html5_BF1, ht_bucket_html5_BF2, ht_bucket_html5_BF3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_BF7,
ht_bucket_empty, ht_bucket_html5_BF9, ht_bucket_html5_BFA, ht_bucket_empty,
ht_bucket_html5_BFC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C02, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C0B,
ht_bucket_html5_C0C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_C11, ht_bucket_html5_C12, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C16, ht_bucket_empty,
ht_bucket_html5_C18, ht_bucket_empty, ht_bucket_html5_C1A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_C1D, ht_bucket_html5_C1E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C23,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C27,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C2B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_C31, ht_bucket_html5_C32, ht_bucket_html5_C33,
ht_bucket_html5_C34, ht_bucket_html5_C35, ht_bucket_html5_C36, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C3A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C3E, ht_bucket_html5_C3F,
ht_bucket_html5_C40, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C43,
ht_bucket_html5_C44, ht_bucket_empty, ht_bucket_html5_C46, ht_bucket_empty,
ht_bucket_html5_C48, ht_bucket_empty, ht_bucket_html5_C4A, ht_bucket_html5_C4B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C4F,
ht_bucket_html5_C50, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_C54, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_C58, ht_bucket_empty, ht_bucket_html5_C5A, ht_bucket_html5_C5B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C5F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_C69, ht_bucket_empty, ht_bucket_html5_C6B,
ht_bucket_html5_C6C, ht_bucket_empty, ht_bucket_html5_C6E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C72, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C76, ht_bucket_html5_C77,
ht_bucket_html5_C78, ht_bucket_empty, ht_bucket_html5_C7A, ht_bucket_html5_C7B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_C80, ht_bucket_empty, ht_bucket_html5_C82, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_C89, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_C8E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_C91, ht_bucket_empty, ht_bucket_html5_C93,
ht_bucket_html5_C94, ht_bucket_html5_C95, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_C98, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_CA6, ht_bucket_empty,
ht_bucket_html5_CA8, ht_bucket_html5_CA9, ht_bucket_html5_CAA, ht_bucket_empty,
ht_bucket_html5_CAC, ht_bucket_html5_CAD, ht_bucket_empty, ht_bucket_html5_CAF,
ht_bucket_html5_CB0, ht_bucket_empty, ht_bucket_html5_CB2, ht_bucket_empty,
ht_bucket_html5_CB4, ht_bucket_empty, ht_bucket_html5_CB6, ht_bucket_html5_CB7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_CBA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_CBE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_CC2, ht_bucket_empty,
ht_bucket_html5_CC4, ht_bucket_html5_CC5, ht_bucket_empty, ht_bucket_html5_CC7,
ht_bucket_html5_CC8, ht_bucket_html5_CC9, ht_bucket_empty, ht_bucket_html5_CCB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_CD0, ht_bucket_empty, ht_bucket_html5_CD2, ht_bucket_html5_CD3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_CD9, ht_bucket_html5_CDA, ht_bucket_empty,
ht_bucket_html5_CDC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_CE4, ht_bucket_empty, ht_bucket_html5_CE6, ht_bucket_empty,
ht_bucket_html5_CE8, ht_bucket_html5_CE9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_CED, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_CF1, ht_bucket_html5_CF2, ht_bucket_html5_CF3,
ht_bucket_html5_CF4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_CFA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_CFF,
ht_bucket_html5_D00, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D06, ht_bucket_html5_D07,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D0B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D0E, ht_bucket_empty,
ht_bucket_html5_D10, ht_bucket_html5_D11, ht_bucket_html5_D12, ht_bucket_html5_D13,
ht_bucket_empty, ht_bucket_html5_D15, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_D18, ht_bucket_html5_D19, ht_bucket_html5_D1A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D1E, ht_bucket_html5_D1F,
ht_bucket_html5_D20, ht_bucket_empty, ht_bucket_html5_D22, ht_bucket_empty,
ht_bucket_html5_D24, ht_bucket_empty, ht_bucket_html5_D26, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D2A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D32, ht_bucket_empty,
ht_bucket_html5_D34, ht_bucket_html5_D35, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_D38, ht_bucket_html5_D39, ht_bucket_html5_D3A, ht_bucket_html5_D3B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D3E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D42, ht_bucket_empty,
ht_bucket_html5_D44, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_D49, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_D4C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D4F,
ht_bucket_empty, ht_bucket_html5_D51, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_D54, ht_bucket_html5_D55, ht_bucket_html5_D56, ht_bucket_html5_D57,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D5A, ht_bucket_html5_D5B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D5F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D63,
ht_bucket_empty, ht_bucket_html5_D65, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_D69, ht_bucket_html5_D6A, ht_bucket_empty,
ht_bucket_html5_D6C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_D6F,
ht_bucket_html5_D70, ht_bucket_html5_D71, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_D74, ht_bucket_html5_D75, ht_bucket_html5_D76, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_D7C, ht_bucket_html5_D7D, ht_bucket_html5_D7E, ht_bucket_empty,
ht_bucket_html5_D80, ht_bucket_html5_D81, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_D85, ht_bucket_empty, ht_bucket_html5_D87,
ht_bucket_empty, ht_bucket_html5_D89, ht_bucket_html5_D8A, ht_bucket_empty,
ht_bucket_html5_D8C, ht_bucket_html5_D8D, ht_bucket_html5_D8E, ht_bucket_html5_D8F,
ht_bucket_html5_D90, ht_bucket_html5_D91, ht_bucket_empty, ht_bucket_html5_D93,
ht_bucket_html5_D94, ht_bucket_html5_D95, ht_bucket_html5_D96, ht_bucket_empty,
ht_bucket_html5_D98, ht_bucket_empty, ht_bucket_html5_D9A, ht_bucket_empty,
ht_bucket_html5_D9C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_DA0, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_DA5, ht_bucket_html5_DA6, ht_bucket_empty,
ht_bucket_html5_DA8, ht_bucket_html5_DA9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_DAC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_DB0, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_DB3,
ht_bucket_html5_DB4, ht_bucket_empty, ht_bucket_html5_DB6, ht_bucket_html5_DB7,
ht_bucket_empty, ht_bucket_html5_DB9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_DBC, ht_bucket_empty, ht_bucket_html5_DBE, ht_bucket_html5_DBF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_DC2, ht_bucket_html5_DC3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_DC6, ht_bucket_empty,
ht_bucket_html5_DC8, ht_bucket_empty, ht_bucket_html5_DCA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_DCF,
ht_bucket_empty, ht_bucket_html5_DD1, ht_bucket_empty, ht_bucket_html5_DD3,
ht_bucket_html5_DD4, ht_bucket_html5_DD5, ht_bucket_empty, ht_bucket_html5_DD7,
ht_bucket_empty, ht_bucket_html5_DD9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_DDC, ht_bucket_html5_DDD, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_DE4, ht_bucket_empty, ht_bucket_html5_DE6, ht_bucket_html5_DE7,
ht_bucket_empty, ht_bucket_html5_DE9, ht_bucket_empty, ht_bucket_html5_DEB,
ht_bucket_empty, ht_bucket_html5_DED, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_DF1, ht_bucket_html5_DF2, ht_bucket_html5_DF3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_DF6, ht_bucket_html5_DF7,
ht_bucket_empty, ht_bucket_html5_DF9, ht_bucket_empty, ht_bucket_html5_DFB,
ht_bucket_empty, ht_bucket_html5_DFD, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_E03,
ht_bucket_html5_E04, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_E08, ht_bucket_html5_E09, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_E0C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_E11, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_E18, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_E1B,
ht_bucket_html5_E1C, ht_bucket_html5_E1D, ht_bucket_html5_E1E, ht_bucket_empty,
ht_bucket_html5_E20, ht_bucket_empty, ht_bucket_html5_E22, ht_bucket_html5_E23,
ht_bucket_html5_E24, ht_bucket_html5_E25, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_E28, ht_bucket_empty, ht_bucket_html5_E2A, ht_bucket_empty,
ht_bucket_html5_E2C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_E2F,
ht_bucket_html5_E30, ht_bucket_empty, ht_bucket_html5_E32, ht_bucket_html5_E33,
ht_bucket_empty, ht_bucket_html5_E35, ht_bucket_html5_E36, ht_bucket_html5_E37,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_E3B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_E41, ht_bucket_html5_E42, ht_bucket_html5_E43,
ht_bucket_html5_E44, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_E48, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_E4C, ht_bucket_empty, ht_bucket_html5_E4E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_E53,
ht_bucket_empty, ht_bucket_html5_E55, ht_bucket_empty, ht_bucket_html5_E57,
ht_bucket_html5_E58, ht_bucket_html5_E59, ht_bucket_empty, ht_bucket_html5_E5B,
ht_bucket_empty, ht_bucket_html5_E5D, ht_bucket_html5_E5E, ht_bucket_html5_E5F,
ht_bucket_html5_E60, ht_bucket_html5_E61, ht_bucket_empty, ht_bucket_html5_E63,
ht_bucket_html5_E64, ht_bucket_html5_E65, ht_bucket_html5_E66, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_E6A, ht_bucket_empty,
ht_bucket_html5_E6C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_E6F,
ht_bucket_html5_E70, ht_bucket_html5_E71, ht_bucket_empty, ht_bucket_html5_E73,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_E78, ht_bucket_empty, ht_bucket_html5_E7A, ht_bucket_empty,
ht_bucket_html5_E7C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_E80, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_E83,
ht_bucket_html5_E84, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_E89, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_E8E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_E91, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_E99, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_E9C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_EA1, ht_bucket_html5_EA2, ht_bucket_empty,
ht_bucket_html5_EA4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_EA8, ht_bucket_html5_EA9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_EB1, ht_bucket_html5_EB2, ht_bucket_empty,
ht_bucket_html5_EB4, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_EB7,
ht_bucket_html5_EB8, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_ECA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_ECD, ht_bucket_empty, ht_bucket_html5_ECF,
ht_bucket_empty, ht_bucket_html5_ED1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_ED7,
ht_bucket_html5_ED8, ht_bucket_html5_ED9, ht_bucket_html5_EDA, ht_bucket_html5_EDB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_EDF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_EE2, ht_bucket_empty,
ht_bucket_html5_EE4, ht_bucket_html5_EE5, ht_bucket_empty, ht_bucket_html5_EE7,
ht_bucket_empty, ht_bucket_html5_EE9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_EEF,
ht_bucket_empty, ht_bucket_html5_EF1, ht_bucket_html5_EF2, ht_bucket_empty,
ht_bucket_html5_EF4, ht_bucket_html5_EF5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_EF9, ht_bucket_empty, ht_bucket_html5_EFB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_EFE, ht_bucket_html5_EFF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F03,
ht_bucket_empty, ht_bucket_html5_F05, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_F08, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F0B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F0F,
ht_bucket_html5_F10, ht_bucket_html5_F11, ht_bucket_empty, ht_bucket_html5_F13,
ht_bucket_html5_F14, ht_bucket_html5_F15, ht_bucket_empty, ht_bucket_html5_F17,
ht_bucket_html5_F18, ht_bucket_html5_F19, ht_bucket_html5_F1A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F1F,
ht_bucket_html5_F20, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F26, ht_bucket_empty,
ht_bucket_html5_F28, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_F2C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F2F,
ht_bucket_html5_F30, ht_bucket_html5_F31, ht_bucket_empty, ht_bucket_html5_F33,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F37,
ht_bucket_html5_F38, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_F40, ht_bucket_html5_F41, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_F44, ht_bucket_empty, ht_bucket_html5_F46, ht_bucket_html5_F47,
ht_bucket_html5_F48, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F4B,
ht_bucket_html5_F4C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_F50, ht_bucket_html5_F51, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_F55, ht_bucket_empty, ht_bucket_html5_F57,
ht_bucket_html5_F58, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F5B,
ht_bucket_empty, ht_bucket_html5_F5D, ht_bucket_empty, ht_bucket_html5_F5F,
ht_bucket_html5_F60, ht_bucket_empty, ht_bucket_html5_F62, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F67,
ht_bucket_html5_F68, ht_bucket_html5_F69, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_F6D, ht_bucket_html5_F6E, ht_bucket_html5_F6F,
ht_bucket_html5_F70, ht_bucket_html5_F71, ht_bucket_html5_F72, ht_bucket_empty,
ht_bucket_html5_F74, ht_bucket_html5_F75, ht_bucket_html5_F76, ht_bucket_html5_F77,
ht_bucket_html5_F78, ht_bucket_html5_F79, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_F81, ht_bucket_html5_F82, ht_bucket_html5_F83,
ht_bucket_html5_F84, ht_bucket_empty, ht_bucket_html5_F86, ht_bucket_empty,
ht_bucket_html5_F88, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F8B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F8F,
ht_bucket_empty, ht_bucket_html5_F91, ht_bucket_empty, ht_bucket_html5_F93,
ht_bucket_empty, ht_bucket_html5_F95, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F9A, ht_bucket_empty,
ht_bucket_html5_F9C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_F9F,
ht_bucket_html5_FA0, ht_bucket_html5_FA1, ht_bucket_html5_FA2, ht_bucket_html5_FA3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_FA9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_FB0, ht_bucket_html5_FB1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_FB4, ht_bucket_html5_FB5, ht_bucket_html5_FB6, ht_bucket_html5_FB7,
ht_bucket_empty, ht_bucket_html5_FB9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_FBC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_FC2, ht_bucket_empty,
ht_bucket_html5_FC4, ht_bucket_empty, ht_bucket_html5_FC6, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_FCA, ht_bucket_empty,
ht_bucket_html5_FCC, ht_bucket_empty, ht_bucket_html5_FCE, ht_bucket_html5_FCF,
ht_bucket_html5_FD0, ht_bucket_empty, ht_bucket_html5_FD2, ht_bucket_html5_FD3,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_FD6, ht_bucket_html5_FD7,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_FDD, ht_bucket_empty, ht_bucket_html5_FDF,
ht_bucket_empty, ht_bucket_html5_FE1, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html5_FE6, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_FE9, ht_bucket_html5_FEA, ht_bucket_html5_FEB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html5_FF5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_FF8, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html5_FFC, ht_bucket_empty, ht_bucket_html5_FFE, ht_bucket_empty,
};
static const entity_ht ent_ht_html5 = {
0x1000,
ht_buckets_html5
};
/* end of HTML5 hash table for entity -> codepoint }}} */
/* {{{ Start of HTML 4.01 multi-stage table for codepoint -> entity */
/* {{{ Stage 3 Tables for HTML 4.01 */
static const entity_stage3_row stage3_table_html4_00000[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"quot", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"amp", 3} } }, {0, { {"#039", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"lt", 2} } }, {0, { {NULL, 0} } }, {0, { {"gt", 2} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_00080[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"nbsp", 4} } }, {0, { {"iexcl", 5} } }, {0, { {"cent", 4} } }, {0, { {"pound", 5} } },
{0, { {"curren", 6} } }, {0, { {"yen", 3} } }, {0, { {"brvbar", 6} } }, {0, { {"sect", 4} } },
{0, { {"uml", 3} } }, {0, { {"copy", 4} } }, {0, { {"ordf", 4} } }, {0, { {"laquo", 5} } },
{0, { {"not", 3} } }, {0, { {"shy", 3} } }, {0, { {"reg", 3} } }, {0, { {"macr", 4} } },
{0, { {"deg", 3} } }, {0, { {"plusmn", 6} } }, {0, { {"sup2", 4} } }, {0, { {"sup3", 4} } },
{0, { {"acute", 5} } }, {0, { {"micro", 5} } }, {0, { {"para", 4} } }, {0, { {"middot", 6} } },
{0, { {"cedil", 5} } }, {0, { {"sup1", 4} } }, {0, { {"ordm", 4} } }, {0, { {"raquo", 5} } },
{0, { {"frac14", 6} } }, {0, { {"frac12", 6} } }, {0, { {"frac34", 6} } }, {0, { {"iquest", 6} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_000C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {"Agrave", 6} } }, {0, { {"Aacute", 6} } }, {0, { {"Acirc", 5} } }, {0, { {"Atilde", 6} } },
{0, { {"Auml", 4} } }, {0, { {"Aring", 5} } }, {0, { {"AElig", 5} } }, {0, { {"Ccedil", 6} } },
{0, { {"Egrave", 6} } }, {0, { {"Eacute", 6} } }, {0, { {"Ecirc", 5} } }, {0, { {"Euml", 4} } },
{0, { {"Igrave", 6} } }, {0, { {"Iacute", 6} } }, {0, { {"Icirc", 5} } }, {0, { {"Iuml", 4} } },
{0, { {"ETH", 3} } }, {0, { {"Ntilde", 6} } }, {0, { {"Ograve", 6} } }, {0, { {"Oacute", 6} } },
{0, { {"Ocirc", 5} } }, {0, { {"Otilde", 6} } }, {0, { {"Ouml", 4} } }, {0, { {"times", 5} } },
{0, { {"Oslash", 6} } }, {0, { {"Ugrave", 6} } }, {0, { {"Uacute", 6} } }, {0, { {"Ucirc", 5} } },
{0, { {"Uuml", 4} } }, {0, { {"Yacute", 6} } }, {0, { {"THORN", 5} } }, {0, { {"szlig", 5} } },
{0, { {"agrave", 6} } }, {0, { {"aacute", 6} } }, {0, { {"acirc", 5} } }, {0, { {"atilde", 6} } },
{0, { {"auml", 4} } }, {0, { {"aring", 5} } }, {0, { {"aelig", 5} } }, {0, { {"ccedil", 6} } },
{0, { {"egrave", 6} } }, {0, { {"eacute", 6} } }, {0, { {"ecirc", 5} } }, {0, { {"euml", 4} } },
{0, { {"igrave", 6} } }, {0, { {"iacute", 6} } }, {0, { {"icirc", 5} } }, {0, { {"iuml", 4} } },
{0, { {"eth", 3} } }, {0, { {"ntilde", 6} } }, {0, { {"ograve", 6} } }, {0, { {"oacute", 6} } },
{0, { {"ocirc", 5} } }, {0, { {"otilde", 6} } }, {0, { {"ouml", 4} } }, {0, { {"divide", 6} } },
{0, { {"oslash", 6} } }, {0, { {"ugrave", 6} } }, {0, { {"uacute", 6} } }, {0, { {"ucirc", 5} } },
{0, { {"uuml", 4} } }, {0, { {"yacute", 6} } }, {0, { {"thorn", 5} } }, {0, { {"yuml", 4} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_00140[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"OElig", 5} } }, {0, { {"oelig", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"Scaron", 6} } }, {0, { {"scaron", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"Yuml", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_00180[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"fnof", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_002C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"circ", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"tilde", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_00380[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"Alpha", 5} } }, {0, { {"Beta", 4} } }, {0, { {"Gamma", 5} } },
{0, { {"Delta", 5} } }, {0, { {"Epsilon", 7} } }, {0, { {"Zeta", 4} } }, {0, { {"Eta", 3} } },
{0, { {"Theta", 5} } }, {0, { {"Iota", 4} } }, {0, { {"Kappa", 5} } }, {0, { {"Lambda", 6} } },
{0, { {"Mu", 2} } }, {0, { {"Nu", 2} } }, {0, { {"Xi", 2} } }, {0, { {"Omicron", 7} } },
{0, { {"Pi", 2} } }, {0, { {"Rho", 3} } }, {0, { {NULL, 0} } }, {0, { {"Sigma", 5} } },
{0, { {"Tau", 3} } }, {0, { {"Upsilon", 7} } }, {0, { {"Phi", 3} } }, {0, { {"Chi", 3} } },
{0, { {"Psi", 3} } }, {0, { {"Omega", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"alpha", 5} } }, {0, { {"beta", 4} } }, {0, { {"gamma", 5} } },
{0, { {"delta", 5} } }, {0, { {"epsilon", 7} } }, {0, { {"zeta", 4} } }, {0, { {"eta", 3} } },
{0, { {"theta", 5} } }, {0, { {"iota", 4} } }, {0, { {"kappa", 5} } }, {0, { {"lambda", 6} } },
{0, { {"mu", 2} } }, {0, { {"nu", 2} } }, {0, { {"xi", 2} } }, {0, { {"omicron", 7} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_003C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {"pi", 2} } }, {0, { {"rho", 3} } }, {0, { {"sigmaf", 6} } }, {0, { {"sigma", 5} } },
{0, { {"tau", 3} } }, {0, { {"upsilon", 7} } }, {0, { {"phi", 3} } }, {0, { {"chi", 3} } },
{0, { {"psi", 3} } }, {0, { {"omega", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"thetasym", 8} } }, {0, { {"upsih", 5} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"piv", 3} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_02000[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"ensp", 4} } }, {0, { {"emsp", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"thinsp", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"zwnj", 4} } }, {0, { {"zwj", 3} } }, {0, { {"lrm", 3} } }, {0, { {"rlm", 3} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"ndash", 5} } },
{0, { {"mdash", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"lsquo", 5} } }, {0, { {"rsquo", 5} } }, {0, { {"sbquo", 5} } }, {0, { {NULL, 0} } },
{0, { {"ldquo", 5} } }, {0, { {"rdquo", 5} } }, {0, { {"bdquo", 5} } }, {0, { {NULL, 0} } },
{0, { {"dagger", 6} } }, {0, { {"Dagger", 6} } }, {0, { {"bull", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"hellip", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"permil", 6} } }, {0, { {NULL, 0} } }, {0, { {"prime", 5} } }, {0, { {"Prime", 5} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"lsaquo", 6} } }, {0, { {"rsaquo", 6} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"oline", 5} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_02040[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"frasl", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_02080[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"euro", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_02100[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"image", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"weierp", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"real", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"trade", 5} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"alefsym", 7} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_02180[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"larr", 4} } }, {0, { {"uarr", 4} } }, {0, { {"rarr", 4} } }, {0, { {"darr", 4} } },
{0, { {"harr", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"crarr", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_021C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"lArr", 4} } }, {0, { {"uArr", 4} } }, {0, { {"rArr", 4} } }, {0, { {"dArr", 4} } },
{0, { {"hArr", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_02200[] = {
2011-08-31 13:47:19 +08:00
{0, { {"forall", 6} } }, {0, { {NULL, 0} } }, {0, { {"part", 4} } }, {0, { {"exist", 5} } },
{0, { {NULL, 0} } }, {0, { {"empty", 5} } }, {0, { {NULL, 0} } }, {0, { {"nabla", 5} } },
{0, { {"isin", 4} } }, {0, { {"notin", 5} } }, {0, { {NULL, 0} } }, {0, { {"ni", 2} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"prod", 4} } },
{0, { {NULL, 0} } }, {0, { {"sum", 3} } }, {0, { {"minus", 5} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"lowast", 6} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"radic", 5} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"prop", 4} } }, {0, { {"infin", 5} } }, {0, { {NULL, 0} } },
{0, { {"ang", 3} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"and", 3} } },
{0, { {"or", 2} } }, {0, { {"cap", 3} } }, {0, { {"cup", 3} } }, {0, { {"int", 3} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"there4", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"sim", 3} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_02240[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"cong", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"asymp", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"ne", 2} } }, {0, { {"equiv", 5} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"le", 2} } }, {0, { {"ge", 2} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_02280[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"sub", 3} } }, {0, { {"sup", 3} } },
{0, { {"nsub", 4} } }, {0, { {NULL, 0} } }, {0, { {"sube", 4} } }, {0, { {"supe", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"oplus", 5} } }, {0, { {NULL, 0} } }, {0, { {"otimes", 6} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"perp", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_022C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"sdot", 4} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_02300[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"lceil", 5} } }, {0, { {"rceil", 5} } }, {0, { {"lfloor", 6} } }, {0, { {"rfloor", 6} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {"lang", 4} } }, {0, { {"rang", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_025C0[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"loz", 3} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
static const entity_stage3_row stage3_table_html4_02640[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"spades", 6} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"clubs", 5} } },
{0, { {NULL, 0} } }, {0, { {"hearts", 6} } }, {0, { {"diams", 5} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* end of stage 3 Tables for HTML 4.01 }}} */
/* {{{ Stage 2 Tables for HTML 4.01 */
static const entity_stage2_row stage2_table_html4_00000[] = {
stage3_table_html4_00000, empty_stage3_table, stage3_table_html4_00080, stage3_table_html4_000C0,
empty_stage3_table, stage3_table_html4_00140, stage3_table_html4_00180, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, stage3_table_html4_002C0,
empty_stage3_table, empty_stage3_table, stage3_table_html4_00380, stage3_table_html4_003C0,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
};
static const entity_stage2_row stage2_table_html4_02000[] = {
stage3_table_html4_02000, stage3_table_html4_02040, stage3_table_html4_02080, empty_stage3_table,
stage3_table_html4_02100, empty_stage3_table, stage3_table_html4_02180, stage3_table_html4_021C0,
stage3_table_html4_02200, stage3_table_html4_02240, stage3_table_html4_02280, stage3_table_html4_022C0,
stage3_table_html4_02300, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, stage3_table_html4_025C0,
empty_stage3_table, stage3_table_html4_02640, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
empty_stage3_table, empty_stage3_table, empty_stage3_table, empty_stage3_table,
};
/* end of stage 2 tables for HTML 4.01 }}} */
static const entity_stage1_row entity_ms_table_html4[] = {
stage2_table_html4_00000,
empty_stage2_table,
stage2_table_html4_02000,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
empty_stage2_table,
};
/* end of HTML 4.01 multi-stage table for codepoint -> entity }}} */
/* {{{ HTML 4.01 hash table for entity -> codepoint */
2011-08-31 13:47:19 +08:00
static const entity_cp_map ht_bucket_html4_000[] = { {"gt", 2, 0x0003E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_003[] = { {"Igrave", 6, 0x000CC, 0}, {"amp", 3, 0x00026, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_006[] = { {"oacute", 6, 0x000F3, 0}, {"Xi", 2, 0x0039E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_008[] = { {"uuml", 4, 0x000FC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_00B[] = { {"Alpha", 5, 0x00391, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_00E[] = { {"sim", 3, 0x0223C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_012[] = { {"kappa", 5, 0x003BA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_016[] = { {"lArr", 4, 0x021D0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_018[] = { {"and", 3, 0x02227, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_01B[] = { {"ang", 3, 0x02220, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_020[] = { {"copy", 4, 0x000A9, 0}, {"Iacute", 6, 0x000CD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_023[] = { {"igrave", 6, 0x000EC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_026[] = { {"xi", 2, 0x003BE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_027[] = { {"Acirc", 5, 0x000C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_02B[] = { {"Ecirc", 5, 0x000CA, 0}, {"alpha", 5, 0x003B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_02C[] = { {"hearts", 6, 0x02665, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_02F[] = { {"Icirc", 5, 0x000CE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_030[] = { {"Yacute", 6, 0x000DD, 0}, {"int", 3, 0x0222B, 0}, {"rlm", 3, 0x0200F, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_034[] = { {"empty", 5, 0x02205, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_036[] = { {"larr", 4, 0x02190, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_03B[] = { {"Ucirc", 5, 0x000DB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_03C[] = { {"oline", 5, 0x0203E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_040[] = { {"iacute", 6, 0x000ED, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_046[] = { {"middot", 6, 0x000B7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_047[] = { {"acirc", 5, 0x000E2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_04B[] = { {"ecirc", 5, 0x000EA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_04F[] = { {"icirc", 5, 0x000EE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_050[] = { {"yacute", 6, 0x000FD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_051[] = { {"minus", 5, 0x02212, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_054[] = { {"Auml", 4, 0x000C4, 0}, {"thetasym", 8, 0x003D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_056[] = { {"Sigma", 5, 0x003A3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_059[] = { {"lsquo", 5, 0x02018, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_05B[] = { {"ucirc", 5, 0x000FB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_05C[] = { {"rArr", 4, 0x021D2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_064[] = { {"brvbar", 6, 0x000A6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_067[] = { {"AElig", 5, 0x000C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_069[] = { {"Ccedil", 6, 0x000C7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_071[] = { {"Psi", 3, 0x003A8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_072[] = { {"exist", 5, 0x02203, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_074[] = { {"auml", 4, 0x000E4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_076[] = { {"sigma", 5, 0x003C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_078[] = { {"isin", 4, 0x02208, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_07C[] = { {"rarr", 4, 0x02192, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_089[] = { {"ccedil", 6, 0x000E7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_08D[] = { {"raquo", 5, 0x000BB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_08E[] = { {"Omega", 5, 0x003A9, 0}, {"zwnj", 4, 0x0200C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_091[] = { {"psi", 3, 0x003C8, 0}, {"there4", 6, 0x02234, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_092[] = { {"hArr", 4, 0x021D4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_096[] = { {"le", 2, 0x02264, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_098[] = { {"Atilde", 6, 0x000C3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_099[] = { {"Zeta", 4, 0x00396, 0}, {"infin", 5, 0x0221E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_09D[] = { {"frasl", 5, 0x02044, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0A0[] = { {"euro", 4, 0x020AC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0A5[] = { {"lt", 2, 0x0003C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0A7[] = { {"aelig", 5, 0x000E6, 0}, {"Mu", 2, 0x0039C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0A8[] = { {"macr", 4, 0x000AF, 0}, {"image", 5, 0x02111, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0AA[] = { {"ldquo", 5, 0x0201C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0AE[] = { {"omega", 5, 0x003C9, 0}, {"upsih", 5, 0x003D2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0B0[] = { {"THORN", 5, 0x000DE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0B2[] = { {"Iota", 4, 0x00399, 0}, {"harr", 4, 0x02194, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0B4[] = { {"bull", 4, 0x02022, 0}, {"rceil", 5, 0x02309, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0B8[] = { {"atilde", 6, 0x000E3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0B9[] = { {"zeta", 4, 0x003B6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0BA[] = { {"emsp", 4, 0x02003, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0BC[] = { {"perp", 4, 0x022A5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0C2[] = { {"Prime", 5, 0x02033, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0C4[] = { {"frac12", 6, 0x000BD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0C5[] = { {"Ntilde", 6, 0x000D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0C6[] = { {"frac14", 6, 0x000BC, 0}, {"circ", 4, 0x002C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0C7[] = { {"mu", 2, 0x003BC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0C8[] = { {"Gamma", 5, 0x00393, 0}, {"Nu", 2, 0x0039D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0CE[] = { {"fnof", 4, 0x00192, 0}, {"quot", 4, 0x00022, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0D2[] = { {"iota", 4, 0x003B9, 0}, {"mdash", 5, 0x02014, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0D8[] = { {"ne", 2, 0x02260, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0DB[] = { {"Theta", 5, 0x00398, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0DC[] = { {"ni", 2, 0x0220B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0E2[] = { {"prime", 5, 0x02032, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0E5[] = { {"ntilde", 6, 0x000F1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0E6[] = { {"Lambda", 6, 0x0039B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0E8[] = { {"gamma", 5, 0x003B3, 0}, {"nu", 2, 0x003BD, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0EB[] = { {"pound", 5, 0x000A3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0EE[] = { {"permil", 6, 0x02030, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0F9[] = { {"cap", 3, 0x02229, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0FA[] = { {"iexcl", 5, 0x000A1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0FB[] = { {"Agrave", 6, 0x000C0, 0}, {"theta", 5, 0x003B8, 0}, {"ensp", 4, 0x02002, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0FE[] = { {"Pi", 2, 0x003A0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_0FF[] = { {"crarr", 5, 0x021B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_100[] = { {"iquest", 6, 0x000BF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_105[] = { {"forall", 6, 0x02200, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_106[] = { {"Phi", 3, 0x003A6, 0}, {"lambda", 6, 0x003BB, 0}, {"or", 2, 0x02228, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_108[] = { {"frac34", 6, 0x000BE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_10D[] = { {"notin", 5, 0x02209, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_10E[] = { {"dArr", 4, 0x021D3, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_10F[] = { {"Dagger", 6, 0x02021, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_111[] = { {"yen", 3, 0x000A5, 0}, {"weierp", 6, 0x02118, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_113[] = { {"uml", 3, 0x000A8, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_117[] = { {"tilde", 5, 0x002DC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_118[] = { {"Aacute", 6, 0x000C1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_11A[] = { {"loz", 3, 0x025CA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_11B[] = { {"agrave", 6, 0x000E0, 0}, {"thinsp", 6, 0x02009, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_11E[] = { {"pi", 2, 0x003C0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_11F[] = { {"micro", 5, 0x000B5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_125[] = { {"spades", 6, 0x02660, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_126[] = { {"phi", 3, 0x003C6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_12E[] = { {"darr", 4, 0x02193, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_12F[] = { {"Oslash", 6, 0x000D8, 0}, {"Tau", 3, 0x003A4, 0}, {"dagger", 6, 0x02020, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_135[] = { {"Ocirc", 5, 0x000D4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_136[] = { {"alefsym", 7, 0x02135, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_138[] = { {"aacute", 6, 0x000E1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_13A[] = { {"divide", 6, 0x000F7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_13F[] = { {"sdot", 4, 0x022C5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_143[] = { {"reg", 3, 0x000AE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_149[] = { {"real", 4, 0x0211C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_14B[] = { {"Scaron", 6, 0x00160, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_14F[] = { {"cent", 4, 0x000A2, 0}, {"oslash", 6, 0x000F8, 0}, {"tau", 3, 0x003C4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_150[] = { {"thorn", 5, 0x000FE, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_153[] = { {"ndash", 5, 0x02013, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_154[] = { {"piv", 3, 0x003D6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_155[] = { {"ocirc", 5, 0x000F4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_156[] = { {"Aring", 5, 0x000C5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_158[] = { {"nbsp", 4, 0x000A0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_15C[] = { {"Iuml", 4, 0x000CF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_15F[] = { {"rsquo", 5, 0x02019, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_160[] = { {"rsaquo", 6, 0x0203A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_163[] = { {"hellip", 6, 0x02026, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_166[] = { {"Otilde", 6, 0x000D5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_16B[] = { {"scaron", 6, 0x00161, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_16C[] = { {"Yuml", 4, 0x00178, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_16E[] = { {"sup1", 4, 0x000B9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_16F[] = { {"sup2", 4, 0x000B2, 0}, {"Delta", 5, 0x00394, 0}, {"sbquo", 5, 0x0201A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_170[] = { {"sup3", 4, 0x000B3, 0}, {"lrm", 3, 0x0200E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_173[] = { {"diams", 5, 0x02666, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_175[] = { {"OElig", 5, 0x00152, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_176[] = { {"aring", 5, 0x000E5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_178[] = { {"oplus", 5, 0x02295, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_17C[] = { {"iuml", 4, 0x000EF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_17F[] = { {"Egrave", 6, 0x000C8, 0}, {"uArr", 4, 0x021D1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_181[] = { {"Beta", 4, 0x00392, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_183[] = { {"nabla", 5, 0x02207, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_186[] = { {"ETH", 3, 0x000D0, 0}, {"otilde", 6, 0x000F5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_187[] = { {"laquo", 5, 0x000AB, 0}, {"times", 5, 0x000D7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_18C[] = { {"yuml", 4, 0x000FF, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_18D[] = { {"cup", 3, 0x0222A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_18E[] = { {"Rho", 3, 0x003A1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_18F[] = { {"Ugrave", 6, 0x000D9, 0}, {"delta", 5, 0x003B4, 0}, {"equiv", 5, 0x02261, 0}, {"sub", 3, 0x02282, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_194[] = { {"curren", 6, 0x000A4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_196[] = { {"not", 3, 0x000AC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_197[] = { {"acute", 5, 0x000B4, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_19A[] = { {"prod", 4, 0x0220F, 0}, {"sum", 3, 0x02211, 0}, {"lsaquo", 6, 0x02039, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_19C[] = { {"Eacute", 6, 0x000C9, 0}, {"Omicron", 7, 0x0039F, 0}, {"sigmaf", 6, 0x003C2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_19D[] = { {"sup", 3, 0x02283, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_19F[] = { {"egrave", 6, 0x000E8, 0}, {"uarr", 4, 0x02191, 0}, {"lowast", 6, 0x02217, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1A0[] = { {"zwj", 3, 0x0200D, 0}, {"bdquo", 5, 0x0201E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1A1[] = { {"beta", 4, 0x003B2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1A2[] = { {"Ouml", 4, 0x000D6, 0}, {"supe", 4, 0x02287, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1A4[] = { {"plusmn", 6, 0x000B1, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1A6[] = { {"cedil", 5, 0x000B8, 0}, {"prop", 4, 0x0221D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1A7[] = { {"lang", 4, 0x02329, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1A8[] = { {"radic", 5, 0x0221A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1A9[] = { {"para", 4, 0x000B6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1AC[] = { {"Uacute", 6, 0x000DA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1AE[] = { {"szlig", 5, 0x000DF, 0}, {"rho", 3, 0x003C1, 0}, {"lceil", 5, 0x02308, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1AF[] = { {"ugrave", 6, 0x000F9, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1B0[] = { {"rdquo", 5, 0x0201D, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1B5[] = { {"deg", 3, 0x000B0, 0}, {"trade", 5, 0x02122, 0}, {"oelig", 5, 0x00153, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1B9[] = { {"Chi", 3, 0x003A7, 0}, {"rfloor", 6, 0x0230B, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1BC[] = { {"eacute", 6, 0x000E9, 0}, {"omicron", 7, 0x003BF, 0}, {"part", 4, 0x02202, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1BE[] = { {"clubs", 5, 0x02663, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1BF[] = { {"Epsilon", 7, 0x00395, 0}, {"Eta", 3, 0x00397, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1C2[] = { {"ouml", 4, 0x000F6, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1C4[] = { {"#039", 4, 0x00027, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1C9[] = { {"Ograve", 6, 0x000D2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1CC[] = { {"uacute", 6, 0x000FA, 0}, {"cong", 4, 0x02245, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1CF[] = { {"Upsilon", 7, 0x003A5, 0}, {"asymp", 5, 0x02248, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1D0[] = { {"ordf", 4, 0x000AA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1D4[] = { {"sube", 4, 0x02286, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1D7[] = { {"ordm", 4, 0x000BA, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1D8[] = { {"Euml", 4, 0x000CB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1D9[] = { {"chi", 3, 0x003C7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1DD[] = { {"nsub", 4, 0x02284, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1DF[] = { {"epsilon", 7, 0x003B5, 0}, {"eta", 3, 0x003B7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1E6[] = { {"Oacute", 6, 0x000D3, 0}, {"eth", 3, 0x000F0, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1E8[] = { {"Uuml", 4, 0x000DC, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1E9[] = { {"ograve", 6, 0x000F2, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1ED[] = { {"rang", 4, 0x0232A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1EF[] = { {"upsilon", 7, 0x003C5, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1F1[] = { {"ge", 2, 0x02265, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1F2[] = { {"Kappa", 5, 0x0039A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1F3[] = { {"lfloor", 6, 0x0230A, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1F4[] = { {"sect", 4, 0x000A7, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1F6[] = { {"otimes", 6, 0x02297, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1F8[] = { {"euml", 4, 0x000EB, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_html4_1F9[] = { {"shy", 3, 0x000AD, 0}, {NULL, 0, 0, 0} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_cp_map *const ht_buckets_html4[] = {
ht_bucket_html4_000, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_003,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_006, ht_bucket_empty,
ht_bucket_html4_008, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_00B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_00E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_012, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_016, ht_bucket_empty,
ht_bucket_html4_018, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_01B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_020, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_023,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_026, ht_bucket_html4_027,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_02B,
ht_bucket_html4_02C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_02F,
ht_bucket_html4_030, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_034, ht_bucket_empty, ht_bucket_html4_036, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_03B,
ht_bucket_html4_03C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_040, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_046, ht_bucket_html4_047,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_04B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_04F,
ht_bucket_html4_050, ht_bucket_html4_051, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_054, ht_bucket_empty, ht_bucket_html4_056, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_059, ht_bucket_empty, ht_bucket_html4_05B,
ht_bucket_html4_05C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_064, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_067,
ht_bucket_empty, ht_bucket_html4_069, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_071, ht_bucket_html4_072, ht_bucket_empty,
ht_bucket_html4_074, ht_bucket_empty, ht_bucket_html4_076, ht_bucket_empty,
ht_bucket_html4_078, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_07C, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_089, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_08D, ht_bucket_html4_08E, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_091, ht_bucket_html4_092, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_096, ht_bucket_empty,
ht_bucket_html4_098, ht_bucket_html4_099, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_09D, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_0A0, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_0A5, ht_bucket_empty, ht_bucket_html4_0A7,
ht_bucket_html4_0A8, ht_bucket_empty, ht_bucket_html4_0AA, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_0AE, ht_bucket_empty,
ht_bucket_html4_0B0, ht_bucket_empty, ht_bucket_html4_0B2, ht_bucket_empty,
ht_bucket_html4_0B4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_0B8, ht_bucket_html4_0B9, ht_bucket_html4_0BA, ht_bucket_empty,
ht_bucket_html4_0BC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_0C2, ht_bucket_empty,
ht_bucket_html4_0C4, ht_bucket_html4_0C5, ht_bucket_html4_0C6, ht_bucket_html4_0C7,
ht_bucket_html4_0C8, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_0CE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_0D2, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_0D8, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_0DB,
ht_bucket_html4_0DC, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_0E2, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_0E5, ht_bucket_html4_0E6, ht_bucket_empty,
ht_bucket_html4_0E8, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_0EB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_0EE, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_0F9, ht_bucket_html4_0FA, ht_bucket_html4_0FB,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_0FE, ht_bucket_html4_0FF,
ht_bucket_html4_100, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_105, ht_bucket_html4_106, ht_bucket_empty,
ht_bucket_html4_108, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_10D, ht_bucket_html4_10E, ht_bucket_html4_10F,
ht_bucket_empty, ht_bucket_html4_111, ht_bucket_empty, ht_bucket_html4_113,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_117,
ht_bucket_html4_118, ht_bucket_empty, ht_bucket_html4_11A, ht_bucket_html4_11B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_11E, ht_bucket_html4_11F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_125, ht_bucket_html4_126, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_12E, ht_bucket_html4_12F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_135, ht_bucket_html4_136, ht_bucket_empty,
ht_bucket_html4_138, ht_bucket_empty, ht_bucket_html4_13A, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_13F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_143,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_149, ht_bucket_empty, ht_bucket_html4_14B,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_14F,
ht_bucket_html4_150, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_153,
ht_bucket_html4_154, ht_bucket_html4_155, ht_bucket_html4_156, ht_bucket_empty,
ht_bucket_html4_158, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_15C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_15F,
ht_bucket_html4_160, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_163,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_166, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_16B,
ht_bucket_html4_16C, ht_bucket_empty, ht_bucket_html4_16E, ht_bucket_html4_16F,
ht_bucket_html4_170, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_173,
ht_bucket_empty, ht_bucket_html4_175, ht_bucket_html4_176, ht_bucket_empty,
ht_bucket_html4_178, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_17C, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_17F,
ht_bucket_empty, ht_bucket_html4_181, ht_bucket_empty, ht_bucket_html4_183,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_186, ht_bucket_html4_187,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_18C, ht_bucket_html4_18D, ht_bucket_html4_18E, ht_bucket_html4_18F,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_194, ht_bucket_empty, ht_bucket_html4_196, ht_bucket_html4_197,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_19A, ht_bucket_empty,
ht_bucket_html4_19C, ht_bucket_html4_19D, ht_bucket_empty, ht_bucket_html4_19F,
ht_bucket_html4_1A0, ht_bucket_html4_1A1, ht_bucket_html4_1A2, ht_bucket_empty,
ht_bucket_html4_1A4, ht_bucket_empty, ht_bucket_html4_1A6, ht_bucket_html4_1A7,
ht_bucket_html4_1A8, ht_bucket_html4_1A9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_1AC, ht_bucket_empty, ht_bucket_html4_1AE, ht_bucket_html4_1AF,
ht_bucket_html4_1B0, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_1B5, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_1B9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_1BC, ht_bucket_empty, ht_bucket_html4_1BE, ht_bucket_html4_1BF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_1C2, ht_bucket_empty,
ht_bucket_html4_1C4, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_1C9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_1CC, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_1CF,
ht_bucket_html4_1D0, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_html4_1D4, ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_1D7,
ht_bucket_html4_1D8, ht_bucket_html4_1D9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_1DD, ht_bucket_empty, ht_bucket_html4_1DF,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_html4_1E6, ht_bucket_empty,
ht_bucket_html4_1E8, ht_bucket_html4_1E9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_html4_1ED, ht_bucket_empty, ht_bucket_html4_1EF,
ht_bucket_empty, ht_bucket_html4_1F1, ht_bucket_html4_1F2, ht_bucket_html4_1F3,
ht_bucket_html4_1F4, ht_bucket_empty, ht_bucket_html4_1F6, ht_bucket_empty,
ht_bucket_html4_1F8, ht_bucket_html4_1F9, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
};
static const entity_ht ent_ht_html4 = {
0x200,
ht_buckets_html4
};
/* end of HTML 4.01 hash table for entity -> codepoint }}} */
/* {{{ Start of Basic entities (no apos) table for codepoint -> entity */
static const entity_stage3_row stage3_table_be_noapos_00000[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"quot", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"amp", 3} } }, {0, { {"#039", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"lt", 2} } }, {0, { {NULL, 0} } }, {0, { {"gt", 2} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* {{{ Basic entities (no apos) hash table for entity -> codepoint */
2011-08-31 13:47:19 +08:00
static const entity_cp_map ht_bucket_be_noapos_000[] = { {"gt", 2, 0x0003E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_be_noapos_003[] = { {"amp", 3, 0x00026, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_be_noapos_004[] = { {"#039", 4, 0x00027, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_be_noapos_005[] = { {"lt", 2, 0x0003C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_be_noapos_00E[] = { {"quot", 4, 0x00022, 0}, {NULL, 0, 0, 0} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_cp_map *const ht_buckets_be_noapos[] = {
ht_bucket_be_noapos_000, ht_bucket_empty, ht_bucket_empty, ht_bucket_be_noapos_003,
ht_bucket_be_noapos_004, ht_bucket_be_noapos_005, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_be_noapos_00E, ht_bucket_empty,
};
static const entity_ht ent_ht_be_noapos = {
0x10,
ht_buckets_be_noapos
};
/* end of Basic entities (no apos) hash table for entity -> codepoint }}} */
/* {{{ Start of Basic entities (with apos) table for codepoint -> entity */
static const entity_stage3_row stage3_table_be_apos_00000[] = {
2011-08-31 13:47:19 +08:00
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"quot", 4} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {"amp", 3} } }, {0, { {"apos", 4} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } }, {0, { {NULL, 0} } },
{0, { {"lt", 2} } }, {0, { {NULL, 0} } }, {0, { {"gt", 2} } }, {0, { {NULL, 0} } },
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
};
/* {{{ Basic entities (with apos) hash table for entity -> codepoint */
2011-08-31 13:47:19 +08:00
static const entity_cp_map ht_bucket_be_apos_000[] = { {"gt", 2, 0x0003E, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_be_apos_003[] = { {"amp", 3, 0x00026, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_be_apos_005[] = { {"lt", 2, 0x0003C, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_be_apos_008[] = { {"apos", 4, 0x00027, 0}, {NULL, 0, 0, 0} };
static const entity_cp_map ht_bucket_be_apos_00E[] = { {"quot", 4, 0x00022, 0}, {NULL, 0, 0, 0} };
- Completed rewrite of html.c. Except for determine_charset, almost nothing remains. - Fixed bug on determine_charset that was preventing correct detection in combination with internal mbstring encoding "none", "pass" or "auto". - Added profiles for entity encode/decode for HTMl 4.01, XHTML 1.0, XML 1.0 and HTML 5. Added the constants ENT_HTML401, ENT_XML1, ENT_XHTML and ENT_HTML5. - htmlentities()/htmlspecialchars(), when told not to double encode, verify the correctness of the existenting entities more thoroughly. It is checked whether the numerical entity represents a valid unicode code point (number is between 0 and 0x10FFFF). If using the flag ENT_DISALLOWED, it is also checked whether that numerical entity is valid in selected document. In HTML 4.01, all the numerical entities that represent a Unicode code point (< U+10FFFFFF) are valid, but that's not the case with other document types. If the entity is not valid, & is encoded to &amp;. For named entities, the check is also more thorough. While before the only check would be to determine if the entity was constituted by alphanumeric characters, now it is checked whether that entity is necessarily defined for the target document type. Otherwise, & is encoded to &amp;. - For html_entity_decode(), only valid numerical and named entities (as defined above for htmlentities()/htmlspecialchars() + !double_encode) are decoded. But there is in this case one additional check. Entities that represent non-SGML or otherwise invalid characters are not decoded. Note that, in HTML5, U+000D is a valid literal character, but the entity &#x0D is not valid and is therefore not decoded. - The hash tables lazily created for decoding in html_entity_decode() that were added recently were substituted by static hash tables. Instead of 1 hash table per encoding, there's only one hash table per document type defined in terms of unicode code points. This means that for charsets other than UTF-8 and ISO-8859-1, a conversion to unicode code points is necessary before decoding. - On the encoding side, the ad hoc ranges of entities of the translation tables, which mapped (in general) non-unicode code points to HTML entities were replaced by three-stage tables for HTML 4 and HTML 5. This mapping tables are defined only in terms of unicode code points, so a conversion is necessary for charsets other than UTF-8 and ISO-8859-1. Even so, the multi-stage table is much faster than the previous method, by a factor of 5; the conversion to unicode is a small penalty because it's just a simple table lookup. XML 1.0/htmlspecialchars() uses a simple table instead of a three-stage table. - Added the flag ENT_SUBSTITUTE, which makes htmlentities()/htmlspecialchars() replace the invalid multibyte sequences with U+FFFD (UTF-8) or &#FFFD; (other encodings). - Added the flag ENT_DISALLOWED. Implements FR #52860. Characters that cannot appear literally are replaced by U+FFFD (UTF-8) or &#FFFD; (otherwise). An alternative implementation would be to encode those characters into numerical entities, but that would only work in HTML 4.01 due to limitations on the values of numerical entities in other document types. See also the effects on htmlentities()/htmlspecialchars() with !double_encode above.
2010-10-24 23:01:02 +08:00
static const entity_cp_map *const ht_buckets_be_apos[] = {
ht_bucket_be_apos_000, ht_bucket_empty, ht_bucket_empty, ht_bucket_be_apos_003,
ht_bucket_empty, ht_bucket_be_apos_005, ht_bucket_empty, ht_bucket_empty,
ht_bucket_be_apos_008, ht_bucket_empty, ht_bucket_empty, ht_bucket_empty,
ht_bucket_empty, ht_bucket_empty, ht_bucket_be_apos_00E, ht_bucket_empty,
};
static const entity_ht ent_ht_be_apos = {
0x10,
ht_buckets_be_apos
};
/* end of Basic entities (with apos) hash table for entity -> codepoint }}} */
- Revamp of the decoding portion of html.c. - Dramatic improvements on the performance of html_entity_decode and htmlspecialchars_decode, as the string is now traversed only once. Speedups of 20 to 25 times with Windows release builds and a ~250 characters string (for 2nd and subsequent calls). - Consistent behavior on html_entity_decode. For instance, the entity in "&&lt;" would be decoded, but not "&&#233;". Not anymore. The code path for "basic" and non-basic entities is now mostly shared. - Code of html_entity_decode and htmlspecialchars_decode is now shared. - [DOC] More consistent behavior of htmlspecialchars_decode. Instead of translating only &lt;, &gt;, &amp;, &quot;, &#039; and &#39;, now e.g. &#34;, &apos;, &#0039;, &#x27;, etc. are also decoded. - [DOC] Previous translation of unicode code points in numerical entities was seriously broken. When the code points for some character were not the same in unicode and the target encoding, the behavior could be an erroneous translation (e.g. 0x80-0xA0 in win-1252) or no translation at all. Added unicode translation tables for all single-byte encodings. Entities are not translated for multi-byte entities, except for ASCII characters whose code points are shared. We could add the huge translation tables (several thousand elements) for those encodings in the future. - Fixed numerical entities that after # had text accepted by strcol being accepted. - Much more commented and well-structured code... - Tests for get_html_translation_table()) are broken. I stared fixing the tests, but then I realized it was completely helpless because get_html_translation_table() is broken by not handling multi-byte characters correctly.
2010-10-11 03:04:59 +08:00
#endif /* HTML_TABLES_H */