Improve detection accuracy of mb_detect_encoding

Originally, `mb_detect_encoding` essentially just checked all candidate
encodings to see which ones the input string was valid in. However, it
was only able to do this for a limited few of all the text encodings
which are officially supported by mbstring.

In 3e7acf901d, I modified it so it could 'detect' any text encoding
supported by mbstring. While this is arguably an improvement, if the
only text encodings one is interested in are those which
`mb_detect_encoding` could originally handle, the old
`mb_detect_encoding` may have been preferable. Because the new one has
more possible encodings which it can guess, it also has more chances to
get the answer wrong.

This commit adjusts the detection heuristics to provide accurate
detection in a wider variety of scenarios. While the previous detection
code would frequently confuse UTF-32BE with UTF-32LE or UTF-16BE with
UTF-16LE, the adjusted code is extremely accurate in those cases.
Detection for Chinese text in Chinese encodings like GB18030 or BIG5
and for Japanese text in Japanese encodings like EUC-JP or SJIS is
greatly improved. Detection of UTF-7 is also greatly improved. An 8KB
table, with one bit for each codepoint from U+0000 up to U+FFFF, is
used to achieve this.

One significant constraint is that the heuristics are completely based
on looking at each codepoint in a string in isolation, treating some
codepoints as 'likely' and others as 'unlikely'. It might still be
possible to achieve great gains in detection accuracy by looking at
sequences of codepoints rather than individual codepoints. However,
this might require huge tables. Further, we might need a huge corpus
of text in various languages to derive those tables.

Accuracy is still dismal when trying to distinguish single-byte
encodings like ISO-8859-1, ISO-8859-2, KOI8-R, and so on. This is
because the valid bytes in these encodings are basically all the same,
and all valid bytes decode to 'likely' codepoints, so our method of
detection (which is based on rating codepoints as likely or unlikely)
cannot tell any difference between the candidates at all. It just
selects the first encoding in the provided list of candidates.

Speaking of which, if one wants to get good results from
`mb_detect_encoding`, it is important to order the list of candidate
encodings according to your prior belief of which are more likely to
be correct. When the function cannot tell any difference between two
candidates, it returns whichever appeared earlier in the array.
This commit is contained in:
Alex Dowad 2021-10-09 23:11:53 +02:00
parent e9cf14e89c
commit 28b346bc06
5 changed files with 7555 additions and 68 deletions

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,61 @@
#!/usr/bin/env php
<?php
if ($argc < 2) {
echo "Usage: php gen_rare_cp_bitvec.php ./common_codepoints.txt\n";
return;
}
$bitvec = array_fill(0, (0xFFFF / 32) + 1, 0xFFFFFFFF);
$input = file_get_contents($argv[1]);
foreach (explode("\n", $input) as $line) {
if (false !== $hashPos = strpos($line, '#')) {
$line = substr($line, 0, $hashPos);
}
$line = trim($line);
if ($line === '') {
continue;
}
$range = explode("\t", $line);
$start = hexdec($range[0]);
$end = hexdec($range[1]);
for ($i = $start; $i <= $end; $i++) {
$bitvec[$i >> 5] &= ~(1 << ($i & 0x1F));
}
}
$result = <<<'HEADER'
/* Machine-generated file; do not edit! See gen_rare_cp_bitvec.php.
*
* The below array has one bit for each Unicode codepoint from U+0000 to U+FFFF.
* The bit is 1 if the codepoint is considered 'rare' for the purpose of
* guessing the text encoding of a string.
*
* Each 'rare' codepoint which appears in a string when it is interpreted
* using a candidate encoding causes the candidate encoding to be treated
* as less likely to be the correct one.
*/
static uint32_t rare_codepoint_bitvec[] = {
HEADER;
for ($i = 0; $i < 0xFFFF / 32; $i++) {
if ($i % 8 === 0) {
$result .= "\n";
} else {
$result .= " ";
}
$result .= "0x" . str_pad(dechex($bitvec[$i]), 8, '0', STR_PAD_LEFT) . ",";
}
$result .= "\n};\n";
file_put_contents(__DIR__ . '/rare_cp_bitvec.h', $result);
echo "Done.\n";
?>

View File

@ -95,6 +95,7 @@
#include "filters/mbfilter_utf8.h"
#include "eaw_table.h"
#include "rare_cp_bitvec.h"
/* hex character table "0123456789ABCDEF" */
static char mbfl_hexchar_table[] = {
@ -285,26 +286,52 @@ size_t mbfl_buffer_illegalchars(mbfl_buffer_converter *convd)
/*
* encoding detector
*/
static int mbfl_estimate_encoding_likelihood(int c, void *void_data)
static int mbfl_estimate_encoding_likelihood(int input_cp, void *void_data)
{
mbfl_encoding_detector_data *data = void_data;
unsigned int c = input_cp;
/* Receive wchars decoded from test string using candidate encoding
* If the test string was invalid in the candidate encoding, we assume
* it's the wrong one. */
/* Receive wchars decoded from input string using candidate encoding.
* If the string was invalid in the candidate encoding, we assume
* it's the wrong one. Otherwise, give the candidate many 'demerits'
* for each 'rare' codepoint found, a smaller number for each ASCII
* punctuation character, and 1 for all other codepoints.
*
* The 'common' codepoints should cover the vast majority of
* codepoints we are likely to see in practice, while only covering
* a small minority of the entire Unicode encoding space. Why?
* Well, if the test string happens to be valid in an incorrect
* candidate encoding, the bogus codepoints which it decodes to will
* be more or less random. By treating the majority of codepoints as
* 'rare', we ensure that in almost all such cases, the bogus
* codepoints will include plenty of 'rares', thus giving the
* incorrect candidate encoding lots of demerits. See
* common_codepoints.txt for the actual list used.
*
* So, why give extra demerits for ASCII punctuation characters? It's
* because there are some text encodings, like UTF-7, HZ, and ISO-2022,
* which deliberately only use bytes in the ASCII range. When
* misinterpreted as ASCII/UTF-8, strings in these encodings will
* have an unusually high number of ASCII punctuation characters.
* So giving extra demerits for such characters will improve
* detection accuracy for UTF-7 and similar encodings.
*
* Finally, why 1 demerit for all other characters? That penalizes
* long strings, meaning we will tend to choose a candidate encoding
* in which the test string decodes to a smaller number of
* codepoints. That prevents single-byte encodings in which almost
* every possible input byte decodes to a 'common' codepoint from
* being favored too much. */
if (c == MBFL_BAD_INPUT) {
data->num_illegalchars++;
} else if (c < 0x9 || (c >= 0xE && c <= 0x1F) || (c >= 0xE000 && c <= 0xF8FF) || c >= 0xF0000) {
/* Otherwise, count how many control characters and 'private use'
* codepoints we see. Those are rarely used and may indicate that
* the candidate encoding is not the right one. */
data->score += 10;
} else if ((c >= 0x21 && c <= 0x2F) || (c >= 0x3A && c <= 0x40) || (c >= 0x5B && c <= 0x60)) {
/* Punctuation is also less common than letters/digits; further, if
* text in ISO-2022 or similar encodings is mistakenly identified as
* ASCII or UTF-8, the misinterpreted string will tend to have an
* unusually high density of ASCII punctuation characters. */
data->score++;
} else if (c > 0xFFFF) {
data->score += 40;
} else if (c >= 0x21 && c <= 0x2F) {
data->score += 6;
} else if ((rare_codepoint_bitvec[c >> 5] >> (c & 0x1F)) & 1) {
data->score += 30;
} else {
data->score += 1;
}
return 0;
}

View File

@ -0,0 +1,269 @@
/* Machine-generated file; do not edit! See gen_rare_cp_bitvec.php.
*
* The below array has one bit for each Unicode codepoint from U+0000 to U+FFFF.
* The bit is 1 if the codepoint is considered 'rare' for the purpose of
* guessing the text encoding of a string.
*
* Each 'rare' codepoint which appears in a string when it is interpreted
* using a candidate encoding causes the candidate encoding to be treated
* as less likely to be the correct one.
*/
static uint32_t rare_codepoint_bitvec[] = {
0xffffd9ff, 0x00000000, 0x00000000, 0x80000000, 0xffffffff, 0x00002001, 0x00000000, 0x00000000,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xfffff800, 0xffffffff, 0xffffffff, 0x0300ffff, 0x0000280f, 0x00000004, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0x0001ffff, 0x00000000, 0x0000ff00, 0xffe07800,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0x00000001, 0x78000000, 0xfe000000, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0x0080f7ff, 0xbee37fb8, 0xffffffff, 0xffffffff, 0xfffffc00, 0xfdffefff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0x0000ffff, 0xffff0000, 0x0000ffff, 0x00000000, 0x00000000, 0x00000000,
0xfffbffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xfffffffe, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0x00000000, 0xffffffff, 0x00000001, 0x00000000, 0xe1800000, 0x00000000, 0x00000000, 0x00000000,
0xffffffff, 0x0001ffff, 0x00000000, 0x00000000, 0xffff8000, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0x00849074, 0x90ead309, 0x1c2204d7, 0xbff4ffbc, 0x22c924bb, 0x7bdb8409, 0xfc05131c, 0x573aabc7,
0x187c1bfd, 0xcaae7baa, 0x61d41fb7, 0x6fd76df6, 0xd7bf56a7, 0x7f9f2e14, 0x1e7e7be3, 0xbad51cfc,
0x78e955bf, 0xa9ffd1a5, 0xebdf2d7f, 0x5f0fffdf, 0xfcfddfde, 0xb753ccf7, 0xfbdd5fdf, 0xf5dfdf5f,
0x7efbfdea, 0x6bfbfbfd, 0xebe8c484, 0xc404845b, 0xe55bc89e, 0xf7126fca, 0xc7fd44ae, 0x50b45fac,
0xd079fa34, 0xd03eb0ce, 0xd7ad8373, 0x7fda7947, 0x17fdd877, 0x3de1f0f4, 0x14dbf555, 0x32dbff6b,
0xfd1ffe9c, 0x31fd3bf4, 0xea013874, 0x76447728, 0x7adfe58b, 0x77f39e9b, 0x742d80f3, 0xcc300082,
0x175408a5, 0xa485043e, 0xe65ef8b2, 0xc6d5fefb, 0xf98a27ae, 0xdef16a3f, 0x5f799c2c, 0xcbfb981d,
0xffeefd2f, 0xbf7d2f6e, 0x3ff77ba7, 0xeeffeb0d, 0xc8eff162, 0xd733a37f, 0x6ba3bf1f, 0xdff77ea5,
0x762f9cbe, 0x77a83dff, 0xef77bfff, 0xfcaf68da, 0xffd73f7f, 0xef6ffef7, 0xb3fdfbfb, 0x9be0dbf2,
0x7eb356b7, 0xbbf6de77, 0x137c1b7a, 0xff7f7bac, 0xfffbf7e3, 0xffffb7b3, 0x7fefb7f7, 0xf97efffe,
0xffddff9a, 0xffcc7bed, 0xba6debfc, 0xbddff5fb, 0x2d77fbd7, 0xbff7bcff, 0x3ed5eb7d, 0xdbb035ef,
0x6b9f5d6f, 0xb86d917a, 0xfb0c346b, 0xdca7ba5a, 0x9db3fa23, 0xb5ffbedf, 0xeb64d1bf, 0x77ffe69f,
0x66f7ff55, 0xefad7fde, 0xfbfffdbf, 0x7f0fdbbb, 0xfbfbffff, 0xdff3ffff, 0xffafedf9, 0xf7fffdb5,
0xffffe5ff, 0xfefefbfb, 0x484effff, 0xdedef5a7, 0x007f6455, 0x1d86b3a0, 0x17f2ef0b, 0x209ea419,
0xdaad1029, 0x086def6f, 0xafdd1098, 0x7dfd6fc8, 0xf36cfff9, 0xfef7cdfd, 0xfdffffff, 0xd7bedebc,
0xf72fbf7f, 0xdfffddff, 0xfffee7ff, 0xfffffdff, 0xffffffef, 0xebffffff, 0x9febdfdf, 0xb761b411,
0xc29eee91, 0xde36171f, 0x7fcfefde, 0x00a3f7f1, 0x29467b25, 0xfe1fd73f, 0x5bb7f8f1, 0x7b37eff2,
0x76467be0, 0xa95d7d1e, 0x9f53aeff, 0xe57cc19b, 0xbc70e2cc, 0xdd67b9fd, 0xec3b4f97, 0x5fddf77c,
0xcf8bbfd8, 0xf7ff7a1c, 0xf9dad7b7, 0x639648c3, 0x7bafddcf, 0xe68b68b6, 0x6addf3df, 0xf1a4043c,
0x63df7cfd, 0x7f7ff767, 0xfadda6ef, 0xff3eb673, 0xfbecb3fb, 0xbfbbe46f, 0xfffaf87b, 0x2a1bffbf,
0x7ab80afe, 0x76223bbf, 0xf6c1947c, 0x2db40577, 0xa211f9ee, 0x4ddde6c9, 0x4022841b, 0x0f654c10,
0x51bdfd79, 0x1bff72c4, 0x2fde1d9b, 0xf33fbe71, 0xf4ba6ce6, 0xfb73f850, 0xf3ba5dfe, 0xdbef99f5,
0xaf265fd8, 0xbbbfcfff, 0x7eebfb2b, 0xf4ff7d7f, 0xbfd0fe73, 0x2eda85cf, 0xbbeb9349, 0xa3e8efbe,
0xcceb7fbe, 0x34bf63c3, 0x941d6edf, 0xfe4aefb3, 0x6d54a573, 0xcd7e4d4f, 0xff7f77dd, 0xfb3dcc1a,
0x7fe72b3b, 0xafff5e5e, 0xdb9365f3, 0xbbae3caf, 0xff3dfc69, 0xffefb63b, 0xcdbffd3f, 0x2ce2efbe,
0x574fd4f6, 0xdbcd423f, 0x2fbc3db2, 0x3ffc5edc, 0xcb5efded, 0xff3f53f2, 0xefe67fef, 0x77c56fdf,
0xff35764f, 0xcc7ff9fd, 0x6fe4ee93, 0x7fbffc93, 0x5f77ffdd, 0xffd65e3b, 0x7a5bd5ee, 0xfbd9bf7e,
0xeffb9fdf, 0xfffedfff, 0xbbe7fbff, 0xcf5fdefe, 0xeffffeab, 0xf7efbeff, 0xf6fed7ff, 0xdff7ffff,
0xbdf7fbf9, 0xf8ddf9f5, 0x7cfff7ef, 0xfffdf7f9, 0xffeefffe, 0xf7ffeffb, 0xfbfffff7, 0xf7ffffdf,
0xffbffbef, 0x9bfbff60, 0x7ff6ed7f, 0xe4e37901, 0xfebff1b7, 0x634b7ffd, 0xf543179c, 0xffff77fd,
0x77257eff, 0xfe6f3633, 0x135ffd78, 0x99bafbec, 0x75b2ecf1, 0x04f7b319, 0xb7dfe9ed, 0xc7d6fad1,
0xb77bf7fd, 0x87bbfbdf, 0xe9f1330d, 0xfe6fb9bf, 0x75dfacdb, 0xeedb7c25, 0xfabde7ae, 0xf155b3ed,
0xd566905f, 0xbaeb4da4, 0xf6bfdbe7, 0x3fff7fff, 0xcfbefdd6, 0xfbbf73db, 0x9bcde74f, 0x3eeb6ccd,
0xffe77db9, 0x3ef1e1f2, 0x66edff7d, 0xfffe9faf, 0xef7fbeae, 0xfff5ffed, 0x7edbfff4, 0xbbfd77ef,
0x53dd75bf, 0xbfffefff, 0xfeddefff, 0x0f9e57f7, 0xfd9bb9ff, 0xc8f707ff, 0x74ff7ef9, 0xddffc72f,
0xf8dffdff, 0xefbeffbd, 0xadffadff, 0xdfffbd4f, 0x7deb7bef, 0xbdfdefff, 0x7ddbfef7, 0xafffdf9f,
0xf7fedfbf, 0x981dcbff, 0x75ffee3f, 0xd9dbfd79, 0xfffffbfc, 0xfdbf6f7a, 0xbd7afeea, 0xccfd8dfc,
0x47b7bfff, 0x3fafb1ff, 0xff7fbfcf, 0xf6dafef7, 0xf7fdf56f, 0xf3fa75ff, 0xfffddbfa, 0xbfffdfde,
0xf7fe7d97, 0xef87fbbb, 0x21dfffff, 0xbfbef9e7, 0xbffef577, 0xdffeff77, 0xefff7bff, 0xff3fbfb3,
0x6affffef, 0xf250d4d3, 0xcded6fdf, 0xfee6f3db, 0xf97d3b3f, 0xaddb37e5, 0xf4cbf95f, 0xfdeefbfb,
0x7ff7d7ff, 0xbefff1a9, 0x7fbbeffb, 0x83f9b7db, 0xfbb5f62b, 0x7bfdbfff, 0x26283a9f, 0xaeeb3b67,
0x7fffe4fd, 0xaa3f7cfe, 0xa7fc7ffc, 0x1dffe7b5, 0x3ffbbfcf, 0x67f6e95f, 0xe37fffff, 0x7fb715d3,
0xff8fcff9, 0xbeeaffff, 0xdfdfffde, 0xffffa7ff, 0x7e7d8dfd, 0xabf4fd7f, 0xfbdef3dd, 0xfb7f7eff,
0xfdffbffd, 0xabfff7bf, 0x1fbffcfc, 0x7a7f5ede, 0xffff1fcc, 0x11fdbb3f, 0x7ef9d5f4, 0xd6fe7d2f,
0x7bff97de, 0x4771bff6, 0xff7f0fba, 0x792ff5fb, 0x4e31dfe5, 0x7bff399f, 0x2dbff34f, 0x79bf5fd6,
0xf5edf6b7, 0xebffeebe, 0xd7a855bf, 0xbd5fffff, 0xff77577e, 0xfffdd7fd, 0xfffffd6e, 0xff7dbded,
0xffffef7f, 0x1efff77d, 0xfdffcff7, 0xd7f7effe, 0x767bf5ff, 0xbdddbb4f, 0x2d9ffbff, 0xb6ffff7f,
0xf16ed3ce, 0xf9f67778, 0xbfdeefab, 0xfd91beb9, 0xe9f77ffd, 0x12e9dffa, 0x9af973ff, 0xf7ef6cf6,
0xf7fffeed, 0x9f7db7b7, 0xbbcffbb5, 0xcdf699fd, 0xefbf4ffb, 0x58116b2a, 0x00d11a07, 0x4e461153,
0xda87bd8e, 0xdaae99ee, 0xfb5cffff, 0xff5ab9f9, 0xbbefaadd, 0xddfbfded, 0x7eebafdd, 0xe5fefdfe,
0xff5ecf94, 0x56b7ffff, 0xbaafe3bb, 0x1f227bff, 0xd2bfe517, 0x3beb39c8, 0xabbfafb4, 0xdaa4ff4d,
0xfbb4789b, 0xbcf577ff, 0xffc4b3ff, 0x30f6b79f, 0xc3ff7bfe, 0x5bf7fcfe, 0x7ef7ddba, 0xf7e7daf7,
0x2f6b881f, 0xfd1cebff, 0xfcffff7f, 0xbb707fbf, 0xcdfbf7fb, 0xdcf54fbf, 0x376d5f7f, 0xfdfd7f9f,
0xefffbfc3, 0xfc9be67f, 0xfeeaf9b6, 0xef7f7775, 0xffbfb9ff, 0xd979f7ff, 0xeff7eb7d, 0xfff97dff,
0xdfff8be7, 0xdfffee0f, 0xf77ffdbf, 0xffdde7b5, 0xedfff7fb, 0xeefbffff, 0xdff7f5ef, 0xffffafbf,
0xfb75ffff, 0xcf5fcfd5, 0xfff7f9f7, 0xbfefdbff, 0xf7efff6f, 0xffff61bf, 0xdfde5dff, 0xf5fffdcf,
0xfffdf33f, 0x7effdfff, 0x23cc3fff, 0x9dfff77f, 0xffdfebf8, 0xffffffff, 0x75ffd37e, 0xb5ffbfef,
0xee5bfefa, 0x7f7ffffd, 0xfd5fbd7f, 0xfeafffbe, 0xbfffdfd3, 0xfbfffffb, 0xfffbf7be, 0xb5fbefff,
0xffdfffdf, 0xf7bdfffb, 0xd167cf9e, 0xfd7ee6d5, 0xefbbd7cd, 0xfffdd7ff, 0xcccf7bd9, 0xdce7ffed,
0xfbfaff7b, 0xf7bbfbdf, 0x7fbfffef, 0x7ffb7bff, 0xfbb7773d, 0xdb77fb7f, 0xfff9fc88, 0xfeffffb7,
0x5657bafa, 0xeba5dbd7, 0xb7ceffff, 0xedf095b0, 0xbed7cd5f, 0x6dfaca84, 0xe7bbf77f, 0xba75973f,
0xd56fbbe8, 0xbdffeefc, 0xeaff7dff, 0xdf6a6fbf, 0xfff7fbfe, 0x083218c9, 0x905d5884, 0x92484510,
0x61d012d4, 0xff6dcca3, 0xfbfebabd, 0xfdffe3f9, 0x9ff7efff, 0x46276078, 0xa7f7fa62, 0xcbefcba0,
0x17b75adf, 0x204c0001, 0xb2a627e1, 0xff0ef7a0, 0x7ddff3dd, 0xbfe7fef7, 0x53fde7e7, 0xfb577aed,
0x9ffe69ff, 0xffcdf9fb, 0xfdfde2eb, 0x7bfbcfaf, 0xffff5b7d, 0xfbfb63dd, 0x7ffbc3fe, 0x96fffffb,
0xd7c3fdf7, 0x36ff79ff, 0x7fff9d8f, 0x47ea2c3f, 0x2414fc87, 0x89f814b7, 0x04ecbe49, 0xdc7ed39b,
0x116259b0, 0xa6d9bff2, 0xad461359, 0x4a5b9dd6, 0xfff57be4, 0xf4dd3bb3, 0xdffdbbfd, 0xdf5fdefd,
0xfdfbfdfb, 0xfffedf7f, 0xdfebc5ff, 0x7e1dabd3, 0xaffbf57f, 0xfeffe7ff, 0xce7c007c, 0xffdffff7,
0xbfcf9dff, 0xbfeffff7, 0xf77fffee, 0xffedfffb, 0xeebdffd6, 0xfd77cfff, 0xfffbff1d, 0xeef7dbef,
0xfafeffef, 0xfffb26ba, 0xf7ffd7ff, 0xbfdfffff, 0xffaffbff, 0xffffbfff, 0x7efffeff, 0xfffffff9,
0xfeffffff, 0xfedbbfff, 0xfff9ffef, 0x4ffaf7ff, 0x1d77f8ff, 0xb7d5bd18, 0xcffebfd4, 0xbabb9ffb,
0xf9fee6cc, 0xf7df3f81, 0xef9dff7e, 0x7f3e7fff, 0xfee5f5f6, 0xf5f9fcd3, 0xee9febff, 0xd04b1afd,
0x6decbddc, 0x7781bbff, 0xd7dec60e, 0xda16f8c1, 0xe4ce339f, 0xb62dfa76, 0xa59d4f0c, 0xb7327af3,
0xafb7992e, 0xfcfbff7f, 0xadbbfefb, 0xffa7fdf8, 0xfeff57df, 0xffffdffe, 0xeff7e7ff, 0x7797ed3f,
0xee70ea91, 0xe4fedfed, 0xb6cf0fbf, 0x011d777f, 0xca34c809, 0xfff7feee, 0x36ffbfef, 0x8ffb5ffb,
0xfff6e9d7, 0xfffffefd, 0xbffbf7df, 0x6b188fdf, 0xfdbbf69f, 0xffe58eff, 0xd7daff8d, 0x7ffddfff,
0xfdf3bfff, 0xbf7effff, 0x735fffde, 0x25a40fdb, 0xfb7d6f2b, 0xee7f7ecd, 0xfee77fcf, 0xfffffbbf,
0xfff7ffbf, 0xeffdff5b, 0xfbef7fc9, 0xffff7fff, 0xfffdffff, 0xffffbfff, 0xfffff9ff, 0xfffffeff,
0xfff6ffff, 0xfffbdfff, 0xff7ffdff, 0xefffffff, 0xefffdffd, 0xfefffeef, 0xbfbfcfe7, 0xffe7fddf,
0xf7fffdff, 0x77fffebf, 0xdffdffff, 0xfffbeffd, 0xffff7fff, 0xffef7fff, 0xff7fffff, 0xfdffffff,
0x33ffffff, 0x1ff75f94, 0xffff79d7, 0x5dd6ffaf, 0x7f77ffff, 0xc7ffff9f, 0x94e93fe7, 0xffff7eff,
0xfff7bfff, 0xb7fffffe, 0xfffaf3ff, 0x7ffbffbd, 0xf9fed6ef, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xc100fc6c, 0xfcfc4fee, 0xfeefd7fe, 0xfe6cffff, 0x6feee184, 0xc4fe68fc, 0xff7fefed, 0xcf946a6c,
0xeefdefee, 0xeecfdfff, 0xfffefd6f, 0x4f86cfff, 0xeffeecf9, 0xffffffef, 0xeff6feee, 0x6cffffee,
0xffefd4fc, 0xfe6cffff, 0x4fee8b94, 0xc7feecdc, 0xffffffcf, 0x8fffffff, 0xeefc4fee, 0xfeefdfff,
0xfffede7f, 0x4feecfff, 0xdffffcf1, 0xfdffffcf, 0xefdeffee, 0xecffffff, 0xefef947e, 0xffeefcff,
0xefeeefef, 0xfffffeff, 0xdd47eaef, 0xcfffffff, 0x68fd4fee, 0xeec504fc, 0xfeecfc4f, 0xffffffde,
0xc4f2e4ff, 0xfc4feec7, 0xeeccfeec, 0xecffffff, 0xfee3d4fa, 0xfffffeff, 0xefeeefff, 0xdffefcff,
0xfd4fe6cf, 0xefeffffe, 0xeeffffff, 0xefeffefe, 0xf8ecfdef, 0xffffeb9c, 0xd4feecff, 0x7147068b,
0xefc4feec, 0xffffffff, 0x668fffff, 0xfeecb54f, 0xffeeefd4, 0xffffeefc, 0xd54ea6cf, 0xeffffeee,
0xeefeffff, 0xffeff4fe, 0xfeecffff, 0xffffefd4, 0xfffffefe, 0xfd7fefef, 0xcffefefe, 0xfefd4fa6,
0xe6cfffff, 0xfeecf84f, 0x6feee7c4, 0xfffffffc, 0xffffffff, 0xe7c7f6ec, 0xfefc6fee, 0xffffffff,
0xfeecffff, 0xffffefdf, 0xfffffeff, 0xffffffef, 0xcfffffff, 0xfffdf7ef, 0xffffffff, 0xfeeeffff,
0xfffffffc, 0xf4fe6cff, 0xffcfeeef, 0xcfdefeee, 0xec384fee, 0xfecfc4fe, 0xfffffd7f, 0x4feecfff,
0xd4feec7c, 0xfc4feecf, 0xcffdffee, 0xeefd4fee, 0xffffffff, 0xfeeeffff, 0xffeeeffe, 0xd4feecfd,
0xfeffffef, 0xcfffffff, 0xecffffee, 0xeecfdefe, 0xffffbd4f, 0x4feecfff, 0x94fc2cfd, 0xf84feec5,
0xffdffefe, 0xecffffff, 0xeec794fe, 0xfeecfc4f, 0xffffef47, 0xd4fee4ff, 0xffffffef, 0xefffffff,
0xeeff7fef, 0xc68fffff, 0xfeee5d4f, 0xfffffff5, 0xfffeeeff, 0xffefeeef, 0xfff6feee, 0x6cffffff,
0x468d44fe, 0xfeecdd4f, 0xdffecdc4, 0xffffffff, 0xf94fe6cf, 0xcfc4fe6c, 0xfedc5fee, 0xee8fffff,
0xfffefd4f, 0xffffefef, 0xfcfeeefe, 0xfffffeef, 0xedd4fc6c, 0xfeffffff, 0xefefffff, 0xfeeeffff,
0xffeecffe, 0xffffffff, 0xf14faecf, 0xc7c6faec, 0xfefc6fee, 0xfffffeff, 0xfe6cffff, 0xffffefc6,
0xd6fffcfd, 0xffffffff, 0xffdcffee, 0xffffffff, 0xffffffff, 0xfffeffff, 0xefeecfdf, 0xffffffff,
0xffffffff, 0xefffffff, 0xeeffffff, 0xfffffcfe, 0xffecffff, 0x4f8e4fd7, 0xc4feecdc, 0xfd4feecf,
0x8ffffffe, 0xecbc4f26, 0xeeefc4fe, 0xfeeefc6f, 0x4fee8fff, 0xdfffeedd, 0xffffffef, 0xcff7ffee,
0x6cfd4fee, 0xffe954fe, 0xfeeefeff, 0x4feeefde, 0xd6fefefd, 0xfd4fcecf, 0xcfffffff, 0xe4bd47e6,
0xeec7ccfe, 0xfffffdef, 0xffffffdf, 0xcdfeecff, 0xffffeeef, 0xffffffff, 0x6cfffffe, 0xffefddfe,
0xfffeffff, 0xeffeefff, 0xfffffeff, 0xfdefeecf, 0xeffffffe, 0xeeffffff, 0xffffffff, 0x7eecfdff,
0xffffeffc, 0xd4feecff, 0x9c47a8cf, 0xcfc4feec, 0xee5d4f6e, 0xc4cfffff, 0xfeeca80f, 0x0f2e8fd4,
0xe6feee1c, 0xf5468ecf, 0xcfc7feec, 0xecfdeffe, 0xeecffefe, 0xfeecfd4f, 0xcfeecfd4, 0xfdfeecfc,
0xfd4feecf, 0xcf54feec, 0xee014bee, 0xcecffefe, 0xfe2cb847, 0x4feecf84, 0xdfffaefc, 0xfffffeef,
0xcf94faec, 0xeefd6fee, 0xffefcefe, 0xfeecffff, 0x7fffe714, 0xdffffefd, 0xffefeeef, 0xcfdfffee,
0xfefd4fee, 0xffefefff, 0xfeecffff, 0xefeeeffe, 0xdcfeecff, 0xffffffff, 0xc994fa6c, 0xecfc4f6a,
0xffefc4ff, 0xffffffff, 0xcfeecfff, 0xfffffefd, 0xfeffffef, 0xcfffffff, 0xfcf57fee, 0xffffffff,
0xfeeeffff, 0xfffffffe, 0xdcfeecfd, 0xffffffef, 0xffffffff, 0xfeffffff, 0xffefffff, 0xffffff6f,
0xcfeecfff, 0x84feac79, 0xfc4feecf, 0xffdffeee, 0xecffffff, 0xeecfc4fe, 0xffeefd6f, 0xffffefef,
0xd6feecff, 0xffffeeef, 0xefffffff, 0xfeffffef, 0xeecfffff, 0xfffefd4f, 0xffffefef, 0xf5fffeff,
0xfdffefef, 0xffdefeec, 0xecffffff, 0xeecfd4fc, 0xfeecfd4f, 0xffffefc4, 0xffffffff, 0xfc4fe6cf,
0xefd4feec, 0xfefe4fee, 0xeecfffff, 0xfeecfd4f, 0xffffefdf, 0xfffefefd, 0xffffffef, 0xefd4feec,
0xfefdffee, 0xeecfdfff, 0xfefeff6f, 0xcfeecffe, 0xfffffffd, 0xfd4feecf, 0xcfc4fcec, 0xfefd4fee,
0xffffffff, 0xfeecffff, 0x4feecfc4, 0xefffeefd, 0xfffffeef, 0xebd4feec, 0xfeffffff, 0xffefffff,
0xfffeff7f, 0x6feecfff, 0xeffffefd, 0xffffffef, 0xefdffeee, 0x6cfdefef, 0xfeeff4fa, 0xfeecffff,
0x4fae8fdc, 0xc4feecdc, 0xffffffcf, 0xcfffffff, 0xecfc6fee, 0xeeefd4fe, 0xfffefccf, 0x6feecfff,
0xfffffefd, 0xffffffff, 0xefffffee, 0xecff7ffe, 0xffefd6fa, 0xfffffdff, 0xfffeefff, 0xdefefeff,
0xffcfeeef, 0xcfffffff, 0xecfd4fee, 0xeecfd4de, 0xfffefc4f, 0xffffffdf, 0xd4feecff, 0xfd4feecf,
0xefc4feec, 0xecffffff, 0xeecdd4fe, 0xfffcfd7f, 0x7feecff7, 0xfffffefd, 0xfd6deecf, 0xcffefeee,
0xecfdffee, 0xeeeff4fe, 0xfc6cfdef, 0xcfeeeddc, 0xf4feecff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xfc00ffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0xffffffff, 0xffffffff, 0xffffffdf,
};

View File

@ -6,96 +6,227 @@ mbstring
mbstring.language=Japanese
--FILE--
<?php
// TODO: Add more tests
// SJIS string (BASE64 encoded)
$sjis = base64_decode('k/qWe4zqg2WDTINYg2eCxYK3gUIwMTIzNIJUglWCVoJXgliBQg==');
// JIS string (BASE64 encoded)
$jis = base64_decode('GyRCRnxLXDhsJUYlLSU5JUgkRyQ5ISMbKEIwMTIzNBskQiM1IzYjNyM4IzkhIxsoQg==');
// EUC-JP string
$euc_jp = '日本語テキストです。01234。';
$euc_jp = "\xC6\xFC\xCB\xDC\xB8\xEC\xA5\xC6\xA5\xAD\xA5\xB9\xA5\xC8\xA4\xC7\xA4\xB9\xA1\xA301234\xA3\xB5\xA3\xB6\xA3\xB7\xA3\xB8\xA3\xB9\xA1\xA3";
// Test with single "form encoding"
// Note: For some reason it complains, results are different. Not researched.
echo "== BASIC TEST ==\n";
$s = $sjis;
$s = mb_detect_encoding($s, 'SJIS');
print("SJIS: $s\n");
$s = $jis;
$s = mb_detect_encoding($s, 'JIS');
print("JIS: $s\n");
print("SJIS: " . mb_detect_encoding($sjis, 'SJIS') . "\n");
$s = $euc_jp;
$s = mb_detect_encoding($s, 'UTF-8,EUC-JP,JIS');
print("EUC-JP: $s\n");
print("JIS: " . mb_detect_encoding($jis, 'JIS') . "\n");
$s = $euc_jp;
$s = mb_detect_encoding($s, 'JIS,EUC-JP');
print("EUC-JP: $s\n");
print("EUC-JP: " . mb_detect_encoding($euc_jp, 'UTF-8,EUC-JP,JIS') . "\n");
print("EUC-JP: " . mb_detect_encoding($euc_jp, 'JIS,EUC-JP') . "\n");
// Using Encoding List Array
echo "== ARRAY ENCODING LIST ==\n";
$a = array(0=>'UTF-8',1=>'EUC-JP', 2=>'SJIS', 3=>'JIS');
$a = ['UTF-8', 'EUC-JP', 'SJIS', 'JIS'];
$s = $jis;
$s = mb_detect_encoding($s, $a);
print("JIS: $s\n");
print("JIS: " . mb_detect_encoding($jis, $a) . "\n");
$s = $euc_jp;
$s = mb_detect_encoding($s, $a);
print("EUC-JP: $s\n");
print("EUC-JP: " . mb_detect_encoding($euc_jp, $a) . "\n");
$s = $sjis;
$s = mb_detect_encoding($s, $a);
print("SJIS: $s\n");
print("SJIS: " . mb_detect_encoding($sjis, $a) . "\n");
$test = "CHARSET=windows-1252:Do\xeb;John";
$encodings = ['UTF-8', 'SJIS', 'GB2312',
'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4',
'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9',
'ISO-8859-10', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16',
'WINDOWS-1252', 'WINDOWS-1251', 'EUC-JP', 'EUC-TW', 'KOI8-R', 'BIG-5',
'ISO-2022-KR', 'ISO-2022-JP', 'UTF-16'
$encodings = [
'UTF-8', 'SJIS', 'GB2312',
'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4',
'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9',
'ISO-8859-10', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16',
'WINDOWS-1252', 'WINDOWS-1251', 'EUC-JP', 'EUC-TW', 'KOI8-R', 'BIG-5',
'ISO-2022-KR', 'ISO-2022-JP', 'UTF-16'
];
echo mb_detect_encoding($test, $encodings), "\n";
// Using Detect Order
$test = 'N:Müller;Jörg;;;
X-ABUID:2E4CB084-4767-4C85-BBCA-805B1DCB1C8E\:ABPerson';
echo mb_detect_encoding($test, ['UTF-8', 'SJIS']), "\n";
$test = 'BEGIN:VCARD
VERSION:2.1
N;ENCODING=QUOTED-PRINTABLE:Iksi=F1ski;Piotr
FN;ENCODING=QUOTED-PRINTABLE:Piotr Iksi=F1ski
EMAIL;PREF;INTERNET:piotr.iksinski@somedomain.com
X-GENDER:Male
REV:20080716T203548Z
END:VCARD
';
echo mb_detect_encoding($test, ['UTF-8', 'UTF-16']), "\n";
// We once had a problem where all kind of strings would be detected as 'UUENCODE'
echo mb_detect_encoding('abc', ['UUENCODE', 'UTF-8']), "\n";
echo "== DETECT ORDER ==\n";
mb_detect_order('auto');
print("JIS: " . mb_detect_encoding($jis) . "\n");
$s = $jis;
$s = mb_detect_encoding($s);
print("JIS: $s\n");
print("EUC-JP: " . mb_detect_encoding($euc_jp) . "\n");
$s = $euc_jp;
$s = mb_detect_encoding($s);
print("EUC-JP: $s\n");
print("SJIS: " . mb_detect_encoding($sjis) . "\n");
$s = $sjis;
$s = mb_detect_encoding($s);
print("SJIS: $s\n");
// Invalid(?) Parameters
echo "== INVALID PARAMETER ==\n";
$s = mb_detect_encoding(1234, 'EUC-JP');
print("INT: $s\n"); // EUC-JP
print("INT: " . mb_detect_encoding(1234, 'EUC-JP') . "\n"); // EUC-JP
$s = mb_detect_encoding('', 'EUC-JP');
print("EUC-JP: $s\n"); // SJIS
print("EUC-JP: " . mb_detect_encoding('', 'EUC-JP') . "\n"); // SJIS
$s = $euc_jp;
try {
var_dump(mb_detect_encoding($s, 'BAD'));
var_dump(mb_detect_encoding($euc_jp, 'BAD'));
} catch (\ValueError $e) {
echo $e->getMessage() . \PHP_EOL;
}
echo "== TORTURE TEST ==\n";
function test($strings, $encodings) {
foreach ($strings as $example) {
foreach ($encodings as $encoding) {
$converted = mb_convert_encoding($example, $encoding, 'UTF-8');
$detected = mb_detect_encoding($converted, $encodings);
if ($detected !== $encoding) {
echo "BAD! mb_detect_encoding returned $detected (should have been $encoding)\n";
echo "UTF-8 was: $example\n";
echo "$encoding bytes: ", bin2hex($converted), "\n";
}
}
}
}
$jpStrings = [
// Hat tip to Wikipedia
"日本で生まれ育ったほとんどの人は、日本語を母語とする[注 3]",
"2019年4月現在、インターネット上の言語使用者数は、英語、中国語、スペイン語、アラビア語、ポルトガル語、マレー語/インドネシア語、フランス語に次いで8番目に多い[13][信頼性要検証]。",
"日本語は地方ごとに多様な方言があり、とりわけ琉球諸島[要曖昧さ回避]で方言差が著しい(「方言」の節参照)。",
"さ し す せ そ しゃ しゅ しょ (清音)
た ち つ て と ちゃ ちゅ ちょ (清音)
な に ぬ ね の にゃ にゅ にょ ――",
"明治時代に入り、1889年から大槻文彦編の小型辞書『言海』が刊行された。これは、古典語・日常語を網羅し、五十音順に見出しを並べて、品詞・漢字表記・語釈を付した初の近代的な日本語辞書であった。『言海』は、後の辞書の模範的存在となり、後に増補版の『大言海』も刊行された。",
"奈良時代には『楊氏漢語抄』や『弁色立成(べんしきりゅうじょう)』という辞書が編纂された。それぞれ逸文として残るのみであるが、和訓を有する漢和辞書であったらしい。",
"複雑な文字体系を理由に、日本語を特殊とする議論もある。",
"一時的流行語。ある時代の若い世代が使う言葉。戦後の「アジャパー」、1970年代の「チカレタビー」など。コーホート語同世代語。",
"外国人による日本語研究も、中世末期から近世前期にかけて多く行われた。イエズス会では日本語とポルトガル語の辞書『日葡辞書』1603年が編纂され、また、同会のロドリゲスにより文法書『日本大文典』1608年および『日本小文典』1620年が表された。",
"一方、戦後になると各地の方言が失われつつあることが危惧されるようになった。NHK放送文化研究所は、昭和20年代の時点で各地の純粋な方言は80歳以上の老人の間でのみ使われているにすぎないとして、1953年から5年計画で全国の方言の録音を行った。",
"文体史 和漢混淆文の誕生",
"仮名遣いについては、早く小学校令施行規則1900年において、「にんぎやう人形」を「にんぎょー」とするなど、漢字音を発音通りにする、いわゆる「棒引き仮名遣い」が採用されたことがあった。",
"元来、日本に文字と呼べるものはなく、言葉を表記するためには中国渡来の漢字を用いた(いわゆる神代文字は後世の偽作とされている[167])。",
"第二次世界大戦が激しくなるにつれて、外来語を禁止または自粛する風潮も起こったが、戦後はアメリカ発の外来語が爆発的に多くなった。",
"そこから類推した結果、「文字を読む」に対して「文字が読むる(読める)」などの可能動詞が出来上がったものと考えられる。",
"近代以降には、外国語(特に英語)の音の影響で新しい音が使われ始めた。比較的一般化した「シェ・チェ・ツァ・ツェ・ツォ・ティ・ファ・フィ・フェ・フォ・ジェ・ディ・デュ」などの音に加え、場合によっては、「イェ・ウィ・ウェ・ウォ・クァ・クィ・クェ・クォ・ツィ・トゥ・グァ・ドゥ・テュ・フュ」などの音も使われる[147]。",
"20世紀後半から21世紀初頭にかけて中央競馬のトップジョッキーとして活躍し、競馬ファンから名手の愛称で親しまれた。",
"名鉄モ600形電車めいてつモ600がたでんしゃは、名古屋鉄道名鉄が岐阜地区の直流600 V電化路線区の一つである美濃町線において運用する目的で、1970年昭和45年に導入した電車である。",
"その視点から、真理は当初未就学期の娘を幼稚園に入園させる考えは持っていなかったが、",
// And here's to everyone's favorite blue robot...
"機械だって 涙を流して 震えながら 勇気を叫ぶだろう",
"台風だって 心を痛めて 愛を込めて さよならするだろう",
"便利な道具で 助けてくれる おもちゃの 兵器だ 「ソレ!とつげき!」"
];
$jpEncodings = [
'UTF-32BE',
'UTF-32LE',
'UTF-16BE',
'UTF-16LE',
'UTF-8',
'UTF-7',
'EUC-JP',
'SJIS',
'ISO-2022-JP'
];
test($jpStrings, $jpEncodings);
$cnStrings = [
"日本宫内厅宣布真子公主和小室圭将在10月26日完婚。",
// The Dream of Red Mansions
"此开卷第一回也。作者自云曾历过一番梦幻之后,故将真事隐去,而借“通灵”说此《石头记》一书也",
"一日,炎夏永昼,士隐于书房闲坐,手倦抛书,伏几盹睡。",
"  须臾,茶毕,早已设下杯盘。",
"士隐听了,便迎上来道:“你满口说些什么?只听见些“好了”“好了”。",
"士隐送雨村去后,回房一觉,直至红日三竿方醒。",
"时逢三五便团圞,满把清光护玉栏。",
"但弟子愚拙,不能洞悉明白。",
"按那石头上书云:当日地陷东南,这东南有个姑苏城,城中阊门",
"后来既受天地精华,复得甘露滋养,遂脱了草木之胎,幻化人形,仅仅修成女体,终日游于“离恨天”外,饥餐“秘情果”,渴饮“灌愁水”。",
"原来雨村自那日见了甄家丫鬟,曾回顾他两次,自谓是个知己,便时刻放在心上。",
// Wikipedia
// (A lot of this uses traditional Chinese characters, which we also want to be tested)
"漢语主要使用漢字書寫,為語素文字。",
"現漢字擁有兩套文字系統,分別為正體字與簡體字。",
"標準漢語中四個主要的聲調使用ma這個音節發音。",
"在語言學原則上,互相之間不能通話的應該被定性為語言而非方言。",
"但不少詞彙會採用粵語詞彙(例如採用「巴士」而非「公車」,採用「魚蛋」而非「魚丸子」,採用「沙律」而非「色拉」)",
"這是因為其他國家(除日本外)均使用表音文字,對於“文”[6]與“語”[7]並不作區分,不符合漢語語法",
"主条目:闽语、閩東語、福州語、閩南語和臺灣閩南語",
"普通話中aieiaoou等都是雙元音韻母",
"汉字",
"实词,词汇中含有实际意义的词语",
"我的老師 一位顧客 恭敬地鞠躬 完全相信 非常堅強 多麼可愛",
"敬畏生命 熱愛工作 上中學 登泰山 蓋房子 包餃子",
"参见:外來語 § 漢語外來語、中文外來語、汉字文化圈和汉字复活",
"如果将汉语延深入汉文,则汉文的信息密度更大。",
"我們家蓋了新房子。",
"他是一個高而瘦的老人。",
"我們家的臺階低。",
"我們家蓋了新房子。",
"敵人監視着葦塘。",
"連詞:用來連接詞、短語或句子,表示前後有並列、遞進、轉折、因果、假設等關係。",
"「大去之期不遠矣」",
"“官话方言”绝大多数次级方言都没有入声",
"其中,闽南语不仅有 ptk也有模糊入声"
];
$cnEncodings = [
'UTF-32BE',
'UTF-32LE',
'UTF-16BE',
'UTF-16LE',
'UTF-8',
'UTF-7',
'GB18030',
'BIG-5'
];
test($cnStrings, $cnEncodings);
$deStrings = [
// Much love to Wikipedia
"Die beiden Brücken über den Strelasund (2011)",
"die Rügenbrücke und der Rügendamm sowie die regelmäßig betriebenen Fährverbindungen zwischen Stralsund",
"Der „Rügendamm“ ist die erste feste Strelasundquerung",
"Koordinaten ♁54° 18 39″ N, 13° 7 0″ O",
"Die ausschließlich dem Kraftfahrzeugverkehr",
"Die Brücke ermöglicht dem Schiffsverkehr eine Durchfahrtshöhe von 40 m.[1]",
"Nach der Hauptbrücke folgt die Vorlandbrücke Dänholm (BW 3), eine 532,3 m",
"Die alte, als Klappbrücke ausgeführte Ziegelgrabenbrücke ist 133 Meter lang",
"Vor allem das Fährdorf Stralow („stral“ bedeutet im Mittelniederdeutschen und im Slawischen „Pfeil“) entwickelte sich rasch.",
"1946 kam es aufgrund der Zerstörung der Brücken",
"Auf der Trajektstrecke verkehrten im ersten Jahr bereits 90.000 Fahrgäste",
"Mai 1897 zwei Schnellzugpaare zwischen Berlin und Saßnitz.",
"Der Damm im Ziegelgraben und zwischen dem Dänholm und dem Widerlager der Brücke wurde mit den bei den Eisenbahnarbeiten gewonnenen Böden verfüllt.",
"Dabei passierten die vier anderen Trajekte die „Altefähr“",
"Ebenfalls in den 1980er Jahren traten zunehmend Ermüdungserscheinungen an den stark beanspruchten Stahlüberbauten auf",
"Erste Planungen für einen neuen Rügendamm
Die Kapazität der Eisenbahnbrücke war begrenzt:",
"bestehend aus den Firmen Walter Bau AG vereinigt mit Dywidag (später/nach v.g. Insolvenz durch die Dywidag Bau GmbH)",
"Bereits im Herbst 1998 erfolgten die ersten Bohrungen zur Untersuchung der Tragfähigkeit des Baugrundes im Bereich des Ziegelgrabens"
];
$deEncodings = [
'UTF-32BE',
'UTF-32LE',
'UTF-16BE',
'UTF-16LE',
'UTF-8',
'ISO-8859-1'
// TODO: It would be good if ISO-8859-2 and ISO-8859-15 can be accurately detected as well
];
test($deStrings, $deEncodings);
echo "Done!\n";
?>
--EXPECT--
== BASIC TEST ==
@ -108,6 +239,9 @@ JIS: JIS
EUC-JP: EUC-JP
SJIS: SJIS
ISO-8859-1
UTF-8
UTF-8
UTF-8
== DETECT ORDER ==
JIS: JIS
EUC-JP: EUC-JP
@ -116,3 +250,5 @@ SJIS: SJIS
INT: EUC-JP
EUC-JP: EUC-JP
mb_detect_encoding(): Argument #2 ($encodings) contains invalid encoding "BAD"
== TORTURE TEST ==
Done!