linux/fs/unicode
Olaf Weber a8384c6879 unicode: reduce the size of utf8data[]
Remove the Hangul decompositions from the utf8data trie, and do
algorithmic decomposition to calculate them on the fly. To store the
decomposition the caller of utf8lookup()/utf8nlookup() must provide a
12-byte buffer, which is used to synthesize a leaf with the
decomposition. This significantly reduces the size of the utf8data[]
array.

Changes made by Gabriel:
  Rebase to mainline
  Fix checkpatch errors
  Extract robustness fixes and merge back to original mkutf8data.c patch
  Regenerate utf8data.h

Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25 13:49:18 -04:00
..
Kconfig
Makefile unicode: introduce code for UTF-8 normalization 2019-04-25 13:45:46 -04:00
README.utf8data unicode: reduce the size of utf8data[] 2019-04-25 13:49:18 -04:00
utf8-norm.c unicode: reduce the size of utf8data[] 2019-04-25 13:49:18 -04:00
utf8data.h unicode: reduce the size of utf8data[] 2019-04-25 13:49:18 -04:00
utf8n.h unicode: reduce the size of utf8data[] 2019-04-25 13:49:18 -04:00

The utf8data.h file in this directory is generated from the Unicode
Character Database for version 11.0.0 of the Unicode standard.

The full set of files can be found here:

  http://www.unicode.org/Public/11.0.0/ucd/

Individual source links:

  http://www.unicode.org/Public/11.0.0/ucd/CaseFolding.txt
  http://www.unicode.org/Public/11.0.0/ucd/DerivedAge.txt
  http://www.unicode.org/Public/11.0.0/ucd/extracted/DerivedCombiningClass.txt
  http://www.unicode.org/Public/11.0.0/ucd/DerivedCoreProperties.txt
  http://www.unicode.org/Public/11.0.0/ucd/NormalizationCorrections.txt
  http://www.unicode.org/Public/11.0.0/ucd/NormalizationTest.txt
  http://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt

md5sums (verify by running "md5sum -c README.utf8data"):

  414436796cf097df55f798e1585448ee  CaseFolding.txt
  6032a595fbb782694456491d86eecfac  DerivedAge.txt
  3240997d671297ac754ab0d27577acf7  DerivedCombiningClass.txt
  2a4fe257d9d8184518e036194d2248ec  DerivedCoreProperties.txt
  4e7d383fa0dd3cd9d49d64e5b7b7c9e0  NormalizationCorrections.txt
  c9500c5b8b88e584469f056023ecc3f2  NormalizationTest.txt
  acc291106c3758d2025f8d7bd5518bee  UnicodeData.txt

sha1sums (verify by running "sha1sum -c README.utf8data"):

  9184727adf7bd20e36312a68581d12ba3ffb9854  CaseFolding.txt
  86c55b3eb89de61704da16af9c3f22854f61b57d  DerivedAge.txt
  b615703f62b1dbc5110e91acc3ff8b3789a067cf  DerivedCombiningClass.txt
  f8b07ef116d7dc21a94f26e70178ed2acf8713e9  DerivedCoreProperties.txt
  a5fafb8998c0b8153a2a58430b8a35c811db0abc  NormalizationCorrections.txt
  070cdcb00cd4f0860e476750e404c59c2ebe9b25  NormalizationTest.txt
  0e060fafb08d6722fbec56d9f9ebe8509f01d0ee  UnicodeData.txt

To update to the newer version of the Unicode standard, the latest
released version of the UCD can be found here:

  http://www.unicode.org/Public/UCD/latest/

To build the utf8data.h file, from a kernel tree that has been built,
cd to this directory (fs/unicode) and run this command:

	make C=../.. objdir=../.. utf8data.h.new

After sanity checking the newly generated utf8data.h.new file (the
version generated from the 11.0.0 UCD should be 4,061 lines long, and
have a total size of 320k) and/or comparing it with the older version
of utf8data.h, rename it to utf8data.h.

If you are a kernel developer updating to a newer version of the
Unicode Character Database, please update this README.utf8data file
with the version of the UCD that was used, the md5sum and sha1sums of
the *.txt files, before checking in the new versions of the utf8data.h
and README.utf8data files.