The exported symbols in utf8-norm.c are not needed for normal
file system consumers, so move them to conditional _GPL exports
just for the selftest.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
utf8data.h contains a large database table which is an auto-generated
decodification trie for the unicode normalization functions.
Allow building it into a separate module.
Based on a patch from Shreeya Patel <shreeya.patel@collabora.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Instead of repeatedly looking up the version add pointers to the
NFD and NFD+CF tables to struct unicode_map, and pass a
unicode_map plus index to the functions using the normalization
tables.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Only used by the tests, so no need to keep it in the core.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Just use the utf8nlen implementation with a (size_t)-1 len argument,
similar to utf8_lookup. Also move the function to utf8-selftest.c, as
it isn't used anywhere else.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Don't bother with pointless string parsing when the caller can just pass
the version in the format that the core expects. Also remove the
fallback to the latest version that none of the callers actually uses.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license 2 as published
by the free software foundation this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation this program is distributed in the hope
that it [would] be useful but without any warranty without even the
implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 9 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.804956444@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch integrates the utf8n patches with some higher level API to
perform UTF-8 string comparison, normalization and casefolding
operations. Implemented is a variation of NFD, and casefold is
performed by doing full casefold on top of NFD. These algorithms are
based on the core implemented by Olaf Weber from SGI.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove the Hangul decompositions from the utf8data trie, and do
algorithmic decomposition to calculate them on the fly. To store the
decomposition the caller of utf8lookup()/utf8nlookup() must provide a
12-byte buffer, which is used to synthesize a leaf with the
decomposition. This significantly reduces the size of the utf8data[]
array.
Changes made by Gabriel:
Rebase to mainline
Fix checkpatch errors
Extract robustness fixes and merge back to original mkutf8data.c patch
Regenerate utf8data.h
Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfdi and
nfdicf.
nfdi:
- Apply unicode normalization form NFD.
- Remove any Default_Ignorable_Code_Point.
nfdicf:
- Apply unicode normalization form NFD.
- Remove any Default_Ignorable_Code_Point.
- Apply a full casefold (C + F).
For the purposes of the code, a string is valid UTF-8 if:
- The values encoded are 0x1..0x10FFFF.
- The surrogate codepoints 0xD800..0xDFFFF are not encoded.
- The shortest possible encoding is used for all values.
The supporting functions work on null-terminated strings (utf8 prefix)
and on length-limited strings (utf8n prefix).
From the original SGI patch and for conformity with coding standards,
the utf8data_t typedef was dropped, since it was just masking the struct
keyword. On other occasions, namely utf8leaf_t and utf8trie_t, I
decided to keep it, since they are simple pointers to memory buffers,
and using uchars here wouldn't provide any more meaningful information.
From the original submission, we also converted from the compatibility
form to canonical.
Changes made by Gabriel:
Rebase to Mainline
Fix up checkpatch.pl warnings
Drop typedefs
move out of libxfs
Convert from NFKD to NFD
Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>