php-src/ext/intl/intl_error.c

331 lines
8.1 KiB
C
Raw Normal View History

2008-07-08 06:51:04 +08:00
/*
+----------------------------------------------------------------------+
2014-09-20 00:33:14 +08:00
| PHP Version 7 |
2008-07-08 06:51:04 +08:00
+----------------------------------------------------------------------+
| This source file is subject to version 3.01 of the PHP license, |
| that is bundled with this package in the file LICENSE, and is |
| available through the world-wide-web at the following url: |
| http://www.php.net/license/3_01.txt |
| If you did not receive a copy of the PHP license and are unable to |
| obtain it through the world-wide-web, please send a note to |
| license@php.net so we can mail you a copy immediately. |
+----------------------------------------------------------------------+
| Authors: Vadim Savchuk <vsavchuk@productengine.com> |
| Dmitry Lakhtyuk <dlakhtyuk@productengine.com> |
| Stanislav Malyshev <stas@zend.com> |
+----------------------------------------------------------------------+
*/
#ifdef HAVE_CONFIG_H
#include "config.h"
#endif
#include <php.h>
#include <zend_exceptions.h>
2008-07-08 06:51:04 +08:00
#include "php_intl.h"
#include "intl_error.h"
BreakIterator and RuleBasedBreakiterator added This commit adds wrappers for the classes BreakIterator and RuleBasedbreakIterator. The C++ ICU classes are described here: <http://icu-project.org/apiref/icu4c/classBreakIterator.html> <http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html> Additionally, a tutorial is available at: <http://userguide.icu-project.org/boundaryanalysis> This implementation wraps UTF-8 text in a UText. The text is iterated without any copying or conversion to UTF-16. There is also no validation that the input is actually UTF-8; where there are malformed sequences, the UText will simply U+FFFD. The class BreakIterator cannot be instantiated directly (has a private constructor). It provides the interface exposed by the ICU abstract class with the same name. The PHP class is not abstract because we may use it to wrap native subclasses of BreakIterator that we don't know how to wrap. This class includes methods to move the iterator position to the beginning (first()), to the end (last()), forward (next()), backwards (previous()), to the boundary preceding a certain position (preceding()) and following a certain position (following()) and to obtain the current position (current()). next() can also be used to advance or recede an arbitrary number of positions. BreakIterator also exposes other native methods: getAvailableLocales(), getLocale() and factory methods to build several predefined types of BreakIterators: createWordInstance() for word boundaries, createCharacterInstance() for locale dependent notions of "characters", createSentenceInstance() for sentences, createLineInstance() and createTitleInstance() -- for title casing breaks. These factories currently return RuleBasedbreakIterators where the names of the rule sets are found in the ICU data, observing the passed locale (although the locale is taken into considering there are very few exceptions to the root rules). The clone and compare_object PHP object handlers are also implemented, though the comparison does not yield meaningful results when used with >, <, >= and <=. Note that BreakIterator is an iterator only in the sense of the first 'Iterator' in 'IteratorIterator', i.e., it does not implement the Iterator interface. The reason is that there is no sensible implementation for Iterator::key(). Using it for an ordinal of the current boundary is not feasible because we are allowed to move to any boundary at any time. It we were to determine the current ordinal when last() is called we'd have to traverse the whole input text to find out how many breaks there were before. Therefore, BreakIterator implements only Traversable. It can be wrapped in an IteratorIterator, but the usual warnings apply. Finally, I added a convenience method to BreakIterator: getPartsIterator(). This provides an IntlIterator, backed by the BreakIterator PHP object (i.e. moving the pointer or changing the text in BreakIterator affects the iterator and also moving the iterator affects the backing BreakIterator), which allows traversing the text between each boundary. This iterator uses the original text to retrieve the text between two positions, not the code points returned by the wrapping UText. Therefore, if the text includes invalid code unit sequences, these invalid sequences will be in the output of this iterator, not U+FFFD code points. The class RuleBasedIterator exposes a constructor that allows building an iterator from arbitrary compiled or non-compiled rules. The form of these rules in described in the tutorial linked above. The rest of the methods allow retrieving the rules -- getRules() and getCompiledRules() --, a hash code of the rule set (hashCode()) and the rules statuses (getRuleStatus() and getRuleStatusVec()). Because the RuleBasedBreakIterator constructor may return parse errors, I reuse the UParseError to text function that was in the transliterator files. Therefore, I move that function to intl_error.c. common_enum.cpp was also changed, mainly to expose previously static functions. This avoided code duplication when implementing the BreakIterator iterator and the IntlIterator returned by BreakIterator::getPartsIterator().
2012-05-31 18:11:44 +08:00
#include "intl_convert.h"
2008-07-08 06:51:04 +08:00
ZEND_EXTERN_MODULE_GLOBALS( intl )
zend_class_entry *IntlException_ce_ptr;
2008-07-08 06:51:04 +08:00
/* {{{ intl_error* intl_g_error_get()
* Return global error structure.
*/
2014-12-14 06:06:14 +08:00
static intl_error* intl_g_error_get( void )
2008-07-08 06:51:04 +08:00
{
return &INTL_G( g_error );
}
/* }}} */
/* {{{ void intl_free_custom_error_msg( intl_error* err )
* Free mem.
*/
2014-12-14 06:06:14 +08:00
static void intl_free_custom_error_msg( intl_error* err )
2008-07-08 06:51:04 +08:00
{
2014-12-14 06:06:14 +08:00
if( !err && !( err = intl_g_error_get( ) ) )
2008-07-08 06:51:04 +08:00
return;
if(err->free_custom_error_message ) {
efree( err->custom_error_message );
}
2008-07-08 06:51:04 +08:00
err->custom_error_message = NULL;
err->free_custom_error_message = 0;
}
/* }}} */
/* {{{ intl_error* intl_error_create()
* Create and initialize internals of 'intl_error'.
*/
2014-12-14 06:06:14 +08:00
intl_error* intl_error_create( void )
2008-07-08 06:51:04 +08:00
{
intl_error* err = ecalloc( 1, sizeof( intl_error ) );
2014-12-14 06:06:14 +08:00
intl_error_init( err );
2008-07-08 06:51:04 +08:00
return err;
}
/* }}} */
/* {{{ void intl_error_init( intl_error* coll_error )
* Initialize internals of 'intl_error'.
*/
2014-12-14 06:06:14 +08:00
void intl_error_init( intl_error* err )
2008-07-08 06:51:04 +08:00
{
2014-12-14 06:06:14 +08:00
if( !err && !( err = intl_g_error_get( ) ) )
2008-07-08 06:51:04 +08:00
return;
err->code = U_ZERO_ERROR;
err->custom_error_message = NULL;
err->free_custom_error_message = 0;
}
/* }}} */
/* {{{ void intl_error_reset( intl_error* err )
* Set last error code to 0 and unset last error message
*/
2014-12-14 06:06:14 +08:00
void intl_error_reset( intl_error* err )
2008-07-08 06:51:04 +08:00
{
2014-12-14 06:06:14 +08:00
if( !err && !( err = intl_g_error_get( ) ) )
2008-07-08 06:51:04 +08:00
return;
err->code = U_ZERO_ERROR;
2014-12-14 06:06:14 +08:00
intl_free_custom_error_msg( err );
2008-07-08 06:51:04 +08:00
}
/* }}} */
/* {{{ void intl_error_set_custom_msg( intl_error* err, char* msg, int copyMsg )
* Set last error message to msg copying it if needed.
*/
void intl_error_set_custom_msg( intl_error* err, const char* msg, int copyMsg )
2008-07-08 06:51:04 +08:00
{
if( !msg )
return;
if( !err ) {
if( INTL_G( error_level ) )
php_error_docref( NULL, INTL_G( error_level ), "%s", msg );
if( INTL_G( use_exceptions ) )
zend_throw_exception_ex( IntlException_ce_ptr, 0, "%s", msg );
2009-07-08 05:25:46 +08:00
}
2014-12-14 06:06:14 +08:00
if( !err && !( err = intl_g_error_get( ) ) )
2008-07-08 06:51:04 +08:00
return;
/* Free previous message if any */
2014-12-14 06:06:14 +08:00
intl_free_custom_error_msg( err );
2008-07-08 06:51:04 +08:00
/* Mark message copied if any */
2008-07-08 06:51:04 +08:00
err->free_custom_error_message = copyMsg;
/* Set user's error text message */
err->custom_error_message = copyMsg ? estrdup( msg ) : (char *) msg;
2008-07-08 06:51:04 +08:00
}
/* }}} */
/* {{{ const char* intl_error_get_message( intl_error* err )
* Create output message in format "<intl_error_text>: <extra_user_error_text>".
*/
2014-12-14 06:06:14 +08:00
zend_string * intl_error_get_message( intl_error* err )
2008-07-08 06:51:04 +08:00
{
2014-06-28 00:02:50 +08:00
const char *uErrorName = NULL;
zend_string *errMessage = 0;
2008-07-08 06:51:04 +08:00
2014-12-14 06:06:14 +08:00
if( !err && !( err = intl_g_error_get( ) ) )
2014-06-28 00:02:50 +08:00
return STR_EMPTY_ALLOC();
2008-07-08 06:51:04 +08:00
uErrorName = u_errorName( err->code );
/* Format output string */
2008-07-08 06:51:04 +08:00
if( err->custom_error_message )
{
2014-06-28 00:02:50 +08:00
errMessage = strpprintf(0, "%s: %s", err->custom_error_message, uErrorName );
2008-07-08 06:51:04 +08:00
}
else
{
2014-06-28 00:02:50 +08:00
errMessage = strpprintf(0, "%s", uErrorName );
2008-07-08 06:51:04 +08:00
}
return errMessage;
}
/* }}} */
/* {{{ void intl_error_set_code( intl_error* err, UErrorCode err_code )
* Set last error code.
*/
2014-12-14 06:06:14 +08:00
void intl_error_set_code( intl_error* err, UErrorCode err_code )
2008-07-08 06:51:04 +08:00
{
2014-12-14 06:06:14 +08:00
if( !err && !( err = intl_g_error_get( ) ) )
2008-07-08 06:51:04 +08:00
return;
err->code = err_code;
}
/* }}} */
/* {{{ void intl_error_get_code( intl_error* err )
* Return last error code.
*/
2014-12-14 06:06:14 +08:00
UErrorCode intl_error_get_code( intl_error* err )
2008-07-08 06:51:04 +08:00
{
2014-12-14 06:06:14 +08:00
if( !err && !( err = intl_g_error_get( ) ) )
2008-07-08 06:51:04 +08:00
return U_ZERO_ERROR;
return err->code;
}
/* }}} */
/* {{{ void intl_error_set( intl_error* err, UErrorCode code, char* msg, int copyMsg )
* Set error code and message.
*/
void intl_error_set( intl_error* err, UErrorCode code, const char* msg, int copyMsg )
2008-07-08 06:51:04 +08:00
{
2014-12-14 06:06:14 +08:00
intl_error_set_code( err, code );
intl_error_set_custom_msg( err, msg, copyMsg );
2008-07-08 06:51:04 +08:00
}
/* }}} */
/* {{{ void intl_errors_set( intl_error* err, UErrorCode code, char* msg, int copyMsg )
* Set error code and message.
*/
void intl_errors_set( intl_error* err, UErrorCode code, const char* msg, int copyMsg )
{
2014-12-14 06:06:14 +08:00
intl_errors_set_code( err, code );
intl_errors_set_custom_msg( err, msg, copyMsg );
}
/* }}} */
2008-07-08 06:51:04 +08:00
/* {{{ void intl_errors_reset( intl_error* err )
*/
2014-12-14 06:06:14 +08:00
void intl_errors_reset( intl_error* err )
2008-07-08 06:51:04 +08:00
{
2009-05-11 03:10:36 +08:00
if(err) {
2014-12-14 06:06:14 +08:00
intl_error_reset( err );
2009-05-11 03:10:36 +08:00
}
2014-12-14 06:06:14 +08:00
intl_error_reset( NULL );
2008-07-08 06:51:04 +08:00
}
/* }}} */
/* {{{ void intl_errors_set_custom_msg( intl_error* err, char* msg, int copyMsg )
*/
void intl_errors_set_custom_msg( intl_error* err, const char* msg, int copyMsg )
2008-07-08 06:51:04 +08:00
{
2009-05-11 03:10:36 +08:00
if(err) {
intl_error_set_custom_msg( err, msg, copyMsg );
2009-05-11 03:10:36 +08:00
}
intl_error_set_custom_msg( NULL, msg, copyMsg );
2008-07-08 06:51:04 +08:00
}
/* }}} */
/* {{{ intl_errors_set_code( intl_error* err, UErrorCode err_code )
*/
2014-12-14 06:06:14 +08:00
void intl_errors_set_code( intl_error* err, UErrorCode err_code )
2008-07-08 06:51:04 +08:00
{
2009-05-11 03:10:36 +08:00
if(err) {
2014-12-14 06:06:14 +08:00
intl_error_set_code( err, err_code );
2009-05-11 03:10:36 +08:00
}
2014-12-14 06:06:14 +08:00
intl_error_set_code( NULL, err_code );
2008-07-08 06:51:04 +08:00
}
/* }}} */
2014-12-14 06:06:14 +08:00
void intl_register_IntlException_class( void )
{
zend_class_entry ce,
*default_exception_ce;
2014-12-14 06:06:14 +08:00
default_exception_ce = zend_exception_get_default( );
/* Create and register 'IntlException' class. */
INIT_CLASS_ENTRY_EX( ce, "IntlException", sizeof( "IntlException" ) - 1, NULL );
IntlException_ce_ptr = zend_register_internal_class_ex( &ce,
2014-12-14 06:06:14 +08:00
default_exception_ce );
IntlException_ce_ptr->create_object = default_exception_ce->create_object;
}
BreakIterator and RuleBasedBreakiterator added This commit adds wrappers for the classes BreakIterator and RuleBasedbreakIterator. The C++ ICU classes are described here: <http://icu-project.org/apiref/icu4c/classBreakIterator.html> <http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html> Additionally, a tutorial is available at: <http://userguide.icu-project.org/boundaryanalysis> This implementation wraps UTF-8 text in a UText. The text is iterated without any copying or conversion to UTF-16. There is also no validation that the input is actually UTF-8; where there are malformed sequences, the UText will simply U+FFFD. The class BreakIterator cannot be instantiated directly (has a private constructor). It provides the interface exposed by the ICU abstract class with the same name. The PHP class is not abstract because we may use it to wrap native subclasses of BreakIterator that we don't know how to wrap. This class includes methods to move the iterator position to the beginning (first()), to the end (last()), forward (next()), backwards (previous()), to the boundary preceding a certain position (preceding()) and following a certain position (following()) and to obtain the current position (current()). next() can also be used to advance or recede an arbitrary number of positions. BreakIterator also exposes other native methods: getAvailableLocales(), getLocale() and factory methods to build several predefined types of BreakIterators: createWordInstance() for word boundaries, createCharacterInstance() for locale dependent notions of "characters", createSentenceInstance() for sentences, createLineInstance() and createTitleInstance() -- for title casing breaks. These factories currently return RuleBasedbreakIterators where the names of the rule sets are found in the ICU data, observing the passed locale (although the locale is taken into considering there are very few exceptions to the root rules). The clone and compare_object PHP object handlers are also implemented, though the comparison does not yield meaningful results when used with >, <, >= and <=. Note that BreakIterator is an iterator only in the sense of the first 'Iterator' in 'IteratorIterator', i.e., it does not implement the Iterator interface. The reason is that there is no sensible implementation for Iterator::key(). Using it for an ordinal of the current boundary is not feasible because we are allowed to move to any boundary at any time. It we were to determine the current ordinal when last() is called we'd have to traverse the whole input text to find out how many breaks there were before. Therefore, BreakIterator implements only Traversable. It can be wrapped in an IteratorIterator, but the usual warnings apply. Finally, I added a convenience method to BreakIterator: getPartsIterator(). This provides an IntlIterator, backed by the BreakIterator PHP object (i.e. moving the pointer or changing the text in BreakIterator affects the iterator and also moving the iterator affects the backing BreakIterator), which allows traversing the text between each boundary. This iterator uses the original text to retrieve the text between two positions, not the code points returned by the wrapping UText. Therefore, if the text includes invalid code unit sequences, these invalid sequences will be in the output of this iterator, not U+FFFD code points. The class RuleBasedIterator exposes a constructor that allows building an iterator from arbitrary compiled or non-compiled rules. The form of these rules in described in the tutorial linked above. The rest of the methods allow retrieving the rules -- getRules() and getCompiledRules() --, a hash code of the rule set (hashCode()) and the rules statuses (getRuleStatus() and getRuleStatusVec()). Because the RuleBasedBreakIterator constructor may return parse errors, I reuse the UParseError to text function that was in the transliterator files. Therefore, I move that function to intl_error.c. common_enum.cpp was also changed, mainly to expose previously static functions. This avoided code duplication when implementing the BreakIterator iterator and the IntlIterator returned by BreakIterator::getPartsIterator().
2012-05-31 18:11:44 +08:00
smart_str intl_parse_error_to_string( UParseError* pe )
{
smart_str ret = {0};
char *buf;
2014-12-29 18:17:15 +08:00
size_t u8len;
BreakIterator and RuleBasedBreakiterator added This commit adds wrappers for the classes BreakIterator and RuleBasedbreakIterator. The C++ ICU classes are described here: <http://icu-project.org/apiref/icu4c/classBreakIterator.html> <http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html> Additionally, a tutorial is available at: <http://userguide.icu-project.org/boundaryanalysis> This implementation wraps UTF-8 text in a UText. The text is iterated without any copying or conversion to UTF-16. There is also no validation that the input is actually UTF-8; where there are malformed sequences, the UText will simply U+FFFD. The class BreakIterator cannot be instantiated directly (has a private constructor). It provides the interface exposed by the ICU abstract class with the same name. The PHP class is not abstract because we may use it to wrap native subclasses of BreakIterator that we don't know how to wrap. This class includes methods to move the iterator position to the beginning (first()), to the end (last()), forward (next()), backwards (previous()), to the boundary preceding a certain position (preceding()) and following a certain position (following()) and to obtain the current position (current()). next() can also be used to advance or recede an arbitrary number of positions. BreakIterator also exposes other native methods: getAvailableLocales(), getLocale() and factory methods to build several predefined types of BreakIterators: createWordInstance() for word boundaries, createCharacterInstance() for locale dependent notions of "characters", createSentenceInstance() for sentences, createLineInstance() and createTitleInstance() -- for title casing breaks. These factories currently return RuleBasedbreakIterators where the names of the rule sets are found in the ICU data, observing the passed locale (although the locale is taken into considering there are very few exceptions to the root rules). The clone and compare_object PHP object handlers are also implemented, though the comparison does not yield meaningful results when used with >, <, >= and <=. Note that BreakIterator is an iterator only in the sense of the first 'Iterator' in 'IteratorIterator', i.e., it does not implement the Iterator interface. The reason is that there is no sensible implementation for Iterator::key(). Using it for an ordinal of the current boundary is not feasible because we are allowed to move to any boundary at any time. It we were to determine the current ordinal when last() is called we'd have to traverse the whole input text to find out how many breaks there were before. Therefore, BreakIterator implements only Traversable. It can be wrapped in an IteratorIterator, but the usual warnings apply. Finally, I added a convenience method to BreakIterator: getPartsIterator(). This provides an IntlIterator, backed by the BreakIterator PHP object (i.e. moving the pointer or changing the text in BreakIterator affects the iterator and also moving the iterator affects the backing BreakIterator), which allows traversing the text between each boundary. This iterator uses the original text to retrieve the text between two positions, not the code points returned by the wrapping UText. Therefore, if the text includes invalid code unit sequences, these invalid sequences will be in the output of this iterator, not U+FFFD code points. The class RuleBasedIterator exposes a constructor that allows building an iterator from arbitrary compiled or non-compiled rules. The form of these rules in described in the tutorial linked above. The rest of the methods allow retrieving the rules -- getRules() and getCompiledRules() --, a hash code of the rule set (hashCode()) and the rules statuses (getRuleStatus() and getRuleStatusVec()). Because the RuleBasedBreakIterator constructor may return parse errors, I reuse the UParseError to text function that was in the transliterator files. Therefore, I move that function to intl_error.c. common_enum.cpp was also changed, mainly to expose previously static functions. This avoided code duplication when implementing the BreakIterator iterator and the IntlIterator returned by BreakIterator::getPartsIterator().
2012-05-31 18:11:44 +08:00
UErrorCode status;
int any = 0;
assert( pe != NULL );
smart_str_appends( &ret, "parse error " );
if( pe->line > 0 )
{
smart_str_appends( &ret, "on line " );
2014-08-26 01:24:55 +08:00
smart_str_append_long( &ret, (zend_long ) pe->line );
BreakIterator and RuleBasedBreakiterator added This commit adds wrappers for the classes BreakIterator and RuleBasedbreakIterator. The C++ ICU classes are described here: <http://icu-project.org/apiref/icu4c/classBreakIterator.html> <http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html> Additionally, a tutorial is available at: <http://userguide.icu-project.org/boundaryanalysis> This implementation wraps UTF-8 text in a UText. The text is iterated without any copying or conversion to UTF-16. There is also no validation that the input is actually UTF-8; where there are malformed sequences, the UText will simply U+FFFD. The class BreakIterator cannot be instantiated directly (has a private constructor). It provides the interface exposed by the ICU abstract class with the same name. The PHP class is not abstract because we may use it to wrap native subclasses of BreakIterator that we don't know how to wrap. This class includes methods to move the iterator position to the beginning (first()), to the end (last()), forward (next()), backwards (previous()), to the boundary preceding a certain position (preceding()) and following a certain position (following()) and to obtain the current position (current()). next() can also be used to advance or recede an arbitrary number of positions. BreakIterator also exposes other native methods: getAvailableLocales(), getLocale() and factory methods to build several predefined types of BreakIterators: createWordInstance() for word boundaries, createCharacterInstance() for locale dependent notions of "characters", createSentenceInstance() for sentences, createLineInstance() and createTitleInstance() -- for title casing breaks. These factories currently return RuleBasedbreakIterators where the names of the rule sets are found in the ICU data, observing the passed locale (although the locale is taken into considering there are very few exceptions to the root rules). The clone and compare_object PHP object handlers are also implemented, though the comparison does not yield meaningful results when used with >, <, >= and <=. Note that BreakIterator is an iterator only in the sense of the first 'Iterator' in 'IteratorIterator', i.e., it does not implement the Iterator interface. The reason is that there is no sensible implementation for Iterator::key(). Using it for an ordinal of the current boundary is not feasible because we are allowed to move to any boundary at any time. It we were to determine the current ordinal when last() is called we'd have to traverse the whole input text to find out how many breaks there were before. Therefore, BreakIterator implements only Traversable. It can be wrapped in an IteratorIterator, but the usual warnings apply. Finally, I added a convenience method to BreakIterator: getPartsIterator(). This provides an IntlIterator, backed by the BreakIterator PHP object (i.e. moving the pointer or changing the text in BreakIterator affects the iterator and also moving the iterator affects the backing BreakIterator), which allows traversing the text between each boundary. This iterator uses the original text to retrieve the text between two positions, not the code points returned by the wrapping UText. Therefore, if the text includes invalid code unit sequences, these invalid sequences will be in the output of this iterator, not U+FFFD code points. The class RuleBasedIterator exposes a constructor that allows building an iterator from arbitrary compiled or non-compiled rules. The form of these rules in described in the tutorial linked above. The rest of the methods allow retrieving the rules -- getRules() and getCompiledRules() --, a hash code of the rule set (hashCode()) and the rules statuses (getRuleStatus() and getRuleStatusVec()). Because the RuleBasedBreakIterator constructor may return parse errors, I reuse the UParseError to text function that was in the transliterator files. Therefore, I move that function to intl_error.c. common_enum.cpp was also changed, mainly to expose previously static functions. This avoided code duplication when implementing the BreakIterator iterator and the IntlIterator returned by BreakIterator::getPartsIterator().
2012-05-31 18:11:44 +08:00
any = 1;
}
if( pe->offset >= 0 ) {
if( any )
smart_str_appends( &ret, ", " );
else
smart_str_appends( &ret, "at " );
smart_str_appends( &ret, "offset " );
2015-01-03 17:22:58 +08:00
smart_str_append_long( &ret, (zend_long ) pe->offset );
BreakIterator and RuleBasedBreakiterator added This commit adds wrappers for the classes BreakIterator and RuleBasedbreakIterator. The C++ ICU classes are described here: <http://icu-project.org/apiref/icu4c/classBreakIterator.html> <http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html> Additionally, a tutorial is available at: <http://userguide.icu-project.org/boundaryanalysis> This implementation wraps UTF-8 text in a UText. The text is iterated without any copying or conversion to UTF-16. There is also no validation that the input is actually UTF-8; where there are malformed sequences, the UText will simply U+FFFD. The class BreakIterator cannot be instantiated directly (has a private constructor). It provides the interface exposed by the ICU abstract class with the same name. The PHP class is not abstract because we may use it to wrap native subclasses of BreakIterator that we don't know how to wrap. This class includes methods to move the iterator position to the beginning (first()), to the end (last()), forward (next()), backwards (previous()), to the boundary preceding a certain position (preceding()) and following a certain position (following()) and to obtain the current position (current()). next() can also be used to advance or recede an arbitrary number of positions. BreakIterator also exposes other native methods: getAvailableLocales(), getLocale() and factory methods to build several predefined types of BreakIterators: createWordInstance() for word boundaries, createCharacterInstance() for locale dependent notions of "characters", createSentenceInstance() for sentences, createLineInstance() and createTitleInstance() -- for title casing breaks. These factories currently return RuleBasedbreakIterators where the names of the rule sets are found in the ICU data, observing the passed locale (although the locale is taken into considering there are very few exceptions to the root rules). The clone and compare_object PHP object handlers are also implemented, though the comparison does not yield meaningful results when used with >, <, >= and <=. Note that BreakIterator is an iterator only in the sense of the first 'Iterator' in 'IteratorIterator', i.e., it does not implement the Iterator interface. The reason is that there is no sensible implementation for Iterator::key(). Using it for an ordinal of the current boundary is not feasible because we are allowed to move to any boundary at any time. It we were to determine the current ordinal when last() is called we'd have to traverse the whole input text to find out how many breaks there were before. Therefore, BreakIterator implements only Traversable. It can be wrapped in an IteratorIterator, but the usual warnings apply. Finally, I added a convenience method to BreakIterator: getPartsIterator(). This provides an IntlIterator, backed by the BreakIterator PHP object (i.e. moving the pointer or changing the text in BreakIterator affects the iterator and also moving the iterator affects the backing BreakIterator), which allows traversing the text between each boundary. This iterator uses the original text to retrieve the text between two positions, not the code points returned by the wrapping UText. Therefore, if the text includes invalid code unit sequences, these invalid sequences will be in the output of this iterator, not U+FFFD code points. The class RuleBasedIterator exposes a constructor that allows building an iterator from arbitrary compiled or non-compiled rules. The form of these rules in described in the tutorial linked above. The rest of the methods allow retrieving the rules -- getRules() and getCompiledRules() --, a hash code of the rule set (hashCode()) and the rules statuses (getRuleStatus() and getRuleStatusVec()). Because the RuleBasedBreakIterator constructor may return parse errors, I reuse the UParseError to text function that was in the transliterator files. Therefore, I move that function to intl_error.c. common_enum.cpp was also changed, mainly to expose previously static functions. This avoided code duplication when implementing the BreakIterator iterator and the IntlIterator returned by BreakIterator::getPartsIterator().
2012-05-31 18:11:44 +08:00
any = 1;
}
if (pe->preContext[0] != 0 ) {
if( any )
smart_str_appends( &ret, ", " );
smart_str_appends( &ret, "after \"" );
intl_convert_utf16_to_utf8( &buf, &u8len, pe->preContext, -1, &status );
if( U_FAILURE( status ) )
{
smart_str_appends( &ret, "(could not convert parser error pre-context to UTF-8)" );
}
else {
smart_str_appendl( &ret, buf, u8len );
efree( buf );
}
smart_str_appends( &ret, "\"" );
any = 1;
}
if( pe->postContext[0] != 0 )
{
if( any )
smart_str_appends( &ret, ", " );
smart_str_appends( &ret, "before or at \"" );
intl_convert_utf16_to_utf8( &buf, &u8len, pe->postContext, -1, &status );
if( U_FAILURE( status ) )
{
smart_str_appends( &ret, "(could not convert parser error post-context to UTF-8)" );
}
else
{
smart_str_appendl( &ret, buf, u8len );
efree( buf );
}
smart_str_appends( &ret, "\"" );
any = 1;
}
if( !any )
{
smart_str_free( &ret );
smart_str_appends( &ret, "no parse error" );
}
2015-01-03 17:22:58 +08:00
BreakIterator and RuleBasedBreakiterator added This commit adds wrappers for the classes BreakIterator and RuleBasedbreakIterator. The C++ ICU classes are described here: <http://icu-project.org/apiref/icu4c/classBreakIterator.html> <http://icu-project.org/apiref/icu4c/classRuleBasedBreakIterator.html> Additionally, a tutorial is available at: <http://userguide.icu-project.org/boundaryanalysis> This implementation wraps UTF-8 text in a UText. The text is iterated without any copying or conversion to UTF-16. There is also no validation that the input is actually UTF-8; where there are malformed sequences, the UText will simply U+FFFD. The class BreakIterator cannot be instantiated directly (has a private constructor). It provides the interface exposed by the ICU abstract class with the same name. The PHP class is not abstract because we may use it to wrap native subclasses of BreakIterator that we don't know how to wrap. This class includes methods to move the iterator position to the beginning (first()), to the end (last()), forward (next()), backwards (previous()), to the boundary preceding a certain position (preceding()) and following a certain position (following()) and to obtain the current position (current()). next() can also be used to advance or recede an arbitrary number of positions. BreakIterator also exposes other native methods: getAvailableLocales(), getLocale() and factory methods to build several predefined types of BreakIterators: createWordInstance() for word boundaries, createCharacterInstance() for locale dependent notions of "characters", createSentenceInstance() for sentences, createLineInstance() and createTitleInstance() -- for title casing breaks. These factories currently return RuleBasedbreakIterators where the names of the rule sets are found in the ICU data, observing the passed locale (although the locale is taken into considering there are very few exceptions to the root rules). The clone and compare_object PHP object handlers are also implemented, though the comparison does not yield meaningful results when used with >, <, >= and <=. Note that BreakIterator is an iterator only in the sense of the first 'Iterator' in 'IteratorIterator', i.e., it does not implement the Iterator interface. The reason is that there is no sensible implementation for Iterator::key(). Using it for an ordinal of the current boundary is not feasible because we are allowed to move to any boundary at any time. It we were to determine the current ordinal when last() is called we'd have to traverse the whole input text to find out how many breaks there were before. Therefore, BreakIterator implements only Traversable. It can be wrapped in an IteratorIterator, but the usual warnings apply. Finally, I added a convenience method to BreakIterator: getPartsIterator(). This provides an IntlIterator, backed by the BreakIterator PHP object (i.e. moving the pointer or changing the text in BreakIterator affects the iterator and also moving the iterator affects the backing BreakIterator), which allows traversing the text between each boundary. This iterator uses the original text to retrieve the text between two positions, not the code points returned by the wrapping UText. Therefore, if the text includes invalid code unit sequences, these invalid sequences will be in the output of this iterator, not U+FFFD code points. The class RuleBasedIterator exposes a constructor that allows building an iterator from arbitrary compiled or non-compiled rules. The form of these rules in described in the tutorial linked above. The rest of the methods allow retrieving the rules -- getRules() and getCompiledRules() --, a hash code of the rule set (hashCode()) and the rules statuses (getRuleStatus() and getRuleStatusVec()). Because the RuleBasedBreakIterator constructor may return parse errors, I reuse the UParseError to text function that was in the transliterator files. Therefore, I move that function to intl_error.c. common_enum.cpp was also changed, mainly to expose previously static functions. This avoided code duplication when implementing the BreakIterator iterator and the IntlIterator returned by BreakIterator::getPartsIterator().
2012-05-31 18:11:44 +08:00
smart_str_0( &ret );
return ret;
}
2008-07-08 06:51:04 +08:00
/*
* Local variables:
* tab-width: 4
* c-basic-offset: 4
* End:
* vim600: noet sw=4 ts=4 fdm=marker
* vim<600: noet sw=4 ts=4
*/