Upgraded bundled PCRE to version 8.10

This commit is contained in:
Ilia Alshanetsky 2010-07-02 17:17:16 +00:00
parent 8584b90199
commit ef22824315
27 changed files with 4213 additions and 1252 deletions

4
NEWS
View File

@ -1,8 +1,8 @@
PHP NEWS
PHP NEWS
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
?? ??? 201?, PHP 5.3.99
- Upgraded bundled sqlite to version 3.6.23.1. (Ilia)
- Upgraded bundled PCRE to version 8.02. (Ilia)
- Upgraded bundled PCRE to version 8.10. (Ilia)
- Added caches to eliminate repeatable run-time bindings of functions, classes,
constants, methods and properties (Dmitry)

View File

@ -1,6 +1,101 @@
ChangeLog for PCRE
------------------
Version 8.10 25-Jun-2010
------------------------
1. Added support for (*MARK:ARG) and for ARG additions to PRUNE, SKIP, and
THEN.
2. (*ACCEPT) was not working when inside an atomic group.
3. Inside a character class, \B is treated as a literal by default, but
faulted if PCRE_EXTRA is set. This mimics Perl's behaviour (the -w option
causes the error). The code is unchanged, but I tidied the documentation.
4. Inside a character class, PCRE always treated \R and \X as literals,
whereas Perl faults them if its -w option is set. I have changed PCRE so
that it faults them when PCRE_EXTRA is set.
5. Added support for \N, which always matches any character other than
newline. (It is the same as "." when PCRE_DOTALL is not set.)
6. When compiling pcregrep with newer versions of gcc which may have
FORTIFY_SOURCE set, several warnings "ignoring return value of 'fwrite',
declared with attribute warn_unused_result" were given. Just casting the
result to (void) does not stop the warnings; a more elaborate fudge is
needed. I've used a macro to implement this.
7. Minor change to pcretest.c to avoid a compiler warning.
8. Added four artifical Unicode properties to help with an option to make
\s etc use properties (see next item). The new properties are: Xan
(alphanumeric), Xsp (Perl space), Xps (POSIX space), and Xwd (word).
9. Added PCRE_UCP to make \b, \d, \s, \w, and certain POSIX character classes
use Unicode properties. (*UCP) at the start of a pattern can be used to set
this option. Modified pcretest to add /W to test this facility. Added
REG_UCP to make it available via the POSIX interface.
10. Added --line-buffered to pcregrep.
11. In UTF-8 mode, if a pattern that was compiled with PCRE_CASELESS was
studied, and the match started with a letter with a code point greater than
127 whose first byte was different to the first byte of the other case of
the letter, the other case of this starting letter was not recognized
(#976).
12. If a pattern that was studied started with a repeated Unicode property
test, for example, \p{Nd}+, there was the theoretical possibility of
setting up an incorrect bitmap of starting bytes, but fortunately it could
not have actually happened in practice until change 8 above was made (it
added property types that matched character-matching opcodes).
13. pcre_study() now recognizes \h, \v, and \R when constructing a bit map of
possible starting bytes for non-anchored patterns.
14. Extended the "auto-possessify" feature of pcre_compile(). It now recognizes
\R, and also a number of cases that involve Unicode properties, both
explicit and implicit when PCRE_UCP is set.
15. If a repeated Unicode property match (e.g. \p{Lu}*) was used with non-UTF-8
input, it could crash or give wrong results if characters with values
greater than 0xc0 were present in the subject string. (Detail: it assumed
UTF-8 input when processing these items.)
16. Added a lot of (int) casts to avoid compiler warnings in systems where
size_t is 64-bit (#991).
17. Added a check for running out of memory when PCRE is compiled with
--disable-stack-for-recursion (#990).
18. If the last data line in a file for pcretest does not have a newline on
the end, a newline was missing in the output.
19. The default pcre_chartables.c file recognizes only ASCII characters (values
less than 128) in its various bitmaps. However, there is a facility for
generating tables according to the current locale when PCRE is compiled. It
turns out that in some environments, 0x85 and 0xa0, which are Unicode space
characters, are recognized by isspace() and therefore were getting set in
these tables, and indeed these tables seem to approximate to ISO 8859. This
caused a problem in UTF-8 mode when pcre_study() was used to create a list
of bytes that can start a match. For \s, it was including 0x85 and 0xa0,
which of course cannot start UTF-8 characters. I have changed the code so
that only real ASCII characters (less than 128) and the correct starting
bytes for UTF-8 encodings are set for characters greater than 127 when in
UTF-8 mode. (When PCRE_UCP is set - see 9 above - the code is different
altogether.)
20. Added the /T option to pcretest so as to be able to run tests with non-
standard character tables, thus making it possible to include the tests
used for 19 above in the standard set of tests.
21. A pattern such as (?&t)(?#()(?(DEFINE)(?<t>a)) which has a forward
reference to a subpattern the other side of a comment that contains an
opening parenthesis caused either an internal compiling error, or a
reference to the wrong subpattern.
Version 8.02 19-Mar-2010
------------------------

View File

@ -1,6 +1,17 @@
News about PCRE releases
------------------------
Release 8.10 25-Jun-2010
------------------------
There are two major additions: support for (*MARK) and friends, and the option
PCRE_UCP, which changes the behaviour of \b, \d, \s, and \w (and their
opposites) so that they make use of Unicode properties. There are also a number
of lesser new features, and several bugs have been fixed. A new option,
--line-buffered, has been added to pcregrep, for use when it is connected to
pipes.
Release 8.02 19-Mar-2010
------------------------

View File

@ -188,9 +188,9 @@ significantly slower when this is done. There is more about stack usage in the
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
If you want to statically link a program against a PCRE library in the form of
a non-dll .a file, you must define PCRE_STATIC before including pcre.h,
otherwise the pcre_malloc() and pcre_free() exported functions will be declared
__declspec(dllimport), with unwanted results.
a non-dll .a file, you must define PCRE_STATIC before including pcre.h or
pcrecpp.h, otherwise the pcre_malloc() and pcre_free() exported functions will
be declared __declspec(dllimport), with unwanted results.
CALLING CONVENTIONS IN WINDOWS ENVIRONMENTS
@ -497,5 +497,5 @@ build.log file in the root of the package also.
=========================
Last Updated: 19 January 2010
Last Updated: 26 May 2010
****

View File

@ -271,13 +271,16 @@ them both to 0; an emulation function will be used. */
#define PACKAGE_NAME "PCRE"
/* Define to the full name and version of this package. */
#define PACKAGE_STRING "PCRE 8.02"
#define PACKAGE_STRING "PCRE 8.10"
/* Define to the one symbol short name of this package. */
#define PACKAGE_TARNAME "pcre"
/* Define to the home page for this package. */
#define PACKAGE_URL ""
/* Define to the version of this package. */
#define PACKAGE_VERSION "8.02"
#define PACKAGE_VERSION "8.10"
/* If you are compiling for a system other than a Unix-like system or
@ -333,7 +336,7 @@ them both to 0; an emulation function will be used. */
/* Version number of package */
#ifndef VERSION
#define VERSION "8.02"
#define VERSION "8.10"
#endif
/* Define to empty if `const' does not conform to ANSI C. */

File diff suppressed because it is too large Load Diff

View File

@ -5,7 +5,7 @@
/* This is the public header file for the PCRE library, to be #included by
applications that call the PCRE functions.
Copyright (c) 1997-2009 University of Cambridge
Copyright (c) 1997-2010 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
/* The current PCRE version information. */
#define PCRE_MAJOR 8
#define PCRE_MINOR 02
#define PCRE_MINOR 10
#define PCRE_PRERELEASE
#define PCRE_DATE 2010-03-19
#define PCRE_DATE 2010-06-25
/* When an application links to a PCRE DLL in Windows, the symbols that are
imported have to be identified as such. When building PCRE, the appropriate
@ -131,6 +131,7 @@ both, so we keep them all distinct. */
#define PCRE_NO_START_OPTIMISE 0x04000000
#define PCRE_PARTIAL_HARD 0x08000000
#define PCRE_NOTEMPTY_ATSTART 0x10000000
#define PCRE_UCP 0x20000000
/* Exec-time and get/set-time error codes */
@ -200,6 +201,7 @@ these bits, just add new ones on the end, in order to remain compatible. */
#define PCRE_EXTRA_CALLOUT_DATA 0x0004
#define PCRE_EXTRA_TABLES 0x0008
#define PCRE_EXTRA_MATCH_LIMIT_RECURSION 0x0010
#define PCRE_EXTRA_MARK 0x0020
/* Types */
@ -225,6 +227,7 @@ typedef struct pcre_extra {
void *callout_data; /* Data passed back in callouts */
const unsigned char *tables; /* Pointer to character tables */
unsigned long int match_limit_recursion; /* Max recursive calls to match() */
unsigned char **mark; /* For passing back a mark pointer */
} pcre_extra;
/* The structure for passing out data via the pcre_callout_function. We use a

View File

@ -14,7 +14,7 @@ example ISO-8859-1. When dftables is run, it creates these tables in the
current locale. If PCRE is configured with --enable-rebuild-chartables, this
happens automatically.
The following #includes are present because without the gcc 4.x may remove the
The following #includes are present because without them gcc 4.x may remove the
array definition from the final binary if PCRE is built into a static library
and dead code stripping is activated. This leads to link errors. Pulling in the
header ensures that the array gets flagged as "someone outside this compilation

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -192,9 +192,7 @@ stdint.h is available, include it; it may define INT64_MAX. Systems that do not
have stdint.h (e.g. Solaris) may have inttypes.h. The macro int64_t may be set
by "configure". */
#ifdef PHP_WIN32
#include "win32/php_stdint.h"
#elif HAVE_STDINT_H
#if HAVE_STDINT_H
#include <stdint.h>
#elif HAVE_INTTYPES_H
#include <inttypes.h>
@ -477,7 +475,8 @@ know we are in UTF-8 mode. */
} \
}
/* Get the next character, testing for UTF-8 mode, and advancing the pointer */
/* Get the next character, testing for UTF-8 mode, and advancing the pointer.
This is called when we don't know if we are in UTF-8 mode. */
#define GETCHARINCTEST(c, eptr) \
c = *eptr++; \
@ -514,7 +513,7 @@ if there are extra bytes. This is called when we know we are in UTF-8 mode. */
/* Get the next UTF-8 character, testing for UTF-8 mode, not advancing the
pointer, incrementing length if there are extra bytes. This is called when we
know we are in UTF-8 mode. */
do not know if we are in UTF-8 mode. */
#define GETCHARLENTEST(c, eptr, len) \
c = *eptr; \
@ -582,7 +581,7 @@ time, run time, or study time, respectively. */
PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8| \
PCRE_NO_AUTO_CAPTURE|PCRE_NO_UTF8_CHECK|PCRE_AUTO_CALLOUT|PCRE_FIRSTLINE| \
PCRE_DUPNAMES|PCRE_NEWLINE_BITS|PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE| \
PCRE_JAVASCRIPT_COMPAT)
PCRE_JAVASCRIPT_COMPAT|PCRE_UCP)
#define PUBLIC_EXEC_OPTIONS \
(PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY|PCRE_NOTEMPTY_ATSTART| \
@ -877,6 +876,7 @@ so that PCRE works on both ASCII and EBCDIC platforms, in non-UTF-mode only. */
#define STRING_COMMIT0 "COMMIT\0"
#define STRING_F0 "F\0"
#define STRING_FAIL0 "FAIL\0"
#define STRING_MARK0 "MARK\0"
#define STRING_PRUNE0 "PRUNE\0"
#define STRING_SKIP0 "SKIP\0"
#define STRING_THEN "THEN"
@ -906,6 +906,7 @@ so that PCRE works on both ASCII and EBCDIC platforms, in non-UTF-mode only. */
#define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)"
#define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)"
#define STRING_UTF8_RIGHTPAR "UTF8)"
#define STRING_UCP_RIGHTPAR "UCP)"
#else /* SUPPORT_UTF8 */
@ -1129,6 +1130,7 @@ only. */
#define STRING_COMMIT0 STR_C STR_O STR_M STR_M STR_I STR_T "\0"
#define STRING_F0 STR_F "\0"
#define STRING_FAIL0 STR_F STR_A STR_I STR_L "\0"
#define STRING_MARK0 STR_M STR_A STR_R STR_K "\0"
#define STRING_PRUNE0 STR_P STR_R STR_U STR_N STR_E "\0"
#define STRING_SKIP0 STR_S STR_K STR_I STR_P "\0"
#define STRING_THEN STR_T STR_H STR_E STR_N
@ -1158,6 +1160,7 @@ only. */
#define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
#define STRING_UTF8_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
#endif /* SUPPORT_UTF8 */
@ -1190,9 +1193,13 @@ only. */
#define PT_ANY 0 /* Any property - matches all chars */
#define PT_LAMP 1 /* L& - the union of Lu, Ll, Lt */
#define PT_GC 2 /* General characteristic (e.g. L) */
#define PT_PC 3 /* Particular characteristic (e.g. Lu) */
#define PT_GC 2 /* Specified general characteristic (e.g. L) */
#define PT_PC 3 /* Specified particular characteristic (e.g. Lu) */
#define PT_SC 4 /* Script (e.g. Han) */
#define PT_ALNUM 5 /* Alphanumeric - the union of L and N */
#define PT_SPACE 6 /* Perl space - Z plus 9,10,12,13 */
#define PT_PXSPACE 7 /* POSIX space - Z plus 9,10,11,12,13 */
#define PT_WORD 8 /* Word - L plus N plus underscore */
/* Flag bits and data types for the extended class (OP_XCLASS) for classes that
contain UTF-8 characters with values greater than 255. */
@ -1209,9 +1216,15 @@ contain UTF-8 characters with values greater than 255. */
/* These are escaped items that aren't just an encoding of a particular data
value such as \n. They must have non-zero values, as check_escape() returns
their negation. Also, they must appear in the same order as in the opcode
definitions below, up to ESC_z. There's a dummy for OP_ANY because it
corresponds to "." rather than an escape sequence, and another for OP_ALLANY
(which is used for [^] in JavaScript compatibility mode).
definitions below, up to ESC_z. There's a dummy for OP_ALLANY because it
corresponds to "." in DOTALL mode rather than an escape sequence. It is also
used for [^] in JavaScript compatibility mode. In non-DOTALL mode, "." behaves
like \N.
The special values ESC_DU, ESC_du, etc. are used instead of ESC_D, ESC_d, etc.
when PCRE_UCP is set, when replacement of \d etc by \p sequences is required.
They must be contiguous, and remain in order so that the replacements can be
looked up from a table.
The final escape must be ESC_REF as subsequent values are used for
backreferences (\1, \2, \3, etc). There are two tests in the code for an escape
@ -1221,11 +1234,12 @@ put in between that don't consume a character, that code will have to change.
*/
enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
ESC_W, ESC_w, ESC_dum1, ESC_dum2, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H,
ESC_h, ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_g, ESC_k,
ESC_W, ESC_w, ESC_N, ESC_dum, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H,
ESC_h, ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z,
ESC_E, ESC_Q, ESC_g, ESC_k,
ESC_DU, ESC_du, ESC_SU, ESC_su, ESC_WU, ESC_wu,
ESC_REF };
/* Opcode table: Starting from 1 (i.e. after OP_END), the values up to
OP_EOD must correspond in order to the list of escapes immediately above.
@ -1249,8 +1263,8 @@ enum {
OP_WHITESPACE, /* 9 \s */
OP_NOT_WORDCHAR, /* 10 \W */
OP_WORDCHAR, /* 11 \w */
OP_ANY, /* 12 Match any character (subject to DOTALL) */
OP_ALLANY, /* 13 Match any character (not subject to DOTALL) */
OP_ANY, /* 12 Match any character except newline */
OP_ALLANY, /* 13 Match any character */
OP_ANYBYTE, /* 14 Match any byte (\C); different to OP_ANY for UTF-8 */
OP_NOTPROP, /* 15 \P (not Unicode property) */
OP_PROP, /* 16 \p (Unicode property) */
@ -1380,20 +1394,24 @@ enum {
/* These are backtracking control verbs */
OP_PRUNE, /* 107 */
OP_SKIP, /* 108 */
OP_THEN, /* 109 */
OP_COMMIT, /* 110 */
OP_MARK, /* 107 always has an argument */
OP_PRUNE, /* 108 */
OP_PRUNE_ARG, /* 109 same, but with argument */
OP_SKIP, /* 110 */
OP_SKIP_ARG, /* 111 same, but with argument */
OP_THEN, /* 112 */
OP_THEN_ARG, /* 113 same, but with argument */
OP_COMMIT, /* 114 */
/* These are forced failure and success verbs */
OP_FAIL, /* 111 */
OP_ACCEPT, /* 112 */
OP_CLOSE, /* 113 Used before OP_ACCEPT to close open captures */
OP_FAIL, /* 115 */
OP_ACCEPT, /* 116 */
OP_CLOSE, /* 117 Used before OP_ACCEPT to close open captures */
/* This is used to skip a subpattern with a {0} quantifier */
OP_SKIPZERO, /* 114 */
OP_SKIPZERO, /* 118 */
/* This is not an opcode, but is used to check that tables indexed by opcode
are the correct length, in order to catch updating errors - there have been
@ -1404,7 +1422,7 @@ enum {
/* *** NOTE NOTE NOTE *** Whenever the list above is updated, the two macro
definitions that follow must also be updated to match. There are also tables
called "coptable" cna "poptable" in pcre_dfa_exec.c that must be updated. */
called "coptable" and "poptable" in pcre_dfa_exec.c that must be updated. */
/* This macro defines textual names for all the opcodes. These are used only
@ -1429,7 +1447,8 @@ for debugging. The macro is referenced only in pcre_printint.c. */
"Once", "Bra", "CBra", "Cond", "SBra", "SCBra", "SCond", \
"Cond ref", "Cond nref", "Cond rec", "Cond nrec", "Cond def", \
"Brazero", "Braminzero", \
"*PRUNE", "*SKIP", "*THEN", "*COMMIT", "*FAIL", "*ACCEPT", \
"*MARK", "*PRUNE", "*PRUNE", "*SKIP", "*SKIP", \
"*THEN", "*THEN", "*COMMIT", "*FAIL", "*ACCEPT", \
"Close", "Skip zero"
@ -1495,8 +1514,9 @@ in UTF-8 mode. The code that uses this table must know about such things. */
3, 3, /* RREF, NRREF */ \
1, /* DEF */ \
1, 1, /* BRAZERO, BRAMINZERO */ \
1, 1, 1, 1, /* PRUNE, SKIP, THEN, COMMIT, */ \
1, 1, 3, 1 /* FAIL, ACCEPT, CLOSE, SKIPZERO */
3, 1, 3, /* MARK, PRUNE, PRUNE_ARG, */ \
1, 3, 1, 3, /* SKIP, SKIP_ARG, THEN, THEN_ARG, */ \
1, 1, 1, 3, 1 /* COMMIT, FAIL, ACCEPT, CLOSE, SKIPZERO */
/* A magic value for OP_RREF and OP_NRREF to indicate the "any recursion"
@ -1514,7 +1534,7 @@ enum { ERR0, ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9,
ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39,
ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERRCOUNT };
ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERRCOUNT };
/* The real format of the start of the pcre block; the index of names and the
code vector run on as long as necessary after the end. We store an explicit
@ -1657,6 +1677,7 @@ typedef struct match_data {
BOOL noteol; /* NOTEOL flag */
BOOL utf8; /* UTF8 flag */
BOOL jscript_compat; /* JAVASCRIPT_COMPAT flag */
BOOL use_ucp; /* PCRE_UCP flag */
BOOL endonly; /* Dollar not before final \n */
BOOL notempty; /* Empty string match not wanted */
BOOL notempty_atstart; /* Empty string match at start not wanted */
@ -1676,6 +1697,7 @@ typedef struct match_data {
int eptrn; /* Next free eptrblock */
recursion_info *recursive; /* Linked list of recursion data */
void *callout_data; /* To pass back to callouts */
const uschar *mark; /* Mark pointer to pass back */
} match_data;
/* A similar structure is used for the same purpose by the DFA matching

View File

@ -534,6 +534,14 @@ for(;;)
}
break;
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
case OP_THEN_ARG:
fprintf(f, " %s %s", OP_names[*code], code + 2);
extra += code[1];
break;
/* Anything else is just an item with no data*/
default:

View File

@ -46,6 +46,7 @@ supporting functions. */
#include "pcre_internal.h"
#define SET_BIT(c) start_bits[c/8] |= (1 << (c&7))
/* Returns from set_start_bits() */
@ -411,6 +412,15 @@ for (;;)
#endif
break;
/* Skip these, but we need to add in the name length. */
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
case OP_THEN_ARG:
cc += _pcre_OP_lengths[op] + cc[1];
break;
/* For the record, these are the opcodes that are matched by "default":
OP_ACCEPT, OP_CLOSE, OP_COMMIT, OP_FAIL, OP_PRUNE, OP_SET_SOM, OP_SKIP,
OP_THEN. */
@ -429,25 +439,121 @@ for (;;)
* Set a bit and maybe its alternate case *
*************************************************/
/* Given a character, set its bit in the table, and also the bit for the other
version of a letter if we are caseless.
/* Given a character, set its first byte's bit in the table, and also the
corresponding bit for the other version of a letter if we are caseless. In
UTF-8 mode, for characters greater than 127, we can only do the caseless thing
when Unicode property support is available.
Arguments:
start_bits points to the bit map
c is the character
p points to the character
caseless the caseless flag
cd the block with char table pointers
utf8 TRUE for UTF-8 mode
Returns: nothing
Returns: pointer after the character
*/
static const uschar *
set_table_bit(uschar *start_bits, const uschar *p, BOOL caseless,
compile_data *cd, BOOL utf8)
{
unsigned int c = *p;
SET_BIT(c);
#ifdef SUPPORT_UTF8
if (utf8 && c > 127)
{
GETCHARINC(c, p);
#ifdef SUPPORT_UCP
if (caseless)
{
uschar buff[8];
c = UCD_OTHERCASE(c);
(void)_pcre_ord2utf8(c, buff);
SET_BIT(buff[0]);
}
#endif
return p;
}
#endif
/* Not UTF-8 mode, or character is less than 127. */
if (caseless && (cd->ctypes[c] & ctype_letter) != 0) SET_BIT(cd->fcc[c]);
return p + 1;
}
/*************************************************
* Set bits for a positive character type *
*************************************************/
/* This function sets starting bits for a character type. In UTF-8 mode, we can
only do a direct setting for bytes less than 128, as otherwise there can be
confusion with bytes in the middle of UTF-8 characters. In a "traditional"
environment, the tables will only recognize ASCII characters anyway, but in at
least one Windows environment, some higher bytes bits were set in the tables.
So we deal with that case by considering the UTF-8 encoding.
Arguments:
start_bits the starting bitmap
cbit type the type of character wanted
table_limit 32 for non-UTF-8; 16 for UTF-8
cd the block with char table pointers
Returns: nothing
*/
static void
set_table_bit(uschar *start_bits, unsigned int c, BOOL caseless,
set_type_bits(uschar *start_bits, int cbit_type, int table_limit,
compile_data *cd)
{
start_bits[c/8] |= (1 << (c&7));
if (caseless && (cd->ctypes[c] & ctype_letter) != 0)
start_bits[cd->fcc[c]/8] |= (1 << (cd->fcc[c]&7));
register int c;
for (c = 0; c < table_limit; c++) start_bits[c] |= cd->cbits[c+cbit_type];
if (table_limit == 32) return;
for (c = 128; c < 256; c++)
{
if ((cd->cbits[c/8] & (1 << (c&7))) != 0)
{
uschar buff[8];
(void)_pcre_ord2utf8(c, buff);
SET_BIT(buff[0]);
}
}
}
/*************************************************
* Set bits for a negative character type *
*************************************************/
/* This function sets starting bits for a negative character type such as \D.
In UTF-8 mode, we can only do a direct setting for bytes less than 128, as
otherwise there can be confusion with bytes in the middle of UTF-8 characters.
Unlike in the positive case, where we can set appropriate starting bits for
specific high-valued UTF-8 characters, in this case we have to set the bits for
all high-valued characters. The lowest is 0xc2, but we overkill by starting at
0xc0 (192) for simplicity.
Arguments:
start_bits the starting bitmap
cbit type the type of character wanted
table_limit 32 for non-UTF-8; 16 for UTF-8
cd the block with char table pointers
Returns: nothing
*/
static void
set_nottype_bits(uschar *start_bits, int cbit_type, int table_limit,
compile_data *cd)
{
register int c;
for (c = 0; c < table_limit; c++) start_bits[c] |= ~cd->cbits[c+cbit_type];
if (table_limit != 32) for (c = 24; c < 32; c++) start_bits[c] = 0xff;
}
@ -482,6 +588,7 @@ set_start_bits(const uschar *code, uschar *start_bits, BOOL caseless,
{
register int c;
int yield = SSB_DONE;
int table_limit = utf8? 16:32;
#if 0
/* ========================================================================= */
@ -605,12 +712,7 @@ do
case OP_QUERY:
case OP_MINQUERY:
case OP_POSQUERY:
set_table_bit(start_bits, tcode[1], caseless, cd);
tcode += 2;
#ifdef SUPPORT_UTF8
if (utf8 && tcode[-1] >= 0xc0)
tcode += _pcre_utf8_table4[tcode[-1] & 0x3f];
#endif
tcode = set_table_bit(start_bits, tcode + 1, caseless, cd, utf8);
break;
/* Single-char upto sets the bit and tries the next */
@ -618,12 +720,7 @@ do
case OP_UPTO:
case OP_MINUPTO:
case OP_POSUPTO:
set_table_bit(start_bits, tcode[3], caseless, cd);
tcode += 4;
#ifdef SUPPORT_UTF8
if (utf8 && tcode[-1] >= 0xc0)
tcode += _pcre_utf8_table4[tcode[-1] & 0x3f];
#endif
tcode = set_table_bit(start_bits, tcode + 3, caseless, cd, utf8);
break;
/* At least one single char sets the bit and stops */
@ -636,59 +733,86 @@ do
case OP_PLUS:
case OP_MINPLUS:
case OP_POSPLUS:
set_table_bit(start_bits, tcode[1], caseless, cd);
(void)set_table_bit(start_bits, tcode + 1, caseless, cd, utf8);
try_next = FALSE;
break;
/* Single character type sets the bits and stops */
/* Special spacing and line-terminating items. These recognize specific
lists of characters. The difference between VSPACE and ANYNL is that the
latter can match the two-character CRLF sequence, but that is not
relevant for finding the first character, so their code here is
identical. */
case OP_HSPACE:
SET_BIT(0x09);
SET_BIT(0x20);
if (utf8)
{
SET_BIT(0xC2); /* For U+00A0 */
SET_BIT(0xE1); /* For U+1680, U+180E */
SET_BIT(0xE2); /* For U+2000 - U+200A, U+202F, U+205F */
SET_BIT(0xE3); /* For U+3000 */
}
else SET_BIT(0xA0);
try_next = FALSE;
break;
case OP_ANYNL:
case OP_VSPACE:
SET_BIT(0x0A);
SET_BIT(0x0B);
SET_BIT(0x0C);
SET_BIT(0x0D);
if (utf8)
{
SET_BIT(0xC2); /* For U+0085 */
SET_BIT(0xE2); /* For U+2028, U+2029 */
}
else SET_BIT(0x85);
try_next = FALSE;
break;
/* Single character types set the bits and stop. Note that if PCRE_UCP
is set, we do not see these op codes because \d etc are converted to
properties. Therefore, these apply in the case when only characters less
than 256 are recognized to match the types. */
case OP_NOT_DIGIT:
for (c = 0; c < 32; c++)
start_bits[c] |= ~cd->cbits[c+cbit_digit];
set_nottype_bits(start_bits, cbit_digit, table_limit, cd);
try_next = FALSE;
break;
case OP_DIGIT:
for (c = 0; c < 32; c++)
start_bits[c] |= cd->cbits[c+cbit_digit];
set_type_bits(start_bits, cbit_digit, table_limit, cd);
try_next = FALSE;
break;
/* The cbit_space table has vertical tab as whitespace; we have to
discard it. */
ensure it is set as not whitespace. */
case OP_NOT_WHITESPACE:
for (c = 0; c < 32; c++)
{
int d = cd->cbits[c+cbit_space];
if (c == 1) d &= ~0x08;
start_bits[c] |= ~d;
}
set_nottype_bits(start_bits, cbit_space, table_limit, cd);
start_bits[1] |= 0x08;
try_next = FALSE;
break;
/* The cbit_space table has vertical tab as whitespace; we have to
discard it. */
not set it from the table. */
case OP_WHITESPACE:
for (c = 0; c < 32; c++)
{
int d = cd->cbits[c+cbit_space];
if (c == 1) d &= ~0x08;
start_bits[c] |= d;
}
c = start_bits[1]; /* Save in case it was already set */
set_type_bits(start_bits, cbit_space, table_limit, cd);
start_bits[1] = (start_bits[1] & ~0x08) | c;
try_next = FALSE;
break;
case OP_NOT_WORDCHAR:
for (c = 0; c < 32; c++)
start_bits[c] |= ~cd->cbits[c+cbit_word];
set_nottype_bits(start_bits, cbit_word, table_limit, cd);
try_next = FALSE;
break;
case OP_WORDCHAR:
for (c = 0; c < 32; c++)
start_bits[c] |= cd->cbits[c+cbit_word];
set_type_bits(start_bits, cbit_word, table_limit, cd);
try_next = FALSE;
break;
@ -697,6 +821,7 @@ do
case OP_TYPEPLUS:
case OP_TYPEMINPLUS:
case OP_TYPEPOSPLUS:
tcode++;
break;
@ -720,52 +845,69 @@ do
case OP_TYPEPOSQUERY:
switch(tcode[1])
{
default:
case OP_ANY:
case OP_ALLANY:
return SSB_FAIL;
case OP_HSPACE:
SET_BIT(0x09);
SET_BIT(0x20);
if (utf8)
{
SET_BIT(0xC2); /* For U+00A0 */
SET_BIT(0xE1); /* For U+1680, U+180E */
SET_BIT(0xE2); /* For U+2000 - U+200A, U+202F, U+205F */
SET_BIT(0xE3); /* For U+3000 */
}
else SET_BIT(0xA0);
break;
case OP_ANYNL:
case OP_VSPACE:
SET_BIT(0x0A);
SET_BIT(0x0B);
SET_BIT(0x0C);
SET_BIT(0x0D);
if (utf8)
{
SET_BIT(0xC2); /* For U+0085 */
SET_BIT(0xE2); /* For U+2028, U+2029 */
}
else SET_BIT(0x85);
break;
case OP_NOT_DIGIT:
for (c = 0; c < 32; c++)
start_bits[c] |= ~cd->cbits[c+cbit_digit];
set_nottype_bits(start_bits, cbit_digit, table_limit, cd);
break;
case OP_DIGIT:
for (c = 0; c < 32; c++)
start_bits[c] |= cd->cbits[c+cbit_digit];
set_type_bits(start_bits, cbit_digit, table_limit, cd);
break;
/* The cbit_space table has vertical tab as whitespace; we have to
discard it. */
ensure it gets set as not whitespace. */
case OP_NOT_WHITESPACE:
for (c = 0; c < 32; c++)
{
int d = cd->cbits[c+cbit_space];
if (c == 1) d &= ~0x08;
start_bits[c] |= ~d;
}
set_nottype_bits(start_bits, cbit_space, table_limit, cd);
start_bits[1] |= 0x08;
break;
/* The cbit_space table has vertical tab as whitespace; we have to
discard it. */
avoid setting it. */
case OP_WHITESPACE:
for (c = 0; c < 32; c++)
{
int d = cd->cbits[c+cbit_space];
if (c == 1) d &= ~0x08;
start_bits[c] |= d;
}
c = start_bits[1]; /* Save in case it was already set */
set_type_bits(start_bits, cbit_space, table_limit, cd);
start_bits[1] = (start_bits[1] & ~0x08) | c;
break;
case OP_NOT_WORDCHAR:
for (c = 0; c < 32; c++)
start_bits[c] |= ~cd->cbits[c+cbit_word];
set_nottype_bits(start_bits, cbit_word, table_limit, cd);
break;
case OP_WORDCHAR:
for (c = 0; c < 32; c++)
start_bits[c] |= cd->cbits[c+cbit_word];
set_type_bits(start_bits, cbit_word, table_limit, cd);
break;
}

View File

@ -241,6 +241,10 @@ strings to make sure that UTF-8 support works on EBCDIC platforms. */
#define STRING_Tifinagh0 STR_T STR_i STR_f STR_i STR_n STR_a STR_g STR_h "\0"
#define STRING_Ugaritic0 STR_U STR_g STR_a STR_r STR_i STR_t STR_i STR_c "\0"
#define STRING_Vai0 STR_V STR_a STR_i "\0"
#define STRING_Xan0 STR_X STR_a STR_n "\0"
#define STRING_Xps0 STR_X STR_p STR_s "\0"
#define STRING_Xsp0 STR_X STR_s STR_p "\0"
#define STRING_Xwd0 STR_X STR_w STR_d "\0"
#define STRING_Yi0 STR_Y STR_i "\0"
#define STRING_Z0 STR_Z "\0"
#define STRING_Zl0 STR_Z STR_l "\0"
@ -374,6 +378,10 @@ const char _pcre_utt_names[] =
STRING_Tifinagh0
STRING_Ugaritic0
STRING_Vai0
STRING_Xan0
STRING_Xps0
STRING_Xsp0
STRING_Xwd0
STRING_Yi0
STRING_Z0
STRING_Zl0
@ -507,11 +515,15 @@ const ucp_type_table _pcre_utt[] = {
{ 891, PT_SC, ucp_Tifinagh },
{ 900, PT_SC, ucp_Ugaritic },
{ 909, PT_SC, ucp_Vai },
{ 913, PT_SC, ucp_Yi },
{ 916, PT_GC, ucp_Z },
{ 918, PT_PC, ucp_Zl },
{ 921, PT_PC, ucp_Zp },
{ 924, PT_PC, ucp_Zs }
{ 913, PT_ALNUM, 0 },
{ 917, PT_PXSPACE, 0 },
{ 921, PT_SPACE, 0 },
{ 925, PT_WORD, 0 },
{ 929, PT_SC, ucp_Yi },
{ 932, PT_GC, ucp_Z },
{ 934, PT_PC, ucp_Zl },
{ 937, PT_PC, ucp_Zp },
{ 940, PT_PC, ucp_Zs }
};
const int _pcre_utt_size = sizeof(_pcre_utt)/sizeof(ucp_type_table);

View File

@ -6,7 +6,7 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
Copyright (c) 1997-2009 University of Cambridge
Copyright (c) 1997-2010 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -110,12 +110,13 @@ while ((t = *data++) != XCL_END)
break;
case PT_LAMP:
if ((prop->chartype == ucp_Lu || prop->chartype == ucp_Ll || prop->chartype == ucp_Lt) ==
(t == XCL_PROP)) return !negated;
if ((prop->chartype == ucp_Lu || prop->chartype == ucp_Ll ||
prop->chartype == ucp_Lt) == (t == XCL_PROP)) return !negated;
break;
case PT_GC:
if ((data[1] == _pcre_ucp_gentype[prop->chartype]) == (t == XCL_PROP)) return !negated;
if ((data[1] == _pcre_ucp_gentype[prop->chartype]) == (t == XCL_PROP))
return !negated;
break;
case PT_PC:
@ -126,6 +127,33 @@ while ((t = *data++) != XCL_END)
if ((data[1] == prop->script) == (t == XCL_PROP)) return !negated;
break;
case PT_ALNUM:
if ((_pcre_ucp_gentype[prop->chartype] == ucp_L ||
_pcre_ucp_gentype[prop->chartype] == ucp_N) == (t == XCL_PROP))
return !negated;
break;
case PT_SPACE: /* Perl space */
if ((_pcre_ucp_gentype[prop->chartype] == ucp_Z ||
c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR)
== (t == XCL_PROP))
return !negated;
break;
case PT_PXSPACE: /* POSIX space */
if ((_pcre_ucp_gentype[prop->chartype] == ucp_Z ||
c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
c == CHAR_FF || c == CHAR_CR) == (t == XCL_PROP))
return !negated;
break;
case PT_WORD:
if ((_pcre_ucp_gentype[prop->chartype] == ucp_L ||
_pcre_ucp_gentype[prop->chartype] == ucp_N || c == CHAR_UNDERSCORE)
== (t == XCL_PROP))
return !negated;
break;
/* This should never occur, but compilers may mutter if there is no
default. */

View File

@ -6,7 +6,7 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
Copyright (c) 1997-2009 University of Cambridge
Copyright (c) 1997-2010 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@ -55,6 +55,11 @@ previously been set. */
# define PCREPOSIX_EXP_DEFN __declspec(dllexport)
#endif
/* We include pcre.h before pcre_internal.h so that the PCRE library functions
are declared as "import" for Windows by defining PCRE_EXP_DECL as "import".
This is needed even though pcre_internal.h itself includes pcre.h, because it
does so after it has set PCRE_EXP_DECL to "export" if it is not already set. */
#include "pcre.h"
#include "pcre_internal.h"
#include "pcreposix.h"
@ -133,7 +138,7 @@ static const int eint[] = {
REG_INVARG, /* inconsistent NEWLINE options */
REG_BADPAT, /* \g is not followed followed by an (optionally braced) non-zero number */
REG_BADPAT, /* a numbered reference must not be zero */
REG_BADPAT, /* (*VERB) with an argument is not supported */
REG_BADPAT, /* an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT) */
/* 60 */
REG_BADPAT, /* (*VERB) not recognized */
REG_BADPAT, /* number is too big */
@ -141,7 +146,9 @@ static const int eint[] = {
REG_BADPAT, /* digit expected after (?+ */
REG_BADPAT, /* ] is an invalid data character in JavaScript compatibility mode */
/* 65 */
REG_BADPAT /* different names for subpatterns of the same number are not allowed */
REG_BADPAT, /* different names for subpatterns of the same number are not allowed */
REG_BADPAT, /* (*MARK) must have an argument */
REG_INVARG, /* this version of PCRE is not compiled with PCRE_UCP support */
};
/* Table of texts corresponding to POSIX error codes */
@ -245,6 +252,7 @@ if ((cflags & REG_NEWLINE) != 0) options |= PCRE_MULTILINE;
if ((cflags & REG_DOTALL) != 0) options |= PCRE_DOTALL;
if ((cflags & REG_NOSUB) != 0) options |= PCRE_NO_AUTO_CAPTURE;
if ((cflags & REG_UTF8) != 0) options |= PCRE_UTF8;
if ((cflags & REG_UCP) != 0) options |= PCRE_UCP;
if ((cflags & REG_UNGREEDY) != 0) options |= PCRE_UNGREEDY;
preg->re_pcre = pcre_compile2(pattern, options, &errorcode, &errorptr,
@ -334,13 +342,13 @@ if ((eflags & REG_STARTEND) != 0)
else
{
so = 0;
eo = strlen(string);
eo = (int)strlen(string);
}
rc = pcre_exec((const pcre *)preg->re_pcre, NULL, string + so, (eo - so),
0, options, ovector, nmatch * 3);
0, options, ovector, (int)(nmatch * 3));
if (rc == 0) rc = nmatch; /* All captured slots were filled in */
if (rc == 0) rc = (int)nmatch; /* All captured slots were filled in */
/* Successful match */

View File

@ -62,6 +62,7 @@ extern "C" {
#define REG_STARTEND 0x0080 /* BSD feature: pass subject string by so,eo */
#define REG_NOTEMPTY 0x0100 /* NOT defined by POSIX; maps to PCRE_NOTEMPTY */
#define REG_UNGREEDY 0x0200 /* NOT defined by POSIX; maps to PCRE_UNGREEDY */
#define REG_UCP 0x0400 /* NOT defined by POSIX; maps to PCRE_UCP */
/* This is not used by PCRE, but by defining it we make it easier
to slot PCRE into existing programs that make POSIX calls. */

View File

@ -1,7 +1,8 @@
/-- These are a few representative patterns whose lengths and offsets are to be
shown when the link size is 2. This is just a doublecheck test to ensure the
sizes don't go horribly wrong when something is changed. The pattern contents
are all themselves checked in other tests. --/
are all themselves checked in other tests. Unicode, including property support,
is required for these tests. --/
/((?i)b)/BM
@ -121,4 +122,14 @@ are all themselves checked in other tests. --/
/[^\xaa]/8BM
/[^\d]/8WB
/[[:^alpha:][:^cntrl:]]+/8WB
/[[:^cntrl:][:^alpha:]]+/8WB
/[[:alpha:]]+/8WB
/[[:^alpha:]\S]+/8WB
/-- End of testinput10 --/

View File

@ -2,12 +2,12 @@
of PCRE's API, error diagnostics, and the compiled code of some patterns.
It also checks the non-Perl syntax the PCRE supports (Python, .NET,
Oniguruma). Finally, there are some tests where PCRE and Perl differ,
either because PCRE can't be compatible, or there is potential Perl
either because PCRE can't be compatible, or there is a possible Perl
bug. --/
/-- Originally, the Perl 5.10 things were in here too, but now I have separated
many (most?) of them out into test 11. However, there may still be some
that were overlooked. --/
/-- Originally, the Perl 5.10 and 5.11 things were in here too, but now I have
separated many (most?) of them out into test 11. However, there may still
be some that were overlooked. --/
/(a)b|/I
@ -51,6 +51,16 @@
/(?X)[\B]/
/(?X)[\R]/
/(?X)[\X]/
/[\B]/BZ
/[\R]/BZ
/[\X]/BZ
/[z-a]/
/^*/
@ -2279,8 +2289,6 @@ a random value. /Ix
/a+b?(*THEN)c+(*FAIL)/C
aaabccc
/a(*PRUNE:XXX)b/
/a(*MARK)b/
/(?i:A{1,}\6666666666)/
@ -3232,4 +3240,255 @@ a random value. /Ix
/(?P<L1>(?P<L2>0|)|(?P>L2)(?P>L1))/
/abc(*MARK:)pqr/
/abc(*:)pqr/
/abc(*FAIL:123)xyz/
/--- This should, and does, fail. In Perl, it does not, which I think is a
bug because replacing the B in the pattern by (B|D) does make it fail. ---/
/A(*COMMIT)B/+K
ACABX
/--- These should be different, but in Perl 5.11 are not, which I think
is a bug in Perl. ---/
/A(*THEN)B|A(*THEN)C/K
AC
/A(*PRUNE)B|A(*PRUNE)C/K
AC
/--- A whole lot of tests of verbs with arguments are here rather than in test
11 because Perl doesn't seem to follow its specification entirely
correctly. ---/
/--- Perl 5.11 sets $REGERROR on the AC failure case here; PCRE does not. It is
not clear how Perl defines "involved in the failure of the match". ---/
/^(A(*THEN:A)B|C(*THEN:B)D)/K
AB
CD
** Failers
AC
CB
/--- Check the use of names for success and failure. PCRE doesn't show these
names for success, though Perl does, contrary to its spec. ---/
/^(A(*PRUNE:A)B|C(*PRUNE:B)D)/K
AB
CD
** Failers
AC
CB
/--- An empty name does not pass back an empty string. It is the same as if no
name were given. ---/
/^(A(*PRUNE:)B|C(*PRUNE:B)D)/K
AB
CD
/--- PRUNE goes to next bumpalong; COMMIT does not. ---/
/A(*PRUNE:A)B/K
ACAB
/(*MARK:A)(*PRUNE:B)(C|X)/K
C
D
/(*MARK:A)(*THEN:B)(C|X)/K
C
D
/--- This should fail, as the skip causes a bump to offset 3 (the skip) ---/
/A(*MARK:A)A+(*SKIP)(B|Z) | AC/xK
AAAC
/--- Same --/
/A(*MARK:A)A+(*MARK:B)(*SKIP:B)(B|Z) | AC/xK
AAAC
/--- This should fail; the SKIP advances by one, but when we get to AC, the
PRUNE kills it. ---/
/A(*PRUNE:A)A+(*SKIP:A)(B|Z) | AC/xK
AAAC
/A(*:A)A+(*SKIP)(B|Z) | AC/xK
AAAC
/--- This should fail, as a null name is the same as no name ---/
/A(*MARK:A)A+(*SKIP:)(B|Z) | AC/xK
AAAC
/--- This fails in PCRE, and I think that is in accordance with Perl's
documentation, though in Perl it succeeds. ---/
/A(*MARK:A)A+(*SKIP:B)(B|Z) | AAC/xK
AAAC
/--- Mark names can be duplicated ---/
/A(*:A)B|X(*:A)Y/K
AABC
XXYZ
/^A(*:A)B|^X(*:A)Y/K
** Failers
XAQQ
/--- A check on what happens after hitting a mark and them bumping along to
something that does not even start. Perl reports tags after the failures here,
though it does not when the individual letters are made into something
more complicated. ---/
/A(*:A)B|XX(*:B)Y/K
AABC
XXYZ
** Failers
XAQQ
XAQQXZZ
AXQQQ
AXXQQQ
/--- COMMIT at the start of a pattern should be the same as an anchor. Perl
optimizations defeat this. So does the PCRE optimization unless we disable it
with \Y. ---/
/(*COMMIT)ABC/
ABCDEFG
** Failers
DEFGABC\Y
/--- Repeat some tests with added studying. ---/
/A(*COMMIT)B/+KS
ACABX
/A(*THEN)B|A(*THEN)C/KS
AC
/A(*PRUNE)B|A(*PRUNE)C/KS
AC
/^(A(*THEN:A)B|C(*THEN:B)D)/KS
AB
CD
** Failers
AC
CB
/^(A(*PRUNE:A)B|C(*PRUNE:B)D)/KS
AB
CD
** Failers
AC
CB
/^(A(*PRUNE:)B|C(*PRUNE:B)D)/KS
AB
CD
/A(*PRUNE:A)B/KS
ACAB
/(*MARK:A)(*PRUNE:B)(C|X)/KS
C
D
/(*MARK:A)(*THEN:B)(C|X)/KS
C
D
/A(*MARK:A)A+(*SKIP)(B|Z) | AC/xKS
AAAC
/A(*MARK:A)A+(*MARK:B)(*SKIP:B)(B|Z) | AC/xKS
AAAC
/A(*PRUNE:A)A+(*SKIP:A)(B|Z) | AC/xKS
AAAC
/A(*:A)A+(*SKIP)(B|Z) | AC/xKS
AAAC
/A(*MARK:A)A+(*SKIP:)(B|Z) | AC/xKS
AAAC
/A(*MARK:A)A+(*SKIP:B)(B|Z) | AAC/xKS
AAAC
/A(*:A)B|XX(*:B)Y/KS
AABC
XXYZ
** Failers
XAQQ
XAQQXZZ
AXQQQ
AXXQQQ
/(*COMMIT)ABC/
ABCDEFG
** Failers
DEFGABC\Y
/^(ab (c+(*THEN)cd) | xyz)/x
abcccd
/^(ab (c+(*PRUNE)cd) | xyz)/x
abcccd
/^(ab (c+(*FAIL)cd) | xyz)/x
abcccd
/--- Perl 5.11 gets some of these wrong ---/
/(?>.(*ACCEPT))*?5/
abcde
/(.(*ACCEPT))*?5/
abcde
/(.(*ACCEPT))5/
abcde
/(.(*ACCEPT))*5/
abcde
/A\NB./BZ
ACBD
** Failers
A\nB
ACB\n
/A\NB./sBZ
ACBD
ACB\n
** Failers
A\nB
/A\NB/<crlf>
A\nB
A\rB
** Failers
A\r\nB
/\R+b/BZ
/\R+\n/BZ
/\R+\d/BZ
/\d*\R/BZ
/\s*\R/BZ
/-- End of testinput2 --/

View File

@ -745,4 +745,53 @@ can't tell the difference.) --/
/X\W{3}X/8
\PX
/\h/SI
/\h/SI8
ABC\x{09}
ABC\x{20}
ABC\x{a0}
ABC\x{1680}
ABC\x{180e}
ABC\x{2000}
ABC\x{202f}
ABC\x{205f}
ABC\x{3000}
/\v/SI
/\v/SI8
ABC\x{0a}
ABC\x{0b}
ABC\x{0c}
ABC\x{0d}
ABC\x{85}
ABC\x{2028}
/\R/SI
/\R/SI8
/\h*A/SI8
CDBABC
/\v+A/SI8
/\s?xxx\s/8SI
/\sxxx\s/8T1
AB\x{85}xxx\x{a0}XYZ
AB\x{a0}xxx\x{85}XYZ
/\sxxx\s/I8ST1
AB\x{85}xxx\x{a0}XYZ
AB\x{a0}xxx\x{85}XYZ
/\S \S/8T1
\x{a2} \x{84}
/\S \S/I8ST1
\x{a2} \x{84}
A Z
/-- End of testinput5 --/

View File

@ -752,4 +752,54 @@
/\p{Avestan}\p{Bamum}\p{Egyptian_Hieroglyphs}\p{Imperial_Aramaic}\p{Inscriptional_Pahlavi}\p{Inscriptional_Parthian}\p{Javanese}\p{Kaithi}\p{Lisu}\p{Meetei_Mayek}\p{Old_South_Arabian}\p{Old_Turkic}\p{Samaritan}\p{Tai_Tham}\p{Tai_Viet}/8
\x{10b00}\x{a6ef}\x{13007}\x{10857}\x{10b78}\x{10b58}\x{a980}\x{110c1}\x{a4ff}\x{abc0}\x{10a7d}\x{10c48}\x{0800}\x{1aad}\x{aac0}
/^\w+/8W
Az_\x{aa}\x{c0}\x{1c5}\x{2b0}\x{3b6}\x{1d7c9}\x{2fa1d}1\x{660}\x{bef}\x{16ee}
/^[[:xdigit:]]*/8W
1a\x{660}\x{bef}\x{16ee}
/^\d+/8W
1\x{660}\x{bef}\x{16ee}
/^[[:digit:]]+/8W
1\x{660}\x{bef}\x{16ee}
/^>\s+/8W
>\x{20}\x{a0}\x{1680}\x{2028}\x{2029}\x{202f}\x{9}\x{b}
/^>\pZ+/8W
>\x{20}\x{a0}\x{1680}\x{2028}\x{2029}\x{202f}\x{9}\x{b}
/^>[[:space:]]*/8W
>\x{20}\x{a0}\x{1680}\x{2028}\x{2029}\x{202f}\x{9}\x{b}
/^>[[:blank:]]*/8W
>\x{20}\x{a0}\x{1680}\x{180e}\x{2000}\x{202f}\x{9}\x{b}\x{2028}
/^[[:alpha:]]*/8W
Az\x{aa}\x{c0}\x{1c5}\x{2b0}\x{3b6}\x{1d7c9}\x{2fa1d}
/^[[:alnum:]]*/8W
Az\x{aa}\x{c0}\x{1c5}\x{2b0}\x{3b6}\x{1d7c9}\x{2fa1d}1\x{660}\x{bef}\x{16ee}
/^[[:cntrl:]]*/8W
\x{0}\x{09}\x{1f}\x{7f}\x{9f}
/^[[:graph:]]*/8W
A\x{a1}\x{a0}
/^[[:print:]]*/8W
A z\x{a0}\x{a1}
/^[[:punct:]]*/8W
.+\x{a1}\x{a0}
/\p{Zs}*?\R/
** Failers
a\xFCb
/\p{Zs}*\R/
** Failers
a\xFCb
/-- End of testinput6 --/

View File

@ -847,4 +847,143 @@
** Failers
\x{1d79}\x{a77d}
/^\p{Xan}/8
ABCD
1234
\x{6ca}
\x{a6c}
\x{10a7}
** Failers
_ABC
/^\p{Xan}+/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
** Failers
_ABC
/^\p{Xan}*/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
/^\p{Xan}{2,9}/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
/^[\p{Xan}]/8
ABCD1234_
1234abcd_
\x{6ca}
\x{a6c}
\x{10a7}
** Failers
_ABC
/^[\p{Xan}]+/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
** Failers
_ABC
/^>\p{Xsp}/8
>\x{1680}\x{2028}\x{0b}
** Failers
\x{0b}
/^>\p{Xsp}+/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
/^>\p{Xsp}*/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
/^>\p{Xsp}{2,9}/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
/^>[\p{Xsp}]/8
>\x{2028}\x{0b}
/^>[\p{Xsp}]+/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
/^>\p{Xps}/8
>\x{1680}\x{2028}\x{0b}
>\x{a0}
** Failers
\x{0b}
/^>\p{Xps}+/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
/^>\p{Xps}+?/8
>\x{1680}\x{2028}\x{0b}
/^>\p{Xps}*/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
/^>\p{Xps}{2,9}/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
/^>\p{Xps}{2,9}?/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
/^>[\p{Xps}]/8
>\x{2028}\x{0b}
/^>[\p{Xps}]+/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
/^\p{Xwd}/8
ABCD
1234
\x{6ca}
\x{a6c}
\x{10a7}
_ABC
** Failers
[]
/^\p{Xwd}+/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
/^\p{Xwd}*/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
/^\p{Xwd}{2,9}/8
A_12\x{6ca}\x{a6c}\x{10a7}
/^[\p{Xwd}]/8
ABCD1234_
1234abcd_
\x{6ca}
\x{a6c}
\x{10a7}
_ABC
** Failers
[]
/^[\p{Xwd}]+/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
/-- Unicode properties for \b abd \B --/
/\b...\B/8W
abc_
\x{37e}abc\x{376}
\x{37e}\x{376}\x{371}\x{393}\x{394}
!\x{c0}++\x{c1}\x{c2}
!\x{c0}+++++
/-- Without PCRE_UCP, non-ASCII always fail, even if < 256 --/
/\b...\B/8
abc_
** Failers
\x{37e}abc\x{376}
\x{37e}\x{376}\x{371}\x{393}\x{394}
!\x{c0}++\x{c1}\x{c2}
!\x{c0}+++++
/-- With PCRE_UCP, non-UTF8 chars that are < 256 still check properties --/
/\b...\B/W
abc_
!\x{c0}++\x{c1}\x{c2}
!\x{c0}+++++
/-- End of testinput9 --/

View File

@ -1,7 +1,8 @@
/-- These are a few representative patterns whose lengths and offsets are to be
shown when the link size is 2. This is just a doublecheck test to ensure the
sizes don't go horribly wrong when something is changed. The pattern contents
are all themselves checked in other tests. --/
are all themselves checked in other tests. Unicode, including property support,
is required for these tests. --/
/((?i)b)/BM
Memory allocation (code space): 21
@ -666,4 +667,44 @@ Memory allocation (code space): 40
39 End
------------------------------------------------------------------
/[^\d]/8WB
------------------------------------------------------------------
0 11 Bra
3 [^\p{Nd}]
11 11 Ket
14 End
------------------------------------------------------------------
/[[:^alpha:][:^cntrl:]]+/8WB
------------------------------------------------------------------
0 44 Bra
3 [ -~\x80-\xff\P{L}]+
44 44 Ket
47 End
------------------------------------------------------------------
/[[:^cntrl:][:^alpha:]]+/8WB
------------------------------------------------------------------
0 44 Bra
3 [ -~\x80-\xff\P{L}]+
44 44 Ket
47 End
------------------------------------------------------------------
/[[:alpha:]]+/8WB
------------------------------------------------------------------
0 12 Bra
3 [\p{L}]+
12 12 Ket
15 End
------------------------------------------------------------------
/[[:^alpha:]\S]+/8WB
------------------------------------------------------------------
0 15 Bra
3 [\P{L}\P{Xsp}]+
15 15 Ket
18 End
------------------------------------------------------------------
/-- End of testinput10 --/

View File

@ -2,12 +2,12 @@
of PCRE's API, error diagnostics, and the compiled code of some patterns.
It also checks the non-Perl syntax the PCRE supports (Python, .NET,
Oniguruma). Finally, there are some tests where PCRE and Perl differ,
either because PCRE can't be compatible, or there is potential Perl
either because PCRE can't be compatible, or there is a possible Perl
bug. --/
/-- Originally, the Perl 5.10 things were in here too, but now I have separated
many (most?) of them out into test 11. However, there may still be some
that were overlooked. --/
/-- Originally, the Perl 5.10 and 5.11 things were in here too, but now I have
separated many (most?) of them out into test 11. However, there may still
be some that were overlooked. --/
/(a)b|/I
Capturing subpattern count = 1
@ -103,6 +103,36 @@ Failed: missing terminating ] for character class at offset 5
/(?X)[\B]/
Failed: invalid escape sequence in character class at offset 6
/(?X)[\R]/
Failed: invalid escape sequence in character class at offset 6
/(?X)[\X]/
Failed: invalid escape sequence in character class at offset 6
/[\B]/BZ
------------------------------------------------------------------
Bra
B
Ket
End
------------------------------------------------------------------
/[\R]/BZ
------------------------------------------------------------------
Bra
R
Ket
End
------------------------------------------------------------------
/[\X]/BZ
------------------------------------------------------------------
Bra
X
Ket
End
------------------------------------------------------------------
/[z-a]/
Failed: range out of order in character class at offset 3
@ -3198,19 +3228,19 @@ Failed: POSIX collating elements are not supported at offset 0
Failed: POSIX named classes are supported only within a class at offset 0
/\l/I
Failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1
/\L/I
Failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1
/\N{name}/I
Failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1
/\u/I
Failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1
/\U/I
Failed: PCRE does not support \L, \l, \N, \U, or \u at offset 1
Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1
/[/I
Failed: missing terminating ] for character class at offset 1
@ -8667,11 +8697,8 @@ No match
+13 ^ ^ (*FAIL)
No match
/a(*PRUNE:XXX)b/
Failed: (*VERB) with an argument is not supported at offset 8
/a(*MARK)b/
Failed: (*VERB) not recognized at offset 7
Failed: (*MARK) must have an argument at offset 7
/(?i:A{1,}\6666666666)/
Failed: number is too big at offset 19
@ -10668,4 +10695,435 @@ No match
/(?P<L1>(?P<L2>0|)|(?P>L2)(?P>L1))/
Failed: recursive call could loop indefinitely at offset 31
/abc(*MARK:)pqr/
Failed: (*MARK) must have an argument at offset 10
/abc(*:)pqr/
Failed: (*MARK) must have an argument at offset 6
/abc(*FAIL:123)xyz/
Failed: an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT) at offset 13
/--- This should, and does, fail. In Perl, it does not, which I think is a
bug because replacing the B in the pattern by (B|D) does make it fail. ---/
/A(*COMMIT)B/+K
ACABX
No match
/--- These should be different, but in Perl 5.11 are not, which I think
is a bug in Perl. ---/
/A(*THEN)B|A(*THEN)C/K
AC
0: AC
/A(*PRUNE)B|A(*PRUNE)C/K
AC
No match
/--- A whole lot of tests of verbs with arguments are here rather than in test
11 because Perl doesn't seem to follow its specification entirely
correctly. ---/
/--- Perl 5.11 sets $REGERROR on the AC failure case here; PCRE does not. It is
not clear how Perl defines "involved in the failure of the match". ---/
/^(A(*THEN:A)B|C(*THEN:B)D)/K
AB
0: AB
1: AB
CD
0: CD
1: CD
** Failers
No match
AC
No match
CB
No match, mark = B
/--- Check the use of names for success and failure. PCRE doesn't show these
names for success, though Perl does, contrary to its spec. ---/
/^(A(*PRUNE:A)B|C(*PRUNE:B)D)/K
AB
0: AB
1: AB
CD
0: CD
1: CD
** Failers
No match
AC
No match, mark = A
CB
No match, mark = B
/--- An empty name does not pass back an empty string. It is the same as if no
name were given. ---/
/^(A(*PRUNE:)B|C(*PRUNE:B)D)/K
AB
0: AB
1: AB
CD
0: CD
1: CD
/--- PRUNE goes to next bumpalong; COMMIT does not. ---/
/A(*PRUNE:A)B/K
ACAB
0: AB
/(*MARK:A)(*PRUNE:B)(C|X)/K
C
0: C
1: C
MK: A
D
No match, mark = B
/(*MARK:A)(*THEN:B)(C|X)/K
C
0: C
1: C
MK: A
D
No match, mark = B
/--- This should fail, as the skip causes a bump to offset 3 (the skip) ---/
/A(*MARK:A)A+(*SKIP)(B|Z) | AC/xK
AAAC
No match
/--- Same --/
/A(*MARK:A)A+(*MARK:B)(*SKIP:B)(B|Z) | AC/xK
AAAC
No match
/--- This should fail; the SKIP advances by one, but when we get to AC, the
PRUNE kills it. ---/
/A(*PRUNE:A)A+(*SKIP:A)(B|Z) | AC/xK
AAAC
No match
/A(*:A)A+(*SKIP)(B|Z) | AC/xK
AAAC
No match
/--- This should fail, as a null name is the same as no name ---/
/A(*MARK:A)A+(*SKIP:)(B|Z) | AC/xK
AAAC
No match
/--- This fails in PCRE, and I think that is in accordance with Perl's
documentation, though in Perl it succeeds. ---/
/A(*MARK:A)A+(*SKIP:B)(B|Z) | AAC/xK
AAAC
No match
/--- Mark names can be duplicated ---/
/A(*:A)B|X(*:A)Y/K
AABC
0: AB
MK: A
XXYZ
0: XY
MK: A
/^A(*:A)B|^X(*:A)Y/K
** Failers
No match
XAQQ
No match, mark = A
/--- A check on what happens after hitting a mark and them bumping along to
something that does not even start. Perl reports tags after the failures here,
though it does not when the individual letters are made into something
more complicated. ---/
/A(*:A)B|XX(*:B)Y/K
AABC
0: AB
MK: A
XXYZ
0: XXY
MK: B
** Failers
No match
XAQQ
No match
XAQQXZZ
No match
AXQQQ
No match
AXXQQQ
No match
/--- COMMIT at the start of a pattern should be the same as an anchor. Perl
optimizations defeat this. So does the PCRE optimization unless we disable it
with \Y. ---/
/(*COMMIT)ABC/
ABCDEFG
0: ABC
** Failers
No match
DEFGABC\Y
No match
/--- Repeat some tests with added studying. ---/
/A(*COMMIT)B/+KS
ACABX
No match
/A(*THEN)B|A(*THEN)C/KS
AC
0: AC
/A(*PRUNE)B|A(*PRUNE)C/KS
AC
No match
/^(A(*THEN:A)B|C(*THEN:B)D)/KS
AB
0: AB
1: AB
CD
0: CD
1: CD
** Failers
No match
AC
No match
CB
No match, mark = B
/^(A(*PRUNE:A)B|C(*PRUNE:B)D)/KS
AB
0: AB
1: AB
CD
0: CD
1: CD
** Failers
No match
AC
No match, mark = A
CB
No match, mark = B
/^(A(*PRUNE:)B|C(*PRUNE:B)D)/KS
AB
0: AB
1: AB
CD
0: CD
1: CD
/A(*PRUNE:A)B/KS
ACAB
0: AB
/(*MARK:A)(*PRUNE:B)(C|X)/KS
C
0: C
1: C
MK: A
D
No match
/(*MARK:A)(*THEN:B)(C|X)/KS
C
0: C
1: C
MK: A
D
No match
/A(*MARK:A)A+(*SKIP)(B|Z) | AC/xKS
AAAC
No match
/A(*MARK:A)A+(*MARK:B)(*SKIP:B)(B|Z) | AC/xKS
AAAC
No match
/A(*PRUNE:A)A+(*SKIP:A)(B|Z) | AC/xKS
AAAC
No match
/A(*:A)A+(*SKIP)(B|Z) | AC/xKS
AAAC
No match
/A(*MARK:A)A+(*SKIP:)(B|Z) | AC/xKS
AAAC
No match
/A(*MARK:A)A+(*SKIP:B)(B|Z) | AAC/xKS
AAAC
No match
/A(*:A)B|XX(*:B)Y/KS
AABC
0: AB
MK: A
XXYZ
0: XXY
MK: B
** Failers
No match
XAQQ
No match
XAQQXZZ
No match
AXQQQ
No match
AXXQQQ
No match
/(*COMMIT)ABC/
ABCDEFG
0: ABC
** Failers
No match
DEFGABC\Y
No match
/^(ab (c+(*THEN)cd) | xyz)/x
abcccd
No match
/^(ab (c+(*PRUNE)cd) | xyz)/x
abcccd
No match
/^(ab (c+(*FAIL)cd) | xyz)/x
abcccd
No match
/--- Perl 5.11 gets some of these wrong ---/
/(?>.(*ACCEPT))*?5/
abcde
0: a
/(.(*ACCEPT))*?5/
abcde
0: a
1: a
/(.(*ACCEPT))5/
abcde
0: a
1: a
/(.(*ACCEPT))*5/
abcde
0: a
1: a
/A\NB./BZ
------------------------------------------------------------------
Bra
A
Any
B
Any
Ket
End
------------------------------------------------------------------
ACBD
0: ACBD
** Failers
No match
A\nB
No match
ACB\n
No match
/A\NB./sBZ
------------------------------------------------------------------
Bra
A
Any
B
AllAny
Ket
End
------------------------------------------------------------------
ACBD
0: ACBD
ACB\n
0: ACB\x0a
** Failers
No match
A\nB
No match
/A\NB/<crlf>
A\nB
0: A\x0aB
A\rB
0: A\x0dB
** Failers
No match
A\r\nB
No match
/\R+b/BZ
------------------------------------------------------------------
Bra
\R++
b
Ket
End
------------------------------------------------------------------
/\R+\n/BZ
------------------------------------------------------------------
Bra
\R+
\x0a
Ket
End
------------------------------------------------------------------
/\R+\d/BZ
------------------------------------------------------------------
Bra
\R++
\d
Ket
End
------------------------------------------------------------------
/\d*\R/BZ
------------------------------------------------------------------
Bra
\d*+
\R
Ket
End
------------------------------------------------------------------
/\s*\R/BZ
------------------------------------------------------------------
Bra
\s*+
\R
Ket
End
------------------------------------------------------------------
/-- End of testinput2 --/

View File

@ -2076,4 +2076,150 @@ Partial match: abcde
\PX
Partial match: X
/\h/SI
Capturing subpattern count = 0
No options
No first char
No need char
Subject length lower bound = 1
Starting byte set: \x09 \x20 \xa0
/\h/SI8
Capturing subpattern count = 0
Options: utf8
No first char
No need char
Subject length lower bound = 1
Starting byte set: \x09 \x20 \xc2 \xe1 \xe2 \xe3
ABC\x{09}
0: \x{09}
ABC\x{20}
0:
ABC\x{a0}
0: \x{a0}
ABC\x{1680}
0: \x{1680}
ABC\x{180e}
0: \x{180e}
ABC\x{2000}
0: \x{2000}
ABC\x{202f}
0: \x{202f}
ABC\x{205f}
0: \x{205f}
ABC\x{3000}
0: \x{3000}
/\v/SI
Capturing subpattern count = 0
No options
No first char
No need char
Subject length lower bound = 1
Starting byte set: \x0a \x0b \x0c \x0d \x85
/\v/SI8
Capturing subpattern count = 0
Options: utf8
No first char
No need char
Subject length lower bound = 1
Starting byte set: \x0a \x0b \x0c \x0d \xc2 \xe2
ABC\x{0a}
0: \x{0a}
ABC\x{0b}
0: \x{0b}
ABC\x{0c}
0: \x{0c}
ABC\x{0d}
0: \x{0d}
ABC\x{85}
0: \x{85}
ABC\x{2028}
0: \x{2028}
/\R/SI
Capturing subpattern count = 0
No options
No first char
No need char
Subject length lower bound = 2
Starting byte set: \x0a \x0b \x0c \x0d \x85
/\R/SI8
Capturing subpattern count = 0
Options: utf8
No first char
No need char
Subject length lower bound = 2
Starting byte set: \x0a \x0b \x0c \x0d \xc2 \xe2
/\h*A/SI8
Capturing subpattern count = 0
Options: utf8
No first char
Need char = 'A'
Subject length lower bound = 1
Starting byte set: \x09 \x20 A \xc2 \xe1 \xe2 \xe3
CDBABC
0: A
/\v+A/SI8
Capturing subpattern count = 0
Options: utf8
No first char
Need char = 'A'
Subject length lower bound = 2
Starting byte set: \x0a \x0b \x0c \x0d \xc2 \xe2
/\s?xxx\s/8SI
Capturing subpattern count = 0
Options: utf8
No first char
Need char = 'x'
Subject length lower bound = 4
Starting byte set: \x09 \x0a \x0c \x0d \x20 x
/\sxxx\s/8T1
AB\x{85}xxx\x{a0}XYZ
0: \x{85}xxx\x{a0}
AB\x{a0}xxx\x{85}XYZ
0: \x{a0}xxx\x{85}
/\sxxx\s/I8ST1
Capturing subpattern count = 0
Options: utf8
No first char
Need char = 'x'
Subject length lower bound = 5
Starting byte set: \x09 \x0a \x0c \x0d \x20 \xc2
AB\x{85}xxx\x{a0}XYZ
0: \x{85}xxx\x{a0}
AB\x{a0}xxx\x{85}XYZ
0: \x{a0}xxx\x{85}
/\S \S/8T1
\x{a2} \x{84}
0: \x{a2} \x{84}
/\S \S/I8ST1
Capturing subpattern count = 0
Options: utf8
No first char
Need char = ' '
Subject length lower bound = 3
Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x0b \x0e
\x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 \x1a \x1b \x1c \x1d
\x1e \x1f ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e
f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f \xc0 \xc1 \xc2 \xc3
\xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca \xcb \xcc \xcd \xce \xcf \xd0 \xd1 \xd2
\xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9 \xda \xdb \xdc \xdd \xde \xdf \xe0 \xe1
\xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8 \xe9 \xea \xeb \xec \xed \xee \xef \xf0
\xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7 \xf8 \xf9 \xfa \xfb \xfc \xfd \xfe \xff
\x{a2} \x{84}
0: \x{a2} \x{84}
A Z
0: A Z
/-- End of testinput5 --/

View File

@ -1285,4 +1285,72 @@ No match
\x{10b00}\x{a6ef}\x{13007}\x{10857}\x{10b78}\x{10b58}\x{a980}\x{110c1}\x{a4ff}\x{abc0}\x{10a7d}\x{10c48}\x{0800}\x{1aad}\x{aac0}
0: \x{10b00}\x{a6ef}\x{13007}\x{10857}\x{10b78}\x{10b58}\x{a980}\x{110c1}\x{a4ff}\x{abc0}\x{10a7d}\x{10c48}\x{800}\x{1aad}\x{aac0}
/^\w+/8W
Az_\x{aa}\x{c0}\x{1c5}\x{2b0}\x{3b6}\x{1d7c9}\x{2fa1d}1\x{660}\x{bef}\x{16ee}
0: Az_\x{aa}\x{c0}\x{1c5}\x{2b0}\x{3b6}\x{1d7c9}\x{2fa1d}1\x{660}\x{bef}\x{16ee}
/^[[:xdigit:]]*/8W
1a\x{660}\x{bef}\x{16ee}
0: 1a
/^\d+/8W
1\x{660}\x{bef}\x{16ee}
0: 1\x{660}\x{bef}
/^[[:digit:]]+/8W
1\x{660}\x{bef}\x{16ee}
0: 1\x{660}\x{bef}
/^>\s+/8W
>\x{20}\x{a0}\x{1680}\x{2028}\x{2029}\x{202f}\x{9}\x{b}
0: > \x{a0}\x{1680}\x{2028}\x{2029}\x{202f}\x{09}
/^>\pZ+/8W
>\x{20}\x{a0}\x{1680}\x{2028}\x{2029}\x{202f}\x{9}\x{b}
0: > \x{a0}\x{1680}\x{2028}\x{2029}\x{202f}
/^>[[:space:]]*/8W
>\x{20}\x{a0}\x{1680}\x{2028}\x{2029}\x{202f}\x{9}\x{b}
0: > \x{a0}\x{1680}\x{2028}\x{2029}\x{202f}\x{09}\x{0b}
/^>[[:blank:]]*/8W
>\x{20}\x{a0}\x{1680}\x{180e}\x{2000}\x{202f}\x{9}\x{b}\x{2028}
0: > \x{a0}\x{1680}\x{180e}\x{2000}\x{202f}\x{09}
/^[[:alpha:]]*/8W
Az\x{aa}\x{c0}\x{1c5}\x{2b0}\x{3b6}\x{1d7c9}\x{2fa1d}
0: Az\x{aa}\x{c0}\x{1c5}\x{2b0}\x{3b6}\x{1d7c9}\x{2fa1d}
/^[[:alnum:]]*/8W
Az\x{aa}\x{c0}\x{1c5}\x{2b0}\x{3b6}\x{1d7c9}\x{2fa1d}1\x{660}\x{bef}\x{16ee}
0: Az\x{aa}\x{c0}\x{1c5}\x{2b0}\x{3b6}\x{1d7c9}\x{2fa1d}1\x{660}\x{bef}\x{16ee}
/^[[:cntrl:]]*/8W
\x{0}\x{09}\x{1f}\x{7f}\x{9f}
0: \x{00}\x{09}\x{1f}\x{7f}
/^[[:graph:]]*/8W
A\x{a1}\x{a0}
0: A
/^[[:print:]]*/8W
A z\x{a0}\x{a1}
0: A z
/^[[:punct:]]*/8W
.+\x{a1}\x{a0}
0: .+
/\p{Zs}*?\R/
** Failers
No match
a\xFCb
No match
/\p{Zs}*\R/
** Failers
No match
a\xFCb
No match
/-- End of testinput6 --/

View File

@ -1674,4 +1674,364 @@ No match
\x{1d79}\x{a77d}
No match
/^\p{Xan}/8
ABCD
0: A
1234
0: 1
\x{6ca}
0: \x{6ca}
\x{a6c}
0: \x{a6c}
\x{10a7}
0: \x{10a7}
** Failers
No match
_ABC
No match
/^\p{Xan}+/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
0: ABCD1234\x{6ca}\x{a6c}\x{10a7}
1: ABCD1234\x{6ca}\x{a6c}
2: ABCD1234\x{6ca}
3: ABCD1234
4: ABCD123
5: ABCD12
6: ABCD1
7: ABCD
8: ABC
9: AB
10: A
** Failers
No match
_ABC
No match
/^\p{Xan}*/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
0: ABCD1234\x{6ca}\x{a6c}\x{10a7}
1: ABCD1234\x{6ca}\x{a6c}
2: ABCD1234\x{6ca}
3: ABCD1234
4: ABCD123
5: ABCD12
6: ABCD1
7: ABCD
8: ABC
9: AB
10: A
11:
/^\p{Xan}{2,9}/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
0: ABCD1234\x{6ca}
1: ABCD1234
2: ABCD123
3: ABCD12
4: ABCD1
5: ABCD
6: ABC
7: AB
/^[\p{Xan}]/8
ABCD1234_
0: A
1234abcd_
0: 1
\x{6ca}
0: \x{6ca}
\x{a6c}
0: \x{a6c}
\x{10a7}
0: \x{10a7}
** Failers
No match
_ABC
No match
/^[\p{Xan}]+/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
0: ABCD1234\x{6ca}\x{a6c}\x{10a7}
1: ABCD1234\x{6ca}\x{a6c}
2: ABCD1234\x{6ca}
3: ABCD1234
4: ABCD123
5: ABCD12
6: ABCD1
7: ABCD
8: ABC
9: AB
10: A
** Failers
No match
_ABC
No match
/^>\p{Xsp}/8
>\x{1680}\x{2028}\x{0b}
0: >\x{1680}
** Failers
No match
\x{0b}
No match
/^>\p{Xsp}+/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
0: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}
1: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}
2: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}
3: > \x{09}\x{0a}\x{0c}\x{0d}
4: > \x{09}\x{0a}\x{0c}
5: > \x{09}\x{0a}
6: > \x{09}
7: >
/^>\p{Xsp}*/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
0: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}
1: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}
2: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}
3: > \x{09}\x{0a}\x{0c}\x{0d}
4: > \x{09}\x{0a}\x{0c}
5: > \x{09}\x{0a}
6: > \x{09}
7: >
8: >
/^>\p{Xsp}{2,9}/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
0: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}
1: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}
2: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}
3: > \x{09}\x{0a}\x{0c}\x{0d}
4: > \x{09}\x{0a}\x{0c}
5: > \x{09}\x{0a}
6: > \x{09}
/^>[\p{Xsp}]/8
>\x{2028}\x{0b}
0: >\x{2028}
/^>[\p{Xsp}]+/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
0: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}
1: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}
2: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}
3: > \x{09}\x{0a}\x{0c}\x{0d}
4: > \x{09}\x{0a}\x{0c}
5: > \x{09}\x{0a}
6: > \x{09}
7: >
/^>\p{Xps}/8
>\x{1680}\x{2028}\x{0b}
0: >\x{1680}
>\x{a0}
0: >\x{a0}
** Failers
No match
\x{0b}
No match
/^>\p{Xps}+/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
0: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
1: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}
2: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}
3: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}
4: > \x{09}\x{0a}\x{0c}\x{0d}
5: > \x{09}\x{0a}\x{0c}
6: > \x{09}\x{0a}
7: > \x{09}
8: >
/^>\p{Xps}+?/8
>\x{1680}\x{2028}\x{0b}
0: >\x{1680}\x{2028}\x{0b}
1: >\x{1680}\x{2028}
2: >\x{1680}
/^>\p{Xps}*/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
0: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
1: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}
2: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}
3: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}
4: > \x{09}\x{0a}\x{0c}\x{0d}
5: > \x{09}\x{0a}\x{0c}
6: > \x{09}\x{0a}
7: > \x{09}
8: >
9: >
/^>\p{Xps}{2,9}/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
0: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
1: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}
2: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}
3: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}
4: > \x{09}\x{0a}\x{0c}\x{0d}
5: > \x{09}\x{0a}\x{0c}
6: > \x{09}\x{0a}
7: > \x{09}
/^>\p{Xps}{2,9}?/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
0: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
1: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}
2: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}
3: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}
4: > \x{09}\x{0a}\x{0c}\x{0d}
5: > \x{09}\x{0a}\x{0c}
6: > \x{09}\x{0a}
7: > \x{09}
/^>[\p{Xps}]/8
>\x{2028}\x{0b}
0: >\x{2028}
/^>[\p{Xps}]+/8
> \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
0: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}\x{0b}
1: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}\x{2028}
2: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}\x{1680}
3: > \x{09}\x{0a}\x{0c}\x{0d}\x{a0}
4: > \x{09}\x{0a}\x{0c}\x{0d}
5: > \x{09}\x{0a}\x{0c}
6: > \x{09}\x{0a}
7: > \x{09}
8: >
/^\p{Xwd}/8
ABCD
0: A
1234
0: 1
\x{6ca}
0: \x{6ca}
\x{a6c}
0: \x{a6c}
\x{10a7}
0: \x{10a7}
_ABC
0: _
** Failers
No match
[]
No match
/^\p{Xwd}+/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
0: ABCD1234\x{6ca}\x{a6c}\x{10a7}_
1: ABCD1234\x{6ca}\x{a6c}\x{10a7}
2: ABCD1234\x{6ca}\x{a6c}
3: ABCD1234\x{6ca}
4: ABCD1234
5: ABCD123
6: ABCD12
7: ABCD1
8: ABCD
9: ABC
10: AB
11: A
/^\p{Xwd}*/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
0: ABCD1234\x{6ca}\x{a6c}\x{10a7}_
1: ABCD1234\x{6ca}\x{a6c}\x{10a7}
2: ABCD1234\x{6ca}\x{a6c}
3: ABCD1234\x{6ca}
4: ABCD1234
5: ABCD123
6: ABCD12
7: ABCD1
8: ABCD
9: ABC
10: AB
11: A
12:
/^\p{Xwd}{2,9}/8
A_12\x{6ca}\x{a6c}\x{10a7}
0: A_12\x{6ca}\x{a6c}\x{10a7}
1: A_12\x{6ca}\x{a6c}
2: A_12\x{6ca}
3: A_12
4: A_1
5: A_
/^[\p{Xwd}]/8
ABCD1234_
0: A
1234abcd_
0: 1
\x{6ca}
0: \x{6ca}
\x{a6c}
0: \x{a6c}
\x{10a7}
0: \x{10a7}
_ABC
0: _
** Failers
No match
[]
No match
/^[\p{Xwd}]+/8
ABCD1234\x{6ca}\x{a6c}\x{10a7}_
0: ABCD1234\x{6ca}\x{a6c}\x{10a7}_
1: ABCD1234\x{6ca}\x{a6c}\x{10a7}
2: ABCD1234\x{6ca}\x{a6c}
3: ABCD1234\x{6ca}
4: ABCD1234
5: ABCD123
6: ABCD12
7: ABCD1
8: ABCD
9: ABC
10: AB
11: A
/-- Unicode properties for \b abd \B --/
/\b...\B/8W
abc_
0: abc
\x{37e}abc\x{376}
0: abc
\x{37e}\x{376}\x{371}\x{393}\x{394}
0: \x{376}\x{371}\x{393}
!\x{c0}++\x{c1}\x{c2}
0: ++\x{c1}
!\x{c0}+++++
0: \x{c0}++
/-- Without PCRE_UCP, non-ASCII always fail, even if < 256 --/
/\b...\B/8
abc_
0: abc
** Failers
0: Fai
\x{37e}abc\x{376}
No match
\x{37e}\x{376}\x{371}\x{393}\x{394}
No match
!\x{c0}++\x{c1}\x{c2}
No match
!\x{c0}+++++
No match
/-- With PCRE_UCP, non-UTF8 chars that are < 256 still check properties --/
/\b...\B/W
abc_
0: abc
!\x{c0}++\x{c1}\x{c2}
0: ++\xc1
!\x{c0}+++++
0: \xc0++
/-- End of testinput9 --/