From c65c18269302de59e7b657bce2e30b2726935d73 Mon Sep 17 00:00:00 2001
From: Andrei Zmievski
+>>>>>>>>>>>>Only WORD is perl. BLANK is GNU. +
+The names "ascii" and "word" are Perl extensions. Another Perl extension is negation, which is indicated by a ^ character after the colon. For example,
diff --git a/ext/pcre/pcrelib/doc/pcre.txt b/ext/pcre/pcrelib/doc/pcre.txt index 95f148f3dea..46ede597548 100644 --- a/ext/pcre/pcrelib/doc/pcre.txt +++ b/ext/pcre/pcrelib/doc/pcre.txt @@ -1273,6 +1273,8 @@ POSIX CHARACTER CLASSES word "word" characters (same as \w) xdigit hexadecimal digits + >>>>>>>>>>>>Only WORD is perl. BLANK is GNU. + The names "ascii" and "word" are Perl extensions. Another Perl extension is negation, which is indicated by a ^ char- acter after the colon. For example, @@ -1416,7 +1418,6 @@ SUBPATTERNS are numbered 1 and 2. The maximum number of captured sub- strings is 99, and the maximum number of all subpatterns, both capturing and non-capturing, is 200. - As a convenient shorthand, if any option settings are required at the start of a non-capturing subpattern, the option letters may appear between the "?" and the ":". Thus @@ -1468,8 +1469,9 @@ REPETITION matches exactly 8 digits. An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken - as a literal character. For example, {,6} is not a quantif- - ier, but a literal string of four characters. + as a literal character. For example, {,6} is not a + quantifier, but a literal string of four characters. + The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present. @@ -1519,8 +1521,8 @@ REPETITION does the right thing with the C comments. The meaning of the various quantifiers is not otherwise changed, just the pre- - ferred number of matches. Do not confuse this use of ques- - tion mark with its use as a quantifier in its own right. + ferred number of matches. Do not confuse this use of + question mark with its use as a quantifier in its own right. Because it has two uses, it can sometimes appear doubled, as in @@ -1571,17 +1573,10 @@ REPETITION + BACK REFERENCES Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back - - - - -SunOS 5.8 Last change: 30 - - - reference to a capturing subpattern earlier (i.e. to its left) in the pattern, provided there have been that many previous capturing left parentheses. @@ -1630,8 +1625,8 @@ SunOS 5.8 Last change: 30 A back reference that occurs inside the parentheses to which it refers fails when the subpattern is first used, so, for example, (a\1) never matches. However, such references can - be useful inside repeated subpatterns. For example, the pat- - tern + be useful inside repeated subpatterns. For example, the + pattern (a|b\1)+ @@ -2100,12 +2095,11 @@ UTF-8 SUPPORT UTF-8 codes. It does not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE, the results are undefined. - Running with PCRE_UTF8 set causes these changes in the way PCRE works: - 1. In a pattern, the escape sequence \x{...}, where the - contents of the braces is a string of hexadecimal digits, is + 1. In a pattern, the escape sequence \x{...}, where the con- + tents of the braces is a string of hexadecimal digits, is interpreted as a UTF-8 character whose code number is the given hexadecimal number, for example: \x{1234}. This inserts from one to six literal bytes into the pattern, @@ -2153,7 +2147,6 @@ UTF-8 SUPPORT 9. The character types such as \d and \w do not work correctly with UTF-8 characters. They continue to test a single byte. - 10. Anything not explicitly mentioned here continues to work in bytes rather than in characters. @@ -2310,6 +2303,5 @@ AUTHOR New Museums Site, Cambridge CB2 3QG, England. Phone: +44 1223 334714 - Last updated: 15 August 2001 Copyright (c) 1997-2001 University of Cambridge. diff --git a/ext/pcre/pcrelib/doc/pcregrep.1 b/ext/pcre/pcrelib/doc/pcregrep.1 index 5d3151e8677..b55745aca8f 100644 --- a/ext/pcre/pcrelib/doc/pcregrep.1 +++ b/ext/pcre/pcrelib/doc/pcregrep.1 @@ -2,7 +2,7 @@ .SH NAME pcregrep - a grep with Perl-compatible regular expressions. .SH SYNOPSIS -.B pcregrep [-Vcfhilnrsvx] pattern [file] ... +.B pcregrep [-Vcfhilnrsvx] [pattern] [file1 file2 ...] .SH DESCRIPTION @@ -11,6 +11,9 @@ grep commands do, but it uses the PCRE regular expression library to support patterns that are compatible with the regular expressions of Perl 5. See \fBpcre(3)\fR for a full description of syntax and semantics. +A pattern must be specified on the command line unless the \fB-f\fR option is +used (see below). + If no files are specified, \fBpcregrep\fR reads the standard input. By default, each line that matches the pattern is copied to the standard output, and if there is more than one file, the file name is printed before each line of @@ -32,11 +35,12 @@ Do not print individual lines; instead just print a count of the number of lines that would otherwise have been printed. If several files are given, a count is printed for each of them. .TP -\fB-f\fIfilename\fR -Read patterns from the file, one per line, and match all patterns against each -line. There is a maximum of 100 patterns. Trailing white space is removed, and -blank lines are ignored. An empty file contains no patterns and therefore -matches nothing. +\fB-f\fIfilename\fR Read a number of patterns from the file, one per line, and +match all of them against each line of input. A line is output if any of the +patterns match it. When \fB-f\fR is used, no pattern is taken from the command +line; all arguments are treated as file names. There is a maximum of 100 +patterns. Trailing white space is removed, and blank lines are ignored. An +empty file contains no patterns and therefore matches nothing. .TP \fB-h\fR Suppress printing of filenames when searching multiple files. @@ -83,6 +87,6 @@ for syntax errors or inacessible files (even if matches were found). .SH AUTHOR Philip Hazel-pcregrep [-Vcfhilnrsvx] pattern [file] ... +pcregrep [-Vcfhilnrsvx] [pattern] [file1 file2 ...]
@@ -32,6 +32,10 @@ patterns that are compatible with the regular expressions of Perl 5. See pcre(3) for a full description of syntax and semantics.
+A pattern must be specified on the command line unless the -f option is +used (see below). +
+If no files are specified, pcregrep reads the standard input. By default, each line that matches the pattern is copied to the standard output, and if there is more than one file, the file name is printed before each line of @@ -55,11 +59,12 @@ lines that would otherwise have been printed. If several files are given, a count is printed for each of them.
-\fB-ffilename -Read patterns from the file, one per line, and match all patterns against each -line. There is a maximum of 100 patterns. Trailing white space is removed, and -blank lines are ignored. An empty file contains no patterns and therefore -matches nothing. +\fB-ffilename Read a number of patterns from the file, one per line, and +match all of them against each line of input. A line is output if any of the +patterns match it. When -f is used, no pattern is taken from the command +line; all arguments are treated as file names. There is a maximum of 100 +patterns. Trailing white space is removed, and blank lines are ignored. An +empty file contains no patterns and therefore matches nothing.
-h @@ -115,6 +120,6 @@ for syntax errors or inacessible files (even if matches were found). Philip Hazel <ph10@cam.ac.uk>
-Last updated: 15 August 2001
+Last updated: 25 July 2002
-Copyright (c) 1997-2001 University of Cambridge.
+Copyright (c) 1997-2002 University of Cambridge.
diff --git a/ext/pcre/pcrelib/doc/pcregrep.txt b/ext/pcre/pcrelib/doc/pcregrep.txt
index 16002284025..ce53f7a8890 100644
--- a/ext/pcre/pcrelib/doc/pcregrep.txt
+++ b/ext/pcre/pcrelib/doc/pcregrep.txt
@@ -4,7 +4,7 @@ NAME
SYNOPSIS
- pcregrep [-Vcfhilnrsvx] pattern [file] ...
+ pcregrep [-Vcfhilnrsvx] [pattern] [file1 file2 ...]
@@ -15,6 +15,9 @@ DESCRIPTION
with the regular expressions of Perl 5. See pcre(3) for a
full description of syntax and semantics.
+ A pattern must be specified on the command line unless the
+ -f option is used (see below).
+
If no files are specified, pcregrep reads the standard
input. By default, each line that matches the pattern is
copied to the standard output, and if there is more than one
@@ -37,13 +40,19 @@ OPTIONS
wise have been printed. If several files are
given, a count is printed for each of them.
- -ffilename
- Read patterns from the file, one per line, and
- match all patterns against each line. There is a
- maximum of 100 patterns. Trailing white space is
- removed, and blank lines are ignored. An empty
- file contains no patterns and therefore matches
- nothing.
+
+
+and
+ -
+ ffilename Read a number of patterns from the file, one per line,
+ match all of them against each line of input. A
+ line is output if any of the patterns match it.
+ When -f is used, no pattern is taken from the com-
+ mand line; all arguments are treated as file
+ names. There is a maximum of 100 patterns. Trail-
+ ing white space is removed, and blank lines are
+ ignored. An empty file contains no patterns and
+ therefore matches nothing.
-h Suppress printing of filenames when searching mul-
tiple files.
@@ -52,7 +61,6 @@ OPTIONS
parisons.
-l Instead of printing lines from the files, just
-
print the names of the files containing lines that
would have been printed. Each file name is printed
once, on a separate line.
@@ -97,5 +105,5 @@ DIAGNOSTICS
AUTHOR
Philip Hazel
-t -Run each compile, study, and match 20000 times with a timer, and output +Run each compile, study, and match many times with a timer, and output resulting time per compile or match (in milliseconds). Do not set -t with -m, because you will then get the size output 20000 times and the timing will be distorted. @@ -78,10 +78,18 @@ expressions, and "data>" to prompt for data lines.
The program handles any number of sets of input on a single input file. Each set starts with a regular expression, and continues with any number of data -lines to be matched against the pattern. An empty line signals the end of the -data lines, at which point a new regular expression is read. The regular -expressions are given enclosed in any non-alphameric delimiters other than -backslash, for example +lines to be matched against the pattern. +
++Each line is matched separately and independently. If you want to do +multiple-line matches, you have to use the \n escape sequence in a single line +of input to encode the newline characters. The maximum length of data line is +30,000 characters. +
++An empty line signals the end of the data lines, at which point a new regular +expression is read. The regular expressions are given enclosed in any +non-alphameric delimiters other than backslash, for example
@@ -364,6 +372,6 @@ Cambridge CB2 3QG, England. Phone: +44 1223 334714-Last updated: 15 August 2001 +Last updated: 25 August 2002
-Copyright (c) 1997-2001 University of Cambridge. +Copyright (c) 1997-2002 University of Cambridge. diff --git a/ext/pcre/pcrelib/doc/pcretest.txt b/ext/pcre/pcrelib/doc/pcretest.txt index 0e13b6c6c50..54e989a83c8 100644 --- a/ext/pcre/pcrelib/doc/pcretest.txt +++ b/ext/pcre/pcrelib/doc/pcretest.txt @@ -42,11 +42,11 @@ OPTIONS wrapper API is used to call PCRE. None of the other options has any effect when -p is set. - -t Run each compile, study, and match 20000 times - with a timer, and output resulting time per com- - pile or match (in milliseconds). Do not set -t - with -m, because you will then get the size output - 20000 times and the timing will be distorted. + -t Run each compile, study, and match many times with + a timer, and output resulting time per compile or + match (in milliseconds). Do not set -t with -m, + because you will then get the size output 20000 + times and the timing will be distorted. @@ -70,10 +70,18 @@ SunOS 5.8 Last change: 1 The program handles any number of sets of input on a single input file. Each set starts with a regular expression, and continues with any number of data lines to be matched - against the pattern. An empty line signals the end of the - data lines, at which point a new regular expression is read. - The regular expressions are given enclosed in any non- - alphameric delimiters other than backslash, for example + against the pattern. + + Each line is matched separately and independently. If you + want to do multiple-line matches, you have to use the \n + escape sequence in a single line of input to encode the new- + line characters. The maximum length of data line is 30,000 + characters. + + An empty line signals the end of the data lines, at which + point a new regular expression is read. The regular expres- + sions are given enclosed in any non-alphameric delimiters + other than backslash, for example /(a|bc)x+yz/ @@ -165,6 +173,7 @@ PATTERN MODIFIERS pcre_fullinfo() after compiling an expression, and output- ting the information it gets back. If the pattern is stu- died, the results of that are also output. + The /D modifier is a PCRE debugging feature, which also assumes /I. It causes the internal form of compiled regular expressions to be output after compilation. @@ -208,7 +217,8 @@ DATA LINES \t tab \v vertical tab \nnn octal character (up to 3 octal digits) - \xhh hexadecimal character (up to 2 hex digits) + + hexadecimal character (up to 2 hex digits) \x{hh...} hexadecimal UTF-8 character \A pass the PCRE_ANCHORED option to pcre_exec() @@ -217,7 +227,6 @@ DATA LINES after a successful match (any decimal number less than 32) \Gdd call pcre_get_substring() for substring dd - after a successful match (any decimal number less than 32) \L call pcre_get_substringlist() after a @@ -261,6 +270,7 @@ OUTPUT FROM PCRETEST re> /^abc(\d+)/ data> abc123 + 0: abc123 1: 123 data> xyz @@ -315,5 +325,5 @@ AUTHOR Cambridge CB2 3QG, England. Phone: +44 1223 334714 - Last updated: 15 August 2001 - Copyright (c) 1997-2001 University of Cambridge. + Last updated: 25 August 2002 + Copyright (c) 1997-2002 University of Cambridge. diff --git a/ext/pcre/pcrelib/doc/perltest.txt b/ext/pcre/pcrelib/doc/perltest.txt index 5a404016b52..9ea9d932a5b 100644 --- a/ext/pcre/pcrelib/doc/perltest.txt +++ b/ext/pcre/pcrelib/doc/perltest.txt @@ -13,10 +13,15 @@ for perltest as well as for pcretest, and the special upper case modifiers such as /A that pcretest recognizes are not used in these files. The output should be identical, apart from the initial identifying banner. -For testing UTF-8 features, an alternative form of perltest, called perltest8, -is supplied. This requires Perl 5.6 or higher. It recognizes the special -modifier /8 that pcretest uses to invoke UTF-8 functionality. The testinput5 -file can be fed to perltest8. +The perltest script can also test UTF-8 features. It works as is for Perl 5.8 +or higher. It recognizes the special modifier /8 that pcretest uses to invoke +UTF-8 functionality. The testinput5 file can be fed to perltest to run UTF-8 +tests. + +For Perl 5.6, perltest won't work unmodified for the UTF-8 tests. You need to +uncomment the "use utf8" lines that it contains. It is best to do this on a +copy of the script, because for non-UTF-8 tests, these lines should remain +commented out. The testinput2 and testinput4 files are not suitable for feeding to perltest, since they do make use of the special upper case modifiers and escapes that @@ -26,4 +31,4 @@ them correctly. Similarly, testinput6 tests UTF-8 features that do not relate to Perl. Philip Hazel-August 2000 +August 2002 diff --git a/ext/pcre/pcrelib/internal.h b/ext/pcre/pcrelib/internal.h index 0482c8a4cad..a727f92bd87 100644 --- a/ext/pcre/pcrelib/internal.h +++ b/ext/pcre/pcrelib/internal.h @@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals. Written by: Philip Hazel - Copyright (c) 1997-2001 University of Cambridge + Copyright (c) 1997-2002 University of Cambridge ----------------------------------------------------------------------------- Permission is granted to anyone to use this software for any purpose on any @@ -52,7 +52,18 @@ On Unix systems, "configure" can be used to override this default. */ #ifndef NEWLINE #define NEWLINE '\n' #endif - + +/* When compiling for use with the Virtual Pascal compiler, these functions +need to have their names changed. PCRE must be compiled with the -DVPCOMPAT +option on the command line. */ + +#ifdef VPCOMPAT +#define strncmp(s1,s2,m) _strncmp(s1,s2,m) +#define memcpy(d,s,n) _memcpy(d,s,n) +#define memmove(d,s,n) _memmove(d,s,n) +#define memset(s,c,n) _memset(s,c,n) +#else /* VPCOMPAT */ + /* To cope with SunOS4 and other systems that lack memmove() but have bcopy(), define a macro for memmove() if HAVE_MEMMOVE is false, provided that HAVE_BCOPY is set. Otherwise, include an emulating function for those systems that have @@ -64,7 +75,7 @@ case in PCRE. */ #undef memmove /* some systems may have a macro */ #if HAVE_BCOPY #define memmove(a, b, c) bcopy(b, a, c) -#else +#else /* HAVE_BCOPY */ void * pcre_memmove(unsigned char *dest, const unsigned char *src, size_t n) { @@ -74,8 +85,85 @@ src += n; for (i = 0; i < n; ++i) *(--dest) = *(--src); } #define memmove(a, b, c) pcre_memmove(a, b, c) +#endif /* not HAVE_BCOPY */ +#endif /* not HAVE_MEMMOVE */ +#endif /* not VPCOMPAT */ + + +/* PCRE keeps offsets in its compiled code as 2-byte quantities by default. +These are used, for example, to link from the start of a subpattern to its +alternatives and its end. The use of 2 bytes per offset limits the size of the +compiled regex to around 64K, which is big enough for almost everybody. +However, I received a request for an even bigger limit. For this reason, and +also to make the code easier to maintain, the storing and loading of offsets +from the byte string is now handled by the macros that are defined here. + +The macros are controlled by the value of LINK_SIZE. This defaults to 2 in +the config.h file, but can be overridden by using -D on the command line. This +is automated on Unix systems via the "configure" command. */ + +#if LINK_SIZE == 2 + +#define PUT(a,n,d) \ + (a[n] = (d) >> 8), \ + (a[(n)+1] = (d) & 255) + +#define GET(a,n) \ + (((a)[n] << 8) | (a)[(n)+1]) + +#define MAX_PATTERN_SIZE (1 << 16) + + +#elif LINK_SIZE == 3 + +#define PUT(a,n,d) \ + (a[n] = (d) >> 16), \ + (a[(n)+1] = (d) >> 8), \ + (a[(n)+2] = (d) & 255) + +#define GET(a,n) \ + (((a)[n] << 16) | ((a)[(n)+1] << 8) | (a)[(n)+2]) + +#define MAX_PATTERN_SIZE (1 << 24) + + +#elif LINK_SIZE == 4 + +#define PUT(a,n,d) \ + (a[n] = (d) >> 24), \ + (a[(n)+1] = (d) >> 16), \ + (a[(n)+2] = (d) >> 8), \ + (a[(n)+3] = (d) & 255) + +#define GET(a,n) \ + (((a)[n] << 24) | ((a)[(n)+1] << 16) | ((a)[(n)+2] << 8) | (a)[(n)+3]) + +#define MAX_PATTERN_SIZE (1 << 30) /* Keep it positive */ + + +#else +#error LINK_SIZE must be either 2, 3, or 4 #endif -#endif + + +/* Convenience macro defined in terms of the others */ + +#define PUTINC(a,n,d) PUT(a,n,d), a += LINK_SIZE + + +/* PCRE uses some other 2-byte quantities that do not change when the size of +offsets changes. There are used for repeat counts and for other things such as +capturing parenthesis numbers in back references. */ + +#define PUT2(a,n,d) \ + a[n] = (d) >> 8; \ + a[(n)+1] = (d) & 255 + +#define GET2(a,n) \ + (((a)[n] << 8) | (a)[(n)+1]) + +#define PUT2INC(a,n,d) PUT2(a,n,d), a += 2 + /* Standard C headers plus the external interface definition */ @@ -107,8 +195,7 @@ to four bytes there is plenty of space. */ #define PCRE_FIRSTSET 0x40000000 /* first_char is set */ #define PCRE_REQCHSET 0x20000000 /* req_char is set */ #define PCRE_STARTLINE 0x10000000 /* start after \n for multiline */ -#define PCRE_INGROUP 0x08000000 /* compiling inside a group */ -#define PCRE_ICHANGED 0x04000000 /* i option changes within regex */ +#define PCRE_ICHANGED 0x08000000 /* i option changes within regex */ /* Options for the "extra" block produced by pcre_study(). */ @@ -130,6 +217,15 @@ time, run time or study time, respectively. */ #define MAGIC_NUMBER 0x50435245UL /* 'PCRE' */ +/* Negative values for the firstchar and reqchar variables */ + +#define REQ_UNSET (-2) +#define REQ_NONE (-1) + +/* Flags added to firstchar or reqchar */ + +#define REQ_CASELESS 0x0100 /* indicates caselessness */ + /* Miscellaneous definitions */ typedef int BOOL; @@ -138,143 +234,213 @@ typedef int BOOL; #define TRUE 1 /* Escape items that are just an encoding of a particular data value. Note that -ESC_N is defined as yet another macro, which is set in config.h to either \n +ESC_n is defined as yet another macro, which is set in config.h to either \n (the default) or \r (which some people want). */ -#ifndef ESC_E -#define ESC_E 27 +#ifndef ESC_e +#define ESC_e 27 #endif -#ifndef ESC_F -#define ESC_F '\f' +#ifndef ESC_f +#define ESC_f '\f' #endif -#ifndef ESC_N -#define ESC_N NEWLINE +#ifndef ESC_n +#define ESC_n NEWLINE #endif -#ifndef ESC_R -#define ESC_R '\r' +#ifndef ESC_r +#define ESC_r '\r' #endif -#ifndef ESC_T -#define ESC_T '\t' +#ifndef ESC_t +#define ESC_t '\t' #endif /* These are escaped items that aren't just an encoding of a particular data value such as \n. They must have non-zero values, as check_escape() returns their negation. Also, they must appear in the same order as in the opcode -definitions below, up to ESC_z. The final one must be ESC_REF as subsequent -values are used for \1, \2, \3, etc. There is a test in the code for an escape -greater than ESC_b and less than ESC_Z to detect the types that may be -repeated. If any new escapes are put in-between that don't consume a character, -that code will have to change. */ +definitions below, up to ESC_z. There's a dummy for OP_ANY because it +corresponds to "." rather than an escape sequence. The final one must be +ESC_REF as subsequent values are used for \1, \2, \3, etc. There is are two +tests in the code for an escape greater than ESC_b and less than ESC_Z to +detect the types that may be repeated. These are the types that consume a +character. If any new escapes are put in between that don't consume a +character, that code will have to change. */ + +enum { ESC_A = 1, ESC_G, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s, ESC_W, + ESC_w, ESC_dum1, ESC_C, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_REF }; -enum { ESC_A = 1, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s, ESC_W, ESC_w, - ESC_Z, ESC_z, ESC_REF }; /* Opcode table: OP_BRA must be last, as all values >= it are used for brackets that extract substrings. Starting from 1 (i.e. after OP_END), the values up to -OP_EOD must correspond in order to the list of escapes immediately above. */ +OP_EOD must correspond in order to the list of escapes immediately above. +Note that whenever this list is updated, the two macro definitions that follow +must also be updated to match. */ enum { - OP_END, /* End of pattern */ + OP_END, /* 0 End of pattern */ /* Values corresponding to backslashed metacharacters */ - OP_SOD, /* Start of data: \A */ - OP_NOT_WORD_BOUNDARY, /* \B */ - OP_WORD_BOUNDARY, /* \b */ - OP_NOT_DIGIT, /* \D */ - OP_DIGIT, /* \d */ - OP_NOT_WHITESPACE, /* \S */ - OP_WHITESPACE, /* \s */ - OP_NOT_WORDCHAR, /* \W */ - OP_WORDCHAR, /* \w */ - OP_EODN, /* End of data or \n at end of data: \Z. */ - OP_EOD, /* End of data: \z */ + OP_SOD, /* 1 Start of data: \A */ + OP_SOM, /* 2 Start of match (subject + offset): \G */ + OP_NOT_WORD_BOUNDARY, /* 3 \B */ + OP_WORD_BOUNDARY, /* 4 \b */ + OP_NOT_DIGIT, /* 5 \D */ + OP_DIGIT, /* 6 \d */ + OP_NOT_WHITESPACE, /* 7 \S */ + OP_WHITESPACE, /* 8 \s */ + OP_NOT_WORDCHAR, /* 9 \W */ + OP_WORDCHAR, /* 10 \w */ + OP_ANY, /* 11 Match any character */ + OP_ANYBYTE, /* 12 Match any byte (\C); different to OP_ANY for UTF-8 */ + OP_EODN, /* 13 End of data or \n at end of data: \Z. */ + OP_EOD, /* 14 End of data: \z */ - OP_OPT, /* Set runtime options */ - OP_CIRC, /* Start of line - varies with multiline switch */ - OP_DOLL, /* End of line - varies with multiline switch */ - OP_ANY, /* Match any character */ - OP_CHARS, /* Match string of characters */ - OP_NOT, /* Match anything but the following char */ + OP_OPT, /* 15 Set runtime options */ + OP_CIRC, /* 16 Start of line - varies with multiline switch */ + OP_DOLL, /* 17 End of line - varies with multiline switch */ + OP_CHARS, /* 18 Match string of characters */ + OP_NOT, /* 19 Match anything but the following char */ - OP_STAR, /* The maximizing and minimizing versions of */ - OP_MINSTAR, /* all these opcodes must come in pairs, with */ - OP_PLUS, /* the minimizing one second. */ - OP_MINPLUS, /* This first set applies to single characters */ - OP_QUERY, - OP_MINQUERY, - OP_UPTO, /* From 0 to n matches */ - OP_MINUPTO, - OP_EXACT, /* Exactly n matches */ + OP_STAR, /* 20 The maximizing and minimizing versions of */ + OP_MINSTAR, /* 21 all these opcodes must come in pairs, with */ + OP_PLUS, /* 22 the minimizing one second. */ + OP_MINPLUS, /* 23 This first set applies to single characters */ + OP_QUERY, /* 24 */ + OP_MINQUERY, /* 25 */ + OP_UPTO, /* 26 From 0 to n matches */ + OP_MINUPTO, /* 27 */ + OP_EXACT, /* 28 Exactly n matches */ - OP_NOTSTAR, /* The maximizing and minimizing versions of */ - OP_NOTMINSTAR, /* all these opcodes must come in pairs, with */ - OP_NOTPLUS, /* the minimizing one second. */ - OP_NOTMINPLUS, /* This first set applies to "not" single characters */ - OP_NOTQUERY, - OP_NOTMINQUERY, - OP_NOTUPTO, /* From 0 to n matches */ - OP_NOTMINUPTO, - OP_NOTEXACT, /* Exactly n matches */ + OP_NOTSTAR, /* 29 The maximizing and minimizing versions of */ + OP_NOTMINSTAR, /* 30 all these opcodes must come in pairs, with */ + OP_NOTPLUS, /* 31 the minimizing one second. */ + OP_NOTMINPLUS, /* 32 This set applies to "not" single characters */ + OP_NOTQUERY, /* 33 */ + OP_NOTMINQUERY, /* 34 */ + OP_NOTUPTO, /* 35 From 0 to n matches */ + OP_NOTMINUPTO, /* 36 */ + OP_NOTEXACT, /* 37 Exactly n matches */ - OP_TYPESTAR, /* The maximizing and minimizing versions of */ - OP_TYPEMINSTAR, /* all these opcodes must come in pairs, with */ - OP_TYPEPLUS, /* the minimizing one second. These codes must */ - OP_TYPEMINPLUS, /* be in exactly the same order as those above. */ - OP_TYPEQUERY, /* This set applies to character types such as \d */ - OP_TYPEMINQUERY, - OP_TYPEUPTO, /* From 0 to n matches */ - OP_TYPEMINUPTO, - OP_TYPEEXACT, /* Exactly n matches */ + OP_TYPESTAR, /* 38 The maximizing and minimizing versions of */ + OP_TYPEMINSTAR, /* 39 all these opcodes must come in pairs, with */ + OP_TYPEPLUS, /* 40 the minimizing one second. These codes must */ + OP_TYPEMINPLUS, /* 41 be in exactly the same order as those above. */ + OP_TYPEQUERY, /* 42 This set applies to character types such as \d */ + OP_TYPEMINQUERY, /* 43 */ + OP_TYPEUPTO, /* 44 From 0 to n matches */ + OP_TYPEMINUPTO, /* 45 */ + OP_TYPEEXACT, /* 46 Exactly n matches */ - OP_CRSTAR, /* The maximizing and minimizing versions of */ - OP_CRMINSTAR, /* all these opcodes must come in pairs, with */ - OP_CRPLUS, /* the minimizing one second. These codes must */ - OP_CRMINPLUS, /* be in exactly the same order as those above. */ - OP_CRQUERY, /* These are for character classes and back refs */ - OP_CRMINQUERY, - OP_CRRANGE, /* These are different to the three seta above. */ - OP_CRMINRANGE, + OP_CRSTAR, /* 47 The maximizing and minimizing versions of */ + OP_CRMINSTAR, /* 48 all these opcodes must come in pairs, with */ + OP_CRPLUS, /* 49 the minimizing one second. These codes must */ + OP_CRMINPLUS, /* 50 be in exactly the same order as those above. */ + OP_CRQUERY, /* 51 These are for character classes and back refs */ + OP_CRMINQUERY, /* 52 */ + OP_CRRANGE, /* 53 These are different to the three seta above. */ + OP_CRMINRANGE, /* 54 */ - OP_CLASS, /* Match a character class */ - OP_REF, /* Match a back reference */ - OP_RECURSE, /* Match this pattern recursively */ + OP_CLASS, /* 55 Match a character class */ + OP_REF, /* 56 Match a back reference */ + OP_RECURSE, /* 57 Match a numbered subpattern (possibly recursive) */ + OP_CALLOUT, /* 58 Call out to external function if provided */ - OP_ALT, /* Start of alternation */ - OP_KET, /* End of group that doesn't have an unbounded repeat */ - OP_KETRMAX, /* These two must remain together and in this */ - OP_KETRMIN, /* order. They are for groups the repeat for ever. */ + OP_ALT, /* 59 Start of alternation */ + OP_KET, /* 60 End of group that doesn't have an unbounded repeat */ + OP_KETRMAX, /* 61 These two must remain together and in this */ + OP_KETRMIN, /* 62 order. They are for groups the repeat for ever. */ /* The assertions must come before ONCE and COND */ - OP_ASSERT, /* Positive lookahead */ - OP_ASSERT_NOT, /* Negative lookahead */ - OP_ASSERTBACK, /* Positive lookbehind */ - OP_ASSERTBACK_NOT, /* Negative lookbehind */ - OP_REVERSE, /* Move pointer back - used in lookbehind assertions */ + OP_ASSERT, /* 63 Positive lookahead */ + OP_ASSERT_NOT, /* 64 Negative lookahead */ + OP_ASSERTBACK, /* 65 Positive lookbehind */ + OP_ASSERTBACK_NOT, /* 66 Negative lookbehind */ + OP_REVERSE, /* 67 Move pointer back - used in lookbehind assertions */ /* ONCE and COND must come after the assertions, with ONCE first, as there's a test for >= ONCE for a subpattern that isn't an assertion. */ - OP_ONCE, /* Once matched, don't back up into the subpattern */ - OP_COND, /* Conditional group */ - OP_CREF, /* Used to hold an extraction string number (cond ref) */ + OP_ONCE, /* 68 Once matched, don't back up into the subpattern */ + OP_COND, /* 69 Conditional group */ + OP_CREF, /* 70 Used to hold an extraction string number (cond ref) */ - OP_BRAZERO, /* These two must remain together and in this */ - OP_BRAMINZERO, /* order. */ + OP_BRAZERO, /* 71 These two must remain together and in this */ + OP_BRAMINZERO, /* 72 order. */ - OP_BRANUMBER, /* Used for extracting brackets whose number is greater - than can fit into an opcode. */ + OP_BRANUMBER, /* 73 Used for extracting brackets whose number is greater + than can fit into an opcode. */ - OP_BRA /* This and greater values are used for brackets that - extract substrings up to a basic limit. After that, - use is made of OP_BRANUMBER. */ + OP_BRA /* 74 This and greater values are used for brackets that + extract substrings up to a basic limit. After that, + use is made of OP_BRANUMBER. */ }; + +/* This macro defines textual names for all the opcodes. There are used only +for debugging, in pcre.c when DEBUG is defined, and also in pcretest.c. The +macro is referenced only in printint.c. */ + +#define OP_NAME_LIST \ + "End", "\\A", "\\G", "\\B", "\\b", "\\D", "\\d", \ + "\\S", "\\s", "\\W", "\\w", "Any", "Anybyte", "\\Z", "\\z", \ + "Opt", "^", "$", "chars", "not", \ + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", \ + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", \ + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", \ + "*", "*?", "+", "+?", "?", "??", "{", "{", \ + "class", "Ref", "Recurse", "Callout", \ + "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", \ + "AssertB", "AssertB not", "Reverse", "Once", "Cond", "Cond ref",\ + "Brazero", "Braminzero", "Branumber", "Bra" + + +/* This macro defines the length of fixed length operations in the compiled +regex. The lengths are used when searching for specific things, and also in the +debugging printing of a compiled regex. We use a macro so that it can be +incorporated both into pcre.c and pcretest.c without being publicly exposed. */ + +#define OP_LENGTHS \ + 1, /* End */ \ + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* \A, \G, \B, \B, \D, \d, \S, \s, \W, \w */ \ + 1, 1, 1, 1, 2, 1, 1, /* Any, Anybyte, \Z, \z, Opt, ^, $ */ \ + 2, /* Chars - the minimum length */ \ + 2, /* not */ \ + /* Positive single-char repeats */ \ + 2, 2, 2, 2, 2, 2, /* *, *?, +, +?, ?, ?? */ \ + 4, 4, 4, /* upto, minupto, exact */ \ + /* Negative single-char repeats */ \ + 2, 2, 2, 2, 2, 2, /* NOT *, *?, +, +?, ?, ?? */ \ + 4, 4, 4, /* NOT upto, minupto, exact */ \ + /* Positive type repeats */ \ + 2, 2, 2, 2, 2, 2, /* Type *, *?, +, +?, ?, ?? */ \ + 4, 4, 4, /* Type upto, minupto, exact */ \ + /* Multi-char class repeats */ \ + 1, 1, 1, 1, 1, 1, /* *, *?, +, +?, ?, ?? */ \ + 5, 5, /* CRRANGE, CRMINRANGE */ \ + 33, 3, /* CLASS, REF */ \ + 1+LINK_SIZE, /* RECURSE */ \ + 2, /* CALLOUT */ \ + 1+LINK_SIZE, /* Alt */ \ + 1+LINK_SIZE, /* Ket */ \ + 1+LINK_SIZE, /* KetRmax */ \ + 1+LINK_SIZE, /* KetRmin */ \ + 1+LINK_SIZE, /* Assert */ \ + 1+LINK_SIZE, /* Assert not */ \ + 1+LINK_SIZE, /* Assert behind */ \ + 1+LINK_SIZE, /* Assert behind not */ \ + 1+LINK_SIZE, /* Reverse */ \ + 1+LINK_SIZE, /* Once */ \ + 1, /* COND */ \ + 3, /* CREF */ \ + 1, 1, /* BRAZERO, BRAMINZERO */ \ + 3, /* BRANUMBER */ \ + 1+LINK_SIZE /* BRA */ \ + + /* The highest extraction number before we have to start using additional bytes. (Originally PCRE didn't have support for extraction counts highter than this number.) The value is limited by the number of opcodes left after OP_BRA, @@ -283,6 +449,10 @@ opcodes. */ #define EXTRACT_BASIC_MAX 150 +/* A magic value for OP_CREF to indicate the "in recursion" condition. */ + +#define CREF_RECURSE 0xffff + /* The texts of compile-time error messages are defined as macros here so that they can be accessed by the POSIX wrapper and converted into error codes. Yes, I could have used error codes in the first place, but didn't feel like changing @@ -300,9 +470,9 @@ just to accommodate the POSIX wrapper. */ #define ERR10 "operand of unlimited repeat could match the empty string" #define ERR11 "internal error: unexpected repeat" #define ERR12 "unrecognized character after (?" -#define ERR13 "unused error" +#define ERR13 "POSIX named classes are supported only within a class" #define ERR14 "missing )" -#define ERR15 "back reference to non-existent subpattern" +#define ERR15 "reference to non-existent subpattern" #define ERR16 "erroffset passed as NULL" #define ERR17 "unknown option bit(s) set" #define ERR18 "missing ) after comment" @@ -316,13 +486,21 @@ just to accommodate the POSIX wrapper. */ #define ERR26 "malformed number after (?(" #define ERR27 "conditional group contains more than two branches" #define ERR28 "assertion expected after (?(" -#define ERR29 "(?p must be followed by )" +#define ERR29 "(?R or (?digits must be followed by )" #define ERR30 "unknown POSIX class name" #define ERR31 "POSIX collating elements are not supported" #define ERR32 "this version of PCRE is not compiled with PCRE_UTF8 support" #define ERR33 "characters with values > 255 are not yet supported in classes" #define ERR34 "character value in \\x{...} sequence is too large" #define ERR35 "invalid condition (?(0)" +#define ERR36 "\\C not allowed in lookbehind assertion" +#define ERR37 "PCRE does not support \\L, \\l, \\N, \\P, \\p, \\U, \\u, or \\X" +#define ERR38 "number after (?C is > 255" +#define ERR39 "closing ) for (?C expected" +#define ERR40 "recursive call could loop indefinitely" +#define ERR41 "unrecognized character after (?P" +#define ERR42 "syntax error after (?P" +#define ERR43 "two named groups have the same name" /* All character handling must be done as unsigned characters. Otherwise there are problems with top-bit-set characters and functions such as isspace(). @@ -333,19 +511,20 @@ Unix, where it is defined in sys/types, so use "uschar" instead. */ typedef unsigned char uschar; -/* The real format of the start of the pcre block; the actual code vector -runs on as long as necessary after the end. */ +/* The real format of the start of the pcre block; the index of names and the +code vector run on as long as necessary after the end. */ typedef struct real_pcre { unsigned long int magic_number; - size_t size; - const unsigned char *tables; + size_t size; /* Total that was malloced */ + const unsigned char *tables; /* Pointer to tables */ unsigned long int options; unsigned short int top_bracket; unsigned short int top_backref; - uschar first_char; - uschar req_char; - uschar code[1]; + unsigned short int first_char; + unsigned short int req_char; + unsigned short int name_entry_size; /* Size of any name items; 0 => none */ + unsigned short int name_count; /* Number of name items */ } real_pcre; /* The real format of the extra block returned by pcre_study(). */ @@ -364,8 +543,32 @@ typedef struct compile_data { const uschar *fcc; /* Points to case-flipping table */ const uschar *cbits; /* Points to character type table */ const uschar *ctypes; /* Points to table of type maps */ + const uschar *start_code; /* The start of the compiled code */ + uschar *name_table; /* The name/number table */ + int names_found; /* Number of entries so far */ + int name_entry_size; /* Size of each entry */ } compile_data; +/* Structure for maintaining a chain of pointers to the currently incomplete +branches, for testing for left recursion. */ + +typedef struct branch_chain { + struct branch_chain *outer; + uschar *current; +} branch_chain; + +/* Structure for items in a linked list that represents an explicit recursive +call within the pattern. */ + +typedef struct recursion_info { + struct recursion_info *prev; /* Previous recursion record (or NULL) */ + int group_num; /* Number of group that was called */ + const uschar *after_call; /* "Return value": points after the call in the expr */ + const uschar *save_start; /* Old value of md->start_match */ + int *offset_save; /* Pointer to start of saved offsets */ + int saved_max; /* Number of saved offsets */ +} recursion_info; + /* Structure for passing "static" information around between the functions doing the matching, so that they are thread-safe. */ @@ -382,12 +585,15 @@ typedef struct match_data { BOOL utf8; /* UTF8 flag */ BOOL endonly; /* Dollar not before final \n */ BOOL notempty; /* Empty string match not wanted */ - const uschar *start_pattern; /* For use when recursing */ + const uschar *start_code; /* For use when recursing */ const uschar *start_subject; /* Start of the subject string */ const uschar *end_subject; /* End of the subject string */ const uschar *start_match; /* Start of this match attempt */ const uschar *end_match_ptr; /* Subject position at end match */ int end_offset_top; /* Highwater mark at end of match */ + int capture_last; /* Most recent capture number */ + int start_offset; /* The start offset value */ + recursion_info *recursive; /* Linked list of recursion data */ } match_data; /* Bit definitions for entries in the pcre_ctypes table. */ diff --git a/ext/pcre/pcrelib/maketables.c b/ext/pcre/pcrelib/maketables.c index 01078f19e6a..f89765214c4 100644 --- a/ext/pcre/pcrelib/maketables.c +++ b/ext/pcre/pcrelib/maketables.c @@ -82,7 +82,9 @@ for (i = 0; i < 256; i++) *p++ = tolower(i); for (i = 0; i < 256; i++) *p++ = islower(i)? toupper(i) : tolower(i); /* Then the character class tables. Don't try to be clever and save effort -on exclusive ones - in some locales things may be different. */ +on exclusive ones - in some locales things may be different. Note that the +table for "space" includes everything "isspace" gives, including VT in the +default locale. This makes it work for the POSIX class [:space:]. */ memset(p, 0, cbit_length); for (i = 0; i < 256; i++) @@ -112,12 +114,14 @@ for (i = 0; i < 256; i++) } p += cbit_length; -/* Finally, the character type table */ +/* Finally, the character type table. In this, we exclude VT from the white +space chars, because Perl doesn't recognize it as such for \s and for comments +within regexes. */ for (i = 0; i < 256; i++) { int x = 0; - if (isspace(i)) x += ctype_space; + if (i != 0x0b && isspace(i)) x += ctype_space; if (isalpha(i)) x += ctype_letter; if (isdigit(i)) x += ctype_digit; if (isxdigit(i)) x += ctype_xdigit; diff --git a/ext/pcre/pcrelib/pcre-config.in b/ext/pcre/pcrelib/pcre-config.in deleted file mode 100644 index 8daded9fe12..00000000000 --- a/ext/pcre/pcrelib/pcre-config.in +++ /dev/null @@ -1,59 +0,0 @@ -#!/bin/sh - -prefix=@prefix@ -exec_prefix=@exec_prefix@ -exec_prefix_set=no - -usage="\ -Usage: pcre-config [--prefix] [--exec-prefix] [--version] [--libs] [--libs-posix] [--cflags] [--cflags-posix]" - -if test $# -eq 0; then - echo "${usage}" 1>&2 - exit 1 -fi - -while test $# -gt 0; do - case "$1" in - -*=*) optarg=`echo "$1" | sed 's/[-_a-zA-Z0-9]*=//'` ;; - *) optarg= ;; - esac - - case $1 in - --prefix=*) - prefix=$optarg - if test $exec_prefix_set = no ; then - exec_prefix=$optarg - fi - ;; - --prefix) - echo $prefix - ;; - --exec-prefix=*) - exec_prefix=$optarg - exec_prefix_set=yes - ;; - --exec-prefix) - echo $exec_prefix - ;; - --version) - echo @PCRE_VERSION@ - ;; - --cflags | --cflags-posix) - if test @includedir@ != /usr/include ; then - includes=-I@includedir@ - fi - echo $includes - ;; - --libs-posix) - echo -L@libdir@ -lpcreposix -lpcre - ;; - --libs) - echo -L@libdir@ -lpcre - ;; - *) - echo "${usage}" 1>&2 - exit 1 - ;; - esac - shift -done diff --git a/ext/pcre/pcrelib/pcre.c b/ext/pcre/pcrelib/pcre.c index ad3ddc7c573..9d18d989bf9 100644 --- a/ext/pcre/pcrelib/pcre.c +++ b/ext/pcre/pcrelib/pcre.c @@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals. Written by: Philip Hazel - Copyright (c) 1997-2001 University of Cambridge + Copyright (c) 1997-2002 University of Cambridge ----------------------------------------------------------------------------- Permission is granted to anyone to use this software for any purpose on any @@ -69,6 +69,14 @@ compile time. */ #define BRASTACK_SIZE 200 + +/* Maximum number of ints of offset to save on the stack for recursive calls. +If the offset vector is bigger, malloc is used. This should be a multiple of 3, +because the offset vector is always a multiple of 3 long. */ + +#define REC_STACK_SAVE_MAX 30 + + /* The number of bytes in a literal character string above which we can't add any more is different when UTF-8 characters may be encountered. */ @@ -79,29 +87,16 @@ any more is different when UTF-8 characters may be encountered. */ #endif +/* Table of sizes for the fixed-length opcodes. It's defined in a macro so that +the definition is next to the definition of the opcodes in internal.h. */ + +static uschar OP_lengths[] = { OP_LENGTHS }; + /* Min and max values for the common repeats; for the maxima, 0 => infinity */ static const char rep_min[] = { 0, 0, 1, 1, 0, 0 }; static const char rep_max[] = { 0, 0, 0, 0, 1, 1 }; -/* Text forms of OP_ values and things, for debugging (not all used) */ - -#ifdef DEBUG -static const char *OP_names[] = { - "End", "\\A", "\\B", "\\b", "\\D", "\\d", - "\\S", "\\s", "\\W", "\\w", "\\Z", "\\z", - "Opt", "^", "$", "Any", "chars", "not", - "*", "*?", "+", "+?", "?", "??", "{", "{", "{", - "*", "*?", "+", "+?", "?", "??", "{", "{", "{", - "*", "*?", "+", "+?", "?", "??", "{", "{", "{", - "*", "*?", "+", "+?", "?", "??", "{", "{", - "class", "Ref", "Recurse", - "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", - "AssertB", "AssertB not", "Reverse", "Once", "Cond", "Cref", - "Brazero", "Braminzero", "Branumber", "Bra" -}; -#endif - /* Table for handling escaped characters in the range '0'-'z'. Positive returns are simple data values; negative values are for special things like \d and so on. Zero means further processing is needed (for things like \x), or the escape @@ -110,13 +105,13 @@ is invalid. */ static const short int escapes[] = { 0, 0, 0, 0, 0, 0, 0, 0, /* 0 - 7 */ 0, 0, ':', ';', '<', '=', '>', '?', /* 8 - ? */ - '@', -ESC_A, -ESC_B, 0, -ESC_D, 0, 0, 0, /* @ - G */ + '@', -ESC_A, -ESC_B, -ESC_C, -ESC_D, -ESC_E, 0, -ESC_G, /* @ - G */ 0, 0, 0, 0, 0, 0, 0, 0, /* H - O */ - 0, 0, 0, -ESC_S, 0, 0, 0, -ESC_W, /* P - W */ + 0, -ESC_Q, 0, -ESC_S, 0, 0, 0, -ESC_W, /* P - W */ 0, 0, -ESC_Z, '[', '\\', ']', '^', '_', /* X - _ */ - '`', 7, -ESC_b, 0, -ESC_d, ESC_E, ESC_F, 0, /* ` - g */ - 0, 0, 0, 0, 0, 0, ESC_N, 0, /* h - o */ - 0, 0, ESC_R, -ESC_s, ESC_T, 0, 0, -ESC_w, /* p - w */ + '`', 7, -ESC_b, 0, -ESC_d, ESC_e, ESC_f, 0, /* ` - g */ + 0, 0, 0, 0, 0, 0, ESC_n, 0, /* h - o */ + 0, 0, ESC_r, -ESC_s, ESC_t, 0, 0, -ESC_w, /* p - w */ 0, 0, -ESC_z /* x - z */ }; @@ -126,14 +121,15 @@ as this is assumed for handling case independence. */ static const char *posix_names[] = { "alpha", "lower", "upper", - "alnum", "ascii", "cntrl", "digit", "graph", + "alnum", "ascii", "blank", "cntrl", "digit", "graph", "print", "punct", "space", "word", "xdigit" }; static const uschar posix_name_lengths[] = { - 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 6, 0 }; + 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 6, 0 }; /* Table of class bit maps for each POSIX class; up to three may be combined -to form the class. */ +to form the class. The table for [:blank:] is dynamically modified to remove +the vertical space characters. */ static const int posix_class_maps[] = { cbit_lower, cbit_upper, -1, /* alpha */ @@ -141,13 +137,14 @@ static const int posix_class_maps[] = { cbit_upper, -1, -1, /* upper */ cbit_digit, cbit_lower, cbit_upper, /* alnum */ cbit_print, cbit_cntrl, -1, /* ascii */ + cbit_space, -1, -1, /* blank - a GNU extension */ cbit_cntrl, -1, -1, /* cntrl */ cbit_digit, -1, -1, /* digit */ cbit_graph, -1, -1, /* graph */ cbit_print, -1, -1, /* print */ cbit_punct, -1, -1, /* punct */ cbit_space, -1, -1, /* space */ - cbit_word, -1, -1, /* word */ + cbit_word, -1, -1, /* word - a Perl extension */ cbit_xdigit,-1, -1 /* xdigit */ }; @@ -156,7 +153,7 @@ static const int posix_class_maps[] = { static BOOL compile_regex(int, int, int *, uschar **, const uschar **, const char **, - BOOL, int, int *, int *, compile_data *); + BOOL, int, int *, int *, branch_chain *, compile_data *); /* Structure for building a chain of data that actually lives on the stack, for holding the values of the subject pointer at the start of each @@ -181,12 +178,15 @@ typedef struct eptrblock { /* PCRE is thread-clean and doesn't use any global variables in the normal sense. However, it calls memory allocation and free functions via the two -indirections below, which are can be changed by the caller, but are shared -between all threads. */ +indirections below, and it can optionally do callouts. These values can be +changed by the caller, but are shared between all threads. However, when +compiling for Virtual Pascal, things are done differently (see pcre.in). */ +#ifndef VPCOMPAT void *(*pcre_malloc)(size_t) = malloc; void (*pcre_free)(void *) = free; - +int (*pcre_callout)(pcre_callout_block *) = NULL; +#endif /************************************************* @@ -322,6 +322,19 @@ return i + 1; +/************************************************* +* Print compiled regex * +*************************************************/ + +/* The code for doing this is held in a separate file that is also included in +pcretest.c. It defines a function called print_internals(). */ + +#ifdef DEBUG +#include "printint.c" +#endif + + + /************************************************* * Return version string * *************************************************/ @@ -436,6 +449,18 @@ switch (what) ((re->options & PCRE_REQCHSET) != 0)? re->req_char : -1; break; + case PCRE_INFO_NAMEENTRYSIZE: + *((int *)where) = re->name_entry_size; + break; + + case PCRE_INFO_NAMECOUNT: + *((int *)where) = re->name_count; + break; + + case PCRE_INFO_NAMETABLE: + *((const uschar **)where) = (const uschar *)re + sizeof(real_pcre); + break; + default: return PCRE_ERROR_BADOPTION; } @@ -525,6 +550,20 @@ else const uschar *oldptr; switch (c) { + /* A number of Perl escapes are not handled by PCRE. We give an explicit + error. */ + + case 'l': + case 'L': + case 'N': + case 'p': + case 'P': + case 'u': + case 'U': + case 'X': + *errorptr = ERR37; + break; + /* The handling of escape sequences consisting of a string of digits starting with one that is not zero is not straightforward. By experiment, the way Perl works seems to be as follows: @@ -745,6 +784,60 @@ return p; +/************************************************* +* Find first significant op code * +*************************************************/ + +/* This is called by several functions that scan a compiled expression looking +for a fixed first character, or an anchoring op code etc. It skips over things +that do not influence this. For some calls, a change of option is important. + +Arguments: + code pointer to the start of the group + options pointer to external options + optbit the option bit whose changing is significant, or + zero if none are + +Returns: pointer to the first significant opcode +*/ + +static const uschar* +first_significant_code(const uschar *code, int *options, int optbit) +{ +for (;;) + { + switch ((int)*code) + { + case OP_OPT: + if (optbit > 0 && ((int)code[1] & optbit) != (*options & optbit)) + *options = (int)code[1]; + code += 2; + break; + + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + do code += GET(code, 1); while (*code == OP_ALT); + /* Fall through */ + + case OP_CALLOUT: + case OP_CREF: + case OP_BRANUMBER: + case OP_WORD_BOUNDARY: + case OP_NOT_WORD_BOUNDARY: + code += OP_lengths[*code]; + break; + + default: + return code; + } + } +/* Control never reaches here */ +} + + + + /************************************************* * Find the fixed length of a pattern * *************************************************/ @@ -756,7 +849,8 @@ Arguments: code points to the start of the pattern (the bracket) options the compiling options -Returns: the fixed length, or -1 if there is no fixed length +Returns: the fixed length, or -1 if there is no fixed length, + or -2 if \C was encountered */ static int @@ -765,7 +859,7 @@ find_fixedlength(uschar *code, int options) int length = -1; register int branchlength = 0; -register uschar *cc = code + 3; +register uschar *cc = code + 1 + LINK_SIZE; /* Scan along the opcodes for this branch. If we get to the end of the branch, check the length against that of the other branches. */ @@ -782,10 +876,10 @@ for (;;) case OP_ONCE: case OP_COND: d = find_fixedlength(cc, options); - if (d < 0) return -1; + if (d < 0) return d; branchlength += d; - do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT); - cc += 3; + do cc += GET(cc, 1); while (*cc == OP_ALT); + cc += 1 + LINK_SIZE; break; /* Reached end of a branch; if it's a ket it is the end of a nested @@ -800,7 +894,7 @@ for (;;) if (length < 0) length = branchlength; else if (length != branchlength) return -1; if (*cc != OP_ALT) return length; - cc += 3; + cc += 1 + LINK_SIZE; branchlength = 0; break; @@ -810,36 +904,31 @@ for (;;) case OP_ASSERT_NOT: case OP_ASSERTBACK: case OP_ASSERTBACK_NOT: - do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT); - cc += 3; - break; + do cc += GET(cc, 1); while (*cc == OP_ALT); + /* Fall through */ /* Skip over things that don't match chars */ case OP_REVERSE: case OP_BRANUMBER: case OP_CREF: - cc++; - /* Fall through */ - case OP_OPT: - cc++; - /* Fall through */ - + case OP_CALLOUT: case OP_SOD: + case OP_SOM: case OP_EOD: case OP_EODN: case OP_CIRC: case OP_DOLL: case OP_NOT_WORD_BOUNDARY: case OP_WORD_BOUNDARY: - cc++; + cc += OP_lengths[*cc]; break; /* Handle char strings. In UTF-8 mode we must count characters, not bytes. This requires a scan of the string, unfortunately. We assume valid UTF-8 - strings, so all we do is reduce the length by one for byte whose bits are - 10xxxxxx. */ + strings, so all we do is reduce the length by one for every byte whose bits + are 10xxxxxx. */ case OP_CHARS: branchlength += *(++cc); @@ -854,7 +943,7 @@ for (;;) case OP_EXACT: case OP_TYPEEXACT: - branchlength += (cc[1] << 8) + cc[2]; + branchlength += GET2(cc,1); cc += 4; break; @@ -871,6 +960,10 @@ for (;;) cc++; break; + /* The single-byte matcher isn't allowed */ + + case OP_ANYBYTE: + return -2; /* Check a class for variable quantification */ @@ -887,8 +980,8 @@ for (;;) case OP_CRRANGE: case OP_CRMINRANGE: - if ((cc[1] << 8) + cc[2] != (cc[3] << 8) + cc[4]) return -1; - branchlength += (cc[1] << 8) + cc[2]; + if (GET2(cc,1) != GET2(cc,3)) return -1; + branchlength += GET2(cc,1); cc += 5; break; @@ -909,6 +1002,184 @@ for (;;) +/************************************************* +* Scan compiled regex for numbered bracket * +*************************************************/ + +/* This little function scans through a compiled pattern until it finds a +capturing bracket with the given number. + +Arguments: + code points to start of expression + number the required bracket number + +Returns: pointer to the opcode for the bracket, or NULL if not found +*/ + +static const uschar * +find_bracket(const uschar *code, int number) +{ +for (;;) + { + register int c = *code; + if (c == OP_END) return NULL; + else if (c == OP_CHARS) code += code[1] + OP_lengths[c]; + else if (c > OP_BRA) + { + int n = c - OP_BRA; + if (n > EXTRACT_BASIC_MAX) n = GET2(code, 2+LINK_SIZE); + if (n == number) return (uschar *)code; + code += OP_lengths[OP_BRA]; + } + else code += OP_lengths[c]; + } +} + + + +/************************************************* +* Scan compiled branch for non-emptiness * +*************************************************/ + +/* This function scans through a branch of a compiled pattern to see whether it +can match the empty string or not. It is called only from could_be_empty() +below. Note that first_significant_code() skips over assertions. If we hit an +unclosed bracket, we return "empty" - this means we've struck an inner bracket +whose current branch will already have been scanned. + +Arguments: + code points to start of search + endcode points to where to stop + +Returns: TRUE if what is matched could be empty +*/ + +static BOOL +could_be_empty_branch(const uschar *code, const uschar *endcode) +{ +register int c; +for (code = first_significant_code(code + 1 + LINK_SIZE, NULL, 0); + code < endcode; + code = first_significant_code(code + OP_lengths[c], NULL, 0)) + { + c = *code; + + if (c >= OP_BRA) + { + BOOL empty_branch; + if (GET(code, 1) == 0) return TRUE; /* Hit unclosed bracket */ + + /* Scan a closed bracket */ + + empty_branch = FALSE; + do + { + if (!empty_branch && could_be_empty_branch(code, endcode)) + empty_branch = TRUE; + code += GET(code, 1); + } + while (*code == OP_ALT); + if (!empty_branch) return FALSE; /* All branches are non-empty */ + code += 1 + LINK_SIZE; + c = *code; + } + + /* Check for any quantifier after a class */ + + else if (c == OP_CLASS) + { + const uschar *ccode = code + 33; + + switch (*ccode) + { + case OP_CRSTAR: /* These could be empty; continue */ + case OP_CRMINSTAR: + case OP_CRQUERY: + case OP_CRMINQUERY: + break; + + default: /* Non-repeat => class must match */ + case OP_CRPLUS: /* These repeats aren't empty */ + case OP_CRMINPLUS: + return FALSE; + + case OP_CRRANGE: + case OP_CRMINRANGE: + if (GET2(ccode, 1) > 0) return FALSE; /* Minimum > 0 */ + break; + } + } + + /* Test for an opcode that must match a character. */ + + else switch (c) + { + case OP_NOT_DIGIT: + case OP_DIGIT: + case OP_NOT_WHITESPACE: + case OP_WHITESPACE: + case OP_NOT_WORDCHAR: + case OP_WORDCHAR: + case OP_ANY: + case OP_ANYBYTE: + case OP_CHARS: + case OP_NOT: + case OP_PLUS: + case OP_MINPLUS: + case OP_EXACT: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTEXACT: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEEXACT: + return FALSE; + + /* End of branch */ + + case OP_KET: + case OP_KETRMAX: + case OP_KETRMIN: + case OP_ALT: + return TRUE; + } + } + +return TRUE; +} + + + +/************************************************* +* Scan compiled regex for non-emptiness * +*************************************************/ + +/* This function is called to check for left recursive calls. We want to check +the current branch of the current pattern to see if it could match the empty +string. If it could, we must look outwards for branches at other levels, +stopping when we pass beyond the bracket which is the subject of the recursion. + +Arguments: + code points to start of the recursion + endcode points to where to stop (current RECURSE item) + bcptr points to the chain of current (unclosed) branch starts + +Returns: TRUE if what is matched could be empty +*/ + +static BOOL +could_be_empty(const uschar *code, const uschar *endcode, branch_chain *bcptr) +{ +while (bcptr != NULL && bcptr->current >= code) + { + if (!could_be_empty_branch(bcptr->current, endcode)) return FALSE; + bcptr = bcptr->outer; + } +return TRUE; +} + + + /************************************************* * Check for POSIX class syntax * *************************************************/ @@ -978,38 +1249,45 @@ return -1; * Compile one branch * *************************************************/ -/* Scan the pattern, compiling it into the code vector. +/* Scan the pattern, compiling it into the code vector. If the options are +changed during the branch, the pointer is used to change the external options +bits. Arguments: - options the option bits - brackets points to number of extracting brackets used - code points to the pointer to the current code point - ptrptr points to the current pattern pointer - errorptr points to pointer to error message - optchanged set to the value of the last OP_OPT item compiled - reqchar set to the last literal character required, else -1 - countlits set to count of mandatory literal characters - cd contains pointers to tables + optionsptr pointer to the option bits + brackets points to number of extracting brackets used + code points to the pointer to the current code point + ptrptr points to the current pattern pointer + errorptr points to pointer to error message + firstcharptr set to initial literal character, or < 0 (REQ_UNSET, REQ_NONE) + reqcharptr set to the last literal character required, else < 0 + bcptr points to current branch chain + cd contains pointers to tables etc. -Returns: TRUE on success - FALSE, with *errorptr set on error +Returns: TRUE on success + FALSE, with *errorptr set on error */ static BOOL -compile_branch(int options, int *brackets, uschar **codeptr, - const uschar **ptrptr, const char **errorptr, int *optchanged, - int *reqchar, int *countlits, compile_data *cd) +compile_branch(int *optionsptr, int *brackets, uschar **codeptr, + const uschar **ptrptr, const char **errorptr, int *firstcharptr, + int *reqcharptr, branch_chain *bcptr, compile_data *cd) { int repeat_type, op_type; -int repeat_min, repeat_max; -int bravalue, length; +int repeat_min = 0, repeat_max = 0; /* To please picky compilers */ +int bravalue = 0; +int length; int greedy_default, greedy_non_default; -int prevreqchar; +int firstchar, reqchar; +int zeroreqchar, zerofirstchar; +int req_caseopt; int condcount = 0; -int subcountlits = 0; +int options = *optionsptr; register int c; register uschar *code = *codeptr; uschar *tempcode; +BOOL inescq = FALSE; +BOOL groupsetfirstchar = FALSE; const uschar *ptr = *ptrptr; const uschar *tempptr; uschar *previous = NULL; @@ -1020,23 +1298,42 @@ uschar class[32]; greedy_default = ((options & PCRE_UNGREEDY) != 0); greedy_non_default = greedy_default ^ 1; -/* Initialize no required char, and count of literals */ +/* Initialize no first char, no required char. REQ_UNSET means "no char +matching encountered yet". It gets changed to REQ_NONE if we hit something that +matches a non-fixed char first char; reqchar just remains unset if we never +find one. -*reqchar = prevreqchar = -1; -*countlits = 0; +When we hit a repeat whose minimum is zero, we may have to adjust these values +to take the zero repeat into account. This is implemented by setting them to +zerofirstchar and zeroreqchar when such a repeat is encountered. The individual +item types that can be repeated set these backoff variables appropriately. */ + +firstchar = reqchar = zerofirstchar = zeroreqchar = REQ_UNSET; + +/* The variable req_caseopt contains either the REQ_CASELESS value or zero, +according to the current setting of the caseless flag. REQ_CASELESS is a bit +value > 255. It is added into the firstchar or reqchar variables to record the +case status of the value. */ + +req_caseopt = ((options & PCRE_CASELESS) != 0)? REQ_CASELESS : 0; /* Switch on next character until the end of the branch */ for (;; ptr++) { BOOL negate_class; + BOOL possessive_quantifier; int class_charcount; int class_lastchar; int newoptions; + int recno; int skipbytes; int subreqchar; + int subfirstchar; c = *ptr; + if (inescq && c != 0) goto NORMAL_CHAR; + if ((options & PCRE_EXTENDED) != 0) { if ((cd->ctypes[c] & ctype_space) != 0) continue; @@ -1045,7 +1342,7 @@ for (;; ptr++) /* The space before the ; is to avoid a warning on a silly compiler on the Macintosh. */ while ((c = *(++ptr)) != 0 && c != NEWLINE) ; - continue; + if (c != 0) continue; /* Else fall through to handle end of string */ } } @@ -1056,13 +1353,20 @@ for (;; ptr++) case 0: case '|': case ')': + *firstcharptr = firstchar; + *reqcharptr = reqchar; *codeptr = code; *ptrptr = ptr; return TRUE; - /* Handle single-character metacharacters */ + /* Handle single-character metacharacters. In multiline mode, ^ disables + the setting of any following char as a first character. */ case '^': + if ((options & PCRE_MULTILINE) != 0) + { + if (firstchar == REQ_UNSET) firstchar = REQ_NONE; + } previous = NULL; *code++ = OP_CIRC; break; @@ -1072,7 +1376,13 @@ for (;; ptr++) *code++ = OP_DOLL; break; + /* There can never be a first char if '.' is first, whatever happens about + repeats. The value of reqchar doesn't change either. */ + case '.': + if (firstchar == REQ_UNSET) firstchar = REQ_NONE; + zerofirstchar = firstchar; + zeroreqchar = reqchar; previous = code; *code++ = OP_ANY; break; @@ -1086,6 +1396,16 @@ for (;; ptr++) previous = code; *code++ = OP_CLASS; + /* PCRE supports POSIX class stuff inside a class. Perl gives an error if + they are encountered at the top level, so we'll do that too. */ + + if ((ptr[1] == ':' || ptr[1] == '.' || ptr[1] == '=') && + check_posix_syntax(ptr, &tempptr, cd)) + { + *errorptr = (ptr[1] == ':')? ERR13 : ERR31; + goto FAILED; + } + /* If the first character is '^', set the negation flag and skip it. */ if ((c = *(++ptr)) == '^') @@ -1109,21 +1429,16 @@ for (;; ptr++) memset(class, 0, 32 * sizeof(uschar)); /* Process characters until ] is reached. By writing this as a "do" it - means that an initial ] is taken as a data character. */ + means that an initial ] is taken as a data character. The first pass + checked the overall syntax. */ do { - if (c == 0) - { - *errorptr = ERR6; - goto FAILED; - } - /* Handle POSIX class names. Perl allows a negation extension of the - form [:^name]. A square bracket that doesn't match the syntax is + form [:^name:]. A square bracket that doesn't match the syntax is treated as a literal. We also recognize the POSIX constructions [.ch.] and [=ch=] ("collating elements") and fault them, as Perl - 5.6 does. */ + 5.6 and 5.8 do. */ if (c == '[' && (ptr[1] == ':' || ptr[1] == '.' || ptr[1] == '=') && @@ -1161,17 +1476,26 @@ for (;; ptr++) posix_class = 0; /* Or into the map we are building up to 3 of the static class - tables, or their negations. */ + tables, or their negations. The [:blank:] class sets up the same + chars as the [:space:] class (all white space). We remove the vertical + white space chars afterwards. */ posix_class *= 3; for (i = 0; i < 3; i++) { + BOOL isblank = strncmp(ptr, "blank", 5) == 0; int taboffset = posix_class_maps[posix_class + i]; if (taboffset < 0) break; if (local_negate) + { for (c = 0; c < 32; c++) class[c] |= ~cbits[c+taboffset]; + if (isblank) class[1] |= 0x3c; + } else + { for (c = 0; c < 32; c++) class[c] |= cbits[c+taboffset]; + if (isblank) class[1] &= ~0x3c; + } } ptr = tempptr + 1; @@ -1194,7 +1518,7 @@ for (;; ptr++) else if (c < 0) { register const uschar *cbits = cd->cbits; - class_charcount = 10; + class_charcount = 10; /* Greater than 1 is what matters */ switch (-c) { case ESC_d: @@ -1215,10 +1539,12 @@ for (;; ptr++) case ESC_s: for (c = 0; c < 32; c++) class[c] |= cbits[c+cbit_space]; + class[1] &= ~0x08; /* Perl 5.004 onwards omits VT from \s */ continue; case ESC_S: for (c = 0; c < 32; c++) class[c] |= ~cbits[c+cbit_space]; + class[1] |= 0x08; /* Perl 5.004 onwards omits VT from \s */ continue; default: @@ -1249,12 +1575,6 @@ for (;; ptr++) ptr += 2; d = *ptr; - if (d == 0) - { - *errorptr = ERR6; - goto FAILED; - } - /* The second part of a range can be a single-character escape, but not any of the other escapes. Perl 5.6 treats a hyphen as a literal in such circumstances. */ @@ -1324,18 +1644,34 @@ for (;; ptr++) while ((c = *(++ptr)) != ']'); /* If class_charcount is 1 and class_lastchar is not negative, we saw - precisely one character. This doesn't need the whole 32-byte bit map. - We turn it into a 1-character OP_CHAR if it's positive, or OP_NOT if - it's negative. */ + precisely one character. This doesn't need the whole 32-byte bit map. We + turn it into a 1-character OP_CHARS if it's positive, or OP_NOT if it's + negative. In the positive case, it can cause firstchar to be set. + Otherwise, there can be no first char if this item is first, whatever + repeat count may follow. In the case of reqchar, save the previous value + for reinstating. */ if (class_charcount == 1 && class_lastchar >= 0) { + zeroreqchar = reqchar; if (negate_class) { + if (firstchar == REQ_UNSET) firstchar = REQ_NONE; + zerofirstchar = firstchar; code[-1] = OP_NOT; } else { + if (firstchar == REQ_UNSET) + { + zerofirstchar = REQ_NONE; + firstchar = class_lastchar | req_caseopt; + } + else + { + zerofirstchar = firstchar; + reqchar = class_lastchar | req_caseopt; + } code[-1] = OP_CHARS; *code++ = 1; } @@ -1343,10 +1679,15 @@ for (;; ptr++) } /* Otherwise, negate the 32-byte map if necessary, and copy it into - the code vector. */ + the code vector. If this is the first thing in the branch, there can be + no first char setting, whatever the repeat count. Any reqchar setting + must remain unchanged after any kind of repeat. */ else { + if (firstchar == REQ_UNSET) firstchar = REQ_NONE; + zerofirstchar = firstchar; + zeroreqchar = reqchar; if (negate_class) for (c = 0; c < 32; c++) code[c] = ~class[c]; else @@ -1384,47 +1725,86 @@ for (;; ptr++) goto FAILED; } - /* If the next character is '?' this is a minimizing repeat, by default, - but if PCRE_UNGREEDY is set, it works the other way round. Advance to the - next character. */ + if (repeat_min == 0) + { + firstchar = zerofirstchar; /* Adjust for zero repeat */ + reqchar = zeroreqchar; /* Ditto */ + } - if (ptr[1] == '?') - { repeat_type = greedy_non_default; ptr++; } + op_type = 0; /* Default single-char op codes */ + possessive_quantifier = FALSE; /* Default not possessive quantifier */ + + /* Save start of previous item, in case we have to move it up to make space + for an inserted OP_ONCE for the additional '+' extension. */ + + tempcode = previous; + + /* If the next character is '+', we have a possessive quantifier. This + implies greediness, whatever the setting of the PCRE_UNGREEDY option. + If the next character is '?' this is a minimizing repeat, by default, + but if PCRE_UNGREEDY is set, it works the other way round. We change the + repeat type to the non-default. */ + + if (ptr[1] == '+') + { + repeat_type = 0; /* Force greedy */ + possessive_quantifier = TRUE; + ptr++; + } + else if (ptr[1] == '?') + { + repeat_type = greedy_non_default; + ptr++; + } else repeat_type = greedy_default; + /* If previous was a recursion, we need to wrap it inside brackets so that + it can be replicated if necessary. */ + + if (*previous == OP_RECURSE) + { + memmove(previous + 1 + LINK_SIZE, previous, 1 + LINK_SIZE); + code += 1 + LINK_SIZE; + *previous = OP_BRA; + PUT(previous, 1, code - previous); + *code = OP_KET; + PUT(code, 1, code - previous); + code += 1 + LINK_SIZE; + } + /* If previous was a string of characters, chop off the last one and use it as the subject of the repeat. If there was only one character, we can - abolish the previous item altogether. A repeat with a zero minimum wipes - out any reqchar setting, backing up to the previous value. We must also - adjust the countlits value. */ + abolish the previous item altogether. If a one-char item has a minumum of + more than one, ensure that it is set in reqchar - it might not be if a + sequence such as x{3} is the first thing in a branch because the x will + have gone into firstchar instead. */ if (*previous == OP_CHARS) { int len = previous[1]; - - if (repeat_min == 0) *reqchar = prevreqchar; - *countlits += repeat_min - 1; - if (len == 1) { c = previous[2]; code = previous; + if (repeat_min > 1) reqchar = c | req_caseopt; } else { c = previous[len+1]; previous[1]--; code--; + tempcode = code; /* Adjust position to be moved for '+' */ } - op_type = 0; /* Use single-char op codes */ + goto OUTPUT_SINGLE_REPEAT; /* Code shared with single character types */ } /* If previous was a single negated character ([^a] or similar), we use one of the special opcodes, replacing it. The code is shared with single- - character repeats by adding a suitable offset into repeat_type. */ + character repeats by setting opt_type to add a suitable offset into + repeat_type. */ - else if ((int)*previous == OP_NOT) + else if (*previous == OP_NOT) { op_type = OP_NOTSTAR - OP_STAR; /* Use "not" opcodes */ c = previous[1]; @@ -1434,9 +1814,9 @@ for (;; ptr++) /* If previous was a character type match (\d or similar), abolish it and create a suitable repeat item. The code is shared with single-character - repeats by adding a suitable offset into repeat_type. */ + repeats by setting op_type to add a suitable offset into repeat_type. */ - else if ((int)*previous < OP_EODN || *previous == OP_ANY) + else if (*previous < OP_EODN) { op_type = OP_TYPESTAR - OP_STAR; /* Use type opcodes */ c = *previous; @@ -1463,8 +1843,7 @@ for (;; ptr++) else { *code++ = OP_UPTO + repeat_type; - *code++ = repeat_max >> 8; - *code++ = (repeat_max & 255); + PUT2INC(code, 0, repeat_max); } } @@ -1481,8 +1860,7 @@ for (;; ptr++) if (repeat_min != 1) { *code++ = OP_EXACT + op_type; /* NB EXACT doesn't have repeat_type */ - *code++ = repeat_min >> 8; - *code++ = (repeat_min & 255); + PUT2INC(code, 0, repeat_min); } /* If the mininum is 1 and the previous item was a character string, @@ -1517,8 +1895,7 @@ for (;; ptr++) *code++ = c; repeat_max -= repeat_min; *code++ = OP_UPTO + repeat_type; - *code++ = repeat_max >> 8; - *code++ = (repeat_max & 255); + PUT2INC(code, 0, repeat_max); } } @@ -1546,19 +1923,17 @@ for (;; ptr++) else { *code++ = OP_CRRANGE + repeat_type; - *code++ = repeat_min >> 8; - *code++ = repeat_min & 255; + PUT2INC(code, 0, repeat_min); if (repeat_max == -1) repeat_max = 0; /* 2-byte encoding for max */ - *code++ = repeat_max >> 8; - *code++ = repeat_max & 255; + PUT2INC(code, 0, repeat_max); } } /* If previous was a bracket group, we may have to replicate it in certain cases. */ - else if ((int)*previous >= OP_BRA || (int)*previous == OP_ONCE || - (int)*previous == OP_COND) + else if (*previous >= OP_BRA || *previous == OP_ONCE || + *previous == OP_COND) { register int i; int ketoffset = 0; @@ -1574,7 +1949,7 @@ for (;; ptr++) if (repeat_max == -1) { register uschar *ket = previous; - do ket += (ket[1] << 8) + ket[2]; while (*ket != OP_KET); + do ket += GET(ket, 1); while (*ket != OP_KET); ketoffset = code - ket; } @@ -1587,15 +1962,6 @@ for (;; ptr++) if (repeat_min == 0) { - /* If we set up a required char from the bracket, we must back off - to the previous value and reset the countlits value too. */ - - if (subcountlits > 0) - { - *reqchar = prevreqchar; - *countlits -= subcountlits; - } - /* If the maximum is also zero, we just omit the group from the output altogether. */ @@ -1625,8 +1991,8 @@ for (;; ptr++) else { int offset; - memmove(previous+4, previous, len); - code += 4; + memmove(previous + 2 + LINK_SIZE, previous, len); + code += 2 + LINK_SIZE; *previous++ = OP_BRAZERO + repeat_type; *previous++ = OP_BRA; @@ -1635,8 +2001,7 @@ for (;; ptr++) offset = (bralink == NULL)? 0 : previous - bralink; bralink = previous; - *previous++ = offset >> 8; - *previous++ = offset & 255; + PUTINC(previous, 0, offset); } repeat_max--; @@ -1644,14 +2009,19 @@ for (;; ptr++) /* If the minimum is greater than zero, replicate the group as many times as necessary, and adjust the maximum to the number of subsequent - copies that we need. */ + copies that we need. If we set a first char from the group, and didn't + set a required char, copy the latter from the former. */ else { - for (i = 1; i < repeat_min; i++) + if (repeat_min > 1) { - memcpy(code, previous, len); - code += len; + if (groupsetfirstchar && reqchar < 0) reqchar = firstchar; + for (i = 1; i < repeat_min; i++) + { + memcpy(code, previous, len); + code += len; + } } if (repeat_max > 0) repeat_max -= repeat_min; } @@ -1677,8 +2047,7 @@ for (;; ptr++) *code++ = OP_BRA; offset = (bralink == NULL)? 0 : code - bralink; bralink = code; - *code++ = offset >> 8; - *code++ = offset & 255; + PUTINC(code, 0, offset); } memcpy(code, previous, len); @@ -1693,11 +2062,11 @@ for (;; ptr++) int oldlinkoffset; int offset = code - bralink + 1; uschar *bra = code - offset; - oldlinkoffset = (bra[1] << 8) + bra[2]; + oldlinkoffset = GET(bra, 1); bralink = (oldlinkoffset == 0)? NULL : bralink - oldlinkoffset; *code++ = OP_KET; - *code++ = bra[1] = offset >> 8; - *code++ = bra[2] = (offset & 255); + PUTINC(code, 0, offset); + PUT(bra, 1, offset); } } @@ -1717,6 +2086,24 @@ for (;; ptr++) goto FAILED; } + /* If the character following a repeat is '+', we wrap the entire repeated + item inside OP_ONCE brackets. This is just syntactic sugar, taken from + Sun's Java package. The repeated item starts at tempcode, not at previous, + which might be the first part of a string whose (former) last char we + repeated. However, we don't support '+' after a greediness '?'. */ + + if (possessive_quantifier) + { + int len = code - tempcode; + memmove(tempcode + 1+LINK_SIZE, tempcode, len); + code += 1 + LINK_SIZE; + len += 1 + LINK_SIZE; + tempcode[0] = OP_ONCE; + *code++ = OP_KET; + PUTINC(code, 0, len); + PUT(tempcode, 1, len); + } + /* In all case we no longer have a previous item. */ END_REPEAT: @@ -1754,9 +2141,22 @@ for (;; ptr++) case '(': bravalue = OP_COND; /* Conditional group */ - if ((cd->ctypes[*(++ptr)] & ctype_digit) != 0) + + /* Condition to test for recursion */ + + if (ptr[1] == 'R') { - int condref = *ptr - '0'; + code[1+LINK_SIZE] = OP_CREF; + PUT2(code, 2+LINK_SIZE, CREF_RECURSE); + skipbytes += 1+LINK_SIZE; + ptr += 3; + } + + /* Condition to test for a numbered subpattern match */ + + else if ((cd->ctypes[ptr[1]] & ctype_digit) != 0) + { + int condref = *(++ptr) - '0'; while (*(++ptr) != ')') condref = condref*10 + *ptr - '0'; if (condref == 0) { @@ -1764,12 +2164,12 @@ for (;; ptr++) goto FAILED; } ptr++; - code[3] = OP_CREF; - code[4] = condref >> 8; - code[5] = condref & 255; + code[1+LINK_SIZE] = OP_CREF; + PUT2(code, 2+LINK_SIZE, condref); skipbytes = 3; } - else ptr--; + /* For conditions that are assertions, we just fall through, having + set bravalue above. */ break; case '=': /* Positive lookahead */ @@ -1794,10 +2194,6 @@ for (;; ptr++) bravalue = OP_ASSERTBACK_NOT; ptr++; break; - - default: /* Syntax error */ - *errorptr = ERR24; - goto FAILED; } break; @@ -1806,11 +2202,145 @@ for (;; ptr++) ptr++; break; - case 'R': /* Pattern recursion */ - *code++ = OP_RECURSE; - ptr++; + case 'C': /* Callout - may be followed by digits */ + *code++ = OP_CALLOUT; + { + int n = 0; + while ((cd->ctypes[*(++ptr)] & ctype_digit) != 0) + n = n * 10 + *ptr - '0'; + if (n > 255) + { + *errorptr = ERR38; + goto FAILED; + } + *code++ = n; + } + previous = NULL; continue; + case 'P': /* Named subpattern handling */ + if (*(++ptr) == '<') /* Definition */ + { + int i, namelen; + const uschar *name = ++ptr; + uschar *slot = cd->name_table; + + while (*ptr++ != '>'); + namelen = ptr - name - 1; + + for (i = 0; i < cd->names_found; i++) + { + int c = strncmp(name, slot+2, namelen); + if (c == 0) + { + *errorptr = ERR43; + goto FAILED; + } + if (c < 0) + { + memmove(slot + cd->name_entry_size, slot, + (cd->names_found - i) * cd->name_entry_size); + break; + } + slot += cd->name_entry_size; + } + + PUT2(slot, 0, *brackets + 1); + memcpy(slot + 2, name, namelen); + slot[2+namelen] = 0; + cd->names_found++; + goto NUMBERED_GROUP; + } + + if (*ptr == '=' || *ptr == '>') /* Reference or recursion */ + { + int i, namelen; + int type = *ptr++; + const uschar *name = ptr; + uschar *slot = cd->name_table; + + while (*ptr != ')') ptr++; + namelen = ptr - name; + + for (i = 0; i < cd->names_found; i++) + { + if (strncmp(name, slot+2, namelen) == 0) break; + slot += cd->name_entry_size; + } + if (i >= cd->names_found) + { + *errorptr = ERR15; + goto FAILED; + } + + recno = GET2(slot, 0); + + if (type == '>') goto HANDLE_RECURSION; /* A few lines below */ + + /* Back reference */ + + previous = code; + *code++ = OP_REF; + PUT2INC(code, 0, recno); + continue; + } + + /* Should never happen */ + break; + + case 'R': /* Pattern recursion */ + ptr++; /* Same as (?0) */ + /* Fall through */ + + /* Recursion or "subroutine" call */ + + case '0': case '1': case '2': case '3': case '4': + case '5': case '6': case '7': case '8': case '9': + { + const uschar *called; + recno = 0; + + while ((cd->ctypes[*ptr] & ctype_digit) != 0) + recno = recno * 10 + *ptr++ - '0'; + + /* Come here from code above that handles a named recursion */ + + HANDLE_RECURSION: + + previous = code; + + /* Find the bracket that is being referenced. Temporarily end the + regex in case it doesn't exist. */ + + *code = OP_END; + called = (recno == 0)? + cd->start_code : find_bracket(cd->start_code, recno); + if (called == NULL) + { + *errorptr = ERR15; + goto FAILED; + } + + /* If the subpattern is still open, this is a recursive call. We + check to see if this is a left recursion that could loop for ever, + and diagnose that case. */ + + if (GET(called, 1) == 0 && could_be_empty(called, code, bcptr)) + { + *errorptr = ERR40; + goto FAILED; + } + + /* Insert the recursion/subroutine item */ + + *code = OP_RECURSE; + PUT(code, 1, called - cd->start_code); + code += 1 + LINK_SIZE; + } + continue; + + /* Character after (? not specially recognized */ + default: /* Option setting */ set = unset = 0; optset = &set; @@ -1827,10 +2357,6 @@ for (;; ptr++) case 'x': *optset |= PCRE_EXTENDED; break; case 'U': *optset |= PCRE_UNGREEDY; break; case 'X': *optset |= PCRE_EXTRA; break; - - default: - *errorptr = ERR12; - goto FAILED; } } @@ -1839,23 +2365,33 @@ for (;; ptr++) newoptions = (options | set) & (~unset); /* If the options ended with ')' this is not the start of a nested - group with option changes, so the options change at this level. At top - level there is nothing else to be done (the options will in fact have - been set from the start of compiling as a result of the first pass) but - at an inner level we must compile code to change the ims options if - necessary, and pass the new setting back so that it can be put at the - start of any following branches, and when this group ends, a resetting - item can be compiled. */ + group with option changes, so the options change at this level. Compile + code to change the ims options if this setting actually changes any of + them. We also pass the new setting back so that it can be put at the + start of any following branches, and when this group ends (if we are in + a group), a resetting item can be compiled. + + Note that if this item is right at the start of the pattern, the + options will have been abstracted and made global, so there will be no + change to compile. */ if (*ptr == ')') { - if ((options & PCRE_INGROUP) != 0 && - (options & PCRE_IMS) != (newoptions & PCRE_IMS)) + if ((options & PCRE_IMS) != (newoptions & PCRE_IMS)) { *code++ = OP_OPT; - *code++ = *optchanged = newoptions & PCRE_IMS; + *code++ = newoptions & PCRE_IMS; } - options = newoptions; /* Change options at this level */ + + /* Change options at this level, and pass them back for use + in subsequent branches. Reset the greedy defaults and the case + value for firstchar and reqchar. */ + + *optionsptr = options = newoptions; + greedy_default = ((newoptions & PCRE_UNGREEDY) != 0); + greedy_non_default = greedy_default ^ 1; + req_caseopt = ((options & PCRE_CASELESS) != 0)? REQ_CASELESS : 0; + previous = NULL; /* This item can't be repeated */ continue; /* It is complete */ } @@ -1876,12 +2412,12 @@ for (;; ptr++) else { + NUMBERED_GROUP: if (++(*brackets) > EXTRACT_BASIC_MAX) { bravalue = OP_BRA + EXTRACT_BASIC_MAX + 1; - code[3] = OP_BRANUMBER; - code[4] = *brackets >> 8; - code[5] = *brackets & 255; + code[1+LINK_SIZE] = OP_BRANUMBER; + PUT2(code, 2+LINK_SIZE, *brackets); skipbytes = 3; } else bravalue = OP_BRA + *brackets; @@ -1897,9 +2433,8 @@ for (;; ptr++) tempcode = code; if (!compile_regex( - options | PCRE_INGROUP, /* Set for all nested groups */ - ((options & PCRE_IMS) != (newoptions & PCRE_IMS))? - newoptions & PCRE_IMS : -1, /* Pass ims options if changed */ + newoptions, /* The complete new option state */ + options & PCRE_IMS, /* The previous ims option state */ brackets, /* Extracting bracket count */ &tempcode, /* Where to put code (updated) */ &ptr, /* Input pointer (updated) */ @@ -1907,8 +2442,9 @@ for (;; ptr++) (bravalue == OP_ASSERTBACK || bravalue == OP_ASSERTBACK_NOT), /* TRUE if back assert */ skipbytes, /* Skip over OP_COND/OP_BRANUMBER */ + &subfirstchar, /* For possible first char */ &subreqchar, /* For possible last char */ - &subcountlits, /* For literal count */ + bcptr, /* Current branch chain */ cd)) /* Tables block */ goto FAILED; @@ -1927,7 +2463,7 @@ for (;; ptr++) do { condcount++; - tc += (tc[1] << 8) | tc[2]; + tc += GET(tc,1); } while (*tc != OP_KET); @@ -1936,25 +2472,63 @@ for (;; ptr++) *errorptr = ERR27; goto FAILED; } + + /* If there is just one branch, we must not make use of its firstchar or + reqchar, because this is equivalent to an empty second branch. */ + + if (condcount == 1) subfirstchar = subreqchar = REQ_NONE; } - /* Handle updating of the required character. If the subpattern didn't - set one, leave it as it was. Otherwise, update it for normal brackets of - all kinds, forward assertions, and conditions with two branches. Don't - update the literal count for forward assertions, however. If the bracket - is followed by a quantifier with zero repeat, we have to back off. Hence - the definition of prevreqchar and subcountlits outside the main loop so - that they can be accessed for the back off. */ + /* Handle updating of the required and first characters. Update for normal + brackets of all kinds, and conditions with two branches (see code above). + If the bracket is followed by a quantifier with zero repeat, we have to + back off. Hence the definition of zeroreqchar and zerofirstchar outside the + main loop so that they can be accessed for the back off. */ - if (subreqchar > 0 && - (bravalue >= OP_BRA || bravalue == OP_ONCE || bravalue == OP_ASSERT || - (bravalue == OP_COND && condcount == 2))) + zeroreqchar = reqchar; + zerofirstchar = firstchar; + groupsetfirstchar = FALSE; + + if (bravalue >= OP_BRA || bravalue == OP_ONCE || bravalue == OP_COND) { - prevreqchar = *reqchar; - *reqchar = subreqchar; - if (bravalue != OP_ASSERT) *countlits += subcountlits; + /* If we have not yet set a firstchar in this branch, take it from the + subpattern, remembering that it was set here so that a repeat of more + than one can replicate it as reqchar if necessary. If the subpattern has + no firstchar, set "none" for the whole branch. In both cases, a zero + repeat forces firstchar to "none". */ + + if (firstchar == REQ_UNSET) + { + if (subfirstchar >= 0) + { + firstchar = subfirstchar; + groupsetfirstchar = TRUE; + } + else firstchar = REQ_NONE; + zerofirstchar = REQ_NONE; + } + + /* If firstchar was previously set, convert the subpattern's firstchar + into reqchar if there wasn't one. */ + + else if (subfirstchar >= 0 && subreqchar < 0) subreqchar = subfirstchar; + + /* If the subpattern set a required char (or set a first char that isn't + really the first char - see above), set it. */ + + if (subreqchar >= 0) reqchar = subreqchar; } + /* For a forward assertion, we take the reqchar, if set. This can be + helpful if the pattern that follows the assertion doesn't set a different + char. For example, it's useful for /(?=abcde).+/. We can't set firstchar + for an assertion, however because it leads to incorrect effect for patterns + such as /(?=a)a.+/ when the "real" "a" would then become a reqchar instead + of a firstchar. This is overcome by a scan at the end if there's no + firstchar, looking for an asserted first char. */ + + else if (bravalue == OP_ASSERT && subreqchar >= 0) reqchar = subreqchar; + /* Now update the main code pointer to the end of the group. */ code = tempcode; @@ -1985,13 +2559,32 @@ for (;; ptr++) if (c < 0) { + if (-c == ESC_Q) /* Handle start of quoted string */ + { + if (ptr[1] == '\\' && ptr[2] == 'E') ptr += 2; /* avoid empty string */ + else inescq = TRUE; + continue; + } + + /* For metasequences that actually match a character, we disable the + setting of a first character if it hasn't already been set. */ + + if (firstchar == REQ_UNSET && -c > ESC_b && -c < ESC_Z) + firstchar = REQ_NONE; + + /* Set values to reset to if this is followed by a zero repeat. */ + + zerofirstchar = firstchar; + zeroreqchar = reqchar; + + /* Back references are handled specially */ + if (-c >= ESC_REF) { int number = -c - ESC_REF; previous = code; *code++ = OP_REF; - *code++ = number >> 8; - *code++ = number & 255; + PUT2INC(code, 0, number); } else { @@ -2019,6 +2612,25 @@ for (;; ptr++) do { + /* If in \Q...\E, check for the end; if not, we always have a literal */ + + if (inescq) + { + if (c == '\\' && ptr[1] == 'E') + { + inescq = FALSE; + ptr++; + } + else + { + *code++ = c; + length++; + } + continue; + } + + /* Skip white space and comments for /x patterns */ + if ((options & PCRE_EXTENDED) != 0) { if ((cd->ctypes[c] & ctype_space) != 0) continue; @@ -2067,14 +2679,31 @@ for (;; ptr++) while (length < MAXLIT && (cd->ctypes[c = *(++ptr)] & ctype_meta) == 0); - /* Update the last character and the count of literals */ + /* Update the first and last character */ - prevreqchar = (length > 1)? code[-2] : *reqchar; - *reqchar = code[-1]; - *countlits += length; + if (firstchar == REQ_UNSET) + { + if (length > 1) + { + zerofirstchar = firstchar = previous[2] | req_caseopt; + zeroreqchar = (length > 2)? (code[-2] | req_caseopt) : reqchar; + reqchar = code[-1] | req_caseopt; + } + else + { + zerofirstchar = REQ_NONE; + firstchar = code[-1] | req_caseopt; + zeroreqchar = reqchar; + } + } + else /* firstchar previously set */ + { + zerofirstchar = firstchar; + zeroreqchar = (length > 1)? (code[-2] | req_caseopt) : reqchar; + reqchar = code[-1] | req_caseopt; + } - /* Compute the length and set it in the data vector, and advance to - the next state. */ + /* Set the length in the data vector, and advance to the next state. */ previous[1] = length; if (length < MAXLIT) ptr--; @@ -2107,52 +2736,56 @@ following branch to ensure they get set correctly at run time, and also pass the new options into every subsequent branch compile. Argument: - options the option bits - optchanged new ims options to set as if (?ims) were at the start, or -1 - for no change - brackets -> int containing the number of extracting brackets used - codeptr -> the address of the current code pointer - ptrptr -> the address of the current pattern pointer - errorptr -> pointer to error message - lookbehind TRUE if this is a lookbehind assertion - skipbytes skip this many bytes at start (for OP_COND, OP_BRANUMBER) - reqchar -> place to put the last required character, or a negative number - countlits -> place to put the shortest literal count of any branch - cd points to the data block with tables pointers + options option bits, including any changes for this subpattern + oldims previous settings of ims option bits + brackets -> int containing the number of extracting brackets used + codeptr -> the address of the current code pointer + ptrptr -> the address of the current pattern pointer + errorptr -> pointer to error message + lookbehind TRUE if this is a lookbehind assertion + skipbytes skip this many bytes at start (for OP_COND, OP_BRANUMBER) + firstcharptr place to put the first required character, or a negative number + reqcharptr place to put the last required character, or a negative number + bcptr pointer to the chain of currently open branches + cd points to the data block with tables pointers etc. Returns: TRUE on success */ static BOOL -compile_regex(int options, int optchanged, int *brackets, uschar **codeptr, +compile_regex(int options, int oldims, int *brackets, uschar **codeptr, const uschar **ptrptr, const char **errorptr, BOOL lookbehind, int skipbytes, - int *reqchar, int *countlits, compile_data *cd) + int *firstcharptr, int *reqcharptr, branch_chain *bcptr, compile_data *cd) { const uschar *ptr = *ptrptr; uschar *code = *codeptr; uschar *last_branch = code; uschar *start_bracket = code; uschar *reverse_count = NULL; -int oldoptions = options & PCRE_IMS; -int branchreqchar, branchcountlits; +int firstchar, reqchar; +int branchfirstchar, branchreqchar; +branch_chain bc; -*reqchar = -1; -*countlits = INT_MAX; -code += 3 + skipbytes; +bc.outer = bcptr; +bc.current = code; + +firstchar = reqchar = REQ_UNSET; + +/* Offset is set zero to mark that this bracket is still open */ + +PUT(code, 1, 0); +code += 1 + LINK_SIZE + skipbytes; /* Loop for each alternative branch */ for (;;) { - int length; + /* Handle a change of ims options at the start of the branch */ - /* Handle change of options */ - - if (optchanged >= 0) + if ((options & PCRE_IMS) != oldims) { *code++ = OP_OPT; - *code++ = optchanged; - options = (options & ~PCRE_IMS) | optchanged; + *code++ = options & PCRE_IMS; } /* Set up dummy OP_REVERSE if lookbehind assertion */ @@ -2161,43 +2794,52 @@ for (;;) { *code++ = OP_REVERSE; reverse_count = code; - *code++ = 0; - *code++ = 0; + PUTINC(code, 0, 0); } /* Now compile the branch */ - if (!compile_branch(options, brackets, &code, &ptr, errorptr, &optchanged, - &branchreqchar, &branchcountlits, cd)) + if (!compile_branch(&options, brackets, &code, &ptr, errorptr, + &branchfirstchar, &branchreqchar, &bc, cd)) { *ptrptr = ptr; return FALSE; } - /* Fill in the length of the last branch */ + /* If this is the first branch, the firstchar and reqchar values for the + branch become the values for the regex. */ - length = code - last_branch; - last_branch[1] = length >> 8; - last_branch[2] = length & 255; - - /* Save the last required character if all branches have the same; a current - value of -1 means unset, while -2 means "previous branch had no last required - char". */ - - if (*reqchar != -2) + if (*last_branch != OP_ALT) { - if (branchreqchar >= 0) - { - if (*reqchar == -1) *reqchar = branchreqchar; - else if (*reqchar != branchreqchar) *reqchar = -2; - } - else *reqchar = -2; + firstchar = branchfirstchar; + reqchar = branchreqchar; } - /* Keep the shortest literal count */ + /* If this is not the first branch, the first char and reqchar have to + match the values from all the previous branches. */ - if (branchcountlits < *countlits) *countlits = branchcountlits; - DPRINTF(("literal count = %d min=%d\n", branchcountlits, *countlits)); + else + { + /* If we previously had a firstchar, but it doesn't match the new branch, + we have to abandon the firstchar for the regex, but if there was previously + no reqchar, it takes on the value of the old firstchar. */ + + if (firstchar >= 0 && firstchar != branchfirstchar) + { + if (reqchar < 0) reqchar = firstchar; + firstchar = REQ_NONE; + } + + /* If we (now or from before) have no firstchar, a firstchar from the + branch becomes a reqchar if there isn't a branch reqchar. */ + + if (firstchar < 0 && branchfirstchar >= 0 && branchreqchar < 0) + branchreqchar = branchfirstchar; + + /* Now ensure that the reqchars match */ + + if (reqchar != branchreqchar) reqchar = REQ_NONE; + } /* If lookbehind, check that this branch matches a fixed-length string, and put the length into the OP_REVERSE item. Temporarily mark the end of @@ -2205,45 +2847,72 @@ for (;;) if (lookbehind) { + int length; *code = OP_END; length = find_fixedlength(last_branch, options); DPRINTF(("fixed length = %d\n", length)); if (length < 0) { - *errorptr = ERR25; + *errorptr = (length == -2)? ERR36 : ERR25; *ptrptr = ptr; return FALSE; } - reverse_count[0] = (length >> 8); - reverse_count[1] = length & 255; + PUT(reverse_count, 0, length); } - /* Reached end of expression, either ')' or end of pattern. Insert a - terminating ket and the length of the whole bracketed item, and return, - leaving the pointer at the terminating char. If any of the ims options - were changed inside the group, compile a resetting op-code following. */ + /* Reached end of expression, either ')' or end of pattern. Go back through + the alternative branches and reverse the chain of offsets, with the field in + the BRA item now becoming an offset to the first alternative. If there are + no alternatives, it points to the end of the group. The length in the + terminating ket is always the length of the whole bracketed item. If any of + the ims options were changed inside the group, compile a resetting op-code + following, except at the very end of the pattern. Return leaving the pointer + at the terminating char. */ if (*ptr != '|') { - length = code - start_bracket; - *code++ = OP_KET; - *code++ = length >> 8; - *code++ = length & 255; - if (optchanged >= 0) + int length = code - last_branch; + do + { + int prev_length = GET(last_branch, 1); + PUT(last_branch, 1, length); + length = prev_length; + last_branch -= length; + } + while (length > 0); + + /* Fill in the ket */ + + *code = OP_KET; + PUT(code, 1, code - start_bracket); + code += 1 + LINK_SIZE; + + /* Resetting option if needed */ + + if ((options & PCRE_IMS) != oldims && *ptr == ')') { *code++ = OP_OPT; - *code++ = oldoptions; + *code++ = oldims; } + + /* Set values to pass back */ + *codeptr = code; *ptrptr = ptr; + *firstcharptr = firstchar; + *reqcharptr = reqchar; return TRUE; } - /* Another branch follows; insert an "or" node and advance the pointer. */ + /* Another branch follows; insert an "or" node. Its length field points back + to the previous branch while the bracket remains open. At the end the chain + is reversed. It's done like this so that the start of the bracket has a + zero offset until it is closed, making it possible to detect recursion. */ *code = OP_ALT; - last_branch = code; - code += 3; + PUT(code, 1, code - last_branch); + bc.current = last_branch = code; + code += 1 + LINK_SIZE; ptr++; } /* Control never reaches here */ @@ -2252,70 +2921,6 @@ for (;;) -/************************************************* -* Find first significant op code * -*************************************************/ - -/* This is called by several functions that scan a compiled expression looking -for a fixed first character, or an anchoring op code etc. It skips over things -that do not influence this. For one application, a change of caseless option is -important. - -Arguments: - code pointer to the start of the group - options pointer to external options - optbit the option bit whose changing is significant, or - zero if none are - optstop TRUE to return on option change, otherwise change the options - value and continue - -Returns: pointer to the first significant opcode -*/ - -static const uschar* -first_significant_code(const uschar *code, int *options, int optbit, - BOOL optstop) -{ -for (;;) - { - switch ((int)*code) - { - case OP_OPT: - if (optbit > 0 && ((int)code[1] & optbit) != (*options & optbit)) - { - if (optstop) return code; - *options = (int)code[1]; - } - code += 2; - break; - - case OP_CREF: - case OP_BRANUMBER: - code += 3; - break; - - case OP_WORD_BOUNDARY: - case OP_NOT_WORD_BOUNDARY: - code++; - break; - - case OP_ASSERT_NOT: - case OP_ASSERTBACK: - case OP_ASSERTBACK_NOT: - do code += (code[1] << 8) + code[2]; while (*code == OP_ALT); - code += 3; - break; - - default: - return code; - } - } -/* Control never reaches here */ -} - - - - /************************************************* * Check for anchored expression * *************************************************/ @@ -2328,33 +2933,64 @@ counts, since OP_CIRC can match in the middle. A branch is also implicitly anchored if it starts with .* and DOTALL is set, because that will try the rest of the pattern at all possible matching points, -so there is no point trying them again. +so there is no point trying again.... er .... + +.... except when the .* appears inside capturing parentheses, and there is a +subsequent back reference to those parentheses. We haven't enough information +to catch that case precisely. The best we can do is to detect when .* is in +capturing brackets and the highest back reference is greater than or equal to +that level. Arguments: - code points to start of expression (the bracket) - options points to the options setting + code points to start of expression (the bracket) + options points to the options setting + in_brackets TRUE if inside capturing parentheses + top_backref the highest back reference in the regex Returns: TRUE or FALSE */ static BOOL -is_anchored(register const uschar *code, int *options) +is_anchored(register const uschar *code, int *options, BOOL in_brackets, + int top_backref) { do { - const uschar *scode = first_significant_code(code + 3, options, - PCRE_MULTILINE, FALSE); + const uschar *scode = + first_significant_code(code + 1+LINK_SIZE, options, PCRE_MULTILINE); register int op = *scode; - if (op >= OP_BRA || op == OP_ASSERT || op == OP_ONCE || op == OP_COND) - { if (!is_anchored(scode, options)) return FALSE; } + + /* Capturing brackets */ + + if (op > OP_BRA) + { + if (!is_anchored(scode, options, TRUE, top_backref)) return FALSE; + } + + /* Other brackets */ + + else if (op == OP_BRA || op == OP_ASSERT || op == OP_ONCE || op == OP_COND) + { + if (!is_anchored(scode, options, in_brackets, top_backref)) + return FALSE; + } + + /* .* is not anchored unless DOTALL is set and it isn't in brackets that + may be referenced. */ + else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR) && (*options & PCRE_DOTALL) != 0) - { if (scode[1] != OP_ANY) return FALSE; } + { + if (scode[1] != OP_ANY || (in_brackets && top_backref > 0)) return FALSE; + } + + /* Check for explicit anchoring */ + else if (op != OP_SOD && ((*options & PCRE_MULTILINE) != 0 || op != OP_CIRC)) return FALSE; - code += (code[1] << 8) + code[2]; + code += GET(code, 1); } -while (*code == OP_ALT); +while (*code == OP_ALT); /* Loop for each alternative */ return TRUE; } @@ -2367,56 +3003,82 @@ return TRUE; /* This is called to find out if every branch starts with ^ or .* so that "first char" processing can be done to speed things up in multiline matching and for non-DOTALL patterns that start with .* (which must start at -the beginning or after \n). +the beginning or after \n). As in the case of is_anchored() (see above), we +have to take account of back references to capturing brackets that contain .* +because in that case we can't make the assumption. -Argument: points to start of expression (the bracket) -Returns: TRUE or FALSE +Arguments: + code points to start of expression (the bracket) + in_brackets TRUE if inside capturing parentheses + top_backref the highest back reference in the regex + +Returns: TRUE or FALSE */ static BOOL -is_startline(const uschar *code) +is_startline(const uschar *code, BOOL in_brackets, int top_backref) { do { - const uschar *scode = first_significant_code(code + 3, NULL, 0, FALSE); + const uschar *scode = first_significant_code(code + 1+LINK_SIZE, NULL, 0); register int op = *scode; - if (op >= OP_BRA || op == OP_ASSERT || op == OP_ONCE || op == OP_COND) - { if (!is_startline(scode)) return FALSE; } + + /* Capturing brackets */ + + if (op > OP_BRA) + { if (!is_startline(scode, TRUE, top_backref)) return FALSE; } + + /* Other brackets */ + + else if (op == OP_BRA || op == OP_ASSERT || op == OP_ONCE || op == OP_COND) + { if (!is_startline(scode, in_brackets, top_backref)) return FALSE; } + + /* .* is not anchored unless DOTALL is set and it isn't in brackets that + may be referenced. */ + else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR) - { if (scode[1] != OP_ANY) return FALSE; } + { + if (scode[1] != OP_ANY || (in_brackets && top_backref > 0)) return FALSE; + } + + /* Check for explicit circumflex */ + else if (op != OP_CIRC) return FALSE; - code += (code[1] << 8) + code[2]; + code += GET(code, 1); } -while (*code == OP_ALT); +while (*code == OP_ALT); /* Loop for each alternative */ return TRUE; } /************************************************* -* Check for fixed first char * +* Check for asserted fixed first char * *************************************************/ -/* Try to find out if there is a fixed first character. This is called for -unanchored expressions, as it speeds up their processing quite considerably. -Consider each alternative branch. If they all start with the same char, or with -a bracket all of whose alternatives start with the same char (recurse ad lib), -then we return that char, otherwise -1. +/* During compilation, the "first char" settings from forward assertions are +discarded, because they can cause conflicts with actual literals that follow. +However, if we end up without a first char setting for an unanchored pattern, +it is worth scanning the regex to see if there is an initial asserted first +char. If all branches start with the same asserted char, or with a bracket all +of whose alternatives start with the same asserted char (recurse ad lib), then +we return that char, otherwise -1. Arguments: code points to start of expression (the bracket) options pointer to the options (used to check casing changes) + inassert TRUE if in an assertion Returns: -1 or the fixed first char */ static int -find_firstchar(const uschar *code, int *options) +find_firstassertedchar(const uschar *code, int *options, BOOL inassert) { register int c = -1; do { int d; - const uschar *scode = first_significant_code(code + 3, options, - PCRE_CASELESS, TRUE); + const uschar *scode = + first_significant_code(code + 1+LINK_SIZE, options, PCRE_CASELESS); register int op = *scode; if (op >= OP_BRA) op = OP_BRA; @@ -2430,7 +3092,8 @@ do { case OP_ASSERT: case OP_ONCE: case OP_COND: - if ((d = find_firstchar(scode, options)) < 0) return -1; + if ((d = find_firstassertedchar(scode, options, op == OP_ASSERT)) < 0) + return -1; if (c < 0) c = d; else if (c != d) return -1; break; @@ -2442,11 +3105,17 @@ do { case OP_PLUS: case OP_MINPLUS: - if (c < 0) c = scode[1]; else if (c != scode[1]) return -1; + if (!inassert) return -1; + if (c < 0) + { + c = scode[1]; + if ((*options & PCRE_CASELESS) != 0) c |= REQ_CASELESS; + } + else if (c != scode[1]) return -1; break; } - code += (code[1] << 8) + code[2]; + code += GET(code, 1); } while (*code == OP_ALT); return c; @@ -2455,7 +3124,6 @@ return c; - /************************************************* * Compile a Regular Expression * *************************************************/ @@ -2479,25 +3147,26 @@ pcre_compile(const char *pattern, int options, const char **errorptr, int *erroroffset, const unsigned char *tables) { real_pcre *re; -int length = 3; /* For initial BRA plus length */ +int length = 1 + LINK_SIZE; /* For initial BRA plus length */ int runlength; -int c, reqchar, countlits; +int c, firstchar, reqchar; int bracount = 0; int top_backref = 0; int branch_extra = 0; int branch_newextra; +int item_count = -1; +int name_count = 0; +int max_name_size = 0; +BOOL inescq = FALSE; unsigned int brastackptr = 0; size_t size; uschar *code; +const uschar *codestart; const uschar *ptr; compile_data compile_block; int brastack[BRASTACK_SIZE]; uschar bralenstack[BRASTACK_SIZE]; -#ifdef DEBUG -uschar *code_base, *code_end; -#endif - /* Can't support UTF8 unless PCRE has been compiled to include the code. */ #ifndef SUPPORT_UTF8 @@ -2545,9 +3214,9 @@ DPRINTF(("%s\n", pattern)); /* The first thing to do is to make a pass over the pattern to compute the amount of store required to hold the compiled code. This does not have to be perfect as long as errors are overestimates. At the same time we can detect any -internal flag settings. Make an attempt to correct for any counted white space -if an "extended" flag setting appears late in the pattern. We can't be so -clever for #-comments. */ +flag settings right at the start, and extract them. Make an attempt to correct +for any counted white space if an "extended" flag setting appears late in the +pattern. We can't be so clever for #-comments. */ ptr = (const uschar *)(pattern - 1); while ((c = *(++ptr)) != 0) @@ -2555,6 +3224,13 @@ while ((c = *(++ptr)) != 0) int min, max; int class_charcount; int bracket_length; + int duplength; + + /* If we are inside a \Q...\E sequence, all chars are literal */ + + if (inescq) goto NORMAL_CHAR; + + /* Otherwise, first check for ignored whitespace and comments */ if ((options & PCRE_EXTENDED) != 0) { @@ -2564,10 +3240,13 @@ while ((c = *(++ptr)) != 0) /* The space before the ; is to avoid a warning on a silly compiler on the Macintosh. */ while ((c = *(++ptr)) != 0 && c != NEWLINE) ; + if (c == 0) break; continue; } } + item_count++; /* Is zero for the first non-comment item */ + switch(c) { /* A backslashed item may be an escaped "normal" character or a @@ -2587,6 +3266,17 @@ while ((c = *(++ptr)) != 0) goto NORMAL_CHAR; } } + + /* If \Q, enter "literal" mode */ + + if (-c == ESC_Q) + { + inescq = TRUE; + continue; + } + + /* Other escapes need one byte */ + length++; /* A back reference needs an additional 2 bytes, plus either one or 5 @@ -2611,12 +3301,19 @@ while ((c = *(++ptr)) != 0) } continue; - case '^': + case '*': /* These repeats won't be after brackets; */ + case '+': /* those are handled separately */ + case '?': + if (ptr[1] == '+') /* Handle "possessive quantifier" */ + { + length += 2 + 2*LINK_SIZE; + ptr++; + } + /* Fall through */ + + case '^': /* Single-byte metacharacters */ case '.': case '$': - case '*': /* These repeats won't be after brackets; */ - case '+': /* those are handled separately */ - case '?': length++; continue; @@ -2636,7 +3333,12 @@ while ((c = *(++ptr)) != 0) if (min == 1) length++; else if (min > 0) length += 4; if (max > 0) length += 4; else length += 2; } - if (ptr[1] == '?') ptr++; + if (ptr[1] == '?') ptr++; /* Needs no extra length */ + if (ptr[1] == '+') /* Possessive quantifier */ + { + ptr++; + length += 2 + 2*LINK_SIZE; /* Allow for atomic brackets */ + } continue; /* An alternation contains an offset to the next branch or ket. If any ims @@ -2645,7 +3347,7 @@ while ((c = *(++ptr)) != 0) branch. This is handled by branch_extra. */ case '|': - length += 3 + branch_extra; + length += 1 + LINK_SIZE + branch_extra; continue; /* A character class uses 33 characters. Don't worry about character types @@ -2656,7 +3358,10 @@ while ((c = *(++ptr)) != 0) case '[': class_charcount = 0; if (*(++ptr) == '^') ptr++; - do + + /* Written as a "do" so that an initial ']' is taken as data */ + + if (*ptr != 0) do { if (*ptr == '\\') { @@ -2665,10 +3370,27 @@ while ((c = *(++ptr)) != 0) if (*errorptr != NULL) goto PCRE_ERROR_RETURN; if (-ch == ESC_b) class_charcount++; else class_charcount = 10; } + + /* Check the syntax for POSIX stuff. The bits we actually handle are + checked during the real compile phase. */ + + else if (*ptr == '[' && check_posix_syntax(ptr, &ptr, &compile_block)) + { + ptr++; + class_charcount = 10; /* Make sure > 1 */ + } + + /* Anything else just counts as one char */ + else class_charcount++; - ptr++; } - while (*ptr != 0 && *ptr != ']'); + while (*(++ptr) != 0 && *ptr != ']'); /* Concludes "do" above */ + + if (*ptr == 0) /* Missing terminating ']' */ + { + *errorptr = ERR6; + goto PCRE_ERROR_RETURN; + } /* Repeats for negated single chars are handled by the general code */ @@ -2695,7 +3417,7 @@ while ((c = *(++ptr)) != 0) case '(': branch_newextra = 0; - bracket_length = 3; + bracket_length = 1 + LINK_SIZE; /* Handle special forms of bracket, which all start (? */ @@ -2729,27 +3451,98 @@ while ((c = *(++ptr)) != 0) ptr += 2; break; - /* A recursive call to the regex is an extension, to provide the - facility which can be obtained by $(?p{perl-code}) in Perl 5.6. */ + /* (?R) specifies a recursive call to the regex, which is an extension + to provide the facility which can be obtained by (?p{perl-code}) in + Perl 5.6. In Perl 5.8 this has become (??{perl-code}). + + From PCRE 4.00, items such as (?3) specify subroutine-like "calls" to + the appropriate numbered brackets. This includes both recursive and + non-recursive calls. (?R) is now synonymous with (?0). */ case 'R': - if (ptr[3] != ')') + ptr++; + + case '0': case '1': case '2': case '3': case '4': + case '5': case '6': case '7': case '8': case '9': + ptr += 2; + if (c != 'R') + while ((compile_block.ctypes[*(++ptr)] & ctype_digit) != 0); + if (*ptr != ')') { *errorptr = ERR29; goto PCRE_ERROR_RETURN; } + length += 1 + LINK_SIZE; + + /* If this item is quantified, it will get wrapped inside brackets so + as to use the code for quantified brackets. We jump down and use the + code that handles this for real brackets. */ + + if (ptr[1] == '+' || ptr[1] == '*' || ptr[1] == '?' || ptr[1] == '{') + { + length += 2 + 2 * LINK_SIZE; /* to make bracketed */ + duplength = 5 + 3 * LINK_SIZE; + goto HANDLE_QUANTIFIED_BRACKETS; + } + continue; + + /* (?C) is an extension which provides "callout" - to provide a bit of + the functionality of the Perl (?{...}) feature. An optional number may + follow (default is zero). */ + + case 'C': + ptr += 2; + while ((compile_block.ctypes[*(++ptr)] & ctype_digit) != 0); + if (*ptr != ')') + { + *errorptr = ERR39; + goto PCRE_ERROR_RETURN; + } + length += 2; + continue; + + /* Named subpatterns are an extension copied from Python */ + + case 'P': ptr += 3; - length += 1; - break; + if (*ptr == '<') + { + const uschar *p = ++ptr; + while ((compile_block.ctypes[*ptr] & ctype_word) != 0) ptr++; + if (*ptr != '>') + { + *errorptr = ERR42; + goto PCRE_ERROR_RETURN; + } + name_count++; + if (ptr - p > max_name_size) max_name_size = (ptr - p); + break; + } + + if (*ptr == '=' || *ptr == '>') + { + while ((compile_block.ctypes[*(++ptr)] & ctype_word) != 0); + if (*ptr != ')') + { + *errorptr = ERR42; + goto PCRE_ERROR_RETURN; + } + break; + } + + /* Unknown character after (?P */ + + *errorptr = ERR41; + goto PCRE_ERROR_RETURN; /* Lookbehinds are in Perl from version 5.005 */ case '<': - if (ptr[3] == '=' || ptr[3] == '!') + ptr += 3; + if (*ptr == '=' || *ptr == '!') { - ptr += 3; - branch_newextra = 3; - length += 3; /* For the first branch */ + branch_newextra = 1 + LINK_SIZE; + length += 1 + LINK_SIZE; /* For the first branch */ break; } *errorptr = ERR24; @@ -2757,10 +3550,15 @@ while ((c = *(++ptr)) != 0) /* Conditionals are in Perl from version 5.005. The bracket must either be followed by a number (for bracket reference) or by an assertion - group. */ + group, or (a PCRE extension) by 'R' for a recursion test. */ case '(': - if ((compile_block.ctypes[ptr[3]] & ctype_digit) != 0) + if (ptr[3] == 'R' && ptr[4] == ')') + { + ptr += 4; + length += 3; + } + else if ((compile_block.ctypes[ptr[3]] & ctype_digit) != 0) { ptr += 4; length += 3; @@ -2827,17 +3625,27 @@ while ((c = *(++ptr)) != 0) optset = &unset; continue; - /* A termination by ')' indicates an options-setting-only item; - this is global at top level; otherwise nothing is done here and - it is handled during the compiling process on a per-bracket-group - basis. */ + /* A termination by ')' indicates an options-setting-only item; if + this is at the very start of the pattern (indicated by item_count + being zero), we use it to set the global options. This is helpful + when analyzing the pattern for first characters, etc. Otherwise + nothing is done here and it is handled during the compiling + process. + + [Historical note: Up to Perl 5.8, options settings at top level + were always global settings, wherever they appeared in the pattern. + That is, they were equivalent to an external setting. From 5.8 + onwards, they apply only to what follows (which is what you might + expect).] */ case ')': - if (brastackptr == 0) + if (item_count == 0) { options = (options | set) & (~unset); set = unset = 0; /* To save length */ + item_count--; /* To allow for several */ } + /* Fall through */ /* A termination by ':' indicates the start of a nested group with @@ -2879,7 +3687,8 @@ while ((c = *(++ptr)) != 0) END_OPTIONS: if (c == ')') { - if (branch_newextra == 2 && (branch_extra == 0 || branch_extra == 3)) + if (branch_newextra == 2 && + (branch_extra == 0 || branch_extra == 1+LINK_SIZE)) branch_extra += branch_newextra; continue; } @@ -2924,55 +3733,65 @@ while ((c = *(++ptr)) != 0) the branch_extra value. */ case ')': - length += 3; + length += 1 + LINK_SIZE; + if (brastackptr > 0) { - int minval = 1; - int maxval = 1; - int duplength; + duplength = length - brastack[--brastackptr]; + branch_extra = bralenstack[brastackptr]; + } + else duplength = 0; - if (brastackptr > 0) - { - duplength = length - brastack[--brastackptr]; - branch_extra = bralenstack[brastackptr]; - } - else duplength = 0; + /* The following code is also used when a recursion such as (?3) is + followed by a quantifier, because in that case, it has to be wrapped inside + brackets so that the quantifier works. The value of duplength must be + set before arrival. */ - /* Leave ptr at the final char; for read_repeat_counts this happens - automatically; for the others we need an increment. */ + HANDLE_QUANTIFIED_BRACKETS: - if ((c = ptr[1]) == '{' && is_counted_repeat(ptr+2, &compile_block)) - { - ptr = read_repeat_counts(ptr+2, &minval, &maxval, errorptr, - &compile_block); - if (*errorptr != NULL) goto PCRE_ERROR_RETURN; - } - else if (c == '*') { minval = 0; maxval = -1; ptr++; } - else if (c == '+') { maxval = -1; ptr++; } - else if (c == '?') { minval = 0; ptr++; } + /* Leave ptr at the final char; for read_repeat_counts this happens + automatically; for the others we need an increment. */ - /* If the minimum is zero, we have to allow for an OP_BRAZERO before the - group, and if the maximum is greater than zero, we have to replicate - maxval-1 times; each replication acquires an OP_BRAZERO plus a nesting - bracket set - hence the 7. */ + if ((c = ptr[1]) == '{' && is_counted_repeat(ptr+2, &compile_block)) + { + ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + } + else if (c == '*') { min = 0; max = -1; ptr++; } + else if (c == '+') { min = 1; max = -1; ptr++; } + else if (c == '?') { min = 0; max = 1; ptr++; } + else { min = 1; max = 1; } - if (minval == 0) - { - length++; - if (maxval > 0) length += (maxval - 1) * (duplength + 7); - } + /* If the minimum is zero, we have to allow for an OP_BRAZERO before the + group, and if the maximum is greater than zero, we have to replicate + maxval-1 times; each replication acquires an OP_BRAZERO plus a nesting + bracket set. */ - /* When the minimum is greater than zero, 1 we have to replicate up to - minval-1 times, with no additions required in the copies. Then, if - there is a limited maximum we have to replicate up to maxval-1 times - allowing for a BRAZERO item before each optional copy and nesting - brackets for all but one of the optional copies. */ + if (min == 0) + { + length++; + if (max > 0) length += (max - 1) * (duplength + 3 + 2*LINK_SIZE); + } - else - { - length += (minval - 1) * duplength; - if (maxval > minval) /* Need this test as maxval=-1 means no limit */ - length += (maxval - minval) * (duplength + 7) - 6; - } + /* When the minimum is greater than zero, we have to replicate up to + minval-1 times, with no additions required in the copies. Then, if there + is a limited maximum we have to replicate up to maxval-1 times allowing + for a BRAZERO item before each optional copy and nesting brackets for all + but one of the optional copies. */ + + else + { + length += (min - 1) * duplength; + if (max > min) /* Need this test as max=-1 means no limit */ + length += (max - min) * (duplength + 3 + 2*LINK_SIZE) + - (2 + 2*LINK_SIZE); + } + + /* Allow space for once brackets for "possessive quantifier" */ + + if (ptr[1] == '+') + { + ptr++; + length += 2 + 2*LINK_SIZE; } continue; @@ -2987,6 +3806,20 @@ while ((c = *(++ptr)) != 0) runlength = 0; do { + /* If in a \Q...\E sequence, check for end; otherwise it's a literal */ + if (inescq) + { + if (c == '\\' && ptr[1] == 'E') + { + inescq = FALSE; + ptr++; + } + else runlength++; + continue; + } + + /* Skip whitespace and comments for /x */ + if ((options & PCRE_EXTENDED) != 0) { if ((compile_block.ctypes[c] & ctype_space) != 0) continue; @@ -3031,27 +3864,24 @@ while ((c = *(++ptr)) != 0) while (runlength < MAXLIT && (compile_block.ctypes[c = *(++ptr)] & ctype_meta) == 0); - ptr--; + if (runlength < MAXLIT) ptr--; length += runlength; continue; } } -length += 4; /* For final KET and END */ +length += 2 + LINK_SIZE; /* For final KET and END */ -if (length > 65539) +if (length > MAX_PATTERN_SIZE) { *errorptr = ERR20; return NULL; } /* Compute the size of data block needed and get it, either from malloc or -externally provided function. We specify "code[0]" in the offsetof() expression -rather than just "code", because it has been reported that one broken compiler -fails on "code" because it is also an independent variable. It should make no -difference to the value of the offsetof(). */ +externally provided function. */ -size = length + offsetof(real_pcre, code[0]); +size = length + sizeof(real_pcre) + name_count * (max_name_size + 3); re = (real_pcre *)(pcre_malloc)(size); if (re == NULL) @@ -3066,17 +3896,28 @@ re->magic_number = MAGIC_NUMBER; re->size = size; re->options = options; re->tables = tables; +re->name_entry_size = max_name_size + 3; +re->name_count = name_count; + +/* The starting points of the name/number translation table and of the code are +passed around in the compile data block. */ + +compile_block.names_found = 0; +compile_block.name_entry_size = max_name_size + 3; +compile_block.name_table = (uschar *)re + sizeof(real_pcre); +codestart = compile_block.name_table + re->name_entry_size * re->name_count; +compile_block.start_code = codestart; /* Set up a starting, non-extracting bracket, then compile the expression. On error, *errorptr will be set non-NULL, so we don't need to look at the result of the function here. */ ptr = (const uschar *)pattern; -code = re->code; +code = (uschar *)codestart; *code = OP_BRA; bracount = 0; -(void)compile_regex(options, -1, &bracount, &code, &ptr, errorptr, FALSE, 0, - &reqchar, &countlits, &compile_block); +(void)compile_regex(options, options & PCRE_IMS, &bracount, &code, &ptr, + errorptr, FALSE, 0, &firstchar, &reqchar, NULL, &compile_block); re->top_bracket = bracount; re->top_backref = top_backref; @@ -3090,7 +3931,7 @@ if debugging, leave the test till after things are printed out. */ *code++ = OP_END; #ifndef DEBUG -if (code - re->code > length) *errorptr = ERR23; +if (code - codestart > length) *errorptr = ERR23; #endif /* Give an error if there's back reference to a non-existent capturing @@ -3098,7 +3939,7 @@ subpattern. */ if (top_backref > re->top_bracket) *errorptr = ERR15; -/* Failed to compile */ +/* Failed to compile, or error while post-processing */ if (*errorptr != NULL) { @@ -3108,12 +3949,12 @@ if (*errorptr != NULL) return NULL; } -/* If the anchored option was not passed, set flag if we can determine that the -pattern is anchored by virtue of ^ characters or \A or anything else (such as -starting with .* when DOTALL is set). +/* If the anchored option was not passed, set the flag if we can determine that +the pattern is anchored by virtue of ^ characters or \A or anything else (such +as starting with .* when DOTALL is set). -Otherwise, see if we can determine what the first character has to be, because -that speeds up unanchored matches no end. If not, see if we can set the +Otherwise, if we know what the first character has to be, save it, because that +speeds up unanchored matches no end. If not, see if we can set the PCRE_STARTLINE flag. This is helpful for multiline matches when all branches start with ^. and also when all branches start with .* for non-DOTALL matches. */ @@ -3121,27 +3962,35 @@ start with ^. and also when all branches start with .* for non-DOTALL matches. if ((options & PCRE_ANCHORED) == 0) { int temp_options = options; - if (is_anchored(re->code, &temp_options)) + if (is_anchored(codestart, &temp_options, FALSE, top_backref)) re->options |= PCRE_ANCHORED; else { - int ch = find_firstchar(re->code, &temp_options); - if (ch >= 0) + if (firstchar < 0) + firstchar = find_firstassertedchar(codestart, &temp_options, FALSE); + if (firstchar >= 0) /* Remove caseless flag for non-caseable chars */ { - re->first_char = ch; + int ch = firstchar & 255; + re->first_char = ((firstchar & REQ_CASELESS) != 0 && + compile_block.fcc[ch] == ch)? ch : firstchar; re->options |= PCRE_FIRSTSET; } - else if (is_startline(re->code)) + else if (is_startline(codestart, FALSE, top_backref)) re->options |= PCRE_STARTLINE; } } -/* Save the last required character if there are at least two literal -characters on all paths, or if there is no first character setting. */ +/* Save the last required character if any. Remove caseless flag for +non-caseable chars. */ -if (reqchar >= 0 && (countlits > 1 || (re->options & PCRE_FIRSTSET) == 0)) +if ((re->options & PCRE_ANCHORED) != 0 && reqchar < 0 && firstchar >= 0) + reqchar = firstchar; + +if (reqchar >= 0) { - re->req_char = reqchar; + int ch = reqchar & 255; + re->req_char = ((reqchar & REQ_CASELESS) != 0 && + compile_block.fcc[ch] == ch)? ch : reqchar; re->options |= PCRE_REQCHSET; } @@ -3168,209 +4017,26 @@ if (re->options != 0) if ((re->options & PCRE_FIRSTSET) != 0) { - if (isprint(re->first_char)) printf("First char = %c\n", re->first_char); - else printf("First char = \\x%02x\n", re->first_char); + int ch = re->first_char & 255; + char *caseless = ((re->first_char & REQ_CASELESS) == 0)? "" : " (caseless)"; + if (isprint(ch)) printf("First char = %c%s\n", ch, caseless); + else printf("First char = \\x%02x%s\n", ch, caseless); } if ((re->options & PCRE_REQCHSET) != 0) { - if (isprint(re->req_char)) printf("Req char = %c\n", re->req_char); - else printf("Req char = \\x%02x\n", re->req_char); + int ch = re->req_char & 255; + char *caseless = ((re->req_char & REQ_CASELESS) == 0)? "" : " (caseless)"; + if (isprint(ch)) printf("Req char = %c%s\n", ch, caseless); + else printf("Req char = \\x%02x%s\n", ch, caseless); } -code_end = code; -code_base = code = re->code; - -while (code < code_end) - { - int charlength; - - printf("%3d ", code - code_base); - - if (*code >= OP_BRA) - { - if (*code - OP_BRA > EXTRACT_BASIC_MAX) - printf("%3d Bra extra", (code[1] << 8) + code[2]); - else - printf("%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA); - code += 2; - } - - else switch(*code) - { - case OP_OPT: - printf(" %.2x %s", code[1], OP_names[*code]); - code++; - break; - - case OP_CHARS: - charlength = *(++code); - printf("%3d ", charlength); - while (charlength-- > 0) - if (isprint(c = *(++code))) printf("%c", c); else printf("\\x%02x", c); - break; - - case OP_KETRMAX: - case OP_KETRMIN: - case OP_ALT: - case OP_KET: - case OP_ASSERT: - case OP_ASSERT_NOT: - case OP_ASSERTBACK: - case OP_ASSERTBACK_NOT: - case OP_ONCE: - case OP_REVERSE: - case OP_BRANUMBER: - case OP_COND: - case OP_CREF: - printf("%3d %s", (code[1] << 8) + code[2], OP_names[*code]); - code += 2; - break; - - case OP_STAR: - case OP_MINSTAR: - case OP_PLUS: - case OP_MINPLUS: - case OP_QUERY: - case OP_MINQUERY: - case OP_TYPESTAR: - case OP_TYPEMINSTAR: - case OP_TYPEPLUS: - case OP_TYPEMINPLUS: - case OP_TYPEQUERY: - case OP_TYPEMINQUERY: - if (*code >= OP_TYPESTAR) - printf(" %s", OP_names[code[1]]); - else if (isprint(c = code[1])) printf(" %c", c); - else printf(" \\x%02x", c); - printf("%s", OP_names[*code++]); - break; - - case OP_EXACT: - case OP_UPTO: - case OP_MINUPTO: - if (isprint(c = code[3])) printf(" %c{", c); - else printf(" \\x%02x{", c); - if (*code != OP_EXACT) printf("0,"); - printf("%d}", (code[1] << 8) + code[2]); - if (*code == OP_MINUPTO) printf("?"); - code += 3; - break; - - case OP_TYPEEXACT: - case OP_TYPEUPTO: - case OP_TYPEMINUPTO: - printf(" %s{", OP_names[code[3]]); - if (*code != OP_TYPEEXACT) printf(","); - printf("%d}", (code[1] << 8) + code[2]); - if (*code == OP_TYPEMINUPTO) printf("?"); - code += 3; - break; - - case OP_NOT: - if (isprint(c = *(++code))) printf(" [^%c]", c); - else printf(" [^\\x%02x]", c); - break; - - case OP_NOTSTAR: - case OP_NOTMINSTAR: - case OP_NOTPLUS: - case OP_NOTMINPLUS: - case OP_NOTQUERY: - case OP_NOTMINQUERY: - if (isprint(c = code[1])) printf(" [^%c]", c); - else printf(" [^\\x%02x]", c); - printf("%s", OP_names[*code++]); - break; - - case OP_NOTEXACT: - case OP_NOTUPTO: - case OP_NOTMINUPTO: - if (isprint(c = code[3])) printf(" [^%c]{", c); - else printf(" [^\\x%02x]{", c); - if (*code != OP_NOTEXACT) printf(","); - printf("%d}", (code[1] << 8) + code[2]); - if (*code == OP_NOTMINUPTO) printf("?"); - code += 3; - break; - - case OP_REF: - printf(" \\%d", (code[1] << 8) | code[2]); - code += 3; - goto CLASS_REF_REPEAT; - - case OP_CLASS: - { - int i, min, max; - code++; - printf(" ["); - - for (i = 0; i < 256; i++) - { - if ((code[i/8] & (1 << (i&7))) != 0) - { - int j; - for (j = i+1; j < 256; j++) - if ((code[j/8] & (1 << (j&7))) == 0) break; - if (i == '-' || i == ']') printf("\\"); - if (isprint(i)) printf("%c", i); else printf("\\x%02x", i); - if (--j > i) - { - printf("-"); - if (j == '-' || j == ']') printf("\\"); - if (isprint(j)) printf("%c", j); else printf("\\x%02x", j); - } - i = j; - } - } - printf("]"); - code += 32; - - CLASS_REF_REPEAT: - - switch(*code) - { - case OP_CRSTAR: - case OP_CRMINSTAR: - case OP_CRPLUS: - case OP_CRMINPLUS: - case OP_CRQUERY: - case OP_CRMINQUERY: - printf("%s", OP_names[*code]); - break; - - case OP_CRRANGE: - case OP_CRMINRANGE: - min = (code[1] << 8) + code[2]; - max = (code[3] << 8) + code[4]; - if (max == 0) printf("{%d,}", min); - else printf("{%d,%d}", min, max); - if (*code == OP_CRMINRANGE) printf("?"); - code += 4; - break; - - default: - code--; - } - } - break; - - /* Anything else is just a one-node item */ - - default: - printf(" %s", OP_names[*code]); - break; - } - - code++; - printf("\n"); - } -printf("------------------------------------------------------------------\n"); +print_internals(re, stdout); /* This check is done here in the debugging case so that the code that was compiled can be seen. */ -if (code - re->code > length) +if (code - codestart > length) { *errorptr = ERR23; (pcre_free)(re); @@ -3515,7 +4181,8 @@ for (;;) /* For extended extraction brackets (large number), we have to fish out the number from a dummy opcode at the start. */ - if (number > EXTRACT_BASIC_MAX) number = (ecode[4] << 8) | ecode[5]; + if (number > EXTRACT_BASIC_MAX) + number = GET2(ecode, 2+LINK_SIZE); offset = number << 1; #ifdef DEBUG @@ -3529,15 +4196,17 @@ for (;;) int save_offset1 = md->offset_vector[offset]; int save_offset2 = md->offset_vector[offset+1]; int save_offset3 = md->offset_vector[md->offset_end - number]; + int save_capture_last = md->capture_last; DPRINTF(("saving %d %d %d\n", save_offset1, save_offset2, save_offset3)); md->offset_vector[md->offset_end - number] = eptr - md->start_subject; do { - if (match(eptr, ecode+3, offset_top, md, ims, eptrb, match_isgroup)) - return TRUE; - ecode += (ecode[1] << 8) + ecode[2]; + if (match(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, eptrb, + match_isgroup)) return TRUE; + md->capture_last = save_capture_last; + ecode += GET(ecode, 1); } while (*ecode == OP_ALT); @@ -3563,9 +4232,9 @@ for (;;) DPRINTF(("start bracket 0\n")); do { - if (match(eptr, ecode+3, offset_top, md, ims, eptrb, match_isgroup)) - return TRUE; - ecode += (ecode[1] << 8) + ecode[2]; + if (match(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, eptrb, + match_isgroup)) return TRUE; + ecode += GET(ecode, 1); } while (*ecode == OP_ALT); DPRINTF(("bracket 0 failed\n")); @@ -3577,12 +4246,14 @@ for (;;) exactly what going to the ket would do. */ case OP_COND: - if (ecode[3] == OP_CREF) /* Condition is extraction test */ + if (ecode[LINK_SIZE+1] == OP_CREF) /* Condition extract or recurse test */ { - int offset = (ecode[4] << 9) | (ecode[5] << 1); /* Doubled ref number */ - return match(eptr, - ecode + ((offset < offset_top && md->offset_vector[offset] >= 0)? - 6 : 3 + (ecode[1] << 8) + ecode[2]), + int offset = GET2(ecode, LINK_SIZE+2) << 1; /* Doubled ref number */ + BOOL condition = (offset == CREF_RECURSE * 2)? + (md->recursive != NULL) : + (offset < offset_top && md->offset_vector[offset] >= 0); + return match(eptr, ecode + (condition? + (LINK_SIZE + 4) : (LINK_SIZE + 1 + GET(ecode, 1))), offset_top, md, ims, eptrb, match_isgroup); } @@ -3591,14 +4262,15 @@ for (;;) else { - if (match(eptr, ecode+3, offset_top, md, ims, NULL, + if (match(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, NULL, match_condassert | match_isgroup)) { - ecode += 3 + (ecode[4] << 8) + ecode[5]; - while (*ecode == OP_ALT) ecode += (ecode[1] << 8) + ecode[2]; + ecode += 1 + LINK_SIZE + GET(ecode, LINK_SIZE+2); + while (*ecode == OP_ALT) ecode += GET(ecode, 1); } - else ecode += (ecode[1] << 8) + ecode[2]; - return match(eptr, ecode+3, offset_top, md, ims, eptrb, match_isgroup); + else ecode += GET(ecode, 1); + return match(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, eptrb, + match_isgroup); } /* Control never reaches here */ @@ -3610,10 +4282,26 @@ for (;;) ecode += 3; break; - /* End of the pattern. If PCRE_NOTEMPTY is set, fail if we have matched - an empty string - recursion will then try other alternatives, if any. */ + /* End of the pattern. If we are in a recursion, we should restore the + offsets appropriately and continue from after the call. */ case OP_END: + if (md->recursive != NULL && md->recursive->group_num == 0) + { + recursion_info *rec = md->recursive; + DPRINTF(("Hit the end in a (?0) recursion\n")); + md->recursive = rec->prev; + memmove(md->offset_vector, rec->offset_save, + rec->saved_max * sizeof(int)); + md->start_match = rec->save_start; + ims = original_ims; + ecode = rec->after_call; + break; + } + + /* Otherwise, if PCRE_NOTEMPTY is set, fail if we have matched an empty + string - backtracking will then try other alternatives, if any. */ + if (md->notempty && eptr == md->start_match) return FALSE; md->end_match_ptr = eptr; /* Record where we ended */ md->end_offset_top = offset_top; /* and how many extracts were taken */ @@ -3637,8 +4325,9 @@ for (;;) case OP_ASSERTBACK: do { - if (match(eptr, ecode+3, offset_top, md, ims, NULL, match_isgroup)) break; - ecode += (ecode[1] << 8) + ecode[2]; + if (match(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, NULL, + match_isgroup)) break; + ecode += GET(ecode, 1); } while (*ecode == OP_ALT); if (*ecode == OP_KET) return FALSE; @@ -3650,8 +4339,8 @@ for (;;) /* Continue from after the assertion, updating the offsets high water mark, since extracts may have been taken during the assertion. */ - do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); - ecode += 3; + do ecode += GET(ecode,1); while (*ecode == OP_ALT); + ecode += 1 + LINK_SIZE; offset_top = md->end_offset_top; continue; @@ -3661,15 +4350,15 @@ for (;;) case OP_ASSERTBACK_NOT: do { - if (match(eptr, ecode+3, offset_top, md, ims, NULL, match_isgroup)) - return FALSE; - ecode += (ecode[1] << 8) + ecode[2]; + if (match(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, NULL, + match_isgroup)) return FALSE; + ecode += GET(ecode,1); } while (*ecode == OP_ALT); if ((flags & match_condassert) != 0) return TRUE; - ecode += 3; + ecode += 1 + LINK_SIZE; continue; /* Move the subject pointer back. This occurs only at the start of @@ -3679,75 +4368,161 @@ for (;;) case OP_REVERSE: #ifdef SUPPORT_UTF8 - c = (ecode[1] << 8) + ecode[2]; + c = GET(ecode,1); for (i = 0; i < c; i++) { eptr--; BACKCHAR(eptr) } #else - eptr -= (ecode[1] << 8) + ecode[2]; + eptr -= GET(ecode,1); #endif if (eptr < md->start_subject) return FALSE; - ecode += 3; + ecode += 1 + LINK_SIZE; break; - /* Recursion matches the current regex, nested. If there are any capturing - brackets started but not finished, we have to save their starting points - and reinstate them after the recursion. However, we don't know how many - such there are (offset_top records the completed total) so we just have - to save all the potential data. There may be up to 99 such values, which - is a bit large to put on the stack, but using malloc for small numbers - seems expensive. As a compromise, the stack is used when there are fewer - than 16 values to store; otherwise malloc is used. A problem is what to do - if the malloc fails ... there is no way of returning to the top level with - an error. Save the top 15 values on the stack, and accept that the rest - may be wrong. */ + /* The callout item calls an external function, if one is provided, passing + details of the match so far. This is mainly for debugging, though the + function is able to force a failure. */ + + case OP_CALLOUT: + if (pcre_callout != NULL) + { + pcre_callout_block cb; + cb.version = 0; /* Version 0 of the callout block */ + cb.callout_number = ecode[1]; + cb.offset_vector = md->offset_vector; + cb.subject = (const char *)md->start_subject; + cb.subject_length = md->end_subject - md->start_subject; + cb.start_match = md->start_match - md->start_subject; + cb.current_position = eptr - md->start_subject; + cb.capture_top = offset_top/2; + cb.capture_last = md->capture_last; + if ((*pcre_callout)(&cb) != 0) return FALSE; + } + ecode += 2; + break; + + /* Recursion either matches the current regex, or some subexpression. The + offset data is the offset to the starting bracket from the start of the + whole pattern. However, it is possible that a BRAZERO was inserted before + this bracket after we took the offset - we just skip it if encountered. + + If there are any capturing brackets started but not finished, we have to + save their starting points and reinstate them after the recursion. However, + we don't know how many such there are (offset_top records the completed + total) so we just have to save all the potential data. There may be up to + 65535 such values, which is too large to put on the stack, but using malloc + for small numbers seems expensive. As a compromise, the stack is used when + there are no more than REC_STACK_SAVE_MAX values to store; otherwise malloc + is used. A problem is what to do if the malloc fails ... there is no way of + returning to the top level with an error. Save the top REC_STACK_SAVE_MAX + values on the stack, and accept that the rest may be wrong. + + There are also other values that have to be saved. We use a chained + sequence of blocks that actually live on the stack. Thanks to Robin Houston + for the original version of this logic. */ case OP_RECURSE: { - BOOL rc; - int *save; - int stacksave[15]; + int stacksave[REC_STACK_SAVE_MAX]; + recursion_info new_recursive; + const uschar *callpat = md->start_code + GET(ecode, 1); - c = md->offset_max; + if (*callpat == OP_BRAZERO) callpat++; - if (c < 16) save = stacksave; else + new_recursive.group_num = *callpat - OP_BRA; + + /* For extended extraction brackets (large number), we have to fish out + the number from a dummy opcode at the start. */ + + if (new_recursive.group_num > EXTRACT_BASIC_MAX) + new_recursive.group_num = GET2(callpat, 2+LINK_SIZE); + + /* Add to "recursing stack" */ + + new_recursive.prev = md->recursive; + md->recursive = &new_recursive; + + /* Find where to continue from afterwards */ + + ecode += 1 + LINK_SIZE; + new_recursive.after_call = ecode; + + /* Now save the offset data. */ + + new_recursive.saved_max = md->offset_end; + if (new_recursive.saved_max <= REC_STACK_SAVE_MAX) + new_recursive.offset_save = stacksave; + else { - save = (int *)(pcre_malloc)((c+1) * sizeof(int)); - if (save == NULL) + new_recursive.offset_save = (int *) + (pcre_malloc)(new_recursive.saved_max * sizeof(int)); + + /* RH: Warning: This may cause INCORRECT RESULTS if we run out of + memory here, because we won't be restoring all the stored strings + correctly. We either need proper run-time error handling or, at the + very least, some way to warn the user. Could we just spit a message to + stderr? + + PH: No, Robin, no! You must NEVER write to stderr from inside a general + library function, because you don't know anything about the state of + the file descriptor. + + RH: Returning error values would be very tedious because of the + recursion; and Philip Hazel says that longjmp() - in many ways the + obvious solution - has previously caused problems on some platforms. */ + + if (new_recursive.offset_save == NULL) { - save = stacksave; - c = 15; + DPRINTF(("malloc() failed - results may be wrong\n")); + new_recursive.offset_save = stacksave; + new_recursive.saved_max = REC_STACK_SAVE_MAX; } } - for (i = 1; i <= c; i++) - save[i] = md->offset_vector[md->offset_end - i]; - rc = match(eptr, md->start_pattern, offset_top, md, ims, eptrb, - match_isgroup); - for (i = 1; i <= c; i++) - md->offset_vector[md->offset_end - i] = save[i]; - if (save != stacksave) (pcre_free)(save); - if (!rc) return FALSE; + memcpy(new_recursive.offset_save, md->offset_vector, + new_recursive.saved_max * sizeof(int)); + new_recursive.save_start = md->start_match; + md->start_match = eptr; - /* In case the recursion has set more capturing values, save the final - number, then move along the subject till after the recursive match, - and advance one byte in the pattern code. */ + /* OK, now we can do the recursion. For each top-level alternative we + restore the offset and recursion data. */ - offset_top = md->end_offset_top; - eptr = md->end_match_ptr; - ecode++; + DPRINTF(("Recursing into group %d\n", new_recursive.group_num)); + do + { + if (match(eptr, callpat + 1 + LINK_SIZE, offset_top, md, ims, eptrb, + match_isgroup)) + { + md->recursive = new_recursive.prev; + if (new_recursive.offset_save != stacksave) + (pcre_free)(new_recursive.offset_save); + return TRUE; + } + + md->recursive = &new_recursive; + memcpy(md->offset_vector, new_recursive.offset_save, + new_recursive.saved_max * sizeof(int)); + callpat += GET(callpat, 1); + } + while (*callpat == OP_ALT); + + DPRINTF(("Recursion didn't match\n")); + md->recursive = new_recursive.prev; + if (new_recursive.offset_save != stacksave) + (pcre_free)(new_recursive.offset_save); + return FALSE; } break; /* "Once" brackets are like assertion brackets except that after a match, the point in the subject string is not moved back. Thus there can never be - a move back into the brackets. Check the alternative branches in turn - the - matching won't pass the KET for this kind of subpattern. If any one branch - matches, we carry on as at the end of a normal bracket, leaving the subject - pointer. */ + a move back into the brackets. Friedl calls these "atomic" subpatterns. + Check the alternative branches in turn - the matching won't pass the KET + for this kind of subpattern. If any one branch matches, we carry on as at + the end of a normal bracket, leaving the subject pointer. */ case OP_ONCE: { @@ -3756,9 +4531,9 @@ for (;;) do { - if (match(eptr, ecode+3, offset_top, md, ims, eptrb, match_isgroup)) - break; - ecode += (ecode[1] << 8) + ecode[2]; + if (match(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, eptrb, + match_isgroup)) break; + ecode += GET(ecode,1); } while (*ecode == OP_ALT); @@ -3769,7 +4544,7 @@ for (;;) /* Continue as from after the assertion, updating the offsets high water mark, since extracts may have been taken. */ - do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + do ecode += GET(ecode,1); while (*ecode == OP_ALT); offset_top = md->end_offset_top; eptr = md->end_match_ptr; @@ -3782,7 +4557,7 @@ for (;;) if (*ecode == OP_KET || eptr == saved_eptr) { - ecode += 3; + ecode += 1+LINK_SIZE; break; } @@ -3791,7 +4566,7 @@ for (;;) that changed within the bracket before re-running it, so check the next opcode. */ - if (ecode[3] == OP_OPT) + if (ecode[1+LINK_SIZE] == OP_OPT) { ims = (ims & ~PCRE_IMS) | ecode[4]; DPRINTF(("ims set to %02lx at group repeat\n", ims)); @@ -3799,14 +4574,16 @@ for (;;) if (*ecode == OP_KETRMIN) { - if (match(eptr, ecode+3, offset_top, md, ims, eptrb, 0) || + if (match(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, eptrb, 0) + || match(eptr, prev, offset_top, md, ims, eptrb, match_isgroup)) return TRUE; } else /* OP_KETRMAX */ { if (match(eptr, prev, offset_top, md, ims, eptrb, match_isgroup) || - match(eptr, ecode+3, offset_top, md, ims, eptrb, 0)) return TRUE; + match(eptr, ecode + 1+LINK_SIZE, offset_top, md, ims, eptrb, 0)) + return TRUE; } } return FALSE; @@ -3815,7 +4592,7 @@ for (;;) bracketed group and go to there. */ case OP_ALT: - do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + do ecode += GET(ecode,1); while (*ecode == OP_ALT); break; /* BRAZERO and BRAMINZERO occur just before a bracket group, indicating @@ -3829,17 +4606,17 @@ for (;;) const uschar *next = ecode+1; if (match(eptr, next, offset_top, md, ims, eptrb, match_isgroup)) return TRUE; - do next += (next[1] << 8) + next[2]; while (*next == OP_ALT); - ecode = next + 3; + do next += GET(next,1); while (*next == OP_ALT); + ecode = next + 1+LINK_SIZE; } break; case OP_BRAMINZERO: { const uschar *next = ecode+1; - do next += (next[1] << 8) + next[2]; while (*next == OP_ALT); - if (match(eptr, next+3, offset_top, md, ims, eptrb, match_isgroup)) - return TRUE; + do next += GET(next,1); while (*next == OP_ALT); + if (match(eptr, next + 1+LINK_SIZE, offset_top, md, ims, eptrb, + match_isgroup)) return TRUE; ecode++; } break; @@ -3853,7 +4630,7 @@ for (;;) case OP_KETRMIN: case OP_KETRMAX: { - const uschar *prev = ecode - (ecode[1] << 8) - ecode[2]; + const uschar *prev = ecode - GET(ecode, 1); const uschar *saved_eptr = eptrb->saved_eptr; eptrb = eptrb->prev; /* Back up the stack of bracket start pointers */ @@ -3879,7 +4656,7 @@ for (;;) /* For extended extraction brackets (large number), we have to fish out the number from a dummy opcode at the start. */ - if (number > EXTRACT_BASIC_MAX) number = (prev[4] << 8) | prev[5]; + if (number > EXTRACT_BASIC_MAX) number = GET2(prev, 2+LINK_SIZE); offset = number << 1; #ifdef DEBUG @@ -3887,8 +4664,14 @@ for (;;) printf("\n"); #endif + /* Test for a numbered group. This includes groups called as a result + of recursion. Note that whole-pattern recursion is coded as a recurse + into group 0, so it won't be picked up here. Instead, we catch it when + the OP_END is reached. */ + if (number > 0) { + md->capture_last = number; if (offset >= md->offset_max) md->offset_overflow = TRUE; else { md->offset_vector[offset] = @@ -3896,6 +4679,22 @@ for (;;) md->offset_vector[offset+1] = eptr - md->start_subject; if (offset_top <= offset) offset_top = offset + 2; } + + /* Handle a recursively called group. Restore the offsets + appropriately and continue from after the call. */ + + if (md->recursive != NULL && md->recursive->group_num == number) + { + recursion_info *rec = md->recursive; + DPRINTF(("Recursion (%d) succeeded - continuing\n", number)); + md->recursive = rec->prev; + md->start_match = rec->save_start; + memcpy(md->offset_vector, rec->offset_save, + rec->saved_max * sizeof(int)); + ecode = rec->after_call; + ims = original_ims; + break; + } } } @@ -3913,7 +4712,7 @@ for (;;) if (*ecode == OP_KET || eptr == saved_eptr) { - ecode += 3; + ecode += 1 + LINK_SIZE; break; } @@ -3922,14 +4721,15 @@ for (;;) if (*ecode == OP_KETRMIN) { - if (match(eptr, ecode+3, offset_top, md, ims, eptrb, 0) || + if (match(eptr, ecode + 1+LINK_SIZE, offset_top, md, ims, eptrb, 0) || match(eptr, prev, offset_top, md, ims, eptrb, match_isgroup)) return TRUE; } else /* OP_KETRMAX */ { if (match(eptr, prev, offset_top, md, ims, eptrb, match_isgroup) || - match(eptr, ecode+3, offset_top, md, ims, eptrb, 0)) return TRUE; + match(eptr, ecode + 1+LINK_SIZE, offset_top, md, ims, eptrb, 0)) + return TRUE; } } return FALSE; @@ -3953,6 +4753,13 @@ for (;;) ecode++; break; + /* Start of match assertion */ + + case OP_SOM: + if (eptr != md->start_subject + md->start_offset) return FALSE; + ecode++; + break; + /* Assert before internal newline if multiline, or before a terminating newline unless endonly is set, else end of subject unless noteol is set. */ @@ -4021,6 +4828,14 @@ for (;;) ecode++; break; + /* Match a single byte, even in UTF-8 mode. This opcode really does match + any byte, even newline, independent of the setting of PCRE_DOTALL. */ + + case OP_ANYBYTE: + if (eptr++ >= md->end_subject) return FALSE; + ecode++; + break; + case OP_NOT_DIGIT: if (eptr >= md->end_subject || (md->ctypes[*eptr++] & ctype_digit) != 0) @@ -4074,7 +4889,7 @@ for (;;) case OP_REF: { int length; - int offset = (ecode[1] << 9) | (ecode[2] << 1); /* Doubled ref number */ + int offset = GET2(ecode, 1) << 1; /* Doubled ref number */ ecode += 3; /* Advance past item */ /* If the reference is unset, set the length to be longer than the amount @@ -4106,8 +4921,8 @@ for (;;) case OP_CRRANGE: case OP_CRMINRANGE: minimize = (*ecode == OP_CRMINRANGE); - min = (ecode[1] << 8) + ecode[2]; - max = (ecode[3] << 8) + ecode[4]; + min = GET2(ecode, 1); + max = GET2(ecode, 3); if (max == 0) max = INT_MAX; ecode += 5; break; @@ -4203,8 +5018,8 @@ for (;;) case OP_CRRANGE: case OP_CRMINRANGE: minimize = (*ecode == OP_CRMINRANGE); - min = (ecode[1] << 8) + ecode[2]; - max = (ecode[3] << 8) + ecode[4]; + min = GET2(ecode, 1); + max = GET2(ecode, 3); if (max == 0) max = INT_MAX; ecode += 5; break; @@ -4327,14 +5142,14 @@ for (;;) /* Match a single character repeatedly; different opcodes share code. */ case OP_EXACT: - min = max = (ecode[1] << 8) + ecode[2]; + min = max = GET2(ecode, 1); ecode += 3; goto REPEATCHAR; case OP_UPTO: case OP_MINUPTO: min = 0; - max = (ecode[1] << 8) + ecode[2]; + max = GET2(ecode, 1); minimize = *ecode == OP_MINUPTO; ecode += 3; goto REPEATCHAR; @@ -4458,14 +5273,14 @@ for (;;) time taken, but character matching *is* what this is all about... */ case OP_NOTEXACT: - min = max = (ecode[1] << 8) + ecode[2]; + min = max = GET2(ecode, 1); ecode += 3; goto REPEATNOTCHAR; case OP_NOTUPTO: case OP_NOTMINUPTO: min = 0; - max = (ecode[1] << 8) + ecode[2]; + max = GET2(ecode, 1); minimize = *ecode == OP_NOTMINUPTO; ecode += 3; goto REPEATNOTCHAR; @@ -4572,7 +5387,7 @@ for (;;) repeat it in the interests of efficiency. */ case OP_TYPEEXACT: - min = max = (ecode[1] << 8) + ecode[2]; + min = max = GET2(ecode, 1); minimize = TRUE; ecode += 3; goto REPEATTYPE; @@ -4580,7 +5395,7 @@ for (;;) case OP_TYPEUPTO: case OP_TYPEMINUPTO: min = 0; - max = (ecode[1] << 8) + ecode[2]; + max = GET2(ecode, 1); minimize = *ecode == OP_TYPEMINUPTO; ecode += 3; goto REPEATTYPE; @@ -4632,6 +5447,10 @@ for (;;) else eptr += min; break; + case OP_ANYBYTE: + eptr += min; + break; + case OP_NOT_DIGIT: for (i = 1; i <= min; i++) if ((md->ctypes[*eptr++] & ctype_digit) != 0) return FALSE; @@ -4690,6 +5509,9 @@ for (;;) #endif break; + case OP_ANYBYTE: + break; + case OP_NOT_DIGIT: if ((md->ctypes[c] & ctype_digit) != 0) return FALSE; break; @@ -4761,13 +5583,14 @@ for (;;) if (eptr >= md->end_subject || *eptr == NEWLINE) break; eptr++; } + break; } - else - { - c = max - min; - if (c > md->end_subject - eptr) c = md->end_subject - eptr; - eptr += c; - } + /* For non-UTF8 DOTALL case, fall through and treat as \C */ + + case OP_ANYBYTE: + c = max - min; + if (c > md->end_subject - eptr) c = md->end_subject - eptr; + eptr += c; break; case OP_NOT_DIGIT: @@ -4898,9 +5721,13 @@ const uschar *end_subject; const uschar *req_char_ptr = start_match - 1; const real_pcre *re = (const real_pcre *)external_re; const real_pcre_extra *extra = (const real_pcre_extra *)external_extra; +const uschar *codestart = + (const uschar *)re + sizeof(real_pcre) + re->name_count * re->name_entry_size; BOOL using_temporary_offsets = FALSE; BOOL anchored; BOOL startline; +BOOL first_char_caseless = FALSE; +BOOL req_char_caseless = FALSE; if ((options & ~PUBLIC_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION; @@ -4911,8 +5738,9 @@ if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC; anchored = ((re->options | options) & PCRE_ANCHORED) != 0; startline = (re->options & PCRE_STARTLINE) != 0; -match_block.start_pattern = re->code; +match_block.start_code = codestart; match_block.start_subject = (const uschar *)subject; +match_block.start_offset = start_offset; match_block.end_subject = match_block.start_subject + length; end_subject = match_block.end_subject; @@ -4924,6 +5752,7 @@ match_block.noteol = (options & PCRE_NOTEOL) != 0; match_block.notempty = (options & PCRE_NOTEMPTY) != 0; match_block.errorcode = PCRE_ERROR_NOMATCH; /* Default error */ +match_block.recursive = NULL; /* No recursion */ match_block.lcc = re->tables + lcc_offset; match_block.ctypes = re->tables + ctypes_offset; @@ -4954,6 +5783,7 @@ else match_block.offset_vector = offsets; match_block.offset_end = ocount; match_block.offset_max = (2*ocount)/3; match_block.offset_overflow = FALSE; +match_block.capture_last = -1; /* Compute the minimum number of offsets that we need to reset each time. Doing this makes a huge difference to execution time when there aren't many brackets @@ -4983,8 +5813,9 @@ if (!anchored) { if ((re->options & PCRE_FIRSTSET) != 0) { - first_char = re->first_char; - if ((ims & PCRE_CASELESS) != 0) first_char = match_block.lcc[first_char]; + first_char = re->first_char & 255; + if ((first_char_caseless = ((re->first_char & REQ_CASELESS) != 0)) == TRUE) + first_char = match_block.lcc[first_char]; } else if (!startline && extra != NULL && @@ -4993,18 +5824,13 @@ if (!anchored) } /* For anchored or unanchored matches, there may be a "last known required -character" set. If the PCRE_CASELESS is set, implying that the match starts -caselessly, or if there are any changes of this flag within the regex, set up -both cases of the character. Otherwise set the two values the same, which will -avoid duplicate testing (which takes significant time). This covers the vast -majority of cases. It will be suboptimal when the case flag changes in a regex -and the required character in fact is caseful. */ +character" set. */ if ((re->options & PCRE_REQCHSET) != 0) { - req_char = re->req_char; - req_char2 = ((re->options & (PCRE_CASELESS | PCRE_ICHANGED)) != 0)? - (re->tables + fcc_offset)[req_char] : req_char; + req_char = re->req_char & 255; + req_char_caseless = (re->req_char & REQ_CASELESS) != 0; + req_char2 = (re->tables + fcc_offset)[req_char]; /* case flipped */ } /* Loop for handling unanchored repeated matching attempts; for anchored regexs @@ -5024,7 +5850,7 @@ do if (first_char >= 0) { - if ((ims & PCRE_CASELESS) != 0) + if (first_char_caseless) while (start_match < end_subject && match_block.lcc[*start_match] != first_char) start_match++; @@ -5065,12 +5891,9 @@ do for the match to succeed. If the first character is set, req_char must be later in the subject; otherwise the test starts at the match point. This optimization can save a huge amount of backtracking in patterns with nested - unlimited repeats that aren't going to match. We don't know what the state of - case matching may be when this character is hit, so test for it in both its - cases if necessary. However, the different cased versions will not be set up - unless PCRE_CASELESS was given or the casing state changes within the regex. - Writing separate code makes it go faster, as does using an autoincrement and - backing off on a match. */ + unlimited repeats that aren't going to match. Writing separate code for + cased/caseless versions makes it go faster, as does using an autoincrement + and backing off on a match. */ if (req_char >= 0) { @@ -5081,19 +5904,7 @@ do if (p > req_char_ptr) { - /* Do a single test if no case difference is set up */ - - if (req_char == req_char2) - { - while (p < end_subject) - { - if (*p++ == req_char) { p--; break; } - } - } - - /* Otherwise test for either case */ - - else + if (req_char_caseless) { while (p < end_subject) { @@ -5101,6 +5912,13 @@ do if (pp == req_char || pp == req_char2) { p--; break; } } } + else + { + while (p < end_subject) + { + if (*p++ == req_char) { p--; break; } + } + } /* If we can't find the required character, break the matching loop */ @@ -5122,7 +5940,7 @@ do if certain parts of the pattern were not used. */ match_block.start_match = start_match; - if (!match(start_match, re->code, 2, &match_block, ims, NULL, match_isgroup)) + if (!match(start_match, codestart, 2, &match_block, ims, NULL, match_isgroup)) continue; /* Copy the offset information from temporary store if necessary */ diff --git a/ext/pcre/pcrelib/pcre.def b/ext/pcre/pcrelib/pcre.def index 0e8cf3f442a..4f6c4bff40d 100644 --- a/ext/pcre/pcrelib/pcre.def +++ b/ext/pcre/pcrelib/pcre.def @@ -8,7 +8,10 @@ pcre_copy_substring pcre_exec pcre_get_substring pcre_get_substring_list +pcre_free_substring +pcre_free_substring_list pcre_info +pcre_fullinfo pcre_maketables pcre_study pcre_version diff --git a/ext/pcre/pcrelib/pcre.h b/ext/pcre/pcrelib/pcre.h index f6fbe73cfb1..2815e56b2cd 100644 --- a/ext/pcre/pcrelib/pcre.h +++ b/ext/pcre/pcrelib/pcre.h @@ -2,7 +2,7 @@ * Perl-Compatible Regular Expressions * *************************************************/ -/* Copyright (c) 1997-2001 University of Cambridge */ +/* Copyright (c) 1997-2002 University of Cambridge */ #ifndef _PCRE_H #define _PCRE_H @@ -11,9 +11,10 @@ make changes to pcre.in. */ #include "php_compat.h" + #define PCRE_MAJOR 3 -#define PCRE_MINOR 9 -#define PCRE_DATE 02-Jan-2002 +#define PCRE_MINOR 92 +#define PCRE_DATE 11-Sep-2002 /* Win32 uses DLL by default */ @@ -72,6 +73,9 @@ extern "C" { #define PCRE_INFO_FIRSTCHAR 4 #define PCRE_INFO_FIRSTTABLE 5 #define PCRE_INFO_LASTLITERAL 6 +#define PCRE_INFO_NAMEENTRYSIZE 7 +#define PCRE_INFO_NAMECOUNT 8 +#define PCRE_INFO_NAMETABLE 9 /* Types */ @@ -81,32 +85,64 @@ struct real_pcre_extra; /* declaration; the definition is private */ typedef struct real_pcre pcre; typedef struct real_pcre_extra pcre_extra; -/* Store get and free functions. These can be set to alternative malloc/free -functions if required. Some magic is required for Win32 DLL; it is null on -other OS. */ +/* The structure for passing out data via the pcre_callout_function. We use a +structure so that new fields can be added on the end in future versions, +without changing the API of the function, thereby allowing old clients to work +without modification. */ +typedef struct pcre_callout_block { + int version; /* Identifies version of block */ + /* ------------------------ Version 0 ------------------------------- */ + int callout_number; /* Number compiled into pattern */ + int *offset_vector; /* The offset vector */ + const char *subject; /* The subject being matched */ + int subject_length; /* The length of the subject */ + int start_match; /* Offset to start of this match attempt */ + int current_position; /* Where we currently are */ + int capture_top; /* Max current capture */ + int capture_last; /* Most recently closed capture */ + /* ------------------------------------------------------------------ */ +} pcre_callout_block; + +/* Indirection for store get and free functions. These can be set to +alternative malloc/free functions if required. There is also an optional +callout function that is triggered by the (?) regex item. Some magic is +required for Win32 DLL; it is null on other OS. For Virtual Pascal, these have +to be different again. */ + +#ifndef VPCOMPAT PCRE_DL_IMPORT extern void *(*pcre_malloc)(size_t); PCRE_DL_IMPORT extern void (*pcre_free)(void *); +PCRE_DL_IMPORT extern int (*pcre_callout)(pcre_callout_block *); +#else /* VPCOMPAT */ +extern void *pcre_malloc(size_t); +extern void pcre_free(void *); +extern int pcre_callout(pcre_callout_block *); +#endif /* VPCOMPAT */ + +/* Exported PCRE functions */ + +PCRE_DL_IMPORT extern pcre *pcre_compile(const char *, int, const char **, + int *, const unsigned char *); +PCRE_DL_IMPORT extern int pcre_copy_substring(const char *, int *, int, int, + char *, int); +PCRE_DL_IMPORT extern int pcre_exec(const pcre *, const pcre_extra *, + const char *, int, int, int, int *, int); +PCRE_DL_IMPORT extern void pcre_free_substring(const char *); +PCRE_DL_IMPORT extern void pcre_free_substring_list(const char **); +PCRE_DL_IMPORT extern int pcre_get_substring(const char *, int *, int, int, + const char **); +PCRE_DL_IMPORT extern int pcre_get_substring_list(const char *, int *, int, + const char ***); +PCRE_DL_IMPORT extern int pcre_info(const pcre *, int *, int *); +PCRE_DL_IMPORT extern int pcre_fullinfo(const pcre *, const pcre_extra *, int, + void *); +PCRE_DL_IMPORT extern const unsigned char *pcre_maketables(void); +PCRE_DL_IMPORT extern pcre_extra *pcre_study(const pcre *, int, const char **); +PCRE_DL_IMPORT extern const char *pcre_version(void); #undef PCRE_DL_IMPORT -/* Functions */ - -extern pcre *pcre_compile(const char *, int, const char **, int *, - const unsigned char *); -extern int pcre_copy_substring(const char *, int *, int, int, char *, int); -extern int pcre_exec(const pcre *, const pcre_extra *, const char *, - int, int, int, int *, int); -extern void pcre_free_substring(const char *); -extern void pcre_free_substring_list(const char **); -extern int pcre_get_substring(const char *, int *, int, int, const char **); -extern int pcre_get_substring_list(const char *, int *, int, const char ***); -extern int pcre_info(const pcre *, int *, int *); -extern int pcre_fullinfo(const pcre *, const pcre_extra *, int, void *); -extern const unsigned char *pcre_maketables(void); -extern pcre_extra *pcre_study(const pcre *, int, const char **); -extern const char *pcre_version(void); - #ifdef __cplusplus } /* extern "C" */ #endif diff --git a/ext/pcre/pcrelib/pcregrep.c b/ext/pcre/pcrelib/pcregrep.c index b50ed0780bf..653f4ffd48b 100644 --- a/ext/pcre/pcrelib/pcregrep.c +++ b/ext/pcre/pcrelib/pcregrep.c @@ -3,7 +3,8 @@ *************************************************/ /* This is a grep program that uses the PCRE regular expression library to do -its pattern matching. On a Unix system it can recurse into directories. */ +its pattern matching. On a Unix or Win32 system it can recurse into +directories. */ #include #include @@ -18,7 +19,7 @@ its pattern matching. On a Unix system it can recurse into directories. */ typedef int BOOL; -#define VERSION "2.0 01-Aug-2001" +#define VERSION "2.2 10-Sep-2002" #define MAX_PATTERN_COUNT 100 @@ -70,8 +71,8 @@ static option_item optionlist[] = { *************************************************/ /* These functions are defined so that they can be made system specific, -although at present the only ones are for Unix, and for "no directory recursion -support". */ +although at present the only ones are for Unix, Win32, and for "no directory +recursion support". */ /************* Directory scanning in Unix ***********/ @@ -118,13 +119,105 @@ closedir(dir); } -#else +/************* Directory scanning in Win32 ***********/ + +/* I (Philip Hazel) have no means of testing this code. It was contributed by +Lionel Fourquaux. */ + + +#elif HAVE_WIN32API + +#ifndef STRICT +# define STRICT +#endif +#ifndef WIN32_LEAN_AND_MEAN +# define WIN32_LEAN_AND_MEAN +#endif +#include + +typedef struct directory_type +{ +HANDLE handle; +BOOL first; +WIN32_FIND_DATA data; +} directory_type; + +int +isdirectory(char *filename) +{ +DWORD attr = GetFileAttributes(filename); +if (attr == INVALID_FILE_ATTRIBUTES) + return 0; +return ((attr & FILE_ATTRIBUTE_DIRECTORY) != 0) ? '/' : 0; +} + +directory_type * +opendirectory(char *filename) +{ +size_t len; +char *pattern; +directory_type *dir; +DWORD err; +len = strlen(filename); +pattern = (char *) malloc(len + 3); +dir = (directory_type *) malloc(sizeof(*dir)); +if ((pattern == NULL) || (dir == NULL)) + { + fprintf(stderr, "pcregrep: malloc failed\n"); + exit(2); + } +memcpy(pattern, filename, len); +memcpy(&(pattern[len]), "\\*", 3); +dir->handle = FindFirstFile(pattern, &(dir->data)); +if (dir->handle != INVALID_HANDLE_VALUE) + { + free(pattern); + dir->first = TRUE; + return dir; + } +err = GetLastError(); +free(pattern); +free(dir); +errno = (err == ERROR_ACCESS_DENIED) ? EACCES : ENOENT; +return NULL; +} + +char * +readdirectory(directory_type *dir) +{ +for (;;) + { + if (!dir->first) + { + if (!FindNextFile(dir->handle, &(dir->data))) + return NULL; + } + else + { + dir->first = FALSE; + } + if (strcmp(dir->data.cFileName, ".") != 0 && strcmp(dir->data.cFileName, "..") != 0) + return dir->data.cFileName; + } +#ifndef _MSC_VER +return NULL; /* Keep compiler happy; never executed */ +#endif +} + +void +closedirectory(directory_type *dir) +{ +FindClose(dir->handle); +free(dir); +} /************* Directory scanning when we can't do it ***********/ /* The type is void, and apart from isdirectory(), the functions do nothing. */ +#else + typedef void directory_type; int isdirectory(char *filename) { return FALSE; } @@ -262,8 +355,9 @@ if ((sep = isdirectory(filename)) != 0 && recurse) } /* If the file is not a directory, or we are not recursing, scan it. If this is -the first and only argument at top level, we don't show the file name. -Otherwise, control is via the show_filenames variable. */ +the first and only argument at top level, we don't show the file name (unless +we are only showing the file name). Otherwise, control is via the +show_filenames variable. */ in = fopen(filename, "r"); if (in == NULL) @@ -272,7 +366,8 @@ if (in == NULL) return 2; } -rc = pcregrep(in, (show_filenames && !only_one_at_top)? filename : NULL); +rc = pcregrep(in, (filenames_only || (show_filenames && !only_one_at_top))? + filename : NULL); fclose(in); return rc; } @@ -287,7 +382,7 @@ return rc; static int usage(int rc) { -fprintf(stderr, "Usage: pcregrep [-Vcfhilnrsvx] [long-options] pattern [file] ...\n"); +fprintf(stderr, "Usage: pcregrep [-Vcfhilnrsvx] [long-options] [pattern] [file1 file2 ...]\n"); fprintf(stderr, "Type `pcregrep --help' for more information.\n"); return rc; } @@ -304,8 +399,9 @@ help(void) { option_item *op; -printf("Usage: pcregrep [OPTION]... PATTERN [FILE] ...\n"); +printf("Usage: pcregrep [OPTION]... [PATTERN] [FILE1 FILE2 ...]\n"); printf("Search for PATTERN in each FILE or standard input.\n"); +printf("PATTERN must be present if -f is not used.\n"); printf("Example: pcregrep -i 'hello.*world' menu.h main.c\n\n"); printf("Options:\n"); @@ -390,6 +486,10 @@ for (i = 1; i < argc; i++) { if (argv[i][0] != '-') break; + /* Missing options */ + + if (argv[i][1] == 0) exit(usage(2)); + /* Long name options */ if (argv[i][1] == '-') @@ -492,7 +592,7 @@ if (pattern_filename != NULL) else { - if (i >= argc) return usage(0); + if (i >= argc) return usage(2); pattern_list[0] = pcre_compile(argv[i++], options, &error, &errptr, NULL); if (pattern_list[0] == NULL) { diff --git a/ext/pcre/pcrelib/pcreposix.c b/ext/pcre/pcrelib/pcreposix.c index ba48d55a001..91e89a1a218 100644 --- a/ext/pcre/pcrelib/pcreposix.c +++ b/ext/pcre/pcrelib/pcreposix.c @@ -12,7 +12,7 @@ functions. Written by: Philip Hazel - Copyright (c) 1997-2001 University of Cambridge + Copyright (c) 1997-2002 University of Cambridge ----------------------------------------------------------------------------- Permission is granted to anyone to use this software for any purpose on any @@ -47,7 +47,8 @@ static const char *estring[] = { ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9, ERR10, ERR11, ERR12, ERR13, ERR14, ERR15, ERR16, ERR17, ERR18, ERR19, ERR20, ERR21, ERR22, ERR23, ERR24, ERR25, ERR26, ERR27, ERR29, ERR29, ERR30, - ERR31 }; + ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39, ERR40, + ERR41, ERR42, ERR43 }; static int eint[] = { REG_EESCAPE, /* "\\ at end of pattern" */ @@ -62,9 +63,9 @@ static int eint[] = { REG_BADRPT, /* "operand of unlimited repeat could match the empty string" */ REG_ASSERT, /* "internal error: unexpected repeat" */ REG_BADPAT, /* "unrecognized character after (?" */ - REG_ASSERT, /* "unused error" */ + REG_BADPAT, /* "POSIX named classes are supported only within a class" */ REG_EPAREN, /* "missing )" */ - REG_ESUBREG, /* "back reference to non-existent subpattern" */ + REG_ESUBREG, /* "reference to non-existent subpattern" */ REG_INVARG, /* "erroffset passed as NULL" */ REG_INVARG, /* "unknown option bit(s) set" */ REG_EPAREN, /* "missing ) after comment" */ @@ -78,13 +79,21 @@ static int eint[] = { REG_BADPAT, /* "malformed number after (?(" */ REG_BADPAT, /* "conditional group containe more than two branches" */ REG_BADPAT, /* "assertion expected after (?(" */ - REG_BADPAT, /* "(?p must be followed by )" */ + REG_BADPAT, /* "(?R or (?digits must be followed by )" */ REG_ECTYPE, /* "unknown POSIX class name" */ REG_BADPAT, /* "POSIX collating elements are not supported" */ REG_INVARG, /* "this version of PCRE is not compiled with PCRE_UTF8 support" */ REG_BADPAT, /* "characters with values > 255 are not yet supported in classes" */ REG_BADPAT, /* "character value in \x{...} sequence is too large" */ - REG_BADPAT /* "invalid condition (?(0)" */ + REG_BADPAT, /* "invalid condition (?(0)" */ + REG_BADPAT, /* "\\C not allowed in lookbehind assertion" */ + REG_EESCAPE, /* "PCRE does not support \\L, \\l, \\N, \\P, \\p, \\U, \\u, or \\X" */ + REG_BADPAT, /* "number after (?C is > 255" */ + REG_BADPAT, /* "closing ) for (?C expected" */ + REG_BADPAT, /* "recursive call could loop indefinitely" */ + REG_BADPAT, /* "unrecognized character after (?P" */ + REG_BADPAT, /* "syntax error after (?P" */ + REG_BADPAT /* "two named groups have the same name" */ }; /* Table of texts corresponding to POSIX error codes */ @@ -222,7 +231,9 @@ return 0; /* Unfortunately, PCRE requires 3 ints of working space for each captured substring, so we have to get and release working store instead of just using the POSIX structures as was done in earlier releases when PCRE needed only 2 -ints. */ +ints. However, if the number of possible capturing brackets is small, use a +block of store on the stack, to reduce the use of malloc/free. The threshold is +in a macro that can be changed at configure time. */ int regexec(regex_t *preg, const char *string, size_t nmatch, @@ -231,6 +242,8 @@ regexec(regex_t *preg, const char *string, size_t nmatch, int rc; int options = 0; int *ovector = NULL; +int small_ovector[POSIX_MALLOC_THRESHOLD * 3]; +BOOL allocated_ovector = FALSE; if ((eflags & REG_NOTBOL) != 0) options |= PCRE_NOTBOL; if ((eflags & REG_NOTEOL) != 0) options |= PCRE_NOTEOL; @@ -239,8 +252,16 @@ preg->re_erroffset = (size_t)(-1); /* Only has meaning after compile */ if (nmatch > 0) { - ovector = (int *)malloc(sizeof(int) * nmatch * 3); - if (ovector == NULL) return REG_ESPACE; + if (nmatch <= POSIX_MALLOC_THRESHOLD) + { + ovector = &(small_ovector[0]); + } + else + { + ovector = (int *)malloc(sizeof(int) * nmatch * 3); + if (ovector == NULL) return REG_ESPACE; + allocated_ovector = TRUE; + } } rc = pcre_exec(preg->re_pcre, NULL, string, (int)strlen(string), 0, options, @@ -251,19 +272,19 @@ if (rc == 0) rc = nmatch; /* All captured slots were filled in */ if (rc >= 0) { size_t i; - for (i = 0; i < rc; i++) + for (i = 0; i < (size_t)rc; i++) { pmatch[i].rm_so = ovector[i*2]; pmatch[i].rm_eo = ovector[i*2+1]; } - if (ovector != NULL) free(ovector); + if (allocated_ovector) free(ovector); for (; i < nmatch; i++) pmatch[i].rm_so = pmatch[i].rm_eo = -1; return 0; } else { - if (ovector != NULL) free(ovector); + if (allocated_ovector) free(ovector); switch(rc) { case PCRE_ERROR_NOMATCH: return REG_NOMATCH; diff --git a/ext/pcre/pcrelib/pcretest.c b/ext/pcre/pcrelib/pcretest.c index f04443ab308..8977ef7f66e 100644 --- a/ext/pcre/pcrelib/pcretest.c +++ b/ext/pcre/pcrelib/pcretest.c @@ -2,6 +2,10 @@ * PCRE testing program * *************************************************/ +/* This program was hacked up as a tester for PCRE. I really should have +written it more tidily in the first place. Will I ever learn? It has grown and +been extended and consequently is now rather untidy in places. */ + #include #include #include @@ -9,7 +13,8 @@ #include #include -/* Use the internal info for displaying the results of pcre_study(). */ +/* We need the internal info for displaying the results of pcre_study(). Also +for getting the opcodes for showing compiled code. */ #include "internal.h" @@ -29,11 +34,17 @@ Makefile. */ #endif #endif -#define LOOPREPEAT 20000 +#define LOOPREPEAT 50000 static FILE *outfile; static int log_store = 0; +static int callout_count; +static int callout_extra; +static int callout_fail_count; +static int callout_fail_id; +static int first_callout; +static int utf8; static size_t gotten_store; @@ -48,6 +59,49 @@ static int utf8_table3[] = { 0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01}; + +/************************************************* +* Print compiled regex * +*************************************************/ + +/* The code for doing this is held in a separate file that is also included in +pcre.c when it is compiled with the debug switch. It defines a function called +print_internals(), which uses a table of opcode lengths defined by the macro +OP_LENGTHS, whose name must be OP_lengths. */ + +static uschar OP_lengths[] = { OP_LENGTHS }; + +#include "printint.c" + + + +/************************************************* +* Read number from string * +*************************************************/ + +/* We don't use strtoul() because SunOS4 doesn't have it. Rather than mess +around with conditional compilation, just do the job by hand. It is only used +for unpicking the -o argument, so just keep it simple. + +Arguments: + str string to be converted + endptr where to put the end pointer + +Returns: the unsigned long +*/ + +static int +get_value(unsigned char *str, unsigned char **endptr) +{ +int result = 0; +while(*str != 0 && isspace(*str)) str++; +while (isdigit(*str)) result = result * 10 + (int)(*str++ - '0'); +*endptr = str; +return(result); +} + + + /************************************************* * Convert character value to UTF-8 * *************************************************/ @@ -143,271 +197,152 @@ return i+1; +/************************************************* +* Print character string * +*************************************************/ +/* Character string printing function. Must handle UTF-8 strings in utf8 +mode. Yields number of characters printed. If handed a NULL file, just counts +chars without printing. */ - -/* Debugging function to print the internal form of the regex. This is the same -code as contained in pcre.c under the DEBUG macro. */ - -static const char *OP_names[] = { - "End", "\\A", "\\B", "\\b", "\\D", "\\d", - "\\S", "\\s", "\\W", "\\w", "\\Z", "\\z", - "Opt", "^", "$", "Any", "chars", "not", - "*", "*?", "+", "+?", "?", "??", "{", "{", "{", - "*", "*?", "+", "+?", "?", "??", "{", "{", "{", - "*", "*?", "+", "+?", "?", "??", "{", "{", "{", - "*", "*?", "+", "+?", "?", "??", "{", "{", - "class", "Ref", "Recurse", - "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", - "AssertB", "AssertB not", "Reverse", "Once", "Cond", "Cref", - "Brazero", "Braminzero", "Branumber", "Bra" -}; - - -static void print_internals(pcre *re) -{ -unsigned char *code = ((real_pcre *)re)->code; - -fprintf(outfile, "------------------------------------------------------------------\n"); - -for(;;) - { - int c; - int charlength; - - fprintf(outfile, "%3d ", (int)(code - ((real_pcre *)re)->code)); - - if (*code >= OP_BRA) - { - if (*code - OP_BRA > EXTRACT_BASIC_MAX) - fprintf(outfile, "%3d Bra extra", (code[1] << 8) + code[2]); - else - fprintf(outfile, "%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA); - code += 2; - } - - else switch(*code) - { - case OP_END: - fprintf(outfile, " %s\n", OP_names[*code]); - fprintf(outfile, "------------------------------------------------------------------\n"); - return; - - case OP_OPT: - fprintf(outfile, " %.2x %s", code[1], OP_names[*code]); - code++; - break; - - case OP_CHARS: - charlength = *(++code); - fprintf(outfile, "%3d ", charlength); - while (charlength-- > 0) - if (isprint(c = *(++code))) fprintf(outfile, "%c", c); - else fprintf(outfile, "\\x%02x", c); - break; - - case OP_KETRMAX: - case OP_KETRMIN: - case OP_ALT: - case OP_KET: - case OP_ASSERT: - case OP_ASSERT_NOT: - case OP_ASSERTBACK: - case OP_ASSERTBACK_NOT: - case OP_ONCE: - case OP_COND: - case OP_BRANUMBER: - case OP_REVERSE: - case OP_CREF: - fprintf(outfile, "%3d %s", (code[1] << 8) + code[2], OP_names[*code]); - code += 2; - break; - - case OP_STAR: - case OP_MINSTAR: - case OP_PLUS: - case OP_MINPLUS: - case OP_QUERY: - case OP_MINQUERY: - case OP_TYPESTAR: - case OP_TYPEMINSTAR: - case OP_TYPEPLUS: - case OP_TYPEMINPLUS: - case OP_TYPEQUERY: - case OP_TYPEMINQUERY: - if (*code >= OP_TYPESTAR) - fprintf(outfile, " %s", OP_names[code[1]]); - else if (isprint(c = code[1])) fprintf(outfile, " %c", c); - else fprintf(outfile, " \\x%02x", c); - fprintf(outfile, "%s", OP_names[*code++]); - break; - - case OP_EXACT: - case OP_UPTO: - case OP_MINUPTO: - if (isprint(c = code[3])) fprintf(outfile, " %c{", c); - else fprintf(outfile, " \\x%02x{", c); - if (*code != OP_EXACT) fprintf(outfile, ","); - fprintf(outfile, "%d}", (code[1] << 8) + code[2]); - if (*code == OP_MINUPTO) fprintf(outfile, "?"); - code += 3; - break; - - case OP_TYPEEXACT: - case OP_TYPEUPTO: - case OP_TYPEMINUPTO: - fprintf(outfile, " %s{", OP_names[code[3]]); - if (*code != OP_TYPEEXACT) fprintf(outfile, "0,"); - fprintf(outfile, "%d}", (code[1] << 8) + code[2]); - if (*code == OP_TYPEMINUPTO) fprintf(outfile, "?"); - code += 3; - break; - - case OP_NOT: - if (isprint(c = *(++code))) fprintf(outfile, " [^%c]", c); - else fprintf(outfile, " [^\\x%02x]", c); - break; - - case OP_NOTSTAR: - case OP_NOTMINSTAR: - case OP_NOTPLUS: - case OP_NOTMINPLUS: - case OP_NOTQUERY: - case OP_NOTMINQUERY: - if (isprint(c = code[1])) fprintf(outfile, " [^%c]", c); - else fprintf(outfile, " [^\\x%02x]", c); - fprintf(outfile, "%s", OP_names[*code++]); - break; - - case OP_NOTEXACT: - case OP_NOTUPTO: - case OP_NOTMINUPTO: - if (isprint(c = code[3])) fprintf(outfile, " [^%c]{", c); - else fprintf(outfile, " [^\\x%02x]{", c); - if (*code != OP_NOTEXACT) fprintf(outfile, ","); - fprintf(outfile, "%d}", (code[1] << 8) + code[2]); - if (*code == OP_NOTMINUPTO) fprintf(outfile, "?"); - code += 3; - break; - - case OP_REF: - fprintf(outfile, " \\%d", (code[1] << 8) | code[2]); - code += 3; - goto CLASS_REF_REPEAT; - - case OP_CLASS: - { - int i, min, max; - code++; - fprintf(outfile, " ["); - - for (i = 0; i < 256; i++) - { - if ((code[i/8] & (1 << (i&7))) != 0) - { - int j; - for (j = i+1; j < 256; j++) - if ((code[j/8] & (1 << (j&7))) == 0) break; - if (i == '-' || i == ']') fprintf(outfile, "\\"); - if (isprint(i)) fprintf(outfile, "%c", i); else fprintf(outfile, "\\x%02x", i); - if (--j > i) - { - fprintf(outfile, "-"); - if (j == '-' || j == ']') fprintf(outfile, "\\"); - if (isprint(j)) fprintf(outfile, "%c", j); else fprintf(outfile, "\\x%02x", j); - } - i = j; - } - } - fprintf(outfile, "]"); - code += 32; - - CLASS_REF_REPEAT: - - switch(*code) - { - case OP_CRSTAR: - case OP_CRMINSTAR: - case OP_CRPLUS: - case OP_CRMINPLUS: - case OP_CRQUERY: - case OP_CRMINQUERY: - fprintf(outfile, "%s", OP_names[*code]); - break; - - case OP_CRRANGE: - case OP_CRMINRANGE: - min = (code[1] << 8) + code[2]; - max = (code[3] << 8) + code[4]; - if (max == 0) fprintf(outfile, "{%d,}", min); - else fprintf(outfile, "{%d,%d}", min, max); - if (*code == OP_CRMINRANGE) fprintf(outfile, "?"); - code += 4; - break; - - default: - code--; - } - } - break; - - /* Anything else is just a one-node item */ - - default: - fprintf(outfile, " %s", OP_names[*code]); - break; - } - - code++; - fprintf(outfile, "\n"); - } -} - - - -/* Character string printing function. A "normal" and a UTF-8 version. */ - -static void pchars(unsigned char *p, int length, int utf8) +static int pchars(unsigned char *p, int length, FILE *f) { int c; +int yield = 0; + while (length-- > 0) { if (utf8) { int rc = utf82ord(p, &c); - if (rc > 0) + + if (rc > 0 && rc <= length + 1) /* Mustn't run over the end */ { length -= rc - 1; p += rc; - if (c < 256 && isprint(c)) fprintf(outfile, "%c", c); - else fprintf(outfile, "\\x{%02x}", c); + if (c < 256 && isprint(c)) + { + if (f != NULL) fprintf(f, "%c", c); + yield++; + } + else + { + int n; + if (f != NULL) fprintf(f, "\\x{%02x}%n", c, &n); + yield += n; + } continue; } } /* Not UTF-8, or malformed UTF-8 */ - if (isprint(c = *(p++))) fprintf(outfile, "%c", c); - else fprintf(outfile, "\\x%02x", c); + if (isprint(c = *(p++))) + { + if (f != NULL) fprintf(f, "%c", c); + yield++; + } + else + { + if (f != NULL) fprintf(f, "\\x%02x", c); + yield += 4; + } } + +return yield; } +/************************************************* +* Callout function * +*************************************************/ + +/* Called from PCRE as a result of the (?C) item. We print out where we are in +the match. Yield OK unless more callouts than the fail count. . */ + +static int callout(pcre_callout_block *cb) +{ +FILE *f = (first_callout | callout_extra)? outfile : NULL; +int i, pre_start, post_start; + +if (callout_extra) + { + int i; + fprintf(f, "Callout %d: last capture = %d\n", + cb->callout_number, cb->capture_last); + + for (i = 0; i < cb->capture_top * 2; i += 2) + { + if (cb->offset_vector[i] < 0) + fprintf(f, "%2d: \n", i/2); + else + { + fprintf(f, "%2d: ", i/2); + (void)pchars((unsigned char *)cb->subject + cb->offset_vector[i], + cb->offset_vector[i+1] - cb->offset_vector[i], f); + fprintf(f, "\n"); + } + } + } + +/* Re-print the subject in canonical form, the first time or if giving full +datails. On subsequent calls in the same match, we use pchars just to find the +printed lengths of the substrings. */ + +if (f != NULL) fprintf(f, "--->"); + +pre_start = pchars((unsigned char *)cb->subject, cb->start_match, f); +post_start = pchars((unsigned char *)(cb->subject + cb->start_match), + cb->current_position - cb->start_match, f); + +(void)pchars((unsigned char *)(cb->subject + cb->current_position), + cb->subject_length - cb->current_position, f); + +if (f != NULL) fprintf(f, "\n"); + +/* Always print appropriate indicators, with callout number if not already +shown */ + +if (callout_extra) fprintf(outfile, " "); + else fprintf(outfile, "%3d ", cb->callout_number); + +for (i = 0; i < pre_start; i++) fprintf(outfile, " "); +fprintf(outfile, "^"); + +if (post_start > 0) + { + for (i = 0; i < post_start - 1; i++) fprintf(outfile, " "); + fprintf(outfile, "^"); + } + +fprintf(outfile, "\n"); + +first_callout = 0; + +return (cb->callout_number != callout_fail_id)? 0 : + (++callout_count >= callout_fail_count)? 1 : 0; +} + + +/************************************************* +* Local malloc function * +*************************************************/ + /* Alternative malloc function, to test functionality and show the size of the compiled re. */ static void *new_malloc(size_t size) { gotten_store = size; -if (log_store) - fprintf(outfile, "Memory allocation (code space): %d\n", - (int)((int)size - offsetof(real_pcre, code[0]))); return malloc(size); } +/************************************************* +* Call pcre_fullinfo() * +*************************************************/ /* Get one piece of information from the pcre_fullinfo() function */ @@ -420,6 +355,9 @@ if ((rc = pcre_fullinfo(re, study, option, ptr)) < 0) +/************************************************* +* Main Program * +*************************************************/ /* Read lines from named file or stdin and write to named file or stdout; lines consist of a regular expression, in delimiters and optionally followed by @@ -453,7 +391,7 @@ outfile = stdout; while (argc > 1 && argv[op][0] == '-') { - char *endptr; + unsigned char *endptr; if (strcmp(argv[op], "-s") == 0 || strcmp(argv[op], "-m") == 0) showstore = 1; @@ -461,7 +399,7 @@ while (argc > 1 && argv[op][0] == '-') else if (strcmp(argv[op], "-i") == 0) showinfo = 1; else if (strcmp(argv[op], "-d") == 0) showinfo = debug = 1; else if (strcmp(argv[op], "-o") == 0 && argc > 2 && - ((size_offsets = (int)strtoul(argv[op+1], &endptr, 10)), *endptr == 0)) + ((size_offsets = get_value(argv[op+1], &endptr)), *endptr == 0)) { op++; argc--; @@ -549,12 +487,14 @@ while (!done) int do_g = 0; int do_showinfo = showinfo; int do_showrest = 0; - int utf8 = 0; int erroroffset, len, delimiter; + utf8 = 0; + if (infile == stdin) printf(" re> "); if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) break; if (infile != stdin) fprintf(outfile, "%s", (char *)buffer); + fflush(outfile); p = buffer; while (isspace(*p)) p++; @@ -705,8 +645,8 @@ while (!done) } time_taken = clock() - start_time; fprintf(outfile, "Compile time %.3f milliseconds\n", - ((double)time_taken * 1000.0) / - ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + (((double)time_taken * 1000.0) / (double)LOOPREPEAT) / + (double)CLOCKS_PER_SEC); } re = pcre_compile((char *)p, options, &error, &erroroffset, tables); @@ -740,14 +680,26 @@ while (!done) info-returning functions. The old one has a limited interface and returns only limited data. Check that it agrees with the newer one. */ + if (log_store) + fprintf(outfile, "Memory allocation (code space): %d\n", + (int)(gotten_store - + sizeof(real_pcre) - + ((real_pcre *)re)->name_count * ((real_pcre *)re)->name_entry_size)); + if (do_showinfo) { unsigned long int get_options; int old_first_char, old_options, old_count; int count, backrefmax, first_char, need_char; + int nameentrysize, namecount; + const uschar *nametable; size_t size; - if (do_debug) print_internals(re); + if (do_debug) + { + fprintf(outfile, "------------------------------------------------------------------\n"); + print_internals(re, outfile); + } new_info(re, NULL, PCRE_INFO_OPTIONS, &get_options); new_info(re, NULL, PCRE_INFO_SIZE, &size); @@ -755,6 +707,9 @@ while (!done) new_info(re, NULL, PCRE_INFO_BACKREFMAX, &backrefmax); new_info(re, NULL, PCRE_INFO_FIRSTCHAR, &first_char); new_info(re, NULL, PCRE_INFO_LASTLITERAL, &need_char); + new_info(re, NULL, PCRE_INFO_NAMEENTRYSIZE, &nameentrysize); + new_info(re, NULL, PCRE_INFO_NAMECOUNT, &namecount); + new_info(re, NULL, PCRE_INFO_NAMETABLE, &nametable); old_count = pcre_info(re, &old_options, &old_first_char); if (count < 0) fprintf(outfile, @@ -781,6 +736,19 @@ while (!done) fprintf(outfile, "Capturing subpattern count = %d\n", count); if (backrefmax > 0) fprintf(outfile, "Max back reference = %d\n", backrefmax); + + if (namecount > 0) + { + fprintf(outfile, "Named capturing subpatterns:\n"); + while (namecount-- > 0) + { + fprintf(outfile, " %s %*s%3d\n", nametable + 2, + nameentrysize - 3 - (int)strlen((char *)nametable + 2), "", + GET2(nametable, 0)); + nametable += nameentrysize; + } + } + if (get_options == 0) fprintf(outfile, "No options\n"); else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s%s\n", ((get_options & PCRE_ANCHORED) != 0)? " anchored" : "", @@ -806,10 +774,13 @@ while (!done) } else { - if (isprint(first_char)) - fprintf(outfile, "First char = \'%c\'\n", first_char); + int ch = first_char & 255; + char *caseless = ((first_char & REQ_CASELESS) == 0)? + "" : " (caseless)"; + if (isprint(ch)) + fprintf(outfile, "First char = \'%c\'%s\n", ch, caseless); else - fprintf(outfile, "First char = %d\n", first_char); + fprintf(outfile, "First char = %d%s\n", ch, caseless); } if (need_char < 0) @@ -818,10 +789,13 @@ while (!done) } else { + int ch = need_char & 255; + char *caseless = ((need_char & REQ_CASELESS) == 0)? + "" : " (caseless)"; if (isprint(need_char)) - fprintf(outfile, "Need char = \'%c\'\n", need_char); + fprintf(outfile, "Need char = \'%c\'%s\n", ch, caseless); else - fprintf(outfile, "Need char = %d\n", need_char); + fprintf(outfile, "Need char = %d%s\n", ch, caseless); } } @@ -840,8 +814,8 @@ while (!done) time_taken = clock() - start_time; if (extra != NULL) free(extra); fprintf(outfile, " Study time %.3f milliseconds\n", - ((double)time_taken * 1000.0)/ - ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + (((double)time_taken * 1000.0) / (double)LOOPREPEAT) / + (double)CLOCKS_PER_SEC); } extra = pcre_study(re, study_options, &error); @@ -906,6 +880,13 @@ while (!done) options = 0; + pcre_callout = callout; + first_callout = 1; + callout_extra = 0; + callout_count = 0; + callout_fail_count = 999999; + callout_fail_id = -1; + if (infile == stdin) printf("data> "); if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) { @@ -927,6 +908,7 @@ while (!done) { int i = 0; int n = 0; + if (c == '\\') switch ((c = *p++)) { case 'a': c = 7; break; @@ -991,8 +973,35 @@ while (!done) continue; case 'C': - while(isdigit(*p)) n = n * 10 + *p++ - '0'; - copystrings |= 1 << n; + if (isdigit(*p)) /* Set copy string */ + { + while(isdigit(*p)) n = n * 10 + *p++ - '0'; + copystrings |= 1 << n; + } + else if (*p == '+') + { + callout_extra = 1; + p++; + } + else if (*p == '-') + { + pcre_callout = NULL; + p++; + } + else if (*p == '!') + { + callout_fail_id = 0; + p++; + while(isdigit(*p)) + callout_fail_id = callout_fail_id * 10 + *p++ - '0'; + callout_fail_count = 0; + if (*p == '!') + { + p++; + while(isdigit(*p)) + callout_fail_count = callout_fail_count * 10 + *p++ - '0'; + } + } continue; case 'G': @@ -1023,7 +1032,7 @@ while (!done) } } use_size_offsets = n; - if (n == 0) use_offsets = NULL; + if (n == 0) use_offsets = NULL; /* Ensures it can't write to it */ continue; case 'Z': @@ -1057,18 +1066,19 @@ while (!done) else { size_t i; - for (i = 0; i < use_size_offsets; i++) + for (i = 0; i < (size_t)use_size_offsets; i++) { if (pmatch[i].rm_so >= 0) { fprintf(outfile, "%2d: ", (int)i); - pchars(dbuffer + pmatch[i].rm_so, - pmatch[i].rm_eo - pmatch[i].rm_so, utf8); + (void)pchars(dbuffer + pmatch[i].rm_so, + pmatch[i].rm_eo - pmatch[i].rm_so, outfile); fprintf(outfile, "\n"); if (i == 0 && do_showrest) { fprintf(outfile, " 0+ "); - pchars(dbuffer + pmatch[i].rm_eo, len - pmatch[i].rm_eo, utf8); + (void)pchars(dbuffer + pmatch[i].rm_eo, len - pmatch[i].rm_eo, + outfile); fprintf(outfile, "\n"); } } @@ -1094,8 +1104,8 @@ while (!done) start_offset, options | g_notempty, use_offsets, use_size_offsets); time_taken = clock() - start_time; fprintf(outfile, "Execute time %.3f milliseconds\n", - ((double)time_taken * 1000.0)/ - ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + (((double)time_taken * 1000.0) / (double)LOOPREPEAT) / + (double)CLOCKS_PER_SEC); } count = pcre_exec(re, extra, (char *)bptr, len, @@ -1119,14 +1129,16 @@ while (!done) else { fprintf(outfile, "%2d: ", i/2); - pchars(bptr + use_offsets[i], use_offsets[i+1] - use_offsets[i], utf8); + (void)pchars(bptr + use_offsets[i], + use_offsets[i+1] - use_offsets[i], outfile); fprintf(outfile, "\n"); if (i == 0) { if (do_showrest) { fprintf(outfile, " 0+ "); - pchars(bptr + use_offsets[i+1], len - use_offsets[i+1], utf8); + (void)pchars(bptr + use_offsets[i+1], len - use_offsets[i+1], + outfile); fprintf(outfile, "\n"); } } diff --git a/ext/pcre/pcrelib/perltest b/ext/pcre/pcrelib/perltest deleted file mode 100755 index e6f797498c5..00000000000 --- a/ext/pcre/pcrelib/perltest +++ /dev/null @@ -1,169 +0,0 @@ -#! /usr/bin/perl - -# Program for testing regular expressions with perl to check that PCRE handles -# them the same. - - -# Function for turning a string into a string of printing chars - -sub pchars { -my($t) = ""; - -foreach $c (split(//, $_[0])) - { - if (ord $c >= 32 && ord $c < 127) { $t .= $c; } - else { $t .= sprintf("\\x%02x", ord $c); } - } -$t; -} - - - -# Read lines from named file or stdin and write to named file or stdout; lines -# consist of a regular expression, in delimiters and optionally followed by -# options, followed by a set of test data, terminated by an empty line. - -# Sort out the input and output files - -if (@ARGV > 0) - { - open(INFILE, "<$ARGV[0]") || die "Failed to open $ARGV[0]\n"; - $infile = "INFILE"; - } -else { $infile = "STDIN"; } - -if (@ARGV > 1) - { - open(OUTFILE, ">$ARGV[1]") || die "Failed to open $ARGV[1]\n"; - $outfile = "OUTFILE"; - } -else { $outfile = "STDOUT"; } - -printf($outfile "Perl $] Regular Expressions\n\n"); - -# Main loop - -NEXT_RE: -for (;;) - { - printf " re> " if $infile eq "STDIN"; - last if ! ($_ = <$infile>); - printf $outfile "$_" if $infile ne "STDIN"; - next if ($_ eq ""); - - $pattern = $_; - - while ($pattern !~ /^\s*(.).*\1/s) - { - printf " > " if $infile eq "STDIN"; - last if ! ($_ = <$infile>); - printf $outfile "$_" if $infile ne "STDIN"; - $pattern .= $_; - } - - chomp($pattern); - $pattern =~ s/\s+$//; - - # The private /+ modifier means "print $' afterwards". We use it - # only on the end of patterns to make it easy to chop off here. - - $showrest = ($pattern =~ s/\+(?=[a-z]*$)//); - - # Check that the pattern is valid - - eval "\$_ =~ ${pattern}"; - if ($@) - { - printf $outfile "Error: $@"; - next NEXT_RE; - } - - # If the /g modifier is present, we want to put a loop round the matching; - # otherwise just a single "if". - - $cmd = ($pattern =~ /g[a-z]*$/)? "while" : "if"; - - # If the pattern is actually the null string, Perl uses the most recently - # executed (and successfully compiled) regex is used instead. This is a - # nasty trap for the unwary! The PCRE test suite does contain null strings - # in places - if they are allowed through here all sorts of weird and - # unexpected effects happen. To avoid this, we replace such patterns with - # a non-null pattern that has the same effect. - - $pattern = "/(?#)/$2" if ($pattern =~ /^(.)\1(.*)$/); - - # Read data lines and test them - - for (;;) - { - printf "data> " if $infile eq "STDIN"; - last NEXT_RE if ! ($_ = <$infile>); - chomp; - printf $outfile "$_\n" if $infile ne "STDIN"; - - s/\s+$//; - s/^\s+//; - - last if ($_ eq ""); - - $x = eval "\"$_\""; # To get escapes processed - - # Empty array for holding results, then do the matching. - - @subs = (); - - eval "${cmd} (\$x =~ ${pattern}) {" . - "push \@subs,\$&;" . - "push \@subs,\$1;" . - "push \@subs,\$2;" . - "push \@subs,\$3;" . - "push \@subs,\$4;" . - "push \@subs,\$5;" . - "push \@subs,\$6;" . - "push \@subs,\$7;" . - "push \@subs,\$8;" . - "push \@subs,\$9;" . - "push \@subs,\$10;" . - "push \@subs,\$11;" . - "push \@subs,\$12;" . - "push \@subs,\$13;" . - "push \@subs,\$14;" . - "push \@subs,\$15;" . - "push \@subs,\$16;" . - "push \@subs,\$'; }"; - - if ($@) - { - printf $outfile "Error: $@\n"; - next NEXT_RE; - } - elsif (scalar(@subs) == 0) - { - printf $outfile "No match\n"; - } - else - { - while (scalar(@subs) != 0) - { - printf $outfile (" 0: %s\n", &pchars($subs[0])); - printf $outfile (" 0+ %s\n", &pchars($subs[17])) if $showrest; - $last_printed = 0; - for ($i = 1; $i <= 16; $i++) - { - if (defined $subs[$i]) - { - while ($last_printed++ < $i-1) - { printf $outfile ("%2d: \n", $last_printed); } - printf $outfile ("%2d: %s\n", $i, &pchars($subs[$i])); - $last_printed = $i; - } - } - splice(@subs, 0, 18); - } - } - } - } - -printf $outfile "\n"; - -# End diff --git a/ext/pcre/pcrelib/perltest8 b/ext/pcre/pcrelib/perltest8 deleted file mode 100755 index 2fe522d60d3..00000000000 --- a/ext/pcre/pcrelib/perltest8 +++ /dev/null @@ -1,208 +0,0 @@ -#! /usr/bin/perl - -# Program for testing regular expressions with perl to check that PCRE handles -# them the same. This is the version that supports /8 for UTF-8 testing. It -# requires at least Perl 5.6. - - -# Function for turning a string into a string of printing chars. There are -# currently problems with UTF-8 strings; this fudges round them. - -sub pchars { -my($t) = ""; - -if ($utf8) - { - use utf8; - @p = unpack('U*', $_[0]); - foreach $c (@p) - { - if ($c >= 32 && $c < 127) { $t .= chr $c; } - else { $t .= sprintf("\\x{%02x}", $c); } - } - } - -else - { - foreach $c (split(//, $_[0])) - { - if (ord $c >= 32 && ord $c < 127) { $t .= $c; } - else { $t .= sprintf("\\x%02x", ord $c); } - } - } - -$t; -} - - - -# Read lines from named file or stdin and write to named file or stdout; lines -# consist of a regular expression, in delimiters and optionally followed by -# options, followed by a set of test data, terminated by an empty line. - -# Sort out the input and output files - -if (@ARGV > 0) - { - open(INFILE, "<$ARGV[0]") || die "Failed to open $ARGV[0]\n"; - $infile = "INFILE"; - } -else { $infile = "STDIN"; } - -if (@ARGV > 1) - { - open(OUTFILE, ">$ARGV[1]") || die "Failed to open $ARGV[1]\n"; - $outfile = "OUTFILE"; - } -else { $outfile = "STDOUT"; } - -printf($outfile "Perl $] Regular Expressions\n\n"); - -# Main loop - -NEXT_RE: -for (;;) - { - printf " re> " if $infile eq "STDIN"; - last if ! ($_ = <$infile>); - printf $outfile "$_" if $infile ne "STDIN"; - next if ($_ eq ""); - - $pattern = $_; - - while ($pattern !~ /^\s*(.).*\1/s) - { - printf " > " if $infile eq "STDIN"; - last if ! ($_ = <$infile>); - printf $outfile "$_" if $infile ne "STDIN"; - $pattern .= $_; - } - - chomp($pattern); - $pattern =~ s/\s+$//; - - # The private /+ modifier means "print $' afterwards". - - $showrest = ($pattern =~ s/\+(?=[a-z]*$)//); - - # The private /8 modifier means "operate in UTF-8". Currently, Perl - # has bugs that we try to work around using this flag. - - $utf8 = ($pattern =~ s/8(?=[a-z]*$)//); - - # Check that the pattern is valid - - if ($utf8) - { - use utf8; - eval "\$_ =~ ${pattern}"; - } - else - { - eval "\$_ =~ ${pattern}"; - } - - if ($@) - { - printf $outfile "Error: $@"; - next NEXT_RE; - } - - # If the /g modifier is present, we want to put a loop round the matching; - # otherwise just a single "if". - - $cmd = ($pattern =~ /g[a-z]*$/)? "while" : "if"; - - # If the pattern is actually the null string, Perl uses the most recently - # executed (and successfully compiled) regex is used instead. This is a - # nasty trap for the unwary! The PCRE test suite does contain null strings - # in places - if they are allowed through here all sorts of weird and - # unexpected effects happen. To avoid this, we replace such patterns with - # a non-null pattern that has the same effect. - - $pattern = "/(?#)/$2" if ($pattern =~ /^(.)\1(.*)$/); - - # Read data lines and test them - - for (;;) - { - printf "data> " if $infile eq "STDIN"; - last NEXT_RE if ! ($_ = <$infile>); - chomp; - printf $outfile "$_\n" if $infile ne "STDIN"; - - s/\s+$//; - s/^\s+//; - - last if ($_ eq ""); - - $x = eval "\"$_\""; # To get escapes processed - - # Empty array for holding results, then do the matching. - - @subs = (); - - $pushes = "push \@subs,\$&;" . - "push \@subs,\$1;" . - "push \@subs,\$2;" . - "push \@subs,\$3;" . - "push \@subs,\$4;" . - "push \@subs,\$5;" . - "push \@subs,\$6;" . - "push \@subs,\$7;" . - "push \@subs,\$8;" . - "push \@subs,\$9;" . - "push \@subs,\$10;" . - "push \@subs,\$11;" . - "push \@subs,\$12;" . - "push \@subs,\$13;" . - "push \@subs,\$14;" . - "push \@subs,\$15;" . - "push \@subs,\$16;" . - "push \@subs,\$'; }"; - - if ($utf8) - { - use utf8; - eval "${cmd} (\$x =~ ${pattern}) {" . $pushes; - } - else - { - eval "${cmd} (\$x =~ ${pattern}) {" . $pushes; - } - - if ($@) - { - printf $outfile "Error: $@\n"; - next NEXT_RE; - } - elsif (scalar(@subs) == 0) - { - printf $outfile "No match\n"; - } - else - { - while (scalar(@subs) != 0) - { - printf $outfile (" 0: %s\n", &pchars($subs[0])); - printf $outfile (" 0+ %s\n", &pchars($subs[17])) if $showrest; - $last_printed = 0; - for ($i = 1; $i <= 16; $i++) - { - if (defined $subs[$i]) - { - while ($last_printed++ < $i-1) - { printf $outfile ("%2d: \n", $last_printed); } - printf $outfile ("%2d: %s\n", $i, &pchars($subs[$i])); - $last_printed = $i; - } - } - splice(@subs, 0, 18); - } - } - } - } - -printf $outfile "\n"; - -# End diff --git a/ext/pcre/pcrelib/study.c b/ext/pcre/pcrelib/study.c index f924543d213..948560f61e9 100644 --- a/ext/pcre/pcrelib/study.c +++ b/ext/pcre/pcrelib/study.c @@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals. Written by: Philip Hazel - Copyright (c) 1997-2001 University of Cambridge + Copyright (c) 1997-2002 University of Cambridge ----------------------------------------------------------------------------- Permission is granted to anyone to use this software for any purpose on any @@ -99,7 +99,7 @@ volatile int dummy; do { - const uschar *tcode = code + 3; + const uschar *tcode = code + 1 + LINK_SIZE; BOOL try_next = TRUE; while (try_next) @@ -119,6 +119,12 @@ do default: return FALSE; + /* Skip over callout */ + + case OP_CALLOUT: + tcode += 2; + break; + /* Skip over extended extraction bracket number */ case OP_BRANUMBER: @@ -130,8 +136,8 @@ do case OP_ASSERT_NOT: case OP_ASSERTBACK: case OP_ASSERTBACK_NOT: - do tcode += (tcode[1] << 8) + tcode[2]; while (*tcode == OP_ALT); - tcode += 3; + do tcode += GET(tcode, 1); while (*tcode == OP_ALT); + tcode += 1+LINK_SIZE; break; /* Skip over an option setting, changing the caseless flag */ @@ -148,8 +154,8 @@ do if (!set_start_bits(++tcode, start_bits, caseless, cd)) return FALSE; dummy = 1; - do tcode += (tcode[1] << 8) + tcode[2]; while (*tcode == OP_ALT); - tcode += 3; + do tcode += GET(tcode,1); while (*tcode == OP_ALT); + tcode += 1+LINK_SIZE; break; /* Single-char * or ? sets the bit and tries the next item */ @@ -314,7 +320,7 @@ do } /* End of switch */ } /* End of try_next loop */ - code += (code[1] << 8) + code[2]; /* Advance to next branch */ + code += GET(code, 1); /* Advance to next branch */ } while (*code == OP_ALT); return TRUE; @@ -346,6 +352,8 @@ pcre_study(const pcre *external_re, int options, const char **errorptr) uschar start_bits[32]; real_pcre_extra *extra; const real_pcre *re = (const real_pcre *)external_re; +uschar *code = (uschar *)re + sizeof(real_pcre) + + (re->name_count * re->name_entry_size); compile_data compile_block; *errorptr = NULL; @@ -362,9 +370,9 @@ if ((options & ~PUBLIC_STUDY_OPTIONS) != 0) return NULL; } -/* For an anchored pattern, or an unchored pattern that has a first char, or a -multiline pattern that matches only at "line starts", no further processing at -present. */ +/* For an anchored pattern, or an unanchored pattern that has a first char, or +a multiline pattern that matches only at "line starts", no further processing +at present. */ if ((re->options & (PCRE_ANCHORED|PCRE_FIRSTSET|PCRE_STARTLINE)) != 0) return NULL; @@ -379,7 +387,7 @@ compile_block.ctypes = re->tables + ctypes_offset; /* See if we can find a fixed set of initial characters for the pattern. */ memset(start_bits, 0, 32 * sizeof(uschar)); -if (!set_start_bits(re->code, start_bits, (re->options & PCRE_CASELESS) != 0, +if (!set_start_bits(code, start_bits, (re->options & PCRE_CASELESS) != 0, &compile_block)) return NULL; /* Get an "extra" block and put the information therein. */ diff --git a/ext/pcre/pcrelib/testdata/testinput1 b/ext/pcre/pcrelib/testdata/testinput1 index 441f99e3203..c64257685be 100644 --- a/ext/pcre/pcrelib/testdata/testinput1 +++ b/ext/pcre/pcrelib/testdata/testinput1 @@ -1518,7 +1518,7 @@ /(abc)[\1]de/ abc\1de -/a.b(?s)/ +/(?s)a.b/ a\nb /^([^a])([^\b])([^c]*)([^d]{3,4})/ @@ -1948,4 +1948,1861 @@ /(AB)*\1/ ABABAB +/(?.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ + +"(?>.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + +/(?>(\.\d\d[1-9]?))\d+/ + 1.230003938 + 1.875000282 + *** Failers + 1.235 + +/^((?>\w+)|(?>\s+))*$/ + now is the time for all good men to come to the aid of the party + *** Failers + this is not a line with only words and spaces! + +/(\d+)(\w)/ + 12345a + 12345+ + +/((?>\d+))(\w)/ + 12345a + *** Failers + 12345+ + +/(?>a+)b/ + aaab + +/((?>a+)b)/ + aaab + +/(?>(a+))b/ + aaab + +/(?>b)+/ + aaabbbccc + +/(?>a+|b+|c+)*c/ + aaabbbbccccd + +/((?>[^()]+)|\([^()]*\))+/ + ((abc(ade)ufh()()x + +/\(((?>[^()]+)|\([^()]+\))+\)/ + (abc) + (abc(def)xyz) + *** Failers + ((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa + +/a(?-i)b/i + ab + Ab + *** Failers + aB + AB + +/(a (?x)b c)d e/ + a bcd e + *** Failers + a b cd e + abcd e + a bcde + +/(a b(?x)c d (?-x)e f)/ + a bcde f + *** Failers + abcdef + +/(a(?i)b)c/ + abc + aBc + *** Failers + abC + aBC + Abc + ABc + ABC + AbC + +/a(?i:b)c/ + abc + aBc + *** Failers + ABC + abC + aBC + +/a(?i:b)*c/ + aBc + aBBc + *** Failers + aBC + aBBC + +/a(?=b(?i)c)\w\wd/ + abcd + abCd + *** Failers + aBCd + abcD + +/(?s-i:more.*than).*million/i + more than million + more than MILLION + more \n than Million + *** Failers + MORE THAN MILLION + more \n than \n million + +/(?:(?s-i)more.*than).*million/i + more than million + more than MILLION + more \n than Million + *** Failers + MORE THAN MILLION + more \n than \n million + +/(?>a(?i)b+)+c/ + abc + aBbc + aBBc + *** Failers + Abc + abAb + abbC + +/(?=a(?i)b)\w\wc/ + abc + aBc + *** Failers + Ab + abC + aBC + +/(?<=a(?i)b)(\w\w)c/ + abxxc + aBxxc + *** Failers + Abxxc + ABxxc + abxxC + +/(?:(a)|b)(?(1)A|B)/ + aA + bB + *** Failers + aB + bA + +/^(a)?(?(1)a|b)+$/ + aa + b + bb + *** Failers + ab + +/^(?(?=abc)\w{3}:|\d\d)$/ + abc: + 12 + *** Failers + 123 + xyz + +/^(?(?!abc)\d\d|\w{3}:)$/ + abc: + 12 + *** Failers + 123 + xyz + +/(?(?<=foo)bar|cat)/ + foobar + cat + fcat + focat + *** Failers + foocat + +/(?(?a*)*/ + a + aa + aaaa + +/(abc|)+/ + abc + abcabc + abcabcabc + xyz + +/([a]*)*/ + a + aaaaa + +/([ab]*)*/ + a + b + ababab + aaaabcde + bbbb + +/([^a]*)*/ + b + bbbb + aaa + +/([^ab]*)*/ + cccc + abab + +/([a]*?)*/ + a + aaaa + +/([ab]*?)*/ + a + b + abab + baba + +/([^a]*?)*/ + b + bbbb + aaa + +/([^ab]*?)*/ + c + cccc + baba + +/(?>a*)*/ + a + aaabcde + +/((?>a*))*/ + aaaaa + aabbaa + +/((?>a*?))*/ + aaaaa + aabbaa + +/(?(?=[^a-z]+[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) /x + 12-sep-98 + 12-09-98 + *** Failers + sep-12-98 + +/(?<=(foo))bar\1/ + foobarfoo + foobarfootling + *** Failers + foobar + barfoo + +/(?i:saturday|sunday)/ + saturday + sunday + Saturday + Sunday + SATURDAY + SUNDAY + SunDay + +/(a(?i)bc|BB)x/ + abcx + aBCx + bbx + BBx + *** Failers + abcX + aBCX + bbX + BBX + +/^([ab](?i)[cd]|[ef])/ + ac + aC + bD + elephant + Europe + frog + France + *** Failers + Africa + +/^(ab|a(?i)[b-c](?m-i)d|x(?i)y|z)/ + ab + aBd + xy + xY + zebra + Zambesi + *** Failers + aCD + XY + +/(?<=foo\n)^bar/m + foo\nbar + *** Failers + bar + baz\nbar + +/(?<=(?]&/ + <&OUT + +/^(a\1?){4}$/ + aaaaaaaaaa + *** Failers + AB + aaaaaaaaa + aaaaaaaaaaa + +/^(a(?(1)\1)){4}$/ + aaaaaaaaaa + *** Failers + aaaaaaaaa + aaaaaaaaaaa + +/(?:(f)(o)(o)|(b)(a)(r))*/ + foobar + +/(?<=a)b/ + ab + *** Failers + cb + b + +/(?a+)ab/ + +/(?>a+)b/ + aaab + +/([[:]+)/ + a:[b]: + +/([[=]+)/ + a=[b]= + +/([[.]+)/ + a.[b]. + +/((?>a+)b)/ + aaab + +/(?>(a+))b/ + aaab + +/((?>[^()]+)|\([^()]*\))+/ + ((abc(ade)ufh()()x + +/a\Z/ + *** Failers + aaab + a\nb\n + +/b\Z/ + a\nb\n + +/b\z/ + +/b\Z/ + a\nb + +/b\z/ + a\nb + *** Failers + +/^(?>(?(1)\.|())[^\W_](?>[a-z0-9-]*[^\W_])?)+$/ + a + abc + a-b + 0-9 + a.b + 5.6.7 + the.quick.brown.fox + a100.b200.300c + 12-ab.1245 + ***Failers + \ + .a + -a + a- + a. + a_b + a.- + a.. + ab..bc + the.quick.brown.fox- + the.quick.brown.fox. + the.quick.brown.fox_ + the.quick.brown.fox+ + +/(?>.*)(?<=(abcd|wxyz))/ + alphabetabcd + endingwxyz + *** Failers + a rather long string that doesn't end with one of them + +/word (?>(?:(?!otherword)[a-zA-Z0-9]+ ){0,30})otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark otherword + word cat dog elephant mussel cow horse canary baboon snake shark + +/word (?>[a-zA-Z0-9]+ ){0,30}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope + +/(?<=\d{3}(?!999))foo/ + 999foo + 123999foo + *** Failers + 123abcfoo + +/(?<=(?!...999)\d{3})foo/ + 999foo + 123999foo + *** Failers + 123abcfoo + +/(?<=\d{3}(?!999)...)foo/ + 123abcfoo + 123456foo + *** Failers + 123999foo + +/(?<=\d{3}...)(?\s*)=(?>\s*) # find Z)+|A)*/ + ZABCDEFG + +/((?>)+|A)*/ + ZABCDEFG + +/a*/g + abbab + +/^[a-\d]/ + abcde + -things + 0digit + *** Failers + bcdef + +/^[\d-a]/ + abcde + -things + 0digit + *** Failers + bcdef + +/[[:space:]]+/ + > \x09\x0a\x0c\x0d\x0b< + +/[[:blank:]]+/ + > \x09\x0a\x0c\x0d\x0b< + +/[\s]+/ + > \x09\x0a\x0c\x0d\x0b< + +/\s+/ + > \x09\x0a\x0c\x0d\x0b< + +/ab/x + ab + +/(?!\A)x/m + a\nxb\n + +/(?!^)x/m + a\nxb\n + +/abc\Qabc\Eabc/ + abcabcabc + +/abc\Q(*+|\Eabc/ + abc(*+|abc + +/ abc\Q abc\Eabc/x + abc abcabc + *** Failers + abcabcabc + +/abc#comment + \Q#not comment + literal\E/x + abc#not comment\n literal + +/abc#comment + \Q#not comment + literal/x + abc#not comment\n literal + +/abc#comment + \Q#not comment + literal\E #more comment + /x + abc#not comment\n literal + +/abc#comment + \Q#not comment + literal\E #more comment/x + abc#not comment\n literal + +/\Qabc\$xyz\E/ + abc\\\$xyz + +/\Qabc\E\$\Qxyz\E/ + abc\$xyz + +/\Gabc/ + abc + *** Failers + xyzabc + +/\Gabc./g + abc1abc2xyzabc3 + +/abc./g + abc1abc2xyzabc3 + +/a(?x: b c )d/ + XabcdY + *** Failers + Xa b c d Y + +/((?x)x y z | a b c)/ + XabcY + AxyzB + +/(?i)AB(?-i)C/ + XabCY + *** Failers + XabcY + +/((?i)AB(?-i)C|D)E/ + abCE + DE + *** Failers + abcE + abCe + dE + De + +/(.*)\d+\1/ + abc123abc + abc123bc + +/(.*)\d+\1/s + abc123abc + abc123bc + +/((.*))\d+\1/ + abc123abc + abc123bc + +/-- This tests for an IPv6 address in the form where it can have up to --/ +/-- eight components, one and only one of which is empty. This must be --/ +/-- an internal component. --/ + +/^(?!:) # colon disallowed at start + (?: # start of item + (?: [0-9a-f]{1,4} | # 1-4 hex digits or + (?(1)0 | () ) ) # if null previously matched, fail; else null + : # followed by colon + ){1,7} # end item; 1-7 of them required + [0-9a-f]{1,4} $ # final hex number at end of string + (?(1)|.) # check that there was an empty component + /xi + a123::a123 + a123:b342::abcd + a123:b342::324e:abcd + a123:ddde:b342::324e:abcd + a123:ddde:b342::324e:dcba:abcd + a123:ddde:9999:b342::324e:dcba:abcd + *** Failers + 1:2:3:4:5:6:7:8 + a123:bce:ddde:9999:b342::324e:dcba:abcd + a123::9999:b342::324e:dcba:abcd + abcde:2:3:4:5:6:7:8 + ::1 + abcd:fee0:123:: + :1 + 1: + / End of testinput1 / diff --git a/ext/pcre/pcrelib/testdata/testinput2 b/ext/pcre/pcrelib/testdata/testinput2 index f41478e1046..2dd498a713b 100644 --- a/ext/pcre/pcrelib/testdata/testinput2 +++ b/ext/pcre/pcrelib/testdata/testinput2 @@ -173,7 +173,7 @@ /<.*>/U abc ghi nop -/<.*>(?U)/ +/(?U)<.*>/ abc ghi nop /<.*?>/U @@ -658,6 +658,8 @@ /^[[:ascii:]]/D +/^[[:blank:]]/D + /^[[:cntrl:]]/D /^[[:digit:]]/D @@ -682,6 +684,8 @@ /^[12[:^digit:]]/D +/^[[:^blank:]]/D + /[01[:alpha:]%]/D /[[.ch.]]/ @@ -720,4 +724,439 @@ mainmain mainOmain +/These are all cases where Perl does it differently (nested captures)/ + +/^(a(b)?)+$/ + aba + +/^(aa(bb)?)+$/ + aabbaa + +/^(aa|aa(bb))+$/ + aabbaa + +/^(aa(bb)??)+$/ + aabbaa + +/^(?:aa(bb)?)+$/ + aabbaa + +/^(aa(b(b))?)+$/ + aabbaa + +/^(?:aa(b(b))?)+$/ + aabbaa + +/^(?:aa(b(?:b))?)+$/ + aabbaa + +/^(?:aa(bb(?:b))?)+$/ + aabbbaa + +/^(?:aa(b(?:bb))?)+$/ + aabbbaa + +/^(?:aa(?:b(b))?)+$/ + aabbaa + +/^(?:aa(?:b(bb))?)+$/ + aabbbaa + +/^(aa(b(bb))?)+$/ + aabbbaa + +/^(aa(bb(bb))?)+$/ + aabbbbaa + +/--------------------------------------------------------------------/ + +/#/xMD + +/a#/xMD + +/[\s]/D + +/[\S]/D + +/a(?i)b/D + ab + aB + *** Failers + AB + +/(a(?i)b)/D + ab + aB + *** Failers + AB + +/ (?i)abc/xD + +/#this is a comment + (?i)abc/xD + +/123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/D + +/\Q123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/D + +/\Q\E/D + \ + +/\Q\Ex/D + +/ \Q\E/D + +/a\Q\E/D + abc + bca + bac + +/a\Q\Eb/D + abc + +/\Q\Eabc/D + +/x*+\w/D + ****Failers + xxxxx + +/x?+/D + +/x++/D + +/x{1,3}+/D + +/(x)*+/D + +/^(\w++|\s++)*$/ + now is the time for all good men to come to the aid of the party + *** Failers + this is not a line with only words and spaces! + +/(\d++)(\w)/ + 12345a + *** Failers + 12345+ + +/a++b/ + aaab + +/(a++b)/ + aaab + +/(a++)b/ + aaab + +/([^()]++|\([^()]*\))+/ + ((abc(ade)ufh()()x + +/\(([^()]++|\([^()]+\))+\)/ + (abc) + (abc(def)xyz) + *** Failers + ((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa + +/(abc){1,3}+/D + +/a+?+/ + +/a{2,3}?+b/ + +/(?U)a+?+/ + +/a{2,3}?+b/U + +/x(?U)a++b/D + xaaaab + +/(?U)xa++b/D + xaaaab + +/^((a+)(?U)([ab]+)(?-U)([bc]+)(\w*))/D + +/^x(?U)a+b/D + +/^x(?U)(a+)b/D + +/[.x.]/ + +/[=x=]/ + +/[:x:]/ + +/\l/ + +/\L/ + +/\N{name}/ + +/\pP/ + +/\PP/ + +/\p{prop}/ + +/\P{prop}/ + +/\u/ + +/\U/ + +/\X/ + +/[/ + +/[a-/ + +/[[:space:]/ + +/[\s]/DM + +/[[:space:]]/DM + +/[[:space:]abcde]/DM + +/< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >/x + <> + + hij> + hij> + def> + + *** Failers + iV\;\?aPhFB\<\*vW\@QW\@sO9\}cfZA\-i\'w\%hKd6gt1UJP\,15_\#QY\$M\^Mss_U\/\]\&LK9\[5vQub\^w\[KDD\ qmj\;2\}YWFdYx\.Ap\]hjCPTP\(n28k\+3\;o\&WXqs\/gOXdr\$\:r\'do0\;b4c\(f_Gr\=\"\\4\)\[01T7ajQJvL\$W\~mL_sS\/4h\:x\*\[ZN\=KLs\&L5zX\/\/\>it\,o\:aU\(\;Z\>pW\&T7oP\'2K\^E\:x9\'c\[\%z\-\,64JQ5AeH_G\#KijUKghQw\^\\vea3a\?kka_G\$8\#\`\*kynsxzBLru\'\]k_\[7FrVx\}\^\=\$blx\>s\-N\%j\;D\*aZDnsw\:YKZ\%Q\.Kne9\#hP\?\+b3\(SOvL\,\^\;\&u5\@\?5C5Bhb\=m\-vEh_L15Jl\]U\)0RP6\{q\%L\^_z5E\'Dw6X\b|DM + +|\$\<\.X\+ix\[d1b\!H\#\?vV0vrK\:ZH1\=2M\>iV\;\?aPhFB\<\*vW\@QW\@sO9\}cfZA\-i\'w\%hKd6gt1UJP\,15_\#QY\$M\^Mss_U\/\]\&LK9\[5vQub\^w\[KDD\ qmj\;2\}YWFdYx\.Ap\]hjCPTP\(n28k\+3\;o\&WXqs\/gOXdr\$\:r\'do0\;b4c\(f_Gr\=\"\\4\)\[01T7ajQJvL\$W\~mL_sS\/4h\:x\*\[ZN\=KLs\&L5zX\/\/\>it\,o\:aU\(\;Z\>pW\&T7oP\'2K\^E\:x9\'c\[\%z\-\,64JQ5AeH_G\#KijUKghQw\^\\vea3a\?kka_G\$8\#\`\*kynsxzBLru\'\]k_\[7FrVx\}\^\=\$blx\>s\-N\%j\;D\*aZDnsw\:YKZ\%Q\.Kne9\#hP\?\+b3\(SOvL\,\^\;\&u5\@\?5C5Bhb\=m\-vEh_L15Jl\]U\)0RP6\{q\%L\^_z5E\'Dw6X\b|DM + +/(.*)\d+\1/I + +/(.*)\d+/I + +/(.*)\d+\1/Is + +/(.*)\d+/Is + +/(.*(xyz))\d+\2/I + +/((.*))\d+\1/I + abc123bc + +/a[b]/I + +/(?=a).*/I + +/(?=abc).xyz/iI + +/(?=abc)(?i).xyz/I + +/(?=a)(?=b)/I + +/(?=.)a/I + +/((?=abcda)a)/I + +/((?=abcda)ab)/I + +/()a/I + +/(?(1)ab|ac)/I + +/(?(1)abz|acz)/I + +/(?(1)abz)/I + +/(?(1)abz)123/I + +/(a)+/I + +/(a){2,3}/I + +/(a)*/I + +/[a]/I + +/[ab]/I + +/[ab]/IS + +/[^a]/I + +/\d456/I + +/\d456/IS + +/a^b/I + +/^a/mI + abcde + xy\nabc + *** Failers + xyabc + +/c|abc/I + +/(?i)[ab]/IS + +/[ab](?i)cd/IS + +/abc(?C)def/ + abcdef + 1234abcdef + *** Failers + abcxyz + abcxyzf + +/abc(?C)de(?C1)f/ + 123abcdef + +/(?C1)\dabc(?C2)def/ + 1234abcdef + *** Failers + abcdef + +/(?C255)ab/ + +/(?C256)ab/ + +/(?Cab)xx/ + +/(?C12vr)x/ + +/abc(?C)def/ + *** Failers + \x83\x0\x61bcdef + +/(abc)(?C)de(?C1)f/ + 123abcdef + 123abcdef\C+ + 123abcdef\C- + *** Failers + 123abcdef\C!1 + +/(?C0)(abc(?C1))*/ + abcabcabc + abcabc\C!1!3 + *** Failers + abcabcabc\C!1!3 + +/(\d{3}(?C))*/ + 123\C+ + 123456\C+ + 123456789\C+ + +/((xyz)(?C)p|(?C1)xyzabc)/ + xyzabc\C+ + +/(X)((xyz)(?C)p|(?C1)xyzabc)/ + Xxyzabc\C+ + +/(?=(abc))(?C)abcdef/ + abcdef\C+ + +/(?!(abc)(?C1)d)(?C2)abcxyz/ + abcxyz\C+ + +/(?<=(abc)(?C))xyz/ + abcxyz\C+ + +/(?C)abc/ + +/(?C)^abc/ + +/(?C)a|b/S + +/(?R)/ + +/(a|(?R))/ + +/(ab|(bc|(de|(?R))))/ + +/x(ab|(bc|(de|(?R))))/ + xab + xbc + xde + xxab + xxxab + *** Failers + xyab + +/(ab|(bc|(de|(?1))))/ + +/x(ab|(bc|(de|(?1)x)x)x)/ + +/^([^()]|\((?1)*\))*$/ + abc + a(b)c + a(b(c))d + *** Failers) + a(b(c)d + +/^>abc>([^()]|\((?1)*\))* abc>123 abc>1(2)3 abc>(1(2)3) ]*+) | (?2)) * >))/x + <> + + hij> + hij> + def> + + *** Failers + b|c)d(?P e)/D + abde + acde + +/(?:a(?P c(?P d)))(?Pa)/D + +/(?Pa)...(?P=a)bbb(?P>a)d/D + / End of testinput2 / diff --git a/ext/pcre/pcrelib/testdata/testinput3 b/ext/pcre/pcrelib/testdata/testinput3 index d3bd74fdd33..391aa620696 100644 --- a/ext/pcre/pcrelib/testdata/testinput3 +++ b/ext/pcre/pcrelib/testdata/testinput3 @@ -1,1724 +1,65 @@ -/(?.*/)foo" - /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ +/[\b]/Lfr + \b + *** Failers + a -"(?>.*/)foo" - /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo +/^\w+/ + *** Failers + École -/(?>(\.\d\d[1-9]?))\d+/ - 1.230003938 - 1.875000282 +/^\w+/Lfr + École + +/(.+)\b(.+)/ + École + +/(.+)\b(.+)/Lfr + *** Failers + École + +/École/i + École + *** Failers + école + +/École/iLfr + École + école + +/\w/IS + +/\w/ISLfr + +/^[\xc8-\xc9]/iLfr + École + école + +/^[\xc8-\xc9]/Lfr + École *** Failers - 1.235 + école -/^((?>\w+)|(?>\s+))*$/ - now is the time for all good men to come to the aid of the party - *** Failers - this is not a line with only words and spaces! - -/(\d+)(\w)/ - 12345a - 12345+ - -/((?>\d+))(\w)/ - 12345a - *** Failers - 12345+ - -/(?>a+)b/ - aaab - -/((?>a+)b)/ - aaab - -/(?>(a+))b/ - aaab - -/(?>b)+/ - aaabbbccc - -/(?>a+|b+|c+)*c/ - aaabbbbccccd - -/((?>[^()]+)|\([^()]*\))+/ - ((abc(ade)ufh()()x - -/\(((?>[^()]+)|\([^()]+\))+\)/ - (abc) - (abc(def)xyz) - *** Failers - ((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa - -/a(?-i)b/i - ab - *** Failers - Ab - aB - AB - -/(a (?x)b c)d e/ - a bcd e - *** Failers - a b cd e - abcd e - a bcde - -/(a b(?x)c d (?-x)e f)/ - a bcde f - *** Failers - abcdef - -/(a(?i)b)c/ - abc - aBc - *** Failers - abC - aBC - Abc - ABc - ABC - AbC - -/a(?i:b)c/ - abc - aBc - *** Failers - ABC - abC - aBC - -/a(?i:b)*c/ - aBc - aBBc - *** Failers - aBC - aBBC - -/a(?=b(?i)c)\w\wd/ - abcd - abCd - *** Failers - aBCd - abcD - -/(?s-i:more.*than).*million/i - more than million - more than MILLION - more \n than Million - *** Failers - MORE THAN MILLION - more \n than \n million - -/(?:(?s-i)more.*than).*million/i - more than million - more than MILLION - more \n than Million - *** Failers - MORE THAN MILLION - more \n than \n million - -/(?>a(?i)b+)+c/ - abc - aBbc - aBBc - *** Failers - Abc - abAb - abbC - -/(?=a(?i)b)\w\wc/ - abc - aBc - *** Failers - Ab - abC - aBC - -/(?<=a(?i)b)(\w\w)c/ - abxxc - aBxxc - *** Failers - Abxxc - ABxxc - abxxC - -/(?:(a)|b)(?(1)A|B)/ - aA - bB - *** Failers - aB - bA - -/^(a)?(?(1)a|b)+$/ - aa - b - bb - *** Failers - ab - -/^(?(?=abc)\w{3}:|\d\d)$/ - abc: - 12 - *** Failers - 123 - xyz - -/^(?(?!abc)\d\d|\w{3}:)$/ - abc: - 12 - *** Failers - 123 - xyz - -/(?(?<=foo)bar|cat)/ - foobar - cat - fcat - focat - *** Failers - foocat - -/(?(?a*)*/ - a - aa - aaaa - -/(abc|)+/ - abc - abcabc - abcabcabc - xyz - -/([a]*)*/ - a - aaaaa - -/([ab]*)*/ - a - b - ababab - aaaabcde - bbbb - -/([^a]*)*/ - b - bbbb - aaa - -/([^ab]*)*/ - cccc - abab - -/([a]*?)*/ - a - aaaa - -/([ab]*?)*/ - a - b - abab - baba - -/([^a]*?)*/ - b - bbbb - aaa - -/([^ab]*?)*/ - c - cccc - baba - -/(?>a*)*/ - a - aaabcde - -/((?>a*))*/ - aaaaa - aabbaa - -/((?>a*?))*/ - aaaaa - aabbaa - -/(?(?=[^a-z]+[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) /x - 12-sep-98 - 12-09-98 - *** Failers - sep-12-98 - -/(?<=(foo))bar\1/ - foobarfoo - foobarfootling - *** Failers - foobar - barfoo - -/(?i:saturday|sunday)/ - saturday - sunday - Saturday - Sunday - SATURDAY - SUNDAY - SunDay - -/(a(?i)bc|BB)x/ - abcx - aBCx - bbx - BBx - *** Failers - abcX - aBCX - bbX - BBX - -/^([ab](?i)[cd]|[ef])/ - ac - aC - bD - elephant - Europe - frog - France - *** Failers - Africa - -/^(ab|a(?i)[b-c](?m-i)d|x(?i)y|z)/ - ab - aBd - xy - xY - zebra - Zambesi - *** Failers - aCD - XY - -/(?<=foo\n)^bar/m - foo\nbar - *** Failers - bar - baz\nbar - -/(?<=(?]&/ - <&OUT - -/^(a\1?){4}$/ - aaaaaaaaaa - *** Failers - AB - aaaaaaaaa - aaaaaaaaaaa - -/^(a(?(1)\1)){4}$/ - aaaaaaaaaa - *** Failers - aaaaaaaaa - aaaaaaaaaaa - -/(?:(f)(o)(o)|(b)(a)(r))*/ - foobar - -/(?<=a)b/ - ab - *** Failers - cb - b - -/(?a+)ab/ - -/(?>a+)b/ - aaab - -/([[:]+)/ - a:[b]: - -/([[=]+)/ - a=[b]= - -/([[.]+)/ - a.[b]. - -/((?>a+)b)/ - aaab - -/(?>(a+))b/ - aaab - -/((?>[^()]+)|\([^()]*\))+/ - ((abc(ade)ufh()()x - -/a\Z/ - *** Failers - aaab - a\nb\n - -/b\Z/ - a\nb\n - -/b\z/ - -/b\Z/ - a\nb - -/b\z/ - a\nb - *** Failers - -/^(?>(?(1)\.|())[^\W_](?>[a-z0-9-]*[^\W_])?)+$/ - a - abc - a-b - 0-9 - a.b - 5.6.7 - the.quick.brown.fox - a100.b200.300c - 12-ab.1245 - ***Failers - \ - .a - -a - a- - a. - a_b - a.- - a.. - ab..bc - the.quick.brown.fox- - the.quick.brown.fox. - the.quick.brown.fox_ - the.quick.brown.fox+ - -/(?>.*)(?<=(abcd|wxyz))/ - alphabetabcd - endingwxyz - *** Failers - a rather long string that doesn't end with one of them - -/word (?>(?:(?!otherword)[a-zA-Z0-9]+ ){0,30})otherword/ - word cat dog elephant mussel cow horse canary baboon snake shark otherword - word cat dog elephant mussel cow horse canary baboon snake shark - -/word (?>[a-zA-Z0-9]+ ){0,30}otherword/ - word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope - -/(?<=\d{3}(?!999))foo/ - 999foo - 123999foo - *** Failers - 123abcfoo - -/(?<=(?!...999)\d{3})foo/ - 999foo - 123999foo - *** Failers - 123abcfoo - -/(?<=\d{3}(?!999)...)foo/ - 123abcfoo - 123456foo - *** Failers - 123999foo - -/(?<=\d{3}...)(?\s*)=(?>\s*) # find Z)+|A)*/ - ZABCDEFG - -/((?>)+|A)*/ - ZABCDEFG - -/a*/g - abbab - -/^[a-\d]/ - abcde - -things - 0digit - *** Failers - bcdef - -/^[\d-a]/ - abcde - -things - 0digit - *** Failers - bcdef - -/ End of testinput3 / +/ End of testinput3 / diff --git a/ext/pcre/pcrelib/testdata/testinput4 b/ext/pcre/pcrelib/testdata/testinput4 index f2878965f64..51d6b976328 100644 --- a/ext/pcre/pcrelib/testdata/testinput4 +++ b/ext/pcre/pcrelib/testdata/testinput4 @@ -1,65 +1,155 @@ -/^[\w]+/ +/-- Do not use the \x{} construct except with patterns that have the --/ +/-- /8 option set, because PCRE doesn't recognize them as UTF-8 unless --/ +/-- that option is set. However, the latest Perls recognize them always. --/ + +/a.b/8 + acb + a\x7fb + a\x{100}b *** Failers - École + a\nb -/^[\w]+/Lfr - École - -/^[\w]+/ +/a(.{3})b/8 + a\x{4000}xyb + a\x{4000}\x7fyb + a\x{4000}\x{100}yb *** Failers - École + a\x{4000}b + ac\ncb -/^[\W]+/ - École +/a(.*?)(.)/ + a\xc0\x88b -/^[\W]+/Lfr +/a(.*?)(.)/8 + a\x{100}b + +/a(.*)(.)/ + a\xc0\x88b + +/a(.*)(.)/8 + a\x{100}b + +/a(.)(.)/ + a\xc0\x92bcd + +/a(.)(.)/8 + a\x{240}bcd + +/a(.?)(.)/ + a\xc0\x92bcd + +/a(.?)(.)/8 + a\x{240}bcd + +/a(.??)(.)/ + a\xc0\x92bcd + +/a(.??)(.)/8 + a\x{240}bcd + +/a(.{3})b/8 + a\x{1234}xyb + a\x{1234}\x{4321}yb + a\x{1234}\x{4321}\x{3412}b *** Failers - École + a\x{1234}b + ac\ncb -/[\b]/ - \b +/a(.{3,})b/8 + a\x{1234}xyb + a\x{1234}\x{4321}yb + a\x{1234}\x{4321}\x{3412}b + axxxxbcdefghijb + a\x{1234}\x{4321}\x{3412}\x{3421}b *** Failers - a + a\x{1234}b -/[\b]/Lfr - \b +/a(.{3,}?)b/8 + a\x{1234}xyb + a\x{1234}\x{4321}yb + a\x{1234}\x{4321}\x{3412}b + axxxxbcdefghijb + a\x{1234}\x{4321}\x{3412}\x{3421}b *** Failers - a + a\x{1234}b -/^\w+/ +/a(.{3,5})b/8 + a\x{1234}xyb + a\x{1234}\x{4321}yb + a\x{1234}\x{4321}\x{3412}b + axxxxbcdefghijb + a\x{1234}\x{4321}\x{3412}\x{3421}b + axbxxbcdefghijb + axxxxxbcdefghijb *** Failers - École + a\x{1234}b + axxxxxxbcdefghijb -/^\w+/Lfr - École - -/(.+)\b(.+)/ - École - -/(.+)\b(.+)/Lfr +/a(.{3,5}?)b/8 + a\x{1234}xyb + a\x{1234}\x{4321}yb + a\x{1234}\x{4321}\x{3412}b + axxxxbcdefghijb + a\x{1234}\x{4321}\x{3412}\x{3421}b + axbxxbcdefghijb + axxxxxbcdefghijb *** Failers - École + a\x{1234}b + axxxxxxbcdefghijb -/École/i - École +/^[a\x{c0}]/8 *** Failers - école + \x{100} -/École/iLfr - École - école +/(?<=aXb)cd/8 + aXbcd -/\w/IS +/(?<=a\x{100}b)cd/8 + a\x{100}bcd -/\w/ISLfr - -/^[\xc8-\xc9]/iLfr - École - école - -/^[\xc8-\xc9]/Lfr - École +/(?<=a\x{100000}b)cd/8 + a\x{100000}bcd + +/(?:\x{100}){3}b/8 + \x{100}\x{100}\x{100}b *** Failers - école + \x{100}\x{100}b + +/\x{ab}/8 + \x{ab} + \xc2\xab + *** Failers + \x00{ab} + +/(?<=(.))X/8 + WXYZ + \x{256}XYZ + *** Failers + XYZ + +/X(\C{3})/8 + X\x{1234} + +/X(\C{4})/8 + X\x{1234}YZ + +/X\C*/8 + XYZabcdce + +/X\C*?/8 + XYZabcde + +/X\C{3,5}/8 + Xabcdefg + X\x{1234} + X\x{1234}YZ + X\x{1234}\x{512} + X\x{1234}\x{512}YZ + +/X\C{3,5}?/8 + Xabcdefg + X\x{1234} + X\x{1234}YZ + X\x{1234}\x{512} / End of testinput4 / diff --git a/ext/pcre/pcrelib/testdata/testinput5 b/ext/pcre/pcrelib/testdata/testinput5 index d66cfbddf30..81fe233e6ba 100644 --- a/ext/pcre/pcrelib/testdata/testinput5 +++ b/ext/pcre/pcrelib/testdata/testinput5 @@ -1,118 +1,91 @@ -/-- Because of problems with Perl 5.6 in handling UTF-8 vs non UTF-8 --/ -/-- strings automatically, do not use the \x{} construct except with --/ -/-- patterns that have the /8 option set, and don't use them without! --/ +/\x{100}/8DM -/a.b/8 - acb - a\x7fb - a\x{100}b - *** Failers - a\nb +/\x{1000}/8DM -/a(.{3})b/8 - a\x{4000}xyb - a\x{4000}\x7fyb - a\x{4000}\x{100}yb - *** Failers - a\x{4000}b - ac\ncb +/\x{10000}/8DM -/a(.*?)(.)/ - a\xc0\x88b +/\x{100000}/8DM -/a(.*?)(.)/8 - a\x{100}b +/\x{1000000}/8DM -/a(.*)(.)/ - a\xc0\x88b +/\x{4000000}/8DM -/a(.*)(.)/8 - a\x{100}b +/\x{7fffFFFF}/8DM -/a(.)(.)/ - a\xc0\x92bcd +/[\x{ff}]/8DM -/a(.)(.)/8 - a\x{240}bcd +/[\x{100}]/8DM -/a(.?)(.)/ - a\xc0\x92bcd +/\x{ffffffff}/8 -/a(.?)(.)/8 - a\x{240}bcd +/\x{100000000}/8 -/a(.??)(.)/ - a\xc0\x92bcd +/^\x{100}a\x{1234}/8 + \x{100}a\x{1234}bcd -/a(.??)(.)/8 - a\x{240}bcd +/\x80/8D -/a(.{3})b/8 - a\x{1234}xyb - a\x{1234}\x{4321}yb - a\x{1234}\x{4321}\x{3412}b - *** Failers - a\x{1234}b - ac\ncb +/\xff/8D -/a(.{3,})b/8 - a\x{1234}xyb - a\x{1234}\x{4321}yb - a\x{1234}\x{4321}\x{3412}b - axxxxbcdefghijb - a\x{1234}\x{4321}\x{3412}\x{3421}b - *** Failers - a\x{1234}b - -/a(.{3,}?)b/8 - a\x{1234}xyb - a\x{1234}\x{4321}yb - a\x{1234}\x{4321}\x{3412}b - axxxxbcdefghijb - a\x{1234}\x{4321}\x{3412}\x{3421}b - *** Failers - a\x{1234}b - -/a(.{3,5})b/8 - a\x{1234}xyb - a\x{1234}\x{4321}yb - a\x{1234}\x{4321}\x{3412}b - axxxxbcdefghijb - a\x{1234}\x{4321}\x{3412}\x{3421}b - axbxxbcdefghijb - axxxxxbcdefghijb - *** Failers - a\x{1234}b - axxxxxxbcdefghijb - -/a(.{3,5}?)b/8 - a\x{1234}xyb - a\x{1234}\x{4321}yb - a\x{1234}\x{4321}\x{3412}b - axxxxbcdefghijb - a\x{1234}\x{4321}\x{3412}\x{3421}b - axbxxbcdefghijb - axxxxxbcdefghijb - *** Failers - a\x{1234}b - axxxxxxbcdefghijb - -/^[a\x{c0}]/8 - *** Failers - \x{100} - -/(?<=aXb)cd/8 - aXbcd - -/(?<=a\x{100}b)cd/8 - a\x{100}bcd - -/(?<=a\x{100000}b)cd/8 - a\x{100000}bcd +/\x{0041}\x{2262}\x{0391}\x{002e}/D8 + \x{0041}\x{2262}\x{0391}\x{002e} -/(?:\x{100}){3}b/8 - \x{100}\x{100}\x{100}b - *** Failers - \x{100}\x{100}b +/\x{D55c}\x{ad6d}\x{C5B4}/D8 + \x{D55c}\x{ad6d}\x{C5B4} + +/\x{65e5}\x{672c}\x{8a9e}/D8 + \x{65e5}\x{672c}\x{8a9e} + +/\x{80}/D8 + +/\x{084}/D8 + +/\x{104}/D8 + +/\x{861}/D8 + +/\x{212ab}/D8 + +/.{3,5}X/D8 + \x{212ab}\x{212ab}\x{212ab}\x{861}X + + +/.{3,5}?/D8 + \x{212ab}\x{212ab}\x{212ab}\x{861} + +/-- These tests are here rather than in testinput4 because Perl 5.6 has --/ +/-- some problems with UTF-8 support, in the area of \x{..} where the --/ +/-- value is < 255. It grumbles about invalid UTF-8 strings. --/ + +/^[a\x{c0}]b/8 + \x{c0}b + +/^([a\x{c0}]*?)aa/8 + a\x{c0}aaaa/ + +/^([a\x{c0}]*?)aa/8 + a\x{c0}aaaa/ + a\x{c0}a\x{c0}aaa/ + +/^([a\x{c0}]*)aa/8 + a\x{c0}aaaa/ + a\x{c0}a\x{c0}aaa/ + +/^([a\x{c0}]*)a\x{c0}/8 + a\x{c0}aaaa/ + a\x{c0}a\x{c0}aaa/ + +/-- --/ + +/(?<=\C)X/8 + Should produce an error diagnostic + +/-- This one is here not because it's different to Perl, but because the --/ +/-- way the captured single-byte is displayed. (In Perl it becomes a --/ +/-- character, and you can't tell the difference.) --/ + +/X(\C)(.*)/8 + X\x{1234} + X\nabc / End of testinput5 / diff --git a/ext/pcre/pcrelib/testdata/testoutput1 b/ext/pcre/pcrelib/testdata/testoutput1 index ea14af11d35..81bf6cef3d5 100644 --- a/ext/pcre/pcrelib/testdata/testoutput1 +++ b/ext/pcre/pcrelib/testdata/testoutput1 @@ -1,4 +1,4 @@ -PCRE version 3.9 02-Jan-2002 +PCRE version 3.92 11-Sep-2002 /the quick brown fox/ the quick brown fox @@ -2225,7 +2225,7 @@ No match 0: abc\x01de 1: abc -/a.b(?s)/ +/(?s)a.b/ a\nb 0: a\x0ab @@ -3015,5 +3015,3208 @@ No match 0: ABABAB 1: AB +/(?.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ +No match + +"(?>.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + 0: /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + +/(?>(\.\d\d[1-9]?))\d+/ + 1.230003938 + 0: .230003938 + 1: .23 + 1.875000282 + 0: .875000282 + 1: .875 + *** Failers +No match + 1.235 +No match + +/^((?>\w+)|(?>\s+))*$/ + now is the time for all good men to come to the aid of the party + 0: now is the time for all good men to come to the aid of the party + 1: party + *** Failers +No match + this is not a line with only words and spaces! +No match + +/(\d+)(\w)/ + 12345a + 0: 12345a + 1: 12345 + 2: a + 12345+ + 0: 12345 + 1: 1234 + 2: 5 + +/((?>\d+))(\w)/ + 12345a + 0: 12345a + 1: 12345 + 2: a + *** Failers +No match + 12345+ +No match + +/(?>a+)b/ + aaab + 0: aaab + +/((?>a+)b)/ + aaab + 0: aaab + 1: aaab + +/(?>(a+))b/ + aaab + 0: aaab + 1: aaa + +/(?>b)+/ + aaabbbccc + 0: bbb + +/(?>a+|b+|c+)*c/ + aaabbbbccccd + 0: aaabbbbc + +/((?>[^()]+)|\([^()]*\))+/ + ((abc(ade)ufh()()x + 0: abc(ade)ufh()()x + 1: x + +/\(((?>[^()]+)|\([^()]+\))+\)/ + (abc) + 0: (abc) + 1: abc + (abc(def)xyz) + 0: (abc(def)xyz) + 1: xyz + *** Failers +No match + ((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +No match + +/a(?-i)b/i + ab + 0: ab + Ab + 0: Ab + *** Failers +No match + aB +No match + AB +No match + +/(a (?x)b c)d e/ + a bcd e + 0: a bcd e + 1: a bc + *** Failers +No match + a b cd e +No match + abcd e +No match + a bcde +No match + +/(a b(?x)c d (?-x)e f)/ + a bcde f + 0: a bcde f + 1: a bcde f + *** Failers +No match + abcdef +No match + +/(a(?i)b)c/ + abc + 0: abc + 1: ab + aBc + 0: aBc + 1: aB + *** Failers +No match + abC +No match + aBC +No match + Abc +No match + ABc +No match + ABC +No match + AbC +No match + +/a(?i:b)c/ + abc + 0: abc + aBc + 0: aBc + *** Failers +No match + ABC +No match + abC +No match + aBC +No match + +/a(?i:b)*c/ + aBc + 0: aBc + aBBc + 0: aBBc + *** Failers +No match + aBC +No match + aBBC +No match + +/a(?=b(?i)c)\w\wd/ + abcd + 0: abcd + abCd + 0: abCd + *** Failers +No match + aBCd +No match + abcD +No match + +/(?s-i:more.*than).*million/i + more than million + 0: more than million + more than MILLION + 0: more than MILLION + more \n than Million + 0: more \x0a than Million + *** Failers +No match + MORE THAN MILLION +No match + more \n than \n million +No match + +/(?:(?s-i)more.*than).*million/i + more than million + 0: more than million + more than MILLION + 0: more than MILLION + more \n than Million + 0: more \x0a than Million + *** Failers +No match + MORE THAN MILLION +No match + more \n than \n million +No match + +/(?>a(?i)b+)+c/ + abc + 0: abc + aBbc + 0: aBbc + aBBc + 0: aBBc + *** Failers +No match + Abc +No match + abAb +No match + abbC +No match + +/(?=a(?i)b)\w\wc/ + abc + 0: abc + aBc + 0: aBc + *** Failers +No match + Ab +No match + abC +No match + aBC +No match + +/(?<=a(?i)b)(\w\w)c/ + abxxc + 0: xxc + 1: xx + aBxxc + 0: xxc + 1: xx + *** Failers +No match + Abxxc +No match + ABxxc +No match + abxxC +No match + +/(?:(a)|b)(?(1)A|B)/ + aA + 0: aA + 1: a + bB + 0: bB + *** Failers +No match + aB +No match + bA +No match + +/^(a)?(?(1)a|b)+$/ + aa + 0: aa + 1: a + b + 0: b + bb + 0: bb + *** Failers +No match + ab +No match + +/^(?(?=abc)\w{3}:|\d\d)$/ + abc: + 0: abc: + 12 + 0: 12 + *** Failers +No match + 123 +No match + xyz +No match + +/^(?(?!abc)\d\d|\w{3}:)$/ + abc: + 0: abc: + 12 + 0: 12 + *** Failers +No match + 123 +No match + xyz +No match + +/(?(?<=foo)bar|cat)/ + foobar + 0: bar + cat + 0: cat + fcat + 0: cat + focat + 0: cat + *** Failers +No match + foocat +No match + +/(?(?a*)*/ + a + 0: a + aa + 0: aa + aaaa + 0: aaaa + +/(abc|)+/ + abc + 0: abc + 1: + abcabc + 0: abcabc + 1: + abcabcabc + 0: abcabcabc + 1: + xyz + 0: + 1: + +/([a]*)*/ + a + 0: a + 1: + aaaaa + 0: aaaaa + 1: + +/([ab]*)*/ + a + 0: a + 1: + b + 0: b + 1: + ababab + 0: ababab + 1: + aaaabcde + 0: aaaab + 1: + bbbb + 0: bbbb + 1: + +/([^a]*)*/ + b + 0: b + 1: + bbbb + 0: bbbb + 1: + aaa + 0: + 1: + +/([^ab]*)*/ + cccc + 0: cccc + 1: + abab + 0: + 1: + +/([a]*?)*/ + a + 0: + 1: + aaaa + 0: + 1: + +/([ab]*?)*/ + a + 0: + 1: + b + 0: + 1: + abab + 0: + 1: + baba + 0: + 1: + +/([^a]*?)*/ + b + 0: + 1: + bbbb + 0: + 1: + aaa + 0: + 1: + +/([^ab]*?)*/ + c + 0: + 1: + cccc + 0: + 1: + baba + 0: + 1: + +/(?>a*)*/ + a + 0: a + aaabcde + 0: aaa + +/((?>a*))*/ + aaaaa + 0: aaaaa + 1: + aabbaa + 0: aa + 1: + +/((?>a*?))*/ + aaaaa + 0: + 1: + aabbaa + 0: + 1: + +/(?(?=[^a-z]+[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) /x + 12-sep-98 + 0: 12-sep-98 + 12-09-98 + 0: 12-09-98 + *** Failers +No match + sep-12-98 +No match + +/(?<=(foo))bar\1/ + foobarfoo + 0: barfoo + 1: foo + foobarfootling + 0: barfoo + 1: foo + *** Failers +No match + foobar +No match + barfoo +No match + +/(?i:saturday|sunday)/ + saturday + 0: saturday + sunday + 0: sunday + Saturday + 0: Saturday + Sunday + 0: Sunday + SATURDAY + 0: SATURDAY + SUNDAY + 0: SUNDAY + SunDay + 0: SunDay + +/(a(?i)bc|BB)x/ + abcx + 0: abcx + 1: abc + aBCx + 0: aBCx + 1: aBC + bbx + 0: bbx + 1: bb + BBx + 0: BBx + 1: BB + *** Failers +No match + abcX +No match + aBCX +No match + bbX +No match + BBX +No match + +/^([ab](?i)[cd]|[ef])/ + ac + 0: ac + 1: ac + aC + 0: aC + 1: aC + bD + 0: bD + 1: bD + elephant + 0: e + 1: e + Europe + 0: E + 1: E + frog + 0: f + 1: f + France + 0: F + 1: F + *** Failers +No match + Africa +No match + +/^(ab|a(?i)[b-c](?m-i)d|x(?i)y|z)/ + ab + 0: ab + 1: ab + aBd + 0: aBd + 1: aBd + xy + 0: xy + 1: xy + xY + 0: xY + 1: xY + zebra + 0: z + 1: z + Zambesi + 0: Z + 1: Z + *** Failers +No match + aCD +No match + XY +No match + +/(?<=foo\n)^bar/m + foo\nbar + 0: bar + *** Failers +No match + bar +No match + baz\nbar +No match + +/(?<=(?]&/ + <&OUT + 0: <& + +/^(a\1?){4}$/ + aaaaaaaaaa + 0: aaaaaaaaaa + 1: aaaa + *** Failers +No match + AB +No match + aaaaaaaaa +No match + aaaaaaaaaaa +No match + +/^(a(?(1)\1)){4}$/ + aaaaaaaaaa + 0: aaaaaaaaaa + 1: aaaa + *** Failers +No match + aaaaaaaaa +No match + aaaaaaaaaaa +No match + +/(?:(f)(o)(o)|(b)(a)(r))*/ + foobar + 0: foobar + 1: f + 2: o + 3: o + 4: b + 5: a + 6: r + +/(?<=a)b/ + ab + 0: b + *** Failers +No match + cb +No match + b +No match + +/(? + 2: abcd + xy:z:::abcd + 0: xy:z:::abcd + 1: xy:z::: + 2: abcd + +/^[^bcd]*(c+)/ + aexycd + 0: aexyc + 1: c + +/(a*)b+/ + caab + 0: aab + 1: aa + +/([\w:]+::)?(\w+)$/ + abcd + 0: abcd + 1: + 2: abcd + xy:z:::abcd + 0: xy:z:::abcd + 1: xy:z::: + 2: abcd + *** Failers + 0: Failers + 1: + 2: Failers + abcd: +No match + abcd: +No match + +/^[^bcd]*(c+)/ + aexycd + 0: aexyc + 1: c + +/(>a+)ab/ + +/(?>a+)b/ + aaab + 0: aaab + +/([[:]+)/ + a:[b]: + 0: :[ + 1: :[ + +/([[=]+)/ + a=[b]= + 0: =[ + 1: =[ + +/([[.]+)/ + a.[b]. + 0: .[ + 1: .[ + +/((?>a+)b)/ + aaab + 0: aaab + 1: aaab + +/(?>(a+))b/ + aaab + 0: aaab + 1: aaa + +/((?>[^()]+)|\([^()]*\))+/ + ((abc(ade)ufh()()x + 0: abc(ade)ufh()()x + 1: x + +/a\Z/ + *** Failers +No match + aaab +No match + a\nb\n +No match + +/b\Z/ + a\nb\n + 0: b + +/b\z/ + +/b\Z/ + a\nb + 0: b + +/b\z/ + a\nb + 0: b + *** Failers +No match + +/^(?>(?(1)\.|())[^\W_](?>[a-z0-9-]*[^\W_])?)+$/ + a + 0: a + 1: + abc + 0: abc + 1: + a-b + 0: a-b + 1: + 0-9 + 0: 0-9 + 1: + a.b + 0: a.b + 1: + 5.6.7 + 0: 5.6.7 + 1: + the.quick.brown.fox + 0: the.quick.brown.fox + 1: + a100.b200.300c + 0: a100.b200.300c + 1: + 12-ab.1245 + 0: 12-ab.1245 + 1: + ***Failers +No match + \ +No match + .a +No match + -a +No match + a- +No match + a. +No match + a_b +No match + a.- +No match + a.. +No match + ab..bc +No match + the.quick.brown.fox- +No match + the.quick.brown.fox. +No match + the.quick.brown.fox_ +No match + the.quick.brown.fox+ +No match + +/(?>.*)(?<=(abcd|wxyz))/ + alphabetabcd + 0: alphabetabcd + 1: abcd + endingwxyz + 0: endingwxyz + 1: wxyz + *** Failers +No match + a rather long string that doesn't end with one of them +No match + +/word (?>(?:(?!otherword)[a-zA-Z0-9]+ ){0,30})otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark otherword + 0: word cat dog elephant mussel cow horse canary baboon snake shark otherword + word cat dog elephant mussel cow horse canary baboon snake shark +No match + +/word (?>[a-zA-Z0-9]+ ){0,30}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope +No match + +/(?<=\d{3}(?!999))foo/ + 999foo + 0: foo + 123999foo + 0: foo + *** Failers +No match + 123abcfoo +No match + +/(?<=(?!...999)\d{3})foo/ + 999foo + 0: foo + 123999foo + 0: foo + *** Failers +No match + 123abcfoo +No match + +/(?<=\d{3}(?!999)...)foo/ + 123abcfoo + 0: foo + 123456foo + 0: foo + *** Failers +No match + 123999foo +No match + +/(?<=\d{3}...)(? + 2: + 3: abcd + + 2: + 3: abcd + \s*)=(?>\s*) # find + 2: + 3: abcd + Z)+|A)*/ + ZABCDEFG + 0: ZA + 1: A + +/((?>)+|A)*/ + ZABCDEFG + 0: + 1: + +/a*/g + abbab + 0: a + 0: + 0: + 0: a + 0: + 0: + +/^[a-\d]/ + abcde + 0: a + -things + 0: - + 0digit + 0: 0 + *** Failers +No match + bcdef +No match + +/^[\d-a]/ + abcde + 0: a + -things + 0: - + 0digit + 0: 0 + *** Failers +No match + bcdef +No match + +/[[:space:]]+/ + > \x09\x0a\x0c\x0d\x0b< + 0: \x09\x0a\x0c\x0d\x0b + +/[[:blank:]]+/ + > \x09\x0a\x0c\x0d\x0b< + 0: \x09 + +/[\s]+/ + > \x09\x0a\x0c\x0d\x0b< + 0: \x09\x0a\x0c\x0d + +/\s+/ + > \x09\x0a\x0c\x0d\x0b< + 0: \x09\x0a\x0c\x0d + +/ab/x + ab +No match + +/(?!\A)x/m + a\nxb\n + 0: x + +/(?!^)x/m + a\nxb\n +No match + +/abc\Qabc\Eabc/ + abcabcabc + 0: abcabcabc + +/abc\Q(*+|\Eabc/ + abc(*+|abc + 0: abc(*+|abc + +/ abc\Q abc\Eabc/x + abc abcabc + 0: abc abcabc + *** Failers +No match + abcabcabc +No match + +/abc#comment + \Q#not comment + literal\E/x + abc#not comment\n literal + 0: abc#not comment\x0a literal + +/abc#comment + \Q#not comment + literal/x + abc#not comment\n literal + 0: abc#not comment\x0a literal + +/abc#comment + \Q#not comment + literal\E #more comment + /x + abc#not comment\n literal + 0: abc#not comment\x0a literal + +/abc#comment + \Q#not comment + literal\E #more comment/x + abc#not comment\n literal + 0: abc#not comment\x0a literal + +/\Qabc\$xyz\E/ + abc\\\$xyz + 0: abc\$xyz + +/\Qabc\E\$\Qxyz\E/ + abc\$xyz + 0: abc$xyz + +/\Gabc/ + abc + 0: abc + *** Failers +No match + xyzabc +No match + +/\Gabc./g + abc1abc2xyzabc3 + 0: abc1 + 0: abc2 + +/abc./g + abc1abc2xyzabc3 + 0: abc1 + 0: abc2 + 0: abc3 + +/a(?x: b c )d/ + XabcdY + 0: abcd + *** Failers +No match + Xa b c d Y +No match + +/((?x)x y z | a b c)/ + XabcY + 0: abc + 1: abc + AxyzB + 0: xyz + 1: xyz + +/(?i)AB(?-i)C/ + XabCY + 0: abC + *** Failers +No match + XabcY +No match + +/((?i)AB(?-i)C|D)E/ + abCE + 0: abCE + 1: abC + DE + 0: DE + 1: D + *** Failers +No match + abcE +No match + abCe +No match + dE +No match + De +No match + +/(.*)\d+\1/ + abc123abc + 0: abc123abc + 1: abc + abc123bc + 0: bc123bc + 1: bc + +/(.*)\d+\1/s + abc123abc + 0: abc123abc + 1: abc + abc123bc + 0: bc123bc + 1: bc + +/((.*))\d+\1/ + abc123abc + 0: abc123abc + 1: abc + 2: abc + abc123bc + 0: bc123bc + 1: bc + 2: bc + +/-- This tests for an IPv6 address in the form where it can have up to --/ +/-- eight components, one and only one of which is empty. This must be --/ +No match +/-- an internal component. --/ +No match + +/^(?!:) # colon disallowed at start + (?: # start of item + (?: [0-9a-f]{1,4} | # 1-4 hex digits or + (?(1)0 | () ) ) # if null previously matched, fail; else null + : # followed by colon + ){1,7} # end item; 1-7 of them required + [0-9a-f]{1,4} $ # final hex number at end of string + (?(1)|.) # check that there was an empty component + /xi + a123::a123 + 0: a123::a123 + 1: + a123:b342::abcd + 0: a123:b342::abcd + 1: + a123:b342::324e:abcd + 0: a123:b342::324e:abcd + 1: + a123:ddde:b342::324e:abcd + 0: a123:ddde:b342::324e:abcd + 1: + a123:ddde:b342::324e:dcba:abcd + 0: a123:ddde:b342::324e:dcba:abcd + 1: + a123:ddde:9999:b342::324e:dcba:abcd + 0: a123:ddde:9999:b342::324e:dcba:abcd + 1: + *** Failers +No match + 1:2:3:4:5:6:7:8 +No match + a123:bce:ddde:9999:b342::324e:dcba:abcd +No match + a123::9999:b342::324e:dcba:abcd +No match + abcde:2:3:4:5:6:7:8 +No match + ::1 +No match + abcd:fee0:123:: +No match + :1 +No match + 1: +No match + / End of testinput1 / diff --git a/ext/pcre/pcrelib/testdata/testoutput2 b/ext/pcre/pcrelib/testdata/testoutput2 index e8844d2aebe..2d1db0fb134 100644 --- a/ext/pcre/pcrelib/testdata/testoutput2 +++ b/ext/pcre/pcrelib/testdata/testoutput2 @@ -1,4 +1,4 @@ -PCRE version 3.9 02-Jan-2002 +PCRE version 3.92 11-Sep-2002 /(a)b|/ Capturing subpattern count = 1 @@ -185,10 +185,10 @@ Capturing subpattern count = 1 No options No first char No need char -Starting character set: \x09 \x0a \x0b \x0c \x0d \x20 a b +Starting character set: \x09 \x0a \x0c \x0d \x20 a b /(ab\2)/ -Failed: back reference to non-existent subpattern at offset 6 +Failed: reference to non-existent subpattern at offset 6 /{4,5}abc/ Failed: nothing to repeat at offset 4 @@ -281,7 +281,7 @@ No match No match /(a)(b)(c)(d)(e)\6/ -Failed: back reference to non-existent subpattern at offset 17 +Failed: reference to non-existent subpattern at offset 17 /the quick brown fox/ Capturing subpattern count = 0 @@ -426,7 +426,7 @@ Need char = '>' abc ghi nop 0: -/<.*>(?U)/ +/(?U)<.*>/ Capturing subpattern count = 0 Options: ungreedy First char = '<' @@ -486,8 +486,8 @@ Failed: lookbehind assertion is not fixed length at offset 12 /(?i)abc/ Capturing subpattern count = 0 Options: caseless -First char = 'a' -Need char = 'c' +First char = 'a' (caseless) +Need char = 'c' (caseless) /(a|(?m)a)/ Capturing subpattern count = 1 @@ -563,7 +563,7 @@ Failed: assertion expected after (?( at offset 3 Failed: assertion expected after (?( at offset 3 /(?(? qrs)123) 1: abcd(xyz qrs)123 2: 123 - 3:
qrs + 3:
/\( ( ( (?>[^()]+) | ((?R)) )* ) \) /x Capturing subpattern count = 3 @@ -1844,6 +1844,19 @@ Options: anchored No first char No need char +/^[[:blank:]]/D +------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [\x09 ] + 37 37 Ket + 40 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: anchored +No first char +No need char + /^[[:cntrl:]]/D ------------------------------------------------------------------ 0 37 Bra 0 @@ -2000,6 +2013,19 @@ Options: anchored No first char No need char +/^[[:^blank:]]/D +------------------------------------------------------------------ + 0 37 Bra 0 + 3 ^ + 4 [\x00-\x08\x0a-\x1f!-\xff] + 37 37 Ket + 40 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: anchored +No first char +No need char + /[01[:alpha:]%]/D ------------------------------------------------------------------ 0 36 Bra 0 @@ -2372,6 +2398,1688 @@ Need char = 'n' 1: main 2: O +/These are all cases where Perl does it differently (nested captures)/ +Capturing subpattern count = 1 +No options +First char = 'T' +Need char = 's' + +/^(a(b)?)+$/ +Capturing subpattern count = 2 +Options: anchored +No first char +Need char = 'a' + aba + 0: aba + 1: a + 2: b + +/^(aa(bb)?)+$/ +Capturing subpattern count = 2 +Options: anchored +No first char +Need char = 'a' + aabbaa + 0: aabbaa + 1: aa + 2: bb + +/^(aa|aa(bb))+$/ +Capturing subpattern count = 2 +Options: anchored +No first char +Need char = 'a' + aabbaa + 0: aabbaa + 1: aa + 2: bb + +/^(aa(bb)??)+$/ +Capturing subpattern count = 2 +Options: anchored +No first char +Need char = 'a' + aabbaa + 0: aabbaa + 1: aa + 2: bb + +/^(?:aa(bb)?)+$/ +Capturing subpattern count = 1 +Options: anchored +No first char +Need char = 'a' + aabbaa + 0: aabbaa + 1: bb + +/^(aa(b(b))?)+$/ +Capturing subpattern count = 3 +Options: anchored +No first char +Need char = 'a' + aabbaa + 0: aabbaa + 1: aa + 2: bb + 3: b + +/^(?:aa(b(b))?)+$/ +Capturing subpattern count = 2 +Options: anchored +No first char +Need char = 'a' + aabbaa + 0: aabbaa + 1: bb + 2: b + +/^(?:aa(b(?:b))?)+$/ +Capturing subpattern count = 1 +Options: anchored +No first char +Need char = 'a' + aabbaa + 0: aabbaa + 1: bb + +/^(?:aa(bb(?:b))?)+$/ +Capturing subpattern count = 1 +Options: anchored +No first char +Need char = 'a' + aabbbaa + 0: aabbbaa + 1: bbb + +/^(?:aa(b(?:bb))?)+$/ +Capturing subpattern count = 1 +Options: anchored +No first char +Need char = 'a' + aabbbaa + 0: aabbbaa + 1: bbb + +/^(?:aa(?:b(b))?)+$/ +Capturing subpattern count = 1 +Options: anchored +No first char +Need char = 'a' + aabbaa + 0: aabbaa + 1: b + +/^(?:aa(?:b(bb))?)+$/ +Capturing subpattern count = 1 +Options: anchored +No first char +Need char = 'a' + aabbbaa + 0: aabbbaa + 1: bb + +/^(aa(b(bb))?)+$/ +Capturing subpattern count = 3 +Options: anchored +No first char +Need char = 'a' + aabbbaa + 0: aabbbaa + 1: aa + 2: bbb + 3: bb + +/^(aa(bb(bb))?)+$/ +Capturing subpattern count = 3 +Options: anchored +No first char +Need char = 'a' + aabbbbaa + 0: aabbbbaa + 1: aa + 2: bbbb + 3: bb + +/--------------------------------------------------------------------/ +Capturing subpattern count = 0 +No options +First char = '-' +Need char = '-' + +/#/xMD +Memory allocation (code space): 7 +------------------------------------------------------------------ + 0 3 Bra 0 + 3 3 Ket + 6 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: extended +No first char +No need char + +/a#/xMD +Memory allocation (code space): 13 +------------------------------------------------------------------ + 0 6 Bra 0 + 3 1 a + 6 6 Ket + 9 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: extended +First char = 'a' +No need char + +/[\s]/D +------------------------------------------------------------------ + 0 36 Bra 0 + 3 [\x09-\x0a\x0c-\x0d ] + 36 36 Ket + 39 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +No first char +No need char + +/[\S]/D +------------------------------------------------------------------ + 0 36 Bra 0 + 3 [\x00-\x08\x0b\x0e-\x1f!-\xff] + 36 36 Ket + 39 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +No first char +No need char + +/a(?i)b/D +------------------------------------------------------------------ + 0 11 Bra 0 + 3 1 a + 6 01 Opt + 8 1 b + 11 11 Ket + 14 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +Case state changes +First char = 'a' +Need char = 'b' (caseless) + ab + 0: ab + aB + 0: aB + *** Failers +No match + AB +No match + +/(a(?i)b)/D +------------------------------------------------------------------ + 0 19 Bra 0 + 3 11 Bra 1 + 6 1 a + 9 01 Opt + 11 1 b + 14 11 Ket + 17 00 Opt + 19 19 Ket + 22 End +------------------------------------------------------------------ +Capturing subpattern count = 1 +No options +Case state changes +First char = 'a' +Need char = 'b' (caseless) + ab + 0: ab + 1: ab + aB + 0: aB + 1: aB + *** Failers +No match + AB +No match + +/ (?i)abc/xD +------------------------------------------------------------------ + 0 8 Bra 0 + 3 3 abc + 8 8 Ket + 11 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: caseless extended +First char = 'a' (caseless) +Need char = 'c' (caseless) + +/#this is a comment + (?i)abc/xD +------------------------------------------------------------------ + 0 8 Bra 0 + 3 3 abc + 8 8 Ket + 11 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: caseless extended +First char = 'a' (caseless) +Need char = 'c' (caseless) + +/123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/D +------------------------------------------------------------------ + 0 307 Bra 0 + 3 250 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 +255 50 12345678901234567890123456789012345678901234567890 +307 307 Ket +310 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = '1' +Need char = '0' + +/\Q123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/D +------------------------------------------------------------------ + 0 307 Bra 0 + 3 250 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 +255 50 12345678901234567890123456789012345678901234567890 +307 307 Ket +310 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = '1' +Need char = '0' + +/\Q\E/D +------------------------------------------------------------------ + 0 3 Bra 0 + 3 3 Ket + 6 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +No first char +No need char + \ + 0: + +/\Q\Ex/D +------------------------------------------------------------------ + 0 6 Bra 0 + 3 1 x + 6 6 Ket + 9 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = 'x' +No need char + +/ \Q\E/D +------------------------------------------------------------------ + 0 6 Bra 0 + 3 1 + 6 6 Ket + 9 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = ' ' +No need char + +/a\Q\E/D +------------------------------------------------------------------ + 0 6 Bra 0 + 3 1 a + 6 6 Ket + 9 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = 'a' +No need char + abc + 0: a + bca + 0: a + bac + 0: a + +/a\Q\Eb/D +------------------------------------------------------------------ + 0 9 Bra 0 + 3 1 a + 6 1 b + 9 9 Ket + 12 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'b' + abc + 0: ab + +/\Q\Eabc/D +------------------------------------------------------------------ + 0 8 Bra 0 + 3 3 abc + 8 8 Ket + 11 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'c' + +/x*+\w/D +------------------------------------------------------------------ + 0 12 Bra 0 + 3 5 Once + 6 x* + 8 5 Ket + 11 \w + 12 12 Ket + 15 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +No first char +No need char + ****Failers + 0: F + xxxxx +No match + +/x?+/D +------------------------------------------------------------------ + 0 11 Bra 0 + 3 5 Once + 6 x? + 8 5 Ket + 11 11 Ket + 14 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +No first char +No need char + +/x++/D +------------------------------------------------------------------ + 0 11 Bra 0 + 3 5 Once + 6 x+ + 8 5 Ket + 11 11 Ket + 14 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = 'x' +No need char + +/x{1,3}+/D +------------------------------------------------------------------ + 0 16 Bra 0 + 3 10 Once + 6 1 x + 9 x{,2} + 13 10 Ket + 16 16 Ket + 19 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = 'x' +No need char + +/(x)*+/D +------------------------------------------------------------------ + 0 19 Bra 0 + 3 13 Once + 6 Brazero + 7 6 Bra 1 + 10 1 x + 13 6 KetRmax + 16 13 Ket + 19 19 Ket + 22 End +------------------------------------------------------------------ +Capturing subpattern count = 1 +No options +No first char +No need char + +/^(\w++|\s++)*$/ +Capturing subpattern count = 1 +Options: anchored +No first char +No need char + now is the time for all good men to come to the aid of the party + 0: now is the time for all good men to come to the aid of the party + 1: party + *** Failers +No match + this is not a line with only words and spaces! +No match + +/(\d++)(\w)/ +Capturing subpattern count = 2 +No options +No first char +No need char + 12345a + 0: 12345a + 1: 12345 + 2: a + *** Failers +No match + 12345+ +No match + +/a++b/ +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'b' + aaab + 0: aaab + +/(a++b)/ +Capturing subpattern count = 1 +No options +First char = 'a' +Need char = 'b' + aaab + 0: aaab + 1: aaab + +/(a++)b/ +Capturing subpattern count = 1 +No options +First char = 'a' +Need char = 'b' + aaab + 0: aaab + 1: aaa + +/([^()]++|\([^()]*\))+/ +Capturing subpattern count = 1 +No options +No first char +No need char + ((abc(ade)ufh()()x + 0: abc(ade)ufh()()x + 1: x + +/\(([^()]++|\([^()]+\))+\)/ +Capturing subpattern count = 1 +No options +First char = '(' +Need char = ')' + (abc) + 0: (abc) + 1: abc + (abc(def)xyz) + 0: (abc(def)xyz) + 1: xyz + *** Failers +No match + ((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +No match + +/(abc){1,3}+/D +------------------------------------------------------------------ + 0 50 Bra 0 + 3 44 Once + 6 8 Bra 1 + 9 3 abc + 14 8 Ket + 17 Brazero + 18 26 Bra 0 + 21 8 Bra 1 + 24 3 abc + 29 8 Ket + 32 Brazero + 33 8 Bra 1 + 36 3 abc + 41 8 Ket + 44 26 Ket + 47 44 Ket + 50 50 Ket + 53 End +------------------------------------------------------------------ +Capturing subpattern count = 1 +No options +First char = 'a' +Need char = 'c' + +/a+?+/ +Failed: nothing to repeat at offset 3 + +/a{2,3}?+b/ +Failed: nothing to repeat at offset 7 + +/(?U)a+?+/ +Failed: nothing to repeat at offset 7 + +/a{2,3}?+b/U +Failed: nothing to repeat at offset 7 + +/x(?U)a++b/D +------------------------------------------------------------------ + 0 17 Bra 0 + 3 1 x + 6 5 Once + 9 a+ + 11 5 Ket + 14 1 b + 17 17 Ket + 20 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = 'x' +Need char = 'b' + xaaaab + 0: xaaaab + +/(?U)xa++b/D +------------------------------------------------------------------ + 0 17 Bra 0 + 3 1 x + 6 5 Once + 9 a+ + 11 5 Ket + 14 1 b + 17 17 Ket + 20 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: ungreedy +First char = 'x' +Need char = 'b' + xaaaab + 0: xaaaab + +/^((a+)(?U)([ab]+)(?-U)([bc]+)(\w*))/D +------------------------------------------------------------------ + 0 106 Bra 0 + 3 ^ + 4 99 Bra 1 + 7 5 Bra 2 + 10 a+ + 12 5 Ket + 15 37 Bra 3 + 18 [a-b]+? + 52 37 Ket + 55 37 Bra 4 + 58 [b-c]+ + 92 37 Ket + 95 5 Bra 5 + 98 \w* +100 5 Ket +103 99 Ket +106 106 Ket +109 End +------------------------------------------------------------------ +Capturing subpattern count = 5 +Options: anchored +No first char +Need char = 'a' + +/^x(?U)a+b/D +------------------------------------------------------------------ + 0 12 Bra 0 + 3 ^ + 4 1 x + 7 a+? + 9 1 b + 12 12 Ket + 15 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: anchored +No first char +Need char = 'b' + +/^x(?U)(a+)b/D +------------------------------------------------------------------ + 0 18 Bra 0 + 3 ^ + 4 1 x + 7 5 Bra 1 + 10 a+? + 12 5 Ket + 15 1 b + 18 18 Ket + 21 End +------------------------------------------------------------------ +Capturing subpattern count = 1 +Options: anchored +No first char +Need char = 'b' + +/[.x.]/ +Failed: POSIX collating elements are not supported at offset 0 + +/[=x=]/ +Failed: POSIX collating elements are not supported at offset 0 + +/[:x:]/ +Failed: POSIX named classes are supported only within a class at offset 0 + +/\l/ +Failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X at offset 1 + +/\L/ +Failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X at offset 1 + +/\N{name}/ +Failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X at offset 1 + +/\pP/ +Failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X at offset 1 + +/\PP/ +Failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X at offset 1 + +/\p{prop}/ +Failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X at offset 1 + +/\P{prop}/ +Failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X at offset 1 + +/\u/ +Failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X at offset 1 + +/\U/ +Failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X at offset 1 + +/\X/ +Failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X at offset 1 + +/[/ +Failed: missing terminating ] for character class at offset 1 + +/[a-/ +Failed: missing terminating ] for character class at offset 3 + +/[[:space:]/ +Failed: missing terminating ] for character class at offset 10 + +/[\s]/DM +Memory allocation (code space): 40 +------------------------------------------------------------------ + 0 36 Bra 0 + 3 [\x09-\x0a\x0c-\x0d ] + 36 36 Ket + 39 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +No first char +No need char + +/[[:space:]]/DM +Memory allocation (code space): 40 +------------------------------------------------------------------ + 0 36 Bra 0 + 3 [\x09-\x0d ] + 36 36 Ket + 39 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +No first char +No need char + +/[[:space:]abcde]/DM +Memory allocation (code space): 40 +------------------------------------------------------------------ + 0 36 Bra 0 + 3 [\x09-\x0d a-e] + 36 36 Ket + 39 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +No first char +No need char + +/< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >/x +Capturing subpattern count = 0 +Options: extended +First char = '<' +Need char = '>' + <> + 0: <> + + 0: + hij> + 0: hij> + hij> + 0: + def> + 0: def> + + 0: <> + *** Failers +No match + iV\;\?aPhFB\<\*vW\@QW\@sO9\}cfZA\-i\'w\%hKd6gt1UJP\,15_\#QY\$M\^Mss_U\/\]\&LK9\[5vQub\^w\[KDD\ qmj\;2\}YWFdYx\.Ap\]hjCPTP\(n28k\+3\;o\&WXqs\/gOXdr\$\:r\'do0\;b4c\(f_Gr\=\"\\4\)\[01T7ajQJvL\$W\~mL_sS\/4h\:x\*\[ZN\=KLs\&L5zX\/\/\>it\,o\:aU\(\;Z\>pW\&T7oP\'2K\^E\:x9\'c\[\%z\-\,64JQ5AeH_G\#KijUKghQw\^\\vea3a\?kka_G\$8\#\`\*kynsxzBLru\'\]k_\[7FrVx\}\^\=\$blx\>s\-N\%j\;D\*aZDnsw\:YKZ\%Q\.Kne9\#hP\?\+b3\(SOvL\,\^\;\&u5\@\?5C5Bhb\=m\-vEh_L15Jl\]U\)0RP6\{q\%L\^_z5E\'Dw6X\b|DM +Memory allocation (code space): 421 +------------------------------------------------------------------ + 0 417 Bra 0 + 3 250 8J$WE<.rX+ix[d1b!H#?vV0vrK:ZH1=2M>iV;?aPhFB<*vW@QW@sO9}cfZA-i'w%hKd6gt1UJP,15_#QY$M^Mss_U/]&LK9[5vQub^w[KDD qmj;2}YWFdYx.Ap]hjCPTP(n28k+3;o&WXqs/gOXdr$:r'do0;b4c(f_Gr="\4)[01T7ajQJvL$W~mL_sS/4h:x*[ZN=KLs&L5zX//>it,o:aU(;Z>pW&T7oP'2K^E: +255 159 x9'c[%z-,64JQ5AeH_G#KijUKghQw^\vea3a?kka_G$8#`*kynsxzBLru']k_[7FrVx}^=$blx>s-N%j;D*aZDnsw:YKZ%Q.Kne9#hP?+b3(SOvL,^;&u5@?5C5Bhb=m-vEh_L15Jl]U)0RP6{q%L^_z5E'Dw6X +416 \b +417 417 Ket +420 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = '8' +Need char = 'X' + +|\$\<\.X\+ix\[d1b\!H\#\?vV0vrK\:ZH1\=2M\>iV\;\?aPhFB\<\*vW\@QW\@sO9\}cfZA\-i\'w\%hKd6gt1UJP\,15_\#QY\$M\^Mss_U\/\]\&LK9\[5vQub\^w\[KDD\ qmj\;2\}YWFdYx\.Ap\]hjCPTP\(n28k\+3\;o\&WXqs\/gOXdr\$\:r\'do0\;b4c\(f_Gr\=\"\\4\)\[01T7ajQJvL\$W\~mL_sS\/4h\:x\*\[ZN\=KLs\&L5zX\/\/\>it\,o\:aU\(\;Z\>pW\&T7oP\'2K\^E\:x9\'c\[\%z\-\,64JQ5AeH_G\#KijUKghQw\^\\vea3a\?kka_G\$8\#\`\*kynsxzBLru\'\]k_\[7FrVx\}\^\=\$blx\>s\-N\%j\;D\*aZDnsw\:YKZ\%Q\.Kne9\#hP\?\+b3\(SOvL\,\^\;\&u5\@\?5C5Bhb\=m\-vEh_L15Jl\]U\)0RP6\{q\%L\^_z5E\'Dw6X\b|DM +Memory allocation (code space): 416 +------------------------------------------------------------------ + 0 412 Bra 0 + 3 250 $<.X+ix[d1b!H#?vV0vrK:ZH1=2M>iV;?aPhFB<*vW@QW@sO9}cfZA-i'w%hKd6gt1UJP,15_#QY$M^Mss_U/]&LK9[5vQub^w[KDD qmj;2}YWFdYx.Ap]hjCPTP(n28k+3;o&WXqs/gOXdr$:r'do0;b4c(f_Gr="\4)[01T7ajQJvL$W~mL_sS/4h:x*[ZN=KLs&L5zX//>it,o:aU(;Z>pW&T7oP'2K^E:x9'c[ +255 154 %z-,64JQ5AeH_G#KijUKghQw^\vea3a?kka_G$8#`*kynsxzBLru']k_[7FrVx}^=$blx>s-N%j;D*aZDnsw:YKZ%Q.Kne9#hP?+b3(SOvL,^;&u5@?5C5Bhb=m-vEh_L15Jl]U)0RP6{q%L^_z5E'Dw6X +411 \b +412 412 Ket +415 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +No options +First char = '$' +Need char = 'X' + +/(.*)\d+\1/I +Capturing subpattern count = 1 +Max back reference = 1 +No options +No first char +No need char + +/(.*)\d+/I +Capturing subpattern count = 1 +No options +First char at start or follows \n +No need char + +/(.*)\d+\1/Is +Capturing subpattern count = 1 +Max back reference = 1 +Options: dotall +No first char +No need char + +/(.*)\d+/Is +Capturing subpattern count = 1 +Options: anchored dotall +No first char +No need char + +/(.*(xyz))\d+\2/I +Capturing subpattern count = 2 +Max back reference = 2 +No options +No first char +Need char = 'z' + +/((.*))\d+\1/I +Capturing subpattern count = 2 +Max back reference = 1 +No options +No first char +No need char + abc123bc + 0: bc123bc + 1: bc + 2: bc + +/a[b]/I +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'b' + +/(?=a).*/I +Capturing subpattern count = 0 +No options +First char = 'a' +No need char + +/(?=abc).xyz/iI +Capturing subpattern count = 0 +Options: caseless +First char = 'a' (caseless) +Need char = 'z' (caseless) + +/(?=abc)(?i).xyz/I +Capturing subpattern count = 0 +No options +Case state changes +First char = 'a' +Need char = 'z' (caseless) + +/(?=a)(?=b)/I +Capturing subpattern count = 0 +No options +First char = 'a' +No need char + +/(?=.)a/I +Capturing subpattern count = 0 +No options +First char = 'a' +No need char + +/((?=abcda)a)/I +Capturing subpattern count = 1 +No options +First char = 'a' +Need char = 'a' + +/((?=abcda)ab)/I +Capturing subpattern count = 1 +No options +First char = 'a' +Need char = 'b' + +/()a/I +Capturing subpattern count = 1 +No options +No first char +Need char = 'a' + +/(?(1)ab|ac)/I +Capturing subpattern count = 0 +No options +First char = 'a' +No need char + +/(?(1)abz|acz)/I +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'z' + +/(?(1)abz)/I +Capturing subpattern count = 0 +No options +No first char +No need char + +/(?(1)abz)123/I +Capturing subpattern count = 0 +No options +No first char +Need char = '3' + +/(a)+/I +Capturing subpattern count = 1 +No options +First char = 'a' +No need char + +/(a){2,3}/I +Capturing subpattern count = 1 +No options +First char = 'a' +Need char = 'a' + +/(a)*/I +Capturing subpattern count = 1 +No options +No first char +No need char + +/[a]/I +Capturing subpattern count = 0 +No options +First char = 'a' +No need char + +/[ab]/I +Capturing subpattern count = 0 +No options +No first char +No need char + +/[ab]/IS +Capturing subpattern count = 0 +No options +No first char +No need char +Starting character set: a b + +/[^a]/I +Capturing subpattern count = 0 +No options +No first char +No need char + +/\d456/I +Capturing subpattern count = 0 +No options +No first char +Need char = '6' + +/\d456/IS +Capturing subpattern count = 0 +No options +No first char +Need char = '6' +Starting character set: 0 1 2 3 4 5 6 7 8 9 + +/a^b/I +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'b' + +/^a/mI +Capturing subpattern count = 0 +Options: multiline +First char at start or follows \n +Need char = 'a' + abcde + 0: a + xy\nabc + 0: a + *** Failers +No match + xyabc +No match + +/c|abc/I +Capturing subpattern count = 0 +No options +No first char +Need char = 'c' + +/(?i)[ab]/IS +Capturing subpattern count = 0 +Options: caseless +No first char +No need char +Starting character set: A B a b + +/[ab](?i)cd/IS +Capturing subpattern count = 0 +No options +Case state changes +No first char +Need char = 'd' (caseless) +Starting character set: a b + +/abc(?C)def/ +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'f' + abcdef +--->abcdef + 0 ^ ^ + 0: abcdef + 1234abcdef +--->1234abcdef + 0 ^ ^ + 0: abcdef + *** Failers +No match + abcxyz +No match + abcxyzf +--->abcxyzf + 0 ^ ^ +No match + +/abc(?C)de(?C1)f/ +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'f' + 123abcdef +--->123abcdef + 0 ^ ^ + 1 ^ ^ + 0: abcdef + +/(?C1)\dabc(?C2)def/ +Capturing subpattern count = 0 +No options +No first char +Need char = 'f' + 1234abcdef +--->1234abcdef + 1 ^ + 1 ^ + 1 ^ + 1 ^ + 2 ^ ^ + 0: 4abcdef + *** Failers +No match + abcdef +--->abcdef + 1 ^ + 1 ^ + 1 ^ + 1 ^ + 1 ^ + 1 ^ +No match + +/(?C255)ab/ +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'b' + +/(?C256)ab/ +Failed: number after (?C is > 255 at offset 6 + +/(?Cab)xx/ +Failed: closing ) for (?C expected at offset 3 + +/(?C12vr)x/ +Failed: closing ) for (?C expected at offset 5 + +/abc(?C)def/ +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'f' + *** Failers +No match + \x83\x0\x61bcdef +--->\x83\x00abcdef + 0 ^ ^ + 0: abcdef + +/(abc)(?C)de(?C1)f/ +Capturing subpattern count = 1 +No options +First char = 'a' +Need char = 'f' + 123abcdef +--->123abcdef + 0 ^ ^ + 1 ^ ^ + 0: abcdef + 1: abc + 123abcdef\C+ +Callout 0: last capture = 1 + 0: + 1: abc +--->123abcdef + ^ ^ +Callout 1: last capture = 1 + 0: + 1: abc +--->123abcdef + ^ ^ + 0: abcdef + 1: abc + 123abcdef\C- + 0: abcdef + 1: abc + *** Failers +No match + 123abcdef\C!1 +--->123abcdef + 0 ^ ^ + 1 ^ ^ +No match + +/(?C0)(abc(?C1))*/ +Capturing subpattern count = 1 +No options +No first char +No need char + abcabcabc +--->abcabcabc + 0 ^ + 1 ^ ^ + 1 ^ ^ + 1 ^ ^ + 0: abcabcabc + 1: abc + abcabc\C!1!3 +--->abcabc + 0 ^ + 1 ^ ^ + 1 ^ ^ + 0: abcabc + 1: abc + *** Failers +--->*** Failers + 0 ^ + 0: + abcabcabc\C!1!3 +--->abcabcabc + 0 ^ + 1 ^ ^ + 1 ^ ^ + 1 ^ ^ + 0: abcabc + 1: abc + +/(\d{3}(?C))*/ +Capturing subpattern count = 1 +No options +No first char +No need char + 123\C+ +Callout 0: last capture = -1 + 0: +--->123 + ^ ^ + 0: 123 + 1: 123 + 123456\C+ +Callout 0: last capture = -1 + 0: +--->123456 + ^ ^ +Callout 0: last capture = 1 + 0: + 1: 123 +--->123456 + ^ ^ + 0: 123456 + 1: 456 + 123456789\C+ +Callout 0: last capture = -1 + 0: +--->123456789 + ^ ^ +Callout 0: last capture = 1 + 0: + 1: 123 +--->123456789 + ^ ^ +Callout 0: last capture = 1 + 0: + 1: 456 +--->123456789 + ^ ^ + 0: 123456789 + 1: 789 + +/((xyz)(?C)p|(?C1)xyzabc)/ +Capturing subpattern count = 2 +No options +First char = 'x' +No need char + xyzabc\C+ +Callout 0: last capture = 2 + 0: + 1: + 2: xyz +--->xyzabc + ^ ^ +Callout 1: last capture = -1 + 0: +--->xyzabc + ^ + 0: xyzabc + 1: xyzabc + +/(X)((xyz)(?C)p|(?C1)xyzabc)/ +Capturing subpattern count = 3 +No options +First char = 'X' +Need char = 'x' + Xxyzabc\C+ +Callout 0: last capture = 3 + 0: + 1: X + 2: + 3: xyz +--->Xxyzabc + ^ ^ +Callout 1: last capture = 1 + 0: + 1: X +--->Xxyzabc + ^^ + 0: Xxyzabc + 1: X + 2: xyzabc + +/(?=(abc))(?C)abcdef/ +Capturing subpattern count = 1 +No options +First char = 'a' +Need char = 'f' + abcdef\C+ +Callout 0: last capture = 1 + 0: + 1: abc +--->abcdef + ^ + 0: abcdef + 1: abc + +/(?!(abc)(?C1)d)(?C2)abcxyz/ +Capturing subpattern count = 1 +No options +First char = 'a' +Need char = 'z' + abcxyz\C+ +Callout 1: last capture = 1 + 0: + 1: abc +--->abcxyz + ^ ^ +Callout 2: last capture = -1 + 0: +--->abcxyz + ^ + 0: abcxyz + +/(?<=(abc)(?C))xyz/ +Capturing subpattern count = 1 +No options +First char = 'x' +Need char = 'z' + abcxyz\C+ +Callout 0: last capture = 1 + 0: + 1: abc +--->abcxyz + ^ + 0: xyz + 1: abc + +/(?C)abc/ +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'c' + +/(?C)^abc/ +Capturing subpattern count = 0 +Options: anchored +No first char +Need char = 'c' + +/(?C)a|b/S +Capturing subpattern count = 0 +No options +No first char +No need char +Starting character set: a b + +/(?R)/ +Failed: recursive call could loop indefinitely at offset 3 + +/(a|(?R))/ +Failed: recursive call could loop indefinitely at offset 6 + +/(ab|(bc|(de|(?R))))/ +Failed: recursive call could loop indefinitely at offset 15 + +/x(ab|(bc|(de|(?R))))/ +Capturing subpattern count = 3 +No options +First char = 'x' +No need char + xab + 0: xab + 1: ab + xbc + 0: xbc + 1: bc + 2: bc + xde + 0: xde + 1: de + 2: de + 3: de + xxab + 0: xxab + 1: xab + 2: xab + 3: xab + xxxab + 0: xxxab + 1: xxab + 2: xxab + 3: xxab + *** Failers +No match + xyab +No match + +/(ab|(bc|(de|(?1))))/ +Failed: recursive call could loop indefinitely at offset 15 + +/x(ab|(bc|(de|(?1)x)x)x)/ +Failed: recursive call could loop indefinitely at offset 16 + +/^([^()]|\((?1)*\))*$/ +Capturing subpattern count = 1 +Options: anchored +No first char +No need char + abc + 0: abc + 1: c + a(b)c + 0: a(b)c + 1: c + a(b(c))d + 0: a(b(c))d + 1: d + *** Failers) +No match + a(b(c)d +No match + +/^>abc>([^()]|\((?1)*\))* abc>123 abc>123 abc>1(2)3 abc>1(2)3 abc>(1(2)3) abc>(1(2)3) + 2: + 3: Satan, oscillate my metallic sonatas + 4: S + A man, a plan, a canal: Panama! + 0: A man, a plan, a canal: Panama! + 1: + 2: + 3: A man, a plan, a canal: Panama + 4: A + Able was I ere I saw Elba. + 0: Able was I ere I saw Elba. + 1: + 2: + 3: Able was I ere I saw Elba + 4: A + *** Failers +No match + The quick brown fox +No match + +/^(\d+|\((?1)([+*-])(?1)\)|-(?1))$/ +Capturing subpattern count = 2 +Options: anchored +No first char +No need char + 12 + 0: 12 + 1: 12 + (((2+2)*-3)-7) + 0: (((2+2)*-3)-7) + 1: (((2+2)*-3)-7) + 2: - + -12 + 0: -12 + 1: -12 + *** Failers +No match + ((2+2)*-3)-7) +No match + +/^(x(y|(?1){2})z)/ +Capturing subpattern count = 2 +Options: anchored +No first char +Need char = 'z' + xyz + 0: xyz + 1: xyz + 2: y + xxyzxyzz + 0: xxyzxyzz + 1: xxyzxyzz + 2: xyzxyz + *** Failers +No match + xxyzz +No match + xxyzxyzxyzz +No match + +/((< (?: (?(R) \d++ | [^<>]*+) | (?2)) * >))/x +Capturing subpattern count = 2 +Options: extended +First char = '<' +Need char = '>' + <> + 0: <> + 1: <> + 2: <> + + 0: + 1: + 2: + hij> + 0: hij> + 1: hij> + 2: hij> + hij> + 0: + 1: + 2: + def> + 0: def> + 1: def> + 2: def> + + 0: <> + 1: <> + 2: <> + *** Failers +No match + b|c)d(?P e)/D +------------------------------------------------------------------ + 0 33 Bra 0 + 3 1 a + 6 6 Bra 1 + 9 1 b + 12 6 Alt + 15 1 c + 18 12 Ket + 21 1 d + 24 6 Bra 2 + 27 1 e + 30 6 Ket + 33 33 Ket + 36 End +------------------------------------------------------------------ +Capturing subpattern count = 2 +Named capturing subpatterns: + longername2 2 + name1 1 +No options +First char = 'a' +Need char = 'e' + abde + 0: abde + 1: b + 2: e + acde + 0: acde + 1: c + 2: e + +/(?:a(?P c(?P d)))(?Pa)/D +------------------------------------------------------------------ + 0 39 Bra 0 + 3 24 Bra 0 + 6 1 a + 9 15 Bra 1 + 12 1 c + 15 6 Bra 2 + 18 1 d + 21 6 Ket + 24 15 Ket + 27 24 Ket + 30 6 Bra 3 + 33 1 a + 36 6 Ket + 39 39 Ket + 42 End +------------------------------------------------------------------ +Capturing subpattern count = 3 +Named capturing subpatterns: + a 3 + c 1 + d 2 +No options +First char = 'a' +Need char = 'a' + +/(?Pa)...(?P=a)bbb(?P>a)d/D +------------------------------------------------------------------ + 0 29 Bra 0 + 3 6 Bra 1 + 6 1 a + 9 6 Ket + 12 Any + 13 Any + 14 Any + 15 \1 + 18 3 bbb + 23 3 Recurse + 26 1 d + 29 29 Ket + 32 End +------------------------------------------------------------------ +Capturing subpattern count = 1 +Named capturing subpatterns: + a 1 +No options +First char = 'a' +Need char = 'd' + / End of testinput2 / Capturing subpattern count = 0 No options diff --git a/ext/pcre/pcrelib/testdata/testoutput3 b/ext/pcre/pcrelib/testdata/testoutput3 index cbe9aaa7553..8cc3e8dc646 100644 --- a/ext/pcre/pcrelib/testdata/testoutput3 +++ b/ext/pcre/pcrelib/testdata/testoutput3 @@ -1,2991 +1,116 @@ -PCRE version 3.9 02-Jan-2002 +PCRE version 3.92 11-Sep-2002 -/(?.*/)foo" - /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ -No match - -"(?>.*/)foo" - /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo - 0: /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo - -/(?>(\.\d\d[1-9]?))\d+/ - 1.230003938 - 0: .230003938 - 1: .23 - 1.875000282 - 0: .875000282 - 1: .875 - *** Failers -No match - 1.235 -No match - -/^((?>\w+)|(?>\s+))*$/ - now is the time for all good men to come to the aid of the party - 0: now is the time for all good men to come to the aid of the party - 1: party - *** Failers -No match - this is not a line with only words and spaces! -No match - -/(\d+)(\w)/ - 12345a - 0: 12345a - 1: 12345 - 2: a - 12345+ - 0: 12345 - 1: 1234 - 2: 5 - -/((?>\d+))(\w)/ - 12345a - 0: 12345a - 1: 12345 - 2: a - *** Failers -No match - 12345+ -No match - -/(?>a+)b/ - aaab - 0: aaab - -/((?>a+)b)/ - aaab - 0: aaab - 1: aaab - -/(?>(a+))b/ - aaab - 0: aaab - 1: aaa - -/(?>b)+/ - aaabbbccc - 0: bbb - -/(?>a+|b+|c+)*c/ - aaabbbbccccd - 0: aaabbbbc - -/((?>[^()]+)|\([^()]*\))+/ - ((abc(ade)ufh()()x - 0: abc(ade)ufh()()x - 1: x - -/\(((?>[^()]+)|\([^()]+\))+\)/ - (abc) - 0: (abc) - 1: abc - (abc(def)xyz) - 0: (abc(def)xyz) - 1: xyz - *** Failers -No match - ((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -No match - -/a(?-i)b/i - ab - 0: ab - *** Failers -No match - Ab -No match - aB -No match - AB -No match - -/(a (?x)b c)d e/ - a bcd e - 0: a bcd e - 1: a bc - *** Failers -No match - a b cd e -No match - abcd e -No match - a bcde -No match - -/(a b(?x)c d (?-x)e f)/ - a bcde f - 0: a bcde f - 1: a bcde f - *** Failers -No match - abcdef -No match - -/(a(?i)b)c/ - abc - 0: abc - 1: ab - aBc - 0: aBc - 1: aB - *** Failers -No match - abC -No match - aBC -No match - Abc -No match - ABc -No match - ABC -No match - AbC -No match - -/a(?i:b)c/ - abc - 0: abc - aBc - 0: aBc - *** Failers -No match - ABC -No match - abC -No match - aBC -No match - -/a(?i:b)*c/ - aBc - 0: aBc - aBBc - 0: aBBc - *** Failers -No match - aBC -No match - aBBC -No match - -/a(?=b(?i)c)\w\wd/ - abcd - 0: abcd - abCd - 0: abCd - *** Failers -No match - aBCd -No match - abcD -No match - -/(?s-i:more.*than).*million/i - more than million - 0: more than million - more than MILLION - 0: more than MILLION - more \n than Million - 0: more \x0a than Million - *** Failers -No match - MORE THAN MILLION -No match - more \n than \n million -No match - -/(?:(?s-i)more.*than).*million/i - more than million - 0: more than million - more than MILLION - 0: more than MILLION - more \n than Million - 0: more \x0a than Million - *** Failers -No match - MORE THAN MILLION -No match - more \n than \n million -No match - -/(?>a(?i)b+)+c/ - abc - 0: abc - aBbc - 0: aBbc - aBBc - 0: aBBc - *** Failers -No match - Abc -No match - abAb -No match - abbC -No match - -/(?=a(?i)b)\w\wc/ - abc - 0: abc - aBc - 0: aBc - *** Failers -No match - Ab -No match - abC -No match - aBC -No match - -/(?<=a(?i)b)(\w\w)c/ - abxxc - 0: xxc - 1: xx - aBxxc - 0: xxc - 1: xx - *** Failers -No match - Abxxc -No match - ABxxc -No match - abxxC -No match - -/(?:(a)|b)(?(1)A|B)/ - aA - 0: aA - 1: a - bB - 0: bB - *** Failers -No match - aB -No match - bA -No match - -/^(a)?(?(1)a|b)+$/ - aa - 0: aa - 1: a - b - 0: b - bb - 0: bb - *** Failers -No match - ab -No match - -/^(?(?=abc)\w{3}:|\d\d)$/ - abc: - 0: abc: - 12 - 0: 12 - *** Failers -No match - 123 -No match - xyz -No match - -/^(?(?!abc)\d\d|\w{3}:)$/ - abc: - 0: abc: - 12 - 0: 12 - *** Failers -No match - 123 -No match - xyz -No match - -/(?(?<=foo)bar|cat)/ - foobar - 0: bar - cat - 0: cat - fcat - 0: cat - focat - 0: cat - *** Failers -No match - foocat -No match - -/(?(?a*)*/ - a - 0: a - aa - 0: aa - aaaa - 0: aaaa - -/(abc|)+/ - abc - 0: abc - 1: - abcabc - 0: abcabc - 1: - abcabcabc - 0: abcabcabc - 1: - xyz - 0: - 1: - -/([a]*)*/ - a - 0: a - 1: - aaaaa - 0: aaaaa - 1: - -/([ab]*)*/ - a - 0: a - 1: - b - 0: b - 1: - ababab - 0: ababab - 1: - aaaabcde - 0: aaaab - 1: - bbbb - 0: bbbb - 1: - -/([^a]*)*/ - b - 0: b - 1: - bbbb - 0: bbbb - 1: - aaa - 0: - 1: - -/([^ab]*)*/ - cccc - 0: cccc - 1: - abab - 0: - 1: - -/([a]*?)*/ - a - 0: - 1: - aaaa - 0: - 1: - -/([ab]*?)*/ - a - 0: - 1: - b - 0: - 1: - abab - 0: - 1: - baba - 0: - 1: - -/([^a]*?)*/ - b - 0: - 1: - bbbb - 0: - 1: - aaa - 0: - 1: - -/([^ab]*?)*/ - c - 0: - 1: - cccc - 0: - 1: - baba - 0: - 1: - -/(?>a*)*/ - a - 0: a - aaabcde - 0: aaa - -/((?>a*))*/ - aaaaa - 0: aaaaa - 1: - aabbaa - 0: aa - 1: - -/((?>a*?))*/ - aaaaa - 0: - 1: - aabbaa - 0: - 1: - -/(?(?=[^a-z]+[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) /x - 12-sep-98 - 0: 12-sep-98 - 12-09-98 - 0: 12-09-98 - *** Failers -No match - sep-12-98 -No match - -/(?<=(foo))bar\1/ - foobarfoo - 0: barfoo - 1: foo - foobarfootling - 0: barfoo - 1: foo - *** Failers -No match - foobar -No match - barfoo -No match - -/(?i:saturday|sunday)/ - saturday - 0: saturday - sunday - 0: sunday - Saturday - 0: Saturday - Sunday - 0: Sunday - SATURDAY - 0: SATURDAY - SUNDAY - 0: SUNDAY - SunDay - 0: SunDay - -/(a(?i)bc|BB)x/ - abcx - 0: abcx - 1: abc - aBCx - 0: aBCx - 1: aBC - bbx - 0: bbx - 1: bb - BBx - 0: BBx - 1: BB - *** Failers -No match - abcX -No match - aBCX -No match - bbX -No match - BBX -No match - -/^([ab](?i)[cd]|[ef])/ - ac - 0: ac - 1: ac - aC - 0: aC - 1: aC - bD - 0: bD - 1: bD - elephant - 0: e - 1: e - Europe - 0: E - 1: E - frog - 0: f - 1: f - France - 0: F - 1: F - *** Failers -No match - Africa -No match - -/^(ab|a(?i)[b-c](?m-i)d|x(?i)y|z)/ - ab - 0: ab - 1: ab - aBd - 0: aBd - 1: aBd - xy - 0: xy - 1: xy - xY - 0: xY - 1: xY - zebra - 0: z - 1: z - Zambesi - 0: Z - 1: Z - *** Failers -No match - aCD -No match - XY -No match - -/(?<=foo\n)^bar/m - foo\nbar - 0: bar - *** Failers -No match - bar -No match - baz\nbar -No match - -/(?<=(?]&/ - <&OUT - 0: <& - -/^(a\1?){4}$/ - aaaaaaaaaa - 0: aaaaaaaaaa - 1: aaaa - *** Failers -No match - AB -No match - aaaaaaaaa -No match - aaaaaaaaaaa -No match - -/^(a(?(1)\1)){4}$/ - aaaaaaaaaa - 0: aaaaaaaaaa - 1: aaaa - *** Failers -No match - aaaaaaaaa -No match - aaaaaaaaaaa -No match - -/(?:(f)(o)(o)|(b)(a)(r))*/ - foobar - 0: foobar - 1: f - 2: o - 3: o - 4: b - 5: a - 6: r - -/(?<=a)b/ - ab - 0: b - *** Failers -No match - cb -No match - b -No match - -/(? - 2: abcd - xy:z:::abcd - 0: xy:z:::abcd - 1: xy:z::: - 2: abcd - -/^[^bcd]*(c+)/ - aexycd - 0: aexyc - 1: c - -/(a*)b+/ - caab - 0: aab - 1: aa - -/([\w:]+::)?(\w+)$/ - abcd - 0: abcd - 1: - 2: abcd - xy:z:::abcd - 0: xy:z:::abcd - 1: xy:z::: - 2: abcd - *** Failers - 0: Failers - 1: + 0: *** Failers + 1: *** 2: Failers - abcd: -No match - abcd: + École No match -/^[^bcd]*(c+)/ - aexycd - 0: aexyc - 1: c - -/(>a+)ab/ - -/(?>a+)b/ - aaab - 0: aaab - -/([[:]+)/ - a:[b]: - 0: :[ - 1: :[ - -/([[=]+)/ - a=[b]= - 0: =[ - 1: =[ - -/([[.]+)/ - a.[b]. - 0: .[ - 1: .[ - -/((?>a+)b)/ - aaab - 0: aaab - 1: aaab - -/(?>(a+))b/ - aaab - 0: aaab - 1: aaa - -/((?>[^()]+)|\([^()]*\))+/ - ((abc(ade)ufh()()x - 0: abc(ade)ufh()()x - 1: x - -/a\Z/ +/École/i + École + 0: \xc9cole *** Failers No match - aaab -No match - a\nb\n + école No match -/b\Z/ - a\nb\n - 0: b +/École/iLfr + École + 0: École + école + 0: école -/b\z/ +/\w/IS +Capturing subpattern count = 0 +No options +No first char +No need char +Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P + Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z -/b\Z/ - a\nb - 0: b +/\w/ISLfr +Capturing subpattern count = 0 +No options +No first char +No need char +Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P + Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z + À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å + æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ -/b\z/ - a\nb - 0: b - *** Failers +/^[\xc8-\xc9]/iLfr + École + 0: É + école + 0: é + +/^[\xc8-\xc9]/Lfr + École + 0: É + *** Failers No match - -/^(?>(?(1)\.|())[^\W_](?>[a-z0-9-]*[^\W_])?)+$/ - a - 0: a - 1: - abc - 0: abc - 1: - a-b - 0: a-b - 1: - 0-9 - 0: 0-9 - 1: - a.b - 0: a.b - 1: - 5.6.7 - 0: 5.6.7 - 1: - the.quick.brown.fox - 0: the.quick.brown.fox - 1: - a100.b200.300c - 0: a100.b200.300c - 1: - 12-ab.1245 - 0: 12-ab.1245 - 1: - ***Failers -No match - \ -No match - .a -No match - -a -No match - a- -No match - a. -No match - a_b -No match - a.- -No match - a.. -No match - ab..bc -No match - the.quick.brown.fox- -No match - the.quick.brown.fox. -No match - the.quick.brown.fox_ -No match - the.quick.brown.fox+ + école No match -/(?>.*)(?<=(abcd|wxyz))/ - alphabetabcd - 0: alphabetabcd - 1: abcd - endingwxyz - 0: endingwxyz - 1: wxyz - *** Failers -No match - a rather long string that doesn't end with one of them -No match - -/word (?>(?:(?!otherword)[a-zA-Z0-9]+ ){0,30})otherword/ - word cat dog elephant mussel cow horse canary baboon snake shark otherword - 0: word cat dog elephant mussel cow horse canary baboon snake shark otherword - word cat dog elephant mussel cow horse canary baboon snake shark -No match - -/word (?>[a-zA-Z0-9]+ ){0,30}otherword/ - word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope -No match - -/(?<=\d{3}(?!999))foo/ - 999foo - 0: foo - 123999foo - 0: foo - *** Failers -No match - 123abcfoo -No match - -/(?<=(?!...999)\d{3})foo/ - 999foo - 0: foo - 123999foo - 0: foo - *** Failers -No match - 123abcfoo -No match - -/(?<=\d{3}(?!999)...)foo/ - 123abcfoo - 0: foo - 123456foo - 0: foo - *** Failers -No match - 123999foo -No match - -/(?<=\d{3}...)(? - 2: - 3: abcd - - 2: - 3: abcd - \s*)=(?>\s*) # find - 2: - 3: abcd - Z)+|A)*/ - ZABCDEFG - 0: ZA - 1: A - -/((?>)+|A)*/ - ZABCDEFG - 0: - 1: - -/a*/g - abbab - 0: a - 0: - 0: - 0: a - 0: - 0: - -/^[a-\d]/ - abcde - 0: a - -things - 0: - - 0digit - 0: 0 - *** Failers -No match - bcdef -No match - -/^[\d-a]/ - abcde - 0: a - -things - 0: - - 0digit - 0: 0 - *** Failers -No match - bcdef -No match - -/ End of testinput3 / +/ End of testinput3 / diff --git a/ext/pcre/pcrelib/testdata/testoutput4 b/ext/pcre/pcrelib/testdata/testoutput4 index df81a0f5480..3018c9baa7f 100644 --- a/ext/pcre/pcrelib/testdata/testoutput4 +++ b/ext/pcre/pcrelib/testdata/testoutput4 @@ -1,116 +1,304 @@ -PCRE version 3.9 02-Jan-2002 +PCRE version 3.92 11-Sep-2002 -/^[\w]+/ +/-- Do not use the \x{} construct except with patterns that have the --/ +/-- /8 option set, because PCRE doesn't recognize them as UTF-8 unless --/ +No match +/-- that option is set. However, the latest Perls recognize them always. --/ +No match + +/a.b/8 + acb + 0: acb + a\x7fb + 0: a\x{7f}b + a\x{100}b + 0: a\x{100}b *** Failers No match - École + a\nb No match -/^[\w]+/Lfr - École - 0: École - -/^[\w]+/ +/a(.{3})b/8 + a\x{4000}xyb + 0: a\x{4000}xyb + 1: \x{4000}xy + a\x{4000}\x7fyb + 0: a\x{4000}\x{7f}yb + 1: \x{4000}\x{7f}y + a\x{4000}\x{100}yb + 0: a\x{4000}\x{100}yb + 1: \x{4000}\x{100}y *** Failers No match - École + a\x{4000}b +No match + ac\ncb No match -/^[\W]+/ - École - 0: \xc9 +/a(.*?)(.)/ + a\xc0\x88b + 0: a\xc0 + 1: + 2: \xc0 -/^[\W]+/Lfr - *** Failers - 0: *** - École -No match +/a(.*?)(.)/8 + a\x{100}b + 0: a\x{100} + 1: + 2: \x{100} -/[\b]/ - \b - 0: \x08 +/a(.*)(.)/ + a\xc0\x88b + 0: a\xc0\x88b + 1: \xc0\x88 + 2: b + +/a(.*)(.)/8 + a\x{100}b + 0: a\x{100}b + 1: \x{100} + 2: b + +/a(.)(.)/ + a\xc0\x92bcd + 0: a\xc0\x92 + 1: \xc0 + 2: \x92 + +/a(.)(.)/8 + a\x{240}bcd + 0: a\x{240}b + 1: \x{240} + 2: b + +/a(.?)(.)/ + a\xc0\x92bcd + 0: a\xc0\x92 + 1: \xc0 + 2: \x92 + +/a(.?)(.)/8 + a\x{240}bcd + 0: a\x{240}b + 1: \x{240} + 2: b + +/a(.??)(.)/ + a\xc0\x92bcd + 0: a\xc0 + 1: + 2: \xc0 + +/a(.??)(.)/8 + a\x{240}bcd + 0: a\x{240} + 1: + 2: \x{240} + +/a(.{3})b/8 + a\x{1234}xyb + 0: a\x{1234}xyb + 1: \x{1234}xy + a\x{1234}\x{4321}yb + 0: a\x{1234}\x{4321}yb + 1: \x{1234}\x{4321}y + a\x{1234}\x{4321}\x{3412}b + 0: a\x{1234}\x{4321}\x{3412}b + 1: \x{1234}\x{4321}\x{3412} *** Failers No match - a + a\x{1234}b +No match + ac\ncb No match -/[\b]/Lfr - \b - 0: \x08 +/a(.{3,})b/8 + a\x{1234}xyb + 0: a\x{1234}xyb + 1: \x{1234}xy + a\x{1234}\x{4321}yb + 0: a\x{1234}\x{4321}yb + 1: \x{1234}\x{4321}y + a\x{1234}\x{4321}\x{3412}b + 0: a\x{1234}\x{4321}\x{3412}b + 1: \x{1234}\x{4321}\x{3412} + axxxxbcdefghijb + 0: axxxxbcdefghijb + 1: xxxxbcdefghij + a\x{1234}\x{4321}\x{3412}\x{3421}b + 0: a\x{1234}\x{4321}\x{3412}\x{3421}b + 1: \x{1234}\x{4321}\x{3412}\x{3421} *** Failers No match - a + a\x{1234}b No match -/^\w+/ +/a(.{3,}?)b/8 + a\x{1234}xyb + 0: a\x{1234}xyb + 1: \x{1234}xy + a\x{1234}\x{4321}yb + 0: a\x{1234}\x{4321}yb + 1: \x{1234}\x{4321}y + a\x{1234}\x{4321}\x{3412}b + 0: a\x{1234}\x{4321}\x{3412}b + 1: \x{1234}\x{4321}\x{3412} + axxxxbcdefghijb + 0: axxxxb + 1: xxxx + a\x{1234}\x{4321}\x{3412}\x{3421}b + 0: a\x{1234}\x{4321}\x{3412}\x{3421}b + 1: \x{1234}\x{4321}\x{3412}\x{3421} *** Failers No match - École + a\x{1234}b No match -/^\w+/Lfr - École - 0: École - -/(.+)\b(.+)/ - École - 0: \xc9cole - 1: \xc9 - 2: cole - -/(.+)\b(.+)/Lfr - *** Failers - 0: *** Failers - 1: *** - 2: Failers - École -No match - -/École/i - École - 0: \xc9cole +/a(.{3,5})b/8 + a\x{1234}xyb + 0: a\x{1234}xyb + 1: \x{1234}xy + a\x{1234}\x{4321}yb + 0: a\x{1234}\x{4321}yb + 1: \x{1234}\x{4321}y + a\x{1234}\x{4321}\x{3412}b + 0: a\x{1234}\x{4321}\x{3412}b + 1: \x{1234}\x{4321}\x{3412} + axxxxbcdefghijb + 0: axxxxb + 1: xxxx + a\x{1234}\x{4321}\x{3412}\x{3421}b + 0: a\x{1234}\x{4321}\x{3412}\x{3421}b + 1: \x{1234}\x{4321}\x{3412}\x{3421} + axbxxbcdefghijb + 0: axbxxb + 1: xbxx + axxxxxbcdefghijb + 0: axxxxxb + 1: xxxxx *** Failers No match - école + a\x{1234}b +No match + axxxxxxbcdefghijb No match -/École/iLfr - École - 0: École - école - 0: école +/a(.{3,5}?)b/8 + a\x{1234}xyb + 0: a\x{1234}xyb + 1: \x{1234}xy + a\x{1234}\x{4321}yb + 0: a\x{1234}\x{4321}yb + 1: \x{1234}\x{4321}y + a\x{1234}\x{4321}\x{3412}b + 0: a\x{1234}\x{4321}\x{3412}b + 1: \x{1234}\x{4321}\x{3412} + axxxxbcdefghijb + 0: axxxxb + 1: xxxx + a\x{1234}\x{4321}\x{3412}\x{3421}b + 0: a\x{1234}\x{4321}\x{3412}\x{3421}b + 1: \x{1234}\x{4321}\x{3412}\x{3421} + axbxxbcdefghijb + 0: axbxxb + 1: xbxx + axxxxxbcdefghijb + 0: axxxxxb + 1: xxxxx + *** Failers +No match + a\x{1234}b +No match + axxxxxxbcdefghijb +No match -/\w/IS -Capturing subpattern count = 0 -No options -No first char -No need char -Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P - Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z +/^[a\x{c0}]/8 + *** Failers +No match + \x{100} +No match -/\w/ISLfr -Capturing subpattern count = 0 -No options -No first char -No need char -Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P - Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z - À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å - æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ +/(?<=aXb)cd/8 + aXbcd + 0: cd -/^[\xc8-\xc9]/iLfr - École - 0: É - école - 0: é +/(?<=a\x{100}b)cd/8 + a\x{100}bcd + 0: cd -/^[\xc8-\xc9]/Lfr - École - 0: É +/(?<=a\x{100000}b)cd/8 + a\x{100000}bcd + 0: cd + +/(?:\x{100}){3}b/8 + \x{100}\x{100}\x{100}b + 0: \x{100}\x{100}\x{100}b *** Failers No match - école + \x{100}\x{100}b No match +/\x{ab}/8 + \x{ab} + 0: \x{ab} + \xc2\xab + 0: \x{ab} + *** Failers +No match + \x00{ab} +No match + +/(?<=(.))X/8 + WXYZ + 0: X + 1: W + \x{256}XYZ + 0: X + 1: \x{256} + *** Failers +No match + XYZ +No match + +/X(\C{3})/8 + X\x{1234} + 0: X\x{1234} + 1: \x{1234} + +/X(\C{4})/8 + X\x{1234}YZ + 0: X\x{1234}Y + 1: \x{1234}Y + +/X\C*/8 + XYZabcdce + 0: XYZabcdce + +/X\C*?/8 + XYZabcde + 0: X + +/X\C{3,5}/8 + Xabcdefg + 0: Xabcde + X\x{1234} + 0: X\x{1234} + X\x{1234}YZ + 0: X\x{1234}YZ + X\x{1234}\x{512} + 0: X\x{1234}\x{512} + X\x{1234}\x{512}YZ + 0: X\x{1234}\x{512} + +/X\C{3,5}?/8 + Xabcdefg + 0: Xabc + X\x{1234} + 0: X\x{1234} + X\x{1234}YZ + 0: X\x{1234} + X\x{1234}\x{512} + 0: X\x{1234} + / End of testinput4 / diff --git a/ext/pcre/pcrelib/testdata/testoutput5 b/ext/pcre/pcrelib/testdata/testoutput5 index 6bb9ad31b4e..01daca505ee 100644 --- a/ext/pcre/pcrelib/testdata/testoutput5 +++ b/ext/pcre/pcrelib/testdata/testoutput5 @@ -1,242 +1,339 @@ -PCRE version 3.9 02-Jan-2002 +PCRE version 3.92 11-Sep-2002 -/-- Because of problems with Perl 5.6 in handling UTF-8 vs non UTF-8 --/ -/-- strings automatically, do not use the \x{} construct except with --/ -No match -/-- patterns that have the /8 option set, and don't use them without! --/ -No match +/\x{100}/8DM +Memory allocation (code space): 11 +------------------------------------------------------------------ + 0 7 Bra 0 + 3 2 \xc4\x80 + 7 7 Ket + 10 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 196 +Need char = 128 -/a.b/8 - acb - 0: acb - a\x7fb - 0: a\x{7f}b - a\x{100}b - 0: a\x{100}b - *** Failers -No match - a\nb -No match +/\x{1000}/8DM +Memory allocation (code space): 12 +------------------------------------------------------------------ + 0 8 Bra 0 + 3 3 \xe1\x80\x80 + 8 8 Ket + 11 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 225 +Need char = 128 -/a(.{3})b/8 - a\x{4000}xyb - 0: a\x{4000}xyb - 1: \x{4000}xy - a\x{4000}\x7fyb - 0: a\x{4000}\x{7f}yb - 1: \x{4000}\x{7f}y - a\x{4000}\x{100}yb - 0: a\x{4000}\x{100}yb - 1: \x{4000}\x{100}y - *** Failers -No match - a\x{4000}b -No match - ac\ncb -No match +/\x{10000}/8DM +Memory allocation (code space): 13 +------------------------------------------------------------------ + 0 9 Bra 0 + 3 4 \xf0\x90\x80\x80 + 9 9 Ket + 12 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 240 +Need char = 128 -/a(.*?)(.)/ - a\xc0\x88b - 0: a\xc0 - 1: - 2: \xc0 +/\x{100000}/8DM +Memory allocation (code space): 13 +------------------------------------------------------------------ + 0 9 Bra 0 + 3 4 \xf4\x80\x80\x80 + 9 9 Ket + 12 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 244 +Need char = 128 -/a(.*?)(.)/8 - a\x{100}b - 0: a\x{100} - 1: - 2: \x{100} +/\x{1000000}/8DM +Memory allocation (code space): 14 +------------------------------------------------------------------ + 0 10 Bra 0 + 3 5 \xf9\x80\x80\x80\x80 + 10 10 Ket + 13 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 249 +Need char = 128 -/a(.*)(.)/ - a\xc0\x88b - 0: a\xc0\x88b - 1: \xc0\x88 - 2: b +/\x{4000000}/8DM +Memory allocation (code space): 15 +------------------------------------------------------------------ + 0 11 Bra 0 + 3 6 \xfc\x84\x80\x80\x80\x80 + 11 11 Ket + 14 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 252 +Need char = 128 -/a(.*)(.)/8 - a\x{100}b - 0: a\x{100}b - 1: \x{100} - 2: b +/\x{7fffFFFF}/8DM +Memory allocation (code space): 15 +------------------------------------------------------------------ + 0 11 Bra 0 + 3 6 \xfd\xbf\xbf\xbf\xbf\xbf + 11 11 Ket + 14 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 253 +Need char = 191 -/a(.)(.)/ - a\xc0\x92bcd - 0: a\xc0\x92 - 1: \xc0 - 2: \x92 +/[\x{ff}]/8DM +Memory allocation (code space): 40 +------------------------------------------------------------------ + 0 6 Bra 0 + 3 1 \xff + 6 6 Ket + 9 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 255 +No need char -/a(.)(.)/8 - a\x{240}bcd - 0: a\x{240}b - 1: \x{240} - 2: b +/[\x{100}]/8DM +Failed: characters with values > 255 are not yet supported in classes at offset 7 -/a(.?)(.)/ - a\xc0\x92bcd - 0: a\xc0\x92 - 1: \xc0 - 2: \x92 +/\x{ffffffff}/8 +Failed: character value in \x{...} sequence is too large at offset 11 -/a(.?)(.)/8 - a\x{240}bcd - 0: a\x{240}b - 1: \x{240} - 2: b +/\x{100000000}/8 +Failed: character value in \x{...} sequence is too large at offset 12 -/a(.??)(.)/ - a\xc0\x92bcd - 0: a\xc0 - 1: - 2: \xc0 +/^\x{100}a\x{1234}/8 + \x{100}a\x{1234}bcd + 0: \x{100}a\x{1234} -/a(.??)(.)/8 - a\x{240}bcd - 0: a\x{240} - 1: - 2: \x{240} +/\x80/8D +------------------------------------------------------------------ + 0 7 Bra 0 + 3 2 \xc2\x80 + 7 7 Ket + 10 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 194 +Need char = 128 -/a(.{3})b/8 - a\x{1234}xyb - 0: a\x{1234}xyb - 1: \x{1234}xy - a\x{1234}\x{4321}yb - 0: a\x{1234}\x{4321}yb - 1: \x{1234}\x{4321}y - a\x{1234}\x{4321}\x{3412}b - 0: a\x{1234}\x{4321}\x{3412}b - 1: \x{1234}\x{4321}\x{3412} - *** Failers -No match - a\x{1234}b -No match - ac\ncb -No match +/\xff/8D +------------------------------------------------------------------ + 0 7 Bra 0 + 3 2 \xc3\xbf + 7 7 Ket + 10 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 195 +Need char = 191 -/a(.{3,})b/8 - a\x{1234}xyb - 0: a\x{1234}xyb - 1: \x{1234}xy - a\x{1234}\x{4321}yb - 0: a\x{1234}\x{4321}yb - 1: \x{1234}\x{4321}y - a\x{1234}\x{4321}\x{3412}b - 0: a\x{1234}\x{4321}\x{3412}b - 1: \x{1234}\x{4321}\x{3412} - axxxxbcdefghijb - 0: axxxxbcdefghijb - 1: xxxxbcdefghij - a\x{1234}\x{4321}\x{3412}\x{3421}b - 0: a\x{1234}\x{4321}\x{3412}\x{3421}b - 1: \x{1234}\x{4321}\x{3412}\x{3421} - *** Failers -No match - a\x{1234}b -No match - -/a(.{3,}?)b/8 - a\x{1234}xyb - 0: a\x{1234}xyb - 1: \x{1234}xy - a\x{1234}\x{4321}yb - 0: a\x{1234}\x{4321}yb - 1: \x{1234}\x{4321}y - a\x{1234}\x{4321}\x{3412}b - 0: a\x{1234}\x{4321}\x{3412}b - 1: \x{1234}\x{4321}\x{3412} - axxxxbcdefghijb - 0: axxxxb - 1: xxxx - a\x{1234}\x{4321}\x{3412}\x{3421}b - 0: a\x{1234}\x{4321}\x{3412}\x{3421}b - 1: \x{1234}\x{4321}\x{3412}\x{3421} - *** Failers -No match - a\x{1234}b -No match - -/a(.{3,5})b/8 - a\x{1234}xyb - 0: a\x{1234}xyb - 1: \x{1234}xy - a\x{1234}\x{4321}yb - 0: a\x{1234}\x{4321}yb - 1: \x{1234}\x{4321}y - a\x{1234}\x{4321}\x{3412}b - 0: a\x{1234}\x{4321}\x{3412}b - 1: \x{1234}\x{4321}\x{3412} - axxxxbcdefghijb - 0: axxxxb - 1: xxxx - a\x{1234}\x{4321}\x{3412}\x{3421}b - 0: a\x{1234}\x{4321}\x{3412}\x{3421}b - 1: \x{1234}\x{4321}\x{3412}\x{3421} - axbxxbcdefghijb - 0: axbxxb - 1: xbxx - axxxxxbcdefghijb - 0: axxxxxb - 1: xxxxx - *** Failers -No match - a\x{1234}b -No match - axxxxxxbcdefghijb -No match - -/a(.{3,5}?)b/8 - a\x{1234}xyb - 0: a\x{1234}xyb - 1: \x{1234}xy - a\x{1234}\x{4321}yb - 0: a\x{1234}\x{4321}yb - 1: \x{1234}\x{4321}y - a\x{1234}\x{4321}\x{3412}b - 0: a\x{1234}\x{4321}\x{3412}b - 1: \x{1234}\x{4321}\x{3412} - axxxxbcdefghijb - 0: axxxxb - 1: xxxx - a\x{1234}\x{4321}\x{3412}\x{3421}b - 0: a\x{1234}\x{4321}\x{3412}\x{3421}b - 1: \x{1234}\x{4321}\x{3412}\x{3421} - axbxxbcdefghijb - 0: axbxxb - 1: xbxx - axxxxxbcdefghijb - 0: axxxxxb - 1: xxxxx - *** Failers -No match - a\x{1234}b -No match - axxxxxxbcdefghijb -No match - -/^[a\x{c0}]/8 - *** Failers -No match - \x{100} -No match - -/(?<=aXb)cd/8 - aXbcd - 0: cd - -/(?<=a\x{100}b)cd/8 - a\x{100}bcd - 0: cd - -/(?<=a\x{100000}b)cd/8 - a\x{100000}bcd - 0: cd +/\x{0041}\x{2262}\x{0391}\x{002e}/D8 +------------------------------------------------------------------ + 0 12 Bra 0 + 3 7 A\xe2\x89\xa2\xce\x91. + 12 12 Ket + 15 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 'A' +Need char = '.' + \x{0041}\x{2262}\x{0391}\x{002e} + 0: A\x{2262}\x{391}. -/(?:\x{100}){3}b/8 - \x{100}\x{100}\x{100}b - 0: \x{100}\x{100}\x{100}b - *** Failers +/\x{D55c}\x{ad6d}\x{C5B4}/D8 +------------------------------------------------------------------ + 0 14 Bra 0 + 3 9 \xed\x95\x9c\xea\xb5\xad\xec\x96\xb4 + 14 14 Ket + 17 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 237 +Need char = 180 + \x{D55c}\x{ad6d}\x{C5B4} + 0: \x{d55c}\x{ad6d}\x{c5b4} + +/\x{65e5}\x{672c}\x{8a9e}/D8 +------------------------------------------------------------------ + 0 14 Bra 0 + 3 9 \xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e + 14 14 Ket + 17 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 230 +Need char = 158 + \x{65e5}\x{672c}\x{8a9e} + 0: \x{65e5}\x{672c}\x{8a9e} + +/\x{80}/D8 +------------------------------------------------------------------ + 0 7 Bra 0 + 3 2 \xc2\x80 + 7 7 Ket + 10 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 194 +Need char = 128 + +/\x{084}/D8 +------------------------------------------------------------------ + 0 7 Bra 0 + 3 2 \xc2\x84 + 7 7 Ket + 10 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 194 +Need char = 132 + +/\x{104}/D8 +------------------------------------------------------------------ + 0 7 Bra 0 + 3 2 \xc4\x84 + 7 7 Ket + 10 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 196 +Need char = 132 + +/\x{861}/D8 +------------------------------------------------------------------ + 0 8 Bra 0 + 3 3 \xe0\xa1\xa1 + 8 8 Ket + 11 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 224 +Need char = 161 + +/\x{212ab}/D8 +------------------------------------------------------------------ + 0 9 Bra 0 + 3 4 \xf0\xa1\x8a\xab + 9 9 Ket + 12 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +First char = 240 +Need char = 171 + +/.{3,5}X/D8 +------------------------------------------------------------------ + 0 14 Bra 0 + 3 Any{3} + 7 Any{0,2} + 11 1 X + 14 14 Ket + 17 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +No first char +Need char = 'X' + \x{212ab}\x{212ab}\x{212ab}\x{861}X + 0: \x{212ab}\x{212ab}\x{212ab}\x{861}X + + +/.{3,5}?/D8 +------------------------------------------------------------------ + 0 11 Bra 0 + 3 Any{3} + 7 Any{0,2}? + 11 11 Ket + 14 End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: utf8 +No first char +No need char + \x{212ab}\x{212ab}\x{212ab}\x{861} + 0: \x{212ab}\x{212ab}\x{212ab} + +/-- These tests are here rather than in testinput4 because Perl 5.6 has --/ +/-- some problems with UTF-8 support, in the area of \x{..} where the --/ No match - \x{100}\x{100}b +/-- value is < 255. It grumbles about invalid UTF-8 strings. --/ No match +/^[a\x{c0}]b/8 + \x{c0}b + 0: \x{c0}b + +/^([a\x{c0}]*?)aa/8 + a\x{c0}aaaa/ + 0: a\x{c0}aa + 1: a\x{c0} + +/^([a\x{c0}]*?)aa/8 + a\x{c0}aaaa/ + 0: a\x{c0}aa + 1: a\x{c0} + a\x{c0}a\x{c0}aaa/ + 0: a\x{c0}a\x{c0}aa + 1: a\x{c0}a\x{c0} + +/^([a\x{c0}]*)aa/8 + a\x{c0}aaaa/ + 0: a\x{c0}aaaa + 1: a\x{c0}aa + a\x{c0}a\x{c0}aaa/ + 0: a\x{c0}a\x{c0}aaa + 1: a\x{c0}a\x{c0}a + +/^([a\x{c0}]*)a\x{c0}/8 + a\x{c0}aaaa/ + 0: a\x{c0} + 1: + a\x{c0}a\x{c0}aaa/ + 0: a\x{c0}a\x{c0} + 1: a\x{c0} + +/-- --/ + +/(?<=\C)X/8 +Failed: \C not allowed in lookbehind assertion at offset 6 + +/-- This one is here not because it's different to Perl, but because the --/ +/-- way the captured single-byte is displayed. (In Perl it becomes a --/ +No match +/-- character, and you can't tell the difference.) --/ +No match + +/X(\C)(.*)/8 + X\x{1234} + 0: X\x{1234} + 1: \xe1 + 2: \x88\xb4 + X\nabc + 0: X\x{0a}abc + 1: \x{0a} + 2: abc + / End of testinput5 /