2006-09-18 07:02:52 +08:00
|
|
|
#ifndef GREP_H
|
|
|
|
#define GREP_H
|
2009-03-07 20:32:32 +08:00
|
|
|
#include "color.h"
|
2017-05-26 03:45:28 +08:00
|
|
|
#ifdef USE_LIBPCRE1
|
2011-05-10 05:52:05 +08:00
|
|
|
#include <pcre.h>
|
2017-05-26 04:05:26 +08:00
|
|
|
#ifdef PCRE_CONFIG_JIT
|
|
|
|
#if PCRE_MAJOR >= 8 && PCRE_MINOR >= 32
|
grep: un-break building with PCRE >= 8.32 without --enable-jit
Amend my change earlier in this series ("grep: add support for the
PCRE v1 JIT API", 2017-04-11) to un-break the build on PCRE v1
versions later than 8.31 compiled without --enable-jit.
As explained in that change and a later compatibility change in this
series ("grep: un-break building with PCRE < 8.32", 2017-05-10) the
pcre_jit_exec() function is a faster path to execute the JIT.
Unfortunately there's no compatibility stub for that function compiled
into the library if pcre_config(PCRE_CONFIG_JIT, &ret) would return 0,
and no macro that can be used to check for it, so the only portable
option to support builds without --enable-jit is via a new
NO_LIBPCRE1_JIT=UnfortunatelyYes Makefile option[1].
Another option would be to make the JIT opt-in via
USE_LIBPCRE1_JIT=YesPlease, after all it's not a default option of
PCRE v1.
I think it makes more sense to make it opt-out since even though it's
not a default option, most packagers of PCRE seem to turn it on by
default, with the notable exception of the MinGW package.
Make the MinGW platform work by default by changing the build defaults
to turn on NO_LIBPCRE1_JIT=UnfortunatelyYes. It is the only platform
that turns on USE_LIBPCRE=YesPlease by default, see commit
df5218b4c3 ("config.mak.uname: support MSys2", 2016-01-13) for that
change.
1. "How do I support pcre1 JIT on all
versions?" (https://lists.exim.org/lurker/thread/20170601.103148.10253788.en.html)
2. https://github.com/Alexpux/MINGW-packages/blob/master/mingw-w64-pcre/PKGBUILD
(referenced from "Re: PCRE v2 compile error, was Re: What's cooking
in git.git (May 2017, #01; Mon, 1)";
<alpine.DEB.2.20.1705021756530.3480@virtualbox>)
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-02 02:20:55 +08:00
|
|
|
#ifndef NO_LIBPCRE1_JIT
|
2017-05-26 04:05:26 +08:00
|
|
|
#define GIT_PCRE1_USE_JIT
|
|
|
|
#endif
|
|
|
|
#endif
|
grep: un-break building with PCRE >= 8.32 without --enable-jit
Amend my change earlier in this series ("grep: add support for the
PCRE v1 JIT API", 2017-04-11) to un-break the build on PCRE v1
versions later than 8.31 compiled without --enable-jit.
As explained in that change and a later compatibility change in this
series ("grep: un-break building with PCRE < 8.32", 2017-05-10) the
pcre_jit_exec() function is a faster path to execute the JIT.
Unfortunately there's no compatibility stub for that function compiled
into the library if pcre_config(PCRE_CONFIG_JIT, &ret) would return 0,
and no macro that can be used to check for it, so the only portable
option to support builds without --enable-jit is via a new
NO_LIBPCRE1_JIT=UnfortunatelyYes Makefile option[1].
Another option would be to make the JIT opt-in via
USE_LIBPCRE1_JIT=YesPlease, after all it's not a default option of
PCRE v1.
I think it makes more sense to make it opt-out since even though it's
not a default option, most packagers of PCRE seem to turn it on by
default, with the notable exception of the MinGW package.
Make the MinGW platform work by default by changing the build defaults
to turn on NO_LIBPCRE1_JIT=UnfortunatelyYes. It is the only platform
that turns on USE_LIBPCRE=YesPlease by default, see commit
df5218b4c3 ("config.mak.uname: support MSys2", 2016-01-13) for that
change.
1. "How do I support pcre1 JIT on all
versions?" (https://lists.exim.org/lurker/thread/20170601.103148.10253788.en.html)
2. https://github.com/Alexpux/MINGW-packages/blob/master/mingw-w64-pcre/PKGBUILD
(referenced from "Re: PCRE v2 compile error, was Re: What's cooking
in git.git (May 2017, #01; Mon, 1)";
<alpine.DEB.2.20.1705021756530.3480@virtualbox>)
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-02 02:20:55 +08:00
|
|
|
#endif
|
grep: add support for the PCRE v1 JIT API
Change the grep PCRE v1 code to use JIT when available. When PCRE
support was initially added in commit 63e7e9d8b6 ("git-grep: Learn
PCRE", 2011-05-09) PCRE had no JIT support, it was integrated into
8.20 released on 2011-10-21.
Enabling JIT support usually improves performance by more than
40%. The pattern compilation times are relatively slower, but those
relative numbers are tiny, and are easily made back in all but the
most trivial cases of grep. Detailed benchmarks & overview of
compilation times is at: http://sljit.sourceforge.net/pcre.html
With this change the difference in a t/perf/p7820-grep-engines.sh run
is, with just the /perl/ tests shown:
$ GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre/inst/lib' ./run HEAD~ HEAD p7820-grep-engines.sh
Test HEAD~ HEAD
---------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.35(1.11+0.43) 0.23(0.42+0.46) -34.3%
7820.7: perl grep '^how to' 0.64(2.71+0.36) 0.27(0.66+0.44) -57.8%
7820.11: perl grep '[how] to' 0.63(2.51+0.42) 0.33(0.98+0.39) -47.6%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 1.17(5.61+0.35) 0.34(1.08+0.46) -70.9%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.43(1.52+0.44) 0.30(0.88+0.42) -30.2%
The conditional support for JIT is implemented as suggested in the
pcrejit(3) man page. E.g. defining PCRE_STUDY_JIT_COMPILE to 0 if it's
not present.
The implementation is relatively verbose because even if
PCRE_CONFIG_JIT is defined only a call to pcre_config() can determine
if the JIT is available, and if so the faster pcre_jit_exec() function
should be called instead of pcre_exec(), and a different (but not
complimentary!) function needs to be called to free pcre1_extra_info.
There's no graceful fallback if pcre_jit_stack_alloc() fails under
PCRE_CONFIG_JIT, instead the program will simply abort. I don't think
this is worth handling gracefully, it'll only fail in cases where
malloc() doesn't work, in which case we're screwed anyway.
That there's no assignment of `p->pcre1_jit_on = 0` when
PCRE_CONFIG_JIT isn't defined isn't a bug. The create_grep_pat()
function allocates the grep_pat allocates it with calloc(), so it's
guaranteed to be 0 when PCRE_CONFIG_JIT isn't defined.
I you're bisecting and find this change, check that your PCRE isn't
older than 8.32. This change intentionally broke really old versions
of PCRE, but that's fixed in follow-up commits.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-26 04:05:25 +08:00
|
|
|
#ifndef PCRE_STUDY_JIT_COMPILE
|
|
|
|
#define PCRE_STUDY_JIT_COMPILE 0
|
|
|
|
#endif
|
2017-05-26 04:05:27 +08:00
|
|
|
#if PCRE_MAJOR <= 8 && PCRE_MINOR < 20
|
|
|
|
typedef int pcre_jit_stack;
|
|
|
|
#endif
|
2011-05-10 05:52:05 +08:00
|
|
|
#else
|
|
|
|
typedef int pcre;
|
|
|
|
typedef int pcre_extra;
|
grep: add support for the PCRE v1 JIT API
Change the grep PCRE v1 code to use JIT when available. When PCRE
support was initially added in commit 63e7e9d8b6 ("git-grep: Learn
PCRE", 2011-05-09) PCRE had no JIT support, it was integrated into
8.20 released on 2011-10-21.
Enabling JIT support usually improves performance by more than
40%. The pattern compilation times are relatively slower, but those
relative numbers are tiny, and are easily made back in all but the
most trivial cases of grep. Detailed benchmarks & overview of
compilation times is at: http://sljit.sourceforge.net/pcre.html
With this change the difference in a t/perf/p7820-grep-engines.sh run
is, with just the /perl/ tests shown:
$ GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre/inst/lib' ./run HEAD~ HEAD p7820-grep-engines.sh
Test HEAD~ HEAD
---------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.35(1.11+0.43) 0.23(0.42+0.46) -34.3%
7820.7: perl grep '^how to' 0.64(2.71+0.36) 0.27(0.66+0.44) -57.8%
7820.11: perl grep '[how] to' 0.63(2.51+0.42) 0.33(0.98+0.39) -47.6%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 1.17(5.61+0.35) 0.34(1.08+0.46) -70.9%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.43(1.52+0.44) 0.30(0.88+0.42) -30.2%
The conditional support for JIT is implemented as suggested in the
pcrejit(3) man page. E.g. defining PCRE_STUDY_JIT_COMPILE to 0 if it's
not present.
The implementation is relatively verbose because even if
PCRE_CONFIG_JIT is defined only a call to pcre_config() can determine
if the JIT is available, and if so the faster pcre_jit_exec() function
should be called instead of pcre_exec(), and a different (but not
complimentary!) function needs to be called to free pcre1_extra_info.
There's no graceful fallback if pcre_jit_stack_alloc() fails under
PCRE_CONFIG_JIT, instead the program will simply abort. I don't think
this is worth handling gracefully, it'll only fail in cases where
malloc() doesn't work, in which case we're screwed anyway.
That there's no assignment of `p->pcre1_jit_on = 0` when
PCRE_CONFIG_JIT isn't defined isn't a bug. The create_grep_pat()
function allocates the grep_pat allocates it with calloc(), so it's
guaranteed to be 0 when PCRE_CONFIG_JIT isn't defined.
I you're bisecting and find this change, check that your PCRE isn't
older than 8.32. This change intentionally broke really old versions
of PCRE, but that's fixed in follow-up commits.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-26 04:05:25 +08:00
|
|
|
typedef int pcre_jit_stack;
|
2011-05-10 05:52:05 +08:00
|
|
|
#endif
|
grep: add support for PCRE v2
Add support for v2 of the PCRE API. This is a new major version of
PCRE that came out in early 2015[1].
The regular expression syntax is the same, but while the API is
similar, pretty much every function is either renamed or takes
different arguments. Thus using it via entirely new functions makes
sense, as opposed to trying to e.g. have one compile_pcre_pattern()
that would call either PCRE v1 or v2 functions.
Git can now be compiled with either USE_LIBPCRE1=YesPlease or
USE_LIBPCRE2=YesPlease, with USE_LIBPCRE=YesPlease currently being a
synonym for the former. Providing both is a compile-time error.
With earlier patches to enable JIT for PCRE v1 the performance of the
release versions of both libraries is almost exactly the same, with
PCRE v2 being around 1% slower.
However after I reported this to the pcre-dev mailing list[2] I got a
lot of help with the API use from Zoltán Herczeg, he subsequently
optimized some of the JIT functionality in v2 of the library.
Running the p7820-grep-engines.sh performance test against the latest
Subversion trunk of both, with both them and git compiled as -O3, and
the test run against linux.git, gives the following results. Just the
/perl/ tests shown:
$ GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_COMMAND='grep -q LIBPCRE2 Makefile && make -j8 USE_LIBPCRE2=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre2/inst/lib || make -j8 USE_LIBPCRE=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre/inst/lib' ./run HEAD~5 HEAD~ HEAD p7820-grep-engines.sh
[...]
Test HEAD~5 HEAD~ HEAD
-----------------------------------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.31(1.10+0.48) 0.21(0.35+0.56) -32.3% 0.21(0.34+0.55) -32.3%
7820.7: perl grep '^how to' 0.56(2.70+0.40) 0.24(0.64+0.52) -57.1% 0.20(0.28+0.60) -64.3%
7820.11: perl grep '[how] to' 0.56(2.66+0.38) 0.29(0.95+0.45) -48.2% 0.23(0.45+0.54) -58.9%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 1.02(5.77+0.42) 0.31(1.02+0.54) -69.6% 0.23(0.50+0.54) -77.5%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.38(1.57+0.42) 0.27(0.85+0.46) -28.9% 0.21(0.33+0.57) -44.7%
See commit ("perf: add a comparison test of grep regex engines",
2017-04-19) for details on the machine the above test run was executed
on.
Here HEAD~2 is git with PCRE v1 without JIT, HEAD~ is PCRE v1 with
JIT, and HEAD is PCRE v2 (also with JIT). See previous commits of mine
mentioning p7820-grep-engines.sh for more details on the test setup.
For ease of readability, a different run just of HEAD~ (PCRE v1 with
JIT v.s. PCRE v2), again with just the /perl/ tests shown:
[...]
Test HEAD~ HEAD
----------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.21(0.42+0.52) 0.21(0.31+0.58) +0.0%
7820.7: perl grep '^how to' 0.25(0.65+0.50) 0.20(0.31+0.57) -20.0%
7820.11: perl grep '[how] to' 0.30(0.90+0.50) 0.23(0.46+0.53) -23.3%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.30(1.19+0.38) 0.23(0.51+0.51) -23.3%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.27(0.84+0.48) 0.21(0.34+0.57) -22.2%
I.e. the two are either neck-to-neck, but PCRE v2 usually pulls ahead,
when it does it's around 20% faster.
A brief note on thread safety: As noted in pcre2api(3) & pcre2jit(3)
the compiled pattern can be shared between threads, but not some of
the JIT context, however the grep threading support does all pattern &
JIT compilation in separate threads, so this code doesn't need to
concern itself with thread safety.
See commit 63e7e9d8b6 ("git-grep: Learn PCRE", 2011-05-09) for the
initial addition of PCRE v1. This change follows some of the same
patterns it did (and which were discussed on list at the time),
e.g. mocking up types with typedef instead of ifdef-ing them out when
USE_LIBPCRE2 isn't defined. This adds some trivial memory use to the
program, but makes the code look nicer.
1. https://lists.exim.org/lurker/message/20150105.162835.0666407a.en.html
2. https://lists.exim.org/lurker/thread/20170419.172322.833ee099.en.html
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-02 02:20:56 +08:00
|
|
|
#ifdef USE_LIBPCRE2
|
|
|
|
#define PCRE2_CODE_UNIT_WIDTH 8
|
|
|
|
#include <pcre2.h>
|
|
|
|
#else
|
|
|
|
typedef int pcre2_code;
|
|
|
|
typedef int pcre2_match_data;
|
|
|
|
typedef int pcre2_compile_context;
|
|
|
|
typedef int pcre2_match_context;
|
|
|
|
typedef int pcre2_jit_stack;
|
|
|
|
#endif
|
Use kwset in grep
Benchmarks for the hot cache case:
before:
$ perf stat --repeat=5 git grep qwerty > /dev/null
Performance counter stats for 'git grep qwerty' (5 runs):
3,478,085 cache-misses # 2.322 M/sec ( +- 2.690% )
11,356,177 cache-references # 7.582 M/sec ( +- 2.598% )
3,872,184 branch-misses # 0.363 % ( +- 0.258% )
1,067,367,848 branches # 712.673 M/sec ( +- 2.622% )
3,828,370,782 instructions # 0.947 IPC ( +- 0.033% )
4,043,832,831 cycles # 2700.037 M/sec ( +- 0.167% )
8,518 page-faults # 0.006 M/sec ( +- 3.648% )
847 CPU-migrations # 0.001 M/sec ( +- 3.262% )
6,546 context-switches # 0.004 M/sec ( +- 2.292% )
1497.695495 task-clock-msecs # 3.303 CPUs ( +- 2.550% )
0.453394396 seconds time elapsed ( +- 0.912% )
after:
$ perf stat --repeat=5 git grep qwerty > /dev/null
Performance counter stats for 'git grep qwerty' (5 runs):
2,989,918 cache-misses # 3.166 M/sec ( +- 5.013% )
10,986,041 cache-references # 11.633 M/sec ( +- 4.899% ) (scaled from 95.06%)
3,511,993 branch-misses # 1.422 % ( +- 0.785% )
246,893,561 branches # 261.433 M/sec ( +- 3.967% )
1,392,727,757 instructions # 0.564 IPC ( +- 0.040% )
2,468,142,397 cycles # 2613.494 M/sec ( +- 0.110% )
7,747 page-faults # 0.008 M/sec ( +- 3.995% )
897 CPU-migrations # 0.001 M/sec ( +- 2.383% )
6,535 context-switches # 0.007 M/sec ( +- 1.993% )
944.384228 task-clock-msecs # 3.177 CPUs ( +- 0.268% )
0.297257643 seconds time elapsed ( +- 0.450% )
So we gain about 35% by using the kwset code.
As a side effect of using kwset two grep tests are fixed by this
patch. The first is fixed because kwset can deal with case-insensitive
search containing NULs, something strcasestr cannot do. The second one
is fixed because we consider patterns containing NULs as fixed strings
(regcomp cannot accept patterns with NULs).
Signed-off-by: Fredrik Kuivinen <frekui@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-08-21 06:42:18 +08:00
|
|
|
#include "kwset.h"
|
2011-12-13 05:16:07 +08:00
|
|
|
#include "thread-utils.h"
|
2012-02-02 16:20:43 +08:00
|
|
|
#include "userdiff.h"
|
2006-09-18 07:02:52 +08:00
|
|
|
|
|
|
|
enum grep_pat_token {
|
|
|
|
GREP_PATTERN,
|
2006-09-21 03:39:46 +08:00
|
|
|
GREP_PATTERN_HEAD,
|
|
|
|
GREP_PATTERN_BODY,
|
2006-09-18 07:02:52 +08:00
|
|
|
GREP_AND,
|
|
|
|
GREP_OPEN_PAREN,
|
|
|
|
GREP_CLOSE_PAREN,
|
|
|
|
GREP_NOT,
|
2010-05-14 17:31:35 +08:00
|
|
|
GREP_OR
|
2006-09-18 07:02:52 +08:00
|
|
|
};
|
|
|
|
|
2006-09-21 03:39:46 +08:00
|
|
|
enum grep_context {
|
|
|
|
GREP_CONTEXT_HEAD,
|
2010-05-14 17:31:35 +08:00
|
|
|
GREP_CONTEXT_BODY
|
2006-09-21 03:39:46 +08:00
|
|
|
};
|
|
|
|
|
log --author/--committer: really match only with name part
When we tried to find commits done by AUTHOR, the first implementation
tried to pattern match a line with "^author .*AUTHOR", which later was
enhanced to strip leading caret and look for "^author AUTHOR" when the
search pattern was anchored at the left end (i.e. --author="^AUTHOR").
This had a few problems:
* When looking for fixed strings (e.g. "git log -F --author=x --grep=y"),
the regexp internally used "^author .*x" would never match anything;
* To match at the end (e.g. "git log --author='google.com>$'"), the
generated regexp has to also match the trailing timestamp part the
commit header lines have. Also, in order to determine if the '$' at
the end means "match at the end of the line" or just a literal dollar
sign (probably backslash-quoted), we would need to parse the regexp
ourselves.
An earlier alternative tried to make sure that a line matches "^author "
(to limit by field name) and the user supplied pattern at the same time.
While it solved the -F problem by introducing a special override for
matching the "^author ", it did not solve the trailing timestamp nor tail
match problem. It also would have matched every commit if --author=author
was asked for, not because the author's email part had this string, but
because every commit header line that talks about the author begins with
that field name, regardleses of who wrote it.
Instead of piling more hacks on top of hacks, this rethinks the grep
machinery that is used to look for strings in the commit header, and makes
sure that (1) field name matches literally at the beginning of the line,
followed by a SP, and (2) the user supplied pattern is matched against the
remainder of the line, excluding the trailing timestamp data.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-09-05 13:15:02 +08:00
|
|
|
enum grep_header_field {
|
2013-02-03 22:37:09 +08:00
|
|
|
GREP_HEADER_FIELD_MIN = 0,
|
|
|
|
GREP_HEADER_AUTHOR = GREP_HEADER_FIELD_MIN,
|
2012-09-29 12:41:27 +08:00
|
|
|
GREP_HEADER_COMMITTER,
|
2012-09-29 12:41:28 +08:00
|
|
|
GREP_HEADER_REFLOG,
|
2012-09-29 12:41:27 +08:00
|
|
|
|
|
|
|
/* Must be at the end of the enum */
|
|
|
|
GREP_HEADER_FIELD_MAX
|
log --author/--committer: really match only with name part
When we tried to find commits done by AUTHOR, the first implementation
tried to pattern match a line with "^author .*AUTHOR", which later was
enhanced to strip leading caret and look for "^author AUTHOR" when the
search pattern was anchored at the left end (i.e. --author="^AUTHOR").
This had a few problems:
* When looking for fixed strings (e.g. "git log -F --author=x --grep=y"),
the regexp internally used "^author .*x" would never match anything;
* To match at the end (e.g. "git log --author='google.com>$'"), the
generated regexp has to also match the trailing timestamp part the
commit header lines have. Also, in order to determine if the '$' at
the end means "match at the end of the line" or just a literal dollar
sign (probably backslash-quoted), we would need to parse the regexp
ourselves.
An earlier alternative tried to make sure that a line matches "^author "
(to limit by field name) and the user supplied pattern at the same time.
While it solved the -F problem by introducing a special override for
matching the "^author ", it did not solve the trailing timestamp nor tail
match problem. It also would have matched every commit if --author=author
was asked for, not because the author's email part had this string, but
because every commit header line that talks about the author begins with
that field name, regardleses of who wrote it.
Instead of piling more hacks on top of hacks, this rethinks the grep
machinery that is used to look for strings in the commit header, and makes
sure that (1) field name matches literally at the beginning of the line,
followed by a SP, and (2) the user supplied pattern is matched against the
remainder of the line, excluding the trailing timestamp data.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-09-05 13:15:02 +08:00
|
|
|
};
|
|
|
|
|
2006-09-18 07:02:52 +08:00
|
|
|
struct grep_pat {
|
|
|
|
struct grep_pat *next;
|
|
|
|
const char *origin;
|
|
|
|
int no;
|
|
|
|
enum grep_pat_token token;
|
2012-05-20 22:33:07 +08:00
|
|
|
char *pattern;
|
2010-05-23 05:43:43 +08:00
|
|
|
size_t patternlen;
|
log --author/--committer: really match only with name part
When we tried to find commits done by AUTHOR, the first implementation
tried to pattern match a line with "^author .*AUTHOR", which later was
enhanced to strip leading caret and look for "^author AUTHOR" when the
search pattern was anchored at the left end (i.e. --author="^AUTHOR").
This had a few problems:
* When looking for fixed strings (e.g. "git log -F --author=x --grep=y"),
the regexp internally used "^author .*x" would never match anything;
* To match at the end (e.g. "git log --author='google.com>$'"), the
generated regexp has to also match the trailing timestamp part the
commit header lines have. Also, in order to determine if the '$' at
the end means "match at the end of the line" or just a literal dollar
sign (probably backslash-quoted), we would need to parse the regexp
ourselves.
An earlier alternative tried to make sure that a line matches "^author "
(to limit by field name) and the user supplied pattern at the same time.
While it solved the -F problem by introducing a special override for
matching the "^author ", it did not solve the trailing timestamp nor tail
match problem. It also would have matched every commit if --author=author
was asked for, not because the author's email part had this string, but
because every commit header line that talks about the author begins with
that field name, regardleses of who wrote it.
Instead of piling more hacks on top of hacks, this rethinks the grep
machinery that is used to look for strings in the commit header, and makes
sure that (1) field name matches literally at the beginning of the line,
followed by a SP, and (2) the user supplied pattern is matched against the
remainder of the line, excluding the trailing timestamp data.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-09-05 13:15:02 +08:00
|
|
|
enum grep_header_field field;
|
2006-09-18 07:02:52 +08:00
|
|
|
regex_t regexp;
|
2017-05-26 03:45:29 +08:00
|
|
|
pcre *pcre1_regexp;
|
|
|
|
pcre_extra *pcre1_extra_info;
|
grep: add support for the PCRE v1 JIT API
Change the grep PCRE v1 code to use JIT when available. When PCRE
support was initially added in commit 63e7e9d8b6 ("git-grep: Learn
PCRE", 2011-05-09) PCRE had no JIT support, it was integrated into
8.20 released on 2011-10-21.
Enabling JIT support usually improves performance by more than
40%. The pattern compilation times are relatively slower, but those
relative numbers are tiny, and are easily made back in all but the
most trivial cases of grep. Detailed benchmarks & overview of
compilation times is at: http://sljit.sourceforge.net/pcre.html
With this change the difference in a t/perf/p7820-grep-engines.sh run
is, with just the /perl/ tests shown:
$ GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre/inst/lib' ./run HEAD~ HEAD p7820-grep-engines.sh
Test HEAD~ HEAD
---------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.35(1.11+0.43) 0.23(0.42+0.46) -34.3%
7820.7: perl grep '^how to' 0.64(2.71+0.36) 0.27(0.66+0.44) -57.8%
7820.11: perl grep '[how] to' 0.63(2.51+0.42) 0.33(0.98+0.39) -47.6%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 1.17(5.61+0.35) 0.34(1.08+0.46) -70.9%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.43(1.52+0.44) 0.30(0.88+0.42) -30.2%
The conditional support for JIT is implemented as suggested in the
pcrejit(3) man page. E.g. defining PCRE_STUDY_JIT_COMPILE to 0 if it's
not present.
The implementation is relatively verbose because even if
PCRE_CONFIG_JIT is defined only a call to pcre_config() can determine
if the JIT is available, and if so the faster pcre_jit_exec() function
should be called instead of pcre_exec(), and a different (but not
complimentary!) function needs to be called to free pcre1_extra_info.
There's no graceful fallback if pcre_jit_stack_alloc() fails under
PCRE_CONFIG_JIT, instead the program will simply abort. I don't think
this is worth handling gracefully, it'll only fail in cases where
malloc() doesn't work, in which case we're screwed anyway.
That there's no assignment of `p->pcre1_jit_on = 0` when
PCRE_CONFIG_JIT isn't defined isn't a bug. The create_grep_pat()
function allocates the grep_pat allocates it with calloc(), so it's
guaranteed to be 0 when PCRE_CONFIG_JIT isn't defined.
I you're bisecting and find this change, check that your PCRE isn't
older than 8.32. This change intentionally broke really old versions
of PCRE, but that's fixed in follow-up commits.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-26 04:05:25 +08:00
|
|
|
pcre_jit_stack *pcre1_jit_stack;
|
2017-05-26 03:45:29 +08:00
|
|
|
const unsigned char *pcre1_tables;
|
grep: add support for the PCRE v1 JIT API
Change the grep PCRE v1 code to use JIT when available. When PCRE
support was initially added in commit 63e7e9d8b6 ("git-grep: Learn
PCRE", 2011-05-09) PCRE had no JIT support, it was integrated into
8.20 released on 2011-10-21.
Enabling JIT support usually improves performance by more than
40%. The pattern compilation times are relatively slower, but those
relative numbers are tiny, and are easily made back in all but the
most trivial cases of grep. Detailed benchmarks & overview of
compilation times is at: http://sljit.sourceforge.net/pcre.html
With this change the difference in a t/perf/p7820-grep-engines.sh run
is, with just the /perl/ tests shown:
$ GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre/inst/lib' ./run HEAD~ HEAD p7820-grep-engines.sh
Test HEAD~ HEAD
---------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.35(1.11+0.43) 0.23(0.42+0.46) -34.3%
7820.7: perl grep '^how to' 0.64(2.71+0.36) 0.27(0.66+0.44) -57.8%
7820.11: perl grep '[how] to' 0.63(2.51+0.42) 0.33(0.98+0.39) -47.6%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 1.17(5.61+0.35) 0.34(1.08+0.46) -70.9%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.43(1.52+0.44) 0.30(0.88+0.42) -30.2%
The conditional support for JIT is implemented as suggested in the
pcrejit(3) man page. E.g. defining PCRE_STUDY_JIT_COMPILE to 0 if it's
not present.
The implementation is relatively verbose because even if
PCRE_CONFIG_JIT is defined only a call to pcre_config() can determine
if the JIT is available, and if so the faster pcre_jit_exec() function
should be called instead of pcre_exec(), and a different (but not
complimentary!) function needs to be called to free pcre1_extra_info.
There's no graceful fallback if pcre_jit_stack_alloc() fails under
PCRE_CONFIG_JIT, instead the program will simply abort. I don't think
this is worth handling gracefully, it'll only fail in cases where
malloc() doesn't work, in which case we're screwed anyway.
That there's no assignment of `p->pcre1_jit_on = 0` when
PCRE_CONFIG_JIT isn't defined isn't a bug. The create_grep_pat()
function allocates the grep_pat allocates it with calloc(), so it's
guaranteed to be 0 when PCRE_CONFIG_JIT isn't defined.
I you're bisecting and find this change, check that your PCRE isn't
older than 8.32. This change intentionally broke really old versions
of PCRE, but that's fixed in follow-up commits.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-26 04:05:25 +08:00
|
|
|
int pcre1_jit_on;
|
grep: add support for PCRE v2
Add support for v2 of the PCRE API. This is a new major version of
PCRE that came out in early 2015[1].
The regular expression syntax is the same, but while the API is
similar, pretty much every function is either renamed or takes
different arguments. Thus using it via entirely new functions makes
sense, as opposed to trying to e.g. have one compile_pcre_pattern()
that would call either PCRE v1 or v2 functions.
Git can now be compiled with either USE_LIBPCRE1=YesPlease or
USE_LIBPCRE2=YesPlease, with USE_LIBPCRE=YesPlease currently being a
synonym for the former. Providing both is a compile-time error.
With earlier patches to enable JIT for PCRE v1 the performance of the
release versions of both libraries is almost exactly the same, with
PCRE v2 being around 1% slower.
However after I reported this to the pcre-dev mailing list[2] I got a
lot of help with the API use from Zoltán Herczeg, he subsequently
optimized some of the JIT functionality in v2 of the library.
Running the p7820-grep-engines.sh performance test against the latest
Subversion trunk of both, with both them and git compiled as -O3, and
the test run against linux.git, gives the following results. Just the
/perl/ tests shown:
$ GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_COMMAND='grep -q LIBPCRE2 Makefile && make -j8 USE_LIBPCRE2=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre2/inst/lib || make -j8 USE_LIBPCRE=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre/inst/lib' ./run HEAD~5 HEAD~ HEAD p7820-grep-engines.sh
[...]
Test HEAD~5 HEAD~ HEAD
-----------------------------------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.31(1.10+0.48) 0.21(0.35+0.56) -32.3% 0.21(0.34+0.55) -32.3%
7820.7: perl grep '^how to' 0.56(2.70+0.40) 0.24(0.64+0.52) -57.1% 0.20(0.28+0.60) -64.3%
7820.11: perl grep '[how] to' 0.56(2.66+0.38) 0.29(0.95+0.45) -48.2% 0.23(0.45+0.54) -58.9%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 1.02(5.77+0.42) 0.31(1.02+0.54) -69.6% 0.23(0.50+0.54) -77.5%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.38(1.57+0.42) 0.27(0.85+0.46) -28.9% 0.21(0.33+0.57) -44.7%
See commit ("perf: add a comparison test of grep regex engines",
2017-04-19) for details on the machine the above test run was executed
on.
Here HEAD~2 is git with PCRE v1 without JIT, HEAD~ is PCRE v1 with
JIT, and HEAD is PCRE v2 (also with JIT). See previous commits of mine
mentioning p7820-grep-engines.sh for more details on the test setup.
For ease of readability, a different run just of HEAD~ (PCRE v1 with
JIT v.s. PCRE v2), again with just the /perl/ tests shown:
[...]
Test HEAD~ HEAD
----------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.21(0.42+0.52) 0.21(0.31+0.58) +0.0%
7820.7: perl grep '^how to' 0.25(0.65+0.50) 0.20(0.31+0.57) -20.0%
7820.11: perl grep '[how] to' 0.30(0.90+0.50) 0.23(0.46+0.53) -23.3%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.30(1.19+0.38) 0.23(0.51+0.51) -23.3%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.27(0.84+0.48) 0.21(0.34+0.57) -22.2%
I.e. the two are either neck-to-neck, but PCRE v2 usually pulls ahead,
when it does it's around 20% faster.
A brief note on thread safety: As noted in pcre2api(3) & pcre2jit(3)
the compiled pattern can be shared between threads, but not some of
the JIT context, however the grep threading support does all pattern &
JIT compilation in separate threads, so this code doesn't need to
concern itself with thread safety.
See commit 63e7e9d8b6 ("git-grep: Learn PCRE", 2011-05-09) for the
initial addition of PCRE v1. This change follows some of the same
patterns it did (and which were discussed on list at the time),
e.g. mocking up types with typedef instead of ifdef-ing them out when
USE_LIBPCRE2 isn't defined. This adds some trivial memory use to the
program, but makes the code look nicer.
1. https://lists.exim.org/lurker/message/20150105.162835.0666407a.en.html
2. https://lists.exim.org/lurker/thread/20170419.172322.833ee099.en.html
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-02 02:20:56 +08:00
|
|
|
pcre2_code *pcre2_pattern;
|
|
|
|
pcre2_match_data *pcre2_match_data;
|
|
|
|
pcre2_compile_context *pcre2_compile_context;
|
|
|
|
pcre2_match_context *pcre2_match_context;
|
|
|
|
pcre2_jit_stack *pcre2_jit_stack;
|
|
|
|
uint32_t pcre2_jit_on;
|
Use kwset in grep
Benchmarks for the hot cache case:
before:
$ perf stat --repeat=5 git grep qwerty > /dev/null
Performance counter stats for 'git grep qwerty' (5 runs):
3,478,085 cache-misses # 2.322 M/sec ( +- 2.690% )
11,356,177 cache-references # 7.582 M/sec ( +- 2.598% )
3,872,184 branch-misses # 0.363 % ( +- 0.258% )
1,067,367,848 branches # 712.673 M/sec ( +- 2.622% )
3,828,370,782 instructions # 0.947 IPC ( +- 0.033% )
4,043,832,831 cycles # 2700.037 M/sec ( +- 0.167% )
8,518 page-faults # 0.006 M/sec ( +- 3.648% )
847 CPU-migrations # 0.001 M/sec ( +- 3.262% )
6,546 context-switches # 0.004 M/sec ( +- 2.292% )
1497.695495 task-clock-msecs # 3.303 CPUs ( +- 2.550% )
0.453394396 seconds time elapsed ( +- 0.912% )
after:
$ perf stat --repeat=5 git grep qwerty > /dev/null
Performance counter stats for 'git grep qwerty' (5 runs):
2,989,918 cache-misses # 3.166 M/sec ( +- 5.013% )
10,986,041 cache-references # 11.633 M/sec ( +- 4.899% ) (scaled from 95.06%)
3,511,993 branch-misses # 1.422 % ( +- 0.785% )
246,893,561 branches # 261.433 M/sec ( +- 3.967% )
1,392,727,757 instructions # 0.564 IPC ( +- 0.040% )
2,468,142,397 cycles # 2613.494 M/sec ( +- 0.110% )
7,747 page-faults # 0.008 M/sec ( +- 3.995% )
897 CPU-migrations # 0.001 M/sec ( +- 2.383% )
6,535 context-switches # 0.007 M/sec ( +- 1.993% )
944.384228 task-clock-msecs # 3.177 CPUs ( +- 0.268% )
0.297257643 seconds time elapsed ( +- 0.450% )
So we gain about 35% by using the kwset code.
As a side effect of using kwset two grep tests are fixed by this
patch. The first is fixed because kwset can deal with case-insensitive
search containing NULs, something strcasestr cannot do. The second one
is fixed because we consider patterns containing NULs as fixed strings
(regcomp cannot accept patterns with NULs).
Signed-off-by: Fredrik Kuivinen <frekui@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-08-21 06:42:18 +08:00
|
|
|
kwset_t kws;
|
2009-01-10 07:18:34 +08:00
|
|
|
unsigned fixed:1;
|
2009-11-06 17:22:35 +08:00
|
|
|
unsigned ignore_case:1;
|
2009-03-07 20:28:40 +08:00
|
|
|
unsigned word_regexp:1;
|
2006-09-18 07:02:52 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
enum grep_expr_node {
|
|
|
|
GREP_NODE_ATOM,
|
|
|
|
GREP_NODE_NOT,
|
|
|
|
GREP_NODE_AND,
|
2010-09-13 13:15:35 +08:00
|
|
|
GREP_NODE_TRUE,
|
2010-05-14 17:31:35 +08:00
|
|
|
GREP_NODE_OR
|
2006-09-18 07:02:52 +08:00
|
|
|
};
|
|
|
|
|
grep: add a grep.patternType configuration setting
The grep.extendedRegexp configuration setting enables the -E flag on grep
by default but there are no equivalents for the -G, -F and -P flags.
Rather than adding an additional setting for grep.fooRegexp for current
and future pattern matching options, add a grep.patternType setting that
can accept appropriate values for modifying the default grep pattern
matching behavior. The current values are "basic", "extended", "fixed",
"perl" and "default" for setting -G, -E, -F, -P and the default behavior
respectively.
When grep.patternType is set to a value other than "default", the
grep.extendedRegexp setting is ignored. The value of "default" restores
the current default behavior, including the grep.extendedRegexp
behavior.
Signed-off-by: J Smith <dark.panda@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-08-03 22:53:50 +08:00
|
|
|
enum grep_pattern_type {
|
|
|
|
GREP_PATTERN_TYPE_UNSPECIFIED = 0,
|
|
|
|
GREP_PATTERN_TYPE_BRE,
|
|
|
|
GREP_PATTERN_TYPE_ERE,
|
|
|
|
GREP_PATTERN_TYPE_FIXED,
|
|
|
|
GREP_PATTERN_TYPE_PCRE
|
|
|
|
};
|
|
|
|
|
2006-09-18 07:02:52 +08:00
|
|
|
struct grep_expr {
|
|
|
|
enum grep_expr_node node;
|
2006-09-28 08:50:52 +08:00
|
|
|
unsigned hit;
|
2006-09-18 07:02:52 +08:00
|
|
|
union {
|
|
|
|
struct grep_pat *atom;
|
|
|
|
struct grep_expr *unary;
|
|
|
|
struct {
|
|
|
|
struct grep_expr *left;
|
|
|
|
struct grep_expr *right;
|
|
|
|
} binary;
|
|
|
|
} u;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct grep_opt {
|
|
|
|
struct grep_pat *pattern_list;
|
|
|
|
struct grep_pat **pattern_tail;
|
"log --author=me --grep=it" should find intersection, not union
Historically, any grep filter in "git log" family of commands were taken
as restricting to commits with any of the words in the commit log message.
However, the user almost always want to find commits "done by this person
on that topic". With "--all-match" option, a series of grep patterns can
be turned into a requirement that all of them must produce a match, but
that makes it impossible to ask for "done by me, on either this or that"
with:
log --author=me --committer=him --grep=this --grep=that
because it will require both "this" and "that" to appear.
Change the "header" parser of grep library to treat the headers specially,
and parse it as:
(all-match-OR (HEADER-AUTHOR me)
(HEADER-COMMITTER him)
(OR
(PATTERN this)
(PATTERN that) ) )
Even though the "log" command line parser doesn't give direct access to
the extended grep syntax to group terms with parentheses, this change will
cover the majority of the case the users would want.
This incidentally revealed that one test in t7002 was bogus. It ran:
log --author=Thor --grep=Thu --format='%s'
and expected (wrongly) "Thu" to match "Thursday" in the author/committer
date, but that would never match, as the timestamp in raw commit buffer
does not have the name of the day-of-the-week.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-01-18 12:09:06 +08:00
|
|
|
struct grep_pat *header_list;
|
|
|
|
struct grep_pat **header_tail;
|
2006-09-18 07:02:52 +08:00
|
|
|
struct grep_expr *pattern_expression;
|
2009-09-05 20:31:17 +08:00
|
|
|
const char *prefix;
|
2006-09-18 07:02:52 +08:00
|
|
|
int prefix_length;
|
|
|
|
regex_t regexp;
|
2009-05-08 03:46:48 +08:00
|
|
|
int linenum;
|
|
|
|
int invert;
|
2009-11-06 17:22:35 +08:00
|
|
|
int ignore_case;
|
2009-05-08 03:46:48 +08:00
|
|
|
int status_only;
|
|
|
|
int name_only;
|
|
|
|
int unmatch_name_only;
|
|
|
|
int count;
|
|
|
|
int word_regexp;
|
|
|
|
int fixed;
|
|
|
|
int all_match;
|
2012-09-14 05:21:44 +08:00
|
|
|
int debug;
|
2006-09-18 07:02:52 +08:00
|
|
|
#define GREP_BINARY_DEFAULT 0
|
|
|
|
#define GREP_BINARY_NOMATCH 1
|
|
|
|
#define GREP_BINARY_TEXT 2
|
2009-05-08 03:46:48 +08:00
|
|
|
int binary;
|
2013-05-10 23:10:15 +08:00
|
|
|
int allow_textconv;
|
2009-05-08 03:46:48 +08:00
|
|
|
int extended;
|
2012-09-30 02:59:52 +08:00
|
|
|
int use_reflog_filter;
|
2017-05-26 03:45:29 +08:00
|
|
|
int pcre1;
|
grep: add support for PCRE v2
Add support for v2 of the PCRE API. This is a new major version of
PCRE that came out in early 2015[1].
The regular expression syntax is the same, but while the API is
similar, pretty much every function is either renamed or takes
different arguments. Thus using it via entirely new functions makes
sense, as opposed to trying to e.g. have one compile_pcre_pattern()
that would call either PCRE v1 or v2 functions.
Git can now be compiled with either USE_LIBPCRE1=YesPlease or
USE_LIBPCRE2=YesPlease, with USE_LIBPCRE=YesPlease currently being a
synonym for the former. Providing both is a compile-time error.
With earlier patches to enable JIT for PCRE v1 the performance of the
release versions of both libraries is almost exactly the same, with
PCRE v2 being around 1% slower.
However after I reported this to the pcre-dev mailing list[2] I got a
lot of help with the API use from Zoltán Herczeg, he subsequently
optimized some of the JIT functionality in v2 of the library.
Running the p7820-grep-engines.sh performance test against the latest
Subversion trunk of both, with both them and git compiled as -O3, and
the test run against linux.git, gives the following results. Just the
/perl/ tests shown:
$ GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_COMMAND='grep -q LIBPCRE2 Makefile && make -j8 USE_LIBPCRE2=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre2/inst/lib || make -j8 USE_LIBPCRE=YesPlease CC=~/perl5/installed/bin/gcc NO_R_TO_GCC_LINKER=YesPlease CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst LDFLAGS=-Wl,-rpath,/home/avar/g/pcre/inst/lib' ./run HEAD~5 HEAD~ HEAD p7820-grep-engines.sh
[...]
Test HEAD~5 HEAD~ HEAD
-----------------------------------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.31(1.10+0.48) 0.21(0.35+0.56) -32.3% 0.21(0.34+0.55) -32.3%
7820.7: perl grep '^how to' 0.56(2.70+0.40) 0.24(0.64+0.52) -57.1% 0.20(0.28+0.60) -64.3%
7820.11: perl grep '[how] to' 0.56(2.66+0.38) 0.29(0.95+0.45) -48.2% 0.23(0.45+0.54) -58.9%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 1.02(5.77+0.42) 0.31(1.02+0.54) -69.6% 0.23(0.50+0.54) -77.5%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.38(1.57+0.42) 0.27(0.85+0.46) -28.9% 0.21(0.33+0.57) -44.7%
See commit ("perf: add a comparison test of grep regex engines",
2017-04-19) for details on the machine the above test run was executed
on.
Here HEAD~2 is git with PCRE v1 without JIT, HEAD~ is PCRE v1 with
JIT, and HEAD is PCRE v2 (also with JIT). See previous commits of mine
mentioning p7820-grep-engines.sh for more details on the test setup.
For ease of readability, a different run just of HEAD~ (PCRE v1 with
JIT v.s. PCRE v2), again with just the /perl/ tests shown:
[...]
Test HEAD~ HEAD
----------------------------------------------------------------------------------------
7820.3: perl grep 'how.to' 0.21(0.42+0.52) 0.21(0.31+0.58) +0.0%
7820.7: perl grep '^how to' 0.25(0.65+0.50) 0.20(0.31+0.57) -20.0%
7820.11: perl grep '[how] to' 0.30(0.90+0.50) 0.23(0.46+0.53) -23.3%
7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.30(1.19+0.38) 0.23(0.51+0.51) -23.3%
7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.27(0.84+0.48) 0.21(0.34+0.57) -22.2%
I.e. the two are either neck-to-neck, but PCRE v2 usually pulls ahead,
when it does it's around 20% faster.
A brief note on thread safety: As noted in pcre2api(3) & pcre2jit(3)
the compiled pattern can be shared between threads, but not some of
the JIT context, however the grep threading support does all pattern &
JIT compilation in separate threads, so this code doesn't need to
concern itself with thread safety.
See commit 63e7e9d8b6 ("git-grep: Learn PCRE", 2011-05-09) for the
initial addition of PCRE v1. This change follows some of the same
patterns it did (and which were discussed on list at the time),
e.g. mocking up types with typedef instead of ifdef-ing them out when
USE_LIBPCRE2 isn't defined. This adds some trivial memory use to the
program, but makes the code look nicer.
1. https://lists.exim.org/lurker/message/20150105.162835.0666407a.en.html
2. https://lists.exim.org/lurker/thread/20170419.172322.833ee099.en.html
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-02 02:20:56 +08:00
|
|
|
int pcre2;
|
2009-05-08 03:46:48 +08:00
|
|
|
int relative;
|
|
|
|
int pathname;
|
|
|
|
int null_following_name;
|
2009-03-07 20:32:32 +08:00
|
|
|
int color;
|
grep: Add --max-depth option.
It is useful to grep directories non-recursively, e.g. when one wants to
look for all files in the toplevel directory, but not in any subdirectory,
or in Documentation/, but not in Documentation/technical/.
This patch adds support for --max-depth <depth> option to git-grep. If it is
given, git-grep descends at most <depth> levels of directories below paths
specified on the command line.
Note that if path specified on command line contains wildcards, this option
makes no sense, e.g.
$ git grep -l --max-depth 0 GNU -- 'contrib/*'
(note the quotes) will search all files in contrib/, even in
subdirectories, because '*' matches all files.
Documentation updates, bash-completion and simple test cases are also
provided.
Signed-off-by: Michał Kiedrowicz <michal.kiedrowicz@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-07-23 01:52:15 +08:00
|
|
|
int max_depth;
|
2009-07-02 06:06:34 +08:00
|
|
|
int funcname;
|
2011-08-02 01:20:53 +08:00
|
|
|
int funcbody;
|
grep: add a grep.patternType configuration setting
The grep.extendedRegexp configuration setting enables the -E flag on grep
by default but there are no equivalents for the -G, -F and -P flags.
Rather than adding an additional setting for grep.fooRegexp for current
and future pattern matching options, add a grep.patternType setting that
can accept appropriate values for modifying the default grep pattern
matching behavior. The current values are "basic", "extended", "fixed",
"perl" and "default" for setting -G, -E, -F, -P and the default behavior
respectively.
When grep.patternType is set to a value other than "default", the
grep.extendedRegexp setting is ignored. The value of "default" restores
the current default behavior, including the grep.extendedRegexp
behavior.
Signed-off-by: J Smith <dark.panda@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-08-03 22:53:50 +08:00
|
|
|
int extended_regexp_option;
|
|
|
|
int pattern_type_option;
|
2010-03-08 00:52:47 +08:00
|
|
|
char color_context[COLOR_MAXLEN];
|
grep: Colorize filename, line number, and separator
Colorize the filename, line number, and separator in git grep output, as
GNU grep does. The colors are customizable through color.grep.<slot>.
The default is to only color the separator (in cyan), since this gives
the biggest legibility increase without overwhelming the user with
colors. GNU grep also defaults cyan for the separator, but defaults to
magenta for the filename and to green for the line number, as well.
There is one difference from GNU grep: When a binary file matches
without -a, GNU grep does not color the <file> in "Binary file <file>
matches", but we do.
Like GNU grep, if --null is given, the null separators are not colored.
For config.txt, use a a sub-list to describe the slots, rather than
a single paragraph with parentheses, since this is much more readable.
Remove the cast to int for `rm_eo - rm_so` since it is not necessary.
Signed-off-by: Mark Lodato <lodatom@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-03-08 00:52:46 +08:00
|
|
|
char color_filename[COLOR_MAXLEN];
|
2010-03-08 00:52:47 +08:00
|
|
|
char color_function[COLOR_MAXLEN];
|
grep: Colorize filename, line number, and separator
Colorize the filename, line number, and separator in git grep output, as
GNU grep does. The colors are customizable through color.grep.<slot>.
The default is to only color the separator (in cyan), since this gives
the biggest legibility increase without overwhelming the user with
colors. GNU grep also defaults cyan for the separator, but defaults to
magenta for the filename and to green for the line number, as well.
There is one difference from GNU grep: When a binary file matches
without -a, GNU grep does not color the <file> in "Binary file <file>
matches", but we do.
Like GNU grep, if --null is given, the null separators are not colored.
For config.txt, use a a sub-list to describe the slots, rather than
a single paragraph with parentheses, since this is much more readable.
Remove the cast to int for `rm_eo - rm_so` since it is not necessary.
Signed-off-by: Mark Lodato <lodatom@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-03-08 00:52:46 +08:00
|
|
|
char color_lineno[COLOR_MAXLEN];
|
2014-10-28 02:23:05 +08:00
|
|
|
char color_match_context[COLOR_MAXLEN];
|
|
|
|
char color_match_selected[COLOR_MAXLEN];
|
2010-03-08 00:52:47 +08:00
|
|
|
char color_selected[COLOR_MAXLEN];
|
grep: Colorize filename, line number, and separator
Colorize the filename, line number, and separator in git grep output, as
GNU grep does. The colors are customizable through color.grep.<slot>.
The default is to only color the separator (in cyan), since this gives
the biggest legibility increase without overwhelming the user with
colors. GNU grep also defaults cyan for the separator, but defaults to
magenta for the filename and to green for the line number, as well.
There is one difference from GNU grep: When a binary file matches
without -a, GNU grep does not color the <file> in "Binary file <file>
matches", but we do.
Like GNU grep, if --null is given, the null separators are not colored.
For config.txt, use a a sub-list to describe the slots, rather than
a single paragraph with parentheses, since this is much more readable.
Remove the cast to int for `rm_eo - rm_so` since it is not necessary.
Signed-off-by: Mark Lodato <lodatom@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-03-08 00:52:46 +08:00
|
|
|
char color_sep[COLOR_MAXLEN];
|
2006-09-18 07:02:52 +08:00
|
|
|
int regflags;
|
|
|
|
unsigned pre_context;
|
|
|
|
unsigned post_context;
|
2009-07-02 06:02:38 +08:00
|
|
|
unsigned last_shown;
|
2009-07-02 06:03:44 +08:00
|
|
|
int show_hunk_mark;
|
2011-06-05 23:24:25 +08:00
|
|
|
int file_break;
|
2011-06-05 23:24:36 +08:00
|
|
|
int heading;
|
2009-07-02 06:07:24 +08:00
|
|
|
void *priv;
|
2010-01-26 06:51:39 +08:00
|
|
|
|
|
|
|
void (*output)(struct grep_opt *opt, const void *data, size_t size);
|
|
|
|
void *output_priv;
|
2006-09-18 07:02:52 +08:00
|
|
|
};
|
|
|
|
|
2012-10-10 07:17:50 +08:00
|
|
|
extern void init_grep_defaults(void);
|
|
|
|
extern int grep_config(const char *var, const char *value, void *);
|
|
|
|
extern void grep_init(struct grep_opt *, const char *prefix);
|
2012-10-04 05:47:48 +08:00
|
|
|
void grep_commit_pattern_type(enum grep_pattern_type, struct grep_opt *opt);
|
2012-10-10 07:17:50 +08:00
|
|
|
|
2010-05-23 05:43:43 +08:00
|
|
|
extern void append_grep_pat(struct grep_opt *opt, const char *pat, size_t patlen, const char *origin, int no, enum grep_pat_token t);
|
2006-09-18 07:02:52 +08:00
|
|
|
extern void append_grep_pattern(struct grep_opt *opt, const char *pat, const char *origin, int no, enum grep_pat_token t);
|
log --author/--committer: really match only with name part
When we tried to find commits done by AUTHOR, the first implementation
tried to pattern match a line with "^author .*AUTHOR", which later was
enhanced to strip leading caret and look for "^author AUTHOR" when the
search pattern was anchored at the left end (i.e. --author="^AUTHOR").
This had a few problems:
* When looking for fixed strings (e.g. "git log -F --author=x --grep=y"),
the regexp internally used "^author .*x" would never match anything;
* To match at the end (e.g. "git log --author='google.com>$'"), the
generated regexp has to also match the trailing timestamp part the
commit header lines have. Also, in order to determine if the '$' at
the end means "match at the end of the line" or just a literal dollar
sign (probably backslash-quoted), we would need to parse the regexp
ourselves.
An earlier alternative tried to make sure that a line matches "^author "
(to limit by field name) and the user supplied pattern at the same time.
While it solved the -F problem by introducing a special override for
matching the "^author ", it did not solve the trailing timestamp nor tail
match problem. It also would have matched every commit if --author=author
was asked for, not because the author's email part had this string, but
because every commit header line that talks about the author begins with
that field name, regardleses of who wrote it.
Instead of piling more hacks on top of hacks, this rethinks the grep
machinery that is used to look for strings in the commit header, and makes
sure that (1) field name matches literally at the beginning of the line,
followed by a SP, and (2) the user supplied pattern is matched against the
remainder of the line, excluding the trailing timestamp data.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-09-05 13:15:02 +08:00
|
|
|
extern void append_header_grep_pattern(struct grep_opt *, enum grep_header_field, const char *);
|
2006-09-18 07:02:52 +08:00
|
|
|
extern void compile_grep_patterns(struct grep_opt *opt);
|
2006-09-28 07:27:10 +08:00
|
|
|
extern void free_grep_patterns(struct grep_opt *opt);
|
2012-02-02 16:20:10 +08:00
|
|
|
extern int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size);
|
2006-09-18 07:02:52 +08:00
|
|
|
|
grep: refactor the concept of "grep source" into an object
The main interface to the low-level grep code is
grep_buffer, which takes a pointer to a buffer and a size.
This is convenient and flexible (we use it to grep commit
bodies, files on disk, and blobs by sha1), but it makes it
hard to pass extra information about what we are grepping
(either for correctness, like overriding binary
auto-detection, or for optimizations, like lazily loading
blob contents).
Instead, let's encapsulate the idea of a "grep source",
including the buffer, its size, and where the data is coming
from. This is similar to the diff_filespec structure used by
the diff code (unsurprising, since future patches will
implement some of the same optimizations found there).
The diffstat is slightly scarier than the actual patch
content. Most of the modified lines are simply replacing
access to raw variables with their counterparts that are now
in a "struct grep_source". Most of the added lines were
taken from builtin/grep.c, which partially abstracted the
idea of grep sources (for file vs sha1 sources).
Instead of dropping the now-redundant code, this patch
leaves builtin/grep.c using the traditional grep_buffer
interface (which now wraps the grep_source interface). That
makes it easy to test that there is no change of behavior
(yet).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-02 16:19:28 +08:00
|
|
|
struct grep_source {
|
|
|
|
char *name;
|
|
|
|
|
|
|
|
enum grep_source_type {
|
|
|
|
GREP_SOURCE_SHA1,
|
|
|
|
GREP_SOURCE_FILE,
|
|
|
|
GREP_SOURCE_BUF,
|
2016-12-17 03:03:19 +08:00
|
|
|
GREP_SOURCE_SUBMODULE,
|
grep: refactor the concept of "grep source" into an object
The main interface to the low-level grep code is
grep_buffer, which takes a pointer to a buffer and a size.
This is convenient and flexible (we use it to grep commit
bodies, files on disk, and blobs by sha1), but it makes it
hard to pass extra information about what we are grepping
(either for correctness, like overriding binary
auto-detection, or for optimizations, like lazily loading
blob contents).
Instead, let's encapsulate the idea of a "grep source",
including the buffer, its size, and where the data is coming
from. This is similar to the diff_filespec structure used by
the diff code (unsurprising, since future patches will
implement some of the same optimizations found there).
The diffstat is slightly scarier than the actual patch
content. Most of the modified lines are simply replacing
access to raw variables with their counterparts that are now
in a "struct grep_source". Most of the added lines were
taken from builtin/grep.c, which partially abstracted the
idea of grep sources (for file vs sha1 sources).
Instead of dropping the now-redundant code, this patch
leaves builtin/grep.c using the traditional grep_buffer
interface (which now wraps the grep_source interface). That
makes it easy to test that there is no change of behavior
(yet).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-02 16:19:28 +08:00
|
|
|
} type;
|
|
|
|
void *identifier;
|
|
|
|
|
|
|
|
char *buf;
|
|
|
|
unsigned long size;
|
2012-02-02 16:20:43 +08:00
|
|
|
|
2012-10-12 18:49:38 +08:00
|
|
|
char *path; /* for attribute lookups */
|
2012-02-02 16:20:43 +08:00
|
|
|
struct userdiff_driver *driver;
|
grep: refactor the concept of "grep source" into an object
The main interface to the low-level grep code is
grep_buffer, which takes a pointer to a buffer and a size.
This is convenient and flexible (we use it to grep commit
bodies, files on disk, and blobs by sha1), but it makes it
hard to pass extra information about what we are grepping
(either for correctness, like overriding binary
auto-detection, or for optimizations, like lazily loading
blob contents).
Instead, let's encapsulate the idea of a "grep source",
including the buffer, its size, and where the data is coming
from. This is similar to the diff_filespec structure used by
the diff code (unsurprising, since future patches will
implement some of the same optimizations found there).
The diffstat is slightly scarier than the actual patch
content. Most of the modified lines are simply replacing
access to raw variables with their counterparts that are now
in a "struct grep_source". Most of the added lines were
taken from builtin/grep.c, which partially abstracted the
idea of grep sources (for file vs sha1 sources).
Instead of dropping the now-redundant code, this patch
leaves builtin/grep.c using the traditional grep_buffer
interface (which now wraps the grep_source interface). That
makes it easy to test that there is no change of behavior
(yet).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-02 16:19:28 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
void grep_source_init(struct grep_source *gs, enum grep_source_type type,
|
2012-10-12 18:49:38 +08:00
|
|
|
const char *name, const char *path,
|
|
|
|
const void *identifier);
|
grep: refactor the concept of "grep source" into an object
The main interface to the low-level grep code is
grep_buffer, which takes a pointer to a buffer and a size.
This is convenient and flexible (we use it to grep commit
bodies, files on disk, and blobs by sha1), but it makes it
hard to pass extra information about what we are grepping
(either for correctness, like overriding binary
auto-detection, or for optimizations, like lazily loading
blob contents).
Instead, let's encapsulate the idea of a "grep source",
including the buffer, its size, and where the data is coming
from. This is similar to the diff_filespec structure used by
the diff code (unsurprising, since future patches will
implement some of the same optimizations found there).
The diffstat is slightly scarier than the actual patch
content. Most of the modified lines are simply replacing
access to raw variables with their counterparts that are now
in a "struct grep_source". Most of the added lines were
taken from builtin/grep.c, which partially abstracted the
idea of grep sources (for file vs sha1 sources).
Instead of dropping the now-redundant code, this patch
leaves builtin/grep.c using the traditional grep_buffer
interface (which now wraps the grep_source interface). That
makes it easy to test that there is no change of behavior
(yet).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-02 16:19:28 +08:00
|
|
|
void grep_source_clear_data(struct grep_source *gs);
|
|
|
|
void grep_source_clear(struct grep_source *gs);
|
2012-02-02 16:20:43 +08:00
|
|
|
void grep_source_load_driver(struct grep_source *gs);
|
2012-09-16 05:04:36 +08:00
|
|
|
|
grep: refactor the concept of "grep source" into an object
The main interface to the low-level grep code is
grep_buffer, which takes a pointer to a buffer and a size.
This is convenient and flexible (we use it to grep commit
bodies, files on disk, and blobs by sha1), but it makes it
hard to pass extra information about what we are grepping
(either for correctness, like overriding binary
auto-detection, or for optimizations, like lazily loading
blob contents).
Instead, let's encapsulate the idea of a "grep source",
including the buffer, its size, and where the data is coming
from. This is similar to the diff_filespec structure used by
the diff code (unsurprising, since future patches will
implement some of the same optimizations found there).
The diffstat is slightly scarier than the actual patch
content. Most of the modified lines are simply replacing
access to raw variables with their counterparts that are now
in a "struct grep_source". Most of the added lines were
taken from builtin/grep.c, which partially abstracted the
idea of grep sources (for file vs sha1 sources).
Instead of dropping the now-redundant code, this patch
leaves builtin/grep.c using the traditional grep_buffer
interface (which now wraps the grep_source interface). That
makes it easy to test that there is no change of behavior
(yet).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-02 16:19:28 +08:00
|
|
|
|
|
|
|
int grep_source(struct grep_opt *opt, struct grep_source *gs);
|
|
|
|
|
2010-01-26 06:51:39 +08:00
|
|
|
extern struct grep_opt *grep_opt_dup(const struct grep_opt *opt);
|
|
|
|
extern int grep_threads_ok(const struct grep_opt *opt);
|
|
|
|
|
2011-12-13 05:16:07 +08:00
|
|
|
#ifndef NO_PTHREADS
|
|
|
|
/*
|
|
|
|
* Mutex used around access to the attributes machinery if
|
|
|
|
* opt->use_threads. Must be initialized/destroyed by callers!
|
|
|
|
*/
|
grep: make locking flag global
The low-level grep code traditionally didn't care about
threading, as it doesn't do any threading itself and didn't
call out to other non-thread-safe code. That changed with
0579f91 (grep: enable threading with -p and -W using lazy
attribute lookup, 2011-12-12), which pushed the lookup of
funcname attributes (which is not thread-safe) into the
low-level grep code.
As a result, the low-level code learned about a new global
"grep_attr_mutex" to serialize access to the attribute code.
A multi-threaded caller (e.g., builtin/grep.c) is expected
to initialize the mutex and set "use_threads" in the
grep_opt structure. The low-level code only uses the lock if
use_threads is set.
However, putting the use_threads flag into the grep_opt
struct is not the most logical place. Whether threading is
in use is not something that matters for each call to
grep_buffer, but is instead global to the whole program
(i.e., if any thread is doing multi-threaded grep, every
other thread, even if it thinks it is doing its own
single-threaded grep, would need to use the locking). In
practice, this distinction isn't a problem for us, because
the only user of multi-threaded grep is "git-grep", which
does nothing except call grep.
This patch turns the opt->use_threads flag into a global
flag. More important than the nit-picking semantic argument
above is that this means that the locking functions don't
need to actually have access to a grep_opt to know whether
to lock. Which in turn can make adding new locks simpler, as
we don't need to pass around a grep_opt.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-02 16:18:29 +08:00
|
|
|
extern int grep_use_locks;
|
2011-12-13 05:16:07 +08:00
|
|
|
extern pthread_mutex_t grep_attr_mutex;
|
2012-02-02 16:18:41 +08:00
|
|
|
extern pthread_mutex_t grep_read_mutex;
|
|
|
|
|
|
|
|
static inline void grep_read_lock(void)
|
|
|
|
{
|
|
|
|
if (grep_use_locks)
|
|
|
|
pthread_mutex_lock(&grep_read_mutex);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void grep_read_unlock(void)
|
|
|
|
{
|
|
|
|
if (grep_use_locks)
|
|
|
|
pthread_mutex_unlock(&grep_read_mutex);
|
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
|
|
|
#define grep_read_lock()
|
|
|
|
#define grep_read_unlock()
|
2011-12-13 05:16:07 +08:00
|
|
|
#endif
|
|
|
|
|
2006-09-18 07:02:52 +08:00
|
|
|
#endif
|