Simplify the parsing of "grep.patternType" and
"grep.extendedRegexp". This changes no behavior, but gets rid of
complex parsing logic that isn't needed anymore.
When "grep.patternType" was introduced in 84befcd0a4 (grep: add a
grep.patternType configuration setting, 2012-08-03) we promised that:
1. You can set "grep.patternType", and "[setting it to] 'default'
will return to the default matching behavior".
In that context "the default" meant whatever the configuration
system specified before that change, i.e. via grep.extendedRegexp.
2. We'd support the existing "grep.extendedRegexp" option, but ignore
it when the new "grep.patternType" option is set. We said we'd
only ignore the older "grep.extendedRegexp" option "when the
`grep.patternType` option is set to a value other than
'default'".
In a preceding commit we changed grep_config() to be called after
grep_init(), which means that much of the complexity here can go
away.
As before both "grep.patternType" and "grep.extendedRegexp" are
last-one-wins variable, with "grep.extendedRegexp" yielding to
"grep.patternType", except when "grep.patternType=default".
Note that as the previously added tests indicate this cannot be done
on-the-fly as we see the config variables, without introducing more
state keeping. I.e. if we see:
-c grep.extendedRegexp=false
-c grep.patternType=default
-c extendedRegexp=true
We need to select ERE, since grep.patternType=default unselects that
variable, which normally has higher precedence, but we also need to
select BRE in cases of:
-c grep.extendedRegexp=true \
-c grep.extendedRegexp=false
Which would not be the case for this, which select ERE:
-c grep.patternType=extended \
-c grep.extendedRegexp=false
Therefore we cannot do this on-the-fly in grep_config without also
introducing tracking variables for not only the pattern type, but what
the source of that pattern type was.
So we need to decide on the pattern after our config was fully
parsed. Let's do that by deferring the decision on the pattern type
until it's time to compile it in compile_regexp().
By that time we've not only parsed the config, but also handled the
command-line options. Those will set "opt.pattern_type_option" (*not*
"opt.extended_regexp_option"!).
At that point all we need to do is see if "grep.patternType" was
UNSPECIFIED in the end (including an explicit "=default"), if so we'll
use the "grep.extendedRegexp" configuration, if any.
See my 07a3d41173 (grep: remove regflags from the public grep_opt
API, 2017-06-29) for addition of the two comments being removed here,
i.e. the complexity noted in that commit is now going away.
1. https://lore.kernel.org/git/patch-v8-09.10-c211bb0c69d-20220118T155211Z-avarab@gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Change code in "builtin/grep.c" and "builtin/ls-tree.c" to trust the
"prefix" passed from "run_builtin()". The "prefix" we get from setup.c
is either going to be NULL or a string of length >0, never "".
So we can drop the "prefix && *prefix" checks added for
"builtin/grep.c" in 0d042fecf2 (git-grep: show pathnames relative to
the current directory, 2006-08-11), and for "builtin/ls-tree.c" in
a69dd585fc (ls-tree: chomp leading directories when run from a
subdirectory, 2005-12-23).
As seen in code in revision.c that was added in cd676a5136 (diff
--relative: output paths as relative to the current subdirectory,
2008-02-12) we already have existing code that does away with this
assertion.
This makes it easier to reason about a subsequent change to the
"prefix_length" code in grep.c in a subsequent commit, and since we're
going to the trouble of doing that let's leave behind an assert() to
promise this to any future callers.
For "builtin/grep.c" it would be painful to pass the "prefix" down the
callchain of:
cmd_grep -> grep_tree -> grep_submodule -> grep_cache -> grep_oid ->
grep_source_name
So for the code that needs it in grep_source_name() let's add a
"grep_prefix" variable similar to the existing "ls_tree_prefix".
While at it let's move the code in cmd_ls_tree() around so that we
assign to the "ls_tree_prefix" right after declaring the variables,
and stop assigning to "prefix". We only subsequently used that
variable later in the function after clobbering it. Let's just use our
own "grep_prefix" instead.
Let's also add an assert() in git.c, so that we'll make this promise
about the "prefix" to any current and future callers, as well as to
any readers of the code.
Code history:
* The strlen() in "grep.c" hasn't been used since 493b7a08d8 (grep:
accept relative paths outside current working directory, 2009-09-05).
When that code was added in 0d042fecf2 (git-grep: show pathnames
relative to the current directory, 2006-08-11) we used the length.
But since 493b7a08d8 we haven't used it for anything except a
boolean check that we could have done on the "prefix" member
itself.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Similar message templates have been consolidated so that
translators need to work on fewer number of messages.
* ja/i18n-similar-messages:
i18n: turn even more messages into "cannot be used together" ones
i18n: ref-filter: factorize "%(foo) atom used without %(bar) atom"
i18n: factorize "--foo outside a repository"
i18n: refactor "unrecognized %(foo) argument" strings
i18n: factorize "no directory given for --foo"
i18n: factorize "--foo requires --bar" and the like
i18n: tag.c factorize i18n strings
i18n: standardize "cannot open" and "cannot read"
i18n: turn "options are incompatible" into "cannot be used together"
i18n: refactor "%s, %s and %s are mutually exclusive"
i18n: refactor "foo and bar are mutually exclusive"
"git log --invert-grep --author=<name>" used to exclude commits
written by the given author, but now "--invert-grep" only affects
the matches made by the "--grep=<pattern>" option.
* rs/log-invert-grep-with-headers:
log: let --invert-grep only invert --grep
They are all replaced by "the option '%s' requires '%s'", which is a
new string but replaces 17 previous unique strings.
Signed-off-by: Jean-Noël Avila <jn.avila@free.fr>
Reviewed-by: Johannes Sixt <j6t@kdbg.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Weather balloon to find compilers that do not grok variable
declaration in the for() loop.
* jc/c99-var-decl-in-for-loop:
revision: use C99 declaration of variable in for() loop
The option --invert-grep is documented to filter out commits whose
messages match the --grep filters. However, it also affects the
header matches (--author, --committer), which is not intended.
Move the handling of that option to grep.c, as only the code there can
distinguish between matches in the header from those in the message
body. If --invert-grep is given then enable extended expressions (not
the regex type, we just need git grep's --not to work), negate the body
patterns and check if any of them match by piggy-backing on the
collect_hits mechanism of grep_source_1().
Collecting the matches in struct grep_opt is a bit iffy, but with
"last_shown" we have a precedent for writing state information to that
struct.
Reported-by: Dotan Cohen <dotancohen@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are certain C99 features that might be nice to use in our code
base, but we've hesitated to do so in order to avoid breaking
compatibility with older compilers. But we don't actually know if
people are even using pre-C99 compilers these days.
One way to figure that out is to introduce a very small use of a
feature, and see if anybody complains, and we've done so to probe
the portability for a few features like "trailing comma in enum
declaration", "designated initializer for struct", and "designated
initializer for array". A few years ago, we tried to use a handy
for (int i = 0; i < n; i++)
use(i);
to introduce a new variable valid only in the loop, but found that
some compilers we cared about didn't like it back then. Two years
is a long-enough time, so let's try it again.
If this patch can survive a few releases without complaint, then we
can feel more confident that variable declaration in for() loop is
supported by the compilers our user base use. And if we do get
complaints, then we'll have gained some data and we can easily
revert this patch.
Helped-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This reverts commit f45022dc2f,
as this is like breakage in the traversal more likely. In a
history with 10 single strand of pearls,
1-->2-->3--...->7-->8-->9-->10
asking "rev-list --unsorted-input 1 10 --not 9 8 7 6 5 4" fails to
paint the bottom 1 uninteresting as the traversal stops, without
completing the propagation of uninteresting bit starting at 4 down
through 3 and 2 to 1.
The ref iteration code used to optionally allow dangling refs to be
shown, which has been tightened up.
* jk/ref-paranoia:
refs: drop "broken" flag from for_each_fullref_in()
ref-filter: drop broken-ref code entirely
ref-filter: stop setting FILTER_REFS_INCLUDE_BROKEN
repack, prune: drop GIT_REF_PARANOIA settings
refs: turn on GIT_REF_PARANOIA by default
refs: omit dangling symrefs when using GIT_REF_PARANOIA
refs: add DO_FOR_EACH_OMIT_DANGLING_SYMREFS flag
refs-internal.h: reorganize DO_FOR_EACH_* flag documentation
refs-internal.h: move DO_FOR_EACH_* flags next to each other
t5312: be more assertive about command failure
t5312: test non-destructive repack
t5312: create bogus ref as necessary
t5312: drop "verbose" helper
t5600: provide detached HEAD for corruption failures
t5516: don't use HEAD ref for invalid ref-deletion tests
t7900: clean up some more broken refs
More code paths that use the hack to add submodule's object
database to the set of alternate object store have been cleaned up.
* jt/add-submodule-odb-clean-up:
revision: remove "submodule" from opt struct
repository: support unabsorbed in repo_submodule_init
submodule: remove unnecessary unabsorbed fallback
In C it isn't required to specify that all members of a struct are
zero'd out to 0, NULL or '\0', just providing a "{ 0 }" will
accomplish that.
Let's also change code that provided N zero'd fields to just
provide one, and change e.g. "{ NULL }" to "{ 0 }" for
consistency. I.e. even if the first member is a pointer let's use "0"
instead of "NULL". The point of using "0" consistently is to pick one,
and to not have the reader wonder why we're not using the same pattern
everywhere.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
No callers pass in anything but "0" here. Likewise to our sibling
functions. Note that some of them ferry along the flag, but none of
their callers pass anything but "0" either.
Nor is anybody likely to change that. Callers which really want to see
all of the raw refs use for_each_rawref(). And anybody interested in
iterating a subset of the refs will likely be happy to use the
now-default behavior of showing broken refs, but omitting dangling
symlinks.
So we can get rid of this whole feature.
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Clean up a TODO in revision.h by removing the "submodule" field from
struct setup_revision_opt. This field is only used to specify the ref
store to use, so use rev_info->repo to determine the ref store instead.
The only users of this field are merge-ort.c and merge-recursive.c.
However, both these files specify the superproject as rev_info->repo and
the submodule as setup_revision_opt->submodule. In order to be able to
pass the submodule as rev_info->repo, all commits must be parsed with
the submodule explicitly specified; this patch does that as well. (An
incremental solution in which only some commits are parsed with explicit
submodule will not work, because if the same commit is parsed twice in
different repositories, there will be 2 heap-allocated object structs
corresponding to that commit, and any flag set by the revision walking
mechanism on one of them will not be reflected onto the other.)
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When queueing references in git-rev-list(1), we try to optimize parsing
of commits via the commit-graph. To do so, we first look up the object's
type, and if it is a commit we call `repo_parse_commit()` instead of
`parse_object()`. This is quite inefficient though given that we're
always uncompressing the object header in order to determine the type.
Instead, we can opportunistically search the commit-graph for the object
ID: in case it's found, we know it's a commit and can directly fill in
the commit object without having to uncompress the object header.
Expose a new function `lookup_commit_in_graph()`, which tries to find a
commit in the commit-graph by ID, and convert `get_reference()` to use
this function. This provides a big performance win in cases where we
load references in a repository with lots of references pointing to
commits. The following has been executed in a real-world repository with
about 2.2 million refs:
Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
Time (mean ± σ): 4.458 s ± 0.044 s [User: 4.115 s, System: 0.342 s]
Range (min … max): 4.409 s … 4.534 s 10 runs
Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
Time (mean ± σ): 3.089 s ± 0.015 s [User: 2.768 s, System: 0.321 s]
Range (min … max): 3.061 s … 3.105 s 10 runs
Summary
'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
1.44 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When queueing up references for the revision walk, `handle_one_ref()`
will resolve the reference's object ID via `get_reference()` and then
queue the ID as pending object via `add_pending_oid()`. But given that
`add_pending_oid()` is only a thin wrapper around `add_pending_object()`
which fist calls `get_reference()`, we effectively resolve the reference
twice and thus duplicate some of the work.
Fix the issue by instead calling `add_pending_object()` directly, which
takes the already-resolved object as input. In a repository with lots of
refs, this translates into a near 10% speedup:
Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
Time (mean ± σ): 5.015 s ± 0.038 s [User: 4.698 s, System: 0.316 s]
Range (min … max): 4.970 s … 5.089 s 10 runs
Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
Time (mean ± σ): 4.606 s ± 0.029 s [User: 4.260 s, System: 0.345 s]
Range (min … max): 4.565 s … 4.657 s 10 runs
Summary
'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
1.09 ± 0.01 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In order to compute whether objects reachable from a set of tips are all
connected, we do a revision walk with these tips as positive references
and `--not --all`. `--not --all` will cause the revision walk to load
all preexisting references as uninteresting, which can be very expensive
in repositories with many references.
Benchmarking the git-rev-list(1) command highlights that by far the most
expensive single phase is initial sorting of the input revisions: after
all references have been loaded, we first sort commits by author date.
In a real-world repository with about 2.2 million references, it makes
up about 40% of the total runtime of git-rev-list(1).
Ultimately, the connectivity check shouldn't really bother about the
order of input revisions at all. We only care whether we can actually
walk all objects until we hit the cut-off point. So sorting the input is
a complete waste of time.
Introduce a new "--unsorted-input" flag to git-rev-list(1) which will
cause it to not sort the commits and adjust the connectivity check to
always pass the flag. This results in the following speedups, executed
in a clone of gitlab-org/gitlab [1]:
Benchmark #1: git rev-list --objects --quiet --not --all --not $(cat newrev)
Time (mean ± σ): 7.639 s ± 0.065 s [User: 7.304 s, System: 0.335 s]
Range (min … max): 7.543 s … 7.742 s 10 runs
Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev
Time (mean ± σ): 4.995 s ± 0.044 s [User: 4.657 s, System: 0.337 s]
Range (min … max): 4.909 s … 5.048 s 10 runs
Summary
'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran
1.53 ± 0.02 times faster than 'git rev-list --objects --quiet --not --all --not $newrev'
[1]: https://gitlab.com/gitlab-org/gitlab.git. Note that not all refs
are visible to clients.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `--no-walk` flag supports two modes: either it sorts the revisions
given as input input or it doesn't. This is reflected in a single
`no_walk` flag, which reflects one of the three states "walk", "don't
walk but without sorting" and "don't walk but with sorting".
Split up the flag into two separate bits, one indicating whether we
should walk or not and one indicating whether the input should be sorted
or not. This will allow us to more easily introduce a new flag
`--unsorted-input`, which only impacts the sorting bit.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When compiling with -O3, some gcc versions (10.2.1 here) complain about
an out-of-bounds subscript:
revision.c: In function ‘do_add_index_objects_to_pending’:
revision.c:321:22: error: array subscript [1, 2147483647] is outside array bounds of ‘char[1]’ [-Werror=array-bounds]
321 | if (0 < len && name[len] && buf.len)
| ~~~~^~~~~
The "len" parameter here comes from calling interpret_branch_name(),
which intends to return the number of characters of "name" it parsed.
But the compiler doesn't realize this. It knows the size of the empty
string "name" passed in from do_add_index_objects_to_pending(), but it
has no clue that the "len" we get back will be constrained to "0" in
that case.
And I don't think the warning is telling us about some subtle or clever
bug. The implementation of interpret_branch_name() is in another file
entirely, and the compiler can't see it (you can even verify there is no
clever LTO going on by replacing it with "return 0" and still getting
the warning).
We can work around this by replacing our "did we hit the trailing NUL"
subscript dereference with a length check. We do not even have to pay
the cost for an extra strlen(), as we can pass our new length into
interpret_branch_name(), which was converting our "0" into a call to
strlen() anyway.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Plug various leans reported by LSAN.
* ah/plugleaks:
builtin/rm: avoid leaking pathspec and seen
builtin/rebase: release git_format_patch_opt too
builtin/for-each-ref: free filter and UNLEAK sorting.
mailinfo: also free strbuf lists when clearing mailinfo
builtin/checkout: clear pending objects after diffing
builtin/check-ignore: clear_pathspec before returning
builtin/bugreport: don't leak prefixed filename
branch: FREE_AND_NULL instead of NULL'ing real_ref
bloom: clear each bloom_key after use
ls-files: free max_prefix when done
wt-status: fix multiple small leaks
revision: free remainder of old commit list in limit_list
"git rev-list" learns the "--filter=object:type=<type>" option,
which can be used to exclude objects of the given kind from the
packfile generated by pack-objects.
* ps/rev-list-object-type-filter:
rev-list: allow filtering of provided items
pack-bitmap: implement combined filter
pack-bitmap: implement object type filter
list-objects: implement object type filter
list-objects: support filtering by tag and commit
list-objects: move tag processing into its own function
revision: mark commit parents as NOT_USER_GIVEN
uploadpack.txt: document implication of `uploadpackfilter.allow`
Builds on top of the sparse-index infrastructure to mark operations
that are not ready to mark with the sparse index, causing them to
fall back on fully-populated index that they always have worked with.
* ds/sparse-index-protections: (47 commits)
name-hash: use expand_to_path()
sparse-index: expand_to_path()
name-hash: don't add directories to name_hash
revision: ensure full index
resolve-undo: ensure full index
read-cache: ensure full index
pathspec: ensure full index
merge-recursive: ensure full index
entry: ensure full index
dir: ensure full index
update-index: ensure full index
stash: ensure full index
rm: ensure full index
merge-index: ensure full index
ls-files: ensure full index
grep: ensure full index
fsck: ensure full index
difftool: ensure full index
commit: ensure full index
checkout: ensure full index
...
limit_list() iterates over the original revs->commits list, and consumes
many of its entries via pop_commit. However we might stop iterating over
the list early (e.g. if we realise that the rest of the list is
uninteresting). If we do stop iterating early, list will be pointing to
the unconsumed portion of revs->commits - and we need to free this list
to avoid a leak. (revs->commits itself will be an invalid pointer: it
will have been free'd during the first pop_commit.)
However the list pointer is later reused to iterate over our new list,
but only for the limiting_can_increase_treesame() branch. We therefore
need to introduce a new variable for that branch - and while we're here
we can rename the original list to original_list as that makes its
purpose more obvious.
This leak was found while running t0090. It's not likely to be very
impactful, but it can happen quite early during some checkout
invocations, and hence seems to be worth fixing:
Direct leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x49a85d in malloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3
#1 0x9ac084 in do_xmalloc wrapper.c:41:8
#2 0x9ac05a in xmalloc wrapper.c:62:9
#3 0x7175d6 in commit_list_insert commit.c:540:33
#4 0x71800f in commit_list_insert_by_date commit.c:604:9
#5 0x8f8d2e in process_parents revision.c:1128:5
#6 0x8f2f2c in limit_list revision.c:1418:7
#7 0x8f210e in prepare_revision_walk revision.c:3577:7
#8 0x514170 in orphaned_commit_warning builtin/checkout.c:1185:6
#9 0x512f05 in switch_branches builtin/checkout.c:1250:3
#10 0x50f8de in checkout_branch builtin/checkout.c:1646:9
#11 0x50ba12 in checkout_main builtin/checkout.c:2003:9
#12 0x5086c0 in cmd_checkout builtin/checkout.c:2055:8
#13 0x4cd91d in run_builtin git.c:467:11
#14 0x4cb5f3 in handle_builtin git.c:719:3
#15 0x4ccf47 in run_argv git.c:808:4
#16 0x4caf49 in cmd_main git.c:939:19
#17 0x69dc0e in main common-main.c:52:11
#18 0x7faaabd0e349 in __libc_start_main (/lib64/libc.so.6+0x24349)
Indirect leak of 48 byte(s) in 3 object(s) allocated from:
#0 0x49a85d in malloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3
#1 0x9ac084 in do_xmalloc wrapper.c:41:8
#2 0x9ac05a in xmalloc wrapper.c:62:9
#3 0x717de6 in commit_list_append commit.c:1609:35
#4 0x8f1f9b in prepare_revision_walk revision.c:3554:12
#5 0x514170 in orphaned_commit_warning builtin/checkout.c:1185:6
#6 0x512f05 in switch_branches builtin/checkout.c:1250:3
#7 0x50f8de in checkout_branch builtin/checkout.c:1646:9
#8 0x50ba12 in checkout_main builtin/checkout.c:2003:9
#9 0x5086c0 in cmd_checkout builtin/checkout.c:2055:8
#10 0x4cd91d in run_builtin git.c:467:11
#11 0x4cb5f3 in handle_builtin git.c:719:3
#12 0x4ccf47 in run_argv git.c:808:4
#13 0x4caf49 in cmd_main git.c:939:19
#14 0x69dc0e in main common-main.c:52:11
#15 0x7faaabd0e349 in __libc_start_main (/lib64/libc.so.6+0x24349)
Signed-off-by: Andrzej Hunt <ajrhunt@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Before iterating over all index entries, ensure that a sparse index is
expanded to a full index to avoid unexpected behavior. This case could
be integrated later by ensuring that we walk the tree in the
sparse-directory entry, but the current behavior is only expecting
blobs. Save this integration for later when it can be properly tested.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When --exclude-promisor-objects is given, before traversing any objects
we iterate over all of the objects in any promisor packs, marking them
as UNINTERESTING and SEEN. We turn the oid we get from iterating the
pack into an object with parse_object(), but this has two problems:
- it's slow; we are zlib inflating (and reconstructing from deltas)
every byte of every object in the packfile
- it leaves the tree buffers attached to their structs, which means
our heap usage will grow to store every uncompressed tree
simultaneously. This can be gigabytes.
We can obviously fix the second by freeing the tree buffers after we've
parsed them. But we can observe that the function doesn't look at the
object contents at all! The only reason we call parse_object() is that
we need a "struct object" on which to set the flags. There are two
options here:
- we can look up just the object type via oid_object_info(), and then
call the appropriate lookup_foo() function
- we can call lookup_unknown_object(), which gives us an OBJ_NONE
struct (which will get auto-converted later by object_as_type() via
calls to lookup_commit(), etc).
The first one is closer to the current code, but we do pay the price to
look up the type for each object. The latter should be more efficient in
CPU, though it wastes a little bit of memory (the "unknown" object
structs are a union of all object types, so some of the structs are
bigger than they need to be). It also runs the risk of triggering a
latent bug in code that calls lookup_object() directly but isn't ready
to handle OBJ_NONE (such code would already be buggy, but we use
lookup_unknown_object() infrequently enough that it might be hiding).
I went with the second option here. I don't think the risk is high (and
we'd want to find and fix any such bugs anyway), and it should be more
efficient overall.
The new tests in p5600 show off the improvement (this is on git.git):
Test HEAD^ HEAD
-------------------------------------------------------------------------------
5600.5: count commits 0.37(0.37+0.00) 0.38(0.38+0.00) +2.7%
5600.6: count non-promisor commits 11.74(11.37+0.37) 0.04(0.03+0.00) -99.7%
The improvement is particularly big in this script because _every_
object in the newly-cloned partial repo is a promisor object. So after
marking them all, there's nothing left to traverse.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The NOT_USER_GIVEN flag of an object marks whether a flag was explicitly
provided by the user or not. The most important use case for this is
when filtering objects: only objects that were not explicitly requested
will get filtered.
The flag is currently only set for blobs and trees, which has been fine
given that there are no filters for tags or commits currently. We're
about to extend filtering capabilities to add object type filter though,
which requires us to set up the NOT_USER_GIVEN flag correctly -- if it's
not set, the object wouldn't get filtered at all.
Mark unseen commit parents as NOT_USER_GIVEN when processing parents.
Like this, explicitly provided parents stay user-given and thus
unfiltered, while parents which get loaded as part of the graph walk
can be filtered.
This commit shouldn't have any user-visible impact yet as there is no
logic to filter commits yet.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git repack" so far has been only capable of repacking everything
under the sun into a single pack (or split by size). A cleverer
strategy to reduce the cost of repacking a repository has been
introduced.
* tb/geometric-repack:
builtin/pack-objects.c: ignore missing links with --stdin-packs
builtin/repack.c: reword comment around pack-objects flags
builtin/repack.c: be more conservative with unsigned overflows
builtin/repack.c: assign pack split later
t7703: test --geometric repack with loose objects
builtin/repack.c: do not repack single packs with --geometric
builtin/repack.c: add '--geometric' option
packfile: add kept-pack cache for find_kept_pack_entry()
builtin/pack-objects.c: rewrite honor-pack-keep logic
p5303: measure time to repack with keep
p5303: add missing &&-chains
builtin/pack-objects.c: add '--stdin-packs' option
revision: learn '--no-kept-objects'
packfile: introduce 'find_kept_pack_entry()'
Add and apply a semantic patch for converting code that open-codes
CALLOC_ARRAY to use it instead. It shortens the code and infers the
element size automatically.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A future caller will want to be able to perform a reachability traversal
which terminates when visiting an object found in a kept pack. The
closest existing option is '--honor-pack-keep', but this isn't quite
what we want. Instead of halting the traversal midway through, a full
traversal is always performed, and the results are only trimmed
afterwords.
Besides needing to introduce a new flag (since culling results
post-facto can be different than halting the traversal as it's
happening), there is an additional wrinkle handling the distinction
in-core and on-disk kept packs. That is: what kinds of kept pack should
stop the traversal?
Introduce '--no-kept-objects[=<on-disk|in-core>]' to specify which kinds
of kept packs, if any, should stop a traversal. This can be useful for
callers that want to perform a reachability analysis, but want to leave
certain packs alone (for e.g., when doing a geometric repack that has
some "large" packs which are kept in-core that it wants to leave alone).
Note that this option is not guaranteed to produce exactly the set of
objects that aren't in kept packs, since it's possible the traversal
order may end up in a situation where a non-kept ancestor was "cut off"
by a kept object (at which point we would stop traversing). But, we
don't care about absolute correctness here, since this will eventually
be used as a purely additive guide in an upcoming new repack mode.
Explicitly avoid documenting this new flag, since it is only used
internally. In theory we could avoid even adding it rev-list, but being
able to spell this option out on the command-line makes some special
cases easier to test without promising to keep it behaving consistently
forever. Those tricky cases are exercised in t6114.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The commit-graph learned to use corrected commit dates instead of
the generation number to help topological revision traversal.
* ak/corrected-commit-date:
doc: add corrected commit date info
commit-reach: use corrected commit dates in paint_down_to_common()
commit-graph: use generation v2 only if entire chain does
commit-graph: implement generation data chunk
commit-graph: implement corrected commit date
commit-graph: return 64-bit generation number
commit-graph: add a slab to store topological levels
t6600-test-reach: generalize *_three_modes
commit-graph: consolidate fill_commit_graph_info
revision: parse parent in indegree_walk_step()
commit-graph: fix regression when computing Bloom filters
Lose the debugging aid that may have been useful in the past, but
no longer is, in the "grep" codepaths.
* ab/lose-grep-debug:
grep/log: remove hidden --debug and --grep-debug options
"git log" learned a new "--diff-merges=<how>" option.
* so/log-diff-merge: (32 commits)
t4013: add tests for --diff-merges=first-parent
doc/git-show: include --diff-merges description
doc/rev-list-options: document --first-parent changes merges format
doc/diff-generate-patch: mention new --diff-merges option
doc/git-log: describe new --diff-merges options
diff-merges: add '--diff-merges=1' as synonym for 'first-parent'
diff-merges: add old mnemonic counterparts to --diff-merges
diff-merges: let new options enable diff without -p
diff-merges: do not imply -p for new options
diff-merges: implement new values for --diff-merges
diff-merges: make -m/-c/--cc explicitly mutually exclusive
diff-merges: refactor opt settings into separate functions
diff-merges: get rid of now empty diff_merges_init_revs()
diff-merges: group diff-merge flags next to each other inside 'rev_info'
diff-merges: split 'ignore_merges' field
diff-merges: fix -m to properly override -c/--cc
t4013: add tests for -m failing to override -c/--cc
t4013: support test_expect_failure through ':failure' magic
diff-merges: revise revs->diff flag handling
diff-merges: handle imply -p on -c/--cc logic for log.c
...
Remove the hidden "grep --debug" and "log --grep-debug" options added
in 17bf35a3c7 (grep: teach --debug option to dump the parse tree,
2012-09-13).
At the time these options seem to have been intended to go along with
a documentation discussion and to help the author of relevant tests to
perform ad-hoc debugging on them[1].
Reasons to want this gone:
1. They were never documented, and the only (rather trivial) use of
them in our own codebase for testing is something I removed back
in e01b4dab01 (grep: change non-ASCII -i test to stop using
--debug, 2017-05-20).
2. Googling around doesn't show any in-the-wild uses I could dig up,
and on the Git ML the only mentions after the original discussion
seem to have been when they came up in unrelated diff contexts, or
that test commit of mine.
3. An exception to that is c581e4a749 (grep: under --debug, show
whether PCRE JIT is enabled, 2019-08-18) where we added the
ability to dump out when PCREv2 has the JIT in effect.
The combination of that and my earlier b65abcafc7 (grep: use PCRE
v2 for optimized fixed-string search, 2019-07-01) means Git prints
this out in its most common in-the-wild configuration:
$ git log --grep-debug --grep=foo --grep=bar --grep=baz --all-match
pcre2_jit_on=1
pcre2_jit_on=1
pcre2_jit_on=1
[all-match]
(or
pattern_body<body>foo
(or
pattern_body<body>bar
pattern_body<body>baz
)
)
$ git grep --debug \( -e foo --and -e bar \) --or -e baz
pcre2_jit_on=1
pcre2_jit_on=1
pcre2_jit_on=1
(or
(and
patternfoo
patternbar
)
patternbaz
)
I.e. for each pattern we're considering for the and/or/--all-match
etc. debugging we'll now diligently spew out another identical line
saying whether the PCREv2 JIT is on or not.
I think that nobody's complained about that rather glaringly obviously
bad output says something about how much this is used, i.e. it's
not.
The need for this debugging aid for the composed grep/log patterns
seems to have passed, and the desire to dump the JIT config seems to
have been another one-off around the time we had JIT-related issues on
the PCREv2 codepath. That the original author of this debugging
facility seemingly hasn't noticed the bad output since then[2] is
probably some indicator.
1. https://lore.kernel.org/git/cover.1347615361.git.git@drmicha.warpmail.net/
2. https://lore.kernel.org/git/xmqqk1b8x0ac.fsf@gitster-ct.c.googlers.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When more than one commit with the same patch ID appears on one
side, "git log --cherry-pick A...B" did not exclude them all when a
commit with the same patch ID appears on the other side. Now it
does.
* jk/log-cherry-pick-duplicate-patches:
patch-ids: handle duplicate hashmap entries
In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.
We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.
With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This fixes a bug introduced in dfb7a1b4d0 (patch-ids: stop using a
hand-rolled hashmap implementation, 2016-07-29) in which
git rev-list --cherry-pick A...B
will fail to suppress commits reachable from A even if a commit with
matching patch-id appears in B.
Around the time of that commit, the algorithm for "--cherry-pick" looked
something like this:
0. Traverse all of the commits, marking them as being on the left or
right side of the symmetric difference.
1. Iterate over the left-hand commits, inserting a patch-id struct for
each into a hashmap, and pointing commit->util to the patch-id
struct.
2. Iterate over the right-hand commits, checking which are present in
the hashmap. If so, we exclude the commit from the output _and_ we
mark the patch-id as "seen".
3. Iterate again over the left-hand commits, checking whether
commit->util->seen is set; if so, exclude them from the output.
At the end, we'll have eliminated commits from both sides that have a
matching patch-id on the other side. But there's a subtle assumption
here: for any given patch-id, we must have exactly one struct
representing it. If two commits from A both have the same patch-id and
we allow duplicates in the hashmap, then we run into a problem:
a. In step 1, we insert two patch-id structs into the hashmap.
b. In step 2, our lookups will find only one of these structs, so only
one "seen" flag is marked.
c. In step 3, one of the commits in A will have its commit->util->seen
set, but the other will not. We'll erroneously output the latter.
Prior to dfb7a1b4d0, our hashmap did not allow duplicates. Afterwards,
it used hashmap_add(), which explicitly does allow duplicates.
At that point, the solution would have been easy: when we are about to
add a duplicate, skip doing so and return the existing entry which
matches. But it gets more complicated.
In 683f17ec44 (patch-ids: replace the seen indicator with a commit
pointer, 2016-07-29), our step 3 goes away entirely. Instead, in step 2,
when the right-hand side finds a matching patch_id from the left-hand
side, we can directly mark the left-hand patch_id->commit to be omitted.
Solving that would be easy, too; there's a one-to-many relationship of
patch-ids to commits, so we just need to keep a list.
But there's more. Commit b3dfeebb92 (rebase: avoid computing unnecessary
patch IDs, 2016-07-29) built on that by lazily computing the full
patch-ids. So we don't even know when adding to the hashmap whether two
commits truly have the same id. We'd have to tentatively assign them a
list, and then possibly split them apart (possibly into N new structs)
at the moment we compute the real patch-ids. This could work, but it's
complicated and error-prone.
Instead, let's accept that we may store duplicates, and teach the lookup
side to be more clever. Rather than asking for a single matching
patch-id, it will need to iterate over all matching patch-ids. This does
mean examining every entry in a single hash bucket, but the worst-case
for a hash lookup was already doing that.
We'll keep the hashmap details out of the caller by providing a simple
iteration interface. We can retain the simple has_commit_patch_id()
interface for the other callers, but we'll simplify its return value
into an integer, rather than returning the patch_id struct. That way
they won't be tempted to look at the "commit" field of the return value
without iterating.
Reported-by: Arnaud Morin <arnaud.morin@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We trace statistics about the effectiveness of changed-path Bloom
filters since 42e50e78 (revision.c: add trace2 stats around Bloom
filter usage, 2020-04-06). Add similar tracing for the topo-walk
algorithm that uses generation numbers to limit the walk size.
This information can help investigate and describe benefits to
heuristics and other changes.
The information that is printed is in JSON format and can be formatted
nicely to present as follows:
{
"count_explort_walked":2603,
"count_indegree_walked":2603,
"count_topo_walked":473
}
Each of these values count the number of commits are visited by each of
the three "stages" of the topo-walk as detailed in b4542418 (revision.c:
generation-based topo-order algorithm, 2018-11-01).
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
After getting rid of 'ignore_merges' field, the diff_merges_init_revs()
function became empty. Get rid of it.
Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Use the same "diff_merges" prefix for all the diff merges function
names.
Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Create separate diff-merges.c and diff-merges.h files, and move all
the code related to handling of diff merges there.
Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Use these implementations from show_setup_revisions_tweak() and
log_setup_revisions_tweak() in builtin/log.c.
This completes moving of management of diff merges parameters to a
single place, where we can finally observe them simultaneously.
Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Move initialization code related to diffing merges into new
init_diff_merge_revs() function.
Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Move all the setting code related to diffing merges into new
setup_diff_merge_revs() function.
Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Move all the parsing code related to diffing merges into new
parse_diff_merge_opts() function.
Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up.
* ma/grep-init-default:
MyFirstObjectWalk: drop `init_walken_defaults()`
grep: copy struct in one fell swoop
grep: use designated initializers for `grep_defaults`
grep: don't set up a "default" repo for grep
In 15fabd1bbd ("builtin/grep.c: make configuration callback more
reusable", 2012-10-09), we learned to fill a `static struct grep_opt
grep_defaults` which we can use as a blueprint for other such structs.
At the time, we didn't consider designated initializers to be widely
useable, but these days, we do. (See, e.g., cbc0f81d96 ("strbuf: use
designated initializers in STRBUF_INIT", 2017-07-10).)
Use designated initializers to let the compiler set up the struct and so
that we don't need to remember to call `init_grep_defaults()`.
Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>