global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
Use of the `the_repository` variable is deprecated nowadays, and we
slowly but steadily convert the codebase to not use it anymore. Instead,
callers should be passing down the repository to work on via parameters.
It is hard though to prove that a given code unit does not use this
variable anymore. The most trivial case, merely demonstrating that there
is no direct use of `the_repository`, is already a bit of a pain during
code reviews as the reviewer needs to manually verify claims made by the
patch author. The bigger problem though is that we have many interfaces
that implicitly rely on `the_repository`.
Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code
units to opt into usage of `the_repository`. The intent of this macro is
to demonstrate that a certain code unit does not use this variable
anymore, and to keep it from new dependencies on it in future changes,
be it explicit or implicit
For now, the macro only guards `the_repository` itself as well as
`the_hash_algo`. There are many more known interfaces where we have an
implicit dependency on `the_repository`, but those are not guarded at
the current point in time. Over time though, we should start to add
guards as required (or even better, just remove them).
Define the macro as required in our code units. As expected, most of our
code still relies on the global variable. Nearly all of our builtins
rely on the variable as there is no way yet to pass `the_repository` to
their entry point. For now, declare the macro in "biultin.h" to keep the
required changes at least a little bit more contained.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14 14:50:23 +08:00
|
|
|
#define USE_THE_REPOSITORY_VARIABLE
|
|
|
|
|
2023-05-16 14:33:57 +08:00
|
|
|
#include "git-compat-util.h"
|
2022-11-17 13:46:56 +08:00
|
|
|
#include "config.h"
|
2023-03-21 14:26:03 +08:00
|
|
|
#include "environment.h"
|
2023-03-21 14:25:54 +08:00
|
|
|
#include "gettext.h"
|
2023-02-24 08:09:27 +08:00
|
|
|
#include "hex.h"
|
2023-04-11 15:41:49 +08:00
|
|
|
#include "object-name.h"
|
2023-04-11 15:41:53 +08:00
|
|
|
#include "object-file.h"
|
2023-05-16 14:34:06 +08:00
|
|
|
#include "object-store-ll.h"
|
2023-10-27 15:59:29 +08:00
|
|
|
#include "oidset.h"
|
2006-02-26 08:19:46 +08:00
|
|
|
#include "tag.h"
|
|
|
|
#include "blob.h"
|
|
|
|
#include "tree.h"
|
|
|
|
#include "commit.h"
|
2006-03-01 03:24:00 +08:00
|
|
|
#include "diff.h"
|
2020-12-21 23:19:33 +08:00
|
|
|
#include "diff-merges.h"
|
2006-02-26 08:19:46 +08:00
|
|
|
#include "refs.h"
|
|
|
|
#include "revision.h"
|
2018-04-12 08:21:09 +08:00
|
|
|
#include "repository.h"
|
2008-05-04 18:36:54 +08:00
|
|
|
#include "graph.h"
|
2006-09-18 06:43:40 +08:00
|
|
|
#include "grep.h"
|
2007-01-11 18:47:48 +08:00
|
|
|
#include "reflog-walk.h"
|
2007-04-09 18:40:38 +08:00
|
|
|
#include "patch-ids.h"
|
2008-04-03 17:12:06 +08:00
|
|
|
#include "decorate.h"
|
2010-03-13 01:04:26 +08:00
|
|
|
#include "string-list.h"
|
Implement line-history search (git log -L)
This is a rewrite of much of Bo's work, mainly in an effort to split
it into smaller, easier to understand routines.
The algorithm is built around the struct range_set, which encodes a
series of line ranges as intervals [a,b). This is used in two
contexts:
* A set of lines we are tracking (which will change as we dig through
history).
* To encode diffs, as pairs of ranges.
The main routine is range_set_map_across_diff(). It processes the
diff between a commit C and some parent P. It determines which diff
hunks are relevant to the ranges tracked in C, and computes the new
ranges for P.
The algorithm is then simply to process history in topological order
from newest to oldest, computing ranges and (partial) diffs. At
branch points, we need to merge the ranges we are watching. We will
find that many commits do not affect the chosen ranges, and mark them
TREESAME (in addition to those already filtered by pathspec limiting).
Another pass of history simplification then gets rid of such commits.
This is wired as an extra filtering pass in the log machinery. This
currently only reduces code duplication, but should allow for other
simplifications and options to be used.
Finally, we hook a diff printer into the output chain. Ideally we
would wire directly into the diff logic, to optionally use features
like word diff. However, that will require some major reworking of
the diff chain, so we completely replace the output with our own diff
for now.
As this was a GSoC project, and has quite some history by now, many
people have helped. In no particular order, thanks go to
Jakub Narebski <jnareb@gmail.com>
Jens Lehmann <Jens.Lehmann@web.de>
Jonathan Nieder <jrnieder@gmail.com>
Junio C Hamano <gitster@pobox.com>
Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Will Palmer <wmpalmer@gmail.com>
Apologies to everyone I forgot.
Signed-off-by: Bo Yang <struggleyb.nku@gmail.com>
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-03-29 00:47:32 +08:00
|
|
|
#include "line-log.h"
|
2013-01-06 05:26:45 +08:00
|
|
|
#include "mailmap.h"
|
log: use true parents for diff even when rewriting
When using pathspec filtering in combination with diff-based log
output, parent simplification happens before the diff is computed.
The diff is therefore against the *simplified* parents.
This works okay, arguably by accident, in the normal case:
simplification reduces to one parent as long as the commit is TREESAME
to it. So the simplified parent of any given commit must have the
same tree contents on the filtered paths as its true (unfiltered)
parent.
However, --full-diff breaks this guarantee, and indeed gives pretty
spectacular results when comparing the output of
git log --graph --stat ...
git log --graph --full-diff --stat ...
(--graph internally kicks in parent simplification, much like
--parents).
To fix it, store a copy of the parent list before simplification (in a
slab) whenever --full-diff is in effect. Then use the stored parents
instead of the simplified ones in the commit display code paths. The
latter do not actually check for --full-diff to avoid duplicated code;
they just grab the original parents if save_parents() has not been
called for this revision walk.
For ordinary commits it should be obvious that this is the right thing
to do.
Merge commits are a bit subtle. Observe that with default
simplification, merge simplification is an all-or-nothing decision:
either the merge is TREESAME to one parent and disappears, or it is
different from all parents and the parent list remains intact.
Redundant parents are not pruned, so the existing code also shows them
as a merge.
So if we do show a merge commit, the parent list just consists of the
rewrite result on each parent. Running, e.g., --cc on this in
--full-diff mode is not very useful: if any commits were skipped, some
hunks will disagree with all sides of the merge (with one side,
because commits were skipped; with the others, because they didn't
have those changes in the first place). This triggers --cc showing
these hunks spuriously.
Therefore I believe that even for merge commits it is better to show
the diffs wrt. the original parents.
Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-01 04:13:20 +08:00
|
|
|
#include "commit-slab.h"
|
2014-10-17 08:44:23 +08:00
|
|
|
#include "cache-tree.h"
|
2015-06-29 23:40:30 +08:00
|
|
|
#include "bisect.h"
|
2017-08-19 06:20:36 +08:00
|
|
|
#include "packfile.h"
|
2017-08-23 20:36:52 +08:00
|
|
|
#include "worktree.h"
|
2024-06-09 02:39:01 +08:00
|
|
|
#include "path.h"
|
2023-05-16 14:33:56 +08:00
|
|
|
#include "read-cache.h"
|
2023-03-21 14:26:05 +08:00
|
|
|
#include "setup.h"
|
2023-05-16 14:33:51 +08:00
|
|
|
#include "sparse-index.h"
|
2020-07-29 04:23:39 +08:00
|
|
|
#include "strvec.h"
|
2023-04-11 11:00:38 +08:00
|
|
|
#include "trace2.h"
|
2018-07-21 00:33:04 +08:00
|
|
|
#include "commit-reach.h"
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
#include "commit-graph.h"
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
#include "prio-queue.h"
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
#include "hashmap.h"
|
2019-06-28 07:39:05 +08:00
|
|
|
#include "utf8.h"
|
2020-04-07 00:59:52 +08:00
|
|
|
#include "bloom.h"
|
2020-04-07 00:59:53 +08:00
|
|
|
#include "json-writer.h"
|
2022-03-10 00:01:40 +08:00
|
|
|
#include "list-objects-filter-options.h"
|
2022-06-10 07:44:20 +08:00
|
|
|
#include "resolve-undo.h"
|
treewide: include parse-options.h in source files
The builtins 'ls-remote', 'pack-objects', 'receive-pack', 'reflog' and
'send-pack' use parse_options(), but their source files don't directly
include 'parse-options.h'. Furthermore, the source files
'diagnose.c', 'list-objects-filter-options.c', 'remote.c' and
'send-pack.c' define option parsing callback functions, while
'revision.c' defines an option parsing helper function, and thus need
access to various fields in 'struct option' and 'struct
parse_opt_ctx_t', but they don't directly include 'parse-options.h'
either. They all can still be built, of course, because they include
one of the header files that does include 'parse-options.h' (though
unnecessarily, see the next commit).
Add those missing includes to these files, as our general rule is that
"a C file must directly include the header files that declare the
functions and the types it uses".
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-20 00:27:11 +08:00
|
|
|
#include "parse-options.h"
|
2023-05-16 14:34:03 +08:00
|
|
|
#include "wildmatch.h"
|
2006-02-26 08:19:46 +08:00
|
|
|
|
2007-11-04 02:11:10 +08:00
|
|
|
volatile show_early_output_fn_t show_early_output;
|
|
|
|
|
2015-06-29 23:40:30 +08:00
|
|
|
static const char *term_bad;
|
|
|
|
static const char *term_good;
|
|
|
|
|
2018-05-19 13:28:24 +08:00
|
|
|
implement_shared_commit_slab(revision_sources, char *);
|
|
|
|
|
line-log: more responsive, incremental 'git log -L'
The current line-level log implementation performs a preprocessing
step in prepare_revision_walk(), during which the line_log_filter()
function filters and rewrites history to keep only commits modifying
the given line range. This preprocessing affects both responsiveness
and correctness:
- Git doesn't produce any output during this preprocessing step.
Checking whether a commit modified the given line range is
somewhat expensive, so depending on the size of the given revision
range this preprocessing can result in a significant delay before
the first commit is shown.
- Limiting the number of displayed commits (e.g. 'git log -3 -L...')
doesn't limit the amount of work during preprocessing, because
that limit is applied during history traversal. Alas, by that
point this expensive preprocessing step has already churned
through the whole revision range to find all commits modifying the
revision range, even though only a few of them need to be shown.
- It rewrites parents, with no way to turn it off. Without the user
explicitly requesting parent rewriting any parent object ID shown
should be that of the immediate parent, just like in case of a
pathspec-limited history traversal without parent rewriting.
However, after that preprocessing step rewrote history, the
subsequent "regular" history traversal (i.e. get_revision() in a
loop) only sees commits modifying the given line range.
Consequently, it can only show the object ID of the last ancestor
that modified the given line range (which might happen to be the
immediate parent, but many-many times it isn't).
This patch addresses both the correctness and, at least for the common
case, the responsiveness issues by integrating line-level log
filtering into the regular revision walking machinery:
- Make process_ranges_arbitrary_commit(), the static function in
'line-log.c' deciding whether a commit modifies the given line
range, public by removing the static keyword and adding the
'line_log_' prefix, so it can be called from other parts of the
revision walking machinery.
- If the user didn't explicitly ask for parent rewriting (which, I
believe, is the most common case):
- Call this now-public function during regular history traversal,
namely from get_commit_action() to ignore any commits not
modifying the given line range.
Note that while this check is relatively expensive, it must be
performed before other, much cheaper conditions, because the
tracked line range must be adjusted even when the commit will
end up being ignored by other conditions.
- Skip the line_log_filter() call, i.e. the expensive
preprocessing step, in prepare_revision_walk(), because, thanks
to the above points, the revision walking machinery is now able
to filter out commits not modifying the given line range while
traversing history.
This way the regular history traversal sees the unmodified
history, and is therefore able to print the object ids of the
immediate parents of the listed commits. The eliminated
preprocessing step can greatly reduce the delay before the first
commit is shown, see the numbers below.
- However, if the user did explicitly ask for parent rewriting via
'--parents' or a similar option, then stick with the current
implementation for now, i.e. perform that expensive filtering and
history rewriting in the preprocessing step just like we did
before, leaving the initial delay as long as it was.
I tried to integrate line-level log filtering with parent rewriting
into the regular history traversal, but, unfortunately, several
subtleties resisted... :) Maybe someday we'll figure out how to do
that, but until then at least the simple and common (i.e. without
parent rewriting) 'git log -L:func:file' commands can benefit from the
reduced delay.
This change makes the failing 'parent oids without parent rewriting'
test in 't4211-line-log.sh' succeed.
The reduced delay is most noticable when there's a commit modifying
the line range near the tip of a large-ish revision range:
# no parent rewriting requested, no commit-graph present
$ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0
Before:
real 0m9.570s
user 0m9.494s
sys 0m0.076s
After:
real 0m0.718s
user 0m0.674s
sys 0m0.044s
A significant part of the remaining delay is spent reading and parsing
commit objects in limit_list(). With the help of the commit-graph we
can eliminate most of that reading and parsing overhead, so here are
the timing results of the same command as above, but this time using
the commit-graph:
Before:
real 0m8.874s
user 0m8.816s
sys 0m0.057s
After:
real 0m0.107s
user 0m0.091s
sys 0m0.013s
The next patch will further reduce the remaining delay.
To be clear: this patch doesn't actually optimize the line-level log,
but merely moves most of the work from the preprocessing step to the
history traversal, so the commits modifying the line range can be
shown as soon as they are processed, and the traversal can be
terminated as soon as the given number of commits are shown.
Consequently, listing the full history of a line range, potentially
all the way to the root commit, will take the same time as before (but
at least the user might start reading the output earlier).
Furthermore, if the most recent commit modifying the line range is far
away from the starting revision, then that initial delay will still be
significant.
Additional testing by Derrick Stolee: In the Linux kernel repository,
the MAINTAINERS file was changed ~3,500 times across the ~915,000
commits. In addition to that edit frequency, the file itself is quite
large (~18,700 lines). This means that a significant portion of the
computation is taken up by computing the patch-diff of the file. This
patch improves the real time it takes to output the first result quite
a bit:
Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
Before: 3.88 s
After: 0.71 s
If we drop the "-n 1" in the command, then there is no change in
end-to-end process time. This is because the command still needs to
walk the entire commit history, which negates the point of this
patch. This is expected.
As a note for future reference, the ~4.3 seconds in the old code
spends ~2.6 seconds computing the patch-diffs, and the rest of the
time is spent walking commits and computing diffs for which paths
changed at each commit. The changed-path Bloom filters could improve
the end-to-end computation time (i.e. no "-n 1" in the command).
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-11 19:56:17 +08:00
|
|
|
static inline int want_ancestry(const struct rev_info *revs);
|
|
|
|
|
list-objects: pass full pathname to callbacks
When we find a blob at "a/b/c", we currently pass this to
our show_object_fn callbacks as two components: "a/b/" and
"c". Callbacks which want the full value then call
path_name(), which concatenates the two. But this is an
inefficient interface; the path is a strbuf, and we could
simply append "c" to it temporarily, then roll back the
length, without creating a new copy.
So we could improve this by teaching the callsites of
path_name() this trick (and there are only 3). But we can
also notice that no callback actually cares about the
broken-down representation, and simply pass each callback
the full path "a/b/c" as a string. The callback code becomes
even simpler, then, as we do not have to worry about freeing
an allocated buffer, nor rolling back our modification to
the strbuf.
This is theoretically less efficient, as some callbacks
would not bother to format the final path component. But in
practice this is not measurable. Since we use the same
strbuf over and over, our work to grow it is amortized, and
we really only pay to memcpy a few bytes.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-02-12 06:28:36 +08:00
|
|
|
void show_object_with_name(FILE *out, struct object *obj, const char *name)
|
2006-02-26 08:19:46 +08:00
|
|
|
{
|
2015-11-10 10:22:28 +08:00
|
|
|
fprintf(out, "%s ", oid_to_hex(&obj->oid));
|
revision: use C99 declaration of variable in for() loop
There are certain C99 features that might be nice to use in our code
base, but we've hesitated to do so in order to avoid breaking
compatibility with older compilers. But we don't actually know if
people are even using pre-C99 compilers these days.
One way to figure that out is to introduce a very small use of a
feature, and see if anybody complains, and we've done so to probe
the portability for a few features like "trailing comma in enum
declaration", "designated initializer for struct", and "designated
initializer for array". A few years ago, we tried to use a handy
for (int i = 0; i < n; i++)
use(i);
to introduce a new variable valid only in the loop, but found that
some compilers we cared about didn't like it back then. Two years
is a long-enough time, so let's try it again.
If this patch can survive a few releases without complaint, then we
can feel more confident that variable declaration in for() loop is
supported by the compilers our user base use. And if we do get
complaints, then we'll have gained some data and we can easily
revert this patch.
Helped-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-15 14:27:45 +08:00
|
|
|
for (const char *p = name; *p && *p != '\n'; p++)
|
2016-02-12 06:24:18 +08:00
|
|
|
fputc(*p, out);
|
2011-08-18 05:30:35 +08:00
|
|
|
fputc('\n', out);
|
2011-08-18 05:30:34 +08:00
|
|
|
}
|
|
|
|
|
2006-02-26 08:19:46 +08:00
|
|
|
static void mark_blob_uninteresting(struct blob *blob)
|
|
|
|
{
|
2008-02-19 04:47:54 +08:00
|
|
|
if (!blob)
|
|
|
|
return;
|
2006-02-26 08:19:46 +08:00
|
|
|
if (blob->object.flags & UNINTERESTING)
|
|
|
|
return;
|
|
|
|
blob->object.flags |= UNINTERESTING;
|
|
|
|
}
|
|
|
|
|
2018-09-21 23:57:39 +08:00
|
|
|
static void mark_tree_contents_uninteresting(struct repository *r,
|
|
|
|
struct tree *tree)
|
2006-02-26 08:19:46 +08:00
|
|
|
{
|
2006-05-30 03:20:14 +08:00
|
|
|
struct tree_desc desc;
|
tree_entry(): new tree-walking helper function
This adds a "tree_entry()" function that combines the common operation of
doing a "tree_entry_extract()" + "update_tree_entry()".
It also has a simplified calling convention, designed for simple loops
that traverse over a whole tree: the arguments are pointers to the tree
descriptor and a name_entry structure to fill in, and it returns a boolean
"true" if there was an entry left to be gotten in the tree.
This allows tree traversal with
struct tree_desc desc;
struct name_entry entry;
desc.buf = tree->buffer;
desc.size = tree->size;
while (tree_entry(&desc, &entry) {
... use "entry.{path, sha1, mode, pathlen}" ...
}
which is not only shorter than writing it out in full, it's hopefully less
error prone too.
[ It's actually a tad faster too - we don't need to recalculate the entry
pathlength in both extract and update, but need to do it only once.
Also, some callers can avoid doing a "strlen()" on the result, since
it's returned as part of the name_entry structure.
However, by now we're talking just 1% speedup on "git-rev-list --objects
--all", and we're definitely at the point where tree walking is no
longer the issue any more. ]
NOTE! Not everybody wants to use this new helper function, since some of
the tree walkers very much on purpose do the descriptor update separately
from the entry extraction. So the "extract + update" sequence still
remains as the core sequence, this is just a simplified interface.
We should probably add a silly two-line inline helper function for
initializing the descriptor from the "struct tree" too, just to cut down
on the noise from that common "desc" initializer.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-05-31 00:45:45 +08:00
|
|
|
struct name_entry entry;
|
2006-02-26 08:19:46 +08:00
|
|
|
|
2018-05-12 02:00:55 +08:00
|
|
|
if (parse_tree_gently(tree, 1) < 0)
|
2006-02-26 08:19:46 +08:00
|
|
|
return;
|
2006-05-30 03:20:14 +08:00
|
|
|
|
2023-10-02 10:40:28 +08:00
|
|
|
init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size);
|
tree_entry(): new tree-walking helper function
This adds a "tree_entry()" function that combines the common operation of
doing a "tree_entry_extract()" + "update_tree_entry()".
It also has a simplified calling convention, designed for simple loops
that traverse over a whole tree: the arguments are pointers to the tree
descriptor and a name_entry structure to fill in, and it returns a boolean
"true" if there was an entry left to be gotten in the tree.
This allows tree traversal with
struct tree_desc desc;
struct name_entry entry;
desc.buf = tree->buffer;
desc.size = tree->size;
while (tree_entry(&desc, &entry) {
... use "entry.{path, sha1, mode, pathlen}" ...
}
which is not only shorter than writing it out in full, it's hopefully less
error prone too.
[ It's actually a tad faster too - we don't need to recalculate the entry
pathlength in both extract and update, but need to do it only once.
Also, some callers can avoid doing a "strlen()" on the result, since
it's returned as part of the name_entry structure.
However, by now we're talking just 1% speedup on "git-rev-list --objects
--all", and we're definitely at the point where tree walking is no
longer the issue any more. ]
NOTE! Not everybody wants to use this new helper function, since some of
the tree walkers very much on purpose do the descriptor update separately
from the entry extraction. So the "extract + update" sequence still
remains as the core sequence, this is just a simplified interface.
We should probably add a silly two-line inline helper function for
initializing the descriptor from the "struct tree" too, just to cut down
on the noise from that common "desc" initializer.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-05-31 00:45:45 +08:00
|
|
|
while (tree_entry(&desc, &entry)) {
|
2007-11-12 07:35:23 +08:00
|
|
|
switch (object_type(entry.mode)) {
|
|
|
|
case OBJ_TREE:
|
2019-01-15 08:39:44 +08:00
|
|
|
mark_tree_uninteresting(r, lookup_tree(r, &entry.oid));
|
2007-11-12 07:35:23 +08:00
|
|
|
break;
|
|
|
|
case OBJ_BLOB:
|
2019-01-15 08:39:44 +08:00
|
|
|
mark_blob_uninteresting(lookup_blob(r, &entry.oid));
|
2007-11-12 07:35:23 +08:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
/* Subproject commit - not in this repository */
|
|
|
|
break;
|
|
|
|
}
|
2006-02-26 08:19:46 +08:00
|
|
|
}
|
2006-05-30 03:20:14 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't care about the tree any more
|
|
|
|
* after it has been marked uninteresting.
|
|
|
|
*/
|
2013-06-06 06:37:39 +08:00
|
|
|
free_tree_buffer(tree);
|
2006-02-26 08:19:46 +08:00
|
|
|
}
|
|
|
|
|
2018-09-21 23:57:39 +08:00
|
|
|
void mark_tree_uninteresting(struct repository *r, struct tree *tree)
|
2014-01-16 07:38:01 +08:00
|
|
|
{
|
2015-12-05 23:27:24 +08:00
|
|
|
struct object *obj;
|
2014-01-16 07:38:01 +08:00
|
|
|
|
|
|
|
if (!tree)
|
|
|
|
return;
|
2015-12-05 23:27:24 +08:00
|
|
|
|
|
|
|
obj = &tree->object;
|
2014-01-16 07:38:01 +08:00
|
|
|
if (obj->flags & UNINTERESTING)
|
|
|
|
return;
|
|
|
|
obj->flags |= UNINTERESTING;
|
2018-09-21 23:57:39 +08:00
|
|
|
mark_tree_contents_uninteresting(r, tree);
|
2006-02-26 08:19:46 +08:00
|
|
|
}
|
|
|
|
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
struct path_and_oids_entry {
|
|
|
|
struct hashmap_entry ent;
|
|
|
|
char *path;
|
|
|
|
struct oidset trees;
|
|
|
|
};
|
|
|
|
|
2022-08-26 01:09:48 +08:00
|
|
|
static int path_and_oids_cmp(const void *hashmap_cmp_fn_data UNUSED,
|
2019-10-07 07:30:37 +08:00
|
|
|
const struct hashmap_entry *eptr,
|
|
|
|
const struct hashmap_entry *entry_or_key,
|
2022-08-26 01:09:48 +08:00
|
|
|
const void *keydata UNUSED)
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
{
|
2019-10-07 07:30:37 +08:00
|
|
|
const struct path_and_oids_entry *e1, *e2;
|
|
|
|
|
|
|
|
e1 = container_of(eptr, const struct path_and_oids_entry, ent);
|
|
|
|
e2 = container_of(entry_or_key, const struct path_and_oids_entry, ent);
|
|
|
|
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
return strcmp(e1->path, e2->path);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void paths_and_oids_clear(struct hashmap *map)
|
|
|
|
{
|
|
|
|
struct hashmap_iter iter;
|
|
|
|
struct path_and_oids_entry *entry;
|
|
|
|
|
2019-10-07 07:30:41 +08:00
|
|
|
hashmap_for_each_entry(map, &iter, entry, ent /* member name */) {
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
oidset_clear(&entry->trees);
|
|
|
|
free(entry->path);
|
|
|
|
}
|
|
|
|
|
2020-11-03 02:55:05 +08:00
|
|
|
hashmap_clear_and_free(map, struct path_and_oids_entry, ent);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void paths_and_oids_insert(struct hashmap *map,
|
|
|
|
const char *path,
|
|
|
|
const struct object_id *oid)
|
|
|
|
{
|
|
|
|
int hash = strhash(path);
|
|
|
|
struct path_and_oids_entry key;
|
|
|
|
struct path_and_oids_entry *entry;
|
|
|
|
|
2019-10-07 07:30:27 +08:00
|
|
|
hashmap_entry_init(&key.ent, hash);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
|
|
|
|
/* use a shallow copy for the lookup */
|
|
|
|
key.path = (char *)path;
|
|
|
|
oidset_init(&key.trees, 0);
|
|
|
|
|
2019-10-07 07:30:42 +08:00
|
|
|
entry = hashmap_get_entry(map, &key, ent, NULL);
|
2019-10-07 07:30:30 +08:00
|
|
|
if (!entry) {
|
2021-03-14 00:17:22 +08:00
|
|
|
CALLOC_ARRAY(entry, 1);
|
2019-10-07 07:30:27 +08:00
|
|
|
hashmap_entry_init(&entry->ent, hash);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
entry->path = xstrdup(key.path);
|
|
|
|
oidset_init(&entry->trees, 16);
|
2019-10-07 07:30:32 +08:00
|
|
|
hashmap_put(map, &entry->ent);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
oidset_insert(&entry->trees, oid);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void add_children_by_path(struct repository *r,
|
|
|
|
struct tree *tree,
|
|
|
|
struct hashmap *map)
|
|
|
|
{
|
|
|
|
struct tree_desc desc;
|
|
|
|
struct name_entry entry;
|
|
|
|
|
|
|
|
if (!tree)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (parse_tree_gently(tree, 1) < 0)
|
|
|
|
return;
|
|
|
|
|
2023-10-02 10:40:28 +08:00
|
|
|
init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
while (tree_entry(&desc, &entry)) {
|
|
|
|
switch (object_type(entry.mode)) {
|
|
|
|
case OBJ_TREE:
|
2019-02-07 14:05:24 +08:00
|
|
|
paths_and_oids_insert(map, entry.path, &entry.oid);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
|
|
|
|
if (tree->object.flags & UNINTERESTING) {
|
2019-02-07 14:05:24 +08:00
|
|
|
struct tree *child = lookup_tree(r, &entry.oid);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
if (child)
|
|
|
|
child->object.flags |= UNINTERESTING;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case OBJ_BLOB:
|
|
|
|
if (tree->object.flags & UNINTERESTING) {
|
2019-02-07 14:05:24 +08:00
|
|
|
struct blob *child = lookup_blob(r, &entry.oid);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
if (child)
|
|
|
|
child->object.flags |= UNINTERESTING;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
/* Subproject commit - not in this repository */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
free_tree_buffer(tree);
|
|
|
|
}
|
|
|
|
|
2019-01-17 02:25:58 +08:00
|
|
|
void mark_trees_uninteresting_sparse(struct repository *r,
|
|
|
|
struct oidset *trees)
|
|
|
|
{
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
unsigned has_interesting = 0, has_uninteresting = 0;
|
2020-11-12 04:02:20 +08:00
|
|
|
struct hashmap map = HASHMAP_INIT(path_and_oids_cmp, NULL);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
struct hashmap_iter map_iter;
|
|
|
|
struct path_and_oids_entry *entry;
|
2019-01-17 02:25:58 +08:00
|
|
|
struct object_id *oid;
|
|
|
|
struct oidset_iter iter;
|
|
|
|
|
|
|
|
oidset_iter_init(trees, &iter);
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
while ((!has_interesting || !has_uninteresting) &&
|
|
|
|
(oid = oidset_iter_next(&iter))) {
|
2019-01-17 02:25:58 +08:00
|
|
|
struct tree *tree = lookup_tree(r, oid);
|
|
|
|
|
|
|
|
if (!tree)
|
|
|
|
continue;
|
|
|
|
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
if (tree->object.flags & UNINTERESTING)
|
|
|
|
has_uninteresting = 1;
|
|
|
|
else
|
|
|
|
has_interesting = 1;
|
2019-01-17 02:25:58 +08:00
|
|
|
}
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
|
|
|
|
/* Do not walk unless we have both types of trees. */
|
|
|
|
if (!has_uninteresting || !has_interesting)
|
|
|
|
return;
|
|
|
|
|
|
|
|
oidset_iter_init(trees, &iter);
|
|
|
|
while ((oid = oidset_iter_next(&iter))) {
|
|
|
|
struct tree *tree = lookup_tree(r, oid);
|
|
|
|
add_children_by_path(r, tree, &map);
|
|
|
|
}
|
|
|
|
|
2019-10-07 07:30:41 +08:00
|
|
|
hashmap_for_each_entry(&map, &map_iter, entry, ent /* member name */)
|
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git
pack-objects --revs', we discover the "frontier" of commits
that we care about and the boundary with commit we find
uninteresting. From that point, we walk trees to discover which
trees and blobs are uninteresting. Finally, we walk trees from the
interesting commits to find the interesting objects that are
placed in the pack.
This commit introduces a new, "sparse" way to discover the
uninteresting trees. We use the perspective of a single user trying
to push their topic to a large repository. That user likely changed
a very small fraction of the paths in their working directory, but
we spend a lot of time walking all reachable trees.
The way to switch the logic to work in this sparse way is to start
caring about which paths introduce new trees. While it is not
possible to generate a diff between the frontier boundary and all
of the interesting commits, we can simulate that behavior by
inspecting all of the root trees as a whole, then recursing down
to the set of trees at each path.
We already had taken the first step by passing an oidset to
mark_trees_uninteresting_sparse(). We now create a dictionary
whose keys are paths and values are oidsets. We consider the set
of trees that appear at each path. While we inspect a tree, we
add its subtrees to the oidsets corresponding to the tree entry's
path. We also mark trees as UNINTERESTING if the tree we are
parsing is UNINTERESTING.
To actually improve the performance, we need to terminate our
recursion. If the oidset contains only UNINTERESTING trees, then
we do not continue the recursion. This avoids walking trees that
are likely to not be reachable from interesting trees. If the
oidset contains only interesting trees, then we will walk these
trees in the final stage that collects the intersting objects to
place in the pack. Thus, we only recurse if the oidset contains
both interesting and UNINITERESTING trees.
There are a few ways that this is not a universally better option.
First, we can pack extra objects. If someone copies a subtree
from one tree to another, the first tree will appear UNINTERESTING
and we will not recurse to see that the subtree should also be
UNINTERESTING. We will walk the new tree and see the subtree as
a "new" object and add it to the pack. A test is modified to
demonstrate this behavior and to verify that the new logic is
being exercised.
Second, we can have extra memory pressure. If instead of being a
single user pushing a small topic we are a server sending new
objects from across the entire working directory, then we will
gain very little (the recursion will rarely terminate early) but
will spend extra time maintaining the path-oidset dictionaries.
Despite these potential drawbacks, the benefits of the algorithm
are clear. By adding a counter to 'add_children_by_path' and
'mark_tree_contents_uninteresting', I measured the number of
parsed trees for the two algorithms in a variety of repos.
For git.git, I used the following input:
v2.19.0
^v2.19.0~10
Objects to pack: 550
Walked (old alg): 282
Walked (new alg): 130
For the Linux repo, I used the following input:
v4.18
^v4.18~10
Objects to pack: 518
Walked (old alg): 4,836
Walked (new alg): 188
The two repos above are rather "wide and flat" compared to
other repos that I have used in the past. As a comparison,
I tested an old topic branch in the Azure DevOps repo, which
has a much deeper folder structure than the Linux repo.
Objects to pack: 220
Walked (old alg): 22,804
Walked (new alg): 129
I used the number of walked trees the main metric above because
it is consistent across multiple runs. When I ran my tests, the
performance of the pack-objects command with the same options
could change the end-to-end time by 10x depending on the file
system being warm. However, by repeating the same test on repeat
I could get more consistent timing results. The git.git and
Linux tests were too fast overall (less than 0.5s) to measure
an end-to-end difference. The Azure DevOps case was slow enough
to see the time improve from 15s to 1s in the warm case. The
cold case was 90s to 9s in my testing.
These improvements will have even larger benefits in the super-
large Windows repository. In our experiments, we see the
"Enumerate objects" phase of pack-objects taking 60-80% of the
end-to-end time of non-trivial pushes, taking longer than the
network time to send the pack and the server time to verify the
pack.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-17 02:25:59 +08:00
|
|
|
mark_trees_uninteresting_sparse(r, &entry->trees);
|
|
|
|
|
|
|
|
paths_and_oids_clear(&map);
|
2019-01-17 02:25:58 +08:00
|
|
|
}
|
|
|
|
|
2018-05-12 02:02:27 +08:00
|
|
|
struct commit_stack {
|
|
|
|
struct commit **items;
|
|
|
|
size_t nr, alloc;
|
|
|
|
};
|
2021-09-27 20:54:25 +08:00
|
|
|
#define COMMIT_STACK_INIT { 0 }
|
2018-05-12 02:02:27 +08:00
|
|
|
|
|
|
|
static void commit_stack_push(struct commit_stack *stack, struct commit *commit)
|
|
|
|
{
|
|
|
|
ALLOC_GROW(stack->items, stack->nr + 1, stack->alloc);
|
|
|
|
stack->items[stack->nr++] = commit;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct commit *commit_stack_pop(struct commit_stack *stack)
|
|
|
|
{
|
|
|
|
return stack->nr ? stack->items[--stack->nr] : NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void commit_stack_clear(struct commit_stack *stack)
|
|
|
|
{
|
|
|
|
FREE_AND_NULL(stack->items);
|
|
|
|
stack->nr = stack->alloc = 0;
|
|
|
|
}
|
|
|
|
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
static void mark_one_parent_uninteresting(struct rev_info *revs, struct commit *commit,
|
2018-05-12 02:03:15 +08:00
|
|
|
struct commit_stack *pending)
|
|
|
|
{
|
|
|
|
struct commit_list *l;
|
|
|
|
|
|
|
|
if (commit->object.flags & UNINTERESTING)
|
|
|
|
return;
|
|
|
|
commit->object.flags |= UNINTERESTING;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Normally we haven't parsed the parent
|
|
|
|
* yet, so we won't have a parent of a parent
|
|
|
|
* here. However, it may turn out that we've
|
|
|
|
* reached this commit some other way (where it
|
|
|
|
* wasn't uninteresting), in which case we need
|
|
|
|
* to mark its parents recursively too..
|
|
|
|
*/
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
for (l = commit->parents; l; l = l->next) {
|
2018-05-12 02:03:15 +08:00
|
|
|
commit_stack_push(pending, l->item);
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
if (revs && revs->exclude_first_parent_only)
|
|
|
|
break;
|
|
|
|
}
|
2018-05-12 02:03:15 +08:00
|
|
|
}
|
|
|
|
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
void mark_parents_uninteresting(struct rev_info *revs, struct commit *commit)
|
2006-02-26 08:19:46 +08:00
|
|
|
{
|
2018-05-12 02:02:27 +08:00
|
|
|
struct commit_stack pending = COMMIT_STACK_INIT;
|
|
|
|
struct commit_list *l;
|
2012-01-14 20:19:53 +08:00
|
|
|
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
for (l = commit->parents; l; l = l->next) {
|
|
|
|
mark_one_parent_uninteresting(revs, l->item, &pending);
|
|
|
|
if (revs && revs->exclude_first_parent_only)
|
|
|
|
break;
|
|
|
|
}
|
2018-05-12 02:03:15 +08:00
|
|
|
|
|
|
|
while (pending.nr > 0)
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
mark_one_parent_uninteresting(revs, commit_stack_pop(&pending),
|
2018-05-12 02:03:15 +08:00
|
|
|
&pending);
|
2018-05-12 02:02:27 +08:00
|
|
|
|
|
|
|
commit_stack_clear(&pending);
|
2006-02-26 08:19:46 +08:00
|
|
|
}
|
|
|
|
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:43:19 +08:00
|
|
|
static void add_pending_object_with_path(struct rev_info *revs,
|
2013-05-25 17:08:07 +08:00
|
|
|
struct object *obj,
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:43:19 +08:00
|
|
|
const char *name, unsigned mode,
|
|
|
|
const char *path)
|
2006-02-26 08:19:46 +08:00
|
|
|
{
|
2020-09-02 06:28:07 +08:00
|
|
|
struct interpret_branch_name_options options = { 0 };
|
2011-05-19 09:08:09 +08:00
|
|
|
if (!obj)
|
|
|
|
return;
|
2007-02-28 08:22:52 +08:00
|
|
|
if (revs->no_walk && (obj->flags & UNINTERESTING))
|
Make 'git show' more useful
For some reason, I ended up doing
git show HEAD~5..
as an odd way of asking for a log. I realize I should just have used "git
log", but at the same time it does make perfect conceptual sense. After
all, you _could_ have done
git show HEAD HEAD~1 HEAD~2 HEAD~3 HEAD~4
and saying "git show HEAD~5.." is pretty natural. It's not like "git show"
only ever showed a single commit (or other object) before either! So
conceptually, giving a commit range is a very sensible operation, even
though you'd traditionally have used "git log" for that.
However, doing that currently results in an error
fatal: object ranges do not make sense when not walking revisions
which admittedly _also_ makes perfect sense - from an internal git
implementation standpoint in 'revision.c'.
However, I think that asking to show a range makes sense to a user, while
saying "object ranges no not make sense when not walking revisions" only
makes sense to a git developer.
So on the whole, of the two different "makes perfect sense" behaviors, I
think I originally picked the wrong one. And quite frankly, I don't really
see anybody actually _depending_ on that error case. So why not change it?
So rather than error out, just turn that non-walking error case into a
"silently turn on walking" instead.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-07-14 05:41:12 +08:00
|
|
|
revs->no_walk = 0;
|
2010-01-27 05:48:28 +08:00
|
|
|
if (revs->reflog_info && obj->type == OBJ_COMMIT) {
|
|
|
|
struct strbuf buf = STRBUF_INIT;
|
2021-06-10 21:06:43 +08:00
|
|
|
size_t namelen = strlen(name);
|
2023-03-28 21:58:46 +08:00
|
|
|
int len = repo_interpret_branch_name(the_repository, name,
|
|
|
|
namelen, &buf, &options);
|
2010-01-27 05:48:28 +08:00
|
|
|
|
2021-06-10 21:06:43 +08:00
|
|
|
if (0 < len && len < namelen && buf.len)
|
2010-01-27 05:48:28 +08:00
|
|
|
strbuf_addstr(&buf, name + len);
|
reflog-walk: stop using fake parents
The reflog-walk system works by putting a ref's tip into the
pending queue, and then "traversing" the reflog by
pretending that the parent of each commit is the previous
reflog entry.
This causes a number of user-visible oddities, as documented
in t1414 (and the commit message which introduced it). We
can fix all of them in one go by replacing the fake-reflog
system with a much simpler one: just keeping a list of
reflogs to show, and walking through them entry by entry.
The implementation is fairly straight-forward, but there are
a few items to note:
1. We obviously must skip calling add_parents_to_list()
when we are traversing reflogs, since we do not want to
walk the original parents at all. As a result, we must call
try_to_simplify_commit() ourselves.
There are other parts of add_parents_to_list() we skip,
as well, but none of them should matter for a reflog
traversal:
- We do not allow UNINTERESTING commits, nor
symmetric ranges (and we bail when these are used
with "-g").
- Using --source makes no sense, since we aren't
traversing. The reflog selector shows the same
information with more detail.
- Using --first-parent is still sensible, since you
may want to see the first-parent diff for each
entry. But since we're not traversing, we don't
need to cull the parent list here.
2. Since we now just walk the reflog entries themselves,
rather than starting with the ref tip, we now look at
the "new" field of each entry rather than the "old"
(i.e., we are showing entries, not faking parents).
This removes all of the tricky logic around skipping
past root commits.
But note that we have no way to show an entry with the
null sha1 in its "new" field (because such a commit
obviously does not exist). Normally this would not
happen, since we delete reflogs along with refs, but
there is one special case. When we rename the currently
checked out branch, we write two reflog entries into
the HEAD log: one where the commit goes away, and
another where it comes back.
Prior to this commit, we show both entries with
identical reflog messages. After this commit, we show
only the "comes back" entry. See the update in t3200
which demonstrates this.
Arguably either is fine, as the whole double-entry
thing is a bit hacky in the first place. And until a
recent fix, we truncated the traversal in such a case
anyway, which was _definitely_ wrong.
3. We show individual reflogs in order, but choose which
reflog to show at each stage based on which has the
most recent timestamp. This interleaves the output
from multiple reflogs based on date order, which is
probably what you'd want with limiting like "-n 30".
Note that the implementation aims for simplicity. It
does a linear walk over the reflog queue for each
commit it pulls, which may perform badly if you
interleave an enormous number of reflogs. That seems
like an unlikely use case; if we did want to handle it,
we could probably keep a priority queue of reflogs,
ordered by the timestamp of their current tip entry.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-07 17:14:07 +08:00
|
|
|
add_reflog_for_walk(revs->reflog_info,
|
|
|
|
(struct commit *)obj,
|
|
|
|
buf.buf[0] ? buf.buf: name);
|
2010-01-27 05:48:28 +08:00
|
|
|
strbuf_release(&buf);
|
reflog-walk: stop using fake parents
The reflog-walk system works by putting a ref's tip into the
pending queue, and then "traversing" the reflog by
pretending that the parent of each commit is the previous
reflog entry.
This causes a number of user-visible oddities, as documented
in t1414 (and the commit message which introduced it). We
can fix all of them in one go by replacing the fake-reflog
system with a much simpler one: just keeping a list of
reflogs to show, and walking through them entry by entry.
The implementation is fairly straight-forward, but there are
a few items to note:
1. We obviously must skip calling add_parents_to_list()
when we are traversing reflogs, since we do not want to
walk the original parents at all. As a result, we must call
try_to_simplify_commit() ourselves.
There are other parts of add_parents_to_list() we skip,
as well, but none of them should matter for a reflog
traversal:
- We do not allow UNINTERESTING commits, nor
symmetric ranges (and we bail when these are used
with "-g").
- Using --source makes no sense, since we aren't
traversing. The reflog selector shows the same
information with more detail.
- Using --first-parent is still sensible, since you
may want to see the first-parent diff for each
entry. But since we're not traversing, we don't
need to cull the parent list here.
2. Since we now just walk the reflog entries themselves,
rather than starting with the ref tip, we now look at
the "new" field of each entry rather than the "old"
(i.e., we are showing entries, not faking parents).
This removes all of the tricky logic around skipping
past root commits.
But note that we have no way to show an entry with the
null sha1 in its "new" field (because such a commit
obviously does not exist). Normally this would not
happen, since we delete reflogs along with refs, but
there is one special case. When we rename the currently
checked out branch, we write two reflog entries into
the HEAD log: one where the commit goes away, and
another where it comes back.
Prior to this commit, we show both entries with
identical reflog messages. After this commit, we show
only the "comes back" entry. See the update in t3200
which demonstrates this.
Arguably either is fine, as the whole double-entry
thing is a bit hacky in the first place. And until a
recent fix, we truncated the traversal in such a case
anyway, which was _definitely_ wrong.
3. We show individual reflogs in order, but choose which
reflog to show at each stage based on which has the
most recent timestamp. This interleaves the output
from multiple reflogs based on date order, which is
probably what you'd want with limiting like "-n 30".
Note that the implementation aims for simplicity. It
does a linear walk over the reflog queue for each
commit it pulls, which may perform badly if you
interleave an enormous number of reflogs. That seems
like an unlikely use case; if we did want to handle it,
we could probably keep a priority queue of reflogs,
ordered by the timestamp of their current tip entry.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-07 17:14:07 +08:00
|
|
|
return; /* do not add the commit itself */
|
2010-01-27 05:48:28 +08:00
|
|
|
}
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:43:19 +08:00
|
|
|
add_object_array_with_path(obj, name, &revs->pending, mode, path);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void add_pending_object_with_mode(struct rev_info *revs,
|
|
|
|
struct object *obj,
|
|
|
|
const char *name, unsigned mode)
|
|
|
|
{
|
|
|
|
add_pending_object_with_path(revs, obj, name, mode, NULL);
|
2006-02-26 08:19:46 +08:00
|
|
|
}
|
|
|
|
|
2013-05-25 17:08:07 +08:00
|
|
|
void add_pending_object(struct rev_info *revs,
|
|
|
|
struct object *obj, const char *name)
|
2007-06-08 17:24:58 +08:00
|
|
|
{
|
|
|
|
add_pending_object_with_mode(revs, obj, name, S_IFINVALID);
|
|
|
|
}
|
|
|
|
|
2007-12-12 02:09:04 +08:00
|
|
|
void add_head_to_pending(struct rev_info *revs)
|
|
|
|
{
|
2017-05-07 06:10:27 +08:00
|
|
|
struct object_id oid;
|
2007-12-12 02:09:04 +08:00
|
|
|
struct object *obj;
|
2023-03-28 21:58:46 +08:00
|
|
|
if (repo_get_oid(the_repository, "HEAD", &oid))
|
2007-12-12 02:09:04 +08:00
|
|
|
return;
|
2018-09-21 23:57:39 +08:00
|
|
|
obj = parse_object(revs->repo, &oid);
|
2007-12-12 02:09:04 +08:00
|
|
|
if (!obj)
|
|
|
|
return;
|
|
|
|
add_pending_object(revs, obj, "HEAD");
|
|
|
|
}
|
|
|
|
|
2013-05-25 17:08:07 +08:00
|
|
|
static struct object *get_reference(struct rev_info *revs, const char *name,
|
2017-05-07 06:10:27 +08:00
|
|
|
const struct object_id *oid,
|
2013-05-25 17:08:07 +08:00
|
|
|
unsigned int flags)
|
2006-02-26 08:19:46 +08:00
|
|
|
{
|
|
|
|
struct object *object;
|
|
|
|
|
2022-09-07 07:06:25 +08:00
|
|
|
object = parse_object_with_flags(revs->repo, oid,
|
|
|
|
revs->verify_objects ? 0 :
|
upload-pack: free tree buffers after parsing
When a client sends us a "want" or "have" line, we call parse_object()
to get an object struct. If the object is a tree, then the parsed state
means that tree->buffer points to the uncompressed contents of the tree.
But we don't really care about it. We only really need to parse commits
and tags; for trees and blobs, the important output is just a "struct
object" with the correct type.
But much worse, we do not ever free that tree buffer. It's not leaked in
the traditional sense, in that we still have a pointer to it from the
global object hash. But if the client requests many trees, we'll hold
all of their contents in memory at the same time.
Nobody really noticed because it's rare for clients to directly request
a tree. It might happen for a lightweight tag pointing straight at a
tree, or it might happen for a "tree:depth" partial clone filling in
missing trees.
But it's also possible for a malicious client to request a lot of trees,
causing upload-pack's memory to balloon. For example, without this
patch, requesting every tree in git.git like:
pktline() {
local msg="$*"
printf "%04x%s\n" $((1+4+${#msg})) "$msg"
}
want_trees() {
pktline command=fetch
printf 0001
git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype)' |
while read oid type; do
test "$type" = "tree" || continue
pktline want $oid
done
pktline done
printf 0000
}
want_trees | GIT_PROTOCOL=version=2 valgrind --tool=massif ./git upload-pack . >/dev/null
shows a peak heap usage of ~3.7GB. Which is just about the sum of the
sizes of all of the uncompressed trees. For linux.git, it's closer to
17GB.
So the obvious thing to do is to call free_tree_buffer() after we
realize that we've parsed a tree. We know that upload-pack won't need it
later. But let's push the logic into parse_object_with_flags(), telling
it to discard the tree buffer immediately. There are two reasons for
this. One, all of the relevant call-sites already call the with_options
variant to pass the SKIP_HASH flag. So it actually ends up as less code
than manually free-ing in each spot. And two, it enables an extra
optimization that I'll discuss below.
I've touched all of the sites that currently use SKIP_HASH in
upload-pack. That drops the peak heap of the upload-pack invocation
above from 3.7GB to ~24MB.
I've also modified the caller in get_reference(); a partial clone
benefits from its use in pack-objects for the reasons given in
0bc2557951 (upload-pack: skip parse-object re-hashing of "want" objects,
2022-09-06), where we were measuring blob requests. But note that the
results of get_reference() are used for traversing, as well; so we
really would _eventually_ use the tree contents. That makes this at
first glance a space/time tradeoff: we won't hold all of the trees in
memory at once, but we'll have to reload them each when it comes time to
traverse.
And here's where our extra optimization comes in. If the caller is not
going to immediately look at the tree contents, and it doesn't care
about checking the hash, then parse_object() can simply skip loading the
tree entirely, just like we do for blobs! And now it's not a space/time
tradeoff in get_reference() anymore. It's just a lazy-load: we're
delaying reading the tree contents until it's time to actually traverse
them one by one.
And of course for upload-pack, this optimization means we never load the
trees at all, saving lots of CPU time. Timing the "every tree from
git.git" request above shows upload-pack dropping from 32 seconds of CPU
to 19 (the remainder is mostly due to pack-objects actually sending the
pack; timing just the upload-pack portion shows we go from 13s to
~0.28s).
These are all highly gamed numbers, of course. For real-world
partial-clone requests we're saving only a small bit of time in
practice. But it does help harden upload-pack against malicious
denial-of-service attacks.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-29 06:39:07 +08:00
|
|
|
PARSE_OBJECT_SKIP_HASH_CHECK |
|
|
|
|
PARSE_OBJECT_DISCARD_TREE);
|
2018-12-05 06:42:38 +08:00
|
|
|
|
2011-05-19 09:08:09 +08:00
|
|
|
if (!object) {
|
|
|
|
if (revs->ignore_missing)
|
2024-02-14 22:25:10 +08:00
|
|
|
return NULL;
|
2017-12-08 23:27:15 +08:00
|
|
|
if (revs->exclude_promisor_objects && is_promisor_object(oid))
|
|
|
|
return NULL;
|
rev-list: allow missing tips with --missing=[print|allow*]
In 9830926c7d (rev-list: add commit object support in `--missing`
option, 2023-10-27) we fixed the `--missing` option in `git rev-list`
so that it works with with missing commits, not just blobs/trees.
Unfortunately, such a command would still fail with a "fatal: bad
object <oid>" if it is passed a missing commit, blob or tree as an
argument (before the rev walking even begins).
When such a command is used to find the dependencies of some objects,
for example the dependencies of quarantined objects (see the
"QUARANTINE ENVIRONMENT" section in the git-receive-pack(1)
documentation), it would be better if the command would instead
consider such missing objects, especially commits, in the same way as
other missing objects.
If, for example `--missing=print` is used, it would be nice for some
use cases if the missing tips passed as arguments were reported in
the same way as other missing objects instead of the command just
failing.
We could introduce a new option to make it work like this, but most
users are likely to prefer the command to have this behavior as the
default one. Introducing a new option would require another dumb loop
to look for that option early, which isn't nice.
Also we made `git rev-list` work with missing commits very recently
and the command is most often passed commits as arguments. So let's
consider this as a bug fix related to these recent changes.
While at it let's add a NEEDSWORK comment to say that we should get
rid of the existing ugly dumb loops that parse the
`--exclude-promisor-objects` and `--missing=...` options early.
Helped-by: Linus Arver <linusa@google.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-14 22:25:13 +08:00
|
|
|
if (revs->do_not_die_on_missing_objects) {
|
|
|
|
oidset_insert(&revs->missing_commits, oid);
|
|
|
|
return NULL;
|
|
|
|
}
|
2006-02-26 08:19:46 +08:00
|
|
|
die("bad object %s", name);
|
2011-05-19 09:08:09 +08:00
|
|
|
}
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
object->flags |= flags;
|
|
|
|
return object;
|
|
|
|
}
|
|
|
|
|
2017-05-07 06:10:26 +08:00
|
|
|
void add_pending_oid(struct rev_info *revs, const char *name,
|
|
|
|
const struct object_id *oid, unsigned int flags)
|
2011-10-01 23:43:52 +08:00
|
|
|
{
|
2017-05-07 06:10:27 +08:00
|
|
|
struct object *object = get_reference(revs, name, oid, flags);
|
2011-10-01 23:43:52 +08:00
|
|
|
add_pending_object(revs, object, name);
|
|
|
|
}
|
|
|
|
|
2013-05-25 17:08:07 +08:00
|
|
|
static struct commit *handle_commit(struct rev_info *revs,
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:43:19 +08:00
|
|
|
struct object_array_entry *entry)
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
{
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:43:19 +08:00
|
|
|
struct object *object = entry->item;
|
|
|
|
const char *name = entry->name;
|
|
|
|
const char *path = entry->path;
|
|
|
|
unsigned int mode = entry->mode;
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
unsigned long flags = object->flags;
|
2006-02-26 08:19:46 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Tag object? Look what it points to..
|
|
|
|
*/
|
2006-07-12 11:45:31 +08:00
|
|
|
while (object->type == OBJ_TAG) {
|
2006-02-26 08:19:46 +08:00
|
|
|
struct tag *tag = (struct tag *) object;
|
2024-02-28 17:10:11 +08:00
|
|
|
struct object_id *oid;
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
if (revs->tag_objects && !(flags & UNINTERESTING))
|
2006-02-26 08:19:46 +08:00
|
|
|
add_pending_object(revs, object, tag->tag);
|
2024-02-28 17:10:11 +08:00
|
|
|
oid = get_tagged_oid(tag);
|
|
|
|
object = parse_object(revs->repo, oid);
|
2009-01-28 15:19:30 +08:00
|
|
|
if (!object) {
|
2017-05-20 16:30:25 +08:00
|
|
|
if (revs->ignore_missing_links || (flags & UNINTERESTING))
|
2009-01-28 15:19:30 +08:00
|
|
|
return NULL;
|
2018-07-13 08:03:06 +08:00
|
|
|
if (revs->exclude_promisor_objects &&
|
|
|
|
is_promisor_object(&tag->tagged->oid))
|
|
|
|
return NULL;
|
2024-02-28 17:10:11 +08:00
|
|
|
if (revs->do_not_die_on_missing_objects && oid) {
|
|
|
|
oidset_insert(&revs->missing_commits, oid);
|
|
|
|
return NULL;
|
|
|
|
}
|
2015-11-10 10:22:28 +08:00
|
|
|
die("bad object %s", oid_to_hex(&tag->tagged->oid));
|
2009-01-28 15:19:30 +08:00
|
|
|
}
|
2014-01-16 04:26:13 +08:00
|
|
|
object->flags |= flags;
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:43:19 +08:00
|
|
|
/*
|
|
|
|
* We'll handle the tagged object by looping or dropping
|
|
|
|
* through to the non-tag handlers below. Do not
|
revision.c: propagate tag names from pending array
When we unwrap a tag to find its commit for a traversal, we
do not propagate the "name" field of the tag in the pending
array (i.e., the ref name the user gave us in the first
place) to the commit (instead, we use an empty string). This
means that "git log --source" will never show the tag-name
for commits we reach through it.
This was broken in 2073949 (traverse_commit_list: support
pending blobs/trees with paths, 2014-10-15). That commit
tried to be careful and avoid propagating the path
information for a tag (which would be nonsensical) to trees
and blobs. But it should not have cut off the "name" field,
which should carry forward to children.
Note that this does mean that the "name" field will carry
forward to blobs and trees, too. Whereas prior to 2073949,
we always gave them an empty string. This is the right thing
to do, but in practice no callers probably use it (since now
we have an explicit separate "path" field, which was the
point of 2073949).
We add tests here not only for the broken case, but also a
basic sanity test of "log --source" in general, which did
not have any coverage in the test suite.
Reported-by: Raymundo <gypark@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-12-17 14:47:07 +08:00
|
|
|
* propagate path data from the tag's pending entry.
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:43:19 +08:00
|
|
|
*/
|
|
|
|
path = NULL;
|
|
|
|
mode = 0;
|
2006-02-26 08:19:46 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Commit object? Just return it, we'll do all the complex
|
|
|
|
* reachability crud.
|
|
|
|
*/
|
2006-07-12 11:45:31 +08:00
|
|
|
if (object->type == OBJ_COMMIT) {
|
2006-02-26 08:19:46 +08:00
|
|
|
struct commit *commit = (struct commit *)object;
|
2018-05-19 13:28:24 +08:00
|
|
|
|
2020-06-24 04:56:58 +08:00
|
|
|
if (repo_parse_commit(revs->repo, commit) < 0)
|
2006-02-26 08:19:46 +08:00
|
|
|
die("unable to parse commit %s", name);
|
2006-02-28 00:54:36 +08:00
|
|
|
if (flags & UNINTERESTING) {
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
mark_parents_uninteresting(revs, commit);
|
2019-05-21 21:14:38 +08:00
|
|
|
|
|
|
|
if (!revs->topo_order || !generation_numbers_enabled(the_repository))
|
|
|
|
revs->limited = 1;
|
2006-02-28 00:54:36 +08:00
|
|
|
}
|
2018-05-19 13:28:24 +08:00
|
|
|
if (revs->sources) {
|
|
|
|
char **slot = revision_sources_at(revs->sources, commit);
|
|
|
|
|
|
|
|
if (!*slot)
|
|
|
|
*slot = xstrdup(name);
|
|
|
|
}
|
2006-02-26 08:19:46 +08:00
|
|
|
return commit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2009-04-18 02:13:30 +08:00
|
|
|
* Tree object? Either mark it uninteresting, or add it
|
2006-02-26 08:19:46 +08:00
|
|
|
* to the list of objects to look at later..
|
|
|
|
*/
|
2006-07-12 11:45:31 +08:00
|
|
|
if (object->type == OBJ_TREE) {
|
2006-02-26 08:19:46 +08:00
|
|
|
struct tree *tree = (struct tree *)object;
|
|
|
|
if (!revs->tree_objects)
|
|
|
|
return NULL;
|
|
|
|
if (flags & UNINTERESTING) {
|
2018-09-21 23:57:39 +08:00
|
|
|
mark_tree_contents_uninteresting(revs->repo, tree);
|
2006-02-26 08:19:46 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:43:19 +08:00
|
|
|
add_pending_object_with_path(revs, object, name, mode, path);
|
2006-02-26 08:19:46 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Blob object? You know the drill by now..
|
|
|
|
*/
|
2006-07-12 11:45:31 +08:00
|
|
|
if (object->type == OBJ_BLOB) {
|
2006-02-26 08:19:46 +08:00
|
|
|
if (!revs->blob_objects)
|
|
|
|
return NULL;
|
2014-01-16 04:26:13 +08:00
|
|
|
if (flags & UNINTERESTING)
|
2006-02-26 08:19:46 +08:00
|
|
|
return NULL;
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:43:19 +08:00
|
|
|
add_pending_object_with_path(revs, object, name, mode, path);
|
2006-02-26 08:19:46 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
die("%s is unknown object", name);
|
|
|
|
}
|
|
|
|
|
limit_list: avoid quadratic behavior from still_interesting
When we are limiting a rev-list traversal due to
UNINTERESTING refs, we have to walk down the tips (both
interesting and uninteresting) to find where they intersect.
We keep a queue of commits to examine, pop commits off
the queue one by one, and potentially add their parents. The
size of the queue will naturally fluctuate based on the
"width" of the history graph; i.e., the number of
simultaneous lines of development. But for the most part it
will stay in the same ballpark as the initial number of tips
we fed, shrinking over time (as we hit common ancestors of
the tips). So roughly speaking, if we start with `N` tips,
we'll spend much of the time with a queue around `N` items.
For each UNINTERESTING commit we pop, we call
still_interesting to check whether marking its parents as
UNINTERESTING has made the whole queue uninteresting (in
which case we can quit early). Because the queue is stored
as a linked list, this is `O(N)`, where `N` is the number of
items in the queue. So processing a queue with `N` commits
marked UNINTERESTING (and one or more interesting commits)
will take `O(N^2)`.
If you feed a lot of positive tips, this isn't a problem.
They aren't UNINTERESTING, so they don't incur the
still_interesting check. It also isn't a problem if you
traverse from an interesting tip to some UNINTERESTING
bases. We order the queue by recency, so the interesting
commits stay at the front of the queue as we walk down them.
The linear check can exit early as soon as it sees one
interesting commit left in the queue.
But if you want to know whether an older commit is reachable
from a set of newer tips, we end up processing in the
opposite direction: from the UNINTERESTING ones down to the
interesting one. This may happen when we call:
git rev-list $commits --not --all
in check_everything_connected after a fetch. If we fetched
something much older than most of our refs, and if we have a
large number of refs, the traversal cost is dominated by the
quadratic behavior.
These commands simulate the connectivity check of such a
fetch, when you have `$n` distinct refs in the receiver:
# positive ref is 100,000 commits deep
git rev-list --all | head -100000 | tail -1 >input
# huge number of more recent negative refs
git rev-list --all | head -$n | sed s/^/^/ >>input
time git rev-list --stdin <input
Here are timings for various `n` on the linux.git
repository. The `n=1` case provides a baseline for just
walking the commits, which lets us see the still_interesting
overhead. The times marked with `+` subtract that baseline
to show just the extra time growth due to the large number
of refs. The `x` numbers show the slowdown of the adjusted
time versus the prior trial.
n | before | after
--------------------------------------------------------
1 | 0.991s | 0.848s
10000 | 1.120s (+0.129s) | 0.885s (+0.037s)
20000 | 1.451s (+0.460s, 3.5x) | 0.923s (+0.075s, 2.0x)
40000 | 2.731s (+1.740s, 3.8x) | 0.994s (+0.146s, 1.9x)
80000 | 8.235s (+7.244s, 4.2x) | 1.123s (+0.275s, 1.9x)
Each trial doubles `n`, so you can see the quadratic (`4x`)
behavior before this patch. Afterwards, we have a roughly
linear relationship.
The implementation is fairly straightforward. Whenever we do
the linear search, we cache the interesting commit we find,
and next time check it before doing another linear search.
If that commit is removed from the list or becomes
UNINTERESTING itself, then we fall back to the linear
search. This is very similar to the trick used by fce87ae
(Fix quadratic performance in rewrite_one., 2008-07-12).
I considered and rejected several possible alternatives:
1. Keep a count of UNINTERESTING commits in the queue.
This requires managing the count not only when removing
an item from the queue, but also when marking an item
as UNINTERESTING. That requires touching the other
functions which mark commits, and would require knowing
quickly which commits are in the queue (lookup in the
queue is linear, so we would need an auxiliary
structure or to also maintain an IN_QUEUE flag in each
commit object).
2. Keep a separate list of interesting commits. Drop items
from it when they are dropped from the queue, or if
they become UNINTERESTING. This again suffers from
extra complexity to maintain the list, not to mention
CPU and memory.
3. Use a better data structure for the queue. This is
something that could help the fix in fce87ae, because
we order the queue by recency, and it is about
inserting quickly in recency order. So a normal
priority queue would help there. But here, we cannot
disturb the order of the queue, which makes things
harder. We really do need an auxiliary index to track
the flag we care about, which is basically option (2)
above.
The "cache" trick is simple, and the numbers above show that
it works well in practice. This is because the length of
time it takes to find an interesting commit is proportional
to the length of time it will remain cached (i.e., if we
have to walk a long way to find it, it also means we have to
pop a lot of elements in the queue until we get rid of it
and have to find another interesting commit).
The worst case is still quadratic, though. We could have `N`
uninteresting commits at the front of the queue, followed by
`N` interesting commits, where commit `i` has parent `i+N`.
When we pop commit `i`, we will notice that the parent of
the next commit, `i+1+N` is still interesting and cache it.
But then handling commit `i+1`, we will mark its parent
`i+1+N` uninteresting, and immediately invalidate our cache.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-04-18 06:11:04 +08:00
|
|
|
static int everybody_uninteresting(struct commit_list *orig,
|
|
|
|
struct commit **interesting_cache)
|
2006-03-01 03:24:00 +08:00
|
|
|
{
|
|
|
|
struct commit_list *list = orig;
|
limit_list: avoid quadratic behavior from still_interesting
When we are limiting a rev-list traversal due to
UNINTERESTING refs, we have to walk down the tips (both
interesting and uninteresting) to find where they intersect.
We keep a queue of commits to examine, pop commits off
the queue one by one, and potentially add their parents. The
size of the queue will naturally fluctuate based on the
"width" of the history graph; i.e., the number of
simultaneous lines of development. But for the most part it
will stay in the same ballpark as the initial number of tips
we fed, shrinking over time (as we hit common ancestors of
the tips). So roughly speaking, if we start with `N` tips,
we'll spend much of the time with a queue around `N` items.
For each UNINTERESTING commit we pop, we call
still_interesting to check whether marking its parents as
UNINTERESTING has made the whole queue uninteresting (in
which case we can quit early). Because the queue is stored
as a linked list, this is `O(N)`, where `N` is the number of
items in the queue. So processing a queue with `N` commits
marked UNINTERESTING (and one or more interesting commits)
will take `O(N^2)`.
If you feed a lot of positive tips, this isn't a problem.
They aren't UNINTERESTING, so they don't incur the
still_interesting check. It also isn't a problem if you
traverse from an interesting tip to some UNINTERESTING
bases. We order the queue by recency, so the interesting
commits stay at the front of the queue as we walk down them.
The linear check can exit early as soon as it sees one
interesting commit left in the queue.
But if you want to know whether an older commit is reachable
from a set of newer tips, we end up processing in the
opposite direction: from the UNINTERESTING ones down to the
interesting one. This may happen when we call:
git rev-list $commits --not --all
in check_everything_connected after a fetch. If we fetched
something much older than most of our refs, and if we have a
large number of refs, the traversal cost is dominated by the
quadratic behavior.
These commands simulate the connectivity check of such a
fetch, when you have `$n` distinct refs in the receiver:
# positive ref is 100,000 commits deep
git rev-list --all | head -100000 | tail -1 >input
# huge number of more recent negative refs
git rev-list --all | head -$n | sed s/^/^/ >>input
time git rev-list --stdin <input
Here are timings for various `n` on the linux.git
repository. The `n=1` case provides a baseline for just
walking the commits, which lets us see the still_interesting
overhead. The times marked with `+` subtract that baseline
to show just the extra time growth due to the large number
of refs. The `x` numbers show the slowdown of the adjusted
time versus the prior trial.
n | before | after
--------------------------------------------------------
1 | 0.991s | 0.848s
10000 | 1.120s (+0.129s) | 0.885s (+0.037s)
20000 | 1.451s (+0.460s, 3.5x) | 0.923s (+0.075s, 2.0x)
40000 | 2.731s (+1.740s, 3.8x) | 0.994s (+0.146s, 1.9x)
80000 | 8.235s (+7.244s, 4.2x) | 1.123s (+0.275s, 1.9x)
Each trial doubles `n`, so you can see the quadratic (`4x`)
behavior before this patch. Afterwards, we have a roughly
linear relationship.
The implementation is fairly straightforward. Whenever we do
the linear search, we cache the interesting commit we find,
and next time check it before doing another linear search.
If that commit is removed from the list or becomes
UNINTERESTING itself, then we fall back to the linear
search. This is very similar to the trick used by fce87ae
(Fix quadratic performance in rewrite_one., 2008-07-12).
I considered and rejected several possible alternatives:
1. Keep a count of UNINTERESTING commits in the queue.
This requires managing the count not only when removing
an item from the queue, but also when marking an item
as UNINTERESTING. That requires touching the other
functions which mark commits, and would require knowing
quickly which commits are in the queue (lookup in the
queue is linear, so we would need an auxiliary
structure or to also maintain an IN_QUEUE flag in each
commit object).
2. Keep a separate list of interesting commits. Drop items
from it when they are dropped from the queue, or if
they become UNINTERESTING. This again suffers from
extra complexity to maintain the list, not to mention
CPU and memory.
3. Use a better data structure for the queue. This is
something that could help the fix in fce87ae, because
we order the queue by recency, and it is about
inserting quickly in recency order. So a normal
priority queue would help there. But here, we cannot
disturb the order of the queue, which makes things
harder. We really do need an auxiliary index to track
the flag we care about, which is basically option (2)
above.
The "cache" trick is simple, and the numbers above show that
it works well in practice. This is because the length of
time it takes to find an interesting commit is proportional
to the length of time it will remain cached (i.e., if we
have to walk a long way to find it, it also means we have to
pop a lot of elements in the queue until we get rid of it
and have to find another interesting commit).
The worst case is still quadratic, though. We could have `N`
uninteresting commits at the front of the queue, followed by
`N` interesting commits, where commit `i` has parent `i+N`.
When we pop commit `i`, we will notice that the parent of
the next commit, `i+1+N` is still interesting and cache it.
But then handling commit `i+1`, we will mark its parent
`i+1+N` uninteresting, and immediately invalidate our cache.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-04-18 06:11:04 +08:00
|
|
|
|
|
|
|
if (*interesting_cache) {
|
|
|
|
struct commit *commit = *interesting_cache;
|
|
|
|
if (!(commit->object.flags & UNINTERESTING))
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-03-01 03:24:00 +08:00
|
|
|
while (list) {
|
|
|
|
struct commit *commit = list->item;
|
|
|
|
list = list->next;
|
|
|
|
if (commit->object.flags & UNINTERESTING)
|
|
|
|
continue;
|
2015-06-27 03:40:19 +08:00
|
|
|
|
|
|
|
*interesting_cache = commit;
|
2006-03-01 03:24:00 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2013-05-16 23:32:39 +08:00
|
|
|
/*
|
|
|
|
* A definition of "relevant" commit that we can use to simplify limited graphs
|
|
|
|
* by eliminating side branches.
|
|
|
|
*
|
|
|
|
* A "relevant" commit is one that is !UNINTERESTING (ie we are including it
|
|
|
|
* in our list), or that is a specified BOTTOM commit. Then after computing
|
|
|
|
* a limited list, during processing we can generally ignore boundary merges
|
|
|
|
* coming from outside the graph, (ie from irrelevant parents), and treat
|
|
|
|
* those merges as if they were single-parent. TREESAME is defined to consider
|
|
|
|
* only relevant parents, if any. If we are TREESAME to our on-graph parents,
|
|
|
|
* we don't care if we were !TREESAME to non-graph parents.
|
|
|
|
*
|
|
|
|
* Treating bottom commits as relevant ensures that a limited graph's
|
|
|
|
* connection to the actual bottom commit is not viewed as a side branch, but
|
|
|
|
* treated as part of the graph. For example:
|
|
|
|
*
|
|
|
|
* ....Z...A---X---o---o---B
|
|
|
|
* . /
|
|
|
|
* W---Y
|
|
|
|
*
|
|
|
|
* When computing "A..B", the A-X connection is at least as important as
|
|
|
|
* Y-X, despite A being flagged UNINTERESTING.
|
|
|
|
*
|
|
|
|
* And when computing --ancestry-path "A..B", the A-X connection is more
|
|
|
|
* important than Y-X, despite both A and Y being flagged UNINTERESTING.
|
|
|
|
*/
|
|
|
|
static inline int relevant_commit(struct commit *commit)
|
|
|
|
{
|
|
|
|
return (commit->object.flags & (UNINTERESTING | BOTTOM)) != UNINTERESTING;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return a single relevant commit from a parent list. If we are a TREESAME
|
|
|
|
* commit, and this selects one of our parents, then we can safely simplify to
|
|
|
|
* that parent.
|
|
|
|
*/
|
|
|
|
static struct commit *one_relevant_parent(const struct rev_info *revs,
|
|
|
|
struct commit_list *orig)
|
|
|
|
{
|
|
|
|
struct commit_list *list = orig;
|
|
|
|
struct commit *relevant = NULL;
|
|
|
|
|
|
|
|
if (!orig)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For 1-parent commits, or if first-parent-only, then return that
|
|
|
|
* first parent (even if not "relevant" by the above definition).
|
|
|
|
* TREESAME will have been set purely on that parent.
|
|
|
|
*/
|
|
|
|
if (revs->first_parent_only || !orig->next)
|
|
|
|
return orig->item;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For multi-parent commits, identify a sole relevant parent, if any.
|
|
|
|
* If we have only one relevant parent, then TREESAME will be set purely
|
|
|
|
* with regard to that parent, and we can simplify accordingly.
|
|
|
|
*
|
|
|
|
* If we have more than one relevant parent, or no relevant parents
|
|
|
|
* (and multiple irrelevant ones), then we can't select a parent here
|
|
|
|
* and return NULL.
|
|
|
|
*/
|
|
|
|
while (list) {
|
|
|
|
struct commit *commit = list->item;
|
|
|
|
list = list->next;
|
|
|
|
if (relevant_commit(commit)) {
|
|
|
|
if (relevant)
|
|
|
|
return NULL;
|
|
|
|
relevant = commit;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return relevant;
|
|
|
|
}
|
|
|
|
|
2007-03-15 04:12:18 +08:00
|
|
|
/*
|
|
|
|
* The goal is to get REV_TREE_NEW as the result only if the
|
2009-06-03 09:34:01 +08:00
|
|
|
* diff consists of all '+' (and no other changes), REV_TREE_OLD
|
|
|
|
* if the whole diff is removal of old data, and otherwise
|
|
|
|
* REV_TREE_DIFFERENT (of course if the trees are the same we
|
|
|
|
* want REV_TREE_SAME).
|
revision: quit pruning diff more quickly when possible
When the revision traversal machinery is given a pathspec,
we must compute the parent-diff for each commit to determine
which ones are TREESAME. We set the QUICK diff flag to avoid
looking at more entries than we need; we really just care
whether there are any changes at all.
But there is one case where we want to know a bit more: if
--remove-empty is set, we care about finding cases where the
change consists only of added entries (in which case we may
prune the parent in try_to_simplify_commit()). To cover that
case, our file_add_remove() callback does not quit the diff
upon seeing an added entry; it keeps looking for other types
of entries.
But this means when --remove-empty is not set (and it is not
by default), we compute more of the diff than is necessary.
You can see this in a pathological case where a commit adds
a very large number of entries, and we limit based on a
broad pathspec. E.g.:
perl -e '
chomp(my $blob = `git hash-object -w --stdin </dev/null`);
for my $a (1..1000) {
for my $b (1..1000) {
print "100644 $blob\t$a/$b\n";
}
}
' | git update-index --index-info
git commit -qm add
git rev-list HEAD -- .
This case takes about 100ms now, but after this patch only
needs 6ms. That's not a huge improvement, but it's easy to
get and it protects us against even more pathological cases
(e.g., going from 1 million to 10 million files would take
ten times as long with the current code, but not increase at
all after this patch).
This is reported to minorly speed-up pathspec limiting in
real world repositories (like the 100-million-file Windows
repository), but probably won't make a noticeable difference
outside of pathological setups.
This patch actually covers the case without --remove-empty,
and the case where we see only deletions. See the in-code
comment for details.
Note that we have to add a new member to the diff_options
struct so that our callback can see the value of
revs->remove_empty_trees. This callback parameter could be
passed to the "add_remove" and "change" callbacks, but
there's not much point. They already receive the
diff_options struct, and doing it this way avoids having to
update the function signature of the other callbacks
(arguably the format_callback and output_prefix functions
could benefit from the same simplification).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-13 23:27:45 +08:00
|
|
|
*
|
|
|
|
* The only time we care about the distinction is when
|
|
|
|
* remove_empty_trees is in effect, in which case we care only about
|
|
|
|
* whether the whole change is REV_TREE_NEW, or if there's another type
|
|
|
|
* of change. Which means we can stop the diff early in either of these
|
|
|
|
* cases:
|
|
|
|
*
|
|
|
|
* 1. We're not using remove_empty_trees at all.
|
|
|
|
*
|
|
|
|
* 2. We saw anything except REV_TREE_NEW.
|
2007-03-15 04:12:18 +08:00
|
|
|
*/
|
revision.[ch]: document and move code declared around "init"
A subsequent commit will add "REV_INFO_INIT" macro adjacent to
repo_init_revisions(), unfortunately between the "struct rev_info"
itself and that function we've added various miscellaneous code
between the two over the years.
Let's move that code either lower in revision.h, giving it API docs
while we're at it, or in cases where it wasn't public API at all move
it into revision.c No lines of code are changed here, only moved
around. The only changes are the addition of new API comments.
The "tree_difference" variable could also be declared like this, which
I think would be a lot clearer, but let's leave that for now to keep
this a move-only change:
static enum {
REV_TREE_SAME,
REV_TREE_NEW, /* Only new files */
REV_TREE_OLD, /* Only files removed */
REV_TREE_DIFFERENT, /* Mixed changes */
} tree_difference = REV_TREE_SAME;
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-14 04:01:37 +08:00
|
|
|
#define REV_TREE_SAME 0
|
|
|
|
#define REV_TREE_NEW 1 /* Only new files */
|
|
|
|
#define REV_TREE_OLD 2 /* Only files removed */
|
|
|
|
#define REV_TREE_DIFFERENT 3 /* Mixed changes */
|
2006-03-10 17:21:39 +08:00
|
|
|
static int tree_difference = REV_TREE_SAME;
|
2006-03-01 03:24:00 +08:00
|
|
|
|
|
|
|
static void file_add_remove(struct diff_options *options,
|
2022-12-13 19:13:48 +08:00
|
|
|
int addremove,
|
|
|
|
unsigned mode UNUSED,
|
|
|
|
const struct object_id *oid UNUSED,
|
|
|
|
int oid_valid UNUSED,
|
|
|
|
const char *fullpath UNUSED,
|
|
|
|
unsigned dirty_submodule UNUSED)
|
2006-03-01 03:24:00 +08:00
|
|
|
{
|
2009-06-03 09:34:01 +08:00
|
|
|
int diff = addremove == '+' ? REV_TREE_NEW : REV_TREE_OLD;
|
revision: quit pruning diff more quickly when possible
When the revision traversal machinery is given a pathspec,
we must compute the parent-diff for each commit to determine
which ones are TREESAME. We set the QUICK diff flag to avoid
looking at more entries than we need; we really just care
whether there are any changes at all.
But there is one case where we want to know a bit more: if
--remove-empty is set, we care about finding cases where the
change consists only of added entries (in which case we may
prune the parent in try_to_simplify_commit()). To cover that
case, our file_add_remove() callback does not quit the diff
upon seeing an added entry; it keeps looking for other types
of entries.
But this means when --remove-empty is not set (and it is not
by default), we compute more of the diff than is necessary.
You can see this in a pathological case where a commit adds
a very large number of entries, and we limit based on a
broad pathspec. E.g.:
perl -e '
chomp(my $blob = `git hash-object -w --stdin </dev/null`);
for my $a (1..1000) {
for my $b (1..1000) {
print "100644 $blob\t$a/$b\n";
}
}
' | git update-index --index-info
git commit -qm add
git rev-list HEAD -- .
This case takes about 100ms now, but after this patch only
needs 6ms. That's not a huge improvement, but it's easy to
get and it protects us against even more pathological cases
(e.g., going from 1 million to 10 million files would take
ten times as long with the current code, but not increase at
all after this patch).
This is reported to minorly speed-up pathspec limiting in
real world repositories (like the 100-million-file Windows
repository), but probably won't make a noticeable difference
outside of pathological setups.
This patch actually covers the case without --remove-empty,
and the case where we see only deletions. See the in-code
comment for details.
Note that we have to add a new member to the diff_options
struct so that our callback can see the value of
revs->remove_empty_trees. This callback parameter could be
passed to the "add_remove" and "change" callbacks, but
there's not much point. They already receive the
diff_options struct, and doing it this way avoids having to
update the function signature of the other callbacks
(arguably the format_callback and output_prefix functions
could benefit from the same simplification).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-13 23:27:45 +08:00
|
|
|
struct rev_info *revs = options->change_fn_data;
|
2006-03-01 03:24:00 +08:00
|
|
|
|
2009-06-03 09:34:01 +08:00
|
|
|
tree_difference |= diff;
|
revision: quit pruning diff more quickly when possible
When the revision traversal machinery is given a pathspec,
we must compute the parent-diff for each commit to determine
which ones are TREESAME. We set the QUICK diff flag to avoid
looking at more entries than we need; we really just care
whether there are any changes at all.
But there is one case where we want to know a bit more: if
--remove-empty is set, we care about finding cases where the
change consists only of added entries (in which case we may
prune the parent in try_to_simplify_commit()). To cover that
case, our file_add_remove() callback does not quit the diff
upon seeing an added entry; it keeps looking for other types
of entries.
But this means when --remove-empty is not set (and it is not
by default), we compute more of the diff than is necessary.
You can see this in a pathological case where a commit adds
a very large number of entries, and we limit based on a
broad pathspec. E.g.:
perl -e '
chomp(my $blob = `git hash-object -w --stdin </dev/null`);
for my $a (1..1000) {
for my $b (1..1000) {
print "100644 $blob\t$a/$b\n";
}
}
' | git update-index --index-info
git commit -qm add
git rev-list HEAD -- .
This case takes about 100ms now, but after this patch only
needs 6ms. That's not a huge improvement, but it's easy to
get and it protects us against even more pathological cases
(e.g., going from 1 million to 10 million files would take
ten times as long with the current code, but not increase at
all after this patch).
This is reported to minorly speed-up pathspec limiting in
real world repositories (like the 100-million-file Windows
repository), but probably won't make a noticeable difference
outside of pathological setups.
This patch actually covers the case without --remove-empty,
and the case where we see only deletions. See the in-code
comment for details.
Note that we have to add a new member to the diff_options
struct so that our callback can see the value of
revs->remove_empty_trees. This callback parameter could be
passed to the "add_remove" and "change" callbacks, but
there's not much point. They already receive the
diff_options struct, and doing it this way avoids having to
update the function signature of the other callbacks
(arguably the format_callback and output_prefix functions
could benefit from the same simplification).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-13 23:27:45 +08:00
|
|
|
if (!revs->remove_empty_trees || tree_difference != REV_TREE_NEW)
|
2017-11-01 02:19:11 +08:00
|
|
|
options->flags.has_changes = 1;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void file_change(struct diff_options *options,
|
2022-12-13 19:13:48 +08:00
|
|
|
unsigned old_mode UNUSED,
|
|
|
|
unsigned new_mode UNUSED,
|
|
|
|
const struct object_id *old_oid UNUSED,
|
|
|
|
const struct object_id *new_oid UNUSED,
|
|
|
|
int old_oid_valid UNUSED,
|
|
|
|
int new_oid_valid UNUSED,
|
|
|
|
const char *fullpath UNUSED,
|
|
|
|
unsigned old_dirty_submodule UNUSED,
|
|
|
|
unsigned new_dirty_submodule UNUSED)
|
2006-03-01 03:24:00 +08:00
|
|
|
{
|
2006-03-10 17:21:39 +08:00
|
|
|
tree_difference = REV_TREE_DIFFERENT;
|
2017-11-01 02:19:11 +08:00
|
|
|
options->flags.has_changes = 1;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
|
|
|
|
2020-04-07 00:59:53 +08:00
|
|
|
static int bloom_filter_atexit_registered;
|
|
|
|
static unsigned int count_bloom_filter_maybe;
|
|
|
|
static unsigned int count_bloom_filter_definitely_not;
|
|
|
|
static unsigned int count_bloom_filter_false_positive;
|
|
|
|
static unsigned int count_bloom_filter_not_present;
|
|
|
|
|
|
|
|
static void trace2_bloom_filter_statistics_atexit(void)
|
|
|
|
{
|
|
|
|
struct json_writer jw = JSON_WRITER_INIT;
|
|
|
|
|
|
|
|
jw_object_begin(&jw, 0);
|
|
|
|
jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
|
|
|
|
jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
|
|
|
|
jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
|
|
|
|
jw_object_intmax(&jw, "false_positive", count_bloom_filter_false_positive);
|
|
|
|
jw_end(&jw);
|
|
|
|
|
|
|
|
trace2_data_json("bloom", the_repository, "statistics", &jw);
|
|
|
|
|
|
|
|
jw_release(&jw);
|
|
|
|
}
|
|
|
|
|
2020-04-17 04:14:02 +08:00
|
|
|
static int forbid_bloom_filters(struct pathspec *spec)
|
|
|
|
{
|
|
|
|
if (spec->has_wildcard)
|
|
|
|
return 1;
|
|
|
|
if (spec->nr > 1)
|
|
|
|
return 1;
|
|
|
|
if (spec->magic & ~PATHSPEC_LITERAL)
|
|
|
|
return 1;
|
|
|
|
if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-04-07 00:59:52 +08:00
|
|
|
static void prepare_to_use_bloom_filter(struct rev_info *revs)
|
|
|
|
{
|
|
|
|
struct pathspec_item *pi;
|
|
|
|
char *path_alloc = NULL;
|
commit-graph: check all leading directories in changed path Bloom filters
The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.
So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well. Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.
This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks. The table below compares the average
false positive rate and runtime of
git rev-list HEAD -- "$path"
before and after this change for 5000+ randomly* selected paths from
each repository:
Average false Average Average
positive rate runtime runtime
before after before after difference
------------------------------------------------------------------
git 3.220% 0.7853% 0.0558s 0.0387s -30.6%
linux 2.453% 0.0296% 0.1046s 0.0766s -26.8%
tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2%
*Path selection was done with the following pipeline:
git ls-tree -r --name-only HEAD | sort -R | head -n 5000
The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here. However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches). If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:
$ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
$ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
commit-graph write --reachable
$ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
$ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
rev-list HEAD -- "$path"
then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).
This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.
Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 21:27:30 +08:00
|
|
|
const char *path, *p;
|
|
|
|
size_t len;
|
|
|
|
int path_component_nr = 1;
|
2020-04-07 00:59:52 +08:00
|
|
|
|
|
|
|
if (!revs->commits)
|
2020-04-17 04:14:02 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
if (forbid_bloom_filters(&revs->prune_data))
|
|
|
|
return;
|
2020-04-07 00:59:52 +08:00
|
|
|
|
|
|
|
repo_parse_commit(revs->repo, revs->commits->item);
|
|
|
|
|
commit-graph: introduce 'get_bloom_filter_settings()'
Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.
In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.
This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.
There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.
Since Bloom settings are guaranteed in practice to be the same for any
layer in a chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.
Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.
While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.
Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09 23:22:44 +08:00
|
|
|
revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
|
2020-04-07 00:59:52 +08:00
|
|
|
if (!revs->bloom_filter_settings)
|
|
|
|
return;
|
|
|
|
|
line-log: integrate with changed-path Bloom filters
The previous changes to the line-log machinery focused on making the
first result appear faster. This was achieved by no longer walking the
entire commit history before returning the early results. There is still
another way to improve the performance: walk most commits much faster.
Let's use the changed-path Bloom filters to reduce time spent computing
diffs.
Since the line-log computation requires opening blobs and checking the
content-diff, there is still a lot of necessary computation that cannot
be replaced with changed-path Bloom filters. The part that we can reduce
is most effective when checking the history of a file that is deep in
several directories and those directories are modified frequently. In
this case, the computation to check if a commit is TREESAME to its first
parent takes a large fraction of the time. That is ripe for improvement
with changed-path Bloom filters.
We must ensure that prepare_to_use_bloom_filters() is called in
revision.c so that the bloom_filter_settings are loaded into the struct
rev_info from the commit-graph. Of course, some cases are still
forbidden, but in the line-log case the pathspec is provided in a
different way than normal.
Since multiple paths and segments could be requested, we compute the
struct bloom_key data dynamically during the commit walk. This could
likely be improved, but adds code complexity that is not valuable at
this time.
There are two cases to care about: merge commits and "ordinary" commits.
Merge commits have multiple parents, but if we are TREESAME to our first
parent in every range, then pass the blame for all ranges to the first
parent. Ordinary commits have the same condition, but each is done
slightly differently in the process_ranges_[merge|ordinary]_commit()
methods. By checking if the changed-path Bloom filter can guarantee
TREESAME, we can avoid that tree-diff cost. If the filter says "probably
changed", then we need to run the tree-diff and then the blob-diff if
there was a real edit.
The Linux kernel repository is a good testing ground for the performance
improvements claimed here. There are two different cases to test. The
first is the "entire history" case, where we output the entire history
to /dev/null to see how long it would take to compute the full line-log
history. The second is the "first result" case, where we find how long
it takes to show the first value, which is an indicator of how quickly a
user would see responses when waiting at a terminal.
To test, I selected the paths that were changed most frequently in the
top 10,000 commits using this command (stolen from StackOverflow [1]):
git log --pretty=format: --name-only -n 10000 | sort | \
uniq -c | sort -rg | head -10
which results in
121 MAINTAINERS
63 fs/namei.c
60 arch/x86/kvm/cpuid.c
59 fs/io_uring.c
58 arch/x86/kvm/vmx/vmx.c
51 arch/x86/kvm/x86.c
45 arch/x86/kvm/svm.c
42 fs/btrfs/disk-io.c
42 Documentation/scsi/index.rst
(along with a bogus first result). It appears that the path
arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the
following results for the real command time:
| | Entire History | First Result |
| Path | Before | After | Before | After |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |
It is worth noting that the least speedup comes for the MAINTAINERS file
which is
* edited frequently,
* low in the directory heirarchy, and
* quite a large file.
All of those points lead to spending more time doing the blob diff and
less time doing the tree diff. Still, we see some improvement in that
case and significant improvement in other cases. A 2-4x speedup is
likely the more typical case as opposed to the small 5% change for that
file.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-11 19:56:19 +08:00
|
|
|
if (!revs->pruning.pathspec.nr)
|
|
|
|
return;
|
|
|
|
|
2020-04-07 00:59:52 +08:00
|
|
|
pi = &revs->pruning.pathspec.items[0];
|
|
|
|
|
|
|
|
/* remove single trailing slash from path, if needed */
|
revision: avoid out-of-bounds read/write on empty pathspec
Running t4216 with ASan results in it complaining of an out-of-bounds
read in prepare_to_use_bloom_filter(). The issue is this code to strip a
trailing slash:
last_index = pi->len - 1;
if (pi->match[last_index] == '/') {
because we have no guarantee that pi->len isn't zero. This can happen if
the pathspec is ".", as we translate that to an empty string. And if
that read of random memory does trigger the conditional, we'd then do an
out-of-bounds write:
path_alloc = xstrdup(pi->match);
path_alloc[last_index] = '\0';
Let's make sure to check the length before subtracting. Note that for an
empty pathspec, we'd end up bailing from the function a few lines later,
which makes it tempting to just:
if (!pi->len)
return;
early here. But our code here is stripping a trailing slash, and we need
to check for emptiness after stripping that slash, too. So we'd have two
blocks, which would require repeating some cleanup code.
Instead, just skip the trailing-slash for an empty string. Setting
last_index at all in the case is awkward since it will have a nonsense
value (and it uses an "int", which is a too-small type for a string
anyway). So while we're here, let's:
- drop last_index entirely; it's only used in two spots right next to
each other and writing out "pi->len - 1" in both is actually easier
to follow
- use xmemdupz() to duplicate the string. This is slightly more
efficient, but more importantly makes the intent more clear by
allocating the correct-sized substring in the first place. It also
eliminates any question of whether path_alloc is as long as
pi->match (which it would not be if pi->match has any embedded NULs,
though in practice this is probably impossible).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-04 15:46:52 +08:00
|
|
|
if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
|
|
|
|
path_alloc = xmemdupz(pi->match, pi->len - 1);
|
2020-07-01 21:27:28 +08:00
|
|
|
path = path_alloc;
|
2020-04-07 00:59:52 +08:00
|
|
|
} else
|
2020-07-01 21:27:28 +08:00
|
|
|
path = pi->match;
|
2020-04-07 00:59:52 +08:00
|
|
|
|
|
|
|
len = strlen(path);
|
2020-07-01 21:27:29 +08:00
|
|
|
if (!len) {
|
|
|
|
revs->bloom_filter_settings = NULL;
|
2020-08-04 15:50:17 +08:00
|
|
|
free(path_alloc);
|
2020-07-01 21:27:29 +08:00
|
|
|
return;
|
|
|
|
}
|
2020-04-07 00:59:52 +08:00
|
|
|
|
commit-graph: check all leading directories in changed path Bloom filters
The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.
So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well. Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.
This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks. The table below compares the average
false positive rate and runtime of
git rev-list HEAD -- "$path"
before and after this change for 5000+ randomly* selected paths from
each repository:
Average false Average Average
positive rate runtime runtime
before after before after difference
------------------------------------------------------------------
git 3.220% 0.7853% 0.0558s 0.0387s -30.6%
linux 2.453% 0.0296% 0.1046s 0.0766s -26.8%
tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2%
*Path selection was done with the following pipeline:
git ls-tree -r --name-only HEAD | sort -R | head -n 5000
The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here. However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches). If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:
$ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
$ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
commit-graph write --reachable
$ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
$ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
rev-list HEAD -- "$path"
then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).
This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.
Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 21:27:30 +08:00
|
|
|
p = path;
|
|
|
|
while (*p) {
|
|
|
|
/*
|
|
|
|
* At this point, the path is normalized to use Unix-style
|
|
|
|
* path separators. This is required due to how the
|
|
|
|
* changed-path Bloom filters store the paths.
|
|
|
|
*/
|
|
|
|
if (*p == '/')
|
|
|
|
path_component_nr++;
|
|
|
|
p++;
|
|
|
|
}
|
|
|
|
|
|
|
|
revs->bloom_keys_nr = path_component_nr;
|
|
|
|
ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
|
2020-04-07 00:59:52 +08:00
|
|
|
|
commit-graph: check all leading directories in changed path Bloom filters
The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.
So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well. Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.
This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks. The table below compares the average
false positive rate and runtime of
git rev-list HEAD -- "$path"
before and after this change for 5000+ randomly* selected paths from
each repository:
Average false Average Average
positive rate runtime runtime
before after before after difference
------------------------------------------------------------------
git 3.220% 0.7853% 0.0558s 0.0387s -30.6%
linux 2.453% 0.0296% 0.1046s 0.0766s -26.8%
tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2%
*Path selection was done with the following pipeline:
git ls-tree -r --name-only HEAD | sort -R | head -n 5000
The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here. However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches). If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:
$ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
$ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
commit-graph write --reachable
$ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
$ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
rev-list HEAD -- "$path"
then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).
This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.
Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 21:27:30 +08:00
|
|
|
fill_bloom_key(path, len, &revs->bloom_keys[0],
|
|
|
|
revs->bloom_filter_settings);
|
|
|
|
path_component_nr = 1;
|
|
|
|
|
|
|
|
p = path + len - 1;
|
|
|
|
while (p > path) {
|
|
|
|
if (*p == '/')
|
|
|
|
fill_bloom_key(path, p - path,
|
|
|
|
&revs->bloom_keys[path_component_nr++],
|
|
|
|
revs->bloom_filter_settings);
|
|
|
|
p--;
|
|
|
|
}
|
2020-04-07 00:59:52 +08:00
|
|
|
|
2020-04-07 00:59:53 +08:00
|
|
|
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
|
|
|
|
atexit(trace2_bloom_filter_statistics_atexit);
|
|
|
|
bloom_filter_atexit_registered = 1;
|
|
|
|
}
|
|
|
|
|
2020-04-07 00:59:52 +08:00
|
|
|
free(path_alloc);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
|
|
|
|
struct commit *commit)
|
|
|
|
{
|
|
|
|
struct bloom_filter *filter;
|
commit-graph: check all leading directories in changed path Bloom filters
The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.
So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well. Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.
This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks. The table below compares the average
false positive rate and runtime of
git rev-list HEAD -- "$path"
before and after this change for 5000+ randomly* selected paths from
each repository:
Average false Average Average
positive rate runtime runtime
before after before after difference
------------------------------------------------------------------
git 3.220% 0.7853% 0.0558s 0.0387s -30.6%
linux 2.453% 0.0296% 0.1046s 0.0766s -26.8%
tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2%
*Path selection was done with the following pipeline:
git ls-tree -r --name-only HEAD | sort -R | head -n 5000
The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here. However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches). If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:
$ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
$ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
commit-graph write --reachable
$ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
$ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
rev-list HEAD -- "$path"
then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).
This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.
Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 21:27:30 +08:00
|
|
|
int result = 1, j;
|
2020-04-07 00:59:52 +08:00
|
|
|
|
|
|
|
if (!revs->repo->objects->commit_graph)
|
|
|
|
return -1;
|
|
|
|
|
2020-06-17 17:14:10 +08:00
|
|
|
if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
|
2020-04-07 00:59:52 +08:00
|
|
|
return -1;
|
|
|
|
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
filter = get_bloom_filter(revs->repo, commit);
|
2020-04-07 00:59:52 +08:00
|
|
|
|
|
|
|
if (!filter) {
|
2020-04-07 00:59:53 +08:00
|
|
|
count_bloom_filter_not_present++;
|
2020-04-07 00:59:52 +08:00
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
commit-graph: check all leading directories in changed path Bloom filters
The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.
So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well. Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.
This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks. The table below compares the average
false positive rate and runtime of
git rev-list HEAD -- "$path"
before and after this change for 5000+ randomly* selected paths from
each repository:
Average false Average Average
positive rate runtime runtime
before after before after difference
------------------------------------------------------------------
git 3.220% 0.7853% 0.0558s 0.0387s -30.6%
linux 2.453% 0.0296% 0.1046s 0.0766s -26.8%
tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2%
*Path selection was done with the following pipeline:
git ls-tree -r --name-only HEAD | sort -R | head -n 5000
The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here. However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches). If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:
$ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
$ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
commit-graph write --reachable
$ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
$ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
rev-list HEAD -- "$path"
then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).
This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.
Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 21:27:30 +08:00
|
|
|
for (j = 0; result && j < revs->bloom_keys_nr; j++) {
|
|
|
|
result = bloom_filter_contains(filter,
|
|
|
|
&revs->bloom_keys[j],
|
|
|
|
revs->bloom_filter_settings);
|
2020-04-07 00:59:52 +08:00
|
|
|
}
|
|
|
|
|
2020-04-07 00:59:53 +08:00
|
|
|
if (result)
|
|
|
|
count_bloom_filter_maybe++;
|
|
|
|
else
|
|
|
|
count_bloom_filter_definitely_not++;
|
|
|
|
|
2020-04-07 00:59:52 +08:00
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
2013-05-25 17:08:07 +08:00
|
|
|
static int rev_compare_tree(struct rev_info *revs,
|
2020-04-07 00:59:52 +08:00
|
|
|
struct commit *parent, struct commit *commit, int nth_parent)
|
2006-03-01 03:24:00 +08:00
|
|
|
{
|
2023-03-28 21:58:48 +08:00
|
|
|
struct tree *t1 = repo_get_commit_tree(the_repository, parent);
|
|
|
|
struct tree *t2 = repo_get_commit_tree(the_repository, commit);
|
2020-04-07 00:59:52 +08:00
|
|
|
int bloom_ret = 1;
|
2008-11-04 02:45:41 +08:00
|
|
|
|
2006-03-01 03:24:00 +08:00
|
|
|
if (!t1)
|
2006-03-10 17:21:39 +08:00
|
|
|
return REV_TREE_NEW;
|
2009-06-03 09:34:01 +08:00
|
|
|
if (!t2)
|
|
|
|
return REV_TREE_OLD;
|
2008-11-04 03:25:46 +08:00
|
|
|
|
|
|
|
if (revs->simplify_by_decoration) {
|
|
|
|
/*
|
|
|
|
* If we are simplifying by decoration, then the commit
|
|
|
|
* is worth showing if it has a tag pointing at it.
|
|
|
|
*/
|
2014-08-26 18:23:54 +08:00
|
|
|
if (get_name_decoration(&commit->object))
|
2008-11-04 03:25:46 +08:00
|
|
|
return REV_TREE_DIFFERENT;
|
|
|
|
/*
|
|
|
|
* A commit that is not pointed by a tag is uninteresting
|
|
|
|
* if we are not limited by path. This means that you will
|
|
|
|
* see the usual "commits that touch the paths" plus any
|
|
|
|
* tagged commit by specifying both --simplify-by-decoration
|
|
|
|
* and pathspec.
|
|
|
|
*/
|
2010-12-17 20:43:06 +08:00
|
|
|
if (!revs->prune_data.nr)
|
2008-11-04 03:25:46 +08:00
|
|
|
return REV_TREE_SAME;
|
|
|
|
}
|
2009-06-03 09:34:01 +08:00
|
|
|
|
commit-graph: check all leading directories in changed path Bloom filters
The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.
So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well. Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.
This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks. The table below compares the average
false positive rate and runtime of
git rev-list HEAD -- "$path"
before and after this change for 5000+ randomly* selected paths from
each repository:
Average false Average Average
positive rate runtime runtime
before after before after difference
------------------------------------------------------------------
git 3.220% 0.7853% 0.0558s 0.0387s -30.6%
linux 2.453% 0.0296% 0.1046s 0.0766s -26.8%
tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2%
*Path selection was done with the following pipeline:
git ls-tree -r --name-only HEAD | sort -R | head -n 5000
The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here. However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches). If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:
$ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
$ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
commit-graph write --reachable
$ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
$ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
rev-list HEAD -- "$path"
then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).
This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.
Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 21:27:30 +08:00
|
|
|
if (revs->bloom_keys_nr && !nth_parent) {
|
2020-04-07 00:59:52 +08:00
|
|
|
bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
|
|
|
|
|
|
|
|
if (bloom_ret == 0)
|
|
|
|
return REV_TREE_SAME;
|
|
|
|
}
|
|
|
|
|
2006-03-10 17:21:39 +08:00
|
|
|
tree_difference = REV_TREE_SAME;
|
2017-11-01 02:19:11 +08:00
|
|
|
revs->pruning.flags.has_changes = 0;
|
diff.h: drop diff_tree_oid() & friends' return value
ll_diff_tree_oid() has only ever returned 0 [1], so it's return value
is basically useless. It's only caller diff_tree_oid() has only ever
returned the return value of ll_diff_tree_oid() as-is [2], so its
return value is just as useless. Most of diff_tree_oid()'s callers
simply ignore its return value, except:
- diff_root_tree_oid() is a thin wrapper around diff_tree_oid() and
returns with its return value, but all of diff_root_tree_oid()'s
callers ignore its return value.
- rev_compare_tree() and rev_same_tree_as_empty() do look at the
return value in a condition, but, since the return value is always
0, the former's < 0 condition is never fulfilled, while the
latter's >= 0 condition is always fulfilled.
So let's drop the return value of ll_diff_tree_oid(), diff_tree_oid()
and diff_root_tree_oid(), and drop those conditions from
rev_compare_tree() and rev_same_tree_as_empty() as well.
[1] ll_diff_tree_oid() and its ancestors have been returning only 0
ever since it was introduced as diff_tree() in 9174026cfe (Add
"diff-tree" program to show which files have changed between two
trees., 2005-04-09).
[2] diff_tree_oid() traces back to diff-tree.c:main() in 9174026cfe as
well.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-05 21:00:27 +08:00
|
|
|
diff_tree_oid(&t1->object.oid, &t2->object.oid, "", &revs->pruning);
|
2020-04-07 00:59:52 +08:00
|
|
|
|
2020-04-07 00:59:53 +08:00
|
|
|
if (!nth_parent)
|
|
|
|
if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
|
|
|
|
count_bloom_filter_false_positive++;
|
|
|
|
|
2006-03-01 03:24:00 +08:00
|
|
|
return tree_difference;
|
|
|
|
}
|
|
|
|
|
2024-06-26 01:39:13 +08:00
|
|
|
static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit,
|
|
|
|
int nth_parent)
|
2006-03-01 03:24:00 +08:00
|
|
|
{
|
2023-03-28 21:58:48 +08:00
|
|
|
struct tree *t1 = repo_get_commit_tree(the_repository, commit);
|
2024-06-26 01:39:13 +08:00
|
|
|
int bloom_ret = -1;
|
2006-03-01 03:24:00 +08:00
|
|
|
|
|
|
|
if (!t1)
|
|
|
|
return 0;
|
|
|
|
|
2024-06-26 01:39:13 +08:00
|
|
|
if (!nth_parent && revs->bloom_keys_nr) {
|
|
|
|
bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
|
|
|
|
if (!bloom_ret)
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2007-03-15 04:12:18 +08:00
|
|
|
tree_difference = REV_TREE_SAME;
|
2017-11-01 02:19:11 +08:00
|
|
|
revs->pruning.flags.has_changes = 0;
|
diff.h: drop diff_tree_oid() & friends' return value
ll_diff_tree_oid() has only ever returned 0 [1], so it's return value
is basically useless. It's only caller diff_tree_oid() has only ever
returned the return value of ll_diff_tree_oid() as-is [2], so its
return value is just as useless. Most of diff_tree_oid()'s callers
simply ignore its return value, except:
- diff_root_tree_oid() is a thin wrapper around diff_tree_oid() and
returns with its return value, but all of diff_root_tree_oid()'s
callers ignore its return value.
- rev_compare_tree() and rev_same_tree_as_empty() do look at the
return value in a condition, but, since the return value is always
0, the former's < 0 condition is never fulfilled, while the
latter's >= 0 condition is always fulfilled.
So let's drop the return value of ll_diff_tree_oid(), diff_tree_oid()
and diff_root_tree_oid(), and drop those conditions from
rev_compare_tree() and rev_same_tree_as_empty() as well.
[1] ll_diff_tree_oid() and its ancestors have been returning only 0
ever since it was introduced as diff_tree() in 9174026cfe (Add
"diff-tree" program to show which files have changed between two
trees., 2005-04-09).
[2] diff_tree_oid() traces back to diff-tree.c:main() in 9174026cfe as
well.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-05 21:00:27 +08:00
|
|
|
diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning);
|
2006-03-01 03:24:00 +08:00
|
|
|
|
2024-06-26 01:39:13 +08:00
|
|
|
if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
|
|
|
|
count_bloom_filter_false_positive++;
|
|
|
|
|
diff.h: drop diff_tree_oid() & friends' return value
ll_diff_tree_oid() has only ever returned 0 [1], so it's return value
is basically useless. It's only caller diff_tree_oid() has only ever
returned the return value of ll_diff_tree_oid() as-is [2], so its
return value is just as useless. Most of diff_tree_oid()'s callers
simply ignore its return value, except:
- diff_root_tree_oid() is a thin wrapper around diff_tree_oid() and
returns with its return value, but all of diff_root_tree_oid()'s
callers ignore its return value.
- rev_compare_tree() and rev_same_tree_as_empty() do look at the
return value in a condition, but, since the return value is always
0, the former's < 0 condition is never fulfilled, while the
latter's >= 0 condition is always fulfilled.
So let's drop the return value of ll_diff_tree_oid(), diff_tree_oid()
and diff_root_tree_oid(), and drop those conditions from
rev_compare_tree() and rev_same_tree_as_empty() as well.
[1] ll_diff_tree_oid() and its ancestors have been returning only 0
ever since it was introduced as diff_tree() in 9174026cfe (Add
"diff-tree" program to show which files have changed between two
trees., 2005-04-09).
[2] diff_tree_oid() traces back to diff-tree.c:main() in 9174026cfe as
well.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-05 21:00:27 +08:00
|
|
|
return tree_difference == REV_TREE_SAME;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
|
|
|
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
struct treesame_state {
|
|
|
|
unsigned int nparents;
|
|
|
|
unsigned char treesame[FLEX_ARRAY];
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct treesame_state *initialise_treesame(struct rev_info *revs, struct commit *commit)
|
|
|
|
{
|
|
|
|
unsigned n = commit_list_count(commit->parents);
|
2016-02-23 06:44:35 +08:00
|
|
|
struct treesame_state *st = xcalloc(1, st_add(sizeof(*st), n));
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
st->nparents = n;
|
|
|
|
add_decoration(&revs->treesame, &commit->object, st);
|
|
|
|
return st;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Must be called immediately after removing the nth_parent from a commit's
|
|
|
|
* parent list, if we are maintaining the per-parent treesame[] decoration.
|
|
|
|
* This does not recalculate the master TREESAME flag - update_treesame()
|
|
|
|
* should be called to update it after a sequence of treesame[] modifications
|
|
|
|
* that may have affected it.
|
|
|
|
*/
|
|
|
|
static int compact_treesame(struct rev_info *revs, struct commit *commit, unsigned nth_parent)
|
|
|
|
{
|
|
|
|
struct treesame_state *st;
|
|
|
|
int old_same;
|
|
|
|
|
|
|
|
if (!commit->parents) {
|
|
|
|
/*
|
|
|
|
* Have just removed the only parent from a non-merge.
|
|
|
|
* Different handling, as we lack decoration.
|
|
|
|
*/
|
|
|
|
if (nth_parent != 0)
|
|
|
|
die("compact_treesame %u", nth_parent);
|
|
|
|
old_same = !!(commit->object.flags & TREESAME);
|
2024-06-26 01:39:13 +08:00
|
|
|
if (rev_same_tree_as_empty(revs, commit, nth_parent))
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
commit->object.flags |= TREESAME;
|
|
|
|
else
|
|
|
|
commit->object.flags &= ~TREESAME;
|
|
|
|
return old_same;
|
|
|
|
}
|
|
|
|
|
|
|
|
st = lookup_decoration(&revs->treesame, &commit->object);
|
|
|
|
if (!st || nth_parent >= st->nparents)
|
|
|
|
die("compact_treesame %u", nth_parent);
|
|
|
|
|
|
|
|
old_same = st->treesame[nth_parent];
|
|
|
|
memmove(st->treesame + nth_parent,
|
|
|
|
st->treesame + nth_parent + 1,
|
|
|
|
st->nparents - nth_parent - 1);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we've just become a non-merge commit, update TREESAME
|
|
|
|
* immediately, and remove the no-longer-needed decoration.
|
|
|
|
* If still a merge, defer update until update_treesame().
|
|
|
|
*/
|
|
|
|
if (--st->nparents == 1) {
|
|
|
|
if (commit->parents->next)
|
|
|
|
die("compact_treesame parents mismatch");
|
|
|
|
if (st->treesame[0] && revs->dense)
|
|
|
|
commit->object.flags |= TREESAME;
|
|
|
|
else
|
|
|
|
commit->object.flags &= ~TREESAME;
|
|
|
|
free(add_decoration(&revs->treesame, &commit->object, NULL));
|
|
|
|
}
|
|
|
|
|
|
|
|
return old_same;
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned update_treesame(struct rev_info *revs, struct commit *commit)
|
|
|
|
{
|
|
|
|
if (commit->parents && commit->parents->next) {
|
|
|
|
unsigned n;
|
|
|
|
struct treesame_state *st;
|
2013-05-16 23:32:39 +08:00
|
|
|
struct commit_list *p;
|
|
|
|
unsigned relevant_parents;
|
|
|
|
unsigned relevant_change, irrelevant_change;
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
|
|
|
|
st = lookup_decoration(&revs->treesame, &commit->object);
|
|
|
|
if (!st)
|
2015-11-10 10:22:28 +08:00
|
|
|
die("update_treesame %s", oid_to_hex(&commit->object.oid));
|
2013-05-16 23:32:39 +08:00
|
|
|
relevant_parents = 0;
|
|
|
|
relevant_change = irrelevant_change = 0;
|
|
|
|
for (p = commit->parents, n = 0; p; n++, p = p->next) {
|
|
|
|
if (relevant_commit(p->item)) {
|
|
|
|
relevant_change |= !st->treesame[n];
|
|
|
|
relevant_parents++;
|
|
|
|
} else
|
|
|
|
irrelevant_change |= !st->treesame[n];
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
}
|
2013-05-16 23:32:39 +08:00
|
|
|
if (relevant_parents ? relevant_change : irrelevant_change)
|
|
|
|
commit->object.flags &= ~TREESAME;
|
|
|
|
else
|
|
|
|
commit->object.flags |= TREESAME;
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return commit->object.flags & TREESAME;
|
|
|
|
}
|
|
|
|
|
2013-05-16 23:32:39 +08:00
|
|
|
static inline int limiting_can_increase_treesame(const struct rev_info *revs)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* TREESAME is irrelevant unless prune && dense;
|
|
|
|
* if simplify_history is set, we can't have a mixture of TREESAME and
|
|
|
|
* !TREESAME INTERESTING parents (and we don't have treesame[]
|
|
|
|
* decoration anyway);
|
|
|
|
* if first_parent_only is set, then the TREESAME flag is locked
|
|
|
|
* against the first parent (and again we lack treesame[] decoration).
|
|
|
|
*/
|
|
|
|
return revs->prune && revs->dense &&
|
|
|
|
!revs->simplify_history &&
|
|
|
|
!revs->first_parent_only;
|
|
|
|
}
|
|
|
|
|
2006-03-01 03:24:00 +08:00
|
|
|
static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
|
|
|
|
{
|
|
|
|
struct commit_list **pp, *parent;
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
struct treesame_state *ts = NULL;
|
2013-05-16 23:32:39 +08:00
|
|
|
int relevant_change = 0, irrelevant_change = 0;
|
|
|
|
int relevant_parents, nth_parent;
|
2006-03-01 03:24:00 +08:00
|
|
|
|
2007-11-06 05:22:34 +08:00
|
|
|
/*
|
|
|
|
* If we don't do pruning, everything is interesting
|
|
|
|
*/
|
2007-11-13 15:16:08 +08:00
|
|
|
if (!revs->prune)
|
2007-11-06 05:22:34 +08:00
|
|
|
return;
|
|
|
|
|
2023-03-28 21:58:48 +08:00
|
|
|
if (!repo_get_commit_tree(the_repository, commit))
|
2006-03-01 03:24:00 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
if (!commit->parents) {
|
2024-06-26 01:39:13 +08:00
|
|
|
/*
|
|
|
|
* Pretend as if we are comparing ourselves to the
|
|
|
|
* (non-existent) first parent of this commit object. Even
|
|
|
|
* though no such parent exists, its changed-path Bloom filter
|
|
|
|
* (if one exists) is relative to the empty tree, using Bloom
|
|
|
|
* filters is allowed here.
|
|
|
|
*/
|
|
|
|
if (rev_same_tree_as_empty(revs, commit, 0))
|
2007-11-13 15:16:08 +08:00
|
|
|
commit->object.flags |= TREESAME;
|
2006-03-01 03:24:00 +08:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2007-11-06 05:22:34 +08:00
|
|
|
/*
|
|
|
|
* Normal non-merge commit? If we don't want to make the
|
|
|
|
* history dense, we consider it always to be a change..
|
|
|
|
*/
|
2007-11-13 15:16:08 +08:00
|
|
|
if (!revs->dense && !commit->parents->next)
|
2007-11-06 05:22:34 +08:00
|
|
|
return;
|
|
|
|
|
2013-05-16 23:32:39 +08:00
|
|
|
for (pp = &commit->parents, nth_parent = 0, relevant_parents = 0;
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
(parent = *pp) != NULL;
|
|
|
|
pp = &parent->next, nth_parent++) {
|
2006-03-01 03:24:00 +08:00
|
|
|
struct commit *p = parent->item;
|
2013-05-16 23:32:39 +08:00
|
|
|
if (relevant_commit(p))
|
|
|
|
relevant_parents++;
|
2006-03-01 03:24:00 +08:00
|
|
|
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
if (nth_parent == 1) {
|
|
|
|
/*
|
|
|
|
* This our second loop iteration - so we now know
|
|
|
|
* we're dealing with a merge.
|
|
|
|
*
|
|
|
|
* Do not compare with later parents when we care only about
|
|
|
|
* the first parent chain, in order to avoid derailing the
|
|
|
|
* traversal to follow a side branch that brought everything
|
|
|
|
* in the path we are limited to by the pathspec.
|
|
|
|
*/
|
|
|
|
if (revs->first_parent_only)
|
|
|
|
break;
|
|
|
|
/*
|
|
|
|
* If this will remain a potentially-simplifiable
|
|
|
|
* merge, remember per-parent treesame if needed.
|
|
|
|
* Initialise the array with the comparison from our
|
|
|
|
* first iteration.
|
|
|
|
*/
|
|
|
|
if (revs->treesame.name &&
|
|
|
|
!revs->simplify_history &&
|
|
|
|
!(commit->object.flags & UNINTERESTING)) {
|
|
|
|
ts = initialise_treesame(revs, commit);
|
2013-05-16 23:32:39 +08:00
|
|
|
if (!(irrelevant_change || relevant_change))
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
ts->treesame[0] = 1;
|
|
|
|
}
|
|
|
|
}
|
2020-06-24 04:56:58 +08:00
|
|
|
if (repo_parse_commit(revs->repo, p) < 0)
|
2007-05-05 05:54:57 +08:00
|
|
|
die("cannot simplify commit %s (because of %s)",
|
2015-11-10 10:22:28 +08:00
|
|
|
oid_to_hex(&commit->object.oid),
|
|
|
|
oid_to_hex(&p->object.oid));
|
2020-04-07 00:59:52 +08:00
|
|
|
switch (rev_compare_tree(revs, p, commit, nth_parent)) {
|
2006-03-10 17:21:39 +08:00
|
|
|
case REV_TREE_SAME:
|
2013-05-16 23:32:41 +08:00
|
|
|
if (!revs->simplify_history || !relevant_commit(p)) {
|
2006-03-11 13:59:37 +08:00
|
|
|
/* Even if a merge with an uninteresting
|
|
|
|
* side branch brought the entire change
|
|
|
|
* we are interested in, we do not want
|
|
|
|
* to lose the other branches of this
|
|
|
|
* merge, so we just keep going.
|
|
|
|
*/
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
if (ts)
|
|
|
|
ts->treesame[nth_parent] = 1;
|
2006-03-11 13:59:37 +08:00
|
|
|
continue;
|
|
|
|
}
|
2024-09-26 19:47:05 +08:00
|
|
|
|
|
|
|
free_commit_list(parent->next);
|
2006-03-01 03:24:00 +08:00
|
|
|
parent->next = NULL;
|
2024-09-26 19:47:05 +08:00
|
|
|
while (commit->parents != parent)
|
|
|
|
pop_commit(&commit->parents);
|
2006-03-01 03:24:00 +08:00
|
|
|
commit->parents = parent;
|
revision: --show-pulls adds helpful merges
The default file history simplification of "git log -- <path>" or
"git rev-list -- <path>" focuses on providing the smallest set of
commits that first contributed a change. The revision walk greatly
restricts the set of walked commits by visiting only the first
TREESAME parent of a merge commit, when one exists. This means
that portions of the commit-graph are not walked, which can be a
performance benefit, but can also "hide" commits that added changes
but were ignored by a merge resolution.
The --full-history option modifies this by walking all commits and
reporting a merge commit as "interesting" if it has _any_ parent
that is not TREESAME. This tends to be an over-representation of
important commits, especially in an environment where most merge
commits are created by pull request completion.
Suppose we have a commit A and we create a commit B on top that
changes our file. When we merge the pull request, we create a merge
commit M. If no one else changed the file in the first-parent
history between M and A, then M will not be TREESAME to its first
parent, but will be TREESAME to B. Thus, the simplified history
will be "B". However, M will appear in the --full-history mode.
However, suppose that a number of topics T1, T2, ..., Tn were
created based on commits C1, C2, ..., Cn between A and M as
follows:
A----C1----C2--- ... ---Cn----M------P1---P2--- ... ---Pn
\ \ \ \ / / / /
\ \__.. \ \/ ..__T1 / Tn
\ \__.. /\ ..__T2 /
\_____________________B \____________________/
If the commits T1, T2, ... Tn did not change the file, then all of
P1 through Pn will be TREESAME to their first parent, but not
TREESAME to their second. This means that all of those merge commits
appear in the --full-history view, with edges that immediately
collapse into the lower history without introducing interesting
single-parent commits.
The --simplify-merges option was introduced to remove these extra
merge commits. By noticing that the rewritten parents are reachable
from their first parents, those edges can be simplified away. Finally,
the commits now look like single-parent commits that are TREESAME to
their "only" parent. Thus, they are removed and this issue does not
cause issues anymore. However, this also ends up removing the commit
M from the history view! Even worse, the --simplify-merges option
requires walking the entire history before returning a single result.
Many Git users are using Git alongside a Git service that provides
code storage alongside a code review tool commonly called "Pull
Requests" or "Merge Requests" against a target branch. When these
requests are accepted and merged, they typically create a merge
commit whose first parent is the previous branch tip and the second
parent is the tip of the topic branch used for the request. This
presents a valuable order to the parents, but also makes that merge
commit slightly special. Users may want to see not only which
commits changed a file, but which pull requests merged those commits
into their branch. In the previous example, this would mean the
users want to see the merge commit "M" in addition to the single-
parent commit "C".
Users are even more likely to want these merge commits when they
use pull requests to merge into a feature branch before merging that
feature branch into their trunk.
In some sense, users are asking for the "first" merge commit to
bring in the change to their branch. As long as the parent order is
consistent, this can be handled with the following rule:
Include a merge commit if it is not TREESAME to its first
parent, but is TREESAME to a later parent.
These merges look like the merge commits that would result from
running "git pull <topic>" on a main branch. Thus, the option to
show these commits is called "--show-pulls". This has the added
benefit of showing the commits created by closing a pull request or
merge request on any of the Git hosting and code review platforms.
To test these options, extend the standard test example to include
a merge commit that is not TREESAME to its first parent. It is
surprising that that option was not already in the example, as it
is instructive.
In particular, this extension demonstrates a common issue with file
history simplification. When a user resolves a merge conflict using
"-Xours" or otherwise ignoring one side of the conflict, they create
a TREESAME edge that probably should not be TREESAME. This leads
users to become frustrated and complain that "my change disappeared!"
In my experience, showing them history with --full-history and
--simplify-merges quickly reveals the problematic merge. As mentioned,
this option is expensive to compute. The --show-pulls option
_might_ show the merge commit (usually titled "resolving conflicts")
more quickly. Of course, this depends on the user having the correct
parent order, which is backwards when using "git pull master" from a
topic branch.
There are some special considerations when combining the --show-pulls
option with --simplify-merges. This requires adding a new PULL_MERGE
object flag to store the information from the initial TREESAME
comparisons. This helps avoid dropping those commits in later filters.
This is covered by a test, including how the parents can be simplified.
Since "struct object" has already ruined its 32-bit alignment by using
33 bits across parsed, type, and flags member, let's not make it worse.
PULL_MERGE is used in revision.c with the same value (1u<<15) as
REACHABLE in commit-graph.c. The REACHABLE flag is only used when
writing a commit-graph file, and a revision walk using --show-pulls
does not happen in the same process. Care must be taken in the future
to ensure this remains the case.
Update Documentation/rev-list-options.txt with significant details
around this option. This requires updating the example in the
History Simplification section to demonstrate some of the problems
with TREESAME second parents.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-10 20:19:43 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* A merge commit is a "diversion" if it is not
|
|
|
|
* TREESAME to its first parent but is TREESAME
|
|
|
|
* to a later parent. In the simplified history,
|
|
|
|
* we "divert" the history walk to the later
|
|
|
|
* parent. These commits are shown when "show_pulls"
|
|
|
|
* is enabled, so do not mark the object as
|
|
|
|
* TREESAME here.
|
|
|
|
*/
|
|
|
|
if (!revs->show_pulls || !nth_parent)
|
|
|
|
commit->object.flags |= TREESAME;
|
|
|
|
|
2006-03-01 03:24:00 +08:00
|
|
|
return;
|
|
|
|
|
2006-03-10 17:21:39 +08:00
|
|
|
case REV_TREE_NEW:
|
|
|
|
if (revs->remove_empty_trees &&
|
2024-06-26 01:39:13 +08:00
|
|
|
rev_same_tree_as_empty(revs, p, nth_parent)) {
|
2006-03-13 05:39:31 +08:00
|
|
|
/* We are adding all the specified
|
|
|
|
* paths from this parent, so the
|
|
|
|
* history beyond this parent is not
|
|
|
|
* interesting. Remove its parents
|
|
|
|
* (they are grandparents for us).
|
|
|
|
* IOW, we pretend this parent is a
|
|
|
|
* "root" commit.
|
2006-03-13 05:39:31 +08:00
|
|
|
*/
|
2020-06-24 04:56:58 +08:00
|
|
|
if (repo_parse_commit(revs->repo, p) < 0)
|
2007-05-05 05:54:57 +08:00
|
|
|
die("cannot simplify commit %s (invalid %s)",
|
2015-11-10 10:22:28 +08:00
|
|
|
oid_to_hex(&commit->object.oid),
|
|
|
|
oid_to_hex(&p->object.oid));
|
2024-09-26 19:47:05 +08:00
|
|
|
free_commit_list(p->parents);
|
2006-03-13 05:39:31 +08:00
|
|
|
p->parents = NULL;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
|
|
|
/* fallthrough */
|
2009-06-03 09:34:01 +08:00
|
|
|
case REV_TREE_OLD:
|
2006-03-10 17:21:39 +08:00
|
|
|
case REV_TREE_DIFFERENT:
|
2013-05-16 23:32:39 +08:00
|
|
|
if (relevant_commit(p))
|
|
|
|
relevant_change = 1;
|
|
|
|
else
|
|
|
|
irrelevant_change = 1;
|
revision: --show-pulls adds helpful merges
The default file history simplification of "git log -- <path>" or
"git rev-list -- <path>" focuses on providing the smallest set of
commits that first contributed a change. The revision walk greatly
restricts the set of walked commits by visiting only the first
TREESAME parent of a merge commit, when one exists. This means
that portions of the commit-graph are not walked, which can be a
performance benefit, but can also "hide" commits that added changes
but were ignored by a merge resolution.
The --full-history option modifies this by walking all commits and
reporting a merge commit as "interesting" if it has _any_ parent
that is not TREESAME. This tends to be an over-representation of
important commits, especially in an environment where most merge
commits are created by pull request completion.
Suppose we have a commit A and we create a commit B on top that
changes our file. When we merge the pull request, we create a merge
commit M. If no one else changed the file in the first-parent
history between M and A, then M will not be TREESAME to its first
parent, but will be TREESAME to B. Thus, the simplified history
will be "B". However, M will appear in the --full-history mode.
However, suppose that a number of topics T1, T2, ..., Tn were
created based on commits C1, C2, ..., Cn between A and M as
follows:
A----C1----C2--- ... ---Cn----M------P1---P2--- ... ---Pn
\ \ \ \ / / / /
\ \__.. \ \/ ..__T1 / Tn
\ \__.. /\ ..__T2 /
\_____________________B \____________________/
If the commits T1, T2, ... Tn did not change the file, then all of
P1 through Pn will be TREESAME to their first parent, but not
TREESAME to their second. This means that all of those merge commits
appear in the --full-history view, with edges that immediately
collapse into the lower history without introducing interesting
single-parent commits.
The --simplify-merges option was introduced to remove these extra
merge commits. By noticing that the rewritten parents are reachable
from their first parents, those edges can be simplified away. Finally,
the commits now look like single-parent commits that are TREESAME to
their "only" parent. Thus, they are removed and this issue does not
cause issues anymore. However, this also ends up removing the commit
M from the history view! Even worse, the --simplify-merges option
requires walking the entire history before returning a single result.
Many Git users are using Git alongside a Git service that provides
code storage alongside a code review tool commonly called "Pull
Requests" or "Merge Requests" against a target branch. When these
requests are accepted and merged, they typically create a merge
commit whose first parent is the previous branch tip and the second
parent is the tip of the topic branch used for the request. This
presents a valuable order to the parents, but also makes that merge
commit slightly special. Users may want to see not only which
commits changed a file, but which pull requests merged those commits
into their branch. In the previous example, this would mean the
users want to see the merge commit "M" in addition to the single-
parent commit "C".
Users are even more likely to want these merge commits when they
use pull requests to merge into a feature branch before merging that
feature branch into their trunk.
In some sense, users are asking for the "first" merge commit to
bring in the change to their branch. As long as the parent order is
consistent, this can be handled with the following rule:
Include a merge commit if it is not TREESAME to its first
parent, but is TREESAME to a later parent.
These merges look like the merge commits that would result from
running "git pull <topic>" on a main branch. Thus, the option to
show these commits is called "--show-pulls". This has the added
benefit of showing the commits created by closing a pull request or
merge request on any of the Git hosting and code review platforms.
To test these options, extend the standard test example to include
a merge commit that is not TREESAME to its first parent. It is
surprising that that option was not already in the example, as it
is instructive.
In particular, this extension demonstrates a common issue with file
history simplification. When a user resolves a merge conflict using
"-Xours" or otherwise ignoring one side of the conflict, they create
a TREESAME edge that probably should not be TREESAME. This leads
users to become frustrated and complain that "my change disappeared!"
In my experience, showing them history with --full-history and
--simplify-merges quickly reveals the problematic merge. As mentioned,
this option is expensive to compute. The --show-pulls option
_might_ show the merge commit (usually titled "resolving conflicts")
more quickly. Of course, this depends on the user having the correct
parent order, which is backwards when using "git pull master" from a
topic branch.
There are some special considerations when combining the --show-pulls
option with --simplify-merges. This requires adding a new PULL_MERGE
object flag to store the information from the initial TREESAME
comparisons. This helps avoid dropping those commits in later filters.
This is covered by a test, including how the parents can be simplified.
Since "struct object" has already ruined its 32-bit alignment by using
33 bits across parsed, type, and flags member, let's not make it worse.
PULL_MERGE is used in revision.c with the same value (1u<<15) as
REACHABLE in commit-graph.c. The REACHABLE flag is only used when
writing a commit-graph file, and a revision walk using --show-pulls
does not happen in the same process. Care must be taken in the future
to ensure this remains the case.
Update Documentation/rev-list-options.txt with significant details
around this option. This requires updating the example in the
History Simplification section to demonstrate some of the problems
with TREESAME second parents.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-10 20:19:43 +08:00
|
|
|
|
|
|
|
if (!nth_parent)
|
|
|
|
commit->object.flags |= PULL_MERGE;
|
|
|
|
|
2006-03-01 03:24:00 +08:00
|
|
|
continue;
|
|
|
|
}
|
2015-11-10 10:22:28 +08:00
|
|
|
die("bad tree compare for commit %s", oid_to_hex(&commit->object.oid));
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
2013-05-16 23:32:39 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* TREESAME is straightforward for single-parent commits. For merge
|
|
|
|
* commits, it is most useful to define it so that "irrelevant"
|
|
|
|
* parents cannot make us !TREESAME - if we have any relevant
|
|
|
|
* parents, then we only consider TREESAMEness with respect to them,
|
|
|
|
* allowing irrelevant merges from uninteresting branches to be
|
|
|
|
* simplified away. Only if we have only irrelevant parents do we
|
|
|
|
* base TREESAME on them. Note that this logic is replicated in
|
|
|
|
* update_treesame, which should be kept in sync.
|
|
|
|
*/
|
|
|
|
if (relevant_parents ? !relevant_change : !irrelevant_change)
|
|
|
|
commit->object.flags |= TREESAME;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
|
|
|
|
2018-11-01 21:46:21 +08:00
|
|
|
static int process_parents(struct rev_info *revs, struct commit *commit,
|
revision: use a prio_queue to hold rewritten parents
This patch fixes a quadratic list insertion in rewrite_one() when
pathspec limiting is combined with --parents. What happens is something
like this:
1. We see that some commit X touches the path, so we try to rewrite
its parents.
2. rewrite_one() loops forever, rewriting parents, until it finds a
relevant parent (or hits the root and decides there are none). The
heavy lifting is done by process_parent(), which uses
try_to_simplify_commit() to drop parents.
3. process_parent() puts any intermediate parents into the
&revs->commits list, inserting by commit date as usual.
So if commit X is recent, and then there's a large chunk of history that
doesn't touch the path, we may add a lot of commits to &revs->commits.
And insertion by commit date is O(n) in the worst case, making the whole
thing quadratic.
We tried to deal with this long ago in fce87ae538 (Fix quadratic
performance in rewrite_one., 2008-07-12). In that scheme, we cache the
oldest commit in the list; if the new commit to be added is older, we
can start our linear traversal there. This often works well in practice
because parents are older than their descendants, and thus we tend to
add older and older commits as we traverse.
But this isn't guaranteed, and in fact there's a simple case where it is
not: merges. Imagine we look at the first parent of a merge and see a
very old commit (let's say 3 years old). And on the second parent, as we
go back 3 years in history, we might have many commits. That one
first-parent commit has polluted our oldest-commit cache; it will remain
the oldest while we traverse a huge chunk of history, during which we
have to fall back to the slow, linear method of adding to the list.
Naively, one might imagine that instead of caching the oldest commit,
we'd start at the last-added one. But that just makes some cases faster
while making others slower (and indeed, while it made a real-world test
case much faster, it does quite poorly in the perf test include here).
Fundamentally, these are just heuristics; our worst case is still
quadratic, and some cases will approach that.
Instead, let's use a data structure with better worst-case performance.
Swapping out revs->commits for something else would have repercussions
all over the code base, but we can take advantage of one fact: for the
rewrite_one() case, nobody actually needs to see those commits in
revs->commits until we've finished generating the whole list.
That leaves us with two obvious options:
1. We can generate the list _unordered_, which should be O(n), and
then sort it afterwards, which would be O(n log n) total. This is
"sort-after" below.
2. We can insert the commits into a separate data structure, like a
priority queue. This is "prio-queue" below.
I expected that sort-after would be the fastest (since it saves us the
extra step of copying the items into the linked list), but surprisingly
the prio-queue seems to be a bit faster.
Here are timings for the new p0001.6 for all three techniques across a
few repositories, as compared to master:
master cache-last sort-after prio-queue
--------------------------------------------------------------------------------------------
GIT_PERF_REPO=git.git
0.52(0.50+0.02) 0.53(0.51+0.02) +1.9% 0.37(0.33+0.03) -28.8% 0.37(0.32+0.04) -28.8%
GIT_PERF_REPO=linux.git
20.81(20.74+0.07) 20.31(20.24+0.07) -2.4% 0.94(0.86+0.07) -95.5% 0.91(0.82+0.09) -95.6%
GIT_PERF_REPO=llvm-project.git
83.67(83.57+0.09) 4.23(4.15+0.08) -94.9% 3.21(3.15+0.06) -96.2% 2.98(2.91+0.07) -96.4%
A few items to note:
- the cache-list tweak does improve the bad case for llvm-project.git
that started my digging into this problem. But it performs terribly
on linux.git, barely helping at all.
- the sort-after and prio-queue techniques work well. They approach
the timing for running without --parents at all, which is what you'd
expect (see below for more data).
- prio-queue just barely outperforms sort-after. As I said, I'm not
really sure why this is the case, but it is. You can see it even
more prominently in this real-world case on llvm-project.git:
git rev-list --parents 07ef786652e7 -- llvm/test/CodeGen/Generic/bswap.ll
where prio-queue routinely outperforms sort-after by about 7%. One
guess is that the prio-queue may just be more efficient because it
uses a compact array.
There are three new perf tests:
- "rev-list --parents" gives us a baseline for running with --parents.
This isn't sped up meaningfully here, because the bad case is
triggered only with simplification. But it's good to make sure we
don't screw it up (now, or in the future).
- "rev-list -- dummy" gives us a baseline for just traversing with
pathspec limiting. This gives a lower bound for the next test (and
it's also a good thing for us to be checking in general for
regressions, since we don't seem to have any existing tests).
- "rev-list --parents -- dummy" shows off the problem (and our fix)
Here are the timings for those three on llvm-project.git, before and
after the fix:
Test master prio-queue
------------------------------------------------------------------------------
0001.3: rev-list --parents 2.24(2.12+0.12) 2.22(2.11+0.11) -0.9%
0001.5: rev-list -- dummy 2.89(2.82+0.07) 2.92(2.89+0.03) +1.0%
0001.6: rev-list --parents -- dummy 83.67(83.57+0.09) 2.98(2.91+0.07) -96.4%
Changes in the first two are basically noise, and you can see we
approach our lower bound in the final one.
Note that we can't fully get rid of the list argument from
process_parents(). Other callers do have lists, and it would be hard to
convert them. They also don't seem to have this problem (probably
because they actually remove items from the list as they loop, meaning
it doesn't grow so large in the first place). So this basically just
drops the "cache_ptr" parameter (which was used only by the one caller
we're fixing here) and replaces it with a prio_queue. Callers are free
to use either data structure, depending on what they're prepared to
handle.
Reported-by: Björn Pettersson A <bjorn.a.pettersson@ericsson.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-04 09:41:09 +08:00
|
|
|
struct commit_list **list, struct prio_queue *queue)
|
2006-03-01 03:24:00 +08:00
|
|
|
{
|
|
|
|
struct commit_list *parent = commit->parents;
|
2022-08-19 12:28:10 +08:00
|
|
|
unsigned pass_flags;
|
2006-03-01 03:24:00 +08:00
|
|
|
|
Make "--parents" logs also be incremental
The parent rewriting feature caused us to create the whole history in one
go, and then simplify it later, because of how rewrite_parents() had been
written. However, with a little tweaking, it's perfectly possible to do
even that one incrementally.
Right now, this doesn't really much matter, because every user of
"--parents" will probably generally _also_ use "--topo-order", which will
cause the old non-incremental behaviour anyway. However, I'm hopeful that
we could make even the topological sort incremental, or at least
_partially_ so (for example, make it incremental up to the first merge).
In the meantime, this at least moves things in the right direction, and
removes a strange special case.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-09 08:05:58 +08:00
|
|
|
if (commit->object.flags & ADDED)
|
2007-05-05 05:54:57 +08:00
|
|
|
return 0;
|
2023-10-27 15:59:29 +08:00
|
|
|
if (revs->do_not_die_on_missing_objects &&
|
|
|
|
oidset_contains(&revs->missing_commits, &commit->object.oid))
|
|
|
|
return 0;
|
Make "--parents" logs also be incremental
The parent rewriting feature caused us to create the whole history in one
go, and then simplify it later, because of how rewrite_parents() had been
written. However, with a little tweaking, it's perfectly possible to do
even that one incrementally.
Right now, this doesn't really much matter, because every user of
"--parents" will probably generally _also_ use "--topo-order", which will
cause the old non-incremental behaviour anyway. However, I'm hopeful that
we could make even the topological sort incremental, or at least
_partially_ so (for example, make it incremental up to the first merge).
In the meantime, this at least moves things in the right direction, and
removes a strange special case.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-09 08:05:58 +08:00
|
|
|
commit->object.flags |= ADDED;
|
|
|
|
|
2013-10-25 02:01:41 +08:00
|
|
|
if (revs->include_check &&
|
|
|
|
!revs->include_check(commit, revs->include_check_data))
|
|
|
|
return 0;
|
|
|
|
|
2006-03-01 03:24:00 +08:00
|
|
|
/*
|
|
|
|
* If the commit is uninteresting, don't try to
|
|
|
|
* prune parents - we want the maximal uninteresting
|
|
|
|
* set.
|
|
|
|
*
|
|
|
|
* Normally we haven't parsed the parent
|
|
|
|
* yet, so we won't have a parent of a parent
|
|
|
|
* here. However, it may turn out that we've
|
|
|
|
* reached this commit some other way (where it
|
|
|
|
* wasn't uninteresting), in which case we need
|
|
|
|
* to mark its parents recursively too..
|
|
|
|
*/
|
|
|
|
if (commit->object.flags & UNINTERESTING) {
|
|
|
|
while (parent) {
|
|
|
|
struct commit *p = parent->item;
|
|
|
|
parent = parent->next;
|
2009-01-28 15:19:30 +08:00
|
|
|
if (p)
|
|
|
|
p->object.flags |= UNINTERESTING;
|
2020-06-24 04:56:58 +08:00
|
|
|
if (repo_parse_commit_gently(revs->repo, p, 1) < 0)
|
2009-01-28 15:19:30 +08:00
|
|
|
continue;
|
2006-03-01 03:24:00 +08:00
|
|
|
if (p->parents)
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
mark_parents_uninteresting(revs, p);
|
2006-03-01 03:24:00 +08:00
|
|
|
if (p->object.flags & SEEN)
|
|
|
|
continue;
|
2021-04-09 19:27:57 +08:00
|
|
|
p->object.flags |= (SEEN | NOT_USER_GIVEN);
|
2018-11-01 21:46:21 +08:00
|
|
|
if (list)
|
revision: use a prio_queue to hold rewritten parents
This patch fixes a quadratic list insertion in rewrite_one() when
pathspec limiting is combined with --parents. What happens is something
like this:
1. We see that some commit X touches the path, so we try to rewrite
its parents.
2. rewrite_one() loops forever, rewriting parents, until it finds a
relevant parent (or hits the root and decides there are none). The
heavy lifting is done by process_parent(), which uses
try_to_simplify_commit() to drop parents.
3. process_parent() puts any intermediate parents into the
&revs->commits list, inserting by commit date as usual.
So if commit X is recent, and then there's a large chunk of history that
doesn't touch the path, we may add a lot of commits to &revs->commits.
And insertion by commit date is O(n) in the worst case, making the whole
thing quadratic.
We tried to deal with this long ago in fce87ae538 (Fix quadratic
performance in rewrite_one., 2008-07-12). In that scheme, we cache the
oldest commit in the list; if the new commit to be added is older, we
can start our linear traversal there. This often works well in practice
because parents are older than their descendants, and thus we tend to
add older and older commits as we traverse.
But this isn't guaranteed, and in fact there's a simple case where it is
not: merges. Imagine we look at the first parent of a merge and see a
very old commit (let's say 3 years old). And on the second parent, as we
go back 3 years in history, we might have many commits. That one
first-parent commit has polluted our oldest-commit cache; it will remain
the oldest while we traverse a huge chunk of history, during which we
have to fall back to the slow, linear method of adding to the list.
Naively, one might imagine that instead of caching the oldest commit,
we'd start at the last-added one. But that just makes some cases faster
while making others slower (and indeed, while it made a real-world test
case much faster, it does quite poorly in the perf test include here).
Fundamentally, these are just heuristics; our worst case is still
quadratic, and some cases will approach that.
Instead, let's use a data structure with better worst-case performance.
Swapping out revs->commits for something else would have repercussions
all over the code base, but we can take advantage of one fact: for the
rewrite_one() case, nobody actually needs to see those commits in
revs->commits until we've finished generating the whole list.
That leaves us with two obvious options:
1. We can generate the list _unordered_, which should be O(n), and
then sort it afterwards, which would be O(n log n) total. This is
"sort-after" below.
2. We can insert the commits into a separate data structure, like a
priority queue. This is "prio-queue" below.
I expected that sort-after would be the fastest (since it saves us the
extra step of copying the items into the linked list), but surprisingly
the prio-queue seems to be a bit faster.
Here are timings for the new p0001.6 for all three techniques across a
few repositories, as compared to master:
master cache-last sort-after prio-queue
--------------------------------------------------------------------------------------------
GIT_PERF_REPO=git.git
0.52(0.50+0.02) 0.53(0.51+0.02) +1.9% 0.37(0.33+0.03) -28.8% 0.37(0.32+0.04) -28.8%
GIT_PERF_REPO=linux.git
20.81(20.74+0.07) 20.31(20.24+0.07) -2.4% 0.94(0.86+0.07) -95.5% 0.91(0.82+0.09) -95.6%
GIT_PERF_REPO=llvm-project.git
83.67(83.57+0.09) 4.23(4.15+0.08) -94.9% 3.21(3.15+0.06) -96.2% 2.98(2.91+0.07) -96.4%
A few items to note:
- the cache-list tweak does improve the bad case for llvm-project.git
that started my digging into this problem. But it performs terribly
on linux.git, barely helping at all.
- the sort-after and prio-queue techniques work well. They approach
the timing for running without --parents at all, which is what you'd
expect (see below for more data).
- prio-queue just barely outperforms sort-after. As I said, I'm not
really sure why this is the case, but it is. You can see it even
more prominently in this real-world case on llvm-project.git:
git rev-list --parents 07ef786652e7 -- llvm/test/CodeGen/Generic/bswap.ll
where prio-queue routinely outperforms sort-after by about 7%. One
guess is that the prio-queue may just be more efficient because it
uses a compact array.
There are three new perf tests:
- "rev-list --parents" gives us a baseline for running with --parents.
This isn't sped up meaningfully here, because the bad case is
triggered only with simplification. But it's good to make sure we
don't screw it up (now, or in the future).
- "rev-list -- dummy" gives us a baseline for just traversing with
pathspec limiting. This gives a lower bound for the next test (and
it's also a good thing for us to be checking in general for
regressions, since we don't seem to have any existing tests).
- "rev-list --parents -- dummy" shows off the problem (and our fix)
Here are the timings for those three on llvm-project.git, before and
after the fix:
Test master prio-queue
------------------------------------------------------------------------------
0001.3: rev-list --parents 2.24(2.12+0.12) 2.22(2.11+0.11) -0.9%
0001.5: rev-list -- dummy 2.89(2.82+0.07) 2.92(2.89+0.03) +1.0%
0001.6: rev-list --parents -- dummy 83.67(83.57+0.09) 2.98(2.91+0.07) -96.4%
Changes in the first two are basically noise, and you can see we
approach our lower bound in the final one.
Note that we can't fully get rid of the list argument from
process_parents(). Other callers do have lists, and it would be hard to
convert them. They also don't seem to have this problem (probably
because they actually remove items from the list as they loop, meaning
it doesn't grow so large in the first place). So this basically just
drops the "cache_ptr" parameter (which was used only by the one caller
we're fixing here) and replaces it with a prio_queue. Callers are free
to use either data structure, depending on what they're prepared to
handle.
Reported-by: Björn Pettersson A <bjorn.a.pettersson@ericsson.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-04 09:41:09 +08:00
|
|
|
commit_list_insert_by_date(p, list);
|
|
|
|
if (queue)
|
|
|
|
prio_queue_put(queue, p);
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
if (revs->exclude_first_parent_only)
|
|
|
|
break;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
2007-05-05 05:54:57 +08:00
|
|
|
return 0;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ok, the commit wasn't uninteresting. Try to
|
|
|
|
* simplify the commit history and find the parent
|
|
|
|
* that has no differences in the path set if one exists.
|
|
|
|
*/
|
2007-11-06 05:22:34 +08:00
|
|
|
try_to_simplify_commit(revs, commit);
|
2006-03-01 03:24:00 +08:00
|
|
|
|
2006-04-16 03:09:56 +08:00
|
|
|
if (revs->no_walk)
|
2007-05-05 05:54:57 +08:00
|
|
|
return 0;
|
2006-04-16 03:09:56 +08:00
|
|
|
|
2022-08-19 12:28:10 +08:00
|
|
|
pass_flags = (commit->object.flags & (SYMMETRIC_LEFT | ANCESTRY_PATH));
|
2007-03-13 16:57:22 +08:00
|
|
|
|
2008-04-28 01:32:46 +08:00
|
|
|
for (parent = commit->parents; parent; parent = parent->next) {
|
2006-03-01 03:24:00 +08:00
|
|
|
struct commit *p = parent->item;
|
2017-12-08 23:27:15 +08:00
|
|
|
int gently = revs->ignore_missing_links ||
|
2023-10-27 15:59:29 +08:00
|
|
|
revs->exclude_promisor_objects ||
|
|
|
|
revs->do_not_die_on_missing_objects;
|
2020-06-24 04:56:58 +08:00
|
|
|
if (repo_parse_commit_gently(revs->repo, p, gently) < 0) {
|
2017-12-08 23:27:15 +08:00
|
|
|
if (revs->exclude_promisor_objects &&
|
|
|
|
is_promisor_object(&p->object.oid)) {
|
|
|
|
if (revs->first_parent_only)
|
|
|
|
break;
|
|
|
|
continue;
|
|
|
|
}
|
2023-10-27 15:59:29 +08:00
|
|
|
|
|
|
|
if (revs->do_not_die_on_missing_objects)
|
|
|
|
oidset_insert(&revs->missing_commits, &p->object.oid);
|
|
|
|
else
|
|
|
|
return -1; /* corrupt repository */
|
2017-12-08 23:27:15 +08:00
|
|
|
}
|
2018-05-19 13:28:24 +08:00
|
|
|
if (revs->sources) {
|
|
|
|
char **slot = revision_sources_at(revs->sources, p);
|
|
|
|
|
|
|
|
if (!*slot)
|
|
|
|
*slot = *revision_sources_at(revs->sources, commit);
|
|
|
|
}
|
2022-08-19 12:28:10 +08:00
|
|
|
p->object.flags |= pass_flags;
|
2008-05-12 23:12:36 +08:00
|
|
|
if (!(p->object.flags & SEEN)) {
|
2021-04-09 19:27:57 +08:00
|
|
|
p->object.flags |= (SEEN | NOT_USER_GIVEN);
|
2018-11-01 21:46:21 +08:00
|
|
|
if (list)
|
revision: use a prio_queue to hold rewritten parents
This patch fixes a quadratic list insertion in rewrite_one() when
pathspec limiting is combined with --parents. What happens is something
like this:
1. We see that some commit X touches the path, so we try to rewrite
its parents.
2. rewrite_one() loops forever, rewriting parents, until it finds a
relevant parent (or hits the root and decides there are none). The
heavy lifting is done by process_parent(), which uses
try_to_simplify_commit() to drop parents.
3. process_parent() puts any intermediate parents into the
&revs->commits list, inserting by commit date as usual.
So if commit X is recent, and then there's a large chunk of history that
doesn't touch the path, we may add a lot of commits to &revs->commits.
And insertion by commit date is O(n) in the worst case, making the whole
thing quadratic.
We tried to deal with this long ago in fce87ae538 (Fix quadratic
performance in rewrite_one., 2008-07-12). In that scheme, we cache the
oldest commit in the list; if the new commit to be added is older, we
can start our linear traversal there. This often works well in practice
because parents are older than their descendants, and thus we tend to
add older and older commits as we traverse.
But this isn't guaranteed, and in fact there's a simple case where it is
not: merges. Imagine we look at the first parent of a merge and see a
very old commit (let's say 3 years old). And on the second parent, as we
go back 3 years in history, we might have many commits. That one
first-parent commit has polluted our oldest-commit cache; it will remain
the oldest while we traverse a huge chunk of history, during which we
have to fall back to the slow, linear method of adding to the list.
Naively, one might imagine that instead of caching the oldest commit,
we'd start at the last-added one. But that just makes some cases faster
while making others slower (and indeed, while it made a real-world test
case much faster, it does quite poorly in the perf test include here).
Fundamentally, these are just heuristics; our worst case is still
quadratic, and some cases will approach that.
Instead, let's use a data structure with better worst-case performance.
Swapping out revs->commits for something else would have repercussions
all over the code base, but we can take advantage of one fact: for the
rewrite_one() case, nobody actually needs to see those commits in
revs->commits until we've finished generating the whole list.
That leaves us with two obvious options:
1. We can generate the list _unordered_, which should be O(n), and
then sort it afterwards, which would be O(n log n) total. This is
"sort-after" below.
2. We can insert the commits into a separate data structure, like a
priority queue. This is "prio-queue" below.
I expected that sort-after would be the fastest (since it saves us the
extra step of copying the items into the linked list), but surprisingly
the prio-queue seems to be a bit faster.
Here are timings for the new p0001.6 for all three techniques across a
few repositories, as compared to master:
master cache-last sort-after prio-queue
--------------------------------------------------------------------------------------------
GIT_PERF_REPO=git.git
0.52(0.50+0.02) 0.53(0.51+0.02) +1.9% 0.37(0.33+0.03) -28.8% 0.37(0.32+0.04) -28.8%
GIT_PERF_REPO=linux.git
20.81(20.74+0.07) 20.31(20.24+0.07) -2.4% 0.94(0.86+0.07) -95.5% 0.91(0.82+0.09) -95.6%
GIT_PERF_REPO=llvm-project.git
83.67(83.57+0.09) 4.23(4.15+0.08) -94.9% 3.21(3.15+0.06) -96.2% 2.98(2.91+0.07) -96.4%
A few items to note:
- the cache-list tweak does improve the bad case for llvm-project.git
that started my digging into this problem. But it performs terribly
on linux.git, barely helping at all.
- the sort-after and prio-queue techniques work well. They approach
the timing for running without --parents at all, which is what you'd
expect (see below for more data).
- prio-queue just barely outperforms sort-after. As I said, I'm not
really sure why this is the case, but it is. You can see it even
more prominently in this real-world case on llvm-project.git:
git rev-list --parents 07ef786652e7 -- llvm/test/CodeGen/Generic/bswap.ll
where prio-queue routinely outperforms sort-after by about 7%. One
guess is that the prio-queue may just be more efficient because it
uses a compact array.
There are three new perf tests:
- "rev-list --parents" gives us a baseline for running with --parents.
This isn't sped up meaningfully here, because the bad case is
triggered only with simplification. But it's good to make sure we
don't screw it up (now, or in the future).
- "rev-list -- dummy" gives us a baseline for just traversing with
pathspec limiting. This gives a lower bound for the next test (and
it's also a good thing for us to be checking in general for
regressions, since we don't seem to have any existing tests).
- "rev-list --parents -- dummy" shows off the problem (and our fix)
Here are the timings for those three on llvm-project.git, before and
after the fix:
Test master prio-queue
------------------------------------------------------------------------------
0001.3: rev-list --parents 2.24(2.12+0.12) 2.22(2.11+0.11) -0.9%
0001.5: rev-list -- dummy 2.89(2.82+0.07) 2.92(2.89+0.03) +1.0%
0001.6: rev-list --parents -- dummy 83.67(83.57+0.09) 2.98(2.91+0.07) -96.4%
Changes in the first two are basically noise, and you can see we
approach our lower bound in the final one.
Note that we can't fully get rid of the list argument from
process_parents(). Other callers do have lists, and it would be hard to
convert them. They also don't seem to have this problem (probably
because they actually remove items from the list as they loop, meaning
it doesn't grow so large in the first place). So this basically just
drops the "cache_ptr" parameter (which was used only by the one caller
we're fixing here) and replaces it with a prio_queue. Callers are free
to use either data structure, depending on what they're prepared to
handle.
Reported-by: Björn Pettersson A <bjorn.a.pettersson@ericsson.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-04 09:41:09 +08:00
|
|
|
commit_list_insert_by_date(p, list);
|
|
|
|
if (queue)
|
|
|
|
prio_queue_put(queue, p);
|
2008-05-12 23:12:36 +08:00
|
|
|
}
|
2008-08-01 13:17:13 +08:00
|
|
|
if (revs->first_parent_only)
|
2008-04-28 01:32:46 +08:00
|
|
|
break;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
2007-05-05 05:54:57 +08:00
|
|
|
return 0;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
|
|
|
|
2007-07-10 21:50:49 +08:00
|
|
|
static void cherry_pick_list(struct commit_list *list, struct rev_info *revs)
|
2007-04-09 18:40:38 +08:00
|
|
|
{
|
|
|
|
struct commit_list *p;
|
|
|
|
int left_count = 0, right_count = 0;
|
|
|
|
int left_first;
|
|
|
|
struct patch_ids ids;
|
2011-03-07 20:31:40 +08:00
|
|
|
unsigned cherry_flag;
|
2007-04-09 18:40:38 +08:00
|
|
|
|
|
|
|
/* First count the commits on the left and on the right */
|
|
|
|
for (p = list; p; p = p->next) {
|
|
|
|
struct commit *commit = p->item;
|
|
|
|
unsigned flags = commit->object.flags;
|
|
|
|
if (flags & BOUNDARY)
|
|
|
|
;
|
|
|
|
else if (flags & SYMMETRIC_LEFT)
|
|
|
|
left_count++;
|
|
|
|
else
|
|
|
|
right_count++;
|
|
|
|
}
|
|
|
|
|
2010-02-20 19:42:04 +08:00
|
|
|
if (!left_count || !right_count)
|
|
|
|
return;
|
|
|
|
|
2007-04-09 18:40:38 +08:00
|
|
|
left_first = left_count < right_count;
|
2018-09-21 23:57:38 +08:00
|
|
|
init_patch_ids(revs->repo, &ids);
|
2010-12-15 23:02:38 +08:00
|
|
|
ids.diffopts.pathspec = revs->diffopt.pathspec;
|
2007-04-09 18:40:38 +08:00
|
|
|
|
|
|
|
/* Compute patch-ids for one side */
|
|
|
|
for (p = list; p; p = p->next) {
|
|
|
|
struct commit *commit = p->item;
|
|
|
|
unsigned flags = commit->object.flags;
|
|
|
|
|
|
|
|
if (flags & BOUNDARY)
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* If we have fewer left, left_first is set and we omit
|
|
|
|
* commits on the right branch in this loop. If we have
|
|
|
|
* fewer right, we skip the left ones.
|
|
|
|
*/
|
|
|
|
if (left_first != !!(flags & SYMMETRIC_LEFT))
|
|
|
|
continue;
|
2016-07-30 00:19:18 +08:00
|
|
|
add_commit_patch_id(commit, &ids);
|
2007-04-09 18:40:38 +08:00
|
|
|
}
|
|
|
|
|
2011-03-07 20:31:40 +08:00
|
|
|
/* either cherry_mark or cherry_pick are true */
|
|
|
|
cherry_flag = revs->cherry_mark ? PATCHSAME : SHOWN;
|
|
|
|
|
2007-04-09 18:40:38 +08:00
|
|
|
/* Check the other side */
|
|
|
|
for (p = list; p; p = p->next) {
|
|
|
|
struct commit *commit = p->item;
|
|
|
|
struct patch_id *id;
|
|
|
|
unsigned flags = commit->object.flags;
|
|
|
|
|
|
|
|
if (flags & BOUNDARY)
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* If we have fewer left, left_first is set and we omit
|
|
|
|
* commits on the left branch in this loop.
|
|
|
|
*/
|
|
|
|
if (left_first == !!(flags & SYMMETRIC_LEFT))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Have we seen the same patch id?
|
|
|
|
*/
|
patch-ids: handle duplicate hashmap entries
This fixes a bug introduced in dfb7a1b4d0 (patch-ids: stop using a
hand-rolled hashmap implementation, 2016-07-29) in which
git rev-list --cherry-pick A...B
will fail to suppress commits reachable from A even if a commit with
matching patch-id appears in B.
Around the time of that commit, the algorithm for "--cherry-pick" looked
something like this:
0. Traverse all of the commits, marking them as being on the left or
right side of the symmetric difference.
1. Iterate over the left-hand commits, inserting a patch-id struct for
each into a hashmap, and pointing commit->util to the patch-id
struct.
2. Iterate over the right-hand commits, checking which are present in
the hashmap. If so, we exclude the commit from the output _and_ we
mark the patch-id as "seen".
3. Iterate again over the left-hand commits, checking whether
commit->util->seen is set; if so, exclude them from the output.
At the end, we'll have eliminated commits from both sides that have a
matching patch-id on the other side. But there's a subtle assumption
here: for any given patch-id, we must have exactly one struct
representing it. If two commits from A both have the same patch-id and
we allow duplicates in the hashmap, then we run into a problem:
a. In step 1, we insert two patch-id structs into the hashmap.
b. In step 2, our lookups will find only one of these structs, so only
one "seen" flag is marked.
c. In step 3, one of the commits in A will have its commit->util->seen
set, but the other will not. We'll erroneously output the latter.
Prior to dfb7a1b4d0, our hashmap did not allow duplicates. Afterwards,
it used hashmap_add(), which explicitly does allow duplicates.
At that point, the solution would have been easy: when we are about to
add a duplicate, skip doing so and return the existing entry which
matches. But it gets more complicated.
In 683f17ec44 (patch-ids: replace the seen indicator with a commit
pointer, 2016-07-29), our step 3 goes away entirely. Instead, in step 2,
when the right-hand side finds a matching patch_id from the left-hand
side, we can directly mark the left-hand patch_id->commit to be omitted.
Solving that would be easy, too; there's a one-to-many relationship of
patch-ids to commits, so we just need to keep a list.
But there's more. Commit b3dfeebb92 (rebase: avoid computing unnecessary
patch IDs, 2016-07-29) built on that by lazily computing the full
patch-ids. So we don't even know when adding to the hashmap whether two
commits truly have the same id. We'd have to tentatively assign them a
list, and then possibly split them apart (possibly into N new structs)
at the moment we compute the real patch-ids. This could work, but it's
complicated and error-prone.
Instead, let's accept that we may store duplicates, and teach the lookup
side to be more clever. Rather than asking for a single matching
patch-id, it will need to iterate over all matching patch-ids. This does
mean examining every entry in a single hash bucket, but the worst-case
for a hash lookup was already doing that.
We'll keep the hashmap details out of the caller by providing a simple
iteration interface. We can retain the simple has_commit_patch_id()
interface for the other callers, but we'll simplify its return value
into an integer, rather than returning the patch_id struct. That way
they won't be tempted to look at the "commit" field of the return value
without iterating.
Reported-by: Arnaud Morin <arnaud.morin@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-12 23:52:32 +08:00
|
|
|
id = patch_id_iter_first(commit, &ids);
|
2007-04-09 18:40:38 +08:00
|
|
|
if (!id)
|
|
|
|
continue;
|
|
|
|
|
2016-07-30 00:19:18 +08:00
|
|
|
commit->object.flags |= cherry_flag;
|
patch-ids: handle duplicate hashmap entries
This fixes a bug introduced in dfb7a1b4d0 (patch-ids: stop using a
hand-rolled hashmap implementation, 2016-07-29) in which
git rev-list --cherry-pick A...B
will fail to suppress commits reachable from A even if a commit with
matching patch-id appears in B.
Around the time of that commit, the algorithm for "--cherry-pick" looked
something like this:
0. Traverse all of the commits, marking them as being on the left or
right side of the symmetric difference.
1. Iterate over the left-hand commits, inserting a patch-id struct for
each into a hashmap, and pointing commit->util to the patch-id
struct.
2. Iterate over the right-hand commits, checking which are present in
the hashmap. If so, we exclude the commit from the output _and_ we
mark the patch-id as "seen".
3. Iterate again over the left-hand commits, checking whether
commit->util->seen is set; if so, exclude them from the output.
At the end, we'll have eliminated commits from both sides that have a
matching patch-id on the other side. But there's a subtle assumption
here: for any given patch-id, we must have exactly one struct
representing it. If two commits from A both have the same patch-id and
we allow duplicates in the hashmap, then we run into a problem:
a. In step 1, we insert two patch-id structs into the hashmap.
b. In step 2, our lookups will find only one of these structs, so only
one "seen" flag is marked.
c. In step 3, one of the commits in A will have its commit->util->seen
set, but the other will not. We'll erroneously output the latter.
Prior to dfb7a1b4d0, our hashmap did not allow duplicates. Afterwards,
it used hashmap_add(), which explicitly does allow duplicates.
At that point, the solution would have been easy: when we are about to
add a duplicate, skip doing so and return the existing entry which
matches. But it gets more complicated.
In 683f17ec44 (patch-ids: replace the seen indicator with a commit
pointer, 2016-07-29), our step 3 goes away entirely. Instead, in step 2,
when the right-hand side finds a matching patch_id from the left-hand
side, we can directly mark the left-hand patch_id->commit to be omitted.
Solving that would be easy, too; there's a one-to-many relationship of
patch-ids to commits, so we just need to keep a list.
But there's more. Commit b3dfeebb92 (rebase: avoid computing unnecessary
patch IDs, 2016-07-29) built on that by lazily computing the full
patch-ids. So we don't even know when adding to the hashmap whether two
commits truly have the same id. We'd have to tentatively assign them a
list, and then possibly split them apart (possibly into N new structs)
at the moment we compute the real patch-ids. This could work, but it's
complicated and error-prone.
Instead, let's accept that we may store duplicates, and teach the lookup
side to be more clever. Rather than asking for a single matching
patch-id, it will need to iterate over all matching patch-ids. This does
mean examining every entry in a single hash bucket, but the worst-case
for a hash lookup was already doing that.
We'll keep the hashmap details out of the caller by providing a simple
iteration interface. We can retain the simple has_commit_patch_id()
interface for the other callers, but we'll simplify its return value
into an integer, rather than returning the patch_id struct. That way
they won't be tempted to look at the "commit" field of the return value
without iterating.
Reported-by: Arnaud Morin <arnaud.morin@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-12 23:52:32 +08:00
|
|
|
do {
|
|
|
|
id->commit->object.flags |= cherry_flag;
|
|
|
|
} while ((id = patch_id_iter_next(id, &ids)));
|
2007-04-09 18:40:38 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
free_patch_ids(&ids);
|
|
|
|
}
|
|
|
|
|
2008-03-18 09:56:33 +08:00
|
|
|
/* How many extra uninteresting commits we want to see.. */
|
|
|
|
#define SLOP 5
|
|
|
|
|
2017-04-27 03:29:31 +08:00
|
|
|
static int still_interesting(struct commit_list *src, timestamp_t date, int slop,
|
limit_list: avoid quadratic behavior from still_interesting
When we are limiting a rev-list traversal due to
UNINTERESTING refs, we have to walk down the tips (both
interesting and uninteresting) to find where they intersect.
We keep a queue of commits to examine, pop commits off
the queue one by one, and potentially add their parents. The
size of the queue will naturally fluctuate based on the
"width" of the history graph; i.e., the number of
simultaneous lines of development. But for the most part it
will stay in the same ballpark as the initial number of tips
we fed, shrinking over time (as we hit common ancestors of
the tips). So roughly speaking, if we start with `N` tips,
we'll spend much of the time with a queue around `N` items.
For each UNINTERESTING commit we pop, we call
still_interesting to check whether marking its parents as
UNINTERESTING has made the whole queue uninteresting (in
which case we can quit early). Because the queue is stored
as a linked list, this is `O(N)`, where `N` is the number of
items in the queue. So processing a queue with `N` commits
marked UNINTERESTING (and one or more interesting commits)
will take `O(N^2)`.
If you feed a lot of positive tips, this isn't a problem.
They aren't UNINTERESTING, so they don't incur the
still_interesting check. It also isn't a problem if you
traverse from an interesting tip to some UNINTERESTING
bases. We order the queue by recency, so the interesting
commits stay at the front of the queue as we walk down them.
The linear check can exit early as soon as it sees one
interesting commit left in the queue.
But if you want to know whether an older commit is reachable
from a set of newer tips, we end up processing in the
opposite direction: from the UNINTERESTING ones down to the
interesting one. This may happen when we call:
git rev-list $commits --not --all
in check_everything_connected after a fetch. If we fetched
something much older than most of our refs, and if we have a
large number of refs, the traversal cost is dominated by the
quadratic behavior.
These commands simulate the connectivity check of such a
fetch, when you have `$n` distinct refs in the receiver:
# positive ref is 100,000 commits deep
git rev-list --all | head -100000 | tail -1 >input
# huge number of more recent negative refs
git rev-list --all | head -$n | sed s/^/^/ >>input
time git rev-list --stdin <input
Here are timings for various `n` on the linux.git
repository. The `n=1` case provides a baseline for just
walking the commits, which lets us see the still_interesting
overhead. The times marked with `+` subtract that baseline
to show just the extra time growth due to the large number
of refs. The `x` numbers show the slowdown of the adjusted
time versus the prior trial.
n | before | after
--------------------------------------------------------
1 | 0.991s | 0.848s
10000 | 1.120s (+0.129s) | 0.885s (+0.037s)
20000 | 1.451s (+0.460s, 3.5x) | 0.923s (+0.075s, 2.0x)
40000 | 2.731s (+1.740s, 3.8x) | 0.994s (+0.146s, 1.9x)
80000 | 8.235s (+7.244s, 4.2x) | 1.123s (+0.275s, 1.9x)
Each trial doubles `n`, so you can see the quadratic (`4x`)
behavior before this patch. Afterwards, we have a roughly
linear relationship.
The implementation is fairly straightforward. Whenever we do
the linear search, we cache the interesting commit we find,
and next time check it before doing another linear search.
If that commit is removed from the list or becomes
UNINTERESTING itself, then we fall back to the linear
search. This is very similar to the trick used by fce87ae
(Fix quadratic performance in rewrite_one., 2008-07-12).
I considered and rejected several possible alternatives:
1. Keep a count of UNINTERESTING commits in the queue.
This requires managing the count not only when removing
an item from the queue, but also when marking an item
as UNINTERESTING. That requires touching the other
functions which mark commits, and would require knowing
quickly which commits are in the queue (lookup in the
queue is linear, so we would need an auxiliary
structure or to also maintain an IN_QUEUE flag in each
commit object).
2. Keep a separate list of interesting commits. Drop items
from it when they are dropped from the queue, or if
they become UNINTERESTING. This again suffers from
extra complexity to maintain the list, not to mention
CPU and memory.
3. Use a better data structure for the queue. This is
something that could help the fix in fce87ae, because
we order the queue by recency, and it is about
inserting quickly in recency order. So a normal
priority queue would help there. But here, we cannot
disturb the order of the queue, which makes things
harder. We really do need an auxiliary index to track
the flag we care about, which is basically option (2)
above.
The "cache" trick is simple, and the numbers above show that
it works well in practice. This is because the length of
time it takes to find an interesting commit is proportional
to the length of time it will remain cached (i.e., if we
have to walk a long way to find it, it also means we have to
pop a lot of elements in the queue until we get rid of it
and have to find another interesting commit).
The worst case is still quadratic, though. We could have `N`
uninteresting commits at the front of the queue, followed by
`N` interesting commits, where commit `i` has parent `i+N`.
When we pop commit `i`, we will notice that the parent of
the next commit, `i+1+N` is still interesting and cache it.
But then handling commit `i+1`, we will mark its parent
`i+1+N` uninteresting, and immediately invalidate our cache.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-04-18 06:11:04 +08:00
|
|
|
struct commit **interesting_cache)
|
Add "--show-all" revision walker flag for debugging
It's really not very easy to visualize the commit walker, because - on
purpose - it obvously doesn't show the uninteresting commits!
This adds a "--show-all" flag to the revision walker, which will make
it show uninteresting commits too, and they'll have a '^' in front of
them (it also fixes a logic error for !verbose_header for boundary
commits - we should show the '-' even if left_right isn't shown).
A separate patch to gitk to teach it the new '^' was sent
to paulus. With the change in place, it actually is interesting
even for the cases that git doesn't have any problems with, ie
for the kernel you can do:
gitk -d --show-all v2.6.24..
and you see just how far down it has to parse things to see it all. The
use of "-d" is a good idea, since the date-ordered toposort is much better
at showing why it goes deep down (ie the date of some of those commits
after 2.6.24 is much older, because they were merged from trees that
weren't rebased).
So I think this is a useful feature even for non-debugging - just to
visualize what git does internally more.
When it actually breaks out due to the "everybody_uninteresting()"
case, it adds the uninteresting commits (both the one it's looking at
now, and the list of pending ones) to the list
This way, we really list *all* the commits we've looked at.
Because we now end up listing commits we may not even have been parsed
at all "show_log" and "show_commit" need to protect against commits
that don't have a commit buffer entry.
That second part is debatable just how it should work. Maybe we shouldn't
show such entries at all (with this patch those entries do get shown, they
just don't get any message shown with them). But I think this is a useful
case.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-02-10 06:02:07 +08:00
|
|
|
{
|
2008-03-18 09:56:33 +08:00
|
|
|
/*
|
|
|
|
* No source list at all? We're definitely done..
|
|
|
|
*/
|
|
|
|
if (!src)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Does the destination list contain entries with a date
|
|
|
|
* before the source list? Definitely _not_ done.
|
|
|
|
*/
|
2013-03-23 02:38:19 +08:00
|
|
|
if (date <= src->item->date)
|
2008-03-18 09:56:33 +08:00
|
|
|
return SLOP;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Does the source list still have interesting commits in
|
|
|
|
* it? Definitely not done..
|
|
|
|
*/
|
limit_list: avoid quadratic behavior from still_interesting
When we are limiting a rev-list traversal due to
UNINTERESTING refs, we have to walk down the tips (both
interesting and uninteresting) to find where they intersect.
We keep a queue of commits to examine, pop commits off
the queue one by one, and potentially add their parents. The
size of the queue will naturally fluctuate based on the
"width" of the history graph; i.e., the number of
simultaneous lines of development. But for the most part it
will stay in the same ballpark as the initial number of tips
we fed, shrinking over time (as we hit common ancestors of
the tips). So roughly speaking, if we start with `N` tips,
we'll spend much of the time with a queue around `N` items.
For each UNINTERESTING commit we pop, we call
still_interesting to check whether marking its parents as
UNINTERESTING has made the whole queue uninteresting (in
which case we can quit early). Because the queue is stored
as a linked list, this is `O(N)`, where `N` is the number of
items in the queue. So processing a queue with `N` commits
marked UNINTERESTING (and one or more interesting commits)
will take `O(N^2)`.
If you feed a lot of positive tips, this isn't a problem.
They aren't UNINTERESTING, so they don't incur the
still_interesting check. It also isn't a problem if you
traverse from an interesting tip to some UNINTERESTING
bases. We order the queue by recency, so the interesting
commits stay at the front of the queue as we walk down them.
The linear check can exit early as soon as it sees one
interesting commit left in the queue.
But if you want to know whether an older commit is reachable
from a set of newer tips, we end up processing in the
opposite direction: from the UNINTERESTING ones down to the
interesting one. This may happen when we call:
git rev-list $commits --not --all
in check_everything_connected after a fetch. If we fetched
something much older than most of our refs, and if we have a
large number of refs, the traversal cost is dominated by the
quadratic behavior.
These commands simulate the connectivity check of such a
fetch, when you have `$n` distinct refs in the receiver:
# positive ref is 100,000 commits deep
git rev-list --all | head -100000 | tail -1 >input
# huge number of more recent negative refs
git rev-list --all | head -$n | sed s/^/^/ >>input
time git rev-list --stdin <input
Here are timings for various `n` on the linux.git
repository. The `n=1` case provides a baseline for just
walking the commits, which lets us see the still_interesting
overhead. The times marked with `+` subtract that baseline
to show just the extra time growth due to the large number
of refs. The `x` numbers show the slowdown of the adjusted
time versus the prior trial.
n | before | after
--------------------------------------------------------
1 | 0.991s | 0.848s
10000 | 1.120s (+0.129s) | 0.885s (+0.037s)
20000 | 1.451s (+0.460s, 3.5x) | 0.923s (+0.075s, 2.0x)
40000 | 2.731s (+1.740s, 3.8x) | 0.994s (+0.146s, 1.9x)
80000 | 8.235s (+7.244s, 4.2x) | 1.123s (+0.275s, 1.9x)
Each trial doubles `n`, so you can see the quadratic (`4x`)
behavior before this patch. Afterwards, we have a roughly
linear relationship.
The implementation is fairly straightforward. Whenever we do
the linear search, we cache the interesting commit we find,
and next time check it before doing another linear search.
If that commit is removed from the list or becomes
UNINTERESTING itself, then we fall back to the linear
search. This is very similar to the trick used by fce87ae
(Fix quadratic performance in rewrite_one., 2008-07-12).
I considered and rejected several possible alternatives:
1. Keep a count of UNINTERESTING commits in the queue.
This requires managing the count not only when removing
an item from the queue, but also when marking an item
as UNINTERESTING. That requires touching the other
functions which mark commits, and would require knowing
quickly which commits are in the queue (lookup in the
queue is linear, so we would need an auxiliary
structure or to also maintain an IN_QUEUE flag in each
commit object).
2. Keep a separate list of interesting commits. Drop items
from it when they are dropped from the queue, or if
they become UNINTERESTING. This again suffers from
extra complexity to maintain the list, not to mention
CPU and memory.
3. Use a better data structure for the queue. This is
something that could help the fix in fce87ae, because
we order the queue by recency, and it is about
inserting quickly in recency order. So a normal
priority queue would help there. But here, we cannot
disturb the order of the queue, which makes things
harder. We really do need an auxiliary index to track
the flag we care about, which is basically option (2)
above.
The "cache" trick is simple, and the numbers above show that
it works well in practice. This is because the length of
time it takes to find an interesting commit is proportional
to the length of time it will remain cached (i.e., if we
have to walk a long way to find it, it also means we have to
pop a lot of elements in the queue until we get rid of it
and have to find another interesting commit).
The worst case is still quadratic, though. We could have `N`
uninteresting commits at the front of the queue, followed by
`N` interesting commits, where commit `i` has parent `i+N`.
When we pop commit `i`, we will notice that the parent of
the next commit, `i+1+N` is still interesting and cache it.
But then handling commit `i+1`, we will mark its parent
`i+1+N` uninteresting, and immediately invalidate our cache.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-04-18 06:11:04 +08:00
|
|
|
if (!everybody_uninteresting(src, interesting_cache))
|
2008-03-18 09:56:33 +08:00
|
|
|
return SLOP;
|
|
|
|
|
|
|
|
/* Ok, we're closing in.. */
|
|
|
|
return slop-1;
|
Add "--show-all" revision walker flag for debugging
It's really not very easy to visualize the commit walker, because - on
purpose - it obvously doesn't show the uninteresting commits!
This adds a "--show-all" flag to the revision walker, which will make
it show uninteresting commits too, and they'll have a '^' in front of
them (it also fixes a logic error for !verbose_header for boundary
commits - we should show the '-' even if left_right isn't shown).
A separate patch to gitk to teach it the new '^' was sent
to paulus. With the change in place, it actually is interesting
even for the cases that git doesn't have any problems with, ie
for the kernel you can do:
gitk -d --show-all v2.6.24..
and you see just how far down it has to parse things to see it all. The
use of "-d" is a good idea, since the date-ordered toposort is much better
at showing why it goes deep down (ie the date of some of those commits
after 2.6.24 is much older, because they were merged from trees that
weren't rebased).
So I think this is a useful feature even for non-debugging - just to
visualize what git does internally more.
When it actually breaks out due to the "everybody_uninteresting()"
case, it adds the uninteresting commits (both the one it's looking at
now, and the list of pending ones) to the list
This way, we really list *all* the commits we've looked at.
Because we now end up listing commits we may not even have been parsed
at all "show_log" and "show_commit" need to protect against commits
that don't have a commit buffer entry.
That second part is debatable just how it should work. Maybe we shouldn't
show such entries at all (with this patch those entries do get shown, they
just don't get any message shown with them). But I think this is a useful
case.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-02-10 06:02:07 +08:00
|
|
|
}
|
|
|
|
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
/*
|
2022-08-19 12:28:10 +08:00
|
|
|
* "rev-list --ancestry-path=C_0 [--ancestry-path=C_1 ...] A..B"
|
|
|
|
* computes commits that are ancestors of B but not ancestors of A but
|
|
|
|
* further limits the result to those that have any of C in their
|
|
|
|
* ancestry path (i.e. are either ancestors of any of C, descendants
|
|
|
|
* of any of C, or are any of C). If --ancestry-path is specified with
|
|
|
|
* no commit, we use all bottom commits for C.
|
|
|
|
*
|
|
|
|
* Before this function is called, ancestors of C will have already
|
|
|
|
* been marked with ANCESTRY_PATH previously.
|
|
|
|
*
|
|
|
|
* This takes the list of bottom commits and the result of "A..B"
|
|
|
|
* without --ancestry-path, and limits the latter further to the ones
|
|
|
|
* that have any of C in their ancestry path. Since the ancestors of C
|
|
|
|
* have already been marked (a prerequisite of this function), we just
|
|
|
|
* need to mark the descendants, then exclude any commit that does not
|
|
|
|
* have any of these marks.
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
*/
|
2022-08-19 12:28:10 +08:00
|
|
|
static void limit_to_ancestry(struct commit_list *bottoms, struct commit_list *list)
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
{
|
|
|
|
struct commit_list *p;
|
|
|
|
struct commit_list *rlist = NULL;
|
|
|
|
int made_progress;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reverse the list so that it will be likely that we would
|
|
|
|
* process parents before children.
|
|
|
|
*/
|
|
|
|
for (p = list; p; p = p->next)
|
|
|
|
commit_list_insert(p->item, &rlist);
|
|
|
|
|
2022-08-19 12:28:10 +08:00
|
|
|
for (p = bottoms; p; p = p->next)
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
p->item->object.flags |= TMP_MARK;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Mark the ones that can reach bottom commits in "list",
|
|
|
|
* in a bottom-up fashion.
|
|
|
|
*/
|
|
|
|
do {
|
|
|
|
made_progress = 0;
|
|
|
|
for (p = rlist; p; p = p->next) {
|
|
|
|
struct commit *c = p->item;
|
|
|
|
struct commit_list *parents;
|
|
|
|
if (c->object.flags & (TMP_MARK | UNINTERESTING))
|
|
|
|
continue;
|
|
|
|
for (parents = c->parents;
|
|
|
|
parents;
|
|
|
|
parents = parents->next) {
|
|
|
|
if (!(parents->item->object.flags & TMP_MARK))
|
|
|
|
continue;
|
|
|
|
c->object.flags |= TMP_MARK;
|
|
|
|
made_progress = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} while (made_progress);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* NEEDSWORK: decide if we want to remove parents that are
|
|
|
|
* not marked with TMP_MARK from commit->parents for commits
|
|
|
|
* in the resulting list. We may not want to do that, though.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
2022-08-19 12:28:10 +08:00
|
|
|
* The ones that are not marked with either TMP_MARK or
|
|
|
|
* ANCESTRY_PATH are uninteresting
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
*/
|
|
|
|
for (p = list; p; p = p->next) {
|
|
|
|
struct commit *c = p->item;
|
2022-08-19 12:28:10 +08:00
|
|
|
if (c->object.flags & (TMP_MARK | ANCESTRY_PATH))
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
continue;
|
|
|
|
c->object.flags |= UNINTERESTING;
|
|
|
|
}
|
|
|
|
|
2022-08-19 12:28:10 +08:00
|
|
|
/* We are done with TMP_MARK and ANCESTRY_PATH */
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
for (p = list; p; p = p->next)
|
2022-08-19 12:28:10 +08:00
|
|
|
p->item->object.flags &= ~(TMP_MARK | ANCESTRY_PATH);
|
|
|
|
for (p = bottoms; p; p = p->next)
|
|
|
|
p->item->object.flags &= ~(TMP_MARK | ANCESTRY_PATH);
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
free_commit_list(rlist);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-08-19 12:28:10 +08:00
|
|
|
* Before walking the history, add the set of "negative" refs the
|
|
|
|
* caller has asked to exclude to the bottom list.
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
*
|
|
|
|
* This is used to compute "rev-list --ancestry-path A..B", as we need
|
|
|
|
* to filter the result of "A..B" further to the ones that can actually
|
|
|
|
* reach A.
|
|
|
|
*/
|
2022-08-19 12:28:10 +08:00
|
|
|
static void collect_bottom_commits(struct commit_list *list,
|
|
|
|
struct commit_list **bottom)
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
{
|
2022-08-19 12:28:10 +08:00
|
|
|
struct commit_list *elem;
|
2013-05-16 23:32:38 +08:00
|
|
|
for (elem = list; elem; elem = elem->next)
|
|
|
|
if (elem->item->object.flags & BOTTOM)
|
2022-08-19 12:28:10 +08:00
|
|
|
commit_list_insert(elem->item, bottom);
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
}
|
|
|
|
|
2011-02-22 00:09:11 +08:00
|
|
|
/* Assumes either left_only or right_only is set */
|
|
|
|
static void limit_left_right(struct commit_list *list, struct rev_info *revs)
|
|
|
|
{
|
|
|
|
struct commit_list *p;
|
|
|
|
|
|
|
|
for (p = list; p; p = p->next) {
|
|
|
|
struct commit *commit = p->item;
|
|
|
|
|
|
|
|
if (revs->right_only) {
|
|
|
|
if (commit->object.flags & SYMMETRIC_LEFT)
|
|
|
|
commit->object.flags |= SHOWN;
|
|
|
|
} else /* revs->left_only is set */
|
|
|
|
if (!(commit->object.flags & SYMMETRIC_LEFT))
|
|
|
|
commit->object.flags |= SHOWN;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-05-05 05:54:57 +08:00
|
|
|
static int limit_list(struct rev_info *revs)
|
2006-03-01 03:24:00 +08:00
|
|
|
{
|
2008-03-18 09:56:33 +08:00
|
|
|
int slop = SLOP;
|
2017-04-27 03:29:31 +08:00
|
|
|
timestamp_t date = TIME_MAX;
|
2021-04-25 22:16:08 +08:00
|
|
|
struct commit_list *original_list = revs->commits;
|
2006-03-01 03:24:00 +08:00
|
|
|
struct commit_list *newlist = NULL;
|
|
|
|
struct commit_list **p = &newlist;
|
limit_list: avoid quadratic behavior from still_interesting
When we are limiting a rev-list traversal due to
UNINTERESTING refs, we have to walk down the tips (both
interesting and uninteresting) to find where they intersect.
We keep a queue of commits to examine, pop commits off
the queue one by one, and potentially add their parents. The
size of the queue will naturally fluctuate based on the
"width" of the history graph; i.e., the number of
simultaneous lines of development. But for the most part it
will stay in the same ballpark as the initial number of tips
we fed, shrinking over time (as we hit common ancestors of
the tips). So roughly speaking, if we start with `N` tips,
we'll spend much of the time with a queue around `N` items.
For each UNINTERESTING commit we pop, we call
still_interesting to check whether marking its parents as
UNINTERESTING has made the whole queue uninteresting (in
which case we can quit early). Because the queue is stored
as a linked list, this is `O(N)`, where `N` is the number of
items in the queue. So processing a queue with `N` commits
marked UNINTERESTING (and one or more interesting commits)
will take `O(N^2)`.
If you feed a lot of positive tips, this isn't a problem.
They aren't UNINTERESTING, so they don't incur the
still_interesting check. It also isn't a problem if you
traverse from an interesting tip to some UNINTERESTING
bases. We order the queue by recency, so the interesting
commits stay at the front of the queue as we walk down them.
The linear check can exit early as soon as it sees one
interesting commit left in the queue.
But if you want to know whether an older commit is reachable
from a set of newer tips, we end up processing in the
opposite direction: from the UNINTERESTING ones down to the
interesting one. This may happen when we call:
git rev-list $commits --not --all
in check_everything_connected after a fetch. If we fetched
something much older than most of our refs, and if we have a
large number of refs, the traversal cost is dominated by the
quadratic behavior.
These commands simulate the connectivity check of such a
fetch, when you have `$n` distinct refs in the receiver:
# positive ref is 100,000 commits deep
git rev-list --all | head -100000 | tail -1 >input
# huge number of more recent negative refs
git rev-list --all | head -$n | sed s/^/^/ >>input
time git rev-list --stdin <input
Here are timings for various `n` on the linux.git
repository. The `n=1` case provides a baseline for just
walking the commits, which lets us see the still_interesting
overhead. The times marked with `+` subtract that baseline
to show just the extra time growth due to the large number
of refs. The `x` numbers show the slowdown of the adjusted
time versus the prior trial.
n | before | after
--------------------------------------------------------
1 | 0.991s | 0.848s
10000 | 1.120s (+0.129s) | 0.885s (+0.037s)
20000 | 1.451s (+0.460s, 3.5x) | 0.923s (+0.075s, 2.0x)
40000 | 2.731s (+1.740s, 3.8x) | 0.994s (+0.146s, 1.9x)
80000 | 8.235s (+7.244s, 4.2x) | 1.123s (+0.275s, 1.9x)
Each trial doubles `n`, so you can see the quadratic (`4x`)
behavior before this patch. Afterwards, we have a roughly
linear relationship.
The implementation is fairly straightforward. Whenever we do
the linear search, we cache the interesting commit we find,
and next time check it before doing another linear search.
If that commit is removed from the list or becomes
UNINTERESTING itself, then we fall back to the linear
search. This is very similar to the trick used by fce87ae
(Fix quadratic performance in rewrite_one., 2008-07-12).
I considered and rejected several possible alternatives:
1. Keep a count of UNINTERESTING commits in the queue.
This requires managing the count not only when removing
an item from the queue, but also when marking an item
as UNINTERESTING. That requires touching the other
functions which mark commits, and would require knowing
quickly which commits are in the queue (lookup in the
queue is linear, so we would need an auxiliary
structure or to also maintain an IN_QUEUE flag in each
commit object).
2. Keep a separate list of interesting commits. Drop items
from it when they are dropped from the queue, or if
they become UNINTERESTING. This again suffers from
extra complexity to maintain the list, not to mention
CPU and memory.
3. Use a better data structure for the queue. This is
something that could help the fix in fce87ae, because
we order the queue by recency, and it is about
inserting quickly in recency order. So a normal
priority queue would help there. But here, we cannot
disturb the order of the queue, which makes things
harder. We really do need an auxiliary index to track
the flag we care about, which is basically option (2)
above.
The "cache" trick is simple, and the numbers above show that
it works well in practice. This is because the length of
time it takes to find an interesting commit is proportional
to the length of time it will remain cached (i.e., if we
have to walk a long way to find it, it also means we have to
pop a lot of elements in the queue until we get rid of it
and have to find another interesting commit).
The worst case is still quadratic, though. We could have `N`
uninteresting commits at the front of the queue, followed by
`N` interesting commits, where commit `i` has parent `i+N`.
When we pop commit `i`, we will notice that the parent of
the next commit, `i+1+N` is still interesting and cache it.
But then handling commit `i+1`, we will mark its parent
`i+1+N` uninteresting, and immediately invalidate our cache.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-04-18 06:11:04 +08:00
|
|
|
struct commit *interesting_cache = NULL;
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
|
2022-08-19 12:28:10 +08:00
|
|
|
if (revs->ancestry_path_implicit_bottoms) {
|
|
|
|
collect_bottom_commits(original_list,
|
|
|
|
&revs->ancestry_path_bottoms);
|
|
|
|
if (!revs->ancestry_path_bottoms)
|
2010-06-04 07:17:36 +08:00
|
|
|
die("--ancestry-path given but there are no bottom commits");
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
}
|
2006-03-01 03:24:00 +08:00
|
|
|
|
2021-04-25 22:16:08 +08:00
|
|
|
while (original_list) {
|
|
|
|
struct commit *commit = pop_commit(&original_list);
|
2006-03-01 03:24:00 +08:00
|
|
|
struct object *obj = &commit->object;
|
2007-11-04 02:11:10 +08:00
|
|
|
show_early_output_fn_t show;
|
2006-03-01 03:24:00 +08:00
|
|
|
|
limit_list: avoid quadratic behavior from still_interesting
When we are limiting a rev-list traversal due to
UNINTERESTING refs, we have to walk down the tips (both
interesting and uninteresting) to find where they intersect.
We keep a queue of commits to examine, pop commits off
the queue one by one, and potentially add their parents. The
size of the queue will naturally fluctuate based on the
"width" of the history graph; i.e., the number of
simultaneous lines of development. But for the most part it
will stay in the same ballpark as the initial number of tips
we fed, shrinking over time (as we hit common ancestors of
the tips). So roughly speaking, if we start with `N` tips,
we'll spend much of the time with a queue around `N` items.
For each UNINTERESTING commit we pop, we call
still_interesting to check whether marking its parents as
UNINTERESTING has made the whole queue uninteresting (in
which case we can quit early). Because the queue is stored
as a linked list, this is `O(N)`, where `N` is the number of
items in the queue. So processing a queue with `N` commits
marked UNINTERESTING (and one or more interesting commits)
will take `O(N^2)`.
If you feed a lot of positive tips, this isn't a problem.
They aren't UNINTERESTING, so they don't incur the
still_interesting check. It also isn't a problem if you
traverse from an interesting tip to some UNINTERESTING
bases. We order the queue by recency, so the interesting
commits stay at the front of the queue as we walk down them.
The linear check can exit early as soon as it sees one
interesting commit left in the queue.
But if you want to know whether an older commit is reachable
from a set of newer tips, we end up processing in the
opposite direction: from the UNINTERESTING ones down to the
interesting one. This may happen when we call:
git rev-list $commits --not --all
in check_everything_connected after a fetch. If we fetched
something much older than most of our refs, and if we have a
large number of refs, the traversal cost is dominated by the
quadratic behavior.
These commands simulate the connectivity check of such a
fetch, when you have `$n` distinct refs in the receiver:
# positive ref is 100,000 commits deep
git rev-list --all | head -100000 | tail -1 >input
# huge number of more recent negative refs
git rev-list --all | head -$n | sed s/^/^/ >>input
time git rev-list --stdin <input
Here are timings for various `n` on the linux.git
repository. The `n=1` case provides a baseline for just
walking the commits, which lets us see the still_interesting
overhead. The times marked with `+` subtract that baseline
to show just the extra time growth due to the large number
of refs. The `x` numbers show the slowdown of the adjusted
time versus the prior trial.
n | before | after
--------------------------------------------------------
1 | 0.991s | 0.848s
10000 | 1.120s (+0.129s) | 0.885s (+0.037s)
20000 | 1.451s (+0.460s, 3.5x) | 0.923s (+0.075s, 2.0x)
40000 | 2.731s (+1.740s, 3.8x) | 0.994s (+0.146s, 1.9x)
80000 | 8.235s (+7.244s, 4.2x) | 1.123s (+0.275s, 1.9x)
Each trial doubles `n`, so you can see the quadratic (`4x`)
behavior before this patch. Afterwards, we have a roughly
linear relationship.
The implementation is fairly straightforward. Whenever we do
the linear search, we cache the interesting commit we find,
and next time check it before doing another linear search.
If that commit is removed from the list or becomes
UNINTERESTING itself, then we fall back to the linear
search. This is very similar to the trick used by fce87ae
(Fix quadratic performance in rewrite_one., 2008-07-12).
I considered and rejected several possible alternatives:
1. Keep a count of UNINTERESTING commits in the queue.
This requires managing the count not only when removing
an item from the queue, but also when marking an item
as UNINTERESTING. That requires touching the other
functions which mark commits, and would require knowing
quickly which commits are in the queue (lookup in the
queue is linear, so we would need an auxiliary
structure or to also maintain an IN_QUEUE flag in each
commit object).
2. Keep a separate list of interesting commits. Drop items
from it when they are dropped from the queue, or if
they become UNINTERESTING. This again suffers from
extra complexity to maintain the list, not to mention
CPU and memory.
3. Use a better data structure for the queue. This is
something that could help the fix in fce87ae, because
we order the queue by recency, and it is about
inserting quickly in recency order. So a normal
priority queue would help there. But here, we cannot
disturb the order of the queue, which makes things
harder. We really do need an auxiliary index to track
the flag we care about, which is basically option (2)
above.
The "cache" trick is simple, and the numbers above show that
it works well in practice. This is because the length of
time it takes to find an interesting commit is proportional
to the length of time it will remain cached (i.e., if we
have to walk a long way to find it, it also means we have to
pop a lot of elements in the queue until we get rid of it
and have to find another interesting commit).
The worst case is still quadratic, though. We could have `N`
uninteresting commits at the front of the queue, followed by
`N` interesting commits, where commit `i` has parent `i+N`.
When we pop commit `i`, we will notice that the parent of
the next commit, `i+1+N` is still interesting and cache it.
But then handling commit `i+1`, we will mark its parent
`i+1+N` uninteresting, and immediately invalidate our cache.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-04-18 06:11:04 +08:00
|
|
|
if (commit == interesting_cache)
|
|
|
|
interesting_cache = NULL;
|
|
|
|
|
2006-03-01 03:24:00 +08:00
|
|
|
if (revs->max_age != -1 && (commit->date < revs->max_age))
|
|
|
|
obj->flags |= UNINTERESTING;
|
2021-04-25 22:16:08 +08:00
|
|
|
if (process_parents(revs, commit, &original_list, NULL) < 0)
|
2007-05-05 05:54:57 +08:00
|
|
|
return -1;
|
2006-03-01 03:24:00 +08:00
|
|
|
if (obj->flags & UNINTERESTING) {
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
mark_parents_uninteresting(revs, commit);
|
2021-04-25 22:16:08 +08:00
|
|
|
slop = still_interesting(original_list, date, slop, &interesting_cache);
|
2008-03-18 09:56:33 +08:00
|
|
|
if (slop)
|
Add "--show-all" revision walker flag for debugging
It's really not very easy to visualize the commit walker, because - on
purpose - it obvously doesn't show the uninteresting commits!
This adds a "--show-all" flag to the revision walker, which will make
it show uninteresting commits too, and they'll have a '^' in front of
them (it also fixes a logic error for !verbose_header for boundary
commits - we should show the '-' even if left_right isn't shown).
A separate patch to gitk to teach it the new '^' was sent
to paulus. With the change in place, it actually is interesting
even for the cases that git doesn't have any problems with, ie
for the kernel you can do:
gitk -d --show-all v2.6.24..
and you see just how far down it has to parse things to see it all. The
use of "-d" is a good idea, since the date-ordered toposort is much better
at showing why it goes deep down (ie the date of some of those commits
after 2.6.24 is much older, because they were merged from trees that
weren't rebased).
So I think this is a useful feature even for non-debugging - just to
visualize what git does internally more.
When it actually breaks out due to the "everybody_uninteresting()"
case, it adds the uninteresting commits (both the one it's looking at
now, and the list of pending ones) to the list
This way, we really list *all* the commits we've looked at.
Because we now end up listing commits we may not even have been parsed
at all "show_log" and "show_commit" need to protect against commits
that don't have a commit buffer entry.
That second part is debatable just how it should work. Maybe we shouldn't
show such entries at all (with this patch those entries do get shown, they
just don't get any message shown with them). But I think this is a useful
case.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-02-10 06:02:07 +08:00
|
|
|
continue;
|
2008-03-18 09:56:33 +08:00
|
|
|
break;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
2020-07-04 20:56:32 +08:00
|
|
|
if (revs->min_age != -1 && (commit->date > revs->min_age) &&
|
|
|
|
!revs->line_level_traverse)
|
2006-03-01 03:24:00 +08:00
|
|
|
continue;
|
2022-04-23 20:59:57 +08:00
|
|
|
if (revs->max_age_as_filter != -1 &&
|
|
|
|
(commit->date < revs->max_age_as_filter) && !revs->line_level_traverse)
|
|
|
|
continue;
|
2008-03-18 09:56:33 +08:00
|
|
|
date = commit->date;
|
2006-03-01 03:24:00 +08:00
|
|
|
p = &commit_list_insert(commit, p)->next;
|
2007-11-04 02:11:10 +08:00
|
|
|
|
|
|
|
show = show_early_output;
|
|
|
|
if (!show)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
show(revs, newlist);
|
|
|
|
show_early_output = NULL;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
2011-03-07 20:31:40 +08:00
|
|
|
if (revs->cherry_pick || revs->cherry_mark)
|
2007-07-10 21:50:49 +08:00
|
|
|
cherry_pick_list(newlist, revs);
|
2007-04-09 18:40:38 +08:00
|
|
|
|
2011-02-22 00:09:11 +08:00
|
|
|
if (revs->left_only || revs->right_only)
|
|
|
|
limit_left_right(newlist, revs);
|
|
|
|
|
2022-08-19 12:28:10 +08:00
|
|
|
if (revs->ancestry_path)
|
|
|
|
limit_to_ancestry(revs->ancestry_path_bottoms, newlist);
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
|
2013-05-16 23:32:39 +08:00
|
|
|
/*
|
|
|
|
* Check if any commits have become TREESAME by some of their parents
|
|
|
|
* becoming UNINTERESTING.
|
|
|
|
*/
|
2021-04-25 22:16:08 +08:00
|
|
|
if (limiting_can_increase_treesame(revs)) {
|
|
|
|
struct commit_list *list = NULL;
|
2013-05-16 23:32:39 +08:00
|
|
|
for (list = newlist; list; list = list->next) {
|
|
|
|
struct commit *c = list->item;
|
|
|
|
if (c->object.flags & (UNINTERESTING | TREESAME))
|
|
|
|
continue;
|
|
|
|
update_treesame(revs, c);
|
|
|
|
}
|
2021-04-25 22:16:08 +08:00
|
|
|
}
|
2013-05-16 23:32:39 +08:00
|
|
|
|
2021-04-25 22:16:08 +08:00
|
|
|
free_commit_list(original_list);
|
2006-03-01 03:24:00 +08:00
|
|
|
revs->commits = newlist;
|
2007-05-05 05:54:57 +08:00
|
|
|
return 0;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
|
|
|
|
2013-05-25 17:08:02 +08:00
|
|
|
/*
|
|
|
|
* Add an entry to refs->cmdline with the specified information.
|
|
|
|
* *name is copied.
|
|
|
|
*/
|
2011-08-26 08:35:39 +08:00
|
|
|
static void add_rev_cmdline(struct rev_info *revs,
|
|
|
|
struct object *item,
|
|
|
|
const char *name,
|
|
|
|
int whence,
|
|
|
|
unsigned flags)
|
|
|
|
{
|
|
|
|
struct rev_cmdline_info *info = &revs->cmdline;
|
2017-09-22 00:49:38 +08:00
|
|
|
unsigned int nr = info->nr;
|
2011-08-26 08:35:39 +08:00
|
|
|
|
|
|
|
ALLOC_GROW(info->rev, nr + 1, info->alloc);
|
|
|
|
info->rev[nr].item = item;
|
2013-05-25 17:08:02 +08:00
|
|
|
info->rev[nr].name = xstrdup(name);
|
2011-08-26 08:35:39 +08:00
|
|
|
info->rev[nr].whence = whence;
|
|
|
|
info->rev[nr].flags = flags;
|
|
|
|
info->nr++;
|
|
|
|
}
|
|
|
|
|
2013-05-13 23:00:47 +08:00
|
|
|
static void add_rev_cmdline_list(struct rev_info *revs,
|
|
|
|
struct commit_list *commit_list,
|
|
|
|
int whence,
|
|
|
|
unsigned flags)
|
|
|
|
{
|
|
|
|
while (commit_list) {
|
|
|
|
struct object *object = &commit_list->item->object;
|
2015-11-10 10:22:28 +08:00
|
|
|
add_rev_cmdline(revs, object, oid_to_hex(&object->oid),
|
2013-05-13 23:00:47 +08:00
|
|
|
whence, flags);
|
|
|
|
commit_list = commit_list->next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-11-17 13:46:51 +08:00
|
|
|
int ref_excluded(const struct ref_exclusions *exclusions, const char *path)
|
revision: introduce --exclude=<glob> to tame wildcards
People often find "git log --branches" etc. that includes _all_
branches is cumbersome to use when they want to grab most but except
some. The same applies to --tags, --all and --glob.
Teach the revision machinery to remember patterns, and then upon the
next such a globbing option, exclude those that match the pattern.
With this, I can view only my integration branches (e.g. maint,
master, etc.) without topic branches, which are named after two
letters from primary authors' names, slash and topic name.
git rev-list --no-walk --exclude=??/* --branches |
git name-rev --refs refs/heads/* --stdin
This one shows things reachable from local and remote branches that
have not been merged to the integration branches.
git log --remotes --branches --not --exclude=??/* --branches
It may be a bit rough around the edges, in that the pattern to give
the exclude option depends on what globbing option follows. In
these examples, the pattern "??/*" is used, not "refs/heads/??/*",
because the globbing option that follows the -"-exclude=<pattern>"
is "--branches". As each use of globbing option resets previously
set "--exclude", this may not be such a bad thing, though.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-31 07:37:55 +08:00
|
|
|
{
|
2022-11-17 13:46:56 +08:00
|
|
|
const char *stripped_path = strip_namespace(path);
|
revision: introduce --exclude=<glob> to tame wildcards
People often find "git log --branches" etc. that includes _all_
branches is cumbersome to use when they want to grab most but except
some. The same applies to --tags, --all and --glob.
Teach the revision machinery to remember patterns, and then upon the
next such a globbing option, exclude those that match the pattern.
With this, I can view only my integration branches (e.g. maint,
master, etc.) without topic branches, which are named after two
letters from primary authors' names, slash and topic name.
git rev-list --no-walk --exclude=??/* --branches |
git name-rev --refs refs/heads/* --stdin
This one shows things reachable from local and remote branches that
have not been merged to the integration branches.
git log --remotes --branches --not --exclude=??/* --branches
It may be a bit rough around the edges, in that the pattern to give
the exclude option depends on what globbing option follows. In
these examples, the pattern "??/*" is used, not "refs/heads/??/*",
because the globbing option that follows the -"-exclude=<pattern>"
is "--branches". As each use of globbing option resets previously
set "--exclude", this may not be such a bad thing, though.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-31 07:37:55 +08:00
|
|
|
struct string_list_item *item;
|
|
|
|
|
2022-11-17 13:46:51 +08:00
|
|
|
for_each_string_list_item(item, &exclusions->excluded_refs) {
|
2017-06-23 05:38:08 +08:00
|
|
|
if (!wildmatch(item->string, path, 0))
|
revision: introduce --exclude=<glob> to tame wildcards
People often find "git log --branches" etc. that includes _all_
branches is cumbersome to use when they want to grab most but except
some. The same applies to --tags, --all and --glob.
Teach the revision machinery to remember patterns, and then upon the
next such a globbing option, exclude those that match the pattern.
With this, I can view only my integration branches (e.g. maint,
master, etc.) without topic branches, which are named after two
letters from primary authors' names, slash and topic name.
git rev-list --no-walk --exclude=??/* --branches |
git name-rev --refs refs/heads/* --stdin
This one shows things reachable from local and remote branches that
have not been merged to the integration branches.
git log --remotes --branches --not --exclude=??/* --branches
It may be a bit rough around the edges, in that the pattern to give
the exclude option depends on what globbing option follows. In
these examples, the pattern "??/*" is used, not "refs/heads/??/*",
because the globbing option that follows the -"-exclude=<pattern>"
is "--branches". As each use of globbing option resets previously
set "--exclude", this may not be such a bad thing, though.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-31 07:37:55 +08:00
|
|
|
return 1;
|
|
|
|
}
|
2022-11-17 13:46:56 +08:00
|
|
|
|
|
|
|
if (ref_is_hidden(stripped_path, path, &exclusions->hidden_refs))
|
|
|
|
return 1;
|
|
|
|
|
revision: introduce --exclude=<glob> to tame wildcards
People often find "git log --branches" etc. that includes _all_
branches is cumbersome to use when they want to grab most but except
some. The same applies to --tags, --all and --glob.
Teach the revision machinery to remember patterns, and then upon the
next such a globbing option, exclude those that match the pattern.
With this, I can view only my integration branches (e.g. maint,
master, etc.) without topic branches, which are named after two
letters from primary authors' names, slash and topic name.
git rev-list --no-walk --exclude=??/* --branches |
git name-rev --refs refs/heads/* --stdin
This one shows things reachable from local and remote branches that
have not been merged to the integration branches.
git log --remotes --branches --not --exclude=??/* --branches
It may be a bit rough around the edges, in that the pattern to give
the exclude option depends on what globbing option follows. In
these examples, the pattern "??/*" is used, not "refs/heads/??/*",
because the globbing option that follows the -"-exclude=<pattern>"
is "--branches". As each use of globbing option resets previously
set "--exclude", this may not be such a bad thing, though.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-31 07:37:55 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-11-17 13:46:51 +08:00
|
|
|
void init_ref_exclusions(struct ref_exclusions *exclusions)
|
2022-11-17 13:46:47 +08:00
|
|
|
{
|
2022-11-17 13:46:51 +08:00
|
|
|
struct ref_exclusions blank = REF_EXCLUSIONS_INIT;
|
|
|
|
memcpy(exclusions, &blank, sizeof(*exclusions));
|
2022-11-17 13:46:47 +08:00
|
|
|
}
|
|
|
|
|
2022-11-17 13:46:51 +08:00
|
|
|
void clear_ref_exclusions(struct ref_exclusions *exclusions)
|
2022-11-17 13:46:47 +08:00
|
|
|
{
|
2022-11-17 13:46:51 +08:00
|
|
|
string_list_clear(&exclusions->excluded_refs, 0);
|
2023-07-11 05:12:33 +08:00
|
|
|
strvec_clear(&exclusions->hidden_refs);
|
2022-11-17 13:46:56 +08:00
|
|
|
exclusions->hidden_refs_configured = 0;
|
2022-11-17 13:46:51 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void add_ref_exclusion(struct ref_exclusions *exclusions, const char *exclude)
|
|
|
|
{
|
|
|
|
string_list_append(&exclusions->excluded_refs, exclude);
|
2022-11-17 13:46:47 +08:00
|
|
|
}
|
|
|
|
|
2022-11-17 13:46:56 +08:00
|
|
|
struct exclude_hidden_refs_cb {
|
|
|
|
struct ref_exclusions *exclusions;
|
|
|
|
const char *section;
|
|
|
|
};
|
|
|
|
|
config: add ctx arg to config_fn_t
Add a new "const struct config_context *ctx" arg to config_fn_t to hold
additional information about the config iteration operation.
config_context has a "struct key_value_info kvi" member that holds
metadata about the config source being read (e.g. what kind of config
source it is, the filename, etc). In this series, we're only interested
in .kvi, so we could have just used "struct key_value_info" as an arg,
but config_context makes it possible to add/adjust members in the future
without changing the config_fn_t signature. We could also consider other
ways of organizing the args (e.g. moving the config name and value into
config_context or key_value_info), but in my experiments, the
incremental benefit doesn't justify the added complexity (e.g. a
config_fn_t will sometimes invoke another config_fn_t but with a
different config value).
In subsequent commits, the .kvi member will replace the global "struct
config_reader" in config.c, making config iteration a global-free
operation. It requires much more work for the machinery to provide
meaningful values of .kvi, so for now, merely change the signature and
call sites, pass NULL as a placeholder value, and don't rely on the arg
in any meaningful way.
Most of the changes are performed by
contrib/coccinelle/config_fn_ctx.pending.cocci, which, for every
config_fn_t:
- Modifies the signature to accept "const struct config_context *ctx"
- Passes "ctx" to any inner config_fn_t, if needed
- Adds UNUSED attributes to "ctx", if needed
Most config_fn_t instances are easily identified by seeing if they are
called by the various config functions. Most of the remaining ones are
manually named in the .cocci patch. Manual cleanups are still needed,
but the majority of it is trivial; it's either adjusting config_fn_t
that the .cocci patch didn't catch, or adding forward declarations of
"struct config_context ctx" to make the signatures make sense.
The non-trivial changes are in cases where we are invoking a config_fn_t
outside of config machinery, and we now need to decide what value of
"ctx" to pass. These cases are:
- trace2/tr2_cfg.c:tr2_cfg_set_fl()
This is indirectly called by git_config_set() so that the trace2
machinery can notice the new config values and update its settings
using the tr2 config parsing function, i.e. tr2_cfg_cb().
- builtin/checkout.c:checkout_main()
This calls git_xmerge_config() as a shorthand for parsing a CLI arg.
This might be worth refactoring away in the future, since
git_xmerge_config() can call git_default_config(), which can do much
more than just parsing.
Handle them by creating a KVI_INIT macro that initializes "struct
key_value_info" to a reasonable default, and use that to construct the
"ctx" arg.
Signed-off-by: Glen Choo <chooglen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-29 03:26:22 +08:00
|
|
|
static int hide_refs_config(const char *var, const char *value,
|
|
|
|
const struct config_context *ctx UNUSED,
|
|
|
|
void *cb_data)
|
2022-11-17 13:46:56 +08:00
|
|
|
{
|
|
|
|
struct exclude_hidden_refs_cb *cb = cb_data;
|
|
|
|
cb->exclusions->hidden_refs_configured = 1;
|
|
|
|
return parse_hide_refs_config(var, value, cb->section,
|
|
|
|
&cb->exclusions->hidden_refs);
|
|
|
|
}
|
|
|
|
|
|
|
|
void exclude_hidden_refs(struct ref_exclusions *exclusions, const char *section)
|
|
|
|
{
|
|
|
|
struct exclude_hidden_refs_cb cb;
|
|
|
|
|
2023-02-12 17:04:26 +08:00
|
|
|
if (strcmp(section, "fetch") && strcmp(section, "receive") &&
|
|
|
|
strcmp(section, "uploadpack"))
|
2022-11-17 13:46:56 +08:00
|
|
|
die(_("unsupported section for hidden refs: %s"), section);
|
|
|
|
|
|
|
|
if (exclusions->hidden_refs_configured)
|
|
|
|
die(_("--exclude-hidden= passed more than once"));
|
|
|
|
|
|
|
|
cb.exclusions = exclusions;
|
|
|
|
cb.section = section;
|
|
|
|
|
|
|
|
git_config(hide_refs_config, &cb);
|
|
|
|
}
|
|
|
|
|
2022-11-17 13:46:47 +08:00
|
|
|
struct all_refs_cb {
|
|
|
|
int all_flags;
|
|
|
|
int warned_bad_reflog;
|
|
|
|
struct rev_info *all_revs;
|
|
|
|
const char *name_for_errormsg;
|
|
|
|
struct worktree *wt;
|
|
|
|
};
|
|
|
|
|
2024-08-09 23:37:50 +08:00
|
|
|
static int handle_one_ref(const char *path, const char *referent UNUSED, const struct object_id *oid,
|
2022-08-26 01:09:48 +08:00
|
|
|
int flag UNUSED,
|
2022-08-19 18:08:32 +08:00
|
|
|
void *cb_data)
|
2006-02-26 08:19:46 +08:00
|
|
|
{
|
2006-12-19 09:25:28 +08:00
|
|
|
struct all_refs_cb *cb = cb_data;
|
revision: introduce --exclude=<glob> to tame wildcards
People often find "git log --branches" etc. that includes _all_
branches is cumbersome to use when they want to grab most but except
some. The same applies to --tags, --all and --glob.
Teach the revision machinery to remember patterns, and then upon the
next such a globbing option, exclude those that match the pattern.
With this, I can view only my integration branches (e.g. maint,
master, etc.) without topic branches, which are named after two
letters from primary authors' names, slash and topic name.
git rev-list --no-walk --exclude=??/* --branches |
git name-rev --refs refs/heads/* --stdin
This one shows things reachable from local and remote branches that
have not been merged to the integration branches.
git log --remotes --branches --not --exclude=??/* --branches
It may be a bit rough around the edges, in that the pattern to give
the exclude option depends on what globbing option follows. In
these examples, the pattern "??/*" is used, not "refs/heads/??/*",
because the globbing option that follows the -"-exclude=<pattern>"
is "--branches". As each use of globbing option resets previously
set "--exclude", this may not be such a bad thing, though.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-31 07:37:55 +08:00
|
|
|
struct object *object;
|
|
|
|
|
2022-11-17 13:46:51 +08:00
|
|
|
if (ref_excluded(&cb->all_revs->ref_excludes, path))
|
revision: introduce --exclude=<glob> to tame wildcards
People often find "git log --branches" etc. that includes _all_
branches is cumbersome to use when they want to grab most but except
some. The same applies to --tags, --all and --glob.
Teach the revision machinery to remember patterns, and then upon the
next such a globbing option, exclude those that match the pattern.
With this, I can view only my integration branches (e.g. maint,
master, etc.) without topic branches, which are named after two
letters from primary authors' names, slash and topic name.
git rev-list --no-walk --exclude=??/* --branches |
git name-rev --refs refs/heads/* --stdin
This one shows things reachable from local and remote branches that
have not been merged to the integration branches.
git log --remotes --branches --not --exclude=??/* --branches
It may be a bit rough around the edges, in that the pattern to give
the exclude option depends on what globbing option follows. In
these examples, the pattern "??/*" is used, not "refs/heads/??/*",
because the globbing option that follows the -"-exclude=<pattern>"
is "--branches". As each use of globbing option resets previously
set "--exclude", this may not be such a bad thing, though.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-31 07:37:55 +08:00
|
|
|
return 0;
|
|
|
|
|
2017-05-07 06:10:27 +08:00
|
|
|
object = get_reference(cb->all_revs, path, oid, cb->all_flags);
|
2011-08-26 08:35:39 +08:00
|
|
|
add_rev_cmdline(cb->all_revs, object, path, REV_CMD_REF, cb->all_flags);
|
2021-08-09 16:11:54 +08:00
|
|
|
add_pending_object(cb->all_revs, object, path);
|
2006-02-26 08:19:46 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-01-20 17:48:25 +08:00
|
|
|
static void init_all_refs_cb(struct all_refs_cb *cb, struct rev_info *revs,
|
|
|
|
unsigned flags)
|
|
|
|
{
|
|
|
|
cb->all_revs = revs;
|
|
|
|
cb->all_flags = flags;
|
2017-08-03 06:25:27 +08:00
|
|
|
revs->rev_input_given = 1;
|
2018-10-21 16:08:56 +08:00
|
|
|
cb->wt = NULL;
|
2010-01-20 17:48:25 +08:00
|
|
|
}
|
|
|
|
|
2017-08-23 20:36:56 +08:00
|
|
|
static void handle_refs(struct ref_store *refs,
|
|
|
|
struct rev_info *revs, unsigned flags,
|
|
|
|
int (*for_each)(struct ref_store *, each_ref_fn, void *))
|
2006-02-26 08:19:46 +08:00
|
|
|
{
|
2006-12-19 09:25:28 +08:00
|
|
|
struct all_refs_cb cb;
|
2017-08-23 20:36:56 +08:00
|
|
|
|
|
|
|
if (!refs) {
|
|
|
|
/* this could happen with uninitialized submodules */
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-01-20 17:48:25 +08:00
|
|
|
init_all_refs_cb(&cb, revs, flags);
|
2017-08-23 20:36:56 +08:00
|
|
|
for_each(refs, handle_one_ref, &cb);
|
2006-12-19 09:25:28 +08:00
|
|
|
}
|
|
|
|
|
2017-02-22 07:47:32 +08:00
|
|
|
static void handle_one_reflog_commit(struct object_id *oid, void *cb_data)
|
2006-12-19 09:25:28 +08:00
|
|
|
{
|
|
|
|
struct all_refs_cb *cb = cb_data;
|
2017-02-22 07:47:32 +08:00
|
|
|
if (!is_null_oid(oid)) {
|
2018-09-21 23:57:39 +08:00
|
|
|
struct object *o = parse_object(cb->all_revs->repo, oid);
|
2006-12-22 08:49:06 +08:00
|
|
|
if (o) {
|
|
|
|
o->flags |= cb->all_flags;
|
2011-08-26 08:35:39 +08:00
|
|
|
/* ??? CMDLINEFLAGS ??? */
|
2006-12-22 08:49:06 +08:00
|
|
|
add_pending_object(cb->all_revs, o, "");
|
|
|
|
}
|
|
|
|
else if (!cb->warned_bad_reflog) {
|
2007-03-31 07:07:05 +08:00
|
|
|
warning("reflog of '%s' references pruned commits",
|
2006-12-22 08:49:06 +08:00
|
|
|
cb->name_for_errormsg);
|
|
|
|
cb->warned_bad_reflog = 1;
|
|
|
|
}
|
2006-12-19 09:25:28 +08:00
|
|
|
}
|
2006-12-22 08:49:06 +08:00
|
|
|
}
|
|
|
|
|
2017-02-22 07:47:32 +08:00
|
|
|
static int handle_one_reflog_ent(struct object_id *ooid, struct object_id *noid,
|
2022-08-26 01:09:48 +08:00
|
|
|
const char *email UNUSED,
|
|
|
|
timestamp_t timestamp UNUSED,
|
|
|
|
int tz UNUSED,
|
|
|
|
const char *message UNUSED,
|
2022-08-19 18:08:35 +08:00
|
|
|
void *cb_data)
|
2006-12-22 08:49:06 +08:00
|
|
|
{
|
2017-02-22 07:47:32 +08:00
|
|
|
handle_one_reflog_commit(ooid, cb_data);
|
|
|
|
handle_one_reflog_commit(noid, cb_data);
|
2006-12-19 09:25:28 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2024-02-21 20:37:39 +08:00
|
|
|
static int handle_one_reflog(const char *refname_in_wt, void *cb_data)
|
2006-12-19 09:25:28 +08:00
|
|
|
{
|
|
|
|
struct all_refs_cb *cb = cb_data;
|
2018-10-21 16:08:56 +08:00
|
|
|
struct strbuf refname = STRBUF_INIT;
|
|
|
|
|
2006-12-22 08:49:06 +08:00
|
|
|
cb->warned_bad_reflog = 0;
|
2018-10-21 16:08:56 +08:00
|
|
|
strbuf_worktree_ref(cb->wt, &refname, refname_in_wt);
|
|
|
|
cb->name_for_errormsg = refname.buf;
|
|
|
|
refs_for_each_reflog_ent(get_main_ref_store(the_repository),
|
|
|
|
refname.buf,
|
2017-08-23 20:37:01 +08:00
|
|
|
handle_one_reflog_ent, cb_data);
|
2018-10-21 16:08:56 +08:00
|
|
|
strbuf_release(&refname);
|
2006-12-19 09:25:28 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-08-23 20:37:01 +08:00
|
|
|
static void add_other_reflogs_to_pending(struct all_refs_cb *cb)
|
|
|
|
{
|
|
|
|
struct worktree **worktrees, **p;
|
|
|
|
|
2020-06-20 07:35:44 +08:00
|
|
|
worktrees = get_worktrees();
|
2017-08-23 20:37:01 +08:00
|
|
|
for (p = worktrees; *p; p++) {
|
|
|
|
struct worktree *wt = *p;
|
|
|
|
|
|
|
|
if (wt->is_current)
|
|
|
|
continue;
|
|
|
|
|
2018-10-21 16:08:56 +08:00
|
|
|
cb->wt = wt;
|
|
|
|
refs_for_each_reflog(get_worktree_ref_store(wt),
|
2017-08-23 20:37:01 +08:00
|
|
|
handle_one_reflog,
|
|
|
|
cb);
|
|
|
|
}
|
|
|
|
free_worktrees(worktrees);
|
|
|
|
}
|
|
|
|
|
2014-10-16 06:38:31 +08:00
|
|
|
void add_reflogs_to_pending(struct rev_info *revs, unsigned flags)
|
2006-12-19 09:25:28 +08:00
|
|
|
{
|
|
|
|
struct all_refs_cb cb;
|
2015-05-26 02:38:28 +08:00
|
|
|
|
2006-12-19 09:25:28 +08:00
|
|
|
cb.all_revs = revs;
|
|
|
|
cb.all_flags = flags;
|
2018-10-21 16:08:56 +08:00
|
|
|
cb.wt = NULL;
|
2024-05-07 15:11:53 +08:00
|
|
|
refs_for_each_reflog(get_main_ref_store(the_repository),
|
|
|
|
handle_one_reflog, &cb);
|
2017-08-23 20:37:01 +08:00
|
|
|
|
|
|
|
if (!revs->single_worktree)
|
|
|
|
add_other_reflogs_to_pending(&cb);
|
2006-02-26 08:19:46 +08:00
|
|
|
}
|
|
|
|
|
2014-10-17 08:44:23 +08:00
|
|
|
static void add_cache_tree(struct cache_tree *it, struct rev_info *revs,
|
2018-11-02 13:22:59 +08:00
|
|
|
struct strbuf *path, unsigned int flags)
|
2014-10-17 08:44:23 +08:00
|
|
|
{
|
|
|
|
size_t baselen = path->len;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (it->entry_count >= 0) {
|
2018-09-21 23:57:39 +08:00
|
|
|
struct tree *tree = lookup_tree(revs->repo, &it->oid);
|
2018-11-02 13:22:59 +08:00
|
|
|
tree->object.flags |= flags;
|
2014-10-17 08:44:23 +08:00
|
|
|
add_pending_object_with_path(revs, &tree->object, "",
|
|
|
|
040000, path->buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < it->subtree_nr; i++) {
|
|
|
|
struct cache_tree_sub *sub = it->down[i];
|
|
|
|
strbuf_addf(path, "%s%s", baselen ? "/" : "", sub->name);
|
2018-11-02 13:22:59 +08:00
|
|
|
add_cache_tree(sub->cache_tree, revs, path, flags);
|
2014-10-17 08:44:23 +08:00
|
|
|
strbuf_setlen(path, baselen);
|
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2022-06-10 07:44:20 +08:00
|
|
|
static void add_resolve_undo_to_pending(struct index_state *istate, struct rev_info *revs)
|
|
|
|
{
|
|
|
|
struct string_list_item *item;
|
|
|
|
struct string_list *resolve_undo = istate->resolve_undo;
|
|
|
|
|
|
|
|
if (!resolve_undo)
|
|
|
|
return;
|
|
|
|
|
|
|
|
for_each_string_list_item(item, resolve_undo) {
|
|
|
|
const char *path = item->string;
|
|
|
|
struct resolve_undo_info *ru = item->util;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (!ru)
|
|
|
|
continue;
|
|
|
|
for (i = 0; i < 3; i++) {
|
|
|
|
struct blob *blob;
|
|
|
|
|
|
|
|
if (!ru->mode[i] || !S_ISREG(ru->mode[i]))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
blob = lookup_blob(revs->repo, &ru->oid[i]);
|
|
|
|
if (!blob) {
|
|
|
|
warning(_("resolve-undo records `%s` which is missing"),
|
|
|
|
oid_to_hex(&ru->oid[i]));
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
add_pending_object_with_path(revs, &blob->object, "",
|
|
|
|
ru->mode[i], path);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-08-23 20:36:51 +08:00
|
|
|
static void do_add_index_objects_to_pending(struct rev_info *revs,
|
2018-11-02 13:22:59 +08:00
|
|
|
struct index_state *istate,
|
|
|
|
unsigned int flags)
|
2014-10-17 08:44:23 +08:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2021-04-01 09:50:00 +08:00
|
|
|
/* TODO: audit for interaction with sparse-index. */
|
|
|
|
ensure_full_index(istate);
|
2017-08-23 20:36:51 +08:00
|
|
|
for (i = 0; i < istate->cache_nr; i++) {
|
|
|
|
struct cache_entry *ce = istate->cache[i];
|
2014-10-17 08:44:23 +08:00
|
|
|
struct blob *blob;
|
|
|
|
|
|
|
|
if (S_ISGITLINK(ce->ce_mode))
|
|
|
|
continue;
|
|
|
|
|
2018-09-21 23:57:39 +08:00
|
|
|
blob = lookup_blob(revs->repo, &ce->oid);
|
2014-10-17 08:44:23 +08:00
|
|
|
if (!blob)
|
|
|
|
die("unable to add index blob to traversal");
|
2018-11-02 13:22:59 +08:00
|
|
|
blob->object.flags |= flags;
|
2014-10-17 08:44:23 +08:00
|
|
|
add_pending_object_with_path(revs, &blob->object, "",
|
|
|
|
ce->ce_mode, ce->name);
|
|
|
|
}
|
|
|
|
|
2017-08-23 20:36:51 +08:00
|
|
|
if (istate->cache_tree) {
|
2014-10-17 08:44:23 +08:00
|
|
|
struct strbuf path = STRBUF_INIT;
|
2018-11-02 13:22:59 +08:00
|
|
|
add_cache_tree(istate->cache_tree, revs, &path, flags);
|
2014-10-17 08:44:23 +08:00
|
|
|
strbuf_release(&path);
|
|
|
|
}
|
2022-06-10 07:44:20 +08:00
|
|
|
|
|
|
|
add_resolve_undo_to_pending(istate, revs);
|
2014-10-17 08:44:23 +08:00
|
|
|
}
|
|
|
|
|
2017-08-23 20:36:51 +08:00
|
|
|
void add_index_objects_to_pending(struct rev_info *revs, unsigned int flags)
|
|
|
|
{
|
2017-08-23 20:36:52 +08:00
|
|
|
struct worktree **worktrees, **p;
|
|
|
|
|
2019-01-12 10:13:26 +08:00
|
|
|
repo_read_index(revs->repo);
|
2018-11-02 13:22:59 +08:00
|
|
|
do_add_index_objects_to_pending(revs, revs->repo->index, flags);
|
2017-08-23 20:36:52 +08:00
|
|
|
|
|
|
|
if (revs->single_worktree)
|
|
|
|
return;
|
|
|
|
|
2020-06-20 07:35:44 +08:00
|
|
|
worktrees = get_worktrees();
|
2017-08-23 20:36:52 +08:00
|
|
|
for (p = worktrees; *p; p++) {
|
|
|
|
struct worktree *wt = *p;
|
treewide: always have a valid "index_state.repo" member
When the "repo" member was added to "the_index" in [1] the
repo_read_index() was made to populate it, but the unpopulated
"the_index" variable didn't get the same treatment.
Let's do that in initialize_the_repository() when we set it up, and
likewise for all of the current callers initialized an empty "struct
index_state".
This simplifies code that needs to deal with "the_index" or a custom
"struct index_state", we no longer need to second-guess this part of
the "index_state" deep in the stack. A recent example of such
second-guessing is the "istate->repo ? istate->repo : the_repository"
code in [2]. We can now simply use "istate->repo".
We're doing this by making use of the INDEX_STATE_INIT() macro (and
corresponding function) added in [3], which now have mandatory "repo"
arguments.
Because we now call index_state_init() in repository.c's
initialize_the_repository() we don't need to handle the case where we
have a "repo->index" whose "repo" member doesn't match the "repo"
we're setting up, i.e. the "Complete the double-reference" code in
repo_read_index() being altered here. That logic was originally added
in [1], and was working around the lack of what we now have in
initialize_the_repository().
For "fsmonitor-settings.c" we can remove the initialization of a NULL
"r" argument to "the_repository". This was added back in [4], and was
needed at the time for callers that would pass us the "r" from an
"istate->repo". Before this change such a change to
"fsmonitor-settings.c" would segfault all over the test suite (e.g. in
t0002-gitfile.sh).
This change has wider eventual implications for
"fsmonitor-settings.c". The reason the other lazy loading behavior in
it is required (starting with "if (!r->settings.fsmonitor) ..." is
because of the previously passed "r" being "NULL".
I have other local changes on top of this which move its configuration
reading to "prepare_repo_settings()" in "repo-settings.c", as we could
now start to rely on it being called for our "r". But let's leave all
of that for now, and narrowly remove this particular part of the
lazy-loading.
1. 1fd9ae517c4 (repository: add repo reference to index_state,
2021-01-23)
2. ee1f0c242ef (read-cache: add index.skipHash config option,
2023-01-06)
3. 2f6b1eb794e (cache API: add a "INDEX_STATE_INIT" macro/function,
add release_index(), 2023-01-12)
4. 1e0ea5c4316 (fsmonitor: config settings are repository-specific,
2022-03-25)
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-17 21:57:00 +08:00
|
|
|
struct index_state istate = INDEX_STATE_INIT(revs->repo);
|
2017-08-23 20:36:52 +08:00
|
|
|
|
|
|
|
if (wt->is_current)
|
|
|
|
continue; /* current index already taken care of */
|
|
|
|
|
|
|
|
if (read_index_from(&istate,
|
2024-08-13 17:13:37 +08:00
|
|
|
worktree_git_path(the_repository, wt, "index"),
|
read-cache: fix reading the shared index for other repos
read_index_from() takes a path argument for the location of the index
file. For reading the shared index in split index mode however it just
ignores that path argument, and reads it from the gitdir of the current
repository.
This works as long as an index in the_repository is read. Once that
changes, such as when we read the index of a submodule, or of a
different working tree than the current one, the gitdir of
the_repository will no longer contain the appropriate shared index,
and git will fail to read it.
For example t3007-ls-files-recurse-submodules.sh was broken with
GIT_TEST_SPLIT_INDEX set in 188dce131f ("ls-files: use repository
object", 2017-06-22), and t7814-grep-recurse-submodules.sh was also
broken in a similar manner, probably by introducing struct repository
there, although I didn't track down the exact commit for that.
be489d02d2 ("revision.c: --indexed-objects add objects from all
worktrees", 2017-08-23) breaks with split index mode in a similar
manner, not erroring out when it can't read the index, but instead
carrying on with pruning, without taking the index of the worktree into
account.
Fix this by passing an additional gitdir parameter to read_index_from,
to indicate where it should look for and read the shared index from.
read_cache_from() defaults to using the gitdir of the_repository. As it
is mostly a convenience macro, having to pass get_git_dir() for every
call seems overkill, and if necessary users can have more control by
using read_index_from().
Helped-by: Brandon Williams <bmwill@google.com>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-01-08 06:30:13 +08:00
|
|
|
get_worktree_git_dir(wt)) > 0)
|
2018-11-02 13:22:59 +08:00
|
|
|
do_add_index_objects_to_pending(revs, &istate, flags);
|
2017-08-23 20:36:52 +08:00
|
|
|
discard_index(&istate);
|
|
|
|
}
|
|
|
|
free_worktrees(worktrees);
|
2017-08-23 20:36:51 +08:00
|
|
|
}
|
|
|
|
|
2019-07-01 21:18:15 +08:00
|
|
|
struct add_alternate_refs_data {
|
|
|
|
struct rev_info *revs;
|
|
|
|
unsigned int flags;
|
|
|
|
};
|
|
|
|
|
|
|
|
static void add_one_alternate_ref(const struct object_id *oid,
|
|
|
|
void *vdata)
|
|
|
|
{
|
|
|
|
const char *name = ".alternate";
|
|
|
|
struct add_alternate_refs_data *data = vdata;
|
|
|
|
struct object *obj;
|
|
|
|
|
|
|
|
obj = get_reference(data->revs, name, oid, data->flags);
|
|
|
|
add_rev_cmdline(data->revs, obj, name, REV_CMD_REV, data->flags);
|
|
|
|
add_pending_object(data->revs, obj, name);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void add_alternate_refs_to_pending(struct rev_info *revs,
|
|
|
|
unsigned int flags)
|
|
|
|
{
|
|
|
|
struct add_alternate_refs_data data;
|
|
|
|
data.revs = revs;
|
|
|
|
data.flags = flags;
|
|
|
|
for_each_alternate_ref(add_one_alternate_ref, &data);
|
|
|
|
}
|
|
|
|
|
2016-09-27 16:32:49 +08:00
|
|
|
static int add_parents_only(struct rev_info *revs, const char *arg_, int flags,
|
|
|
|
int exclude_parent)
|
2006-04-30 15:54:29 +08:00
|
|
|
{
|
2017-05-07 06:10:27 +08:00
|
|
|
struct object_id oid;
|
2006-04-30 15:54:29 +08:00
|
|
|
struct object *it;
|
|
|
|
struct commit *commit;
|
|
|
|
struct commit_list *parents;
|
2016-09-27 16:32:49 +08:00
|
|
|
int parent_number;
|
2011-08-26 08:35:39 +08:00
|
|
|
const char *arg = arg_;
|
2006-04-30 15:54:29 +08:00
|
|
|
|
|
|
|
if (*arg == '^') {
|
2013-05-16 23:32:38 +08:00
|
|
|
flags ^= UNINTERESTING | BOTTOM;
|
2006-04-30 15:54:29 +08:00
|
|
|
arg++;
|
|
|
|
}
|
2023-03-28 21:58:46 +08:00
|
|
|
if (repo_get_oid_committish(the_repository, arg, &oid))
|
2006-04-30 15:54:29 +08:00
|
|
|
return 0;
|
|
|
|
while (1) {
|
2017-05-07 06:10:27 +08:00
|
|
|
it = get_reference(revs, arg, &oid, 0);
|
2011-05-19 09:08:09 +08:00
|
|
|
if (!it && revs->ignore_missing)
|
|
|
|
return 0;
|
2006-07-12 11:45:31 +08:00
|
|
|
if (it->type != OBJ_TAG)
|
2006-04-30 15:54:29 +08:00
|
|
|
break;
|
2008-02-19 04:48:01 +08:00
|
|
|
if (!((struct tag*)it)->tagged)
|
|
|
|
return 0;
|
2017-05-07 06:10:27 +08:00
|
|
|
oidcpy(&oid, &((struct tag*)it)->tagged->oid);
|
2006-04-30 15:54:29 +08:00
|
|
|
}
|
2006-07-12 11:45:31 +08:00
|
|
|
if (it->type != OBJ_COMMIT)
|
2006-04-30 15:54:29 +08:00
|
|
|
return 0;
|
|
|
|
commit = (struct commit *)it;
|
2016-09-27 16:32:49 +08:00
|
|
|
if (exclude_parent &&
|
|
|
|
exclude_parent > commit_list_count(commit->parents))
|
|
|
|
return 0;
|
|
|
|
for (parents = commit->parents, parent_number = 1;
|
|
|
|
parents;
|
|
|
|
parents = parents->next, parent_number++) {
|
|
|
|
if (exclude_parent && parent_number != exclude_parent)
|
|
|
|
continue;
|
|
|
|
|
2006-04-30 15:54:29 +08:00
|
|
|
it = &parents->item->object;
|
|
|
|
it->flags |= flags;
|
2011-08-26 08:35:39 +08:00
|
|
|
add_rev_cmdline(revs, it, arg_, REV_CMD_PARENTS_ONLY, flags);
|
2006-04-30 15:54:29 +08:00
|
|
|
add_pending_object(revs, it, arg);
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2018-09-21 23:57:38 +08:00
|
|
|
void repo_init_revisions(struct repository *r,
|
|
|
|
struct rev_info *revs,
|
|
|
|
const char *prefix)
|
2006-03-10 17:21:39 +08:00
|
|
|
{
|
2022-11-08 22:02:57 +08:00
|
|
|
struct rev_info blank = REV_INFO_INIT;
|
|
|
|
memcpy(revs, &blank, sizeof(*revs));
|
2006-04-15 13:19:38 +08:00
|
|
|
|
2018-09-21 23:57:38 +08:00
|
|
|
revs->repo = r;
|
2018-11-19 00:47:57 +08:00
|
|
|
revs->pruning.repo = r;
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
revs->pruning.add_remove = file_add_remove;
|
|
|
|
revs->pruning.change = file_change;
|
revision: quit pruning diff more quickly when possible
When the revision traversal machinery is given a pathspec,
we must compute the parent-diff for each commit to determine
which ones are TREESAME. We set the QUICK diff flag to avoid
looking at more entries than we need; we really just care
whether there are any changes at all.
But there is one case where we want to know a bit more: if
--remove-empty is set, we care about finding cases where the
change consists only of added entries (in which case we may
prune the parent in try_to_simplify_commit()). To cover that
case, our file_add_remove() callback does not quit the diff
upon seeing an added entry; it keeps looking for other types
of entries.
But this means when --remove-empty is not set (and it is not
by default), we compute more of the diff than is necessary.
You can see this in a pathological case where a commit adds
a very large number of entries, and we limit based on a
broad pathspec. E.g.:
perl -e '
chomp(my $blob = `git hash-object -w --stdin </dev/null`);
for my $a (1..1000) {
for my $b (1..1000) {
print "100644 $blob\t$a/$b\n";
}
}
' | git update-index --index-info
git commit -qm add
git rev-list HEAD -- .
This case takes about 100ms now, but after this patch only
needs 6ms. That's not a huge improvement, but it's easy to
get and it protects us against even more pathological cases
(e.g., going from 1 million to 10 million files would take
ten times as long with the current code, but not increase at
all after this patch).
This is reported to minorly speed-up pathspec limiting in
real world repositories (like the 100-million-file Windows
repository), but probably won't make a noticeable difference
outside of pathological setups.
This patch actually covers the case without --remove-empty,
and the case where we see only deletions. See the in-code
comment for details.
Note that we have to add a new member to the diff_options
struct so that our callback can see the value of
revs->remove_empty_trees. This callback parameter could be
passed to the "add_remove" and "change" callbacks, but
there's not much point. They already receive the
diff_options struct, and doing it this way avoids having to
update the function signature of the other callbacks
(arguably the format_callback and output_prefix functions
could benefit from the same simplification).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-13 23:27:45 +08:00
|
|
|
revs->pruning.change_fn_data = revs;
|
2006-07-29 12:21:48 +08:00
|
|
|
revs->prefix = prefix;
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
|
built-ins: trust the "prefix" from run_builtin()
Change code in "builtin/grep.c" and "builtin/ls-tree.c" to trust the
"prefix" passed from "run_builtin()". The "prefix" we get from setup.c
is either going to be NULL or a string of length >0, never "".
So we can drop the "prefix && *prefix" checks added for
"builtin/grep.c" in 0d042fecf2f (git-grep: show pathnames relative to
the current directory, 2006-08-11), and for "builtin/ls-tree.c" in
a69dd585fca (ls-tree: chomp leading directories when run from a
subdirectory, 2005-12-23).
As seen in code in revision.c that was added in cd676a51367 (diff
--relative: output paths as relative to the current subdirectory,
2008-02-12) we already have existing code that does away with this
assertion.
This makes it easier to reason about a subsequent change to the
"prefix_length" code in grep.c in a subsequent commit, and since we're
going to the trouble of doing that let's leave behind an assert() to
promise this to any future callers.
For "builtin/grep.c" it would be painful to pass the "prefix" down the
callchain of:
cmd_grep -> grep_tree -> grep_submodule -> grep_cache -> grep_oid ->
grep_source_name
So for the code that needs it in grep_source_name() let's add a
"grep_prefix" variable similar to the existing "ls_tree_prefix".
While at it let's move the code in cmd_ls_tree() around so that we
assign to the "ls_tree_prefix" right after declaring the variables,
and stop assigning to "prefix". We only subsequently used that
variable later in the function after clobbering it. Let's just use our
own "grep_prefix" instead.
Let's also add an assert() in git.c, so that we'll make this promise
about the "prefix" to any current and future callers, as well as to
any readers of the code.
Code history:
* The strlen() in "grep.c" hasn't been used since 493b7a08d80 (grep:
accept relative paths outside current working directory, 2009-09-05).
When that code was added in 0d042fecf2f (git-grep: show pathnames
relative to the current directory, 2006-08-11) we used the length.
But since 493b7a08d80 we haven't used it for anything except a
boolean check that we could have done on the "prefix" member
itself.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-16 08:00:34 +08:00
|
|
|
grep_init(&revs->grep_filter, revs->repo);
|
2008-08-25 14:15:05 +08:00
|
|
|
revs->grep_filter.status_only = 1;
|
|
|
|
|
2018-09-21 23:57:38 +08:00
|
|
|
repo_diff_setup(revs->repo, &revs->diffopt);
|
2008-02-13 16:34:39 +08:00
|
|
|
if (prefix && !revs->diffopt.prefix) {
|
diff --relative: output paths as relative to the current subdirectory
This adds --relative option to the diff family. When you start
from a subdirectory:
$ git diff --relative
shows only the diff that is inside your current subdirectory,
and without $prefix part. People who usually live in
subdirectories may like it.
There are a few things I should also mention about the change:
- This works not just with diff but also works with the log
family of commands, but the history pruning is not affected.
In other words, if you go to a subdirectory, you can say:
$ git log --relative -p
but it will show the log message even for commits that do not
touch the current directory. You can limit it by giving
pathspec yourself:
$ git log --relative -p .
This originally was not a conscious design choice, but we
have a way to affect diff pathspec and pruning pathspec
independently. IOW "git log --full-diff -p ." tells it to
prune history to commits that affect the current subdirectory
but show the changes with full context. I think it makes
more sense to leave pruning independent from --relative than
the obvious alternative of always pruning with the current
subdirectory, which would break the symmetry.
- Because this works also with the log family, you could
format-patch a single change, limiting the effect to your
subdirectory, like so:
$ cd gitk-git
$ git format-patch -1 --relative 911f1eb
But because that is a special purpose usage, this option will
never become the default, with or without repository or user
preference configuration. The risk of producing a partial
patch and sending it out by mistake is too great if we did
so.
- This is inherently incompatible with --no-index, which is a
bolted-on hack that does not have much to do with git
itself. I didn't bother checking and erroring out on the
combined use of the options, but probably I should.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-02-13 06:26:02 +08:00
|
|
|
revs->diffopt.prefix = prefix;
|
|
|
|
revs->diffopt.prefix_length = strlen(prefix);
|
|
|
|
}
|
2011-03-30 04:57:27 +08:00
|
|
|
|
2019-12-09 21:10:41 +08:00
|
|
|
init_display_notes(&revs->notes_opt);
|
list-objects-filter: add and use initializers
In 7e2619d8ff (list_objects_filter_options: plug leak of filter_spec
strings, 2022-09-08), we noted that the filter_spec string_list was
inconsistent in how it handled memory ownership of strings stored in the
list. The fix there was a bit of a band-aid to set the "strdup_strings"
variable right before adding anything.
That works OK, and it lets the users of the API continue to
zero-initialize the struct. But it makes the code a bit hard to follow
and accident-prone, as any other spots appending the filter_spec need to
think about whether to set the strdup_strings value, too (there's one
such spot in partial_clone_get_default_filter_spec(), which is probably
a possible memory leak).
So let's do that full cleanup now. We'll introduce a
LIST_OBJECTS_FILTER_INIT macro and matching function, and use them as
appropriate (though it is for the "_options" struct, this matches the
corresponding list_objects_filter_release() function).
This is harder than it seems! Many other structs, like
git_transport_data, embed the filter struct. So they need to initialize
it themselves even if the rest of the enclosing struct is OK with
zero-initialization. I found all of the relevant spots by grepping
manually for declarations of list_objects_filter_options. And then doing
so recursively for structs which embed it, and ones which embed those,
and so on.
I'm pretty sure I got everything, but there's no change that would alert
the compiler if any topics in flight added new declarations. To catch
this case, we now double-check in the parsing function that things were
initialized as expected and BUG() if appropriate.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-11 13:03:07 +08:00
|
|
|
list_objects_filter_init(&revs->filter);
|
2022-11-17 13:46:51 +08:00
|
|
|
init_ref_exclusions(&revs->ref_excludes);
|
rev-list: allow missing tips with --missing=[print|allow*]
In 9830926c7d (rev-list: add commit object support in `--missing`
option, 2023-10-27) we fixed the `--missing` option in `git rev-list`
so that it works with with missing commits, not just blobs/trees.
Unfortunately, such a command would still fail with a "fatal: bad
object <oid>" if it is passed a missing commit, blob or tree as an
argument (before the rev walking even begins).
When such a command is used to find the dependencies of some objects,
for example the dependencies of quarantined objects (see the
"QUARANTINE ENVIRONMENT" section in the git-receive-pack(1)
documentation), it would be better if the command would instead
consider such missing objects, especially commits, in the same way as
other missing objects.
If, for example `--missing=print` is used, it would be nice for some
use cases if the missing tips passed as arguments were reported in
the same way as other missing objects instead of the command just
failing.
We could introduce a new option to make it work like this, but most
users are likely to prefer the command to have this behavior as the
default one. Introducing a new option would require another dumb loop
to look for that option early, which isn't nice.
Also we made `git rev-list` work with missing commits very recently
and the command is most often passed commits as arguments. So let's
consider this as a bug fix related to these recent changes.
While at it let's add a NEEDSWORK comment to say that we should get
rid of the existing ugly dumb loops that parse the
`--exclude-promisor-objects` and `--missing=...` options early.
Helped-by: Linus Arver <linusa@google.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-14 22:25:13 +08:00
|
|
|
oidset_init(&revs->missing_commits, 0);
|
2006-03-10 17:21:39 +08:00
|
|
|
}
|
|
|
|
|
2006-07-02 07:29:37 +08:00
|
|
|
static void add_pending_commit_list(struct rev_info *revs,
|
2018-12-06 23:42:06 +08:00
|
|
|
struct commit_list *commit_list,
|
|
|
|
unsigned int flags)
|
2006-07-02 07:29:37 +08:00
|
|
|
{
|
|
|
|
while (commit_list) {
|
|
|
|
struct object *object = &commit_list->item->object;
|
|
|
|
object->flags |= flags;
|
2015-11-10 10:22:28 +08:00
|
|
|
add_pending_object(revs, object, oid_to_hex(&object->oid));
|
2006-07-02 07:29:37 +08:00
|
|
|
commit_list = commit_list->next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
revision: implement `git log --merge` also for rebase/cherry-pick/revert
'git log' learned in ae3e5e1ef2 (git log -p --merge [[--] paths...],
2006-07-03) to show commits touching conflicted files in the range
HEAD...MERGE_HEAD, an addition documented in d249b45547 (Document
rev-list's option --merge, 2006-08-04).
It can be useful to look at the commit history to understand what lead
to merge conflicts also for other mergy operations besides merges, like
cherry-pick, revert and rebase.
For rebases and cherry-picks, an interesting range to look at is
HEAD...{REBASE_HEAD,CHERRY_PICK_HEAD}, since even if all the commits
included in that range are not directly part of the 3-way merge,
conflicts encountered during these operations can indeed be caused by
changes introduced in preceding commits on both sides of the history.
For revert, as we are (most likely) reversing changes from a previous
commit, an appropriate range is REVERT_HEAD..HEAD, which is equivalent
to REVERT_HEAD...HEAD and to HEAD...REVERT_HEAD, if we keep HEAD and its
parents on the left side of the range.
As such, adjust the code in prepare_show_merge so it constructs the
range HEAD...$OTHER for OTHER={MERGE_HEAD, CHERRY_PICK_HEAD, REVERT_HEAD
or REBASE_HEAD}. Note that we try these pseudorefs in order, so keep
REBASE_HEAD last since the three other operations can be performed
during a rebase. Note also that in the uncommon case where $OTHER and
HEAD do not share a common ancestor, this will show the complete
histories of both sides since their root commits, which is the same
behaviour as currently happens in that case for HEAD and MERGE_HEAD.
Adjust the documentation of this option accordingly.
Co-authored-by: Johannes Sixt <j6t@kdbg.org>
Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Michael Lohmann <mi.al.lohmann@gmail.com>
[jc: tweaked in j6t's precedence fix that tries REBASE_HEAD last]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-28 21:54:54 +08:00
|
|
|
static const char *lookup_other_head(struct object_id *oid)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
static const char *const other_head[] = {
|
|
|
|
"MERGE_HEAD", "CHERRY_PICK_HEAD", "REVERT_HEAD", "REBASE_HEAD"
|
|
|
|
};
|
|
|
|
|
|
|
|
for (i = 0; i < ARRAY_SIZE(other_head); i++)
|
2024-05-07 15:11:53 +08:00
|
|
|
if (!refs_read_ref_full(get_main_ref_store(the_repository), other_head[i],
|
|
|
|
RESOLVE_REF_READING | RESOLVE_REF_NO_RECURSE,
|
|
|
|
oid, NULL)) {
|
revision: implement `git log --merge` also for rebase/cherry-pick/revert
'git log' learned in ae3e5e1ef2 (git log -p --merge [[--] paths...],
2006-07-03) to show commits touching conflicted files in the range
HEAD...MERGE_HEAD, an addition documented in d249b45547 (Document
rev-list's option --merge, 2006-08-04).
It can be useful to look at the commit history to understand what lead
to merge conflicts also for other mergy operations besides merges, like
cherry-pick, revert and rebase.
For rebases and cherry-picks, an interesting range to look at is
HEAD...{REBASE_HEAD,CHERRY_PICK_HEAD}, since even if all the commits
included in that range are not directly part of the 3-way merge,
conflicts encountered during these operations can indeed be caused by
changes introduced in preceding commits on both sides of the history.
For revert, as we are (most likely) reversing changes from a previous
commit, an appropriate range is REVERT_HEAD..HEAD, which is equivalent
to REVERT_HEAD...HEAD and to HEAD...REVERT_HEAD, if we keep HEAD and its
parents on the left side of the range.
As such, adjust the code in prepare_show_merge so it constructs the
range HEAD...$OTHER for OTHER={MERGE_HEAD, CHERRY_PICK_HEAD, REVERT_HEAD
or REBASE_HEAD}. Note that we try these pseudorefs in order, so keep
REBASE_HEAD last since the three other operations can be performed
during a rebase. Note also that in the uncommon case where $OTHER and
HEAD do not share a common ancestor, this will show the complete
histories of both sides since their root commits, which is the same
behaviour as currently happens in that case for HEAD and MERGE_HEAD.
Adjust the documentation of this option accordingly.
Co-authored-by: Johannes Sixt <j6t@kdbg.org>
Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Michael Lohmann <mi.al.lohmann@gmail.com>
[jc: tweaked in j6t's precedence fix that tries REBASE_HEAD last]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-28 21:54:54 +08:00
|
|
|
if (is_null_oid(oid))
|
|
|
|
die(_("%s exists but is a symbolic ref"), other_head[i]);
|
|
|
|
return other_head[i];
|
|
|
|
}
|
|
|
|
|
|
|
|
die(_("--merge requires one of the pseudorefs MERGE_HEAD, CHERRY_PICK_HEAD, REVERT_HEAD or REBASE_HEAD"));
|
|
|
|
}
|
|
|
|
|
2006-07-03 17:59:32 +08:00
|
|
|
static void prepare_show_merge(struct rev_info *revs)
|
|
|
|
{
|
2024-02-28 17:44:14 +08:00
|
|
|
struct commit_list *bases = NULL;
|
2006-07-03 17:59:32 +08:00
|
|
|
struct commit *head, *other;
|
2017-05-07 06:10:05 +08:00
|
|
|
struct object_id oid;
|
revision: implement `git log --merge` also for rebase/cherry-pick/revert
'git log' learned in ae3e5e1ef2 (git log -p --merge [[--] paths...],
2006-07-03) to show commits touching conflicted files in the range
HEAD...MERGE_HEAD, an addition documented in d249b45547 (Document
rev-list's option --merge, 2006-08-04).
It can be useful to look at the commit history to understand what lead
to merge conflicts also for other mergy operations besides merges, like
cherry-pick, revert and rebase.
For rebases and cherry-picks, an interesting range to look at is
HEAD...{REBASE_HEAD,CHERRY_PICK_HEAD}, since even if all the commits
included in that range are not directly part of the 3-way merge,
conflicts encountered during these operations can indeed be caused by
changes introduced in preceding commits on both sides of the history.
For revert, as we are (most likely) reversing changes from a previous
commit, an appropriate range is REVERT_HEAD..HEAD, which is equivalent
to REVERT_HEAD...HEAD and to HEAD...REVERT_HEAD, if we keep HEAD and its
parents on the left side of the range.
As such, adjust the code in prepare_show_merge so it constructs the
range HEAD...$OTHER for OTHER={MERGE_HEAD, CHERRY_PICK_HEAD, REVERT_HEAD
or REBASE_HEAD}. Note that we try these pseudorefs in order, so keep
REBASE_HEAD last since the three other operations can be performed
during a rebase. Note also that in the uncommon case where $OTHER and
HEAD do not share a common ancestor, this will show the complete
histories of both sides since their root commits, which is the same
behaviour as currently happens in that case for HEAD and MERGE_HEAD.
Adjust the documentation of this option accordingly.
Co-authored-by: Johannes Sixt <j6t@kdbg.org>
Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Michael Lohmann <mi.al.lohmann@gmail.com>
[jc: tweaked in j6t's precedence fix that tries REBASE_HEAD last]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-28 21:54:54 +08:00
|
|
|
const char *other_name;
|
2006-07-03 17:59:32 +08:00
|
|
|
const char **prune = NULL;
|
|
|
|
int i, prune_num = 1; /* counting terminating NULL */
|
2018-09-21 23:57:38 +08:00
|
|
|
struct index_state *istate = revs->repo->index;
|
2006-07-03 17:59:32 +08:00
|
|
|
|
2023-03-28 21:58:46 +08:00
|
|
|
if (repo_get_oid(the_repository, "HEAD", &oid))
|
2006-07-03 17:59:32 +08:00
|
|
|
die("--merge without HEAD?");
|
Convert lookup_commit* to struct object_id
Convert lookup_commit, lookup_commit_or_die,
lookup_commit_reference, and lookup_commit_reference_gently to take
struct object_id arguments.
Introduce a temporary in parse_object buffer in order to convert this
function. This is required since in order to convert parse_object and
parse_object_buffer, lookup_commit_reference_gently and
lookup_commit_or_die would need to be converted. Not introducing a
temporary would therefore require that lookup_commit_or_die take a
struct object_id *, but lookup_commit would take unsigned char *,
leaving a confusing and hard-to-use interface.
parse_object_buffer will lose this temporary in a later patch.
This commit was created with manual changes to commit.c, commit.h, and
object.c, plus the following semantic patch:
@@
expression E1, E2;
@@
- lookup_commit_reference_gently(E1.hash, E2)
+ lookup_commit_reference_gently(&E1, E2)
@@
expression E1, E2;
@@
- lookup_commit_reference_gently(E1->hash, E2)
+ lookup_commit_reference_gently(E1, E2)
@@
expression E1;
@@
- lookup_commit_reference(E1.hash)
+ lookup_commit_reference(&E1)
@@
expression E1;
@@
- lookup_commit_reference(E1->hash)
+ lookup_commit_reference(E1)
@@
expression E1;
@@
- lookup_commit(E1.hash)
+ lookup_commit(&E1)
@@
expression E1;
@@
- lookup_commit(E1->hash)
+ lookup_commit(E1)
@@
expression E1, E2;
@@
- lookup_commit_or_die(E1.hash, E2)
+ lookup_commit_or_die(&E1, E2)
@@
expression E1, E2;
@@
- lookup_commit_or_die(E1->hash, E2)
+ lookup_commit_or_die(E1, E2)
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-07 06:10:10 +08:00
|
|
|
head = lookup_commit_or_die(&oid, "HEAD");
|
revision: implement `git log --merge` also for rebase/cherry-pick/revert
'git log' learned in ae3e5e1ef2 (git log -p --merge [[--] paths...],
2006-07-03) to show commits touching conflicted files in the range
HEAD...MERGE_HEAD, an addition documented in d249b45547 (Document
rev-list's option --merge, 2006-08-04).
It can be useful to look at the commit history to understand what lead
to merge conflicts also for other mergy operations besides merges, like
cherry-pick, revert and rebase.
For rebases and cherry-picks, an interesting range to look at is
HEAD...{REBASE_HEAD,CHERRY_PICK_HEAD}, since even if all the commits
included in that range are not directly part of the 3-way merge,
conflicts encountered during these operations can indeed be caused by
changes introduced in preceding commits on both sides of the history.
For revert, as we are (most likely) reversing changes from a previous
commit, an appropriate range is REVERT_HEAD..HEAD, which is equivalent
to REVERT_HEAD...HEAD and to HEAD...REVERT_HEAD, if we keep HEAD and its
parents on the left side of the range.
As such, adjust the code in prepare_show_merge so it constructs the
range HEAD...$OTHER for OTHER={MERGE_HEAD, CHERRY_PICK_HEAD, REVERT_HEAD
or REBASE_HEAD}. Note that we try these pseudorefs in order, so keep
REBASE_HEAD last since the three other operations can be performed
during a rebase. Note also that in the uncommon case where $OTHER and
HEAD do not share a common ancestor, this will show the complete
histories of both sides since their root commits, which is the same
behaviour as currently happens in that case for HEAD and MERGE_HEAD.
Adjust the documentation of this option accordingly.
Co-authored-by: Johannes Sixt <j6t@kdbg.org>
Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Michael Lohmann <mi.al.lohmann@gmail.com>
[jc: tweaked in j6t's precedence fix that tries REBASE_HEAD last]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-28 21:54:54 +08:00
|
|
|
other_name = lookup_other_head(&oid);
|
|
|
|
other = lookup_commit_or_die(&oid, other_name);
|
2006-07-03 17:59:32 +08:00
|
|
|
add_pending_object(revs, &head->object, "HEAD");
|
revision: implement `git log --merge` also for rebase/cherry-pick/revert
'git log' learned in ae3e5e1ef2 (git log -p --merge [[--] paths...],
2006-07-03) to show commits touching conflicted files in the range
HEAD...MERGE_HEAD, an addition documented in d249b45547 (Document
rev-list's option --merge, 2006-08-04).
It can be useful to look at the commit history to understand what lead
to merge conflicts also for other mergy operations besides merges, like
cherry-pick, revert and rebase.
For rebases and cherry-picks, an interesting range to look at is
HEAD...{REBASE_HEAD,CHERRY_PICK_HEAD}, since even if all the commits
included in that range are not directly part of the 3-way merge,
conflicts encountered during these operations can indeed be caused by
changes introduced in preceding commits on both sides of the history.
For revert, as we are (most likely) reversing changes from a previous
commit, an appropriate range is REVERT_HEAD..HEAD, which is equivalent
to REVERT_HEAD...HEAD and to HEAD...REVERT_HEAD, if we keep HEAD and its
parents on the left side of the range.
As such, adjust the code in prepare_show_merge so it constructs the
range HEAD...$OTHER for OTHER={MERGE_HEAD, CHERRY_PICK_HEAD, REVERT_HEAD
or REBASE_HEAD}. Note that we try these pseudorefs in order, so keep
REBASE_HEAD last since the three other operations can be performed
during a rebase. Note also that in the uncommon case where $OTHER and
HEAD do not share a common ancestor, this will show the complete
histories of both sides since their root commits, which is the same
behaviour as currently happens in that case for HEAD and MERGE_HEAD.
Adjust the documentation of this option accordingly.
Co-authored-by: Johannes Sixt <j6t@kdbg.org>
Co-authored-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Michael Lohmann <mi.al.lohmann@gmail.com>
[jc: tweaked in j6t's precedence fix that tries REBASE_HEAD last]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-28 21:54:54 +08:00
|
|
|
add_pending_object(revs, &other->object, other_name);
|
2024-02-28 17:44:14 +08:00
|
|
|
if (repo_get_merge_bases(the_repository, head, other, &bases) < 0)
|
|
|
|
exit(128);
|
2013-05-16 23:32:38 +08:00
|
|
|
add_rev_cmdline_list(revs, bases, REV_CMD_MERGE_BASE, UNINTERESTING | BOTTOM);
|
|
|
|
add_pending_commit_list(revs, bases, UNINTERESTING | BOTTOM);
|
2008-02-27 15:18:38 +08:00
|
|
|
free_commit_list(bases);
|
|
|
|
head->object.flags |= SYMMETRIC_LEFT;
|
2006-07-03 17:59:32 +08:00
|
|
|
|
2018-09-21 23:57:38 +08:00
|
|
|
if (!istate->cache_nr)
|
2019-01-12 10:13:26 +08:00
|
|
|
repo_read_index(revs->repo);
|
2018-09-21 23:57:38 +08:00
|
|
|
for (i = 0; i < istate->cache_nr; i++) {
|
|
|
|
const struct cache_entry *ce = istate->cache[i];
|
2006-07-03 17:59:32 +08:00
|
|
|
if (!ce_stage(ce))
|
|
|
|
continue;
|
2018-09-21 23:57:38 +08:00
|
|
|
if (ce_path_match(istate, ce, &revs->prune_data, NULL)) {
|
2006-07-03 17:59:32 +08:00
|
|
|
prune_num++;
|
2014-09-17 02:56:57 +08:00
|
|
|
REALLOC_ARRAY(prune, prune_num);
|
2006-07-03 17:59:32 +08:00
|
|
|
prune[prune_num-2] = ce->name;
|
|
|
|
prune[prune_num-1] = NULL;
|
|
|
|
}
|
2018-09-21 23:57:38 +08:00
|
|
|
while ((i+1 < istate->cache_nr) &&
|
|
|
|
ce_same_name(ce, istate->cache[i+1]))
|
2006-07-03 17:59:32 +08:00
|
|
|
i++;
|
|
|
|
}
|
2016-06-03 05:09:22 +08:00
|
|
|
clear_pathspec(&revs->prune_data);
|
2013-10-26 10:09:20 +08:00
|
|
|
parse_pathspec(&revs->prune_data, PATHSPEC_ALL_MAGIC & ~PATHSPEC_LITERAL,
|
2013-11-19 06:31:29 +08:00
|
|
|
PATHSPEC_PREFER_FULL | PATHSPEC_LITERAL_PATH, "", prune);
|
2008-02-27 15:18:38 +08:00
|
|
|
revs->limited = 1;
|
2006-07-03 17:59:32 +08:00
|
|
|
}
|
|
|
|
|
2017-05-19 20:52:00 +08:00
|
|
|
static int dotdot_missing(const char *arg, char *dotdot,
|
|
|
|
struct rev_info *revs, int symmetric)
|
|
|
|
{
|
|
|
|
if (revs->ignore_missing)
|
|
|
|
return 0;
|
|
|
|
/* de-munge so we report the full argument */
|
|
|
|
*dotdot = '.';
|
|
|
|
die(symmetric
|
|
|
|
? "Invalid symmetric difference expression %s"
|
|
|
|
: "Invalid revision range %s", arg);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int handle_dotdot_1(const char *arg, char *dotdot,
|
|
|
|
struct rev_info *revs, int flags,
|
2017-05-19 20:55:26 +08:00
|
|
|
int cant_be_filename,
|
|
|
|
struct object_context *a_oc,
|
|
|
|
struct object_context *b_oc)
|
2017-05-19 20:52:00 +08:00
|
|
|
{
|
|
|
|
const char *a_name, *b_name;
|
|
|
|
struct object_id a_oid, b_oid;
|
|
|
|
struct object *a_obj, *b_obj;
|
|
|
|
unsigned int a_flags, b_flags;
|
|
|
|
int symmetric = 0;
|
|
|
|
unsigned int flags_exclude = flags ^ (UNINTERESTING | BOTTOM);
|
2017-07-14 07:49:29 +08:00
|
|
|
unsigned int oc_flags = GET_OID_COMMITTISH | GET_OID_RECORD_PATH;
|
2017-05-19 20:52:00 +08:00
|
|
|
|
|
|
|
a_name = arg;
|
|
|
|
if (!*a_name)
|
|
|
|
a_name = "HEAD";
|
|
|
|
|
|
|
|
b_name = dotdot + 2;
|
|
|
|
if (*b_name == '.') {
|
|
|
|
symmetric = 1;
|
|
|
|
b_name++;
|
|
|
|
}
|
|
|
|
if (!*b_name)
|
|
|
|
b_name = "HEAD";
|
|
|
|
|
2019-01-12 10:13:28 +08:00
|
|
|
if (get_oid_with_context(revs->repo, a_name, oc_flags, &a_oid, a_oc) ||
|
|
|
|
get_oid_with_context(revs->repo, b_name, oc_flags, &b_oid, b_oc))
|
2017-05-19 20:52:00 +08:00
|
|
|
return -1;
|
|
|
|
|
|
|
|
if (!cant_be_filename) {
|
|
|
|
*dotdot = '.';
|
|
|
|
verify_non_filename(revs->prefix, arg);
|
|
|
|
*dotdot = '\0';
|
|
|
|
}
|
|
|
|
|
2018-09-21 23:57:39 +08:00
|
|
|
a_obj = parse_object(revs->repo, &a_oid);
|
|
|
|
b_obj = parse_object(revs->repo, &b_oid);
|
2017-05-19 20:52:00 +08:00
|
|
|
if (!a_obj || !b_obj)
|
|
|
|
return dotdot_missing(arg, dotdot, revs, symmetric);
|
|
|
|
|
|
|
|
if (!symmetric) {
|
|
|
|
/* just A..B */
|
|
|
|
b_flags = flags;
|
|
|
|
a_flags = flags_exclude;
|
|
|
|
} else {
|
|
|
|
/* A...B -- find merge bases between the two */
|
|
|
|
struct commit *a, *b;
|
2024-02-28 17:44:14 +08:00
|
|
|
struct commit_list *exclude = NULL;
|
2017-05-19 20:52:00 +08:00
|
|
|
|
2018-09-21 23:57:39 +08:00
|
|
|
a = lookup_commit_reference(revs->repo, &a_obj->oid);
|
|
|
|
b = lookup_commit_reference(revs->repo, &b_obj->oid);
|
2017-05-19 20:52:00 +08:00
|
|
|
if (!a || !b)
|
|
|
|
return dotdot_missing(arg, dotdot, revs, symmetric);
|
|
|
|
|
2024-02-28 17:44:14 +08:00
|
|
|
if (repo_get_merge_bases(the_repository, a, b, &exclude) < 0) {
|
|
|
|
free_commit_list(exclude);
|
|
|
|
return -1;
|
|
|
|
}
|
2017-05-19 20:52:00 +08:00
|
|
|
add_rev_cmdline_list(revs, exclude, REV_CMD_MERGE_BASE,
|
|
|
|
flags_exclude);
|
|
|
|
add_pending_commit_list(revs, exclude, flags_exclude);
|
|
|
|
free_commit_list(exclude);
|
|
|
|
|
|
|
|
b_flags = flags;
|
|
|
|
a_flags = flags | SYMMETRIC_LEFT;
|
|
|
|
}
|
|
|
|
|
|
|
|
a_obj->flags |= a_flags;
|
|
|
|
b_obj->flags |= b_flags;
|
|
|
|
add_rev_cmdline(revs, a_obj, a_name, REV_CMD_LEFT, a_flags);
|
|
|
|
add_rev_cmdline(revs, b_obj, b_name, REV_CMD_RIGHT, b_flags);
|
2017-05-19 20:55:26 +08:00
|
|
|
add_pending_object_with_path(revs, a_obj, a_name, a_oc->mode, a_oc->path);
|
|
|
|
add_pending_object_with_path(revs, b_obj, b_name, b_oc->mode, b_oc->path);
|
2017-05-19 20:52:00 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int handle_dotdot(const char *arg,
|
|
|
|
struct rev_info *revs, int flags,
|
|
|
|
int cant_be_filename)
|
|
|
|
{
|
2024-06-11 17:19:59 +08:00
|
|
|
struct object_context a_oc = {0}, b_oc = {0};
|
2017-05-19 20:52:00 +08:00
|
|
|
char *dotdot = strstr(arg, "..");
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!dotdot)
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
*dotdot = '\0';
|
2017-05-19 20:55:26 +08:00
|
|
|
ret = handle_dotdot_1(arg, dotdot, revs, flags, cant_be_filename,
|
|
|
|
&a_oc, &b_oc);
|
2017-05-19 20:52:00 +08:00
|
|
|
*dotdot = '.';
|
|
|
|
|
2024-06-11 17:19:59 +08:00
|
|
|
object_context_release(&a_oc);
|
|
|
|
object_context_release(&b_oc);
|
2017-05-19 20:52:00 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
revision: set rev_input_given in handle_revision_arg()
Commit 7ba826290a (revision: add rev_input_given flag, 2017-08-02) added
a flag to rev_info to tell whether we got any revision arguments. As
explained there, this is necessary because some revision arguments may
not produce any pending traversal objects, but should still inhibit
default behaviors (e.g., a glob that matches nothing).
However, it only set the flag in the globbing code, but not for
revisions we get on the command-line or via stdin. This leads to two
problems:
- the command-line code keeps its own separate got_rev_arg flag; this
isn't wrong, but it's confusing and an extra maintenance burden
- even specifically-named rev arguments might end up not adding any
pending objects: if --ignore-missing is set, then specifying a
missing object is a noop rather than an error.
And that leads to some user-visible bugs:
- when deciding whether a default rev like "HEAD" should kick in, we
check both got_rev_arg and rev_input_given. That means that
"--ignore-missing $ZERO_OID" works on the command-line (where we set
got_rev_arg) but not on --stdin (where we don't)
- when rev-list decides whether it should complain that it wasn't
given a starting point, it relies on rev_input_given. So it can't
even get the command-line "--ignore-missing $ZERO_OID" right
Let's consistently set the flag if we got any revision argument. That
lets us clean up the redundant got_rev_arg, and fixes both of those bugs
(but note there are three new tests: we'll confirm the already working
git-log command-line case).
A few implementation notes:
- conceptually we want to set the flag whenever handle_revision_arg()
finds an actual revision arg ("handles" it, you might say). But it
covers a ton of cases with early returns. Rather than annotating
each one, we just wrap it and use its success exit-code to set the
flag in one spot.
- the new rev-list test is in t6018, which is titled to cover globs.
This isn't exactly a glob, but it made sense to stick it with the
other tests that handle the "even though we got a rev, we have no
pending objects" case, which are globs.
- the tests check for the oid of a missing object, which it's pretty
clear --ignore-missing should ignore. You can see the same behavior
with "--ignore-missing a-ref-that-does-not-exist", because
--ignore-missing treats them both the same. That's perhaps less
clearly correct, and we may want to change that in the future. But
the way the code and tests here are written, we'd continue to do the
right thing even if it does.
Reported-by: Bryan Turner <bturner@atlassian.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-27 04:13:05 +08:00
|
|
|
static int handle_revision_arg_1(const char *arg_, struct rev_info *revs, int flags, unsigned revarg_opt)
|
2006-09-06 12:28:36 +08:00
|
|
|
{
|
2024-06-11 17:19:59 +08:00
|
|
|
struct object_context oc = {0};
|
2017-05-19 20:50:07 +08:00
|
|
|
char *mark;
|
2006-09-06 12:28:36 +08:00
|
|
|
struct object *object;
|
2017-05-07 06:10:27 +08:00
|
|
|
struct object_id oid;
|
2006-09-06 12:28:36 +08:00
|
|
|
int local_flags;
|
2011-08-26 08:35:39 +08:00
|
|
|
const char *arg = arg_;
|
2012-07-03 03:33:52 +08:00
|
|
|
int cant_be_filename = revarg_opt & REVARG_CANNOT_BE_FILENAME;
|
2017-07-14 07:49:29 +08:00
|
|
|
unsigned get_sha1_flags = GET_OID_RECORD_PATH;
|
2024-06-11 17:19:59 +08:00
|
|
|
int ret;
|
2006-09-06 12:28:36 +08:00
|
|
|
|
2013-05-16 23:32:38 +08:00
|
|
|
flags = flags & UNINTERESTING ? flags | BOTTOM : flags & ~BOTTOM;
|
|
|
|
|
2017-05-19 20:51:40 +08:00
|
|
|
if (!cant_be_filename && !strcmp(arg, "..")) {
|
|
|
|
/*
|
|
|
|
* Just ".."? That is not a range but the
|
|
|
|
* pathspec for the parent directory.
|
|
|
|
*/
|
2024-06-11 17:19:59 +08:00
|
|
|
ret = -1;
|
|
|
|
goto out;
|
2006-09-06 12:28:36 +08:00
|
|
|
}
|
2016-09-27 16:32:49 +08:00
|
|
|
|
2024-06-11 17:19:59 +08:00
|
|
|
if (!handle_dotdot(arg, revs, flags, revarg_opt)) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2016-09-27 16:32:49 +08:00
|
|
|
|
2017-05-19 20:50:07 +08:00
|
|
|
mark = strstr(arg, "^@");
|
|
|
|
if (mark && !mark[2]) {
|
|
|
|
*mark = 0;
|
2024-06-11 17:19:59 +08:00
|
|
|
if (add_parents_only(revs, arg, flags, 0)) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2017-05-19 20:50:07 +08:00
|
|
|
*mark = '^';
|
2006-09-06 12:28:36 +08:00
|
|
|
}
|
2017-05-19 20:50:07 +08:00
|
|
|
mark = strstr(arg, "^!");
|
|
|
|
if (mark && !mark[2]) {
|
|
|
|
*mark = 0;
|
2016-09-27 16:32:49 +08:00
|
|
|
if (!add_parents_only(revs, arg, flags ^ (UNINTERESTING | BOTTOM), 0))
|
2017-05-19 20:50:07 +08:00
|
|
|
*mark = '^';
|
2016-09-27 16:32:49 +08:00
|
|
|
}
|
2017-05-19 20:50:07 +08:00
|
|
|
mark = strstr(arg, "^-");
|
|
|
|
if (mark) {
|
2016-09-27 16:32:49 +08:00
|
|
|
int exclude_parent = 1;
|
|
|
|
|
2017-05-19 20:50:07 +08:00
|
|
|
if (mark[2]) {
|
2022-10-01 18:25:36 +08:00
|
|
|
if (strtol_i(mark + 2, 10, &exclude_parent) ||
|
2024-06-11 17:19:59 +08:00
|
|
|
exclude_parent < 1) {
|
|
|
|
ret = -1;
|
|
|
|
goto out;
|
|
|
|
}
|
2016-09-27 16:32:49 +08:00
|
|
|
}
|
|
|
|
|
2017-05-19 20:50:07 +08:00
|
|
|
*mark = 0;
|
2016-09-27 16:32:49 +08:00
|
|
|
if (!add_parents_only(revs, arg, flags ^ (UNINTERESTING | BOTTOM), exclude_parent))
|
2017-05-19 20:50:07 +08:00
|
|
|
*mark = '^';
|
2006-11-01 06:22:34 +08:00
|
|
|
}
|
|
|
|
|
2006-09-06 12:28:36 +08:00
|
|
|
local_flags = 0;
|
|
|
|
if (*arg == '^') {
|
2013-05-16 23:32:38 +08:00
|
|
|
local_flags = UNINTERESTING | BOTTOM;
|
2006-09-06 12:28:36 +08:00
|
|
|
arg++;
|
|
|
|
}
|
2012-07-03 03:43:05 +08:00
|
|
|
|
|
|
|
if (revarg_opt & REVARG_COMMITTISH)
|
2017-07-14 07:49:29 +08:00
|
|
|
get_sha1_flags |= GET_OID_COMMITTISH;
|
2012-07-03 03:43:05 +08:00
|
|
|
|
rev-list: allow missing tips with --missing=[print|allow*]
In 9830926c7d (rev-list: add commit object support in `--missing`
option, 2023-10-27) we fixed the `--missing` option in `git rev-list`
so that it works with with missing commits, not just blobs/trees.
Unfortunately, such a command would still fail with a "fatal: bad
object <oid>" if it is passed a missing commit, blob or tree as an
argument (before the rev walking even begins).
When such a command is used to find the dependencies of some objects,
for example the dependencies of quarantined objects (see the
"QUARANTINE ENVIRONMENT" section in the git-receive-pack(1)
documentation), it would be better if the command would instead
consider such missing objects, especially commits, in the same way as
other missing objects.
If, for example `--missing=print` is used, it would be nice for some
use cases if the missing tips passed as arguments were reported in
the same way as other missing objects instead of the command just
failing.
We could introduce a new option to make it work like this, but most
users are likely to prefer the command to have this behavior as the
default one. Introducing a new option would require another dumb loop
to look for that option early, which isn't nice.
Also we made `git rev-list` work with missing commits very recently
and the command is most often passed commits as arguments. So let's
consider this as a bug fix related to these recent changes.
While at it let's add a NEEDSWORK comment to say that we should get
rid of the existing ugly dumb loops that parse the
`--exclude-promisor-objects` and `--missing=...` options early.
Helped-by: Linus Arver <linusa@google.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-14 22:25:13 +08:00
|
|
|
/*
|
|
|
|
* Even if revs->do_not_die_on_missing_objects is set, we
|
|
|
|
* should error out if we can't even get an oid, as
|
|
|
|
* `--missing=print` should be able to report missing oids.
|
|
|
|
*/
|
2024-06-11 17:19:59 +08:00
|
|
|
if (get_oid_with_context(revs->repo, arg, get_sha1_flags, &oid, &oc)) {
|
|
|
|
ret = revs->ignore_missing ? 0 : -1;
|
|
|
|
goto out;
|
|
|
|
}
|
2006-09-06 12:28:36 +08:00
|
|
|
if (!cant_be_filename)
|
|
|
|
verify_non_filename(revs->prefix, arg);
|
2017-05-07 06:10:27 +08:00
|
|
|
object = get_reference(revs, arg, &oid, flags ^ local_flags);
|
2024-06-11 17:19:59 +08:00
|
|
|
if (!object) {
|
|
|
|
ret = (revs->ignore_missing || revs->do_not_die_on_missing_objects) ? 0 : -1;
|
|
|
|
goto out;
|
|
|
|
}
|
2011-08-26 08:35:39 +08:00
|
|
|
add_rev_cmdline(revs, object, arg_, REV_CMD_REV, flags ^ local_flags);
|
2017-05-19 20:55:26 +08:00
|
|
|
add_pending_object_with_path(revs, object, arg, oc.mode, oc.path);
|
2024-06-11 17:19:59 +08:00
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
out:
|
|
|
|
object_context_release(&oc);
|
|
|
|
return ret;
|
2006-09-06 12:28:36 +08:00
|
|
|
}
|
|
|
|
|
revision: set rev_input_given in handle_revision_arg()
Commit 7ba826290a (revision: add rev_input_given flag, 2017-08-02) added
a flag to rev_info to tell whether we got any revision arguments. As
explained there, this is necessary because some revision arguments may
not produce any pending traversal objects, but should still inhibit
default behaviors (e.g., a glob that matches nothing).
However, it only set the flag in the globbing code, but not for
revisions we get on the command-line or via stdin. This leads to two
problems:
- the command-line code keeps its own separate got_rev_arg flag; this
isn't wrong, but it's confusing and an extra maintenance burden
- even specifically-named rev arguments might end up not adding any
pending objects: if --ignore-missing is set, then specifying a
missing object is a noop rather than an error.
And that leads to some user-visible bugs:
- when deciding whether a default rev like "HEAD" should kick in, we
check both got_rev_arg and rev_input_given. That means that
"--ignore-missing $ZERO_OID" works on the command-line (where we set
got_rev_arg) but not on --stdin (where we don't)
- when rev-list decides whether it should complain that it wasn't
given a starting point, it relies on rev_input_given. So it can't
even get the command-line "--ignore-missing $ZERO_OID" right
Let's consistently set the flag if we got any revision argument. That
lets us clean up the redundant got_rev_arg, and fixes both of those bugs
(but note there are three new tests: we'll confirm the already working
git-log command-line case).
A few implementation notes:
- conceptually we want to set the flag whenever handle_revision_arg()
finds an actual revision arg ("handles" it, you might say). But it
covers a ton of cases with early returns. Rather than annotating
each one, we just wrap it and use its success exit-code to set the
flag in one spot.
- the new rev-list test is in t6018, which is titled to cover globs.
This isn't exactly a glob, but it made sense to stick it with the
other tests that handle the "even though we got a rev, we have no
pending objects" case, which are globs.
- the tests check for the oid of a missing object, which it's pretty
clear --ignore-missing should ignore. You can see the same behavior
with "--ignore-missing a-ref-that-does-not-exist", because
--ignore-missing treats them both the same. That's perhaps less
clearly correct, and we may want to change that in the future. But
the way the code and tests here are written, we'd continue to do the
right thing even if it does.
Reported-by: Bryan Turner <bturner@atlassian.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-27 04:13:05 +08:00
|
|
|
int handle_revision_arg(const char *arg, struct rev_info *revs, int flags, unsigned revarg_opt)
|
|
|
|
{
|
|
|
|
int ret = handle_revision_arg_1(arg, revs, flags, revarg_opt);
|
|
|
|
if (!ret)
|
|
|
|
revs->rev_input_given = 1;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-03-20 16:13:36 +08:00
|
|
|
static void read_pathspec_from_stdin(struct strbuf *sb,
|
2020-07-29 04:25:12 +08:00
|
|
|
struct strvec *prune)
|
2011-05-12 05:01:19 +08:00
|
|
|
{
|
2017-09-21 04:36:59 +08:00
|
|
|
while (strbuf_getline(sb, stdin) != EOF)
|
2020-07-29 04:25:12 +08:00
|
|
|
strvec_push(prune, sb->buf);
|
2009-11-20 18:50:21 +08:00
|
|
|
}
|
|
|
|
|
2006-09-21 04:21:56 +08:00
|
|
|
static void add_grep(struct rev_info *revs, const char *ptn, enum grep_pat_token what)
|
2006-09-18 08:23:20 +08:00
|
|
|
{
|
2008-08-25 14:15:05 +08:00
|
|
|
append_grep_pattern(&revs->grep_filter, ptn, "command line", 0, what);
|
2006-09-21 04:21:56 +08:00
|
|
|
}
|
|
|
|
|
log --author/--committer: really match only with name part
When we tried to find commits done by AUTHOR, the first implementation
tried to pattern match a line with "^author .*AUTHOR", which later was
enhanced to strip leading caret and look for "^author AUTHOR" when the
search pattern was anchored at the left end (i.e. --author="^AUTHOR").
This had a few problems:
* When looking for fixed strings (e.g. "git log -F --author=x --grep=y"),
the regexp internally used "^author .*x" would never match anything;
* To match at the end (e.g. "git log --author='google.com>$'"), the
generated regexp has to also match the trailing timestamp part the
commit header lines have. Also, in order to determine if the '$' at
the end means "match at the end of the line" or just a literal dollar
sign (probably backslash-quoted), we would need to parse the regexp
ourselves.
An earlier alternative tried to make sure that a line matches "^author "
(to limit by field name) and the user supplied pattern at the same time.
While it solved the -F problem by introducing a special override for
matching the "^author ", it did not solve the trailing timestamp nor tail
match problem. It also would have matched every commit if --author=author
was asked for, not because the author's email part had this string, but
because every commit header line that talks about the author begins with
that field name, regardleses of who wrote it.
Instead of piling more hacks on top of hacks, this rethinks the grep
machinery that is used to look for strings in the commit header, and makes
sure that (1) field name matches literally at the beginning of the line,
followed by a SP, and (2) the user supplied pattern is matched against the
remainder of the line, excluding the trailing timestamp data.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-09-05 13:15:02 +08:00
|
|
|
static void add_header_grep(struct rev_info *revs, enum grep_header_field field, const char *pattern)
|
2006-09-21 04:21:56 +08:00
|
|
|
{
|
log --author/--committer: really match only with name part
When we tried to find commits done by AUTHOR, the first implementation
tried to pattern match a line with "^author .*AUTHOR", which later was
enhanced to strip leading caret and look for "^author AUTHOR" when the
search pattern was anchored at the left end (i.e. --author="^AUTHOR").
This had a few problems:
* When looking for fixed strings (e.g. "git log -F --author=x --grep=y"),
the regexp internally used "^author .*x" would never match anything;
* To match at the end (e.g. "git log --author='google.com>$'"), the
generated regexp has to also match the trailing timestamp part the
commit header lines have. Also, in order to determine if the '$' at
the end means "match at the end of the line" or just a literal dollar
sign (probably backslash-quoted), we would need to parse the regexp
ourselves.
An earlier alternative tried to make sure that a line matches "^author "
(to limit by field name) and the user supplied pattern at the same time.
While it solved the -F problem by introducing a special override for
matching the "^author ", it did not solve the trailing timestamp nor tail
match problem. It also would have matched every commit if --author=author
was asked for, not because the author's email part had this string, but
because every commit header line that talks about the author begins with
that field name, regardleses of who wrote it.
Instead of piling more hacks on top of hacks, this rethinks the grep
machinery that is used to look for strings in the commit header, and makes
sure that (1) field name matches literally at the beginning of the line,
followed by a SP, and (2) the user supplied pattern is matched against the
remainder of the line, excluding the trailing timestamp data.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-09-05 13:15:02 +08:00
|
|
|
append_header_grep_pattern(&revs->grep_filter, field, pattern);
|
2006-09-18 08:23:20 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void add_message_grep(struct rev_info *revs, const char *pattern)
|
|
|
|
{
|
2006-09-21 04:21:56 +08:00
|
|
|
add_grep(revs, pattern, GREP_PATTERN_BODY);
|
2006-09-18 08:23:20 +08:00
|
|
|
}
|
|
|
|
|
revision: parse integer arguments to --max-count, --skip, etc., more carefully
The "rev-list" and other commands in the "log" family, being the
oldest part of the system, use their own custom argument parsers,
and integer values of some options are parsed with atoi(), which
allows a non-digit after the number (e.g., "1q") to be silently
ignored. As a natural consequence, an argument that does not begin
with a digit (e.g., "q") silently becomes zero, too.
Switch to use strtol_i() and parse_timestamp() appropriately to
catch bogus input.
Note that one may naïvely expect that --max-count, --skip, etc., to
only take non-negative values, but we must allow them to also take
negative values, as an escape hatch to countermand a limit set by an
earlier option on the command line; the underlying variables are
initialized to (-1) and "--max-count=-1", for example, is a
legitimate way to reinitialize the limit.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-09 06:35:23 +08:00
|
|
|
static int parse_count(const char *arg)
|
|
|
|
{
|
|
|
|
int count;
|
|
|
|
|
|
|
|
if (strtol_i(arg, 10, &count) < 0)
|
|
|
|
die("'%s': not an integer", arg);
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
|
|
|
static timestamp_t parse_age(const char *arg)
|
|
|
|
{
|
|
|
|
timestamp_t num;
|
|
|
|
char *p;
|
|
|
|
|
|
|
|
errno = 0;
|
|
|
|
num = parse_timestamp(arg, &p, 10);
|
|
|
|
if (errno || *p || p == arg)
|
|
|
|
die("'%s': not a number of seconds since epoch", arg);
|
|
|
|
return num;
|
|
|
|
}
|
|
|
|
|
2008-07-10 05:38:34 +08:00
|
|
|
static int handle_revision_opt(struct rev_info *revs, int argc, const char **argv,
|
2018-12-04 06:10:19 +08:00
|
|
|
int *unkc, const char **unkv,
|
|
|
|
const struct setup_revision_opt* opt)
|
2008-07-08 21:19:33 +08:00
|
|
|
{
|
|
|
|
const char *arg = argv[0];
|
2022-08-19 12:28:10 +08:00
|
|
|
const char *optarg = NULL;
|
2010-08-05 16:22:55 +08:00
|
|
|
int argcount;
|
2018-05-02 08:25:50 +08:00
|
|
|
const unsigned hexsz = the_hash_algo->hexsz;
|
2008-07-08 21:19:33 +08:00
|
|
|
|
|
|
|
/* pseudo revision arguments */
|
|
|
|
if (!strcmp(arg, "--all") || !strcmp(arg, "--branches") ||
|
|
|
|
!strcmp(arg, "--tags") || !strcmp(arg, "--remotes") ||
|
|
|
|
!strcmp(arg, "--reflog") || !strcmp(arg, "--not") ||
|
2009-10-28 02:28:07 +08:00
|
|
|
!strcmp(arg, "--no-walk") || !strcmp(arg, "--do-walk") ||
|
2013-12-01 04:55:40 +08:00
|
|
|
!strcmp(arg, "--bisect") || starts_with(arg, "--glob=") ||
|
2014-10-17 08:44:23 +08:00
|
|
|
!strcmp(arg, "--indexed-objects") ||
|
2019-07-01 21:18:15 +08:00
|
|
|
!strcmp(arg, "--alternate-refs") ||
|
2022-11-17 13:46:56 +08:00
|
|
|
starts_with(arg, "--exclude=") || starts_with(arg, "--exclude-hidden=") ||
|
2013-12-01 04:55:40 +08:00
|
|
|
starts_with(arg, "--branches=") || starts_with(arg, "--tags=") ||
|
|
|
|
starts_with(arg, "--remotes=") || starts_with(arg, "--no-walk="))
|
2008-07-08 21:19:33 +08:00
|
|
|
{
|
|
|
|
unkv[(*unkc)++] = arg;
|
Allow "non-option" revision options in parse_option-enabled commands
Commands which use parse_options() but also call setup_revisions()
must do their parsing in a two step process:
1. first, they parse all options. Anything unknown goes to
parse_revision_opt() (which calls handle_revision_opt), which
may claim the option or say "I don't recognize this"
2. the non-option remainder goes to setup_revisions() to
actually get turned into revisions
Some revision options are "non-options" in that they must be
parsed in order with their revision counterparts in
setup_revisions(). For example, "--all" functions as a
pseudo-option expanding to all refs, and "--no-walk" affects refs
after it on the command line, but not before. The revision option
parser in step 1 recognizes such options and sets them aside for
later parsing by setup_revisions().
However, the return value used from handle_revision_opt indicated
"I didn't recognize this", which was wrong. It did, and it took
appropriate action (even though that action was just deferring it
for later parsing). Thus it should return "yes, I recognized
this."
Previously, these pseudo-options generated an error when used with
parse_options parsers (currently just blame and shortlog). With
this patch, they should work fine, enabling things like "git
shortlog --all".
Signed-off-by: Jeff King <peff@peff.net>
Acked-By: Pierre Habouzit <madcoder@debian.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 18:22:23 +08:00
|
|
|
return 1;
|
2008-07-08 21:19:33 +08:00
|
|
|
}
|
|
|
|
|
2010-08-05 16:22:55 +08:00
|
|
|
if ((argcount = parse_long_opt("max-count", argv, &optarg))) {
|
revision: parse integer arguments to --max-count, --skip, etc., more carefully
The "rev-list" and other commands in the "log" family, being the
oldest part of the system, use their own custom argument parsers,
and integer values of some options are parsed with atoi(), which
allows a non-digit after the number (e.g., "1q") to be silently
ignored. As a natural consequence, an argument that does not begin
with a digit (e.g., "q") silently becomes zero, too.
Switch to use strtol_i() and parse_timestamp() appropriately to
catch bogus input.
Note that one may naïvely expect that --max-count, --skip, etc., to
only take non-negative values, but we must allow them to also take
negative values, as an escape hatch to countermand a limit set by an
earlier option on the command line; the underlying variables are
initialized to (-1) and "--max-count=-1", for example, is a
legitimate way to reinitialize the limit.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-09 06:35:23 +08:00
|
|
|
revs->max_count = parse_count(optarg);
|
2010-06-01 16:35:49 +08:00
|
|
|
revs->no_walk = 0;
|
2010-08-05 16:22:55 +08:00
|
|
|
return argcount;
|
|
|
|
} else if ((argcount = parse_long_opt("skip", argv, &optarg))) {
|
revision: parse integer arguments to --max-count, --skip, etc., more carefully
The "rev-list" and other commands in the "log" family, being the
oldest part of the system, use their own custom argument parsers,
and integer values of some options are parsed with atoi(), which
allows a non-digit after the number (e.g., "1q") to be silently
ignored. As a natural consequence, an argument that does not begin
with a digit (e.g., "q") silently becomes zero, too.
Switch to use strtol_i() and parse_timestamp() appropriately to
catch bogus input.
Note that one may naïvely expect that --max-count, --skip, etc., to
only take non-negative values, but we must allow them to also take
negative values, as an escape hatch to countermand a limit set by an
earlier option on the command line; the underlying variables are
initialized to (-1) and "--max-count=-1", for example, is a
legitimate way to reinitialize the limit.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-09 06:35:23 +08:00
|
|
|
revs->skip_count = parse_count(optarg);
|
2010-08-05 16:22:55 +08:00
|
|
|
return argcount;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if ((*arg == '-') && isdigit(arg[1])) {
|
2014-06-07 06:33:25 +08:00
|
|
|
/* accept -<digit>, like traditional "head" */
|
revision: parse integer arguments to --max-count, --skip, etc., more carefully
The "rev-list" and other commands in the "log" family, being the
oldest part of the system, use their own custom argument parsers,
and integer values of some options are parsed with atoi(), which
allows a non-digit after the number (e.g., "1q") to be silently
ignored. As a natural consequence, an argument that does not begin
with a digit (e.g., "q") silently becomes zero, too.
Switch to use strtol_i() and parse_timestamp() appropriately to
catch bogus input.
Note that one may naïvely expect that --max-count, --skip, etc., to
only take non-negative values, but we must allow them to also take
negative values, as an escape hatch to countermand a limit set by an
earlier option on the command line; the underlying variables are
initialized to (-1) and "--max-count=-1", for example, is a
legitimate way to reinitialize the limit.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-09 06:35:23 +08:00
|
|
|
revs->max_count = parse_count(arg + 1);
|
2010-06-01 16:35:49 +08:00
|
|
|
revs->no_walk = 0;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "-n")) {
|
|
|
|
if (argc <= 1)
|
|
|
|
return error("-n requires an argument");
|
revision: parse integer arguments to --max-count, --skip, etc., more carefully
The "rev-list" and other commands in the "log" family, being the
oldest part of the system, use their own custom argument parsers,
and integer values of some options are parsed with atoi(), which
allows a non-digit after the number (e.g., "1q") to be silently
ignored. As a natural consequence, an argument that does not begin
with a digit (e.g., "q") silently becomes zero, too.
Switch to use strtol_i() and parse_timestamp() appropriately to
catch bogus input.
Note that one may naïvely expect that --max-count, --skip, etc., to
only take non-negative values, but we must allow them to also take
negative values, as an escape hatch to countermand a limit set by an
earlier option on the command line; the underlying variables are
initialized to (-1) and "--max-count=-1", for example, is a
legitimate way to reinitialize the limit.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-09 06:35:23 +08:00
|
|
|
revs->max_count = parse_count(argv[1]);
|
2010-06-01 16:35:49 +08:00
|
|
|
revs->no_walk = 0;
|
2008-07-08 21:19:33 +08:00
|
|
|
return 2;
|
2017-06-10 02:17:32 +08:00
|
|
|
} else if (skip_prefix(arg, "-n", &optarg)) {
|
revision: parse integer arguments to --max-count, --skip, etc., more carefully
The "rev-list" and other commands in the "log" family, being the
oldest part of the system, use their own custom argument parsers,
and integer values of some options are parsed with atoi(), which
allows a non-digit after the number (e.g., "1q") to be silently
ignored. As a natural consequence, an argument that does not begin
with a digit (e.g., "q") silently becomes zero, too.
Switch to use strtol_i() and parse_timestamp() appropriately to
catch bogus input.
Note that one may naïvely expect that --max-count, --skip, etc., to
only take non-negative values, but we must allow them to also take
negative values, as an escape hatch to countermand a limit set by an
earlier option on the command line; the underlying variables are
initialized to (-1) and "--max-count=-1", for example, is a
legitimate way to reinitialize the limit.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-09 06:35:23 +08:00
|
|
|
revs->max_count = parse_count(optarg);
|
2010-06-01 16:35:49 +08:00
|
|
|
revs->no_walk = 0;
|
2010-08-05 16:22:55 +08:00
|
|
|
} else if ((argcount = parse_long_opt("max-age", argv, &optarg))) {
|
revision: parse integer arguments to --max-count, --skip, etc., more carefully
The "rev-list" and other commands in the "log" family, being the
oldest part of the system, use their own custom argument parsers,
and integer values of some options are parsed with atoi(), which
allows a non-digit after the number (e.g., "1q") to be silently
ignored. As a natural consequence, an argument that does not begin
with a digit (e.g., "q") silently becomes zero, too.
Switch to use strtol_i() and parse_timestamp() appropriately to
catch bogus input.
Note that one may naïvely expect that --max-count, --skip, etc., to
only take non-negative values, but we must allow them to also take
negative values, as an escape hatch to countermand a limit set by an
earlier option on the command line; the underlying variables are
initialized to (-1) and "--max-count=-1", for example, is a
legitimate way to reinitialize the limit.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-09 06:35:23 +08:00
|
|
|
revs->max_age = parse_age(optarg);
|
2010-08-05 16:22:55 +08:00
|
|
|
return argcount;
|
|
|
|
} else if ((argcount = parse_long_opt("since", argv, &optarg))) {
|
|
|
|
revs->max_age = approxidate(optarg);
|
|
|
|
return argcount;
|
2022-04-23 20:59:57 +08:00
|
|
|
} else if ((argcount = parse_long_opt("since-as-filter", argv, &optarg))) {
|
|
|
|
revs->max_age_as_filter = approxidate(optarg);
|
|
|
|
return argcount;
|
2010-08-05 16:22:55 +08:00
|
|
|
} else if ((argcount = parse_long_opt("after", argv, &optarg))) {
|
|
|
|
revs->max_age = approxidate(optarg);
|
|
|
|
return argcount;
|
|
|
|
} else if ((argcount = parse_long_opt("min-age", argv, &optarg))) {
|
revision: parse integer arguments to --max-count, --skip, etc., more carefully
The "rev-list" and other commands in the "log" family, being the
oldest part of the system, use their own custom argument parsers,
and integer values of some options are parsed with atoi(), which
allows a non-digit after the number (e.g., "1q") to be silently
ignored. As a natural consequence, an argument that does not begin
with a digit (e.g., "q") silently becomes zero, too.
Switch to use strtol_i() and parse_timestamp() appropriately to
catch bogus input.
Note that one may naïvely expect that --max-count, --skip, etc., to
only take non-negative values, but we must allow them to also take
negative values, as an escape hatch to countermand a limit set by an
earlier option on the command line; the underlying variables are
initialized to (-1) and "--max-count=-1", for example, is a
legitimate way to reinitialize the limit.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-09 06:35:23 +08:00
|
|
|
revs->min_age = parse_age(optarg);
|
2010-08-05 16:22:55 +08:00
|
|
|
return argcount;
|
|
|
|
} else if ((argcount = parse_long_opt("before", argv, &optarg))) {
|
|
|
|
revs->min_age = approxidate(optarg);
|
|
|
|
return argcount;
|
|
|
|
} else if ((argcount = parse_long_opt("until", argv, &optarg))) {
|
|
|
|
revs->min_age = approxidate(optarg);
|
|
|
|
return argcount;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--first-parent")) {
|
|
|
|
revs->first_parent_only = 1;
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
} else if (!strcmp(arg, "--exclude-first-parent-only")) {
|
|
|
|
revs->exclude_first_parent_only = 1;
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
} else if (!strcmp(arg, "--ancestry-path")) {
|
|
|
|
revs->ancestry_path = 1;
|
2010-06-04 07:17:37 +08:00
|
|
|
revs->simplify_history = 0;
|
revision: --ancestry-path
"rev-list A..H" computes the set of commits that are ancestors of H, but
excludes the ones that are ancestors of A. This is useful to see what
happened to the history leading to H since A, in the sense that "what does
H have that did not exist in A" (e.g. when you have a choice to update to
H from A).
x---x---A---B---C <-- topic
/ \
x---x---x---o---o---o---o---M---D---E---F---G <-- dev
/ \
x---o---o---o---o---o---o---o---o---o---o---o---N---H <-- master
The result in the above example would be the commits marked with caps
letters (except for A itself, of course), and the ones marked with 'o'.
When you want to find out what commits in H are contaminated with the bug
introduced by A and need fixing, however, you might want to view only the
subset of "A..B" that are actually descendants of A, i.e. excluding the
ones marked with 'o'. Introduce a new option --ancestry-path to compute
this set with "rev-list --ancestry-path A..B".
Note that in practice, you would build a fix immediately on top of A and
"git branch --contains A" will give the names of branches that you would
need to merge the fix into (i.e. topic, dev and master), so this may not
be worth paying the extra cost of postprocessing.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-21 04:48:39 +08:00
|
|
|
revs->limited = 1;
|
2022-08-19 12:28:10 +08:00
|
|
|
revs->ancestry_path_implicit_bottoms = 1;
|
|
|
|
} else if (skip_prefix(arg, "--ancestry-path=", &optarg)) {
|
|
|
|
struct commit *c;
|
|
|
|
struct object_id oid;
|
2024-02-16 18:15:37 +08:00
|
|
|
const char *msg = _("could not get commit for --ancestry-path argument %s");
|
2022-08-19 12:28:10 +08:00
|
|
|
|
|
|
|
revs->ancestry_path = 1;
|
|
|
|
revs->simplify_history = 0;
|
|
|
|
revs->limited = 1;
|
|
|
|
|
|
|
|
if (repo_get_oid_committish(revs->repo, optarg, &oid))
|
|
|
|
return error(msg, optarg);
|
|
|
|
get_reference(revs, optarg, &oid, ANCESTRY_PATH);
|
|
|
|
c = lookup_commit_reference(revs->repo, &oid);
|
|
|
|
if (!c)
|
|
|
|
return error(msg, optarg);
|
|
|
|
commit_list_insert(c, &revs->ancestry_path_bottoms);
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "-g") || !strcmp(arg, "--walk-reflogs")) {
|
|
|
|
init_reflog_walk(&revs->reflog_info);
|
|
|
|
} else if (!strcmp(arg, "--default")) {
|
|
|
|
if (argc <= 1)
|
|
|
|
return error("bad --default argument");
|
|
|
|
revs->def = argv[1];
|
|
|
|
return 2;
|
|
|
|
} else if (!strcmp(arg, "--merge")) {
|
|
|
|
revs->show_merge = 1;
|
|
|
|
} else if (!strcmp(arg, "--topo-order")) {
|
toposort: rename "lifo" field
The primary invariant of sort_in_topological_order() is that a
parent commit is not emitted until all children of it are. When
traversing a forked history like this with "git log C E":
A----B----C
\
D----E
we ensure that A is emitted after all of B, C, D, and E are done, B
has to wait until C is done, and D has to wait until E is done.
In some applications, however, we would further want to control how
these child commits B, C, D and E on two parallel ancestry chains
are shown.
Most of the time, we would want to see C and B emitted together, and
then E and D, and finally A (i.e. the --topo-order output). The
"lifo" parameter of the sort_in_topological_order() function is used
to control this behaviour. We start the traversal by knowing two
commits, C and E. While keeping in mind that we also need to
inspect E later, we pick C first to inspect, and we notice and
record that B needs to be inspected. By structuring the "work to be
done" set as a LIFO stack, we ensure that B is inspected next,
before other in-flight commits we had known that we will need to
inspect, e.g. E.
When showing in --date-order, we would want to see commits ordered
by timestamps, i.e. show C, E, B and D in this order before showing
A, possibly mixing commits from two parallel histories together.
When "lifo" parameter is set to false, the function keeps the "work
to be done" set sorted in the date order to realize this semantics.
After inspecting C, we add B to the "work to be done" set, but the
next commit we inspect from the set is E which is newer than B.
The name "lifo", however, is too strongly tied to the way how the
function implements its behaviour, and does not describe what the
behaviour _means_.
Replace this field with an enum rev_sort_order, with two possible
values: REV_SORT_IN_GRAPH_ORDER and REV_SORT_BY_COMMIT_DATE, and
update the existing code. The mechanical replacement rule is:
"lifo == 0" is equivalent to "sort_order == REV_SORT_BY_COMMIT_DATE"
"lifo == 1" is equivalent to "sort_order == REV_SORT_IN_GRAPH_ORDER"
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-06-07 07:07:14 +08:00
|
|
|
revs->sort_order = REV_SORT_IN_GRAPH_ORDER;
|
2008-07-08 21:19:33 +08:00
|
|
|
revs->topo_order = 1;
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
} else if (!strcmp(arg, "--simplify-merges")) {
|
|
|
|
revs->simplify_merges = 1;
|
2012-06-09 05:47:08 +08:00
|
|
|
revs->topo_order = 1;
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
revs->rewrite_parents = 1;
|
|
|
|
revs->simplify_history = 0;
|
|
|
|
revs->limited = 1;
|
2008-11-04 03:25:46 +08:00
|
|
|
} else if (!strcmp(arg, "--simplify-by-decoration")) {
|
|
|
|
revs->simplify_merges = 1;
|
2012-06-09 05:47:08 +08:00
|
|
|
revs->topo_order = 1;
|
2008-11-04 03:25:46 +08:00
|
|
|
revs->rewrite_parents = 1;
|
|
|
|
revs->simplify_history = 0;
|
|
|
|
revs->simplify_by_decoration = 1;
|
|
|
|
revs->limited = 1;
|
|
|
|
revs->prune = 1;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--date-order")) {
|
toposort: rename "lifo" field
The primary invariant of sort_in_topological_order() is that a
parent commit is not emitted until all children of it are. When
traversing a forked history like this with "git log C E":
A----B----C
\
D----E
we ensure that A is emitted after all of B, C, D, and E are done, B
has to wait until C is done, and D has to wait until E is done.
In some applications, however, we would further want to control how
these child commits B, C, D and E on two parallel ancestry chains
are shown.
Most of the time, we would want to see C and B emitted together, and
then E and D, and finally A (i.e. the --topo-order output). The
"lifo" parameter of the sort_in_topological_order() function is used
to control this behaviour. We start the traversal by knowing two
commits, C and E. While keeping in mind that we also need to
inspect E later, we pick C first to inspect, and we notice and
record that B needs to be inspected. By structuring the "work to be
done" set as a LIFO stack, we ensure that B is inspected next,
before other in-flight commits we had known that we will need to
inspect, e.g. E.
When showing in --date-order, we would want to see commits ordered
by timestamps, i.e. show C, E, B and D in this order before showing
A, possibly mixing commits from two parallel histories together.
When "lifo" parameter is set to false, the function keeps the "work
to be done" set sorted in the date order to realize this semantics.
After inspecting C, we add B to the "work to be done" set, but the
next commit we inspect from the set is E which is newer than B.
The name "lifo", however, is too strongly tied to the way how the
function implements its behaviour, and does not describe what the
behaviour _means_.
Replace this field with an enum rev_sort_order, with two possible
values: REV_SORT_IN_GRAPH_ORDER and REV_SORT_BY_COMMIT_DATE, and
update the existing code. The mechanical replacement rule is:
"lifo == 0" is equivalent to "sort_order == REV_SORT_BY_COMMIT_DATE"
"lifo == 1" is equivalent to "sort_order == REV_SORT_IN_GRAPH_ORDER"
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-06-07 07:07:14 +08:00
|
|
|
revs->sort_order = REV_SORT_BY_COMMIT_DATE;
|
2008-07-08 21:19:33 +08:00
|
|
|
revs->topo_order = 1;
|
2013-06-08 01:35:54 +08:00
|
|
|
} else if (!strcmp(arg, "--author-date-order")) {
|
|
|
|
revs->sort_order = REV_SORT_BY_AUTHOR_DATE;
|
2008-07-08 21:19:33 +08:00
|
|
|
revs->topo_order = 1;
|
2017-06-10 02:17:31 +08:00
|
|
|
} else if (!strcmp(arg, "--early-output")) {
|
|
|
|
revs->early_output = 100;
|
|
|
|
revs->topo_order = 1;
|
|
|
|
} else if (skip_prefix(arg, "--early-output=", &optarg)) {
|
|
|
|
if (strtoul_ui(optarg, 10, &revs->early_output) < 0)
|
|
|
|
die("'%s': not a non-negative integer", optarg);
|
|
|
|
revs->topo_order = 1;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--parents")) {
|
|
|
|
revs->rewrite_parents = 1;
|
|
|
|
revs->print_parents = 1;
|
|
|
|
} else if (!strcmp(arg, "--dense")) {
|
|
|
|
revs->dense = 1;
|
|
|
|
} else if (!strcmp(arg, "--sparse")) {
|
|
|
|
revs->dense = 0;
|
2017-11-16 10:00:35 +08:00
|
|
|
} else if (!strcmp(arg, "--in-commit-order")) {
|
|
|
|
revs->tree_blobs_in_commit_order = 1;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--remove-empty")) {
|
|
|
|
revs->remove_empty_trees = 1;
|
2009-06-30 01:28:25 +08:00
|
|
|
} else if (!strcmp(arg, "--merges")) {
|
2011-03-21 18:14:06 +08:00
|
|
|
revs->min_parents = 2;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--no-merges")) {
|
2011-03-21 18:14:06 +08:00
|
|
|
revs->max_parents = 1;
|
2017-06-10 02:17:32 +08:00
|
|
|
} else if (skip_prefix(arg, "--min-parents=", &optarg)) {
|
revision: parse integer arguments to --max-count, --skip, etc., more carefully
The "rev-list" and other commands in the "log" family, being the
oldest part of the system, use their own custom argument parsers,
and integer values of some options are parsed with atoi(), which
allows a non-digit after the number (e.g., "1q") to be silently
ignored. As a natural consequence, an argument that does not begin
with a digit (e.g., "q") silently becomes zero, too.
Switch to use strtol_i() and parse_timestamp() appropriately to
catch bogus input.
Note that one may naïvely expect that --max-count, --skip, etc., to
only take non-negative values, but we must allow them to also take
negative values, as an escape hatch to countermand a limit set by an
earlier option on the command line; the underlying variables are
initialized to (-1) and "--max-count=-1", for example, is a
legitimate way to reinitialize the limit.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-09 06:35:23 +08:00
|
|
|
revs->min_parents = parse_count(optarg);
|
2017-06-10 02:17:30 +08:00
|
|
|
} else if (!strcmp(arg, "--no-min-parents")) {
|
2011-03-21 18:14:06 +08:00
|
|
|
revs->min_parents = 0;
|
2017-06-10 02:17:32 +08:00
|
|
|
} else if (skip_prefix(arg, "--max-parents=", &optarg)) {
|
revision: parse integer arguments to --max-count, --skip, etc., more carefully
The "rev-list" and other commands in the "log" family, being the
oldest part of the system, use their own custom argument parsers,
and integer values of some options are parsed with atoi(), which
allows a non-digit after the number (e.g., "1q") to be silently
ignored. As a natural consequence, an argument that does not begin
with a digit (e.g., "q") silently becomes zero, too.
Switch to use strtol_i() and parse_timestamp() appropriately to
catch bogus input.
Note that one may naïvely expect that --max-count, --skip, etc., to
only take non-negative values, but we must allow them to also take
negative values, as an escape hatch to countermand a limit set by an
earlier option on the command line; the underlying variables are
initialized to (-1) and "--max-count=-1", for example, is a
legitimate way to reinitialize the limit.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-09 06:35:23 +08:00
|
|
|
revs->max_parents = parse_count(optarg);
|
2017-06-10 02:17:30 +08:00
|
|
|
} else if (!strcmp(arg, "--no-max-parents")) {
|
2011-03-21 18:14:06 +08:00
|
|
|
revs->max_parents = -1;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--boundary")) {
|
|
|
|
revs->boundary = 1;
|
|
|
|
} else if (!strcmp(arg, "--left-right")) {
|
|
|
|
revs->left_right = 1;
|
2011-02-22 00:09:11 +08:00
|
|
|
} else if (!strcmp(arg, "--left-only")) {
|
2011-02-22 08:58:37 +08:00
|
|
|
if (revs->right_only)
|
2023-11-26 19:57:43 +08:00
|
|
|
die(_("options '%s' and '%s' cannot be used together"),
|
|
|
|
"--left-only", "--right-only/--cherry");
|
2011-02-22 00:09:11 +08:00
|
|
|
revs->left_only = 1;
|
|
|
|
} else if (!strcmp(arg, "--right-only")) {
|
2011-02-22 08:58:37 +08:00
|
|
|
if (revs->left_only)
|
2022-01-06 04:02:16 +08:00
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--right-only", "--left-only");
|
2011-02-22 00:09:11 +08:00
|
|
|
revs->right_only = 1;
|
2011-03-07 20:31:42 +08:00
|
|
|
} else if (!strcmp(arg, "--cherry")) {
|
|
|
|
if (revs->left_only)
|
2022-01-06 04:02:16 +08:00
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--cherry", "--left-only");
|
2011-03-07 20:31:42 +08:00
|
|
|
revs->cherry_mark = 1;
|
|
|
|
revs->right_only = 1;
|
2011-03-21 18:14:06 +08:00
|
|
|
revs->max_parents = 1;
|
2011-03-07 20:31:42 +08:00
|
|
|
revs->limited = 1;
|
2010-06-10 19:47:23 +08:00
|
|
|
} else if (!strcmp(arg, "--count")) {
|
|
|
|
revs->count = 1;
|
2011-03-07 20:31:40 +08:00
|
|
|
} else if (!strcmp(arg, "--cherry-mark")) {
|
|
|
|
if (revs->cherry_pick)
|
2022-01-06 04:02:16 +08:00
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--cherry-mark", "--cherry-pick");
|
2011-03-07 20:31:40 +08:00
|
|
|
revs->cherry_mark = 1;
|
|
|
|
revs->limited = 1; /* needs limit_list() */
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--cherry-pick")) {
|
2011-03-07 20:31:40 +08:00
|
|
|
if (revs->cherry_mark)
|
2022-01-06 04:02:16 +08:00
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--cherry-pick", "--cherry-mark");
|
2008-07-08 21:19:33 +08:00
|
|
|
revs->cherry_pick = 1;
|
|
|
|
revs->limited = 1;
|
|
|
|
} else if (!strcmp(arg, "--objects")) {
|
|
|
|
revs->tag_objects = 1;
|
|
|
|
revs->tree_objects = 1;
|
|
|
|
revs->blob_objects = 1;
|
|
|
|
} else if (!strcmp(arg, "--objects-edge")) {
|
|
|
|
revs->tag_objects = 1;
|
|
|
|
revs->tree_objects = 1;
|
|
|
|
revs->blob_objects = 1;
|
|
|
|
revs->edge_hint = 1;
|
2014-12-25 07:05:39 +08:00
|
|
|
} else if (!strcmp(arg, "--objects-edge-aggressive")) {
|
|
|
|
revs->tag_objects = 1;
|
|
|
|
revs->tree_objects = 1;
|
|
|
|
revs->blob_objects = 1;
|
|
|
|
revs->edge_hint = 1;
|
|
|
|
revs->edge_hint_aggressive = 1;
|
2011-09-02 06:43:34 +08:00
|
|
|
} else if (!strcmp(arg, "--verify-objects")) {
|
|
|
|
revs->tag_objects = 1;
|
|
|
|
revs->tree_objects = 1;
|
|
|
|
revs->blob_objects = 1;
|
|
|
|
revs->verify_objects = 1;
|
rev-list: disable commit graph with --verify-objects
Since the point of --verify-objects is to actually load and checksum the
bytes of each object, optimizing out reads using the commit graph runs
contrary to our goal.
The most targeted way to implement this would be for the revision
traversal code to check revs->verify_objects and avoid using the commit
graph. But it's difficult to be sure we've hit all of the correct spots.
For instance, I started this patch by writing the first of the included
test cases, where the corrupted commit is directly on rev-list's command
line. And that is easy to fix by teaching get_reference() to check
revs->verify_objects before calling lookup_commit_in_graph().
But that doesn't cover the second test case: when we traverse to a
corrupted commit, we'd parse the parent in process_parents(). So we'd
need to check there, too. And it keeps going. In handle_commit() we
sometimes parses commits, too, though I couldn't figure out a way to
trigger it that did not already parse via get_reference() or tag
peeling. And try_to_simplify_commit() has its own parse call, and so on.
So it seems like the safest thing is to just disable the commit graph
for the whole process when we see the --verify-objects option. We can do
that either in builtin/rev-list.c, where we use the option, or in
revision.c, where we parse it. There are some subtleties:
- putting it in rev-list.c is less surprising in some ways, because
there we know we are just doing a single traversal. In a command
which does multiple traversals in a single process, it's rather
unexpected to globally disable the commit graph.
- putting it in revision.c is less surprising in some ways, because
the caller does not have to remember to disable the graph
themselves. But this is already tricky! The verify_objects flag in
rev_info doesn't do anything by itself. The caller has to provide an
object callback which does the right thing.
- for that reason, in practice nobody but rev-list uses this option in
the first place. So the distinction is probably not important either
way. Arguably it should just be an option of rev-list, and not the
general revision machinery; right now you can run "git log
--verify-objects", but it does not actually do anything useful.
- checking for a parsed revs.verify_objects flag in rev-list.c is too
late. By that time we've already passed the arguments to
setup_revisions(), which will have parsed the commits using the
graph.
So this commit disables the graph as soon as we see the option in
revision.c. That's a pretty broad hammer, but it does what we want, and
in practice nobody but rev-list is using this flag anyway.
The tests cover both the "tip" and "parent" cases. Obviously our hammer
hits them both in this case, but it's good to check both in case
somebody later tries the more focused approach.
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-07 05:04:35 +08:00
|
|
|
disable_commit_graph(revs->repo);
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--unpacked")) {
|
|
|
|
revs->unpacked = 1;
|
2013-12-01 04:55:40 +08:00
|
|
|
} else if (starts_with(arg, "--unpacked=")) {
|
2020-08-05 06:26:30 +08:00
|
|
|
die(_("--unpacked=<packfile> no longer supported"));
|
revision: learn '--no-kept-objects'
A future caller will want to be able to perform a reachability traversal
which terminates when visiting an object found in a kept pack. The
closest existing option is '--honor-pack-keep', but this isn't quite
what we want. Instead of halting the traversal midway through, a full
traversal is always performed, and the results are only trimmed
afterwords.
Besides needing to introduce a new flag (since culling results
post-facto can be different than halting the traversal as it's
happening), there is an additional wrinkle handling the distinction
in-core and on-disk kept packs. That is: what kinds of kept pack should
stop the traversal?
Introduce '--no-kept-objects[=<on-disk|in-core>]' to specify which kinds
of kept packs, if any, should stop a traversal. This can be useful for
callers that want to perform a reachability analysis, but want to leave
certain packs alone (for e.g., when doing a geometric repack that has
some "large" packs which are kept in-core that it wants to leave alone).
Note that this option is not guaranteed to produce exactly the set of
objects that aren't in kept packs, since it's possible the traversal
order may end up in a situation where a non-kept ancestor was "cut off"
by a kept object (at which point we would stop traversing). But, we
don't care about absolute correctness here, since this will eventually
be used as a purely additive guide in an upcoming new repack mode.
Explicitly avoid documenting this new flag, since it is only used
internally. In theory we could avoid even adding it rev-list, but being
able to spell this option out on the command-line makes some special
cases easier to test without promising to keep it behaving consistently
forever. Those tricky cases are exercised in t6114.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 10:25:07 +08:00
|
|
|
} else if (!strcmp(arg, "--no-kept-objects")) {
|
|
|
|
revs->no_kept_objects = 1;
|
|
|
|
revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
|
|
|
|
revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
|
|
|
|
} else if (skip_prefix(arg, "--no-kept-objects=", &optarg)) {
|
|
|
|
revs->no_kept_objects = 1;
|
|
|
|
if (!strcmp(optarg, "in-core"))
|
|
|
|
revs->keep_pack_cache_flags |= IN_CORE_KEEP_PACKS;
|
|
|
|
if (!strcmp(optarg, "on-disk"))
|
|
|
|
revs->keep_pack_cache_flags |= ON_DISK_KEEP_PACKS;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "-r")) {
|
|
|
|
revs->diff = 1;
|
2017-11-01 02:19:11 +08:00
|
|
|
revs->diffopt.flags.recursive = 1;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "-t")) {
|
|
|
|
revs->diff = 1;
|
2017-11-01 02:19:11 +08:00
|
|
|
revs->diffopt.flags.recursive = 1;
|
|
|
|
revs->diffopt.flags.tree_in_recursive = 1;
|
2020-12-21 23:19:34 +08:00
|
|
|
} else if ((argcount = diff_merges_parse_opts(revs, argv))) {
|
2020-08-06 06:08:30 +08:00
|
|
|
return argcount;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "-v")) {
|
|
|
|
revs->verbose_header = 1;
|
|
|
|
} else if (!strcmp(arg, "--pretty")) {
|
|
|
|
revs->verbose_header = 1;
|
2010-01-21 05:59:36 +08:00
|
|
|
revs->pretty_given = 1;
|
2014-07-30 01:53:40 +08:00
|
|
|
get_commit_format(NULL, revs);
|
2017-06-10 02:17:32 +08:00
|
|
|
} else if (skip_prefix(arg, "--pretty=", &optarg) ||
|
|
|
|
skip_prefix(arg, "--format=", &optarg)) {
|
2010-08-05 16:22:55 +08:00
|
|
|
/*
|
|
|
|
* Detached form ("--pretty X" as opposed to "--pretty=X")
|
|
|
|
* not allowed, since the argument is optional.
|
|
|
|
*/
|
2008-07-08 21:19:33 +08:00
|
|
|
revs->verbose_header = 1;
|
2010-01-21 05:59:36 +08:00
|
|
|
revs->pretty_given = 1;
|
2017-06-10 02:17:32 +08:00
|
|
|
get_commit_format(optarg, revs);
|
2016-03-17 00:15:53 +08:00
|
|
|
} else if (!strcmp(arg, "--expand-tabs")) {
|
2016-03-30 07:05:39 +08:00
|
|
|
revs->expand_tabs_in_log = 8;
|
2016-03-30 06:49:24 +08:00
|
|
|
} else if (!strcmp(arg, "--no-expand-tabs")) {
|
|
|
|
revs->expand_tabs_in_log = 0;
|
2016-03-30 07:05:39 +08:00
|
|
|
} else if (skip_prefix(arg, "--expand-tabs=", &arg)) {
|
|
|
|
int val;
|
|
|
|
if (strtol_i(arg, 10, &val) < 0 || val < 0)
|
|
|
|
die("'%s': not a non-negative integer", arg);
|
|
|
|
revs->expand_tabs_in_log = val;
|
2011-03-30 04:57:47 +08:00
|
|
|
} else if (!strcmp(arg, "--show-notes") || !strcmp(arg, "--notes")) {
|
notes: break set_display_notes() into smaller functions
In 8164c961e1 (format-patch: use --notes behavior for format.notes,
2019-12-09), we introduced set_display_notes() which was a monolithic
function with three mutually exclusive branches. Break the function up
into three small and simple functions that each are only responsible for
one task.
This family of functions accepts an `int *show_notes` instead of
returning a value suitable for assignment to `show_notes`. This is for
two reasons. First of all, this guarantees that the external
`show_notes` variable changes in lockstep with the
`struct display_notes_opt`. Second, this prompts future developers to be
careful about doing something meaningful with this value. In fact, a
NULL check is intentionally omitted because causing a segfault here
would tell the future developer that they are meant to use the value for
something meaningful.
One alternative was making the family of functions accept a
`struct rev_info *` instead of the `struct display_notes_opt *`, since
the former contains the `show_notes` field as well. This does not work
because we have to call git_config() before repo_init_revisions().
However, if we had a `struct rev_info`, we'd need to initialize it before
it gets assigned values from git_config(). As a result, we break the
circular dependency by having standalone `int show_notes` and
`struct display_notes_opt notes_opt` variables which temporarily hold
values from git_config() until the values are copied over to `rev`.
To implement this change, we need to get a pointer to
`rev_info::show_notes`. Unfortunately, this is not possible with
bitfields and only direct-assignment is possible. Change
`rev_info::show_notes` to a non-bitfield int so that we can get its
address.
Signed-off-by: Denton Liu <liu.denton@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-12 08:49:50 +08:00
|
|
|
enable_default_display_notes(&revs->notes_opt, &revs->show_notes);
|
2010-01-21 05:59:36 +08:00
|
|
|
revs->show_notes_given = 1;
|
2011-10-19 06:53:23 +08:00
|
|
|
} else if (!strcmp(arg, "--show-signature")) {
|
|
|
|
revs->show_signature = 1;
|
2016-06-23 00:51:25 +08:00
|
|
|
} else if (!strcmp(arg, "--no-show-signature")) {
|
|
|
|
revs->show_signature = 0;
|
2017-06-10 02:17:32 +08:00
|
|
|
} else if (!strcmp(arg, "--show-linear-break")) {
|
|
|
|
revs->break_bar = " ..........";
|
2014-03-25 21:23:27 +08:00
|
|
|
revs->track_linear = 1;
|
|
|
|
revs->track_first_time = 1;
|
2017-06-10 02:17:32 +08:00
|
|
|
} else if (skip_prefix(arg, "--show-linear-break=", &optarg)) {
|
|
|
|
revs->break_bar = xstrdup(optarg);
|
2014-03-25 21:23:27 +08:00
|
|
|
revs->track_linear = 1;
|
|
|
|
revs->track_first_time = 1;
|
2023-09-20 04:26:50 +08:00
|
|
|
} else if (!strcmp(arg, "--show-notes-by-default")) {
|
|
|
|
revs->show_notes_by_default = 1;
|
2017-06-10 02:17:32 +08:00
|
|
|
} else if (skip_prefix(arg, "--show-notes=", &optarg) ||
|
|
|
|
skip_prefix(arg, "--notes=", &optarg)) {
|
|
|
|
if (starts_with(arg, "--show-notes=") &&
|
|
|
|
revs->notes_opt.use_default_notes < 0)
|
|
|
|
revs->notes_opt.use_default_notes = 1;
|
notes: break set_display_notes() into smaller functions
In 8164c961e1 (format-patch: use --notes behavior for format.notes,
2019-12-09), we introduced set_display_notes() which was a monolithic
function with three mutually exclusive branches. Break the function up
into three small and simple functions that each are only responsible for
one task.
This family of functions accepts an `int *show_notes` instead of
returning a value suitable for assignment to `show_notes`. This is for
two reasons. First of all, this guarantees that the external
`show_notes` variable changes in lockstep with the
`struct display_notes_opt`. Second, this prompts future developers to be
careful about doing something meaningful with this value. In fact, a
NULL check is intentionally omitted because causing a segfault here
would tell the future developer that they are meant to use the value for
something meaningful.
One alternative was making the family of functions accept a
`struct rev_info *` instead of the `struct display_notes_opt *`, since
the former contains the `show_notes` field as well. This does not work
because we have to call git_config() before repo_init_revisions().
However, if we had a `struct rev_info`, we'd need to initialize it before
it gets assigned values from git_config(). As a result, we break the
circular dependency by having standalone `int show_notes` and
`struct display_notes_opt notes_opt` variables which temporarily hold
values from git_config() until the values are copied over to `rev`.
To implement this change, we need to get a pointer to
`rev_info::show_notes`. Unfortunately, this is not possible with
bitfields and only direct-assignment is possible. Change
`rev_info::show_notes` to a non-bitfield int so that we can get its
address.
Signed-off-by: Denton Liu <liu.denton@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-12 08:49:50 +08:00
|
|
|
enable_ref_display_notes(&revs->notes_opt, &revs->show_notes, optarg);
|
2019-12-09 21:10:44 +08:00
|
|
|
revs->show_notes_given = 1;
|
2010-01-21 05:59:36 +08:00
|
|
|
} else if (!strcmp(arg, "--no-notes")) {
|
notes: break set_display_notes() into smaller functions
In 8164c961e1 (format-patch: use --notes behavior for format.notes,
2019-12-09), we introduced set_display_notes() which was a monolithic
function with three mutually exclusive branches. Break the function up
into three small and simple functions that each are only responsible for
one task.
This family of functions accepts an `int *show_notes` instead of
returning a value suitable for assignment to `show_notes`. This is for
two reasons. First of all, this guarantees that the external
`show_notes` variable changes in lockstep with the
`struct display_notes_opt`. Second, this prompts future developers to be
careful about doing something meaningful with this value. In fact, a
NULL check is intentionally omitted because causing a segfault here
would tell the future developer that they are meant to use the value for
something meaningful.
One alternative was making the family of functions accept a
`struct rev_info *` instead of the `struct display_notes_opt *`, since
the former contains the `show_notes` field as well. This does not work
because we have to call git_config() before repo_init_revisions().
However, if we had a `struct rev_info`, we'd need to initialize it before
it gets assigned values from git_config(). As a result, we break the
circular dependency by having standalone `int show_notes` and
`struct display_notes_opt notes_opt` variables which temporarily hold
values from git_config() until the values are copied over to `rev`.
To implement this change, we need to get a pointer to
`rev_info::show_notes`. Unfortunately, this is not possible with
bitfields and only direct-assignment is possible. Change
`rev_info::show_notes` to a non-bitfield int so that we can get its
address.
Signed-off-by: Denton Liu <liu.denton@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-12 08:49:50 +08:00
|
|
|
disable_display_notes(&revs->notes_opt, &revs->show_notes);
|
2010-01-21 05:59:36 +08:00
|
|
|
revs->show_notes_given = 1;
|
2010-03-13 01:04:26 +08:00
|
|
|
} else if (!strcmp(arg, "--standard-notes")) {
|
|
|
|
revs->show_notes_given = 1;
|
2011-03-30 04:57:27 +08:00
|
|
|
revs->notes_opt.use_default_notes = 1;
|
2010-03-13 01:04:26 +08:00
|
|
|
} else if (!strcmp(arg, "--no-standard-notes")) {
|
2011-03-30 04:57:27 +08:00
|
|
|
revs->notes_opt.use_default_notes = 0;
|
2009-02-24 17:59:16 +08:00
|
|
|
} else if (!strcmp(arg, "--oneline")) {
|
|
|
|
revs->verbose_header = 1;
|
|
|
|
get_commit_format("oneline", revs);
|
2010-01-22 06:57:41 +08:00
|
|
|
revs->pretty_given = 1;
|
2009-02-24 17:59:16 +08:00
|
|
|
revs->abbrev_commit = 1;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--graph")) {
|
2022-02-12 00:36:24 +08:00
|
|
|
graph_clear(revs->graph);
|
2008-07-08 21:19:33 +08:00
|
|
|
revs->graph = graph_init(revs);
|
2022-02-12 00:36:25 +08:00
|
|
|
} else if (!strcmp(arg, "--no-graph")) {
|
|
|
|
graph_clear(revs->graph);
|
|
|
|
revs->graph = NULL;
|
2020-04-08 12:31:38 +08:00
|
|
|
} else if (!strcmp(arg, "--encode-email-headers")) {
|
|
|
|
revs->encode_email_headers = 1;
|
|
|
|
} else if (!strcmp(arg, "--no-encode-email-headers")) {
|
|
|
|
revs->encode_email_headers = 0;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--root")) {
|
|
|
|
revs->show_root_diff = 1;
|
|
|
|
} else if (!strcmp(arg, "--no-commit-id")) {
|
|
|
|
revs->no_commit_id = 1;
|
|
|
|
} else if (!strcmp(arg, "--always")) {
|
|
|
|
revs->always_show_header = 1;
|
|
|
|
} else if (!strcmp(arg, "--no-abbrev")) {
|
|
|
|
revs->abbrev = 0;
|
|
|
|
} else if (!strcmp(arg, "--abbrev")) {
|
|
|
|
revs->abbrev = DEFAULT_ABBREV;
|
2017-06-10 02:17:32 +08:00
|
|
|
} else if (skip_prefix(arg, "--abbrev=", &optarg)) {
|
|
|
|
revs->abbrev = strtoul(optarg, NULL, 10);
|
2008-07-08 21:19:33 +08:00
|
|
|
if (revs->abbrev < MINIMUM_ABBREV)
|
|
|
|
revs->abbrev = MINIMUM_ABBREV;
|
2018-05-02 08:25:50 +08:00
|
|
|
else if (revs->abbrev > hexsz)
|
|
|
|
revs->abbrev = hexsz;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--abbrev-commit")) {
|
|
|
|
revs->abbrev_commit = 1;
|
2011-05-19 01:56:04 +08:00
|
|
|
revs->abbrev_commit_given = 1;
|
|
|
|
} else if (!strcmp(arg, "--no-abbrev-commit")) {
|
|
|
|
revs->abbrev_commit = 0;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--full-diff")) {
|
|
|
|
revs->diff = 1;
|
|
|
|
revs->full_diff = 1;
|
revision: --show-pulls adds helpful merges
The default file history simplification of "git log -- <path>" or
"git rev-list -- <path>" focuses on providing the smallest set of
commits that first contributed a change. The revision walk greatly
restricts the set of walked commits by visiting only the first
TREESAME parent of a merge commit, when one exists. This means
that portions of the commit-graph are not walked, which can be a
performance benefit, but can also "hide" commits that added changes
but were ignored by a merge resolution.
The --full-history option modifies this by walking all commits and
reporting a merge commit as "interesting" if it has _any_ parent
that is not TREESAME. This tends to be an over-representation of
important commits, especially in an environment where most merge
commits are created by pull request completion.
Suppose we have a commit A and we create a commit B on top that
changes our file. When we merge the pull request, we create a merge
commit M. If no one else changed the file in the first-parent
history between M and A, then M will not be TREESAME to its first
parent, but will be TREESAME to B. Thus, the simplified history
will be "B". However, M will appear in the --full-history mode.
However, suppose that a number of topics T1, T2, ..., Tn were
created based on commits C1, C2, ..., Cn between A and M as
follows:
A----C1----C2--- ... ---Cn----M------P1---P2--- ... ---Pn
\ \ \ \ / / / /
\ \__.. \ \/ ..__T1 / Tn
\ \__.. /\ ..__T2 /
\_____________________B \____________________/
If the commits T1, T2, ... Tn did not change the file, then all of
P1 through Pn will be TREESAME to their first parent, but not
TREESAME to their second. This means that all of those merge commits
appear in the --full-history view, with edges that immediately
collapse into the lower history without introducing interesting
single-parent commits.
The --simplify-merges option was introduced to remove these extra
merge commits. By noticing that the rewritten parents are reachable
from their first parents, those edges can be simplified away. Finally,
the commits now look like single-parent commits that are TREESAME to
their "only" parent. Thus, they are removed and this issue does not
cause issues anymore. However, this also ends up removing the commit
M from the history view! Even worse, the --simplify-merges option
requires walking the entire history before returning a single result.
Many Git users are using Git alongside a Git service that provides
code storage alongside a code review tool commonly called "Pull
Requests" or "Merge Requests" against a target branch. When these
requests are accepted and merged, they typically create a merge
commit whose first parent is the previous branch tip and the second
parent is the tip of the topic branch used for the request. This
presents a valuable order to the parents, but also makes that merge
commit slightly special. Users may want to see not only which
commits changed a file, but which pull requests merged those commits
into their branch. In the previous example, this would mean the
users want to see the merge commit "M" in addition to the single-
parent commit "C".
Users are even more likely to want these merge commits when they
use pull requests to merge into a feature branch before merging that
feature branch into their trunk.
In some sense, users are asking for the "first" merge commit to
bring in the change to their branch. As long as the parent order is
consistent, this can be handled with the following rule:
Include a merge commit if it is not TREESAME to its first
parent, but is TREESAME to a later parent.
These merges look like the merge commits that would result from
running "git pull <topic>" on a main branch. Thus, the option to
show these commits is called "--show-pulls". This has the added
benefit of showing the commits created by closing a pull request or
merge request on any of the Git hosting and code review platforms.
To test these options, extend the standard test example to include
a merge commit that is not TREESAME to its first parent. It is
surprising that that option was not already in the example, as it
is instructive.
In particular, this extension demonstrates a common issue with file
history simplification. When a user resolves a merge conflict using
"-Xours" or otherwise ignoring one side of the conflict, they create
a TREESAME edge that probably should not be TREESAME. This leads
users to become frustrated and complain that "my change disappeared!"
In my experience, showing them history with --full-history and
--simplify-merges quickly reveals the problematic merge. As mentioned,
this option is expensive to compute. The --show-pulls option
_might_ show the merge commit (usually titled "resolving conflicts")
more quickly. Of course, this depends on the user having the correct
parent order, which is backwards when using "git pull master" from a
topic branch.
There are some special considerations when combining the --show-pulls
option with --simplify-merges. This requires adding a new PULL_MERGE
object flag to store the information from the initial TREESAME
comparisons. This helps avoid dropping those commits in later filters.
This is covered by a test, including how the parents can be simplified.
Since "struct object" has already ruined its 32-bit alignment by using
33 bits across parsed, type, and flags member, let's not make it worse.
PULL_MERGE is used in revision.c with the same value (1u<<15) as
REACHABLE in commit-graph.c. The REACHABLE flag is only used when
writing a commit-graph file, and a revision walk using --show-pulls
does not happen in the same process. Care must be taken in the future
to ensure this remains the case.
Update Documentation/rev-list-options.txt with significant details
around this option. This requires updating the example in the
History Simplification section to demonstrate some of the problems
with TREESAME second parents.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-10 20:19:43 +08:00
|
|
|
} else if (!strcmp(arg, "--show-pulls")) {
|
|
|
|
revs->show_pulls = 1;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--full-history")) {
|
|
|
|
revs->simplify_history = 0;
|
|
|
|
} else if (!strcmp(arg, "--relative-date")) {
|
convert "enum date_mode" into a struct
In preparation for adding date modes that may carry extra
information beyond the mode itself, this patch converts the
date_mode enum into a struct.
Most of the conversion is fairly straightforward; we pass
the struct as a pointer and dereference the type field where
necessary. Locations that declare a date_mode can use a "{}"
constructor. However, the tricky case is where we use the
enum labels as constants, like:
show_date(t, tz, DATE_NORMAL);
Ideally we could say:
show_date(t, tz, &{ DATE_NORMAL });
but of course C does not allow that. Likewise, we cannot
cast the constant to a struct, because we need to pass an
actual address. Our options are basically:
1. Manually add a "struct date_mode d = { DATE_NORMAL }"
definition to each caller, and pass "&d". This makes
the callers uglier, because they sometimes do not even
have their own scope (e.g., they are inside a switch
statement).
2. Provide a pre-made global "date_normal" struct that can
be passed by address. We'd also need "date_rfc2822",
"date_iso8601", and so forth. But at least the ugliness
is defined in one place.
3. Provide a wrapper that generates the correct struct on
the fly. The big downside is that we end up pointing to
a single global, which makes our wrapper non-reentrant.
But show_date is already not reentrant, so it does not
matter.
This patch implements 3, along with a minor macro to keep
the size of the callers sane.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-06-26 00:55:02 +08:00
|
|
|
revs->date_mode.type = DATE_RELATIVE;
|
2009-09-24 16:28:15 +08:00
|
|
|
revs->date_mode_explicit = 1;
|
2010-08-05 16:22:55 +08:00
|
|
|
} else if ((argcount = parse_long_opt("date", argv, &optarg))) {
|
convert "enum date_mode" into a struct
In preparation for adding date modes that may carry extra
information beyond the mode itself, this patch converts the
date_mode enum into a struct.
Most of the conversion is fairly straightforward; we pass
the struct as a pointer and dereference the type field where
necessary. Locations that declare a date_mode can use a "{}"
constructor. However, the tricky case is where we use the
enum labels as constants, like:
show_date(t, tz, DATE_NORMAL);
Ideally we could say:
show_date(t, tz, &{ DATE_NORMAL });
but of course C does not allow that. Likewise, we cannot
cast the constant to a struct, because we need to pass an
actual address. Our options are basically:
1. Manually add a "struct date_mode d = { DATE_NORMAL }"
definition to each caller, and pass "&d". This makes
the callers uglier, because they sometimes do not even
have their own scope (e.g., they are inside a switch
statement).
2. Provide a pre-made global "date_normal" struct that can
be passed by address. We'd also need "date_rfc2822",
"date_iso8601", and so forth. But at least the ugliness
is defined in one place.
3. Provide a wrapper that generates the correct struct on
the fly. The big downside is that we end up pointing to
a single global, which makes our wrapper non-reentrant.
But show_date is already not reentrant, so it does not
matter.
This patch implements 3, along with a minor macro to keep
the size of the callers sane.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-06-26 00:55:02 +08:00
|
|
|
parse_date_format(optarg, &revs->date_mode);
|
2009-09-24 16:28:15 +08:00
|
|
|
revs->date_mode_explicit = 1;
|
2010-08-05 16:22:55 +08:00
|
|
|
return argcount;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--log-size")) {
|
|
|
|
revs->show_log_size = 1;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Grepping the commit log
|
|
|
|
*/
|
2010-08-05 16:22:55 +08:00
|
|
|
else if ((argcount = parse_long_opt("author", argv, &optarg))) {
|
|
|
|
add_header_grep(revs, GREP_HEADER_AUTHOR, optarg);
|
|
|
|
return argcount;
|
|
|
|
} else if ((argcount = parse_long_opt("committer", argv, &optarg))) {
|
|
|
|
add_header_grep(revs, GREP_HEADER_COMMITTER, optarg);
|
|
|
|
return argcount;
|
2012-09-29 12:41:28 +08:00
|
|
|
} else if ((argcount = parse_long_opt("grep-reflog", argv, &optarg))) {
|
|
|
|
add_header_grep(revs, GREP_HEADER_REFLOG, optarg);
|
|
|
|
return argcount;
|
2010-08-05 16:22:55 +08:00
|
|
|
} else if ((argcount = parse_long_opt("grep", argv, &optarg))) {
|
|
|
|
add_message_grep(revs, optarg);
|
|
|
|
return argcount;
|
2012-10-04 06:01:34 +08:00
|
|
|
} else if (!strcmp(arg, "--basic-regexp")) {
|
grep: further simplify setting the pattern type
When c5c31d33 (grep: move pattern-type bits support to top-level
grep.[ch], 2012-10-03) introduced grep_commit_pattern_type() helper
function, the intention was to allow the users of grep API to having
to fiddle only with .pattern_type_option (which can be set to "fixed",
"basic", "extended", and "pcre"), and then immediately before compiling
the pattern strings for use, call grep_commit_pattern_type() to have
it prepare various bits in the grep_opt structure (like .fixed,
.regflags, etc.).
However, grep_set_pattern_type_option() helper function the grep API
internally uses were left as an external function by mistake. This
function shouldn't have been made callable by the users of the API.
Later when the grep API was used in revision traversal machinery,
the caller then mistakenly started calling the function around
34a4ae55 (log --grep: use the same helper to set -E/-F options as
"git grep", 2012-10-03), instead of setting the .pattern_type_option
field and letting the grep_commit_pattern_type() to take care of the
details.
This caused an unnecessary bug that made a configured
grep.patternType take precedence over the command line options
(e.g. --basic-regexp, --fixed-strings) in "git log" family of
commands.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-23 02:43:14 +08:00
|
|
|
revs->grep_filter.pattern_type_option = GREP_PATTERN_TYPE_BRE;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--extended-regexp") || !strcmp(arg, "-E")) {
|
grep: further simplify setting the pattern type
When c5c31d33 (grep: move pattern-type bits support to top-level
grep.[ch], 2012-10-03) introduced grep_commit_pattern_type() helper
function, the intention was to allow the users of grep API to having
to fiddle only with .pattern_type_option (which can be set to "fixed",
"basic", "extended", and "pcre"), and then immediately before compiling
the pattern strings for use, call grep_commit_pattern_type() to have
it prepare various bits in the grep_opt structure (like .fixed,
.regflags, etc.).
However, grep_set_pattern_type_option() helper function the grep API
internally uses were left as an external function by mistake. This
function shouldn't have been made callable by the users of the API.
Later when the grep API was used in revision traversal machinery,
the caller then mistakenly started calling the function around
34a4ae55 (log --grep: use the same helper to set -E/-F options as
"git grep", 2012-10-03), instead of setting the .pattern_type_option
field and letting the grep_commit_pattern_type() to take care of the
details.
This caused an unnecessary bug that made a configured
grep.patternType take precedence over the command line options
(e.g. --basic-regexp, --fixed-strings) in "git log" family of
commands.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-23 02:43:14 +08:00
|
|
|
revs->grep_filter.pattern_type_option = GREP_PATTERN_TYPE_ERE;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--regexp-ignore-case") || !strcmp(arg, "-i")) {
|
2017-05-21 05:42:08 +08:00
|
|
|
revs->grep_filter.ignore_case = 1;
|
2018-01-05 06:50:40 +08:00
|
|
|
revs->diffopt.pickaxe_opts |= DIFF_PICKAXE_IGNORE_CASE;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--fixed-strings") || !strcmp(arg, "-F")) {
|
grep: further simplify setting the pattern type
When c5c31d33 (grep: move pattern-type bits support to top-level
grep.[ch], 2012-10-03) introduced grep_commit_pattern_type() helper
function, the intention was to allow the users of grep API to having
to fiddle only with .pattern_type_option (which can be set to "fixed",
"basic", "extended", and "pcre"), and then immediately before compiling
the pattern strings for use, call grep_commit_pattern_type() to have
it prepare various bits in the grep_opt structure (like .fixed,
.regflags, etc.).
However, grep_set_pattern_type_option() helper function the grep API
internally uses were left as an external function by mistake. This
function shouldn't have been made callable by the users of the API.
Later when the grep API was used in revision traversal machinery,
the caller then mistakenly started calling the function around
34a4ae55 (log --grep: use the same helper to set -E/-F options as
"git grep", 2012-10-03), instead of setting the .pattern_type_option
field and letting the grep_commit_pattern_type() to take care of the
details.
This caused an unnecessary bug that made a configured
grep.patternType take precedence over the command line options
(e.g. --basic-regexp, --fixed-strings) in "git log" family of
commands.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-23 02:43:14 +08:00
|
|
|
revs->grep_filter.pattern_type_option = GREP_PATTERN_TYPE_FIXED;
|
2017-05-26 04:05:24 +08:00
|
|
|
} else if (!strcmp(arg, "--perl-regexp") || !strcmp(arg, "-P")) {
|
grep: further simplify setting the pattern type
When c5c31d33 (grep: move pattern-type bits support to top-level
grep.[ch], 2012-10-03) introduced grep_commit_pattern_type() helper
function, the intention was to allow the users of grep API to having
to fiddle only with .pattern_type_option (which can be set to "fixed",
"basic", "extended", and "pcre"), and then immediately before compiling
the pattern strings for use, call grep_commit_pattern_type() to have
it prepare various bits in the grep_opt structure (like .fixed,
.regflags, etc.).
However, grep_set_pattern_type_option() helper function the grep API
internally uses were left as an external function by mistake. This
function shouldn't have been made callable by the users of the API.
Later when the grep API was used in revision traversal machinery,
the caller then mistakenly started calling the function around
34a4ae55 (log --grep: use the same helper to set -E/-F options as
"git grep", 2012-10-03), instead of setting the .pattern_type_option
field and letting the grep_commit_pattern_type() to take care of the
details.
This caused an unnecessary bug that made a configured
grep.patternType take precedence over the command line options
(e.g. --basic-regexp, --fixed-strings) in "git log" family of
commands.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-23 02:43:14 +08:00
|
|
|
revs->grep_filter.pattern_type_option = GREP_PATTERN_TYPE_PCRE;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--all-match")) {
|
2008-08-25 14:15:05 +08:00
|
|
|
revs->grep_filter.all_match = 1;
|
log: teach --invert-grep option
"git log --grep=<string>" shows only commits with messages that
match the given string, but sometimes it is useful to be able to
show only commits that do *not* have certain messages (e.g. "show
me ones that are not FIXUP commits").
Originally, we had the invert-grep flag in grep_opt, but because
"git grep --invert-grep" does not make sense except in conjunction
with "--files-with-matches", which is already covered by
"--files-without-matches", it was moved it to revisions structure.
To have the flag there expresses the function to the feature better.
When the newly inserted two tests run, the history would have commits
with messages "initial", "second", "third", "fourth", "fifth", "sixth"
and "Second", committed in this order. The commits that does not match
either "th" or "Sec" is "second" and "initial". For the case insensitive
case only "initial" matches.
Signed-off-by: Christoph Junghans <ottxor@gentoo.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-01-13 09:33:32 +08:00
|
|
|
} else if (!strcmp(arg, "--invert-grep")) {
|
2021-12-18 00:48:49 +08:00
|
|
|
revs->grep_filter.no_body_match = 1;
|
2010-08-05 16:22:55 +08:00
|
|
|
} else if ((argcount = parse_long_opt("encoding", argv, &optarg))) {
|
2024-06-07 14:39:07 +08:00
|
|
|
free(git_log_output_encoding);
|
2010-08-05 16:22:55 +08:00
|
|
|
if (strcmp(optarg, "none"))
|
|
|
|
git_log_output_encoding = xstrdup(optarg);
|
2008-07-08 21:19:33 +08:00
|
|
|
else
|
2024-06-07 14:39:07 +08:00
|
|
|
git_log_output_encoding = xstrdup("");
|
2010-08-05 16:22:55 +08:00
|
|
|
return argcount;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else if (!strcmp(arg, "--reverse")) {
|
|
|
|
revs->reverse ^= 1;
|
|
|
|
} else if (!strcmp(arg, "--children")) {
|
|
|
|
revs->children.name = "children";
|
|
|
|
revs->limited = 1;
|
2011-05-19 09:08:09 +08:00
|
|
|
} else if (!strcmp(arg, "--ignore-missing")) {
|
|
|
|
revs->ignore_missing = 1;
|
2018-12-04 06:10:19 +08:00
|
|
|
} else if (opt && opt->allow_exclude_promisor_objects &&
|
2018-10-23 09:13:42 +08:00
|
|
|
!strcmp(arg, "--exclude-promisor-objects")) {
|
2017-12-08 23:27:15 +08:00
|
|
|
if (fetch_if_missing)
|
2018-05-02 17:38:39 +08:00
|
|
|
BUG("exclude_promisor_objects can only be used when fetch_if_missing is 0");
|
2017-12-08 23:27:15 +08:00
|
|
|
revs->exclude_promisor_objects = 1;
|
2008-07-08 21:19:33 +08:00
|
|
|
} else {
|
2016-01-21 19:48:44 +08:00
|
|
|
int opts = diff_opt_parse(&revs->diffopt, argv, argc, revs->prefix);
|
2008-07-08 21:19:33 +08:00
|
|
|
if (!opts)
|
|
|
|
unkv[(*unkc)++] = arg;
|
|
|
|
return opts;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2008-07-10 05:38:34 +08:00
|
|
|
void parse_revision_opt(struct rev_info *revs, struct parse_opt_ctx_t *ctx,
|
|
|
|
const struct option *options,
|
|
|
|
const char * const usagestr[])
|
|
|
|
{
|
|
|
|
int n = handle_revision_opt(revs, ctx->argc, ctx->argv,
|
2018-12-04 06:10:19 +08:00
|
|
|
&ctx->cpidx, ctx->out, NULL);
|
2008-07-10 05:38:34 +08:00
|
|
|
if (n <= 0) {
|
|
|
|
error("unknown option `%s'", ctx->argv[0]);
|
|
|
|
usage_with_options(usagestr, options);
|
|
|
|
}
|
|
|
|
ctx->argv += n;
|
|
|
|
ctx->argc -= n;
|
|
|
|
}
|
|
|
|
|
2022-02-12 00:36:25 +08:00
|
|
|
void revision_opts_finish(struct rev_info *revs)
|
|
|
|
{
|
|
|
|
if (revs->graph && revs->track_linear)
|
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--show-linear-break", "--graph");
|
|
|
|
|
|
|
|
if (revs->graph) {
|
|
|
|
revs->topo_order = 1;
|
|
|
|
revs->rewrite_parents = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-08-23 20:36:56 +08:00
|
|
|
static int for_each_bisect_ref(struct ref_store *refs, each_ref_fn fn,
|
|
|
|
void *cb_data, const char *term)
|
|
|
|
{
|
2015-06-29 23:40:30 +08:00
|
|
|
struct strbuf bisect_refs = STRBUF_INIT;
|
|
|
|
int status;
|
|
|
|
strbuf_addf(&bisect_refs, "refs/bisect/%s", term);
|
2023-07-11 05:12:22 +08:00
|
|
|
status = refs_for_each_fullref_in(refs, bisect_refs.buf, NULL, fn, cb_data);
|
2015-06-29 23:40:30 +08:00
|
|
|
strbuf_release(&bisect_refs);
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
2017-08-23 20:36:56 +08:00
|
|
|
static int for_each_bad_bisect_ref(struct ref_store *refs, each_ref_fn fn, void *cb_data)
|
2009-10-28 02:28:07 +08:00
|
|
|
{
|
2017-08-23 20:36:56 +08:00
|
|
|
return for_each_bisect_ref(refs, fn, cb_data, term_bad);
|
2009-10-28 02:28:07 +08:00
|
|
|
}
|
|
|
|
|
2017-08-23 20:36:56 +08:00
|
|
|
static int for_each_good_bisect_ref(struct ref_store *refs, each_ref_fn fn, void *cb_data)
|
2009-10-28 02:28:07 +08:00
|
|
|
{
|
2017-08-23 20:36:56 +08:00
|
|
|
return for_each_bisect_ref(refs, fn, cb_data, term_good);
|
2009-10-28 02:28:07 +08:00
|
|
|
}
|
|
|
|
|
2021-09-10 02:47:29 +08:00
|
|
|
static int handle_revision_pseudo_opt(struct rev_info *revs,
|
2020-09-30 20:28:18 +08:00
|
|
|
const char **argv, int *flags)
|
2011-04-21 18:45:07 +08:00
|
|
|
{
|
|
|
|
const char *arg = argv[0];
|
|
|
|
const char *optarg;
|
2017-08-23 20:36:56 +08:00
|
|
|
struct ref_store *refs;
|
2011-04-21 18:45:07 +08:00
|
|
|
int argcount;
|
|
|
|
|
2021-09-10 02:47:29 +08:00
|
|
|
if (revs->repo != the_repository) {
|
2017-08-23 20:36:59 +08:00
|
|
|
/*
|
|
|
|
* We need some something like get_submodule_worktrees()
|
|
|
|
* before we can go through all worktrees of a submodule,
|
|
|
|
* .e.g with adding all HEADs from --all, which is not
|
|
|
|
* supported right now, so stick to single worktree.
|
|
|
|
*/
|
|
|
|
if (!revs->single_worktree)
|
2018-05-02 17:38:39 +08:00
|
|
|
BUG("--single-worktree cannot be used together with submodule");
|
2021-09-10 02:47:29 +08:00
|
|
|
}
|
|
|
|
refs = get_main_ref_store(revs->repo);
|
2017-08-23 20:36:56 +08:00
|
|
|
|
2011-04-21 18:48:24 +08:00
|
|
|
/*
|
|
|
|
* NOTE!
|
|
|
|
*
|
|
|
|
* Commands like "git shortlog" will not accept the options below
|
|
|
|
* unless parse_revision_opt queues them (as opposed to erroring
|
|
|
|
* out).
|
|
|
|
*
|
|
|
|
* When implementing your new pseudo-option, remember to
|
|
|
|
* register it in the list at the top of handle_revision_opt.
|
|
|
|
*/
|
2011-04-21 18:45:07 +08:00
|
|
|
if (!strcmp(arg, "--all")) {
|
2017-08-23 20:36:56 +08:00
|
|
|
handle_refs(refs, revs, *flags, refs_for_each_ref);
|
|
|
|
handle_refs(refs, revs, *flags, refs_head_ref);
|
2017-08-23 20:36:59 +08:00
|
|
|
if (!revs->single_worktree) {
|
|
|
|
struct all_refs_cb cb;
|
|
|
|
|
|
|
|
init_all_refs_cb(&cb, revs, *flags);
|
|
|
|
other_head_refs(handle_one_ref, &cb);
|
|
|
|
}
|
2022-11-17 13:46:51 +08:00
|
|
|
clear_ref_exclusions(&revs->ref_excludes);
|
2011-04-21 18:45:07 +08:00
|
|
|
} else if (!strcmp(arg, "--branches")) {
|
2022-11-17 13:46:56 +08:00
|
|
|
if (revs->ref_excludes.hidden_refs_configured)
|
2023-12-06 19:51:58 +08:00
|
|
|
return error(_("options '%s' and '%s' cannot be used together"),
|
|
|
|
"--exclude-hidden", "--branches");
|
2017-08-23 20:36:56 +08:00
|
|
|
handle_refs(refs, revs, *flags, refs_for_each_branch_ref);
|
2022-11-17 13:46:51 +08:00
|
|
|
clear_ref_exclusions(&revs->ref_excludes);
|
2011-04-21 18:45:07 +08:00
|
|
|
} else if (!strcmp(arg, "--bisect")) {
|
2015-06-29 23:40:30 +08:00
|
|
|
read_bisect_terms(&term_bad, &term_good);
|
2017-08-23 20:36:56 +08:00
|
|
|
handle_refs(refs, revs, *flags, for_each_bad_bisect_ref);
|
|
|
|
handle_refs(refs, revs, *flags ^ (UNINTERESTING | BOTTOM),
|
|
|
|
for_each_good_bisect_ref);
|
2011-04-21 18:45:07 +08:00
|
|
|
revs->bisect = 1;
|
|
|
|
} else if (!strcmp(arg, "--tags")) {
|
2022-11-17 13:46:56 +08:00
|
|
|
if (revs->ref_excludes.hidden_refs_configured)
|
2023-12-06 19:51:58 +08:00
|
|
|
return error(_("options '%s' and '%s' cannot be used together"),
|
|
|
|
"--exclude-hidden", "--tags");
|
2017-08-23 20:36:56 +08:00
|
|
|
handle_refs(refs, revs, *flags, refs_for_each_tag_ref);
|
2022-11-17 13:46:51 +08:00
|
|
|
clear_ref_exclusions(&revs->ref_excludes);
|
2011-04-21 18:45:07 +08:00
|
|
|
} else if (!strcmp(arg, "--remotes")) {
|
2022-11-17 13:46:56 +08:00
|
|
|
if (revs->ref_excludes.hidden_refs_configured)
|
2023-12-06 19:51:58 +08:00
|
|
|
return error(_("options '%s' and '%s' cannot be used together"),
|
|
|
|
"--exclude-hidden", "--remotes");
|
2017-08-23 20:36:56 +08:00
|
|
|
handle_refs(refs, revs, *flags, refs_for_each_remote_ref);
|
2022-11-17 13:46:51 +08:00
|
|
|
clear_ref_exclusions(&revs->ref_excludes);
|
2011-04-21 18:45:07 +08:00
|
|
|
} else if ((argcount = parse_long_opt("glob", argv, &optarg))) {
|
|
|
|
struct all_refs_cb cb;
|
|
|
|
init_all_refs_cb(&cb, revs, *flags);
|
2024-05-07 15:11:53 +08:00
|
|
|
refs_for_each_glob_ref(get_main_ref_store(the_repository),
|
|
|
|
handle_one_ref, optarg, &cb);
|
2022-11-17 13:46:51 +08:00
|
|
|
clear_ref_exclusions(&revs->ref_excludes);
|
revision: introduce --exclude=<glob> to tame wildcards
People often find "git log --branches" etc. that includes _all_
branches is cumbersome to use when they want to grab most but except
some. The same applies to --tags, --all and --glob.
Teach the revision machinery to remember patterns, and then upon the
next such a globbing option, exclude those that match the pattern.
With this, I can view only my integration branches (e.g. maint,
master, etc.) without topic branches, which are named after two
letters from primary authors' names, slash and topic name.
git rev-list --no-walk --exclude=??/* --branches |
git name-rev --refs refs/heads/* --stdin
This one shows things reachable from local and remote branches that
have not been merged to the integration branches.
git log --remotes --branches --not --exclude=??/* --branches
It may be a bit rough around the edges, in that the pattern to give
the exclude option depends on what globbing option follows. In
these examples, the pattern "??/*" is used, not "refs/heads/??/*",
because the globbing option that follows the -"-exclude=<pattern>"
is "--branches". As each use of globbing option resets previously
set "--exclude", this may not be such a bad thing, though.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-31 07:37:55 +08:00
|
|
|
return argcount;
|
|
|
|
} else if ((argcount = parse_long_opt("exclude", argv, &optarg))) {
|
2013-11-02 03:02:45 +08:00
|
|
|
add_ref_exclusion(&revs->ref_excludes, optarg);
|
2011-04-21 18:45:07 +08:00
|
|
|
return argcount;
|
2022-11-17 13:46:56 +08:00
|
|
|
} else if ((argcount = parse_long_opt("exclude-hidden", argv, &optarg))) {
|
|
|
|
exclude_hidden_refs(&revs->ref_excludes, optarg);
|
|
|
|
return argcount;
|
2017-06-10 02:17:33 +08:00
|
|
|
} else if (skip_prefix(arg, "--branches=", &optarg)) {
|
2011-04-21 18:45:07 +08:00
|
|
|
struct all_refs_cb cb;
|
2022-11-17 13:46:56 +08:00
|
|
|
if (revs->ref_excludes.hidden_refs_configured)
|
2023-12-06 19:51:58 +08:00
|
|
|
return error(_("options '%s' and '%s' cannot be used together"),
|
|
|
|
"--exclude-hidden", "--branches");
|
2011-04-21 18:45:07 +08:00
|
|
|
init_all_refs_cb(&cb, revs, *flags);
|
2024-05-07 15:11:53 +08:00
|
|
|
refs_for_each_glob_ref_in(get_main_ref_store(the_repository),
|
|
|
|
handle_one_ref, optarg,
|
|
|
|
"refs/heads/", &cb);
|
2022-11-17 13:46:51 +08:00
|
|
|
clear_ref_exclusions(&revs->ref_excludes);
|
2017-06-10 02:17:33 +08:00
|
|
|
} else if (skip_prefix(arg, "--tags=", &optarg)) {
|
2011-04-21 18:45:07 +08:00
|
|
|
struct all_refs_cb cb;
|
2022-11-17 13:46:56 +08:00
|
|
|
if (revs->ref_excludes.hidden_refs_configured)
|
2023-12-06 19:51:58 +08:00
|
|
|
return error(_("options '%s' and '%s' cannot be used together"),
|
|
|
|
"--exclude-hidden", "--tags");
|
2011-04-21 18:45:07 +08:00
|
|
|
init_all_refs_cb(&cb, revs, *flags);
|
2024-05-07 15:11:53 +08:00
|
|
|
refs_for_each_glob_ref_in(get_main_ref_store(the_repository),
|
|
|
|
handle_one_ref, optarg,
|
|
|
|
"refs/tags/", &cb);
|
2022-11-17 13:46:51 +08:00
|
|
|
clear_ref_exclusions(&revs->ref_excludes);
|
2017-06-10 02:17:33 +08:00
|
|
|
} else if (skip_prefix(arg, "--remotes=", &optarg)) {
|
2011-04-21 18:45:07 +08:00
|
|
|
struct all_refs_cb cb;
|
2022-11-17 13:46:56 +08:00
|
|
|
if (revs->ref_excludes.hidden_refs_configured)
|
2023-12-06 19:51:58 +08:00
|
|
|
return error(_("options '%s' and '%s' cannot be used together"),
|
|
|
|
"--exclude-hidden", "--remotes");
|
2011-04-21 18:45:07 +08:00
|
|
|
init_all_refs_cb(&cb, revs, *flags);
|
2024-05-07 15:11:53 +08:00
|
|
|
refs_for_each_glob_ref_in(get_main_ref_store(the_repository),
|
|
|
|
handle_one_ref, optarg,
|
|
|
|
"refs/remotes/", &cb);
|
2022-11-17 13:46:51 +08:00
|
|
|
clear_ref_exclusions(&revs->ref_excludes);
|
2011-04-21 18:45:07 +08:00
|
|
|
} else if (!strcmp(arg, "--reflog")) {
|
2014-10-16 06:38:31 +08:00
|
|
|
add_reflogs_to_pending(revs, *flags);
|
2014-10-17 08:44:23 +08:00
|
|
|
} else if (!strcmp(arg, "--indexed-objects")) {
|
|
|
|
add_index_objects_to_pending(revs, *flags);
|
2019-07-01 21:18:15 +08:00
|
|
|
} else if (!strcmp(arg, "--alternate-refs")) {
|
|
|
|
add_alternate_refs_to_pending(revs, *flags);
|
2011-04-21 18:45:07 +08:00
|
|
|
} else if (!strcmp(arg, "--not")) {
|
2013-05-16 23:32:38 +08:00
|
|
|
*flags ^= UNINTERESTING | BOTTOM;
|
2011-04-21 18:45:07 +08:00
|
|
|
} else if (!strcmp(arg, "--no-walk")) {
|
2021-08-05 19:25:24 +08:00
|
|
|
revs->no_walk = 1;
|
2017-06-10 02:17:33 +08:00
|
|
|
} else if (skip_prefix(arg, "--no-walk=", &optarg)) {
|
teach log --no-walk=unsorted, which avoids sorting
When 'git log' is passed the --no-walk option, no revision walk takes
place, naturally. Perhaps somewhat surprisingly, however, the provided
revisions still get sorted by commit date. So e.g 'git log --no-walk
HEAD HEAD~1' and 'git log --no-walk HEAD~1 HEAD' give the same result
(unless the two revisions share the commit date, in which case they
will retain the order given on the command line). As the commit that
introduced --no-walk (8e64006 (Teach revision machinery about
--no-walk, 2007-07-24)) points out, the sorting is intentional, to
allow things like
git log --abbrev-commit --pretty=oneline --decorate --all --no-walk
to show all refs in order by commit date.
But there are also other cases where the sorting is not wanted, such
as
<command producing revisions in order> |
git log --oneline --no-walk --stdin
To accomodate both cases, leave the decision of whether or not to sort
up to the caller, by allowing --no-walk={sorted,unsorted}, defaulting
to 'sorted' for backward-compatibility reasons.
Signed-off-by: Martin von Zweigbergk <martinvonz@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-08-29 14:15:54 +08:00
|
|
|
/*
|
|
|
|
* Detached form ("--no-walk X" as opposed to "--no-walk=X")
|
|
|
|
* not allowed, since the argument is optional.
|
|
|
|
*/
|
2021-08-05 19:25:24 +08:00
|
|
|
revs->no_walk = 1;
|
2017-06-10 02:17:33 +08:00
|
|
|
if (!strcmp(optarg, "sorted"))
|
2021-08-05 19:25:24 +08:00
|
|
|
revs->unsorted_input = 0;
|
2017-06-10 02:17:33 +08:00
|
|
|
else if (!strcmp(optarg, "unsorted"))
|
2021-08-05 19:25:24 +08:00
|
|
|
revs->unsorted_input = 1;
|
teach log --no-walk=unsorted, which avoids sorting
When 'git log' is passed the --no-walk option, no revision walk takes
place, naturally. Perhaps somewhat surprisingly, however, the provided
revisions still get sorted by commit date. So e.g 'git log --no-walk
HEAD HEAD~1' and 'git log --no-walk HEAD~1 HEAD' give the same result
(unless the two revisions share the commit date, in which case they
will retain the order given on the command line). As the commit that
introduced --no-walk (8e64006 (Teach revision machinery about
--no-walk, 2007-07-24)) points out, the sorting is intentional, to
allow things like
git log --abbrev-commit --pretty=oneline --decorate --all --no-walk
to show all refs in order by commit date.
But there are also other cases where the sorting is not wanted, such
as
<command producing revisions in order> |
git log --oneline --no-walk --stdin
To accomodate both cases, leave the decision of whether or not to sort
up to the caller, by allowing --no-walk={sorted,unsorted}, defaulting
to 'sorted' for backward-compatibility reasons.
Signed-off-by: Martin von Zweigbergk <martinvonz@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-08-29 14:15:54 +08:00
|
|
|
else
|
|
|
|
return error("invalid argument to --no-walk");
|
2011-04-21 18:45:07 +08:00
|
|
|
} else if (!strcmp(arg, "--do-walk")) {
|
|
|
|
revs->no_walk = 0;
|
2017-08-23 20:37:02 +08:00
|
|
|
} else if (!strcmp(arg, "--single-worktree")) {
|
|
|
|
revs->single_worktree = 1;
|
2022-03-23 01:28:35 +08:00
|
|
|
} else if (skip_prefix(arg, ("--filter="), &arg)) {
|
2022-03-10 00:01:40 +08:00
|
|
|
parse_list_objects_filter(&revs->filter, arg);
|
2022-03-23 01:28:35 +08:00
|
|
|
} else if (!strcmp(arg, ("--no-filter"))) {
|
2022-03-10 00:01:40 +08:00
|
|
|
list_objects_filter_set_no_filter(&revs->filter);
|
2011-04-21 18:45:07 +08:00
|
|
|
} else {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2023-06-15 22:39:51 +08:00
|
|
|
static void read_revisions_from_stdin(struct rev_info *revs,
|
revision: make pseudo-opt flags read via stdin behave consistently
When reading revisions from stdin via git-rev-list(1)'s `--stdin` option
then these revisions never honor flags like `--not` which have been
passed on the command line. Thus, an invocation like e.g. `git rev-list
--all --not --stdin` will not treat all revisions read from stdin as
uninteresting. While this behaviour may be surprising to a user, it's
been this way ever since it has been introduced via 42cabc341c4 (Teach
rev-list an option to read revs from the standard input., 2006-09-05).
With that said, in c40f0b7877 (revision: handle pseudo-opts in `--stdin`
mode, 2023-06-15) we have introduced a new mode to read pseudo opts from
standard input where this behaviour is a lot more confusing. If you pass
`--not` via stdin, it will:
- Influence subsequent revisions or pseudo-options passed on the
command line.
- Influence pseudo-options passed via standard input.
- _Not_ influence normal revisions passed via standard input.
This behaviour is extremely inconsistent and bound to cause confusion.
While it would be nice to retroactively change the behaviour for how
`--not` and `--stdin` behave together, chances are quite high that this
would break existing scripts that expect the current behaviour that has
been around for many years by now. This is thus not really a viable
option to explore to fix the inconsistency.
Instead, we change the behaviour of how pseudo-opts read via standard
input influence the flags such that the effect is fully localized. With
this change, when reading `--not` via standard input, it will:
- _Not_ influence subsequent revisions or pseudo-options passed on
the command line, which is a change in behaviour.
- Influence pseudo-options passed via standard input.
- Influence normal revisions passed via standard input, which is a
change in behaviour.
Thus, all flags read via standard input are fully self-contained to that
standard input, only.
While this is a breaking change as well, the behaviour has only been
recently introduced with Git v2.42.0. Furthermore, the current behaviour
can be regarded as a simple bug. With that in mind it feels like the
right thing to retroactively change it and make the behaviour sane.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Reported-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-25 21:02:00 +08:00
|
|
|
struct strvec *prune)
|
2023-06-15 22:39:51 +08:00
|
|
|
{
|
|
|
|
struct strbuf sb;
|
|
|
|
int seen_dashdash = 0;
|
2023-06-15 22:39:59 +08:00
|
|
|
int seen_end_of_options = 0;
|
2023-06-15 22:39:51 +08:00
|
|
|
int save_warning;
|
revision: make pseudo-opt flags read via stdin behave consistently
When reading revisions from stdin via git-rev-list(1)'s `--stdin` option
then these revisions never honor flags like `--not` which have been
passed on the command line. Thus, an invocation like e.g. `git rev-list
--all --not --stdin` will not treat all revisions read from stdin as
uninteresting. While this behaviour may be surprising to a user, it's
been this way ever since it has been introduced via 42cabc341c4 (Teach
rev-list an option to read revs from the standard input., 2006-09-05).
With that said, in c40f0b7877 (revision: handle pseudo-opts in `--stdin`
mode, 2023-06-15) we have introduced a new mode to read pseudo opts from
standard input where this behaviour is a lot more confusing. If you pass
`--not` via stdin, it will:
- Influence subsequent revisions or pseudo-options passed on the
command line.
- Influence pseudo-options passed via standard input.
- _Not_ influence normal revisions passed via standard input.
This behaviour is extremely inconsistent and bound to cause confusion.
While it would be nice to retroactively change the behaviour for how
`--not` and `--stdin` behave together, chances are quite high that this
would break existing scripts that expect the current behaviour that has
been around for many years by now. This is thus not really a viable
option to explore to fix the inconsistency.
Instead, we change the behaviour of how pseudo-opts read via standard
input influence the flags such that the effect is fully localized. With
this change, when reading `--not` via standard input, it will:
- _Not_ influence subsequent revisions or pseudo-options passed on
the command line, which is a change in behaviour.
- Influence pseudo-options passed via standard input.
- Influence normal revisions passed via standard input, which is a
change in behaviour.
Thus, all flags read via standard input are fully self-contained to that
standard input, only.
While this is a breaking change as well, the behaviour has only been
recently introduced with Git v2.42.0. Furthermore, the current behaviour
can be regarded as a simple bug. With that in mind it feels like the
right thing to retroactively change it and make the behaviour sane.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Reported-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-25 21:02:00 +08:00
|
|
|
int flags = 0;
|
2023-06-15 22:39:51 +08:00
|
|
|
|
|
|
|
save_warning = warn_on_object_refname_ambiguity;
|
|
|
|
warn_on_object_refname_ambiguity = 0;
|
|
|
|
|
|
|
|
strbuf_init(&sb, 1000);
|
|
|
|
while (strbuf_getline(&sb, stdin) != EOF) {
|
2023-06-15 22:39:55 +08:00
|
|
|
if (!sb.len)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (!strcmp(sb.buf, "--")) {
|
|
|
|
seen_dashdash = 1;
|
2023-06-15 22:39:51 +08:00
|
|
|
break;
|
|
|
|
}
|
2023-06-15 22:39:55 +08:00
|
|
|
|
2023-06-15 22:39:59 +08:00
|
|
|
if (!seen_end_of_options && sb.buf[0] == '-') {
|
|
|
|
const char *argv[] = { sb.buf, NULL };
|
|
|
|
|
|
|
|
if (!strcmp(sb.buf, "--end-of-options")) {
|
|
|
|
seen_end_of_options = 1;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
revision: make pseudo-opt flags read via stdin behave consistently
When reading revisions from stdin via git-rev-list(1)'s `--stdin` option
then these revisions never honor flags like `--not` which have been
passed on the command line. Thus, an invocation like e.g. `git rev-list
--all --not --stdin` will not treat all revisions read from stdin as
uninteresting. While this behaviour may be surprising to a user, it's
been this way ever since it has been introduced via 42cabc341c4 (Teach
rev-list an option to read revs from the standard input., 2006-09-05).
With that said, in c40f0b7877 (revision: handle pseudo-opts in `--stdin`
mode, 2023-06-15) we have introduced a new mode to read pseudo opts from
standard input where this behaviour is a lot more confusing. If you pass
`--not` via stdin, it will:
- Influence subsequent revisions or pseudo-options passed on the
command line.
- Influence pseudo-options passed via standard input.
- _Not_ influence normal revisions passed via standard input.
This behaviour is extremely inconsistent and bound to cause confusion.
While it would be nice to retroactively change the behaviour for how
`--not` and `--stdin` behave together, chances are quite high that this
would break existing scripts that expect the current behaviour that has
been around for many years by now. This is thus not really a viable
option to explore to fix the inconsistency.
Instead, we change the behaviour of how pseudo-opts read via standard
input influence the flags such that the effect is fully localized. With
this change, when reading `--not` via standard input, it will:
- _Not_ influence subsequent revisions or pseudo-options passed on
the command line, which is a change in behaviour.
- Influence pseudo-options passed via standard input.
- Influence normal revisions passed via standard input, which is a
change in behaviour.
Thus, all flags read via standard input are fully self-contained to that
standard input, only.
While this is a breaking change as well, the behaviour has only been
recently introduced with Git v2.42.0. Furthermore, the current behaviour
can be regarded as a simple bug. With that in mind it feels like the
right thing to retroactively change it and make the behaviour sane.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Reported-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-25 21:02:00 +08:00
|
|
|
if (handle_revision_pseudo_opt(revs, argv, &flags) > 0)
|
2023-06-15 22:39:59 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
die(_("invalid option '%s' in --stdin mode"), sb.buf);
|
|
|
|
}
|
2023-06-15 22:39:55 +08:00
|
|
|
|
revision: make pseudo-opt flags read via stdin behave consistently
When reading revisions from stdin via git-rev-list(1)'s `--stdin` option
then these revisions never honor flags like `--not` which have been
passed on the command line. Thus, an invocation like e.g. `git rev-list
--all --not --stdin` will not treat all revisions read from stdin as
uninteresting. While this behaviour may be surprising to a user, it's
been this way ever since it has been introduced via 42cabc341c4 (Teach
rev-list an option to read revs from the standard input., 2006-09-05).
With that said, in c40f0b7877 (revision: handle pseudo-opts in `--stdin`
mode, 2023-06-15) we have introduced a new mode to read pseudo opts from
standard input where this behaviour is a lot more confusing. If you pass
`--not` via stdin, it will:
- Influence subsequent revisions or pseudo-options passed on the
command line.
- Influence pseudo-options passed via standard input.
- _Not_ influence normal revisions passed via standard input.
This behaviour is extremely inconsistent and bound to cause confusion.
While it would be nice to retroactively change the behaviour for how
`--not` and `--stdin` behave together, chances are quite high that this
would break existing scripts that expect the current behaviour that has
been around for many years by now. This is thus not really a viable
option to explore to fix the inconsistency.
Instead, we change the behaviour of how pseudo-opts read via standard
input influence the flags such that the effect is fully localized. With
this change, when reading `--not` via standard input, it will:
- _Not_ influence subsequent revisions or pseudo-options passed on
the command line, which is a change in behaviour.
- Influence pseudo-options passed via standard input.
- Influence normal revisions passed via standard input, which is a
change in behaviour.
Thus, all flags read via standard input are fully self-contained to that
standard input, only.
While this is a breaking change as well, the behaviour has only been
recently introduced with Git v2.42.0. Furthermore, the current behaviour
can be regarded as a simple bug. With that in mind it feels like the
right thing to retroactively change it and make the behaviour sane.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Reported-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-25 21:02:00 +08:00
|
|
|
if (handle_revision_arg(sb.buf, revs, flags,
|
2023-06-15 22:39:51 +08:00
|
|
|
REVARG_CANNOT_BE_FILENAME))
|
|
|
|
die("bad revision '%s'", sb.buf);
|
|
|
|
}
|
|
|
|
if (seen_dashdash)
|
|
|
|
read_pathspec_from_stdin(&sb, prune);
|
|
|
|
|
|
|
|
strbuf_release(&sb);
|
|
|
|
warn_on_object_refname_ambiguity = save_warning;
|
|
|
|
}
|
|
|
|
|
2015-08-29 13:04:18 +08:00
|
|
|
static void NORETURN diagnose_missing_default(const char *def)
|
|
|
|
{
|
|
|
|
int flags;
|
|
|
|
const char *refname;
|
|
|
|
|
2024-05-07 15:11:53 +08:00
|
|
|
refname = refs_resolve_ref_unsafe(get_main_ref_store(the_repository),
|
|
|
|
def, 0, NULL, &flags);
|
2015-08-29 13:04:18 +08:00
|
|
|
if (!refname || !(flags & REF_ISSYMREF) || (flags & REF_ISBROKEN))
|
|
|
|
die(_("your current branch appears to be broken"));
|
|
|
|
|
|
|
|
skip_prefix(refname, "refs/heads/", &refname);
|
|
|
|
die(_("your current branch '%s' does not have any commits yet"),
|
|
|
|
refname);
|
|
|
|
}
|
|
|
|
|
2006-02-26 08:19:46 +08:00
|
|
|
/*
|
|
|
|
* Parse revision information, filling in the "rev_info" structure,
|
|
|
|
* and removing the used arguments from the argument list.
|
|
|
|
*
|
2006-03-01 07:07:20 +08:00
|
|
|
* Returns the number of arguments left that weren't recognized
|
|
|
|
* (which are also moved to the head of the argument list)
|
2006-02-26 08:19:46 +08:00
|
|
|
*/
|
2010-03-09 14:58:09 +08:00
|
|
|
int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct setup_revision_opt *opt)
|
2006-02-26 08:19:46 +08:00
|
|
|
{
|
revision: set rev_input_given in handle_revision_arg()
Commit 7ba826290a (revision: add rev_input_given flag, 2017-08-02) added
a flag to rev_info to tell whether we got any revision arguments. As
explained there, this is necessary because some revision arguments may
not produce any pending traversal objects, but should still inhibit
default behaviors (e.g., a glob that matches nothing).
However, it only set the flag in the globbing code, but not for
revisions we get on the command-line or via stdin. This leads to two
problems:
- the command-line code keeps its own separate got_rev_arg flag; this
isn't wrong, but it's confusing and an extra maintenance burden
- even specifically-named rev arguments might end up not adding any
pending objects: if --ignore-missing is set, then specifying a
missing object is a noop rather than an error.
And that leads to some user-visible bugs:
- when deciding whether a default rev like "HEAD" should kick in, we
check both got_rev_arg and rev_input_given. That means that
"--ignore-missing $ZERO_OID" works on the command-line (where we set
got_rev_arg) but not on --stdin (where we don't)
- when rev-list decides whether it should complain that it wasn't
given a starting point, it relies on rev_input_given. So it can't
even get the command-line "--ignore-missing $ZERO_OID" right
Let's consistently set the flag if we got any revision argument. That
lets us clean up the redundant got_rev_arg, and fixes both of those bugs
(but note there are three new tests: we'll confirm the already working
git-log command-line case).
A few implementation notes:
- conceptually we want to set the flag whenever handle_revision_arg()
finds an actual revision arg ("handles" it, you might say). But it
covers a ton of cases with early returns. Rather than annotating
each one, we just wrap it and use its success exit-code to set the
flag in one spot.
- the new rev-list test is in t6018, which is titled to cover globs.
This isn't exactly a glob, but it made sense to stick it with the
other tests that handle the "even though we got a rev, we have no
pending objects" case, which are globs.
- the tests check for the oid of a missing object, which it's pretty
clear --ignore-missing should ignore. You can see the same behavior
with "--ignore-missing a-ref-that-does-not-exist", because
--ignore-missing treats them both the same. That's perhaps less
clearly correct, and we may want to change that in the future. But
the way the code and tests here are written, we'd continue to do the
right thing even if it does.
Reported-by: Bryan Turner <bturner@atlassian.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-27 04:13:05 +08:00
|
|
|
int i, flags, left, seen_dashdash, revarg_opt;
|
2020-07-29 04:25:12 +08:00
|
|
|
struct strvec prune_data = STRVEC_INIT;
|
revision: allow --end-of-options to end option parsing
There's currently no robust way to tell Git that a particular option is
meant to be a revision, and not an option. So if you have a branch
"refs/heads/--foo", you cannot just say:
git rev-list --foo
You can say:
git rev-list refs/heads/--foo
But that breaks down if you don't know the refname, and in particular if
you're a script passing along a value from elsewhere. In most programs,
you can use "--" to end option parsing, like this:
some-prog -- "$revision"
But that doesn't work for the revision parser, because "--" is already
meaningful there: it separates revisions from pathspecs. So we need some
other marker to separate options from revisions.
This patch introduces "--end-of-options", which serves that purpose:
git rev-list --oneline --end-of-options "$revision"
will work regardless of what's in "$revision" (well, if you say "--" it
may fail, but it won't do something dangerous, like triggering an
unexpected option).
The name is verbose, but that's probably a good thing; this is meant to
be used for scripted invocations where readability is more important
than terseness.
One alternative would be to introduce an explicit option to mark a
revision, like:
git rev-list --oneline --revision="$revision"
That's slightly _more_ informative than this commit (because it makes
even something silly like "--" unambiguous). But the pattern of using a
separator like "--" is well established in git and in other commands,
and it makes some scripting tasks simpler like:
git rev-list --end-of-options "$@"
There's no documentation in this patch, because it will make sense to
describe the feature once it is available everywhere (and support will
be added in further patches).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-06 22:39:58 +08:00
|
|
|
int seen_end_of_options = 0;
|
2010-07-07 21:39:12 +08:00
|
|
|
|
2006-02-26 08:19:46 +08:00
|
|
|
/* First, search for "--" */
|
2012-04-15 03:04:48 +08:00
|
|
|
if (opt && opt->assume_dashdash) {
|
2006-02-26 08:19:46 +08:00
|
|
|
seen_dashdash = 1;
|
2012-04-15 03:04:48 +08:00
|
|
|
} else {
|
|
|
|
seen_dashdash = 0;
|
|
|
|
for (i = 1; i < argc; i++) {
|
|
|
|
const char *arg = argv[i];
|
|
|
|
if (strcmp(arg, "--"))
|
|
|
|
continue;
|
2022-08-02 23:33:16 +08:00
|
|
|
if (opt && opt->free_removed_argv_elements)
|
|
|
|
free((char *)argv[i]);
|
2012-04-15 03:04:48 +08:00
|
|
|
argv[i] = NULL;
|
|
|
|
argc = i;
|
|
|
|
if (argv[i + 1])
|
2020-07-29 04:25:12 +08:00
|
|
|
strvec_pushv(&prune_data, argv + i + 1);
|
2012-04-15 03:04:48 +08:00
|
|
|
seen_dashdash = 1;
|
|
|
|
break;
|
|
|
|
}
|
2006-02-26 08:19:46 +08:00
|
|
|
}
|
|
|
|
|
2008-07-08 21:19:33 +08:00
|
|
|
/* Second, deal with arguments and options */
|
|
|
|
flags = 0;
|
2012-07-03 03:43:05 +08:00
|
|
|
revarg_opt = opt ? opt->revarg_opt : 0;
|
|
|
|
if (seen_dashdash)
|
|
|
|
revarg_opt |= REVARG_CANNOT_BE_FILENAME;
|
2008-07-08 21:19:33 +08:00
|
|
|
for (left = i = 1; i < argc; i++) {
|
2006-02-26 08:19:46 +08:00
|
|
|
const char *arg = argv[i];
|
revision: allow --end-of-options to end option parsing
There's currently no robust way to tell Git that a particular option is
meant to be a revision, and not an option. So if you have a branch
"refs/heads/--foo", you cannot just say:
git rev-list --foo
You can say:
git rev-list refs/heads/--foo
But that breaks down if you don't know the refname, and in particular if
you're a script passing along a value from elsewhere. In most programs,
you can use "--" to end option parsing, like this:
some-prog -- "$revision"
But that doesn't work for the revision parser, because "--" is already
meaningful there: it separates revisions from pathspecs. So we need some
other marker to separate options from revisions.
This patch introduces "--end-of-options", which serves that purpose:
git rev-list --oneline --end-of-options "$revision"
will work regardless of what's in "$revision" (well, if you say "--" it
may fail, but it won't do something dangerous, like triggering an
unexpected option).
The name is verbose, but that's probably a good thing; this is meant to
be used for scripted invocations where readability is more important
than terseness.
One alternative would be to introduce an explicit option to mark a
revision, like:
git rev-list --oneline --revision="$revision"
That's slightly _more_ informative than this commit (because it makes
even something silly like "--" unambiguous). But the pattern of using a
separator like "--" is well established in git and in other commands,
and it makes some scripting tasks simpler like:
git rev-list --end-of-options "$@"
There's no documentation in this patch, because it will make sense to
describe the feature once it is available everywhere (and support will
be added in further patches).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-06 22:39:58 +08:00
|
|
|
if (!seen_end_of_options && *arg == '-') {
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
int opts;
|
2008-07-08 21:19:33 +08:00
|
|
|
|
2021-09-10 02:47:29 +08:00
|
|
|
opts = handle_revision_pseudo_opt(
|
2020-09-30 20:28:18 +08:00
|
|
|
revs, argv + i,
|
2011-04-21 18:45:07 +08:00
|
|
|
&flags);
|
|
|
|
if (opts > 0) {
|
|
|
|
i += opts - 1;
|
2007-07-24 07:38:40 +08:00
|
|
|
continue;
|
|
|
|
}
|
2011-04-21 18:45:07 +08:00
|
|
|
|
2009-11-03 22:59:18 +08:00
|
|
|
if (!strcmp(arg, "--stdin")) {
|
|
|
|
if (revs->disable_stdin) {
|
|
|
|
argv[left++] = arg;
|
|
|
|
continue;
|
|
|
|
}
|
rev-list: make empty --stdin not an error
When we originally did the series that contains 7ba826290a
(revision: add rev_input_given flag, 2017-08-02) the intent
was that "git rev-list --stdin </dev/null" would similarly
become a successful noop. However, an attempt at the time to
do that did not work[1]. The problem is that rev_input_given
serves two roles:
- it tells rev-list.c that it should not error out
- it tells revision.c that it should not have the "default"
ref kick (e.g., "HEAD" in "git log")
We want to trigger the former, but not the latter. This is
technically possible with a single flag, if we set the flag
only after revision.c's revs->def check. But this introduces
a rather subtle ordering dependency.
Instead, let's keep two flags: one to denote when we got
actual input (which triggers both roles) and one for when we
read stdin (which triggers only the first).
This does mean a caller interested in the first role has to
check both flags, but there's only one such caller. And any
future callers might want to make the distinction anyway
(e.g., if they care less about erroring out, and more about
whether revision.c soaked up our stdin).
In fact, we already keep such a flag internally in
revision.c for this purpose, so this is really just exposing
that to the caller (and the old function-local flag can go
away in favor of our new one).
[1] https://public-inbox.org/git/20170802223416.gwiezhbuxbdmbjzx@sigill.intra.peff.net/
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-23 05:37:23 +08:00
|
|
|
if (revs->read_from_stdin++)
|
2009-11-03 22:59:18 +08:00
|
|
|
die("--stdin given twice?");
|
revision: make pseudo-opt flags read via stdin behave consistently
When reading revisions from stdin via git-rev-list(1)'s `--stdin` option
then these revisions never honor flags like `--not` which have been
passed on the command line. Thus, an invocation like e.g. `git rev-list
--all --not --stdin` will not treat all revisions read from stdin as
uninteresting. While this behaviour may be surprising to a user, it's
been this way ever since it has been introduced via 42cabc341c4 (Teach
rev-list an option to read revs from the standard input., 2006-09-05).
With that said, in c40f0b7877 (revision: handle pseudo-opts in `--stdin`
mode, 2023-06-15) we have introduced a new mode to read pseudo opts from
standard input where this behaviour is a lot more confusing. If you pass
`--not` via stdin, it will:
- Influence subsequent revisions or pseudo-options passed on the
command line.
- Influence pseudo-options passed via standard input.
- _Not_ influence normal revisions passed via standard input.
This behaviour is extremely inconsistent and bound to cause confusion.
While it would be nice to retroactively change the behaviour for how
`--not` and `--stdin` behave together, chances are quite high that this
would break existing scripts that expect the current behaviour that has
been around for many years by now. This is thus not really a viable
option to explore to fix the inconsistency.
Instead, we change the behaviour of how pseudo-opts read via standard
input influence the flags such that the effect is fully localized. With
this change, when reading `--not` via standard input, it will:
- _Not_ influence subsequent revisions or pseudo-options passed on
the command line, which is a change in behaviour.
- Influence pseudo-options passed via standard input.
- Influence normal revisions passed via standard input, which is a
change in behaviour.
Thus, all flags read via standard input are fully self-contained to that
standard input, only.
While this is a breaking change as well, the behaviour has only been
recently introduced with Git v2.42.0. Furthermore, the current behaviour
can be regarded as a simple bug. With that in mind it feels like the
right thing to retroactively change it and make the behaviour sane.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Reported-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-25 21:02:00 +08:00
|
|
|
read_revisions_from_stdin(revs, &prune_data);
|
2009-11-03 22:59:18 +08:00
|
|
|
continue;
|
|
|
|
}
|
2006-09-21 04:21:56 +08:00
|
|
|
|
revision: allow --end-of-options to end option parsing
There's currently no robust way to tell Git that a particular option is
meant to be a revision, and not an option. So if you have a branch
"refs/heads/--foo", you cannot just say:
git rev-list --foo
You can say:
git rev-list refs/heads/--foo
But that breaks down if you don't know the refname, and in particular if
you're a script passing along a value from elsewhere. In most programs,
you can use "--" to end option parsing, like this:
some-prog -- "$revision"
But that doesn't work for the revision parser, because "--" is already
meaningful there: it separates revisions from pathspecs. So we need some
other marker to separate options from revisions.
This patch introduces "--end-of-options", which serves that purpose:
git rev-list --oneline --end-of-options "$revision"
will work regardless of what's in "$revision" (well, if you say "--" it
may fail, but it won't do something dangerous, like triggering an
unexpected option).
The name is verbose, but that's probably a good thing; this is meant to
be used for scripted invocations where readability is more important
than terseness.
One alternative would be to introduce an explicit option to mark a
revision, like:
git rev-list --oneline --revision="$revision"
That's slightly _more_ informative than this commit (because it makes
even something silly like "--" unambiguous). But the pattern of using a
separator like "--" is well established in git and in other commands,
and it makes some scripting tasks simpler like:
git rev-list --end-of-options "$@"
There's no documentation in this patch, because it will make sense to
describe the feature once it is available everywhere (and support will
be added in further patches).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-06 22:39:58 +08:00
|
|
|
if (!strcmp(arg, "--end-of-options")) {
|
|
|
|
seen_end_of_options = 1;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2018-12-04 06:10:19 +08:00
|
|
|
opts = handle_revision_opt(revs, argc - i, argv + i,
|
|
|
|
&left, argv, opt);
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
if (opts > 0) {
|
|
|
|
i += opts - 1;
|
|
|
|
continue;
|
|
|
|
}
|
2008-07-08 21:19:33 +08:00
|
|
|
if (opts < 0)
|
|
|
|
exit(128);
|
2006-02-26 08:19:46 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2012-07-03 03:33:52 +08:00
|
|
|
|
|
|
|
if (handle_revision_arg(arg, revs, flags, revarg_opt)) {
|
2006-09-06 12:28:36 +08:00
|
|
|
int j;
|
|
|
|
if (seen_dashdash || *arg == '^')
|
2006-02-26 08:19:46 +08:00
|
|
|
die("bad revision '%s'", arg);
|
|
|
|
|
2006-04-27 06:09:27 +08:00
|
|
|
/* If we didn't have a "--":
|
|
|
|
* (1) all filenames must exist;
|
|
|
|
* (2) all rev-args must not be interpretable
|
|
|
|
* as a valid filename.
|
|
|
|
* but the latter we have checked in the main loop.
|
|
|
|
*/
|
2006-04-27 01:15:54 +08:00
|
|
|
for (j = i; j < argc; j++)
|
2012-06-19 02:18:21 +08:00
|
|
|
verify_filename(revs->prefix, argv[j], j == i);
|
2006-04-27 01:15:54 +08:00
|
|
|
|
2020-07-29 04:25:12 +08:00
|
|
|
strvec_pushv(&prune_data, argv + i);
|
2006-02-26 08:19:46 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2022-02-12 00:36:25 +08:00
|
|
|
revision_opts_finish(revs);
|
2006-09-06 12:28:36 +08:00
|
|
|
|
2020-07-29 08:37:20 +08:00
|
|
|
if (prune_data.nr) {
|
2011-05-12 06:23:25 +08:00
|
|
|
/*
|
|
|
|
* If we need to introduce the magic "a lone ':' means no
|
|
|
|
* pathspec whatsoever", here is the place to do so.
|
|
|
|
*
|
|
|
|
* if (prune_data.nr == 1 && !strcmp(prune_data[0], ":")) {
|
|
|
|
* prune_data.nr = 0;
|
|
|
|
* prune_data.alloc = 0;
|
|
|
|
* free(prune_data.path);
|
|
|
|
* prune_data.path = NULL;
|
|
|
|
* } else {
|
|
|
|
* terminate prune_data.alloc with NULL and
|
|
|
|
* call init_pathspec() to set revs->prune_data here.
|
|
|
|
* }
|
|
|
|
*/
|
2013-07-14 16:35:31 +08:00
|
|
|
parse_pathspec(&revs->prune_data, 0, 0,
|
2020-07-29 08:37:20 +08:00
|
|
|
revs->prefix, prune_data.v);
|
2011-05-12 05:01:19 +08:00
|
|
|
}
|
2020-07-29 04:25:12 +08:00
|
|
|
strvec_clear(&prune_data);
|
2009-11-20 18:33:28 +08:00
|
|
|
|
2022-05-03 00:50:37 +08:00
|
|
|
if (!revs->def)
|
2010-03-09 14:58:09 +08:00
|
|
|
revs->def = opt ? opt->def : NULL;
|
2010-03-09 15:27:25 +08:00
|
|
|
if (opt && opt->tweak)
|
2023-07-03 14:44:16 +08:00
|
|
|
opt->tweak(revs);
|
2008-07-08 21:19:33 +08:00
|
|
|
if (revs->show_merge)
|
2006-07-03 17:59:32 +08:00
|
|
|
prepare_show_merge(revs);
|
revision: set rev_input_given in handle_revision_arg()
Commit 7ba826290a (revision: add rev_input_given flag, 2017-08-02) added
a flag to rev_info to tell whether we got any revision arguments. As
explained there, this is necessary because some revision arguments may
not produce any pending traversal objects, but should still inhibit
default behaviors (e.g., a glob that matches nothing).
However, it only set the flag in the globbing code, but not for
revisions we get on the command-line or via stdin. This leads to two
problems:
- the command-line code keeps its own separate got_rev_arg flag; this
isn't wrong, but it's confusing and an extra maintenance burden
- even specifically-named rev arguments might end up not adding any
pending objects: if --ignore-missing is set, then specifying a
missing object is a noop rather than an error.
And that leads to some user-visible bugs:
- when deciding whether a default rev like "HEAD" should kick in, we
check both got_rev_arg and rev_input_given. That means that
"--ignore-missing $ZERO_OID" works on the command-line (where we set
got_rev_arg) but not on --stdin (where we don't)
- when rev-list decides whether it should complain that it wasn't
given a starting point, it relies on rev_input_given. So it can't
even get the command-line "--ignore-missing $ZERO_OID" right
Let's consistently set the flag if we got any revision argument. That
lets us clean up the redundant got_rev_arg, and fixes both of those bugs
(but note there are three new tests: we'll confirm the already working
git-log command-line case).
A few implementation notes:
- conceptually we want to set the flag whenever handle_revision_arg()
finds an actual revision arg ("handles" it, you might say). But it
covers a ton of cases with early returns. Rather than annotating
each one, we just wrap it and use its success exit-code to set the
flag in one spot.
- the new rev-list test is in t6018, which is titled to cover globs.
This isn't exactly a glob, but it made sense to stick it with the
other tests that handle the "even though we got a rev, we have no
pending objects" case, which are globs.
- the tests check for the oid of a missing object, which it's pretty
clear --ignore-missing should ignore. You can see the same behavior
with "--ignore-missing a-ref-that-does-not-exist", because
--ignore-missing treats them both the same. That's perhaps less
clearly correct, and we may want to change that in the future. But
the way the code and tests here are written, we'd continue to do the
right thing even if it does.
Reported-by: Bryan Turner <bturner@atlassian.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-27 04:13:05 +08:00
|
|
|
if (revs->def && !revs->pending.nr && !revs->rev_input_given) {
|
2017-05-07 06:10:27 +08:00
|
|
|
struct object_id oid;
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
struct object *object;
|
2012-07-03 03:56:44 +08:00
|
|
|
struct object_context oc;
|
2019-01-12 10:13:28 +08:00
|
|
|
if (get_oid_with_context(revs->repo, revs->def, 0, &oid, &oc))
|
2015-08-29 13:04:18 +08:00
|
|
|
diagnose_missing_default(revs->def);
|
2017-05-07 06:10:27 +08:00
|
|
|
object = get_reference(revs, revs->def, &oid, 0);
|
2012-07-03 03:56:44 +08:00
|
|
|
add_pending_object_with_mode(revs, object, revs->def, oc.mode);
|
2024-06-11 17:19:59 +08:00
|
|
|
object_context_release(&oc);
|
2006-02-26 08:19:46 +08:00
|
|
|
}
|
2006-03-10 17:21:39 +08:00
|
|
|
|
2007-09-30 00:50:39 +08:00
|
|
|
/* Did the user ask for any diff output? Run the diff! */
|
|
|
|
if (revs->diffopt.output_format & ~DIFF_FORMAT_NO_OUTPUT)
|
|
|
|
revs->diff = 1;
|
|
|
|
|
2007-12-25 19:06:47 +08:00
|
|
|
/* Pickaxe, diff-filter and rename following need diffs */
|
2018-01-05 06:50:41 +08:00
|
|
|
if ((revs->diffopt.pickaxe_opts & DIFF_PICKAXE_KINDS_MASK) ||
|
2007-12-25 19:06:47 +08:00
|
|
|
revs->diffopt.filter ||
|
2017-11-01 02:19:11 +08:00
|
|
|
revs->diffopt.flags.follow_renames)
|
2007-09-30 00:50:39 +08:00
|
|
|
revs->diff = 1;
|
|
|
|
|
2018-01-05 06:50:42 +08:00
|
|
|
if (revs->diffopt.objfind)
|
|
|
|
revs->simplify_history = 0;
|
|
|
|
|
line-log: try to use generation number-based topo-ordering
The previous patch made it possible to perform line-level filtering
during history traversal instead of in an expensive preprocessing
step, but it still requires some simpler preprocessing steps, notably
topo-ordering. However, nowadays we have commit-graphs storing
generation numbers, which make it possible to incrementally traverse
the history in topological order, without the preparatory limit_list()
and sort_in_topological_order() steps; see b45424181e (revision.c:
generation-based topo-order algorithm, 2018-11-01).
This patch combines the two, so we can do both the topo-ordering and
the line-level filtering during history traversal, eliminating even
those simpler preprocessing steps, and thus further reducing the delay
before showing the first commit modifying the given line range.
The 'revs->limited' flag plays the central role in this, because, due
to limitations of the current implementation, the generation
number-based topo-ordering is only enabled when this flag remains
unset. Line-level log, however, always sets this flag in
setup_revisions() ever since the feature was introduced in 12da1d1f6f
(Implement line-history search (git log -L), 2013-03-28). The reason
for setting 'limited' is unclear, though, because the line-level log
itself doesn't directly depend on it, and it doesn't affect how the
limit_list() function limits the revision range. However, there is an
indirect dependency: the line-level log requires topo-ordering, and
the "traditional" sort_in_topological_order() requires an already
limited commit list since e6c3505b44 (Make sure we generate the whole
commit list before trying to sort it topologically, 2005-07-06). The
new, generation numbers-based topo-ordering doesn't require a limited
commit list anymore.
So don't set 'revs->limited' for line-level log, unless it is really
necessary, namely:
- The user explicitly requested parent rewriting, because that is
still done in the line_log_filter() preprocessing step (see
previous patch), which requires sort_in_topological_order() and in
turn limit_list() as well.
- A commit-graph file is not available or it doesn't yet contain
generation numbers. In these cases we had to fall back on
sort_in_topological_order() and in turn limit_list(). The
existing condition with generation_numbers_enabled() has already
ensured that the 'limited' flag is set in these cases; this patch
just makes sure that the line-level log sets 'revs->topo_order'
before that condition.
While the reduced delay before showing the first commit is measurable
in git.git, it takes a bigger repository to make it clearly noticable.
In both cases below the line ranges were chosen so that they were
modified rather close to the starting revisions, so the effect of this
change is most noticable.
# git.git
$ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0
Before:
real 0m0.107s
user 0m0.091s
sys 0m0.013s
After:
real 0m0.058s
user 0m0.050s
sys 0m0.005s
# linux.git
$ time git --no-pager log \
-L:build_restore_work_registers:arch/mips/mm/tlbex.c -1 v5.2
Before:
real 0m1.129s
user 0m1.061s
sys 0m0.069s
After:
real 0m0.096s
user 0m0.087s
sys 0m0.009s
Additional testing by Derrick Stolee: Since this patch improves
the performance for the first result, I repeated the experiment
from the previous patch on the Linux kernel repository, reporting
real time here:
Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
Before: 0.71 s
After: 0.05 s
Now, we have dropped the full topo-order of all ~910,000 commits
before reporting the first result. The remaining performance
improvements then are:
1. Update the parent-rewriting logic to be incremental similar to
how "git log --graph" behaves.
2. Use changed-path Bloom filters to reduce the time spend in the
tree-diff to see if the path(s) changed.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-11 19:56:18 +08:00
|
|
|
if (revs->line_level_traverse) {
|
|
|
|
if (want_ancestry(revs))
|
|
|
|
revs->limited = 1;
|
|
|
|
revs->topo_order = 1;
|
|
|
|
}
|
|
|
|
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
if (revs->topo_order && !generation_numbers_enabled(the_repository))
|
2006-04-02 10:38:25 +08:00
|
|
|
revs->limited = 1;
|
|
|
|
|
2010-12-17 20:43:06 +08:00
|
|
|
if (revs->prune_data.nr) {
|
2013-07-14 16:35:58 +08:00
|
|
|
copy_pathspec(&revs->pruning.pathspec, &revs->prune_data);
|
Finally implement "git log --follow"
Ok, I've really held off doing this too damn long, because I'm lazy, and I
was always hoping that somebody else would do it.
But no, people keep asking for it, but nobody actually did anything, so I
decided I might as well bite the bullet, and instead of telling people
they could add a "--follow" flag to "git log" to do what they want to do,
I decided that it looks like I just have to do it for them..
The code wasn't actually that complicated, in that the diffstat for this
patch literally says "70 insertions(+), 1 deletions(-)", but I will have
to admit that in order to get to this fairly simple patch, you did have to
know and understand the internal git diff generation machinery pretty
well, and had to really be able to follow how commit generation interacts
with generating patches and generating the log.
So I suspect that while I was right that it wasn't that hard, I might have
been expecting too much of random people - this patch does seem to be
firmly in the core "Linus or Junio" territory.
To make a long story short: I'm sorry for it taking so long until I just
did it.
I'm not going to guarantee that this works for everybody, but you really
can just look at the patch, and after the appropriate appreciative noises
("Ooh, aah") over how clever I am, you can then just notice that the code
itself isn't really that complicated.
All the real new code is in the new "try_to_follow_renames()" function. It
really isn't rocket science: we notice that the pathname we were looking
at went away, so we start a full tree diff and try to see if we can
instead make that pathname be a rename or a copy from some other previous
pathname. And if we can, we just continue, except we show *that*
particular diff, and ever after we use the _previous_ pathname.
One thing to look out for: the "rename detection" is considered to be a
singular event in the _linear_ "git log" output! That's what people want
to do, but I just wanted to point out that this patch is *not* carrying
around a "commit,pathname" kind of pair and it's *not* going to be able to
notice the file coming from multiple *different* files in earlier history.
IOW, if you use "git log --follow", then you get the stupid CVS/SVN kind
of "files have single identities" kind of semantics, and git log will just
pick the identity based on the normal move/copy heuristics _as_if_ the
history could be linearized.
Put another way: I think the model is broken, but given the broken model,
I think this patch does just about as well as you can do. If you have
merges with the same "file" having different filenames over the two
branches, git will just end up picking _one_ of the pathnames at the point
where the newer one goes away. It never looks at multiple pathnames in
parallel.
And if you understood all that, you probably didn't need it explained, and
if you didn't understand the above blathering, it doesn't really mtter to
you. What matters to you is that you can now do
git log -p --follow builtin-rev-list.c
and it will find the point where the old "rev-list.c" got renamed to
"builtin-rev-list.c" and show it as such.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-06-20 05:22:46 +08:00
|
|
|
/* Can't prune commits with rename following: the paths change.. */
|
2017-11-01 02:19:11 +08:00
|
|
|
if (!revs->diffopt.flags.follow_renames)
|
2007-11-06 05:22:34 +08:00
|
|
|
revs->prune = 1;
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
if (!revs->full_diff)
|
2013-07-14 16:35:58 +08:00
|
|
|
copy_pathspec(&revs->diffopt.pathspec,
|
|
|
|
&revs->prune_data);
|
2006-03-10 17:21:39 +08:00
|
|
|
}
|
2020-12-21 23:19:30 +08:00
|
|
|
|
2020-12-21 23:19:34 +08:00
|
|
|
diff_merges_setup_revs(revs);
|
log,diff-tree: add --combined-all-paths option
The combined diff format for merges will only list one filename, even if
rename or copy detection is active. For example, with raw format one
might see:
::100644 100644 100644 fabadb8 cc95eb0 4866510 MM describe.c
::100755 100755 100755 52b7a2d 6d1ac04 d2ac7d7 RM bar.sh
::100644 100644 100644 e07d6c5 9042e82 ee91881 RR phooey.c
This doesn't let us know what the original name of bar.sh was in the
first parent, and doesn't let us know what either of the original names
of phooey.c were in either of the parents. In contrast, for non-merge
commits, raw format does provide original filenames (and a rename score
to boot). In order to also provide original filenames for merge
commits, add a --combined-all-paths option (which must be used with
either -c or --cc, and is likely only useful with rename or copy
detection active) so that we can print tab-separated filenames when
renames are involved. This transforms the above output to:
::100644 100644 100644 fabadb8 cc95eb0 4866510 MM desc.c desc.c desc.c
::100755 100755 100755 52b7a2d 6d1ac04 d2ac7d7 RM foo.sh bar.sh bar.sh
::100644 100644 100644 e07d6c5 9042e82 ee91881 RR fooey.c fuey.c phooey.c
Further, in patch format, this changes the from/to headers so that
instead of just having one "from" header, we get one for each parent.
For example, instead of having
--- a/phooey.c
+++ b/phooey.c
we would see
--- a/fooey.c
--- a/fuey.c
+++ b/phooey.c
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-02-08 09:12:46 +08:00
|
|
|
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
revs->diffopt.abbrev = revs->abbrev;
|
Implement line-history search (git log -L)
This is a rewrite of much of Bo's work, mainly in an effort to split
it into smaller, easier to understand routines.
The algorithm is built around the struct range_set, which encodes a
series of line ranges as intervals [a,b). This is used in two
contexts:
* A set of lines we are tracking (which will change as we dig through
history).
* To encode diffs, as pairs of ranges.
The main routine is range_set_map_across_diff(). It processes the
diff between a commit C and some parent P. It determines which diff
hunks are relevant to the ranges tracked in C, and computes the new
ranges for P.
The algorithm is then simply to process history in topological order
from newest to oldest, computing ranges and (partial) diffs. At
branch points, we need to merge the ranges we are watching. We will
find that many commits do not affect the chosen ranges, and mark them
TREESAME (in addition to those already filtered by pathspec limiting).
Another pass of history simplification then gets rid of such commits.
This is wired as an extra filtering pass in the log machinery. This
currently only reduces code duplication, but should allow for other
simplifications and options to be used.
Finally, we hook a diff printer into the output chain. Ideally we
would wire directly into the diff logic, to optionally use features
like word diff. However, that will require some major reworking of
the diff chain, so we completely replace the output with our own diff
for now.
As this was a GSoC project, and has quite some history by now, many
people have helped. In no particular order, thanks go to
Jakub Narebski <jnareb@gmail.com>
Jens Lehmann <Jens.Lehmann@web.de>
Jonathan Nieder <jrnieder@gmail.com>
Junio C Hamano <gitster@pobox.com>
Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Will Palmer <wmpalmer@gmail.com>
Apologies to everyone I forgot.
Signed-off-by: Bo Yang <struggleyb.nku@gmail.com>
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-03-29 00:47:32 +08:00
|
|
|
|
2012-08-03 20:16:24 +08:00
|
|
|
diff_setup_done(&revs->diffopt);
|
2006-03-10 17:21:39 +08:00
|
|
|
|
2019-06-28 07:39:05 +08:00
|
|
|
if (!is_encoding_utf8(get_log_output_encoding()))
|
|
|
|
revs->grep_filter.ignore_locale = 1;
|
2008-08-25 14:15:05 +08:00
|
|
|
compile_grep_patterns(&revs->grep_filter);
|
2006-09-18 08:23:20 +08:00
|
|
|
|
2017-07-07 17:07:16 +08:00
|
|
|
if (revs->reflog_info && revs->limited)
|
|
|
|
die("cannot combine --walk-reflogs with history-limiting options");
|
2008-07-09 06:25:44 +08:00
|
|
|
if (revs->rewrite_parents && revs->children.name)
|
2022-01-06 04:02:16 +08:00
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--parents", "--children");
|
2022-03-10 00:01:40 +08:00
|
|
|
if (revs->filter.choice && !revs->blob_objects)
|
|
|
|
die(_("object filtering requires --objects"));
|
2007-08-20 10:33:43 +08:00
|
|
|
|
2008-05-04 18:36:54 +08:00
|
|
|
/*
|
|
|
|
* Limitations on the graph functionality
|
|
|
|
*/
|
2023-12-06 19:51:57 +08:00
|
|
|
die_for_incompatible_opt3(!!revs->graph, "--graph",
|
|
|
|
!!revs->reverse, "--reverse",
|
|
|
|
!!revs->reflog_info, "--walk-reflogs");
|
2008-05-04 18:36:54 +08:00
|
|
|
|
2015-03-11 10:13:02 +08:00
|
|
|
if (revs->no_walk && revs->graph)
|
2022-01-06 04:02:16 +08:00
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--no-walk", "--graph");
|
2012-09-30 02:59:52 +08:00
|
|
|
if (!revs->reflog_info && revs->grep_filter.use_reflog_filter)
|
2022-01-06 04:02:19 +08:00
|
|
|
die(_("the option '%s' requires '%s'"), "--grep-reflog", "--walk-reflogs");
|
2008-05-04 18:36:54 +08:00
|
|
|
|
2019-03-11 11:54:33 +08:00
|
|
|
if (revs->line_level_traverse &&
|
|
|
|
(revs->diffopt.output_format & ~(DIFF_FORMAT_PATCH | DIFF_FORMAT_NO_OUTPUT)))
|
|
|
|
die(_("-L does not yet support diff formats besides -p and -s"));
|
|
|
|
|
2016-03-30 06:49:24 +08:00
|
|
|
if (revs->expand_tabs_in_log < 0)
|
|
|
|
revs->expand_tabs_in_log = revs->expand_tabs_in_log_default;
|
|
|
|
|
2023-09-20 04:26:50 +08:00
|
|
|
if (!revs->show_notes_given && revs->show_notes_by_default) {
|
|
|
|
enable_default_display_notes(&revs->notes_opt, &revs->show_notes);
|
|
|
|
revs->show_notes_given = 1;
|
|
|
|
}
|
|
|
|
|
2006-02-26 08:19:46 +08:00
|
|
|
return left;
|
|
|
|
}
|
2006-03-01 03:24:00 +08:00
|
|
|
|
revisions API: have release_revisions() release "cmdline"
Extend the the release_revisions() function so that it frees the
"cmdline" in the "struct rev_info". This in combination with a
preceding change to free "commits" and "mailmap" means that we can
whitelist another test under "TEST_PASSES_SANITIZE_LEAK=true".
There was a proposal in [1] to do away with xstrdup()-ing this
add_rev_cmdline(), perhaps that would be worthwhile, but for now let's
just free() it.
We could also make that a "char *" in "struct rev_cmdline_entry"
itself, but since we own it let's expose it as a constant to outside
callers. I proposed that in [2] but have since changed my mind. See
14d30cdfc04 (ref-filter: fix memory leak in `free_array_item()`,
2019-07-10), c514c62a4fd (checkout: fix leak of non-existent branch
names, 2020-08-14) and other log history hits for "free((char *)" for
prior art.
This includes the tests we had false-positive passes on before my
6798b08e848 (perl Git.pm: don't ignore signalled failure in
_cmd_close(), 2022-02-01), now they pass for real.
Since there are 66 tests matching t/t[0-9]*git-svn*.sh it's easier to
list those that don't pass than to touch most of those 66. So let's
introduce a "TEST_FAILS_SANITIZE_LEAK=true", which if set in the tests
won't cause lib-git-svn.sh to set "TEST_PASSES_SANITIZE_LEAK=true.
This change also marks all the tests that we removed
"TEST_FAILS_SANITIZE_LEAK=true" from in an earlier commit due to
removing the UNLEAK() from cmd_format_patch(), we can now assert that
its API use doesn't leak any "struct rev_info" memory.
This change also made commit "t5503-tagfollow.sh" pass on current
master, but that would regress when combined with
ps/fetch-atomic-fixup's de004e848a9 (t5503: simplify setup of test
which exercises failure of backfill, 2022-03-03) (through no fault of
that topic, that change started using "git clone" in the test, which
has an outstanding leak). Let's leave that test out for now to avoid
in-flight semantic conflicts.
1. https://lore.kernel.org/git/YUj%2FgFRh6pwrZalY@carlos-mbp.lan/
2. https://lore.kernel.org/git/87o88obkb1.fsf@evledraar.gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-14 04:01:47 +08:00
|
|
|
static void release_revisions_cmdline(struct rev_cmdline_info *cmdline)
|
|
|
|
{
|
|
|
|
unsigned int i;
|
|
|
|
|
|
|
|
for (i = 0; i < cmdline->nr; i++)
|
|
|
|
free((char *)cmdline->rev[i].name);
|
|
|
|
free(cmdline->rev);
|
|
|
|
}
|
|
|
|
|
revisions API: have release_revisions() release "mailmap"
Extend the the release_revisions() function so that it frees the
"mailmap" in the "struct rev_info".
The log family of functions now calls the clear_mailmap() function
added in fa8afd18e5a (revisions API: provide and use a
release_revisions(), 2021-09-19), allowing us to whitelist some tests
with "TEST_PASSES_SANITIZE_LEAK=true".
Unfortunately having a pointer to a mailmap in "struct rev_info"
instead of an embedded member that we "own" get a bit messy, as can be
seen in the change to builtin/commit.c.
When we free() this data we won't be able to tell apart a pointer to a
"mailmap" on the heap from one on the stack. As seen in
ea57bc0d41b (log: add --use-mailmap option, 2013-01-05) the "log"
family allocates it on the heap, but in the find_author_by_nickname()
code added in ea16794e430 (commit: search author pattern against
mailmap, 2013-08-23) we allocated it on the stack instead.
Ideally we'd simply change that member to a "struct string_list
mailmap" and never free() the "mailmap" itself, but that would be a
much larger change to the revisions API.
We have code that needs to hand an existing "mailmap" to a "struct
rev_info", while we could change all of that, let's not go there
now.
The complexity isn't in the ownership of the "mailmap" per-se, but
that various things assume a "rev_info.mailmap == NULL" means "doesn't
want mailmap", if we changed that to an init'd "struct string_list
we'd need to carefully refactor things to change those assumptions.
Let's instead always free() it, and simply declare that if you add
such a "mailmap" it must be allocated on the heap. Any modern libc
will correctly panic if we free() a stack variable, so this should be
safe going forward.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-14 04:01:46 +08:00
|
|
|
static void release_revisions_mailmap(struct string_list *mailmap)
|
|
|
|
{
|
|
|
|
if (!mailmap)
|
|
|
|
return;
|
|
|
|
clear_mailmap(mailmap);
|
|
|
|
free(mailmap);
|
|
|
|
}
|
|
|
|
|
2022-04-14 13:56:39 +08:00
|
|
|
static void release_revisions_topo_walk_info(struct topo_walk_info *info);
|
|
|
|
|
2023-10-06 05:30:14 +08:00
|
|
|
static void free_void_commit_list(void *list)
|
|
|
|
{
|
|
|
|
free_commit_list(list);
|
|
|
|
}
|
|
|
|
|
revision.[ch]: provide and start using a release_revisions()
The users of the revision.[ch] API's "struct rev_info" are a major
source of memory leaks in the test suite under SANITIZE=leak, which in
turn adds a lot of noise when trying to mark up tests with
"TEST_PASSES_SANITIZE_LEAK=true".
The users of that API are largely one-shot, e.g. "git rev-list" or
"git log", or the "git checkout" and "git stash" being modified here
For these callers freeing the memory is arguably a waste of time, but
in many cases they've actually been trying to free the memory, and
just doing that in a buggy manner.
Let's provide a release_revisions() function for these users, and
start migrating them over per the plan outlined in [1]. Right now this
only handles the "pending" member of the struct, but more will be
added in subsequent commits.
Even though we only clear the "pending" member now, let's not leave a
trap in code like the pre-image of index_differs_from(), where we'd
start doing the wrong thing as soon as the release_revisions() learned
to clear its "diffopt". I.e. we need to call release_revisions() after
we've inspected any state in "struct rev_info".
This leaves in place e.g. clear_pathspec(&rev.prune_data) in
stash_working_tree() in builtin/stash.c, subsequent commits will teach
release_revisions() to free "prune_data" and other members that in
some cases are individually cleared by users of "struct rev_info" by
reaching into its members. Those subsequent commits will remove the
relevant calls to e.g. clear_pathspec().
We avoid amending code in index_differs_from() in diff-lib.c as well
as wt_status_collect_changes_index(), has_unstaged_changes() and
has_uncommitted_changes() in wt-status.c in a way that assumes that we
are already clearing the "diffopt" member. That will be handled in a
subsequent commit.
1. https://lore.kernel.org/git/87a6k8daeu.fsf@evledraar.gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-14 04:01:35 +08:00
|
|
|
void release_revisions(struct rev_info *revs)
|
|
|
|
{
|
2022-04-14 04:01:45 +08:00
|
|
|
free_commit_list(revs->commits);
|
2022-08-19 12:28:10 +08:00
|
|
|
free_commit_list(revs->ancestry_path_bottoms);
|
2024-06-11 17:19:45 +08:00
|
|
|
release_display_notes(&revs->notes_opt);
|
revision.[ch]: provide and start using a release_revisions()
The users of the revision.[ch] API's "struct rev_info" are a major
source of memory leaks in the test suite under SANITIZE=leak, which in
turn adds a lot of noise when trying to mark up tests with
"TEST_PASSES_SANITIZE_LEAK=true".
The users of that API are largely one-shot, e.g. "git rev-list" or
"git log", or the "git checkout" and "git stash" being modified here
For these callers freeing the memory is arguably a waste of time, but
in many cases they've actually been trying to free the memory, and
just doing that in a buggy manner.
Let's provide a release_revisions() function for these users, and
start migrating them over per the plan outlined in [1]. Right now this
only handles the "pending" member of the struct, but more will be
added in subsequent commits.
Even though we only clear the "pending" member now, let's not leave a
trap in code like the pre-image of index_differs_from(), where we'd
start doing the wrong thing as soon as the release_revisions() learned
to clear its "diffopt". I.e. we need to call release_revisions() after
we've inspected any state in "struct rev_info".
This leaves in place e.g. clear_pathspec(&rev.prune_data) in
stash_working_tree() in builtin/stash.c, subsequent commits will teach
release_revisions() to free "prune_data" and other members that in
some cases are individually cleared by users of "struct rev_info" by
reaching into its members. Those subsequent commits will remove the
relevant calls to e.g. clear_pathspec().
We avoid amending code in index_differs_from() in diff-lib.c as well
as wt_status_collect_changes_index(), has_unstaged_changes() and
has_uncommitted_changes() in wt-status.c in a way that assumes that we
are already clearing the "diffopt" member. That will be handled in a
subsequent commit.
1. https://lore.kernel.org/git/87a6k8daeu.fsf@evledraar.gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-14 04:01:35 +08:00
|
|
|
object_array_clear(&revs->pending);
|
2022-04-14 04:01:51 +08:00
|
|
|
object_array_clear(&revs->boundary_commits);
|
revisions API: have release_revisions() release "cmdline"
Extend the the release_revisions() function so that it frees the
"cmdline" in the "struct rev_info". This in combination with a
preceding change to free "commits" and "mailmap" means that we can
whitelist another test under "TEST_PASSES_SANITIZE_LEAK=true".
There was a proposal in [1] to do away with xstrdup()-ing this
add_rev_cmdline(), perhaps that would be worthwhile, but for now let's
just free() it.
We could also make that a "char *" in "struct rev_cmdline_entry"
itself, but since we own it let's expose it as a constant to outside
callers. I proposed that in [2] but have since changed my mind. See
14d30cdfc04 (ref-filter: fix memory leak in `free_array_item()`,
2019-07-10), c514c62a4fd (checkout: fix leak of non-existent branch
names, 2020-08-14) and other log history hits for "free((char *)" for
prior art.
This includes the tests we had false-positive passes on before my
6798b08e848 (perl Git.pm: don't ignore signalled failure in
_cmd_close(), 2022-02-01), now they pass for real.
Since there are 66 tests matching t/t[0-9]*git-svn*.sh it's easier to
list those that don't pass than to touch most of those 66. So let's
introduce a "TEST_FAILS_SANITIZE_LEAK=true", which if set in the tests
won't cause lib-git-svn.sh to set "TEST_PASSES_SANITIZE_LEAK=true.
This change also marks all the tests that we removed
"TEST_FAILS_SANITIZE_LEAK=true" from in an earlier commit due to
removing the UNLEAK() from cmd_format_patch(), we can now assert that
its API use doesn't leak any "struct rev_info" memory.
This change also made commit "t5503-tagfollow.sh" pass on current
master, but that would regress when combined with
ps/fetch-atomic-fixup's de004e848a9 (t5503: simplify setup of test
which exercises failure of backfill, 2022-03-03) (through no fault of
that topic, that change started using "git clone" in the test, which
has an outstanding leak). Let's leave that test out for now to avoid
in-flight semantic conflicts.
1. https://lore.kernel.org/git/YUj%2FgFRh6pwrZalY@carlos-mbp.lan/
2. https://lore.kernel.org/git/87o88obkb1.fsf@evledraar.gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-14 04:01:47 +08:00
|
|
|
release_revisions_cmdline(&revs->cmdline);
|
2022-04-14 04:01:48 +08:00
|
|
|
list_objects_filter_release(&revs->filter);
|
2022-04-14 04:01:50 +08:00
|
|
|
clear_pathspec(&revs->prune_data);
|
2022-04-14 13:56:38 +08:00
|
|
|
date_mode_release(&revs->date_mode);
|
revisions API: have release_revisions() release "mailmap"
Extend the the release_revisions() function so that it frees the
"mailmap" in the "struct rev_info".
The log family of functions now calls the clear_mailmap() function
added in fa8afd18e5a (revisions API: provide and use a
release_revisions(), 2021-09-19), allowing us to whitelist some tests
with "TEST_PASSES_SANITIZE_LEAK=true".
Unfortunately having a pointer to a mailmap in "struct rev_info"
instead of an embedded member that we "own" get a bit messy, as can be
seen in the change to builtin/commit.c.
When we free() this data we won't be able to tell apart a pointer to a
"mailmap" on the heap from one on the stack. As seen in
ea57bc0d41b (log: add --use-mailmap option, 2013-01-05) the "log"
family allocates it on the heap, but in the find_author_by_nickname()
code added in ea16794e430 (commit: search author pattern against
mailmap, 2013-08-23) we allocated it on the stack instead.
Ideally we'd simply change that member to a "struct string_list
mailmap" and never free() the "mailmap" itself, but that would be a
much larger change to the revisions API.
We have code that needs to hand an existing "mailmap" to a "struct
rev_info", while we could change all of that, let's not go there
now.
The complexity isn't in the ownership of the "mailmap" per-se, but
that various things assume a "rev_info.mailmap == NULL" means "doesn't
want mailmap", if we changed that to an init'd "struct string_list
we'd need to carefully refactor things to change those assumptions.
Let's instead always free() it, and simply declare that if you add
such a "mailmap" it must be allocated on the heap. Any modern libc
will correctly panic if we free() a stack variable, so this should be
safe going forward.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-14 04:01:46 +08:00
|
|
|
release_revisions_mailmap(revs->mailmap);
|
2022-04-14 04:01:49 +08:00
|
|
|
free_grep_patterns(&revs->grep_filter);
|
2022-11-09 02:17:44 +08:00
|
|
|
graph_clear(revs->graph);
|
2024-06-11 17:20:24 +08:00
|
|
|
diff_free(&revs->diffopt);
|
2022-04-14 04:01:53 +08:00
|
|
|
diff_free(&revs->pruning);
|
2022-04-14 04:01:52 +08:00
|
|
|
reflog_walk_info_release(revs->reflog_info);
|
2022-04-14 13:56:39 +08:00
|
|
|
release_revisions_topo_walk_info(revs->topo_walk_info);
|
2023-10-06 05:30:14 +08:00
|
|
|
clear_decoration(&revs->children, free_void_commit_list);
|
|
|
|
clear_decoration(&revs->merge_simplification, free);
|
|
|
|
clear_decoration(&revs->treesame, free);
|
|
|
|
line_log_free(revs);
|
2023-10-27 15:59:29 +08:00
|
|
|
oidset_clear(&revs->missing_commits);
|
2024-11-05 14:16:58 +08:00
|
|
|
|
|
|
|
for (int i = 0; i < revs->bloom_keys_nr; i++)
|
|
|
|
clear_bloom_key(&revs->bloom_keys[i]);
|
|
|
|
FREE_AND_NULL(revs->bloom_keys);
|
|
|
|
revs->bloom_keys_nr = 0;
|
revision.[ch]: provide and start using a release_revisions()
The users of the revision.[ch] API's "struct rev_info" are a major
source of memory leaks in the test suite under SANITIZE=leak, which in
turn adds a lot of noise when trying to mark up tests with
"TEST_PASSES_SANITIZE_LEAK=true".
The users of that API are largely one-shot, e.g. "git rev-list" or
"git log", or the "git checkout" and "git stash" being modified here
For these callers freeing the memory is arguably a waste of time, but
in many cases they've actually been trying to free the memory, and
just doing that in a buggy manner.
Let's provide a release_revisions() function for these users, and
start migrating them over per the plan outlined in [1]. Right now this
only handles the "pending" member of the struct, but more will be
added in subsequent commits.
Even though we only clear the "pending" member now, let's not leave a
trap in code like the pre-image of index_differs_from(), where we'd
start doing the wrong thing as soon as the release_revisions() learned
to clear its "diffopt". I.e. we need to call release_revisions() after
we've inspected any state in "struct rev_info".
This leaves in place e.g. clear_pathspec(&rev.prune_data) in
stash_working_tree() in builtin/stash.c, subsequent commits will teach
release_revisions() to free "prune_data" and other members that in
some cases are individually cleared by users of "struct rev_info" by
reaching into its members. Those subsequent commits will remove the
relevant calls to e.g. clear_pathspec().
We avoid amending code in index_differs_from() in diff-lib.c as well
as wt_status_collect_changes_index(), has_unstaged_changes() and
has_uncommitted_changes() in wt-status.c in a way that assumes that we
are already clearing the "diffopt" member. That will be handled in a
subsequent commit.
1. https://lore.kernel.org/git/87a6k8daeu.fsf@evledraar.gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-14 04:01:35 +08:00
|
|
|
}
|
|
|
|
|
2008-04-03 17:12:06 +08:00
|
|
|
static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
|
|
|
|
{
|
|
|
|
struct commit_list *l = xcalloc(1, sizeof(*l));
|
|
|
|
|
|
|
|
l->item = child;
|
|
|
|
l->next = add_decoration(&revs->children, &parent->object, l);
|
|
|
|
}
|
|
|
|
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
static int remove_duplicate_parents(struct rev_info *revs, struct commit *commit)
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
{
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
struct treesame_state *ts = lookup_decoration(&revs->treesame, &commit->object);
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
struct commit_list **pp, *p;
|
|
|
|
int surviving_parents;
|
|
|
|
|
|
|
|
/* Examine existing parents while marking ones we have seen... */
|
|
|
|
pp = &commit->parents;
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
surviving_parents = 0;
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
while ((p = *pp) != NULL) {
|
|
|
|
struct commit *parent = p->item;
|
|
|
|
if (parent->object.flags & TMP_MARK) {
|
|
|
|
*pp = p->next;
|
2024-09-30 17:14:03 +08:00
|
|
|
free(p);
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
if (ts)
|
|
|
|
compact_treesame(revs, commit, surviving_parents);
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
parent->object.flags |= TMP_MARK;
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
surviving_parents++;
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
pp = &p->next;
|
|
|
|
}
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
/* clear the temporary mark */
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
for (p = commit->parents; p; p = p->next) {
|
|
|
|
p->item->object.flags &= ~TMP_MARK;
|
|
|
|
}
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
/* no update_treesame() - removing duplicates can't affect TREESAME */
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
return surviving_parents;
|
|
|
|
}
|
|
|
|
|
2008-08-15 01:59:44 +08:00
|
|
|
struct merge_simplify_state {
|
|
|
|
struct commit *simplified;
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct merge_simplify_state *locate_simplify_state(struct rev_info *revs, struct commit *commit)
|
|
|
|
{
|
|
|
|
struct merge_simplify_state *st;
|
|
|
|
|
|
|
|
st = lookup_decoration(&revs->merge_simplification, &commit->object);
|
|
|
|
if (!st) {
|
2021-03-14 00:17:22 +08:00
|
|
|
CALLOC_ARRAY(st, 1);
|
2008-08-15 01:59:44 +08:00
|
|
|
add_decoration(&revs->merge_simplification, &commit->object, st);
|
|
|
|
}
|
|
|
|
return st;
|
|
|
|
}
|
|
|
|
|
2019-03-20 16:13:36 +08:00
|
|
|
static int mark_redundant_parents(struct commit *commit)
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
{
|
|
|
|
struct commit_list *h = reduce_heads(commit->parents);
|
|
|
|
int i = 0, marked = 0;
|
|
|
|
struct commit_list *po, *pn;
|
|
|
|
|
|
|
|
/* Want these for sanity-checking only */
|
|
|
|
int orig_cnt = commit_list_count(commit->parents);
|
|
|
|
int cnt = commit_list_count(h);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Not ready to remove items yet, just mark them for now, based
|
|
|
|
* on the output of reduce_heads(). reduce_heads outputs the reduced
|
|
|
|
* set in its original order, so this isn't too hard.
|
|
|
|
*/
|
|
|
|
po = commit->parents;
|
|
|
|
pn = h;
|
|
|
|
while (po) {
|
|
|
|
if (pn && po->item == pn->item) {
|
|
|
|
pn = pn->next;
|
|
|
|
i++;
|
|
|
|
} else {
|
|
|
|
po->item->object.flags |= TMP_MARK;
|
|
|
|
marked++;
|
|
|
|
}
|
|
|
|
po=po->next;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (i != cnt || cnt+marked != orig_cnt)
|
|
|
|
die("mark_redundant_parents %d %d %d %d", orig_cnt, cnt, i, marked);
|
|
|
|
|
|
|
|
free_commit_list(h);
|
|
|
|
|
|
|
|
return marked;
|
|
|
|
}
|
|
|
|
|
2019-03-20 16:13:36 +08:00
|
|
|
static int mark_treesame_root_parents(struct commit *commit)
|
2013-05-16 23:32:37 +08:00
|
|
|
{
|
|
|
|
struct commit_list *p;
|
|
|
|
int marked = 0;
|
|
|
|
|
|
|
|
for (p = commit->parents; p; p = p->next) {
|
|
|
|
struct commit *parent = p->item;
|
|
|
|
if (!parent->parents && (parent->object.flags & TREESAME)) {
|
|
|
|
parent->object.flags |= TMP_MARK;
|
|
|
|
marked++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return marked;
|
|
|
|
}
|
|
|
|
|
2013-05-16 23:32:36 +08:00
|
|
|
/*
|
|
|
|
* Awkward naming - this means one parent we are TREESAME to.
|
|
|
|
* cf mark_treesame_root_parents: root parents that are TREESAME (to an
|
|
|
|
* empty tree). Better name suggestions?
|
|
|
|
*/
|
|
|
|
static int leave_one_treesame_to_parent(struct rev_info *revs, struct commit *commit)
|
|
|
|
{
|
|
|
|
struct treesame_state *ts = lookup_decoration(&revs->treesame, &commit->object);
|
|
|
|
struct commit *unmarked = NULL, *marked = NULL;
|
|
|
|
struct commit_list *p;
|
|
|
|
unsigned n;
|
|
|
|
|
|
|
|
for (p = commit->parents, n = 0; p; p = p->next, n++) {
|
|
|
|
if (ts->treesame[n]) {
|
|
|
|
if (p->item->object.flags & TMP_MARK) {
|
|
|
|
if (!marked)
|
|
|
|
marked = p->item;
|
|
|
|
} else {
|
|
|
|
if (!unmarked) {
|
|
|
|
unmarked = p->item;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are TREESAME to a marked-for-deletion parent, but not to any
|
|
|
|
* unmarked parents, unmark the first TREESAME parent. This is the
|
|
|
|
* parent that the default simplify_history==1 scan would have followed,
|
|
|
|
* and it doesn't make sense to omit that path when asking for a
|
|
|
|
* simplified full history. Retaining it improves the chances of
|
|
|
|
* understanding odd missed merges that took an old version of a file.
|
|
|
|
*
|
|
|
|
* Example:
|
|
|
|
*
|
|
|
|
* I--------*X A modified the file, but mainline merge X used
|
|
|
|
* \ / "-s ours", so took the version from I. X is
|
|
|
|
* `-*A--' TREESAME to I and !TREESAME to A.
|
|
|
|
*
|
|
|
|
* Default log from X would produce "I". Without this check,
|
|
|
|
* --full-history --simplify-merges would produce "I-A-X", showing
|
|
|
|
* the merge commit X and that it changed A, but not making clear that
|
|
|
|
* it had just taken the I version. With this check, the topology above
|
|
|
|
* is retained.
|
|
|
|
*
|
|
|
|
* Note that it is possible that the simplification chooses a different
|
|
|
|
* TREESAME parent from the default, in which case this test doesn't
|
|
|
|
* activate, and we _do_ drop the default parent. Example:
|
|
|
|
*
|
|
|
|
* I------X A modified the file, but it was reverted in B,
|
|
|
|
* \ / meaning mainline merge X is TREESAME to both
|
|
|
|
* *A-*B parents.
|
|
|
|
*
|
|
|
|
* Default log would produce "I" by following the first parent;
|
|
|
|
* --full-history --simplify-merges will produce "I-A-B". But this is a
|
|
|
|
* reasonable result - it presents a logical full history leading from
|
|
|
|
* I to X, and X is not an important merge.
|
|
|
|
*/
|
|
|
|
if (!unmarked && marked) {
|
|
|
|
marked->object.flags &= ~TMP_MARK;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
static int remove_marked_parents(struct rev_info *revs, struct commit *commit)
|
|
|
|
{
|
|
|
|
struct commit_list **pp, *p;
|
|
|
|
int nth_parent, removed = 0;
|
|
|
|
|
|
|
|
pp = &commit->parents;
|
|
|
|
nth_parent = 0;
|
|
|
|
while ((p = *pp) != NULL) {
|
|
|
|
struct commit *parent = p->item;
|
|
|
|
if (parent->object.flags & TMP_MARK) {
|
|
|
|
parent->object.flags &= ~TMP_MARK;
|
|
|
|
*pp = p->next;
|
|
|
|
free(p);
|
|
|
|
removed++;
|
|
|
|
compact_treesame(revs, commit, nth_parent);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
pp = &p->next;
|
|
|
|
nth_parent++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Removing parents can only increase TREESAMEness */
|
|
|
|
if (removed && !(commit->object.flags & TREESAME))
|
|
|
|
update_treesame(revs, commit);
|
|
|
|
|
|
|
|
return nth_parent;
|
|
|
|
}
|
|
|
|
|
2008-08-15 01:59:44 +08:00
|
|
|
static struct commit_list **simplify_one(struct rev_info *revs, struct commit *commit, struct commit_list **tail)
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
{
|
|
|
|
struct commit_list *p;
|
2013-05-16 23:32:39 +08:00
|
|
|
struct commit *parent;
|
2008-08-15 01:59:44 +08:00
|
|
|
struct merge_simplify_state *st, *pst;
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
int cnt;
|
|
|
|
|
2008-08-15 01:59:44 +08:00
|
|
|
st = locate_simplify_state(revs, commit);
|
|
|
|
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
/*
|
|
|
|
* Have we handled this one?
|
|
|
|
*/
|
2008-08-15 01:59:44 +08:00
|
|
|
if (st->simplified)
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
return tail;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* An UNINTERESTING commit simplifies to itself, so does a
|
|
|
|
* root commit. We do not rewrite parents of such commit
|
|
|
|
* anyway.
|
|
|
|
*/
|
|
|
|
if ((commit->object.flags & UNINTERESTING) || !commit->parents) {
|
2008-08-15 01:59:44 +08:00
|
|
|
st->simplified = commit;
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
return tail;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2012-06-09 05:56:03 +08:00
|
|
|
* Do we know what commit all of our parents that matter
|
|
|
|
* should be rewritten to? Otherwise we are not ready to
|
|
|
|
* rewrite this one yet.
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
*/
|
|
|
|
for (cnt = 0, p = commit->parents; p; p = p->next) {
|
2008-08-15 01:59:44 +08:00
|
|
|
pst = locate_simplify_state(revs, p->item);
|
|
|
|
if (!pst->simplified) {
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
tail = &commit_list_insert(p->item, tail)->next;
|
|
|
|
cnt++;
|
|
|
|
}
|
2012-06-09 05:56:03 +08:00
|
|
|
if (revs->first_parent_only)
|
|
|
|
break;
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
}
|
2008-08-18 15:37:34 +08:00
|
|
|
if (cnt) {
|
|
|
|
tail = &commit_list_insert(commit, tail)->next;
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
return tail;
|
2008-08-18 15:37:34 +08:00
|
|
|
}
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
|
|
|
|
/*
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
* Rewrite our list of parents. Note that this cannot
|
|
|
|
* affect our TREESAME flags in any way - a commit is
|
|
|
|
* always TREESAME to its simplification.
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
*/
|
2008-08-15 01:59:44 +08:00
|
|
|
for (p = commit->parents; p; p = p->next) {
|
|
|
|
pst = locate_simplify_state(revs, p->item);
|
|
|
|
p->item = pst->simplified;
|
2012-06-09 05:56:03 +08:00
|
|
|
if (revs->first_parent_only)
|
|
|
|
break;
|
2008-08-15 01:59:44 +08:00
|
|
|
}
|
2013-01-18 06:23:03 +08:00
|
|
|
|
2013-04-09 04:10:27 +08:00
|
|
|
if (revs->first_parent_only)
|
2012-06-09 05:56:03 +08:00
|
|
|
cnt = 1;
|
2013-04-09 04:10:27 +08:00
|
|
|
else
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
cnt = remove_duplicate_parents(revs, commit);
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* It is possible that we are a merge and one side branch
|
|
|
|
* does not have any commit that touches the given paths;
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
* in such a case, the immediate parent from that branch
|
|
|
|
* will be rewritten to be the merge base.
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
*
|
|
|
|
* o----X X: the commit we are looking at;
|
|
|
|
* / / o: a commit that touches the paths;
|
|
|
|
* ---o----'
|
|
|
|
*
|
2013-05-16 23:32:37 +08:00
|
|
|
* Further, a merge of an independent branch that doesn't
|
|
|
|
* touch the path will reduce to a treesame root parent:
|
|
|
|
*
|
|
|
|
* ----o----X X: the commit we are looking at;
|
|
|
|
* / o: a commit that touches the paths;
|
|
|
|
* r r: a root commit not touching the paths
|
|
|
|
*
|
|
|
|
* Detect and simplify both cases.
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
*/
|
|
|
|
if (1 < cnt) {
|
2019-03-20 16:13:36 +08:00
|
|
|
int marked = mark_redundant_parents(commit);
|
|
|
|
marked += mark_treesame_root_parents(commit);
|
2013-05-16 23:32:36 +08:00
|
|
|
if (marked)
|
|
|
|
marked -= leave_one_treesame_to_parent(revs, commit);
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
if (marked)
|
|
|
|
cnt = remove_marked_parents(revs, commit);
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A commit simplifies to itself if it is a root, if it is
|
|
|
|
* UNINTERESTING, if it touches the given paths, or if it is a
|
2013-05-16 23:32:39 +08:00
|
|
|
* merge and its parents don't simplify to one relevant commit
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
* (the first two cases are already handled at the beginning of
|
|
|
|
* this function).
|
|
|
|
*
|
2013-05-16 23:32:39 +08:00
|
|
|
* Otherwise, it simplifies to what its sole relevant parent
|
|
|
|
* simplifies to.
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
*/
|
|
|
|
if (!cnt ||
|
|
|
|
(commit->object.flags & UNINTERESTING) ||
|
|
|
|
!(commit->object.flags & TREESAME) ||
|
revision: --show-pulls adds helpful merges
The default file history simplification of "git log -- <path>" or
"git rev-list -- <path>" focuses on providing the smallest set of
commits that first contributed a change. The revision walk greatly
restricts the set of walked commits by visiting only the first
TREESAME parent of a merge commit, when one exists. This means
that portions of the commit-graph are not walked, which can be a
performance benefit, but can also "hide" commits that added changes
but were ignored by a merge resolution.
The --full-history option modifies this by walking all commits and
reporting a merge commit as "interesting" if it has _any_ parent
that is not TREESAME. This tends to be an over-representation of
important commits, especially in an environment where most merge
commits are created by pull request completion.
Suppose we have a commit A and we create a commit B on top that
changes our file. When we merge the pull request, we create a merge
commit M. If no one else changed the file in the first-parent
history between M and A, then M will not be TREESAME to its first
parent, but will be TREESAME to B. Thus, the simplified history
will be "B". However, M will appear in the --full-history mode.
However, suppose that a number of topics T1, T2, ..., Tn were
created based on commits C1, C2, ..., Cn between A and M as
follows:
A----C1----C2--- ... ---Cn----M------P1---P2--- ... ---Pn
\ \ \ \ / / / /
\ \__.. \ \/ ..__T1 / Tn
\ \__.. /\ ..__T2 /
\_____________________B \____________________/
If the commits T1, T2, ... Tn did not change the file, then all of
P1 through Pn will be TREESAME to their first parent, but not
TREESAME to their second. This means that all of those merge commits
appear in the --full-history view, with edges that immediately
collapse into the lower history without introducing interesting
single-parent commits.
The --simplify-merges option was introduced to remove these extra
merge commits. By noticing that the rewritten parents are reachable
from their first parents, those edges can be simplified away. Finally,
the commits now look like single-parent commits that are TREESAME to
their "only" parent. Thus, they are removed and this issue does not
cause issues anymore. However, this also ends up removing the commit
M from the history view! Even worse, the --simplify-merges option
requires walking the entire history before returning a single result.
Many Git users are using Git alongside a Git service that provides
code storage alongside a code review tool commonly called "Pull
Requests" or "Merge Requests" against a target branch. When these
requests are accepted and merged, they typically create a merge
commit whose first parent is the previous branch tip and the second
parent is the tip of the topic branch used for the request. This
presents a valuable order to the parents, but also makes that merge
commit slightly special. Users may want to see not only which
commits changed a file, but which pull requests merged those commits
into their branch. In the previous example, this would mean the
users want to see the merge commit "M" in addition to the single-
parent commit "C".
Users are even more likely to want these merge commits when they
use pull requests to merge into a feature branch before merging that
feature branch into their trunk.
In some sense, users are asking for the "first" merge commit to
bring in the change to their branch. As long as the parent order is
consistent, this can be handled with the following rule:
Include a merge commit if it is not TREESAME to its first
parent, but is TREESAME to a later parent.
These merges look like the merge commits that would result from
running "git pull <topic>" on a main branch. Thus, the option to
show these commits is called "--show-pulls". This has the added
benefit of showing the commits created by closing a pull request or
merge request on any of the Git hosting and code review platforms.
To test these options, extend the standard test example to include
a merge commit that is not TREESAME to its first parent. It is
surprising that that option was not already in the example, as it
is instructive.
In particular, this extension demonstrates a common issue with file
history simplification. When a user resolves a merge conflict using
"-Xours" or otherwise ignoring one side of the conflict, they create
a TREESAME edge that probably should not be TREESAME. This leads
users to become frustrated and complain that "my change disappeared!"
In my experience, showing them history with --full-history and
--simplify-merges quickly reveals the problematic merge. As mentioned,
this option is expensive to compute. The --show-pulls option
_might_ show the merge commit (usually titled "resolving conflicts")
more quickly. Of course, this depends on the user having the correct
parent order, which is backwards when using "git pull master" from a
topic branch.
There are some special considerations when combining the --show-pulls
option with --simplify-merges. This requires adding a new PULL_MERGE
object flag to store the information from the initial TREESAME
comparisons. This helps avoid dropping those commits in later filters.
This is covered by a test, including how the parents can be simplified.
Since "struct object" has already ruined its 32-bit alignment by using
33 bits across parsed, type, and flags member, let's not make it worse.
PULL_MERGE is used in revision.c with the same value (1u<<15) as
REACHABLE in commit-graph.c. The REACHABLE flag is only used when
writing a commit-graph file, and a revision walk using --show-pulls
does not happen in the same process. Care must be taken in the future
to ensure this remains the case.
Update Documentation/rev-list-options.txt with significant details
around this option. This requires updating the example in the
History Simplification section to demonstrate some of the problems
with TREESAME second parents.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-10 20:19:43 +08:00
|
|
|
(parent = one_relevant_parent(revs, commit->parents)) == NULL ||
|
|
|
|
(revs->show_pulls && (commit->object.flags & PULL_MERGE)))
|
2008-08-15 01:59:44 +08:00
|
|
|
st->simplified = commit;
|
|
|
|
else {
|
2013-05-16 23:32:39 +08:00
|
|
|
pst = locate_simplify_state(revs, parent);
|
2008-08-15 01:59:44 +08:00
|
|
|
st->simplified = pst->simplified;
|
|
|
|
}
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
return tail;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void simplify_merges(struct rev_info *revs)
|
|
|
|
{
|
2012-06-09 05:50:22 +08:00
|
|
|
struct commit_list *list, *next;
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
struct commit_list *yet_to_do, **tail;
|
2012-06-09 05:50:22 +08:00
|
|
|
struct commit *commit;
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
|
2008-08-15 04:52:36 +08:00
|
|
|
if (!revs->prune)
|
|
|
|
return;
|
2008-08-04 08:47:16 +08:00
|
|
|
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
/* feed the list reversed */
|
|
|
|
yet_to_do = NULL;
|
2012-06-09 05:50:22 +08:00
|
|
|
for (list = revs->commits; list; list = next) {
|
|
|
|
commit = list->item;
|
|
|
|
next = list->next;
|
|
|
|
/*
|
|
|
|
* Do not free(list) here yet; the original list
|
|
|
|
* is used later in this function.
|
|
|
|
*/
|
|
|
|
commit_list_insert(commit, &yet_to_do);
|
|
|
|
}
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
while (yet_to_do) {
|
|
|
|
list = yet_to_do;
|
|
|
|
yet_to_do = NULL;
|
|
|
|
tail = &yet_to_do;
|
|
|
|
while (list) {
|
2015-10-25 00:21:31 +08:00
|
|
|
commit = pop_commit(&list);
|
2008-08-15 01:59:44 +08:00
|
|
|
tail = simplify_one(revs, commit, tail);
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* clean up the result, removing the simplified ones */
|
|
|
|
list = revs->commits;
|
|
|
|
revs->commits = NULL;
|
|
|
|
tail = &revs->commits;
|
|
|
|
while (list) {
|
2008-08-15 01:59:44 +08:00
|
|
|
struct merge_simplify_state *st;
|
2012-06-09 05:50:22 +08:00
|
|
|
|
2015-10-25 00:21:31 +08:00
|
|
|
commit = pop_commit(&list);
|
2008-08-15 01:59:44 +08:00
|
|
|
st = locate_simplify_state(revs, commit);
|
|
|
|
if (st->simplified == commit)
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
tail = &commit_list_insert(commit, tail)->next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-04-03 17:12:06 +08:00
|
|
|
static void set_children(struct rev_info *revs)
|
|
|
|
{
|
|
|
|
struct commit_list *l;
|
|
|
|
for (l = revs->commits; l; l = l->next) {
|
|
|
|
struct commit *commit = l->item;
|
|
|
|
struct commit_list *p;
|
|
|
|
|
|
|
|
for (p = commit->parents; p; p = p->next)
|
|
|
|
add_child(revs, p->item, commit);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-03-29 15:21:21 +08:00
|
|
|
void reset_revision_walk(void)
|
|
|
|
{
|
2019-11-22 16:37:03 +08:00
|
|
|
clear_object_flags(SEEN | ADDED | SHOWN | TOPO_WALK_EXPLORED | TOPO_WALK_INDEGREE);
|
2012-03-29 15:21:21 +08:00
|
|
|
}
|
|
|
|
|
2017-12-08 23:27:15 +08:00
|
|
|
static int mark_uninteresting(const struct object_id *oid,
|
2023-02-24 14:39:24 +08:00
|
|
|
struct packed_git *pack UNUSED,
|
|
|
|
uint32_t pos UNUSED,
|
2018-09-21 23:57:39 +08:00
|
|
|
void *cb)
|
2017-12-08 23:27:15 +08:00
|
|
|
{
|
2018-09-21 23:57:39 +08:00
|
|
|
struct rev_info *revs = cb;
|
revision: avoid parsing with --exclude-promisor-objects
When --exclude-promisor-objects is given, before traversing any objects
we iterate over all of the objects in any promisor packs, marking them
as UNINTERESTING and SEEN. We turn the oid we get from iterating the
pack into an object with parse_object(), but this has two problems:
- it's slow; we are zlib inflating (and reconstructing from deltas)
every byte of every object in the packfile
- it leaves the tree buffers attached to their structs, which means
our heap usage will grow to store every uncompressed tree
simultaneously. This can be gigabytes.
We can obviously fix the second by freeing the tree buffers after we've
parsed them. But we can observe that the function doesn't look at the
object contents at all! The only reason we call parse_object() is that
we need a "struct object" on which to set the flags. There are two
options here:
- we can look up just the object type via oid_object_info(), and then
call the appropriate lookup_foo() function
- we can call lookup_unknown_object(), which gives us an OBJ_NONE
struct (which will get auto-converted later by object_as_type() via
calls to lookup_commit(), etc).
The first one is closer to the current code, but we do pay the price to
look up the type for each object. The latter should be more efficient in
CPU, though it wastes a little bit of memory (the "unknown" object
structs are a union of all object types, so some of the structs are
bigger than they need to be). It also runs the risk of triggering a
latent bug in code that calls lookup_object() directly but isn't ready
to handle OBJ_NONE (such code would already be buggy, but we use
lookup_unknown_object() infrequently enough that it might be hiding).
I went with the second option here. I don't think the risk is high (and
we'd want to find and fix any such bugs anyway), and it should be more
efficient overall.
The new tests in p5600 show off the improvement (this is on git.git):
Test HEAD^ HEAD
-------------------------------------------------------------------------------
5600.5: count commits 0.37(0.37+0.00) 0.38(0.38+0.00) +2.7%
5600.6: count non-promisor commits 11.74(11.37+0.37) 0.04(0.03+0.00) -99.7%
The improvement is particularly big in this script because _every_
object in the newly-cloned partial repo is a promisor object. So after
marking them all, there's nothing left to traverse.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-13 15:17:48 +08:00
|
|
|
struct object *o = lookup_unknown_object(revs->repo, oid);
|
2017-12-08 23:27:15 +08:00
|
|
|
o->flags |= UNINTERESTING | SEEN;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
define_commit_slab(indegree_slab, int);
|
|
|
|
define_commit_slab(author_date_slab, timestamp_t);
|
|
|
|
|
|
|
|
struct topo_walk_info {
|
2021-01-17 02:11:13 +08:00
|
|
|
timestamp_t min_generation;
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
struct prio_queue explore_queue;
|
|
|
|
struct prio_queue indegree_queue;
|
|
|
|
struct prio_queue topo_queue;
|
|
|
|
struct indegree_slab indegree;
|
|
|
|
struct author_date_slab author_date;
|
|
|
|
};
|
|
|
|
|
2020-12-30 12:31:53 +08:00
|
|
|
static int topo_walk_atexit_registered;
|
|
|
|
static unsigned int count_explore_walked;
|
|
|
|
static unsigned int count_indegree_walked;
|
|
|
|
static unsigned int count_topo_walked;
|
|
|
|
|
|
|
|
static void trace2_topo_walk_statistics_atexit(void)
|
|
|
|
{
|
|
|
|
struct json_writer jw = JSON_WRITER_INIT;
|
|
|
|
|
|
|
|
jw_object_begin(&jw, 0);
|
|
|
|
jw_object_intmax(&jw, "count_explore_walked", count_explore_walked);
|
|
|
|
jw_object_intmax(&jw, "count_indegree_walked", count_indegree_walked);
|
|
|
|
jw_object_intmax(&jw, "count_topo_walked", count_topo_walked);
|
|
|
|
jw_end(&jw);
|
|
|
|
|
|
|
|
trace2_data_json("topo_walk", the_repository, "statistics", &jw);
|
|
|
|
|
|
|
|
jw_release(&jw);
|
|
|
|
}
|
|
|
|
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
static inline void test_flag_and_insert(struct prio_queue *q, struct commit *c, int flag)
|
|
|
|
{
|
|
|
|
if (c->object.flags & flag)
|
|
|
|
return;
|
|
|
|
|
|
|
|
c->object.flags |= flag;
|
|
|
|
prio_queue_put(q, c);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void explore_walk_step(struct rev_info *revs)
|
|
|
|
{
|
|
|
|
struct topo_walk_info *info = revs->topo_walk_info;
|
|
|
|
struct commit_list *p;
|
|
|
|
struct commit *c = prio_queue_get(&info->explore_queue);
|
|
|
|
|
|
|
|
if (!c)
|
|
|
|
return;
|
|
|
|
|
2020-06-24 04:56:58 +08:00
|
|
|
if (repo_parse_commit_gently(revs->repo, c, 1) < 0)
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
return;
|
|
|
|
|
2020-12-30 12:31:53 +08:00
|
|
|
count_explore_walked++;
|
|
|
|
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
|
|
|
|
record_author_date(&info->author_date, c);
|
|
|
|
|
|
|
|
if (revs->max_age != -1 && (c->date < revs->max_age))
|
|
|
|
c->object.flags |= UNINTERESTING;
|
|
|
|
|
|
|
|
if (process_parents(revs, c, NULL, NULL) < 0)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (c->object.flags & UNINTERESTING)
|
git-rev-list: add --exclude-first-parent-only flag
It is useful to know when a branch first diverged in history
from some integration branch in order to be able to enumerate
the user's local changes. However, these local changes can
include arbitrary merges, so it is necessary to ignore this
merge structure when finding the divergence point.
In order to do this, teach the "rev-list" family to accept
"--exclude-first-parent-only", which restricts the traversal
of excluded commits to only follow first parent links.
-A-----E-F-G--main
\ / /
B-C-D--topic
In this example, the goal is to return the set {B, C, D} which
represents a topic branch that has been merged into main branch.
`git rev-list topic ^main` will end up returning no commits
since excluding main will end up traversing the commits on topic
as well. `git rev-list --exclude-first-parent-only topic ^main`
however will return {B, C, D} as desired.
Add docs for the new flag, and clarify the doc for --first-parent
to indicate that it applies to traversing the set of included
commits only.
Signed-off-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12 05:39:41 +08:00
|
|
|
mark_parents_uninteresting(revs, c);
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
|
|
|
|
for (p = c->parents; p; p = p->next)
|
|
|
|
test_flag_and_insert(&info->explore_queue, p->item, TOPO_WALK_EXPLORED);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void explore_to_depth(struct rev_info *revs,
|
2021-01-17 02:11:13 +08:00
|
|
|
timestamp_t gen_cutoff)
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
{
|
|
|
|
struct topo_walk_info *info = revs->topo_walk_info;
|
|
|
|
struct commit *c;
|
|
|
|
while ((c = prio_queue_peek(&info->explore_queue)) &&
|
2020-06-17 17:14:10 +08:00
|
|
|
commit_graph_generation(c) >= gen_cutoff)
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
explore_walk_step(revs);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void indegree_walk_step(struct rev_info *revs)
|
|
|
|
{
|
|
|
|
struct commit_list *p;
|
|
|
|
struct topo_walk_info *info = revs->topo_walk_info;
|
|
|
|
struct commit *c = prio_queue_get(&info->indegree_queue);
|
|
|
|
|
|
|
|
if (!c)
|
|
|
|
return;
|
|
|
|
|
2020-06-24 04:56:58 +08:00
|
|
|
if (repo_parse_commit_gently(revs->repo, c, 1) < 0)
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
return;
|
|
|
|
|
2020-12-30 12:31:53 +08:00
|
|
|
count_indegree_walked++;
|
|
|
|
|
2020-06-17 17:14:10 +08:00
|
|
|
explore_to_depth(revs, commit_graph_generation(c));
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
|
|
|
|
for (p = c->parents; p; p = p->next) {
|
|
|
|
struct commit *parent = p->item;
|
|
|
|
int *pi = indegree_slab_at(&info->indegree, parent);
|
|
|
|
|
2021-01-17 02:11:09 +08:00
|
|
|
if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
|
|
|
|
return;
|
|
|
|
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
if (*pi)
|
|
|
|
(*pi)++;
|
|
|
|
else
|
|
|
|
*pi = 2;
|
|
|
|
|
|
|
|
test_flag_and_insert(&info->indegree_queue, parent, TOPO_WALK_INDEGREE);
|
|
|
|
|
|
|
|
if (revs->first_parent_only)
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void compute_indegrees_to_depth(struct rev_info *revs,
|
2021-01-17 02:11:13 +08:00
|
|
|
timestamp_t gen_cutoff)
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
{
|
|
|
|
struct topo_walk_info *info = revs->topo_walk_info;
|
|
|
|
struct commit *c;
|
|
|
|
while ((c = prio_queue_peek(&info->indegree_queue)) &&
|
2020-06-17 17:14:10 +08:00
|
|
|
commit_graph_generation(c) >= gen_cutoff)
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
indegree_walk_step(revs);
|
|
|
|
}
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
|
2022-04-14 13:56:39 +08:00
|
|
|
static void release_revisions_topo_walk_info(struct topo_walk_info *info)
|
2019-11-22 16:37:04 +08:00
|
|
|
{
|
2022-04-14 13:56:39 +08:00
|
|
|
if (!info)
|
|
|
|
return;
|
2019-11-22 16:37:04 +08:00
|
|
|
clear_prio_queue(&info->explore_queue);
|
|
|
|
clear_prio_queue(&info->indegree_queue);
|
|
|
|
clear_prio_queue(&info->topo_queue);
|
|
|
|
clear_indegree_slab(&info->indegree);
|
|
|
|
clear_author_date_slab(&info->author_date);
|
2022-04-14 13:56:39 +08:00
|
|
|
free(info);
|
|
|
|
}
|
2019-11-22 16:37:04 +08:00
|
|
|
|
2022-04-14 13:56:39 +08:00
|
|
|
static void reset_topo_walk(struct rev_info *revs)
|
|
|
|
{
|
|
|
|
release_revisions_topo_walk_info(revs->topo_walk_info);
|
|
|
|
revs->topo_walk_info = NULL;
|
2019-11-22 16:37:04 +08:00
|
|
|
}
|
|
|
|
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
static void init_topo_walk(struct rev_info *revs)
|
|
|
|
{
|
|
|
|
struct topo_walk_info *info;
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
struct commit_list *list;
|
2019-11-22 16:37:04 +08:00
|
|
|
if (revs->topo_walk_info)
|
|
|
|
reset_topo_walk(revs);
|
|
|
|
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
|
|
|
|
info = revs->topo_walk_info;
|
|
|
|
memset(info, 0, sizeof(struct topo_walk_info));
|
|
|
|
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
init_indegree_slab(&info->indegree);
|
|
|
|
memset(&info->explore_queue, 0, sizeof(info->explore_queue));
|
|
|
|
memset(&info->indegree_queue, 0, sizeof(info->indegree_queue));
|
|
|
|
memset(&info->topo_queue, 0, sizeof(info->topo_queue));
|
|
|
|
|
|
|
|
switch (revs->sort_order) {
|
|
|
|
default: /* REV_SORT_IN_GRAPH_ORDER */
|
|
|
|
info->topo_queue.compare = NULL;
|
|
|
|
break;
|
|
|
|
case REV_SORT_BY_COMMIT_DATE:
|
|
|
|
info->topo_queue.compare = compare_commits_by_commit_date;
|
|
|
|
break;
|
|
|
|
case REV_SORT_BY_AUTHOR_DATE:
|
|
|
|
init_author_date_slab(&info->author_date);
|
|
|
|
info->topo_queue.compare = compare_commits_by_author_date;
|
|
|
|
info->topo_queue.cb_data = &info->author_date;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
|
|
|
|
info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
|
|
|
|
|
|
|
|
info->min_generation = GENERATION_NUMBER_INFINITY;
|
|
|
|
for (list = revs->commits; list; list = list->next) {
|
|
|
|
struct commit *c = list->item;
|
2021-01-17 02:11:13 +08:00
|
|
|
timestamp_t generation;
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
|
2020-06-24 04:56:58 +08:00
|
|
|
if (repo_parse_commit_gently(revs->repo, c, 1))
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
test_flag_and_insert(&info->explore_queue, c, TOPO_WALK_EXPLORED);
|
|
|
|
test_flag_and_insert(&info->indegree_queue, c, TOPO_WALK_INDEGREE);
|
|
|
|
|
2020-06-17 17:14:11 +08:00
|
|
|
generation = commit_graph_generation(c);
|
|
|
|
if (generation < info->min_generation)
|
|
|
|
info->min_generation = generation;
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
|
|
|
|
*(indegree_slab_at(&info->indegree, c)) = 1;
|
|
|
|
|
|
|
|
if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
|
|
|
|
record_author_date(&info->author_date, c);
|
|
|
|
}
|
|
|
|
compute_indegrees_to_depth(revs, info->min_generation);
|
|
|
|
|
|
|
|
for (list = revs->commits; list; list = list->next) {
|
|
|
|
struct commit *c = list->item;
|
|
|
|
|
|
|
|
if (*(indegree_slab_at(&info->indegree, c)) == 1)
|
|
|
|
prio_queue_put(&info->topo_queue, c);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is unfortunate; the initial tips need to be shown
|
|
|
|
* in the order given from the revision traversal machinery.
|
|
|
|
*/
|
|
|
|
if (revs->sort_order == REV_SORT_IN_GRAPH_ORDER)
|
|
|
|
prio_queue_reverse(&info->topo_queue);
|
2020-12-30 12:31:53 +08:00
|
|
|
|
|
|
|
if (trace2_is_enabled() && !topo_walk_atexit_registered) {
|
|
|
|
atexit(trace2_topo_walk_statistics_atexit);
|
|
|
|
topo_walk_atexit_registered = 1;
|
|
|
|
}
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct commit *next_topo_commit(struct rev_info *revs)
|
|
|
|
{
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
struct commit *c;
|
|
|
|
struct topo_walk_info *info = revs->topo_walk_info;
|
|
|
|
|
|
|
|
/* pop next off of topo_queue */
|
|
|
|
c = prio_queue_get(&info->topo_queue);
|
|
|
|
|
|
|
|
if (c)
|
|
|
|
*(indegree_slab_at(&info->indegree, c)) = 0;
|
|
|
|
|
|
|
|
return c;
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
|
|
|
|
{
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
struct commit_list *p;
|
|
|
|
struct topo_walk_info *info = revs->topo_walk_info;
|
|
|
|
if (process_parents(revs, commit, NULL, NULL) < 0) {
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
if (!revs->ignore_missing_links)
|
|
|
|
die("Failed to traverse parents of commit %s",
|
|
|
|
oid_to_hex(&commit->object.oid));
|
|
|
|
}
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
|
2020-12-30 12:31:53 +08:00
|
|
|
count_topo_walked++;
|
|
|
|
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
for (p = commit->parents; p; p = p->next) {
|
|
|
|
struct commit *parent = p->item;
|
|
|
|
int *pi;
|
2021-01-17 02:11:13 +08:00
|
|
|
timestamp_t generation;
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
|
2019-05-21 21:59:53 +08:00
|
|
|
if (parent->object.flags & UNINTERESTING)
|
|
|
|
continue;
|
|
|
|
|
2020-06-24 04:56:58 +08:00
|
|
|
if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
continue;
|
|
|
|
|
2020-06-17 17:14:11 +08:00
|
|
|
generation = commit_graph_generation(parent);
|
|
|
|
if (generation < info->min_generation) {
|
|
|
|
info->min_generation = generation;
|
revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.
When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:
1. Run limit_list(), which parses all reachable commits,
adds them to a linked list, and distributes UNINTERESTING
flags. If all unprocessed commits are UNINTERESTING, then
it may terminate without walking all reachable commits.
This does not occur if we do not specify UNINTERESTING
commits.
2. Run sort_in_topological_order(), which is an implementation
of Kahn's algorithm. It first iterates through the entire
set of important commits and computes the in-degree of each
(plus one, as we use 'zero' as a special value here). Then,
we walk the commits in priority order, adding them to the
priority queue if and only if their in-degree is one. As
we remove commits from this priority queue, we decrement the
in-degree of their parents.
3. While we are peeling commits for output, get_revision_1()
uses pop_commit on the full list of commits computed by
sort_in_topological_order().
In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.
Recall that the generation number of a commit satisfies:
* If the commit has at least one parent, then the generation
number is one more than the maximum generation number among
its parents.
* If the commit has no parent, then the generation number is one.
There are two special generation numbers:
* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
indicates that the commit is not stored in the commit-graph and
the generation number was not previously calculated.
* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
to say that the commit-graph was generated by a version of Git
that does not compute generation numbers (such as v2.18.0).
Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:
If A and B are commits with generation numbers gen(A) and
gen(B) and gen(A) < gen(B), then A cannot reach B.
Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.
The walks are as follows:
1. EXPLORE: using the explore_queue priority queue (ordered by
maximizing the generation number), parse each reachable
commit until all commits in the queue have generation
number strictly lower than needed. During this walk, update
the UNINTERESTING flags as necessary.
2. INDEGREE: using the indegree_queue priority queue (ordered
by maximizing the generation number), add one to the in-
degree of each parent for each commit that is walked. Since
we walk in order of decreasing generation number, we know
that discovering an in-degree value of 0 means the value for
that commit was not initialized, so should be initialized to
two. (Recall that in-degree value "1" is what we use to say a
commit is ready for output.) As we iterate the parents of a
commit during this walk, ensure the EXPLORE walk has walked
beyond their generation numbers.
3. TOPO: using the topo_queue priority queue (ordered based on
the sort_order given, which could be commit-date, author-
date, or typical topo-order which treats the queue as a LIFO
stack), remove a commit from the queue and decrement the
in-degree of each parent. If a parent has an in-degree of
one, then we add it to the topo_queue. Before we decrement
the in-degree, however, ensure the INDEGREE walk has walked
beyond that generation number.
The implementations of these walks are in the following methods:
* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk
These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.
One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.
In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s
Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
HEAD, w/ commit-graph: 0.06 s
This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.
For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:22 +08:00
|
|
|
compute_indegrees_to_depth(revs, info->min_generation);
|
|
|
|
}
|
|
|
|
|
|
|
|
pi = indegree_slab_at(&info->indegree, parent);
|
|
|
|
|
|
|
|
(*pi)--;
|
|
|
|
if (*pi == 1)
|
|
|
|
prio_queue_put(&info->topo_queue, parent);
|
|
|
|
|
|
|
|
if (revs->first_parent_only)
|
|
|
|
return;
|
|
|
|
}
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
}
|
|
|
|
|
2007-05-05 05:54:57 +08:00
|
|
|
int prepare_revision_walk(struct rev_info *revs)
|
2006-03-01 03:24:00 +08:00
|
|
|
{
|
clean up name allocation in prepare_revision_walk
When we enter prepare_revision_walk, we have zero or more
entries in our "pending" array. We disconnect that array
from the rev_info, and then process each entry:
1. If the entry is a commit and the --source option is in
effect, we keep a pointer to the object name.
2. Otherwise, we re-add the item to the pending list with
a blank name.
We then throw away the old array by freeing the array
itself, but do not touch the "name" field of each entry. For
any items of type (2), we leak the memory associated with
the name. This commit fixes that by calling object_array_clear,
which handles the cleanup for us.
That breaks (1), though, because it depends on the memory
pointed to by the name to last forever. We can solve that by
making a copy of the name. This is slightly less efficient,
but it shouldn't matter in practice, as we do it only for
the tip commits of the traversal.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:35:12 +08:00
|
|
|
int i;
|
|
|
|
struct object_array old_pending;
|
2012-04-26 04:35:41 +08:00
|
|
|
struct commit_list **next = &revs->commits;
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
|
clean up name allocation in prepare_revision_walk
When we enter prepare_revision_walk, we have zero or more
entries in our "pending" array. We disconnect that array
from the rev_info, and then process each entry:
1. If the entry is a commit and the --source option is in
effect, we keep a pointer to the object name.
2. Otherwise, we re-add the item to the pending list with
a blank name.
We then throw away the old array by freeing the array
itself, but do not touch the "name" field of each entry. For
any items of type (2), we leak the memory associated with
the name. This commit fixes that by calling object_array_clear,
which handles the cleanup for us.
That breaks (1), though, because it depends on the memory
pointed to by the name to last forever. We can solve that by
making a copy of the name. This is slightly less efficient,
but it shouldn't matter in practice, as we do it only for
the tip commits of the traversal.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:35:12 +08:00
|
|
|
memcpy(&old_pending, &revs->pending, sizeof(old_pending));
|
Add "named object array" concept
We've had this notion of a "object_list" for a long time, which eventually
grew a "name" member because some users (notably git-rev-list) wanted to
name each object as it is generated.
That object_list is great for some things, but it isn't all that wonderful
for others, and the "name" member is generally not used by everybody.
This patch splits the users of the object_list array up into two: the
traditional list users, who want the list-like format, and who don't
actually use or want the name. And another class of users that really used
the list as an extensible array, and generally wanted to name the objects.
The patch is fairly straightforward, but it's also biggish. Most of it
really just cleans things up: switching the revision parsing and listing
over to the array makes things like the builtin-diff usage much simpler
(we now see exactly how many members the array has, and we don't get the
objects reversed from the order they were on the command line).
One of the main reasons for doing this at all is that the malloc overhead
of the simple object list was actually pretty high, and the array is just
a lot denser. So this patch brings down memory usage by git-rev-list by
just under 3% (on top of all the other memory use optimizations) on the
mozilla archive.
It does add more lines than it removes, and more importantly, it adds a
whole new infrastructure for maintaining lists of objects, but on the
other hand, the new dynamic array code is pretty obvious. The change to
builtin-diff-tree.c shows a fairly good example of why an array interface
is sometimes more natural, and just much simpler for everybody.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-06-20 08:42:35 +08:00
|
|
|
revs->pending.nr = 0;
|
|
|
|
revs->pending.alloc = 0;
|
|
|
|
revs->pending.objects = NULL;
|
clean up name allocation in prepare_revision_walk
When we enter prepare_revision_walk, we have zero or more
entries in our "pending" array. We disconnect that array
from the rev_info, and then process each entry:
1. If the entry is a commit and the --source option is in
effect, we keep a pointer to the object name.
2. Otherwise, we re-add the item to the pending list with
a blank name.
We then throw away the old array by freeing the array
itself, but do not touch the "name" field of each entry. For
any items of type (2), we leak the memory associated with
the name. This commit fixes that by calling object_array_clear,
which handles the cleanup for us.
That breaks (1), though, because it depends on the memory
pointed to by the name to last forever. We can solve that by
making a copy of the name. This is slightly less efficient,
but it shouldn't matter in practice, as we do it only for
the tip commits of the traversal.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:35:12 +08:00
|
|
|
for (i = 0; i < old_pending.nr; i++) {
|
|
|
|
struct object_array_entry *e = old_pending.objects + i;
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 06:43:19 +08:00
|
|
|
struct commit *commit = handle_commit(revs, e);
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
if (commit) {
|
|
|
|
if (!(commit->object.flags & SEEN)) {
|
|
|
|
commit->object.flags |= SEEN;
|
2012-04-26 04:35:41 +08:00
|
|
|
next = commit_list_append(commit, next);
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2017-12-26 01:47:51 +08:00
|
|
|
object_array_clear(&old_pending);
|
Common option parsing for "git log --diff" and friends
This basically does a few things that are sadly somewhat interdependent,
and nontrivial to split out
- get rid of "struct log_tree_opt"
The fields in "log_tree_opt" are moved into "struct rev_info", and all
users of log_tree_opt are changed to use the rev_info struct instead.
- add the parsing for the log_tree_opt arguments to "setup_revision()"
- make setup_revision set a flag (revs->diff) if the diff-related
arguments were used. This allows "git log" to decide whether it wants
to show diffs or not.
- make setup_revision() also initialize the diffopt part of rev_info
(which we had from before, but we just didn't initialize it)
- make setup_revision() do all the "finishing touches" on it all (it will
do the proper flag combination logic, and call "diff_setup_done()")
Now, that was the easy and straightforward part.
The slightly more involved part is that some of the programs that want to
use the new-and-improved rev_info parsing don't actually want _commits_,
they may want tree'ish arguments instead. That meant that I had to change
setup_revision() to parse the arguments not into the "revs->commits" list,
but into the "revs->pending_objects" list.
Then, when we do "prepare_revision_walk()", we walk that list, and create
the sorted commit list from there.
This actually cleaned some stuff up, but it's the less obvious part of the
patch, and re-organized the "revision.c" logic somewhat. It actually paves
the way for splitting argument parsing _entirely_ out of "revision.c",
since now the argument parsing really is totally independent of the commit
walking: that didn't use to be true, since there was lots of overlap with
get_commit_reference() handling etc, now the _only_ overlap is the shared
(and trivial) "add_pending_object()" thing.
However, I didn't do that file split, just because I wanted the diff
itself to be smaller, and show the actual changes more clearly. If this
gets accepted, I'll do further cleanups then - that includes the file
split, but also using the new infrastructure to do a nicer "git diff" etc.
Even in this form, it actually ends up removing more lines than it adds.
It's nice to note how simple and straightforward this makes the built-in
"git log" command, even though it continues to support all the diff flags
too. It doesn't get much simpler that this.
I think this is worth merging soonish, because it does allow for future
cleanup and even more sharing of code. However, it obviously touches
"revision.c", which is subtle. I've tested that it passes all the tests we
have, and it passes my "looks sane" detector, but somebody else should
also give it a good look-over.
[jc: squashed the original and three "oops this too" updates, with
another fix-up.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-15 07:52:13 +08:00
|
|
|
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
/* Signal whether we need per-parent treesame decoration */
|
2013-05-16 23:32:39 +08:00
|
|
|
if (revs->simplify_merges ||
|
|
|
|
(revs->limited && limiting_can_increase_treesame(revs)))
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
revs->treesame.name = "treesame";
|
|
|
|
|
2017-12-08 23:27:15 +08:00
|
|
|
if (revs->exclude_promisor_objects) {
|
2018-09-21 23:57:39 +08:00
|
|
|
for_each_packed_object(mark_uninteresting, revs,
|
2017-12-08 23:27:15 +08:00
|
|
|
FOR_EACH_OBJECT_PROMISOR_ONLY);
|
|
|
|
}
|
|
|
|
|
line-log: integrate with changed-path Bloom filters
The previous changes to the line-log machinery focused on making the
first result appear faster. This was achieved by no longer walking the
entire commit history before returning the early results. There is still
another way to improve the performance: walk most commits much faster.
Let's use the changed-path Bloom filters to reduce time spent computing
diffs.
Since the line-log computation requires opening blobs and checking the
content-diff, there is still a lot of necessary computation that cannot
be replaced with changed-path Bloom filters. The part that we can reduce
is most effective when checking the history of a file that is deep in
several directories and those directories are modified frequently. In
this case, the computation to check if a commit is TREESAME to its first
parent takes a large fraction of the time. That is ripe for improvement
with changed-path Bloom filters.
We must ensure that prepare_to_use_bloom_filters() is called in
revision.c so that the bloom_filter_settings are loaded into the struct
rev_info from the commit-graph. Of course, some cases are still
forbidden, but in the line-log case the pathspec is provided in a
different way than normal.
Since multiple paths and segments could be requested, we compute the
struct bloom_key data dynamically during the commit walk. This could
likely be improved, but adds code complexity that is not valuable at
this time.
There are two cases to care about: merge commits and "ordinary" commits.
Merge commits have multiple parents, but if we are TREESAME to our first
parent in every range, then pass the blame for all ranges to the first
parent. Ordinary commits have the same condition, but each is done
slightly differently in the process_ranges_[merge|ordinary]_commit()
methods. By checking if the changed-path Bloom filter can guarantee
TREESAME, we can avoid that tree-diff cost. If the filter says "probably
changed", then we need to run the tree-diff and then the blob-diff if
there was a real edit.
The Linux kernel repository is a good testing ground for the performance
improvements claimed here. There are two different cases to test. The
first is the "entire history" case, where we output the entire history
to /dev/null to see how long it would take to compute the full line-log
history. The second is the "first result" case, where we find how long
it takes to show the first value, which is an indicator of how quickly a
user would see responses when waiting at a terminal.
To test, I selected the paths that were changed most frequently in the
top 10,000 commits using this command (stolen from StackOverflow [1]):
git log --pretty=format: --name-only -n 10000 | sort | \
uniq -c | sort -rg | head -10
which results in
121 MAINTAINERS
63 fs/namei.c
60 arch/x86/kvm/cpuid.c
59 fs/io_uring.c
58 arch/x86/kvm/vmx/vmx.c
51 arch/x86/kvm/x86.c
45 arch/x86/kvm/svm.c
42 fs/btrfs/disk-io.c
42 Documentation/scsi/index.rst
(along with a bogus first result). It appears that the path
arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the
following results for the real command time:
| | Entire History | First Result |
| Path | Before | After | Before | After |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |
It is worth noting that the least speedup comes for the MAINTAINERS file
which is
* edited frequently,
* low in the directory heirarchy, and
* quite a large file.
All of those points lead to spending more time doing the blob diff and
less time doing the tree diff. Still, we see some improvement in that
case and significant improvement in other cases. A 2-4x speedup is
likely the more typical case as opposed to the small 5% change for that
file.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-11 19:56:19 +08:00
|
|
|
if (!revs->reflog_info)
|
2020-04-07 00:59:52 +08:00
|
|
|
prepare_to_use_bloom_filter(revs);
|
2021-08-05 19:25:24 +08:00
|
|
|
if (!revs->unsorted_input)
|
teach log --no-walk=unsorted, which avoids sorting
When 'git log' is passed the --no-walk option, no revision walk takes
place, naturally. Perhaps somewhat surprisingly, however, the provided
revisions still get sorted by commit date. So e.g 'git log --no-walk
HEAD HEAD~1' and 'git log --no-walk HEAD~1 HEAD' give the same result
(unless the two revisions share the commit date, in which case they
will retain the order given on the command line). As the commit that
introduced --no-walk (8e64006 (Teach revision machinery about
--no-walk, 2007-07-24)) points out, the sorting is intentional, to
allow things like
git log --abbrev-commit --pretty=oneline --decorate --all --no-walk
to show all refs in order by commit date.
But there are also other cases where the sorting is not wanted, such
as
<command producing revisions in order> |
git log --oneline --no-walk --stdin
To accomodate both cases, leave the decision of whether or not to sort
up to the caller, by allowing --no-walk={sorted,unsorted}, defaulting
to 'sorted' for backward-compatibility reasons.
Signed-off-by: Martin von Zweigbergk <martinvonz@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-08-29 14:15:54 +08:00
|
|
|
commit_list_sort_by_date(&revs->commits);
|
2006-04-16 03:09:56 +08:00
|
|
|
if (revs->no_walk)
|
2007-05-05 05:54:57 +08:00
|
|
|
return 0;
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
if (revs->limited) {
|
2007-05-05 05:54:57 +08:00
|
|
|
if (limit_list(revs) < 0)
|
|
|
|
return -1;
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
if (revs->topo_order)
|
|
|
|
sort_in_topological_order(&revs->commits, revs->sort_order);
|
|
|
|
} else if (revs->topo_order)
|
|
|
|
init_topo_walk(revs);
|
line-log: more responsive, incremental 'git log -L'
The current line-level log implementation performs a preprocessing
step in prepare_revision_walk(), during which the line_log_filter()
function filters and rewrites history to keep only commits modifying
the given line range. This preprocessing affects both responsiveness
and correctness:
- Git doesn't produce any output during this preprocessing step.
Checking whether a commit modified the given line range is
somewhat expensive, so depending on the size of the given revision
range this preprocessing can result in a significant delay before
the first commit is shown.
- Limiting the number of displayed commits (e.g. 'git log -3 -L...')
doesn't limit the amount of work during preprocessing, because
that limit is applied during history traversal. Alas, by that
point this expensive preprocessing step has already churned
through the whole revision range to find all commits modifying the
revision range, even though only a few of them need to be shown.
- It rewrites parents, with no way to turn it off. Without the user
explicitly requesting parent rewriting any parent object ID shown
should be that of the immediate parent, just like in case of a
pathspec-limited history traversal without parent rewriting.
However, after that preprocessing step rewrote history, the
subsequent "regular" history traversal (i.e. get_revision() in a
loop) only sees commits modifying the given line range.
Consequently, it can only show the object ID of the last ancestor
that modified the given line range (which might happen to be the
immediate parent, but many-many times it isn't).
This patch addresses both the correctness and, at least for the common
case, the responsiveness issues by integrating line-level log
filtering into the regular revision walking machinery:
- Make process_ranges_arbitrary_commit(), the static function in
'line-log.c' deciding whether a commit modifies the given line
range, public by removing the static keyword and adding the
'line_log_' prefix, so it can be called from other parts of the
revision walking machinery.
- If the user didn't explicitly ask for parent rewriting (which, I
believe, is the most common case):
- Call this now-public function during regular history traversal,
namely from get_commit_action() to ignore any commits not
modifying the given line range.
Note that while this check is relatively expensive, it must be
performed before other, much cheaper conditions, because the
tracked line range must be adjusted even when the commit will
end up being ignored by other conditions.
- Skip the line_log_filter() call, i.e. the expensive
preprocessing step, in prepare_revision_walk(), because, thanks
to the above points, the revision walking machinery is now able
to filter out commits not modifying the given line range while
traversing history.
This way the regular history traversal sees the unmodified
history, and is therefore able to print the object ids of the
immediate parents of the listed commits. The eliminated
preprocessing step can greatly reduce the delay before the first
commit is shown, see the numbers below.
- However, if the user did explicitly ask for parent rewriting via
'--parents' or a similar option, then stick with the current
implementation for now, i.e. perform that expensive filtering and
history rewriting in the preprocessing step just like we did
before, leaving the initial delay as long as it was.
I tried to integrate line-level log filtering with parent rewriting
into the regular history traversal, but, unfortunately, several
subtleties resisted... :) Maybe someday we'll figure out how to do
that, but until then at least the simple and common (i.e. without
parent rewriting) 'git log -L:func:file' commands can benefit from the
reduced delay.
This change makes the failing 'parent oids without parent rewriting'
test in 't4211-line-log.sh' succeed.
The reduced delay is most noticable when there's a commit modifying
the line range near the tip of a large-ish revision range:
# no parent rewriting requested, no commit-graph present
$ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0
Before:
real 0m9.570s
user 0m9.494s
sys 0m0.076s
After:
real 0m0.718s
user 0m0.674s
sys 0m0.044s
A significant part of the remaining delay is spent reading and parsing
commit objects in limit_list(). With the help of the commit-graph we
can eliminate most of that reading and parsing overhead, so here are
the timing results of the same command as above, but this time using
the commit-graph:
Before:
real 0m8.874s
user 0m8.816s
sys 0m0.057s
After:
real 0m0.107s
user 0m0.091s
sys 0m0.013s
The next patch will further reduce the remaining delay.
To be clear: this patch doesn't actually optimize the line-level log,
but merely moves most of the work from the preprocessing step to the
history traversal, so the commits modifying the line range can be
shown as soon as they are processed, and the traversal can be
terminated as soon as the given number of commits are shown.
Consequently, listing the full history of a line range, potentially
all the way to the root commit, will take the same time as before (but
at least the user might start reading the output earlier).
Furthermore, if the most recent commit modifying the line range is far
away from the starting revision, then that initial delay will still be
significant.
Additional testing by Derrick Stolee: In the Linux kernel repository,
the MAINTAINERS file was changed ~3,500 times across the ~915,000
commits. In addition to that edit frequency, the file itself is quite
large (~18,700 lines). This means that a significant portion of the
computation is taken up by computing the patch-diff of the file. This
patch improves the real time it takes to output the first result quite
a bit:
Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
Before: 3.88 s
After: 0.71 s
If we drop the "-n 1" in the command, then there is no change in
end-to-end process time. This is because the command still needs to
walk the entire commit history, which negates the point of this
patch. This is expected.
As a note for future reference, the ~4.3 seconds in the old code
spends ~2.6 seconds computing the patch-diffs, and the rest of the
time is spent walking commits and computing diffs for which paths
changed at each commit. The changed-path Bloom filters could improve
the end-to-end computation time (i.e. no "-n 1" in the command).
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-11 19:56:17 +08:00
|
|
|
if (revs->line_level_traverse && want_ancestry(revs))
|
|
|
|
/*
|
|
|
|
* At the moment we can only do line-level log with parent
|
|
|
|
* rewriting by performing this expensive pre-filtering step.
|
|
|
|
* If parent rewriting is not requested, then we rather
|
|
|
|
* perform the line-level log filtering during the regular
|
|
|
|
* history traversal.
|
|
|
|
*/
|
Implement line-history search (git log -L)
This is a rewrite of much of Bo's work, mainly in an effort to split
it into smaller, easier to understand routines.
The algorithm is built around the struct range_set, which encodes a
series of line ranges as intervals [a,b). This is used in two
contexts:
* A set of lines we are tracking (which will change as we dig through
history).
* To encode diffs, as pairs of ranges.
The main routine is range_set_map_across_diff(). It processes the
diff between a commit C and some parent P. It determines which diff
hunks are relevant to the ranges tracked in C, and computes the new
ranges for P.
The algorithm is then simply to process history in topological order
from newest to oldest, computing ranges and (partial) diffs. At
branch points, we need to merge the ranges we are watching. We will
find that many commits do not affect the chosen ranges, and mark them
TREESAME (in addition to those already filtered by pathspec limiting).
Another pass of history simplification then gets rid of such commits.
This is wired as an extra filtering pass in the log machinery. This
currently only reduces code duplication, but should allow for other
simplifications and options to be used.
Finally, we hook a diff printer into the output chain. Ideally we
would wire directly into the diff logic, to optionally use features
like word diff. However, that will require some major reworking of
the diff chain, so we completely replace the output with our own diff
for now.
As this was a GSoC project, and has quite some history by now, many
people have helped. In no particular order, thanks go to
Jakub Narebski <jnareb@gmail.com>
Jens Lehmann <Jens.Lehmann@web.de>
Jonathan Nieder <jrnieder@gmail.com>
Junio C Hamano <gitster@pobox.com>
Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Will Palmer <wmpalmer@gmail.com>
Apologies to everyone I forgot.
Signed-off-by: Bo Yang <struggleyb.nku@gmail.com>
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-03-29 00:47:32 +08:00
|
|
|
line_log_filter(revs);
|
revision traversal: show full history with merge simplification
The --full-history traversal keeps all merges in addition to non-merge
commits that touch paths in the given pathspec. This is useful to view
both sides of a merge in a topology like this:
A---M---o
/ /
---O---B
even when A and B makes identical change to the given paths. The revision
traversal without --full-history aims to come up with the simplest history
to explain the final state of the tree, and one of the side branches can
be pruned away.
The behaviour to keep all merges however is inconvenient if neither A nor
B touches the paths we are interested in. --full-history reduces the
topology to:
---O---M---o
in such a case, without removing M.
This adds a post processing phase on top of --full-history traversal to
remove needless merges from the resulting history.
The idea is to compute, for each commit in the "full history" result set,
the commit that should replace it in the simplified history. The commit
to replace it in the final history is determined as follows:
* In any case, we first figure out the replacement commits of parents of
the commit we are looking at. The commit we are looking at is
rewritten as if the replacement commits of its original parents are its
parents. While doing so, we reduce the redundant parents from the
rewritten parent list by not just removing the identical ones, but also
removing a parent that is an ancestor of another parent.
* After the above parent simplification, if the commit is a root commit,
an UNINTERESTING commit, a merge commit, or modifies the paths we are
interested in, then the replacement commit of the commit is itself. In
other words, such a commit is not dropped from the final result.
The first point above essentially means that the history is rewritten in
the bottom up direction. We can rewrite the parent list of a commit only
after we know how all of its parents are rewritten. This means that the
processing needs to happen on the full history (i.e. after limit_list()).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-07-31 16:17:41 +08:00
|
|
|
if (revs->simplify_merges)
|
|
|
|
simplify_merges(revs);
|
2008-04-03 17:12:06 +08:00
|
|
|
if (revs->children.name)
|
|
|
|
set_children(revs);
|
2020-04-07 00:59:52 +08:00
|
|
|
|
2007-05-05 05:54:57 +08:00
|
|
|
return 0;
|
2006-03-01 03:24:00 +08:00
|
|
|
}
|
|
|
|
|
revision: use a prio_queue to hold rewritten parents
This patch fixes a quadratic list insertion in rewrite_one() when
pathspec limiting is combined with --parents. What happens is something
like this:
1. We see that some commit X touches the path, so we try to rewrite
its parents.
2. rewrite_one() loops forever, rewriting parents, until it finds a
relevant parent (or hits the root and decides there are none). The
heavy lifting is done by process_parent(), which uses
try_to_simplify_commit() to drop parents.
3. process_parent() puts any intermediate parents into the
&revs->commits list, inserting by commit date as usual.
So if commit X is recent, and then there's a large chunk of history that
doesn't touch the path, we may add a lot of commits to &revs->commits.
And insertion by commit date is O(n) in the worst case, making the whole
thing quadratic.
We tried to deal with this long ago in fce87ae538 (Fix quadratic
performance in rewrite_one., 2008-07-12). In that scheme, we cache the
oldest commit in the list; if the new commit to be added is older, we
can start our linear traversal there. This often works well in practice
because parents are older than their descendants, and thus we tend to
add older and older commits as we traverse.
But this isn't guaranteed, and in fact there's a simple case where it is
not: merges. Imagine we look at the first parent of a merge and see a
very old commit (let's say 3 years old). And on the second parent, as we
go back 3 years in history, we might have many commits. That one
first-parent commit has polluted our oldest-commit cache; it will remain
the oldest while we traverse a huge chunk of history, during which we
have to fall back to the slow, linear method of adding to the list.
Naively, one might imagine that instead of caching the oldest commit,
we'd start at the last-added one. But that just makes some cases faster
while making others slower (and indeed, while it made a real-world test
case much faster, it does quite poorly in the perf test include here).
Fundamentally, these are just heuristics; our worst case is still
quadratic, and some cases will approach that.
Instead, let's use a data structure with better worst-case performance.
Swapping out revs->commits for something else would have repercussions
all over the code base, but we can take advantage of one fact: for the
rewrite_one() case, nobody actually needs to see those commits in
revs->commits until we've finished generating the whole list.
That leaves us with two obvious options:
1. We can generate the list _unordered_, which should be O(n), and
then sort it afterwards, which would be O(n log n) total. This is
"sort-after" below.
2. We can insert the commits into a separate data structure, like a
priority queue. This is "prio-queue" below.
I expected that sort-after would be the fastest (since it saves us the
extra step of copying the items into the linked list), but surprisingly
the prio-queue seems to be a bit faster.
Here are timings for the new p0001.6 for all three techniques across a
few repositories, as compared to master:
master cache-last sort-after prio-queue
--------------------------------------------------------------------------------------------
GIT_PERF_REPO=git.git
0.52(0.50+0.02) 0.53(0.51+0.02) +1.9% 0.37(0.33+0.03) -28.8% 0.37(0.32+0.04) -28.8%
GIT_PERF_REPO=linux.git
20.81(20.74+0.07) 20.31(20.24+0.07) -2.4% 0.94(0.86+0.07) -95.5% 0.91(0.82+0.09) -95.6%
GIT_PERF_REPO=llvm-project.git
83.67(83.57+0.09) 4.23(4.15+0.08) -94.9% 3.21(3.15+0.06) -96.2% 2.98(2.91+0.07) -96.4%
A few items to note:
- the cache-list tweak does improve the bad case for llvm-project.git
that started my digging into this problem. But it performs terribly
on linux.git, barely helping at all.
- the sort-after and prio-queue techniques work well. They approach
the timing for running without --parents at all, which is what you'd
expect (see below for more data).
- prio-queue just barely outperforms sort-after. As I said, I'm not
really sure why this is the case, but it is. You can see it even
more prominently in this real-world case on llvm-project.git:
git rev-list --parents 07ef786652e7 -- llvm/test/CodeGen/Generic/bswap.ll
where prio-queue routinely outperforms sort-after by about 7%. One
guess is that the prio-queue may just be more efficient because it
uses a compact array.
There are three new perf tests:
- "rev-list --parents" gives us a baseline for running with --parents.
This isn't sped up meaningfully here, because the bad case is
triggered only with simplification. But it's good to make sure we
don't screw it up (now, or in the future).
- "rev-list -- dummy" gives us a baseline for just traversing with
pathspec limiting. This gives a lower bound for the next test (and
it's also a good thing for us to be checking in general for
regressions, since we don't seem to have any existing tests).
- "rev-list --parents -- dummy" shows off the problem (and our fix)
Here are the timings for those three on llvm-project.git, before and
after the fix:
Test master prio-queue
------------------------------------------------------------------------------
0001.3: rev-list --parents 2.24(2.12+0.12) 2.22(2.11+0.11) -0.9%
0001.5: rev-list -- dummy 2.89(2.82+0.07) 2.92(2.89+0.03) +1.0%
0001.6: rev-list --parents -- dummy 83.67(83.57+0.09) 2.98(2.91+0.07) -96.4%
Changes in the first two are basically noise, and you can see we
approach our lower bound in the final one.
Note that we can't fully get rid of the list argument from
process_parents(). Other callers do have lists, and it would be hard to
convert them. They also don't seem to have this problem (probably
because they actually remove items from the list as they loop, meaning
it doesn't grow so large in the first place). So this basically just
drops the "cache_ptr" parameter (which was used only by the one caller
we're fixing here) and replaces it with a prio_queue. Callers are free
to use either data structure, depending on what they're prepared to
handle.
Reported-by: Björn Pettersson A <bjorn.a.pettersson@ericsson.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-04 09:41:09 +08:00
|
|
|
static enum rewrite_result rewrite_one_1(struct rev_info *revs,
|
|
|
|
struct commit **pp,
|
|
|
|
struct prio_queue *queue)
|
2006-03-01 07:07:20 +08:00
|
|
|
{
|
|
|
|
for (;;) {
|
|
|
|
struct commit *p = *pp;
|
Make "--parents" logs also be incremental
The parent rewriting feature caused us to create the whole history in one
go, and then simplify it later, because of how rewrite_parents() had been
written. However, with a little tweaking, it's perfectly possible to do
even that one incrementally.
Right now, this doesn't really much matter, because every user of
"--parents" will probably generally _also_ use "--topo-order", which will
cause the old non-incremental behaviour anyway. However, I'm hopeful that
we could make even the topological sort incremental, or at least
_partially_ so (for example, make it incremental up to the first merge).
In the meantime, this at least moves things in the right direction, and
removes a strange special case.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-04-09 08:05:58 +08:00
|
|
|
if (!revs->limited)
|
revision: use a prio_queue to hold rewritten parents
This patch fixes a quadratic list insertion in rewrite_one() when
pathspec limiting is combined with --parents. What happens is something
like this:
1. We see that some commit X touches the path, so we try to rewrite
its parents.
2. rewrite_one() loops forever, rewriting parents, until it finds a
relevant parent (or hits the root and decides there are none). The
heavy lifting is done by process_parent(), which uses
try_to_simplify_commit() to drop parents.
3. process_parent() puts any intermediate parents into the
&revs->commits list, inserting by commit date as usual.
So if commit X is recent, and then there's a large chunk of history that
doesn't touch the path, we may add a lot of commits to &revs->commits.
And insertion by commit date is O(n) in the worst case, making the whole
thing quadratic.
We tried to deal with this long ago in fce87ae538 (Fix quadratic
performance in rewrite_one., 2008-07-12). In that scheme, we cache the
oldest commit in the list; if the new commit to be added is older, we
can start our linear traversal there. This often works well in practice
because parents are older than their descendants, and thus we tend to
add older and older commits as we traverse.
But this isn't guaranteed, and in fact there's a simple case where it is
not: merges. Imagine we look at the first parent of a merge and see a
very old commit (let's say 3 years old). And on the second parent, as we
go back 3 years in history, we might have many commits. That one
first-parent commit has polluted our oldest-commit cache; it will remain
the oldest while we traverse a huge chunk of history, during which we
have to fall back to the slow, linear method of adding to the list.
Naively, one might imagine that instead of caching the oldest commit,
we'd start at the last-added one. But that just makes some cases faster
while making others slower (and indeed, while it made a real-world test
case much faster, it does quite poorly in the perf test include here).
Fundamentally, these are just heuristics; our worst case is still
quadratic, and some cases will approach that.
Instead, let's use a data structure with better worst-case performance.
Swapping out revs->commits for something else would have repercussions
all over the code base, but we can take advantage of one fact: for the
rewrite_one() case, nobody actually needs to see those commits in
revs->commits until we've finished generating the whole list.
That leaves us with two obvious options:
1. We can generate the list _unordered_, which should be O(n), and
then sort it afterwards, which would be O(n log n) total. This is
"sort-after" below.
2. We can insert the commits into a separate data structure, like a
priority queue. This is "prio-queue" below.
I expected that sort-after would be the fastest (since it saves us the
extra step of copying the items into the linked list), but surprisingly
the prio-queue seems to be a bit faster.
Here are timings for the new p0001.6 for all three techniques across a
few repositories, as compared to master:
master cache-last sort-after prio-queue
--------------------------------------------------------------------------------------------
GIT_PERF_REPO=git.git
0.52(0.50+0.02) 0.53(0.51+0.02) +1.9% 0.37(0.33+0.03) -28.8% 0.37(0.32+0.04) -28.8%
GIT_PERF_REPO=linux.git
20.81(20.74+0.07) 20.31(20.24+0.07) -2.4% 0.94(0.86+0.07) -95.5% 0.91(0.82+0.09) -95.6%
GIT_PERF_REPO=llvm-project.git
83.67(83.57+0.09) 4.23(4.15+0.08) -94.9% 3.21(3.15+0.06) -96.2% 2.98(2.91+0.07) -96.4%
A few items to note:
- the cache-list tweak does improve the bad case for llvm-project.git
that started my digging into this problem. But it performs terribly
on linux.git, barely helping at all.
- the sort-after and prio-queue techniques work well. They approach
the timing for running without --parents at all, which is what you'd
expect (see below for more data).
- prio-queue just barely outperforms sort-after. As I said, I'm not
really sure why this is the case, but it is. You can see it even
more prominently in this real-world case on llvm-project.git:
git rev-list --parents 07ef786652e7 -- llvm/test/CodeGen/Generic/bswap.ll
where prio-queue routinely outperforms sort-after by about 7%. One
guess is that the prio-queue may just be more efficient because it
uses a compact array.
There are three new perf tests:
- "rev-list --parents" gives us a baseline for running with --parents.
This isn't sped up meaningfully here, because the bad case is
triggered only with simplification. But it's good to make sure we
don't screw it up (now, or in the future).
- "rev-list -- dummy" gives us a baseline for just traversing with
pathspec limiting. This gives a lower bound for the next test (and
it's also a good thing for us to be checking in general for
regressions, since we don't seem to have any existing tests).
- "rev-list --parents -- dummy" shows off the problem (and our fix)
Here are the timings for those three on llvm-project.git, before and
after the fix:
Test master prio-queue
------------------------------------------------------------------------------
0001.3: rev-list --parents 2.24(2.12+0.12) 2.22(2.11+0.11) -0.9%
0001.5: rev-list -- dummy 2.89(2.82+0.07) 2.92(2.89+0.03) +1.0%
0001.6: rev-list --parents -- dummy 83.67(83.57+0.09) 2.98(2.91+0.07) -96.4%
Changes in the first two are basically noise, and you can see we
approach our lower bound in the final one.
Note that we can't fully get rid of the list argument from
process_parents(). Other callers do have lists, and it would be hard to
convert them. They also don't seem to have this problem (probably
because they actually remove items from the list as they loop, meaning
it doesn't grow so large in the first place). So this basically just
drops the "cache_ptr" parameter (which was used only by the one caller
we're fixing here) and replaces it with a prio_queue. Callers are free
to use either data structure, depending on what they're prepared to
handle.
Reported-by: Björn Pettersson A <bjorn.a.pettersson@ericsson.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-04 09:41:09 +08:00
|
|
|
if (process_parents(revs, p, NULL, queue) < 0)
|
2007-05-05 05:54:57 +08:00
|
|
|
return rewrite_one_error;
|
2007-11-13 15:16:08 +08:00
|
|
|
if (p->object.flags & UNINTERESTING)
|
|
|
|
return rewrite_one_ok;
|
|
|
|
if (!(p->object.flags & TREESAME))
|
2007-05-05 05:54:57 +08:00
|
|
|
return rewrite_one_ok;
|
2006-03-01 07:07:20 +08:00
|
|
|
if (!p->parents)
|
2007-05-05 05:54:57 +08:00
|
|
|
return rewrite_one_noparents;
|
2022-05-03 00:50:37 +08:00
|
|
|
if (!(p = one_relevant_parent(revs, p->parents)))
|
2013-05-16 23:32:39 +08:00
|
|
|
return rewrite_one_ok;
|
|
|
|
*pp = p;
|
2006-03-01 07:07:20 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
revision: use a prio_queue to hold rewritten parents
This patch fixes a quadratic list insertion in rewrite_one() when
pathspec limiting is combined with --parents. What happens is something
like this:
1. We see that some commit X touches the path, so we try to rewrite
its parents.
2. rewrite_one() loops forever, rewriting parents, until it finds a
relevant parent (or hits the root and decides there are none). The
heavy lifting is done by process_parent(), which uses
try_to_simplify_commit() to drop parents.
3. process_parent() puts any intermediate parents into the
&revs->commits list, inserting by commit date as usual.
So if commit X is recent, and then there's a large chunk of history that
doesn't touch the path, we may add a lot of commits to &revs->commits.
And insertion by commit date is O(n) in the worst case, making the whole
thing quadratic.
We tried to deal with this long ago in fce87ae538 (Fix quadratic
performance in rewrite_one., 2008-07-12). In that scheme, we cache the
oldest commit in the list; if the new commit to be added is older, we
can start our linear traversal there. This often works well in practice
because parents are older than their descendants, and thus we tend to
add older and older commits as we traverse.
But this isn't guaranteed, and in fact there's a simple case where it is
not: merges. Imagine we look at the first parent of a merge and see a
very old commit (let's say 3 years old). And on the second parent, as we
go back 3 years in history, we might have many commits. That one
first-parent commit has polluted our oldest-commit cache; it will remain
the oldest while we traverse a huge chunk of history, during which we
have to fall back to the slow, linear method of adding to the list.
Naively, one might imagine that instead of caching the oldest commit,
we'd start at the last-added one. But that just makes some cases faster
while making others slower (and indeed, while it made a real-world test
case much faster, it does quite poorly in the perf test include here).
Fundamentally, these are just heuristics; our worst case is still
quadratic, and some cases will approach that.
Instead, let's use a data structure with better worst-case performance.
Swapping out revs->commits for something else would have repercussions
all over the code base, but we can take advantage of one fact: for the
rewrite_one() case, nobody actually needs to see those commits in
revs->commits until we've finished generating the whole list.
That leaves us with two obvious options:
1. We can generate the list _unordered_, which should be O(n), and
then sort it afterwards, which would be O(n log n) total. This is
"sort-after" below.
2. We can insert the commits into a separate data structure, like a
priority queue. This is "prio-queue" below.
I expected that sort-after would be the fastest (since it saves us the
extra step of copying the items into the linked list), but surprisingly
the prio-queue seems to be a bit faster.
Here are timings for the new p0001.6 for all three techniques across a
few repositories, as compared to master:
master cache-last sort-after prio-queue
--------------------------------------------------------------------------------------------
GIT_PERF_REPO=git.git
0.52(0.50+0.02) 0.53(0.51+0.02) +1.9% 0.37(0.33+0.03) -28.8% 0.37(0.32+0.04) -28.8%
GIT_PERF_REPO=linux.git
20.81(20.74+0.07) 20.31(20.24+0.07) -2.4% 0.94(0.86+0.07) -95.5% 0.91(0.82+0.09) -95.6%
GIT_PERF_REPO=llvm-project.git
83.67(83.57+0.09) 4.23(4.15+0.08) -94.9% 3.21(3.15+0.06) -96.2% 2.98(2.91+0.07) -96.4%
A few items to note:
- the cache-list tweak does improve the bad case for llvm-project.git
that started my digging into this problem. But it performs terribly
on linux.git, barely helping at all.
- the sort-after and prio-queue techniques work well. They approach
the timing for running without --parents at all, which is what you'd
expect (see below for more data).
- prio-queue just barely outperforms sort-after. As I said, I'm not
really sure why this is the case, but it is. You can see it even
more prominently in this real-world case on llvm-project.git:
git rev-list --parents 07ef786652e7 -- llvm/test/CodeGen/Generic/bswap.ll
where prio-queue routinely outperforms sort-after by about 7%. One
guess is that the prio-queue may just be more efficient because it
uses a compact array.
There are three new perf tests:
- "rev-list --parents" gives us a baseline for running with --parents.
This isn't sped up meaningfully here, because the bad case is
triggered only with simplification. But it's good to make sure we
don't screw it up (now, or in the future).
- "rev-list -- dummy" gives us a baseline for just traversing with
pathspec limiting. This gives a lower bound for the next test (and
it's also a good thing for us to be checking in general for
regressions, since we don't seem to have any existing tests).
- "rev-list --parents -- dummy" shows off the problem (and our fix)
Here are the timings for those three on llvm-project.git, before and
after the fix:
Test master prio-queue
------------------------------------------------------------------------------
0001.3: rev-list --parents 2.24(2.12+0.12) 2.22(2.11+0.11) -0.9%
0001.5: rev-list -- dummy 2.89(2.82+0.07) 2.92(2.89+0.03) +1.0%
0001.6: rev-list --parents -- dummy 83.67(83.57+0.09) 2.98(2.91+0.07) -96.4%
Changes in the first two are basically noise, and you can see we
approach our lower bound in the final one.
Note that we can't fully get rid of the list argument from
process_parents(). Other callers do have lists, and it would be hard to
convert them. They also don't seem to have this problem (probably
because they actually remove items from the list as they loop, meaning
it doesn't grow so large in the first place). So this basically just
drops the "cache_ptr" parameter (which was used only by the one caller
we're fixing here) and replaces it with a prio_queue. Callers are free
to use either data structure, depending on what they're prepared to
handle.
Reported-by: Björn Pettersson A <bjorn.a.pettersson@ericsson.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-04 09:41:09 +08:00
|
|
|
static void merge_queue_into_list(struct prio_queue *q, struct commit_list **list)
|
|
|
|
{
|
|
|
|
while (q->nr) {
|
|
|
|
struct commit *item = prio_queue_peek(q);
|
|
|
|
struct commit_list *p = *list;
|
|
|
|
|
|
|
|
if (p && p->item->date >= item->date)
|
|
|
|
list = &p->next;
|
|
|
|
else {
|
|
|
|
p = commit_list_insert(item, list);
|
|
|
|
list = &p->next; /* skip newly added item */
|
|
|
|
prio_queue_get(q); /* pop item */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static enum rewrite_result rewrite_one(struct rev_info *revs, struct commit **pp)
|
|
|
|
{
|
|
|
|
struct prio_queue queue = { compare_commits_by_commit_date };
|
|
|
|
enum rewrite_result ret = rewrite_one_1(revs, pp, &queue);
|
|
|
|
merge_queue_into_list(&queue, &revs->commits);
|
|
|
|
clear_prio_queue(&queue);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-03-29 00:47:31 +08:00
|
|
|
int rewrite_parents(struct rev_info *revs, struct commit *commit,
|
|
|
|
rewrite_parent_fn_t rewrite_parent)
|
2006-03-01 07:07:20 +08:00
|
|
|
{
|
|
|
|
struct commit_list **pp = &commit->parents;
|
|
|
|
while (*pp) {
|
|
|
|
struct commit_list *parent = *pp;
|
2013-03-29 00:47:31 +08:00
|
|
|
switch (rewrite_parent(revs, &parent->item)) {
|
2007-05-05 05:54:57 +08:00
|
|
|
case rewrite_one_ok:
|
|
|
|
break;
|
|
|
|
case rewrite_one_noparents:
|
2006-03-01 07:07:20 +08:00
|
|
|
*pp = parent->next;
|
2024-09-30 17:14:03 +08:00
|
|
|
free(parent);
|
2006-03-01 07:07:20 +08:00
|
|
|
continue;
|
2007-05-05 05:54:57 +08:00
|
|
|
case rewrite_one_error:
|
|
|
|
return -1;
|
2006-03-01 07:07:20 +08:00
|
|
|
}
|
|
|
|
pp = &parent->next;
|
|
|
|
}
|
revision.c: Make --full-history consider more merges
History simplification previously always treated merges as TREESAME
if they were TREESAME to any parent.
While this was consistent with the default behaviour, this could be
extremely unhelpful when searching detailed history, and could not be
overridden. For example, if a merge had ignored a change, as if by "-s
ours", then:
git log -m -p --full-history -Schange file
would successfully locate "change"'s addition but would not locate the
merge that resolved against it.
Futher, simplify_merges could drop the actual parent that a commit
was TREESAME to, leaving it as a normal commit marked TREESAME that
isn't actually TREESAME to its remaining parent.
Now redefine a commit's TREESAME flag to be true only if a commit is
TREESAME to _all_ of its parents. This doesn't affect either the default
simplify_history behaviour (because partially TREESAME merges are turned
into normal commits), or full-history with parent rewriting (because all
merges are output). But it does affect other modes. The clearest
difference is that --full-history will show more merges - sufficient to
ensure that -m -p --full-history log searches can really explain every
change to the file, including those changes' ultimate fate in merges.
Also modify simplify_merges to recalculate TREESAME after removing
a parent. This is achieved by storing per-parent TREESAME flags on the
initial scan, so the combined flag can be easily recomputed.
This fixes some t6111 failures, but creates a couple of new ones -
we are now showing some merges that don't need to be shown.
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-16 23:32:34 +08:00
|
|
|
remove_duplicate_parents(revs, commit);
|
2007-05-05 05:54:57 +08:00
|
|
|
return 0;
|
2006-03-01 07:07:20 +08:00
|
|
|
}
|
|
|
|
|
2006-09-18 06:43:40 +08:00
|
|
|
static int commit_match(struct commit *commit, struct rev_info *opt)
|
|
|
|
{
|
2012-09-29 12:41:28 +08:00
|
|
|
int retval;
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-12 04:59:58 +08:00
|
|
|
const char *encoding;
|
2014-06-11 05:39:30 +08:00
|
|
|
const char *message;
|
2012-09-29 12:41:28 +08:00
|
|
|
struct strbuf buf = STRBUF_INIT;
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-12 04:59:58 +08:00
|
|
|
|
"log --author=me --grep=it" should find intersection, not union
Historically, any grep filter in "git log" family of commands were taken
as restricting to commits with any of the words in the commit log message.
However, the user almost always want to find commits "done by this person
on that topic". With "--all-match" option, a series of grep patterns can
be turned into a requirement that all of them must produce a match, but
that makes it impossible to ask for "done by me, on either this or that"
with:
log --author=me --committer=him --grep=this --grep=that
because it will require both "this" and "that" to appear.
Change the "header" parser of grep library to treat the headers specially,
and parse it as:
(all-match-OR (HEADER-AUTHOR me)
(HEADER-COMMITTER him)
(OR
(PATTERN this)
(PATTERN that) ) )
Even though the "log" command line parser doesn't give direct access to
the extended grep syntax to group terms with parentheses, this change will
cover the majority of the case the users would want.
This incidentally revealed that one test in t7002 was bogus. It ran:
log --author=Thor --grep=Thu --format='%s'
and expected (wrongly) "Thu" to match "Thursday" in the author/committer
date, but that would never match, as the timestamp in raw commit buffer
does not have the name of the day-of-the-week.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-01-18 12:09:06 +08:00
|
|
|
if (!opt->grep_filter.pattern_list && !opt->grep_filter.header_list)
|
2006-09-18 06:43:40 +08:00
|
|
|
return 1;
|
2012-09-30 02:59:52 +08:00
|
|
|
|
|
|
|
/* Prepend "fake" headers as needed */
|
|
|
|
if (opt->grep_filter.use_reflog_filter) {
|
2012-09-29 12:41:28 +08:00
|
|
|
strbuf_addstr(&buf, "reflog ");
|
|
|
|
get_reflog_message(&buf, opt->reflog_info);
|
|
|
|
strbuf_addch(&buf, '\n');
|
|
|
|
}
|
2012-09-30 02:59:52 +08:00
|
|
|
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-12 04:59:58 +08:00
|
|
|
/*
|
|
|
|
* We grep in the user's output encoding, under the assumption that it
|
|
|
|
* is the encoding they are most likely to write their grep pattern
|
|
|
|
* for. In addition, it means we will match the "notes" encoding below,
|
|
|
|
* so we will not end up with a buffer that has two different encodings
|
|
|
|
* in it.
|
|
|
|
*/
|
|
|
|
encoding = get_log_output_encoding();
|
2023-03-28 21:58:48 +08:00
|
|
|
message = repo_logmsg_reencode(the_repository, commit, NULL, encoding);
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-12 04:59:58 +08:00
|
|
|
|
2012-09-30 02:59:52 +08:00
|
|
|
/* Copy the commit to temporary if we are using "fake" headers */
|
|
|
|
if (buf.len)
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-12 04:59:58 +08:00
|
|
|
strbuf_addstr(&buf, message);
|
2012-09-30 02:59:52 +08:00
|
|
|
|
log --use-mailmap: optimize for cases without --author/--committer search
When we taught the commit_match() mechanism to pay attention to the
new --use-mailmap option, we started to unconditionally copy the
commit object to a temporary buffer, just in case we need the author
and committer lines updated via the mailmap mechanism, and rewrite
author and committer using the mailmap.
It turns out that this has a rather unpleasant performance
implications. In the linux kernel repository, running
$ git log --author='Junio C Hamano' --pretty=short >/dev/null
under /usr/bin/time, with and without --use-mailmap (the .mailmap
file is 118 entries long, the particular author does not appear in
it), cost (with warm cache):
[without --use-mailmap]
5.42user 0.26system 0:05.70elapsed 99%CPU (0avgtext+0avgdata 2005936maxresident)k
0inputs+0outputs (0major+137669minor)pagefaults 0swaps
[with --use-mailmap]
6.47user 0.30system 0:06.78elapsed 99%CPU (0avgtext+0avgdata 2006288maxresident)k
0inputs+0outputs (0major+137692minor)pagefaults 0swaps
which incurs about 20% overhead. The command is doing extra work,
so the extra cost may be justified.
But it is inexcusable to pay the cost when we do not need
author/committer match. In the same repository,
$ git log --grep='fix menuconfig on debian lenny' --pretty=short >/dev/null
shows very similar numbers as the above:
[without --use-mailmap]
5.32user 0.30system 0:05.63elapsed 99%CPU (0avgtext+0avgdata 2005984maxresident)k
0inputs+0outputs (0major+137672minor)pagefaults 0swaps
[with --use-mailmap]
6.64user 0.24system 0:06.89elapsed 99%CPU (0avgtext+0avgdata 2006320maxresident)k
0inputs+0outputs (0major+137694minor)pagefaults 0swaps
The latter case is an unnecessary performance regression. We may
want to _show_ the result with mailmap applied, but we do not have
to copy and rewrite the author/committer of all commits we try to
match if we do not query for these fields.
Trivially optimize this performace regression by limiting the
rewrites for only when we are matching with author/committer fields.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-01-08 16:02:49 +08:00
|
|
|
if (opt->grep_filter.header_list && opt->mailmap) {
|
2022-07-19 03:50:59 +08:00
|
|
|
const char *commit_headers[] = { "author ", "committer ", NULL };
|
|
|
|
|
2013-01-06 05:26:45 +08:00
|
|
|
if (!buf.len)
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-12 04:59:58 +08:00
|
|
|
strbuf_addstr(&buf, message);
|
2013-01-06 05:26:45 +08:00
|
|
|
|
2022-07-19 03:51:01 +08:00
|
|
|
apply_mailmap_to_header(&buf, commit_headers, opt->mailmap);
|
2013-01-06 05:26:45 +08:00
|
|
|
}
|
|
|
|
|
2012-09-29 12:41:29 +08:00
|
|
|
/* Append "fake" message parts as needed */
|
|
|
|
if (opt->show_notes) {
|
|
|
|
if (!buf.len)
|
log: re-encode commit messages before grepping
If you run "git log --grep=foo", we will run your regex on
the literal bytes of the commit message. This can provide
confusing results if the commit message is not in the same
encoding as your grep expression (or worse, you have commits
in multiple encodings, in which case your regex would need
to be written to match either encoding). On top of this, we
might also be grepping in the commit's notes, which are
already re-encoded, potentially leading to grepping in a
buffer with mixed encodings concatenated. This is insanity,
but most people never noticed, because their terminal and
their commit encodings all match.
Instead, let's massage the to-be-grepped commit into a
standardized encoding. There is not much point in adding a
flag for "this is the encoding I expect my grep pattern to
match"; the only sane choice is for it to use the log output
encoding. That is presumably what the user's terminal is
using, and it means that the patterns found by the grep will
match the output produced by git.
As a bonus, this fixes a potential segfault in commit_match
when commit->buffer is NULL, as we now build on logmsg_reencode,
which handles reading the commit buffer from disk if
necessary. The segfault can be triggered with:
git commit -m 'text1' --allow-empty
git commit -m 'text2' --allow-empty
git log --graph --no-walk --grep 'text2'
which arguably does not make any sense (--graph inherently
wants a connected history, and by --no-walk the command line
is telling us to show discrete points in history without
connectivity), and we probably should forbid the
combination, but that is a separate issue.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-12 04:59:58 +08:00
|
|
|
strbuf_addstr(&buf, message);
|
2017-05-31 01:30:41 +08:00
|
|
|
format_display_notes(&commit->object.oid, &buf, encoding, 1);
|
2012-09-29 12:41:29 +08:00
|
|
|
}
|
|
|
|
|
2014-06-11 05:39:30 +08:00
|
|
|
/*
|
|
|
|
* Find either in the original commit message, or in the temporary.
|
|
|
|
* Note that we cast away the constness of "message" here. It is
|
|
|
|
* const because it may come from the cached commit buffer. That's OK,
|
|
|
|
* because we know that it is modifiable heap memory, and that while
|
|
|
|
* grep_buffer may modify it for speed, it will restore any
|
|
|
|
* changes before returning.
|
|
|
|
*/
|
2012-09-29 12:41:28 +08:00
|
|
|
if (buf.len)
|
|
|
|
retval = grep_buffer(&opt->grep_filter, buf.buf, buf.len);
|
|
|
|
else
|
|
|
|
retval = grep_buffer(&opt->grep_filter,
|
2014-06-11 05:39:30 +08:00
|
|
|
(char *)message, strlen(message));
|
2012-09-29 12:41:28 +08:00
|
|
|
strbuf_release(&buf);
|
2023-03-28 21:58:48 +08:00
|
|
|
repo_unuse_commit_buffer(the_repository, commit, message);
|
2021-12-18 00:48:49 +08:00
|
|
|
return retval;
|
2006-09-18 06:43:40 +08:00
|
|
|
}
|
|
|
|
|
log: use true parents for diff even when rewriting
When using pathspec filtering in combination with diff-based log
output, parent simplification happens before the diff is computed.
The diff is therefore against the *simplified* parents.
This works okay, arguably by accident, in the normal case:
simplification reduces to one parent as long as the commit is TREESAME
to it. So the simplified parent of any given commit must have the
same tree contents on the filtered paths as its true (unfiltered)
parent.
However, --full-diff breaks this guarantee, and indeed gives pretty
spectacular results when comparing the output of
git log --graph --stat ...
git log --graph --full-diff --stat ...
(--graph internally kicks in parent simplification, much like
--parents).
To fix it, store a copy of the parent list before simplification (in a
slab) whenever --full-diff is in effect. Then use the stored parents
instead of the simplified ones in the commit display code paths. The
latter do not actually check for --full-diff to avoid duplicated code;
they just grab the original parents if save_parents() has not been
called for this revision walk.
For ordinary commits it should be obvious that this is the right thing
to do.
Merge commits are a bit subtle. Observe that with default
simplification, merge simplification is an all-or-nothing decision:
either the merge is TREESAME to one parent and disappears, or it is
different from all parents and the parent list remains intact.
Redundant parents are not pruned, so the existing code also shows them
as a merge.
So if we do show a merge commit, the parent list just consists of the
rewrite result on each parent. Running, e.g., --cc on this in
--full-diff mode is not very useful: if any commits were skipped, some
hunks will disagree with all sides of the merge (with one side,
because commits were skipped; with the others, because they didn't
have those changes in the first place). This triggers --cc showing
these hunks spuriously.
Therefore I believe that even for merge commits it is better to show
the diffs wrt. the original parents.
Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-01 04:13:20 +08:00
|
|
|
static inline int want_ancestry(const struct rev_info *revs)
|
2008-04-03 17:12:06 +08:00
|
|
|
{
|
2008-07-09 06:25:44 +08:00
|
|
|
return (revs->rewrite_parents || revs->children.name);
|
2008-04-03 17:12:06 +08:00
|
|
|
}
|
|
|
|
|
2017-07-07 17:16:21 +08:00
|
|
|
/*
|
|
|
|
* Return a timestamp to be used for --since/--until comparisons for this
|
|
|
|
* commit, based on the revision options.
|
|
|
|
*/
|
|
|
|
static timestamp_t comparison_date(const struct rev_info *revs,
|
|
|
|
struct commit *commit)
|
|
|
|
{
|
|
|
|
return revs->reflog_info ?
|
|
|
|
get_reflog_timestamp(revs->reflog_info) :
|
|
|
|
commit->date;
|
|
|
|
}
|
|
|
|
|
2009-08-19 10:34:33 +08:00
|
|
|
enum commit_action get_commit_action(struct rev_info *revs, struct commit *commit)
|
2007-11-05 04:12:05 +08:00
|
|
|
{
|
|
|
|
if (commit->object.flags & SHOWN)
|
|
|
|
return commit_ignore;
|
2018-05-02 08:25:33 +08:00
|
|
|
if (revs->unpacked && has_object_pack(&commit->object.oid))
|
2007-11-05 04:12:05 +08:00
|
|
|
return commit_ignore;
|
revision: learn '--no-kept-objects'
A future caller will want to be able to perform a reachability traversal
which terminates when visiting an object found in a kept pack. The
closest existing option is '--honor-pack-keep', but this isn't quite
what we want. Instead of halting the traversal midway through, a full
traversal is always performed, and the results are only trimmed
afterwords.
Besides needing to introduce a new flag (since culling results
post-facto can be different than halting the traversal as it's
happening), there is an additional wrinkle handling the distinction
in-core and on-disk kept packs. That is: what kinds of kept pack should
stop the traversal?
Introduce '--no-kept-objects[=<on-disk|in-core>]' to specify which kinds
of kept packs, if any, should stop a traversal. This can be useful for
callers that want to perform a reachability analysis, but want to leave
certain packs alone (for e.g., when doing a geometric repack that has
some "large" packs which are kept in-core that it wants to leave alone).
Note that this option is not guaranteed to produce exactly the set of
objects that aren't in kept packs, since it's possible the traversal
order may end up in a situation where a non-kept ancestor was "cut off"
by a kept object (at which point we would stop traversing). But, we
don't care about absolute correctness here, since this will eventually
be used as a purely additive guide in an upcoming new repack mode.
Explicitly avoid documenting this new flag, since it is only used
internally. In theory we could avoid even adding it rev-list, but being
able to spell this option out on the command-line makes some special
cases easier to test without promising to keep it behaving consistently
forever. Those tricky cases are exercised in t6114.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 10:25:07 +08:00
|
|
|
if (revs->no_kept_objects) {
|
|
|
|
if (has_object_kept_pack(&commit->object.oid,
|
|
|
|
revs->keep_pack_cache_flags))
|
|
|
|
return commit_ignore;
|
|
|
|
}
|
2007-11-05 04:12:05 +08:00
|
|
|
if (commit->object.flags & UNINTERESTING)
|
|
|
|
return commit_ignore;
|
line-log: more responsive, incremental 'git log -L'
The current line-level log implementation performs a preprocessing
step in prepare_revision_walk(), during which the line_log_filter()
function filters and rewrites history to keep only commits modifying
the given line range. This preprocessing affects both responsiveness
and correctness:
- Git doesn't produce any output during this preprocessing step.
Checking whether a commit modified the given line range is
somewhat expensive, so depending on the size of the given revision
range this preprocessing can result in a significant delay before
the first commit is shown.
- Limiting the number of displayed commits (e.g. 'git log -3 -L...')
doesn't limit the amount of work during preprocessing, because
that limit is applied during history traversal. Alas, by that
point this expensive preprocessing step has already churned
through the whole revision range to find all commits modifying the
revision range, even though only a few of them need to be shown.
- It rewrites parents, with no way to turn it off. Without the user
explicitly requesting parent rewriting any parent object ID shown
should be that of the immediate parent, just like in case of a
pathspec-limited history traversal without parent rewriting.
However, after that preprocessing step rewrote history, the
subsequent "regular" history traversal (i.e. get_revision() in a
loop) only sees commits modifying the given line range.
Consequently, it can only show the object ID of the last ancestor
that modified the given line range (which might happen to be the
immediate parent, but many-many times it isn't).
This patch addresses both the correctness and, at least for the common
case, the responsiveness issues by integrating line-level log
filtering into the regular revision walking machinery:
- Make process_ranges_arbitrary_commit(), the static function in
'line-log.c' deciding whether a commit modifies the given line
range, public by removing the static keyword and adding the
'line_log_' prefix, so it can be called from other parts of the
revision walking machinery.
- If the user didn't explicitly ask for parent rewriting (which, I
believe, is the most common case):
- Call this now-public function during regular history traversal,
namely from get_commit_action() to ignore any commits not
modifying the given line range.
Note that while this check is relatively expensive, it must be
performed before other, much cheaper conditions, because the
tracked line range must be adjusted even when the commit will
end up being ignored by other conditions.
- Skip the line_log_filter() call, i.e. the expensive
preprocessing step, in prepare_revision_walk(), because, thanks
to the above points, the revision walking machinery is now able
to filter out commits not modifying the given line range while
traversing history.
This way the regular history traversal sees the unmodified
history, and is therefore able to print the object ids of the
immediate parents of the listed commits. The eliminated
preprocessing step can greatly reduce the delay before the first
commit is shown, see the numbers below.
- However, if the user did explicitly ask for parent rewriting via
'--parents' or a similar option, then stick with the current
implementation for now, i.e. perform that expensive filtering and
history rewriting in the preprocessing step just like we did
before, leaving the initial delay as long as it was.
I tried to integrate line-level log filtering with parent rewriting
into the regular history traversal, but, unfortunately, several
subtleties resisted... :) Maybe someday we'll figure out how to do
that, but until then at least the simple and common (i.e. without
parent rewriting) 'git log -L:func:file' commands can benefit from the
reduced delay.
This change makes the failing 'parent oids without parent rewriting'
test in 't4211-line-log.sh' succeed.
The reduced delay is most noticable when there's a commit modifying
the line range near the tip of a large-ish revision range:
# no parent rewriting requested, no commit-graph present
$ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0
Before:
real 0m9.570s
user 0m9.494s
sys 0m0.076s
After:
real 0m0.718s
user 0m0.674s
sys 0m0.044s
A significant part of the remaining delay is spent reading and parsing
commit objects in limit_list(). With the help of the commit-graph we
can eliminate most of that reading and parsing overhead, so here are
the timing results of the same command as above, but this time using
the commit-graph:
Before:
real 0m8.874s
user 0m8.816s
sys 0m0.057s
After:
real 0m0.107s
user 0m0.091s
sys 0m0.013s
The next patch will further reduce the remaining delay.
To be clear: this patch doesn't actually optimize the line-level log,
but merely moves most of the work from the preprocessing step to the
history traversal, so the commits modifying the line range can be
shown as soon as they are processed, and the traversal can be
terminated as soon as the given number of commits are shown.
Consequently, listing the full history of a line range, potentially
all the way to the root commit, will take the same time as before (but
at least the user might start reading the output earlier).
Furthermore, if the most recent commit modifying the line range is far
away from the starting revision, then that initial delay will still be
significant.
Additional testing by Derrick Stolee: In the Linux kernel repository,
the MAINTAINERS file was changed ~3,500 times across the ~915,000
commits. In addition to that edit frequency, the file itself is quite
large (~18,700 lines). This means that a significant portion of the
computation is taken up by computing the patch-diff of the file. This
patch improves the real time it takes to output the first result quite
a bit:
Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
Before: 3.88 s
After: 0.71 s
If we drop the "-n 1" in the command, then there is no change in
end-to-end process time. This is because the command still needs to
walk the entire commit history, which negates the point of this
patch. This is expected.
As a note for future reference, the ~4.3 seconds in the old code
spends ~2.6 seconds computing the patch-diffs, and the rest of the
time is spent walking commits and computing diffs for which paths
changed at each commit. The changed-path Bloom filters could improve
the end-to-end computation time (i.e. no "-n 1" in the command).
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-11 19:56:17 +08:00
|
|
|
if (revs->line_level_traverse && !want_ancestry(revs)) {
|
|
|
|
/*
|
|
|
|
* In case of line-level log with parent rewriting
|
|
|
|
* prepare_revision_walk() already took care of all line-level
|
|
|
|
* log filtering, and there is nothing left to do here.
|
|
|
|
*
|
|
|
|
* If parent rewriting was not requested, then this is the
|
|
|
|
* place to perform the line-level log filtering. Notably,
|
|
|
|
* this check, though expensive, must come before the other,
|
|
|
|
* cheaper filtering conditions, because the tracked line
|
|
|
|
* ranges must be adjusted even when the commit will end up
|
|
|
|
* being ignored based on other conditions.
|
|
|
|
*/
|
|
|
|
if (!line_log_process_ranges_arbitrary_commit(revs, commit))
|
|
|
|
return commit_ignore;
|
|
|
|
}
|
2017-07-07 17:16:21 +08:00
|
|
|
if (revs->min_age != -1 &&
|
|
|
|
comparison_date(revs, commit) > revs->min_age)
|
|
|
|
return commit_ignore;
|
2022-04-23 20:59:57 +08:00
|
|
|
if (revs->max_age_as_filter != -1 &&
|
|
|
|
comparison_date(revs, commit) < revs->max_age_as_filter)
|
|
|
|
return commit_ignore;
|
2011-03-21 18:14:06 +08:00
|
|
|
if (revs->min_parents || (revs->max_parents >= 0)) {
|
2013-05-16 23:32:40 +08:00
|
|
|
int n = commit_list_count(commit->parents);
|
2011-03-21 18:14:06 +08:00
|
|
|
if ((n < revs->min_parents) ||
|
|
|
|
((revs->max_parents >= 0) && (n > revs->max_parents)))
|
|
|
|
return commit_ignore;
|
|
|
|
}
|
2007-11-05 04:12:05 +08:00
|
|
|
if (!commit_match(commit, revs))
|
|
|
|
return commit_ignore;
|
2007-11-06 05:22:34 +08:00
|
|
|
if (revs->prune && revs->dense) {
|
2007-11-05 04:12:05 +08:00
|
|
|
/* Commit without changes? */
|
2007-11-13 15:16:08 +08:00
|
|
|
if (commit->object.flags & TREESAME) {
|
2013-05-16 23:32:40 +08:00
|
|
|
int n;
|
|
|
|
struct commit_list *p;
|
2007-11-05 04:12:05 +08:00
|
|
|
/* drop merges unless we want parenthood */
|
2008-04-03 17:12:06 +08:00
|
|
|
if (!want_ancestry(revs))
|
2007-11-05 04:12:05 +08:00
|
|
|
return commit_ignore;
|
revision: --show-pulls adds helpful merges
The default file history simplification of "git log -- <path>" or
"git rev-list -- <path>" focuses on providing the smallest set of
commits that first contributed a change. The revision walk greatly
restricts the set of walked commits by visiting only the first
TREESAME parent of a merge commit, when one exists. This means
that portions of the commit-graph are not walked, which can be a
performance benefit, but can also "hide" commits that added changes
but were ignored by a merge resolution.
The --full-history option modifies this by walking all commits and
reporting a merge commit as "interesting" if it has _any_ parent
that is not TREESAME. This tends to be an over-representation of
important commits, especially in an environment where most merge
commits are created by pull request completion.
Suppose we have a commit A and we create a commit B on top that
changes our file. When we merge the pull request, we create a merge
commit M. If no one else changed the file in the first-parent
history between M and A, then M will not be TREESAME to its first
parent, but will be TREESAME to B. Thus, the simplified history
will be "B". However, M will appear in the --full-history mode.
However, suppose that a number of topics T1, T2, ..., Tn were
created based on commits C1, C2, ..., Cn between A and M as
follows:
A----C1----C2--- ... ---Cn----M------P1---P2--- ... ---Pn
\ \ \ \ / / / /
\ \__.. \ \/ ..__T1 / Tn
\ \__.. /\ ..__T2 /
\_____________________B \____________________/
If the commits T1, T2, ... Tn did not change the file, then all of
P1 through Pn will be TREESAME to their first parent, but not
TREESAME to their second. This means that all of those merge commits
appear in the --full-history view, with edges that immediately
collapse into the lower history without introducing interesting
single-parent commits.
The --simplify-merges option was introduced to remove these extra
merge commits. By noticing that the rewritten parents are reachable
from their first parents, those edges can be simplified away. Finally,
the commits now look like single-parent commits that are TREESAME to
their "only" parent. Thus, they are removed and this issue does not
cause issues anymore. However, this also ends up removing the commit
M from the history view! Even worse, the --simplify-merges option
requires walking the entire history before returning a single result.
Many Git users are using Git alongside a Git service that provides
code storage alongside a code review tool commonly called "Pull
Requests" or "Merge Requests" against a target branch. When these
requests are accepted and merged, they typically create a merge
commit whose first parent is the previous branch tip and the second
parent is the tip of the topic branch used for the request. This
presents a valuable order to the parents, but also makes that merge
commit slightly special. Users may want to see not only which
commits changed a file, but which pull requests merged those commits
into their branch. In the previous example, this would mean the
users want to see the merge commit "M" in addition to the single-
parent commit "C".
Users are even more likely to want these merge commits when they
use pull requests to merge into a feature branch before merging that
feature branch into their trunk.
In some sense, users are asking for the "first" merge commit to
bring in the change to their branch. As long as the parent order is
consistent, this can be handled with the following rule:
Include a merge commit if it is not TREESAME to its first
parent, but is TREESAME to a later parent.
These merges look like the merge commits that would result from
running "git pull <topic>" on a main branch. Thus, the option to
show these commits is called "--show-pulls". This has the added
benefit of showing the commits created by closing a pull request or
merge request on any of the Git hosting and code review platforms.
To test these options, extend the standard test example to include
a merge commit that is not TREESAME to its first parent. It is
surprising that that option was not already in the example, as it
is instructive.
In particular, this extension demonstrates a common issue with file
history simplification. When a user resolves a merge conflict using
"-Xours" or otherwise ignoring one side of the conflict, they create
a TREESAME edge that probably should not be TREESAME. This leads
users to become frustrated and complain that "my change disappeared!"
In my experience, showing them history with --full-history and
--simplify-merges quickly reveals the problematic merge. As mentioned,
this option is expensive to compute. The --show-pulls option
_might_ show the merge commit (usually titled "resolving conflicts")
more quickly. Of course, this depends on the user having the correct
parent order, which is backwards when using "git pull master" from a
topic branch.
There are some special considerations when combining the --show-pulls
option with --simplify-merges. This requires adding a new PULL_MERGE
object flag to store the information from the initial TREESAME
comparisons. This helps avoid dropping those commits in later filters.
This is covered by a test, including how the parents can be simplified.
Since "struct object" has already ruined its 32-bit alignment by using
33 bits across parsed, type, and flags member, let's not make it worse.
PULL_MERGE is used in revision.c with the same value (1u<<15) as
REACHABLE in commit-graph.c. The REACHABLE flag is only used when
writing a commit-graph file, and a revision walk using --show-pulls
does not happen in the same process. Care must be taken in the future
to ensure this remains the case.
Update Documentation/rev-list-options.txt with significant details
around this option. This requires updating the example in the
History Simplification section to demonstrate some of the problems
with TREESAME second parents.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-10 20:19:43 +08:00
|
|
|
|
|
|
|
if (revs->show_pulls && (commit->object.flags & PULL_MERGE))
|
|
|
|
return commit_show;
|
|
|
|
|
2013-05-16 23:32:40 +08:00
|
|
|
/*
|
|
|
|
* If we want ancestry, then need to keep any merges
|
|
|
|
* between relevant commits to tie together topology.
|
|
|
|
* For consistency with TREESAME and simplification
|
|
|
|
* use "relevant" here rather than just INTERESTING,
|
|
|
|
* to treat bottom commit(s) as part of the topology.
|
|
|
|
*/
|
|
|
|
for (n = 0, p = commit->parents; p; p = p->next)
|
|
|
|
if (relevant_commit(p->item))
|
|
|
|
if (++n >= 2)
|
|
|
|
return commit_show;
|
|
|
|
return commit_ignore;
|
2007-11-05 04:12:05 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
return commit_show;
|
|
|
|
}
|
|
|
|
|
2015-01-15 06:49:24 +08:00
|
|
|
define_commit_slab(saved_parents, struct commit_list *);
|
|
|
|
|
|
|
|
#define EMPTY_PARENT_LIST ((struct commit_list *)-1)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* You may only call save_parents() once per commit (this is checked
|
|
|
|
* for non-root commits).
|
|
|
|
*/
|
|
|
|
static void save_parents(struct rev_info *revs, struct commit *commit)
|
|
|
|
{
|
|
|
|
struct commit_list **pp;
|
|
|
|
|
|
|
|
if (!revs->saved_parents_slab) {
|
|
|
|
revs->saved_parents_slab = xmalloc(sizeof(struct saved_parents));
|
|
|
|
init_saved_parents(revs->saved_parents_slab);
|
|
|
|
}
|
|
|
|
|
|
|
|
pp = saved_parents_at(revs->saved_parents_slab, commit);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When walking with reflogs, we may visit the same commit
|
|
|
|
* several times: once for each appearance in the reflog.
|
|
|
|
*
|
|
|
|
* In this case, save_parents() will be called multiple times.
|
|
|
|
* We want to keep only the first set of parents. We need to
|
|
|
|
* store a sentinel value for an empty (i.e., NULL) parent
|
|
|
|
* list to distinguish it from a not-yet-saved list, however.
|
|
|
|
*/
|
|
|
|
if (*pp)
|
|
|
|
return;
|
|
|
|
if (commit->parents)
|
|
|
|
*pp = copy_commit_list(commit->parents);
|
|
|
|
else
|
|
|
|
*pp = EMPTY_PARENT_LIST;
|
|
|
|
}
|
|
|
|
|
2024-09-30 17:14:06 +08:00
|
|
|
static void free_saved_parent(struct commit_list **parents)
|
|
|
|
{
|
|
|
|
if (*parents != EMPTY_PARENT_LIST)
|
|
|
|
free_commit_list(*parents);
|
|
|
|
}
|
|
|
|
|
2015-01-15 06:49:24 +08:00
|
|
|
static void free_saved_parents(struct rev_info *revs)
|
|
|
|
{
|
2024-09-30 17:14:06 +08:00
|
|
|
if (!revs->saved_parents_slab)
|
|
|
|
return;
|
|
|
|
deep_clear_saved_parents(revs->saved_parents_slab, free_saved_parent);
|
|
|
|
FREE_AND_NULL(revs->saved_parents_slab);
|
2015-01-15 06:49:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
struct commit_list *get_saved_parents(struct rev_info *revs, const struct commit *commit)
|
|
|
|
{
|
|
|
|
struct commit_list *parents;
|
|
|
|
|
|
|
|
if (!revs->saved_parents_slab)
|
|
|
|
return commit->parents;
|
|
|
|
|
|
|
|
parents = *saved_parents_at(revs->saved_parents_slab, commit);
|
|
|
|
if (parents == EMPTY_PARENT_LIST)
|
|
|
|
return NULL;
|
|
|
|
return parents;
|
|
|
|
}
|
|
|
|
|
2009-08-19 10:34:33 +08:00
|
|
|
enum commit_action simplify_commit(struct rev_info *revs, struct commit *commit)
|
|
|
|
{
|
|
|
|
enum commit_action action = get_commit_action(revs, commit);
|
|
|
|
|
|
|
|
if (action == commit_show &&
|
|
|
|
revs->prune && revs->dense && want_ancestry(revs)) {
|
log: use true parents for diff even when rewriting
When using pathspec filtering in combination with diff-based log
output, parent simplification happens before the diff is computed.
The diff is therefore against the *simplified* parents.
This works okay, arguably by accident, in the normal case:
simplification reduces to one parent as long as the commit is TREESAME
to it. So the simplified parent of any given commit must have the
same tree contents on the filtered paths as its true (unfiltered)
parent.
However, --full-diff breaks this guarantee, and indeed gives pretty
spectacular results when comparing the output of
git log --graph --stat ...
git log --graph --full-diff --stat ...
(--graph internally kicks in parent simplification, much like
--parents).
To fix it, store a copy of the parent list before simplification (in a
slab) whenever --full-diff is in effect. Then use the stored parents
instead of the simplified ones in the commit display code paths. The
latter do not actually check for --full-diff to avoid duplicated code;
they just grab the original parents if save_parents() has not been
called for this revision walk.
For ordinary commits it should be obvious that this is the right thing
to do.
Merge commits are a bit subtle. Observe that with default
simplification, merge simplification is an all-or-nothing decision:
either the merge is TREESAME to one parent and disappears, or it is
different from all parents and the parent list remains intact.
Redundant parents are not pruned, so the existing code also shows them
as a merge.
So if we do show a merge commit, the parent list just consists of the
rewrite result on each parent. Running, e.g., --cc on this in
--full-diff mode is not very useful: if any commits were skipped, some
hunks will disagree with all sides of the merge (with one side,
because commits were skipped; with the others, because they didn't
have those changes in the first place). This triggers --cc showing
these hunks spuriously.
Therefore I believe that even for merge commits it is better to show
the diffs wrt. the original parents.
Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-01 04:13:20 +08:00
|
|
|
/*
|
|
|
|
* --full-diff on simplified parents is no good: it
|
|
|
|
* will show spurious changes from the commits that
|
|
|
|
* were elided. So we save the parents on the side
|
|
|
|
* when --full-diff is in effect.
|
|
|
|
*/
|
|
|
|
if (revs->full_diff)
|
|
|
|
save_parents(revs, commit);
|
2013-03-29 00:47:31 +08:00
|
|
|
if (rewrite_parents(revs, commit, rewrite_one) < 0)
|
2009-08-19 10:34:33 +08:00
|
|
|
return commit_error;
|
|
|
|
}
|
|
|
|
return action;
|
|
|
|
}
|
|
|
|
|
2014-03-25 21:23:27 +08:00
|
|
|
static void track_linear(struct rev_info *revs, struct commit *commit)
|
|
|
|
{
|
|
|
|
if (revs->track_first_time) {
|
|
|
|
revs->linear = 1;
|
|
|
|
revs->track_first_time = 0;
|
|
|
|
} else {
|
|
|
|
struct commit_list *p;
|
|
|
|
for (p = revs->previous_parents; p; p = p->next)
|
|
|
|
if (p->item == NULL || /* first commit */
|
convert "oidcmp() == 0" to oideq()
Using the more restrictive oideq() should, in the long run,
give the compiler more opportunities to optimize these
callsites. For now, this conversion should be a complete
noop with respect to the generated code.
The result is also perhaps a little more readable, as it
avoids the "zero is equal" idiom. Since it's so prevalent in
C, I think seasoned programmers tend not to even notice it
anymore, but it can sometimes make for awkward double
negations (e.g., we can drop a few !!oidcmp() instances
here).
This patch was generated almost entirely by the included
coccinelle patch. This mechanical conversion should be
completely safe, because we check explicitly for cases where
oidcmp() is compared to 0, which is what oideq() is doing
under the hood. Note that we don't have to catch "!oidcmp()"
separately; coccinelle's standard isomorphisms make sure the
two are treated equivalently.
I say "almost" because I did hand-edit the coccinelle output
to fix up a few style violations (it mostly keeps the
original formatting, but sometimes unwraps long lines).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29 05:22:40 +08:00
|
|
|
oideq(&p->item->object.oid, &commit->object.oid))
|
2014-03-25 21:23:27 +08:00
|
|
|
break;
|
|
|
|
revs->linear = p != NULL;
|
|
|
|
}
|
|
|
|
if (revs->reverse) {
|
|
|
|
if (revs->linear)
|
|
|
|
commit->object.flags |= TRACK_LINEAR;
|
|
|
|
}
|
|
|
|
free_commit_list(revs->previous_parents);
|
|
|
|
revs->previous_parents = copy_commit_list(commit->parents);
|
|
|
|
}
|
|
|
|
|
2006-12-20 10:25:32 +08:00
|
|
|
static struct commit *get_revision_1(struct rev_info *revs)
|
2006-03-01 03:24:00 +08:00
|
|
|
{
|
2017-07-07 17:07:58 +08:00
|
|
|
while (1) {
|
reflog-walk: stop using fake parents
The reflog-walk system works by putting a ref's tip into the
pending queue, and then "traversing" the reflog by
pretending that the parent of each commit is the previous
reflog entry.
This causes a number of user-visible oddities, as documented
in t1414 (and the commit message which introduced it). We
can fix all of them in one go by replacing the fake-reflog
system with a much simpler one: just keeping a list of
reflogs to show, and walking through them entry by entry.
The implementation is fairly straight-forward, but there are
a few items to note:
1. We obviously must skip calling add_parents_to_list()
when we are traversing reflogs, since we do not want to
walk the original parents at all. As a result, we must call
try_to_simplify_commit() ourselves.
There are other parts of add_parents_to_list() we skip,
as well, but none of them should matter for a reflog
traversal:
- We do not allow UNINTERESTING commits, nor
symmetric ranges (and we bail when these are used
with "-g").
- Using --source makes no sense, since we aren't
traversing. The reflog selector shows the same
information with more detail.
- Using --first-parent is still sensible, since you
may want to see the first-parent diff for each
entry. But since we're not traversing, we don't
need to cull the parent list here.
2. Since we now just walk the reflog entries themselves,
rather than starting with the ref tip, we now look at
the "new" field of each entry rather than the "old"
(i.e., we are showing entries, not faking parents).
This removes all of the tricky logic around skipping
past root commits.
But note that we have no way to show an entry with the
null sha1 in its "new" field (because such a commit
obviously does not exist). Normally this would not
happen, since we delete reflogs along with refs, but
there is one special case. When we rename the currently
checked out branch, we write two reflog entries into
the HEAD log: one where the commit goes away, and
another where it comes back.
Prior to this commit, we show both entries with
identical reflog messages. After this commit, we show
only the "comes back" entry. See the update in t3200
which demonstrates this.
Arguably either is fine, as the whole double-entry
thing is a bit hacky in the first place. And until a
recent fix, we truncated the traversal in such a case
anyway, which was _definitely_ wrong.
3. We show individual reflogs in order, but choose which
reflog to show at each stage based on which has the
most recent timestamp. This interleaves the output
from multiple reflogs based on date order, which is
probably what you'd want with limiting like "-n 30".
Note that the implementation aims for simplicity. It
does a linear walk over the reflog queue for each
commit it pulls, which may perform badly if you
interleave an enormous number of reflogs. That seems
like an unlikely use case; if we did want to handle it,
we could probably keep a priority queue of reflogs,
ordered by the timestamp of their current tip entry.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-07 17:14:07 +08:00
|
|
|
struct commit *commit;
|
2006-03-01 03:24:00 +08:00
|
|
|
|
reflog-walk: stop using fake parents
The reflog-walk system works by putting a ref's tip into the
pending queue, and then "traversing" the reflog by
pretending that the parent of each commit is the previous
reflog entry.
This causes a number of user-visible oddities, as documented
in t1414 (and the commit message which introduced it). We
can fix all of them in one go by replacing the fake-reflog
system with a much simpler one: just keeping a list of
reflogs to show, and walking through them entry by entry.
The implementation is fairly straight-forward, but there are
a few items to note:
1. We obviously must skip calling add_parents_to_list()
when we are traversing reflogs, since we do not want to
walk the original parents at all. As a result, we must call
try_to_simplify_commit() ourselves.
There are other parts of add_parents_to_list() we skip,
as well, but none of them should matter for a reflog
traversal:
- We do not allow UNINTERESTING commits, nor
symmetric ranges (and we bail when these are used
with "-g").
- Using --source makes no sense, since we aren't
traversing. The reflog selector shows the same
information with more detail.
- Using --first-parent is still sensible, since you
may want to see the first-parent diff for each
entry. But since we're not traversing, we don't
need to cull the parent list here.
2. Since we now just walk the reflog entries themselves,
rather than starting with the ref tip, we now look at
the "new" field of each entry rather than the "old"
(i.e., we are showing entries, not faking parents).
This removes all of the tricky logic around skipping
past root commits.
But note that we have no way to show an entry with the
null sha1 in its "new" field (because such a commit
obviously does not exist). Normally this would not
happen, since we delete reflogs along with refs, but
there is one special case. When we rename the currently
checked out branch, we write two reflog entries into
the HEAD log: one where the commit goes away, and
another where it comes back.
Prior to this commit, we show both entries with
identical reflog messages. After this commit, we show
only the "comes back" entry. See the update in t3200
which demonstrates this.
Arguably either is fine, as the whole double-entry
thing is a bit hacky in the first place. And until a
recent fix, we truncated the traversal in such a case
anyway, which was _definitely_ wrong.
3. We show individual reflogs in order, but choose which
reflog to show at each stage based on which has the
most recent timestamp. This interleaves the output
from multiple reflogs based on date order, which is
probably what you'd want with limiting like "-n 30".
Note that the implementation aims for simplicity. It
does a linear walk over the reflog queue for each
commit it pulls, which may perform badly if you
interleave an enormous number of reflogs. That seems
like an unlikely use case; if we did want to handle it,
we could probably keep a priority queue of reflogs,
ordered by the timestamp of their current tip entry.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-07 17:14:07 +08:00
|
|
|
if (revs->reflog_info)
|
|
|
|
commit = next_reflog_entry(revs->reflog_info);
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
else if (revs->topo_walk_info)
|
|
|
|
commit = next_topo_commit(revs);
|
reflog-walk: stop using fake parents
The reflog-walk system works by putting a ref's tip into the
pending queue, and then "traversing" the reflog by
pretending that the parent of each commit is the previous
reflog entry.
This causes a number of user-visible oddities, as documented
in t1414 (and the commit message which introduced it). We
can fix all of them in one go by replacing the fake-reflog
system with a much simpler one: just keeping a list of
reflogs to show, and walking through them entry by entry.
The implementation is fairly straight-forward, but there are
a few items to note:
1. We obviously must skip calling add_parents_to_list()
when we are traversing reflogs, since we do not want to
walk the original parents at all. As a result, we must call
try_to_simplify_commit() ourselves.
There are other parts of add_parents_to_list() we skip,
as well, but none of them should matter for a reflog
traversal:
- We do not allow UNINTERESTING commits, nor
symmetric ranges (and we bail when these are used
with "-g").
- Using --source makes no sense, since we aren't
traversing. The reflog selector shows the same
information with more detail.
- Using --first-parent is still sensible, since you
may want to see the first-parent diff for each
entry. But since we're not traversing, we don't
need to cull the parent list here.
2. Since we now just walk the reflog entries themselves,
rather than starting with the ref tip, we now look at
the "new" field of each entry rather than the "old"
(i.e., we are showing entries, not faking parents).
This removes all of the tricky logic around skipping
past root commits.
But note that we have no way to show an entry with the
null sha1 in its "new" field (because such a commit
obviously does not exist). Normally this would not
happen, since we delete reflogs along with refs, but
there is one special case. When we rename the currently
checked out branch, we write two reflog entries into
the HEAD log: one where the commit goes away, and
another where it comes back.
Prior to this commit, we show both entries with
identical reflog messages. After this commit, we show
only the "comes back" entry. See the update in t3200
which demonstrates this.
Arguably either is fine, as the whole double-entry
thing is a bit hacky in the first place. And until a
recent fix, we truncated the traversal in such a case
anyway, which was _definitely_ wrong.
3. We show individual reflogs in order, but choose which
reflog to show at each stage based on which has the
most recent timestamp. This interleaves the output
from multiple reflogs based on date order, which is
probably what you'd want with limiting like "-n 30".
Note that the implementation aims for simplicity. It
does a linear walk over the reflog queue for each
commit it pulls, which may perform badly if you
interleave an enormous number of reflogs. That seems
like an unlikely use case; if we did want to handle it,
we could probably keep a priority queue of reflogs,
ordered by the timestamp of their current tip entry.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-07 17:14:07 +08:00
|
|
|
else
|
|
|
|
commit = pop_commit(&revs->commits);
|
Make path-limiting be incremental when possible.
This makes git-rev-list able to do path-limiting without having to parse
all of history before it starts showing the results.
This makes things like "git log -- pathname" much more pleasant to use.
This is actually a pretty small patch, and the biggest part of it is
purely cleanups (turning the "goto next" statements into "continue"), but
it's conceptually a lot bigger than it looks.
What it does is that if you do a path-limited revision list, and you do
_not_ ask for pseudo-parenthood information, it won't do all the
path-limiting up-front, but instead do it incrementally in
"get_revision()".
This is an absolutely huge deal for anything like "git log -- <pathname>",
but also for some things that we don't do yet - like the "find where
things changed" logic I've described elsewhere, where we want to find the
previous revision that changed a file.
The reason I put "RFC" in the subject line is that while I've validated it
various ways, like doing
git-rev-list HEAD -- drivers/char/ | md5sum
before-and-after on the kernel archive, it's "git-rev-list" after all. In
other words, it's that really really subtle and complex central piece of
software. So while I think this is important and should go in asap, I also
think it should get lots of testing and eyeballs looking at the code.
Btw, don't even bother testing this with the git archive. git itself is so
small that parsing the whole revision history for it takes about a second
even with path limiting. The thing that _really_ shows this off is doing
git log drivers/
on the kernel archive, or even better, on the _historic_ kernel archive.
With this change, the response is instantaneous (although seeking to the
end of the result will obviously take as long as it ever did). Before this
change, the command would think about the result for tens of seconds - or
even minutes, in the case of the bigger old kernel archive - before
starting to output the results.
NOTE NOTE NOTE! Using path limiting with things like "gitk", which uses
the "--parents" flag to actually generate a pseudo-history of the
resulting commits won't actually see the improvement in interactivity,
since that forces git-rev-list to do the whole-history thing after all.
MAYBE we can fix that too at some point, but I won't promise anything.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-31 09:05:25 +08:00
|
|
|
|
2017-07-07 17:07:58 +08:00
|
|
|
if (!commit)
|
|
|
|
return NULL;
|
|
|
|
|
reflog-walk: stop using fake parents
The reflog-walk system works by putting a ref's tip into the
pending queue, and then "traversing" the reflog by
pretending that the parent of each commit is the previous
reflog entry.
This causes a number of user-visible oddities, as documented
in t1414 (and the commit message which introduced it). We
can fix all of them in one go by replacing the fake-reflog
system with a much simpler one: just keeping a list of
reflogs to show, and walking through them entry by entry.
The implementation is fairly straight-forward, but there are
a few items to note:
1. We obviously must skip calling add_parents_to_list()
when we are traversing reflogs, since we do not want to
walk the original parents at all. As a result, we must call
try_to_simplify_commit() ourselves.
There are other parts of add_parents_to_list() we skip,
as well, but none of them should matter for a reflog
traversal:
- We do not allow UNINTERESTING commits, nor
symmetric ranges (and we bail when these are used
with "-g").
- Using --source makes no sense, since we aren't
traversing. The reflog selector shows the same
information with more detail.
- Using --first-parent is still sensible, since you
may want to see the first-parent diff for each
entry. But since we're not traversing, we don't
need to cull the parent list here.
2. Since we now just walk the reflog entries themselves,
rather than starting with the ref tip, we now look at
the "new" field of each entry rather than the "old"
(i.e., we are showing entries, not faking parents).
This removes all of the tricky logic around skipping
past root commits.
But note that we have no way to show an entry with the
null sha1 in its "new" field (because such a commit
obviously does not exist). Normally this would not
happen, since we delete reflogs along with refs, but
there is one special case. When we rename the currently
checked out branch, we write two reflog entries into
the HEAD log: one where the commit goes away, and
another where it comes back.
Prior to this commit, we show both entries with
identical reflog messages. After this commit, we show
only the "comes back" entry. See the update in t3200
which demonstrates this.
Arguably either is fine, as the whole double-entry
thing is a bit hacky in the first place. And until a
recent fix, we truncated the traversal in such a case
anyway, which was _definitely_ wrong.
3. We show individual reflogs in order, but choose which
reflog to show at each stage based on which has the
most recent timestamp. This interleaves the output
from multiple reflogs based on date order, which is
probably what you'd want with limiting like "-n 30".
Note that the implementation aims for simplicity. It
does a linear walk over the reflog queue for each
commit it pulls, which may perform badly if you
interleave an enormous number of reflogs. That seems
like an unlikely use case; if we did want to handle it,
we could probably keep a priority queue of reflogs,
ordered by the timestamp of their current tip entry.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-07 17:14:07 +08:00
|
|
|
if (revs->reflog_info)
|
reflogs: clear flags properly in corner case
The reflog-walking mechanism is based on the regular
revision traversal. We just rewrite the parents of each
commit in fake_reflog_parent to point to the commit in the
next reflog entry instead of the real parents.
However, the regular revision traversal tries not to show
the same commit twice, and so sets the SHOWN flag on each
commit it shows. In a reflog, however, we may want to see
the same commit more than once if it appears in the reflog
multiple times (which easily happens, for example, if you do
a reset to a prior state).
The fake_reflog_parent function takes care of this by
clearing flags, including SHOWN. Unfortunately, it does so
at the very end of the function, and it is possible to
return early from the function if there is no fake parent to
set up (e.g., because we are at the very first reflog entry
on the branch). In such a case the flag is not cleared, and
the entry is skipped by the revision traversal machinery as
already shown.
You can see this by walking the log of a ref which is set to
its very first commit more than once (the test below shows
such a situation). In this case the reflog walk will fail to
show the entry for the initial creation of the ref.
We don't want to simply move the flag-clearing to the top of
the function; we want to make sure flags set during the
fake-parent installation are also cleared. Instead, let's
hoist the flag-clearing out of the fake_reflog_parent
function entirely. It's not really about fake parents
anyway, and the only caller is the get_revision machinery.
Reported-by: Martin von Zweigbergk <martin.von.zweigbergk@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-11-22 12:42:53 +08:00
|
|
|
commit->object.flags &= ~(ADDED | SEEN | SHOWN);
|
2007-01-11 18:47:48 +08:00
|
|
|
|
Make path-limiting be incremental when possible.
This makes git-rev-list able to do path-limiting without having to parse
all of history before it starts showing the results.
This makes things like "git log -- pathname" much more pleasant to use.
This is actually a pretty small patch, and the biggest part of it is
purely cleanups (turning the "goto next" statements into "continue"), but
it's conceptually a lot bigger than it looks.
What it does is that if you do a path-limited revision list, and you do
_not_ ask for pseudo-parenthood information, it won't do all the
path-limiting up-front, but instead do it incrementally in
"get_revision()".
This is an absolutely huge deal for anything like "git log -- <pathname>",
but also for some things that we don't do yet - like the "find where
things changed" logic I've described elsewhere, where we want to find the
previous revision that changed a file.
The reason I put "RFC" in the subject line is that while I've validated it
various ways, like doing
git-rev-list HEAD -- drivers/char/ | md5sum
before-and-after on the kernel archive, it's "git-rev-list" after all. In
other words, it's that really really subtle and complex central piece of
software. So while I think this is important and should go in asap, I also
think it should get lots of testing and eyeballs looking at the code.
Btw, don't even bother testing this with the git archive. git itself is so
small that parsing the whole revision history for it takes about a second
even with path limiting. The thing that _really_ shows this off is doing
git log drivers/
on the kernel archive, or even better, on the _historic_ kernel archive.
With this change, the response is instantaneous (although seeking to the
end of the result will obviously take as long as it ever did). Before this
change, the command would think about the result for tens of seconds - or
even minutes, in the case of the bigger old kernel archive - before
starting to output the results.
NOTE NOTE NOTE! Using path limiting with things like "gitk", which uses
the "--parents" flag to actually generate a pseudo-history of the
resulting commits won't actually see the improvement in interactivity,
since that forces git-rev-list to do the whole-history thing after all.
MAYBE we can fix that too at some point, but I won't promise anything.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-31 09:05:25 +08:00
|
|
|
/*
|
|
|
|
* If we haven't done the list limiting, we need to look at
|
2006-04-02 08:35:06 +08:00
|
|
|
* the parents here. We also need to do the date-based limiting
|
|
|
|
* that we'd otherwise have done in limit_list().
|
Make path-limiting be incremental when possible.
This makes git-rev-list able to do path-limiting without having to parse
all of history before it starts showing the results.
This makes things like "git log -- pathname" much more pleasant to use.
This is actually a pretty small patch, and the biggest part of it is
purely cleanups (turning the "goto next" statements into "continue"), but
it's conceptually a lot bigger than it looks.
What it does is that if you do a path-limited revision list, and you do
_not_ ask for pseudo-parenthood information, it won't do all the
path-limiting up-front, but instead do it incrementally in
"get_revision()".
This is an absolutely huge deal for anything like "git log -- <pathname>",
but also for some things that we don't do yet - like the "find where
things changed" logic I've described elsewhere, where we want to find the
previous revision that changed a file.
The reason I put "RFC" in the subject line is that while I've validated it
various ways, like doing
git-rev-list HEAD -- drivers/char/ | md5sum
before-and-after on the kernel archive, it's "git-rev-list" after all. In
other words, it's that really really subtle and complex central piece of
software. So while I think this is important and should go in asap, I also
think it should get lots of testing and eyeballs looking at the code.
Btw, don't even bother testing this with the git archive. git itself is so
small that parsing the whole revision history for it takes about a second
even with path limiting. The thing that _really_ shows this off is doing
git log drivers/
on the kernel archive, or even better, on the _historic_ kernel archive.
With this change, the response is instantaneous (although seeking to the
end of the result will obviously take as long as it ever did). Before this
change, the command would think about the result for tens of seconds - or
even minutes, in the case of the bigger old kernel archive - before
starting to output the results.
NOTE NOTE NOTE! Using path limiting with things like "gitk", which uses
the "--parents" flag to actually generate a pseudo-history of the
resulting commits won't actually see the improvement in interactivity,
since that forces git-rev-list to do the whole-history thing after all.
MAYBE we can fix that too at some point, but I won't promise anything.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-31 09:05:25 +08:00
|
|
|
*/
|
2006-04-02 08:35:06 +08:00
|
|
|
if (!revs->limited) {
|
2006-10-31 09:37:49 +08:00
|
|
|
if (revs->max_age != -1 &&
|
2017-07-07 17:16:21 +08:00
|
|
|
comparison_date(revs, commit) < revs->max_age)
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
continue;
|
reflog-walk: stop using fake parents
The reflog-walk system works by putting a ref's tip into the
pending queue, and then "traversing" the reflog by
pretending that the parent of each commit is the previous
reflog entry.
This causes a number of user-visible oddities, as documented
in t1414 (and the commit message which introduced it). We
can fix all of them in one go by replacing the fake-reflog
system with a much simpler one: just keeping a list of
reflogs to show, and walking through them entry by entry.
The implementation is fairly straight-forward, but there are
a few items to note:
1. We obviously must skip calling add_parents_to_list()
when we are traversing reflogs, since we do not want to
walk the original parents at all. As a result, we must call
try_to_simplify_commit() ourselves.
There are other parts of add_parents_to_list() we skip,
as well, but none of them should matter for a reflog
traversal:
- We do not allow UNINTERESTING commits, nor
symmetric ranges (and we bail when these are used
with "-g").
- Using --source makes no sense, since we aren't
traversing. The reflog selector shows the same
information with more detail.
- Using --first-parent is still sensible, since you
may want to see the first-parent diff for each
entry. But since we're not traversing, we don't
need to cull the parent list here.
2. Since we now just walk the reflog entries themselves,
rather than starting with the ref tip, we now look at
the "new" field of each entry rather than the "old"
(i.e., we are showing entries, not faking parents).
This removes all of the tricky logic around skipping
past root commits.
But note that we have no way to show an entry with the
null sha1 in its "new" field (because such a commit
obviously does not exist). Normally this would not
happen, since we delete reflogs along with refs, but
there is one special case. When we rename the currently
checked out branch, we write two reflog entries into
the HEAD log: one where the commit goes away, and
another where it comes back.
Prior to this commit, we show both entries with
identical reflog messages. After this commit, we show
only the "comes back" entry. See the update in t3200
which demonstrates this.
Arguably either is fine, as the whole double-entry
thing is a bit hacky in the first place. And until a
recent fix, we truncated the traversal in such a case
anyway, which was _definitely_ wrong.
3. We show individual reflogs in order, but choose which
reflog to show at each stage based on which has the
most recent timestamp. This interleaves the output
from multiple reflogs based on date order, which is
probably what you'd want with limiting like "-n 30".
Note that the implementation aims for simplicity. It
does a linear walk over the reflog queue for each
commit it pulls, which may perform badly if you
interleave an enormous number of reflogs. That seems
like an unlikely use case; if we did want to handle it,
we could probably keep a priority queue of reflogs,
ordered by the timestamp of their current tip entry.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-07 17:14:07 +08:00
|
|
|
|
|
|
|
if (revs->reflog_info)
|
|
|
|
try_to_simplify_commit(revs, commit);
|
revision.c: begin refactoring --topo-order logic
When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():
* revs->limited implies we run limit_list() to walk the entire
reachable set. There are some short-cuts here, such as if we
perform a range query like 'git rev-list COMPARE..HEAD' and we
can stop limit_list() when all queued commits are uninteresting.
* revs->topo_order implies we run sort_in_topological_order(). See
the implementation of that method in commit.c. It implies that
the full set of commits to order is in the given commit_list.
These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.
If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.
In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.
When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.
In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:
1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
struct. We use the presence of this struct as a signal to use the
new methods during our walk. In this patch, this method simply
calls limit_list() and sort_in_topological_order(). In the future,
this method will set up a new data structure to perform that logic
in-line.
2. next_topo_commit() provides get_revision_1() with the next topo-
ordered commit in the list. Currently, this simply pops the commit
from revs->commits.
3. expand_topo_walk() provides get_revision_1() with a way to signal
walking beyond the latest commit. Currently, this calls
add_parents_to_list() exactly like the old logic.
While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-01 21:46:20 +08:00
|
|
|
else if (revs->topo_walk_info)
|
|
|
|
expand_topo_walk(revs, commit);
|
2018-11-01 21:46:21 +08:00
|
|
|
else if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
|
add `ignore_missing_links` mode to revwalk
When pack-objects is computing the reachability bitmap to
serve a fetch request, it can erroneously die() if some of
the UNINTERESTING objects are not present. Upload-pack
throws away HAVE lines from the client for objects we do not
have, but we may have a tip object without all of its
ancestors (e.g., if the tip is no longer reachable and was
new enough to survive a `git prune`, but some of its
reachable objects did get pruned).
In the non-bitmap case, we do a revision walk with the HAVE
objects marked as UNINTERESTING. The revision walker
explicitly ignores errors in accessing UNINTERESTING commits
to handle this case (and we do not bother looking at
UNINTERESTING trees or blobs at all).
When we have bitmaps, however, the process is quite
different. The bitmap index for a pack-objects run is
calculated in two separate steps:
First, we perform an extensive walk from all the HAVEs to
find the full set of objects reachable from them. This walk
is usually optimized away because we are expected to hit an
object with a bitmap during the traversal, which allows us
to terminate early.
Secondly, we perform an extensive walk from all the WANTs,
which usually also terminates early because we hit a commit
with an existing bitmap.
Once we have the resulting bitmaps from the two walks, we
AND-NOT them together to obtain the resulting set of objects
we need to pack.
When we are walking the HAVE objects, the revision walker
does not know that we are walking it only to mark the
results as uninteresting. We strip out the UNINTERESTING flag,
because those objects _are_ interesting to us during the
first walk. We want to keep going to get a complete set of
reachable objects if we can.
We need some way to tell the revision walker that it's OK to
silently truncate the HAVE walk, just like it does for the
UNINTERESTING case. This patch introduces a new
`ignore_missing_links` flag to the `rev_info` struct, which
we set only for the HAVE walk.
It also adds tests to cover UNINTERESTING objects missing
from several positions: a missing blob, a missing tree, and
a missing parent commit. The missing blob already worked (as
we do not care about its contents at all), but the other two
cases caused us to die().
Note that there are a few cases we do not need to test:
1. We do not need to test a missing tree, with the blob
still present. Without the tree that refers to it, we
would not know that the blob is relevant to our walk.
2. We do not need to test a tip commit that is missing.
Upload-pack omits these for us (and in fact, we
complain even in the non-bitmap case if it fails to do
so).
Reported-by: Siddharth Agarwal <sid0@fb.com>
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-28 18:00:43 +08:00
|
|
|
if (!revs->ignore_missing_links)
|
|
|
|
die("Failed to traverse parents of commit %s",
|
2015-11-10 10:22:28 +08:00
|
|
|
oid_to_hex(&commit->object.oid));
|
add `ignore_missing_links` mode to revwalk
When pack-objects is computing the reachability bitmap to
serve a fetch request, it can erroneously die() if some of
the UNINTERESTING objects are not present. Upload-pack
throws away HAVE lines from the client for objects we do not
have, but we may have a tip object without all of its
ancestors (e.g., if the tip is no longer reachable and was
new enough to survive a `git prune`, but some of its
reachable objects did get pruned).
In the non-bitmap case, we do a revision walk with the HAVE
objects marked as UNINTERESTING. The revision walker
explicitly ignores errors in accessing UNINTERESTING commits
to handle this case (and we do not bother looking at
UNINTERESTING trees or blobs at all).
When we have bitmaps, however, the process is quite
different. The bitmap index for a pack-objects run is
calculated in two separate steps:
First, we perform an extensive walk from all the HAVEs to
find the full set of objects reachable from them. This walk
is usually optimized away because we are expected to hit an
object with a bitmap during the traversal, which allows us
to terminate early.
Secondly, we perform an extensive walk from all the WANTs,
which usually also terminates early because we hit a commit
with an existing bitmap.
Once we have the resulting bitmaps from the two walks, we
AND-NOT them together to obtain the resulting set of objects
we need to pack.
When we are walking the HAVE objects, the revision walker
does not know that we are walking it only to mark the
results as uninteresting. We strip out the UNINTERESTING flag,
because those objects _are_ interesting to us during the
first walk. We want to keep going to get a complete set of
reachable objects if we can.
We need some way to tell the revision walker that it's OK to
silently truncate the HAVE walk, just like it does for the
UNINTERESTING case. This patch introduces a new
`ignore_missing_links` flag to the `rev_info` struct, which
we set only for the HAVE walk.
It also adds tests to cover UNINTERESTING objects missing
from several positions: a missing blob, a missing tree, and
a missing parent commit. The missing blob already worked (as
we do not care about its contents at all), but the other two
cases caused us to die().
Note that there are a few cases we do not need to test:
1. We do not need to test a missing tree, with the blob
still present. Without the tree that refers to it, we
would not know that the blob is relevant to our walk.
2. We do not need to test a tip commit that is missing.
Upload-pack omits these for us (and in fact, we
complain even in the non-bitmap case if it fails to do
so).
Reported-by: Siddharth Agarwal <sid0@fb.com>
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-28 18:00:43 +08:00
|
|
|
}
|
2006-04-02 08:35:06 +08:00
|
|
|
}
|
2006-10-31 09:37:49 +08:00
|
|
|
|
2007-11-05 04:12:05 +08:00
|
|
|
switch (simplify_commit(revs, commit)) {
|
|
|
|
case commit_ignore:
|
2006-09-18 06:43:40 +08:00
|
|
|
continue;
|
2007-11-05 04:12:05 +08:00
|
|
|
case commit_error:
|
2009-02-11 17:27:43 +08:00
|
|
|
die("Failed to simplify parents of commit %s",
|
2015-11-10 10:22:28 +08:00
|
|
|
oid_to_hex(&commit->object.oid));
|
2007-11-05 04:12:05 +08:00
|
|
|
default:
|
2014-03-25 21:23:27 +08:00
|
|
|
if (revs->track_linear)
|
|
|
|
track_linear(revs, commit);
|
2007-11-05 04:12:05 +08:00
|
|
|
return commit;
|
2006-03-28 15:58:34 +08:00
|
|
|
}
|
2017-07-07 17:07:58 +08:00
|
|
|
}
|
2006-03-01 07:07:20 +08:00
|
|
|
}
|
2006-12-20 10:25:32 +08:00
|
|
|
|
2013-05-25 17:08:09 +08:00
|
|
|
/*
|
|
|
|
* Return true for entries that have not yet been shown. (This is an
|
|
|
|
* object_array_each_func_t.)
|
|
|
|
*/
|
2023-02-24 14:39:15 +08:00
|
|
|
static int entry_unshown(struct object_array_entry *entry, void *cb_data UNUSED)
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
{
|
2013-05-25 17:08:09 +08:00
|
|
|
return !(entry->item->flags & SHOWN);
|
|
|
|
}
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
|
2013-05-25 17:08:09 +08:00
|
|
|
/*
|
|
|
|
* If array is on the verge of a realloc, garbage-collect any entries
|
|
|
|
* that have already been shown to try to free up some space.
|
|
|
|
*/
|
|
|
|
static void gc_boundary(struct object_array *array)
|
|
|
|
{
|
|
|
|
if (array->nr == array->alloc)
|
|
|
|
object_array_filter(array, entry_unshown, NULL);
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
}
|
|
|
|
|
2008-05-25 07:02:05 +08:00
|
|
|
static void create_boundary_commit_list(struct rev_info *revs)
|
|
|
|
{
|
|
|
|
unsigned i;
|
|
|
|
struct commit *c;
|
|
|
|
struct object_array *array = &revs->boundary_commits;
|
|
|
|
struct object_array_entry *objects = array->objects;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If revs->commits is non-NULL at this point, an error occurred in
|
|
|
|
* get_revision_1(). Ignore the error and continue printing the
|
|
|
|
* boundary commits anyway. (This is what the code has always
|
|
|
|
* done.)
|
|
|
|
*/
|
2022-04-14 04:01:34 +08:00
|
|
|
free_commit_list(revs->commits);
|
|
|
|
revs->commits = NULL;
|
2008-05-25 07:02:05 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Put all of the actual boundary commits from revs->boundary_commits
|
|
|
|
* into revs->commits
|
|
|
|
*/
|
|
|
|
for (i = 0; i < array->nr; i++) {
|
|
|
|
c = (struct commit *)(objects[i].item);
|
|
|
|
if (!c)
|
|
|
|
continue;
|
|
|
|
if (!(c->object.flags & CHILD_SHOWN))
|
|
|
|
continue;
|
|
|
|
if (c->object.flags & (SHOWN | BOUNDARY))
|
|
|
|
continue;
|
|
|
|
c->object.flags |= BOUNDARY;
|
|
|
|
commit_list_insert(c, &revs->commits);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If revs->topo_order is set, sort the boundary commits
|
|
|
|
* in topological order
|
|
|
|
*/
|
toposort: rename "lifo" field
The primary invariant of sort_in_topological_order() is that a
parent commit is not emitted until all children of it are. When
traversing a forked history like this with "git log C E":
A----B----C
\
D----E
we ensure that A is emitted after all of B, C, D, and E are done, B
has to wait until C is done, and D has to wait until E is done.
In some applications, however, we would further want to control how
these child commits B, C, D and E on two parallel ancestry chains
are shown.
Most of the time, we would want to see C and B emitted together, and
then E and D, and finally A (i.e. the --topo-order output). The
"lifo" parameter of the sort_in_topological_order() function is used
to control this behaviour. We start the traversal by knowing two
commits, C and E. While keeping in mind that we also need to
inspect E later, we pick C first to inspect, and we notice and
record that B needs to be inspected. By structuring the "work to be
done" set as a LIFO stack, we ensure that B is inspected next,
before other in-flight commits we had known that we will need to
inspect, e.g. E.
When showing in --date-order, we would want to see commits ordered
by timestamps, i.e. show C, E, B and D in this order before showing
A, possibly mixing commits from two parallel histories together.
When "lifo" parameter is set to false, the function keeps the "work
to be done" set sorted in the date order to realize this semantics.
After inspecting C, we add B to the "work to be done" set, but the
next commit we inspect from the set is E which is newer than B.
The name "lifo", however, is too strongly tied to the way how the
function implements its behaviour, and does not describe what the
behaviour _means_.
Replace this field with an enum rev_sort_order, with two possible
values: REV_SORT_IN_GRAPH_ORDER and REV_SORT_BY_COMMIT_DATE, and
update the existing code. The mechanical replacement rule is:
"lifo == 0" is equivalent to "sort_order == REV_SORT_BY_COMMIT_DATE"
"lifo == 1" is equivalent to "sort_order == REV_SORT_IN_GRAPH_ORDER"
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-06-07 07:07:14 +08:00
|
|
|
sort_in_topological_order(&revs->commits, revs->sort_order);
|
2008-05-25 07:02:05 +08:00
|
|
|
}
|
|
|
|
|
2008-05-04 18:36:54 +08:00
|
|
|
static struct commit *get_revision_internal(struct rev_info *revs)
|
2006-12-20 10:25:32 +08:00
|
|
|
{
|
|
|
|
struct commit *c = NULL;
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
struct commit_list *l;
|
|
|
|
|
|
|
|
if (revs->boundary == 2) {
|
2008-05-25 07:02:05 +08:00
|
|
|
/*
|
|
|
|
* All of the normal commits have already been returned,
|
|
|
|
* and we are now returning boundary commits.
|
|
|
|
* create_boundary_commit_list() has populated
|
|
|
|
* revs->commits with the remaining commits to return.
|
|
|
|
*/
|
|
|
|
c = pop_commit(&revs->commits);
|
|
|
|
if (c)
|
|
|
|
c->object.flags |= SHOWN;
|
2007-01-21 06:04:02 +08:00
|
|
|
return c;
|
|
|
|
}
|
|
|
|
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
/*
|
revision: avoid work after --max-count is reached
During a revision traversal in which --max-count has been
specified, we decrement a counter for each revision returned
by get_revision. When it hits 0, we typically return NULL
(the exception being if we still have boundary commits to
show).
However, before we check the counter, we call get_revision_1
to get the next commit. This might involve looking at a
large number of commits if we have restricted the traversal
(e.g., we might traverse until we find the next commit whose
diff actually matches a pathspec).
There's no need to make this get_revision_1 call when our
counter runs out. If we are not in --boundary mode, we will
just throw away the result and immediately return NULL. If
we are in --boundary mode, then we will still throw away the
result, and then start showing the boundary commits.
However, as git_revision_1 does not impact the boundary
list, it should not have an impact.
In most cases, avoiding this work will not be especially
noticeable. However, in some cases, it can make a big
difference:
[before]
$ time git rev-list -1 origin Documentation/RelNotes/1.7.11.2.txt
8d141a1d562abb31f27f599dbf6e10a6c06ed73e
real 0m0.301s
user 0m0.280s
sys 0m0.016s
[after]
$ time git rev-list -1 origin Documentation/RelNotes/1.7.11.2.txt
8d141a1d562abb31f27f599dbf6e10a6c06ed73e
real 0m0.010s
user 0m0.008s
sys 0m0.000s
Note that the output is produced almost instantaneously in
the first case, and then git uselessly spends a long time
looking for the next commit to touch that file (but there
isn't one, and we traverse all the way down to the roots).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-07-13 15:50:23 +08:00
|
|
|
* If our max_count counter has reached zero, then we are done. We
|
|
|
|
* don't simply return NULL because we still might need to show
|
|
|
|
* boundary commits. But we want to avoid calling get_revision_1, which
|
|
|
|
* might do a considerable amount of work finding the next commit only
|
|
|
|
* for us to throw it away.
|
|
|
|
*
|
|
|
|
* If it is non-zero, then either we don't have a max_count at all
|
|
|
|
* (-1), or it is still counting, in which case we decrement.
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
*/
|
revision: avoid work after --max-count is reached
During a revision traversal in which --max-count has been
specified, we decrement a counter for each revision returned
by get_revision. When it hits 0, we typically return NULL
(the exception being if we still have boundary commits to
show).
However, before we check the counter, we call get_revision_1
to get the next commit. This might involve looking at a
large number of commits if we have restricted the traversal
(e.g., we might traverse until we find the next commit whose
diff actually matches a pathspec).
There's no need to make this get_revision_1 call when our
counter runs out. If we are not in --boundary mode, we will
just throw away the result and immediately return NULL. If
we are in --boundary mode, then we will still throw away the
result, and then start showing the boundary commits.
However, as git_revision_1 does not impact the boundary
list, it should not have an impact.
In most cases, avoiding this work will not be especially
noticeable. However, in some cases, it can make a big
difference:
[before]
$ time git rev-list -1 origin Documentation/RelNotes/1.7.11.2.txt
8d141a1d562abb31f27f599dbf6e10a6c06ed73e
real 0m0.301s
user 0m0.280s
sys 0m0.016s
[after]
$ time git rev-list -1 origin Documentation/RelNotes/1.7.11.2.txt
8d141a1d562abb31f27f599dbf6e10a6c06ed73e
real 0m0.010s
user 0m0.008s
sys 0m0.000s
Note that the output is produced almost instantaneously in
the first case, and then git uselessly spends a long time
looking for the next commit to touch that file (but there
isn't one, and we traverse all the way down to the roots).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-07-13 15:50:23 +08:00
|
|
|
if (revs->max_count) {
|
|
|
|
c = get_revision_1(revs);
|
|
|
|
if (c) {
|
2013-10-31 17:25:43 +08:00
|
|
|
while (revs->skip_count > 0) {
|
revision: avoid work after --max-count is reached
During a revision traversal in which --max-count has been
specified, we decrement a counter for each revision returned
by get_revision. When it hits 0, we typically return NULL
(the exception being if we still have boundary commits to
show).
However, before we check the counter, we call get_revision_1
to get the next commit. This might involve looking at a
large number of commits if we have restricted the traversal
(e.g., we might traverse until we find the next commit whose
diff actually matches a pathspec).
There's no need to make this get_revision_1 call when our
counter runs out. If we are not in --boundary mode, we will
just throw away the result and immediately return NULL. If
we are in --boundary mode, then we will still throw away the
result, and then start showing the boundary commits.
However, as git_revision_1 does not impact the boundary
list, it should not have an impact.
In most cases, avoiding this work will not be especially
noticeable. However, in some cases, it can make a big
difference:
[before]
$ time git rev-list -1 origin Documentation/RelNotes/1.7.11.2.txt
8d141a1d562abb31f27f599dbf6e10a6c06ed73e
real 0m0.301s
user 0m0.280s
sys 0m0.016s
[after]
$ time git rev-list -1 origin Documentation/RelNotes/1.7.11.2.txt
8d141a1d562abb31f27f599dbf6e10a6c06ed73e
real 0m0.010s
user 0m0.008s
sys 0m0.000s
Note that the output is produced almost instantaneously in
the first case, and then git uselessly spends a long time
looking for the next commit to touch that file (but there
isn't one, and we traverse all the way down to the roots).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-07-13 15:50:23 +08:00
|
|
|
revs->skip_count--;
|
|
|
|
c = get_revision_1(revs);
|
|
|
|
if (!c)
|
|
|
|
break;
|
revision: free commit buffers for skipped commits
In git-log we leave the save_commit_buffer flag set to "1", which tells
the commit parsing code to store the object content after it has parsed
it to find parents, tree, etc. That lets us reuse the contents for
pretty-printing the commit in the output. And then after printing each
commit, we call free_commit_buffer(), since we don't need it anymore.
But some options may cause us to traverse commits which are not part of
the output. And so git-log does not see them at all, and doesn't free
them. One such case is something like:
git log -n 1000 --skip=1000000
which will churn through a million commits, before showing only a
thousand. We loop through these inside get_revision(), without freeing
the contents. As a result, we end up storing the object data for those
million commits simultaneously.
We should free the stored buffers (if any) for those commits as we skip
over them, which is what this patch does. Running the above command in
linux.git drops the peak heap usage from ~1.1GB to ~200MB, according to
valgrind/massif. (I thought we might get an even bigger improvement, but
the remaining memory is going to commit/tree structs, which we do hold
on to forever).
Note that this problem doesn't occur if:
- you're running a git-rev-list without a --format parameter; it turns
off save_commit_buffer by default, since it only output the object
id
- you've built a commit-graph file, since in that case we'd use the
optimized graph data instead of the initial parse, and then do a
lazy parse for commits we're actually going to output
There are probably some other option combinations that can likewise
end up with useless stored commit buffers. For example, if you ask for
"foo..bar", then we'll have to walk down to the merge base, and
everything on the "foo" side won't be shown. Tuning the "save" behavior
to handle that might be tricky (I guess maybe drop buffers for anything
we mark as UNINTERESTING?). And in the long run, the right solution here
is probably to make sure the commit-graph is built (since it fixes the
memory problem _and_ drastically reduces CPU usage).
But since this "--skip" case is an easy one-liner, it's worth fixing in
the meantime. It should be OK to make this call even if there is no
saved buffer (e.g., because save_commit_buffer=0, or because a
commit-graph was used), since it's O(1) to look up the buffer and is a
noop if it isn't present. I verified by running the above command after
"git commit-graph write --reachable", and it takes the same time with
and without this patch.
Reported-by: Yuri Karnilaev <karnilaev@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-31 04:53:31 +08:00
|
|
|
free_commit_buffer(revs->repo->parsed_objects, c);
|
revision: avoid work after --max-count is reached
During a revision traversal in which --max-count has been
specified, we decrement a counter for each revision returned
by get_revision. When it hits 0, we typically return NULL
(the exception being if we still have boundary commits to
show).
However, before we check the counter, we call get_revision_1
to get the next commit. This might involve looking at a
large number of commits if we have restricted the traversal
(e.g., we might traverse until we find the next commit whose
diff actually matches a pathspec).
There's no need to make this get_revision_1 call when our
counter runs out. If we are not in --boundary mode, we will
just throw away the result and immediately return NULL. If
we are in --boundary mode, then we will still throw away the
result, and then start showing the boundary commits.
However, as git_revision_1 does not impact the boundary
list, it should not have an impact.
In most cases, avoiding this work will not be especially
noticeable. However, in some cases, it can make a big
difference:
[before]
$ time git rev-list -1 origin Documentation/RelNotes/1.7.11.2.txt
8d141a1d562abb31f27f599dbf6e10a6c06ed73e
real 0m0.301s
user 0m0.280s
sys 0m0.016s
[after]
$ time git rev-list -1 origin Documentation/RelNotes/1.7.11.2.txt
8d141a1d562abb31f27f599dbf6e10a6c06ed73e
real 0m0.010s
user 0m0.008s
sys 0m0.000s
Note that the output is produced almost instantaneously in
the first case, and then git uselessly spends a long time
looking for the next commit to touch that file (but there
isn't one, and we traverse all the way down to the roots).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-07-13 15:50:23 +08:00
|
|
|
}
|
2007-03-06 19:20:55 +08:00
|
|
|
}
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
|
revision: avoid work after --max-count is reached
During a revision traversal in which --max-count has been
specified, we decrement a counter for each revision returned
by get_revision. When it hits 0, we typically return NULL
(the exception being if we still have boundary commits to
show).
However, before we check the counter, we call get_revision_1
to get the next commit. This might involve looking at a
large number of commits if we have restricted the traversal
(e.g., we might traverse until we find the next commit whose
diff actually matches a pathspec).
There's no need to make this get_revision_1 call when our
counter runs out. If we are not in --boundary mode, we will
just throw away the result and immediately return NULL. If
we are in --boundary mode, then we will still throw away the
result, and then start showing the boundary commits.
However, as git_revision_1 does not impact the boundary
list, it should not have an impact.
In most cases, avoiding this work will not be especially
noticeable. However, in some cases, it can make a big
difference:
[before]
$ time git rev-list -1 origin Documentation/RelNotes/1.7.11.2.txt
8d141a1d562abb31f27f599dbf6e10a6c06ed73e
real 0m0.301s
user 0m0.280s
sys 0m0.016s
[after]
$ time git rev-list -1 origin Documentation/RelNotes/1.7.11.2.txt
8d141a1d562abb31f27f599dbf6e10a6c06ed73e
real 0m0.010s
user 0m0.008s
sys 0m0.000s
Note that the output is produced almost instantaneously in
the first case, and then git uselessly spends a long time
looking for the next commit to touch that file (but there
isn't one, and we traverse all the way down to the roots).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-07-13 15:50:23 +08:00
|
|
|
if (revs->max_count > 0)
|
|
|
|
revs->max_count--;
|
2006-12-20 10:25:32 +08:00
|
|
|
}
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
|
2007-03-06 10:23:57 +08:00
|
|
|
if (c)
|
|
|
|
c->object.flags |= SHOWN;
|
|
|
|
|
2013-10-31 17:25:43 +08:00
|
|
|
if (!revs->boundary)
|
2006-12-20 10:25:32 +08:00
|
|
|
return c;
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
|
|
|
|
if (!c) {
|
|
|
|
/*
|
|
|
|
* get_revision_1() runs out the commits, and
|
|
|
|
* we are done computing the boundaries.
|
|
|
|
* switch to boundary commits output mode.
|
|
|
|
*/
|
|
|
|
revs->boundary = 2;
|
2008-05-25 07:02:05 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Update revs->commits to contain the list of
|
|
|
|
* boundary commits.
|
|
|
|
*/
|
|
|
|
create_boundary_commit_list(revs);
|
|
|
|
|
2008-05-25 07:02:04 +08:00
|
|
|
return get_revision_internal(revs);
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* boundary commits are the commits that are parents of the
|
|
|
|
* ones we got from get_revision_1() but they themselves are
|
|
|
|
* not returned from get_revision_1(). Before returning
|
|
|
|
* 'c', we need to mark its parents that they could be boundaries.
|
|
|
|
*/
|
|
|
|
|
|
|
|
for (l = c->parents; l; l = l->next) {
|
|
|
|
struct object *p;
|
|
|
|
p = &(l->item->object);
|
2007-03-06 10:23:57 +08:00
|
|
|
if (p->flags & (CHILD_SHOWN | SHOWN))
|
revision walker: Fix --boundary when limited
This cleans up the boundary processing in the commit walker. It
- rips out the boundary logic from the commit walker. Placing
"negative" commits in the revs->commits list was Ok if all we
cared about "boundary" was the UNINTERESTING limiting case,
but conceptually it was wrong.
- makes get_revision_1() function to walk the commits and return
the results as if there is no funny postprocessing flags such
as --reverse, --skip nor --max-count.
- makes get_revision() function the postprocessing phase:
If reverse is given, wait for get_revision_1() to give
everything that it would normally give, and then reverse it
before consuming.
If skip is given, skip that many before going further.
If max is given, stop when we gave out that many.
Now that we are about to return one positive commit, mark
the parents of that commit to be potential boundaries
before returning, iff we are doing the boundary processing.
Return the commit.
- After get_revision() finishes giving out all the positive
commits, if we are doing the boundary processing, we look at
the parents that we marked as potential boundaries earlier,
see if they are really boundaries, and give them out.
It loses more code than it adds, even when the new gc_boundary()
function, which is purely for early optimization, is counted.
Note that this patch is purely for eyeballing and discussion
only. It breaks git-bundle's verify logic because the logic
does not use BOUNDARY_SHOW flag for its internal computation
anymore. After we correct it not to attempt to affect the
boundary processing by setting the BOUNDARY_SHOW flag, we can
remove BOUNDARY_SHOW from revision.h and use that bit assignment
for the new CHILD_SHOWN flag.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-06 05:10:06 +08:00
|
|
|
continue;
|
|
|
|
p->flags |= CHILD_SHOWN;
|
|
|
|
gc_boundary(&revs->boundary_commits);
|
|
|
|
add_object_array(p, NULL, &revs->boundary_commits);
|
|
|
|
}
|
|
|
|
|
|
|
|
return c;
|
2006-12-20 10:25:32 +08:00
|
|
|
}
|
2008-05-04 18:36:54 +08:00
|
|
|
|
|
|
|
struct commit *get_revision(struct rev_info *revs)
|
|
|
|
{
|
2008-08-30 03:18:38 +08:00
|
|
|
struct commit *c;
|
|
|
|
struct commit_list *reversed;
|
|
|
|
|
|
|
|
if (revs->reverse) {
|
|
|
|
reversed = NULL;
|
2013-10-31 17:25:43 +08:00
|
|
|
while ((c = get_revision_internal(revs)))
|
2008-08-30 03:18:38 +08:00
|
|
|
commit_list_insert(c, &reversed);
|
2024-06-11 17:19:17 +08:00
|
|
|
free_commit_list(revs->commits);
|
2008-08-30 03:18:38 +08:00
|
|
|
revs->commits = reversed;
|
|
|
|
revs->reverse = 0;
|
|
|
|
revs->reverse_output_stage = 1;
|
|
|
|
}
|
|
|
|
|
2014-03-25 21:23:27 +08:00
|
|
|
if (revs->reverse_output_stage) {
|
|
|
|
c = pop_commit(&revs->commits);
|
|
|
|
if (revs->track_linear)
|
|
|
|
revs->linear = !!(c && c->object.flags & TRACK_LINEAR);
|
|
|
|
return c;
|
|
|
|
}
|
2008-08-30 03:18:38 +08:00
|
|
|
|
|
|
|
c = get_revision_internal(revs);
|
2008-05-04 18:36:54 +08:00
|
|
|
if (c && revs->graph)
|
|
|
|
graph_update(revs->graph, c);
|
2014-03-25 21:23:27 +08:00
|
|
|
if (!c) {
|
log: use true parents for diff even when rewriting
When using pathspec filtering in combination with diff-based log
output, parent simplification happens before the diff is computed.
The diff is therefore against the *simplified* parents.
This works okay, arguably by accident, in the normal case:
simplification reduces to one parent as long as the commit is TREESAME
to it. So the simplified parent of any given commit must have the
same tree contents on the filtered paths as its true (unfiltered)
parent.
However, --full-diff breaks this guarantee, and indeed gives pretty
spectacular results when comparing the output of
git log --graph --stat ...
git log --graph --full-diff --stat ...
(--graph internally kicks in parent simplification, much like
--parents).
To fix it, store a copy of the parent list before simplification (in a
slab) whenever --full-diff is in effect. Then use the stored parents
instead of the simplified ones in the commit display code paths. The
latter do not actually check for --full-diff to avoid duplicated code;
they just grab the original parents if save_parents() has not been
called for this revision walk.
For ordinary commits it should be obvious that this is the right thing
to do.
Merge commits are a bit subtle. Observe that with default
simplification, merge simplification is an all-or-nothing decision:
either the merge is TREESAME to one parent and disappears, or it is
different from all parents and the parent list remains intact.
Redundant parents are not pruned, so the existing code also shows them
as a merge.
So if we do show a merge commit, the parent list just consists of the
rewrite result on each parent. Running, e.g., --cc on this in
--full-diff mode is not very useful: if any commits were skipped, some
hunks will disagree with all sides of the merge (with one side,
because commits were skipped; with the others, because they didn't
have those changes in the first place). This triggers --cc showing
these hunks spuriously.
Therefore I believe that even for merge commits it is better to show
the diffs wrt. the original parents.
Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-01 04:13:20 +08:00
|
|
|
free_saved_parents(revs);
|
2022-04-14 04:01:34 +08:00
|
|
|
free_commit_list(revs->previous_parents);
|
|
|
|
revs->previous_parents = NULL;
|
2014-03-25 21:23:27 +08:00
|
|
|
}
|
2008-05-04 18:36:54 +08:00
|
|
|
return c;
|
|
|
|
}
|
2011-03-07 20:31:39 +08:00
|
|
|
|
2019-11-20 08:51:13 +08:00
|
|
|
const char *get_revision_mark(const struct rev_info *revs, const struct commit *commit)
|
2011-03-07 20:31:39 +08:00
|
|
|
{
|
|
|
|
if (commit->object.flags & BOUNDARY)
|
|
|
|
return "-";
|
|
|
|
else if (commit->object.flags & UNINTERESTING)
|
|
|
|
return "^";
|
2011-03-07 20:31:40 +08:00
|
|
|
else if (commit->object.flags & PATCHSAME)
|
|
|
|
return "=";
|
2011-03-07 20:31:39 +08:00
|
|
|
else if (!revs || revs->left_right) {
|
|
|
|
if (commit->object.flags & SYMMETRIC_LEFT)
|
|
|
|
return "<";
|
|
|
|
else
|
|
|
|
return ">";
|
|
|
|
} else if (revs->graph)
|
|
|
|
return "*";
|
2011-03-07 20:31:40 +08:00
|
|
|
else if (revs->cherry_mark)
|
|
|
|
return "+";
|
2011-03-07 20:31:39 +08:00
|
|
|
return "";
|
|
|
|
}
|
2011-03-10 22:45:03 +08:00
|
|
|
|
|
|
|
void put_revision_mark(const struct rev_info *revs, const struct commit *commit)
|
|
|
|
{
|
2019-11-20 08:51:13 +08:00
|
|
|
const char *mark = get_revision_mark(revs, commit);
|
2011-03-10 22:45:03 +08:00
|
|
|
if (!strlen(mark))
|
|
|
|
return;
|
|
|
|
fputs(mark, stdout);
|
|
|
|
putchar(' ');
|
|
|
|
}
|