2023-04-11 15:41:54 +08:00
|
|
|
#include "git-compat-util.h"
|
2020-06-05 21:00:28 +08:00
|
|
|
#include "config.h"
|
2023-03-21 14:25:54 +08:00
|
|
|
#include "gettext.h"
|
2023-02-24 08:09:27 +08:00
|
|
|
#include "hex.h"
|
2018-04-03 04:34:19 +08:00
|
|
|
#include "lockfile.h"
|
|
|
|
#include "pack.h"
|
|
|
|
#include "packfile.h"
|
|
|
|
#include "commit.h"
|
|
|
|
#include "object.h"
|
2018-06-27 21:24:45 +08:00
|
|
|
#include "refs.h"
|
2018-04-03 04:34:19 +08:00
|
|
|
#include "revision.h"
|
2020-12-31 19:56:23 +08:00
|
|
|
#include "hash-lookup.h"
|
2018-04-03 04:34:19 +08:00
|
|
|
#include "commit-graph.h"
|
2023-04-11 15:41:53 +08:00
|
|
|
#include "object-file.h"
|
2023-05-16 14:34:06 +08:00
|
|
|
#include "object-store-ll.h"
|
2023-04-11 11:00:42 +08:00
|
|
|
#include "oid-array.h"
|
2023-05-16 14:33:59 +08:00
|
|
|
#include "path.h"
|
2018-06-27 21:24:36 +08:00
|
|
|
#include "alloc.h"
|
2018-08-21 02:24:27 +08:00
|
|
|
#include "hashmap.h"
|
|
|
|
#include "replace-object.h"
|
commit-graph write: add progress output
Before this change the "commit-graph write" command didn't report any
progress. On my machine this command takes more than 10 seconds to
write the graph for linux.git, and around 1m30s on the
2015-04-03-1M-git.git[1] test repository (a test case for a large
monorepository).
Furthermore, since the gc.writeCommitGraph setting was added in
d5d5d7b641 ("gc: automatically write commit-graph files", 2018-06-27),
there was no indication at all from a "git gc" run that anything was
different. This why one of the progress bars being added here uses
start_progress() instead of start_delayed_progress(), so that it's
guaranteed to be seen. E.g. on my tiny 867 commit dotfiles.git
repository:
$ git -c gc.writeCommitGraph=true gc
Enumerating objects: 2821, done.
[...]
Computing commit graph generation numbers: 100% (867/867), done.
On larger repositories, such as linux.git the delayed progress bar(s)
will kick in, and we'll show what's going on instead of, as was
previously happening, printing nothing while we write the graph:
$ git -c gc.writeCommitGraph=true gc
[...]
Annotating commits in commit graph: 1565573, done.
Computing commit graph generation numbers: 100% (782484/782484), done.
Note that here we don't show "Finding commits for commit graph", this
is because under "git gc" we seed the search with the commit
references in the repository, and that set is too small to show any
progress, but would e.g. on a smaller repo such as git.git with
--stdin-commits:
$ git rev-list --all | git -c gc.writeCommitGraph=true write --stdin-commits
Finding commits for commit graph: 100% (162576/162576), done.
Computing commit graph generation numbers: 100% (162576/162576), done.
With --stdin-packs we don't show any estimation of how much is left to
do. This is because we might be processing more than one pack. We
could be less lazy here and show progress, either by detecting that
we're only processing one pack, or by first looping over the packs to
discover how many commits they have. I don't see the point in doing
that work. So instead we get (on 2015-04-03-1M-git.git):
$ echo pack-<HASH>.idx | git -c gc.writeCommitGraph=true --exec-path=$PWD commit-graph write --stdin-packs
Finding commits for commit graph: 13064614, done.
Annotating commits in commit graph: 3001341, done.
Computing commit graph generation numbers: 100% (1000447/1000447), done.
No GC mode uses --stdin-packs. It's what they use at Microsoft to
manually compute the generation numbers for their collection of large
packs which are never coalesced.
The reason we need a "report_progress" variable passed down from "git
gc" is so that we don't report this output when we're running in the
process "git gc --auto" detaches from the terminal.
Since we write the commit graph from the "git gc" process itself (as
opposed to what we do with say the "git repack" phase), we'd end up
writing the output to .git/gc.log and reporting it to the user next
time as part of the "The last gc run reported the following[...]"
error, see 329e6e8794 ("gc: save log from daemonized gc --auto and
print it next time", 2015-09-19).
So we must keep track of whether or not we're running in that
demonized mode, and if so print no progress.
See [2] and subsequent replies for a discussion of an approach not
taken in compute_generation_numbers(). I.e. we're saying "Computing
commit graph generation numbers", even though on an established
history we're mostly skipping over all the work we did in the
past. This is similar to the white lie we tell in the "Writing
objects" phase (not all are objects being written).
Always showing progress is considered more important than
accuracy. I.e. on a repository like 2015-04-03-1M-git.git we'd hang
for 6 seconds with no output on the second "git gc" if no changes were
made to any objects in the interim if we'd take the approach in [2].
1. https://github.com/avar/2015-04-03-1M-git
2. <c6960252-c095-fb2b-e0bc-b1e6bb261614@gmail.com>
(https://public-inbox.org/git/c6960252-c095-fb2b-e0bc-b1e6bb261614@gmail.com/)
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-17 23:33:35 +08:00
|
|
|
#include "progress.h"
|
2020-03-30 08:31:28 +08:00
|
|
|
#include "bloom.h"
|
2020-03-30 08:31:29 +08:00
|
|
|
#include "commit-slab.h"
|
2020-05-01 03:48:50 +08:00
|
|
|
#include "shallow.h"
|
2020-07-01 21:27:24 +08:00
|
|
|
#include "json-writer.h"
|
|
|
|
#include "trace2.h"
|
2023-04-23 04:17:26 +08:00
|
|
|
#include "tree.h"
|
2021-02-18 22:07:25 +08:00
|
|
|
#include "chunk-format.h"
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2020-04-17 04:14:03 +08:00
|
|
|
void git_test_write_commit_graph_or_die(void)
|
|
|
|
{
|
|
|
|
int flags = 0;
|
|
|
|
if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
|
|
|
|
flags = COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
|
|
|
|
|
|
|
|
if (write_commit_graph_reachable(the_repository->objects->odb,
|
|
|
|
flags, NULL))
|
|
|
|
die("failed to write commit-graph under GIT_TEST_COMMIT_GRAPH");
|
|
|
|
}
|
|
|
|
|
2018-04-03 04:34:19 +08:00
|
|
|
#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
|
|
|
|
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
|
|
|
|
#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
|
|
|
|
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
|
2022-03-02 22:45:13 +08:00
|
|
|
#define GRAPH_CHUNKID_GENERATION_DATA 0x47444132 /* "GDA2" */
|
|
|
|
#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f32 /* "GDO2" */
|
commit-graph: rename "large edges" to "extra edges"
The optional 'Large Edge List' chunk of the commit graph file stores
parent information for commits with more than two parents, and the
names of most of the macros, variables, struct fields, and functions
related to this chunk contain the term "large edges", e.g.
write_graph_chunk_large_edges(). However, it's not a really great
term, as the edges to the second and subsequent parents stored in this
chunk are not any larger than the edges to the first and second
parents stored in the "main" 'Commit Data' chunk. It's the number of
edges, IOW number of parents, that is larger compared to non-merge and
"regular" two-parent merge commits. And indeed, two functions in
'commit-graph.c' have a local variable called 'num_extra_edges' that
refer to the same thing, and this "extra edges" term is much better at
describing these edges.
So let's rename all these references to "large edges" in macro,
variable, function, etc. names to "extra edges". There is a
GRAPH_OCTOPUS_EDGES_NEEDED macro as well; for the sake of consistency
rename it to GRAPH_EXTRA_EDGES_NEEDED.
We can do so safely without causing any incompatibility issues,
because the term "large edges" doesn't come up in the file format
itself in any form (the chunk's magic is {'E', 'D', 'G', 'E'}, there
is no 'L' in there), but only in the specification text. The string
"large edges", however, does come up in the output of 'git
commit-graph read' and in tests looking at its input, but that command
is explicitly documented as debugging aid, so we can change its output
and the affected tests safely.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-20 04:21:13 +08:00
|
|
|
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
|
2020-04-07 00:59:49 +08:00
|
|
|
#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
|
|
|
|
#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
|
2019-06-19 02:14:26 +08:00
|
|
|
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2018-11-14 12:09:35 +08:00
|
|
|
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
|
2018-04-03 04:34:19 +08:00
|
|
|
|
|
|
|
#define GRAPH_VERSION_1 0x1
|
|
|
|
#define GRAPH_VERSION GRAPH_VERSION_1
|
|
|
|
|
commit-graph: rename "large edges" to "extra edges"
The optional 'Large Edge List' chunk of the commit graph file stores
parent information for commits with more than two parents, and the
names of most of the macros, variables, struct fields, and functions
related to this chunk contain the term "large edges", e.g.
write_graph_chunk_large_edges(). However, it's not a really great
term, as the edges to the second and subsequent parents stored in this
chunk are not any larger than the edges to the first and second
parents stored in the "main" 'Commit Data' chunk. It's the number of
edges, IOW number of parents, that is larger compared to non-merge and
"regular" two-parent merge commits. And indeed, two functions in
'commit-graph.c' have a local variable called 'num_extra_edges' that
refer to the same thing, and this "extra edges" term is much better at
describing these edges.
So let's rename all these references to "large edges" in macro,
variable, function, etc. names to "extra edges". There is a
GRAPH_OCTOPUS_EDGES_NEEDED macro as well; for the sake of consistency
rename it to GRAPH_EXTRA_EDGES_NEEDED.
We can do so safely without causing any incompatibility issues,
because the term "large edges" doesn't come up in the file format
itself in any form (the chunk's magic is {'E', 'D', 'G', 'E'}, there
is no 'L' in there), but only in the specification text. The string
"large edges", however, does come up in the output of 'git
commit-graph read' and in tests looking at its input, but that command
is explicitly documented as debugging aid, so we can change its output
and the affected tests safely.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-20 04:21:13 +08:00
|
|
|
#define GRAPH_EXTRA_EDGES_NEEDED 0x80000000
|
2018-04-03 04:34:19 +08:00
|
|
|
#define GRAPH_EDGE_LAST_MASK 0x7fffffff
|
|
|
|
#define GRAPH_PARENT_NONE 0x70000000
|
|
|
|
|
|
|
|
#define GRAPH_LAST_EDGE 0x80000000
|
|
|
|
|
2018-06-27 21:24:28 +08:00
|
|
|
#define GRAPH_HEADER_SIZE 8
|
2018-04-03 04:34:19 +08:00
|
|
|
#define GRAPH_FANOUT_SIZE (4 * 256)
|
2021-02-18 22:07:35 +08:00
|
|
|
#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * CHUNK_TOC_ENTRY_SIZE \
|
2018-11-14 12:09:35 +08:00
|
|
|
+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
|
2018-04-03 04:34:19 +08:00
|
|
|
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
|
|
|
|
|
commit-graph: fix writing first commit-graph during fetch
The previous commit includes a failing test for an issue around
fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we
fix that bug and set the test to "test_expect_success".
The problem arises with this set of commands when the remote repo at
<url> has a submodule. Note that --recurse-submodules is not needed to
demonstrate the bug.
$ git clone <url> test
$ cd test
$ git -c fetch.writeCommitGraph=true fetch origin
Computing commit graph generation numbers: 100% (12/12), done.
BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2>
Aborted (core dumped)
As an initial fix, I converted the code in builtin/fetch.c that calls
write_commit_graph_reachable() to instead launch a "git commit-graph
write --reachable --split" process. That code worked, but is not how we
want the feature to work long-term.
That test did demonstrate that the issue must be something to do with
internal state of the 'git fetch' process.
The write_commit_graph() method in commit-graph.c ensures the commits we
plan to write are "closed under reachability" using close_reachable().
This method walks from the input commits, and uses the UNINTERESTING
flag to mark which commits have already been visited. This allows the
walk to take O(N) time, where N is the number of commits, instead of
O(P) time, where P is the number of paths. (The number of paths can be
exponential in the number of commits.)
However, the UNINTERESTING flag is used in lots of places in the
codebase. This flag usually means some barrier to stop a commit walk,
such as in revision-walking to compare histories. It is not often
cleared after the walk completes because the starting points of those
walks do not have the UNINTERESTING flag, and clear_commit_marks() would
stop immediately.
This is happening during a 'git fetch' call with a remote. The fetch
negotiation is comparing the remote refs with the local refs and marking
some commits as UNINTERESTING.
I tested running clear_commit_marks_many() to clear the UNINTERESTING
flag inside close_reachable(), but the tips did not have the flag, so
that did nothing.
It turns out that the calculate_changed_submodule_paths() method is at
fault. Thanks, Peff, for pointing out this detail! More specifically,
for each submodule, the collect_changed_submodules() runs a revision
walk to essentially do file-history on the list of submodules. That
revision walk marks commits UNININTERESTING if they are simplified away
by not changing the submodule.
Instead, I finally arrived on the conclusion that I should use a flag
that is not used in any other part of the code. In commit-reach.c, a
number of flags were defined for commit walk algorithms. The REACHABLE
flag seemed like it made the most sense, and it seems it was not
actually used in the file. The REACHABLE flag was used in early versions
of commit-reach.c, but was removed by 4fbcca4 (commit-reach: make
can_all_from_reach... linear, 2018-07-20).
Add the REACHABLE flag to commit-graph.c and use it instead of
UNINTERESTING in close_reachable(). This fixes the bug in manual
testing.
Reported-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-24 21:40:42 +08:00
|
|
|
/* Remember to update object flag allocation in object.h */
|
|
|
|
#define REACHABLE (1u<<15)
|
|
|
|
|
2021-01-17 02:11:12 +08:00
|
|
|
define_commit_slab(topo_level_slab, uint32_t);
|
|
|
|
|
2020-03-30 08:31:29 +08:00
|
|
|
/* Keep track of the order in which commits are added to our list. */
|
|
|
|
define_commit_slab(commit_pos, int);
|
|
|
|
static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
|
|
|
|
|
|
|
|
static void set_commit_pos(struct repository *r, const struct object_id *oid)
|
|
|
|
{
|
|
|
|
static int32_t max_pos;
|
|
|
|
struct commit *commit = lookup_commit(r, oid);
|
|
|
|
|
|
|
|
if (!commit)
|
|
|
|
return; /* should never happen, but be lenient */
|
|
|
|
|
|
|
|
*commit_pos_at(&commit_pos, commit) = max_pos++;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int commit_pos_cmp(const void *va, const void *vb)
|
2018-04-03 04:34:19 +08:00
|
|
|
{
|
2020-03-30 08:31:29 +08:00
|
|
|
const struct commit *a = *(const struct commit **)va;
|
|
|
|
const struct commit *b = *(const struct commit **)vb;
|
|
|
|
return commit_pos_at(&commit_pos, a) -
|
|
|
|
commit_pos_at(&commit_pos, b);
|
|
|
|
}
|
|
|
|
|
2020-06-17 17:14:09 +08:00
|
|
|
define_commit_slab(commit_graph_data_slab, struct commit_graph_data);
|
|
|
|
static struct commit_graph_data_slab commit_graph_data_slab =
|
|
|
|
COMMIT_SLAB_INIT(1, commit_graph_data_slab);
|
|
|
|
|
2021-02-26 02:19:43 +08:00
|
|
|
static int get_configured_generation_version(struct repository *r)
|
|
|
|
{
|
|
|
|
int version = 2;
|
|
|
|
repo_config_get_int(r, "commitgraph.generationversion", &version);
|
|
|
|
return version;
|
|
|
|
}
|
|
|
|
|
2020-06-17 17:14:09 +08:00
|
|
|
uint32_t commit_graph_position(const struct commit *c)
|
|
|
|
{
|
|
|
|
struct commit_graph_data *data =
|
|
|
|
commit_graph_data_slab_peek(&commit_graph_data_slab, c);
|
|
|
|
|
|
|
|
return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
|
|
|
|
}
|
|
|
|
|
2021-01-17 02:11:13 +08:00
|
|
|
timestamp_t commit_graph_generation(const struct commit *c)
|
2020-06-17 17:14:09 +08:00
|
|
|
{
|
|
|
|
struct commit_graph_data *data =
|
|
|
|
commit_graph_data_slab_peek(&commit_graph_data_slab, c);
|
|
|
|
|
2023-03-20 19:26:51 +08:00
|
|
|
if (data && data->generation)
|
|
|
|
return data->generation;
|
2020-06-17 17:14:09 +08:00
|
|
|
|
2023-03-20 19:26:51 +08:00
|
|
|
return GENERATION_NUMBER_INFINITY;
|
2020-06-17 17:14:09 +08:00
|
|
|
}
|
|
|
|
|
commit-graph: introduce `commit_graph_generation_from_graph()`
In 2ee11f7261 (commit-graph: return generation from memory, 2023-03-20),
the `commit_graph_generation()` function stopped returning zeros when
asked to locate the generation number of a given commit.
This was done at the time to prepare for a later change which set
generation values in memory, meaning that we could no longer rely on
`graph_pos` alone to tell us whether or not to trust the generation
number returned by this function.
In 2ee11f7261, it was noted that this change only impacted very old
commit-graphs, which were written with all commits having generation
number 0. Indeed, zero is not a valid generation number, so we should
never expect to see that value outside of the aforementioned case.
The test fallout in 2ee11f7261 indicated that we were no longer able to
fsck a specific old case of commit-graph corruption, where we see a
non-zero generation number after having seen a generation number of 0
earlier.
Introduce a variant of `commit_graph_generation()` which behaves like
that function did prior to 2ee11f7261, known as
`commit_graph_generation_from_graph()`. Then use this function in the
context of `verify_one_commit_graph()`, where we only want to trust the
values from the graph.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-22 05:34:34 +08:00
|
|
|
static timestamp_t commit_graph_generation_from_graph(const struct commit *c)
|
|
|
|
{
|
|
|
|
struct commit_graph_data *data =
|
|
|
|
commit_graph_data_slab_peek(&commit_graph_data_slab, c);
|
|
|
|
|
|
|
|
if (!data || data->graph_pos == COMMIT_NOT_FROM_GRAPH)
|
|
|
|
return GENERATION_NUMBER_INFINITY;
|
|
|
|
return data->generation;
|
|
|
|
}
|
|
|
|
|
2020-06-17 17:14:09 +08:00
|
|
|
static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
|
|
|
|
{
|
|
|
|
unsigned int i, nth_slab;
|
|
|
|
struct commit_graph_data *data =
|
|
|
|
commit_graph_data_slab_peek(&commit_graph_data_slab, c);
|
|
|
|
|
|
|
|
if (data)
|
|
|
|
return data;
|
|
|
|
|
|
|
|
nth_slab = c->index / commit_graph_data_slab.slab_size;
|
|
|
|
data = commit_graph_data_slab_at(&commit_graph_data_slab, c);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* commit-slab initializes elements with zero, overwrite this with
|
|
|
|
* COMMIT_NOT_FROM_GRAPH for graph_pos.
|
|
|
|
*
|
|
|
|
* We avoid initializing generation with checking if graph position
|
|
|
|
* is not COMMIT_NOT_FROM_GRAPH.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < commit_graph_data_slab.slab_size; i++) {
|
|
|
|
commit_graph_data_slab.slab[nth_slab][i].graph_pos =
|
|
|
|
COMMIT_NOT_FROM_GRAPH;
|
|
|
|
}
|
|
|
|
|
|
|
|
return data;
|
|
|
|
}
|
|
|
|
|
commit-graph: fix regression when computing Bloom filters
Before computing Bloom filters, the commit-graph machinery uses
commit_gen_cmp to sort commits by generation order for improved diff
performance. 3d11275505 (commit-graph: examine commits by generation
number, 2020-03-30) claims that this sort can reduce the time spent to
compute Bloom filters by nearly half.
But since c49c82aa4c (commit: move members graph_pos, generation to a
slab, 2020-06-17), this optimization is broken, since asking for a
'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
while writing.
Not all hope is lost, though: 'commit_gen_cmp()' falls back to
comparing commits by their date when they have equal generation number,
and so since c49c82aa4c is purely a date comparison function. This
heuristic is good enough that we don't seem to loose appreciable
performance while computing Bloom filters.
Applying this patch (compared with v2.30.0) speeds up computing Bloom
filters by factors ranging from 0.40% to 5.19% on various repositories [1].
So, avoid the useless 'commit_graph_generation()' while writing by
instead accessing the slab directly. This returns the newly-computed
generation numbers, and allows us to avoid the heuristic by directly
comparing generation numbers.
[1]: https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:08 +08:00
|
|
|
/*
|
|
|
|
* Should be used only while writing commit-graph as it compares
|
|
|
|
* generation value of commits by directly accessing commit-slab.
|
|
|
|
*/
|
2020-03-30 08:31:30 +08:00
|
|
|
static int commit_gen_cmp(const void *va, const void *vb)
|
|
|
|
{
|
|
|
|
const struct commit *a = *(const struct commit **)va;
|
|
|
|
const struct commit *b = *(const struct commit **)vb;
|
|
|
|
|
2021-01-17 02:11:13 +08:00
|
|
|
const timestamp_t generation_a = commit_graph_data_at(a)->generation;
|
|
|
|
const timestamp_t generation_b = commit_graph_data_at(b)->generation;
|
2020-03-30 08:31:30 +08:00
|
|
|
/* lower generation commits first */
|
2020-06-17 17:14:11 +08:00
|
|
|
if (generation_a < generation_b)
|
2020-03-30 08:31:30 +08:00
|
|
|
return -1;
|
2020-06-17 17:14:11 +08:00
|
|
|
else if (generation_a > generation_b)
|
2020-03-30 08:31:30 +08:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
/* use date as a heuristic when generations are equal */
|
|
|
|
if (a->date < b->date)
|
|
|
|
return -1;
|
|
|
|
else if (a->date > b->date)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-03-30 08:31:29 +08:00
|
|
|
char *get_commit_graph_filename(struct object_directory *obj_dir)
|
2018-04-03 04:34:19 +08:00
|
|
|
{
|
2020-03-30 08:31:29 +08:00
|
|
|
return xstrfmt("%s/info/commit-graph", obj_dir->path);
|
2018-04-03 04:34:19 +08:00
|
|
|
}
|
|
|
|
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 05:18:02 +08:00
|
|
|
static char *get_split_graph_filename(struct object_directory *odb,
|
2019-06-19 02:14:25 +08:00
|
|
|
const char *oid_hex)
|
|
|
|
{
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 05:18:02 +08:00
|
|
|
return xstrfmt("%s/info/commit-graphs/graph-%s.graph", odb->path,
|
|
|
|
oid_hex);
|
2019-06-19 02:14:25 +08:00
|
|
|
}
|
|
|
|
|
2020-09-18 02:11:46 +08:00
|
|
|
char *get_commit_graph_chain_filename(struct object_directory *odb)
|
2019-06-19 02:14:25 +08:00
|
|
|
{
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 05:18:02 +08:00
|
|
|
return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
|
2018-04-03 04:34:19 +08:00
|
|
|
}
|
|
|
|
|
2018-04-10 20:56:02 +08:00
|
|
|
static struct commit_graph *alloc_commit_graph(void)
|
|
|
|
{
|
|
|
|
struct commit_graph *g = xcalloc(1, sizeof(*g));
|
|
|
|
|
|
|
|
return g;
|
|
|
|
}
|
|
|
|
|
2018-08-21 02:24:27 +08:00
|
|
|
static int commit_graph_compatible(struct repository *r)
|
|
|
|
{
|
2018-08-21 02:24:32 +08:00
|
|
|
if (!r->gitdir)
|
|
|
|
return 0;
|
|
|
|
|
2023-06-06 21:24:36 +08:00
|
|
|
if (replace_refs_enabled(r)) {
|
2018-08-21 02:24:27 +08:00
|
|
|
prepare_replace_object(r);
|
2021-03-02 01:19:37 +08:00
|
|
|
if (hashmap_get_size(&r->objects->replace_map->map))
|
2018-08-21 02:24:27 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-08-21 02:24:30 +08:00
|
|
|
prepare_commit_graft(r);
|
commit.c: don't persist substituted parents when unshallowing
Since 37b9dcabfc (shallow.c: use '{commit,rollback}_shallow_file',
2020-04-22), Git knows how to reset stat-validity checks for the
$GIT_DIR/shallow file, allowing it to change between a shallow and
non-shallow state in the same process (e.g., in the case of 'git fetch
--unshallow').
However, when $GIT_DIR/shallow changes, Git does not alter or remove any
grafts (nor substituted parents) in memory.
This comes up in a "git fetch --unshallow" with fetch.writeCommitGraph
set to true. Ordinarily in a shallow repository (and before 37b9dcabfc,
even in this case), commit_graph_compatible() would return false,
indicating that the repository should not be used to write a
commit-graphs (since commit-graph files cannot represent a shallow
history). But since 37b9dcabfc, in an --unshallow operation that check
succeeds.
Thus even though the repository isn't shallow any longer (that is, we
have all of the objects), the in-core representation of those objects
still has munged parents at the shallow boundaries. When the
commit-graph write proceeds, we use the incorrect parentage, producing
wrong results.
There are two ways for a user to work around this: either (1) set
'fetch.writeCommitGraph' to 'false', or (2) drop the commit-graph after
unshallowing.
One way to fix this would be to reset the parsed object pool entirely
(flushing the cache and thus preventing subsequent reads from modifying
their parents) after unshallowing. That would produce a problem when
callers have a now-stale reference to the old pool, and so this patch
implements a different approach. Instead, attach a new bit to the pool,
'substituted_parent', which indicates if the repository *ever* stored a
commit which had its parents modified (i.e., the shallow boundary
prior to unshallowing).
This bit needs to be sticky because all reads subsequent to modifying a
commit's parents are unreliable when unshallowing. Modify the check in
'commit_graph_compatible' to take this bit into account, and correctly
avoid generating commit-graphs in this case, thus solving the bug.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: Jonathan Nieder <jrnieder@gmail.com>
Reported-by: Jay Conrod <jayconrod@google.com>
Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-09 05:10:53 +08:00
|
|
|
if (r->parsed_objects &&
|
2021-03-02 01:19:37 +08:00
|
|
|
(r->parsed_objects->grafts_nr || r->parsed_objects->substituted_parent))
|
2018-08-21 02:24:30 +08:00
|
|
|
return 0;
|
2021-03-02 01:19:37 +08:00
|
|
|
if (is_repository_shallow(r))
|
2018-08-21 02:24:30 +08:00
|
|
|
return 0;
|
|
|
|
|
2018-08-21 02:24:27 +08:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2019-03-25 20:08:30 +08:00
|
|
|
int open_commit_graph(const char *graph_file, int *fd, struct stat *st)
|
|
|
|
{
|
|
|
|
*fd = git_open(graph_file);
|
|
|
|
if (*fd < 0)
|
|
|
|
return 0;
|
|
|
|
if (fstat(*fd, st)) {
|
|
|
|
close(*fd);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2020-09-09 23:22:56 +08:00
|
|
|
struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
|
|
|
|
int fd, struct stat *st,
|
2020-02-04 05:18:04 +08:00
|
|
|
struct object_directory *odb)
|
2018-04-10 20:56:02 +08:00
|
|
|
{
|
|
|
|
void *graph_map;
|
|
|
|
size_t graph_size;
|
2019-01-16 06:25:50 +08:00
|
|
|
struct commit_graph *ret;
|
2018-04-10 20:56:02 +08:00
|
|
|
|
2019-03-25 20:08:30 +08:00
|
|
|
graph_size = xsize_t(st->st_size);
|
2018-04-10 20:56:02 +08:00
|
|
|
|
|
|
|
if (graph_size < GRAPH_MIN_SIZE) {
|
|
|
|
close(fd);
|
2019-03-25 20:08:31 +08:00
|
|
|
error(_("commit-graph file is too small"));
|
2019-03-25 20:08:30 +08:00
|
|
|
return NULL;
|
2018-04-10 20:56:02 +08:00
|
|
|
}
|
|
|
|
graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
|
2020-04-24 05:41:13 +08:00
|
|
|
close(fd);
|
commit-graph: pass repo_settings instead of repository
The parse_commit_graph() function takes a 'struct repository *' pointer,
but it only ever accesses config settings (either directly or through
the .settings field of the repo struct). Move all relevant config
settings into the repo_settings struct, and update parse_commit_graph()
and its existing callers so that it takes 'struct repo_settings *'
instead.
Callers of parse_commit_graph() will now need to call
prepare_repo_settings() themselves, or initialize a 'struct
repo_settings' directly.
Prior to ab14d0676c (commit-graph: pass a 'struct repository *' in more
places, 2020-09-09), parsing a commit-graph was a pure function
depending only on the contents of the commit-graph itself. Commit
ab14d0676c introduced a dependency on a `struct repository` pointer, and
later commits such as b66d84756f (commit-graph: respect
'commitGraph.readChangedPaths', 2020-09-09) added dependencies on config
settings, which were accessed through the `settings` field of the
repository pointer. This field was initialized via a call to
`prepare_repo_settings()`.
Additionally, this fixes an issue in fuzz-commit-graph: In 44c7e62
(2021-12-06, repo-settings:prepare_repo_settings only in git repos),
prepare_repo_settings was changed to issue a BUG() if it is called by a
process whose CWD is not a Git repository.
The combination of commits mentioned above broke fuzz-commit-graph,
which attempts to parse arbitrary fuzzing-engine-provided bytes as a
commit graph file. Prior to this change, parse_commit_graph() called
prepare_repo_settings(), but since we run the fuzz tests without a valid
repository, we are hitting the BUG() from 44c7e62 for every test case.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-15 05:43:06 +08:00
|
|
|
prepare_repo_settings(r);
|
|
|
|
ret = parse_commit_graph(&r->settings, graph_map, graph_size);
|
2019-01-16 06:25:50 +08:00
|
|
|
|
2020-02-04 05:18:04 +08:00
|
|
|
if (ret)
|
|
|
|
ret->odb = odb;
|
2020-04-24 05:41:13 +08:00
|
|
|
else
|
2019-01-16 06:25:50 +08:00
|
|
|
munmap(graph_map, graph_size);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
commit-graph: fix segfault on e.g. "git status"
When core.commitGraph=true is set, various common commands now consult
the commit graph. Because the commit-graph code is very trusting of
its input data, it's possibly to construct a graph that'll cause an
immediate segfault on e.g. "status" (and e.g. "log", "blame", ...). In
some other cases where git immediately exits with a cryptic error
about the graph being broken.
The root cause of this is that while the "commit-graph verify"
sub-command exhaustively verifies the graph, other users of the graph
simply trust the graph, and will e.g. deference data found at certain
offsets as pointers, causing segfaults.
This change does the bare minimum to ensure that we don't segfault in
the common fill_commit_in_graph() codepath called by
e.g. setup_revisions(), to do this instrument the "commit-graph
verify" tests to always check if "status" would subsequently
segfault. This fixes the following tests which would previously
segfault:
not ok 50 - detect low chunk count
not ok 51 - detect missing OID fanout chunk
not ok 52 - detect missing OID lookup chunk
not ok 53 - detect missing commit data chunk
Those happened because with the commit-graph enabled setup_revisions()
would eventually call fill_commit_in_graph(), where e.g.
g->chunk_commit_data is used early as an offset (and will be
0x0). With this change we get far enough to detect that the graph is
broken, and show an error instead. E.g.:
$ git status; echo $?
error: commit-graph is missing the Commit Data chunk
1
That also sucks, we should *warn* and not hard-fail "status" just
because the commit-graph is corrupt, but fixing is left to a follow-up
change.
A side-effect of changing the reporting from graph_report() to error()
is that we now have an "error: " prefix for these even for
"commit-graph verify". Pseudo-diff before/after:
$ git commit-graph verify
-commit-graph is missing the Commit Data chunk
+error: commit-graph is missing the Commit Data chunk
Changing that is OK. Various errors it emits now early on are prefixed
with "error: ", moving these over and changing the output doesn't
break anything.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-25 20:08:29 +08:00
|
|
|
static int verify_commit_graph_lite(struct commit_graph *g)
|
|
|
|
{
|
commit-graph: check consistency of fanout table
We use bsearch_hash() to look up items in the oid index of a
commit-graph. It also has a fanout table to reduce the initial range in
which we'll search. But since the fanout comes from the on-disk file, a
corrupted or malicious file can cause us to look outside of the
allocated index memory.
One solution here would be to pass the total table size to
bsearch_hash(), which could then bounds check the values it reads from
the fanout. But there's an inexpensive up-front check we can do, and
it's the same one used by the midx and pack idx code (both of which
likewise have fanout tables and use bsearch_hash(), but are not affected
by this bug):
1. We can check the value of the final fanout entry against the size
of the table we got from the index chunk. These must always match,
since the fanout is just slicing up the index.
As a side note, the midx and pack idx code compute it the other
way around: they use the final fanout value as the object count, and
check the index size against it. Either is valid; if they
disagree we cannot know which is wrong (a corrupted fanout value,
or a too-small table of oids).
2. We can quickly scan the fanout table to make sure it is
monotonically increasing. If it is, then we know that every value
is less than or equal to the final value, and therefore less than
or equal to the table size.
It would also be sufficient to just check that each fanout value is
smaller than the final one, but the midx and pack idx code both do
a full monotonicity check. It's the same cost, and it catches some
other corruptions (though not all; the checks done by "commit-graph
verify" are more complete but more expensive, and our goal here is
to be fast and memory-safe).
There are two new tests. One just checks the final fanout value (this is
the mirror image of the "too small oid lookup" case added for the midx
in the previous commit; it's flipped here because commit-graph considers
the oid lookup chunk to be the source of truth).
The other actually creates a fanout with many out-of-bounds entries, and
prior to this patch, it does cause the segfault you'd expect. But note
that the error is not "your fanout entry is out-of-bounds", but rather
"fanout value out of order". That's because we leave the final fanout
value in place (to get past the table size check), making the index
non-monotonic (the second-to-last entry is big, but the last one must
remain small to match the actual table).
We need adjustments to a few existing tests, as well:
- an earlier test in t5318 corrupts the fanout and runs "commit-graph
verify". Its message is now changed, since we catch the problem
earlier (during the load step, rather than the careful validation
step).
- in t5324, we test that "commit-graph verify --shallow" does not do
expensive verification on the base file of the chain. But the
corruption it uses (munging a byte at offset 1000) happens to be in
the middle of the fanout table. And now we detect that problem in
the cheaper checks that are performed for every part of the graph.
We'll push this back to offset 1500, which is only caught by the
more expensive checksum validation.
Likewise, there's a later test in t5324 which munges an offset 100
bytes into a file (also in the fanout table) that is referenced by
an alternates file. So we now find that corruption during the load
step, rather than the verification step. At the very least we need
to change the error message (like the case above in t5318). But it
is probably good to make sure we handle all parts of the
verification even for alternate graph files. So let's likewise
corrupt byte 1500 and make sure we found the invalid checksum.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:04:58 +08:00
|
|
|
int i;
|
|
|
|
|
commit-graph: fix segfault on e.g. "git status"
When core.commitGraph=true is set, various common commands now consult
the commit graph. Because the commit-graph code is very trusting of
its input data, it's possibly to construct a graph that'll cause an
immediate segfault on e.g. "status" (and e.g. "log", "blame", ...). In
some other cases where git immediately exits with a cryptic error
about the graph being broken.
The root cause of this is that while the "commit-graph verify"
sub-command exhaustively verifies the graph, other users of the graph
simply trust the graph, and will e.g. deference data found at certain
offsets as pointers, causing segfaults.
This change does the bare minimum to ensure that we don't segfault in
the common fill_commit_in_graph() codepath called by
e.g. setup_revisions(), to do this instrument the "commit-graph
verify" tests to always check if "status" would subsequently
segfault. This fixes the following tests which would previously
segfault:
not ok 50 - detect low chunk count
not ok 51 - detect missing OID fanout chunk
not ok 52 - detect missing OID lookup chunk
not ok 53 - detect missing commit data chunk
Those happened because with the commit-graph enabled setup_revisions()
would eventually call fill_commit_in_graph(), where e.g.
g->chunk_commit_data is used early as an offset (and will be
0x0). With this change we get far enough to detect that the graph is
broken, and show an error instead. E.g.:
$ git status; echo $?
error: commit-graph is missing the Commit Data chunk
1
That also sucks, we should *warn* and not hard-fail "status" just
because the commit-graph is corrupt, but fixing is left to a follow-up
change.
A side-effect of changing the reporting from graph_report() to error()
is that we now have an "error: " prefix for these even for
"commit-graph verify". Pseudo-diff before/after:
$ git commit-graph verify
-commit-graph is missing the Commit Data chunk
+error: commit-graph is missing the Commit Data chunk
Changing that is OK. Various errors it emits now early on are prefixed
with "error: ", moving these over and changing the output doesn't
break anything.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-25 20:08:29 +08:00
|
|
|
/*
|
|
|
|
* Basic validation shared between parse_commit_graph()
|
|
|
|
* which'll be called every time the graph is used, and the
|
|
|
|
* much more expensive verify_commit_graph() used by
|
|
|
|
* "commit-graph verify".
|
|
|
|
*
|
|
|
|
* There should only be very basic checks here to ensure that
|
|
|
|
* we don't e.g. segfault in fill_commit_in_graph(), but
|
|
|
|
* because this is a very hot codepath nothing that e.g. loops
|
|
|
|
* over g->num_commits, or runs a checksum on the commit-graph
|
|
|
|
* itself.
|
|
|
|
*/
|
|
|
|
if (!g->chunk_oid_fanout) {
|
|
|
|
error("commit-graph is missing the OID Fanout chunk");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
if (!g->chunk_oid_lookup) {
|
|
|
|
error("commit-graph is missing the OID Lookup chunk");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
if (!g->chunk_commit_data) {
|
|
|
|
error("commit-graph is missing the Commit Data chunk");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
commit-graph: check consistency of fanout table
We use bsearch_hash() to look up items in the oid index of a
commit-graph. It also has a fanout table to reduce the initial range in
which we'll search. But since the fanout comes from the on-disk file, a
corrupted or malicious file can cause us to look outside of the
allocated index memory.
One solution here would be to pass the total table size to
bsearch_hash(), which could then bounds check the values it reads from
the fanout. But there's an inexpensive up-front check we can do, and
it's the same one used by the midx and pack idx code (both of which
likewise have fanout tables and use bsearch_hash(), but are not affected
by this bug):
1. We can check the value of the final fanout entry against the size
of the table we got from the index chunk. These must always match,
since the fanout is just slicing up the index.
As a side note, the midx and pack idx code compute it the other
way around: they use the final fanout value as the object count, and
check the index size against it. Either is valid; if they
disagree we cannot know which is wrong (a corrupted fanout value,
or a too-small table of oids).
2. We can quickly scan the fanout table to make sure it is
monotonically increasing. If it is, then we know that every value
is less than or equal to the final value, and therefore less than
or equal to the table size.
It would also be sufficient to just check that each fanout value is
smaller than the final one, but the midx and pack idx code both do
a full monotonicity check. It's the same cost, and it catches some
other corruptions (though not all; the checks done by "commit-graph
verify" are more complete but more expensive, and our goal here is
to be fast and memory-safe).
There are two new tests. One just checks the final fanout value (this is
the mirror image of the "too small oid lookup" case added for the midx
in the previous commit; it's flipped here because commit-graph considers
the oid lookup chunk to be the source of truth).
The other actually creates a fanout with many out-of-bounds entries, and
prior to this patch, it does cause the segfault you'd expect. But note
that the error is not "your fanout entry is out-of-bounds", but rather
"fanout value out of order". That's because we leave the final fanout
value in place (to get past the table size check), making the index
non-monotonic (the second-to-last entry is big, but the last one must
remain small to match the actual table).
We need adjustments to a few existing tests, as well:
- an earlier test in t5318 corrupts the fanout and runs "commit-graph
verify". Its message is now changed, since we catch the problem
earlier (during the load step, rather than the careful validation
step).
- in t5324, we test that "commit-graph verify --shallow" does not do
expensive verification on the base file of the chain. But the
corruption it uses (munging a byte at offset 1000) happens to be in
the middle of the fanout table. And now we detect that problem in
the cheaper checks that are performed for every part of the graph.
We'll push this back to offset 1500, which is only caught by the
more expensive checksum validation.
Likewise, there's a later test in t5324 which munges an offset 100
bytes into a file (also in the fanout table) that is referenced by
an alternates file. So we now find that corruption during the load
step, rather than the verification step. At the very least we need
to change the error message (like the case above in t5318). But it
is probably good to make sure we handle all parts of the
verification even for alternate graph files. So let's likewise
corrupt byte 1500 and make sure we found the invalid checksum.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:04:58 +08:00
|
|
|
for (i = 0; i < 255; i++) {
|
|
|
|
uint32_t oid_fanout1 = ntohl(g->chunk_oid_fanout[i]);
|
|
|
|
uint32_t oid_fanout2 = ntohl(g->chunk_oid_fanout[i + 1]);
|
|
|
|
|
|
|
|
if (oid_fanout1 > oid_fanout2) {
|
|
|
|
error("commit-graph fanout values out of order");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (ntohl(g->chunk_oid_fanout[255]) != g->num_commits) {
|
|
|
|
error("commit-graph oid table and fanout disagree on size");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
commit-graph: fix segfault on e.g. "git status"
When core.commitGraph=true is set, various common commands now consult
the commit graph. Because the commit-graph code is very trusting of
its input data, it's possibly to construct a graph that'll cause an
immediate segfault on e.g. "status" (and e.g. "log", "blame", ...). In
some other cases where git immediately exits with a cryptic error
about the graph being broken.
The root cause of this is that while the "commit-graph verify"
sub-command exhaustively verifies the graph, other users of the graph
simply trust the graph, and will e.g. deference data found at certain
offsets as pointers, causing segfaults.
This change does the bare minimum to ensure that we don't segfault in
the common fill_commit_in_graph() codepath called by
e.g. setup_revisions(), to do this instrument the "commit-graph
verify" tests to always check if "status" would subsequently
segfault. This fixes the following tests which would previously
segfault:
not ok 50 - detect low chunk count
not ok 51 - detect missing OID fanout chunk
not ok 52 - detect missing OID lookup chunk
not ok 53 - detect missing commit data chunk
Those happened because with the commit-graph enabled setup_revisions()
would eventually call fill_commit_in_graph(), where e.g.
g->chunk_commit_data is used early as an offset (and will be
0x0). With this change we get far enough to detect that the graph is
broken, and show an error instead. E.g.:
$ git status; echo $?
error: commit-graph is missing the Commit Data chunk
1
That also sucks, we should *warn* and not hard-fail "status" just
because the commit-graph is corrupt, but fixing is left to a follow-up
change.
A side-effect of changing the reporting from graph_report() to error()
is that we now have an "error: " prefix for these even for
"commit-graph verify". Pseudo-diff before/after:
$ git commit-graph verify
-commit-graph is missing the Commit Data chunk
+error: commit-graph is missing the Commit Data chunk
Changing that is OK. Various errors it emits now early on are prefixed
with "error: ", moving these over and changing the output doesn't
break anything.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-25 20:08:29 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-10-10 04:59:51 +08:00
|
|
|
static int graph_read_oid_fanout(const unsigned char *chunk_start,
|
|
|
|
size_t chunk_size, void *data)
|
|
|
|
{
|
|
|
|
struct commit_graph *g = data;
|
|
|
|
if (chunk_size != 256 * sizeof(uint32_t))
|
|
|
|
return error("commit-graph oid fanout chunk is wrong size");
|
|
|
|
g->chunk_oid_fanout = (const uint32_t *)chunk_start;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-02-18 22:07:35 +08:00
|
|
|
static int graph_read_oid_lookup(const unsigned char *chunk_start,
|
|
|
|
size_t chunk_size, void *data)
|
|
|
|
{
|
|
|
|
struct commit_graph *g = data;
|
|
|
|
g->chunk_oid_lookup = chunk_start;
|
|
|
|
g->num_commits = chunk_size / g->hash_len;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-10-10 05:05:36 +08:00
|
|
|
static int graph_read_commit_data(const unsigned char *chunk_start,
|
|
|
|
size_t chunk_size, void *data)
|
|
|
|
{
|
|
|
|
struct commit_graph *g = data;
|
|
|
|
if (chunk_size != g->num_commits * GRAPH_DATA_WIDTH)
|
|
|
|
return error("commit-graph commit data chunk is wrong size");
|
|
|
|
g->chunk_commit_data = chunk_start;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-10-10 05:05:44 +08:00
|
|
|
static int graph_read_generation_data(const unsigned char *chunk_start,
|
|
|
|
size_t chunk_size, void *data)
|
|
|
|
{
|
|
|
|
struct commit_graph *g = data;
|
|
|
|
if (chunk_size != g->num_commits * sizeof(uint32_t))
|
|
|
|
return error("commit-graph generations chunk is wrong size");
|
|
|
|
g->chunk_generation_data = chunk_start;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-10-10 05:05:53 +08:00
|
|
|
static int graph_read_bloom_index(const unsigned char *chunk_start,
|
|
|
|
size_t chunk_size, void *data)
|
|
|
|
{
|
|
|
|
struct commit_graph *g = data;
|
|
|
|
if (chunk_size != g->num_commits * 4) {
|
|
|
|
warning("commit-graph changed-path index chunk is too small");
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
g->chunk_bloom_indexes = chunk_start;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2021-02-18 22:07:35 +08:00
|
|
|
static int graph_read_bloom_data(const unsigned char *chunk_start,
|
|
|
|
size_t chunk_size, void *data)
|
|
|
|
{
|
|
|
|
struct commit_graph *g = data;
|
|
|
|
uint32_t hash_version;
|
commit-graph: check bounds when accessing BDAT chunk
When loading Bloom filters from a commit-graph file, we use the offset
values in the BIDX chunk to index into the memory mapped for the BDAT
chunk. But since we don't record how big the BDAT chunk is, we just
trust that the BIDX offsets won't cause us to read outside of the chunk
memory. A corrupted or malicious commit-graph file will cause us to
segfault (in practice this isn't a very interesting attack, since
commit-graph files are local-only, and the worst case is an
out-of-bounds read).
We can't fix this by checking the chunk size during parsing, since the
data in the BDAT chunk doesn't have a fixed size (that's why we need the
BIDX in the first place). So we'll fix it in two parts:
1. Record the BDAT chunk size during parsing, and then later check
that the BIDX offsets we look up are within bounds.
2. Because the offsets are relative to the end of the BDAT header, we
must also make sure that the BDAT chunk is at least as large as the
expected header size. Otherwise, we overflow when trying to move
past the header, even for an offset of "0". We can check this
early, during the parsing stage.
The error messages are rather verbose, but since this is not something
you'd expect to see outside of severe bugs or corruption, it makes sense
to err on the side of too many details. Sadly we can't mention the
filename during the chunk-parsing stage, as we haven't set g->filename
at this point, nor passed it down through the stack.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:50 +08:00
|
|
|
|
|
|
|
if (chunk_size < BLOOMDATA_CHUNK_HEADER_SIZE) {
|
|
|
|
warning("ignoring too-small changed-path chunk"
|
|
|
|
" (%"PRIuMAX" < %"PRIuMAX") in commit-graph file",
|
|
|
|
(uintmax_t)chunk_size,
|
|
|
|
(uintmax_t)BLOOMDATA_CHUNK_HEADER_SIZE);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2021-02-18 22:07:35 +08:00
|
|
|
g->chunk_bloom_data = chunk_start;
|
commit-graph: check bounds when accessing BDAT chunk
When loading Bloom filters from a commit-graph file, we use the offset
values in the BIDX chunk to index into the memory mapped for the BDAT
chunk. But since we don't record how big the BDAT chunk is, we just
trust that the BIDX offsets won't cause us to read outside of the chunk
memory. A corrupted or malicious commit-graph file will cause us to
segfault (in practice this isn't a very interesting attack, since
commit-graph files are local-only, and the worst case is an
out-of-bounds read).
We can't fix this by checking the chunk size during parsing, since the
data in the BDAT chunk doesn't have a fixed size (that's why we need the
BIDX in the first place). So we'll fix it in two parts:
1. Record the BDAT chunk size during parsing, and then later check
that the BIDX offsets we look up are within bounds.
2. Because the offsets are relative to the end of the BDAT header, we
must also make sure that the BDAT chunk is at least as large as the
expected header size. Otherwise, we overflow when trying to move
past the header, even for an offset of "0". We can check this
early, during the parsing stage.
The error messages are rather verbose, but since this is not something
you'd expect to see outside of severe bugs or corruption, it makes sense
to err on the side of too many details. Sadly we can't mention the
filename during the chunk-parsing stage, as we haven't set g->filename
at this point, nor passed it down through the stack.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:50 +08:00
|
|
|
g->chunk_bloom_data_size = chunk_size;
|
2021-02-18 22:07:35 +08:00
|
|
|
hash_version = get_be32(chunk_start);
|
|
|
|
|
|
|
|
if (hash_version != 1)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
|
|
|
|
g->bloom_filter_settings->hash_version = hash_version;
|
|
|
|
g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4);
|
|
|
|
g->bloom_filter_settings->bits_per_entry = get_be32(chunk_start + 8);
|
|
|
|
g->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
commit-graph: pass repo_settings instead of repository
The parse_commit_graph() function takes a 'struct repository *' pointer,
but it only ever accesses config settings (either directly or through
the .settings field of the repo struct). Move all relevant config
settings into the repo_settings struct, and update parse_commit_graph()
and its existing callers so that it takes 'struct repo_settings *'
instead.
Callers of parse_commit_graph() will now need to call
prepare_repo_settings() themselves, or initialize a 'struct
repo_settings' directly.
Prior to ab14d0676c (commit-graph: pass a 'struct repository *' in more
places, 2020-09-09), parsing a commit-graph was a pure function
depending only on the contents of the commit-graph itself. Commit
ab14d0676c introduced a dependency on a `struct repository` pointer, and
later commits such as b66d84756f (commit-graph: respect
'commitGraph.readChangedPaths', 2020-09-09) added dependencies on config
settings, which were accessed through the `settings` field of the
repository pointer. This field was initialized via a call to
`prepare_repo_settings()`.
Additionally, this fixes an issue in fuzz-commit-graph: In 44c7e62
(2021-12-06, repo-settings:prepare_repo_settings only in git repos),
prepare_repo_settings was changed to issue a BUG() if it is called by a
process whose CWD is not a Git repository.
The combination of commits mentioned above broke fuzz-commit-graph,
which attempts to parse arbitrary fuzzing-engine-provided bytes as a
commit graph file. Prior to this change, parse_commit_graph() called
prepare_repo_settings(), but since we run the fuzz tests without a valid
repository, we are hitting the BUG() from 44c7e62 for every test case.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-15 05:43:06 +08:00
|
|
|
struct commit_graph *parse_commit_graph(struct repo_settings *s,
|
2020-09-09 23:22:56 +08:00
|
|
|
void *graph_map, size_t graph_size)
|
2019-01-16 06:25:50 +08:00
|
|
|
{
|
2021-02-18 22:07:35 +08:00
|
|
|
const unsigned char *data;
|
2019-01-16 06:25:50 +08:00
|
|
|
struct commit_graph *graph;
|
|
|
|
uint32_t graph_signature;
|
|
|
|
unsigned char graph_version, hash_version;
|
2021-02-18 22:07:35 +08:00
|
|
|
struct chunkfile *cf = NULL;
|
2019-01-16 06:25:50 +08:00
|
|
|
|
|
|
|
if (!graph_map)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
if (graph_size < GRAPH_MIN_SIZE)
|
|
|
|
return NULL;
|
|
|
|
|
2018-04-10 20:56:02 +08:00
|
|
|
data = (const unsigned char *)graph_map;
|
|
|
|
|
|
|
|
graph_signature = get_be32(data);
|
|
|
|
if (graph_signature != GRAPH_SIGNATURE) {
|
2019-03-25 20:08:34 +08:00
|
|
|
error(_("commit-graph signature %X does not match signature %X"),
|
2018-04-10 20:56:02 +08:00
|
|
|
graph_signature, GRAPH_SIGNATURE);
|
2019-01-16 06:25:50 +08:00
|
|
|
return NULL;
|
2018-04-10 20:56:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
graph_version = *(unsigned char*)(data + 4);
|
|
|
|
if (graph_version != GRAPH_VERSION) {
|
2019-03-25 20:08:34 +08:00
|
|
|
error(_("commit-graph version %X does not match version %X"),
|
2018-04-10 20:56:02 +08:00
|
|
|
graph_version, GRAPH_VERSION);
|
2019-01-16 06:25:50 +08:00
|
|
|
return NULL;
|
2018-04-10 20:56:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
hash_version = *(unsigned char*)(data + 5);
|
2022-05-21 07:17:41 +08:00
|
|
|
if (hash_version != oid_version(the_hash_algo)) {
|
2019-03-25 20:08:34 +08:00
|
|
|
error(_("commit-graph hash version %X does not match version %X"),
|
2022-05-21 07:17:41 +08:00
|
|
|
hash_version, oid_version(the_hash_algo));
|
2019-01-16 06:25:50 +08:00
|
|
|
return NULL;
|
2018-04-10 20:56:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
graph = alloc_commit_graph();
|
|
|
|
|
2018-11-14 12:09:35 +08:00
|
|
|
graph->hash_len = the_hash_algo->rawsz;
|
2018-04-10 20:56:02 +08:00
|
|
|
graph->num_chunks = *(unsigned char*)(data + 6);
|
|
|
|
graph->data = graph_map;
|
|
|
|
graph->data_len = graph_size;
|
|
|
|
|
commit-graph: simplify parse_commit_graph() #1
While we iterate over all entries of the Chunk Lookup table we make
sure that we don't attempt to read past the end of the mmap-ed
commit-graph file, and check in each iteration that the chunk ID and
offset we are about to read is still within the mmap-ed memory region.
However, these checks in each iteration are not really necessary,
because the number of chunks in the commit-graph file is already known
before this loop from the just parsed commit-graph header.
So let's check that the commit-graph file is large enough for all
entries in the Chunk Lookup table before we start iterating over those
entries, and drop those per-iteration checks. While at it, take into
account the size of everything that is necessary to have a valid
commit-graph file, i.e. the size of the header, the size of the
mandatory OID Fanout chunk, and the size of the signature in the
trailer as well.
Note that this necessitates the change of the error message as well,
and, consequently, have to update the 'detect incorrect chunk count'
test in 't5318-commit-graph.sh' as well.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-05 21:00:29 +08:00
|
|
|
if (graph_size < GRAPH_HEADER_SIZE +
|
2021-02-18 22:07:35 +08:00
|
|
|
(graph->num_chunks + 1) * CHUNK_TOC_ENTRY_SIZE +
|
commit-graph: simplify parse_commit_graph() #1
While we iterate over all entries of the Chunk Lookup table we make
sure that we don't attempt to read past the end of the mmap-ed
commit-graph file, and check in each iteration that the chunk ID and
offset we are about to read is still within the mmap-ed memory region.
However, these checks in each iteration are not really necessary,
because the number of chunks in the commit-graph file is already known
before this loop from the just parsed commit-graph header.
So let's check that the commit-graph file is large enough for all
entries in the Chunk Lookup table before we start iterating over those
entries, and drop those per-iteration checks. While at it, take into
account the size of everything that is necessary to have a valid
commit-graph file, i.e. the size of the header, the size of the
mandatory OID Fanout chunk, and the size of the signature in the
trailer as well.
Note that this necessitates the change of the error message as well,
and, consequently, have to update the 'detect incorrect chunk count'
test in 't5318-commit-graph.sh' as well.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-05 21:00:29 +08:00
|
|
|
GRAPH_FANOUT_SIZE + the_hash_algo->rawsz) {
|
|
|
|
error(_("commit-graph file is too small to hold %u chunks"),
|
|
|
|
graph->num_chunks);
|
|
|
|
free(graph);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2021-02-18 22:07:35 +08:00
|
|
|
cf = init_chunkfile(NULL);
|
2019-06-19 02:14:26 +08:00
|
|
|
|
2021-02-18 22:07:35 +08:00
|
|
|
if (read_table_of_contents(cf, graph->data, graph_size,
|
midx: enforce chunk alignment on reading
The midx reader assumes chunks are aligned to a 4-byte boundary: we
treat the fanout chunk as an array of uint32_t, indexing it to feed the
results to ntohl(). Without aligning the chunks, we may violate the
CPU's alignment constraints. Though many platforms allow this, some do
not. And certanily UBSan will complain, since it is undefined behavior.
Even though most chunks are naturally 4-byte-aligned (because they are
storing uint32_t or larger types), PNAM is not. It stores NUL-terminated
pack names, so you can have a valid chunk with any length. The writing
side handles this by 4-byte-aligning the chunk, introducing a few extra
NULs as necessary. But since we don't check this on the reading side, we
may end up with a misaligned fanout and trigger the undefined behavior.
We have two options here:
1. Swap out ntohl(fanout[i]) for get_be32(fanout+i) everywhere. The
latter handles alignment itself. It's possible that it's slightly
slower (though in practice I'm not sure how true that is,
especially for these code paths which then go on to do a binary
search).
2. Enforce the alignment when reading the chunks. This is easy to do,
since the table-of-contents reader can check it in one spot.
I went with the second option here, just because it places less burden
on maintenance going forward (it is OK to continue using ntohl), and we
know it can't have any performance impact on the actual reads.
The commit-graph code uses the same chunk API. It's usually also 4-byte
aligned, but some chunks are not (like Bloom filter BDAT chunks). So
we'll pass "1" here to allow any alignment. It doesn't suffer from the
same problem as midx with its fanout because the fanout chunk is always
the first (and the rest of the format dictates that the first chunk will
start aligned).
The new test shows the effect on a midx with a misaligned PNAM chunk.
Note that the midx-reading code treats chunk-toc errors as soft, falling
back to the non-midx path rather than calling die(), as we do for other
parsing errors. Arguably we should make all of these behave the same,
but that's out of scope for this patch. For now the test just expects
the fallback behavior.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:23 +08:00
|
|
|
GRAPH_HEADER_SIZE, graph->num_chunks, 1))
|
2021-02-18 22:07:35 +08:00
|
|
|
goto free_and_return;
|
2018-04-10 20:56:02 +08:00
|
|
|
|
2023-10-10 04:59:51 +08:00
|
|
|
read_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, graph_read_oid_fanout, graph);
|
2021-02-18 22:07:35 +08:00
|
|
|
read_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, graph_read_oid_lookup, graph);
|
2023-10-10 05:05:36 +08:00
|
|
|
read_chunk(cf, GRAPH_CHUNKID_DATA, graph_read_commit_data, graph);
|
commit-graph: detect out-of-bounds extra-edges pointers
If an entry in a commit-graph file has more than 2 parents, the
fixed-size parent fields instead point to an offset within an "extra
edges" chunk. We blindly follow these, assuming that the chunk is
present and sufficiently large; this can lead to an out-of-bounds read
for a corrupt or malicious file.
We can fix this by recording the size of the chunk and adding a
bounds-check in fill_commit_in_graph(). There are a few tricky bits:
1. We'll switch from working with a pointer to an offset. This makes
some corner cases just fall out naturally:
a. If we did not find an EDGE chunk at all, our size will
correctly be zero (so everything is "out of bounds").
b. Comparing "size / 4" lets us make sure we have at least 4 bytes
to read, and we never compute a pointer more than one element
past the end of the array (computing a larger pointer is
probably OK in practice, but is technically undefined
behavior).
c. The current code casts to "uint32_t *". Replacing it with an
offset avoids any comparison between different types of pointer
(since the chunk is stored as "unsigned char *").
2. This is the first case in which fill_commit_in_graph() may return
anything but success. We need to make sure to roll back the
"parsed" flag (and any parents we might have added before running
out of buffer) so that the caller can cleanly fall back to
loading the commit object itself.
It's a little non-trivial to do this, and we might benefit from
factoring it out. But we can wait on that until we actually see a
second case where we return an error.
As a bonus, this lets us drop the st_mult() call. Since we've already
done a bounds check, we know there won't be any integer overflow (it
would imply our buffer is larger than a size_t can hold).
The included test does not actually segfault before this patch (though
you could construct a case where it does). Instead, it reads garbage
from the next chunk which results in it complaining about a bogus parent
id. This is sufficient for our needs, though (we care that the fallback
succeeds, and that stderr mentions the out-of-bounds read).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:38 +08:00
|
|
|
pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges,
|
|
|
|
&graph->chunk_extra_edges_size);
|
commit-graph: bounds-check base graphs chunk
When we are loading a commit-graph chain, we check that each slice of the
chain points to the appropriate set of base graphs via its BASE chunk.
But since we don't record the size of the chunk, we may access
out-of-bounds memory if the file is corrupted.
Since we know the number of entries we expect to find (based on the
position within the commit-graph-chain file), we can just check the size
up front.
In theory this would also let us drop the st_mult() call a few lines
later when we actually access the memory, since we know that the
computed offset will fit in a size_t. But because the operands
"g->hash_len" and "n" have types "unsigned char" and "int", we'd have to
cast to size_t first. Leaving the st_mult() does that cast, and makes it
more obvious that we don't have an overflow problem.
Note that the test does not actually segfault before this patch, since
it just reads garbage from the chunk after BASE (and indeed, it even
rejects the file because that garbage does not have the expected hash
value). You could construct a file with BASE at the end that did
segfault, but corrupting the existing one is easy, and we can check
stderr for the expected message.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:41 +08:00
|
|
|
pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs,
|
|
|
|
&graph->chunk_base_graphs_size);
|
2021-02-26 02:19:43 +08:00
|
|
|
|
commit-graph: pass repo_settings instead of repository
The parse_commit_graph() function takes a 'struct repository *' pointer,
but it only ever accesses config settings (either directly or through
the .settings field of the repo struct). Move all relevant config
settings into the repo_settings struct, and update parse_commit_graph()
and its existing callers so that it takes 'struct repo_settings *'
instead.
Callers of parse_commit_graph() will now need to call
prepare_repo_settings() themselves, or initialize a 'struct
repo_settings' directly.
Prior to ab14d0676c (commit-graph: pass a 'struct repository *' in more
places, 2020-09-09), parsing a commit-graph was a pure function
depending only on the contents of the commit-graph itself. Commit
ab14d0676c introduced a dependency on a `struct repository` pointer, and
later commits such as b66d84756f (commit-graph: respect
'commitGraph.readChangedPaths', 2020-09-09) added dependencies on config
settings, which were accessed through the `settings` field of the
repository pointer. This field was initialized via a call to
`prepare_repo_settings()`.
Additionally, this fixes an issue in fuzz-commit-graph: In 44c7e62
(2021-12-06, repo-settings:prepare_repo_settings only in git repos),
prepare_repo_settings was changed to issue a BUG() if it is called by a
process whose CWD is not a Git repository.
The combination of commits mentioned above broke fuzz-commit-graph,
which attempts to parse arbitrary fuzzing-engine-provided bytes as a
commit graph file. Prior to this change, parse_commit_graph() called
prepare_repo_settings(), but since we run the fuzz tests without a valid
repository, we are hitting the BUG() from 44c7e62 for every test case.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-15 05:43:06 +08:00
|
|
|
if (s->commit_graph_generation_version >= 2) {
|
2023-10-10 05:05:44 +08:00
|
|
|
read_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
|
|
|
|
graph_read_generation_data, graph);
|
commit-graph: bounds-check generation overflow chunk
If the generation entry in a commit-graph doesn't fit, we instead insert
an offset into a generation overflow chunk. But since we don't record
the size of the chunk, we may read outside the chunk if the offset we
find on disk is malicious or corrupted.
We can't check the size of the chunk up-front; it will vary based on how
many entries need overflow. So instead, we'll do a bounds-check before
accessing the chunk memory. Unfortunately there is no error-return from
this function, so we'll just have to die(), which is what it does for
other forms of corruption.
As with other cases, we can drop the st_mult() call, since we know our
bounds-checked value will fit within a size_t.
Before this patch, the test here actually "works" because we read
garbage data from the next chunk. And since that garbage data happens
not to provide a generation number which changes the output, it appears
to work. We could construct a case that actually segfaults or produces
wrong output, but it would be a bit tricky. For our purposes its
sufficient to check that we've detected the bounds error.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:47 +08:00
|
|
|
pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
|
|
|
|
&graph->chunk_generation_data_overflow,
|
|
|
|
&graph->chunk_generation_data_overflow_size);
|
commit-graph: start parsing generation v2 (again)
The 'read_generation_data' member of 'struct commit_graph' was
introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire
chain does, 2021-01-16). The intention was to avoid using corrected
commit dates if not all layers of a commit-graph had that data stored.
The logic in validate_mixed_generation_chain() at that point incorrectly
initialized read_generation_data to 1 if and only if the tip
commit-graph contained the Corrected Commit Date chunk.
This was "fixed" in 448a39e65 (commit-graph: validate layers for
generation data, 2021-02-02) to validate that read_generation_data was
either non-zero for all layers, or it would set read_generation_data to
zero for all layers.
The problem here is that read_generation_data is not initialized to be
non-zero anywhere!
This change initializes read_generation_data immediately after the chunk
is parsed, so each layer will have its value present as soon as
possible.
The read_generation_data member is used in fill_commit_graph_info() to
determine if we should use the corrected commit date or the topological
levels stored in the Commit Data chunk. Due to this bug, all previous
versions of Git were defaulting to topological levels in all cases!
This can be measured with some performance tests. Using the Linux kernel
as a testbed, I generated a complete commit-graph containing corrected
commit dates and tested the 'new' version against the previous, 'old'
version.
First, rev-list with --topo-order demonstrates a 26% improvement using
corrected commit dates:
hyperfine \
-n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \
-n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \
--warmup=10
Benchmark 1: old
Time (mean ± σ): 57.1 ms ± 3.1 ms
Range (min … max): 52.9 ms … 62.0 ms 55 runs
Benchmark 2: new
Time (mean ± σ): 45.5 ms ± 3.3 ms
Range (min … max): 39.9 ms … 51.7 ms 59 runs
Summary
'new' ran
1.26 ± 0.11 times faster than 'old'
These performance improvements are due to the algorithmic improvements
given by walking fewer commits due to the higher cutoffs from corrected
commit dates.
However, this comes at a cost. The additional I/O cost of parsing the
corrected commit dates is visible in case of merge-base commands that do
not reduce the overall number of walked commits.
hyperfine \
-n "old" "$OLD_GIT merge-base v4.8 v4.9" \
-n "new" "$NEW_GIT merge-base v4.8 v4.9" \
--warmup=10
Benchmark 1: old
Time (mean ± σ): 110.4 ms ± 6.4 ms
Range (min … max): 96.0 ms … 118.3 ms 25 runs
Benchmark 2: new
Time (mean ± σ): 150.7 ms ± 1.1 ms
Range (min … max): 149.3 ms … 153.4 ms 19 runs
Summary
'old' ran
1.36 ± 0.08 times faster than 'new'
Performance issues like this are what motivated 702110aac (commit-graph:
use config to specify generation type, 2021-02-25).
In the future, we could fix this performance problem by inserting the
corrected commit date offsets into the Commit Date chunk instead of
having that data in an extra chunk.
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-02 03:48:31 +08:00
|
|
|
|
|
|
|
if (graph->chunk_generation_data)
|
|
|
|
graph->read_generation_data = 1;
|
2021-02-26 02:19:43 +08:00
|
|
|
}
|
2021-02-18 22:07:35 +08:00
|
|
|
|
commit-graph: pass repo_settings instead of repository
The parse_commit_graph() function takes a 'struct repository *' pointer,
but it only ever accesses config settings (either directly or through
the .settings field of the repo struct). Move all relevant config
settings into the repo_settings struct, and update parse_commit_graph()
and its existing callers so that it takes 'struct repo_settings *'
instead.
Callers of parse_commit_graph() will now need to call
prepare_repo_settings() themselves, or initialize a 'struct
repo_settings' directly.
Prior to ab14d0676c (commit-graph: pass a 'struct repository *' in more
places, 2020-09-09), parsing a commit-graph was a pure function
depending only on the contents of the commit-graph itself. Commit
ab14d0676c introduced a dependency on a `struct repository` pointer, and
later commits such as b66d84756f (commit-graph: respect
'commitGraph.readChangedPaths', 2020-09-09) added dependencies on config
settings, which were accessed through the `settings` field of the
repository pointer. This field was initialized via a call to
`prepare_repo_settings()`.
Additionally, this fixes an issue in fuzz-commit-graph: In 44c7e62
(2021-12-06, repo-settings:prepare_repo_settings only in git repos),
prepare_repo_settings was changed to issue a BUG() if it is called by a
process whose CWD is not a Git repository.
The combination of commits mentioned above broke fuzz-commit-graph,
which attempts to parse arbitrary fuzzing-engine-provided bytes as a
commit graph file. Prior to this change, parse_commit_graph() called
prepare_repo_settings(), but since we run the fuzz tests without a valid
repository, we are hitting the BUG() from 44c7e62 for every test case.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-15 05:43:06 +08:00
|
|
|
if (s->commit_graph_read_changed_paths) {
|
2023-10-10 05:05:53 +08:00
|
|
|
read_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
|
|
|
|
graph_read_bloom_index, graph);
|
2021-02-18 22:07:35 +08:00
|
|
|
read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
|
|
|
|
graph_read_bloom_data, graph);
|
2018-04-10 20:56:02 +08:00
|
|
|
}
|
|
|
|
|
2020-04-07 00:59:49 +08:00
|
|
|
if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
|
|
|
|
init_bloom_filters();
|
|
|
|
} else {
|
|
|
|
/* We need both the bloom chunks to exist together. Else ignore the data */
|
|
|
|
graph->chunk_bloom_indexes = NULL;
|
|
|
|
graph->chunk_bloom_data = NULL;
|
2020-05-05 03:13:24 +08:00
|
|
|
FREE_AND_NULL(graph->bloom_filter_settings);
|
2020-04-07 00:59:49 +08:00
|
|
|
}
|
|
|
|
|
2021-04-26 09:02:50 +08:00
|
|
|
oidread(&graph->oid, graph->data + graph->data_len - graph->hash_len);
|
2019-06-19 02:14:26 +08:00
|
|
|
|
2020-05-05 03:13:24 +08:00
|
|
|
if (verify_commit_graph_lite(graph))
|
|
|
|
goto free_and_return;
|
commit-graph: fix segfault on e.g. "git status"
When core.commitGraph=true is set, various common commands now consult
the commit graph. Because the commit-graph code is very trusting of
its input data, it's possibly to construct a graph that'll cause an
immediate segfault on e.g. "status" (and e.g. "log", "blame", ...). In
some other cases where git immediately exits with a cryptic error
about the graph being broken.
The root cause of this is that while the "commit-graph verify"
sub-command exhaustively verifies the graph, other users of the graph
simply trust the graph, and will e.g. deference data found at certain
offsets as pointers, causing segfaults.
This change does the bare minimum to ensure that we don't segfault in
the common fill_commit_in_graph() codepath called by
e.g. setup_revisions(), to do this instrument the "commit-graph
verify" tests to always check if "status" would subsequently
segfault. This fixes the following tests which would previously
segfault:
not ok 50 - detect low chunk count
not ok 51 - detect missing OID fanout chunk
not ok 52 - detect missing OID lookup chunk
not ok 53 - detect missing commit data chunk
Those happened because with the commit-graph enabled setup_revisions()
would eventually call fill_commit_in_graph(), where e.g.
g->chunk_commit_data is used early as an offset (and will be
0x0). With this change we get far enough to detect that the graph is
broken, and show an error instead. E.g.:
$ git status; echo $?
error: commit-graph is missing the Commit Data chunk
1
That also sucks, we should *warn* and not hard-fail "status" just
because the commit-graph is corrupt, but fixing is left to a follow-up
change.
A side-effect of changing the reporting from graph_report() to error()
is that we now have an "error: " prefix for these even for
"commit-graph verify". Pseudo-diff before/after:
$ git commit-graph verify
-commit-graph is missing the Commit Data chunk
+error: commit-graph is missing the Commit Data chunk
Changing that is OK. Various errors it emits now early on are prefixed
with "error: ", moving these over and changing the output doesn't
break anything.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-25 20:08:29 +08:00
|
|
|
|
2021-02-18 22:07:35 +08:00
|
|
|
free_chunkfile(cf);
|
2018-04-10 20:56:02 +08:00
|
|
|
return graph;
|
2020-05-05 03:13:24 +08:00
|
|
|
|
|
|
|
free_and_return:
|
2021-02-18 22:07:35 +08:00
|
|
|
free_chunkfile(cf);
|
2020-05-05 03:13:24 +08:00
|
|
|
free(graph->bloom_filter_settings);
|
|
|
|
free(graph);
|
|
|
|
return NULL;
|
2018-04-10 20:56:02 +08:00
|
|
|
}
|
|
|
|
|
2020-09-09 23:22:56 +08:00
|
|
|
static struct commit_graph *load_commit_graph_one(struct repository *r,
|
|
|
|
const char *graph_file,
|
2020-02-04 05:18:04 +08:00
|
|
|
struct object_directory *odb)
|
2019-03-25 20:08:30 +08:00
|
|
|
{
|
|
|
|
|
|
|
|
struct stat st;
|
|
|
|
int fd;
|
2019-06-19 02:14:27 +08:00
|
|
|
struct commit_graph *g;
|
2019-03-25 20:08:30 +08:00
|
|
|
int open_ok = open_commit_graph(graph_file, &fd, &st);
|
|
|
|
|
|
|
|
if (!open_ok)
|
|
|
|
return NULL;
|
|
|
|
|
2020-09-09 23:22:56 +08:00
|
|
|
g = load_commit_graph_one_fd_st(r, fd, &st, odb);
|
2019-06-19 02:14:27 +08:00
|
|
|
|
|
|
|
if (g)
|
|
|
|
g->filename = xstrdup(graph_file);
|
|
|
|
|
|
|
|
return g;
|
2019-03-25 20:08:30 +08:00
|
|
|
}
|
|
|
|
|
2020-02-04 05:18:00 +08:00
|
|
|
static struct commit_graph *load_commit_graph_v1(struct repository *r,
|
|
|
|
struct object_directory *odb)
|
2019-06-19 02:14:25 +08:00
|
|
|
{
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 05:18:02 +08:00
|
|
|
char *graph_name = get_commit_graph_filename(odb);
|
2020-09-09 23:22:56 +08:00
|
|
|
struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
|
2019-06-19 02:14:25 +08:00
|
|
|
free(graph_name);
|
|
|
|
|
|
|
|
return g;
|
|
|
|
}
|
|
|
|
|
2023-09-28 12:38:17 +08:00
|
|
|
/*
|
|
|
|
* returns 1 if and only if all graphs in the chain have
|
|
|
|
* corrected commit dates stored in the generation_data chunk.
|
|
|
|
*/
|
|
|
|
static int validate_mixed_generation_chain(struct commit_graph *g)
|
|
|
|
{
|
|
|
|
int read_generation_data = 1;
|
|
|
|
struct commit_graph *p = g;
|
|
|
|
|
|
|
|
while (read_generation_data && p) {
|
|
|
|
read_generation_data = p->read_generation_data;
|
|
|
|
p = p->base_graph;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (read_generation_data)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
while (g) {
|
|
|
|
g->read_generation_data = 0;
|
|
|
|
g = g->base_graph;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-06-19 02:14:25 +08:00
|
|
|
static int add_graph_to_chain(struct commit_graph *g,
|
|
|
|
struct commit_graph *chain,
|
|
|
|
struct object_id *oids,
|
|
|
|
int n)
|
|
|
|
{
|
|
|
|
struct commit_graph *cur_g = chain;
|
|
|
|
|
2019-06-19 02:14:26 +08:00
|
|
|
if (n && !g->chunk_base_graphs) {
|
|
|
|
warning(_("commit-graph has no base graphs chunk"));
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
commit-graph: bounds-check base graphs chunk
When we are loading a commit-graph chain, we check that each slice of the
chain points to the appropriate set of base graphs via its BASE chunk.
But since we don't record the size of the chunk, we may access
out-of-bounds memory if the file is corrupted.
Since we know the number of entries we expect to find (based on the
position within the commit-graph-chain file), we can just check the size
up front.
In theory this would also let us drop the st_mult() call a few lines
later when we actually access the memory, since we know that the
computed offset will fit in a size_t. But because the operands
"g->hash_len" and "n" have types "unsigned char" and "int", we'd have to
cast to size_t first. Leaving the st_mult() does that cast, and makes it
more obvious that we don't have an overflow problem.
Note that the test does not actually segfault before this patch, since
it just reads garbage from the chunk after BASE (and indeed, it even
rejects the file because that garbage does not have the expected hash
value). You could construct a file with BASE at the end that did
segfault, but corrupting the existing one is easy, and we can check
stderr for the expected message.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:41 +08:00
|
|
|
if (g->chunk_base_graphs_size / g->hash_len < n) {
|
|
|
|
warning(_("commit-graph base graphs chunk is too small"));
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-06-19 02:14:25 +08:00
|
|
|
while (n) {
|
|
|
|
n--;
|
2019-06-19 02:14:26 +08:00
|
|
|
|
|
|
|
if (!cur_g ||
|
|
|
|
!oideq(&oids[n], &cur_g->oid) ||
|
2023-07-13 07:37:57 +08:00
|
|
|
!hasheq(oids[n].hash, g->chunk_base_graphs + st_mult(g->hash_len, n))) {
|
2019-06-19 02:14:26 +08:00
|
|
|
warning(_("commit-graph chain does not match"));
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-06-19 02:14:25 +08:00
|
|
|
cur_g = cur_g->base_graph;
|
|
|
|
}
|
|
|
|
|
|
|
|
g->base_graph = chain;
|
|
|
|
|
2023-07-13 07:37:57 +08:00
|
|
|
if (chain) {
|
|
|
|
if (unsigned_add_overflows(chain->num_commits,
|
|
|
|
chain->num_commits_in_base)) {
|
|
|
|
warning(_("commit count in base graph too high: %"PRIuMAX),
|
|
|
|
(uintmax_t)chain->num_commits_in_base);
|
|
|
|
return 0;
|
|
|
|
}
|
2019-06-19 02:14:25 +08:00
|
|
|
g->num_commits_in_base = chain->num_commits + chain->num_commits_in_base;
|
2023-07-13 07:37:57 +08:00
|
|
|
}
|
2019-06-19 02:14:25 +08:00
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
commit-graph: factor out chain opening function
The load_commit_graph_chain() function opens the chain file and all of
the slices of graph that it points to. If there is no chain file (which
is a totally normal condition), we return NULL. But if we run into
errors with the chain file or loading the actual graph data, we also
return NULL, and the caller cannot tell the difference.
The caller can check for ENOENT for the unremarkable "no such file"
case. But I'm hesitant to assume that the rest of the function would
never accidentally set errno to ENOENT itself, since it is opening the
slice files (and that would mean the caller fails to notice a real
error).
So let's break this into two functions: one to open the file, and one to
actually load it. This matches the interface we provide for the
non-chain graph file, which will also come in handy in a moment when we
fix some bugs in the "git commit-graph verify" code.
Some notes:
- I've kept the "1 is good, 0 is bad" return convention (and the weird
"fd" out-parameter) used by the matching open_commit_graph()
function and other parts of the commit-graph code. This is unlike
most of the rest of Git (which would just return the fd, with -1 for
error), but it makes sense to stay consistent with the adjacent bits
of the API here.
- The existing chain loading function will quietly return if the file
is too small to hold a single entry. I've retained that behavior
(and explicitly set ENOENT in the opener function) for now, under
the notion that it's probably valid (though I'd imagine unusual) to
have an empty chain file.
There are two small behavior changes here, but I think both are strictly
positive:
1. The original blindly did a stat() before checking if fopen()
succeeded, meaning we were making a pointless extra stat call.
2. We now use fstat() to check the file size. The previous code using
a regular stat() on the pathname meant we could technically race
with somebody updating the chain file, and end up with a size that
does not match what we just opened with fopen(). I doubt anybody
ever hit this in practice, but it may have caused an out-of-bounds
read.
We'll retain the load_commit_graph_chain() function which does both the
open and reading steps (most existing callers do not care about seeing
errors anyway, since loading commit-graphs is optimistic).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 12:38:07 +08:00
|
|
|
int open_commit_graph_chain(const char *chain_file,
|
|
|
|
int *fd, struct stat *st)
|
|
|
|
{
|
|
|
|
*fd = git_open(chain_file);
|
|
|
|
if (*fd < 0)
|
|
|
|
return 0;
|
|
|
|
if (fstat(*fd, st)) {
|
|
|
|
close(*fd);
|
|
|
|
return 0;
|
|
|
|
}
|
commit-graph: tighten chain size check
When we open a commit-graph-chain file, if it's smaller than a single
entry, we just quietly treat that as ENOENT. That make some sense if the
file is truly zero bytes, but it means that "commit-graph verify" will
quietly ignore a file that contains garbage if that garbage happens to
be short.
Instead, let's only simulate ENOENT when the file is truly empty, and
otherwise return EINVAL. The normal graph-loading routines don't care,
but "commit-graph verify" will notice and complain about the difference.
It's not entirely clear to me that the 0-is-ENOENT case actually happens
in real life, so we could perhaps just eliminate this special-case
altogether. But this is how we've always behaved, so I'm preserving it
in the name of backwards compatibility (though again, it really only
matters for "verify", as the regular routines are happy to load what
they can).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 12:39:10 +08:00
|
|
|
if (st->st_size < the_hash_algo->hexsz) {
|
commit-graph: factor out chain opening function
The load_commit_graph_chain() function opens the chain file and all of
the slices of graph that it points to. If there is no chain file (which
is a totally normal condition), we return NULL. But if we run into
errors with the chain file or loading the actual graph data, we also
return NULL, and the caller cannot tell the difference.
The caller can check for ENOENT for the unremarkable "no such file"
case. But I'm hesitant to assume that the rest of the function would
never accidentally set errno to ENOENT itself, since it is opening the
slice files (and that would mean the caller fails to notice a real
error).
So let's break this into two functions: one to open the file, and one to
actually load it. This matches the interface we provide for the
non-chain graph file, which will also come in handy in a moment when we
fix some bugs in the "git commit-graph verify" code.
Some notes:
- I've kept the "1 is good, 0 is bad" return convention (and the weird
"fd" out-parameter) used by the matching open_commit_graph()
function and other parts of the commit-graph code. This is unlike
most of the rest of Git (which would just return the fd, with -1 for
error), but it makes sense to stay consistent with the adjacent bits
of the API here.
- The existing chain loading function will quietly return if the file
is too small to hold a single entry. I've retained that behavior
(and explicitly set ENOENT in the opener function) for now, under
the notion that it's probably valid (though I'd imagine unusual) to
have an empty chain file.
There are two small behavior changes here, but I think both are strictly
positive:
1. The original blindly did a stat() before checking if fopen()
succeeded, meaning we were making a pointless extra stat call.
2. We now use fstat() to check the file size. The previous code using
a regular stat() on the pathname meant we could technically race
with somebody updating the chain file, and end up with a size that
does not match what we just opened with fopen(). I doubt anybody
ever hit this in practice, but it may have caused an out-of-bounds
read.
We'll retain the load_commit_graph_chain() function which does both the
open and reading steps (most existing callers do not care about seeing
errors anyway, since loading commit-graphs is optimistic).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 12:38:07 +08:00
|
|
|
close(*fd);
|
commit-graph: tighten chain size check
When we open a commit-graph-chain file, if it's smaller than a single
entry, we just quietly treat that as ENOENT. That make some sense if the
file is truly zero bytes, but it means that "commit-graph verify" will
quietly ignore a file that contains garbage if that garbage happens to
be short.
Instead, let's only simulate ENOENT when the file is truly empty, and
otherwise return EINVAL. The normal graph-loading routines don't care,
but "commit-graph verify" will notice and complain about the difference.
It's not entirely clear to me that the 0-is-ENOENT case actually happens
in real life, so we could perhaps just eliminate this special-case
altogether. But this is how we've always behaved, so I'm preserving it
in the name of backwards compatibility (though again, it really only
matters for "verify", as the regular routines are happy to load what
they can).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 12:39:10 +08:00
|
|
|
if (!st->st_size) {
|
|
|
|
/* treat empty files the same as missing */
|
|
|
|
errno = ENOENT;
|
|
|
|
} else {
|
|
|
|
warning("commit-graph chain file too small");
|
|
|
|
errno = EINVAL;
|
|
|
|
}
|
commit-graph: factor out chain opening function
The load_commit_graph_chain() function opens the chain file and all of
the slices of graph that it points to. If there is no chain file (which
is a totally normal condition), we return NULL. But if we run into
errors with the chain file or loading the actual graph data, we also
return NULL, and the caller cannot tell the difference.
The caller can check for ENOENT for the unremarkable "no such file"
case. But I'm hesitant to assume that the rest of the function would
never accidentally set errno to ENOENT itself, since it is opening the
slice files (and that would mean the caller fails to notice a real
error).
So let's break this into two functions: one to open the file, and one to
actually load it. This matches the interface we provide for the
non-chain graph file, which will also come in handy in a moment when we
fix some bugs in the "git commit-graph verify" code.
Some notes:
- I've kept the "1 is good, 0 is bad" return convention (and the weird
"fd" out-parameter) used by the matching open_commit_graph()
function and other parts of the commit-graph code. This is unlike
most of the rest of Git (which would just return the fd, with -1 for
error), but it makes sense to stay consistent with the adjacent bits
of the API here.
- The existing chain loading function will quietly return if the file
is too small to hold a single entry. I've retained that behavior
(and explicitly set ENOENT in the opener function) for now, under
the notion that it's probably valid (though I'd imagine unusual) to
have an empty chain file.
There are two small behavior changes here, but I think both are strictly
positive:
1. The original blindly did a stat() before checking if fopen()
succeeded, meaning we were making a pointless extra stat call.
2. We now use fstat() to check the file size. The previous code using
a regular stat() on the pathname meant we could technically race
with somebody updating the chain file, and end up with a size that
does not match what we just opened with fopen(). I doubt anybody
ever hit this in practice, but it may have caused an out-of-bounds
read.
We'll retain the load_commit_graph_chain() function which does both the
open and reading steps (most existing callers do not care about seeing
errors anyway, since loading commit-graphs is optimistic).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 12:38:07 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct commit_graph *load_commit_graph_chain_fd_st(struct repository *r,
|
2023-09-28 12:39:51 +08:00
|
|
|
int fd, struct stat *st,
|
|
|
|
int *incomplete_chain)
|
2019-06-19 02:14:25 +08:00
|
|
|
{
|
|
|
|
struct commit_graph *graph_chain = NULL;
|
|
|
|
struct strbuf line = STRBUF_INIT;
|
|
|
|
struct object_id *oids;
|
|
|
|
int i = 0, valid = 1, count;
|
commit-graph: factor out chain opening function
The load_commit_graph_chain() function opens the chain file and all of
the slices of graph that it points to. If there is no chain file (which
is a totally normal condition), we return NULL. But if we run into
errors with the chain file or loading the actual graph data, we also
return NULL, and the caller cannot tell the difference.
The caller can check for ENOENT for the unremarkable "no such file"
case. But I'm hesitant to assume that the rest of the function would
never accidentally set errno to ENOENT itself, since it is opening the
slice files (and that would mean the caller fails to notice a real
error).
So let's break this into two functions: one to open the file, and one to
actually load it. This matches the interface we provide for the
non-chain graph file, which will also come in handy in a moment when we
fix some bugs in the "git commit-graph verify" code.
Some notes:
- I've kept the "1 is good, 0 is bad" return convention (and the weird
"fd" out-parameter) used by the matching open_commit_graph()
function and other parts of the commit-graph code. This is unlike
most of the rest of Git (which would just return the fd, with -1 for
error), but it makes sense to stay consistent with the adjacent bits
of the API here.
- The existing chain loading function will quietly return if the file
is too small to hold a single entry. I've retained that behavior
(and explicitly set ENOENT in the opener function) for now, under
the notion that it's probably valid (though I'd imagine unusual) to
have an empty chain file.
There are two small behavior changes here, but I think both are strictly
positive:
1. The original blindly did a stat() before checking if fopen()
succeeded, meaning we were making a pointless extra stat call.
2. We now use fstat() to check the file size. The previous code using
a regular stat() on the pathname meant we could technically race
with somebody updating the chain file, and end up with a size that
does not match what we just opened with fopen(). I doubt anybody
ever hit this in practice, but it may have caused an out-of-bounds
read.
We'll retain the load_commit_graph_chain() function which does both the
open and reading steps (most existing callers do not care about seeing
errors anyway, since loading commit-graphs is optimistic).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 12:38:07 +08:00
|
|
|
FILE *fp = xfdopen(fd, "r");
|
2019-06-19 02:14:25 +08:00
|
|
|
|
commit-graph: factor out chain opening function
The load_commit_graph_chain() function opens the chain file and all of
the slices of graph that it points to. If there is no chain file (which
is a totally normal condition), we return NULL. But if we run into
errors with the chain file or loading the actual graph data, we also
return NULL, and the caller cannot tell the difference.
The caller can check for ENOENT for the unremarkable "no such file"
case. But I'm hesitant to assume that the rest of the function would
never accidentally set errno to ENOENT itself, since it is opening the
slice files (and that would mean the caller fails to notice a real
error).
So let's break this into two functions: one to open the file, and one to
actually load it. This matches the interface we provide for the
non-chain graph file, which will also come in handy in a moment when we
fix some bugs in the "git commit-graph verify" code.
Some notes:
- I've kept the "1 is good, 0 is bad" return convention (and the weird
"fd" out-parameter) used by the matching open_commit_graph()
function and other parts of the commit-graph code. This is unlike
most of the rest of Git (which would just return the fd, with -1 for
error), but it makes sense to stay consistent with the adjacent bits
of the API here.
- The existing chain loading function will quietly return if the file
is too small to hold a single entry. I've retained that behavior
(and explicitly set ENOENT in the opener function) for now, under
the notion that it's probably valid (though I'd imagine unusual) to
have an empty chain file.
There are two small behavior changes here, but I think both are strictly
positive:
1. The original blindly did a stat() before checking if fopen()
succeeded, meaning we were making a pointless extra stat call.
2. We now use fstat() to check the file size. The previous code using
a regular stat() on the pathname meant we could technically race
with somebody updating the chain file, and end up with a size that
does not match what we just opened with fopen(). I doubt anybody
ever hit this in practice, but it may have caused an out-of-bounds
read.
We'll retain the load_commit_graph_chain() function which does both the
open and reading steps (most existing callers do not care about seeing
errors anyway, since loading commit-graphs is optimistic).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 12:38:07 +08:00
|
|
|
count = st->st_size / (the_hash_algo->hexsz + 1);
|
2021-03-14 00:17:22 +08:00
|
|
|
CALLOC_ARRAY(oids, count);
|
2019-06-19 02:14:25 +08:00
|
|
|
|
2019-06-19 02:14:30 +08:00
|
|
|
prepare_alt_odb(r);
|
|
|
|
|
|
|
|
for (i = 0; i < count; i++) {
|
|
|
|
struct object_directory *odb;
|
2019-06-19 02:14:25 +08:00
|
|
|
|
|
|
|
if (strbuf_getline_lf(&line, fp) == EOF)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (get_oid_hex(line.buf, &oids[i])) {
|
|
|
|
warning(_("invalid commit-graph chain: line '%s' not a hash"),
|
|
|
|
line.buf);
|
|
|
|
valid = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2019-06-19 02:14:30 +08:00
|
|
|
valid = 0;
|
|
|
|
for (odb = r->objects->odb; odb; odb = odb->next) {
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 05:18:02 +08:00
|
|
|
char *graph_name = get_split_graph_filename(odb, line.buf);
|
2020-09-09 23:22:56 +08:00
|
|
|
struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
|
2019-06-19 02:14:25 +08:00
|
|
|
|
2019-06-19 02:14:30 +08:00
|
|
|
free(graph_name);
|
|
|
|
|
|
|
|
if (g) {
|
|
|
|
if (add_graph_to_chain(g, graph_chain, oids, i)) {
|
|
|
|
graph_chain = g;
|
|
|
|
valid = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!valid) {
|
|
|
|
warning(_("unable to find all commit-graph files"));
|
|
|
|
break;
|
|
|
|
}
|
2019-06-19 02:14:25 +08:00
|
|
|
}
|
|
|
|
|
2023-09-28 12:38:17 +08:00
|
|
|
validate_mixed_generation_chain(graph_chain);
|
|
|
|
|
2019-06-19 02:14:25 +08:00
|
|
|
free(oids);
|
|
|
|
fclose(fp);
|
2019-08-07 19:15:02 +08:00
|
|
|
strbuf_release(&line);
|
2019-06-19 02:14:25 +08:00
|
|
|
|
2023-09-28 12:39:51 +08:00
|
|
|
*incomplete_chain = !valid;
|
2019-06-19 02:14:25 +08:00
|
|
|
return graph_chain;
|
|
|
|
}
|
|
|
|
|
commit-graph: factor out chain opening function
The load_commit_graph_chain() function opens the chain file and all of
the slices of graph that it points to. If there is no chain file (which
is a totally normal condition), we return NULL. But if we run into
errors with the chain file or loading the actual graph data, we also
return NULL, and the caller cannot tell the difference.
The caller can check for ENOENT for the unremarkable "no such file"
case. But I'm hesitant to assume that the rest of the function would
never accidentally set errno to ENOENT itself, since it is opening the
slice files (and that would mean the caller fails to notice a real
error).
So let's break this into two functions: one to open the file, and one to
actually load it. This matches the interface we provide for the
non-chain graph file, which will also come in handy in a moment when we
fix some bugs in the "git commit-graph verify" code.
Some notes:
- I've kept the "1 is good, 0 is bad" return convention (and the weird
"fd" out-parameter) used by the matching open_commit_graph()
function and other parts of the commit-graph code. This is unlike
most of the rest of Git (which would just return the fd, with -1 for
error), but it makes sense to stay consistent with the adjacent bits
of the API here.
- The existing chain loading function will quietly return if the file
is too small to hold a single entry. I've retained that behavior
(and explicitly set ENOENT in the opener function) for now, under
the notion that it's probably valid (though I'd imagine unusual) to
have an empty chain file.
There are two small behavior changes here, but I think both are strictly
positive:
1. The original blindly did a stat() before checking if fopen()
succeeded, meaning we were making a pointless extra stat call.
2. We now use fstat() to check the file size. The previous code using
a regular stat() on the pathname meant we could technically race
with somebody updating the chain file, and end up with a size that
does not match what we just opened with fopen(). I doubt anybody
ever hit this in practice, but it may have caused an out-of-bounds
read.
We'll retain the load_commit_graph_chain() function which does both the
open and reading steps (most existing callers do not care about seeing
errors anyway, since loading commit-graphs is optimistic).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 12:38:07 +08:00
|
|
|
static struct commit_graph *load_commit_graph_chain(struct repository *r,
|
|
|
|
struct object_directory *odb)
|
|
|
|
{
|
|
|
|
char *chain_file = get_commit_graph_chain_filename(odb);
|
|
|
|
struct stat st;
|
|
|
|
int fd;
|
|
|
|
struct commit_graph *g = NULL;
|
|
|
|
|
|
|
|
if (open_commit_graph_chain(chain_file, &fd, &st)) {
|
2023-09-28 12:39:51 +08:00
|
|
|
int incomplete;
|
commit-graph: factor out chain opening function
The load_commit_graph_chain() function opens the chain file and all of
the slices of graph that it points to. If there is no chain file (which
is a totally normal condition), we return NULL. But if we run into
errors with the chain file or loading the actual graph data, we also
return NULL, and the caller cannot tell the difference.
The caller can check for ENOENT for the unremarkable "no such file"
case. But I'm hesitant to assume that the rest of the function would
never accidentally set errno to ENOENT itself, since it is opening the
slice files (and that would mean the caller fails to notice a real
error).
So let's break this into two functions: one to open the file, and one to
actually load it. This matches the interface we provide for the
non-chain graph file, which will also come in handy in a moment when we
fix some bugs in the "git commit-graph verify" code.
Some notes:
- I've kept the "1 is good, 0 is bad" return convention (and the weird
"fd" out-parameter) used by the matching open_commit_graph()
function and other parts of the commit-graph code. This is unlike
most of the rest of Git (which would just return the fd, with -1 for
error), but it makes sense to stay consistent with the adjacent bits
of the API here.
- The existing chain loading function will quietly return if the file
is too small to hold a single entry. I've retained that behavior
(and explicitly set ENOENT in the opener function) for now, under
the notion that it's probably valid (though I'd imagine unusual) to
have an empty chain file.
There are two small behavior changes here, but I think both are strictly
positive:
1. The original blindly did a stat() before checking if fopen()
succeeded, meaning we were making a pointless extra stat call.
2. We now use fstat() to check the file size. The previous code using
a regular stat() on the pathname meant we could technically race
with somebody updating the chain file, and end up with a size that
does not match what we just opened with fopen(). I doubt anybody
ever hit this in practice, but it may have caused an out-of-bounds
read.
We'll retain the load_commit_graph_chain() function which does both the
open and reading steps (most existing callers do not care about seeing
errors anyway, since loading commit-graphs is optimistic).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 12:38:07 +08:00
|
|
|
/* ownership of fd is taken over by load function */
|
2023-09-28 12:39:51 +08:00
|
|
|
g = load_commit_graph_chain_fd_st(r, fd, &st, &incomplete);
|
commit-graph: factor out chain opening function
The load_commit_graph_chain() function opens the chain file and all of
the slices of graph that it points to. If there is no chain file (which
is a totally normal condition), we return NULL. But if we run into
errors with the chain file or loading the actual graph data, we also
return NULL, and the caller cannot tell the difference.
The caller can check for ENOENT for the unremarkable "no such file"
case. But I'm hesitant to assume that the rest of the function would
never accidentally set errno to ENOENT itself, since it is opening the
slice files (and that would mean the caller fails to notice a real
error).
So let's break this into two functions: one to open the file, and one to
actually load it. This matches the interface we provide for the
non-chain graph file, which will also come in handy in a moment when we
fix some bugs in the "git commit-graph verify" code.
Some notes:
- I've kept the "1 is good, 0 is bad" return convention (and the weird
"fd" out-parameter) used by the matching open_commit_graph()
function and other parts of the commit-graph code. This is unlike
most of the rest of Git (which would just return the fd, with -1 for
error), but it makes sense to stay consistent with the adjacent bits
of the API here.
- The existing chain loading function will quietly return if the file
is too small to hold a single entry. I've retained that behavior
(and explicitly set ENOENT in the opener function) for now, under
the notion that it's probably valid (though I'd imagine unusual) to
have an empty chain file.
There are two small behavior changes here, but I think both are strictly
positive:
1. The original blindly did a stat() before checking if fopen()
succeeded, meaning we were making a pointless extra stat call.
2. We now use fstat() to check the file size. The previous code using
a regular stat() on the pathname meant we could technically race
with somebody updating the chain file, and end up with a size that
does not match what we just opened with fopen(). I doubt anybody
ever hit this in practice, but it may have caused an out-of-bounds
read.
We'll retain the load_commit_graph_chain() function which does both the
open and reading steps (most existing callers do not care about seeing
errors anyway, since loading commit-graphs is optimistic).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 12:38:07 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
free(chain_file);
|
|
|
|
return g;
|
|
|
|
}
|
|
|
|
|
2020-02-04 05:18:00 +08:00
|
|
|
struct commit_graph *read_commit_graph_one(struct repository *r,
|
|
|
|
struct object_directory *odb)
|
2019-06-19 02:14:25 +08:00
|
|
|
{
|
2020-02-04 05:18:00 +08:00
|
|
|
struct commit_graph *g = load_commit_graph_v1(r, odb);
|
2019-06-19 02:14:25 +08:00
|
|
|
|
|
|
|
if (!g)
|
2020-02-04 05:18:00 +08:00
|
|
|
g = load_commit_graph_chain(r, odb);
|
2019-06-19 02:14:25 +08:00
|
|
|
|
|
|
|
return g;
|
2019-03-25 20:08:30 +08:00
|
|
|
}
|
|
|
|
|
2020-02-04 05:18:00 +08:00
|
|
|
static void prepare_commit_graph_one(struct repository *r,
|
|
|
|
struct object_directory *odb)
|
2018-04-10 20:56:05 +08:00
|
|
|
{
|
|
|
|
|
2018-07-12 06:42:42 +08:00
|
|
|
if (r->objects->commit_graph)
|
2018-04-10 20:56:05 +08:00
|
|
|
return;
|
|
|
|
|
2020-02-04 05:18:00 +08:00
|
|
|
r->objects->commit_graph = read_commit_graph_one(r, odb);
|
2018-04-10 20:56:05 +08:00
|
|
|
}
|
|
|
|
|
2018-07-12 06:42:37 +08:00
|
|
|
/*
|
|
|
|
* Return 1 if commit_graph is non-NULL, and 0 otherwise.
|
|
|
|
*
|
2019-11-06 01:07:23 +08:00
|
|
|
* On the first invocation, this function attempts to load the commit
|
2018-07-12 06:42:37 +08:00
|
|
|
* graph if the_repository is configured to have one.
|
|
|
|
*/
|
2018-07-12 06:42:42 +08:00
|
|
|
static int prepare_commit_graph(struct repository *r)
|
2018-04-10 20:56:05 +08:00
|
|
|
{
|
2018-11-12 22:48:47 +08:00
|
|
|
struct object_directory *odb;
|
2018-07-12 06:42:42 +08:00
|
|
|
|
upload-pack: disable commit graph more gently for shallow traversal
When the client has asked for certain shallow options like
"deepen-since", we do a custom rev-list walk that pretends to be
shallow. Before doing so, we have to disable the commit-graph, since it
is not compatible with the shallow view of the repository. That's
handled by 829a321569 (commit-graph: close_commit_graph before shallow
walk, 2018-08-20). That commit literally closes and frees our
repo->objects->commit_graph struct.
That creates an interesting problem for commits that have _already_ been
parsed using the commit graph. Their commit->object.parsed flag is set,
their commit->graph_pos is set, but their commit->maybe_tree may still
be NULL. When somebody later calls repo_get_commit_tree(), we see that
we haven't loaded the tree oid yet and try to get it from the commit
graph. But since it has been freed, we segfault!
So the root of the issue is a data dependency between the commit's
lazy-load of the tree oid and the fact that the commit graph can go
away mid-process. How can we resolve it?
There are a couple of general approaches:
1. The obvious answer is to avoid loading the tree from the graph when
we see that it's NULL. But then what do we return for the tree oid?
If we return NULL, our caller in do_traverse() will rightly
complain that we have no tree. We'd have to fallback to loading the
actual commit object and re-parsing it. That requires teaching
parse_commit_buffer() to understand re-parsing (i.e., not starting
from a clean slate and not leaking any allocated bits like parent
list pointers).
2. When we close the commit graph, walk through the set of in-memory
objects and clear any graph_pos pointers. But this means we also
have to "unparse" any such commits so that we know they still need
to open the commit object to fill in their trees. So it's no less
complicated than (1), and is more expensive (since we clear objects
we might not later need).
3. Stop freeing the commit-graph struct. Continue to let it be used
for lazy-loads of tree oids, but let upload-pack specify that it
shouldn't be used for further commit parsing.
4. Push the whole shallow rev-list out to its own sub-process, with
the commit-graph disabled from the start, giving it a clean memory
space to work from.
I've chosen (3) here. Options (1) and (2) would work, but are
non-trivial to implement. Option (4) is more expensive, and I'm not sure
how complicated it is (shelling out for the actual rev-list part is
easy, but we do then parse the resulting commits internally, and I'm not
clear which parts need to be handling shallow-ness).
The new test in t5500 triggers this segfault, but see the comments there
for how horribly intimate it has to be with how both upload-pack and
commit graphs work.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-12 22:44:45 +08:00
|
|
|
/*
|
2021-12-06 23:55:56 +08:00
|
|
|
* Early return if there is no git dir or if the commit graph is
|
|
|
|
* disabled.
|
|
|
|
*
|
upload-pack: disable commit graph more gently for shallow traversal
When the client has asked for certain shallow options like
"deepen-since", we do a custom rev-list walk that pretends to be
shallow. Before doing so, we have to disable the commit-graph, since it
is not compatible with the shallow view of the repository. That's
handled by 829a321569 (commit-graph: close_commit_graph before shallow
walk, 2018-08-20). That commit literally closes and frees our
repo->objects->commit_graph struct.
That creates an interesting problem for commits that have _already_ been
parsed using the commit graph. Their commit->object.parsed flag is set,
their commit->graph_pos is set, but their commit->maybe_tree may still
be NULL. When somebody later calls repo_get_commit_tree(), we see that
we haven't loaded the tree oid yet and try to get it from the commit
graph. But since it has been freed, we segfault!
So the root of the issue is a data dependency between the commit's
lazy-load of the tree oid and the fact that the commit graph can go
away mid-process. How can we resolve it?
There are a couple of general approaches:
1. The obvious answer is to avoid loading the tree from the graph when
we see that it's NULL. But then what do we return for the tree oid?
If we return NULL, our caller in do_traverse() will rightly
complain that we have no tree. We'd have to fallback to loading the
actual commit object and re-parsing it. That requires teaching
parse_commit_buffer() to understand re-parsing (i.e., not starting
from a clean slate and not leaking any allocated bits like parent
list pointers).
2. When we close the commit graph, walk through the set of in-memory
objects and clear any graph_pos pointers. But this means we also
have to "unparse" any such commits so that we know they still need
to open the commit object to fill in their trees. So it's no less
complicated than (1), and is more expensive (since we clear objects
we might not later need).
3. Stop freeing the commit-graph struct. Continue to let it be used
for lazy-loads of tree oids, but let upload-pack specify that it
shouldn't be used for further commit parsing.
4. Push the whole shallow rev-list out to its own sub-process, with
the commit-graph disabled from the start, giving it a clean memory
space to work from.
I've chosen (3) here. Options (1) and (2) would work, but are
non-trivial to implement. Option (4) is more expensive, and I'm not sure
how complicated it is (shelling out for the actual rev-list part is
easy, but we do then parse the resulting commits internally, and I'm not
clear which parts need to be handling shallow-ness).
The new test in t5500 triggers this segfault, but see the comments there
for how horribly intimate it has to be with how both upload-pack and
commit graphs work.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-12 22:44:45 +08:00
|
|
|
* This must come before the "already attempted?" check below, because
|
|
|
|
* we want to disable even an already-loaded graph file.
|
|
|
|
*/
|
2021-12-06 23:55:56 +08:00
|
|
|
if (!r->gitdir || r->commit_graph_disabled)
|
upload-pack: disable commit graph more gently for shallow traversal
When the client has asked for certain shallow options like
"deepen-since", we do a custom rev-list walk that pretends to be
shallow. Before doing so, we have to disable the commit-graph, since it
is not compatible with the shallow view of the repository. That's
handled by 829a321569 (commit-graph: close_commit_graph before shallow
walk, 2018-08-20). That commit literally closes and frees our
repo->objects->commit_graph struct.
That creates an interesting problem for commits that have _already_ been
parsed using the commit graph. Their commit->object.parsed flag is set,
their commit->graph_pos is set, but their commit->maybe_tree may still
be NULL. When somebody later calls repo_get_commit_tree(), we see that
we haven't loaded the tree oid yet and try to get it from the commit
graph. But since it has been freed, we segfault!
So the root of the issue is a data dependency between the commit's
lazy-load of the tree oid and the fact that the commit graph can go
away mid-process. How can we resolve it?
There are a couple of general approaches:
1. The obvious answer is to avoid loading the tree from the graph when
we see that it's NULL. But then what do we return for the tree oid?
If we return NULL, our caller in do_traverse() will rightly
complain that we have no tree. We'd have to fallback to loading the
actual commit object and re-parsing it. That requires teaching
parse_commit_buffer() to understand re-parsing (i.e., not starting
from a clean slate and not leaking any allocated bits like parent
list pointers).
2. When we close the commit graph, walk through the set of in-memory
objects and clear any graph_pos pointers. But this means we also
have to "unparse" any such commits so that we know they still need
to open the commit object to fill in their trees. So it's no less
complicated than (1), and is more expensive (since we clear objects
we might not later need).
3. Stop freeing the commit-graph struct. Continue to let it be used
for lazy-loads of tree oids, but let upload-pack specify that it
shouldn't be used for further commit parsing.
4. Push the whole shallow rev-list out to its own sub-process, with
the commit-graph disabled from the start, giving it a clean memory
space to work from.
I've chosen (3) here. Options (1) and (2) would work, but are
non-trivial to implement. Option (4) is more expensive, and I'm not sure
how complicated it is (shelling out for the actual rev-list part is
easy, but we do then parse the resulting commits internally, and I'm not
clear which parts need to be handling shallow-ness).
The new test in t5500 triggers this segfault, but see the comments there
for how horribly intimate it has to be with how both upload-pack and
commit graphs work.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-12 22:44:45 +08:00
|
|
|
return 0;
|
commit-graph write: don't die if the existing graph is corrupt
When the commit-graph is written we end up calling
parse_commit(). This will in turn invoke code that'll consult the
existing commit-graph about the commit, if the graph is corrupted we
die.
We thus get into a state where a failing "commit-graph verify" can't
be followed-up with a "commit-graph write" if core.commitGraph=true is
set, the graph either needs to be manually removed to proceed, or
core.commitGraph needs to be set to "false".
Change the "commit-graph write" codepath to use a new
parse_commit_no_graph() helper instead of parse_commit() to avoid
this. The latter will call repo_parse_commit_internal() with
use_commit_graph=1 as seen in 177722b344 ("commit: integrate commit
graph with commit parsing", 2018-04-10).
Not using the old graph at all slows down the writing of the new graph
by some small amount, but is a sensible way to prevent an error in the
existing commit-graph from spreading.
Just fixing the current issue would be likely to result in code that's
inadvertently broken in the future. New code might use the
commit-graph at a distance. To detect such cases introduce a
"GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD" setting used when we do our
corruption tests, and test that a "write/verify" combo works after
every one of our current test cases where we now detect commit-graph
corruption.
Some of the code changes here might be strictly unnecessary, e.g. I
was unable to find cases where the parse_commit() called from
write_graph_chunk_data() didn't exit early due to
"item->object.parsed" being true in
repo_parse_commit_internal() (before the use_commit_graph=1 has any
effect). But let's also convert those cases for good measure, we do
not have exhaustive tests for all possible types of commit-graph
corruption.
This might need to be re-visited if we learn to write the commit-graph
incrementally, but probably not. Hopefully we'll just start by finding
out what commits we have in total, then read the old graph(s) to see
what they cover, and finally write a new graph file with everything
that's missing. In that case the new graph writing code just needs to
continue to use e.g. a parse_commit() that doesn't consult the
existing commit-graphs.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-25 20:08:33 +08:00
|
|
|
|
2018-07-12 06:42:42 +08:00
|
|
|
if (r->objects->commit_graph_attempted)
|
|
|
|
return !!r->objects->commit_graph;
|
|
|
|
r->objects->commit_graph_attempted = 1;
|
|
|
|
|
2019-08-14 02:37:43 +08:00
|
|
|
prepare_repo_settings(r);
|
|
|
|
|
2018-08-29 20:49:04 +08:00
|
|
|
if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
|
2019-08-14 02:37:43 +08:00
|
|
|
r->settings.core_commit_graph != 1)
|
2018-07-12 06:42:42 +08:00
|
|
|
/*
|
|
|
|
* This repository is not configured to use commit graphs, so
|
|
|
|
* do not load one. (But report commit_graph_attempted anyway
|
|
|
|
* so that commit graph loading is not attempted again for this
|
|
|
|
* repository.)
|
|
|
|
*/
|
2018-07-12 06:42:37 +08:00
|
|
|
return 0;
|
|
|
|
|
2018-08-21 02:24:27 +08:00
|
|
|
if (!commit_graph_compatible(r))
|
|
|
|
return 0;
|
|
|
|
|
2018-07-12 06:42:42 +08:00
|
|
|
prepare_alt_odb(r);
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 22:50:39 +08:00
|
|
|
for (odb = r->objects->odb;
|
2018-11-12 22:48:47 +08:00
|
|
|
!r->objects->commit_graph && odb;
|
|
|
|
odb = odb->next)
|
2020-02-04 05:18:00 +08:00
|
|
|
prepare_commit_graph_one(r, odb);
|
2018-07-12 06:42:42 +08:00
|
|
|
return !!r->objects->commit_graph;
|
2018-04-10 20:56:05 +08:00
|
|
|
}
|
|
|
|
|
commit-reach: use can_all_from_reach
The is_descendant_of method previously used in_merge_bases() to check if
the commit can reach any of the commits in the provided list. This had
two performance problems:
1. The performance is quadratic in worst-case.
2. A single in_merge_bases() call requires walking beyond the target
commit in order to find the full set of boundary commits that may be
merge-bases.
The can_all_from_reach method avoids this quadratic behavior and can
limit the search beyond the target commits using generation numbers. It
requires a small prototype adjustment to stop using commit-date as a
cutoff, as that optimization is no longer appropriate here.
Since in_merge_bases() uses paint_down_to_common(), is_descendant_of()
naturally found cutoffs to avoid walking the entire commit graph. Since
we want to always return the correct result, we cannot use the
min_commit_date cutoff in can_all_from_reach. We then rely on generation
numbers to provide the cutoff.
Since not all repos will have a commit-graph file, nor will we always
have generation numbers computed for a commit-graph file, create a new
method, generation_numbers_enabled(), that checks for a commit-graph
file and sees if the first commit in the file has a non-zero generation
number. In the case that we do not have generation numbers, use the old
logic for is_descendant_of().
Performance was meausured on a copy of the Linux repository using the
'test-tool reach is_descendant_of' command using this input:
A:v4.9
X:v4.10
X:v4.11
X:v4.12
X:v4.13
X:v4.14
X:v4.15
X:v4.16
X:v4.17
X.v3.0
Note that this input is tailored to demonstrate the quadratic nature of
the previous method, as it will compute merge-bases for v4.9 versus all
of the later versions before checking against v4.1.
Before: 0.26 s
After: 0.21 s
Since we previously used the is_descendant_of method in the ref_newer
method, we also measured performance there using
'test-tool reach ref_newer' with this input:
A:v4.9
B:v3.19
Before: 0.10 s
After: 0.08 s
By adding a new commit with parent v3.19, we test the non-reachable case
of ref_newer:
Before: 0.09 s
After: 0.08 s
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-21 00:33:30 +08:00
|
|
|
int generation_numbers_enabled(struct repository *r)
|
|
|
|
{
|
|
|
|
uint32_t first_generation;
|
|
|
|
struct commit_graph *g;
|
|
|
|
if (!prepare_commit_graph(r))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
g = r->objects->commit_graph;
|
|
|
|
|
|
|
|
if (!g->num_commits)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
first_generation = get_be32(g->chunk_commit_data +
|
|
|
|
g->hash_len + 8) >> 2;
|
|
|
|
|
|
|
|
return !!first_generation;
|
|
|
|
}
|
|
|
|
|
commit-reach: use corrected commit dates in paint_down_to_common()
091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.
For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.
While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).
This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.
While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.
With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.
While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.
As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:17 +08:00
|
|
|
int corrected_commit_dates_enabled(struct repository *r)
|
|
|
|
{
|
|
|
|
struct commit_graph *g;
|
|
|
|
if (!prepare_commit_graph(r))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
g = r->objects->commit_graph;
|
|
|
|
|
|
|
|
if (!g->num_commits)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
return g->read_generation_data;
|
|
|
|
}
|
|
|
|
|
commit-graph: introduce 'get_bloom_filter_settings()'
Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.
In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.
This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.
There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.
Since Bloom settings are guaranteed in practice to be the same for any
layer in a chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.
Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.
While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.
Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09 23:22:44 +08:00
|
|
|
struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
|
|
|
|
{
|
|
|
|
struct commit_graph *g = r->objects->commit_graph;
|
|
|
|
while (g) {
|
|
|
|
if (g->bloom_filter_settings)
|
|
|
|
return g->bloom_filter_settings;
|
|
|
|
g = g->base_graph;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2019-06-19 02:14:24 +08:00
|
|
|
static void close_commit_graph_one(struct commit_graph *g)
|
|
|
|
{
|
|
|
|
if (!g)
|
|
|
|
return;
|
|
|
|
|
commit-graph: when closing the graph, also release the slab
The slab has information about the commit graph. That means that it is
meaningless (and even misleading) when the commit graph was closed.
This seems not to matter currently, but we're about to fix a
Windows-specific bug where `git pull` does not close the object store
before fetching (risking that an implicit auto-gc fails to remove the
now-obsolete pack file(s)), and once we have that bug fix in place, it
does matter: after that bug fix, we will open the object store, do some
stuff with it, then close it, fetch, and then open it again, and do more
stuff. If we close the commit graph without releasing the corresponding
slab, we're hit by a symptom like this in t5520.19:
BUG: commit-reach.c:85: bad generation skip 9223372036854775807
> 3 at 5cd378271655d43a3b4477520014f02213ad1546
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-08 16:29:30 +08:00
|
|
|
clear_commit_graph_data_slab(&commit_graph_data_slab);
|
2019-06-19 02:14:24 +08:00
|
|
|
close_commit_graph_one(g->base_graph);
|
|
|
|
free_commit_graph(g);
|
|
|
|
}
|
|
|
|
|
2019-05-18 02:41:47 +08:00
|
|
|
void close_commit_graph(struct raw_object_store *o)
|
2018-04-10 20:56:05 +08:00
|
|
|
{
|
2019-06-19 02:14:24 +08:00
|
|
|
close_commit_graph_one(o->commit_graph);
|
2019-05-18 02:41:47 +08:00
|
|
|
o->commit_graph = NULL;
|
2018-04-10 20:56:05 +08:00
|
|
|
}
|
|
|
|
|
2021-08-09 16:11:59 +08:00
|
|
|
static int bsearch_graph(struct commit_graph *g, const struct object_id *oid, uint32_t *pos)
|
2018-04-10 20:56:05 +08:00
|
|
|
{
|
|
|
|
return bsearch_hash(oid->hash, g->chunk_oid_fanout,
|
|
|
|
g->chunk_oid_lookup, g->hash_len, pos);
|
|
|
|
}
|
|
|
|
|
2019-06-19 02:14:24 +08:00
|
|
|
static void load_oid_from_graph(struct commit_graph *g,
|
|
|
|
uint32_t pos,
|
|
|
|
struct object_id *oid)
|
|
|
|
{
|
|
|
|
uint32_t lex_index;
|
|
|
|
|
|
|
|
while (g && pos < g->num_commits_in_base)
|
|
|
|
g = g->base_graph;
|
|
|
|
|
|
|
|
if (!g)
|
|
|
|
BUG("NULL commit-graph");
|
|
|
|
|
|
|
|
if (pos >= g->num_commits + g->num_commits_in_base)
|
|
|
|
die(_("invalid commit position. commit-graph is likely corrupt"));
|
|
|
|
|
|
|
|
lex_index = pos - g->num_commits_in_base;
|
|
|
|
|
2023-07-13 07:38:00 +08:00
|
|
|
oidread(oid, g->chunk_oid_lookup + st_mult(g->hash_len, lex_index));
|
2019-06-19 02:14:24 +08:00
|
|
|
}
|
|
|
|
|
2018-12-15 08:09:39 +08:00
|
|
|
static struct commit_list **insert_parent_or_die(struct repository *r,
|
|
|
|
struct commit_graph *g,
|
2019-06-19 02:14:24 +08:00
|
|
|
uint32_t pos,
|
2018-04-10 20:56:05 +08:00
|
|
|
struct commit_list **pptr)
|
|
|
|
{
|
|
|
|
struct commit *c;
|
|
|
|
struct object_id oid;
|
2018-06-27 21:24:36 +08:00
|
|
|
|
2019-06-19 02:14:24 +08:00
|
|
|
if (pos >= g->num_commits + g->num_commits_in_base)
|
|
|
|
die("invalid parent position %"PRIu32, pos);
|
2018-06-27 21:24:38 +08:00
|
|
|
|
2019-06-19 02:14:24 +08:00
|
|
|
load_oid_from_graph(g, pos, &oid);
|
2018-12-15 08:09:39 +08:00
|
|
|
c = lookup_commit(r, &oid);
|
2018-04-10 20:56:05 +08:00
|
|
|
if (!c)
|
2018-07-21 15:49:26 +08:00
|
|
|
die(_("could not find commit %s"), oid_to_hex(&oid));
|
2020-06-17 17:14:10 +08:00
|
|
|
commit_graph_data_at(c)->graph_pos = pos;
|
2018-04-10 20:56:05 +08:00
|
|
|
return &commit_list_insert(c, pptr)->next;
|
|
|
|
}
|
|
|
|
|
2018-05-01 20:47:13 +08:00
|
|
|
static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
|
|
|
|
{
|
2019-06-19 02:14:24 +08:00
|
|
|
const unsigned char *commit_data;
|
2020-06-17 17:14:11 +08:00
|
|
|
struct commit_graph_data *graph_data;
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
uint32_t lex_index, offset_pos;
|
|
|
|
uint64_t date_high, date_low, offset;
|
2019-06-19 02:14:24 +08:00
|
|
|
|
|
|
|
while (pos < g->num_commits_in_base)
|
|
|
|
g = g->base_graph;
|
|
|
|
|
commit-graph: consolidate fill_commit_graph_info
Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().
fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.
e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.
The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.
However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.
The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.
As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.
To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.
To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:10 +08:00
|
|
|
if (pos >= g->num_commits + g->num_commits_in_base)
|
|
|
|
die(_("invalid commit position. commit-graph is likely corrupt"));
|
|
|
|
|
2019-06-19 02:14:24 +08:00
|
|
|
lex_index = pos - g->num_commits_in_base;
|
2023-07-13 07:38:03 +08:00
|
|
|
commit_data = g->chunk_commit_data + st_mult(GRAPH_DATA_WIDTH, lex_index);
|
2020-06-17 17:14:11 +08:00
|
|
|
|
|
|
|
graph_data = commit_graph_data_at(item);
|
|
|
|
graph_data->graph_pos = pos;
|
commit-graph: consolidate fill_commit_graph_info
Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().
fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.
e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.
The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.
However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.
The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.
As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.
To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.
To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:10 +08:00
|
|
|
|
|
|
|
date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
|
|
|
|
date_low = get_be32(commit_data + g->hash_len + 12);
|
|
|
|
item->date = (timestamp_t)((date_high << 32) | date_low);
|
|
|
|
|
commit-graph: use generation v2 only if entire chain does
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:
1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.
It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.
This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).
Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.
Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:16 +08:00
|
|
|
if (g->read_generation_data) {
|
2023-07-13 07:38:03 +08:00
|
|
|
offset = (timestamp_t)get_be32(g->chunk_generation_data + st_mult(sizeof(uint32_t), lex_index));
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
|
|
|
|
if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
|
|
|
|
if (!g->chunk_generation_data_overflow)
|
|
|
|
die(_("commit-graph requires overflow generation data but has none"));
|
|
|
|
|
|
|
|
offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
|
commit-graph: bounds-check generation overflow chunk
If the generation entry in a commit-graph doesn't fit, we instead insert
an offset into a generation overflow chunk. But since we don't record
the size of the chunk, we may read outside the chunk if the offset we
find on disk is malicious or corrupted.
We can't check the size of the chunk up-front; it will vary based on how
many entries need overflow. So instead, we'll do a bounds-check before
accessing the chunk memory. Unfortunately there is no error-return from
this function, so we'll just have to die(), which is what it does for
other forms of corruption.
As with other cases, we can drop the st_mult() call, since we know our
bounds-checked value will fit within a size_t.
Before this patch, the test here actually "works" because we read
garbage data from the next chunk. And since that garbage data happens
not to provide a generation number which changes the output, it appears
to work. We could construct a case that actually segfaults or produces
wrong output, but it would be a bit tricky. For our purposes its
sufficient to check that we've detected the bounds error.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:47 +08:00
|
|
|
if (g->chunk_generation_data_overflow_size / sizeof(uint64_t) <= offset_pos)
|
|
|
|
die(_("commit-graph overflow generation data is too small"));
|
|
|
|
graph_data->generation = item->date +
|
|
|
|
get_be64(g->chunk_generation_data_overflow + sizeof(uint64_t) * offset_pos);
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
} else
|
|
|
|
graph_data->generation = item->date + offset;
|
|
|
|
} else
|
|
|
|
graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
|
2021-01-17 02:11:12 +08:00
|
|
|
|
|
|
|
if (g->topo_levels)
|
|
|
|
*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
|
2018-05-01 20:47:13 +08:00
|
|
|
}
|
|
|
|
|
2019-04-16 17:33:18 +08:00
|
|
|
static inline void set_commit_tree(struct commit *c, struct tree *t)
|
|
|
|
{
|
|
|
|
c->maybe_tree = t;
|
|
|
|
}
|
|
|
|
|
2018-12-15 08:09:39 +08:00
|
|
|
static int fill_commit_in_graph(struct repository *r,
|
|
|
|
struct commit *item,
|
|
|
|
struct commit_graph *g, uint32_t pos)
|
2018-04-10 20:56:05 +08:00
|
|
|
{
|
|
|
|
uint32_t edge_value;
|
commit-graph: detect out-of-bounds extra-edges pointers
If an entry in a commit-graph file has more than 2 parents, the
fixed-size parent fields instead point to an offset within an "extra
edges" chunk. We blindly follow these, assuming that the chunk is
present and sufficiently large; this can lead to an out-of-bounds read
for a corrupt or malicious file.
We can fix this by recording the size of the chunk and adding a
bounds-check in fill_commit_in_graph(). There are a few tricky bits:
1. We'll switch from working with a pointer to an offset. This makes
some corner cases just fall out naturally:
a. If we did not find an EDGE chunk at all, our size will
correctly be zero (so everything is "out of bounds").
b. Comparing "size / 4" lets us make sure we have at least 4 bytes
to read, and we never compute a pointer more than one element
past the end of the array (computing a larger pointer is
probably OK in practice, but is technically undefined
behavior).
c. The current code casts to "uint32_t *". Replacing it with an
offset avoids any comparison between different types of pointer
(since the chunk is stored as "unsigned char *").
2. This is the first case in which fill_commit_in_graph() may return
anything but success. We need to make sure to roll back the
"parsed" flag (and any parents we might have added before running
out of buffer) so that the caller can cleanly fall back to
loading the commit object itself.
It's a little non-trivial to do this, and we might benefit from
factoring it out. But we can wait on that until we actually see a
second case where we return an error.
As a bonus, this lets us drop the st_mult() call. Since we've already
done a bounds check, we know there won't be any integer overflow (it
would imply our buffer is larger than a size_t can hold).
The included test does not actually segfault before this patch (though
you could construct a case where it does). Instead, it reads garbage
from the next chunk which results in it complaining about a bogus parent
id. This is sufficient for our needs, though (we care that the fallback
succeeds, and that stderr mentions the out-of-bounds read).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:38 +08:00
|
|
|
uint32_t parent_data_pos;
|
2018-04-10 20:56:05 +08:00
|
|
|
struct commit_list **pptr;
|
2019-06-19 02:14:24 +08:00
|
|
|
const unsigned char *commit_data;
|
|
|
|
uint32_t lex_index;
|
2018-04-10 20:56:05 +08:00
|
|
|
|
2019-06-19 02:14:24 +08:00
|
|
|
while (pos < g->num_commits_in_base)
|
|
|
|
g = g->base_graph;
|
|
|
|
|
commit-graph: consolidate fill_commit_graph_info
Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().
fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.
e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.
The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.
However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.
The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.
As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.
To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.
To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:10 +08:00
|
|
|
fill_commit_graph_info(item, g, pos);
|
2019-06-19 02:14:24 +08:00
|
|
|
|
|
|
|
lex_index = pos - g->num_commits_in_base;
|
2023-07-13 07:38:05 +08:00
|
|
|
commit_data = g->chunk_commit_data + st_mult(g->hash_len + 16, lex_index);
|
2019-06-19 02:14:24 +08:00
|
|
|
|
|
|
|
item->object.parsed = 1;
|
2018-04-10 20:56:05 +08:00
|
|
|
|
2019-04-16 17:33:18 +08:00
|
|
|
set_commit_tree(item, NULL);
|
2018-04-10 20:56:05 +08:00
|
|
|
|
|
|
|
pptr = &item->parents;
|
|
|
|
|
|
|
|
edge_value = get_be32(commit_data + g->hash_len);
|
|
|
|
if (edge_value == GRAPH_PARENT_NONE)
|
|
|
|
return 1;
|
2018-12-15 08:09:39 +08:00
|
|
|
pptr = insert_parent_or_die(r, g, edge_value, pptr);
|
2018-04-10 20:56:05 +08:00
|
|
|
|
|
|
|
edge_value = get_be32(commit_data + g->hash_len + 4);
|
|
|
|
if (edge_value == GRAPH_PARENT_NONE)
|
|
|
|
return 1;
|
commit-graph: rename "large edges" to "extra edges"
The optional 'Large Edge List' chunk of the commit graph file stores
parent information for commits with more than two parents, and the
names of most of the macros, variables, struct fields, and functions
related to this chunk contain the term "large edges", e.g.
write_graph_chunk_large_edges(). However, it's not a really great
term, as the edges to the second and subsequent parents stored in this
chunk are not any larger than the edges to the first and second
parents stored in the "main" 'Commit Data' chunk. It's the number of
edges, IOW number of parents, that is larger compared to non-merge and
"regular" two-parent merge commits. And indeed, two functions in
'commit-graph.c' have a local variable called 'num_extra_edges' that
refer to the same thing, and this "extra edges" term is much better at
describing these edges.
So let's rename all these references to "large edges" in macro,
variable, function, etc. names to "extra edges". There is a
GRAPH_OCTOPUS_EDGES_NEEDED macro as well; for the sake of consistency
rename it to GRAPH_EXTRA_EDGES_NEEDED.
We can do so safely without causing any incompatibility issues,
because the term "large edges" doesn't come up in the file format
itself in any form (the chunk's magic is {'E', 'D', 'G', 'E'}, there
is no 'L' in there), but only in the specification text. The string
"large edges", however, does come up in the output of 'git
commit-graph read' and in tests looking at its input, but that command
is explicitly documented as debugging aid, so we can change its output
and the affected tests safely.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-20 04:21:13 +08:00
|
|
|
if (!(edge_value & GRAPH_EXTRA_EDGES_NEEDED)) {
|
2018-12-15 08:09:39 +08:00
|
|
|
pptr = insert_parent_or_die(r, g, edge_value, pptr);
|
2018-04-10 20:56:05 +08:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
commit-graph: detect out-of-bounds extra-edges pointers
If an entry in a commit-graph file has more than 2 parents, the
fixed-size parent fields instead point to an offset within an "extra
edges" chunk. We blindly follow these, assuming that the chunk is
present and sufficiently large; this can lead to an out-of-bounds read
for a corrupt or malicious file.
We can fix this by recording the size of the chunk and adding a
bounds-check in fill_commit_in_graph(). There are a few tricky bits:
1. We'll switch from working with a pointer to an offset. This makes
some corner cases just fall out naturally:
a. If we did not find an EDGE chunk at all, our size will
correctly be zero (so everything is "out of bounds").
b. Comparing "size / 4" lets us make sure we have at least 4 bytes
to read, and we never compute a pointer more than one element
past the end of the array (computing a larger pointer is
probably OK in practice, but is technically undefined
behavior).
c. The current code casts to "uint32_t *". Replacing it with an
offset avoids any comparison between different types of pointer
(since the chunk is stored as "unsigned char *").
2. This is the first case in which fill_commit_in_graph() may return
anything but success. We need to make sure to roll back the
"parsed" flag (and any parents we might have added before running
out of buffer) so that the caller can cleanly fall back to
loading the commit object itself.
It's a little non-trivial to do this, and we might benefit from
factoring it out. But we can wait on that until we actually see a
second case where we return an error.
As a bonus, this lets us drop the st_mult() call. Since we've already
done a bounds check, we know there won't be any integer overflow (it
would imply our buffer is larger than a size_t can hold).
The included test does not actually segfault before this patch (though
you could construct a case where it does). Instead, it reads garbage
from the next chunk which results in it complaining about a bogus parent
id. This is sufficient for our needs, though (we care that the fallback
succeeds, and that stderr mentions the out-of-bounds read).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:38 +08:00
|
|
|
parent_data_pos = edge_value & GRAPH_EDGE_LAST_MASK;
|
2018-04-10 20:56:05 +08:00
|
|
|
do {
|
commit-graph: detect out-of-bounds extra-edges pointers
If an entry in a commit-graph file has more than 2 parents, the
fixed-size parent fields instead point to an offset within an "extra
edges" chunk. We blindly follow these, assuming that the chunk is
present and sufficiently large; this can lead to an out-of-bounds read
for a corrupt or malicious file.
We can fix this by recording the size of the chunk and adding a
bounds-check in fill_commit_in_graph(). There are a few tricky bits:
1. We'll switch from working with a pointer to an offset. This makes
some corner cases just fall out naturally:
a. If we did not find an EDGE chunk at all, our size will
correctly be zero (so everything is "out of bounds").
b. Comparing "size / 4" lets us make sure we have at least 4 bytes
to read, and we never compute a pointer more than one element
past the end of the array (computing a larger pointer is
probably OK in practice, but is technically undefined
behavior).
c. The current code casts to "uint32_t *". Replacing it with an
offset avoids any comparison between different types of pointer
(since the chunk is stored as "unsigned char *").
2. This is the first case in which fill_commit_in_graph() may return
anything but success. We need to make sure to roll back the
"parsed" flag (and any parents we might have added before running
out of buffer) so that the caller can cleanly fall back to
loading the commit object itself.
It's a little non-trivial to do this, and we might benefit from
factoring it out. But we can wait on that until we actually see a
second case where we return an error.
As a bonus, this lets us drop the st_mult() call. Since we've already
done a bounds check, we know there won't be any integer overflow (it
would imply our buffer is larger than a size_t can hold).
The included test does not actually segfault before this patch (though
you could construct a case where it does). Instead, it reads garbage
from the next chunk which results in it complaining about a bogus parent
id. This is sufficient for our needs, though (we care that the fallback
succeeds, and that stderr mentions the out-of-bounds read).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:38 +08:00
|
|
|
if (g->chunk_extra_edges_size / sizeof(uint32_t) <= parent_data_pos) {
|
|
|
|
error("commit-graph extra-edges pointer out of bounds");
|
|
|
|
free_commit_list(item->parents);
|
|
|
|
item->parents = NULL;
|
|
|
|
item->object.parsed = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
edge_value = get_be32(g->chunk_extra_edges +
|
|
|
|
sizeof(uint32_t) * parent_data_pos);
|
2018-12-15 08:09:39 +08:00
|
|
|
pptr = insert_parent_or_die(r, g,
|
2018-04-10 20:56:05 +08:00
|
|
|
edge_value & GRAPH_EDGE_LAST_MASK,
|
|
|
|
pptr);
|
commit-graph: detect out-of-bounds extra-edges pointers
If an entry in a commit-graph file has more than 2 parents, the
fixed-size parent fields instead point to an offset within an "extra
edges" chunk. We blindly follow these, assuming that the chunk is
present and sufficiently large; this can lead to an out-of-bounds read
for a corrupt or malicious file.
We can fix this by recording the size of the chunk and adding a
bounds-check in fill_commit_in_graph(). There are a few tricky bits:
1. We'll switch from working with a pointer to an offset. This makes
some corner cases just fall out naturally:
a. If we did not find an EDGE chunk at all, our size will
correctly be zero (so everything is "out of bounds").
b. Comparing "size / 4" lets us make sure we have at least 4 bytes
to read, and we never compute a pointer more than one element
past the end of the array (computing a larger pointer is
probably OK in practice, but is technically undefined
behavior).
c. The current code casts to "uint32_t *". Replacing it with an
offset avoids any comparison between different types of pointer
(since the chunk is stored as "unsigned char *").
2. This is the first case in which fill_commit_in_graph() may return
anything but success. We need to make sure to roll back the
"parsed" flag (and any parents we might have added before running
out of buffer) so that the caller can cleanly fall back to
loading the commit object itself.
It's a little non-trivial to do this, and we might benefit from
factoring it out. But we can wait on that until we actually see a
second case where we return an error.
As a bonus, this lets us drop the st_mult() call. Since we've already
done a bounds check, we know there won't be any integer overflow (it
would imply our buffer is larger than a size_t can hold).
The included test does not actually segfault before this patch (though
you could construct a case where it does). Instead, it reads garbage
from the next chunk which results in it complaining about a bogus parent
id. This is sufficient for our needs, though (we care that the fallback
succeeds, and that stderr mentions the out-of-bounds read).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-10 05:05:38 +08:00
|
|
|
parent_data_pos++;
|
2018-04-10 20:56:05 +08:00
|
|
|
} while (!(edge_value & GRAPH_LAST_EDGE));
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2021-08-09 16:11:59 +08:00
|
|
|
static int search_commit_pos_in_graph(const struct object_id *id, struct commit_graph *g, uint32_t *pos)
|
|
|
|
{
|
|
|
|
struct commit_graph *cur_g = g;
|
|
|
|
uint32_t lex_index;
|
|
|
|
|
|
|
|
while (cur_g && !bsearch_graph(cur_g, id, &lex_index))
|
|
|
|
cur_g = cur_g->base_graph;
|
|
|
|
|
|
|
|
if (cur_g) {
|
|
|
|
*pos = lex_index + cur_g->num_commits_in_base;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int find_commit_pos_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
|
2018-05-01 20:47:13 +08:00
|
|
|
{
|
2020-06-17 17:14:11 +08:00
|
|
|
uint32_t graph_pos = commit_graph_position(item);
|
|
|
|
if (graph_pos != COMMIT_NOT_FROM_GRAPH) {
|
|
|
|
*pos = graph_pos;
|
2018-05-01 20:47:13 +08:00
|
|
|
return 1;
|
|
|
|
} else {
|
2021-08-09 16:11:59 +08:00
|
|
|
return search_commit_pos_in_graph(&item->object.oid, g, pos);
|
2018-05-01 20:47:13 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-07-13 07:10:31 +08:00
|
|
|
int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c,
|
|
|
|
uint32_t *pos)
|
|
|
|
{
|
|
|
|
if (!prepare_commit_graph(r))
|
|
|
|
return 0;
|
|
|
|
return find_commit_pos_in_graph(c, r->objects->commit_graph, pos);
|
|
|
|
}
|
|
|
|
|
2021-08-09 16:12:03 +08:00
|
|
|
struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id)
|
|
|
|
{
|
|
|
|
struct commit *commit;
|
|
|
|
uint32_t pos;
|
|
|
|
|
lookup_commit_in_graph(): use prepare_commit_graph() to check for graph
We exit early from lookup_commit_in_graph() if the commit_graph pointer
is NULL, under the assumption that we don't have a graph to look at. But
the graph pointer is lazy-loaded; if no other code happens to have
called prepare_commit_graph(), we'll incorrectly assume that one isn't
available at all.
This has a pretty small performance impact in practice, because the
fallback will generally be to call parse_object() instead. That ends up
in parse_commit_buffer(), which loads the graph data itself. So the
first commit we see won't use the graph, but subsequent ones will. Since
using the graph is just an optimization there's generally no
user-visible difference, but if you instrument rev-list like so:
diff --git a/revision.c b/revision.c
index ee702e498a..63c488ffb6 100644
--- a/revision.c
+++ b/revision.c
@@ -381,6 +381,9 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
* parsing commit data from disk.
*/
commit = lookup_commit_in_graph(revs->repo, oid);
+ warning("%s %s in commit graph",
+ commit ? "found" : "did not find",
+ name);
if (commit)
object = &commit->object;
else
and run (in git.git):
git commit-graph write --reachable
git rev-list origin/master origin/next >/dev/null
you'll see that we fail to find the first one:
warning: did not find origin/master in commit graph
warning: found origin/next in commit graph
After this patch, you'll see that we find both:
warning: found origin/master in commit graph
warning: found origin/next in commit graph
Even though the performance implication is small here, there are two
important reasons to do this:
- it's downright confusing if you are hunting a bug triggered by the
use of the commit graph. It may or may not trigger depending on the
number and ordering of tips you ask for.
- prepare_commit_graph() has other policy logic, too. In particular,
if we've loaded a commit graph and then disabled the graph via
disable_commit_graph(), that should take precedence.
I'm not sure if this can trigger bad behavior in practice. The only
caller there is upload-pack's deepen_by_rev_list(), which should be
avoiding the commit graph for its traversal tips, but probably
wasn't before this patch. Whether you could come up with a case
where that mattered is unclear. Still, this is obviously the right
thing to be doing.
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-09-07 05:02:13 +08:00
|
|
|
if (!prepare_commit_graph(repo))
|
2021-08-09 16:12:03 +08:00
|
|
|
return NULL;
|
|
|
|
if (!search_commit_pos_in_graph(id, repo->objects->commit_graph, &pos))
|
|
|
|
return NULL;
|
2022-07-01 09:34:30 +08:00
|
|
|
if (!has_object(repo, id, 0))
|
2021-08-09 16:12:03 +08:00
|
|
|
return NULL;
|
|
|
|
|
|
|
|
commit = lookup_commit(repo, id);
|
|
|
|
if (!commit)
|
|
|
|
return NULL;
|
|
|
|
if (commit->object.parsed)
|
|
|
|
return commit;
|
|
|
|
|
|
|
|
if (!fill_commit_in_graph(repo, commit, repo->objects->commit_graph, pos))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return commit;
|
|
|
|
}
|
|
|
|
|
2018-12-15 08:09:39 +08:00
|
|
|
static int parse_commit_in_graph_one(struct repository *r,
|
|
|
|
struct commit_graph *g,
|
|
|
|
struct commit *item)
|
2018-04-10 20:56:05 +08:00
|
|
|
{
|
2018-05-01 20:47:13 +08:00
|
|
|
uint32_t pos;
|
|
|
|
|
2018-04-10 20:56:05 +08:00
|
|
|
if (item->object.parsed)
|
|
|
|
return 1;
|
2018-06-27 21:24:29 +08:00
|
|
|
|
2021-08-09 16:11:59 +08:00
|
|
|
if (find_commit_pos_in_graph(item, g, &pos))
|
2018-12-15 08:09:39 +08:00
|
|
|
return fill_commit_in_graph(r, item, g, pos);
|
2018-06-27 21:24:29 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-07-12 06:42:42 +08:00
|
|
|
int parse_commit_in_graph(struct repository *r, struct commit *item)
|
2018-06-27 21:24:29 +08:00
|
|
|
{
|
2020-06-24 01:47:01 +08:00
|
|
|
static int checked_env = 0;
|
|
|
|
|
|
|
|
if (!checked_env &&
|
|
|
|
git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE, 0))
|
|
|
|
die("dying as requested by the '%s' variable on commit-graph parse!",
|
|
|
|
GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE);
|
|
|
|
checked_env = 1;
|
|
|
|
|
2018-07-12 06:42:42 +08:00
|
|
|
if (!prepare_commit_graph(r))
|
2018-06-27 21:24:29 +08:00
|
|
|
return 0;
|
2018-12-15 08:09:39 +08:00
|
|
|
return parse_commit_in_graph_one(r, r->objects->commit_graph, item);
|
2018-04-10 20:56:05 +08:00
|
|
|
}
|
|
|
|
|
2018-07-12 06:42:42 +08:00
|
|
|
void load_commit_graph_info(struct repository *r, struct commit *item)
|
2018-05-01 20:47:13 +08:00
|
|
|
{
|
|
|
|
uint32_t pos;
|
2022-07-13 07:10:31 +08:00
|
|
|
if (repo_find_commit_pos_in_graph(r, item, &pos))
|
2018-07-12 06:42:42 +08:00
|
|
|
fill_commit_graph_info(item, r->objects->commit_graph, pos);
|
2018-05-01 20:47:13 +08:00
|
|
|
}
|
|
|
|
|
2018-12-15 08:09:39 +08:00
|
|
|
static struct tree *load_tree_for_commit(struct repository *r,
|
|
|
|
struct commit_graph *g,
|
|
|
|
struct commit *c)
|
2018-04-07 03:09:46 +08:00
|
|
|
{
|
|
|
|
struct object_id oid;
|
2019-06-19 02:14:24 +08:00
|
|
|
const unsigned char *commit_data;
|
2020-06-17 17:14:11 +08:00
|
|
|
uint32_t graph_pos = commit_graph_position(c);
|
2019-06-19 02:14:24 +08:00
|
|
|
|
2020-06-17 17:14:11 +08:00
|
|
|
while (graph_pos < g->num_commits_in_base)
|
2019-06-19 02:14:24 +08:00
|
|
|
g = g->base_graph;
|
|
|
|
|
|
|
|
commit_data = g->chunk_commit_data +
|
2023-07-13 07:38:08 +08:00
|
|
|
st_mult(GRAPH_DATA_WIDTH, graph_pos - g->num_commits_in_base);
|
2018-04-07 03:09:46 +08:00
|
|
|
|
2021-04-26 09:02:50 +08:00
|
|
|
oidread(&oid, commit_data);
|
2019-04-16 17:33:18 +08:00
|
|
|
set_commit_tree(c, lookup_tree(r, &oid));
|
2018-04-07 03:09:46 +08:00
|
|
|
|
|
|
|
return c->maybe_tree;
|
|
|
|
}
|
|
|
|
|
2018-12-15 08:09:39 +08:00
|
|
|
static struct tree *get_commit_tree_in_graph_one(struct repository *r,
|
|
|
|
struct commit_graph *g,
|
2018-06-27 21:24:31 +08:00
|
|
|
const struct commit *c)
|
2018-04-07 03:09:46 +08:00
|
|
|
{
|
|
|
|
if (c->maybe_tree)
|
|
|
|
return c->maybe_tree;
|
2020-06-17 17:14:10 +08:00
|
|
|
if (commit_graph_position(c) == COMMIT_NOT_FROM_GRAPH)
|
2018-06-27 21:24:31 +08:00
|
|
|
BUG("get_commit_tree_in_graph_one called from non-commit-graph commit");
|
|
|
|
|
2018-12-15 08:09:39 +08:00
|
|
|
return load_tree_for_commit(r, g, (struct commit *)c);
|
2018-06-27 21:24:31 +08:00
|
|
|
}
|
2018-04-07 03:09:46 +08:00
|
|
|
|
2018-07-12 06:42:42 +08:00
|
|
|
struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit *c)
|
2018-06-27 21:24:31 +08:00
|
|
|
{
|
2018-12-15 08:09:39 +08:00
|
|
|
return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
|
2018-04-07 03:09:46 +08:00
|
|
|
}
|
|
|
|
|
2019-06-12 21:29:40 +08:00
|
|
|
struct packed_commit_list {
|
|
|
|
struct commit **list;
|
2020-12-08 03:11:08 +08:00
|
|
|
size_t nr;
|
|
|
|
size_t alloc;
|
2019-06-12 21:29:40 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
struct write_commit_graph_context {
|
|
|
|
struct repository *r;
|
2020-02-04 13:51:50 +08:00
|
|
|
struct object_directory *odb;
|
2019-06-12 21:29:40 +08:00
|
|
|
char *graph_name;
|
2020-12-08 03:11:05 +08:00
|
|
|
struct oid_array oids;
|
2019-06-12 21:29:40 +08:00
|
|
|
struct packed_commit_list commits;
|
|
|
|
int num_extra_edges;
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
int num_generation_data_overflows;
|
2019-06-12 21:29:40 +08:00
|
|
|
unsigned long approx_nr_objects;
|
|
|
|
struct progress *progress;
|
|
|
|
int progress_done;
|
|
|
|
uint64_t progress_cnt;
|
2019-06-19 02:14:27 +08:00
|
|
|
|
|
|
|
char *base_graph_name;
|
|
|
|
int num_commit_graphs_before;
|
|
|
|
int num_commit_graphs_after;
|
|
|
|
char **commit_graph_filenames_before;
|
|
|
|
char **commit_graph_filenames_after;
|
|
|
|
char **commit_graph_hash_after;
|
|
|
|
uint32_t new_num_commits_in_base;
|
|
|
|
struct commit_graph *new_base_graph;
|
|
|
|
|
2019-06-12 21:29:40 +08:00
|
|
|
unsigned append:1,
|
2019-06-19 02:14:27 +08:00
|
|
|
report_progress:1,
|
2019-08-05 16:02:40 +08:00
|
|
|
split:1,
|
2020-03-30 08:31:30 +08:00
|
|
|
changed_paths:1,
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
order_by_pack:1,
|
2021-02-02 11:01:22 +08:00
|
|
|
write_generation_data:1,
|
|
|
|
trust_generation_numbers:1;
|
2019-06-19 02:14:32 +08:00
|
|
|
|
2021-01-17 02:11:12 +08:00
|
|
|
struct topo_level_slab *topo_levels;
|
2020-09-18 10:59:49 +08:00
|
|
|
const struct commit_graph_opts *opts;
|
2020-03-30 08:31:28 +08:00
|
|
|
size_t total_bloom_filter_data_size;
|
2020-06-24 01:47:00 +08:00
|
|
|
const struct bloom_filter_settings *bloom_settings;
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
|
|
|
|
int count_bloom_filter_computed;
|
|
|
|
int count_bloom_filter_not_computed;
|
bloom: encode out-of-bounds filters as non-empty
When a changed-path Bloom filter has either zero, or more than a
certain number (commonly 512) of entries, the commit-graph machinery
encodes it as "missing". More specifically, it sets the indices adjacent
in the BIDX chunk as equal to each other to indicate a "length 0"
filter; that is, that the filter occupies zero bytes on disk.
This has heretofore been fine, since the commit-graph machinery has no
need to care about these filters with too few or too many changed paths.
Both cases act like no filter has been generated at all, and so there is
no need to store them.
In a subsequent commit, however, the commit-graph machinery will learn
to only compute Bloom filters for some commits in the current
commit-graph layer. This is a change from the current implementation
which computes Bloom filters for all commits that are in the layer being
written. Critically for this patch, only computing some of the Bloom
filters means adding a third state for length 0 Bloom filters: zero
entries, too many entries, or "hasn't been computed".
It will be important for that future patch to distinguish between "not
representable" (i.e., zero or too-many changed paths), and "hasn't been
computed". In particular, we don't want to waste time recomputing
filters that have already been computed.
To that end, change how we store Bloom filters in the "computed but not
representable" category:
- Bloom filters with no entries are stored as a single byte with all
bits low (i.e., all queries to that Bloom filter will return
"definitely not")
- Bloom filters with too many entries are stored as a single byte with
all bits set high (i.e., all queries to that Bloom filter will
return "maybe").
These rules are sufficient to not incur a behavior change by changing
the on-disk representation of these two classes. Likewise, no
specification changes are necessary for the commit-graph format, either:
- Filters that were previously empty will be recomputed and stored
according to the new rules, and
- old clients reading filters generated by new clients will interpret
the filters correctly and be none the wiser to how they were
generated.
Clients will invoke the Bloom machinery in more cases than before, but
this can be addressed by returning a NULL filter when all bits are set
high. This can be addressed in a future patch.
Note that this does increase the size of on-disk commit-graphs, but far
less than other proposals. In particular, this is generally more
efficient than storing a bitmap for which commits haven't computed their
Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
whereas storing explicit filters as above incurs a penalty of one byte
per too-large or empty commit.
In practice, these boundary commits likely occupy a small proportion of
the overall number of commits, and so the size penalty is likely smaller
than storing a bitmap for all commits.
See, for example, these relative proportions of such boundary commits
(collected by SZEDER Gábor):
| Percentage of | commit-graph | |
| commits modifying | file size | |
├────────┬──────────────┼───────────────────┤ pct. |
| 0 path | >= 512 paths | before | after | change |
┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤
| android-base | 13.20% | 0.13% | 37.468M | 37.534M | +0.1741 % |
| cmssw | 0.15% | 0.23% | 17.118M | 17.119M | +0.0091 % |
| cpython | 3.07% | 0.01% | 7.967M | 7.971M | +0.0423 % |
| elasticsearch | 0.70% | 1.00% | 8.833M | 8.835M | +0.0128 % |
| gcc | 0.00% | 0.08% | 16.073M | 16.074M | +0.0030 % |
| gecko-dev | 0.14% | 0.64% | 59.868M | 59.874M | +0.0105 % |
| git | 0.11% | 0.02% | 3.895M | 3.895M | +0.0020 % |
| glibc | 0.02% | 0.10% | 3.555M | 3.555M | +0.0021 % |
| go | 0.00% | 0.07% | 3.186M | 3.186M | +0.0018 % |
| homebrew-cask | 0.40% | 0.02% | 7.035M | 7.035M | +0.0065 % |
| homebrew-core | 0.01% | 0.01% | 11.611M | 11.611M | +0.0002 % |
| jdk | 0.26% | 5.64% | 5.537M | 5.540M | +0.0590 % |
| linux | 0.01% | 0.51% | 63.735M | 63.740M | +0.0073 % |
| llvm-project | 0.12% | 0.03% | 25.515M | 25.516M | +0.0050 % |
| rails | 0.10% | 0.10% | 6.252M | 6.252M | +0.0027 % |
| rust | 0.07% | 0.17% | 9.364M | 9.364M | +0.0033 % |
| tensorflow | 0.09% | 1.02% | 7.009M | 7.010M | +0.0158 % |
| webkit | 0.05% | 0.31% | 17.405M | 17.406M | +0.0047 % |
(where the above increase is determined by computing a non-split
commit-graph before and after this patch).
Given that these projects are all "large" by commit count, the storage
cost by writing these filters explicitly is negligible. In the most
extreme example, android-base (which has 494,848 commits at the time of
writing) would have its commit-graph increase by a modest 68.4 KB.
Finally, a test to exercise filters which contain too many changed path
entries will be introduced in a subsequent patch.
Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
Suggested-by: Jakub Narębski <jnareb@gmail.com>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-18 10:59:44 +08:00
|
|
|
int count_bloom_filter_trunc_empty;
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
int count_bloom_filter_trunc_large;
|
2019-06-12 21:29:40 +08:00
|
|
|
};
|
|
|
|
|
2020-07-01 21:27:25 +08:00
|
|
|
static int write_graph_chunk_fanout(struct hashfile *f,
|
2021-02-05 22:30:36 +08:00
|
|
|
void *data)
|
2018-04-03 04:34:19 +08:00
|
|
|
{
|
2021-02-05 22:30:36 +08:00
|
|
|
struct write_commit_graph_context *ctx = data;
|
2018-04-03 04:34:19 +08:00
|
|
|
int i, count = 0;
|
2019-06-12 21:29:40 +08:00
|
|
|
struct commit **list = ctx->commits.list;
|
2018-04-03 04:34:19 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Write the first-level table (the list is sorted,
|
|
|
|
* but we use a 256-entry lookup to be able to avoid
|
|
|
|
* having to do eight extra binary search iterations).
|
|
|
|
*/
|
|
|
|
for (i = 0; i < 256; i++) {
|
2019-06-12 21:29:40 +08:00
|
|
|
while (count < ctx->commits.nr) {
|
2018-04-03 04:34:19 +08:00
|
|
|
if ((*list)->object.oid.hash[0] != i)
|
|
|
|
break;
|
2019-06-12 21:29:40 +08:00
|
|
|
display_progress(ctx->progress, ++ctx->progress_cnt);
|
2018-04-03 04:34:19 +08:00
|
|
|
count++;
|
|
|
|
list++;
|
|
|
|
}
|
|
|
|
|
|
|
|
hashwrite_be32(f, count);
|
|
|
|
}
|
2020-07-01 21:27:25 +08:00
|
|
|
|
|
|
|
return 0;
|
2018-04-03 04:34:19 +08:00
|
|
|
}
|
|
|
|
|
2020-07-01 21:27:25 +08:00
|
|
|
static int write_graph_chunk_oids(struct hashfile *f,
|
2021-02-05 22:30:36 +08:00
|
|
|
void *data)
|
2018-04-03 04:34:19 +08:00
|
|
|
{
|
2021-02-05 22:30:36 +08:00
|
|
|
struct write_commit_graph_context *ctx = data;
|
2019-06-12 21:29:40 +08:00
|
|
|
struct commit **list = ctx->commits.list;
|
2018-04-03 04:34:19 +08:00
|
|
|
int count;
|
2019-06-12 21:29:40 +08:00
|
|
|
for (count = 0; count < ctx->commits.nr; count++, list++) {
|
|
|
|
display_progress(ctx->progress, ++ctx->progress_cnt);
|
2020-07-01 21:27:25 +08:00
|
|
|
hashwrite(f, (*list)->object.oid.hash, the_hash_algo->rawsz);
|
2019-01-20 04:21:15 +08:00
|
|
|
}
|
2020-07-01 21:27:25 +08:00
|
|
|
|
|
|
|
return 0;
|
2018-04-03 04:34:19 +08:00
|
|
|
}
|
|
|
|
|
2021-01-28 14:20:23 +08:00
|
|
|
static const struct object_id *commit_to_oid(size_t index, const void *table)
|
2018-04-03 04:34:19 +08:00
|
|
|
{
|
2021-01-28 14:20:23 +08:00
|
|
|
const struct commit * const *commits = table;
|
2021-01-28 14:19:42 +08:00
|
|
|
return &commits[index]->object.oid;
|
2018-04-03 04:34:19 +08:00
|
|
|
}
|
|
|
|
|
2020-07-01 21:27:25 +08:00
|
|
|
static int write_graph_chunk_data(struct hashfile *f,
|
2021-02-05 22:30:36 +08:00
|
|
|
void *data)
|
2018-04-03 04:34:19 +08:00
|
|
|
{
|
2021-02-05 22:30:36 +08:00
|
|
|
struct write_commit_graph_context *ctx = data;
|
2019-06-12 21:29:40 +08:00
|
|
|
struct commit **list = ctx->commits.list;
|
|
|
|
struct commit **last = ctx->commits.list + ctx->commits.nr;
|
2018-04-03 04:34:19 +08:00
|
|
|
uint32_t num_extra_edges = 0;
|
|
|
|
|
|
|
|
while (list < last) {
|
|
|
|
struct commit_list *parent;
|
2019-09-06 06:04:57 +08:00
|
|
|
struct object_id *tree;
|
2018-04-03 04:34:19 +08:00
|
|
|
int edge_value;
|
|
|
|
uint32_t packedDate[2];
|
2019-06-12 21:29:40 +08:00
|
|
|
display_progress(ctx->progress, ++ctx->progress_cnt);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2021-02-02 01:15:03 +08:00
|
|
|
if (repo_parse_commit_no_graph(ctx->r, *list))
|
commit-graph.c: handle commit parsing errors
To write a commit graph chunk, 'write_graph_chunk_data()' takes a list
of commits to write and parses each one before writing the necessary
data, and continuing on to the next commit in the list.
Since the majority of these commits are not parsed ahead of time (an
exception is made for the *last* commit in the list, which is parsed
early within 'copy_oids_to_commits'), it is possible that calling
'parse_commit_no_graph()' on them may return an error. Failing to catch
these errors before de-referencing later calls can result in a undefined
memory access and a SIGSEGV.
One such example of this is 'get_commit_tree_oid()', which expects a
parsed object as its input (in this case, the commit-graph code passes
'*list'). If '*list' causes a parse error, the subsequent call will
fail.
Prevent such an issue by checking the return value of
'parse_commit_no_graph()' to avoid passing an unparsed object to a
function which expects a parsed object, thus preventing a segfault.
It is worth noting that this fix is really skirting around the issue in
object.c's 'parse_object()', which makes it difficult to tell how
corrupt an object is without digging into it. Presumably one could
change the meaning of 'parse_object' returns, but this would require
adjusting each callsite accordingly. Instead of that, add an additional
check to the object parsed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-06 06:04:55 +08:00
|
|
|
die(_("unable to parse commit %s"),
|
|
|
|
oid_to_hex(&(*list)->object.oid));
|
2019-09-06 06:04:57 +08:00
|
|
|
tree = get_commit_tree_oid(*list);
|
2020-07-01 21:27:25 +08:00
|
|
|
hashwrite(f, tree->hash, the_hash_algo->rawsz);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
|
|
|
parent = (*list)->parents;
|
|
|
|
|
|
|
|
if (!parent)
|
|
|
|
edge_value = GRAPH_PARENT_NONE;
|
|
|
|
else {
|
2021-01-28 14:19:42 +08:00
|
|
|
edge_value = oid_pos(&parent->item->object.oid,
|
|
|
|
ctx->commits.list,
|
|
|
|
ctx->commits.nr,
|
|
|
|
commit_to_oid);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-19 02:14:27 +08:00
|
|
|
if (edge_value >= 0)
|
|
|
|
edge_value += ctx->new_num_commits_in_base;
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
else if (ctx->new_base_graph) {
|
2019-06-19 02:14:27 +08:00
|
|
|
uint32_t pos;
|
2021-08-09 16:11:59 +08:00
|
|
|
if (find_commit_pos_in_graph(parent->item,
|
|
|
|
ctx->new_base_graph,
|
|
|
|
&pos))
|
2019-06-19 02:14:27 +08:00
|
|
|
edge_value = pos;
|
|
|
|
}
|
|
|
|
|
2018-04-03 04:34:19 +08:00
|
|
|
if (edge_value < 0)
|
2018-12-20 04:14:07 +08:00
|
|
|
BUG("missing parent %s for commit %s",
|
|
|
|
oid_to_hex(&parent->item->object.oid),
|
|
|
|
oid_to_hex(&(*list)->object.oid));
|
2018-04-03 04:34:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
hashwrite_be32(f, edge_value);
|
|
|
|
|
|
|
|
if (parent)
|
|
|
|
parent = parent->next;
|
|
|
|
|
|
|
|
if (!parent)
|
|
|
|
edge_value = GRAPH_PARENT_NONE;
|
|
|
|
else if (parent->next)
|
commit-graph: rename "large edges" to "extra edges"
The optional 'Large Edge List' chunk of the commit graph file stores
parent information for commits with more than two parents, and the
names of most of the macros, variables, struct fields, and functions
related to this chunk contain the term "large edges", e.g.
write_graph_chunk_large_edges(). However, it's not a really great
term, as the edges to the second and subsequent parents stored in this
chunk are not any larger than the edges to the first and second
parents stored in the "main" 'Commit Data' chunk. It's the number of
edges, IOW number of parents, that is larger compared to non-merge and
"regular" two-parent merge commits. And indeed, two functions in
'commit-graph.c' have a local variable called 'num_extra_edges' that
refer to the same thing, and this "extra edges" term is much better at
describing these edges.
So let's rename all these references to "large edges" in macro,
variable, function, etc. names to "extra edges". There is a
GRAPH_OCTOPUS_EDGES_NEEDED macro as well; for the sake of consistency
rename it to GRAPH_EXTRA_EDGES_NEEDED.
We can do so safely without causing any incompatibility issues,
because the term "large edges" doesn't come up in the file format
itself in any form (the chunk's magic is {'E', 'D', 'G', 'E'}, there
is no 'L' in there), but only in the specification text. The string
"large edges", however, does come up in the output of 'git
commit-graph read' and in tests looking at its input, but that command
is explicitly documented as debugging aid, so we can change its output
and the affected tests safely.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-20 04:21:13 +08:00
|
|
|
edge_value = GRAPH_EXTRA_EDGES_NEEDED | num_extra_edges;
|
2018-04-03 04:34:19 +08:00
|
|
|
else {
|
2021-01-28 14:19:42 +08:00
|
|
|
edge_value = oid_pos(&parent->item->object.oid,
|
|
|
|
ctx->commits.list,
|
|
|
|
ctx->commits.nr,
|
|
|
|
commit_to_oid);
|
2019-06-19 02:14:27 +08:00
|
|
|
|
|
|
|
if (edge_value >= 0)
|
|
|
|
edge_value += ctx->new_num_commits_in_base;
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
else if (ctx->new_base_graph) {
|
2019-06-19 02:14:27 +08:00
|
|
|
uint32_t pos;
|
2021-08-09 16:11:59 +08:00
|
|
|
if (find_commit_pos_in_graph(parent->item,
|
|
|
|
ctx->new_base_graph,
|
|
|
|
&pos))
|
2019-06-19 02:14:27 +08:00
|
|
|
edge_value = pos;
|
|
|
|
}
|
|
|
|
|
2018-04-03 04:34:19 +08:00
|
|
|
if (edge_value < 0)
|
2018-12-20 04:14:07 +08:00
|
|
|
BUG("missing parent %s for commit %s",
|
|
|
|
oid_to_hex(&parent->item->object.oid),
|
|
|
|
oid_to_hex(&(*list)->object.oid));
|
2018-04-03 04:34:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
hashwrite_be32(f, edge_value);
|
|
|
|
|
commit-graph: rename "large edges" to "extra edges"
The optional 'Large Edge List' chunk of the commit graph file stores
parent information for commits with more than two parents, and the
names of most of the macros, variables, struct fields, and functions
related to this chunk contain the term "large edges", e.g.
write_graph_chunk_large_edges(). However, it's not a really great
term, as the edges to the second and subsequent parents stored in this
chunk are not any larger than the edges to the first and second
parents stored in the "main" 'Commit Data' chunk. It's the number of
edges, IOW number of parents, that is larger compared to non-merge and
"regular" two-parent merge commits. And indeed, two functions in
'commit-graph.c' have a local variable called 'num_extra_edges' that
refer to the same thing, and this "extra edges" term is much better at
describing these edges.
So let's rename all these references to "large edges" in macro,
variable, function, etc. names to "extra edges". There is a
GRAPH_OCTOPUS_EDGES_NEEDED macro as well; for the sake of consistency
rename it to GRAPH_EXTRA_EDGES_NEEDED.
We can do so safely without causing any incompatibility issues,
because the term "large edges" doesn't come up in the file format
itself in any form (the chunk's magic is {'E', 'D', 'G', 'E'}, there
is no 'L' in there), but only in the specification text. The string
"large edges", however, does come up in the output of 'git
commit-graph read' and in tests looking at its input, but that command
is explicitly documented as debugging aid, so we can change its output
and the affected tests safely.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-20 04:21:13 +08:00
|
|
|
if (edge_value & GRAPH_EXTRA_EDGES_NEEDED) {
|
2018-04-03 04:34:19 +08:00
|
|
|
do {
|
|
|
|
num_extra_edges++;
|
|
|
|
parent = parent->next;
|
|
|
|
} while (parent);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (sizeof((*list)->date) > 4)
|
|
|
|
packedDate[0] = htonl(((*list)->date >> 32) & 0x3);
|
|
|
|
else
|
|
|
|
packedDate[0] = 0;
|
|
|
|
|
2021-01-17 02:11:12 +08:00
|
|
|
packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
|
2018-05-01 20:47:09 +08:00
|
|
|
|
2018-04-03 04:34:19 +08:00
|
|
|
packedDate[1] = htonl((*list)->date);
|
|
|
|
hashwrite(f, packedDate, 8);
|
|
|
|
|
|
|
|
list++;
|
|
|
|
}
|
2020-07-01 21:27:25 +08:00
|
|
|
|
|
|
|
return 0;
|
2018-04-03 04:34:19 +08:00
|
|
|
}
|
|
|
|
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
static int write_graph_chunk_generation_data(struct hashfile *f,
|
2021-02-05 22:30:36 +08:00
|
|
|
void *data)
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
{
|
2021-02-05 22:30:36 +08:00
|
|
|
struct write_commit_graph_context *ctx = data;
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
int i, num_generation_data_overflows = 0;
|
|
|
|
|
|
|
|
for (i = 0; i < ctx->commits.nr; i++) {
|
|
|
|
struct commit *c = ctx->commits.list[i];
|
2021-02-02 01:15:04 +08:00
|
|
|
timestamp_t offset;
|
|
|
|
repo_parse_commit(ctx->r, c);
|
|
|
|
offset = commit_graph_data_at(c)->generation - c->date;
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
display_progress(ctx->progress, ++ctx->progress_cnt);
|
|
|
|
|
|
|
|
if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
|
|
|
|
offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
|
|
|
|
num_generation_data_overflows++;
|
|
|
|
}
|
|
|
|
|
|
|
|
hashwrite_be32(f, offset);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
|
2021-02-05 22:30:36 +08:00
|
|
|
void *data)
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
{
|
2021-02-05 22:30:36 +08:00
|
|
|
struct write_commit_graph_context *ctx = data;
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
int i;
|
|
|
|
for (i = 0; i < ctx->commits.nr; i++) {
|
|
|
|
struct commit *c = ctx->commits.list[i];
|
|
|
|
timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
|
|
|
|
display_progress(ctx->progress, ++ctx->progress_cnt);
|
|
|
|
|
|
|
|
if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
|
|
|
|
hashwrite_be32(f, offset >> 32);
|
|
|
|
hashwrite_be32(f, (uint32_t) offset);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-07-01 21:27:25 +08:00
|
|
|
static int write_graph_chunk_extra_edges(struct hashfile *f,
|
2021-02-05 22:30:36 +08:00
|
|
|
void *data)
|
2018-04-03 04:34:19 +08:00
|
|
|
{
|
2021-02-05 22:30:36 +08:00
|
|
|
struct write_commit_graph_context *ctx = data;
|
2019-06-12 21:29:40 +08:00
|
|
|
struct commit **list = ctx->commits.list;
|
|
|
|
struct commit **last = ctx->commits.list + ctx->commits.nr;
|
2018-04-03 04:34:19 +08:00
|
|
|
struct commit_list *parent;
|
|
|
|
|
|
|
|
while (list < last) {
|
|
|
|
int num_parents = 0;
|
2019-01-20 04:21:15 +08:00
|
|
|
|
2019-06-12 21:29:40 +08:00
|
|
|
display_progress(ctx->progress, ++ctx->progress_cnt);
|
2019-01-20 04:21:15 +08:00
|
|
|
|
2018-04-03 04:34:19 +08:00
|
|
|
for (parent = (*list)->parents; num_parents < 3 && parent;
|
|
|
|
parent = parent->next)
|
|
|
|
num_parents++;
|
|
|
|
|
|
|
|
if (num_parents <= 2) {
|
|
|
|
list++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Since num_parents > 2, this initializer is safe. */
|
|
|
|
for (parent = (*list)->parents->next; parent; parent = parent->next) {
|
2021-01-28 14:19:42 +08:00
|
|
|
int edge_value = oid_pos(&parent->item->object.oid,
|
|
|
|
ctx->commits.list,
|
|
|
|
ctx->commits.nr,
|
|
|
|
commit_to_oid);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-19 02:14:27 +08:00
|
|
|
if (edge_value >= 0)
|
|
|
|
edge_value += ctx->new_num_commits_in_base;
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
else if (ctx->new_base_graph) {
|
2019-06-19 02:14:27 +08:00
|
|
|
uint32_t pos;
|
2021-08-09 16:11:59 +08:00
|
|
|
if (find_commit_pos_in_graph(parent->item,
|
|
|
|
ctx->new_base_graph,
|
|
|
|
&pos))
|
2019-06-19 02:14:27 +08:00
|
|
|
edge_value = pos;
|
|
|
|
}
|
|
|
|
|
2018-04-03 04:34:19 +08:00
|
|
|
if (edge_value < 0)
|
2018-12-20 04:14:07 +08:00
|
|
|
BUG("missing parent %s for commit %s",
|
|
|
|
oid_to_hex(&parent->item->object.oid),
|
|
|
|
oid_to_hex(&(*list)->object.oid));
|
2018-04-03 04:34:19 +08:00
|
|
|
else if (!parent->next)
|
|
|
|
edge_value |= GRAPH_LAST_EDGE;
|
|
|
|
|
|
|
|
hashwrite_be32(f, edge_value);
|
|
|
|
}
|
|
|
|
|
|
|
|
list++;
|
|
|
|
}
|
2020-07-01 21:27:25 +08:00
|
|
|
|
|
|
|
return 0;
|
2018-04-03 04:34:19 +08:00
|
|
|
}
|
|
|
|
|
2020-07-01 21:27:25 +08:00
|
|
|
static int write_graph_chunk_bloom_indexes(struct hashfile *f,
|
2021-02-05 22:30:36 +08:00
|
|
|
void *data)
|
2020-04-07 00:59:49 +08:00
|
|
|
{
|
2021-02-05 22:30:36 +08:00
|
|
|
struct write_commit_graph_context *ctx = data;
|
2020-04-07 00:59:49 +08:00
|
|
|
struct commit **list = ctx->commits.list;
|
|
|
|
struct commit **last = ctx->commits.list + ctx->commits.nr;
|
|
|
|
uint32_t cur_pos = 0;
|
|
|
|
|
|
|
|
while (list < last) {
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
|
2020-07-01 21:27:23 +08:00
|
|
|
size_t len = filter ? filter->len : 0;
|
|
|
|
cur_pos += len;
|
2020-07-10 01:00:03 +08:00
|
|
|
display_progress(ctx->progress, ++ctx->progress_cnt);
|
2020-04-07 00:59:49 +08:00
|
|
|
hashwrite_be32(f, cur_pos);
|
|
|
|
list++;
|
|
|
|
}
|
|
|
|
|
2020-07-01 21:27:25 +08:00
|
|
|
return 0;
|
2020-04-07 00:59:49 +08:00
|
|
|
}
|
|
|
|
|
2020-07-01 21:27:24 +08:00
|
|
|
static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
|
|
|
struct json_writer jw = JSON_WRITER_INIT;
|
|
|
|
|
|
|
|
jw_object_begin(&jw, 0);
|
|
|
|
jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
|
|
|
|
jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
|
|
|
|
jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
|
2020-09-17 21:34:42 +08:00
|
|
|
jw_object_intmax(&jw, "max_changed_paths", ctx->bloom_settings->max_changed_paths);
|
2020-07-01 21:27:24 +08:00
|
|
|
jw_end(&jw);
|
|
|
|
|
|
|
|
trace2_data_json("bloom", ctx->r, "settings", &jw);
|
|
|
|
|
|
|
|
jw_release(&jw);
|
2020-04-07 00:59:49 +08:00
|
|
|
}
|
|
|
|
|
2020-07-01 21:27:25 +08:00
|
|
|
static int write_graph_chunk_bloom_data(struct hashfile *f,
|
2021-02-05 22:30:36 +08:00
|
|
|
void *data)
|
2020-04-07 00:59:49 +08:00
|
|
|
{
|
2021-02-05 22:30:36 +08:00
|
|
|
struct write_commit_graph_context *ctx = data;
|
2020-04-07 00:59:49 +08:00
|
|
|
struct commit **list = ctx->commits.list;
|
|
|
|
struct commit **last = ctx->commits.list + ctx->commits.nr;
|
|
|
|
|
2020-07-01 21:27:24 +08:00
|
|
|
trace2_bloom_filter_settings(ctx);
|
|
|
|
|
2020-06-24 01:47:00 +08:00
|
|
|
hashwrite_be32(f, ctx->bloom_settings->hash_version);
|
|
|
|
hashwrite_be32(f, ctx->bloom_settings->num_hashes);
|
|
|
|
hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
|
2020-04-07 00:59:49 +08:00
|
|
|
|
|
|
|
while (list < last) {
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
|
2020-07-01 21:27:23 +08:00
|
|
|
size_t len = filter ? filter->len : 0;
|
|
|
|
|
2020-07-10 01:00:03 +08:00
|
|
|
display_progress(ctx->progress, ++ctx->progress_cnt);
|
2020-07-01 21:27:23 +08:00
|
|
|
if (len)
|
|
|
|
hashwrite(f, filter->data, len * sizeof(unsigned char));
|
2020-04-07 00:59:49 +08:00
|
|
|
list++;
|
|
|
|
}
|
|
|
|
|
2020-07-01 21:27:25 +08:00
|
|
|
return 0;
|
2020-04-07 00:59:49 +08:00
|
|
|
}
|
|
|
|
|
2018-04-03 04:34:19 +08:00
|
|
|
static int add_packed_commits(const struct object_id *oid,
|
|
|
|
struct packed_git *pack,
|
|
|
|
uint32_t pos,
|
|
|
|
void *data)
|
|
|
|
{
|
2019-06-12 21:29:40 +08:00
|
|
|
struct write_commit_graph_context *ctx = (struct write_commit_graph_context*)data;
|
2018-04-03 04:34:19 +08:00
|
|
|
enum object_type type;
|
|
|
|
off_t offset = nth_packed_object_offset(pack, pos);
|
|
|
|
struct object_info oi = OBJECT_INFO_INIT;
|
|
|
|
|
2019-06-12 21:29:40 +08:00
|
|
|
if (ctx->progress)
|
|
|
|
display_progress(ctx->progress, ++ctx->progress_done);
|
commit-graph write: add progress output
Before this change the "commit-graph write" command didn't report any
progress. On my machine this command takes more than 10 seconds to
write the graph for linux.git, and around 1m30s on the
2015-04-03-1M-git.git[1] test repository (a test case for a large
monorepository).
Furthermore, since the gc.writeCommitGraph setting was added in
d5d5d7b641 ("gc: automatically write commit-graph files", 2018-06-27),
there was no indication at all from a "git gc" run that anything was
different. This why one of the progress bars being added here uses
start_progress() instead of start_delayed_progress(), so that it's
guaranteed to be seen. E.g. on my tiny 867 commit dotfiles.git
repository:
$ git -c gc.writeCommitGraph=true gc
Enumerating objects: 2821, done.
[...]
Computing commit graph generation numbers: 100% (867/867), done.
On larger repositories, such as linux.git the delayed progress bar(s)
will kick in, and we'll show what's going on instead of, as was
previously happening, printing nothing while we write the graph:
$ git -c gc.writeCommitGraph=true gc
[...]
Annotating commits in commit graph: 1565573, done.
Computing commit graph generation numbers: 100% (782484/782484), done.
Note that here we don't show "Finding commits for commit graph", this
is because under "git gc" we seed the search with the commit
references in the repository, and that set is too small to show any
progress, but would e.g. on a smaller repo such as git.git with
--stdin-commits:
$ git rev-list --all | git -c gc.writeCommitGraph=true write --stdin-commits
Finding commits for commit graph: 100% (162576/162576), done.
Computing commit graph generation numbers: 100% (162576/162576), done.
With --stdin-packs we don't show any estimation of how much is left to
do. This is because we might be processing more than one pack. We
could be less lazy here and show progress, either by detecting that
we're only processing one pack, or by first looping over the packs to
discover how many commits they have. I don't see the point in doing
that work. So instead we get (on 2015-04-03-1M-git.git):
$ echo pack-<HASH>.idx | git -c gc.writeCommitGraph=true --exec-path=$PWD commit-graph write --stdin-packs
Finding commits for commit graph: 13064614, done.
Annotating commits in commit graph: 3001341, done.
Computing commit graph generation numbers: 100% (1000447/1000447), done.
No GC mode uses --stdin-packs. It's what they use at Microsoft to
manually compute the generation numbers for their collection of large
packs which are never coalesced.
The reason we need a "report_progress" variable passed down from "git
gc" is so that we don't report this output when we're running in the
process "git gc --auto" detaches from the terminal.
Since we write the commit graph from the "git gc" process itself (as
opposed to what we do with say the "git repack" phase), we'd end up
writing the output to .git/gc.log and reporting it to the user next
time as part of the "The last gc run reported the following[...]"
error, see 329e6e8794 ("gc: save log from daemonized gc --auto and
print it next time", 2015-09-19).
So we must keep track of whether or not we're running in that
demonized mode, and if so print no progress.
See [2] and subsequent replies for a discussion of an approach not
taken in compute_generation_numbers(). I.e. we're saying "Computing
commit graph generation numbers", even though on an established
history we're mostly skipping over all the work we did in the
past. This is similar to the white lie we tell in the "Writing
objects" phase (not all are objects being written).
Always showing progress is considered more important than
accuracy. I.e. on a repository like 2015-04-03-1M-git.git we'd hang
for 6 seconds with no output on the second "git gc" if no changes were
made to any objects in the interim if we'd take the approach in [2].
1. https://github.com/avar/2015-04-03-1M-git
2. <c6960252-c095-fb2b-e0bc-b1e6bb261614@gmail.com>
(https://public-inbox.org/git/c6960252-c095-fb2b-e0bc-b1e6bb261614@gmail.com/)
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-09-17 23:33:35 +08:00
|
|
|
|
2018-04-03 04:34:19 +08:00
|
|
|
oi.typep = &type;
|
2019-06-12 21:29:40 +08:00
|
|
|
if (packed_object_info(ctx->r, pack, offset, &oi) < 0)
|
2018-07-21 15:49:26 +08:00
|
|
|
die(_("unable to get type of object %s"), oid_to_hex(oid));
|
2018-04-03 04:34:19 +08:00
|
|
|
|
|
|
|
if (type != OBJ_COMMIT)
|
|
|
|
return 0;
|
|
|
|
|
2020-12-08 03:11:05 +08:00
|
|
|
oid_array_append(&ctx->oids, oid);
|
2020-03-30 08:31:29 +08:00
|
|
|
set_commit_pos(ctx->r, oid);
|
|
|
|
|
2018-04-03 04:34:19 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-06-12 21:29:40 +08:00
|
|
|
static void add_missing_parents(struct write_commit_graph_context *ctx, struct commit *commit)
|
2018-04-10 20:56:04 +08:00
|
|
|
{
|
|
|
|
struct commit_list *parent;
|
|
|
|
for (parent = commit->parents; parent; parent = parent->next) {
|
commit-graph: fix writing first commit-graph during fetch
The previous commit includes a failing test for an issue around
fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we
fix that bug and set the test to "test_expect_success".
The problem arises with this set of commands when the remote repo at
<url> has a submodule. Note that --recurse-submodules is not needed to
demonstrate the bug.
$ git clone <url> test
$ cd test
$ git -c fetch.writeCommitGraph=true fetch origin
Computing commit graph generation numbers: 100% (12/12), done.
BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2>
Aborted (core dumped)
As an initial fix, I converted the code in builtin/fetch.c that calls
write_commit_graph_reachable() to instead launch a "git commit-graph
write --reachable --split" process. That code worked, but is not how we
want the feature to work long-term.
That test did demonstrate that the issue must be something to do with
internal state of the 'git fetch' process.
The write_commit_graph() method in commit-graph.c ensures the commits we
plan to write are "closed under reachability" using close_reachable().
This method walks from the input commits, and uses the UNINTERESTING
flag to mark which commits have already been visited. This allows the
walk to take O(N) time, where N is the number of commits, instead of
O(P) time, where P is the number of paths. (The number of paths can be
exponential in the number of commits.)
However, the UNINTERESTING flag is used in lots of places in the
codebase. This flag usually means some barrier to stop a commit walk,
such as in revision-walking to compare histories. It is not often
cleared after the walk completes because the starting points of those
walks do not have the UNINTERESTING flag, and clear_commit_marks() would
stop immediately.
This is happening during a 'git fetch' call with a remote. The fetch
negotiation is comparing the remote refs with the local refs and marking
some commits as UNINTERESTING.
I tested running clear_commit_marks_many() to clear the UNINTERESTING
flag inside close_reachable(), but the tips did not have the flag, so
that did nothing.
It turns out that the calculate_changed_submodule_paths() method is at
fault. Thanks, Peff, for pointing out this detail! More specifically,
for each submodule, the collect_changed_submodules() runs a revision
walk to essentially do file-history on the list of submodules. That
revision walk marks commits UNININTERESTING if they are simplified away
by not changing the submodule.
Instead, I finally arrived on the conclusion that I should use a flag
that is not used in any other part of the code. In commit-reach.c, a
number of flags were defined for commit walk algorithms. The REACHABLE
flag seemed like it made the most sense, and it seems it was not
actually used in the file. The REACHABLE flag was used in early versions
of commit-reach.c, but was removed by 4fbcca4 (commit-reach: make
can_all_from_reach... linear, 2018-07-20).
Add the REACHABLE flag to commit-graph.c and use it instead of
UNINTERESTING in close_reachable(). This fixes the bug in manual
testing.
Reported-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-24 21:40:42 +08:00
|
|
|
if (!(parent->item->object.flags & REACHABLE)) {
|
2020-12-08 03:11:05 +08:00
|
|
|
oid_array_append(&ctx->oids, &parent->item->object.oid);
|
commit-graph: fix writing first commit-graph during fetch
The previous commit includes a failing test for an issue around
fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we
fix that bug and set the test to "test_expect_success".
The problem arises with this set of commands when the remote repo at
<url> has a submodule. Note that --recurse-submodules is not needed to
demonstrate the bug.
$ git clone <url> test
$ cd test
$ git -c fetch.writeCommitGraph=true fetch origin
Computing commit graph generation numbers: 100% (12/12), done.
BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2>
Aborted (core dumped)
As an initial fix, I converted the code in builtin/fetch.c that calls
write_commit_graph_reachable() to instead launch a "git commit-graph
write --reachable --split" process. That code worked, but is not how we
want the feature to work long-term.
That test did demonstrate that the issue must be something to do with
internal state of the 'git fetch' process.
The write_commit_graph() method in commit-graph.c ensures the commits we
plan to write are "closed under reachability" using close_reachable().
This method walks from the input commits, and uses the UNINTERESTING
flag to mark which commits have already been visited. This allows the
walk to take O(N) time, where N is the number of commits, instead of
O(P) time, where P is the number of paths. (The number of paths can be
exponential in the number of commits.)
However, the UNINTERESTING flag is used in lots of places in the
codebase. This flag usually means some barrier to stop a commit walk,
such as in revision-walking to compare histories. It is not often
cleared after the walk completes because the starting points of those
walks do not have the UNINTERESTING flag, and clear_commit_marks() would
stop immediately.
This is happening during a 'git fetch' call with a remote. The fetch
negotiation is comparing the remote refs with the local refs and marking
some commits as UNINTERESTING.
I tested running clear_commit_marks_many() to clear the UNINTERESTING
flag inside close_reachable(), but the tips did not have the flag, so
that did nothing.
It turns out that the calculate_changed_submodule_paths() method is at
fault. Thanks, Peff, for pointing out this detail! More specifically,
for each submodule, the collect_changed_submodules() runs a revision
walk to essentially do file-history on the list of submodules. That
revision walk marks commits UNININTERESTING if they are simplified away
by not changing the submodule.
Instead, I finally arrived on the conclusion that I should use a flag
that is not used in any other part of the code. In commit-reach.c, a
number of flags were defined for commit walk algorithms. The REACHABLE
flag seemed like it made the most sense, and it seems it was not
actually used in the file. The REACHABLE flag was used in early versions
of commit-reach.c, but was removed by 4fbcca4 (commit-reach: make
can_all_from_reach... linear, 2018-07-20).
Add the REACHABLE flag to commit-graph.c and use it instead of
UNINTERESTING in close_reachable(). This fixes the bug in manual
testing.
Reported-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-24 21:40:42 +08:00
|
|
|
parent->item->object.flags |= REACHABLE;
|
2018-04-10 20:56:04 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-06-12 21:29:40 +08:00
|
|
|
static void close_reachable(struct write_commit_graph_context *ctx)
|
2018-04-10 20:56:04 +08:00
|
|
|
{
|
2019-01-20 04:21:21 +08:00
|
|
|
int i;
|
2018-04-10 20:56:04 +08:00
|
|
|
struct commit *commit;
|
2020-09-18 10:59:49 +08:00
|
|
|
enum commit_graph_split_flags flags = ctx->opts ?
|
|
|
|
ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
|
2018-04-10 20:56:04 +08:00
|
|
|
|
2019-06-12 21:29:40 +08:00
|
|
|
if (ctx->report_progress)
|
|
|
|
ctx->progress = start_delayed_progress(
|
|
|
|
_("Loading known commits in commit graph"),
|
|
|
|
ctx->oids.nr);
|
|
|
|
for (i = 0; i < ctx->oids.nr; i++) {
|
|
|
|
display_progress(ctx->progress, i + 1);
|
2020-12-08 03:11:05 +08:00
|
|
|
commit = lookup_commit(ctx->r, &ctx->oids.oid[i]);
|
2018-04-10 20:56:04 +08:00
|
|
|
if (commit)
|
commit-graph: fix writing first commit-graph during fetch
The previous commit includes a failing test for an issue around
fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we
fix that bug and set the test to "test_expect_success".
The problem arises with this set of commands when the remote repo at
<url> has a submodule. Note that --recurse-submodules is not needed to
demonstrate the bug.
$ git clone <url> test
$ cd test
$ git -c fetch.writeCommitGraph=true fetch origin
Computing commit graph generation numbers: 100% (12/12), done.
BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2>
Aborted (core dumped)
As an initial fix, I converted the code in builtin/fetch.c that calls
write_commit_graph_reachable() to instead launch a "git commit-graph
write --reachable --split" process. That code worked, but is not how we
want the feature to work long-term.
That test did demonstrate that the issue must be something to do with
internal state of the 'git fetch' process.
The write_commit_graph() method in commit-graph.c ensures the commits we
plan to write are "closed under reachability" using close_reachable().
This method walks from the input commits, and uses the UNINTERESTING
flag to mark which commits have already been visited. This allows the
walk to take O(N) time, where N is the number of commits, instead of
O(P) time, where P is the number of paths. (The number of paths can be
exponential in the number of commits.)
However, the UNINTERESTING flag is used in lots of places in the
codebase. This flag usually means some barrier to stop a commit walk,
such as in revision-walking to compare histories. It is not often
cleared after the walk completes because the starting points of those
walks do not have the UNINTERESTING flag, and clear_commit_marks() would
stop immediately.
This is happening during a 'git fetch' call with a remote. The fetch
negotiation is comparing the remote refs with the local refs and marking
some commits as UNINTERESTING.
I tested running clear_commit_marks_many() to clear the UNINTERESTING
flag inside close_reachable(), but the tips did not have the flag, so
that did nothing.
It turns out that the calculate_changed_submodule_paths() method is at
fault. Thanks, Peff, for pointing out this detail! More specifically,
for each submodule, the collect_changed_submodules() runs a revision
walk to essentially do file-history on the list of submodules. That
revision walk marks commits UNININTERESTING if they are simplified away
by not changing the submodule.
Instead, I finally arrived on the conclusion that I should use a flag
that is not used in any other part of the code. In commit-reach.c, a
number of flags were defined for commit walk algorithms. The REACHABLE
flag seemed like it made the most sense, and it seems it was not
actually used in the file. The REACHABLE flag was used in early versions
of commit-reach.c, but was removed by 4fbcca4 (commit-reach: make
can_all_from_reach... linear, 2018-07-20).
Add the REACHABLE flag to commit-graph.c and use it instead of
UNINTERESTING in close_reachable(). This fixes the bug in manual
testing.
Reported-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-24 21:40:42 +08:00
|
|
|
commit->object.flags |= REACHABLE;
|
2018-04-10 20:56:04 +08:00
|
|
|
}
|
2019-06-12 21:29:40 +08:00
|
|
|
stop_progress(&ctx->progress);
|
2018-04-10 20:56:04 +08:00
|
|
|
|
|
|
|
/*
|
2019-06-12 21:29:40 +08:00
|
|
|
* As this loop runs, ctx->oids.nr may grow, but not more
|
2018-04-10 20:56:04 +08:00
|
|
|
* than the number of missing commits in the reachable
|
|
|
|
* closure.
|
|
|
|
*/
|
2019-06-12 21:29:40 +08:00
|
|
|
if (ctx->report_progress)
|
|
|
|
ctx->progress = start_delayed_progress(
|
|
|
|
_("Expanding reachable commits in commit graph"),
|
commit-graph: don't show progress percentages while expanding reachable commits
Commit 49bbc57a57 (commit-graph write: emit a percentage for all
progress, 2019-01-19) was a bit overeager when it added progress
percentages to the "Expanding reachable commits in commit graph" phase
as well, because most of the time the number of commits that phase has
to iterate over is not known in advance and grows significantly, and,
consequently, we end up with nonsensical numbers:
$ git commit-graph write --reachable
Expanding reachable commits in commit graph: 138606% (824706/595), done.
[...]
$ git rev-parse v5.0 | git commit-graph write --stdin-commits
Expanding reachable commits in commit graph: 81264400% (812644/1), done.
[...]
Even worse, because the percentage grows so quickly, the progress code
outputs much more often than it should (because it ticks every second,
or every 1%), slowing the whole process down. My time for "git
commit-graph write --reachable" on linux.git went from 13.463s to
12.521s with this patch, ~7% savings.
Therefore, don't show progress percentages in the "Expanding reachable
commits in commit graph" phase.
Note that the current code does sometimes do the right thing, if we
picked up all commits initially (e.g., omitting "--reachable" in a
fully-packed repository would get the correct count without any parent
traversal). So it may be possible to come up with a way to tell when we
could use a percentage here. But in the meantime, let's make sure we
robustly avoid printing nonsense.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-07 13:01:33 +08:00
|
|
|
0);
|
2019-06-12 21:29:40 +08:00
|
|
|
for (i = 0; i < ctx->oids.nr; i++) {
|
|
|
|
display_progress(ctx->progress, i + 1);
|
2020-12-08 03:11:05 +08:00
|
|
|
commit = lookup_commit(ctx->r, &ctx->oids.oid[i]);
|
2018-04-10 20:56:04 +08:00
|
|
|
|
2019-06-19 02:14:27 +08:00
|
|
|
if (!commit)
|
|
|
|
continue;
|
|
|
|
if (ctx->split) {
|
2021-02-02 01:15:03 +08:00
|
|
|
if ((!repo_parse_commit(ctx->r, commit) &&
|
2020-06-17 17:14:10 +08:00
|
|
|
commit_graph_position(commit) == COMMIT_NOT_FROM_GRAPH) ||
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
flags == COMMIT_GRAPH_SPLIT_REPLACE)
|
2019-06-19 02:14:27 +08:00
|
|
|
add_missing_parents(ctx, commit);
|
2021-02-02 01:15:03 +08:00
|
|
|
} else if (!repo_parse_commit_no_graph(ctx->r, commit))
|
2019-06-12 21:29:40 +08:00
|
|
|
add_missing_parents(ctx, commit);
|
2018-04-10 20:56:04 +08:00
|
|
|
}
|
2019-06-12 21:29:40 +08:00
|
|
|
stop_progress(&ctx->progress);
|
2018-04-10 20:56:04 +08:00
|
|
|
|
2019-06-12 21:29:40 +08:00
|
|
|
if (ctx->report_progress)
|
|
|
|
ctx->progress = start_delayed_progress(
|
|
|
|
_("Clearing commit marks in commit graph"),
|
|
|
|
ctx->oids.nr);
|
|
|
|
for (i = 0; i < ctx->oids.nr; i++) {
|
|
|
|
display_progress(ctx->progress, i + 1);
|
2020-12-08 03:11:05 +08:00
|
|
|
commit = lookup_commit(ctx->r, &ctx->oids.oid[i]);
|
2018-04-10 20:56:04 +08:00
|
|
|
|
|
|
|
if (commit)
|
commit-graph: fix writing first commit-graph during fetch
The previous commit includes a failing test for an issue around
fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we
fix that bug and set the test to "test_expect_success".
The problem arises with this set of commands when the remote repo at
<url> has a submodule. Note that --recurse-submodules is not needed to
demonstrate the bug.
$ git clone <url> test
$ cd test
$ git -c fetch.writeCommitGraph=true fetch origin
Computing commit graph generation numbers: 100% (12/12), done.
BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2>
Aborted (core dumped)
As an initial fix, I converted the code in builtin/fetch.c that calls
write_commit_graph_reachable() to instead launch a "git commit-graph
write --reachable --split" process. That code worked, but is not how we
want the feature to work long-term.
That test did demonstrate that the issue must be something to do with
internal state of the 'git fetch' process.
The write_commit_graph() method in commit-graph.c ensures the commits we
plan to write are "closed under reachability" using close_reachable().
This method walks from the input commits, and uses the UNINTERESTING
flag to mark which commits have already been visited. This allows the
walk to take O(N) time, where N is the number of commits, instead of
O(P) time, where P is the number of paths. (The number of paths can be
exponential in the number of commits.)
However, the UNINTERESTING flag is used in lots of places in the
codebase. This flag usually means some barrier to stop a commit walk,
such as in revision-walking to compare histories. It is not often
cleared after the walk completes because the starting points of those
walks do not have the UNINTERESTING flag, and clear_commit_marks() would
stop immediately.
This is happening during a 'git fetch' call with a remote. The fetch
negotiation is comparing the remote refs with the local refs and marking
some commits as UNINTERESTING.
I tested running clear_commit_marks_many() to clear the UNINTERESTING
flag inside close_reachable(), but the tips did not have the flag, so
that did nothing.
It turns out that the calculate_changed_submodule_paths() method is at
fault. Thanks, Peff, for pointing out this detail! More specifically,
for each submodule, the collect_changed_submodules() runs a revision
walk to essentially do file-history on the list of submodules. That
revision walk marks commits UNININTERESTING if they are simplified away
by not changing the submodule.
Instead, I finally arrived on the conclusion that I should use a flag
that is not used in any other part of the code. In commit-reach.c, a
number of flags were defined for commit walk algorithms. The REACHABLE
flag seemed like it made the most sense, and it seems it was not
actually used in the file. The REACHABLE flag was used in early versions
of commit-reach.c, but was removed by 4fbcca4 (commit-reach: make
can_all_from_reach... linear, 2018-07-20).
Add the REACHABLE flag to commit-graph.c and use it instead of
UNINTERESTING in close_reachable(). This fixes the bug in manual
testing.
Reported-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-24 21:40:42 +08:00
|
|
|
commit->object.flags &= ~REACHABLE;
|
2018-04-10 20:56:04 +08:00
|
|
|
}
|
2019-06-12 21:29:40 +08:00
|
|
|
stop_progress(&ctx->progress);
|
2018-04-10 20:56:04 +08:00
|
|
|
}
|
|
|
|
|
2023-03-20 19:26:49 +08:00
|
|
|
struct compute_generation_info {
|
|
|
|
struct repository *r;
|
|
|
|
struct packed_commit_list *commits;
|
|
|
|
struct progress *progress;
|
|
|
|
int progress_cnt;
|
|
|
|
|
|
|
|
timestamp_t (*get_generation)(struct commit *c, void *data);
|
|
|
|
void (*set_generation)(struct commit *c, timestamp_t gen, void *data);
|
|
|
|
void *data;
|
|
|
|
};
|
|
|
|
|
|
|
|
static timestamp_t compute_generation_from_max(struct commit *c,
|
|
|
|
timestamp_t max_gen,
|
|
|
|
int generation_version)
|
|
|
|
{
|
|
|
|
switch (generation_version) {
|
|
|
|
case 1: /* topological levels */
|
|
|
|
if (max_gen > GENERATION_NUMBER_V1_MAX - 1)
|
|
|
|
max_gen = GENERATION_NUMBER_V1_MAX - 1;
|
|
|
|
return max_gen + 1;
|
|
|
|
|
|
|
|
case 2: /* corrected commit date */
|
|
|
|
if (c->date && c->date > max_gen)
|
|
|
|
max_gen = c->date - 1;
|
|
|
|
return max_gen + 1;
|
|
|
|
|
|
|
|
default:
|
|
|
|
BUG("attempting unimplemented version");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void compute_reachable_generation_numbers(
|
|
|
|
struct compute_generation_info *info,
|
|
|
|
int generation_version)
|
2018-05-01 20:47:09 +08:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct commit_list *list = NULL;
|
|
|
|
|
2023-03-20 19:26:49 +08:00
|
|
|
for (i = 0; i < info->commits->nr; i++) {
|
|
|
|
struct commit *c = info->commits->list[i];
|
|
|
|
timestamp_t gen;
|
|
|
|
repo_parse_commit(info->r, c);
|
|
|
|
gen = info->get_generation(c, info->data);
|
|
|
|
display_progress(info->progress, info->progress_cnt + 1);
|
2020-06-17 17:14:09 +08:00
|
|
|
|
2023-03-20 19:26:49 +08:00
|
|
|
if (gen != GENERATION_NUMBER_ZERO && gen != GENERATION_NUMBER_INFINITY)
|
2018-05-01 20:47:09 +08:00
|
|
|
continue;
|
|
|
|
|
2021-02-02 01:15:04 +08:00
|
|
|
commit_list_insert(c, &list);
|
2018-05-01 20:47:09 +08:00
|
|
|
while (list) {
|
|
|
|
struct commit *current = list->item;
|
|
|
|
struct commit_list *parent;
|
|
|
|
int all_parents_computed = 1;
|
2023-03-20 19:26:49 +08:00
|
|
|
uint32_t max_gen = 0;
|
2018-05-01 20:47:09 +08:00
|
|
|
|
|
|
|
for (parent = current->parents; parent; parent = parent->next) {
|
2023-03-20 19:26:49 +08:00
|
|
|
repo_parse_commit(info->r, parent->item);
|
|
|
|
gen = info->get_generation(parent->item, info->data);
|
2020-06-17 17:14:09 +08:00
|
|
|
|
2023-03-20 19:26:49 +08:00
|
|
|
if (gen == GENERATION_NUMBER_ZERO) {
|
2018-05-01 20:47:09 +08:00
|
|
|
all_parents_computed = 0;
|
|
|
|
commit_list_insert(parent->item, &list);
|
|
|
|
break;
|
|
|
|
}
|
commit-graph: implement corrected commit date
With most of preparations done, let's implement corrected commit date.
The corrected commit date for a commit is defined as:
* A commit with no parents (a root commit) has corrected commit date
equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
the maximum of its commit date and one more than the largest corrected
commit date among its parents.
As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).
To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.
Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.
However, using offsets be problematic if a commit is malformed but valid
and has committer date of 0 Unix time, as the offset would be the same
as corrected commit date and thus require 64-bits to be stored properly.
While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:14 +08:00
|
|
|
|
2023-03-20 19:26:49 +08:00
|
|
|
if (gen > max_gen)
|
|
|
|
max_gen = gen;
|
2018-05-01 20:47:09 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (all_parents_computed) {
|
|
|
|
pop_commit(&list);
|
2023-03-20 19:26:49 +08:00
|
|
|
gen = compute_generation_from_max(
|
|
|
|
current, max_gen,
|
|
|
|
generation_version);
|
|
|
|
info->set_generation(current, gen, info->data);
|
commit-graph: compute generations separately
The compute_generation_numbers() method was introduced by 3258c663
(commit-graph: compute generation numbers, 2018-05-01) to compute what
is now known as "topological levels". These are still stored in the
commit-graph file for compatibility sake while c1a09119 (commit-graph:
implement corrected commit date, 2021-01-16) updated the method to also
compute the new version of generation numbers: corrected commit date.
It makes sense why these are grouped. They perform very similar walks of
the necessary commits and compute similar maximums over each parent.
However, having these two together conflates them in subtle ways that is
hard to separate.
In particular, the topo_level slab is used to store the topological
levels in all cases, but the commit_graph_data_at(c)->generation member
stores different values depending on the state of the existing
commit-graph file.
* If the existing commit-graph file has a "GDAT" chunk, then these
values represent corrected commit dates.
* If the existing commit-graph file doesn't have a "GDAT" chunk, then
these values are actually the topological levels.
This issue only occurs only when upgrading an existing commit-graph file
into one that has the "GDAT" chunk. The current change does not resolve
this upgrade problem, but splitting the implementation into two pieces
here helps with that process, which will follow in the next change.
The important thing this helps with is the case where the
num_generation_data_overflows was being incremented incorrectly,
triggering a write of the overflow chunk.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-02 11:01:21 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2023-03-20 19:26:49 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static timestamp_t get_topo_level(struct commit *c, void *data)
|
|
|
|
{
|
|
|
|
struct write_commit_graph_context *ctx = data;
|
|
|
|
return *topo_level_slab_at(ctx->topo_levels, c);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void set_topo_level(struct commit *c, timestamp_t t, void *data)
|
|
|
|
{
|
|
|
|
struct write_commit_graph_context *ctx = data;
|
|
|
|
*topo_level_slab_at(ctx->topo_levels, c) = (uint32_t)t;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void compute_topological_levels(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
|
|
|
struct compute_generation_info info = {
|
|
|
|
.r = ctx->r,
|
|
|
|
.commits = &ctx->commits,
|
|
|
|
.get_generation = get_topo_level,
|
|
|
|
.set_generation = set_topo_level,
|
|
|
|
.data = ctx,
|
|
|
|
};
|
|
|
|
|
|
|
|
if (ctx->report_progress)
|
|
|
|
info.progress = ctx->progress
|
|
|
|
= start_delayed_progress(
|
|
|
|
_("Computing commit graph topological levels"),
|
|
|
|
ctx->commits.nr);
|
|
|
|
|
|
|
|
compute_reachable_generation_numbers(&info, 1);
|
|
|
|
|
commit-graph: compute generations separately
The compute_generation_numbers() method was introduced by 3258c663
(commit-graph: compute generation numbers, 2018-05-01) to compute what
is now known as "topological levels". These are still stored in the
commit-graph file for compatibility sake while c1a09119 (commit-graph:
implement corrected commit date, 2021-01-16) updated the method to also
compute the new version of generation numbers: corrected commit date.
It makes sense why these are grouped. They perform very similar walks of
the necessary commits and compute similar maximums over each parent.
However, having these two together conflates them in subtle ways that is
hard to separate.
In particular, the topo_level slab is used to store the topological
levels in all cases, but the commit_graph_data_at(c)->generation member
stores different values depending on the state of the existing
commit-graph file.
* If the existing commit-graph file has a "GDAT" chunk, then these
values represent corrected commit dates.
* If the existing commit-graph file doesn't have a "GDAT" chunk, then
these values are actually the topological levels.
This issue only occurs only when upgrading an existing commit-graph file
into one that has the "GDAT" chunk. The current change does not resolve
this upgrade problem, but splitting the implementation into two pieces
here helps with that process, which will follow in the next change.
The important thing this helps with is the case where the
num_generation_data_overflows was being incremented incorrectly,
triggering a write of the overflow chunk.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-02 11:01:21 +08:00
|
|
|
stop_progress(&ctx->progress);
|
|
|
|
}
|
|
|
|
|
2023-08-30 07:45:17 +08:00
|
|
|
static timestamp_t get_generation_from_graph_data(struct commit *c,
|
|
|
|
void *data UNUSED)
|
2023-03-20 19:26:50 +08:00
|
|
|
{
|
|
|
|
return commit_graph_data_at(c)->generation;
|
|
|
|
}
|
|
|
|
|
2023-08-30 07:45:17 +08:00
|
|
|
static void set_generation_v2(struct commit *c, timestamp_t t,
|
|
|
|
void *data UNUSED)
|
2023-03-20 19:26:50 +08:00
|
|
|
{
|
|
|
|
struct commit_graph_data *g = commit_graph_data_at(c);
|
2023-03-27 16:08:25 +08:00
|
|
|
g->generation = t;
|
2023-03-20 19:26:50 +08:00
|
|
|
}
|
|
|
|
|
commit-graph: compute generations separately
The compute_generation_numbers() method was introduced by 3258c663
(commit-graph: compute generation numbers, 2018-05-01) to compute what
is now known as "topological levels". These are still stored in the
commit-graph file for compatibility sake while c1a09119 (commit-graph:
implement corrected commit date, 2021-01-16) updated the method to also
compute the new version of generation numbers: corrected commit date.
It makes sense why these are grouped. They perform very similar walks of
the necessary commits and compute similar maximums over each parent.
However, having these two together conflates them in subtle ways that is
hard to separate.
In particular, the topo_level slab is used to store the topological
levels in all cases, but the commit_graph_data_at(c)->generation member
stores different values depending on the state of the existing
commit-graph file.
* If the existing commit-graph file has a "GDAT" chunk, then these
values represent corrected commit dates.
* If the existing commit-graph file doesn't have a "GDAT" chunk, then
these values are actually the topological levels.
This issue only occurs only when upgrading an existing commit-graph file
into one that has the "GDAT" chunk. The current change does not resolve
this upgrade problem, but splitting the implementation into two pieces
here helps with that process, which will follow in the next change.
The important thing this helps with is the case where the
num_generation_data_overflows was being incremented incorrectly,
triggering a write of the overflow chunk.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-02 11:01:21 +08:00
|
|
|
static void compute_generation_numbers(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
|
|
|
int i;
|
2023-03-20 19:26:50 +08:00
|
|
|
struct compute_generation_info info = {
|
|
|
|
.r = ctx->r,
|
|
|
|
.commits = &ctx->commits,
|
|
|
|
.get_generation = get_generation_from_graph_data,
|
|
|
|
.set_generation = set_generation_v2,
|
|
|
|
};
|
commit-graph: compute generations separately
The compute_generation_numbers() method was introduced by 3258c663
(commit-graph: compute generation numbers, 2018-05-01) to compute what
is now known as "topological levels". These are still stored in the
commit-graph file for compatibility sake while c1a09119 (commit-graph:
implement corrected commit date, 2021-01-16) updated the method to also
compute the new version of generation numbers: corrected commit date.
It makes sense why these are grouped. They perform very similar walks of
the necessary commits and compute similar maximums over each parent.
However, having these two together conflates them in subtle ways that is
hard to separate.
In particular, the topo_level slab is used to store the topological
levels in all cases, but the commit_graph_data_at(c)->generation member
stores different values depending on the state of the existing
commit-graph file.
* If the existing commit-graph file has a "GDAT" chunk, then these
values represent corrected commit dates.
* If the existing commit-graph file doesn't have a "GDAT" chunk, then
these values are actually the topological levels.
This issue only occurs only when upgrading an existing commit-graph file
into one that has the "GDAT" chunk. The current change does not resolve
this upgrade problem, but splitting the implementation into two pieces
here helps with that process, which will follow in the next change.
The important thing this helps with is the case where the
num_generation_data_overflows was being incremented incorrectly,
triggering a write of the overflow chunk.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-02 11:01:21 +08:00
|
|
|
|
|
|
|
if (ctx->report_progress)
|
2023-03-20 19:26:50 +08:00
|
|
|
info.progress = ctx->progress
|
|
|
|
= start_delayed_progress(
|
commit-graph: compute generations separately
The compute_generation_numbers() method was introduced by 3258c663
(commit-graph: compute generation numbers, 2018-05-01) to compute what
is now known as "topological levels". These are still stored in the
commit-graph file for compatibility sake while c1a09119 (commit-graph:
implement corrected commit date, 2021-01-16) updated the method to also
compute the new version of generation numbers: corrected commit date.
It makes sense why these are grouped. They perform very similar walks of
the necessary commits and compute similar maximums over each parent.
However, having these two together conflates them in subtle ways that is
hard to separate.
In particular, the topo_level slab is used to store the topological
levels in all cases, but the commit_graph_data_at(c)->generation member
stores different values depending on the state of the existing
commit-graph file.
* If the existing commit-graph file has a "GDAT" chunk, then these
values represent corrected commit dates.
* If the existing commit-graph file doesn't have a "GDAT" chunk, then
these values are actually the topological levels.
This issue only occurs only when upgrading an existing commit-graph file
into one that has the "GDAT" chunk. The current change does not resolve
this upgrade problem, but splitting the implementation into two pieces
here helps with that process, which will follow in the next change.
The important thing this helps with is the case where the
num_generation_data_overflows was being incremented incorrectly,
triggering a write of the overflow chunk.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-02 11:01:21 +08:00
|
|
|
_("Computing commit graph generation numbers"),
|
|
|
|
ctx->commits.nr);
|
2021-02-02 11:01:22 +08:00
|
|
|
|
|
|
|
if (!ctx->trust_generation_numbers) {
|
|
|
|
for (i = 0; i < ctx->commits.nr; i++) {
|
|
|
|
struct commit *c = ctx->commits.list[i];
|
|
|
|
repo_parse_commit(ctx->r, c);
|
|
|
|
commit_graph_data_at(c)->generation = GENERATION_NUMBER_ZERO;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-03-20 19:26:50 +08:00
|
|
|
compute_reachable_generation_numbers(&info, 2);
|
2022-03-02 03:48:30 +08:00
|
|
|
|
|
|
|
for (i = 0; i < ctx->commits.nr; i++) {
|
|
|
|
struct commit *c = ctx->commits.list[i];
|
|
|
|
timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
|
|
|
|
if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
|
|
|
|
ctx->num_generation_data_overflows++;
|
|
|
|
}
|
2019-06-12 21:29:40 +08:00
|
|
|
stop_progress(&ctx->progress);
|
2018-05-01 20:47:09 +08:00
|
|
|
}
|
|
|
|
|
2023-03-20 19:26:52 +08:00
|
|
|
static void set_generation_in_graph_data(struct commit *c, timestamp_t t,
|
2023-08-30 07:45:17 +08:00
|
|
|
void *data UNUSED)
|
2023-03-20 19:26:52 +08:00
|
|
|
{
|
|
|
|
commit_graph_data_at(c)->generation = t;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* After this method, all commits reachable from those in the given
|
|
|
|
* list will have non-zero, non-infinite generation numbers.
|
|
|
|
*/
|
|
|
|
void ensure_generations_valid(struct repository *r,
|
|
|
|
struct commit **commits, size_t nr)
|
|
|
|
{
|
|
|
|
int generation_version = get_configured_generation_version(r);
|
|
|
|
struct packed_commit_list list = {
|
|
|
|
.list = commits,
|
|
|
|
.alloc = nr,
|
|
|
|
.nr = nr,
|
|
|
|
};
|
|
|
|
struct compute_generation_info info = {
|
|
|
|
.r = r,
|
|
|
|
.commits = &list,
|
|
|
|
.get_generation = get_generation_from_graph_data,
|
|
|
|
.set_generation = set_generation_in_graph_data,
|
|
|
|
};
|
|
|
|
|
|
|
|
compute_reachable_generation_numbers(&info, generation_version);
|
|
|
|
}
|
|
|
|
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
|
|
|
trace2_data_intmax("commit-graph", ctx->r, "filter-computed",
|
|
|
|
ctx->count_bloom_filter_computed);
|
|
|
|
trace2_data_intmax("commit-graph", ctx->r, "filter-not-computed",
|
|
|
|
ctx->count_bloom_filter_not_computed);
|
bloom: encode out-of-bounds filters as non-empty
When a changed-path Bloom filter has either zero, or more than a
certain number (commonly 512) of entries, the commit-graph machinery
encodes it as "missing". More specifically, it sets the indices adjacent
in the BIDX chunk as equal to each other to indicate a "length 0"
filter; that is, that the filter occupies zero bytes on disk.
This has heretofore been fine, since the commit-graph machinery has no
need to care about these filters with too few or too many changed paths.
Both cases act like no filter has been generated at all, and so there is
no need to store them.
In a subsequent commit, however, the commit-graph machinery will learn
to only compute Bloom filters for some commits in the current
commit-graph layer. This is a change from the current implementation
which computes Bloom filters for all commits that are in the layer being
written. Critically for this patch, only computing some of the Bloom
filters means adding a third state for length 0 Bloom filters: zero
entries, too many entries, or "hasn't been computed".
It will be important for that future patch to distinguish between "not
representable" (i.e., zero or too-many changed paths), and "hasn't been
computed". In particular, we don't want to waste time recomputing
filters that have already been computed.
To that end, change how we store Bloom filters in the "computed but not
representable" category:
- Bloom filters with no entries are stored as a single byte with all
bits low (i.e., all queries to that Bloom filter will return
"definitely not")
- Bloom filters with too many entries are stored as a single byte with
all bits set high (i.e., all queries to that Bloom filter will
return "maybe").
These rules are sufficient to not incur a behavior change by changing
the on-disk representation of these two classes. Likewise, no
specification changes are necessary for the commit-graph format, either:
- Filters that were previously empty will be recomputed and stored
according to the new rules, and
- old clients reading filters generated by new clients will interpret
the filters correctly and be none the wiser to how they were
generated.
Clients will invoke the Bloom machinery in more cases than before, but
this can be addressed by returning a NULL filter when all bits are set
high. This can be addressed in a future patch.
Note that this does increase the size of on-disk commit-graphs, but far
less than other proposals. In particular, this is generally more
efficient than storing a bitmap for which commits haven't computed their
Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
whereas storing explicit filters as above incurs a penalty of one byte
per too-large or empty commit.
In practice, these boundary commits likely occupy a small proportion of
the overall number of commits, and so the size penalty is likely smaller
than storing a bitmap for all commits.
See, for example, these relative proportions of such boundary commits
(collected by SZEDER Gábor):
| Percentage of | commit-graph | |
| commits modifying | file size | |
├────────┬──────────────┼───────────────────┤ pct. |
| 0 path | >= 512 paths | before | after | change |
┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤
| android-base | 13.20% | 0.13% | 37.468M | 37.534M | +0.1741 % |
| cmssw | 0.15% | 0.23% | 17.118M | 17.119M | +0.0091 % |
| cpython | 3.07% | 0.01% | 7.967M | 7.971M | +0.0423 % |
| elasticsearch | 0.70% | 1.00% | 8.833M | 8.835M | +0.0128 % |
| gcc | 0.00% | 0.08% | 16.073M | 16.074M | +0.0030 % |
| gecko-dev | 0.14% | 0.64% | 59.868M | 59.874M | +0.0105 % |
| git | 0.11% | 0.02% | 3.895M | 3.895M | +0.0020 % |
| glibc | 0.02% | 0.10% | 3.555M | 3.555M | +0.0021 % |
| go | 0.00% | 0.07% | 3.186M | 3.186M | +0.0018 % |
| homebrew-cask | 0.40% | 0.02% | 7.035M | 7.035M | +0.0065 % |
| homebrew-core | 0.01% | 0.01% | 11.611M | 11.611M | +0.0002 % |
| jdk | 0.26% | 5.64% | 5.537M | 5.540M | +0.0590 % |
| linux | 0.01% | 0.51% | 63.735M | 63.740M | +0.0073 % |
| llvm-project | 0.12% | 0.03% | 25.515M | 25.516M | +0.0050 % |
| rails | 0.10% | 0.10% | 6.252M | 6.252M | +0.0027 % |
| rust | 0.07% | 0.17% | 9.364M | 9.364M | +0.0033 % |
| tensorflow | 0.09% | 1.02% | 7.009M | 7.010M | +0.0158 % |
| webkit | 0.05% | 0.31% | 17.405M | 17.406M | +0.0047 % |
(where the above increase is determined by computing a non-split
commit-graph before and after this patch).
Given that these projects are all "large" by commit count, the storage
cost by writing these filters explicitly is negligible. In the most
extreme example, android-base (which has 494,848 commits at the time of
writing) would have its commit-graph increase by a modest 68.4 KB.
Finally, a test to exercise filters which contain too many changed path
entries will be introduced in a subsequent patch.
Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
Suggested-by: Jakub Narębski <jnareb@gmail.com>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-18 10:59:44 +08:00
|
|
|
trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-empty",
|
|
|
|
ctx->count_bloom_filter_trunc_empty);
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-large",
|
|
|
|
ctx->count_bloom_filter_trunc_large);
|
|
|
|
}
|
|
|
|
|
2020-03-30 08:31:28 +08:00
|
|
|
static void compute_bloom_filters(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct progress *progress = NULL;
|
2020-03-30 08:31:29 +08:00
|
|
|
struct commit **sorted_commits;
|
2020-09-18 21:27:27 +08:00
|
|
|
int max_new_filters;
|
2020-03-30 08:31:28 +08:00
|
|
|
|
|
|
|
init_bloom_filters();
|
|
|
|
|
|
|
|
if (ctx->report_progress)
|
|
|
|
progress = start_delayed_progress(
|
|
|
|
_("Computing commit changed paths Bloom filters"),
|
|
|
|
ctx->commits.nr);
|
|
|
|
|
2023-01-02 05:16:48 +08:00
|
|
|
DUP_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr);
|
2020-03-30 08:31:30 +08:00
|
|
|
|
|
|
|
if (ctx->order_by_pack)
|
|
|
|
QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
|
|
|
|
else
|
|
|
|
QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
|
2020-03-30 08:31:29 +08:00
|
|
|
|
2020-09-18 21:27:27 +08:00
|
|
|
max_new_filters = ctx->opts && ctx->opts->max_new_filters >= 0 ?
|
|
|
|
ctx->opts->max_new_filters : ctx->commits.nr;
|
|
|
|
|
2020-03-30 08:31:28 +08:00
|
|
|
for (i = 0; i < ctx->commits.nr; i++) {
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
enum bloom_filter_computed computed = 0;
|
2020-03-30 08:31:29 +08:00
|
|
|
struct commit *c = sorted_commits[i];
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
struct bloom_filter *filter = get_or_compute_bloom_filter(
|
|
|
|
ctx->r,
|
|
|
|
c,
|
2020-09-18 21:27:27 +08:00
|
|
|
ctx->count_bloom_filter_computed < max_new_filters,
|
2020-09-17 02:07:46 +08:00
|
|
|
ctx->bloom_settings,
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
&computed);
|
|
|
|
if (computed & BLOOM_COMPUTED) {
|
|
|
|
ctx->count_bloom_filter_computed++;
|
bloom: encode out-of-bounds filters as non-empty
When a changed-path Bloom filter has either zero, or more than a
certain number (commonly 512) of entries, the commit-graph machinery
encodes it as "missing". More specifically, it sets the indices adjacent
in the BIDX chunk as equal to each other to indicate a "length 0"
filter; that is, that the filter occupies zero bytes on disk.
This has heretofore been fine, since the commit-graph machinery has no
need to care about these filters with too few or too many changed paths.
Both cases act like no filter has been generated at all, and so there is
no need to store them.
In a subsequent commit, however, the commit-graph machinery will learn
to only compute Bloom filters for some commits in the current
commit-graph layer. This is a change from the current implementation
which computes Bloom filters for all commits that are in the layer being
written. Critically for this patch, only computing some of the Bloom
filters means adding a third state for length 0 Bloom filters: zero
entries, too many entries, or "hasn't been computed".
It will be important for that future patch to distinguish between "not
representable" (i.e., zero or too-many changed paths), and "hasn't been
computed". In particular, we don't want to waste time recomputing
filters that have already been computed.
To that end, change how we store Bloom filters in the "computed but not
representable" category:
- Bloom filters with no entries are stored as a single byte with all
bits low (i.e., all queries to that Bloom filter will return
"definitely not")
- Bloom filters with too many entries are stored as a single byte with
all bits set high (i.e., all queries to that Bloom filter will
return "maybe").
These rules are sufficient to not incur a behavior change by changing
the on-disk representation of these two classes. Likewise, no
specification changes are necessary for the commit-graph format, either:
- Filters that were previously empty will be recomputed and stored
according to the new rules, and
- old clients reading filters generated by new clients will interpret
the filters correctly and be none the wiser to how they were
generated.
Clients will invoke the Bloom machinery in more cases than before, but
this can be addressed by returning a NULL filter when all bits are set
high. This can be addressed in a future patch.
Note that this does increase the size of on-disk commit-graphs, but far
less than other proposals. In particular, this is generally more
efficient than storing a bitmap for which commits haven't computed their
Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
whereas storing explicit filters as above incurs a penalty of one byte
per too-large or empty commit.
In practice, these boundary commits likely occupy a small proportion of
the overall number of commits, and so the size penalty is likely smaller
than storing a bitmap for all commits.
See, for example, these relative proportions of such boundary commits
(collected by SZEDER Gábor):
| Percentage of | commit-graph | |
| commits modifying | file size | |
├────────┬──────────────┼───────────────────┤ pct. |
| 0 path | >= 512 paths | before | after | change |
┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤
| android-base | 13.20% | 0.13% | 37.468M | 37.534M | +0.1741 % |
| cmssw | 0.15% | 0.23% | 17.118M | 17.119M | +0.0091 % |
| cpython | 3.07% | 0.01% | 7.967M | 7.971M | +0.0423 % |
| elasticsearch | 0.70% | 1.00% | 8.833M | 8.835M | +0.0128 % |
| gcc | 0.00% | 0.08% | 16.073M | 16.074M | +0.0030 % |
| gecko-dev | 0.14% | 0.64% | 59.868M | 59.874M | +0.0105 % |
| git | 0.11% | 0.02% | 3.895M | 3.895M | +0.0020 % |
| glibc | 0.02% | 0.10% | 3.555M | 3.555M | +0.0021 % |
| go | 0.00% | 0.07% | 3.186M | 3.186M | +0.0018 % |
| homebrew-cask | 0.40% | 0.02% | 7.035M | 7.035M | +0.0065 % |
| homebrew-core | 0.01% | 0.01% | 11.611M | 11.611M | +0.0002 % |
| jdk | 0.26% | 5.64% | 5.537M | 5.540M | +0.0590 % |
| linux | 0.01% | 0.51% | 63.735M | 63.740M | +0.0073 % |
| llvm-project | 0.12% | 0.03% | 25.515M | 25.516M | +0.0050 % |
| rails | 0.10% | 0.10% | 6.252M | 6.252M | +0.0027 % |
| rust | 0.07% | 0.17% | 9.364M | 9.364M | +0.0033 % |
| tensorflow | 0.09% | 1.02% | 7.009M | 7.010M | +0.0158 % |
| webkit | 0.05% | 0.31% | 17.405M | 17.406M | +0.0047 % |
(where the above increase is determined by computing a non-split
commit-graph before and after this patch).
Given that these projects are all "large" by commit count, the storage
cost by writing these filters explicitly is negligible. In the most
extreme example, android-base (which has 494,848 commits at the time of
writing) would have its commit-graph increase by a modest 68.4 KB.
Finally, a test to exercise filters which contain too many changed path
entries will be introduced in a subsequent patch.
Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
Suggested-by: Jakub Narębski <jnareb@gmail.com>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-18 10:59:44 +08:00
|
|
|
if (computed & BLOOM_TRUNC_EMPTY)
|
|
|
|
ctx->count_bloom_filter_trunc_empty++;
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
if (computed & BLOOM_TRUNC_LARGE)
|
|
|
|
ctx->count_bloom_filter_trunc_large++;
|
|
|
|
} else if (computed & BLOOM_NOT_COMPUTED)
|
|
|
|
ctx->count_bloom_filter_not_computed++;
|
2020-09-18 21:27:27 +08:00
|
|
|
ctx->total_bloom_filter_data_size += filter
|
|
|
|
? sizeof(unsigned char) * filter->len : 0;
|
2020-03-30 08:31:28 +08:00
|
|
|
display_progress(progress, i + 1);
|
|
|
|
}
|
|
|
|
|
bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.
Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.
This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).
While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.
It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 02:07:32 +08:00
|
|
|
if (trace2_is_enabled())
|
|
|
|
trace2_bloom_filter_write_statistics(ctx);
|
|
|
|
|
2020-03-30 08:31:29 +08:00
|
|
|
free(sorted_commits);
|
2020-03-30 08:31:28 +08:00
|
|
|
stop_progress(&progress);
|
|
|
|
}
|
|
|
|
|
2020-05-05 09:13:35 +08:00
|
|
|
struct refs_cb_data {
|
|
|
|
struct oidset *commits;
|
commit-graph.c: show progress of finding reachable commits
When 'git commit-graph write --reachable' is invoked, the commit-graph
machinery calls 'for_each_ref()' to discover the set of reachable
commits.
Right now the 'add_ref_to_set' callback is not doing anything other than
adding an OID to the set of known-reachable OIDs. In a subsequent
commit, 'add_ref_to_set' will presumptively peel references. This
operation should be fast for repositories with an up-to-date
'$GIT_DIR/packed-refs', but may be slow in the general case.
So that it doesn't appear that 'git commit-graph write' is idling with
'--reachable' in the slow case, add a progress meter to provide some
output in the meantime.
In general, we don't expect a progress meter to appear at all, since
peeling references with a 'packed-refs' file is quick. If it's slow and
we do show a progress meter, the subsequent 'fill_oids_from_commits()'
will be fast, since all of the calls to
'lookup_commit_reference_gently()' will be no-ops.
Both progress meters are delayed, so it is unlikely that more than one
will appear. In either case, this intermediate state will go away in a
handful of patches, at which point there will be at most one progress
meter.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-14 05:59:33 +08:00
|
|
|
struct progress *progress;
|
2020-05-05 09:13:35 +08:00
|
|
|
};
|
|
|
|
|
2022-08-26 01:09:48 +08:00
|
|
|
static int add_ref_to_set(const char *refname UNUSED,
|
2020-04-14 12:04:25 +08:00
|
|
|
const struct object_id *oid,
|
2022-08-26 01:09:48 +08:00
|
|
|
int flags UNUSED, void *cb_data)
|
2018-06-27 21:24:45 +08:00
|
|
|
{
|
2020-05-14 05:59:37 +08:00
|
|
|
struct object_id peeled;
|
2020-05-05 09:13:35 +08:00
|
|
|
struct refs_cb_data *data = (struct refs_cb_data *)cb_data;
|
2018-06-27 21:24:45 +08:00
|
|
|
|
refs: switch peel_ref() to peel_iterated_oid()
The peel_ref() interface is confusing and error-prone:
- it's typically used by ref iteration callbacks that have both a
refname and oid. But since they pass only the refname, we may load
the ref value from the filesystem again. This is inefficient, but
also means we are open to a race if somebody simultaneously updates
the ref. E.g., this:
int some_ref_cb(const char *refname, const struct object_id *oid, ...)
{
if (!peel_ref(refname, &peeled))
printf("%s peels to %s",
oid_to_hex(oid), oid_to_hex(&peeled);
}
could print nonsense. It is correct to say "refname peels to..."
(you may see the "before" value or the "after" value, either of
which is consistent), but mentioning both oids may be mixing
before/after values.
Worse, whether this is possible depends on whether the optimization
to read from the current iterator value kicks in. So it is actually
not possible with:
for_each_ref(some_ref_cb);
but it _is_ possible with:
head_ref(some_ref_cb);
which does not use the iterator mechanism (though in practice, HEAD
should never peel to anything, so this may not be triggerable).
- it must take a fully-qualified refname for the read_ref_full() code
path to work. Yet we routinely pass it partial refnames from
callbacks to for_each_tag_ref(), etc. This happens to work when
iterating because there we do not call read_ref_full() at all, and
only use the passed refname to check if it is the same as the
iterator. But the requirements for the function parameters are quite
unclear.
Instead of taking a refname, let's instead take an oid. That fixes both
problems. It's a little funny for a "ref" function not to involve refs
at all. The key thing is that it's optimizing under the hood based on
having access to the ref iterator. So let's change the name to make it
clear why you'd want this function versus just peel_object().
There are two other directions I considered but rejected:
- we could pass the peel information into the each_ref_fn callback.
However, we don't know if the caller actually wants it or not. For
packed-refs, providing it is essentially free. But for loose refs,
we actually have to peel the object, which would be wasteful in most
cases. We could likewise pass in a flag to the callback indicating
whether the peeled information is known, but that complicates those
callbacks, as they then have to decide whether to manually peel
themselves. Plus it requires changing the interface of every
callback, whether they care about peeling or not, and there are many
of them.
- we could make a function to return the peeled value of the current
iterated ref (computing it if necessary), and BUG() otherwise. I.e.:
int peel_current_iterated_ref(struct object_id *out);
Each of the current callers is an each_ref_fn callback, so they'd
mostly be happy. But:
- we use those callbacks with functions like head_ref(), which do
not use the iteration code. So we'd need to handle the fallback
case there, anyway.
- it's possible that a caller would want to call into generic code
that sometimes is used during iteration and sometimes not. This
encapsulates the logic to do the fast thing when possible, and
fallback when necessary.
The implementation is mostly obvious, but I want to call out a few
things in the patch:
- the test-tool coverage for peel_ref() is now meaningless, as it all
collapses to a single peel_object() call (arguably they were pretty
uninteresting before; the tricky part of that function is the
fast-path we see during iteration, but these calls didn't trigger
that). I've just dropped it entirely, though note that some other
tests relied on the tags we created; I've moved that creation to the
tests where it matters.
- we no longer need to take a ref_store parameter, since we'd never
look up a ref now. We do still rely on a global "current iterator"
variable which _could_ be kept per-ref-store. But in practice this
is only useful if there are multiple recursive iterations, at which
point the more appropriate solution is probably a stack of
iterators. No caller used the actual ref-store parameter anyway
(they all call the wrapper that passes the_repository).
- the original only kicked in the optimization when the "refname"
pointer matched (i.e., not string comparison). We do likewise with
the "oid" parameter here, but fall back to doing an actual oideq()
call. This in theory lets us kick in the optimization more often,
though in practice no current caller cares. It should never be
wrong, though (peeling is a property of an object, so two refs
pointing to the same object would peel identically).
- the original took care not to touch the peeled out-parameter unless
we found something to put in it. But no caller cares about this, and
anyway, it is enforced by peel_object() itself (and even in the
optimized iterator case, that's where we eventually end up). We can
shorten the code and avoid an extra copy by just passing the
out-parameter through the stack.
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-21 03:44:43 +08:00
|
|
|
if (!peel_iterated_oid(oid, &peeled))
|
2020-05-14 05:59:37 +08:00
|
|
|
oid = &peeled;
|
|
|
|
if (oid_object_info(the_repository, oid, NULL) == OBJ_COMMIT)
|
|
|
|
oidset_insert(data->commits, oid);
|
commit-graph.c: show progress of finding reachable commits
When 'git commit-graph write --reachable' is invoked, the commit-graph
machinery calls 'for_each_ref()' to discover the set of reachable
commits.
Right now the 'add_ref_to_set' callback is not doing anything other than
adding an OID to the set of known-reachable OIDs. In a subsequent
commit, 'add_ref_to_set' will presumptively peel references. This
operation should be fast for repositories with an up-to-date
'$GIT_DIR/packed-refs', but may be slow in the general case.
So that it doesn't appear that 'git commit-graph write' is idling with
'--reachable' in the slow case, add a progress meter to provide some
output in the meantime.
In general, we don't expect a progress meter to appear at all, since
peeling references with a 'packed-refs' file is quick. If it's slow and
we do show a progress meter, the subsequent 'fill_oids_from_commits()'
will be fast, since all of the calls to
'lookup_commit_reference_gently()' will be no-ops.
Both progress meters are delayed, so it is unlikely that more than one
will appear. In either case, this intermediate state will go away in a
handful of patches, at which point there will be at most one progress
meter.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-14 05:59:33 +08:00
|
|
|
|
|
|
|
display_progress(data->progress, oidset_size(data->commits));
|
2018-06-27 21:24:45 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-02-04 13:51:50 +08:00
|
|
|
int write_commit_graph_reachable(struct object_directory *odb,
|
2019-08-05 16:02:39 +08:00
|
|
|
enum commit_graph_write_flags flags,
|
2020-09-18 10:59:49 +08:00
|
|
|
const struct commit_graph_opts *opts)
|
2018-06-27 21:24:45 +08:00
|
|
|
{
|
2020-04-14 12:04:25 +08:00
|
|
|
struct oidset commits = OIDSET_INIT;
|
2020-05-05 09:13:35 +08:00
|
|
|
struct refs_cb_data data;
|
2019-06-12 21:29:37 +08:00
|
|
|
int result;
|
2018-06-27 21:24:45 +08:00
|
|
|
|
2020-05-05 09:13:35 +08:00
|
|
|
memset(&data, 0, sizeof(data));
|
|
|
|
data.commits = &commits;
|
commit-graph.c: show progress of finding reachable commits
When 'git commit-graph write --reachable' is invoked, the commit-graph
machinery calls 'for_each_ref()' to discover the set of reachable
commits.
Right now the 'add_ref_to_set' callback is not doing anything other than
adding an OID to the set of known-reachable OIDs. In a subsequent
commit, 'add_ref_to_set' will presumptively peel references. This
operation should be fast for repositories with an up-to-date
'$GIT_DIR/packed-refs', but may be slow in the general case.
So that it doesn't appear that 'git commit-graph write' is idling with
'--reachable' in the slow case, add a progress meter to provide some
output in the meantime.
In general, we don't expect a progress meter to appear at all, since
peeling references with a 'packed-refs' file is quick. If it's slow and
we do show a progress meter, the subsequent 'fill_oids_from_commits()'
will be fast, since all of the calls to
'lookup_commit_reference_gently()' will be no-ops.
Both progress meters are delayed, so it is unlikely that more than one
will appear. In either case, this intermediate state will go away in a
handful of patches, at which point there will be at most one progress
meter.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-14 05:59:33 +08:00
|
|
|
if (flags & COMMIT_GRAPH_WRITE_PROGRESS)
|
|
|
|
data.progress = start_delayed_progress(
|
|
|
|
_("Collecting referenced commits"), 0);
|
2020-05-05 09:13:35 +08:00
|
|
|
|
|
|
|
for_each_ref(add_ref_to_set, &data);
|
commit-graph: fix progress of reachable commits
To display a progress line while iterating over all refs,
d335ce8f24 (commit-graph.c: show progress of finding reachable
commits, 2020-05-13) should have added a pair of
start_delayed_progress() and stop_progress() calls around a
for_each_ref() invocation. Alas, the stop_progress() call ended up at
the wrong place, after write_commit_graph(), which does all the
commit-graph computation and writing, and has several progress lines
of its own. Consequently, that new
Collecting referenced commits: 123
progress line is overwritten by the first progress line shown by
write_commit_graph(), and its final "done" line is shown last, after
everything is finished:
Expanding reachable commits in commit graph: 344786, done.
Computing commit changed paths Bloom filters: 100% (344786/344786), done.
Collecting referenced commits: 154, done.
Move that stop_progress() call to the right place.
While at it, drop the unnecessary 'if (data.progress)' condition
protecting the stop_progress() call, because that function is prepared
to handle a NULL progress struct.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-10 00:54:32 +08:00
|
|
|
|
|
|
|
stop_progress(&data.progress);
|
|
|
|
|
2020-04-14 12:04:25 +08:00
|
|
|
result = write_commit_graph(odb, NULL, &commits,
|
2020-09-18 10:59:49 +08:00
|
|
|
flags, opts);
|
2018-10-04 01:12:15 +08:00
|
|
|
|
2020-04-14 12:04:25 +08:00
|
|
|
oidset_clear(&commits);
|
2019-06-12 21:29:37 +08:00
|
|
|
return result;
|
2018-06-27 21:24:45 +08:00
|
|
|
}
|
|
|
|
|
2019-06-12 21:29:41 +08:00
|
|
|
static int fill_oids_from_packs(struct write_commit_graph_context *ctx,
|
2022-03-05 02:32:12 +08:00
|
|
|
const struct string_list *pack_indexes)
|
2018-04-03 04:34:19 +08:00
|
|
|
{
|
2019-06-12 21:29:41 +08:00
|
|
|
uint32_t i;
|
2019-01-20 04:21:16 +08:00
|
|
|
struct strbuf progress_title = STRBUF_INIT;
|
2019-06-12 21:29:41 +08:00
|
|
|
struct strbuf packname = STRBUF_INIT;
|
|
|
|
int dirlen;
|
2022-03-05 02:32:13 +08:00
|
|
|
int ret = 0;
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2020-02-04 13:51:50 +08:00
|
|
|
strbuf_addf(&packname, "%s/pack/", ctx->odb->path);
|
2019-06-12 21:29:41 +08:00
|
|
|
dirlen = packname.len;
|
|
|
|
if (ctx->report_progress) {
|
|
|
|
strbuf_addf(&progress_title,
|
2022-03-07 23:27:08 +08:00
|
|
|
Q_("Finding commits for commit graph in %"PRIuMAX" pack",
|
|
|
|
"Finding commits for commit graph in %"PRIuMAX" packs",
|
2019-06-12 21:29:41 +08:00
|
|
|
pack_indexes->nr),
|
2022-03-07 23:27:08 +08:00
|
|
|
(uintmax_t)pack_indexes->nr);
|
2019-06-12 21:29:41 +08:00
|
|
|
ctx->progress = start_delayed_progress(progress_title.buf, 0);
|
|
|
|
ctx->progress_done = 0;
|
2018-04-10 20:56:08 +08:00
|
|
|
}
|
2019-06-12 21:29:41 +08:00
|
|
|
for (i = 0; i < pack_indexes->nr; i++) {
|
|
|
|
struct packed_git *p;
|
|
|
|
strbuf_setlen(&packname, dirlen);
|
|
|
|
strbuf_addstr(&packname, pack_indexes->items[i].string);
|
|
|
|
p = add_packed_git(packname.buf, packname.len, 1);
|
|
|
|
if (!p) {
|
2022-03-05 02:32:13 +08:00
|
|
|
ret = error(_("error adding pack %s"), packname.buf);
|
|
|
|
goto cleanup;
|
2018-04-10 20:56:08 +08:00
|
|
|
}
|
2019-06-12 21:29:41 +08:00
|
|
|
if (open_pack_index(p)) {
|
2022-03-05 02:32:13 +08:00
|
|
|
ret = error(_("error opening index for %s"), packname.buf);
|
|
|
|
goto cleanup;
|
2019-06-12 21:29:41 +08:00
|
|
|
}
|
|
|
|
for_each_object_in_pack(p, add_packed_commits, ctx,
|
|
|
|
FOR_EACH_OBJECT_PACK_ORDER);
|
|
|
|
close_pack(p);
|
|
|
|
free(p);
|
2018-04-10 20:56:08 +08:00
|
|
|
}
|
|
|
|
|
2022-03-05 02:32:13 +08:00
|
|
|
cleanup:
|
2019-06-12 21:29:41 +08:00
|
|
|
stop_progress(&ctx->progress);
|
2019-08-07 19:15:02 +08:00
|
|
|
strbuf_release(&progress_title);
|
2019-06-12 21:29:41 +08:00
|
|
|
strbuf_release(&packname);
|
|
|
|
|
2022-03-05 02:32:13 +08:00
|
|
|
return ret;
|
2019-06-12 21:29:41 +08:00
|
|
|
}
|
|
|
|
|
2020-04-14 12:04:25 +08:00
|
|
|
static int fill_oids_from_commits(struct write_commit_graph_context *ctx,
|
|
|
|
struct oidset *commits)
|
2019-06-12 21:29:42 +08:00
|
|
|
{
|
2020-04-14 12:04:25 +08:00
|
|
|
struct oidset_iter iter;
|
|
|
|
struct object_id *oid;
|
|
|
|
|
|
|
|
if (!oidset_size(commits))
|
|
|
|
return 0;
|
2019-06-12 21:29:42 +08:00
|
|
|
|
2020-04-14 12:04:25 +08:00
|
|
|
oidset_iter_init(commits, &iter);
|
|
|
|
while ((oid = oidset_iter_next(&iter))) {
|
2020-12-08 03:11:05 +08:00
|
|
|
oid_array_append(&ctx->oids, oid);
|
2018-04-10 20:56:07 +08:00
|
|
|
}
|
2020-04-14 12:04:25 +08:00
|
|
|
|
2019-08-05 16:02:40 +08:00
|
|
|
return 0;
|
2019-06-12 21:29:42 +08:00
|
|
|
}
|
2018-04-10 20:56:07 +08:00
|
|
|
|
2019-06-12 21:29:42 +08:00
|
|
|
static void fill_oids_from_all_packs(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
|
|
|
if (ctx->report_progress)
|
|
|
|
ctx->progress = start_delayed_progress(
|
|
|
|
_("Finding commits for commit graph among packed objects"),
|
|
|
|
ctx->approx_nr_objects);
|
|
|
|
for_each_packed_object(add_packed_commits, ctx,
|
|
|
|
FOR_EACH_OBJECT_PACK_ORDER);
|
|
|
|
if (ctx->progress_done < ctx->approx_nr_objects)
|
|
|
|
display_progress(ctx->progress, ctx->approx_nr_objects);
|
|
|
|
stop_progress(&ctx->progress);
|
|
|
|
}
|
2018-04-10 20:56:06 +08:00
|
|
|
|
2019-06-12 21:29:44 +08:00
|
|
|
static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
|
|
|
uint32_t i;
|
2020-09-18 10:59:49 +08:00
|
|
|
enum commit_graph_split_flags flags = ctx->opts ?
|
|
|
|
ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-12 21:29:44 +08:00
|
|
|
ctx->num_extra_edges = 0;
|
|
|
|
if (ctx->report_progress)
|
|
|
|
ctx->progress = start_delayed_progress(
|
commit-graph write: add itermediate progress
Add progress output to sections of code between "Annotating[...]" and
"Computing[...]generation numbers". This can collectively take 5-10
seconds on a large enough repository.
On a test repository with I have with ~7 million commits and ~50
million objects we'll now emit:
$ ~/g/git/git --exec-path=$HOME/g/git commit-graph write
Finding commits for commit graph among packed objects: 100% (124763727/124763727), done.
Loading known commits in commit graph: 100% (18989461/18989461), done.
Expanding reachable commits in commit graph: 100% (18989507/18989461), done.
Clearing commit marks in commit graph: 100% (18989507/18989507), done.
Counting distinct commits in commit graph: 100% (18989507/18989507), done.
Finding extra edges in commit graph: 100% (18989507/18989507), done.
Computing commit graph generation numbers: 100% (7250302/7250302), done.
Writing out commit graph in 4 passes: 100% (29001208/29001208), done.
Whereas on a medium-sized repository such as linux.git these new
progress bars won't have time to kick in and as before and we'll still
emit output like:
$ ~/g/git/git --exec-path=$HOME/g/git commit-graph write
Finding commits for commit graph among packed objects: 100% (6529159/6529159), done.
Expanding reachable commits in commit graph: 815990, done.
Computing commit graph generation numbers: 100% (815983/815983), done.
Writing out commit graph in 4 passes: 100% (3263932/3263932), done.
The "Counting distinct commits in commit graph" phase will spend most
of its time paused at "0/*" as we QSORT(...) the list. That's not
optimal, but at least we don't seem to be stalling anymore most of the
time.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-20 04:21:20 +08:00
|
|
|
_("Finding extra edges in commit graph"),
|
2019-06-12 21:29:44 +08:00
|
|
|
ctx->oids.nr);
|
2020-12-08 03:11:05 +08:00
|
|
|
oid_array_sort(&ctx->oids);
|
|
|
|
for (i = 0; i < ctx->oids.nr; i = oid_array_next_unique(&ctx->oids, i)) {
|
2019-09-16 01:07:44 +08:00
|
|
|
unsigned int num_parents;
|
|
|
|
|
2019-06-12 21:29:44 +08:00
|
|
|
display_progress(ctx->progress, i + 1);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-19 02:14:27 +08:00
|
|
|
ALLOC_GROW(ctx->commits.list, ctx->commits.nr + 1, ctx->commits.alloc);
|
2020-12-08 03:11:05 +08:00
|
|
|
ctx->commits.list[ctx->commits.nr] = lookup_commit(ctx->r, &ctx->oids.oid[i]);
|
2019-06-19 02:14:27 +08:00
|
|
|
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
if (ctx->split && flags != COMMIT_GRAPH_SPLIT_REPLACE &&
|
2020-06-17 17:14:10 +08:00
|
|
|
commit_graph_position(ctx->commits.list[ctx->commits.nr]) != COMMIT_NOT_FROM_GRAPH)
|
2019-06-19 02:14:27 +08:00
|
|
|
continue;
|
|
|
|
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
if (ctx->split && flags == COMMIT_GRAPH_SPLIT_REPLACE)
|
2021-02-02 01:15:03 +08:00
|
|
|
repo_parse_commit(ctx->r, ctx->commits.list[ctx->commits.nr]);
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
else
|
2021-02-02 01:15:03 +08:00
|
|
|
repo_parse_commit_no_graph(ctx->r, ctx->commits.list[ctx->commits.nr]);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-09-16 01:07:44 +08:00
|
|
|
num_parents = commit_list_count(ctx->commits.list[ctx->commits.nr]->parents);
|
2018-04-03 04:34:19 +08:00
|
|
|
if (num_parents > 2)
|
2019-06-12 21:29:44 +08:00
|
|
|
ctx->num_extra_edges += num_parents - 1;
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-12 21:29:44 +08:00
|
|
|
ctx->commits.nr++;
|
2018-04-03 04:34:19 +08:00
|
|
|
}
|
2019-06-12 21:29:44 +08:00
|
|
|
stop_progress(&ctx->progress);
|
|
|
|
}
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-19 02:14:27 +08:00
|
|
|
static int write_graph_chunk_base_1(struct hashfile *f,
|
|
|
|
struct commit_graph *g)
|
|
|
|
{
|
|
|
|
int num = 0;
|
|
|
|
|
|
|
|
if (!g)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
num = write_graph_chunk_base_1(f, g->base_graph);
|
|
|
|
hashwrite(f, g->oid.hash, the_hash_algo->rawsz);
|
|
|
|
return num + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int write_graph_chunk_base(struct hashfile *f,
|
2021-02-05 22:30:36 +08:00
|
|
|
void *data)
|
2019-06-19 02:14:27 +08:00
|
|
|
{
|
2021-02-05 22:30:36 +08:00
|
|
|
struct write_commit_graph_context *ctx = data;
|
2019-06-19 02:14:27 +08:00
|
|
|
int num = write_graph_chunk_base_1(f, ctx->new_base_graph);
|
|
|
|
|
|
|
|
if (num != ctx->num_commit_graphs_after - 1) {
|
|
|
|
error(_("failed to write correct number of base graph ids"));
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-06-12 21:29:45 +08:00
|
|
|
static int write_commit_graph_file(struct write_commit_graph_context *ctx)
|
2018-04-03 04:34:19 +08:00
|
|
|
{
|
2019-06-12 21:29:45 +08:00
|
|
|
uint32_t i;
|
2019-06-19 02:14:27 +08:00
|
|
|
int fd;
|
2018-04-03 04:34:19 +08:00
|
|
|
struct hashfile *f;
|
|
|
|
struct lock_file lk = LOCK_INIT;
|
2018-11-14 12:09:35 +08:00
|
|
|
const unsigned hashsz = the_hash_algo->rawsz;
|
2019-01-20 04:21:16 +08:00
|
|
|
struct strbuf progress_title = STRBUF_INIT;
|
2021-02-18 22:07:25 +08:00
|
|
|
struct chunkfile *cf;
|
2021-04-26 09:02:58 +08:00
|
|
|
unsigned char file_hash[GIT_MAX_RAWSZ];
|
2019-06-19 02:14:27 +08:00
|
|
|
|
|
|
|
if (ctx->split) {
|
|
|
|
struct strbuf tmp_file = STRBUF_INIT;
|
|
|
|
|
|
|
|
strbuf_addf(&tmp_file,
|
|
|
|
"%s/info/commit-graphs/tmp_graph_XXXXXX",
|
2020-02-04 13:51:50 +08:00
|
|
|
ctx->odb->path);
|
2019-06-19 02:14:27 +08:00
|
|
|
ctx->graph_name = strbuf_detach(&tmp_file, NULL);
|
|
|
|
} else {
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 05:18:02 +08:00
|
|
|
ctx->graph_name = get_commit_graph_filename(ctx->odb);
|
2019-06-19 02:14:27 +08:00
|
|
|
}
|
2019-06-12 21:29:45 +08:00
|
|
|
|
|
|
|
if (safe_create_leading_directories(ctx->graph_name)) {
|
|
|
|
UNLEAK(ctx->graph_name);
|
|
|
|
error(_("unable to create leading directories of %s"),
|
|
|
|
ctx->graph_name);
|
|
|
|
return -1;
|
2018-10-04 01:12:15 +08:00
|
|
|
}
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-19 02:14:27 +08:00
|
|
|
if (ctx->split) {
|
2020-09-18 02:11:46 +08:00
|
|
|
char *lock_name = get_commit_graph_chain_filename(ctx->odb);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2020-04-30 01:36:46 +08:00
|
|
|
hold_lock_file_for_update_mode(&lk, lock_name,
|
|
|
|
LOCK_DIE_ON_ERROR, 0444);
|
2022-03-05 02:32:14 +08:00
|
|
|
free(lock_name);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-19 02:14:27 +08:00
|
|
|
fd = git_mkstemp_mode(ctx->graph_name, 0444);
|
|
|
|
if (fd < 0) {
|
2020-04-24 05:41:02 +08:00
|
|
|
error(_("unable to create temporary graph layer"));
|
2019-06-19 02:14:27 +08:00
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2020-04-30 01:36:42 +08:00
|
|
|
if (adjust_shared_perm(ctx->graph_name)) {
|
|
|
|
error(_("unable to adjust shared permissions for '%s'"),
|
|
|
|
ctx->graph_name);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2019-06-19 02:14:27 +08:00
|
|
|
f = hashfd(fd, ctx->graph_name);
|
|
|
|
} else {
|
2020-04-30 01:36:38 +08:00
|
|
|
hold_lock_file_for_update_mode(&lk, ctx->graph_name,
|
|
|
|
LOCK_DIE_ON_ERROR, 0444);
|
2021-01-06 03:23:47 +08:00
|
|
|
fd = get_lock_file_fd(&lk);
|
|
|
|
f = hashfd(fd, get_lock_file_path(&lk));
|
2019-06-19 02:14:27 +08:00
|
|
|
}
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2021-02-18 22:07:25 +08:00
|
|
|
cf = init_chunkfile(f);
|
|
|
|
|
|
|
|
add_chunk(cf, GRAPH_CHUNKID_OIDFANOUT, GRAPH_FANOUT_SIZE,
|
|
|
|
write_graph_chunk_fanout);
|
2023-07-13 07:37:54 +08:00
|
|
|
add_chunk(cf, GRAPH_CHUNKID_OIDLOOKUP, st_mult(hashsz, ctx->commits.nr),
|
2021-02-18 22:07:25 +08:00
|
|
|
write_graph_chunk_oids);
|
2023-07-13 07:37:54 +08:00
|
|
|
add_chunk(cf, GRAPH_CHUNKID_DATA, st_mult(hashsz + 16, ctx->commits.nr),
|
2021-02-18 22:07:25 +08:00
|
|
|
write_graph_chunk_data);
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
|
2021-02-18 22:07:25 +08:00
|
|
|
if (ctx->write_generation_data)
|
|
|
|
add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA,
|
2023-07-13 07:37:54 +08:00
|
|
|
st_mult(sizeof(uint32_t), ctx->commits.nr),
|
2021-02-18 22:07:25 +08:00
|
|
|
write_graph_chunk_generation_data);
|
|
|
|
if (ctx->num_generation_data_overflows)
|
|
|
|
add_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
|
2023-07-13 07:37:54 +08:00
|
|
|
st_mult(sizeof(timestamp_t), ctx->num_generation_data_overflows),
|
2021-02-18 22:07:25 +08:00
|
|
|
write_graph_chunk_generation_data_overflow);
|
|
|
|
if (ctx->num_extra_edges)
|
|
|
|
add_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES,
|
2023-07-13 07:37:54 +08:00
|
|
|
st_mult(4, ctx->num_extra_edges),
|
2021-02-18 22:07:25 +08:00
|
|
|
write_graph_chunk_extra_edges);
|
2020-04-07 00:59:49 +08:00
|
|
|
if (ctx->changed_paths) {
|
2021-02-18 22:07:25 +08:00
|
|
|
add_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
|
2023-07-13 07:37:54 +08:00
|
|
|
st_mult(sizeof(uint32_t), ctx->commits.nr),
|
2021-02-18 22:07:25 +08:00
|
|
|
write_graph_chunk_bloom_indexes);
|
|
|
|
add_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
|
2023-07-13 07:37:54 +08:00
|
|
|
st_add(sizeof(uint32_t) * 3,
|
|
|
|
ctx->total_bloom_filter_data_size),
|
2021-02-18 22:07:25 +08:00
|
|
|
write_graph_chunk_bloom_data);
|
2020-04-07 00:59:49 +08:00
|
|
|
}
|
2021-02-18 22:07:25 +08:00
|
|
|
if (ctx->num_commit_graphs_after > 1)
|
|
|
|
add_chunk(cf, GRAPH_CHUNKID_BASE,
|
2023-07-13 07:37:54 +08:00
|
|
|
st_mult(hashsz, ctx->num_commit_graphs_after - 1),
|
2021-02-18 22:07:25 +08:00
|
|
|
write_graph_chunk_base);
|
2019-06-19 02:14:27 +08:00
|
|
|
|
|
|
|
hashwrite_be32(f, GRAPH_SIGNATURE);
|
|
|
|
|
|
|
|
hashwrite_u8(f, GRAPH_VERSION);
|
2022-05-21 07:17:41 +08:00
|
|
|
hashwrite_u8(f, oid_version(the_hash_algo));
|
2021-02-18 22:07:25 +08:00
|
|
|
hashwrite_u8(f, get_num_chunks(cf));
|
2019-06-19 02:14:27 +08:00
|
|
|
hashwrite_u8(f, ctx->num_commit_graphs_after - 1);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-12 21:29:45 +08:00
|
|
|
if (ctx->report_progress) {
|
2019-01-20 04:21:16 +08:00
|
|
|
strbuf_addf(&progress_title,
|
|
|
|
Q_("Writing out commit graph in %d pass",
|
|
|
|
"Writing out commit graph in %d passes",
|
2021-02-25 01:12:22 +08:00
|
|
|
get_num_chunks(cf)),
|
|
|
|
get_num_chunks(cf));
|
2019-06-12 21:29:45 +08:00
|
|
|
ctx->progress = start_delayed_progress(
|
2019-01-20 04:21:16 +08:00
|
|
|
progress_title.buf,
|
2023-07-13 07:37:54 +08:00
|
|
|
st_mult(get_num_chunks(cf), ctx->commits.nr));
|
2019-01-20 04:21:16 +08:00
|
|
|
}
|
2020-07-01 21:27:26 +08:00
|
|
|
|
2021-02-18 22:07:25 +08:00
|
|
|
write_chunkfile(cf, ctx);
|
2020-07-01 21:27:26 +08:00
|
|
|
|
2019-06-12 21:29:45 +08:00
|
|
|
stop_progress(&ctx->progress);
|
2019-01-20 04:21:16 +08:00
|
|
|
strbuf_release(&progress_title);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-19 02:14:27 +08:00
|
|
|
if (ctx->split && ctx->base_graph_name && ctx->num_commit_graphs_after > 1) {
|
|
|
|
char *new_base_hash = xstrdup(oid_to_hex(&ctx->new_base_graph->oid));
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 05:18:02 +08:00
|
|
|
char *new_base_name = get_split_graph_filename(ctx->new_base_graph->odb, new_base_hash);
|
2019-06-19 02:14:27 +08:00
|
|
|
|
|
|
|
free(ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2]);
|
|
|
|
free(ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2]);
|
|
|
|
ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 2] = new_base_name;
|
|
|
|
ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2] = new_base_hash;
|
|
|
|
}
|
|
|
|
|
2019-05-18 02:41:47 +08:00
|
|
|
close_commit_graph(ctx->r->objects);
|
2022-03-11 06:43:21 +08:00
|
|
|
finalize_hashfile(f, file_hash, FSYNC_COMPONENT_COMMIT_GRAPH,
|
|
|
|
CSUM_HASH_IN_STREAM | CSUM_FSYNC);
|
2021-02-18 22:07:25 +08:00
|
|
|
free_chunkfile(cf);
|
2019-06-19 02:14:27 +08:00
|
|
|
|
|
|
|
if (ctx->split) {
|
|
|
|
FILE *chainf = fdopen_lock_file(&lk, "w");
|
|
|
|
char *final_graph_name;
|
|
|
|
int result;
|
|
|
|
|
|
|
|
close(fd);
|
|
|
|
|
|
|
|
if (!chainf) {
|
|
|
|
error(_("unable to open commit-graph chain file"));
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ctx->base_graph_name) {
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
const char *dest;
|
|
|
|
int idx = ctx->num_commit_graphs_after - 1;
|
|
|
|
if (ctx->num_commit_graphs_after > 1)
|
|
|
|
idx--;
|
|
|
|
|
|
|
|
dest = ctx->commit_graph_filenames_after[idx];
|
2019-06-19 02:14:27 +08:00
|
|
|
|
2019-06-19 02:14:28 +08:00
|
|
|
if (strcmp(ctx->base_graph_name, dest)) {
|
|
|
|
result = rename(ctx->base_graph_name, dest);
|
|
|
|
|
|
|
|
if (result) {
|
|
|
|
error(_("failed to rename base commit-graph file"));
|
|
|
|
return -1;
|
|
|
|
}
|
2019-06-19 02:14:27 +08:00
|
|
|
}
|
|
|
|
} else {
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 05:18:02 +08:00
|
|
|
char *graph_name = get_commit_graph_filename(ctx->odb);
|
2019-06-19 02:14:27 +08:00
|
|
|
unlink(graph_name);
|
2022-03-05 02:32:14 +08:00
|
|
|
free(graph_name);
|
2019-06-19 02:14:27 +08:00
|
|
|
}
|
|
|
|
|
2021-04-26 09:02:58 +08:00
|
|
|
ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 1] = xstrdup(hash_to_hex(file_hash));
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 05:18:02 +08:00
|
|
|
final_graph_name = get_split_graph_filename(ctx->odb,
|
2019-06-19 02:14:27 +08:00
|
|
|
ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 1]);
|
|
|
|
ctx->commit_graph_filenames_after[ctx->num_commit_graphs_after - 1] = final_graph_name;
|
|
|
|
|
|
|
|
result = rename(ctx->graph_name, final_graph_name);
|
|
|
|
|
|
|
|
for (i = 0; i < ctx->num_commit_graphs_after; i++)
|
2021-01-06 03:23:47 +08:00
|
|
|
fprintf(get_lock_file_fp(&lk), "%s\n", ctx->commit_graph_hash_after[i]);
|
2019-06-19 02:14:27 +08:00
|
|
|
|
|
|
|
if (result) {
|
|
|
|
error(_("failed to rename temporary commit-graph file"));
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-04-03 04:34:19 +08:00
|
|
|
commit_lock_file(&lk);
|
|
|
|
|
2019-06-12 21:29:45 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
2019-10-01 10:29:34 +08:00
|
|
|
struct commit_graph *g;
|
|
|
|
uint32_t num_commits;
|
builtin/commit-graph.c: introduce split strategy 'no-merge'
In the previous commit, we laid the groundwork for supporting different
splitting strategies. In this commit, we introduce the first splitting
strategy: 'no-merge'.
Passing '--split=no-merge' is useful for callers which wish to write a
new incremental commit-graph, but do not want to spend effort condensing
the incremental chain [1]. Previously, this was possible by passing
'--size-multiple=0', but this no longer the case following 63020f175f
(commit-graph: prefer default size_mult when given zero, 2020-01-02).
When '--split=no-merge' is given, the commit-graph machinery will never
condense an existing chain, and it will always write a new incremental.
[1]: This might occur when, for example, a server administrator running
some program after each push may want to ensure that each job runs
proportional in time to the size of the push, and does not "jump" when
the commit-graph machinery decides to trigger a merge.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:12 +08:00
|
|
|
enum commit_graph_split_flags flags = COMMIT_GRAPH_SPLIT_UNSPECIFIED;
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
uint32_t i;
|
|
|
|
|
2019-06-19 02:14:32 +08:00
|
|
|
int max_commits = 0;
|
|
|
|
int size_mult = 2;
|
|
|
|
|
2020-09-18 10:59:49 +08:00
|
|
|
if (ctx->opts) {
|
|
|
|
max_commits = ctx->opts->max_commits;
|
commit-graph: prefer default size_mult when given zero
In 50f26bd ("fetch: add fetch.writeCommitGraph config setting",
2019-09-02), the fetch builtin added the capability to write a
commit-graph using the "--split" feature. This feature creates
multiple commit-graph files, and those can merge based on a set
of "split options" including a size multiple. The default size
multiple is 2, which intends to provide a log_2 N depth of the
commit-graph chain where N is the number of commits.
However, I noticed during dogfooding that my commit-graph chains
were becoming quite large when left only to builds by 'git fetch'.
It turns out that in split_graph_merge_strategy(), we default the
size_mult variable to 2 except we override it with the context's
split_opts if they exist. In builtin/fetch.c, we create such a
split_opts, but do not populate it with values.
This problem is due to two failures:
1. It is unclear that we can add the flag COMMIT_GRAPH_WRITE_SPLIT
with a NULL split_opts.
2. If we have a non-NULL split_opts, then we override the default
values even if a zero value is given.
Correct both of these issues. First, do not override size_mult when
the options provide a zero value. Second, stop creating a split_opts
in the fetch builtin.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-03 00:14:14 +08:00
|
|
|
|
2020-09-18 10:59:49 +08:00
|
|
|
if (ctx->opts->size_multiple)
|
|
|
|
size_mult = ctx->opts->size_multiple;
|
builtin/commit-graph.c: introduce split strategy 'no-merge'
In the previous commit, we laid the groundwork for supporting different
splitting strategies. In this commit, we introduce the first splitting
strategy: 'no-merge'.
Passing '--split=no-merge' is useful for callers which wish to write a
new incremental commit-graph, but do not want to spend effort condensing
the incremental chain [1]. Previously, this was possible by passing
'--size-multiple=0', but this no longer the case following 63020f175f
(commit-graph: prefer default size_mult when given zero, 2020-01-02).
When '--split=no-merge' is given, the commit-graph machinery will never
condense an existing chain, and it will always write a new incremental.
[1]: This might occur when, for example, a server administrator running
some program after each push may want to ensure that each job runs
proportional in time to the size of the push, and does not "jump" when
the commit-graph machinery decides to trigger a merge.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:12 +08:00
|
|
|
|
2020-09-18 10:59:49 +08:00
|
|
|
flags = ctx->opts->split_flags;
|
2019-06-19 02:14:32 +08:00
|
|
|
}
|
|
|
|
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
g = ctx->r->objects->commit_graph;
|
2019-10-01 10:29:34 +08:00
|
|
|
num_commits = ctx->commits.nr;
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
if (flags == COMMIT_GRAPH_SPLIT_REPLACE)
|
|
|
|
ctx->num_commit_graphs_after = 1;
|
|
|
|
else
|
|
|
|
ctx->num_commit_graphs_after = ctx->num_commit_graphs_before + 1;
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
if (flags != COMMIT_GRAPH_SPLIT_MERGE_PROHIBITED &&
|
|
|
|
flags != COMMIT_GRAPH_SPLIT_REPLACE) {
|
2023-07-13 07:38:11 +08:00
|
|
|
while (g && (g->num_commits <= st_mult(size_mult, num_commits) ||
|
builtin/commit-graph.c: introduce split strategy 'no-merge'
In the previous commit, we laid the groundwork for supporting different
splitting strategies. In this commit, we introduce the first splitting
strategy: 'no-merge'.
Passing '--split=no-merge' is useful for callers which wish to write a
new incremental commit-graph, but do not want to spend effort condensing
the incremental chain [1]. Previously, this was possible by passing
'--size-multiple=0', but this no longer the case following 63020f175f
(commit-graph: prefer default size_mult when given zero, 2020-01-02).
When '--split=no-merge' is given, the commit-graph machinery will never
condense an existing chain, and it will always write a new incremental.
[1]: This might occur when, for example, a server administrator running
some program after each push may want to ensure that each job runs
proportional in time to the size of the push, and does not "jump" when
the commit-graph machinery decides to trigger a merge.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:12 +08:00
|
|
|
(max_commits && num_commits > max_commits))) {
|
|
|
|
if (g->odb != ctx->odb)
|
|
|
|
break;
|
2019-06-19 02:14:30 +08:00
|
|
|
|
2023-07-13 07:38:11 +08:00
|
|
|
if (unsigned_add_overflows(num_commits, g->num_commits))
|
|
|
|
die(_("cannot merge graphs with %"PRIuMAX", "
|
|
|
|
"%"PRIuMAX" commits"),
|
|
|
|
(uintmax_t)num_commits,
|
|
|
|
(uintmax_t)g->num_commits);
|
builtin/commit-graph.c: introduce split strategy 'no-merge'
In the previous commit, we laid the groundwork for supporting different
splitting strategies. In this commit, we introduce the first splitting
strategy: 'no-merge'.
Passing '--split=no-merge' is useful for callers which wish to write a
new incremental commit-graph, but do not want to spend effort condensing
the incremental chain [1]. Previously, this was possible by passing
'--size-multiple=0', but this no longer the case following 63020f175f
(commit-graph: prefer default size_mult when given zero, 2020-01-02).
When '--split=no-merge' is given, the commit-graph machinery will never
condense an existing chain, and it will always write a new incremental.
[1]: This might occur when, for example, a server administrator running
some program after each push may want to ensure that each job runs
proportional in time to the size of the push, and does not "jump" when
the commit-graph machinery decides to trigger a merge.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:12 +08:00
|
|
|
num_commits += g->num_commits;
|
|
|
|
g = g->base_graph;
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
|
builtin/commit-graph.c: introduce split strategy 'no-merge'
In the previous commit, we laid the groundwork for supporting different
splitting strategies. In this commit, we introduce the first splitting
strategy: 'no-merge'.
Passing '--split=no-merge' is useful for callers which wish to write a
new incremental commit-graph, but do not want to spend effort condensing
the incremental chain [1]. Previously, this was possible by passing
'--size-multiple=0', but this no longer the case following 63020f175f
(commit-graph: prefer default size_mult when given zero, 2020-01-02).
When '--split=no-merge' is given, the commit-graph machinery will never
condense an existing chain, and it will always write a new incremental.
[1]: This might occur when, for example, a server administrator running
some program after each push may want to ensure that each job runs
proportional in time to the size of the push, and does not "jump" when
the commit-graph machinery decides to trigger a merge.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:12 +08:00
|
|
|
ctx->num_commit_graphs_after--;
|
|
|
|
}
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
}
|
|
|
|
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
|
|
|
|
ctx->new_base_graph = g;
|
|
|
|
else if (ctx->num_commit_graphs_after != 1)
|
|
|
|
BUG("split_graph_merge_strategy: num_commit_graphs_after "
|
|
|
|
"should be 1 with --split=replace");
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
|
2019-06-19 02:14:30 +08:00
|
|
|
if (ctx->num_commit_graphs_after == 2) {
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 05:18:02 +08:00
|
|
|
char *old_graph_name = get_commit_graph_filename(g->odb);
|
2019-06-19 02:14:30 +08:00
|
|
|
|
|
|
|
if (!strcmp(g->filename, old_graph_name) &&
|
commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.
Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.
Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.
This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.
For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:
$ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )
because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].
By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.
[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
result of 'normalize_path_copy()' without checking the return
value.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 05:18:02 +08:00
|
|
|
g->odb != ctx->odb) {
|
2019-06-19 02:14:30 +08:00
|
|
|
ctx->num_commit_graphs_after = 1;
|
|
|
|
ctx->new_base_graph = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
free(old_graph_name);
|
|
|
|
}
|
|
|
|
|
commit-graph.c: gracefully handle file descriptor exhaustion
When writing a layered commit-graph, the commit-graph machinery uses
'commit_graph_filenames_after' and 'commit_graph_hash_after' to keep
track of the layers in the chain that we are in the process of writing.
When the number of commit-graph layers shrinks, we initialize all
entries in the aforementioned arrays, because we know the structure of
the new commit-graph chain immediately (since there are no new layers,
there are no unknown hash values).
But when the number of commit-graph layers grows (i.e., that
'num_commit_graphs_after > num_commit_graphs_before'), then we leave
some entries in the filenames and hashes arrays as uninitialized,
because we will fill them in later as those values become available.
For instance, we rely on 'write_commit_graph_file's to store the
filename and hash of the last layer in the new chain, which is the one
that it is responsible for writing. But, it's possible that
'write_commit_graph_file' may fail, e.g., from file descriptor
exhaustion. In this case it is possible that 'git_mkstemp_mode' will
fail, and that function will return early *before* setting the values
for the last commit-graph layer's filename and hash.
This causes a number of upleasant side-effects. For instance, trying to
'free()' each entry in 'ctx->commit_graph_filenames_after' (and
similarly for the hashes array) causes us to 'free()' uninitialized
memory, since the area is allocated with 'malloc()' and is therefore
subject to contain garbage (which is left alone when
'write_commit_graph_file' returns early).
This can manifest in other issues, like a general protection fault,
and/or leaving a stray 'commit-graph-chain.lock' around after the
process dies. (The reasoning for this is still a mystery to me, since
we'd otherwise usually expect the kernel to run tempfile.c's 'atexit()'
handlers in the case of a normal death...)
To resolve this, initialize the memory with 'CALLOC_ARRAY' so that
uninitialized entries are filled with zeros, and can thus be 'free()'d
as a noop instead of causing a fault.
Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-24 05:41:09 +08:00
|
|
|
CALLOC_ARRAY(ctx->commit_graph_filenames_after, ctx->num_commit_graphs_after);
|
|
|
|
CALLOC_ARRAY(ctx->commit_graph_hash_after, ctx->num_commit_graphs_after);
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
|
|
|
|
for (i = 0; i < ctx->num_commit_graphs_after &&
|
|
|
|
i < ctx->num_commit_graphs_before; i++)
|
|
|
|
ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
|
|
|
|
|
|
|
|
i = ctx->num_commit_graphs_before - 1;
|
|
|
|
g = ctx->r->objects->commit_graph;
|
|
|
|
|
|
|
|
while (g) {
|
|
|
|
if (i < ctx->num_commit_graphs_after)
|
|
|
|
ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
|
|
|
|
|
commit-graph: use generation v2 only if entire chain does
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:
1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.
It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.
This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).
Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.
Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:16 +08:00
|
|
|
/*
|
|
|
|
* If the topmost remaining layer has generation data chunk, the
|
|
|
|
* resultant layer also has generation data chunk.
|
|
|
|
*/
|
|
|
|
if (i == ctx->num_commit_graphs_after - 2)
|
|
|
|
ctx->write_generation_data = !!g->chunk_generation_data;
|
|
|
|
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
i--;
|
|
|
|
g = g->base_graph;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void merge_commit_graph(struct write_commit_graph_context *ctx,
|
|
|
|
struct commit_graph *g)
|
|
|
|
{
|
|
|
|
uint32_t i;
|
|
|
|
uint32_t offset = g->num_commits_in_base;
|
|
|
|
|
2023-07-13 07:38:13 +08:00
|
|
|
if (unsigned_add_overflows(ctx->commits.nr, g->num_commits))
|
|
|
|
die(_("cannot merge graph %s, too many commits: %"PRIuMAX),
|
|
|
|
oid_to_hex(&g->oid),
|
|
|
|
(uintmax_t)st_add(ctx->commits.nr, g->num_commits));
|
|
|
|
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
ALLOC_GROW(ctx->commits.list, ctx->commits.nr + g->num_commits, ctx->commits.alloc);
|
|
|
|
|
|
|
|
for (i = 0; i < g->num_commits; i++) {
|
|
|
|
struct object_id oid;
|
|
|
|
struct commit *result;
|
|
|
|
|
|
|
|
display_progress(ctx->progress, i + 1);
|
|
|
|
|
|
|
|
load_oid_from_graph(g, i + offset, &oid);
|
|
|
|
|
|
|
|
/* only add commits if they still exist in the repo */
|
|
|
|
result = lookup_commit_reference_gently(ctx->r, &oid, 1);
|
|
|
|
|
|
|
|
if (result) {
|
|
|
|
ctx->commits.list[ctx->commits.nr] = result;
|
|
|
|
ctx->commits.nr++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int commit_compare(const void *_a, const void *_b)
|
|
|
|
{
|
|
|
|
const struct commit *a = *(const struct commit **)_a;
|
|
|
|
const struct commit *b = *(const struct commit **)_b;
|
|
|
|
return oidcmp(&a->object.oid, &b->object.oid);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
commit-graph: ignore duplicates when merging layers
Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.
[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/
The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.
While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.
Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.
If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.
Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.
The root cause is due to disabling core.commitGraph, which prevents
parsing commits from the lower layers during a 'git commit-graph write
--split' command. Since we use the 'graph_pos' value to determine
whether a commit is in a lower layer, we never discover that those
commits are already in the commit-graph chain and add them to the top
layer. This layer is then merged down, creating duplicates.
The test added in t5324-split-commit-graph.sh fails without this change.
However, we still have not completely removed the need for this
duplicate check. That will come in a follow-up change.
Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Taylor Blau <me@ttaylorr.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-10 04:53:51 +08:00
|
|
|
uint32_t i, dedup_i = 0;
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
|
|
|
|
if (ctx->report_progress)
|
|
|
|
ctx->progress = start_delayed_progress(
|
|
|
|
_("Scanning merged commits"),
|
|
|
|
ctx->commits.nr);
|
|
|
|
|
|
|
|
QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
|
|
|
|
|
|
|
|
ctx->num_extra_edges = 0;
|
|
|
|
for (i = 0; i < ctx->commits.nr; i++) {
|
2021-09-09 09:10:11 +08:00
|
|
|
display_progress(ctx->progress, i + 1);
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
|
|
|
|
if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
|
|
|
|
&ctx->commits.list[i]->object.oid)) {
|
commit-graph: ignore duplicates when merging layers
Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.
[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/
The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.
While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.
Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.
If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.
Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.
The root cause is due to disabling core.commitGraph, which prevents
parsing commits from the lower layers during a 'git commit-graph write
--split' command. Since we use the 'graph_pos' value to determine
whether a commit is in a lower layer, we never discover that those
commits are already in the commit-graph chain and add them to the top
layer. This layer is then merged down, creating duplicates.
The test added in t5324-split-commit-graph.sh fails without this change.
However, we still have not completely removed the need for this
duplicate check. That will come in a follow-up change.
Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Taylor Blau <me@ttaylorr.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-10 04:53:51 +08:00
|
|
|
/*
|
|
|
|
* Silently ignore duplicates. These were likely
|
|
|
|
* created due to a commit appearing in multiple
|
|
|
|
* layers of the chain, which is unexpected but
|
|
|
|
* not invalid. We should make sure there is a
|
|
|
|
* unique copy in the new layer.
|
|
|
|
*/
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
} else {
|
2019-09-16 01:07:44 +08:00
|
|
|
unsigned int num_parents;
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
|
commit-graph: ignore duplicates when merging layers
Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.
[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/
The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.
While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.
Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.
If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.
Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.
The root cause is due to disabling core.commitGraph, which prevents
parsing commits from the lower layers during a 'git commit-graph write
--split' command. Since we use the 'graph_pos' value to determine
whether a commit is in a lower layer, we never discover that those
commits are already in the commit-graph chain and add them to the top
layer. This layer is then merged down, creating duplicates.
The test added in t5324-split-commit-graph.sh fails without this change.
However, we still have not completely removed the need for this
duplicate check. That will come in a follow-up change.
Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Taylor Blau <me@ttaylorr.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-10 04:53:51 +08:00
|
|
|
ctx->commits.list[dedup_i] = ctx->commits.list[i];
|
|
|
|
dedup_i++;
|
|
|
|
|
2019-09-16 01:07:44 +08:00
|
|
|
num_parents = commit_list_count(ctx->commits.list[i]->parents);
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
if (num_parents > 2)
|
commit-graph: fix bug around octopus merges
In 1771be90 "commit-graph: merge commit-graph chains" (2019-06-18),
the method sort_and_scan_merged_commits() was added to merge the
commit lists of two commit-graph files in the incremental format.
Unfortunately, there was an off-by-one error in that method around
incrementing num_extra_edges, which leads to an incorrect offset
for the base graph chunk.
When we store an octopus merge in the commit-graph file, we store
the first parent in the normal place, but use the second parent
position to point into the "extra edges" chunk where the remaining
parents exist. This means we should be adding "num_parents - 1"
edges to this list, not "num_parents - 2". That is the basic error.
The reason this was not caught in the test suite is more subtle.
In 5324-split-commit-graph.sh, we test creating an octopus merge
and adding it to the tip of a commit-graph chain, then verify the
result. This _should_ have caught the problem, except that when
we load the commit-graph files we were overly careful to not fail
when the commit-graph chain does not match. This care was on
purpose to avoid race conditions as one process reads the chain
and another process modifies it. In such a case, the reading
process outputs the following message to stderr:
warning: commit-graph chain does not match
These warnings are output in the test suite, but ignored. By
checking the stderr of `git commit-graph verify` to include
the expected progress output, it will now catch this error.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-06 00:43:41 +08:00
|
|
|
ctx->num_extra_edges += num_parents - 1;
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
commit-graph: ignore duplicates when merging layers
Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.
[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/
The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.
While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.
Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.
If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.
Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.
The root cause is due to disabling core.commitGraph, which prevents
parsing commits from the lower layers during a 'git commit-graph write
--split' command. Since we use the 'graph_pos' value to determine
whether a commit is in a lower layer, we never discover that those
commits are already in the commit-graph chain and add them to the top
layer. This layer is then merged down, creating duplicates.
The test added in t5324-split-commit-graph.sh fails without this change.
However, we still have not completely removed the need for this
duplicate check. That will come in a follow-up change.
Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Taylor Blau <me@ttaylorr.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-10 04:53:51 +08:00
|
|
|
ctx->commits.nr = dedup_i;
|
|
|
|
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
stop_progress(&ctx->progress);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void merge_commit_graphs(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
|
|
|
struct commit_graph *g = ctx->r->objects->commit_graph;
|
|
|
|
uint32_t current_graph_number = ctx->num_commit_graphs_before;
|
|
|
|
|
|
|
|
while (g && current_graph_number >= ctx->num_commit_graphs_after) {
|
|
|
|
current_graph_number--;
|
|
|
|
|
2020-02-21 02:49:18 +08:00
|
|
|
if (ctx->report_progress)
|
|
|
|
ctx->progress = start_delayed_progress(_("Merging commit-graph"), 0);
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
|
|
|
|
merge_commit_graph(ctx, g);
|
|
|
|
stop_progress(&ctx->progress);
|
|
|
|
|
|
|
|
g = g->base_graph;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (g) {
|
|
|
|
ctx->new_base_graph = g;
|
|
|
|
ctx->new_num_commits_in_base = g->num_commits + g->num_commits_in_base;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ctx->new_base_graph)
|
|
|
|
ctx->base_graph_name = xstrdup(ctx->new_base_graph->filename);
|
|
|
|
|
|
|
|
sort_and_scan_merged_commits(ctx);
|
|
|
|
}
|
|
|
|
|
2019-06-19 02:14:31 +08:00
|
|
|
static void mark_commit_graphs(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
|
|
|
uint32_t i;
|
|
|
|
time_t now = time(NULL);
|
|
|
|
|
|
|
|
for (i = ctx->num_commit_graphs_after - 1; i < ctx->num_commit_graphs_before; i++) {
|
|
|
|
struct stat st;
|
|
|
|
struct utimbuf updated_time;
|
|
|
|
|
2022-05-13 06:32:17 +08:00
|
|
|
if (stat(ctx->commit_graph_filenames_before[i], &st) < 0)
|
|
|
|
continue;
|
2019-06-19 02:14:31 +08:00
|
|
|
|
|
|
|
updated_time.actime = st.st_atime;
|
|
|
|
updated_time.modtime = now;
|
|
|
|
utime(ctx->commit_graph_filenames_before[i], &updated_time);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void expire_commit_graphs(struct write_commit_graph_context *ctx)
|
|
|
|
{
|
|
|
|
struct strbuf path = STRBUF_INIT;
|
|
|
|
DIR *dir;
|
|
|
|
struct dirent *de;
|
|
|
|
size_t dirnamelen;
|
2019-06-19 02:14:32 +08:00
|
|
|
timestamp_t expire_time = time(NULL);
|
|
|
|
|
2020-09-18 10:59:49 +08:00
|
|
|
if (ctx->opts && ctx->opts->expire_time)
|
|
|
|
expire_time = ctx->opts->expire_time;
|
2019-06-19 02:14:33 +08:00
|
|
|
if (!ctx->split) {
|
2020-09-18 02:11:46 +08:00
|
|
|
char *chain_file_name = get_commit_graph_chain_filename(ctx->odb);
|
2019-06-19 02:14:33 +08:00
|
|
|
unlink(chain_file_name);
|
|
|
|
free(chain_file_name);
|
|
|
|
ctx->num_commit_graphs_after = 0;
|
|
|
|
}
|
2019-06-19 02:14:31 +08:00
|
|
|
|
2020-02-04 13:51:50 +08:00
|
|
|
strbuf_addstr(&path, ctx->odb->path);
|
2019-06-19 02:14:31 +08:00
|
|
|
strbuf_addstr(&path, "/info/commit-graphs");
|
|
|
|
dir = opendir(path.buf);
|
|
|
|
|
2019-08-07 19:15:02 +08:00
|
|
|
if (!dir)
|
|
|
|
goto out;
|
2019-06-19 02:14:31 +08:00
|
|
|
|
|
|
|
strbuf_addch(&path, '/');
|
|
|
|
dirnamelen = path.len;
|
|
|
|
while ((de = readdir(dir)) != NULL) {
|
|
|
|
struct stat st;
|
|
|
|
uint32_t i, found = 0;
|
|
|
|
|
|
|
|
strbuf_setlen(&path, dirnamelen);
|
|
|
|
strbuf_addstr(&path, de->d_name);
|
|
|
|
|
2022-05-13 06:32:17 +08:00
|
|
|
if (stat(path.buf, &st) < 0)
|
|
|
|
continue;
|
2019-06-19 02:14:31 +08:00
|
|
|
|
|
|
|
if (st.st_mtime > expire_time)
|
|
|
|
continue;
|
|
|
|
if (path.len < 6 || strcmp(path.buf + path.len - 6, ".graph"))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
for (i = 0; i < ctx->num_commit_graphs_after; i++) {
|
|
|
|
if (!strcmp(ctx->commit_graph_filenames_after[i],
|
|
|
|
path.buf)) {
|
|
|
|
found = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!found)
|
|
|
|
unlink(path.buf);
|
|
|
|
}
|
2019-08-07 19:15:02 +08:00
|
|
|
|
|
|
|
out:
|
2022-09-19 22:14:40 +08:00
|
|
|
if(dir)
|
|
|
|
closedir(dir);
|
2019-08-07 19:15:02 +08:00
|
|
|
strbuf_release(&path);
|
2019-06-19 02:14:31 +08:00
|
|
|
}
|
|
|
|
|
2020-02-04 13:51:50 +08:00
|
|
|
int write_commit_graph(struct object_directory *odb,
|
2022-03-05 02:32:12 +08:00
|
|
|
const struct string_list *const pack_indexes,
|
2020-04-14 12:04:25 +08:00
|
|
|
struct oidset *commits,
|
2019-08-05 16:02:39 +08:00
|
|
|
enum commit_graph_write_flags flags,
|
2020-09-18 10:59:49 +08:00
|
|
|
const struct commit_graph_opts *opts)
|
2019-06-12 21:29:45 +08:00
|
|
|
{
|
2021-02-26 02:19:42 +08:00
|
|
|
struct repository *r = the_repository;
|
2019-06-12 21:29:45 +08:00
|
|
|
struct write_commit_graph_context *ctx;
|
commit-graph: drop count_distinct_commits() function
When writing a commit graph, we collect a list of object ids in an
array, which we'll eventually copy into an array of "struct commit"
pointers. Before we do that, though, we count the number of distinct
commit entries. There's a subtle bug in this step, though.
We eliminate not only duplicate oids, but also in split mode, any oids
which are not commits or which are already in a graph file. However, the
loop starts at index 1, always counting index 0 as distinct. And indeed
it can't be a duplicate, since we check for those by comparing against
the previous entry, and there isn't one for index 0. But it could be a
commit that's already in a graph file, and we'd overcount the number of
commits by 1 in that case.
That turns out not to be a problem, though. The only things we do with
the count are:
- check if our count will overflow our data structures. But the limit
there is 2^31 commits, so while this is a useful check, the
off-by-one is not likely to matter.
- pre-allocate the array of commit pointers. But over-allocating by
one isn't a problem; we'll just waste a few extra bytes.
The bug would be easy enough to fix, but we can observe that neither of
those steps is necessary.
After building the actual commit array, we'll likewise check its count
for overflow. So the extra check of the distinct commit count here is
redundant.
And likewise we use ALLOC_GROW() when building the commit array, so
there's no need to preallocate it (it's possible that doing so is
slightly more efficient, but if we care we can just optimistically
allocate one slot for each oid; I didn't bother here).
So count_distinct_commits() isn't doing anything useful. Let's just get
rid of that step.
Note that a side effect of the function was that we sorted the list of
oids, which we do rely on in copy_oids_to_commits(), since it must also
skip the duplicates. So we'll move the qsort there. I didn't copy the
"TODO" about adding more progress meters. It's actually quite hard to
make a repository large enough for this qsort would take an appreciable
amount of time, so this doesn't seem like a useful note.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08 03:11:02 +08:00
|
|
|
uint32_t i;
|
2019-06-12 21:29:37 +08:00
|
|
|
int res = 0;
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
int replace = 0;
|
2020-09-17 02:07:46 +08:00
|
|
|
struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
|
2021-01-17 02:11:12 +08:00
|
|
|
struct topo_level_slab topo_levels;
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2021-02-26 02:19:42 +08:00
|
|
|
prepare_repo_settings(r);
|
|
|
|
if (!r->settings.core_commit_graph) {
|
2020-10-10 04:53:52 +08:00
|
|
|
warning(_("attempting to write a commit-graph, but 'core.commitGraph' is disabled"));
|
|
|
|
return 0;
|
|
|
|
}
|
2021-02-26 02:19:42 +08:00
|
|
|
if (!commit_graph_compatible(r))
|
2019-06-12 21:29:37 +08:00
|
|
|
return 0;
|
2018-08-21 02:24:27 +08:00
|
|
|
|
2021-03-14 00:17:22 +08:00
|
|
|
CALLOC_ARRAY(ctx, 1);
|
2021-02-26 02:19:42 +08:00
|
|
|
ctx->r = r;
|
2020-02-04 13:51:50 +08:00
|
|
|
ctx->odb = odb;
|
2019-08-05 16:02:39 +08:00
|
|
|
ctx->append = flags & COMMIT_GRAPH_WRITE_APPEND ? 1 : 0;
|
|
|
|
ctx->report_progress = flags & COMMIT_GRAPH_WRITE_PROGRESS ? 1 : 0;
|
|
|
|
ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
|
2020-09-18 10:59:49 +08:00
|
|
|
ctx->opts = opts;
|
2020-03-30 08:31:28 +08:00
|
|
|
ctx->total_bloom_filter_data_size = 0;
|
2021-02-26 02:19:43 +08:00
|
|
|
ctx->write_generation_data = (get_configured_generation_version(r) == 2);
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
ctx->num_generation_data_overflows = 0;
|
2019-06-19 02:14:27 +08:00
|
|
|
|
2020-09-17 02:07:46 +08:00
|
|
|
bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
|
|
|
|
bloom_settings.bits_per_entry);
|
|
|
|
bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
|
|
|
|
bloom_settings.num_hashes);
|
|
|
|
bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
|
|
|
|
bloom_settings.max_changed_paths);
|
|
|
|
ctx->bloom_settings = &bloom_settings;
|
|
|
|
|
2021-01-17 02:11:12 +08:00
|
|
|
init_topo_level_slab(&topo_levels);
|
|
|
|
ctx->topo_levels = &topo_levels;
|
|
|
|
|
2021-02-02 11:01:23 +08:00
|
|
|
prepare_commit_graph(ctx->r);
|
2021-01-17 02:11:12 +08:00
|
|
|
if (ctx->r->objects->commit_graph) {
|
|
|
|
struct commit_graph *g = ctx->r->objects->commit_graph;
|
|
|
|
|
|
|
|
while (g) {
|
|
|
|
g->topo_levels = &topo_levels;
|
|
|
|
g = g->base_graph;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-07-01 21:27:24 +08:00
|
|
|
if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
|
|
|
|
ctx->changed_paths = 1;
|
|
|
|
if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
|
|
|
|
struct commit_graph *g;
|
|
|
|
|
|
|
|
g = ctx->r->objects->commit_graph;
|
|
|
|
|
|
|
|
/* We have changed-paths already. Keep them in the next graph */
|
|
|
|
if (g && g->chunk_bloom_data) {
|
|
|
|
ctx->changed_paths = 1;
|
|
|
|
ctx->bloom_settings = g->bloom_filter_settings;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-06-19 02:14:27 +08:00
|
|
|
if (ctx->split) {
|
2021-02-02 11:01:23 +08:00
|
|
|
struct commit_graph *g = ctx->r->objects->commit_graph;
|
2019-06-19 02:14:27 +08:00
|
|
|
|
|
|
|
while (g) {
|
|
|
|
ctx->num_commit_graphs_before++;
|
|
|
|
g = g->base_graph;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ctx->num_commit_graphs_before) {
|
|
|
|
ALLOC_ARRAY(ctx->commit_graph_filenames_before, ctx->num_commit_graphs_before);
|
|
|
|
i = ctx->num_commit_graphs_before;
|
|
|
|
g = ctx->r->objects->commit_graph;
|
|
|
|
|
|
|
|
while (g) {
|
|
|
|
ctx->commit_graph_filenames_before[--i] = xstrdup(g->filename);
|
|
|
|
g = g->base_graph;
|
|
|
|
}
|
|
|
|
}
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
|
2020-09-18 10:59:49 +08:00
|
|
|
if (ctx->opts)
|
|
|
|
replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
|
2019-06-19 02:14:27 +08:00
|
|
|
}
|
2019-06-12 21:29:40 +08:00
|
|
|
|
2023-03-28 21:58:52 +08:00
|
|
|
ctx->approx_nr_objects = repo_approximate_object_count(the_repository);
|
2019-06-12 21:29:40 +08:00
|
|
|
|
|
|
|
if (ctx->append && ctx->r->objects->commit_graph) {
|
|
|
|
struct commit_graph *g = ctx->r->objects->commit_graph;
|
|
|
|
for (i = 0; i < g->num_commits; i++) {
|
2020-12-08 03:11:05 +08:00
|
|
|
struct object_id oid;
|
2023-07-13 07:38:16 +08:00
|
|
|
oidread(&oid, g->chunk_oid_lookup + st_mult(g->hash_len, i));
|
2020-12-08 03:11:05 +08:00
|
|
|
oid_array_append(&ctx->oids, &oid);
|
2018-04-10 20:56:08 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-04-10 20:56:06 +08:00
|
|
|
if (pack_indexes) {
|
2020-03-30 08:31:30 +08:00
|
|
|
ctx->order_by_pack = 1;
|
2019-06-12 21:29:41 +08:00
|
|
|
if ((res = fill_oids_from_packs(ctx, pack_indexes)))
|
|
|
|
goto cleanup;
|
2018-04-10 20:56:07 +08:00
|
|
|
}
|
|
|
|
|
2020-04-14 12:04:25 +08:00
|
|
|
if (commits) {
|
|
|
|
if ((res = fill_oids_from_commits(ctx, commits)))
|
2019-08-05 16:02:40 +08:00
|
|
|
goto cleanup;
|
|
|
|
}
|
2018-04-10 20:56:07 +08:00
|
|
|
|
2020-05-02 04:39:53 +08:00
|
|
|
if (!pack_indexes && !commits) {
|
2020-03-30 08:31:30 +08:00
|
|
|
ctx->order_by_pack = 1;
|
2019-06-12 21:29:42 +08:00
|
|
|
fill_oids_from_all_packs(ctx);
|
2020-03-30 08:31:30 +08:00
|
|
|
}
|
2018-04-10 20:56:06 +08:00
|
|
|
|
2019-06-12 21:29:40 +08:00
|
|
|
close_reachable(ctx);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-12 21:29:44 +08:00
|
|
|
copy_oids_to_commits(ctx);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-12 21:29:40 +08:00
|
|
|
if (ctx->commits.nr >= GRAPH_EDGE_LAST_MASK) {
|
2019-06-12 21:29:37 +08:00
|
|
|
error(_("too many commits to write graph"));
|
|
|
|
res = -1;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
2018-04-03 04:34:19 +08:00
|
|
|
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
if (!ctx->commits.nr && !replace)
|
2019-06-19 02:14:27 +08:00
|
|
|
goto cleanup;
|
|
|
|
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
if (ctx->split) {
|
|
|
|
split_graph_merge_strategy(ctx);
|
|
|
|
|
builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.
For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.
They can do so with
$ git commit-graph write --reachable
but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.
Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.
This addresses the above situation by making it possible for the caller
to instead write:
$ git commit-graph write --reachable --split=replace
which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.
In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 12:04:17 +08:00
|
|
|
if (!replace)
|
|
|
|
merge_commit_graphs(ctx);
|
commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.
To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:
1. If the number of commits we are adding is more than half the number
of commits in the graph below, then merge with that graph.
2. If we are writing more than 64,000 commits into a single graph,
then merge with all lower graphs.
The numeric values in the conditions above are currently constant, but
can become config options in a future update.
As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.
After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 02:14:29 +08:00
|
|
|
} else
|
2019-06-19 02:14:27 +08:00
|
|
|
ctx->num_commit_graphs_after = 1;
|
|
|
|
|
2021-02-02 11:01:22 +08:00
|
|
|
ctx->trust_generation_numbers = validate_mixed_generation_chain(ctx->r->objects->commit_graph);
|
commit-graph: use generation v2 only if entire chain does
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:
1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.
It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.
This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).
Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.
Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:16 +08:00
|
|
|
|
commit-graph: compute generations separately
The compute_generation_numbers() method was introduced by 3258c663
(commit-graph: compute generation numbers, 2018-05-01) to compute what
is now known as "topological levels". These are still stored in the
commit-graph file for compatibility sake while c1a09119 (commit-graph:
implement corrected commit date, 2021-01-16) updated the method to also
compute the new version of generation numbers: corrected commit date.
It makes sense why these are grouped. They perform very similar walks of
the necessary commits and compute similar maximums over each parent.
However, having these two together conflates them in subtle ways that is
hard to separate.
In particular, the topo_level slab is used to store the topological
levels in all cases, but the commit_graph_data_at(c)->generation member
stores different values depending on the state of the existing
commit-graph file.
* If the existing commit-graph file has a "GDAT" chunk, then these
values represent corrected commit dates.
* If the existing commit-graph file doesn't have a "GDAT" chunk, then
these values are actually the topological levels.
This issue only occurs only when upgrading an existing commit-graph file
into one that has the "GDAT" chunk. The current change does not resolve
this upgrade problem, but splitting the implementation into two pieces
here helps with that process, which will follow in the next change.
The important thing this helps with is the case where the
num_generation_data_overflows was being incremented incorrectly,
triggering a write of the overflow chunk.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-02 11:01:21 +08:00
|
|
|
compute_topological_levels(ctx);
|
|
|
|
if (ctx->write_generation_data)
|
|
|
|
compute_generation_numbers(ctx);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2020-03-30 08:31:28 +08:00
|
|
|
if (ctx->changed_paths)
|
|
|
|
compute_bloom_filters(ctx);
|
|
|
|
|
2019-06-12 21:29:45 +08:00
|
|
|
res = write_commit_graph_file(ctx);
|
2018-04-03 04:34:19 +08:00
|
|
|
|
2019-06-19 02:14:33 +08:00
|
|
|
if (ctx->split)
|
2019-06-19 02:14:31 +08:00
|
|
|
mark_commit_graphs(ctx);
|
2019-06-19 02:14:33 +08:00
|
|
|
|
|
|
|
expire_commit_graphs(ctx);
|
2019-06-19 02:14:31 +08:00
|
|
|
|
2019-06-12 21:29:37 +08:00
|
|
|
cleanup:
|
2019-06-12 21:29:45 +08:00
|
|
|
free(ctx->graph_name);
|
2019-06-12 21:29:40 +08:00
|
|
|
free(ctx->commits.list);
|
2020-12-08 03:11:05 +08:00
|
|
|
oid_array_clear(&ctx->oids);
|
2021-02-20 04:13:10 +08:00
|
|
|
clear_topo_level_slab(&topo_levels);
|
2019-06-19 02:14:27 +08:00
|
|
|
|
|
|
|
if (ctx->commit_graph_filenames_after) {
|
|
|
|
for (i = 0; i < ctx->num_commit_graphs_after; i++) {
|
|
|
|
free(ctx->commit_graph_filenames_after[i]);
|
|
|
|
free(ctx->commit_graph_hash_after[i]);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < ctx->num_commit_graphs_before; i++)
|
|
|
|
free(ctx->commit_graph_filenames_before[i]);
|
|
|
|
|
|
|
|
free(ctx->commit_graph_filenames_after);
|
|
|
|
free(ctx->commit_graph_filenames_before);
|
|
|
|
free(ctx->commit_graph_hash_after);
|
|
|
|
}
|
|
|
|
|
2019-06-12 21:29:40 +08:00
|
|
|
free(ctx);
|
2019-06-12 21:29:37 +08:00
|
|
|
|
|
|
|
return res;
|
2018-04-03 04:34:19 +08:00
|
|
|
}
|
2018-06-27 21:24:32 +08:00
|
|
|
|
2018-06-27 21:24:42 +08:00
|
|
|
#define VERIFY_COMMIT_GRAPH_ERROR_HASH 2
|
2018-06-27 21:24:32 +08:00
|
|
|
static int verify_commit_graph_error;
|
|
|
|
|
2021-07-13 16:05:18 +08:00
|
|
|
__attribute__((format (printf, 1, 2)))
|
2018-06-27 21:24:32 +08:00
|
|
|
static void graph_report(const char *fmt, ...)
|
|
|
|
{
|
|
|
|
va_list ap;
|
|
|
|
|
|
|
|
verify_commit_graph_error = 1;
|
|
|
|
va_start(ap, fmt);
|
|
|
|
vfprintf(stderr, fmt, ap);
|
|
|
|
fprintf(stderr, "\n");
|
|
|
|
va_end(ap);
|
|
|
|
}
|
|
|
|
|
2021-06-24 02:39:09 +08:00
|
|
|
static int commit_graph_checksum_valid(struct commit_graph *g)
|
|
|
|
{
|
|
|
|
return hashfile_checksum_valid(g->data, g->data_len);
|
|
|
|
}
|
|
|
|
|
2023-07-08 08:31:36 +08:00
|
|
|
static int verify_one_commit_graph(struct repository *r,
|
|
|
|
struct commit_graph *g,
|
commit-graph.c: avoid duplicated progress output during `verify`
When `git commit-graph verify` was taught how to verify commit-graph
chains in 3da4b609bb1 (commit-graph: verify chains with --shallow mode,
2019-06-18), it produced one line of progress per layer of the
commit-graph chain.
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (4356/4356), done.
Verifying commits in commit graph: 100% (131912/131912), done.
This could be somewhat confusing to users, who may wonder why there are
multiple occurrences of "Verifying commits in commit graph".
There are likely good arguments on whether or not there should be
one line of progress output per commit-graph layer. On the one hand, the
existing output shows us verifying each individual layer of the chain.
But on the other hand, the fact that a commit-graph may be stored among
multiple layers is an implementation detail that the caller need not be
aware of.
Clarify this by showing a single progress meter regardless of the number
of layers in the commit-graph chain. After this patch, the output
reflects the logical contents of a commit-graph chain, instead of
showing one line of output per commit-graph layer:
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (136268/136268), done.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-08 08:31:45 +08:00
|
|
|
struct progress *progress,
|
|
|
|
uint64_t *seen)
|
2018-06-27 21:24:32 +08:00
|
|
|
{
|
2018-06-27 21:24:35 +08:00
|
|
|
uint32_t i, cur_fanout_pos = 0;
|
2021-04-26 09:02:58 +08:00
|
|
|
struct object_id prev_oid, cur_oid;
|
2023-08-22 05:34:42 +08:00
|
|
|
struct commit *seen_gen_zero = NULL;
|
|
|
|
struct commit *seen_gen_non_zero = NULL;
|
2018-06-27 21:24:32 +08:00
|
|
|
|
commit-graph: fix segfault on e.g. "git status"
When core.commitGraph=true is set, various common commands now consult
the commit graph. Because the commit-graph code is very trusting of
its input data, it's possibly to construct a graph that'll cause an
immediate segfault on e.g. "status" (and e.g. "log", "blame", ...). In
some other cases where git immediately exits with a cryptic error
about the graph being broken.
The root cause of this is that while the "commit-graph verify"
sub-command exhaustively verifies the graph, other users of the graph
simply trust the graph, and will e.g. deference data found at certain
offsets as pointers, causing segfaults.
This change does the bare minimum to ensure that we don't segfault in
the common fill_commit_in_graph() codepath called by
e.g. setup_revisions(), to do this instrument the "commit-graph
verify" tests to always check if "status" would subsequently
segfault. This fixes the following tests which would previously
segfault:
not ok 50 - detect low chunk count
not ok 51 - detect missing OID fanout chunk
not ok 52 - detect missing OID lookup chunk
not ok 53 - detect missing commit data chunk
Those happened because with the commit-graph enabled setup_revisions()
would eventually call fill_commit_in_graph(), where e.g.
g->chunk_commit_data is used early as an offset (and will be
0x0). With this change we get far enough to detect that the graph is
broken, and show an error instead. E.g.:
$ git status; echo $?
error: commit-graph is missing the Commit Data chunk
1
That also sucks, we should *warn* and not hard-fail "status" just
because the commit-graph is corrupt, but fixing is left to a follow-up
change.
A side-effect of changing the reporting from graph_report() to error()
is that we now have an "error: " prefix for these even for
"commit-graph verify". Pseudo-diff before/after:
$ git commit-graph verify
-commit-graph is missing the Commit Data chunk
+error: commit-graph is missing the Commit Data chunk
Changing that is OK. Various errors it emits now early on are prefixed
with "error: ", moving these over and changing the output doesn't
break anything.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-25 20:08:29 +08:00
|
|
|
verify_commit_graph_error = verify_commit_graph_lite(g);
|
2018-06-27 21:24:35 +08:00
|
|
|
if (verify_commit_graph_error)
|
|
|
|
return verify_commit_graph_error;
|
|
|
|
|
2021-06-24 02:39:09 +08:00
|
|
|
if (!commit_graph_checksum_valid(g)) {
|
2018-06-27 21:24:42 +08:00
|
|
|
graph_report(_("the commit-graph file has incorrect checksum and is likely corrupt"));
|
|
|
|
verify_commit_graph_error = VERIFY_COMMIT_GRAPH_ERROR_HASH;
|
|
|
|
}
|
|
|
|
|
2018-06-27 21:24:35 +08:00
|
|
|
for (i = 0; i < g->num_commits; i++) {
|
2018-06-27 21:24:37 +08:00
|
|
|
struct commit *graph_commit;
|
|
|
|
|
2023-07-13 07:38:19 +08:00
|
|
|
oidread(&cur_oid, g->chunk_oid_lookup + st_mult(g->hash_len, i));
|
2018-06-27 21:24:35 +08:00
|
|
|
|
|
|
|
if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
|
2019-03-25 20:08:34 +08:00
|
|
|
graph_report(_("commit-graph has incorrect OID order: %s then %s"),
|
2018-06-27 21:24:35 +08:00
|
|
|
oid_to_hex(&prev_oid),
|
|
|
|
oid_to_hex(&cur_oid));
|
|
|
|
|
|
|
|
oidcpy(&prev_oid, &cur_oid);
|
|
|
|
|
|
|
|
while (cur_oid.hash[0] > cur_fanout_pos) {
|
|
|
|
uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
|
|
|
|
|
|
|
|
if (i != fanout_value)
|
2019-03-25 20:08:34 +08:00
|
|
|
graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"),
|
2018-06-27 21:24:35 +08:00
|
|
|
cur_fanout_pos, fanout_value, i);
|
|
|
|
cur_fanout_pos++;
|
|
|
|
}
|
2018-06-27 21:24:37 +08:00
|
|
|
|
2018-07-18 06:46:19 +08:00
|
|
|
graph_commit = lookup_commit(r, &cur_oid);
|
2018-12-15 08:09:39 +08:00
|
|
|
if (!parse_commit_in_graph_one(r, g, graph_commit))
|
2019-03-25 20:08:34 +08:00
|
|
|
graph_report(_("failed to parse commit %s from commit-graph"),
|
2018-06-27 21:24:37 +08:00
|
|
|
oid_to_hex(&cur_oid));
|
2018-06-27 21:24:35 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
while (cur_fanout_pos < 256) {
|
|
|
|
uint32_t fanout_value = get_be32(g->chunk_oid_fanout + cur_fanout_pos);
|
|
|
|
|
|
|
|
if (g->num_commits != fanout_value)
|
2019-03-25 20:08:34 +08:00
|
|
|
graph_report(_("commit-graph has incorrect fanout value: fanout[%d] = %u != %u"),
|
2018-06-27 21:24:35 +08:00
|
|
|
cur_fanout_pos, fanout_value, i);
|
|
|
|
|
|
|
|
cur_fanout_pos++;
|
|
|
|
}
|
|
|
|
|
2018-06-27 21:24:42 +08:00
|
|
|
if (verify_commit_graph_error & ~VERIFY_COMMIT_GRAPH_ERROR_HASH)
|
2018-06-27 21:24:36 +08:00
|
|
|
return verify_commit_graph_error;
|
|
|
|
|
|
|
|
for (i = 0; i < g->num_commits; i++) {
|
2018-06-27 21:24:37 +08:00
|
|
|
struct commit *graph_commit, *odb_commit;
|
2018-06-27 21:24:38 +08:00
|
|
|
struct commit_list *graph_parents, *odb_parents;
|
2021-01-17 02:11:13 +08:00
|
|
|
timestamp_t max_generation = 0;
|
|
|
|
timestamp_t generation;
|
2018-06-27 21:24:36 +08:00
|
|
|
|
commit-graph.c: avoid duplicated progress output during `verify`
When `git commit-graph verify` was taught how to verify commit-graph
chains in 3da4b609bb1 (commit-graph: verify chains with --shallow mode,
2019-06-18), it produced one line of progress per layer of the
commit-graph chain.
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (4356/4356), done.
Verifying commits in commit graph: 100% (131912/131912), done.
This could be somewhat confusing to users, who may wonder why there are
multiple occurrences of "Verifying commits in commit graph".
There are likely good arguments on whether or not there should be
one line of progress output per commit-graph layer. On the one hand, the
existing output shows us verifying each individual layer of the chain.
But on the other hand, the fact that a commit-graph may be stored among
multiple layers is an implementation detail that the caller need not be
aware of.
Clarify this by showing a single progress meter regardless of the number
of layers in the commit-graph chain. After this patch, the output
reflects the logical contents of a commit-graph chain, instead of
showing one line of output per commit-graph layer:
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (136268/136268), done.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-08 08:31:45 +08:00
|
|
|
display_progress(progress, ++(*seen));
|
2023-07-13 07:38:19 +08:00
|
|
|
oidread(&cur_oid, g->chunk_oid_lookup + st_mult(g->hash_len, i));
|
2018-06-27 21:24:36 +08:00
|
|
|
|
2018-07-18 06:46:19 +08:00
|
|
|
graph_commit = lookup_commit(r, &cur_oid);
|
2019-06-20 15:41:21 +08:00
|
|
|
odb_commit = (struct commit *)create_object(r, &cur_oid, alloc_commit_node(r));
|
libs: use "struct repository *" argument, not "the_repository"
As can easily be seen from grepping in our sources, we had these uses
of "the_repository" in various library code in cases where the
function in question was already getting a "struct repository *"
argument. Let's use that argument instead.
Out of these changes only the changes to "cache-tree.c",
"commit-reach.c", "shallow.c" and "upload-pack.c" would have cleanly
applied before the migration away from the "repo_*()" wrapper macros
in the preceding commits.
The rest aren't new, as we'd previously implicitly refer to
"the_repository", but it's now more obvious that we were doing the
wrong thing all along, and should have used the parameter instead.
The change to change "get_index_format_default(the_repository)" in
"read-cache.c" to use the "r" variable instead should arguably have
been part of [1], or in the subsequent cleanup in [2]. Let's do it
here, as can be seen from the initial code in [3] it's not important
that we use "the_repository" there, but would prefer to always use the
current repository.
This change excludes the "the_repository" use in "upload-pack.c"'s
upload_pack_advertise(), as the in-flight [4] makes that change.
1. ee1f0c242ef (read-cache: add index.skipHash config option,
2023-01-06)
2. 6269f8eaad0 (treewide: always have a valid "index_state.repo"
member, 2023-01-17)
3. 7211b9e7534 (repo-settings: consolidate some config settings,
2019-08-13)
4. <Y/hbUsGPVNAxTdmS@coredump.intra.peff.net>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-28 21:58:58 +08:00
|
|
|
if (repo_parse_commit_internal(r, odb_commit, 0, 0)) {
|
2019-03-25 20:08:34 +08:00
|
|
|
graph_report(_("failed to parse commit %s from object database for commit-graph"),
|
2018-06-27 21:24:36 +08:00
|
|
|
oid_to_hex(&cur_oid));
|
|
|
|
continue;
|
|
|
|
}
|
2018-06-27 21:24:37 +08:00
|
|
|
|
2018-12-15 08:09:39 +08:00
|
|
|
if (!oideq(&get_commit_tree_in_graph_one(r, g, graph_commit)->object.oid,
|
2018-06-27 21:24:37 +08:00
|
|
|
get_commit_tree_oid(odb_commit)))
|
2019-03-25 20:08:34 +08:00
|
|
|
graph_report(_("root tree OID for commit %s in commit-graph is %s != %s"),
|
2018-06-27 21:24:37 +08:00
|
|
|
oid_to_hex(&cur_oid),
|
|
|
|
oid_to_hex(get_commit_tree_oid(graph_commit)),
|
|
|
|
oid_to_hex(get_commit_tree_oid(odb_commit)));
|
2018-06-27 21:24:38 +08:00
|
|
|
|
|
|
|
graph_parents = graph_commit->parents;
|
|
|
|
odb_parents = odb_commit->parents;
|
|
|
|
|
|
|
|
while (graph_parents) {
|
2022-05-03 00:50:37 +08:00
|
|
|
if (!odb_parents) {
|
2019-03-25 20:08:34 +08:00
|
|
|
graph_report(_("commit-graph parent list for commit %s is too long"),
|
2018-06-27 21:24:38 +08:00
|
|
|
oid_to_hex(&cur_oid));
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2019-06-19 02:14:32 +08:00
|
|
|
/* parse parent in case it is in a base graph */
|
|
|
|
parse_commit_in_graph_one(r, g, graph_parents->item);
|
|
|
|
|
2018-08-29 05:22:48 +08:00
|
|
|
if (!oideq(&graph_parents->item->object.oid, &odb_parents->item->object.oid))
|
2019-03-25 20:08:34 +08:00
|
|
|
graph_report(_("commit-graph parent for %s is %s != %s"),
|
2018-06-27 21:24:38 +08:00
|
|
|
oid_to_hex(&cur_oid),
|
|
|
|
oid_to_hex(&graph_parents->item->object.oid),
|
|
|
|
oid_to_hex(&odb_parents->item->object.oid));
|
|
|
|
|
commit-graph: introduce `commit_graph_generation_from_graph()`
In 2ee11f7261 (commit-graph: return generation from memory, 2023-03-20),
the `commit_graph_generation()` function stopped returning zeros when
asked to locate the generation number of a given commit.
This was done at the time to prepare for a later change which set
generation values in memory, meaning that we could no longer rely on
`graph_pos` alone to tell us whether or not to trust the generation
number returned by this function.
In 2ee11f7261, it was noted that this change only impacted very old
commit-graphs, which were written with all commits having generation
number 0. Indeed, zero is not a valid generation number, so we should
never expect to see that value outside of the aforementioned case.
The test fallout in 2ee11f7261 indicated that we were no longer able to
fsck a specific old case of commit-graph corruption, where we see a
non-zero generation number after having seen a generation number of 0
earlier.
Introduce a variant of `commit_graph_generation()` which behaves like
that function did prior to 2ee11f7261, known as
`commit_graph_generation_from_graph()`. Then use this function in the
context of `verify_one_commit_graph()`, where we only want to trust the
values from the graph.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-22 05:34:34 +08:00
|
|
|
generation = commit_graph_generation_from_graph(graph_parents->item);
|
2020-06-17 17:14:11 +08:00
|
|
|
if (generation > max_generation)
|
|
|
|
max_generation = generation;
|
2018-06-27 21:24:39 +08:00
|
|
|
|
2018-06-27 21:24:38 +08:00
|
|
|
graph_parents = graph_parents->next;
|
|
|
|
odb_parents = odb_parents->next;
|
|
|
|
}
|
|
|
|
|
2022-05-03 00:50:37 +08:00
|
|
|
if (odb_parents)
|
2019-03-25 20:08:34 +08:00
|
|
|
graph_report(_("commit-graph parent list for commit %s terminates early"),
|
2018-06-27 21:24:38 +08:00
|
|
|
oid_to_hex(&cur_oid));
|
2018-06-27 21:24:39 +08:00
|
|
|
|
2023-08-22 05:34:42 +08:00
|
|
|
if (commit_graph_generation_from_graph(graph_commit))
|
|
|
|
seen_gen_non_zero = graph_commit;
|
|
|
|
else
|
|
|
|
seen_gen_zero = graph_commit;
|
2018-06-27 21:24:39 +08:00
|
|
|
|
2023-08-22 05:34:42 +08:00
|
|
|
if (seen_gen_zero)
|
2018-06-27 21:24:39 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
* If we are using topological level and one of our parents has
|
|
|
|
* generation GENERATION_NUMBER_V1_MAX, then our generation is
|
|
|
|
* also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
|
|
|
|
* in the following condition.
|
2018-06-27 21:24:39 +08:00
|
|
|
*/
|
commit-graph: use generation v2 only if entire chain does
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:
1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.
It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.
This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).
Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.
Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:16 +08:00
|
|
|
if (!g->read_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
|
2018-06-27 21:24:39 +08:00
|
|
|
max_generation--;
|
|
|
|
|
2020-06-17 17:14:11 +08:00
|
|
|
generation = commit_graph_generation(graph_commit);
|
commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-17 02:11:15 +08:00
|
|
|
if (generation < max_generation + 1)
|
|
|
|
graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
|
2018-06-27 21:24:39 +08:00
|
|
|
oid_to_hex(&cur_oid),
|
2020-06-17 17:14:11 +08:00
|
|
|
generation,
|
2018-06-27 21:24:39 +08:00
|
|
|
max_generation + 1);
|
2018-06-27 21:24:40 +08:00
|
|
|
|
|
|
|
if (graph_commit->date != odb_commit->date)
|
2019-03-25 20:08:34 +08:00
|
|
|
graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
|
2018-06-27 21:24:40 +08:00
|
|
|
oid_to_hex(&cur_oid),
|
|
|
|
graph_commit->date,
|
|
|
|
odb_commit->date);
|
2018-06-27 21:24:36 +08:00
|
|
|
}
|
2023-08-22 05:34:42 +08:00
|
|
|
|
|
|
|
if (seen_gen_zero && seen_gen_non_zero)
|
|
|
|
graph_report(_("commit-graph has both zero and non-zero "
|
|
|
|
"generations (e.g., commits '%s' and '%s')"),
|
|
|
|
oid_to_hex(&seen_gen_zero->object.oid),
|
|
|
|
oid_to_hex(&seen_gen_non_zero->object.oid));
|
2018-06-27 21:24:36 +08:00
|
|
|
|
2023-07-08 08:31:36 +08:00
|
|
|
return verify_commit_graph_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
|
|
|
|
{
|
commit-graph.c: avoid duplicated progress output during `verify`
When `git commit-graph verify` was taught how to verify commit-graph
chains in 3da4b609bb1 (commit-graph: verify chains with --shallow mode,
2019-06-18), it produced one line of progress per layer of the
commit-graph chain.
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (4356/4356), done.
Verifying commits in commit graph: 100% (131912/131912), done.
This could be somewhat confusing to users, who may wonder why there are
multiple occurrences of "Verifying commits in commit graph".
There are likely good arguments on whether or not there should be
one line of progress output per commit-graph layer. On the one hand, the
existing output shows us verifying each individual layer of the chain.
But on the other hand, the fact that a commit-graph may be stored among
multiple layers is an implementation detail that the caller need not be
aware of.
Clarify this by showing a single progress meter regardless of the number
of layers in the commit-graph chain. After this patch, the output
reflects the logical contents of a commit-graph chain, instead of
showing one line of output per commit-graph layer:
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (136268/136268), done.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-08 08:31:45 +08:00
|
|
|
struct progress *progress = NULL;
|
2023-07-08 08:31:36 +08:00
|
|
|
int local_error = 0;
|
commit-graph.c: avoid duplicated progress output during `verify`
When `git commit-graph verify` was taught how to verify commit-graph
chains in 3da4b609bb1 (commit-graph: verify chains with --shallow mode,
2019-06-18), it produced one line of progress per layer of the
commit-graph chain.
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (4356/4356), done.
Verifying commits in commit graph: 100% (131912/131912), done.
This could be somewhat confusing to users, who may wonder why there are
multiple occurrences of "Verifying commits in commit graph".
There are likely good arguments on whether or not there should be
one line of progress output per commit-graph layer. On the one hand, the
existing output shows us verifying each individual layer of the chain.
But on the other hand, the fact that a commit-graph may be stored among
multiple layers is an implementation detail that the caller need not be
aware of.
Clarify this by showing a single progress meter regardless of the number
of layers in the commit-graph chain. After this patch, the output
reflects the logical contents of a commit-graph chain, instead of
showing one line of output per commit-graph layer:
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (136268/136268), done.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-08 08:31:45 +08:00
|
|
|
uint64_t seen = 0;
|
2023-07-08 08:31:36 +08:00
|
|
|
|
|
|
|
if (!g) {
|
|
|
|
graph_report("no commit-graph file loaded");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
commit-graph.c: avoid duplicated progress output during `verify`
When `git commit-graph verify` was taught how to verify commit-graph
chains in 3da4b609bb1 (commit-graph: verify chains with --shallow mode,
2019-06-18), it produced one line of progress per layer of the
commit-graph chain.
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (4356/4356), done.
Verifying commits in commit graph: 100% (131912/131912), done.
This could be somewhat confusing to users, who may wonder why there are
multiple occurrences of "Verifying commits in commit graph".
There are likely good arguments on whether or not there should be
one line of progress output per commit-graph layer. On the one hand, the
existing output shows us verifying each individual layer of the chain.
But on the other hand, the fact that a commit-graph may be stored among
multiple layers is an implementation detail that the caller need not be
aware of.
Clarify this by showing a single progress meter regardless of the number
of layers in the commit-graph chain. After this patch, the output
reflects the logical contents of a commit-graph chain, instead of
showing one line of output per commit-graph layer:
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (136268/136268), done.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-08 08:31:45 +08:00
|
|
|
if (flags & COMMIT_GRAPH_WRITE_PROGRESS) {
|
|
|
|
uint64_t total = g->num_commits;
|
|
|
|
if (!(flags & COMMIT_GRAPH_VERIFY_SHALLOW))
|
|
|
|
total += g->num_commits_in_base;
|
|
|
|
|
|
|
|
progress = start_progress(_("Verifying commits in commit graph"),
|
|
|
|
total);
|
|
|
|
}
|
2023-07-08 08:31:42 +08:00
|
|
|
|
commit-graph.c: avoid duplicated progress output during `verify`
When `git commit-graph verify` was taught how to verify commit-graph
chains in 3da4b609bb1 (commit-graph: verify chains with --shallow mode,
2019-06-18), it produced one line of progress per layer of the
commit-graph chain.
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (4356/4356), done.
Verifying commits in commit graph: 100% (131912/131912), done.
This could be somewhat confusing to users, who may wonder why there are
multiple occurrences of "Verifying commits in commit graph".
There are likely good arguments on whether or not there should be
one line of progress output per commit-graph layer. On the one hand, the
existing output shows us verifying each individual layer of the chain.
But on the other hand, the fact that a commit-graph may be stored among
multiple layers is an implementation detail that the caller need not be
aware of.
Clarify this by showing a single progress meter regardless of the number
of layers in the commit-graph chain. After this patch, the output
reflects the logical contents of a commit-graph chain, instead of
showing one line of output per commit-graph layer:
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (136268/136268), done.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-08 08:31:45 +08:00
|
|
|
for (; g; g = g->base_graph) {
|
|
|
|
local_error |= verify_one_commit_graph(r, g, progress, &seen);
|
2023-07-08 08:31:39 +08:00
|
|
|
if (flags & COMMIT_GRAPH_VERIFY_SHALLOW)
|
|
|
|
break;
|
|
|
|
}
|
2019-06-19 02:14:32 +08:00
|
|
|
|
commit-graph.c: avoid duplicated progress output during `verify`
When `git commit-graph verify` was taught how to verify commit-graph
chains in 3da4b609bb1 (commit-graph: verify chains with --shallow mode,
2019-06-18), it produced one line of progress per layer of the
commit-graph chain.
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (4356/4356), done.
Verifying commits in commit graph: 100% (131912/131912), done.
This could be somewhat confusing to users, who may wonder why there are
multiple occurrences of "Verifying commits in commit graph".
There are likely good arguments on whether or not there should be
one line of progress output per commit-graph layer. On the one hand, the
existing output shows us verifying each individual layer of the chain.
But on the other hand, the fact that a commit-graph may be stored among
multiple layers is an implementation detail that the caller need not be
aware of.
Clarify this by showing a single progress meter regardless of the number
of layers in the commit-graph chain. After this patch, the output
reflects the logical contents of a commit-graph chain, instead of
showing one line of output per commit-graph layer:
$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (136268/136268), done.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-08 08:31:45 +08:00
|
|
|
stop_progress(&progress);
|
2019-06-19 02:14:32 +08:00
|
|
|
|
|
|
|
return local_error;
|
2018-06-27 21:24:32 +08:00
|
|
|
}
|
2018-07-12 06:42:40 +08:00
|
|
|
|
|
|
|
void free_commit_graph(struct commit_graph *g)
|
|
|
|
{
|
|
|
|
if (!g)
|
|
|
|
return;
|
2020-04-24 05:41:13 +08:00
|
|
|
if (g->data) {
|
2018-07-12 06:42:40 +08:00
|
|
|
munmap((void *)g->data, g->data_len);
|
|
|
|
g->data = NULL;
|
|
|
|
}
|
2019-06-19 02:14:27 +08:00
|
|
|
free(g->filename);
|
2020-04-07 00:59:49 +08:00
|
|
|
free(g->bloom_filter_settings);
|
2018-07-12 06:42:40 +08:00
|
|
|
free(g);
|
|
|
|
}
|
upload-pack: disable commit graph more gently for shallow traversal
When the client has asked for certain shallow options like
"deepen-since", we do a custom rev-list walk that pretends to be
shallow. Before doing so, we have to disable the commit-graph, since it
is not compatible with the shallow view of the repository. That's
handled by 829a321569 (commit-graph: close_commit_graph before shallow
walk, 2018-08-20). That commit literally closes and frees our
repo->objects->commit_graph struct.
That creates an interesting problem for commits that have _already_ been
parsed using the commit graph. Their commit->object.parsed flag is set,
their commit->graph_pos is set, but their commit->maybe_tree may still
be NULL. When somebody later calls repo_get_commit_tree(), we see that
we haven't loaded the tree oid yet and try to get it from the commit
graph. But since it has been freed, we segfault!
So the root of the issue is a data dependency between the commit's
lazy-load of the tree oid and the fact that the commit graph can go
away mid-process. How can we resolve it?
There are a couple of general approaches:
1. The obvious answer is to avoid loading the tree from the graph when
we see that it's NULL. But then what do we return for the tree oid?
If we return NULL, our caller in do_traverse() will rightly
complain that we have no tree. We'd have to fallback to loading the
actual commit object and re-parsing it. That requires teaching
parse_commit_buffer() to understand re-parsing (i.e., not starting
from a clean slate and not leaking any allocated bits like parent
list pointers).
2. When we close the commit graph, walk through the set of in-memory
objects and clear any graph_pos pointers. But this means we also
have to "unparse" any such commits so that we know they still need
to open the commit object to fill in their trees. So it's no less
complicated than (1), and is more expensive (since we clear objects
we might not later need).
3. Stop freeing the commit-graph struct. Continue to let it be used
for lazy-loads of tree oids, but let upload-pack specify that it
shouldn't be used for further commit parsing.
4. Push the whole shallow rev-list out to its own sub-process, with
the commit-graph disabled from the start, giving it a clean memory
space to work from.
I've chosen (3) here. Options (1) and (2) would work, but are
non-trivial to implement. Option (4) is more expensive, and I'm not sure
how complicated it is (shelling out for the actual rev-list part is
easy, but we do then parse the resulting commits internally, and I'm not
clear which parts need to be handling shallow-ness).
The new test in t5500 triggers this segfault, but see the comments there
for how horribly intimate it has to be with how both upload-pack and
commit graphs work.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-12 22:44:45 +08:00
|
|
|
|
|
|
|
void disable_commit_graph(struct repository *r)
|
|
|
|
{
|
|
|
|
r->commit_graph_disabled = 1;
|
|
|
|
}
|