git/connected.c

107 lines
3.0 KiB
C
Raw Normal View History

#include "cache.h"
#include "run-command.h"
#include "sigchain.h"
#include "connected.h"
clone: open a shortcut for connectivity check In order to make sure the cloned repository is good, we run "rev-list --objects --not --all $new_refs" on the repository. This is expensive on large repositories. This patch attempts to mitigate the impact in this special case. In the "good" clone case, we only have one pack. If all of the following are met, we can be sure that all objects reachable from the new refs exist, which is the intention of running "rev-list ...": - all refs point to an object in the pack - there are no dangling pointers in any object in the pack - no objects in the pack point to objects outside the pack The second and third checks can be done with the help of index-pack as a slight variation of --strict check (which introduces a new condition for the shortcut: pack transfer must be used and the number of objects large enough to call index-pack). The first is checked in check_everything_connected after we get an "ok" from index-pack. "index-pack + new checks" is still faster than the current "index-pack + rev-list", which is the whole point of this patch. If any of the conditions fail, we fall back to the good old but expensive "rev-list ..". In that case it's even more expensive because we have to pay for the new checks in index-pack. But that should only happen when the other side is either buggy or malicious. Cloning linux-2.6 over file:// before after real 3m25.693s 2m53.050s user 5m2.037s 4m42.396s sys 0m13.750s 0m16.574s A more realistic test with ssh:// over wireless before after real 11m26.629s 10m4.213s user 5m43.196s 5m19.444s sys 0m35.812s 0m37.630s This shortcut is not applied to shallow clones, partly because shallow clones should have no more objects than a usual fetch and the cost of rev-list is acceptable, partly to avoid dealing with corner cases when grafting is involved. This shortcut does not apply to unpack-objects code path either because the number of objects must be small in order to trigger that code path. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-26 09:16:17 +08:00
#include "transport.h"
/*
* If we feed all the commits we want to verify to this command
*
* $ git rev-list --objects --stdin --not --all
*
* and if it does not error out, that means everything reachable from
* these commits locally exists and is connected to our existing refs.
* Note that this does _not_ validate the individual objects.
*
* Returns 0 if everything is connected, non-zero otherwise.
*/
check_everything_connected: use a struct with named options The number of variants of check_everything_connected has grown over the years, so that the "real" function takes several possibly-zero, possibly-NULL arguments. We hid the complexity behind some wrapper functions, but this doesn't scale well when we want to add new options. If we add more wrapper variants to handle the new options, then we can get a combinatorial explosion when those options might be used together (right now nobody wants to use both "shallow" and "transport" together, so we get by with just a few wrappers). If instead we add new parameters to each function, each of which can have a default value, then callers who want the defaults end up with confusing invocations like: check_everything_connected(fn, 0, data, -1, 0, NULL); where it is unclear which parameter is which (and every caller needs updated when we add new options). Instead, let's add a struct to hold all of the optional parameters. This is a little more verbose for the callers (who have to declare the struct and fill it in), but it makes their code much easier to follow, because every option is named as it is set (and unused options do not have to be mentioned at all). Note that we could also stick the iteration function and its callback data into the option struct, too. But since those are required for each call, by avoiding doing so, we can let very simple callers just pass "NULL" for the options and not worry about the struct at all. While we're touching each site, let's also rename the function to check_connected(). The existing name was quite long, and not all of the wrappers even used the full name. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-15 18:30:40 +08:00
int check_connected(sha1_iterate_fn fn, void *cb_data,
struct check_connected_options *opt)
{
struct child_process rev_list = CHILD_PROCESS_INIT;
check_everything_connected: use a struct with named options The number of variants of check_everything_connected has grown over the years, so that the "real" function takes several possibly-zero, possibly-NULL arguments. We hid the complexity behind some wrapper functions, but this doesn't scale well when we want to add new options. If we add more wrapper variants to handle the new options, then we can get a combinatorial explosion when those options might be used together (right now nobody wants to use both "shallow" and "transport" together, so we get by with just a few wrappers). If instead we add new parameters to each function, each of which can have a default value, then callers who want the defaults end up with confusing invocations like: check_everything_connected(fn, 0, data, -1, 0, NULL); where it is unclear which parameter is which (and every caller needs updated when we add new options). Instead, let's add a struct to hold all of the optional parameters. This is a little more verbose for the callers (who have to declare the struct and fill it in), but it makes their code much easier to follow, because every option is named as it is set (and unused options do not have to be mentioned at all). Note that we could also stick the iteration function and its callback data into the option struct, too. But since those are required for each call, by avoiding doing so, we can let very simple callers just pass "NULL" for the options and not worry about the struct at all. While we're touching each site, let's also rename the function to check_connected(). The existing name was quite long, and not all of the wrappers even used the full name. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-15 18:30:40 +08:00
struct check_connected_options defaults = CHECK_CONNECTED_INIT;
char commit[41];
unsigned char sha1[20];
int err = 0;
clone: open a shortcut for connectivity check In order to make sure the cloned repository is good, we run "rev-list --objects --not --all $new_refs" on the repository. This is expensive on large repositories. This patch attempts to mitigate the impact in this special case. In the "good" clone case, we only have one pack. If all of the following are met, we can be sure that all objects reachable from the new refs exist, which is the intention of running "rev-list ...": - all refs point to an object in the pack - there are no dangling pointers in any object in the pack - no objects in the pack point to objects outside the pack The second and third checks can be done with the help of index-pack as a slight variation of --strict check (which introduces a new condition for the shortcut: pack transfer must be used and the number of objects large enough to call index-pack). The first is checked in check_everything_connected after we get an "ok" from index-pack. "index-pack + new checks" is still faster than the current "index-pack + rev-list", which is the whole point of this patch. If any of the conditions fail, we fall back to the good old but expensive "rev-list ..". In that case it's even more expensive because we have to pay for the new checks in index-pack. But that should only happen when the other side is either buggy or malicious. Cloning linux-2.6 over file:// before after real 3m25.693s 2m53.050s user 5m2.037s 4m42.396s sys 0m13.750s 0m16.574s A more realistic test with ssh:// over wireless before after real 11m26.629s 10m4.213s user 5m43.196s 5m19.444s sys 0m35.812s 0m37.630s This shortcut is not applied to shallow clones, partly because shallow clones should have no more objects than a usual fetch and the cost of rev-list is acceptable, partly to avoid dealing with corner cases when grafting is involved. This shortcut does not apply to unpack-objects code path either because the number of objects must be small in order to trigger that code path. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-26 09:16:17 +08:00
struct packed_git *new_pack = NULL;
check_everything_connected: use a struct with named options The number of variants of check_everything_connected has grown over the years, so that the "real" function takes several possibly-zero, possibly-NULL arguments. We hid the complexity behind some wrapper functions, but this doesn't scale well when we want to add new options. If we add more wrapper variants to handle the new options, then we can get a combinatorial explosion when those options might be used together (right now nobody wants to use both "shallow" and "transport" together, so we get by with just a few wrappers). If instead we add new parameters to each function, each of which can have a default value, then callers who want the defaults end up with confusing invocations like: check_everything_connected(fn, 0, data, -1, 0, NULL); where it is unclear which parameter is which (and every caller needs updated when we add new options). Instead, let's add a struct to hold all of the optional parameters. This is a little more verbose for the callers (who have to declare the struct and fill it in), but it makes their code much easier to follow, because every option is named as it is set (and unused options do not have to be mentioned at all). Note that we could also stick the iteration function and its callback data into the option struct, too. But since those are required for each call, by avoiding doing so, we can let very simple callers just pass "NULL" for the options and not worry about the struct at all. While we're touching each site, let's also rename the function to check_connected(). The existing name was quite long, and not all of the wrappers even used the full name. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-15 18:30:40 +08:00
struct transport *transport;
size_t base_len;
check_everything_connected: use a struct with named options The number of variants of check_everything_connected has grown over the years, so that the "real" function takes several possibly-zero, possibly-NULL arguments. We hid the complexity behind some wrapper functions, but this doesn't scale well when we want to add new options. If we add more wrapper variants to handle the new options, then we can get a combinatorial explosion when those options might be used together (right now nobody wants to use both "shallow" and "transport" together, so we get by with just a few wrappers). If instead we add new parameters to each function, each of which can have a default value, then callers who want the defaults end up with confusing invocations like: check_everything_connected(fn, 0, data, -1, 0, NULL); where it is unclear which parameter is which (and every caller needs updated when we add new options). Instead, let's add a struct to hold all of the optional parameters. This is a little more verbose for the callers (who have to declare the struct and fill it in), but it makes their code much easier to follow, because every option is named as it is set (and unused options do not have to be mentioned at all). Note that we could also stick the iteration function and its callback data into the option struct, too. But since those are required for each call, by avoiding doing so, we can let very simple callers just pass "NULL" for the options and not worry about the struct at all. While we're touching each site, let's also rename the function to check_connected(). The existing name was quite long, and not all of the wrappers even used the full name. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-15 18:30:40 +08:00
if (!opt)
opt = &defaults;
transport = opt->transport;
if (fn(cb_data, sha1)) {
if (opt->err_fd)
close(opt->err_fd);
return err;
}
clone: open a shortcut for connectivity check In order to make sure the cloned repository is good, we run "rev-list --objects --not --all $new_refs" on the repository. This is expensive on large repositories. This patch attempts to mitigate the impact in this special case. In the "good" clone case, we only have one pack. If all of the following are met, we can be sure that all objects reachable from the new refs exist, which is the intention of running "rev-list ...": - all refs point to an object in the pack - there are no dangling pointers in any object in the pack - no objects in the pack point to objects outside the pack The second and third checks can be done with the help of index-pack as a slight variation of --strict check (which introduces a new condition for the shortcut: pack transfer must be used and the number of objects large enough to call index-pack). The first is checked in check_everything_connected after we get an "ok" from index-pack. "index-pack + new checks" is still faster than the current "index-pack + rev-list", which is the whole point of this patch. If any of the conditions fail, we fall back to the good old but expensive "rev-list ..". In that case it's even more expensive because we have to pay for the new checks in index-pack. But that should only happen when the other side is either buggy or malicious. Cloning linux-2.6 over file:// before after real 3m25.693s 2m53.050s user 5m2.037s 4m42.396s sys 0m13.750s 0m16.574s A more realistic test with ssh:// over wireless before after real 11m26.629s 10m4.213s user 5m43.196s 5m19.444s sys 0m35.812s 0m37.630s This shortcut is not applied to shallow clones, partly because shallow clones should have no more objects than a usual fetch and the cost of rev-list is acceptable, partly to avoid dealing with corner cases when grafting is involved. This shortcut does not apply to unpack-objects code path either because the number of objects must be small in order to trigger that code path. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-26 09:16:17 +08:00
if (transport && transport->smart_options &&
transport->smart_options->self_contained_and_connected &&
transport->pack_lockfile &&
strip_suffix(transport->pack_lockfile, ".keep", &base_len)) {
clone: open a shortcut for connectivity check In order to make sure the cloned repository is good, we run "rev-list --objects --not --all $new_refs" on the repository. This is expensive on large repositories. This patch attempts to mitigate the impact in this special case. In the "good" clone case, we only have one pack. If all of the following are met, we can be sure that all objects reachable from the new refs exist, which is the intention of running "rev-list ...": - all refs point to an object in the pack - there are no dangling pointers in any object in the pack - no objects in the pack point to objects outside the pack The second and third checks can be done with the help of index-pack as a slight variation of --strict check (which introduces a new condition for the shortcut: pack transfer must be used and the number of objects large enough to call index-pack). The first is checked in check_everything_connected after we get an "ok" from index-pack. "index-pack + new checks" is still faster than the current "index-pack + rev-list", which is the whole point of this patch. If any of the conditions fail, we fall back to the good old but expensive "rev-list ..". In that case it's even more expensive because we have to pay for the new checks in index-pack. But that should only happen when the other side is either buggy or malicious. Cloning linux-2.6 over file:// before after real 3m25.693s 2m53.050s user 5m2.037s 4m42.396s sys 0m13.750s 0m16.574s A more realistic test with ssh:// over wireless before after real 11m26.629s 10m4.213s user 5m43.196s 5m19.444s sys 0m35.812s 0m37.630s This shortcut is not applied to shallow clones, partly because shallow clones should have no more objects than a usual fetch and the cost of rev-list is acceptable, partly to avoid dealing with corner cases when grafting is involved. This shortcut does not apply to unpack-objects code path either because the number of objects must be small in order to trigger that code path. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-26 09:16:17 +08:00
struct strbuf idx_file = STRBUF_INIT;
strbuf_add(&idx_file, transport->pack_lockfile, base_len);
clone: open a shortcut for connectivity check In order to make sure the cloned repository is good, we run "rev-list --objects --not --all $new_refs" on the repository. This is expensive on large repositories. This patch attempts to mitigate the impact in this special case. In the "good" clone case, we only have one pack. If all of the following are met, we can be sure that all objects reachable from the new refs exist, which is the intention of running "rev-list ...": - all refs point to an object in the pack - there are no dangling pointers in any object in the pack - no objects in the pack point to objects outside the pack The second and third checks can be done with the help of index-pack as a slight variation of --strict check (which introduces a new condition for the shortcut: pack transfer must be used and the number of objects large enough to call index-pack). The first is checked in check_everything_connected after we get an "ok" from index-pack. "index-pack + new checks" is still faster than the current "index-pack + rev-list", which is the whole point of this patch. If any of the conditions fail, we fall back to the good old but expensive "rev-list ..". In that case it's even more expensive because we have to pay for the new checks in index-pack. But that should only happen when the other side is either buggy or malicious. Cloning linux-2.6 over file:// before after real 3m25.693s 2m53.050s user 5m2.037s 4m42.396s sys 0m13.750s 0m16.574s A more realistic test with ssh:// over wireless before after real 11m26.629s 10m4.213s user 5m43.196s 5m19.444s sys 0m35.812s 0m37.630s This shortcut is not applied to shallow clones, partly because shallow clones should have no more objects than a usual fetch and the cost of rev-list is acceptable, partly to avoid dealing with corner cases when grafting is involved. This shortcut does not apply to unpack-objects code path either because the number of objects must be small in order to trigger that code path. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-26 09:16:17 +08:00
strbuf_addstr(&idx_file, ".idx");
new_pack = add_packed_git(idx_file.buf, idx_file.len, 1);
strbuf_release(&idx_file);
}
check_everything_connected: use a struct with named options The number of variants of check_everything_connected has grown over the years, so that the "real" function takes several possibly-zero, possibly-NULL arguments. We hid the complexity behind some wrapper functions, but this doesn't scale well when we want to add new options. If we add more wrapper variants to handle the new options, then we can get a combinatorial explosion when those options might be used together (right now nobody wants to use both "shallow" and "transport" together, so we get by with just a few wrappers). If instead we add new parameters to each function, each of which can have a default value, then callers who want the defaults end up with confusing invocations like: check_everything_connected(fn, 0, data, -1, 0, NULL); where it is unclear which parameter is which (and every caller needs updated when we add new options). Instead, let's add a struct to hold all of the optional parameters. This is a little more verbose for the callers (who have to declare the struct and fill it in), but it makes their code much easier to follow, because every option is named as it is set (and unused options do not have to be mentioned at all). Note that we could also stick the iteration function and its callback data into the option struct, too. But since those are required for each call, by avoiding doing so, we can let very simple callers just pass "NULL" for the options and not worry about the struct at all. While we're touching each site, let's also rename the function to check_connected(). The existing name was quite long, and not all of the wrappers even used the full name. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-15 18:30:40 +08:00
if (opt->shallow_file) {
argv_array_push(&rev_list.args, "--shallow-file");
check_everything_connected: use a struct with named options The number of variants of check_everything_connected has grown over the years, so that the "real" function takes several possibly-zero, possibly-NULL arguments. We hid the complexity behind some wrapper functions, but this doesn't scale well when we want to add new options. If we add more wrapper variants to handle the new options, then we can get a combinatorial explosion when those options might be used together (right now nobody wants to use both "shallow" and "transport" together, so we get by with just a few wrappers). If instead we add new parameters to each function, each of which can have a default value, then callers who want the defaults end up with confusing invocations like: check_everything_connected(fn, 0, data, -1, 0, NULL); where it is unclear which parameter is which (and every caller needs updated when we add new options). Instead, let's add a struct to hold all of the optional parameters. This is a little more verbose for the callers (who have to declare the struct and fill it in), but it makes their code much easier to follow, because every option is named as it is set (and unused options do not have to be mentioned at all). Note that we could also stick the iteration function and its callback data into the option struct, too. But since those are required for each call, by avoiding doing so, we can let very simple callers just pass "NULL" for the options and not worry about the struct at all. While we're touching each site, let's also rename the function to check_connected(). The existing name was quite long, and not all of the wrappers even used the full name. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-15 18:30:40 +08:00
argv_array_push(&rev_list.args, opt->shallow_file);
}
argv_array_push(&rev_list.args,"rev-list");
argv_array_push(&rev_list.args, "--objects");
argv_array_push(&rev_list.args, "--stdin");
argv_array_push(&rev_list.args, "--not");
argv_array_push(&rev_list.args, "--all");
argv_array_push(&rev_list.args, "--quiet");
if (opt->progress)
argv_array_pushf(&rev_list.args, "--progress=%s",
_("Checking connectivity"));
rev_list.git_cmd = 1;
rev_list.env = opt->env;
rev_list.in = -1;
rev_list.no_stdout = 1;
if (opt->err_fd)
rev_list.err = opt->err_fd;
else
rev_list.no_stderr = opt->quiet;
if (start_command(&rev_list))
return error(_("Could not run 'git rev-list'"));
sigchain_push(SIGPIPE, SIG_IGN);
commit[40] = '\n';
do {
clone: open a shortcut for connectivity check In order to make sure the cloned repository is good, we run "rev-list --objects --not --all $new_refs" on the repository. This is expensive on large repositories. This patch attempts to mitigate the impact in this special case. In the "good" clone case, we only have one pack. If all of the following are met, we can be sure that all objects reachable from the new refs exist, which is the intention of running "rev-list ...": - all refs point to an object in the pack - there are no dangling pointers in any object in the pack - no objects in the pack point to objects outside the pack The second and third checks can be done with the help of index-pack as a slight variation of --strict check (which introduces a new condition for the shortcut: pack transfer must be used and the number of objects large enough to call index-pack). The first is checked in check_everything_connected after we get an "ok" from index-pack. "index-pack + new checks" is still faster than the current "index-pack + rev-list", which is the whole point of this patch. If any of the conditions fail, we fall back to the good old but expensive "rev-list ..". In that case it's even more expensive because we have to pay for the new checks in index-pack. But that should only happen when the other side is either buggy or malicious. Cloning linux-2.6 over file:// before after real 3m25.693s 2m53.050s user 5m2.037s 4m42.396s sys 0m13.750s 0m16.574s A more realistic test with ssh:// over wireless before after real 11m26.629s 10m4.213s user 5m43.196s 5m19.444s sys 0m35.812s 0m37.630s This shortcut is not applied to shallow clones, partly because shallow clones should have no more objects than a usual fetch and the cost of rev-list is acceptable, partly to avoid dealing with corner cases when grafting is involved. This shortcut does not apply to unpack-objects code path either because the number of objects must be small in order to trigger that code path. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-05-26 09:16:17 +08:00
/*
* If index-pack already checked that:
* - there are no dangling pointers in the new pack
* - the pack is self contained
* Then if the updated ref is in the new pack, then we
* are sure the ref is good and not sending it to
* rev-list for verification.
*/
if (new_pack && find_pack_entry_one(sha1, new_pack))
continue;
memcpy(commit, sha1_to_hex(sha1), 40);
if (write_in_full(rev_list.in, commit, 41) < 0) {
if (errno != EPIPE && errno != EINVAL)
error_errno(_("failed write to rev-list"));
err = -1;
break;
}
} while (!fn(cb_data, sha1));
if (close(rev_list.in))
err = error_errno(_("failed to close rev-list's stdin"));
sigchain_pop(SIGPIPE);
return finish_command(&rev_list) || err;
}