git/t/t5502-quickfetch.sh

143 lines
2.5 KiB
Bash
Raw Normal View History

Make sure quickfetch is not fooled with a previous, incomplete fetch. This updates git-rev-list --objects to be a bit more careful when listing a blob object to make sure the blob actually exists, and uses it to make sure the quick-fetch optimization we introduced earlier is not fooled by a previous incomplete fetch. The quick-fetch optimization works by running this command: git rev-list --objects <<commit-list>> --not --all where <<commit-list>> is a list of commits that we are going to fetch from the other side. If there is any object missing to complete the <<commit-list>>, the rev-list would fail and die (say, the commit was in our repository, but its tree wasn't -- then it will barf while trying to list the blobs the tree contains because it cannot read that tree). Usually we do not have the objects (otherwise why would we fetching?), but in one important special case we do: when the remote repository is used as an alternate object store (i.e. pointed by .git/objects/info/alternates). We could check .git/objects/info/alternates to see if the remote we are interacting with is one of them (or is used as an alternate, recursively, by one of them), but that check is more cumbersome than it is worth. The above check however did not catch missing blob, because object listing code did not read nor check blob objects, knowing that blobs do not contain any further references to other objects. This commit fixes it with practically unmeasurable overhead. I've benched this with git rev-list --objects --all >/dev/null in the kernel repository, with three different implementations of the "check-blob". - Checking with has_sha1_file() has negligible (unmeasurable) performance penalty. - Checking with sha1_object_info() makes it somewhat slower, perhaps by 5%. - Checking with read_sha1_file() to cause a fully re-validation is prohibitively expensive (about 4 times as much runtime). In my original patch, I had this as a command line option, but the overhead is small enough that it is not really worth it. Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-16 15:42:29 +08:00
#!/bin/sh
test_description='test quickfetch from local'
. ./test-lib.sh
test_expect_success setup '
test_tick &&
echo ichi >file &&
git add file &&
git commit -m initial &&
cnt=$( (
git count-objects | sed -e "s/ *objects,.*//"
) ) &&
test $cnt -eq 3
'
test_expect_success 'clone without alternate' '
(
mkdir cloned &&
cd cloned &&
git init-db &&
git remote add -f origin ..
) &&
cnt=$( (
cd cloned &&
git count-objects | sed -e "s/ *objects,.*//"
) ) &&
test $cnt -eq 3
'
test_expect_success 'further commits in the original' '
test_tick &&
echo ni >file &&
git commit -a -m second &&
cnt=$( (
git count-objects | sed -e "s/ *objects,.*//"
) ) &&
test $cnt -eq 6
'
test_expect_success 'copy commit and tree but not blob by hand' '
git rev-list --objects HEAD |
git pack-objects --stdout |
(
cd cloned &&
git unpack-objects
) &&
cnt=$( (
cd cloned &&
git count-objects | sed -e "s/ *objects,.*//"
) ) &&
test $cnt -eq 6 &&
Make sure quickfetch is not fooled with a previous, incomplete fetch. This updates git-rev-list --objects to be a bit more careful when listing a blob object to make sure the blob actually exists, and uses it to make sure the quick-fetch optimization we introduced earlier is not fooled by a previous incomplete fetch. The quick-fetch optimization works by running this command: git rev-list --objects <<commit-list>> --not --all where <<commit-list>> is a list of commits that we are going to fetch from the other side. If there is any object missing to complete the <<commit-list>>, the rev-list would fail and die (say, the commit was in our repository, but its tree wasn't -- then it will barf while trying to list the blobs the tree contains because it cannot read that tree). Usually we do not have the objects (otherwise why would we fetching?), but in one important special case we do: when the remote repository is used as an alternate object store (i.e. pointed by .git/objects/info/alternates). We could check .git/objects/info/alternates to see if the remote we are interacting with is one of them (or is used as an alternate, recursively, by one of them), but that check is more cumbersome than it is worth. The above check however did not catch missing blob, because object listing code did not read nor check blob objects, knowing that blobs do not contain any further references to other objects. This commit fixes it with practically unmeasurable overhead. I've benched this with git rev-list --objects --all >/dev/null in the kernel repository, with three different implementations of the "check-blob". - Checking with has_sha1_file() has negligible (unmeasurable) performance penalty. - Checking with sha1_object_info() makes it somewhat slower, perhaps by 5%. - Checking with read_sha1_file() to cause a fully re-validation is prohibitively expensive (about 4 times as much runtime). In my original patch, I had this as a command line option, but the overhead is small enough that it is not really worth it. Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-16 15:42:29 +08:00
blob=$(git rev-parse HEAD:file | sed -e "s|..|&/|") &&
test -f "cloned/.git/objects/$blob" &&
rm -f "cloned/.git/objects/$blob" &&
cnt=$( (
cd cloned &&
git count-objects | sed -e "s/ *objects,.*//"
) ) &&
test $cnt -eq 5
'
test_expect_success 'quickfetch should not leave a corrupted repository' '
(
cd cloned &&
git fetch
) &&
cnt=$( (
cd cloned &&
git count-objects | sed -e "s/ *objects,.*//"
) ) &&
test $cnt -eq 6
'
git-fetch: avoid local fetching from alternate (again) Back in e3c6f240fd9c5bdeb33f2d47adc859f37935e2df Junio taught git-fetch to avoid copying objects when we are fetching from a repository that is already registered as an alternate object database. In such a case there is no reason to copy any objects as we can already obtain them through the alternate. However we need to ensure the objects are all reachable, so we run `git rev-list --objects $theirs --not --all` to verify this. If any object is missing or unreadable then we need to fetch/copy the objects from the remote. When a missing object is detected the git-rev-list process will exit with a non-zero exit status, making this condition quite easy to detect. Although git-fetch is currently a builtin (and so is rev-list) we cannot invoke the traverse_objects() API at this point in the transport code. The object walker within traverse_objects() calls die() as soon as it finds an object it cannot read. If that happens we want to resume the fetch process by calling do_fetch_pack(). To get around this we spawn git-rev-list into a background process to prevent a die() from killing the foreground fetch process, thus allowing the fetch process to resume into do_fetch_pack() if copying is necessary. We aren't interested in the output of rev-list (a list of SHA-1 object names that are reachable) or its errors (a "spurious" error about an object not being found as we need to copy it) so we redirect both stdout and stderr to /dev/null. We run this git-rev-list based check before any fetch as we may already have the necessary objects local from a prior fetch. If we don't then its very likely the first $theirs object listed on the command line won't exist locally and git-rev-list will die very quickly, allowing us to start the network transfer. This test even on remote URLs may save bandwidth if someone runs `git pull origin`, sees a merge conflict, resets out, then redoes the same pull just a short time later. If the remote hasn't changed between the two pulls and the local repository hasn't had git-gc run in it then there is probably no need to perform network transfer as all of the objects are local. Documentation for the new quickfetch function was suggested and written by Junio, based on his original comment in git-fetch.sh. Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2007-11-11 15:29:47 +08:00
test_expect_success 'quickfetch should not copy from alternate' '
(
mkdir quickclone &&
cd quickclone &&
git init-db &&
(cd ../.git/objects && pwd) >.git/objects/info/alternates &&
git remote add origin .. &&
git fetch -k -k
) &&
obj_cnt=$( (
cd quickclone &&
git count-objects | sed -e "s/ *objects,.*//"
) ) &&
pck_cnt=$( (
cd quickclone &&
git count-objects -v | sed -n -e "/packs:/{
s/packs://
p
q
}"
) ) &&
origin_master=$( (
cd quickclone &&
git rev-parse origin/master
) ) &&
echo "loose objects: $obj_cnt, packfiles: $pck_cnt" &&
test $obj_cnt -eq 0 &&
test $pck_cnt -eq 0 &&
test z$origin_master = z$(git rev-parse master)
'
test_expect_success 'quickfetch should handle ~1000 refs (on Windows)' '
git gc &&
head=$(git rev-parse HEAD) &&
branchprefix="$head refs/heads/branch" &&
for i in 0 1 2 3 4 5 6 7 8 9; do
for j in 0 1 2 3 4 5 6 7 8 9; do
for k in 0 1 2 3 4 5 6 7 8 9; do
echo "$branchprefix$i$j$k" >> .git/packed-refs
done
done
done &&
(
cd cloned &&
git fetch &&
git fetch
)
'
Make sure quickfetch is not fooled with a previous, incomplete fetch. This updates git-rev-list --objects to be a bit more careful when listing a blob object to make sure the blob actually exists, and uses it to make sure the quick-fetch optimization we introduced earlier is not fooled by a previous incomplete fetch. The quick-fetch optimization works by running this command: git rev-list --objects <<commit-list>> --not --all where <<commit-list>> is a list of commits that we are going to fetch from the other side. If there is any object missing to complete the <<commit-list>>, the rev-list would fail and die (say, the commit was in our repository, but its tree wasn't -- then it will barf while trying to list the blobs the tree contains because it cannot read that tree). Usually we do not have the objects (otherwise why would we fetching?), but in one important special case we do: when the remote repository is used as an alternate object store (i.e. pointed by .git/objects/info/alternates). We could check .git/objects/info/alternates to see if the remote we are interacting with is one of them (or is used as an alternate, recursively, by one of them), but that check is more cumbersome than it is worth. The above check however did not catch missing blob, because object listing code did not read nor check blob objects, knowing that blobs do not contain any further references to other objects. This commit fixes it with practically unmeasurable overhead. I've benched this with git rev-list --objects --all >/dev/null in the kernel repository, with three different implementations of the "check-blob". - Checking with has_sha1_file() has negligible (unmeasurable) performance penalty. - Checking with sha1_object_info() makes it somewhat slower, perhaps by 5%. - Checking with read_sha1_file() to cause a fully re-validation is prohibitively expensive (about 4 times as much runtime). In my original patch, I had this as a command line option, but the overhead is small enough that it is not really worth it. Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-16 15:42:29 +08:00
test_done