mirrors/git

mirror of https://github.com/git/git.git synced 2024-11-27 20:14:30 +08:00

Author	SHA1	Message	Date
Ævar Arnfjörð Bjarmason	40286ca2fa	parse_object(): simplify blob conditional Commit `8db2dad7a0` (parse_object(): check on-disk type of suspected blob, 2022-11-17) simplified the conditional for checking if we might have a blob. But we can simplify it further. In: !obj \|\| (obj && obj->type == OBJ_BLOB) the short-circuit "OR" means "obj" will always be true on the right-hand side. The compiler almost certainly optimized that out anyway, but dropping it makes the conditional easier to understand for humans. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-11-22 10:13:54 +09:00
Jeff King	8db2dad7a0	parse_object(): check on-disk type of suspected blob In parse_object(), we try to handle blobs by streaming rather than loading them entirely into memory. The most common case here will be that we haven't seen the object yet and check oid_object_info(), which tells us we have a blob. But we trigger this code on one other case: when we have an in-memory object struct with type OBJ_BLOB (and without its "parsed" flag set, since otherwise we'd return early from the function). This indicates that some other part of the code suspected we have a blob (e.g., it was mentioned by a tree or tag) but we haven't yet looked at the on-disk copy. In this case before hitting the streaming path, we check if we have the object on-disk at all. This is mostly pointless extra work, as the streaming path would complain if it couldn't open the object (albeit with the message "hash mismatch", which is a little misleading). But it's also insufficient to catch all problems. The streaming code will only tell us "yes, the on-disk object matches the oid". But it doesn't actually confirm that what we found was indeed a blob, and neither does repo_has_object_file(). One way to improve this would be to teach stream_object_signature() to check the type (either by returning it to us to check, or taking an "expected" type). But there's an even simpler fix here: if we suspect the object is a blob, just call oid_object_info() to confirm that we have it on-disk, and that it really is a blob. This is slightly less efficient than teaching stream_object_signature() to do it (since it has to open the object already). But this case very rarely comes up. In practice, we usually don't have any clue what the type is, in which case we already call oid_object_info(). This "suspected" case happens only when some other code created an object struct but didn't actually parse the blob, which is actually tricky to trigger at all (see the discussion of the test below). I reworked the conditional a bit so that instead of: if ((suspected_blob && oid_object_info() == OBJ_BLOB) (no_clue && oid_object_info() == OBJ_BLOB) we have the simpler: if ((suspected_blob \|\| no_clue) && oid_object_info() == OBJ_BLOB) This is shorter, but also reflects what we really want say, which is "have we ruled out this being a blob; if not, check it on-disk". In either case, if oid_object_info() fails to tell us it's a blob, we'll skip the streaming code path and call repo_read_object_file(), just as before. And if we really do have a mismatch with the existing object struct, we'll eventually call lookup_commit(), etc, via parse_object_buffer(), which will complain that it doesn't match our existing obj->type. So this fixes one of the lingering expect_failure cases from `0616617c7e` (t: introduce tests for unexpected object types, 2019-04-09). That test works by peeling a tag that claims to point to a blob (triggering us to create the struct), but really points to something else, which we later discover when we call parse_object() as part of the actual traversal). Prior to this commit, we'd quietly check the sha1 and mark the blob as "parsed". Now we correctly complain about the mismatch. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2022-11-18 13:59:31 -05:00
Jeff King	04fb96219a	parse_object(): drop extra "has" check before checking object type When parsing an object of unknown type, we check to see if it's a blob, so we can use our streaming code path. This uses oid_object_info() to check the type, but before doing so we call repo_has_object_file(). This latter is pointless, as oid_object_info() will already fail if the object is missing. Checking it ahead of time just complicates the code and is a waste of resources (albeit small). Let's drop the redundant check. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com>	2022-11-18 13:59:31 -05:00
Junio C Hamano	fdbfac60fd	Merge branch 'jk/fsck-on-diet' "git fsck" failed to release contents of tree objects already used from the memory, which has been fixed. * jk/fsck-on-diet: parse_object_buffer(): respect save_commit_buffer fsck: turn off save_commit_buffer fsck: free tree buffers after walking unreachable objects	2022-10-10 10:08:39 -07:00
Jeff King	51b27747e5	parse_object_buffer(): respect save_commit_buffer If the global variable "save_commit_buffer" is set to 0, then parse_commit() will throw away the commit object data after parsing it, rather than sticking it into a commit slab. This goes all the way back to `60ab26de99` ([PATCH] Avoid wasting memory in git-rev-list, 2005-09-15). But there's another code path which may similarly stash the buffer: parse_object_buffer(). This is where we end up if we parse a commit via parse_object(), and it's used directly in a few other code paths like git-fsck. The original goal of `60ab26de99` was avoiding extra memory usage for rev-list. And there it's not all that important to catch parse_object(). We use that function only for looking at the tips of the traversal, and the majority of the commits are parsed by following parent links, where we use parse_commit() directly. So we were wasting some memory, but only a small portion. It's much easier to see the effect with fsck. Since we now turn off save_commit_buffer by default there, we _should_ be able to drop the freeing of the commit buffer in fsck_obj(). But if we do so (taking the first hunk of this patch without the rest), then the peak heap of "git fsck" in a clone of git.git goes from 136MB to 194MB. Teaching parse_object_buffer() to respect save_commit_buffer brings that down to 134.5MB (it's hard to tell from massif's output, but I suspect the savings comes from avoiding the overhead of the mostly-empty commit slab). Other programs should see a small improvement. Both "rev-list --all" and "fsck --connectivity-only" improve by a few hundred kilobytes, as they'd avoid loading the tip objects of their traversals. Most importantly, no code should be hurt by doing this. Any program that turns off save_commit_buffer is already making the assumption that any commit it sees may need to have its object data loaded on demand, as it doesn't know which ones were parsed by parse_commit() versus parse_object(). Not to mention that anything parsed by the commit graph may be in the same boat, even if save_commit_buffer was not disabled. This should be the only spot that needs to be fixed. Grepping for set_commit_buffer() shows that this and parse_commit() are the only relevant calls. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-09-22 11:40:47 -07:00
Jeff King	9a8c3c4a5f	parse_object(): check commit-graph when skip_hash set If the caller told us that they don't care about us checking the object hash, then we're free to implement any optimizations that get us the parsed value more quickly. An obvious one is to check the commit graph before loading an object from disk. And in fact, both of the callers who pass in this flag are already doing so before they call parse_object()! So we can simplify those callers, as well as any possible future ones, by moving the logic into parse_object(). There are two subtle things to note in the diff, but neither has any impact in practice: - it seems least-surprising here to do the graph lookup on the git-replace'd oid, rather than the original. This is in theory a change of behavior from the earlier code, as neither caller did a replace lookup itself. But in practice it doesn't matter, as we disable the commit graph entirely if there are any replace refs. - the caller in get_reference() passes the skip_hash flag only if revs->verify_objects isn't set, whereas it would look in the commit graph unconditionally. In practice this should not matter as we should disable the commit graph entirely when using verify_objects (and that was done recently in another patch). So this should be a pure cleanup with no behavior change. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-09-07 12:27:02 -07:00
Jeff King	c868d8e91f	parse_object(): allow skipping hash check The parse_object() function checks the object hash of any object it parses. This is a nice feature, as it means we may catch bit corruption during normal use, rather than waiting for specific fsck operations. But it also can be slow. It's particularly noticeable for blobs, where except for the hash check, we could return without loading the object contents at all. Now one may wonder what is the point of calling parse_object() on a blob in the first place then, but usually it's not intentional: we were fed an oid from somewhere, don't know the type, and want an object struct. For commits and trees, the parsing is usually helpful; we're about to look at the contents anyway. But this is less true for blobs, where we may be collecting them as part of a reachability traversal, etc, and don't actually care what's in them. And blobs, of course, tend to be larger. We don't want to just throw out the hash-checks for blobs, though. We do depend on them in some circumstances (e.g., rev-list --verify-objects uses parse_object() to check them). It's only the callers that know how they're going to use the result. And so we can help them by providing a special flag to skip the hash check. We could just apply this to blobs, as they're going to be the main source of performance improvement. But if a caller doesn't care about checking the hash, we might as well skip it for other object types, too. Even though we can't avoid reading the object contents, we can still skip the actual hash computation. If this seems like it is making Git a little bit less safe against corruption, it may be. But it's part of a series of tradeoffs we're already making. For instance, "rev-list --objects" does not open the contents of blobs it prints. And when a commit graph is present, we skip opening most commits entirely. The important thing will be to use this flag in cases where it's safe to skip the check. For instance, when serving a pack for a fetch, we know the client will fully index the objects and do a connectivity check itself. There's little to be gained from the server side re-hashing a blob itself. And indeed, most of the time we don't! The revision machinery won't open up a blob reached by traversal, but only one requested directly with a "want" line. So applied properly, this new feature shouldn't make anything less safe in practice. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-09-07 12:18:57 -07:00
Ævar Arnfjörð Bjarmason	44439c1c58	object-file API: have hash_object_file() take "enum object_type" Change the hash_object_file() function to take an "enum object_type". Since a preceding commit all of its callers are passing either "{commit,tree,blob,tag}_type", or the result of a call to type_name(), the parse_object() caller that would pass NULL is now using stream_object_signature(). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-02-25 17:16:32 -08:00
Ævar Arnfjörð Bjarmason	0f156dbb04	object-file API: split up and simplify check_object_signature() Split up the check_object_signature() function into that non-streaming version (it accepts an already filled "buf"), and a new stream_object_signature() which will retrieve the object from storage, and hash it on-the-fly. All of the callers of check_object_signature() were effectively calling two different functions, if we go by cyclomatic complexity. I.e. they'd either take the early "if (map)" branch and return early, or not. This has been the case since the "if (map)" condition was added in `090ea12671` (parse_object: avoid putting whole blob in core, 2012-03-07). We can then further simplify the resulting check_object_signature() function since only one caller wanted to pass a non-NULL "buf" and a non-NULL "real_oidp". That "read_loose_object()" codepath used by "git fsck" can instead use hash_object_file() followed by oideq(). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-02-25 17:16:31 -08:00
Junio C Hamano	0dc90d954d	Merge branch 'ns/tmp-objdir' New interface into the tmp-objdir API to help in-core use of the quarantine feature. * ns/tmp-objdir: tmp-objdir: disable ref updates when replacing the primary odb tmp-objdir: new API for creating temporary writable databases	2022-01-03 16:24:15 -08:00
Neeraj Singh	b3cecf49ea	tmp-objdir: new API for creating temporary writable databases The tmp_objdir API provides the ability to create temporary object directories, but was designed with the goal of having subprocesses access these object stores, followed by the main process migrating objects from it to the main object store or just deleting it. The subprocesses would view it as their primary datastore and write to it. Here we add the tmp_objdir_replace_primary_odb function that replaces the current process's writable "main" object directory with the specified one. The previous main object directory is restored in either tmp_objdir_migrate or tmp_objdir_destroy. For the --remerge-diff usecase, add a new `will_destroy` flag in `struct object_database` to mark ephemeral object databases that do not require fsync durability. Add 'git prune' support for removing temporary object databases, and make sure that they have a name starting with tmp_ and containing an operation-specific name. Based-on-patch-by: Elijah Newren <newren@gmail.com> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-12-08 14:06:36 -08:00
Ævar Arnfjörð Bjarmason	eafd6e7e55	object.c: use BUG(...) no die("BUG: ...") in lookup_object_by_type() Adjust code added in `7463064b28` (object.h: add lookup_object_by_type() function, 2021-06-22) to use the BUG() function. Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-12-07 12:33:58 -08:00
Junio C Hamano	061a21d36d	Merge branch 'ab/fsck-unexpected-type' "git fsck" has been taught to report mismatch between expected and actual types of an object better. * ab/fsck-unexpected-type: fsck: report invalid object type-path combinations fsck: don't hard die on invalid object types object-file.c: stop dying in parse_loose_header() object-file.c: return ULHR_TOO_LONG on "header too long" object-file.c: use "enum" return type for unpack_loose_header() object-file.c: simplify unpack_loose_short_header() object-file.c: make parse_loose_header_extended() public object-file.c: return -1, not "status" from unpack_loose_header() object-file.c: don't set "typep" when returning non-zero cat-file tests: test for current --allow-unknown-type behavior cat-file tests: add corrupt loose object test cat-file tests: test for missing/bogus object with -t, -s and -p cat-file tests: move bogus_* variable declarations earlier fsck tests: test for garbage appended to a loose object fsck tests: test current hash/type mismatch behavior fsck tests: refactor one test to use a sub-repo fsck tests: add test for fsck-ing an unknown type	2021-10-25 16:06:56 -07:00
Ævar Arnfjörð Bjarmason	96e41f58fe	fsck: report invalid object type-path combinations Improve the error that's emitted in cases where we find a loose object we parse, but which isn't at the location we expect it to be. Before this change we'd prefix the error with a not-a-OID derived from the path at which the object was found, due to an emergent behavior in how we'd end up with an "OID" in these codepaths. Now we'll instead say what object we hashed, and what path it was found at. Before this patch series e.g.: $ git hash-object --stdin -w -t blob </dev/null `e69de29bb2` $ mv objects/e6/ objects/e7 Would emit ("[...]" used to abbreviate the OIDs): git fsck error: hash mismatch for ./objects/e7/9d[...] (expected e79d[...]) error: e79d[...]: object corrupt or missing: ./objects/e7/9d[...] Now we'll instead emit: error: e69d[...]: hash-path mismatch, found at: ./objects/e7/9d[...] Furthermore, we'll do the right thing when the object type and its location are bad. I.e. this case: $ git hash-object --stdin -w -t garbage --literally </dev/null 8315a83d2acc4c174aed59430f9a9c4ed926440f $ mv objects/83 objects/84 As noted in an earlier commits we'd simply die early in those cases, until preceding commits fixed the hard die on invalid object type: $ git fsck fatal: invalid object type Now we'll instead emit sensible error messages: $ git fsck error: 8315[...]: hash-path mismatch, found at: ./objects/84/15[...] error: 8315[...]: object is of unknown type 'garbage': ./objects/84/15[...] In both fsck.c and object-file.c we're using null_oid as a sentinel value for checking whether we got far enough to be certain that the issue was indeed this OID mismatch. We need to add the "object corrupt or missing" special-case to deal with cases where read_loose_object() will return an error before completing check_object_signature(), e.g. if we have an error in unpack_loose_rest() because we find garbage after the valid gzip content: $ git hash-object --stdin -w -t blob </dev/null `e69de29bb2` $ chmod 755 objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ echo garbage >>objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ git fsck error: garbage at end of loose object 'e69d[...]' error: unable to unpack contents of ./objects/e6/9d[...] error: e69d[...]: object corrupt or missing: ./objects/e6/9d[...] There is currently some weird messaging in the edge case when the two are combined, i.e. because we're not explicitly passing along an error state about this specific scenario from check_stream_oid() via read_loose_object() we'll end up printing the null OID if an object is of an unknown type and it can't be unpacked by zlib, e.g.: $ git hash-object --stdin -w -t garbage --literally </dev/null 8315a83d2acc4c174aed59430f9a9c4ed926440f $ chmod 755 objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f $ echo garbage >>objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f $ /usr/bin/git fsck fatal: invalid object type $ ~/g/git/git fsck error: garbage at end of loose object '8315a83d2acc4c174aed59430f9a9c4ed926440f' error: unable to unpack contents of ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f error: 8315a83d2acc4c174aed59430f9a9c4ed926440f: object corrupt or missing: ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f error: 0000000000000000000000000000000000000000: object is of unknown type 'garbage': ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f [...] I think it's OK to leave that for future improvements, which would involve enum-ifying more error state as we've done with "enum unpack_loose_header_result" in preceding commits. In these increasingly more obscure cases the worst that can happen is that we'll get slightly nonsensical or inapplicable error messages. There's other such potential edge cases, all of which might produce some confusing messaging, but still be handled correctly as far as passing along errors goes. E.g. if check_object_signature() returns and oideq(real_oid, null_oid()) is true, which could happen if it returns -1 due to the read_istream() call having failed. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-10-01 15:06:01 -07:00
Junio C Hamano	c9d6d8a193	Merge branch 'jk/log-decorate-optim' Optimize "git log" for cases where we wasted cycles to load ref decoration data that may not be needed. * jk/log-decorate-optim: load_ref_decorations(): fix decoration with tags add_ref_decoration(): rename s/type/deco_type/ load_ref_decorations(): avoid parsing non-tag objects object.h: add lookup_object_by_type() function object.h: expand docstring for lookup_unknown_object() log: avoid loading decorations for userformats that don't need it pretty.h: update and expand docstring for userformat_find_requirements()	2021-07-28 13:17:58 -07:00
Eric Wong	cf2dc1c238	speed up alt_odb_usable() with many alternates With many alternates, the duplicate check in alt_odb_usable() wastes many cycles doing repeated fspathcmp() on every existing alternate. Use a khash to speed up lookups by odb->path. Since the kh_put_* API uses the supplied key without duplicating it, we also take advantage of it to replace both xstrdup() and strbuf_release() in link_alt_odb_entry() with strbuf_detach() to avoid the allocation and copy. In a test repository with 50K alternates and each of those 50K alternates having one alternate each (for a total of 100K total alternates); this speeds up lookup of a non-existent blob from over 16 minutes to roughly 2.7 seconds on my busy workstation. Note: all underlying git object directories were small and unpacked with only loose objects and no packs. Having to load packs increases times significantly. Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-07-07 17:21:12 -07:00
Jeff King	7463064b28	object.h: add lookup_object_by_type() function In some cases it's useful for efficiency reasons to get the type of an object before deciding whether to parse it, but we still want an object struct. E.g., in reachable.c, bitmaps give us the type, but we just want to mark flags on each object. Likewise, we may loop over every object and only parse tags in order to peel them; checking the type first lets us avoid parsing the non-tags. But our lookup_blob(), etc, functions make getting an object struct annoying: we have to call the right function for every type. And we cannot just use the generic lookup_object(), because it only returns an already-seen object; it won't allocate a new object struct. Let's provide a function that dispatches to the correct lookup_* function based on a run-time type. In fact, reachable.c already has such a helper, so we'll just make that public. I did change the return type from "void " to "struct object ". While the former is a clever way to avoid casting inside the function, it's less safe and less informative to people reading the function declaration. The next commit will add a new caller. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-06-28 20:30:18 -07:00
Jeff King	45a187cc34	lookup_unknown_object(): take a repository argument All of the other lookup_foo() functions take a repository argument, but lookup_unknown_object() was never converted, and it uses the_repository internally. Let's fix that. We could leave a wrapper that uses the_repository, but there aren't that many calls, so we'll just convert them all. I looked briefly at each site to see if we had a repository struct (besides the_repository) we could pass, but none of them do (so this conversion to pass the_repository is a pure noop in each case, though it does take us one step closer to eventually getting rid of the_repository). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-04-13 13:18:46 -07:00
René Scharfe	ca56dadb4b	use CALLOC_ARRAY Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-03-13 16:00:09 -08:00
Jiang Xin	ce1d6d9f16	bundle: lost objects when removing duplicate pendings `git rev-list` will list one commit for the following command: $ git rev-list 'main^!' <tip-commit-of-main-branch> But providing the same rev-list args to `git bundle`, fail to create a bundle file. $ git bundle create - 'main^!' # v2 git bundle -<OID> <one-line-message> fatal: Refusing to create empty bundle. This is because when removing duplicate objects in function `object_array_remove_duplicates()`, one unique pending object which has the same name is deleted by mistake. The revision arg 'main^!' in the above example is parsed by `handle_revision_arg()`, and at lease two different objects will be appended to `revs.pending`, one points to the parent commit of the "main" branch, and the other points to the tip commit of the "main" branch. These two objects have the same name "main". Only one object is left with the name "main" after calling the function `object_array_remove_duplicates()`. And what's worse, when adding boundary commits into pending list, we use one-line commit message as names, and the arbitory names may surprise git-bundle. Only comparing objects themselves (".item") is also not good enough, because user may want to create a bundle with two identical objects but with different reference names, such as: "HEAD" and "refs/heads/main". Add new function `contains_object()` which compare both the address and the name of the object. Signed-off-by: Jiang Xin <zhiyou.jx@alibaba-inc.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-11 21:50:41 -08:00
Junio C Hamano	bf0a430f70	Merge branch 'en/strmap' A specialization of hashmap that uses a string as key has been introduced. Hopefully it will see wider use over time. * en/strmap: shortlog: use strset from strmap.h Use new HASHMAP_INIT macro to simplify hashmap initialization strmap: take advantage of FLEXPTR_ALLOC_STR when relevant strmap: enable allocations to come from a mem_pool strmap: add a strset sub-type strmap: split create_entry() out of strmap_put() strmap: add functions facilitating use as a string->int map strmap: enable faster clearing and reusing of strmaps strmap: add more utility functions strmap: new utility functions hashmap: provide deallocation function names hashmap: introduce a new hashmap_partial_clear() hashmap: allow re-use after hashmap_free() hashmap: adjust spacing to fix argument alignment hashmap: add usage documentation explaining hashmap_free[_entries]()	2020-11-21 15:14:38 -08:00
Elijah Newren	6da1a25814	hashmap: provide deallocation function names hashmap_free(), hashmap_free_entries(), and hashmap_free_() have existed for a while, but aren't necessarily the clearest names, especially with hashmap_partial_clear() being added to the mix and lazy-initialization now being supported. Peff suggested we adopt the following names[1]: - hashmap_clear() - remove all entries and de-allocate any hashmap-specific data, but be ready for reuse - hashmap_clear_and_free() - ditto, but free the entries themselves - hashmap_partial_clear() - remove all entries but don't deallocate table - hashmap_partial_clear_and_free() - ditto, but free the entries This patch provides the new names and converts all existing callers over to the new naming scheme. [1] https://lore.kernel.org/git/20201030125059.GA3277724@coredump.intra.peff.net/ Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-11-02 12:15:50 -08:00
René Scharfe	cd8888452c	object: allow clear_commit_marks_all to handle any repo Allow callers to specify the repository to use. Rename the function to repo_clear_commit_marks to document its new scope. No functional change intended. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-10-31 10:46:34 -07:00
Abhishek Kumar	6da43d937c	object: drop parsed_object_pool->commit_count `14ba97f8` (alloc: allow arbitrary repositories for alloc functions, 2018-05-15) introduced parsed_object_pool->commit_count to keep count of commits per repository and was used to assign commit->index. However, commit-slab code requires commit->index values to be unique and a global count would be correct, rather than a per-repo count. Let's introduce a static counter variable, `parsed_commits_count` to keep track of parsed commits so far. As commit_count has no use anymore, let's also drop it from the struct. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-06-17 14:37:14 -07:00
Junio C Hamano	0df82d99da	Merge branch 'jk/object-filter-with-bitmap' The object reachability bitmap machinery and the partial cloning machinery were not prepared to work well together, because some object-filtering criteria that partial clones use inherently rely on object traversal, but the bitmap machinery is an optimization to bypass that object traversal. There however are some cases where they can work together, and they were taught about them. * jk/object-filter-with-bitmap: rev-list --count: comment on the use of count_right++ pack-objects: support filters with bitmaps pack-bitmap: implement BLOB_LIMIT filtering pack-bitmap: implement BLOB_NONE filtering bitmap: add bitmap_unset() function rev-list: use bitmap filters for traversal pack-bitmap: basic noop bitmap filter infrastructure rev-list: allow commit-only bitmap traversals t5310: factor out bitmap traversal comparison rev-list: allow bitmaps when counting objects rev-list: make --count work with --objects rev-list: factor out bitmap-optimized routines pack-bitmap: refuse to do a bitmap traversal with pathspecs rev-list: fallback to non-bitmap traversal when filtering pack-bitmap: fix leak of haves/wants object lists pack-bitmap: factor out type iterator initialization	2020-03-02 15:07:18 -08:00
Junio C Hamano	78e67cda42	Merge branch 'mt/use-passed-repo-more-in-funcs' Some codepaths were given a repository instance as a parameter to work in the repository, but passed the_repository instance to its callees, which has been cleaned up (somewhat). * mt/use-passed-repo-more-in-funcs: sha1-file: allow check_object_signature() to handle any repo sha1-file: pass git_hash_algo to hash_object_file() sha1-file: pass git_hash_algo to write_object_file_prepare() streaming: allow open_istream() to handle any repo pack-check: use given repo's hash_algo at verify_packfile() cache-tree: use given repo's hash_algo at verify_one() diff: make diff_populate_filespec() honor its repo argument	2020-02-14 12:54:22 -08:00
Jeff King	acac50dd8c	pack-bitmap: fix leak of haves/wants object lists When we do a bitmap-aware revision traversal, we create an object_list for each of the "haves" and "wants" tips. After creating the result bitmaps these are no longer needed or used, but we never free the list memory. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-02-13 09:08:58 -08:00
Matheus Tavares	b98d188581	sha1-file: allow check_object_signature() to handle any repo Some callers of check_object_signature() can work on arbitrary repositories, but the repo does not get passed to this function. Instead, the_repository is always used internally. To fix possible inconsistencies, allow the function to receive a struct repository and make those callers pass on the repo being handled. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-01-31 10:45:39 -08:00
Matheus Tavares	b1fc9da1c8	replace-object: make replace operations thread-safe replace-object functions are very close to being thread-safe: the only current racy section is the lazy initialization at prepare_replace_object(). The following patches will protect some object reading operations to be called threaded, but before that, replace functions must be protected. To do so, add a mutex to struct raw_object_store and acquire it before lazy initializing the replace_map. This won't cause any noticeable performance drop as the mutex will no longer be used after the replace_map is initialized. Later, when the replace functions are called in parallel, thread debuggers might point our use of the added replace_map_initialized flag as a data race. However, as this boolean variable is initialized as false and it's only updated once, there's no real harm. It's perfectly fine if the value is updated right after a thread read it in replace-map.h:lookup_replace_object() (there'll only be a performance penalty for the affected threads at that moment). We could cease the debugger warning protecting the variable reading at the said function. However, this would negatively affect performance for all threads calling it, at any time, so it's not really worthy since the warning doesn't represent a real problem. Instead, to make sure we don't get false positives (at ThreadSanitizer, at least) an entry for the respective function is added to .tsan-suppressions. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-01-17 13:52:14 -08:00
Colin Stolley	ec48540fe8	packfile.c: speed up loading lots of packfiles When loading packfiles on start-up, we traverse the internal packfile list once per file to avoid reloading packfiles that have already been loaded. This check runs in quadratic time, so for poorly maintained repos with a large number of packfiles, it can be pretty slow. Add a hashmap containing the packfile names as we load them so that the average runtime cost of checking for already-loaded packs becomes constant. Add a perf test to p5303 to show speed-up. The existing p5303 test runtimes are dominated by other factors and do not show an appreciable speed-up. The new test in p5303 clearly exposes a speed-up in bad cases. In this test we create 10,000 packfiles and measure the start-up time of git rev-parse, which does little else besides load in the packs. Here are the numbers for the new p5303 test: Test HEAD^ HEAD --------------------------------------------------------------------- 5303.12: load 10,000 packs 1.03(0.92+0.10) 0.12(0.02+0.09) -88.3% Signed-off-by: Colin Stolley <cstolley@runbox.com> Helped-by: Jeff King <peff@peff.net> [jc: squashed the change to call hashmap in install_packed_git() by peff] Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-12-03 07:59:45 -08:00
René Scharfe	2fe44394c8	treewide: remove duplicate #include directives Found with "git grep '^#include ' '*.c' \| sort \| uniq -d". Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-10-04 08:16:00 +09:00
Junio C Hamano	a7db4c193d	Merge branch 'jk/oidhash' Code clean-up to remove hardcoded SHA-1 hash from many places. * jk/oidhash: hashmap: convert sha1hash() to oidhash() hash.h: move object_id definition from cache.h khash: rename oid helper functions khash: drop sha1-specific map types pack-bitmap: convert khash_sha1 maps into kh_oid_map delta-islands: convert island_marks khash to use oids khash: rename kh_oid_t to kh_oid_set khash: drop broken oid_map typedef object: convert create_object() to use object_id object: convert internal hash_obj() to object_id object: convert lookup_object() to use object_id object: convert lookup_unknown_object() to use object_id pack-objects: convert locate_object_entry_hash() to object_id pack-objects: convert packlist_find() to use object_id pack-bitmap-write: convert some helpers to use object_id upload-pack: rename a "sha1" variable to "oid" describe: fix accidental oid/hash type-punning	2019-07-09 15:25:43 -07:00
Jeff King	d40abc8e95	hashmap: convert sha1hash() to oidhash() There are no callers left of sha1hash() that do not simply pass the "hash" member of a "struct object_id". Let's get rid of the outdated sha1-specific function and provide one that operates on the whole struct (even though the technique, taking the first few bytes of the hash, will remain the same). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-06-20 10:44:22 -07:00
Jeff King	a378509e1c	object: convert create_object() to use object_id There are no callers left of create_object() that aren't just passing us the "hash" member of a "struct object_id". Let's take the whole struct, which gets us closer to removing all raw sha1 variables. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-06-20 10:20:51 -07:00
Jeff King	46931d3938	object: convert internal hash_obj() to object_id Now that lookup_object() has an object_id, we can consistently pass that around instead of a raw sha1. We still convert to a hash to pass to sha1hash(), but the goal is for that to go away shortly. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-06-20 10:19:37 -07:00
Jeff King	d0229abd93	object: convert lookup_object() to use object_id There are no callers left of lookup_object() that aren't just passing us the "hash" member of a "struct object_id". Let's take the whole struct, which gets us closer to removing all raw sha1 variables. It also matches the existing conversions of lookup_blob(), etc. The conversions of callers were done by hand, but they're all mechanical one-liners. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-06-20 10:18:09 -07:00
Jeff King	0ebbcf70e6	object: convert lookup_unknown_object() to use object_id There are no callers left of lookup_unknown_object() that aren't just passing us the "hash" member of a "struct object_id". Let's take the whole struct, which gets us closer to removing all raw sha1 variables. It also matches the existing conversions of lookup_blob(), etc. The conversions of callers were done by hand, but they're all mechanical one-liners. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-06-20 10:06:19 -07:00
Derrick Stolee	2d511cfc0b	packfile: rename close_all_packs to close_object_store The close_all_packs() method is now responsible for more than just pack-files. It also closes the commit-graph and the multi-pack-index. Rename the function to be more descriptive of its larger role. The name also fits because the input parameter is a raw_object_store. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-06-12 11:33:54 -07:00
Josh Steadmon	96b0710ee4	object: fix leak of shallow_stat In `eee4502baa` ("shallow: migrate shallow information into the object parser", 2018-05-17), we added a stat_validity pointer into the parsed_object_pool struct, but did not add code to free this in parsed_object_pool_clear(). This leak was found by fuzz-commit-graph. Clear the struct and then free it in parsed_object_pool_clear() to prevent the leak. Signed-off-by: Josh Steadmon <steadmon@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-02-07 13:02:28 -08:00
Junio C Hamano	cba595ab1a	Merge branch 'jk/loose-object-cache-oid' Code clean-up. * jk/loose-object-cache-oid: prefer "hash mismatch" to "sha1 mismatch" sha1-file: avoid "sha1 file" for generic use in messages sha1-file: prefer "loose object file" to "sha1 file" in messages sha1-file: drop has_sha1_file() convert has_sha1_file() callers to has_object_file() sha1-file: convert pass-through functions to object_id sha1-file: modernize loose header/stream functions sha1-file: modernize loose object file functions http: use struct object_id instead of bare sha1 update comment references to sha1_object_info() sha1-file: fix outdated sha1 comment references	2019-02-06 22:05:27 -08:00
Junio C Hamano	2ed3de439e	Merge branch 'sg/object-as-type-commit-graph-fix' The commit-graph facility did not work when in-core objects that are promoted from unknown type to commit (e.g. a commit that is accessed via a tag that refers to it) were involved, which has been corrected. * sg/object-as-type-commit-graph-fix: object_as_type: initialize commit-graph-related fields of 'struct commit'	2019-02-05 14:26:19 -08:00
Junio C Hamano	b99a579f8e	Merge branch 'sb/more-repo-in-api' The in-core repository instances are passed through more codepaths. * sb/more-repo-in-api: (23 commits) t/helper/test-repository: celebrate independence from the_repository path.h: make REPO_GIT_PATH_FUNC repository agnostic commit: prepare free_commit_buffer and release_commit_memory for any repo commit-graph: convert remaining functions to handle any repo submodule: don't add submodule as odb for push submodule: use submodule repos for object lookup pretty: prepare format_commit_message to handle arbitrary repositories commit: prepare logmsg_reencode to handle arbitrary repositories commit: prepare repo_unuse_commit_buffer to handle any repo commit: prepare get_commit_buffer to handle any repo commit-reach: prepare in_merge_bases[_many] to handle any repo commit-reach: prepare get_merge_bases to handle any repo commit-reach.c: allow get_merge_bases_many_0 to handle any repo commit-reach.c: allow remove_redundant to handle any repo commit-reach.c: allow merge_bases_many to handle any repo commit-reach.c: allow paint_down_to_common to handle any repo commit: allow parse_commit* to handle any repo object: parse_object to honor its repository argument object-store: prepare has_{sha1, object}_file to handle any repo object-store: prepare read_object_file to deal with any repo ...	2019-02-05 14:26:09 -08:00
SZEDER Gábor	4468d4435c	object_as_type: initialize commit-graph-related fields of 'struct commit' When the commit graph and generation numbers were introduced in commits `177722b344` (commit: integrate commit graph with commit parsing, 2018-04-10) and `83073cc994` (commit: add generation number to struct commit, 2018-04-25), they tried to make sure that the corresponding 'graph_pos' and 'generation' fields of 'struct commit' are initialized conservatively, as if the commit were not included in the commit-graph file. Alas, initializing those fields only in alloc_commit_node() missed the case when an object that happens to be a commit is first looked up via lookup_unknown_object(), and is then later converted to a 'struct commit' via the object_as_type() helper function (either calling it directly, or as part of a subsequent lookup_commit() call). Consequently, both of those fields incorrectly remain set to zero, which means e.g. that the commit is present in and is the first entry of the commit-graph file. This will result in wrong timestamp, parent and root tree hashes, if such a 'struct commit' instance is later filled from the commit-graph. Extract the initialization of 'struct commit's fields from alloc_commit_node() into a helper function, and call it from object_as_type() as well, to make sure that it properly initializes the two commit-graph-related fields, too. With this helper function it is hopefully less likely that any new fields added to 'struct commit' in the future would remain uninitialized. With this change alloc_commit_index() won't have any remaining callers outside of 'alloc.c', so mark it as static. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-01-27 16:55:57 -08:00
Jeff King	01f8d5948a	prefer "hash mismatch" to "sha1 mismatch" To future-proof ourselves against a change in the hash, let's use the more generic "hash mismatch" to refer to integrity problems. Note that we do advertise this exact string in git-fsck(1). However, the message itself is marked for translation, meaning we do not expect it to be machine-readable. While we're touching that documentation, let's also update it for grammar and clarity. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-01-08 09:41:06 -08:00
René Scharfe	d4e19e5163	object-store: factor out odb_clear_loose_cache() Add and use a function for emptying the loose object cache, so callers don't have to know any of its implementation details. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-01-08 09:40:19 -08:00
Stefan Beller	6a7895fd8a	commit: prepare free_commit_buffer and release_commit_memory for any repo Pass the object pool to free_commit_buffer and release_commit_memory, such that we can eliminate access to 'the_repository'. Also remove the TODO in release_commit_memory, as commit->util was removed in `9d2c97016f` (commit.h: delete 'util' field in struct commit, 2018-05-19) Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2018-12-28 10:06:33 -08:00
Stefan Beller	d1a6902265	object: parse_object to honor its repository argument In `8e4b0b6047` (object.c: allow parse_object to handle arbitrary repositories, 2018-06-28), we forgot to pass the repository down to the read_object_file. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2018-11-14 17:22:40 +09:00
Jeff King	f0eaf63819	sha1-file: use an object_directory for the main object dir Our handling of alternate object directories is needlessly different from the main object directory. As a result, many places in the code basically look like this: do_something(r->objects->objdir); for (odb = r->objects->alt_odb_list; odb; odb = odb->next) do_something(odb->path); That gets annoying when do_something() is non-trivial, and we've resorted to gross hacks like creating fake alternates (see find_short_object_filename()). Instead, let's give each raw_object_store a unified list of object_directory structs. The first will be the main store, and everything after is an alternate. Very few callers even care about the distinction, and can just loop over the whole list (and those who care can just treat the first element differently). A few observations: - we don't need r->objects->objectdir anymore, and can just mechanically convert that to r->objects->odb->path - object_directory's path field needs to become a real pointer rather than a FLEX_ARRAY, in order to fill it with expand_base_dir() - we'll call prepare_alt_odb() earlier in many functions (i.e., outside of the loop). This may result in us calling it even when our function would be satisfied looking only at the main odb. But this doesn't matter in practice. It's not a very expensive operation in the first place, and in the majority of cases it will be a noop. We call it already (and cache its results) in prepare_packed_git(), and we'll generally check packs before loose objects. So essentially every program is going to call it immediately once per program. Arguably we should just prepare_alt_odb() immediately upon setting up the repository's object directory, which would save us sprinkling calls throughout the code base (and forgetting to do so has been a source of subtle bugs in the past). But I've stopped short of that here, since there are already a lot of other moving parts in this patch. - Most call sites just get shorter. The check_and_freshen() functions are an exception, because they have entry points to handle local and nonlocal directories separately. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2018-11-13 14:22:03 +09:00
Jeff King	f3f043a103	handle alternates paths the same as the main object dir When we generate loose file paths for the main object directory, the caller provides a buffer to loose_object_path (formerly sha1_file_name). The callers generally keep their own static buffer to avoid excessive reallocations. But for alternate directories, each struct carries its own scratch buffer. This is needlessly different; let's unify them. We could go either direction here, but this patch moves the alternates struct over to the main directory style (rather than vice-versa). Technically the alternates style is more efficient, as it avoids rewriting the object directory name on each call. But this is unlikely to matter in practice, as we avoid reallocations either way (and nobody has ever noticed or complained that the main object directory is copying a few extra bytes before making a much more expensive system call). And this has the advantage that the reusable buffers are tied to particular calls, which makes the invalidation rules simpler (for example, the return value from stat_sha1_file() used to be invalidated by basically any other object call, but now it is affected only by other calls to stat_sha1_file()). We do steal the trick from alt_sha1_path() of returning a pointer to the filled buffer, which makes a few conversions more convenient. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2018-11-13 14:22:02 +09:00
Jeff King	263db403fa	rename "alternate_object_database" to "object_directory" In preparation for unifying the handling of alt odb's and the normal repo object directory, let's use a more neutral name. This patch is purely mechanical, swapping the type name, and converting any variables named "alt" to "odb". There should be no functional change, but it will reduce the noise in subsequent diffs. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2018-11-13 14:22:02 +09:00

1 2 3 4 5

214 Commits