2005-06-27 11:27:56 +08:00
|
|
|
/*
|
|
|
|
* csum-file.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 2005 Linus Torvalds
|
|
|
|
*
|
|
|
|
* Simple file write infrastructure for writing SHA1-summed
|
|
|
|
* files. Useful when you write a file that you want to be
|
|
|
|
* able to verify hasn't been messed with afterwards.
|
|
|
|
*/
|
global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
Use of the `the_repository` variable is deprecated nowadays, and we
slowly but steadily convert the codebase to not use it anymore. Instead,
callers should be passing down the repository to work on via parameters.
It is hard though to prove that a given code unit does not use this
variable anymore. The most trivial case, merely demonstrating that there
is no direct use of `the_repository`, is already a bit of a pain during
code reviews as the reviewer needs to manually verify claims made by the
patch author. The bigger problem though is that we have many interfaces
that implicitly rely on `the_repository`.
Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code
units to opt into usage of `the_repository`. The intent of this macro is
to demonstrate that a certain code unit does not use this variable
anymore, and to keep it from new dependencies on it in future changes,
be it explicit or implicit
For now, the macro only guards `the_repository` itself as well as
`the_hash_algo`. There are many more known interfaces where we have an
implicit dependency on `the_repository`, but those are not guarded at
the current point in time. Over time though, we should start to add
guards as required (or even better, just remove them).
Define the macro as required in our code units. As expected, most of our
code still relies on the global variable. Nearly all of our builtins
rely on the variable as there is no way yet to pass `the_repository` to
their entry point. For now, declare the macro in "biultin.h" to keep the
required changes at least a little bit more contained.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14 14:50:23 +08:00
|
|
|
|
|
|
|
#define USE_THE_REPOSITORY_VARIABLE
|
|
|
|
|
2023-03-21 14:26:04 +08:00
|
|
|
#include "git-compat-util.h"
|
2007-10-31 05:06:21 +08:00
|
|
|
#include "progress.h"
|
2005-06-27 11:27:56 +08:00
|
|
|
#include "csum-file.h"
|
2023-04-23 04:17:20 +08:00
|
|
|
#include "hash.h"
|
2005-06-27 11:27:56 +08:00
|
|
|
|
csum-file.h: increase hashfile buffer size
The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.
There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.
Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.
By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19 02:32:46 +08:00
|
|
|
static void verify_buffer_or_die(struct hashfile *f,
|
|
|
|
const void *buf,
|
|
|
|
unsigned int count)
|
|
|
|
{
|
|
|
|
ssize_t ret = read_in_full(f->check_fd, f->check_buffer, count);
|
|
|
|
|
|
|
|
if (ret < 0)
|
|
|
|
die_errno("%s: sha1 file read error", f->name);
|
|
|
|
if (ret != count)
|
|
|
|
die("%s: sha1 file truncated", f->name);
|
|
|
|
if (memcmp(buf, f->check_buffer, count))
|
|
|
|
die("sha1 file '%s' validation error", f->name);
|
|
|
|
}
|
|
|
|
|
2018-02-01 10:18:46 +08:00
|
|
|
static void flush(struct hashfile *f, const void *buf, unsigned int count)
|
2005-06-27 11:27:56 +08:00
|
|
|
{
|
csum-file.h: increase hashfile buffer size
The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.
There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.
Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.
By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19 02:32:46 +08:00
|
|
|
if (0 <= f->check_fd && count)
|
|
|
|
verify_buffer_or_die(f, buf, count);
|
2011-02-03 09:29:01 +08:00
|
|
|
|
2021-05-17 20:24:49 +08:00
|
|
|
if (write_in_full(f->fd, buf, count) < 0) {
|
|
|
|
if (errno == ENOSPC)
|
2005-06-27 13:01:46 +08:00
|
|
|
die("sha1 file '%s' write error. Out of diskspace", f->name);
|
2009-06-27 23:58:46 +08:00
|
|
|
die_errno("sha1 file '%s' write error", f->name);
|
2005-06-27 11:27:56 +08:00
|
|
|
}
|
2021-05-17 20:24:49 +08:00
|
|
|
|
|
|
|
f->total += count;
|
|
|
|
display_throughput(f->tp, f->total);
|
2005-06-27 11:27:56 +08:00
|
|
|
}
|
|
|
|
|
2018-02-01 10:18:46 +08:00
|
|
|
void hashflush(struct hashfile *f)
|
2005-06-27 11:27:56 +08:00
|
|
|
{
|
|
|
|
unsigned offset = f->offset;
|
2008-05-30 23:42:16 +08:00
|
|
|
|
2005-06-27 11:27:56 +08:00
|
|
|
if (offset) {
|
hashfile: allow skipping the hash function
The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.
Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.
However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.
This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.
We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.
Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.
This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.
A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.
[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-07 00:31:53 +08:00
|
|
|
if (!f->skip_hash)
|
csum-file.c: use unsafe SHA-1 implementation when available
Update hashwrite() and friends to use the unsafe_-variants of hashing
functions, calling for e.g., "the_hash_algo->unsafe_update_fn()" instead
of "the_hash_algo->update_fn()".
These callers only use the_hash_algo to produce a checksum, which we
depend on for data integrity, but not for cryptographic purposes, so
these callers are safe to use the unsafe (non-collision detecting) SHA-1
implementation.
To time this, I took a freshly packed copy of linux.git, and ran the
following with and without the OPENSSL_SHA1_UNSAFE=1 build-knob. Both
versions were compiled with -O3:
$ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
$ valgrind --tool=callgrind ~/src/git/git-pack-objects \
--revs --stdout --all-progress --use-bitmap-index <in >/dev/null
Without OPENSSL_SHA1_UNSAFE=1 (that is, using the collision-detecting
SHA-1 implementation for both cryptographic and non-cryptographic
purposes), we spend a significant amount of our instruction count in
hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
159,998,868,413 (79.42%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and the resulting "clone" takes 19.219 seconds of wall clock time,
18.94 seconds of user time and 0.28 seconds of system time.
Compiling with OPENSSL_SHA1_UNSAFE=1, we spend ~60% fewer instructions
in hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
59,164,001,176 (58.79%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and generate the resulting "clone" much faster, in only 11.597 seconds
of wall time, 11.37 seconds of user time, and 0.23 seconds of system
time, for a ~40% speed-up.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-26 23:22:53 +08:00
|
|
|
the_hash_algo->unsafe_update_fn(&f->ctx, f->buffer, offset);
|
2008-10-10 23:39:20 +08:00
|
|
|
flush(f, f->buffer, offset);
|
2007-05-14 02:28:19 +08:00
|
|
|
f->offset = 0;
|
2005-06-27 11:27:56 +08:00
|
|
|
}
|
2008-10-10 10:08:51 +08:00
|
|
|
}
|
|
|
|
|
2024-08-14 14:52:03 +08:00
|
|
|
void free_hashfile(struct hashfile *f)
|
csum-file.h: increase hashfile buffer size
The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.
There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.
Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.
By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19 02:32:46 +08:00
|
|
|
{
|
|
|
|
free(f->buffer);
|
|
|
|
free(f->check_buffer);
|
|
|
|
free(f);
|
|
|
|
}
|
|
|
|
|
2022-03-11 06:43:21 +08:00
|
|
|
int finalize_hashfile(struct hashfile *f, unsigned char *result,
|
|
|
|
enum fsync_component component, unsigned int flags)
|
2008-10-10 10:08:51 +08:00
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
|
2018-02-01 10:18:46 +08:00
|
|
|
hashflush(f);
|
hashfile: allow skipping the hash function
The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.
Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.
However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.
This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.
We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.
Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.
This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.
A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.
[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-07 00:31:53 +08:00
|
|
|
|
|
|
|
if (f->skip_hash)
|
2024-06-14 14:49:50 +08:00
|
|
|
hashclr(f->buffer, the_repository->hash_algo);
|
hashfile: allow skipping the hash function
The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.
Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.
However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.
This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.
We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.
Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.
This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.
A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.
[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-07 00:31:53 +08:00
|
|
|
else
|
csum-file.c: use unsafe SHA-1 implementation when available
Update hashwrite() and friends to use the unsafe_-variants of hashing
functions, calling for e.g., "the_hash_algo->unsafe_update_fn()" instead
of "the_hash_algo->update_fn()".
These callers only use the_hash_algo to produce a checksum, which we
depend on for data integrity, but not for cryptographic purposes, so
these callers are safe to use the unsafe (non-collision detecting) SHA-1
implementation.
To time this, I took a freshly packed copy of linux.git, and ran the
following with and without the OPENSSL_SHA1_UNSAFE=1 build-knob. Both
versions were compiled with -O3:
$ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
$ valgrind --tool=callgrind ~/src/git/git-pack-objects \
--revs --stdout --all-progress --use-bitmap-index <in >/dev/null
Without OPENSSL_SHA1_UNSAFE=1 (that is, using the collision-detecting
SHA-1 implementation for both cryptographic and non-cryptographic
purposes), we spend a significant amount of our instruction count in
hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
159,998,868,413 (79.42%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and the resulting "clone" takes 19.219 seconds of wall clock time,
18.94 seconds of user time and 0.28 seconds of system time.
Compiling with OPENSSL_SHA1_UNSAFE=1, we spend ~60% fewer instructions
in hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
59,164,001,176 (58.79%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and generate the resulting "clone" much faster, in only 11.597 seconds
of wall time, 11.37 seconds of user time, and 0.23 seconds of system
time, for a ~40% speed-up.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-26 23:22:53 +08:00
|
|
|
the_hash_algo->unsafe_final_fn(f->buffer, &f->ctx);
|
hashfile: allow skipping the hash function
The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.
Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.
However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.
This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.
We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.
Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.
This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.
A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.
[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-07 00:31:53 +08:00
|
|
|
|
2008-08-30 04:08:00 +08:00
|
|
|
if (result)
|
2024-06-14 14:49:50 +08:00
|
|
|
hashcpy(result, f->buffer, the_repository->hash_algo);
|
2018-04-03 04:34:15 +08:00
|
|
|
if (flags & CSUM_HASH_IN_STREAM)
|
2018-02-01 10:18:47 +08:00
|
|
|
flush(f, f->buffer, the_hash_algo->rawsz);
|
2018-04-03 04:34:15 +08:00
|
|
|
if (flags & CSUM_FSYNC)
|
2022-03-11 06:43:21 +08:00
|
|
|
fsync_component_or_die(component, f->fd, f->name);
|
2018-04-03 04:34:15 +08:00
|
|
|
if (flags & CSUM_CLOSE) {
|
2007-10-17 09:55:48 +08:00
|
|
|
if (close(f->fd))
|
2009-06-27 23:58:46 +08:00
|
|
|
die_errno("%s: sha1 file error on close", f->name);
|
2007-10-17 09:55:48 +08:00
|
|
|
fd = 0;
|
|
|
|
} else
|
|
|
|
fd = f->fd;
|
2011-02-03 09:29:01 +08:00
|
|
|
if (0 <= f->check_fd) {
|
|
|
|
char discard;
|
|
|
|
int cnt = read_in_full(f->check_fd, &discard, 1);
|
|
|
|
if (cnt < 0)
|
|
|
|
die_errno("%s: error when reading the tail of sha1 file",
|
|
|
|
f->name);
|
|
|
|
if (cnt)
|
|
|
|
die("%s: sha1 file has trailing garbage", f->name);
|
|
|
|
if (close(f->check_fd))
|
|
|
|
die_errno("%s: sha1 file error on close", f->name);
|
|
|
|
}
|
csum-file.h: increase hashfile buffer size
The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.
There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.
Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.
By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19 02:32:46 +08:00
|
|
|
free_hashfile(f);
|
2007-10-17 09:55:48 +08:00
|
|
|
return fd;
|
2005-06-27 11:27:56 +08:00
|
|
|
}
|
|
|
|
|
csum-file: introduce discard_hashfile()
The hashfile API is used to write out a "hashfile", which has a
final checksum (typically SHA-1) at the end. An in-core hashfile
structure has up to two file descriptors and a few buffers that can
only be freed by calling a helper function that is private to the
csum-file implementation.
The usual flow of a user of the API is to first open a file
descriptor for writing, obtain a hashfile associated with that write
file descriptor by calling either hashfd() or hashfd_check(), call
hashwrite() number of times to write data to the file, and then call
finalize_hashfile(), which appends th checksum to the end of the
file, closes file descriptors and releases associated buffers.
But what if a caller finds some error after calling hashfd() to
start the process and/or hashwrite() to send some data to the file,
and wants to abort the operation? The underlying file descriptor is
often managed by the tempfile API, so aborting will clean the file
out of the filesystem, but the resources associated with the in-core
hashfile structure is lost.
Introduce discard_hashfile() API function to allow them to release
the resources held by a hashfile structure the callers want to
dispose of, and use that in read-cache.c:do_write_index(), which is
a central place that writes the index file.
Mark t2107 as leak-free, as this leak in "update-index --cacheinfo"
test that deliberately makes it fail is now plugged.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-07-26 07:07:28 +08:00
|
|
|
void discard_hashfile(struct hashfile *f)
|
|
|
|
{
|
|
|
|
if (0 <= f->check_fd)
|
|
|
|
close(f->check_fd);
|
|
|
|
if (0 <= f->fd)
|
|
|
|
close(f->fd);
|
|
|
|
free_hashfile(f);
|
|
|
|
}
|
|
|
|
|
2018-02-01 10:18:46 +08:00
|
|
|
void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
|
2005-06-27 11:27:56 +08:00
|
|
|
{
|
|
|
|
while (count) {
|
csum-file.h: increase hashfile buffer size
The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.
There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.
Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.
By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19 02:32:46 +08:00
|
|
|
unsigned left = f->buffer_len - f->offset;
|
2005-06-27 11:27:56 +08:00
|
|
|
unsigned nr = count > left ? left : count;
|
2008-09-02 22:22:20 +08:00
|
|
|
|
|
|
|
if (f->do_crc)
|
|
|
|
f->crc32 = crc32(f->crc32, buf, nr);
|
|
|
|
|
csum-file.h: increase hashfile buffer size
The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.
There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.
Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.
By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19 02:32:46 +08:00
|
|
|
if (nr == f->buffer_len) {
|
2021-03-26 20:38:11 +08:00
|
|
|
/*
|
|
|
|
* Flush a full batch worth of data directly
|
|
|
|
* from the input, skipping the memcpy() to
|
|
|
|
* the hashfile's buffer. In this block,
|
|
|
|
* f->offset is necessarily zero.
|
|
|
|
*/
|
hashfile: allow skipping the hash function
The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.
Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.
However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.
This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.
We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.
Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.
This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.
A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.
[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-07 00:31:53 +08:00
|
|
|
if (!f->skip_hash)
|
csum-file.c: use unsafe SHA-1 implementation when available
Update hashwrite() and friends to use the unsafe_-variants of hashing
functions, calling for e.g., "the_hash_algo->unsafe_update_fn()" instead
of "the_hash_algo->update_fn()".
These callers only use the_hash_algo to produce a checksum, which we
depend on for data integrity, but not for cryptographic purposes, so
these callers are safe to use the unsafe (non-collision detecting) SHA-1
implementation.
To time this, I took a freshly packed copy of linux.git, and ran the
following with and without the OPENSSL_SHA1_UNSAFE=1 build-knob. Both
versions were compiled with -O3:
$ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
$ valgrind --tool=callgrind ~/src/git/git-pack-objects \
--revs --stdout --all-progress --use-bitmap-index <in >/dev/null
Without OPENSSL_SHA1_UNSAFE=1 (that is, using the collision-detecting
SHA-1 implementation for both cryptographic and non-cryptographic
purposes), we spend a significant amount of our instruction count in
hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
159,998,868,413 (79.42%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and the resulting "clone" takes 19.219 seconds of wall clock time,
18.94 seconds of user time and 0.28 seconds of system time.
Compiling with OPENSSL_SHA1_UNSAFE=1, we spend ~60% fewer instructions
in hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
59,164,001,176 (58.79%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and generate the resulting "clone" much faster, in only 11.597 seconds
of wall time, 11.37 seconds of user time, and 0.23 seconds of system
time, for a ~40% speed-up.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-26 23:22:53 +08:00
|
|
|
the_hash_algo->unsafe_update_fn(&f->ctx, buf, nr);
|
2021-03-26 20:38:11 +08:00
|
|
|
flush(f, buf, nr);
|
2008-09-02 22:22:20 +08:00
|
|
|
} else {
|
2021-03-26 20:38:11 +08:00
|
|
|
/*
|
|
|
|
* Copy to the hashfile's buffer, flushing only
|
|
|
|
* if it became full.
|
|
|
|
*/
|
|
|
|
memcpy(f->buffer + f->offset, buf, nr);
|
|
|
|
f->offset += nr;
|
|
|
|
left -= nr;
|
|
|
|
if (!left)
|
|
|
|
hashflush(f);
|
2008-09-02 22:22:20 +08:00
|
|
|
}
|
2005-06-27 11:27:56 +08:00
|
|
|
|
|
|
|
count -= nr;
|
2006-06-18 23:18:09 +08:00
|
|
|
buf = (char *) buf + nr;
|
2005-06-27 11:27:56 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-02-01 10:18:46 +08:00
|
|
|
struct hashfile *hashfd_check(const char *name)
|
2011-02-03 09:29:01 +08:00
|
|
|
{
|
|
|
|
int sink, check;
|
2018-02-01 10:18:46 +08:00
|
|
|
struct hashfile *f;
|
2011-02-03 09:29:01 +08:00
|
|
|
|
2021-08-26 04:16:46 +08:00
|
|
|
sink = xopen("/dev/null", O_WRONLY);
|
|
|
|
check = xopen(name, O_RDONLY);
|
2018-02-01 10:18:46 +08:00
|
|
|
f = hashfd(sink, name);
|
2011-02-03 09:29:01 +08:00
|
|
|
f->check_fd = check;
|
csum-file.h: increase hashfile buffer size
The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.
There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.
Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.
By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19 02:32:46 +08:00
|
|
|
f->check_buffer = xmalloc(f->buffer_len);
|
|
|
|
|
2011-02-03 09:29:01 +08:00
|
|
|
return f;
|
|
|
|
}
|
|
|
|
|
csum-file.h: increase hashfile buffer size
The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.
There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.
Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.
By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19 02:32:46 +08:00
|
|
|
static struct hashfile *hashfd_internal(int fd, const char *name,
|
|
|
|
struct progress *tp,
|
|
|
|
size_t buffer_len)
|
2005-06-29 02:10:06 +08:00
|
|
|
{
|
2018-02-01 10:18:46 +08:00
|
|
|
struct hashfile *f = xmalloc(sizeof(*f));
|
2005-06-29 02:10:06 +08:00
|
|
|
f->fd = fd;
|
2011-02-03 09:29:01 +08:00
|
|
|
f->check_fd = -1;
|
2005-06-29 02:10:06 +08:00
|
|
|
f->offset = 0;
|
2007-11-05 11:15:41 +08:00
|
|
|
f->total = 0;
|
2007-10-31 05:06:21 +08:00
|
|
|
f->tp = tp;
|
2007-11-05 11:54:50 +08:00
|
|
|
f->name = name;
|
compute a CRC32 for each object as stored in a pack
The most important optimization for performance when repacking is the
ability to reuse data from a previous pack as is and bypass any delta
or even SHA1 computation by simply copying the raw data from one pack
to another directly.
The problem with this is that any data corruption within a copied object
would go unnoticed and the new (repacked) pack would be self-consistent
with its own checksum despite containing a corrupted object. This is a
real issue that already happened at least once in the past.
In some attempt to prevent this, we validate the copied data by inflating
it and making sure no error is signaled by zlib. But this is still not
perfect as a significant portion of a pack content is made of object
headers and references to delta base objects which are not deflated and
therefore not validated when repacking actually making the pack data reuse
still not as safe as it could be.
Of course a full SHA1 validation could be performed, but that implies
full data inflating and delta replaying which is extremely costly, which
cost the data reuse optimization was designed to avoid in the first place.
So the best solution to this is simply to store a CRC32 of the raw pack
data for each object in the pack index. This way any object in a pack can
be validated before being copied as is in another pack, including header
and any other non deflated data.
Why CRC32 instead of a faster checksum like Adler32? Quoting Wikipedia:
Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very
short messages. He wrote "Briefly, the problem is that, for very short
packets, Adler32 is guaranteed to give poor coverage of the available
bits. Don't take my word for it, ask Mark Adler. :-)" The problem is
that sum A does not wrap for short messages. The maximum value of A for
a 128-byte message is 32640, which is below the value 65521 used by the
modulo operation. An extended explanation can be found in RFC 3309,
which mandates the use of CRC32 instead of Adler-32 for SCTP, the
Stream Control Transmission Protocol.
In the context of a GIT pack, we have lots of small objects, especially
deltas, which are likely to be quite small and in a size range for which
Adler32 is dimed not to be sufficient. Another advantage of CRC32 is the
possibility for recovery from certain types of small corruptions like
single bit errors which are the most probable type of corruptions.
OK what this patch does is to compute the CRC32 of each object written to
a pack within pack-objects. It is not written to the index yet and it is
obviously not validated when reusing pack data yet either.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-09 13:06:31 +08:00
|
|
|
f->do_crc = 0;
|
hashfile: allow skipping the hash function
The hashfile API is useful for generating files that include a trailing
hash of the file's contents up to that point. Using such a hash is
helpful for verifying the file for corruption-at-rest, such as a faulty
drive causing flipped bits.
Git's index file includes this trailing hash, so it uses a 'struct
hashfile' to handle the I/O to the file. This was very convenient to
allow using the hashfile methods during these operations.
However, hashing the file contents during write comes at a performance
penalty. It's slower to hash the bytes on their way to the disk than
without that step. This problem is made worse by the replacement of
hardware-accelerated SHA1 computations with the software-based sha1dc
computation.
This write cost is significant, and the checksum capability is likely
not worth that cost for such a short-lived file. The index is rewritten
frequently and the only time the checksum is checked is during 'git
fsck'. Thus, it would be helpful to allow a user to opt-out of the hash
computation.
We first need to allow Git to opt-out of the hash computation in the
hashfile API. The buffered writes of the API are still helpful, so it
makes sense to make the change here.
Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the
update_fn and final_fn members of the_hash_algo are skipped. When
finalizing the hashfile, the trailing hash is replaced with the null
hash.
This use of a trailing null hash would be desireable in either case,
since we do not want to special case a file format to have a different
length depending on whether it was hashed or not. When the final bytes
of a file are all zero, we can infer that it was written without
hashing, and thus that verification is not available as a check for file
consistency. This also means that we could easily toggle hashing for any
file format we desire.
A version of this patch has existed in the microsoft/git fork since
2017 [1] (the linked commit was rebased in 2018, but the original dates
back to January 2017). Here, the change to make the index use this fast
path is delayed until a later change.
[1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201
Co-authored-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Kevin Willford <kewillf@microsoft.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-07 00:31:53 +08:00
|
|
|
f->skip_hash = 0;
|
csum-file.c: use unsafe SHA-1 implementation when available
Update hashwrite() and friends to use the unsafe_-variants of hashing
functions, calling for e.g., "the_hash_algo->unsafe_update_fn()" instead
of "the_hash_algo->update_fn()".
These callers only use the_hash_algo to produce a checksum, which we
depend on for data integrity, but not for cryptographic purposes, so
these callers are safe to use the unsafe (non-collision detecting) SHA-1
implementation.
To time this, I took a freshly packed copy of linux.git, and ran the
following with and without the OPENSSL_SHA1_UNSAFE=1 build-knob. Both
versions were compiled with -O3:
$ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
$ valgrind --tool=callgrind ~/src/git/git-pack-objects \
--revs --stdout --all-progress --use-bitmap-index <in >/dev/null
Without OPENSSL_SHA1_UNSAFE=1 (that is, using the collision-detecting
SHA-1 implementation for both cryptographic and non-cryptographic
purposes), we spend a significant amount of our instruction count in
hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
159,998,868,413 (79.42%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and the resulting "clone" takes 19.219 seconds of wall clock time,
18.94 seconds of user time and 0.28 seconds of system time.
Compiling with OPENSSL_SHA1_UNSAFE=1, we spend ~60% fewer instructions
in hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
59,164,001,176 (58.79%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and generate the resulting "clone" much faster, in only 11.597 seconds
of wall time, 11.37 seconds of user time, and 0.23 seconds of system
time, for a ~40% speed-up.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-26 23:22:53 +08:00
|
|
|
the_hash_algo->unsafe_init_fn(&f->ctx);
|
csum-file.h: increase hashfile buffer size
The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.
There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.
Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.
By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19 02:32:46 +08:00
|
|
|
|
|
|
|
f->buffer_len = buffer_len;
|
|
|
|
f->buffer = xmalloc(buffer_len);
|
|
|
|
f->check_buffer = NULL;
|
|
|
|
|
2005-06-29 02:10:06 +08:00
|
|
|
return f;
|
|
|
|
}
|
|
|
|
|
csum-file.h: increase hashfile buffer size
The hashfile API uses a hard-coded buffer size of 8KB and has ever since
it was introduced in c38138c (git-pack-objects: write the pack files
with a SHA1 csum, 2005-06-26). It performs a similar function to the
hashing buffers in read-cache.c, but that code was updated from 8KB to
128KB in f279894 (read-cache: make the index write buffer size 128K,
2021-02-18). The justification there was that do_write_index() improves
from 1.02s to 0.72s. Since our end goal is to have the index writing
code use the hashfile API, we need to unify this buffer size to avoid a
performance regression.
There is a buffer, 'check_buffer', that is used to verify the check_fd
file descriptor. When this buffer increases to 128K to fit the data
being flushed, it causes the stack to overflow the limits placed in the
test suite. To avoid issues with stack size, move both 'buffer' and
'check_buffer' to be heap pointers within 'struct hashfile'. The
'check_buffer' member is left as NULL unless check_fd is set in
hashfd_check(). Both buffers are cleared as part of finalize_hashfile()
which also frees the full structure.
Since these buffers are now on the heap, we can adjust their size based
on the needs of the consumer. In particular, callers to
hashfd_throughput() are expecting to report progress indicators as the
buffer flushes. These callers would prefer the smaller 8k buffer to
avoid large delays between updates, especially for users with slower
networks. When the progress indicator is not used, the larger buffer is
preferrable.
By adding a new trace2 region in the chunk-format API, we can see that
the writing portion of 'git multi-pack-index write' lowers from ~1.49s
to ~1.47s on a Linux machine. These effects may be more pronounced or
diminished on other filesystems. The end-to-end timing is too noisy to
have a definitive change either way.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19 02:32:46 +08:00
|
|
|
struct hashfile *hashfd(int fd, const char *name)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Since we are not going to use a progress meter to
|
|
|
|
* measure the rate of data passing through this hashfile,
|
|
|
|
* use a larger buffer size to reduce fsync() calls.
|
|
|
|
*/
|
|
|
|
return hashfd_internal(fd, name, NULL, 128 * 1024);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Since we are expecting to report progress of the
|
|
|
|
* write into this hashfile, use a smaller buffer
|
|
|
|
* size so the progress indicators arrive at a more
|
|
|
|
* frequent rate.
|
|
|
|
*/
|
|
|
|
return hashfd_internal(fd, name, tp, 8 * 1024);
|
|
|
|
}
|
|
|
|
|
2018-02-01 10:18:46 +08:00
|
|
|
void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
|
2011-11-18 08:26:54 +08:00
|
|
|
{
|
2018-02-01 10:18:46 +08:00
|
|
|
hashflush(f);
|
2011-11-18 08:26:54 +08:00
|
|
|
checkpoint->offset = f->total;
|
csum-file.c: use unsafe SHA-1 implementation when available
Update hashwrite() and friends to use the unsafe_-variants of hashing
functions, calling for e.g., "the_hash_algo->unsafe_update_fn()" instead
of "the_hash_algo->update_fn()".
These callers only use the_hash_algo to produce a checksum, which we
depend on for data integrity, but not for cryptographic purposes, so
these callers are safe to use the unsafe (non-collision detecting) SHA-1
implementation.
To time this, I took a freshly packed copy of linux.git, and ran the
following with and without the OPENSSL_SHA1_UNSAFE=1 build-knob. Both
versions were compiled with -O3:
$ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
$ valgrind --tool=callgrind ~/src/git/git-pack-objects \
--revs --stdout --all-progress --use-bitmap-index <in >/dev/null
Without OPENSSL_SHA1_UNSAFE=1 (that is, using the collision-detecting
SHA-1 implementation for both cryptographic and non-cryptographic
purposes), we spend a significant amount of our instruction count in
hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
159,998,868,413 (79.42%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and the resulting "clone" takes 19.219 seconds of wall clock time,
18.94 seconds of user time and 0.28 seconds of system time.
Compiling with OPENSSL_SHA1_UNSAFE=1, we spend ~60% fewer instructions
in hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
59,164,001,176 (58.79%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and generate the resulting "clone" much faster, in only 11.597 seconds
of wall time, 11.37 seconds of user time, and 0.23 seconds of system
time, for a ~40% speed-up.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-26 23:22:53 +08:00
|
|
|
the_hash_algo->unsafe_clone_fn(&checkpoint->ctx, &f->ctx);
|
2011-11-18 08:26:54 +08:00
|
|
|
}
|
|
|
|
|
2018-02-01 10:18:46 +08:00
|
|
|
int hashfile_truncate(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
|
2011-11-18 08:26:54 +08:00
|
|
|
{
|
|
|
|
off_t offset = checkpoint->offset;
|
|
|
|
|
|
|
|
if (ftruncate(f->fd, offset) ||
|
|
|
|
lseek(f->fd, offset, SEEK_SET) != offset)
|
|
|
|
return -1;
|
|
|
|
f->total = offset;
|
csum-file.c: use unsafe SHA-1 implementation when available
Update hashwrite() and friends to use the unsafe_-variants of hashing
functions, calling for e.g., "the_hash_algo->unsafe_update_fn()" instead
of "the_hash_algo->update_fn()".
These callers only use the_hash_algo to produce a checksum, which we
depend on for data integrity, but not for cryptographic purposes, so
these callers are safe to use the unsafe (non-collision detecting) SHA-1
implementation.
To time this, I took a freshly packed copy of linux.git, and ran the
following with and without the OPENSSL_SHA1_UNSAFE=1 build-knob. Both
versions were compiled with -O3:
$ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
$ valgrind --tool=callgrind ~/src/git/git-pack-objects \
--revs --stdout --all-progress --use-bitmap-index <in >/dev/null
Without OPENSSL_SHA1_UNSAFE=1 (that is, using the collision-detecting
SHA-1 implementation for both cryptographic and non-cryptographic
purposes), we spend a significant amount of our instruction count in
hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
159,998,868,413 (79.42%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and the resulting "clone" takes 19.219 seconds of wall clock time,
18.94 seconds of user time and 0.28 seconds of system time.
Compiling with OPENSSL_SHA1_UNSAFE=1, we spend ~60% fewer instructions
in hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
59,164,001,176 (58.79%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and generate the resulting "clone" much faster, in only 11.597 seconds
of wall time, 11.37 seconds of user time, and 0.23 seconds of system
time, for a ~40% speed-up.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-26 23:22:53 +08:00
|
|
|
the_hash_algo->unsafe_clone_fn(&f->ctx, &checkpoint->ctx);
|
2018-02-01 10:18:46 +08:00
|
|
|
f->offset = 0; /* hashflush() was called in checkpoint */
|
2011-11-18 08:26:54 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-02-01 10:18:46 +08:00
|
|
|
void crc32_begin(struct hashfile *f)
|
compute a CRC32 for each object as stored in a pack
The most important optimization for performance when repacking is the
ability to reuse data from a previous pack as is and bypass any delta
or even SHA1 computation by simply copying the raw data from one pack
to another directly.
The problem with this is that any data corruption within a copied object
would go unnoticed and the new (repacked) pack would be self-consistent
with its own checksum despite containing a corrupted object. This is a
real issue that already happened at least once in the past.
In some attempt to prevent this, we validate the copied data by inflating
it and making sure no error is signaled by zlib. But this is still not
perfect as a significant portion of a pack content is made of object
headers and references to delta base objects which are not deflated and
therefore not validated when repacking actually making the pack data reuse
still not as safe as it could be.
Of course a full SHA1 validation could be performed, but that implies
full data inflating and delta replaying which is extremely costly, which
cost the data reuse optimization was designed to avoid in the first place.
So the best solution to this is simply to store a CRC32 of the raw pack
data for each object in the pack index. This way any object in a pack can
be validated before being copied as is in another pack, including header
and any other non deflated data.
Why CRC32 instead of a faster checksum like Adler32? Quoting Wikipedia:
Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very
short messages. He wrote "Briefly, the problem is that, for very short
packets, Adler32 is guaranteed to give poor coverage of the available
bits. Don't take my word for it, ask Mark Adler. :-)" The problem is
that sum A does not wrap for short messages. The maximum value of A for
a 128-byte message is 32640, which is below the value 65521 used by the
modulo operation. An extended explanation can be found in RFC 3309,
which mandates the use of CRC32 instead of Adler-32 for SCTP, the
Stream Control Transmission Protocol.
In the context of a GIT pack, we have lots of small objects, especially
deltas, which are likely to be quite small and in a size range for which
Adler32 is dimed not to be sufficient. Another advantage of CRC32 is the
possibility for recovery from certain types of small corruptions like
single bit errors which are the most probable type of corruptions.
OK what this patch does is to compute the CRC32 of each object written to
a pack within pack-objects. It is not written to the index yet and it is
obviously not validated when reusing pack data yet either.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-09 13:06:31 +08:00
|
|
|
{
|
2011-04-03 15:06:54 +08:00
|
|
|
f->crc32 = crc32(0, NULL, 0);
|
compute a CRC32 for each object as stored in a pack
The most important optimization for performance when repacking is the
ability to reuse data from a previous pack as is and bypass any delta
or even SHA1 computation by simply copying the raw data from one pack
to another directly.
The problem with this is that any data corruption within a copied object
would go unnoticed and the new (repacked) pack would be self-consistent
with its own checksum despite containing a corrupted object. This is a
real issue that already happened at least once in the past.
In some attempt to prevent this, we validate the copied data by inflating
it and making sure no error is signaled by zlib. But this is still not
perfect as a significant portion of a pack content is made of object
headers and references to delta base objects which are not deflated and
therefore not validated when repacking actually making the pack data reuse
still not as safe as it could be.
Of course a full SHA1 validation could be performed, but that implies
full data inflating and delta replaying which is extremely costly, which
cost the data reuse optimization was designed to avoid in the first place.
So the best solution to this is simply to store a CRC32 of the raw pack
data for each object in the pack index. This way any object in a pack can
be validated before being copied as is in another pack, including header
and any other non deflated data.
Why CRC32 instead of a faster checksum like Adler32? Quoting Wikipedia:
Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very
short messages. He wrote "Briefly, the problem is that, for very short
packets, Adler32 is guaranteed to give poor coverage of the available
bits. Don't take my word for it, ask Mark Adler. :-)" The problem is
that sum A does not wrap for short messages. The maximum value of A for
a 128-byte message is 32640, which is below the value 65521 used by the
modulo operation. An extended explanation can be found in RFC 3309,
which mandates the use of CRC32 instead of Adler-32 for SCTP, the
Stream Control Transmission Protocol.
In the context of a GIT pack, we have lots of small objects, especially
deltas, which are likely to be quite small and in a size range for which
Adler32 is dimed not to be sufficient. Another advantage of CRC32 is the
possibility for recovery from certain types of small corruptions like
single bit errors which are the most probable type of corruptions.
OK what this patch does is to compute the CRC32 of each object written to
a pack within pack-objects. It is not written to the index yet and it is
obviously not validated when reusing pack data yet either.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-09 13:06:31 +08:00
|
|
|
f->do_crc = 1;
|
|
|
|
}
|
2005-06-27 11:27:56 +08:00
|
|
|
|
2018-02-01 10:18:46 +08:00
|
|
|
uint32_t crc32_end(struct hashfile *f)
|
compute a CRC32 for each object as stored in a pack
The most important optimization for performance when repacking is the
ability to reuse data from a previous pack as is and bypass any delta
or even SHA1 computation by simply copying the raw data from one pack
to another directly.
The problem with this is that any data corruption within a copied object
would go unnoticed and the new (repacked) pack would be self-consistent
with its own checksum despite containing a corrupted object. This is a
real issue that already happened at least once in the past.
In some attempt to prevent this, we validate the copied data by inflating
it and making sure no error is signaled by zlib. But this is still not
perfect as a significant portion of a pack content is made of object
headers and references to delta base objects which are not deflated and
therefore not validated when repacking actually making the pack data reuse
still not as safe as it could be.
Of course a full SHA1 validation could be performed, but that implies
full data inflating and delta replaying which is extremely costly, which
cost the data reuse optimization was designed to avoid in the first place.
So the best solution to this is simply to store a CRC32 of the raw pack
data for each object in the pack index. This way any object in a pack can
be validated before being copied as is in another pack, including header
and any other non deflated data.
Why CRC32 instead of a faster checksum like Adler32? Quoting Wikipedia:
Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very
short messages. He wrote "Briefly, the problem is that, for very short
packets, Adler32 is guaranteed to give poor coverage of the available
bits. Don't take my word for it, ask Mark Adler. :-)" The problem is
that sum A does not wrap for short messages. The maximum value of A for
a 128-byte message is 32640, which is below the value 65521 used by the
modulo operation. An extended explanation can be found in RFC 3309,
which mandates the use of CRC32 instead of Adler-32 for SCTP, the
Stream Control Transmission Protocol.
In the context of a GIT pack, we have lots of small objects, especially
deltas, which are likely to be quite small and in a size range for which
Adler32 is dimed not to be sufficient. Another advantage of CRC32 is the
possibility for recovery from certain types of small corruptions like
single bit errors which are the most probable type of corruptions.
OK what this patch does is to compute the CRC32 of each object written to
a pack within pack-objects. It is not written to the index yet and it is
obviously not validated when reusing pack data yet either.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-09 13:06:31 +08:00
|
|
|
{
|
|
|
|
f->do_crc = 0;
|
|
|
|
return f->crc32;
|
|
|
|
}
|
2021-06-24 02:39:07 +08:00
|
|
|
|
|
|
|
int hashfile_checksum_valid(const unsigned char *data, size_t total_len)
|
|
|
|
{
|
|
|
|
unsigned char got[GIT_MAX_RAWSZ];
|
|
|
|
git_hash_ctx ctx;
|
|
|
|
size_t data_len = total_len - the_hash_algo->rawsz;
|
|
|
|
|
|
|
|
if (total_len < the_hash_algo->rawsz)
|
|
|
|
return 0; /* say "too short"? */
|
|
|
|
|
csum-file.c: use unsafe SHA-1 implementation when available
Update hashwrite() and friends to use the unsafe_-variants of hashing
functions, calling for e.g., "the_hash_algo->unsafe_update_fn()" instead
of "the_hash_algo->update_fn()".
These callers only use the_hash_algo to produce a checksum, which we
depend on for data integrity, but not for cryptographic purposes, so
these callers are safe to use the unsafe (non-collision detecting) SHA-1
implementation.
To time this, I took a freshly packed copy of linux.git, and ran the
following with and without the OPENSSL_SHA1_UNSAFE=1 build-knob. Both
versions were compiled with -O3:
$ git for-each-ref --format='%(objectname)' refs/heads refs/tags >in
$ valgrind --tool=callgrind ~/src/git/git-pack-objects \
--revs --stdout --all-progress --use-bitmap-index <in >/dev/null
Without OPENSSL_SHA1_UNSAFE=1 (that is, using the collision-detecting
SHA-1 implementation for both cryptographic and non-cryptographic
purposes), we spend a significant amount of our instruction count in
hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
159,998,868,413 (79.42%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and the resulting "clone" takes 19.219 seconds of wall clock time,
18.94 seconds of user time and 0.28 seconds of system time.
Compiling with OPENSSL_SHA1_UNSAFE=1, we spend ~60% fewer instructions
in hashwrite():
$ callgrind_annotate --inclusive=yes | grep hashwrite | head -n1
59,164,001,176 (58.79%) /home/ttaylorr/src/git/csum-file.c:hashwrite [/home/ttaylorr/src/git/git-pack-objects]
, and generate the resulting "clone" much faster, in only 11.597 seconds
of wall time, 11.37 seconds of user time, and 0.23 seconds of system
time, for a ~40% speed-up.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-26 23:22:53 +08:00
|
|
|
the_hash_algo->unsafe_init_fn(&ctx);
|
|
|
|
the_hash_algo->unsafe_update_fn(&ctx, data, data_len);
|
|
|
|
the_hash_algo->unsafe_final_fn(got, &ctx);
|
2021-06-24 02:39:07 +08:00
|
|
|
|
2024-06-14 14:49:50 +08:00
|
|
|
return hasheq(got, data + data_len, the_repository->hash_algo);
|
2021-06-24 02:39:07 +08:00
|
|
|
}
|