Add has_symlink_leading_path() function.
When we are applying a patch that creates a blob at a path, or
when we are switching from a branch that does not have a blob at
the path to another branch that has one, we need to make sure
that there is nothing at the path in the working tree, as such a
file is a local modification made by the user that would be lost
by the operation.
Normally, lstat() on the path and making sure ENOENT is returned
is good enough for that purpose. However there is a twist. We
may be creating a regular file arch/x86_64/boot/Makefile, while
removing an existing symbolic link at arch/x86_64/boot that
points at existing ../i386/boot directory that has Makefile in
it. We always first check without touching filesystem and then
perform the actual operation, so when we verify the new file,
arch/x86_64/boot/Makefile, does not exist, we haven't removed
the symbolic link arc/x86_64/boot symbolic link yet. lstat() on
the file sees through the symbolic link and reports the file is
there, which is not what we want.
The function has_symlink_leading_path() function takes a path,
and sees if any of the leading directory component is a symbolic
link.
When files in a new directory are created, we tend to process
them together because both index and tree are sorted. The
function takes advantage of this and allows the caller to cache
and reuse which symbolic link on the filesystem caused the
function to return true.
The calling sequence would be:
char last_symlink[PATH_MAX];
*last_symlink = '\0';
for each index entry {
if (!lose)
continue;
if (lstat(it))
if (errno == ENOENT)
; /* happy */
else
error;
else if (has_symlink_leading_path(it, last_symlink))
; /* happy */
else
error; /* would lose local changes */
unlink_entry(it, last_symlink);
}
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-12 13:11:07 +08:00
|
|
|
#include "cache.h"
|
|
|
|
|
2012-09-16 13:38:28 +08:00
|
|
|
static int threaded_check_leading_path(struct cache_def *cache, const char *name, int len);
|
|
|
|
static int threaded_has_dirs_only_path(struct cache_def *cache, const char *name, int len, int prefix_len);
|
|
|
|
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
/*
|
|
|
|
* Returns the length (on a path component basis) of the longest
|
2009-02-10 04:54:05 +08:00
|
|
|
* common prefix match of 'name_a' and 'name_b'.
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
*/
|
2009-02-10 04:54:05 +08:00
|
|
|
static int longest_path_match(const char *name_a, int len_a,
|
|
|
|
const char *name_b, int len_b,
|
|
|
|
int *previous_slash)
|
Optimize symlink/directory detection
This is the base for making symlink detection in the middle fo a pathname
saner and (much) more efficient.
Under various loads, we want to verify that the full path leading up to a
filename is a real directory tree, and that when we successfully do an
'lstat()' on a filename, we don't get a false positive due to a symlink in
the middle of the path that git should have seen as a symlink, not as a
normal path component.
The 'has_symlink_leading_path()' function already did this, and cached
a single level of symlink information, but didn't cache the _lack_ of a
symlink, so the normal behaviour was actually the wrong way around, and we
ended up doing an 'lstat()' on each path component to check that it was a
real directory.
This caches the last detected full directory and symlink entries, and
speeds up especially deep directory structures a lot by avoiding to
lstat() all the directories leading up to each entry in the index.
[ This can - and should - probably be extended upon so that we eventually
never do a bare 'lstat()' on any path entries at *all* when checking the
index, but always check the full path carefully. Right now we do not
generally check the whole path for all our normal quick index
revalidation.
We should also make sure that we're careful about all the invalidation,
ie when we remove a link and replace it by a directory we should
invalidate the symlink cache if it matches (and vice versa for the
directory cache).
But regardless, the basic function needs to be sane to do that. The old
'has_symlink_leading_path()' was not capable enough - or indeed the code
readable enough - to really do that sanely. So I'm pushing this as not
just an optimization, but as a base for further work. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-05-10 00:21:07 +08:00
|
|
|
{
|
2009-01-18 23:14:53 +08:00
|
|
|
int max_len, match_len = 0, match_len_prev = 0, i = 0;
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
|
2009-02-10 04:54:05 +08:00
|
|
|
max_len = len_a < len_b ? len_a : len_b;
|
|
|
|
while (i < max_len && name_a[i] == name_b[i]) {
|
|
|
|
if (name_a[i] == '/') {
|
2009-01-18 23:14:53 +08:00
|
|
|
match_len_prev = match_len;
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
match_len = i;
|
2009-01-18 23:14:53 +08:00
|
|
|
}
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
i++;
|
|
|
|
}
|
2009-02-10 04:54:04 +08:00
|
|
|
/*
|
2009-02-10 04:54:05 +08:00
|
|
|
* Is 'name_b' a substring of 'name_a', the other way around,
|
|
|
|
* or is 'name_a' and 'name_b' the exact same string?
|
2009-02-10 04:54:04 +08:00
|
|
|
*/
|
2009-02-10 04:54:05 +08:00
|
|
|
if (i >= max_len && ((len_a > len_b && name_a[len_b] == '/') ||
|
|
|
|
(len_a < len_b && name_b[len_a] == '/') ||
|
|
|
|
(len_a == len_b))) {
|
2009-01-18 23:14:53 +08:00
|
|
|
match_len_prev = match_len;
|
2009-02-10 04:54:04 +08:00
|
|
|
match_len = i;
|
2009-01-18 23:14:53 +08:00
|
|
|
}
|
|
|
|
*previous_slash = match_len_prev;
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
return match_len;
|
Optimize symlink/directory detection
This is the base for making symlink detection in the middle fo a pathname
saner and (much) more efficient.
Under various loads, we want to verify that the full path leading up to a
filename is a real directory tree, and that when we successfully do an
'lstat()' on a filename, we don't get a false positive due to a symlink in
the middle of the path that git should have seen as a symlink, not as a
normal path component.
The 'has_symlink_leading_path()' function already did this, and cached
a single level of symlink information, but didn't cache the _lack_ of a
symlink, so the normal behaviour was actually the wrong way around, and we
ended up doing an 'lstat()' on each path component to check that it was a
real directory.
This caches the last detected full directory and symlink entries, and
speeds up especially deep directory structures a lot by avoiding to
lstat() all the directories leading up to each entry in the index.
[ This can - and should - probably be extended upon so that we eventually
never do a bare 'lstat()' on any path entries at *all* when checking the
index, but always check the full path carefully. Right now we do not
generally check the whole path for all our normal quick index
revalidation.
We should also make sure that we're careful about all the invalidation,
ie when we remove a link and replace it by a directory we should
invalidate the symlink cache if it matches (and vice versa for the
directory cache).
But regardless, the basic function needs to be sane to do that. The old
'has_symlink_leading_path()' was not capable enough - or indeed the code
readable enough - to really do that sanely. So I'm pushing this as not
just an optimization, but as a base for further work. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-05-10 00:21:07 +08:00
|
|
|
}
|
|
|
|
|
2014-07-05 06:41:46 +08:00
|
|
|
static struct cache_def default_cache = CACHE_DEF_INIT;
|
2009-02-10 04:54:05 +08:00
|
|
|
|
2009-07-10 04:23:59 +08:00
|
|
|
static inline void reset_lstat_cache(struct cache_def *cache)
|
Optimize symlink/directory detection
This is the base for making symlink detection in the middle fo a pathname
saner and (much) more efficient.
Under various loads, we want to verify that the full path leading up to a
filename is a real directory tree, and that when we successfully do an
'lstat()' on a filename, we don't get a false positive due to a symlink in
the middle of the path that git should have seen as a symlink, not as a
normal path component.
The 'has_symlink_leading_path()' function already did this, and cached
a single level of symlink information, but didn't cache the _lack_ of a
symlink, so the normal behaviour was actually the wrong way around, and we
ended up doing an 'lstat()' on each path component to check that it was a
real directory.
This caches the last detected full directory and symlink entries, and
speeds up especially deep directory structures a lot by avoiding to
lstat() all the directories leading up to each entry in the index.
[ This can - and should - probably be extended upon so that we eventually
never do a bare 'lstat()' on any path entries at *all* when checking the
index, but always check the full path carefully. Right now we do not
generally check the whole path for all our normal quick index
revalidation.
We should also make sure that we're careful about all the invalidation,
ie when we remove a link and replace it by a directory we should
invalidate the symlink cache if it matches (and vice versa for the
directory cache).
But regardless, the basic function needs to be sane to do that. The old
'has_symlink_leading_path()' was not capable enough - or indeed the code
readable enough - to really do that sanely. So I'm pushing this as not
just an optimization, but as a base for further work. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-05-10 00:21:07 +08:00
|
|
|
{
|
2014-07-05 06:41:46 +08:00
|
|
|
strbuf_reset(&cache->path);
|
2009-07-10 04:23:59 +08:00
|
|
|
cache->flags = 0;
|
2009-02-10 04:54:04 +08:00
|
|
|
/*
|
|
|
|
* The track_flags and prefix_len_stat_func members is only
|
|
|
|
* set by the safeguard rule inside lstat_cache()
|
|
|
|
*/
|
Optimize symlink/directory detection
This is the base for making symlink detection in the middle fo a pathname
saner and (much) more efficient.
Under various loads, we want to verify that the full path leading up to a
filename is a real directory tree, and that when we successfully do an
'lstat()' on a filename, we don't get a false positive due to a symlink in
the middle of the path that git should have seen as a symlink, not as a
normal path component.
The 'has_symlink_leading_path()' function already did this, and cached
a single level of symlink information, but didn't cache the _lack_ of a
symlink, so the normal behaviour was actually the wrong way around, and we
ended up doing an 'lstat()' on each path component to check that it was a
real directory.
This caches the last detected full directory and symlink entries, and
speeds up especially deep directory structures a lot by avoiding to
lstat() all the directories leading up to each entry in the index.
[ This can - and should - probably be extended upon so that we eventually
never do a bare 'lstat()' on any path entries at *all* when checking the
index, but always check the full path carefully. Right now we do not
generally check the whole path for all our normal quick index
revalidation.
We should also make sure that we're careful about all the invalidation,
ie when we remove a link and replace it by a directory we should
invalidate the symlink cache if it matches (and vice versa for the
directory cache).
But regardless, the basic function needs to be sane to do that. The old
'has_symlink_leading_path()' was not capable enough - or indeed the code
readable enough - to really do that sanely. So I'm pushing this as not
just an optimization, but as a base for further work. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-05-10 00:21:07 +08:00
|
|
|
}
|
|
|
|
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
#define FL_DIR (1 << 0)
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
#define FL_NOENT (1 << 1)
|
|
|
|
#define FL_SYMLINK (1 << 2)
|
|
|
|
#define FL_LSTATERR (1 << 3)
|
|
|
|
#define FL_ERR (1 << 4)
|
2009-01-18 23:14:52 +08:00
|
|
|
#define FL_FULLPATH (1 << 5)
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if name 'name' of length 'len' has a symlink leading
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
* component, or if the directory exists and is real, or not.
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
*
|
|
|
|
* To speed up the check, some information is allowed to be cached.
|
2009-01-18 23:14:52 +08:00
|
|
|
* This can be indicated by the 'track_flags' argument, which also can
|
|
|
|
* be used to indicate that we should check the full path.
|
|
|
|
*
|
|
|
|
* The 'prefix_len_stat_func' parameter can be used to set the length
|
|
|
|
* of the prefix, where the cache should use the stat() function
|
|
|
|
* instead of the lstat() function to test each path component.
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
*/
|
2010-10-09 21:52:59 +08:00
|
|
|
static int lstat_cache_matchlen(struct cache_def *cache,
|
|
|
|
const char *name, int len,
|
|
|
|
int *ret_flags, int track_flags,
|
|
|
|
int prefix_len_stat_func)
|
Optimize symlink/directory detection
This is the base for making symlink detection in the middle fo a pathname
saner and (much) more efficient.
Under various loads, we want to verify that the full path leading up to a
filename is a real directory tree, and that when we successfully do an
'lstat()' on a filename, we don't get a false positive due to a symlink in
the middle of the path that git should have seen as a symlink, not as a
normal path component.
The 'has_symlink_leading_path()' function already did this, and cached
a single level of symlink information, but didn't cache the _lack_ of a
symlink, so the normal behaviour was actually the wrong way around, and we
ended up doing an 'lstat()' on each path component to check that it was a
real directory.
This caches the last detected full directory and symlink entries, and
speeds up especially deep directory structures a lot by avoiding to
lstat() all the directories leading up to each entry in the index.
[ This can - and should - probably be extended upon so that we eventually
never do a bare 'lstat()' on any path entries at *all* when checking the
index, but always check the full path carefully. Right now we do not
generally check the whole path for all our normal quick index
revalidation.
We should also make sure that we're careful about all the invalidation,
ie when we remove a link and replace it by a directory we should
invalidate the symlink cache if it matches (and vice versa for the
directory cache).
But regardless, the basic function needs to be sane to do that. The old
'has_symlink_leading_path()' was not capable enough - or indeed the code
readable enough - to really do that sanely. So I'm pushing this as not
just an optimization, but as a base for further work. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-05-10 00:21:07 +08:00
|
|
|
{
|
2009-01-18 23:14:53 +08:00
|
|
|
int match_len, last_slash, last_slash_dir, previous_slash;
|
2014-07-05 06:41:46 +08:00
|
|
|
int save_flags, ret;
|
Optimize symlink/directory detection
This is the base for making symlink detection in the middle fo a pathname
saner and (much) more efficient.
Under various loads, we want to verify that the full path leading up to a
filename is a real directory tree, and that when we successfully do an
'lstat()' on a filename, we don't get a false positive due to a symlink in
the middle of the path that git should have seen as a symlink, not as a
normal path component.
The 'has_symlink_leading_path()' function already did this, and cached
a single level of symlink information, but didn't cache the _lack_ of a
symlink, so the normal behaviour was actually the wrong way around, and we
ended up doing an 'lstat()' on each path component to check that it was a
real directory.
This caches the last detected full directory and symlink entries, and
speeds up especially deep directory structures a lot by avoiding to
lstat() all the directories leading up to each entry in the index.
[ This can - and should - probably be extended upon so that we eventually
never do a bare 'lstat()' on any path entries at *all* when checking the
index, but always check the full path carefully. Right now we do not
generally check the whole path for all our normal quick index
revalidation.
We should also make sure that we're careful about all the invalidation,
ie when we remove a link and replace it by a directory we should
invalidate the symlink cache if it matches (and vice versa for the
directory cache).
But regardless, the basic function needs to be sane to do that. The old
'has_symlink_leading_path()' was not capable enough - or indeed the code
readable enough - to really do that sanely. So I'm pushing this as not
just an optimization, but as a base for further work. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-05-10 00:21:07 +08:00
|
|
|
struct stat st;
|
Add has_symlink_leading_path() function.
When we are applying a patch that creates a blob at a path, or
when we are switching from a branch that does not have a blob at
the path to another branch that has one, we need to make sure
that there is nothing at the path in the working tree, as such a
file is a local modification made by the user that would be lost
by the operation.
Normally, lstat() on the path and making sure ENOENT is returned
is good enough for that purpose. However there is a twist. We
may be creating a regular file arch/x86_64/boot/Makefile, while
removing an existing symbolic link at arch/x86_64/boot that
points at existing ../i386/boot directory that has Makefile in
it. We always first check without touching filesystem and then
perform the actual operation, so when we verify the new file,
arch/x86_64/boot/Makefile, does not exist, we haven't removed
the symbolic link arc/x86_64/boot symbolic link yet. lstat() on
the file sees through the symbolic link and reports the file is
there, which is not what we want.
The function has_symlink_leading_path() function takes a path,
and sees if any of the leading directory component is a symbolic
link.
When files in a new directory are created, we tend to process
them together because both index and tree are sorted. The
function takes advantage of this and allows the caller to cache
and reuse which symbolic link on the filesystem caused the
function to return true.
The calling sequence would be:
char last_symlink[PATH_MAX];
*last_symlink = '\0';
for each index entry {
if (!lose)
continue;
if (lstat(it))
if (errno == ENOENT)
; /* happy */
else
error;
else if (has_symlink_leading_path(it, last_symlink))
; /* happy */
else
error; /* would lose local changes */
unlink_entry(it, last_symlink);
}
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-12 13:11:07 +08:00
|
|
|
|
2009-07-10 04:23:59 +08:00
|
|
|
if (cache->track_flags != track_flags ||
|
|
|
|
cache->prefix_len_stat_func != prefix_len_stat_func) {
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
/*
|
2009-02-10 04:54:04 +08:00
|
|
|
* As a safeguard rule we clear the cache if the
|
|
|
|
* values of track_flags and/or prefix_len_stat_func
|
|
|
|
* does not match with the last supplied values.
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
*/
|
2009-07-10 04:23:59 +08:00
|
|
|
reset_lstat_cache(cache);
|
|
|
|
cache->track_flags = track_flags;
|
|
|
|
cache->prefix_len_stat_func = prefix_len_stat_func;
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
match_len = last_slash = 0;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Check to see if we have a match from the cache for
|
|
|
|
* the 2 "excluding" path types.
|
|
|
|
*/
|
2009-01-18 23:14:53 +08:00
|
|
|
match_len = last_slash =
|
2014-07-05 06:41:46 +08:00
|
|
|
longest_path_match(name, len, cache->path.buf,
|
|
|
|
cache->path.len, &previous_slash);
|
2010-10-09 21:52:59 +08:00
|
|
|
*ret_flags = cache->flags & track_flags & (FL_NOENT|FL_SYMLINK);
|
2009-06-14 21:08:28 +08:00
|
|
|
|
|
|
|
if (!(track_flags & FL_FULLPATH) && match_len == len)
|
|
|
|
match_len = last_slash = previous_slash;
|
|
|
|
|
2014-07-05 06:41:46 +08:00
|
|
|
if (*ret_flags && match_len == cache->path.len)
|
2010-10-09 21:52:59 +08:00
|
|
|
return match_len;
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
/*
|
|
|
|
* If we now have match_len > 0, we would know that
|
|
|
|
* the matched part will always be a directory.
|
|
|
|
*
|
|
|
|
* Also, if we are tracking directories and 'name' is
|
|
|
|
* a substring of the cache on a path component basis,
|
|
|
|
* we can return immediately.
|
|
|
|
*/
|
2010-10-09 21:52:59 +08:00
|
|
|
*ret_flags = track_flags & FL_DIR;
|
|
|
|
if (*ret_flags && len == match_len)
|
|
|
|
return match_len;
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
}
|
Optimize symlink/directory detection
This is the base for making symlink detection in the middle fo a pathname
saner and (much) more efficient.
Under various loads, we want to verify that the full path leading up to a
filename is a real directory tree, and that when we successfully do an
'lstat()' on a filename, we don't get a false positive due to a symlink in
the middle of the path that git should have seen as a symlink, not as a
normal path component.
The 'has_symlink_leading_path()' function already did this, and cached
a single level of symlink information, but didn't cache the _lack_ of a
symlink, so the normal behaviour was actually the wrong way around, and we
ended up doing an 'lstat()' on each path component to check that it was a
real directory.
This caches the last detected full directory and symlink entries, and
speeds up especially deep directory structures a lot by avoiding to
lstat() all the directories leading up to each entry in the index.
[ This can - and should - probably be extended upon so that we eventually
never do a bare 'lstat()' on any path entries at *all* when checking the
index, but always check the full path carefully. Right now we do not
generally check the whole path for all our normal quick index
revalidation.
We should also make sure that we're careful about all the invalidation,
ie when we remove a link and replace it by a directory we should
invalidate the symlink cache if it matches (and vice versa for the
directory cache).
But regardless, the basic function needs to be sane to do that. The old
'has_symlink_leading_path()' was not capable enough - or indeed the code
readable enough - to really do that sanely. So I'm pushing this as not
just an optimization, but as a base for further work. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-05-10 00:21:07 +08:00
|
|
|
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
/*
|
|
|
|
* Okay, no match from the cache so far, so now we have to
|
|
|
|
* check the rest of the path components.
|
|
|
|
*/
|
2010-10-09 21:52:59 +08:00
|
|
|
*ret_flags = FL_DIR;
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
last_slash_dir = last_slash;
|
2014-07-05 06:41:46 +08:00
|
|
|
if (len > cache->path.len)
|
|
|
|
strbuf_grow(&cache->path, len - cache->path.len);
|
|
|
|
while (match_len < len) {
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
do {
|
2014-07-05 06:41:46 +08:00
|
|
|
cache->path.buf[match_len] = name[match_len];
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
match_len++;
|
2014-07-05 06:41:46 +08:00
|
|
|
} while (match_len < len && name[match_len] != '/');
|
|
|
|
if (match_len >= len && !(track_flags & FL_FULLPATH))
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
break;
|
|
|
|
last_slash = match_len;
|
2014-07-05 06:41:46 +08:00
|
|
|
cache->path.buf[last_slash] = '\0';
|
Add has_symlink_leading_path() function.
When we are applying a patch that creates a blob at a path, or
when we are switching from a branch that does not have a blob at
the path to another branch that has one, we need to make sure
that there is nothing at the path in the working tree, as such a
file is a local modification made by the user that would be lost
by the operation.
Normally, lstat() on the path and making sure ENOENT is returned
is good enough for that purpose. However there is a twist. We
may be creating a regular file arch/x86_64/boot/Makefile, while
removing an existing symbolic link at arch/x86_64/boot that
points at existing ../i386/boot directory that has Makefile in
it. We always first check without touching filesystem and then
perform the actual operation, so when we verify the new file,
arch/x86_64/boot/Makefile, does not exist, we haven't removed
the symbolic link arc/x86_64/boot symbolic link yet. lstat() on
the file sees through the symbolic link and reports the file is
there, which is not what we want.
The function has_symlink_leading_path() function takes a path,
and sees if any of the leading directory component is a symbolic
link.
When files in a new directory are created, we tend to process
them together because both index and tree are sorted. The
function takes advantage of this and allows the caller to cache
and reuse which symbolic link on the filesystem caused the
function to return true.
The calling sequence would be:
char last_symlink[PATH_MAX];
*last_symlink = '\0';
for each index entry {
if (!lose)
continue;
if (lstat(it))
if (errno == ENOENT)
; /* happy */
else
error;
else if (has_symlink_leading_path(it, last_symlink))
; /* happy */
else
error; /* would lose local changes */
unlink_entry(it, last_symlink);
}
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-12 13:11:07 +08:00
|
|
|
|
2009-01-18 23:14:52 +08:00
|
|
|
if (last_slash <= prefix_len_stat_func)
|
2014-07-05 06:41:46 +08:00
|
|
|
ret = stat(cache->path.buf, &st);
|
2009-01-18 23:14:52 +08:00
|
|
|
else
|
2014-07-05 06:41:46 +08:00
|
|
|
ret = lstat(cache->path.buf, &st);
|
2009-01-18 23:14:52 +08:00
|
|
|
|
|
|
|
if (ret) {
|
2010-10-09 21:52:59 +08:00
|
|
|
*ret_flags = FL_LSTATERR;
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
if (errno == ENOENT)
|
2010-10-09 21:52:59 +08:00
|
|
|
*ret_flags |= FL_NOENT;
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
} else if (S_ISDIR(st.st_mode)) {
|
|
|
|
last_slash_dir = last_slash;
|
Optimize symlink/directory detection
This is the base for making symlink detection in the middle fo a pathname
saner and (much) more efficient.
Under various loads, we want to verify that the full path leading up to a
filename is a real directory tree, and that when we successfully do an
'lstat()' on a filename, we don't get a false positive due to a symlink in
the middle of the path that git should have seen as a symlink, not as a
normal path component.
The 'has_symlink_leading_path()' function already did this, and cached
a single level of symlink information, but didn't cache the _lack_ of a
symlink, so the normal behaviour was actually the wrong way around, and we
ended up doing an 'lstat()' on each path component to check that it was a
real directory.
This caches the last detected full directory and symlink entries, and
speeds up especially deep directory structures a lot by avoiding to
lstat() all the directories leading up to each entry in the index.
[ This can - and should - probably be extended upon so that we eventually
never do a bare 'lstat()' on any path entries at *all* when checking the
index, but always check the full path carefully. Right now we do not
generally check the whole path for all our normal quick index
revalidation.
We should also make sure that we're careful about all the invalidation,
ie when we remove a link and replace it by a directory we should
invalidate the symlink cache if it matches (and vice versa for the
directory cache).
But regardless, the basic function needs to be sane to do that. The old
'has_symlink_leading_path()' was not capable enough - or indeed the code
readable enough - to really do that sanely. So I'm pushing this as not
just an optimization, but as a base for further work. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-05-10 00:21:07 +08:00
|
|
|
continue;
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
} else if (S_ISLNK(st.st_mode)) {
|
2010-10-09 21:52:59 +08:00
|
|
|
*ret_flags = FL_SYMLINK;
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
} else {
|
2010-10-09 21:52:59 +08:00
|
|
|
*ret_flags = FL_ERR;
|
Add has_symlink_leading_path() function.
When we are applying a patch that creates a blob at a path, or
when we are switching from a branch that does not have a blob at
the path to another branch that has one, we need to make sure
that there is nothing at the path in the working tree, as such a
file is a local modification made by the user that would be lost
by the operation.
Normally, lstat() on the path and making sure ENOENT is returned
is good enough for that purpose. However there is a twist. We
may be creating a regular file arch/x86_64/boot/Makefile, while
removing an existing symbolic link at arch/x86_64/boot that
points at existing ../i386/boot directory that has Makefile in
it. We always first check without touching filesystem and then
perform the actual operation, so when we verify the new file,
arch/x86_64/boot/Makefile, does not exist, we haven't removed
the symbolic link arc/x86_64/boot symbolic link yet. lstat() on
the file sees through the symbolic link and reports the file is
there, which is not what we want.
The function has_symlink_leading_path() function takes a path,
and sees if any of the leading directory component is a symbolic
link.
When files in a new directory are created, we tend to process
them together because both index and tree are sorted. The
function takes advantage of this and allows the caller to cache
and reuse which symbolic link on the filesystem caused the
function to return true.
The calling sequence would be:
char last_symlink[PATH_MAX];
*last_symlink = '\0';
for each index entry {
if (!lose)
continue;
if (lstat(it))
if (errno == ENOENT)
; /* happy */
else
error;
else if (has_symlink_leading_path(it, last_symlink))
; /* happy */
else
error; /* would lose local changes */
unlink_entry(it, last_symlink);
}
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-12 13:11:07 +08:00
|
|
|
}
|
Optimize symlink/directory detection
This is the base for making symlink detection in the middle fo a pathname
saner and (much) more efficient.
Under various loads, we want to verify that the full path leading up to a
filename is a real directory tree, and that when we successfully do an
'lstat()' on a filename, we don't get a false positive due to a symlink in
the middle of the path that git should have seen as a symlink, not as a
normal path component.
The 'has_symlink_leading_path()' function already did this, and cached
a single level of symlink information, but didn't cache the _lack_ of a
symlink, so the normal behaviour was actually the wrong way around, and we
ended up doing an 'lstat()' on each path component to check that it was a
real directory.
This caches the last detected full directory and symlink entries, and
speeds up especially deep directory structures a lot by avoiding to
lstat() all the directories leading up to each entry in the index.
[ This can - and should - probably be extended upon so that we eventually
never do a bare 'lstat()' on any path entries at *all* when checking the
index, but always check the full path carefully. Right now we do not
generally check the whole path for all our normal quick index
revalidation.
We should also make sure that we're careful about all the invalidation,
ie when we remove a link and replace it by a directory we should
invalidate the symlink cache if it matches (and vice versa for the
directory cache).
But regardless, the basic function needs to be sane to do that. The old
'has_symlink_leading_path()' was not capable enough - or indeed the code
readable enough - to really do that sanely. So I'm pushing this as not
just an optimization, but as a base for further work. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-05-10 00:21:07 +08:00
|
|
|
break;
|
Add has_symlink_leading_path() function.
When we are applying a patch that creates a blob at a path, or
when we are switching from a branch that does not have a blob at
the path to another branch that has one, we need to make sure
that there is nothing at the path in the working tree, as such a
file is a local modification made by the user that would be lost
by the operation.
Normally, lstat() on the path and making sure ENOENT is returned
is good enough for that purpose. However there is a twist. We
may be creating a regular file arch/x86_64/boot/Makefile, while
removing an existing symbolic link at arch/x86_64/boot that
points at existing ../i386/boot directory that has Makefile in
it. We always first check without touching filesystem and then
perform the actual operation, so when we verify the new file,
arch/x86_64/boot/Makefile, does not exist, we haven't removed
the symbolic link arc/x86_64/boot symbolic link yet. lstat() on
the file sees through the symbolic link and reports the file is
there, which is not what we want.
The function has_symlink_leading_path() function takes a path,
and sees if any of the leading directory component is a symbolic
link.
When files in a new directory are created, we tend to process
them together because both index and tree are sorted. The
function takes advantage of this and allows the caller to cache
and reuse which symbolic link on the filesystem caused the
function to return true.
The calling sequence would be:
char last_symlink[PATH_MAX];
*last_symlink = '\0';
for each index entry {
if (!lose)
continue;
if (lstat(it))
if (errno == ENOENT)
; /* happy */
else
error;
else if (has_symlink_leading_path(it, last_symlink))
; /* happy */
else
error; /* would lose local changes */
unlink_entry(it, last_symlink);
}
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-12 13:11:07 +08:00
|
|
|
}
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
|
|
|
|
/*
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
* At the end update the cache. Note that max 3 different
|
|
|
|
* path types, FL_NOENT, FL_SYMLINK and FL_DIR, can be cached
|
|
|
|
* for the moment!
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
*/
|
2010-10-09 21:52:59 +08:00
|
|
|
save_flags = *ret_flags & track_flags & (FL_NOENT|FL_SYMLINK);
|
2014-07-05 06:41:46 +08:00
|
|
|
if (save_flags && last_slash > 0) {
|
|
|
|
cache->path.buf[last_slash] = '\0';
|
|
|
|
cache->path.len = last_slash;
|
2009-07-10 04:23:59 +08:00
|
|
|
cache->flags = save_flags;
|
2014-07-05 06:41:46 +08:00
|
|
|
} else if ((track_flags & FL_DIR) && last_slash_dir > 0) {
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
/*
|
|
|
|
* We have a separate test for the directory case,
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
* since it could be that we have found a symlink or a
|
|
|
|
* non-existing directory and the track_flags says
|
|
|
|
* that we cannot cache this fact, so the cache would
|
|
|
|
* then have been left empty in this case.
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
*
|
|
|
|
* But if we are allowed to track real directories, we
|
|
|
|
* can still cache the path components before the last
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
* one (the found symlink or non-existing component).
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
*/
|
2014-07-05 06:41:46 +08:00
|
|
|
cache->path.buf[last_slash_dir] = '\0';
|
|
|
|
cache->path.len = last_slash_dir;
|
2009-07-10 04:23:59 +08:00
|
|
|
cache->flags = FL_DIR;
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
} else {
|
2009-07-10 04:23:59 +08:00
|
|
|
reset_lstat_cache(cache);
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
}
|
2010-10-09 21:52:59 +08:00
|
|
|
return match_len;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int lstat_cache(struct cache_def *cache, const char *name, int len,
|
|
|
|
int track_flags, int prefix_len_stat_func)
|
|
|
|
{
|
|
|
|
int flags;
|
|
|
|
(void)lstat_cache_matchlen(cache, name, len, &flags, track_flags,
|
|
|
|
prefix_len_stat_func);
|
|
|
|
return flags;
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
}
|
|
|
|
|
2009-01-18 23:14:52 +08:00
|
|
|
#define USE_ONLY_LSTAT 0
|
|
|
|
|
2009-07-10 04:35:31 +08:00
|
|
|
/*
|
|
|
|
* Return non-zero if path 'name' has a leading symlink component
|
|
|
|
*/
|
|
|
|
int threaded_has_symlink_leading_path(struct cache_def *cache, const char *name, int len)
|
|
|
|
{
|
|
|
|
return lstat_cache(cache, name, len, FL_SYMLINK|FL_DIR, USE_ONLY_LSTAT) & FL_SYMLINK;
|
|
|
|
}
|
|
|
|
|
2009-02-10 04:54:06 +08:00
|
|
|
int has_symlink_leading_path(const char *name, int len)
|
lstat_cache(): more cache effective symlink/directory detection
Make the cache functionality more effective. Previously when A/B/C/D
was in the cache and A/B/C/E/file.c was called for, there was no match
at all from the cache. Now we use the fact that the paths "A", "A/B"
and "A/B/C" are already tested, and we only need to do an lstat() call
on "A/B/C/E".
We only cache/store the last path regardless of its type. Since the
cache functionality is always used with alphabetically sorted names
(at least it seems so for me), there is no need to store both the last
symlink-leading path and the last real-directory path. Note that if
the cache is not called with (mostly) alphabetically sorted names,
neither the old, nor this new one, would be very effective.
Previously, when symlink A/B/C/S was cached/stored in the symlink-
leading path, and A/B/C/file.c was called for, it was not easy to use
the fact that we already knew that the paths "A", "A/B" and "A/B/C"
are real directories.
Avoid copying the first path components of the name 2 zillion times
when we test new path components. Since we always cache/store the
last path, we can copy each component as we test those directly into
the cache. Previously we ended up doing a memcpy() for the full
path/name right before each lstat() call, and when updating the cache
for each time we have tested a new path component.
We also use less memory, that is, PATH_MAX bytes less memory on the
stack and PATH_MAX bytes less memory on the heap.
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:50 +08:00
|
|
|
{
|
2009-07-10 04:35:31 +08:00
|
|
|
return threaded_has_symlink_leading_path(&default_cache, name, len);
|
Add has_symlink_leading_path() function.
When we are applying a patch that creates a blob at a path, or
when we are switching from a branch that does not have a blob at
the path to another branch that has one, we need to make sure
that there is nothing at the path in the working tree, as such a
file is a local modification made by the user that would be lost
by the operation.
Normally, lstat() on the path and making sure ENOENT is returned
is good enough for that purpose. However there is a twist. We
may be creating a regular file arch/x86_64/boot/Makefile, while
removing an existing symbolic link at arch/x86_64/boot that
points at existing ../i386/boot directory that has Makefile in
it. We always first check without touching filesystem and then
perform the actual operation, so when we verify the new file,
arch/x86_64/boot/Makefile, does not exist, we haven't removed
the symbolic link arc/x86_64/boot symbolic link yet. lstat() on
the file sees through the symbolic link and reports the file is
there, which is not what we want.
The function has_symlink_leading_path() function takes a path,
and sees if any of the leading directory component is a symbolic
link.
When files in a new directory are created, we tend to process
them together because both index and tree are sorted. The
function takes advantage of this and allows the caller to cache
and reuse which symbolic link on the filesystem caused the
function to return true.
The calling sequence would be:
char last_symlink[PATH_MAX];
*last_symlink = '\0';
for each index entry {
if (!lose)
continue;
if (lstat(it))
if (errno == ENOENT)
; /* happy */
else
error;
else if (has_symlink_leading_path(it, last_symlink))
; /* happy */
else
error; /* would lose local changes */
unlink_entry(it, last_symlink);
}
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-12 13:11:07 +08:00
|
|
|
}
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
|
2010-10-09 21:53:00 +08:00
|
|
|
int check_leading_path(const char *name, int len)
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
{
|
2018-12-06 23:42:06 +08:00
|
|
|
return threaded_check_leading_path(&default_cache, name, len);
|
2012-03-03 10:31:15 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2021-03-19 02:43:46 +08:00
|
|
|
* Return zero if some leading path component of 'name' does not exist.
|
2012-03-03 10:31:15 +08:00
|
|
|
*
|
|
|
|
* Return -1 if leading path exists and is a directory.
|
|
|
|
*
|
2021-03-19 02:43:46 +08:00
|
|
|
* Return the length of a leading component if it either exists but it's not a
|
|
|
|
* directory, or if we were unable to lstat() it.
|
2012-03-03 10:31:15 +08:00
|
|
|
*/
|
2012-09-16 13:38:28 +08:00
|
|
|
static int threaded_check_leading_path(struct cache_def *cache, const char *name, int len)
|
2012-03-03 10:31:15 +08:00
|
|
|
{
|
2010-10-09 21:53:00 +08:00
|
|
|
int flags;
|
|
|
|
int match_len = lstat_cache_matchlen(cache, name, len, &flags,
|
|
|
|
FL_SYMLINK|FL_NOENT|FL_DIR, USE_ONLY_LSTAT);
|
2011-02-20 20:13:43 +08:00
|
|
|
if (flags & FL_NOENT)
|
2010-10-09 21:53:00 +08:00
|
|
|
return 0;
|
|
|
|
else if (flags & FL_DIR)
|
|
|
|
return -1;
|
|
|
|
else
|
|
|
|
return match_len;
|
lstat_cache(): introduce has_symlink_or_noent_leading_path() function
In some cases, especially inside the unpack-trees.c file, and inside
the verify_absent() function, we can avoid some unnecessary calls to
lstat(), if the lstat_cache() function can also be told to keep track
of non-existing directories.
So we update the lstat_cache() function to handle this new fact,
introduce a new wrapper function, and the result is that we save lots
of lstat() calls for a removed directory which previously contained
lots of files, when we call this new wrapper of lstat_cache() instead
of the old one.
We do similar changes inside the unlink_entry() function, since if we
can already say that the leading directory component of a pathname
does not exist, it is not necessary to try to remove a pathname below
it!
Thanks to Junio C Hamano, Linus Torvalds and Rene Scharfe for valuable
comments to this patch!
Signed-off-by: Kjetil Barvik <barvik@broadpark.no>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-18 23:14:51 +08:00
|
|
|
}
|
2009-01-18 23:14:52 +08:00
|
|
|
|
2009-02-10 04:54:06 +08:00
|
|
|
int has_dirs_only_path(const char *name, int len, int prefix_len)
|
2009-01-18 23:14:52 +08:00
|
|
|
{
|
2012-03-03 10:31:15 +08:00
|
|
|
return threaded_has_dirs_only_path(&default_cache, name, len, prefix_len);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return non-zero if all path components of 'name' exists as a
|
|
|
|
* directory. If prefix_len > 0, we will test with the stat()
|
|
|
|
* function instead of the lstat() function for a prefix length of
|
|
|
|
* 'prefix_len', thus we then allow for symlinks in the prefix part as
|
|
|
|
* long as those points to real existing directories.
|
|
|
|
*/
|
2012-09-16 13:38:28 +08:00
|
|
|
static int threaded_has_dirs_only_path(struct cache_def *cache, const char *name, int len, int prefix_len)
|
2012-03-03 10:31:15 +08:00
|
|
|
{
|
checkout: fix bug that makes checkout follow symlinks in leading path
Before checking out a file, we have to confirm that all of its leading
components are real existing directories. And to reduce the number of
lstat() calls in this process, we cache the last leading path known to
contain only directories. However, when a path collision occurs (e.g.
when checking out case-sensitive files in case-insensitive file
systems), a cached path might have its file type changed on disk,
leaving the cache on an invalid state. Normally, this doesn't bring
any bad consequences as we usually check out files in index order, and
therefore, by the time the cached path becomes outdated, we no longer
need it anyway (because all files in that directory would have already
been written).
But, there are some users of the checkout machinery that do not always
follow the index order. In particular: checkout-index writes the paths
in the same order that they appear on the CLI (or stdin); and the
delayed checkout feature -- used when a long-running filter process
replies with "status=delayed" -- postpones the checkout of some entries,
thus modifying the checkout order.
When we have to check out an out-of-order entry and the lstat() cache is
invalid (due to a previous path collision), checkout_entry() may end up
using the invalid data and thrusting that the leading components are
real directories when, in reality, they are not. In the best case
scenario, where the directory was replaced by a regular file, the user
will get an error: "fatal: unable to create file 'foo/bar': Not a
directory". But if the directory was replaced by a symlink, checkout
could actually end up following the symlink and writing the file at a
wrong place, even outside the repository. Since delayed checkout is
affected by this bug, it could be used by an attacker to write
arbitrary files during the clone of a maliciously crafted repository.
Some candidate solutions considered were to disable the lstat() cache
during unordered checkouts or sort the entries before passing them to
the checkout machinery. But both ideas include some performance penalty
and they don't future-proof the code against new unordered use cases.
Instead, we now manually reset the lstat cache whenever we successfully
remove a directory. Note: We are not even checking whether the directory
was the same as the lstat cache points to because we might face a
scenario where the paths refer to the same location but differ due to
case folding, precomposed UTF-8 issues, or the presence of `..`
components in the path. Two regression tests, with case-collisions and
utf8-collisions, are also added for both checkout-index and delayed
checkout.
Note: to make the previously mentioned clone attack unfeasible, it would
be sufficient to reset the lstat cache only after the remove_subtree()
call inside checkout_entry(). This is the place where we would remove a
directory whose path collides with the path of another entry that we are
currently trying to check out (possibly a symlink). However, in the
interest of a thorough fix that does not leave Git open to
similar-but-not-identical attack vectors, we decided to intercept
all `rmdir()` calls in one fell swoop.
This addresses CVE-2021-21300.
Co-authored-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
2020-12-10 21:27:55 +08:00
|
|
|
/*
|
|
|
|
* Note: this function is used by the checkout machinery, which also
|
|
|
|
* takes care to properly reset the cache when it performs an operation
|
|
|
|
* that would leave the cache outdated. If this function starts caching
|
|
|
|
* anything else besides FL_DIR, remember to also invalidate the cache
|
|
|
|
* when creating or deleting paths that might be in the cache.
|
|
|
|
*/
|
2009-07-10 04:23:59 +08:00
|
|
|
return lstat_cache(cache, name, len,
|
2009-01-18 23:14:52 +08:00
|
|
|
FL_DIR|FL_FULLPATH, prefix_len) &
|
|
|
|
FL_DIR;
|
|
|
|
}
|
2009-02-10 04:54:07 +08:00
|
|
|
|
2014-07-05 06:41:46 +08:00
|
|
|
static struct strbuf removal = STRBUF_INIT;
|
2009-02-10 04:54:07 +08:00
|
|
|
|
|
|
|
static void do_remove_scheduled_dirs(int new_len)
|
|
|
|
{
|
|
|
|
while (removal.len > new_len) {
|
2014-07-05 06:41:46 +08:00
|
|
|
removal.buf[removal.len] = '\0';
|
|
|
|
if (rmdir(removal.buf))
|
2009-02-10 04:54:07 +08:00
|
|
|
break;
|
|
|
|
do {
|
|
|
|
removal.len--;
|
|
|
|
} while (removal.len > new_len &&
|
2014-07-05 06:41:46 +08:00
|
|
|
removal.buf[removal.len] != '/');
|
2009-02-10 04:54:07 +08:00
|
|
|
}
|
|
|
|
removal.len = new_len;
|
|
|
|
}
|
|
|
|
|
|
|
|
void schedule_dir_for_removal(const char *name, int len)
|
|
|
|
{
|
|
|
|
int match_len, last_slash, i, previous_slash;
|
|
|
|
|
|
|
|
match_len = last_slash = i =
|
2014-07-05 06:41:46 +08:00
|
|
|
longest_path_match(name, len, removal.buf, removal.len,
|
2009-02-10 04:54:07 +08:00
|
|
|
&previous_slash);
|
|
|
|
/* Find last slash inside 'name' */
|
|
|
|
while (i < len) {
|
|
|
|
if (name[i] == '/')
|
|
|
|
last_slash = i;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are about to go down the directory tree, we check if
|
|
|
|
* we must first go upwards the tree, such that we then can
|
|
|
|
* remove possible empty directories as we go upwards.
|
|
|
|
*/
|
|
|
|
if (match_len < last_slash && match_len < removal.len)
|
|
|
|
do_remove_scheduled_dirs(match_len);
|
|
|
|
/*
|
|
|
|
* If we go deeper down the directory tree, we only need to
|
|
|
|
* save the new path components as we go down.
|
|
|
|
*/
|
2014-07-05 06:41:46 +08:00
|
|
|
if (match_len < last_slash)
|
|
|
|
strbuf_add(&removal, &name[match_len], last_slash - match_len);
|
2009-02-10 04:54:07 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void remove_scheduled_dirs(void)
|
|
|
|
{
|
|
|
|
do_remove_scheduled_dirs(0);
|
|
|
|
}
|
checkout: fix bug that makes checkout follow symlinks in leading path
Before checking out a file, we have to confirm that all of its leading
components are real existing directories. And to reduce the number of
lstat() calls in this process, we cache the last leading path known to
contain only directories. However, when a path collision occurs (e.g.
when checking out case-sensitive files in case-insensitive file
systems), a cached path might have its file type changed on disk,
leaving the cache on an invalid state. Normally, this doesn't bring
any bad consequences as we usually check out files in index order, and
therefore, by the time the cached path becomes outdated, we no longer
need it anyway (because all files in that directory would have already
been written).
But, there are some users of the checkout machinery that do not always
follow the index order. In particular: checkout-index writes the paths
in the same order that they appear on the CLI (or stdin); and the
delayed checkout feature -- used when a long-running filter process
replies with "status=delayed" -- postpones the checkout of some entries,
thus modifying the checkout order.
When we have to check out an out-of-order entry and the lstat() cache is
invalid (due to a previous path collision), checkout_entry() may end up
using the invalid data and thrusting that the leading components are
real directories when, in reality, they are not. In the best case
scenario, where the directory was replaced by a regular file, the user
will get an error: "fatal: unable to create file 'foo/bar': Not a
directory". But if the directory was replaced by a symlink, checkout
could actually end up following the symlink and writing the file at a
wrong place, even outside the repository. Since delayed checkout is
affected by this bug, it could be used by an attacker to write
arbitrary files during the clone of a maliciously crafted repository.
Some candidate solutions considered were to disable the lstat() cache
during unordered checkouts or sort the entries before passing them to
the checkout machinery. But both ideas include some performance penalty
and they don't future-proof the code against new unordered use cases.
Instead, we now manually reset the lstat cache whenever we successfully
remove a directory. Note: We are not even checking whether the directory
was the same as the lstat cache points to because we might face a
scenario where the paths refer to the same location but differ due to
case folding, precomposed UTF-8 issues, or the presence of `..`
components in the path. Two regression tests, with case-collisions and
utf8-collisions, are also added for both checkout-index and delayed
checkout.
Note: to make the previously mentioned clone attack unfeasible, it would
be sufficient to reset the lstat cache only after the remove_subtree()
call inside checkout_entry(). This is the place where we would remove a
directory whose path collides with the path of another entry that we are
currently trying to check out (possibly a symlink). However, in the
interest of a thorough fix that does not leave Git open to
similar-but-not-identical attack vectors, we decided to intercept
all `rmdir()` calls in one fell swoop.
This addresses CVE-2021-21300.
Co-authored-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
2020-12-10 21:27:55 +08:00
|
|
|
|
|
|
|
void invalidate_lstat_cache(void)
|
|
|
|
{
|
|
|
|
reset_lstat_cache(&default_cache);
|
|
|
|
}
|
|
|
|
|
|
|
|
#undef rmdir
|
|
|
|
int lstat_cache_aware_rmdir(const char *path)
|
|
|
|
{
|
|
|
|
/* Any change in this function must be made also in `mingw_rmdir()` */
|
|
|
|
int ret = rmdir(path);
|
|
|
|
|
|
|
|
if (!ret)
|
|
|
|
invalidate_lstat_cache();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|