git/t/t8014-blame-ignore-fuzzy.sh

438 lines
7.8 KiB
Bash
Raw Normal View History

blame: add a fingerprint heuristic to match ignored lines This algorithm will replace the heuristic used to identify lines from ignored commits with one that finds likely candidate lines in the parent's version of the file. The actual replacement occurs in an upcoming commit. The old heuristic simply assigned lines in the target to the same line number (plus offset) in the parent. The new function uses a fingerprinting algorithm to detect similarity between lines. The new heuristic is designed to accurately match changes made mechanically by formatting tools such as clang-format and clang-tidy. These tools make changes such as breaking up lines to fit within a character limit or changing identifiers to fit with a naming convention. The heuristic is not intended to match more extensive refactoring changes and may give misleading results in such cases. In most cases formatting tools preserve line ordering, so the heuristic is optimised for such cases. (Some types of changes do reorder lines e.g. sorting keep the line content identical, the git blame -M option can already be used to address this). The reason that it is advantageous to rely on ordering is due to source code repeating the same character sequences often e.g. declaring an identifier on one line and using that identifier on several subsequent lines. This means that lines can look very similar to each other which presents a problem when doing fuzzy matching. Relying on ordering gives us extra clues to point towards the true match. The heuristic operates on a single diff chunk change at a time. It creates a “fingerprint” for each line on each side of the change. Fingerprints are described in detail in the comment for `struct fingerprint`, but essentially are a multiset of the character pairs in a line. The heuristic first identifies the line in the target entry whose fingerprint is most clearly matched to a line fingerprint in the parent entry. Where fingerprints match identically, the position of the lines is used as a tie-break. The heuristic locks in the best match, and subtracts the fingerprint of the line in the target entry from the fingerprint of the line in the parent entry to prevent other lines being matched on the same parts of that line. It then repeats the process recursively on the section of the chunk before the match, and then the section of the chunk after the match. Here's an example of the difference the fingerprinting makes. Consider a file with two commits: commit-a 1) void func_1(void *x, void *y); commit-b 2) void func_2(void *x, void *y); After a commit 'X', we have: commit-X 1) void func_1(void *x, commit-X 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); When we blame-ignored with the old algorithm, we get: commit-a 1) void func_1(void *x, commit-b 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); Where commit-b is blamed for 2 instead of 3. With the fingerprint algorithm, we get: commit-a 1) void func_1(void *x, commit-a 2) void *y); commit-b 3) void func_2(void *x, commit-b 4) void *y); Note line 2 could be matched with either commit-a or commit-b as it is equally similar to both lines, but is matched with commit-a because its position as a fraction of the new line range is more similar to commit-a as a fraction of the old line range. Line 4 is also equally similar to both lines, but as it appears after line 3 which will be matched first it cannot be matched with an earlier line. For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains example parent and target files and the line numbers in the parent that must be matched. Signed-off-by: Michael Platings <michael@platin.gs> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-21 00:38:18 +08:00
#!/bin/sh
test_description='git blame ignore fuzzy heuristic'
. ./test-lib.sh
pick_author='s/^[0-9a-f^]* *(\([^ ]*\) .*/\1/'
# Each test is composed of 4 variables:
# titleN - the test name
# aN - the initial content
# bN - the final content
# expectedN - the line numbers from aN that we expect git blame
# on bN to identify, or "Final" if bN itself should
# be identified as the origin of that line.
# We start at test 2 because setup will show as test 1
title2="Regression test for partially overlapping search ranges"
cat <<EOF >a2
1
2
3
abcdef
5
6
7
ijkl
9
10
11
pqrs
13
14
15
wxyz
17
18
19
EOF
cat <<EOF >b2
abcde
ijk
pqr
wxy
EOF
cat <<EOF >expected2
4
8
12
16
EOF
title3="Combine 3 lines into 2"
cat <<EOF >a3
if ((maxgrow==0) ||
( single_line_field && (field->dcols < maxgrow)) ||
(!single_line_field && (field->drows < maxgrow)))
EOF
cat <<EOF >b3
if ((maxgrow == 0) || (single_line_field && (field->dcols < maxgrow)) ||
(!single_line_field && (field->drows < maxgrow))) {
EOF
cat <<EOF >expected3
2
3
EOF
title4="Add curly brackets"
cat <<EOF >a4
if (rows) *rows = field->rows;
if (cols) *cols = field->cols;
if (frow) *frow = field->frow;
if (fcol) *fcol = field->fcol;
EOF
cat <<EOF >b4
if (rows) {
*rows = field->rows;
}
if (cols) {
*cols = field->cols;
}
if (frow) {
*frow = field->frow;
}
if (fcol) {
*fcol = field->fcol;
}
EOF
cat <<EOF >expected4
1
1
Final
2
2
Final
3
3
Final
4
4
Final
EOF
title5="Combine many lines and change case"
cat <<EOF >a5
for(row=0,pBuffer=field->buf;
row<height;
row++,pBuffer+=width )
{
if ((len = (int)( After_End_Of_Data( pBuffer, width ) - pBuffer )) > 0)
{
wmove( win, row, 0 );
waddnstr( win, pBuffer, len );
EOF
cat <<EOF >b5
for (Row = 0, PBuffer = field->buf; Row < Height; Row++, PBuffer += Width) {
if ((Len = (int)(afterEndOfData(PBuffer, Width) - PBuffer)) > 0) {
wmove(win, Row, 0);
waddnstr(win, PBuffer, Len);
EOF
cat <<EOF >expected5
1
5
7
8
EOF
title6="Rename and combine lines"
cat <<EOF >a6
bool need_visual_update = ((form != (FORM *)0) &&
(form->status & _POSTED) &&
(form->current==field));
if (need_visual_update)
Synchronize_Buffer(form);
if (single_line_field)
{
growth = field->cols * amount;
if (field->maxgrow)
growth = Minimum(field->maxgrow - field->dcols,growth);
field->dcols += growth;
if (field->dcols == field->maxgrow)
EOF
cat <<EOF >b6
bool NeedVisualUpdate = ((Form != (FORM *)0) && (Form->status & _POSTED) &&
(Form->current == field));
if (NeedVisualUpdate) {
synchronizeBuffer(Form);
}
if (SingleLineField) {
Growth = field->cols * amount;
if (field->maxgrow) {
Growth = Minimum(field->maxgrow - field->dcols, Growth);
}
field->dcols += Growth;
if (field->dcols == field->maxgrow) {
EOF
cat <<EOF >expected6
1
3
4
5
6
Final
7
8
10
11
12
Final
13
14
EOF
# Both lines match identically so position must be used to tie-break.
title7="Same line twice"
cat <<EOF >a7
abc
abc
EOF
cat <<EOF >b7
abcd
abcd
EOF
cat <<EOF >expected7
1
2
EOF
title8="Enforce line order"
cat <<EOF >a8
abcdef
ghijkl
ab
EOF
cat <<EOF >b8
ghijk
abcd
EOF
cat <<EOF >expected8
2
3
EOF
title9="Expand lines and rename variables"
cat <<EOF >a9
int myFunction(int ArgumentOne, Thing *ArgTwo, Blah XuglyBug) {
Squiggle FabulousResult = squargle(ArgumentOne, *ArgTwo,
XuglyBug) + EwwwGlobalWithAReallyLongNameYepTooLong;
return FabulousResult * 42;
}
EOF
cat <<EOF >b9
int myFunction(int argument_one, Thing *arg_asdfgh,
Blah xugly_bug) {
Squiggle fabulous_result = squargle(argument_one,
*arg_asdfgh, xugly_bug)
+ g_ewww_global_with_a_really_long_name_yep_too_long;
return fabulous_result * 42;
}
EOF
cat <<EOF >expected9
1
1
2
3
3
4
5
EOF
title10="Two close matches versus one less close match"
cat <<EOF >a10
abcdef
abcdef
ghijkl
EOF
cat <<EOF >b10
gh
abcdefx
EOF
cat <<EOF >expected10
Final
2
EOF
# The first line of b matches best with the last line of a, but the overall
# match is better if we match it with the first line of a.
blame: add a fingerprint heuristic to match ignored lines This algorithm will replace the heuristic used to identify lines from ignored commits with one that finds likely candidate lines in the parent's version of the file. The actual replacement occurs in an upcoming commit. The old heuristic simply assigned lines in the target to the same line number (plus offset) in the parent. The new function uses a fingerprinting algorithm to detect similarity between lines. The new heuristic is designed to accurately match changes made mechanically by formatting tools such as clang-format and clang-tidy. These tools make changes such as breaking up lines to fit within a character limit or changing identifiers to fit with a naming convention. The heuristic is not intended to match more extensive refactoring changes and may give misleading results in such cases. In most cases formatting tools preserve line ordering, so the heuristic is optimised for such cases. (Some types of changes do reorder lines e.g. sorting keep the line content identical, the git blame -M option can already be used to address this). The reason that it is advantageous to rely on ordering is due to source code repeating the same character sequences often e.g. declaring an identifier on one line and using that identifier on several subsequent lines. This means that lines can look very similar to each other which presents a problem when doing fuzzy matching. Relying on ordering gives us extra clues to point towards the true match. The heuristic operates on a single diff chunk change at a time. It creates a “fingerprint” for each line on each side of the change. Fingerprints are described in detail in the comment for `struct fingerprint`, but essentially are a multiset of the character pairs in a line. The heuristic first identifies the line in the target entry whose fingerprint is most clearly matched to a line fingerprint in the parent entry. Where fingerprints match identically, the position of the lines is used as a tie-break. The heuristic locks in the best match, and subtracts the fingerprint of the line in the target entry from the fingerprint of the line in the parent entry to prevent other lines being matched on the same parts of that line. It then repeats the process recursively on the section of the chunk before the match, and then the section of the chunk after the match. Here's an example of the difference the fingerprinting makes. Consider a file with two commits: commit-a 1) void func_1(void *x, void *y); commit-b 2) void func_2(void *x, void *y); After a commit 'X', we have: commit-X 1) void func_1(void *x, commit-X 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); When we blame-ignored with the old algorithm, we get: commit-a 1) void func_1(void *x, commit-b 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); Where commit-b is blamed for 2 instead of 3. With the fingerprint algorithm, we get: commit-a 1) void func_1(void *x, commit-a 2) void *y); commit-b 3) void func_2(void *x, commit-b 4) void *y); Note line 2 could be matched with either commit-a or commit-b as it is equally similar to both lines, but is matched with commit-a because its position as a fraction of the new line range is more similar to commit-a as a fraction of the old line range. Line 4 is also equally similar to both lines, but as it appears after line 3 which will be matched first it cannot be matched with an earlier line. For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains example parent and target files and the line numbers in the parent that must be matched. Signed-off-by: Michael Platings <michael@platin.gs> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-21 00:38:18 +08:00
title11="Piggy in the middle"
cat <<EOF >a11
abcdefg
ijklmn
abcdefgh
EOF
cat <<EOF >b11
abcdefghx
ijklm
EOF
cat <<EOF >expected11
1
2
EOF
title12="No trailing newline"
printf "abc\ndef" >a12
printf "abx\nstu" >b12
cat <<EOF >expected12
1
Final
EOF
title13="Reorder includes"
cat <<EOF >a13
#include "c.h"
#include "b.h"
#include "a.h"
#include "e.h"
#include "d.h"
EOF
cat <<EOF >b13
#include "a.h"
#include "b.h"
#include "c.h"
#include "d.h"
#include "e.h"
EOF
cat <<EOF >expected13
3
2
1
5
4
EOF
last_test=13
test_expect_success setup '
for i in $(test_seq 2 $last_test)
blame: add a fingerprint heuristic to match ignored lines This algorithm will replace the heuristic used to identify lines from ignored commits with one that finds likely candidate lines in the parent's version of the file. The actual replacement occurs in an upcoming commit. The old heuristic simply assigned lines in the target to the same line number (plus offset) in the parent. The new function uses a fingerprinting algorithm to detect similarity between lines. The new heuristic is designed to accurately match changes made mechanically by formatting tools such as clang-format and clang-tidy. These tools make changes such as breaking up lines to fit within a character limit or changing identifiers to fit with a naming convention. The heuristic is not intended to match more extensive refactoring changes and may give misleading results in such cases. In most cases formatting tools preserve line ordering, so the heuristic is optimised for such cases. (Some types of changes do reorder lines e.g. sorting keep the line content identical, the git blame -M option can already be used to address this). The reason that it is advantageous to rely on ordering is due to source code repeating the same character sequences often e.g. declaring an identifier on one line and using that identifier on several subsequent lines. This means that lines can look very similar to each other which presents a problem when doing fuzzy matching. Relying on ordering gives us extra clues to point towards the true match. The heuristic operates on a single diff chunk change at a time. It creates a “fingerprint” for each line on each side of the change. Fingerprints are described in detail in the comment for `struct fingerprint`, but essentially are a multiset of the character pairs in a line. The heuristic first identifies the line in the target entry whose fingerprint is most clearly matched to a line fingerprint in the parent entry. Where fingerprints match identically, the position of the lines is used as a tie-break. The heuristic locks in the best match, and subtracts the fingerprint of the line in the target entry from the fingerprint of the line in the parent entry to prevent other lines being matched on the same parts of that line. It then repeats the process recursively on the section of the chunk before the match, and then the section of the chunk after the match. Here's an example of the difference the fingerprinting makes. Consider a file with two commits: commit-a 1) void func_1(void *x, void *y); commit-b 2) void func_2(void *x, void *y); After a commit 'X', we have: commit-X 1) void func_1(void *x, commit-X 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); When we blame-ignored with the old algorithm, we get: commit-a 1) void func_1(void *x, commit-b 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); Where commit-b is blamed for 2 instead of 3. With the fingerprint algorithm, we get: commit-a 1) void func_1(void *x, commit-a 2) void *y); commit-b 3) void func_2(void *x, commit-b 4) void *y); Note line 2 could be matched with either commit-a or commit-b as it is equally similar to both lines, but is matched with commit-a because its position as a fraction of the new line range is more similar to commit-a as a fraction of the old line range. Line 4 is also equally similar to both lines, but as it appears after line 3 which will be matched first it cannot be matched with an earlier line. For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains example parent and target files and the line numbers in the parent that must be matched. Signed-off-by: Michael Platings <michael@platin.gs> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-21 00:38:18 +08:00
do
# Append each line in a separate commit to make it easy to
# check which original line the blame output relates to.
line_count=0 &&
while IFS= read line
blame: add a fingerprint heuristic to match ignored lines This algorithm will replace the heuristic used to identify lines from ignored commits with one that finds likely candidate lines in the parent's version of the file. The actual replacement occurs in an upcoming commit. The old heuristic simply assigned lines in the target to the same line number (plus offset) in the parent. The new function uses a fingerprinting algorithm to detect similarity between lines. The new heuristic is designed to accurately match changes made mechanically by formatting tools such as clang-format and clang-tidy. These tools make changes such as breaking up lines to fit within a character limit or changing identifiers to fit with a naming convention. The heuristic is not intended to match more extensive refactoring changes and may give misleading results in such cases. In most cases formatting tools preserve line ordering, so the heuristic is optimised for such cases. (Some types of changes do reorder lines e.g. sorting keep the line content identical, the git blame -M option can already be used to address this). The reason that it is advantageous to rely on ordering is due to source code repeating the same character sequences often e.g. declaring an identifier on one line and using that identifier on several subsequent lines. This means that lines can look very similar to each other which presents a problem when doing fuzzy matching. Relying on ordering gives us extra clues to point towards the true match. The heuristic operates on a single diff chunk change at a time. It creates a “fingerprint” for each line on each side of the change. Fingerprints are described in detail in the comment for `struct fingerprint`, but essentially are a multiset of the character pairs in a line. The heuristic first identifies the line in the target entry whose fingerprint is most clearly matched to a line fingerprint in the parent entry. Where fingerprints match identically, the position of the lines is used as a tie-break. The heuristic locks in the best match, and subtracts the fingerprint of the line in the target entry from the fingerprint of the line in the parent entry to prevent other lines being matched on the same parts of that line. It then repeats the process recursively on the section of the chunk before the match, and then the section of the chunk after the match. Here's an example of the difference the fingerprinting makes. Consider a file with two commits: commit-a 1) void func_1(void *x, void *y); commit-b 2) void func_2(void *x, void *y); After a commit 'X', we have: commit-X 1) void func_1(void *x, commit-X 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); When we blame-ignored with the old algorithm, we get: commit-a 1) void func_1(void *x, commit-b 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); Where commit-b is blamed for 2 instead of 3. With the fingerprint algorithm, we get: commit-a 1) void func_1(void *x, commit-a 2) void *y); commit-b 3) void func_2(void *x, commit-b 4) void *y); Note line 2 could be matched with either commit-a or commit-b as it is equally similar to both lines, but is matched with commit-a because its position as a fraction of the new line range is more similar to commit-a as a fraction of the old line range. Line 4 is also equally similar to both lines, but as it appears after line 3 which will be matched first it cannot be matched with an earlier line. For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains example parent and target files and the line numbers in the parent that must be matched. Signed-off-by: Michael Platings <michael@platin.gs> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-21 00:38:18 +08:00
do
line_count=$((line_count+1)) &&
echo "$line" >>"$i" &&
git add "$i" &&
test_tick &&
GIT_AUTHOR_NAME="$line_count" git commit -m "$line_count"
done <"a$i"
done &&
blame: add a fingerprint heuristic to match ignored lines This algorithm will replace the heuristic used to identify lines from ignored commits with one that finds likely candidate lines in the parent's version of the file. The actual replacement occurs in an upcoming commit. The old heuristic simply assigned lines in the target to the same line number (plus offset) in the parent. The new function uses a fingerprinting algorithm to detect similarity between lines. The new heuristic is designed to accurately match changes made mechanically by formatting tools such as clang-format and clang-tidy. These tools make changes such as breaking up lines to fit within a character limit or changing identifiers to fit with a naming convention. The heuristic is not intended to match more extensive refactoring changes and may give misleading results in such cases. In most cases formatting tools preserve line ordering, so the heuristic is optimised for such cases. (Some types of changes do reorder lines e.g. sorting keep the line content identical, the git blame -M option can already be used to address this). The reason that it is advantageous to rely on ordering is due to source code repeating the same character sequences often e.g. declaring an identifier on one line and using that identifier on several subsequent lines. This means that lines can look very similar to each other which presents a problem when doing fuzzy matching. Relying on ordering gives us extra clues to point towards the true match. The heuristic operates on a single diff chunk change at a time. It creates a “fingerprint” for each line on each side of the change. Fingerprints are described in detail in the comment for `struct fingerprint`, but essentially are a multiset of the character pairs in a line. The heuristic first identifies the line in the target entry whose fingerprint is most clearly matched to a line fingerprint in the parent entry. Where fingerprints match identically, the position of the lines is used as a tie-break. The heuristic locks in the best match, and subtracts the fingerprint of the line in the target entry from the fingerprint of the line in the parent entry to prevent other lines being matched on the same parts of that line. It then repeats the process recursively on the section of the chunk before the match, and then the section of the chunk after the match. Here's an example of the difference the fingerprinting makes. Consider a file with two commits: commit-a 1) void func_1(void *x, void *y); commit-b 2) void func_2(void *x, void *y); After a commit 'X', we have: commit-X 1) void func_1(void *x, commit-X 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); When we blame-ignored with the old algorithm, we get: commit-a 1) void func_1(void *x, commit-b 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); Where commit-b is blamed for 2 instead of 3. With the fingerprint algorithm, we get: commit-a 1) void func_1(void *x, commit-a 2) void *y); commit-b 3) void func_2(void *x, commit-b 4) void *y); Note line 2 could be matched with either commit-a or commit-b as it is equally similar to both lines, but is matched with commit-a because its position as a fraction of the new line range is more similar to commit-a as a fraction of the old line range. Line 4 is also equally similar to both lines, but as it appears after line 3 which will be matched first it cannot be matched with an earlier line. For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains example parent and target files and the line numbers in the parent that must be matched. Signed-off-by: Michael Platings <michael@platin.gs> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-21 00:38:18 +08:00
for i in $(test_seq 2 $last_test)
blame: add a fingerprint heuristic to match ignored lines This algorithm will replace the heuristic used to identify lines from ignored commits with one that finds likely candidate lines in the parent's version of the file. The actual replacement occurs in an upcoming commit. The old heuristic simply assigned lines in the target to the same line number (plus offset) in the parent. The new function uses a fingerprinting algorithm to detect similarity between lines. The new heuristic is designed to accurately match changes made mechanically by formatting tools such as clang-format and clang-tidy. These tools make changes such as breaking up lines to fit within a character limit or changing identifiers to fit with a naming convention. The heuristic is not intended to match more extensive refactoring changes and may give misleading results in such cases. In most cases formatting tools preserve line ordering, so the heuristic is optimised for such cases. (Some types of changes do reorder lines e.g. sorting keep the line content identical, the git blame -M option can already be used to address this). The reason that it is advantageous to rely on ordering is due to source code repeating the same character sequences often e.g. declaring an identifier on one line and using that identifier on several subsequent lines. This means that lines can look very similar to each other which presents a problem when doing fuzzy matching. Relying on ordering gives us extra clues to point towards the true match. The heuristic operates on a single diff chunk change at a time. It creates a “fingerprint” for each line on each side of the change. Fingerprints are described in detail in the comment for `struct fingerprint`, but essentially are a multiset of the character pairs in a line. The heuristic first identifies the line in the target entry whose fingerprint is most clearly matched to a line fingerprint in the parent entry. Where fingerprints match identically, the position of the lines is used as a tie-break. The heuristic locks in the best match, and subtracts the fingerprint of the line in the target entry from the fingerprint of the line in the parent entry to prevent other lines being matched on the same parts of that line. It then repeats the process recursively on the section of the chunk before the match, and then the section of the chunk after the match. Here's an example of the difference the fingerprinting makes. Consider a file with two commits: commit-a 1) void func_1(void *x, void *y); commit-b 2) void func_2(void *x, void *y); After a commit 'X', we have: commit-X 1) void func_1(void *x, commit-X 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); When we blame-ignored with the old algorithm, we get: commit-a 1) void func_1(void *x, commit-b 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); Where commit-b is blamed for 2 instead of 3. With the fingerprint algorithm, we get: commit-a 1) void func_1(void *x, commit-a 2) void *y); commit-b 3) void func_2(void *x, commit-b 4) void *y); Note line 2 could be matched with either commit-a or commit-b as it is equally similar to both lines, but is matched with commit-a because its position as a fraction of the new line range is more similar to commit-a as a fraction of the old line range. Line 4 is also equally similar to both lines, but as it appears after line 3 which will be matched first it cannot be matched with an earlier line. For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains example parent and target files and the line numbers in the parent that must be matched. Signed-off-by: Michael Platings <michael@platin.gs> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-21 00:38:18 +08:00
do
# Overwrite the files with the final content.
cp b$i $i &&
git add $i
done &&
blame: add a fingerprint heuristic to match ignored lines This algorithm will replace the heuristic used to identify lines from ignored commits with one that finds likely candidate lines in the parent's version of the file. The actual replacement occurs in an upcoming commit. The old heuristic simply assigned lines in the target to the same line number (plus offset) in the parent. The new function uses a fingerprinting algorithm to detect similarity between lines. The new heuristic is designed to accurately match changes made mechanically by formatting tools such as clang-format and clang-tidy. These tools make changes such as breaking up lines to fit within a character limit or changing identifiers to fit with a naming convention. The heuristic is not intended to match more extensive refactoring changes and may give misleading results in such cases. In most cases formatting tools preserve line ordering, so the heuristic is optimised for such cases. (Some types of changes do reorder lines e.g. sorting keep the line content identical, the git blame -M option can already be used to address this). The reason that it is advantageous to rely on ordering is due to source code repeating the same character sequences often e.g. declaring an identifier on one line and using that identifier on several subsequent lines. This means that lines can look very similar to each other which presents a problem when doing fuzzy matching. Relying on ordering gives us extra clues to point towards the true match. The heuristic operates on a single diff chunk change at a time. It creates a “fingerprint” for each line on each side of the change. Fingerprints are described in detail in the comment for `struct fingerprint`, but essentially are a multiset of the character pairs in a line. The heuristic first identifies the line in the target entry whose fingerprint is most clearly matched to a line fingerprint in the parent entry. Where fingerprints match identically, the position of the lines is used as a tie-break. The heuristic locks in the best match, and subtracts the fingerprint of the line in the target entry from the fingerprint of the line in the parent entry to prevent other lines being matched on the same parts of that line. It then repeats the process recursively on the section of the chunk before the match, and then the section of the chunk after the match. Here's an example of the difference the fingerprinting makes. Consider a file with two commits: commit-a 1) void func_1(void *x, void *y); commit-b 2) void func_2(void *x, void *y); After a commit 'X', we have: commit-X 1) void func_1(void *x, commit-X 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); When we blame-ignored with the old algorithm, we get: commit-a 1) void func_1(void *x, commit-b 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); Where commit-b is blamed for 2 instead of 3. With the fingerprint algorithm, we get: commit-a 1) void func_1(void *x, commit-a 2) void *y); commit-b 3) void func_2(void *x, commit-b 4) void *y); Note line 2 could be matched with either commit-a or commit-b as it is equally similar to both lines, but is matched with commit-a because its position as a fraction of the new line range is more similar to commit-a as a fraction of the old line range. Line 4 is also equally similar to both lines, but as it appears after line 3 which will be matched first it cannot be matched with an earlier line. For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains example parent and target files and the line numbers in the parent that must be matched. Signed-off-by: Michael Platings <michael@platin.gs> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-21 00:38:18 +08:00
test_tick &&
# Commit the final content all at once so it can all be
# referred to with the same commit ID.
GIT_AUTHOR_NAME=Final git commit -m Final &&
IGNOREME=$(git rev-parse HEAD)
'
for i in $(test_seq 2 $last_test); do
eval title="\$title$i"
test_expect_success "$title" \
"git blame -M9 --ignore-rev $IGNOREME $i >output &&
sed -e \"$pick_author\" output >actual &&
test_cmp expected$i actual"
done
# This invoked a null pointer dereference when the chunk callback was called
# with a zero length parent chunk and there were no more suspects.
test_expect_success 'Diff chunks with no suspects' '
test_write_lines xy1 A B C xy1 >file &&
git add file &&
test_tick &&
GIT_AUTHOR_NAME=1 git commit -m 1 &&
test_write_lines xy2 A B xy2 C xy2 >file &&
git add file &&
test_tick &&
GIT_AUTHOR_NAME=2 git commit -m 2 &&
REV_2=$(git rev-parse HEAD) &&
test_write_lines xy3 A >file &&
git add file &&
test_tick &&
GIT_AUTHOR_NAME=3 git commit -m 3 &&
REV_3=$(git rev-parse HEAD) &&
test_write_lines 1 1 >expected &&
git blame --ignore-rev $REV_2 --ignore-rev $REV_3 file >output &&
sed -e "$pick_author" output >actual &&
test_cmp expected actual
'
test_expect_success 'position matching' '
test_write_lines abc def >file2 &&
git add file2 &&
test_tick &&
GIT_AUTHOR_NAME=1 git commit -m 1 &&
test_write_lines abc def abc def >file2 &&
git add file2 &&
test_tick &&
GIT_AUTHOR_NAME=2 git commit -m 2 &&
test_write_lines abcx defx abcx defx >file2 &&
git add file2 &&
test_tick &&
GIT_AUTHOR_NAME=3 git commit -m 3 &&
REV_3=$(git rev-parse HEAD) &&
test_write_lines abcy defy abcx defx >file2 &&
git add file2 &&
test_tick &&
GIT_AUTHOR_NAME=4 git commit -m 4 &&
REV_4=$(git rev-parse HEAD) &&
test_write_lines 1 1 2 2 >expected &&
git blame --ignore-rev $REV_3 --ignore-rev $REV_4 file2 >output &&
sed -e "$pick_author" output >actual &&
test_cmp expected actual
'
# This fails if each blame entry is processed independently instead of
# processing each diff change in full.
test_expect_success 'preserve order' '
test_write_lines bcde >file3 &&
git add file3 &&
test_tick &&
GIT_AUTHOR_NAME=1 git commit -m 1 &&
test_write_lines bcde fghij >file3 &&
git add file3 &&
test_tick &&
GIT_AUTHOR_NAME=2 git commit -m 2 &&
test_write_lines bcde fghij abcd >file3 &&
git add file3 &&
test_tick &&
GIT_AUTHOR_NAME=3 git commit -m 3 &&
test_write_lines abcdx fghijx bcdex >file3 &&
git add file3 &&
test_tick &&
GIT_AUTHOR_NAME=4 git commit -m 4 &&
REV_4=$(git rev-parse HEAD) &&
test_write_lines abcdx fghijy bcdex >file3 &&
git add file3 &&
test_tick &&
GIT_AUTHOR_NAME=5 git commit -m 5 &&
REV_5=$(git rev-parse HEAD) &&
test_write_lines 1 2 3 >expected &&
git blame --ignore-rev $REV_4 --ignore-rev $REV_5 file3 >output &&
sed -e "$pick_author" output >actual &&
test_cmp expected actual
'
test_done