All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection
@ 2021-02-06 22:52 Elijah Newren via GitGitGadget
  2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
                   ` (4 more replies)
  0 siblings, 5 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-06 22:52 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren

This series depends on ort-perf-batch-6[1], which has not yet appeared in
seen despite being reviewed by both Junio and Stolee.

This series uses file basenames in a basic fashion to guide rename
detection. It represents "Optimization #3" from my Git Merge 2020 talk[2],
and is based on the fact that real world repositories tend to have a large
majority of the renames they have done in history be ones that do not affect
the basenames of the renamed files (in other words, they are simply moving
files into different directories). For the testcases mentioned in commit
557ac0350d ("merge-ort: begin performance work; instrument with
trace2_region_* calls", 2020-10-28), the changes in just this series
improves the performance as follows:

                     Before Series           After Series
no-renames:       13.815 s ±  0.062 s    13.138 s ±  0.086 s
mega-renames:   1799.937 s ±  0.493 s   169.488 s ±  0.494 s
just-one-mega:    51.289 s ±  0.019 s     5.061 s ±  0.017 s


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with (as noted in the same commit message)
were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


[1] https://lore.kernel.org/git/xmqqlfc4byt6.fsf@gitster.c.googlers.com/ [2]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

Elijah Newren (3):
  diffcore-rename: compute basenames of all source and dest candidates
  diffcore-rename: complete find_basename_matches()
  diffcore-rename: guide inexact rename detection based on basenames

 diffcore-rename.c | 181 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 177 insertions(+), 4 deletions(-)


base-commit: 7ae9460d3dba84122c2674b46e4339b9d42bdedd
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/843
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates
  2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
@ 2021-02-06 22:52 ` Elijah Newren via GitGitGadget
  2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-06 22:52 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We want to make use of unique basenames to help inform rename detection,
so that more likely pairings can be checked first.  Add a new function,
not yet used, which creates a map of the unique basenames within
rename_src and another within rename_dst, together with the indices
within rename_src/rename_dst where those basenames show up.  Non-unique
basenames still show up in the map, but have an invalid index (-1).

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 74930716e70d..1c52077b04e5 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,59 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+MAYBE_UNUSED
+static int find_basename_matches(struct diff_options *options,
+				 int minimum_score,
+				 int num_src)
+{
+	int i;
+	struct strintmap sources;
+	struct strintmap dests;
+
+	/* Create maps of basename -> fullname(s) for sources and dests */
+	strintmap_init_with_options(&sources, -1, NULL, 0);
+	strintmap_init_with_options(&dests, -1, NULL, 0);
+	for (i = 0; i < num_src; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		char *base;
+
+		/* exact renames removed in remove_unneeded_paths_from_src() */
+		assert(!rename_src[i].p->one->rename_used);
+
+		base = strrchr(filename, '/');
+		base = (base ? base+1 : filename);
+
+		/* Record index within rename_src (i) if basename is unique */
+		if (strintmap_contains(&sources, base))
+			strintmap_set(&sources, base, -1);
+		else
+			strintmap_set(&sources, base, i);
+	}
+	for (i = 0; i < rename_dst_nr; ++i) {
+		char *filename = rename_dst[i].p->two->path;
+		char *base;
+
+		if (rename_dst[i].is_rename)
+			continue; /* involved in exact match already. */
+
+		base = strrchr(filename, '/');
+		base = (base ? base+1 : filename);
+
+		/* Record index within rename_dst (i) if basename is unique */
+		if (strintmap_contains(&dests, base))
+			strintmap_set(&dests, base, -1);
+		else
+			strintmap_set(&dests, base, i);
+	}
+
+	/* TODO: Make use of basenames source and destination basenames */
+
+	strintmap_clear(&sources);
+	strintmap_clear(&dests);
+
+	return 0;
+}
+
 #define NUM_CANDIDATE_PER_DST 4
 static void record_if_better(struct diff_score m[], struct diff_score *o)
 {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 2/3] diffcore-rename: complete find_basename_matches()
  2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
  2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
@ 2021-02-06 22:52 ` Elijah Newren via GitGitGadget
  2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-06 22:52 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories.  We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames.  If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.

Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename.  I think
that is okay for four reasons: (1) That seems somewhat unlikely in
practice, (2) it's easy to explain to the users what happened if it does
ever occur (or even for them to intuitively figure out), and (3) as the
next patch will show it provides such a large performance boost that
it's worth the tradeoff.  Reason (4) takes a full paragraph to
explain...

If the previous three reasons aren't enough, consider what rename
detection already does.  Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar.  In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity.  Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents.  This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well.  Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.

We do not use this heuristic together with either break or copy
detection.  The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity.  The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities.  So the idea behind this
optimization goes against both of those features, and will be turned off
for both.

Note that this optimization is not yet effective in any situation, as
the function is still unused.  The next commit will hook it into the
code so that it is used when rename detection is wanted, but neither
copy nor break detection are.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 94 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 91 insertions(+), 3 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 1c52077b04e5..b1dda41de9b1 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -372,10 +372,48 @@ static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 int num_src)
 {
-	int i;
+	/*
+	 * When I checked, over 76% of file renames in linux just moved
+	 * files to a different directory but kept the same basename.  gcc
+	 * did that with over 64% of renames, gecko did it with over 79%,
+	 * and WebKit did it with over 89%.
+	 *
+	 * Therefore we can bypass the normal exhaustive NxM matrix
+	 * comparison of similarities between all potential rename sources
+	 * and destinations by instead using file basename as a hint, checking
+	 * for similarity between files with the same basename, and if we
+	 * find a pair that are sufficiently similar, record the rename
+	 * pair and exclude those two from the NxM matrix.
+	 *
+	 * This *might* cause us to find a less than optimal pairing (if
+	 * there is another file that we are even more similar to but has a
+	 * different basename).  Given the huge performance advantage
+	 * basename matching provides, and given the frequency with which
+	 * people use the same basename in real world projects, that's a
+	 * trade-off we are willing to accept when doing just rename
+	 * detection.  However, if someone wants copy detection that
+	 * implies they are willing to spend more cycles to find
+	 * similarities between files, so it may be less likely that this
+	 * heuristic is wanted.
+	 */
+
+	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
 
+	/*
+	 * The prefeteching stuff wants to know if it can skip prefetching blobs
+	 * that are unmodified.  unmodified blobs are only relevant when doing
+	 * copy detection.  find_basename_matches() is only used when detecting
+	 * renames, not when detecting copies, so it'll only be used when a file
+	 * only existed in the source.  Since we already know that the file
+	 * won't be unmodified, there's no point checking for it; that's just a
+	 * waste of resources.  So set skip_unmodified to 0 so that
+	 * estimate_similarity() and prefetch() won't waste resources checking
+	 * for something we already know is false.
+	 */
+	int skip_unmodified = 0;
+
 	/* Create maps of basename -> fullname(s) for sources and dests */
 	strintmap_init_with_options(&sources, -1, NULL, 0);
 	strintmap_init_with_options(&dests, -1, NULL, 0);
@@ -412,12 +450,62 @@ static int find_basename_matches(struct diff_options *options,
 			strintmap_set(&dests, base, i);
 	}
 
-	/* TODO: Make use of basenames source and destination basenames */
+	/* Now look for basename matchups and do similarity estimation */
+	for (i = 0; i < num_src; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		char *base = NULL;
+		intptr_t src_index;
+		intptr_t dst_index;
+
+		/* Get the basename */
+		base = strrchr(filename, '/');
+		base = (base ? base+1 : filename);
+
+		/* Find out if this basename is unique among sources */
+		src_index = strintmap_get(&sources, base);
+		if (src_index == -1)
+			continue; /* not a unique basename; skip it */
+		assert(src_index == i);
+
+		if (strintmap_contains(&dests, base)) {
+			struct diff_filespec *one, *two;
+			int score;
+
+			/* Find out if this basename is unique among dests */
+			dst_index = strintmap_get(&dests, base);
+			if (dst_index == -1)
+				continue; /* not a unique basename; skip it */
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
+			/* Estimate the similarity */
+			one = rename_src[src_index].p->one;
+			two = rename_dst[dst_index].p->two;
+			score = estimate_similarity(options->repo, one, two,
+						    minimum_score, skip_unmodified);
+
+			/* If sufficiently similar, record as rename pair */
+			if (score < minimum_score)
+				continue;
+			record_rename_pair(dst_index, src_index, score);
+			renames++;
+
+			/*
+			 * Found a rename so don't need text anymore; if we
+			 * didn't find a rename, the filespec_blob would get
+			 * re-used when doing the matrix of comparisons.
+			 */
+			diff_free_filespec_blob(one);
+			diff_free_filespec_blob(two);
+		}
+	}
 
 	strintmap_clear(&sources);
 	strintmap_clear(&dests);
 
-	return 0;
+	return renames;
 }
 
 #define NUM_CANDIDATE_PER_DST 4
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
  2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
  2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
@ 2021-02-06 22:52 ` Elijah Newren via GitGitGadget
  2021-02-07 14:38   ` Derrick Stolee
  2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
  2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
  4 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-06 22:52 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       13.815 s ±  0.062 s    13.138 s ±  0.086 s
    mega-renames:   1799.937 s ±  0.493 s   169.488 s ±  0.494 s
    just-one-mega:    51.289 s ±  0.019 s     5.061 s ±  0.017 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 42 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index b1dda41de9b1..206c0bbdcdfb 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,7 +367,6 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
-MAYBE_UNUSED
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 int num_src)
@@ -718,12 +717,45 @@ void diffcore_rename(struct diff_options *options)
 	if (minimum_score == MAX_SCORE)
 		goto cleanup;
 
+	num_sources = rename_src_nr;
+
+	if (want_copies || break_idx) {
+		/*
+		 * Cull sources:
+		 *   - remove ones corresponding to exact renames
+		 */
+		trace2_region_enter("diff", "cull after exact", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull after exact", options->repo);
+	} else {
+		/*
+		 * Cull sources:
+		 *   - remove ones involved in renames (found via exact match)
+		 */
+		trace2_region_enter("diff", "cull exact", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull exact", options->repo);
+
+		/* Utilize file basenames to quickly find renames. */
+		trace2_region_enter("diff", "basename matches", options->repo);
+		rename_count += find_basename_matches(options, minimum_score,
+						      rename_src_nr);
+		trace2_region_leave("diff", "basename matches", options->repo);
+
+		/*
+		 * Cull sources, again:
+		 *   - remove ones involved in renames (found via basenames)
+		 */
+		trace2_region_enter("diff", "cull basename", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull basename", options->repo);
+	}
+
 	/*
-	 * Calculate how many renames are left
+	 * Calculate how many rename destinations are left
 	 */
 	num_destinations = (rename_dst_nr - rename_count);
-	remove_unneeded_paths_from_src(want_copies);
-	num_sources = rename_src_nr;
+	num_sources = rename_src_nr; /* rename_src_nr reflects lower number */
 
 	/* All done? */
 	if (!num_destinations || !num_sources)
@@ -755,7 +787,7 @@ void diffcore_rename(struct diff_options *options)
 		struct diff_score *m;
 
 		if (rename_dst[i].is_rename)
-			continue; /* dealt with exact match already. */
+			continue; /* exact or basename match already handled */
 
 		m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
 		for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection
  2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
@ 2021-02-07  5:19 ` Junio C Hamano
  2021-02-07  6:05   ` Elijah Newren
  2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
  4 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2021-02-07  5:19 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series depends on ort-perf-batch-6[1], which has not yet appeared in
> seen despite being reviewed by both Junio and Stolee.

It is because that one depends on something not in, but soon about
to go in, 'master' and I didn't want to add to "this topic depends
on top of that other topic" mess.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection
  2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
@ 2021-02-07  6:05   ` Elijah Newren
  0 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren @ 2021-02-07  6:05 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King

On Sat, Feb 6, 2021 at 9:19 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > This series depends on ort-perf-batch-6[1], which has not yet appeared in
> > seen despite being reviewed by both Junio and Stolee.
>
> It is because that one depends on something not in, but soon about
> to go in, 'master' and I didn't want to add to "this topic depends
> on top of that other topic" mess.

Ah, makes sense.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
@ 2021-02-07 14:38   ` Derrick Stolee
  2021-02-07 19:51     ` Junio C Hamano
  2021-02-08  8:27     ` Elijah Newren
  0 siblings, 2 replies; 71+ messages in thread
From: Derrick Stolee @ 2021-02-07 14:38 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren

On 2/6/21 5:52 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> Make use of the new find_basename_matches() function added in the last
> two patches, to find renames more rapidly in cases where we can match up
> files based on basenames.

This is a valuable heuristic.

> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> performance work; instrument with trace2_region_* calls", 2020-10-28),
> this change improves the performance as follows:
> 
>                             Before                  After
>     no-renames:       13.815 s ±  0.062 s    13.138 s ±  0.086 s
>     mega-renames:   1799.937 s ±  0.493 s   169.488 s ±  0.494 s
>     just-one-mega:    51.289 s ±  0.019 s     5.061 s ±  0.017 s

These numbers are very impressive.

Before I get too deep into reviewing these patches, I do want
to make it clear that the speed-up is coming at the cost of
a behavior change. We are restricting the "best match" search
to be first among files with common base name (although maybe
I would use 'suffix'?). If we search for a rename among all
additions and deletions ending the ".txt" we might find a
similarity match that is 60% and declare that a rename, even
if there is a ".txt" -> ".md" pair that has a 70% match.

This could be documented in a test case, to demonstrate that
we are making this choice explicitly.

For example, here is a test that passes now, but would
start failing with your patches (if I understand them
correctly):

diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index c16486a9d41..e4c71fcf3be 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -262,4 +262,21 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
 	grep "myotherfile.*myfile" actual
 '
 
+test_expect_success 'multiple similarity choices' '
+	test_write_lines line1 line2 line3 line4 line5 \
+			 line6 line7 line8 line9 line10 >delete.txt &&
+	git add delete.txt &&
+	git commit -m "base txt" &&
+
+	rm delete.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 >add.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 line9 >add.md &&
+	git add add.txt add.md &&
+	git commit -a -m "rename" &&
+	git diff-tree -M HEAD HEAD^ >actual &&
+	grep "add.md	delete.txt" actual
+'
+
 test_done

Personally, I'm fine with making this assumption. All of
our renames are based on heuristics, so any opportunity
to reduce the number of content comparisons is a win in
my mind. We also don't report a rename unless there _is_
an add/delete pair that is sufficiently close in content.

So, in this way, we are changing the optimization function
that is used to determine the "best" rename available. It
might be good to update documentation for how we choose
renames:

  An add/delete pair is marked as a rename based on the
  following similarity function:

  1. If the blob content is identical, then those files
     are marked as a rename. (Should we break ties here
     based on the basename?)

  2. Among pairs whose content matches the minimum
     similarity limit, we optimize for:

     i. among files with the same basename (trailer
        after final '.') select pairs with highest
        similarity.

    ii. if no files with the same basename have the
        minimum similarity, then select pairs with
        highest similarity across all filenames.

The above was written quickly as an attempt, so it will
require careful editing to actually make sense to end
users.

Thanks,
-Stolee

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-07 14:38   ` Derrick Stolee
@ 2021-02-07 19:51     ` Junio C Hamano
  2021-02-08  8:38       ` Elijah Newren
  2021-02-08  8:27     ` Elijah Newren
  1 sibling, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2021-02-07 19:51 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, git, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Elijah Newren

Derrick Stolee <stolee@gmail.com> writes:

> Before I get too deep into reviewing these patches, I do want
> to make it clear that the speed-up is coming at the cost of
> a behavior change. We are restricting the "best match" search
> to be first among files with common base name (although maybe
> I would use 'suffix'?). If we search for a rename among all
> additions and deletions ending the ".txt" we might find a
> similarity match that is 60% and declare that a rename, even
> if there is a ".txt" -> ".md" pair that has a 70% match.

Yes, my initial reaction to the idea was that "yuck, our rename
detection lost its purity".  diffcore-rename strived to base its
decision purely on content similarity, primarily because it is one
of the oldest part of Git where the guiding principle has always
been that the content is the king.  I think my aversion to the "all
of my neighbors are relocating, so should I move to the same place"
(aka "directory rename") comes from a similar place, but in a sense
this was worse.

At least, until I got over the initial bump.  I do not think the
suffix match is necessarily a bad idea, but it adds more "magically
doing a wrong thing" failure modes (e.g. the ".txt" to ".md" example
would probably have more variants that impact the real life
projects; ".C" vs ".cc" vs ".cxx" vs ".cpp" immediately comes to
mind), and a tool that silently does a wrong thing because it uses
more magic would be a tool that is hard to explain why it did the
wrong thing when it does.

> This could be documented in a test case, to demonstrate that
> we are making this choice explicitly.

Yes.  I wonder if we can solve it by requiring a lot better than
minimum match when trying the "suffix match" first, or something?

Provided if we agree that it is a good idea to insert this between
"exact contents match" and "full matrix", I have one question to
Elijah on what the code does.

To me, it seems that the "full matrix" part still uses the remaining
src and dst candidates fully.  But if "A.txt" and "B.txt" are still
surviving in the src/dst at that stage, shouldn't we be saying that
"no way these can be similar enough---we've checked in the middle
stage where only the ones with the same suffix are considered and
this pair didn't turn into a rename"?

Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-07 14:38   ` Derrick Stolee
  2021-02-07 19:51     ` Junio C Hamano
@ 2021-02-08  8:27     ` Elijah Newren
  2021-02-08 11:31       ` Derrick Stolee
  1 sibling, 1 reply; 71+ messages in thread
From: Elijah Newren @ 2021-02-08  8:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King

Hi,

On Sun, Feb 7, 2021 at 6:38 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/6/21 5:52 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > Make use of the new find_basename_matches() function added in the last
> > two patches, to find renames more rapidly in cases where we can match up
> > files based on basenames.
>
> This is a valuable heuristic.
>
> > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> > performance work; instrument with trace2_region_* calls", 2020-10-28),
> > this change improves the performance as follows:
> >
> >                             Before                  After
> >     no-renames:       13.815 s ±  0.062 s    13.138 s ±  0.086 s
> >     mega-renames:   1799.937 s ±  0.493 s   169.488 s ±  0.494 s
> >     just-one-mega:    51.289 s ±  0.019 s     5.061 s ±  0.017 s
>
> These numbers are very impressive.
>
> Before I get too deep into reviewing these patches, I do want
> to make it clear that the speed-up is coming at the cost of
> a behavior change. We are restricting the "best match" search
> to be first among files with common base name (although maybe
> I would use 'suffix'?). If we search for a rename among all
> additions and deletions ending the ".txt" we might find a
> similarity match that is 60% and declare that a rename, even
> if there is a ".txt" -> ".md" pair that has a 70% match.

I'm glad you all are open to possible behavioral changes, but I was
proposing a much smaller behavioral change that is quite different
than what you have suggested here.  Perhaps my wording was poor; I
apologize for forgetting that "basename" has different meanings in
different contexts.  Let me try again; I am not treating the filename
extension as special in any manner here; by "basename" I just mean the
portion of the path ignoring any leading directories.  Thus
    src/foo.txt
might be a good match against
    source/foo.txt
but this optimization as a preliminary step would not consider
matching src/foo.txt against any of
    source/bar.txt
    source/foo.md
since the basenames ('bar.txt' and 'foo.md') do not match our original
file's basename ('foo.txt').

Of course, if this preliminary optimization step fails to find another
"foo.txt" to match src/foo.txt against (or finds more than one and
thus doesn't compare against any of them), then the fallback inexact
rename detection matrix might match it against either of those two
latter paths, as it always has.

> This could be documented in a test case, to demonstrate that
> we are making this choice explicitly.
>
> For example, here is a test that passes now, but would
> start failing with your patches (if I understand them
> correctly):
>
> diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
> index c16486a9d41..e4c71fcf3be 100755
> --- a/t/t4001-diff-rename.sh
> +++ b/t/t4001-diff-rename.sh
> @@ -262,4 +262,21 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
>         grep "myotherfile.*myfile" actual
>  '
>
> +test_expect_success 'multiple similarity choices' '
> +       test_write_lines line1 line2 line3 line4 line5 \
> +                        line6 line7 line8 line9 line10 >delete.txt &&
> +       git add delete.txt &&
> +       git commit -m "base txt" &&
> +
> +       rm delete.txt &&
> +       test_write_lines line1 line2 line3 line4 line5 \
> +                         line6 line7 line8 >add.txt &&
> +       test_write_lines line1 line2 line3 line4 line5 \
> +                         line6 line7 line8 line9 >add.md &&
> +       git add add.txt add.md &&
> +       git commit -a -m "rename" &&
> +       git diff-tree -M HEAD HEAD^ >actual &&
> +       grep "add.md    delete.txt" actual
> +'
> +
>  test_done
>
> Personally, I'm fine with making this assumption. All of
> our renames are based on heuristics, so any opportunity
> to reduce the number of content comparisons is a win in
> my mind. We also don't report a rename unless there _is_
> an add/delete pair that is sufficiently close in content.
>
> So, in this way, we are changing the optimization function
> that is used to determine the "best" rename available. It
> might be good to update documentation for how we choose
> renames:

Seems reasonable; I'll add some commentary below on the rules...

>
>   An add/delete pair is marked as a rename based on the
>   following similarity function:
>

0. Unless break detection is on, files with the same fullname are
considered the same file even if their content is completely
different.  (With break detection turned on, we can have e.g. both
src/foo.txt -> src/bar.txt and otherdir/baz.txt -> src/foo.txt, i.e.
src/foo.txt can be both a source and a destination of a rename.)

[The merge machinery never turns break detection on, but
diffcore-rename is used by git diff, git log, etc. too, so if we're
documenting the rules we should cover all the cases.]

>   1. If the blob content is identical, then those files
>      are marked as a rename. (Should we break ties here
>      based on the basename?)

find_identical_files() already breaks ties based on basename_same(),
yes.  So there's another area of the code that uses basenames to guide
rename detection already, just in a much more limited fashion.

>   2. Among pairs whose content matches the minimum
>      similarity limit, we optimize for:
>
>      i. among files with the same basename (trailer
>         after final '.') select pairs with highest
>         similarity.

This is an interesting idea, but is not what I implemented.  It is
possible that your suggestion is also a useful optimization; it'd be
hard to know without trying.  However, as noted in optimization batch
8 that I'll be submitting later, I'm worried about having any
optimization pre-steps doing more than O(1) comparisons per path (and
here you suggest comparing each .txt file with all other .txt files);
doing that can interact badly with optimization batch 9.
Additionally, unless we do something to avoid re-comparing files again
when doing the later all-unmatched-files-against-each-other check,
then worst case behavior can approach twice as slow as the original
code.

Anyway, the explanation I'd use for the optimization I've added in
this series is:

       i. if looking through the two sets (of add pairs, and of delete
pairs), there is exactly one file with the same basename from each
set, and they have the minimum similarity, then mark them as a rename

Optimization batch 8 will extend this particular rule.

Optimization batches 9 and 10 will optimize the rename detection more,
but instead of using rule changes, will instead pass in a list of
"irrelevant" sources that can be skipped.  The trick is in determining
source files that are irrelevant and why.  I'm not sure if we want to
also mention in the rules that different areas of the code (the merge
machinery, log --follow, etc.) can make the rename detection focus
just on some "relevant" subset of files.  (Which will also touch on
optimization batch 12.)

>     ii. if no files with the same basename have the
>         minimum similarity, then select pairs with
>         highest similarity across all filenames.

Yes, this will remain as the fallback at the very end.

> The above was written quickly as an attempt, so it will
> require careful editing to actually make sense to end
> users.

Yeah, and we probably also need to mention copy detection above
somehow too, and add more precise wording about how break detection is
involved.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-07 19:51     ` Junio C Hamano
@ 2021-02-08  8:38       ` Elijah Newren
  2021-02-08 11:43         ` Derrick Stolee
  2021-02-08 17:37         ` Junio C Hamano
  0 siblings, 2 replies; 71+ messages in thread
From: Elijah Newren @ 2021-02-08  8:38 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King

On Sun, Feb 7, 2021 at 11:51 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Derrick Stolee <stolee@gmail.com> writes:
>
> > Before I get too deep into reviewing these patches, I do want
> > to make it clear that the speed-up is coming at the cost of
> > a behavior change. We are restricting the "best match" search
> > to be first among files with common base name (although maybe
> > I would use 'suffix'?). If we search for a rename among all
> > additions and deletions ending the ".txt" we might find a
> > similarity match that is 60% and declare that a rename, even
> > if there is a ".txt" -> ".md" pair that has a 70% match.
>
> Yes, my initial reaction to the idea was that "yuck, our rename
> detection lost its purity".  diffcore-rename strived to base its
> decision purely on content similarity, primarily because it is one
> of the oldest part of Git where the guiding principle has always
> been that the content is the king.  I think my aversion to the "all
> of my neighbors are relocating, so should I move to the same place"
> (aka "directory rename") comes from a similar place, but in a sense
> this was worse.
>
> At least, until I got over the initial bump.  I do not think the
> suffix match is necessarily a bad idea, but it adds more "magically
> doing a wrong thing" failure modes (e.g. the ".txt" to ".md" example
> would probably have more variants that impact the real life
> projects; ".C" vs ".cc" vs ".cxx" vs ".cpp" immediately comes to
> mind), and a tool that silently does a wrong thing because it uses
> more magic would be a tool that is hard to explain why it did the
> wrong thing when it does.

Stolee explained a new algorithm different than what I have proposed,
I think based on the apparent different meanings of "basename" that
exist.  I tried to clarify that in response to his email, but I wanted
to clarify one additional thing here too:

diffcore-rename has not in the past based its decision solely on
content similarity.  It only does that when break detection is on.
Otherwise, 0% content similarity is trumped by sufficient filename
similarity (namely, with a filename similarity of 100%).  If the
filename similarity wasn't sufficiently high (anything less than an
exact match), then it completely ignored filename similarity and
looked only at content similarity.  It thus jumped from one extreme to
another.

My optimization is adding an in-between state.  When the basename (the
part of the path excluding the leading directory) matches the basename
of another file (and those basenames are unique on each side), then
compare content similarity and mark the files as a rename if the two
are sufficiently similar.  It is thus a position that considers both
filename similarity (basename match) and content similarity together.

> > This could be documented in a test case, to demonstrate that
> > we are making this choice explicitly.
>
> Yes.  I wonder if we can solve it by requiring a lot better than
> minimum match when trying the "suffix match" first, or something?

This may still be a useful idea, and was something I had considered,
but more in the context of more generic filename similarity
comparisons.  We could still discuss it even when basenames match, but
basenames matching seems strong enough to me that I wasn't sure extra
configuration knobs were warranted.

> Provided if we agree that it is a good idea to insert this between
> "exact contents match" and "full matrix", I have one question to
> Elijah on what the code does.
>
> To me, it seems that the "full matrix" part still uses the remaining
> src and dst candidates fully.  But if "A.txt" and "B.txt" are still
> surviving in the src/dst at that stage, shouldn't we be saying that
> "no way these can be similar enough---we've checked in the middle
> stage where only the ones with the same suffix are considered and
> this pair didn't turn into a rename"?

This is a very good point.  A.txt and B.txt will not have been
compared previously since their basenames do not match, but the basic
idea is still possible.  For example, A.txt could have been compared
to source/some-module/A.txt.  And I don't do anything in the final
"full matrix" stage to avoid re-comparing those two files again.
However, it is worth noting that A.txt will have been compared to at
most one other file, not N files.  And thus while we are wasting some
re-comparisons, it is at most O(N) duplicated comparisons, not O(N^2).
I thought about that, but decided to not bother, based on the
following thinking:

1) The most expensive comparison is the first one, because when we do
that one, we first have to populate the list of integers that lines in
the file hash to.  Subsequent comparisons are relatively cheap since
this list of integers has already been computed.

2) This would only save us from at most N comparisons in the N x M
matrix (since no file in this optimization is compared to more than
one other)

3) Checking if two files have previously been compared requires more
code, in what is already a tight nested loop.  My experience
attempting to modify that tight loop for extra conditions (e.g. don't
compare files that are too large), is that it's easy to accidentally
make the code slower.  In fact, this is in part what led to the
addition of the remove_unneed_paths_from_src() function.

4) There were plenty of other interesting ideas and maybe I was a tad lazy.  :-)

I think removing these already-compared cases could be done, but I
just avoided it.  If we were to do the "attempt to match files with
the same extension" optimization that Stolee outlines/invents above,
then we'd definitely need to consider it.  Otherwise, it's just a
minor additional optimization that someone could add to my patches.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-08  8:27     ` Elijah Newren
@ 2021-02-08 11:31       ` Derrick Stolee
  2021-02-08 16:09         ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee @ 2021-02-08 11:31 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King

On 2/8/2021 3:27 AM, Elijah Newren wrote:
> Hi,
> 
> On Sun, Feb 7, 2021 at 6:38 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 2/6/21 5:52 PM, Elijah Newren via GitGitGadget wrote:
>>> From: Elijah Newren <newren@gmail.com>
>>>
>>> Make use of the new find_basename_matches() function added in the last
>>> two patches, to find renames more rapidly in cases where we can match up
>>> files based on basenames.
>>
>> This is a valuable heuristic.
>>
>>> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
>>> performance work; instrument with trace2_region_* calls", 2020-10-28),
>>> this change improves the performance as follows:
>>>
>>>                             Before                  After
>>>     no-renames:       13.815 s ±  0.062 s    13.138 s ±  0.086 s
>>>     mega-renames:   1799.937 s ±  0.493 s   169.488 s ±  0.494 s
>>>     just-one-mega:    51.289 s ±  0.019 s     5.061 s ±  0.017 s
>>
>> These numbers are very impressive.
>>
>> Before I get too deep into reviewing these patches, I do want
>> to make it clear that the speed-up is coming at the cost of
>> a behavior change. We are restricting the "best match" search
>> to be first among files with common base name (although maybe
>> I would use 'suffix'?). If we search for a rename among all
>> additions and deletions ending the ".txt" we might find a
>> similarity match that is 60% and declare that a rename, even
>> if there is a ".txt" -> ".md" pair that has a 70% match.
> 
> I'm glad you all are open to possible behavioral changes, but I was
> proposing a much smaller behavioral change that is quite different
> than what you have suggested here.  Perhaps my wording was poor; I
> apologize for forgetting that "basename" has different meanings in
> different contexts.  Let me try again; I am not treating the filename
> extension as special in any manner here; by "basename" I just mean the
> portion of the path ignoring any leading directories.  Thus
>     src/foo.txt
> might be a good match against
>     source/foo.txt
> but this optimization as a preliminary step would not consider
> matching src/foo.txt against any of
>     source/bar.txt
>     source/foo.md
> since the basenames ('bar.txt' and 'foo.md') do not match our original
> file's basename ('foo.txt').
> 
> Of course, if this preliminary optimization step fails to find another
> "foo.txt" to match src/foo.txt against (or finds more than one and
> thus doesn't compare against any of them), then the fallback inexact
> rename detection matrix might match it against either of those two
> latter paths, as it always has.

Thank you for making it clear that I had misunderstood what the
optimization is actually doing. A much more narrow scope makes
more sense, and avoids the quadratic problem even when many files
of the same suffix are renamed.

>> This could be documented in a test case, to demonstrate that
>> we are making this choice explicitly.

My test is thus bogus, but you could have a similar one for
your actual optimization.

>> So, in this way, we are changing the optimization function
>> that is used to determine the "best" rename available. It
>> might be good to update documentation for how we choose
>> renames:
> 
> Seems reasonable; I'll add some commentary below on the rules...

Your commentary is helpful. I look forward to reading your
carefully-written docs in the next version ;).

>>      i. among files with the same basename (trailer
>>         after final '.') select pairs with highest
>>         similarity.
> 
> This is an interesting idea, but is not what I implemented.

That's what I get for reading the commit messages quickly and
commenting on what I _think_ is going on instead of actually
reading the code carefully. Sorry about that.

>  It is
> possible that your suggestion is also a useful optimization; it'd be
> hard to know without trying.  However, as noted in optimization batch
> 8 that I'll be submitting later, I'm worried about having any
> optimization pre-steps doing more than O(1) comparisons per path (and
> here you suggest comparing each .txt file with all other .txt files);
> doing that can interact badly with optimization batch 9.
> Additionally, unless we do something to avoid re-comparing files again
> when doing the later all-unmatched-files-against-each-other check,
> then worst case behavior can approach twice as slow as the original
> code.

Right. If Git decides to reorganize all of its *.c files in one
commit, we would still get quadratic behavior in rename detection.
Maybe it's not _that_ much of an improvement.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-08  8:38       ` Elijah Newren
@ 2021-02-08 11:43         ` Derrick Stolee
  2021-02-08 16:25           ` Elijah Newren
  2021-02-08 17:37         ` Junio C Hamano
  1 sibling, 1 reply; 71+ messages in thread
From: Derrick Stolee @ 2021-02-08 11:43 UTC (permalink / raw)
  To: Elijah Newren, Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King

On 2/8/2021 3:38 AM, Elijah Newren wrote:
> On Sun, Feb 7, 2021 at 11:51 AM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> Before I get too deep into reviewing these patches, I do want
>>> to make it clear that the speed-up is coming at the cost of
>>> a behavior change. We are restricting the "best match" search
>>> to be first among files with common base name (although maybe
>>> I would use 'suffix'?). If we search for a rename among all
>>> additions and deletions ending the ".txt" we might find a
>>> similarity match that is 60% and declare that a rename, even
>>> if there is a ".txt" -> ".md" pair that has a 70% match.
>>
>> Yes, my initial reaction to the idea was that "yuck, our rename
>> detection lost its purity".  diffcore-rename strived to base its
>> decision purely on content similarity, primarily because it is one
>> of the oldest part of Git where the guiding principle has always
>> been that the content is the king.  I think my aversion to the "all
>> of my neighbors are relocating, so should I move to the same place"
>> (aka "directory rename") comes from a similar place, but in a sense
>> this was worse.
>>
>> At least, until I got over the initial bump.  I do not think the
>> suffix match is necessarily a bad idea, but it adds more "magically
>> doing a wrong thing" failure modes (e.g. the ".txt" to ".md" example
>> would probably have more variants that impact the real life
>> projects; ".C" vs ".cc" vs ".cxx" vs ".cpp" immediately comes to
>> mind), and a tool that silently does a wrong thing because it uses
>> more magic would be a tool that is hard to explain why it did the
>> wrong thing when it does.
> 
> Stolee explained a new algorithm different than what I have proposed,

Yes, sorry for adding noise. The point stands that we are changing
the behavior in some cases, so that must be agreed upon. What you
are _actually_ proposing is a much smaller change than I thought,
but it is still worth pointing out the behavior change.

> I think based on the apparent different meanings of "basename" that
> exist.  I tried to clarify that in response to his email, but I wanted
> to clarify one additional thing here too:
>> diffcore-rename has not in the past based its decision solely on
> content similarity.  It only does that when break detection is on.
> Otherwise, 0% content similarity is trumped by sufficient filename
> similarity (namely, with a filename similarity of 100%).  If the
> filename similarity wasn't sufficiently high (anything less than an
> exact match), then it completely ignored filename similarity and
> looked only at content similarity.  It thus jumped from one extreme to
> another.

This idea of optimizing first for 100% filename similarity is a
good perspective on Git's rename detection algorithm. The canonical
example of this 100% filename similarity is a rename cycle:

	A -> B
	B -> C
	C -> A

Even if the OIDs are distinct and exactly match across these renames,
we see that there are no adds or deletes, so we do not even trigger
rename detection and report A, B, and C as edited instead.

A "rename path" (not cycle) such as:

	A -> B
	B -> C

does trigger rename detection, but B will never be considered. Instead,
"A -> C" will be checked for similarity to see if it is within the
threshold.

Of course, I am _not_ advocating that we change this behavior. These
situations are incredibly rare and we should not sacrifice performance
in the typical case to handle them.

> My optimization is adding an in-between state.  When the basename (the
> part of the path excluding the leading directory) matches the basename
> of another file (and those basenames are unique on each side), then
> compare content similarity and mark the files as a rename if the two
> are sufficiently similar.  It is thus a position that considers both
> filename similarity (basename match) and content similarity together.
> 
>>> This could be documented in a test case, to demonstrate that
>>> we are making this choice explicitly.
>>
>> Yes.  I wonder if we can solve it by requiring a lot better than
>> minimum match when trying the "suffix match" first, or something?
> 
> This may still be a useful idea, and was something I had considered,
> but more in the context of more generic filename similarity
> comparisons.  We could still discuss it even when basenames match, but
> basenames matching seems strong enough to me that I wasn't sure extra
> configuration knobs were warranted.

I think this is a complication that we might not want to add to the
heuristic, at least not at first. We might want to have a follow-up
that adjusts that value to be higher. A natural way would be through
a config option, so users can select something incredibly high like
99%. Another option would be to take a minimum that is halfway between
the existing similarity minimum and 100%.

>> Provided if we agree that it is a good idea to insert this between
>> "exact contents match" and "full matrix", I have one question to
>> Elijah on what the code does.
>>
>> To me, it seems that the "full matrix" part still uses the remaining
>> src and dst candidates fully.  But if "A.txt" and "B.txt" are still
>> surviving in the src/dst at that stage, shouldn't we be saying that
>> "no way these can be similar enough---we've checked in the middle
>> stage where only the ones with the same suffix are considered and
>> this pair didn't turn into a rename"?
> 
> This is a very good point.  A.txt and B.txt will not have been
> compared previously since their basenames do not match, but the basic
> idea is still possible.  For example, A.txt could have been compared
> to source/some-module/A.txt.  And I don't do anything in the final
> "full matrix" stage to avoid re-comparing those two files again.
> However, it is worth noting that A.txt will have been compared to at
> most one other file, not N files.  And thus while we are wasting some
> re-comparisons, it is at most O(N) duplicated comparisons, not O(N^2).
> I thought about that, but decided to not bother, based on the
> following thinking:
> 
> 1) The most expensive comparison is the first one, because when we do
> that one, we first have to populate the list of integers that lines in
> the file hash to.  Subsequent comparisons are relatively cheap since
> this list of integers has already been computed.
> 
> 2) This would only save us from at most N comparisons in the N x M
> matrix (since no file in this optimization is compared to more than
> one other)
> 
> 3) Checking if two files have previously been compared requires more
> code, in what is already a tight nested loop.  My experience
> attempting to modify that tight loop for extra conditions (e.g. don't
> compare files that are too large), is that it's easy to accidentally
> make the code slower.  In fact, this is in part what led to the
> addition of the remove_unneed_paths_from_src() function.

Even storing a single bit to say "these were already compared" takes
quadratic space. The hope is to not have quadratic behavior if it
can be avoided.

> 4) There were plenty of other interesting ideas and maybe I was a tad lazy.  :-)
> 
> I think removing these already-compared cases could be done, but I
> just avoided it.  If we were to do the "attempt to match files with
> the same extension" optimization that Stolee outlines/invents above,
> then we'd definitely need to consider it.  Otherwise, it's just a
> minor additional optimization that someone could add to my patches.

The more I think about it, the less my idea makes sense. I'm sorry
for adding noise to the thread.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-08 11:31       ` Derrick Stolee
@ 2021-02-08 16:09         ` Elijah Newren
  0 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren @ 2021-02-08 16:09 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King

On Mon, Feb 8, 2021 at 3:31 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/8/2021 3:27 AM, Elijah Newren wrote:
> > Hi,
> >
> > On Sun, Feb 7, 2021 at 6:38 AM Derrick Stolee <stolee@gmail.com> wrote:
> >>
> >> On 2/6/21 5:52 PM, Elijah Newren via GitGitGadget wrote:
> >>> From: Elijah Newren <newren@gmail.com>
> >>>
> >>> Make use of the new find_basename_matches() function added in the last
> >>> two patches, to find renames more rapidly in cases where we can match up
> >>> files based on basenames.
> >>
> >> This is a valuable heuristic.
> >>
> >>> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> >>> performance work; instrument with trace2_region_* calls", 2020-10-28),
> >>> this change improves the performance as follows:
> >>>
> >>>                             Before                  After
> >>>     no-renames:       13.815 s ±  0.062 s    13.138 s ±  0.086 s
> >>>     mega-renames:   1799.937 s ±  0.493 s   169.488 s ±  0.494 s
> >>>     just-one-mega:    51.289 s ±  0.019 s     5.061 s ±  0.017 s
> >>
> >> These numbers are very impressive.
> >>
> >> Before I get too deep into reviewing these patches, I do want
> >> to make it clear that the speed-up is coming at the cost of
> >> a behavior change. We are restricting the "best match" search
> >> to be first among files with common base name (although maybe
> >> I would use 'suffix'?). If we search for a rename among all
> >> additions and deletions ending the ".txt" we might find a
> >> similarity match that is 60% and declare that a rename, even
> >> if there is a ".txt" -> ".md" pair that has a 70% match.
> >
> > I'm glad you all are open to possible behavioral changes, but I was
> > proposing a much smaller behavioral change that is quite different
> > than what you have suggested here.  Perhaps my wording was poor; I
> > apologize for forgetting that "basename" has different meanings in
> > different contexts.  Let me try again; I am not treating the filename
> > extension as special in any manner here; by "basename" I just mean the
> > portion of the path ignoring any leading directories.  Thus
> >     src/foo.txt
> > might be a good match against
> >     source/foo.txt
> > but this optimization as a preliminary step would not consider
> > matching src/foo.txt against any of
> >     source/bar.txt
> >     source/foo.md
> > since the basenames ('bar.txt' and 'foo.md') do not match our original
> > file's basename ('foo.txt').
> >
> > Of course, if this preliminary optimization step fails to find another
> > "foo.txt" to match src/foo.txt against (or finds more than one and
> > thus doesn't compare against any of them), then the fallback inexact
> > rename detection matrix might match it against either of those two
> > latter paths, as it always has.
>
> Thank you for making it clear that I had misunderstood what the
> optimization is actually doing. A much more narrow scope makes
> more sense, and avoids the quadratic problem even when many files
> of the same suffix are renamed.
>
> >> This could be documented in a test case, to demonstrate that
> >> we are making this choice explicitly.
>
> My test is thus bogus, but you could have a similar one for
> your actual optimization.

Yes, good point.

> >> So, in this way, we are changing the optimization function
> >> that is used to determine the "best" rename available. It
> >> might be good to update documentation for how we choose
> >> renames:
> >
> > Seems reasonable; I'll add some commentary below on the rules...
>
> Your commentary is helpful. I look forward to reading your
> carefully-written docs in the next version ;).

:-)

> >>      i. among files with the same basename (trailer
> >>         after final '.') select pairs with highest
> >>         similarity.
> >
> > This is an interesting idea, but is not what I implemented.
>
> That's what I get for reading the commit messages quickly and
> commenting on what I _think_ is going on instead of actually
> reading the code carefully. Sorry about that.

There's absolutely no need to apologize.  If you read all three commit
messages and you didn't understand the idea, then clearly there's a
bug in my commit messages.  Thanks for highlighting it; I'll figure
out how to reword or add extra verbiage to make it clear.  Something
in these follow-up emails seemed to work, so I'll try to incorporate
stuff from them.

> >  It is
> > possible that your suggestion is also a useful optimization; it'd be
> > hard to know without trying.  However, as noted in optimization batch
> > 8 that I'll be submitting later, I'm worried about having any
> > optimization pre-steps doing more than O(1) comparisons per path (and
> > here you suggest comparing each .txt file with all other .txt files);
> > doing that can interact badly with optimization batch 9.
> > Additionally, unless we do something to avoid re-comparing files again
> > when doing the later all-unmatched-files-against-each-other check,
> > then worst case behavior can approach twice as slow as the original
> > code.
>
> Right. If Git decides to reorganize all of its *.c files in one
> commit, we would still get quadratic behavior in rename detection.
> Maybe it's not _that_ much of an improvement.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-08 11:43         ` Derrick Stolee
@ 2021-02-08 16:25           ` Elijah Newren
  0 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren @ 2021-02-08 16:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King

On Mon, Feb 8, 2021 at 3:44 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/8/2021 3:38 AM, Elijah Newren wrote:
> > On Sun, Feb 7, 2021 at 11:51 AM Junio C Hamano <gitster@pobox.com> wrote:
> >>
> >> Derrick Stolee <stolee@gmail.com> writes:
> >>
> >>> Before I get too deep into reviewing these patches, I do want
> >>> to make it clear that the speed-up is coming at the cost of
> >>> a behavior change. We are restricting the "best match" search
> >>> to be first among files with common base name (although maybe
> >>> I would use 'suffix'?). If we search for a rename among all
> >>> additions and deletions ending the ".txt" we might find a
> >>> similarity match that is 60% and declare that a rename, even
> >>> if there is a ".txt" -> ".md" pair that has a 70% match.
> >>
> >> Yes, my initial reaction to the idea was that "yuck, our rename
> >> detection lost its purity".  diffcore-rename strived to base its
> >> decision purely on content similarity, primarily because it is one
> >> of the oldest part of Git where the guiding principle has always
> >> been that the content is the king.  I think my aversion to the "all
> >> of my neighbors are relocating, so should I move to the same place"
> >> (aka "directory rename") comes from a similar place, but in a sense
> >> this was worse.
> >>
> >> At least, until I got over the initial bump.  I do not think the
> >> suffix match is necessarily a bad idea, but it adds more "magically
> >> doing a wrong thing" failure modes (e.g. the ".txt" to ".md" example
> >> would probably have more variants that impact the real life
> >> projects; ".C" vs ".cc" vs ".cxx" vs ".cpp" immediately comes to
> >> mind), and a tool that silently does a wrong thing because it uses
> >> more magic would be a tool that is hard to explain why it did the
> >> wrong thing when it does.
> >
> > Stolee explained a new algorithm different than what I have proposed,
>
> Yes, sorry for adding noise. The point stands that we are changing
> the behavior in some cases, so that must be agreed upon. What you
> are _actually_ proposing is a much smaller change than I thought,
> but it is still worth pointing out the behavior change.

Again, no need to apologize; if my commit messages weren't clear, they
need to be fixed.  I am much more surprised here by your repeated
point that it's worth pointing out the behavior change.  I totally
agree with that, and it's why I spent two paragraphs on the second
commit message explicitly covering this and listing four enumerated
reasons to argue for the change.  So, I thought I had done that, but
it apparently isn't very clear...and I'm left wondering how to clarify
it further.  So, a question for you: How should I change it?  Should I
just modify the third commit message to re-highlight that it does
change behavior and refer to the second commit message for details?
Because repeating the point is the only way I can think of to make it
clearer.  Is there anything else I can or should do?

> > I think based on the apparent different meanings of "basename" that
> > exist.  I tried to clarify that in response to his email, but I wanted
> > to clarify one additional thing here too:
> >> diffcore-rename has not in the past based its decision solely on
> > content similarity.  It only does that when break detection is on.
> > Otherwise, 0% content similarity is trumped by sufficient filename
> > similarity (namely, with a filename similarity of 100%).  If the
> > filename similarity wasn't sufficiently high (anything less than an
> > exact match), then it completely ignored filename similarity and
> > looked only at content similarity.  It thus jumped from one extreme to
> > another.
>
> This idea of optimizing first for 100% filename similarity is a
> good perspective on Git's rename detection algorithm. The canonical
> example of this 100% filename similarity is a rename cycle:
>
>         A -> B
>         B -> C
>         C -> A
>
> Even if the OIDs are distinct and exactly match across these renames,
> we see that there are no adds or deletes, so we do not even trigger
> rename detection and report A, B, and C as edited instead.
>
> A "rename path" (not cycle) such as:
>
>         A -> B
>         B -> C
>
> does trigger rename detection, but B will never be considered. Instead,
> "A -> C" will be checked for similarity to see if it is within the
> threshold.
>
> Of course, I am _not_ advocating that we change this behavior. These
> situations are incredibly rare and we should not sacrifice performance
> in the typical case to handle them.
>
> > My optimization is adding an in-between state.  When the basename (the
> > part of the path excluding the leading directory) matches the basename
> > of another file (and those basenames are unique on each side), then
> > compare content similarity and mark the files as a rename if the two
> > are sufficiently similar.  It is thus a position that considers both
> > filename similarity (basename match) and content similarity together.
> >
> >>> This could be documented in a test case, to demonstrate that
> >>> we are making this choice explicitly.
> >>
> >> Yes.  I wonder if we can solve it by requiring a lot better than
> >> minimum match when trying the "suffix match" first, or something?
> >
> > This may still be a useful idea, and was something I had considered,
> > but more in the context of more generic filename similarity
> > comparisons.  We could still discuss it even when basenames match, but
> > basenames matching seems strong enough to me that I wasn't sure extra
> > configuration knobs were warranted.
>
> I think this is a complication that we might not want to add to the
> heuristic, at least not at first. We might want to have a follow-up
> that adjusts that value to be higher. A natural way would be through
> a config option, so users can select something incredibly high like
> 99%. Another option would be to take a minimum that is halfway between
> the existing similarity minimum and 100%.
>
> >> Provided if we agree that it is a good idea to insert this between
> >> "exact contents match" and "full matrix", I have one question to
> >> Elijah on what the code does.
> >>
> >> To me, it seems that the "full matrix" part still uses the remaining
> >> src and dst candidates fully.  But if "A.txt" and "B.txt" are still
> >> surviving in the src/dst at that stage, shouldn't we be saying that
> >> "no way these can be similar enough---we've checked in the middle
> >> stage where only the ones with the same suffix are considered and
> >> this pair didn't turn into a rename"?
> >
> > This is a very good point.  A.txt and B.txt will not have been
> > compared previously since their basenames do not match, but the basic
> > idea is still possible.  For example, A.txt could have been compared
> > to source/some-module/A.txt.  And I don't do anything in the final
> > "full matrix" stage to avoid re-comparing those two files again.
> > However, it is worth noting that A.txt will have been compared to at
> > most one other file, not N files.  And thus while we are wasting some
> > re-comparisons, it is at most O(N) duplicated comparisons, not O(N^2).
> > I thought about that, but decided to not bother, based on the
> > following thinking:
> >
> > 1) The most expensive comparison is the first one, because when we do
> > that one, we first have to populate the list of integers that lines in
> > the file hash to.  Subsequent comparisons are relatively cheap since
> > this list of integers has already been computed.
> >
> > 2) This would only save us from at most N comparisons in the N x M
> > matrix (since no file in this optimization is compared to more than
> > one other)
> >
> > 3) Checking if two files have previously been compared requires more
> > code, in what is already a tight nested loop.  My experience
> > attempting to modify that tight loop for extra conditions (e.g. don't
> > compare files that are too large), is that it's easy to accidentally
> > make the code slower.  In fact, this is in part what led to the
> > addition of the remove_unneed_paths_from_src() function.
>
> Even storing a single bit to say "these were already compared" takes
> quadratic space. The hope is to not have quadratic behavior if it
> can be avoided.
>
> > 4) There were plenty of other interesting ideas and maybe I was a tad lazy.  :-)
> >
> > I think removing these already-compared cases could be done, but I
> > just avoided it.  If we were to do the "attempt to match files with
> > the same extension" optimization that Stolee outlines/invents above,
> > then we'd definitely need to consider it.  Otherwise, it's just a
> > minor additional optimization that someone could add to my patches.
>
> The more I think about it, the less my idea makes sense. I'm sorry
> for adding noise to the thread.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-08  8:38       ` Elijah Newren
  2021-02-08 11:43         ` Derrick Stolee
@ 2021-02-08 17:37         ` Junio C Hamano
  2021-02-08 22:00           ` Elijah Newren
  1 sibling, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2021-02-08 17:37 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King

Elijah Newren <newren@gmail.com> writes:

> idea is still possible.  For example, A.txt could have been compared
> to source/some-module/A.txt.  And I don't do anything in the final
> "full matrix" stage to avoid re-comparing those two files again.
> However, it is worth noting that A.txt will have been compared to at
> most one other file, not N files.

Sorry, but where does this "at most one other file" come from?  "It
is rare to remove source/some-other-module/A.txt at the same time
while the above is happening"?  If so, yes, that sounds like a
sensible thing.

> 1) The most expensive comparison is the first one,...

Yes. we keep the spanhash table across comparison.

> 2) This would only save us from at most N comparisons in the N x M
> matrix (since no file in this optimization is compared to more than
> one other)

True, but doesn't rename_src[] and rename_dst[] entries have the
original pathname, where you can see A.txt and some-module/A.txt
share the same filename part cheaply?  Is that more expensive than
comparing spanhash tables?

Having asked these, I do think it is not worth pursuing, especially
because I agree with Derrick that this "we see a new file whose name
is the same as the one deleted from a different directory, so if
they are similar enough, let's declare victory and not bother
finding a better match" needs to be used with higher similarity bar
than the normal one.  If -M60 says "only consider pairs that are
with at least 60% similarity index", finding one at 60% similarity
and stopping at it only because the pair looks to move a file from
one directory to another directory while retaining the same name,
rejecting other paring, feels a bit too crude a heuristics.  And if
we require higher similarity levels to short-circuit, the later full
matrix stage won't be helped with "we must have already rejected"
logic.  A.txt and some-module/A.txt may not have been similar enough
to short-circuit and reject others in the earlier part, but the
full-matrix part work at a lower bar, which may consider the pair
good enough to keep as match candidates.

Thanks.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-08 17:37         ` Junio C Hamano
@ 2021-02-08 22:00           ` Elijah Newren
  2021-02-08 23:43             ` Junio C Hamano
  0 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren @ 2021-02-08 22:00 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King

Hi,

On Mon, Feb 8, 2021 at 9:37 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > idea is still possible.  For example, A.txt could have been compared
> > to source/some-module/A.txt.  And I don't do anything in the final
> > "full matrix" stage to avoid re-comparing those two files again.
> > However, it is worth noting that A.txt will have been compared to at
> > most one other file, not N files.
>
> Sorry, but where does this "at most one other file" come from?  "It
> is rare to remove source/some-other-module/A.txt at the same time
> while the above is happening"?  If so, yes, that sounds like a
> sensible thing.

It comes from the current implementation.  If both src/module1/A.txt
and src/module2/A.txt were removed, then I don't have a unique 'A.txt'
that was deleted.  In such a case, all 'A.txt' files are excluded from
this optimization step -- both sources and destinations.  This
optimization only kicks in for basenames where there was exactly one
of them deleted somewhere, and exactly one of them added somewhere.

One could consider trying to compare all deleted 'A.txt' files with
all added 'A.txt' files.  I tried that, but it interacts badly with
optimization batch 9 and tossed it aside; I will not be submitting
such a method.

Naturally, this "unique basename" limitation presents problems for
file basenames like .gitignore, Makefile, build.gradle, or even
ObjectFactory.java, setup.c, etc. that tend to appear in several
locations throughout the tree.  As of this series, we have to fall
back to the full N x M matrix comparison to detect the renames for
such non-unique basenames.  The next series I am planning on
submitting will do something smarter for those files while still
ensuring that the preliminary step only compares any given file to at
most one other file.

> > 1) The most expensive comparison is the first one,...
>
> Yes. we keep the spanhash table across comparison.
>
> > 2) This would only save us from at most N comparisons in the N x M
> > matrix (since no file in this optimization is compared to more than
> > one other)
>
> True, but doesn't rename_src[] and rename_dst[] entries have the
> original pathname, where you can see A.txt and some-module/A.txt
> share the same filename part cheaply?  Is that more expensive than
> comparing spanhash tables?

For a small enough number of renames, no, it won't be more expensive.
But I don't want to optimize for low numbers of renames; the code is
fast enough for those.  And with a large enough number of renames,
yes, the basename comparisons in aggregate will be more expensive than
the number of spanhash array comparisons you avoid redoing.  The
preliminary step from this optimization at most only did O(N) spanhash
comparisons, because it would only compare any given file to at most
one other file.  (Any file that didn't have a matching basename on the
other side, or wasn't a unique basename, wouldn't have been compared
to anything.)  So, at most, we save O(N) spanhash comparisons.  In
order to avoid repeating those O(N) comparisons, you are adding O(NxM)
basename comparisons.  Once M is large enough, the O(NxM) basename
comparisons you added will be more expensive than the O(N) spanhash
comparisons you are saving.  Recall that my testcase used N and M of
approximately 26,000.  The real world repository I based it on had
over 30K renames.  And if I know of a repository with 30K renames with
only 50K files (at the time), I think we shouldn't be using that as an
upper bound either.

> Having asked these, I do think it is not worth pursuing, especially
> because I agree with Derrick that this "we see a new file whose name
> is the same as the one deleted from a different directory, so if
> they are similar enough, let's declare victory and not bother
> finding a better match" needs to be used with higher similarity bar
> than the normal one.

You say you agree with Stolee, but that's not what I understood Stolee
as saying at all.  He said he thought it wasn't worth the complication
of trying to use a different value for the basename minimum similarity
than the normal minimum similarity, at least not at first.  He
suggested we could add that in the future at some time, and then
talked a bit about how to add it if we do.

> If -M60 says "only consider pairs that are
> with at least 60% similarity index", finding one at 60% similarity
> and stopping at it only because the pair looks to move a file from
> one directory to another directory while retaining the same name,
> rejecting other paring, feels a bit too crude a heuristics.  And if
> we require higher similarity levels to short-circuit, the later full
> matrix stage won't be helped with "we must have already rejected"
> logic.  A.txt and some-module/A.txt may not have been similar enough
> to short-circuit and reject others in the earlier part, but the
> full-matrix part work at a lower bar, which may consider the pair
> good enough to keep as match candidates.

I'm sorry, but I'm not following you.  As best I can tell, you seem to
be suggesting that if we were to use a higher similarity bar for
checking same-basename files, that such a difference would end up not
accelerating the diffcore-rename algorithm at all?  Is that correct?
If not, I don't understand what you're saying.

If by chance my restatement is an accurate summary of your claim, then
allow me to disabuse you of your assumptions here; you're way off.  I
wrote find_basename_matches() to take a similarity score, so that it
could take a different one than is used elsewhere in the algorithm.  I
didn't think it was necessary, but it does make it easy to test your
hypothesis.  Here are some results:

Original, not using basename-guided rename detection:
    no-renames:       13.815 s ±  0.062 s
    mega-renames:   1799.937 s ±  0.493 s
    just-one-mega:    51.289 s ±  0.019 s

Using basename_min_score = minimum_score, i.e. 50%:
    no-renames:       13.428 s ±  0.119 s
    mega-renames:    172.137 s ±  0.958 s
    just-one-mega:     5.154 s ±  0.025 s

Using basename_min_score = 0.5 * (minimum_score + MAX_SCORE), i.e. 75%:
    no-renames:       13.543 s ±  0.094 s
    mega-renames:    189.598 s ±  0.726 s
    just-one-mega:     5.647 s ±  0.016 s

Using basename_min_score = 0.1 * (minimum_score + 9*MAX_SCORE), i.e. 95%:
    no-renames:       13.733 s ±  0.086 s
    mega-renames:    353.479 s ±  2.574 s
    just-one-mega:    10.351 s ±  0.030 s


So, when we bump the bar for basename similarity much past your
hypothetical 60% all the way up to 75% (i.e. just take a simple
average of minimum score and MAX_SCORE), we see almost identical
speedups (factor of 9 or so instead of 10 or so).  And even when we go
to the extreme of requiring a 95% or greater similarity in order to
pair up basenames, we still see a speed-up factor of 5-6; that's less
than the factor of 10 we could get by allowing basename_min_score to
match minimum_score at 50%, but it's still _most_ of the speedup.

Granted, this is just one testcase.  It's going to vary a lot between
testcases and repositories and how far back or forward in history you
are rebasing or merging, etc.  The fact that this particular testcase
was obtained by doing a "git mv drivers/ pilots/" in the linux kernel
and then finding a topic to rebase across that rename boundary makes
it a bit special.  But....even if we were only able to pair 50% of the
files due to basename similarity, that would save 75% of the spanhash
comparisons.  Even in git.git where only 16% of the renames change the
basename, if we could pair up 16% of the files based on basenames it'd
save roughly 30% of the spanhash comparisons.  The numbers are
probably a lot better than either of those, though.  Since 76% of
renames in the linux kernel don't change the basename, 64% of the
renames in gcc don't, over 79% of the renames in the gecko repository
don't, and over 89% of the renames in the WebKit repository don't, I
think this is a really valuable heuristic to use.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-08 22:00           ` Elijah Newren
@ 2021-02-08 23:43             ` Junio C Hamano
  2021-02-08 23:52               ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2021-02-08 23:43 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King

Elijah Newren <newren@gmail.com> writes:

> I'm sorry, but I'm not following you.  As best I can tell, you seem to
> be suggesting that if we were to use a higher similarity bar for
> checking same-basename files, that such a difference would end up not
> accelerating the diffcore-rename algorithm at all?

No.  If we assume we use the minimum similarity threashold in the
new middle step that consider only the files that were moved across
directories without changing their names, and the last "full matrix"
step sees a src that did *not* pair with a dst of the same name in a
different directory surviving, we know that the pair would not be
similar enough (because we are using the same "minimum similarity"
in the middle step and the full matrix step) without comparing them
again.  But if we used higher similarity in the middle step, the
fact that such a src/dst pair surviving the middle step without
producing a match only means that the pair was not similar enough
with a raised bar used in the middle, and the full-matrix step need
to consider the possibility that they may still be similar enough
when using "minimum similarity" used for all the other pairs.

And because I was assuming that requiring higher similarity in the
middle step would be a prudent thing to do to avoid false matches
that discard better matches elsewhere, my conclusion was that it
would not be a useful optimization to do in the final full-matrix
step to see if a pair is something that was a candidate in the
middle step but did not match well enough (because the fact that the
pair did not compare well enough with higher bar does not mean it
would not compare well to pass the lower "minimum" bar).



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-08 23:43             ` Junio C Hamano
@ 2021-02-08 23:52               ` Elijah Newren
  0 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren @ 2021-02-08 23:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King

On Mon, Feb 8, 2021 at 3:43 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > I'm sorry, but I'm not following you.  As best I can tell, you seem to
> > be suggesting that if we were to use a higher similarity bar for
> > checking same-basename files, that such a difference would end up not
> > accelerating the diffcore-rename algorithm at all?
>
> No.  If we assume we use the minimum similarity threashold in the
> new middle step that consider only the files that were moved across
> directories without changing their names, and the last "full matrix"
> step sees a src that did *not* pair with a dst of the same name in a
> different directory surviving, we know that the pair would not be
> similar enough (because we are using the same "minimum similarity"
> in the middle step and the full matrix step) without comparing them
> again.  But if we used higher similarity in the middle step, the
> fact that such a src/dst pair surviving the middle step without
> producing a match only means that the pair was not similar enough
> with a raised bar used in the middle, and the full-matrix step need
> to consider the possibility that they may still be similar enough
> when using "minimum similarity" used for all the other pairs.
>
> And because I was assuming that requiring higher similarity in the
> middle step would be a prudent thing to do to avoid false matches
> that discard better matches elsewhere, my conclusion was that it
> would not be a useful optimization to do in the final full-matrix
> step to see if a pair is something that was a candidate in the
> middle step but did not match well enough (because the fact that the
> pair did not compare well enough with higher bar does not mean it
> would not compare well to pass the lower "minimum" bar).

Ah, gotcha!  Thanks for clarifying.  Yes, yet another reason to not
even try to avoid "redoing" the O(N) spanhash comparisons.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 0/4] Optimization batch 7: use file basenames to guide rename detection
  2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
@ 2021-02-09 11:32 ` Elijah Newren via GitGitGadget
  2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
                     ` (4 more replies)
  4 siblings, 5 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-09 11:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren

This series depends on ort-perf-batch-6[1].

This series uses file basenames in a basic fashion to guide rename
detection.

Changes since v1:

 * Significant edits to the commit messages to make them explain the basic
   idea better and (hopefully) prevent misunderstandings.
 * Add a commit at the end that updates the documentation and includes a new
   testcase
 * Modify the code to make it clearer that it can already handle using a
   different score for basename comparison similarity, even if we don't use
   that ability yet (see below)

Changes not yet included -- I need input about what is wanted:

 * Stolee suggested not creating a separate score for basename comparisons,
   at least not yet. Junio suggested it may be prudent to use a higher score
   for that than whatever -M option the user provided for normal
   comparisons...but didn't suggest whether it should be a separate
   user-specified option, or some kind of weighted average of the -M option
   and MAX_SCORE (e.g. use 60% if -M is 50%, or use 80% if -M is 75%). I
   tweaked the code to make it clearer that it already is able to handle
   such a score difference, but I'm not sure what whether we want an
   automatically computed higher value or a user-controlled possibly higher
   value.

[1] https://lore.kernel.org/git/xmqqlfc4byt6.fsf@gitster.c.googlers.com/ [2]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

Elijah Newren (4):
  diffcore-rename: compute basenames of all source and dest candidates
  diffcore-rename: complete find_basename_matches()
  diffcore-rename: guide inexact rename detection based on basenames
  gitdiffcore doc: mention new preliminary step for rename detection

 Documentation/gitdiffcore.txt |  15 +++
 diffcore-rename.c             | 185 +++++++++++++++++++++++++++++++++-
 t/t4001-diff-rename.sh        |  24 +++++
 3 files changed, 220 insertions(+), 4 deletions(-)


base-commit: 7ae9460d3dba84122c2674b46e4339b9d42bdedd
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/843

Range-diff vs v1:

 1:  377a4a39fa86 ! 1:  381a45d239bb diffcore-rename: compute basenames of all source and dest candidates
     @@ Commit message
          diffcore-rename: compute basenames of all source and dest candidates
      
          We want to make use of unique basenames to help inform rename detection,
     -    so that more likely pairings can be checked first.  Add a new function,
     +    so that more likely pairings can be checked first.  (src/moduleA/foo.txt
     +    and source/module/A/foo.txt are likely related if there are no other
     +    'foo.txt' files among the deleted and added files.)  Add a new function,
          not yet used, which creates a map of the unique basenames within
          rename_src and another within rename_dst, together with the indices
          within rename_src/rename_dst where those basenames show up.  Non-unique
          basenames still show up in the map, but have an invalid index (-1).
      
     +    This function was inspired by the fact that in real world repositories,
     +    most renames often do not involve a basename change.  Here are some
     +    sample repositories and the percentage of their historical renames (as of
     +    early 2020) that did not involve a basename change:
     +      * linux: 76%
     +      * gcc: 64%
     +      * gecko: 79%
     +      * webkit: 89%
     +
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
 2:  5fb4493247ff ! 2:  dcd0175229aa diffcore-rename: complete find_basename_matches()
     @@ Commit message
          sufficiently similar, we record the rename; if not, we include those
          files in the more exhaustive matrix comparison.
      
     +    This means we are adding a set of preliminary additional comparisons,
     +    but for each file we only compare it with at most one other file.  For
     +    example, if there was a include/media/device.h that was deleted and a
     +    src/module/media/device.h that was added, and there were no other
     +    device.h files added or deleted between the commits being compared,
     +    then these two files would be compared in the preliminary step.
     +
     +    This commit does not yet actually employ this new optimization, it
     +    merely adds a function which can be used for this purpose.  The next
     +    commit will do the necessary plumbing to make use of it.
     +
          Note that this optimization might give us different results than without
          the optimization, because it's possible that despite files with the same
          basename being sufficiently similar to be considered a rename, there's
          an even better match between files without the same basename.  I think
     -    that is okay for four reasons: (1) That seems somewhat unlikely in
     -    practice, (2) it's easy to explain to the users what happened if it does
     -    ever occur (or even for them to intuitively figure out), and (3) as the
     -    next patch will show it provides such a large performance boost that
     -    it's worth the tradeoff.  Reason (4) takes a full paragraph to
     +    that is okay for four reasons: (1) it's easy to explain to the users
     +    what happened if it does ever occur (or even for them to intuitively
     +    figure out), (2) as the next patch will show it provides such a large
     +    performance boost that it's worth the tradeoff, and (3) it's somewhat
     +    unlikely that despite having unique matching basenames that other files
     +    serve as better matches.  Reason (4) takes a full paragraph to
          explain...
      
          If the previous three reasons aren't enough, consider what rename
     @@ Commit message
          basename and are sufficiently similar to be considered a rename, mark
          them as such without comparing the two to all other rename candidates.
      
     -    We do not use this heuristic together with either break or copy
     -    detection.  The point of break detection is to say that filename
     -    similarity does not imply file content similarity, and we only want to
     -    know about file content similarity.  The point of copy detection is to
     -    use more resources to check for additional similarities, while this is
     -    an optimization that uses far less resources but which might also result
     -    in finding slightly fewer similarities.  So the idea behind this
     -    optimization goes against both of those features, and will be turned off
     -    for both.
     -
     -    Note that this optimization is not yet effective in any situation, as
     -    the function is still unused.  The next commit will hook it into the
     -    code so that it is used when rename detection is wanted, but neither
     -    copy nor break detection are.
     -
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
 3:  1d941c35076e ! 3:  ce2173aa1fb7 diffcore-rename: guide inexact rename detection based on basenames
     @@ Commit message
      
          Make use of the new find_basename_matches() function added in the last
          two patches, to find renames more rapidly in cases where we can match up
     -    files based on basenames.
     +    files based on basenames.  As a quick reminder (see the last two commit
     +    messages for more details), this means for example that
     +    docs/extensions.txt and docs/config/extensions.txt are considered likely
     +    renames if there are no 'extensions.txt' files elsewhere among the added
     +    and deleted files, and if a similarity check confirms they are similar,
     +    then they are marked as a rename without looking for a better similarity
     +    match among other files.  This is a behavioral change, as covered in
     +    more detail in the previous commit message.
     +
     +    We do not use this heuristic together with either break or copy
     +    detection.  The point of break detection is to say that filename
     +    similarity does not imply file content similarity, and we only want to
     +    know about file content similarity.  The point of copy detection is to
     +    use more resources to check for additional similarities, while this is
     +    an optimization that uses far less resources but which might also result
     +    in finding slightly fewer similarities.  So the idea behind this
     +    optimization goes against both of those features, and will be turned off
     +    for both.
      
          For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
          performance work; instrument with trace2_region_* calls", 2020-10-28),
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +		remove_unneeded_paths_from_src(want_copies);
      +		trace2_region_leave("diff", "cull after exact", options->repo);
      +	} else {
     ++		/* Determine minimum score to match basenames */
     ++		int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5;
     ++
      +		/*
      +		 * Cull sources:
      +		 *   - remove ones involved in renames (found via exact match)
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +
      +		/* Utilize file basenames to quickly find renames. */
      +		trace2_region_enter("diff", "basename matches", options->repo);
     -+		rename_count += find_basename_matches(options, minimum_score,
     ++		rename_count += find_basename_matches(options,
     ++						      min_basename_score,
      +						      rename_src_nr);
      +		trace2_region_leave("diff", "basename matches", options->repo);
      +
 -:  ------------ > 4:  a0e75d8cd6bd gitdiffcore doc: mention new preliminary step for rename detection

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates
  2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
@ 2021-02-09 11:32   ` Elijah Newren via GitGitGadget
  2021-02-09 13:17     ` Derrick Stolee
  2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-09 11:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

We want to make use of unique basenames to help inform rename detection,
so that more likely pairings can be checked first.  (src/moduleA/foo.txt
and source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the deleted and added files.)  Add a new function,
not yet used, which creates a map of the unique basenames within
rename_src and another within rename_dst, together with the indices
within rename_src/rename_dst where those basenames show up.  Non-unique
basenames still show up in the map, but have an invalid index (-1).

This function was inspired by the fact that in real world repositories,
most renames often do not involve a basename change.  Here are some
sample repositories and the percentage of their historical renames (as of
early 2020) that did not involve a basename change:
  * linux: 76%
  * gcc: 64%
  * gecko: 79%
  * webkit: 89%

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 74930716e70d..1c52077b04e5 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,59 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+MAYBE_UNUSED
+static int find_basename_matches(struct diff_options *options,
+				 int minimum_score,
+				 int num_src)
+{
+	int i;
+	struct strintmap sources;
+	struct strintmap dests;
+
+	/* Create maps of basename -> fullname(s) for sources and dests */
+	strintmap_init_with_options(&sources, -1, NULL, 0);
+	strintmap_init_with_options(&dests, -1, NULL, 0);
+	for (i = 0; i < num_src; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		char *base;
+
+		/* exact renames removed in remove_unneeded_paths_from_src() */
+		assert(!rename_src[i].p->one->rename_used);
+
+		base = strrchr(filename, '/');
+		base = (base ? base+1 : filename);
+
+		/* Record index within rename_src (i) if basename is unique */
+		if (strintmap_contains(&sources, base))
+			strintmap_set(&sources, base, -1);
+		else
+			strintmap_set(&sources, base, i);
+	}
+	for (i = 0; i < rename_dst_nr; ++i) {
+		char *filename = rename_dst[i].p->two->path;
+		char *base;
+
+		if (rename_dst[i].is_rename)
+			continue; /* involved in exact match already. */
+
+		base = strrchr(filename, '/');
+		base = (base ? base+1 : filename);
+
+		/* Record index within rename_dst (i) if basename is unique */
+		if (strintmap_contains(&dests, base))
+			strintmap_set(&dests, base, -1);
+		else
+			strintmap_set(&dests, base, i);
+	}
+
+	/* TODO: Make use of basenames source and destination basenames */
+
+	strintmap_clear(&sources);
+	strintmap_clear(&dests);
+
+	return 0;
+}
+
 #define NUM_CANDIDATE_PER_DST 4
 static void record_if_better(struct diff_score m[], struct diff_score *o)
 {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 2/4] diffcore-rename: complete find_basename_matches()
  2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
  2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
@ 2021-02-09 11:32   ` Elijah Newren via GitGitGadget
  2021-02-09 13:25     ` Derrick Stolee
  2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-09 11:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories.  We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames.  If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.

This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file.  For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there were no other
device.h files added or deleted between the commits being compared,
then these two files would be compared in the preliminary step.

This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose.  The next
commit will do the necessary plumbing to make use of it.

Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename.  I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches.  Reason (4) takes a full paragraph to
explain...

If the previous three reasons aren't enough, consider what rename
detection already does.  Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar.  In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity.  Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents.  This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well.  Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 94 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 91 insertions(+), 3 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 1c52077b04e5..b1dda41de9b1 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -372,10 +372,48 @@ static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 int num_src)
 {
-	int i;
+	/*
+	 * When I checked, over 76% of file renames in linux just moved
+	 * files to a different directory but kept the same basename.  gcc
+	 * did that with over 64% of renames, gecko did it with over 79%,
+	 * and WebKit did it with over 89%.
+	 *
+	 * Therefore we can bypass the normal exhaustive NxM matrix
+	 * comparison of similarities between all potential rename sources
+	 * and destinations by instead using file basename as a hint, checking
+	 * for similarity between files with the same basename, and if we
+	 * find a pair that are sufficiently similar, record the rename
+	 * pair and exclude those two from the NxM matrix.
+	 *
+	 * This *might* cause us to find a less than optimal pairing (if
+	 * there is another file that we are even more similar to but has a
+	 * different basename).  Given the huge performance advantage
+	 * basename matching provides, and given the frequency with which
+	 * people use the same basename in real world projects, that's a
+	 * trade-off we are willing to accept when doing just rename
+	 * detection.  However, if someone wants copy detection that
+	 * implies they are willing to spend more cycles to find
+	 * similarities between files, so it may be less likely that this
+	 * heuristic is wanted.
+	 */
+
+	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
 
+	/*
+	 * The prefeteching stuff wants to know if it can skip prefetching blobs
+	 * that are unmodified.  unmodified blobs are only relevant when doing
+	 * copy detection.  find_basename_matches() is only used when detecting
+	 * renames, not when detecting copies, so it'll only be used when a file
+	 * only existed in the source.  Since we already know that the file
+	 * won't be unmodified, there's no point checking for it; that's just a
+	 * waste of resources.  So set skip_unmodified to 0 so that
+	 * estimate_similarity() and prefetch() won't waste resources checking
+	 * for something we already know is false.
+	 */
+	int skip_unmodified = 0;
+
 	/* Create maps of basename -> fullname(s) for sources and dests */
 	strintmap_init_with_options(&sources, -1, NULL, 0);
 	strintmap_init_with_options(&dests, -1, NULL, 0);
@@ -412,12 +450,62 @@ static int find_basename_matches(struct diff_options *options,
 			strintmap_set(&dests, base, i);
 	}
 
-	/* TODO: Make use of basenames source and destination basenames */
+	/* Now look for basename matchups and do similarity estimation */
+	for (i = 0; i < num_src; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		char *base = NULL;
+		intptr_t src_index;
+		intptr_t dst_index;
+
+		/* Get the basename */
+		base = strrchr(filename, '/');
+		base = (base ? base+1 : filename);
+
+		/* Find out if this basename is unique among sources */
+		src_index = strintmap_get(&sources, base);
+		if (src_index == -1)
+			continue; /* not a unique basename; skip it */
+		assert(src_index == i);
+
+		if (strintmap_contains(&dests, base)) {
+			struct diff_filespec *one, *two;
+			int score;
+
+			/* Find out if this basename is unique among dests */
+			dst_index = strintmap_get(&dests, base);
+			if (dst_index == -1)
+				continue; /* not a unique basename; skip it */
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
+			/* Estimate the similarity */
+			one = rename_src[src_index].p->one;
+			two = rename_dst[dst_index].p->two;
+			score = estimate_similarity(options->repo, one, two,
+						    minimum_score, skip_unmodified);
+
+			/* If sufficiently similar, record as rename pair */
+			if (score < minimum_score)
+				continue;
+			record_rename_pair(dst_index, src_index, score);
+			renames++;
+
+			/*
+			 * Found a rename so don't need text anymore; if we
+			 * didn't find a rename, the filespec_blob would get
+			 * re-used when doing the matrix of comparisons.
+			 */
+			diff_free_filespec_blob(one);
+			diff_free_filespec_blob(two);
+		}
+	}
 
 	strintmap_clear(&sources);
 	strintmap_clear(&dests);
 
-	return 0;
+	return renames;
 }
 
 #define NUM_CANDIDATE_PER_DST 4
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
  2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
  2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
@ 2021-02-09 11:32   ` Elijah Newren via GitGitGadget
  2021-02-09 13:33     ` Derrick Stolee
  2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
  2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
  4 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-09 11:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames.  As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no 'extensions.txt' files elsewhere among the added
and deleted files, and if a similarity check confirms they are similar,
then they are marked as a rename without looking for a better similarity
match among other files.  This is a behavioral change, as covered in
more detail in the previous commit message.

We do not use this heuristic together with either break or copy
detection.  The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity.  The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities.  So the idea behind this
optimization goes against both of those features, and will be turned off
for both.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       13.815 s ±  0.062 s    13.138 s ±  0.086 s
    mega-renames:   1799.937 s ±  0.493 s   169.488 s ±  0.494 s
    just-one-mega:    51.289 s ±  0.019 s     5.061 s ±  0.017 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index b1dda41de9b1..048a6186fd21 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,7 +367,6 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
-MAYBE_UNUSED
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 int num_src)
@@ -718,12 +717,49 @@ void diffcore_rename(struct diff_options *options)
 	if (minimum_score == MAX_SCORE)
 		goto cleanup;
 
+	num_sources = rename_src_nr;
+
+	if (want_copies || break_idx) {
+		/*
+		 * Cull sources:
+		 *   - remove ones corresponding to exact renames
+		 */
+		trace2_region_enter("diff", "cull after exact", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull after exact", options->repo);
+	} else {
+		/* Determine minimum score to match basenames */
+		int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5;
+
+		/*
+		 * Cull sources:
+		 *   - remove ones involved in renames (found via exact match)
+		 */
+		trace2_region_enter("diff", "cull exact", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull exact", options->repo);
+
+		/* Utilize file basenames to quickly find renames. */
+		trace2_region_enter("diff", "basename matches", options->repo);
+		rename_count += find_basename_matches(options,
+						      min_basename_score,
+						      rename_src_nr);
+		trace2_region_leave("diff", "basename matches", options->repo);
+
+		/*
+		 * Cull sources, again:
+		 *   - remove ones involved in renames (found via basenames)
+		 */
+		trace2_region_enter("diff", "cull basename", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull basename", options->repo);
+	}
+
 	/*
-	 * Calculate how many renames are left
+	 * Calculate how many rename destinations are left
 	 */
 	num_destinations = (rename_dst_nr - rename_count);
-	remove_unneeded_paths_from_src(want_copies);
-	num_sources = rename_src_nr;
+	num_sources = rename_src_nr; /* rename_src_nr reflects lower number */
 
 	/* All done? */
 	if (!num_destinations || !num_sources)
@@ -755,7 +791,7 @@ void diffcore_rename(struct diff_options *options)
 		struct diff_score *m;
 
 		if (rename_dst[i].is_rename)
-			continue; /* dealt with exact match already. */
+			continue; /* exact or basename match already handled */
 
 		m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
 		for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection
  2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
@ 2021-02-09 11:32   ` Elijah Newren via GitGitGadget
  2021-02-09 12:59     ` Derrick Stolee
  2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
  4 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-09 11:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

The last few patches have introduced a new preliminary step when rename
detection is on but both break detection and copy detection are off.
Document this new step.  While we're at it, add a testcase that checks
the new behavior as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/gitdiffcore.txt | 15 +++++++++++++++
 t/t4001-diff-rename.sh        | 24 ++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c970d9fe438a..954ae3ef1082 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -168,6 +168,21 @@ a similarity score different from the default of 50% by giving a
 number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
 8/10 = 80%).
 
+Note that when rename detection is on but both copy and break
+detection are off, rename detection adds a preliminary step that first
+checks files with the same basename.  If files with the same basename
+are sufficiently similar, it will mark them as renames and exclude
+them from the later quadratic step (the one that pairwise compares all
+unmatched files to find the "best" matches, determined by the highest
+content similarity).  So, for example, if docs/extensions.txt and
+docs/config/extensions.txt have similar content, then they will be
+marked as a rename even if it turns out that docs/extensions.txt was
+more similar to src/extension-checks.c.  At most, one comparison is
+done per file in this preliminary pass; so if there are several
+extensions.txt files throughout the directory hierarchy that were
+added and deleted, this preliminary step will be skipped for those
+files.
+
 Note.  When the "-C" option is used with `--find-copies-harder`
 option, 'git diff-{asterisk}' commands feed unmodified filepairs to
 diffcore mechanism as well as modified ones.  This lets the copy
diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index c16486a9d41a..bf62537c29a0 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -262,4 +262,28 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
 	grep "myotherfile.*myfile" actual
 '
 
+test_expect_success 'basename similarity vs best similarity' '
+	mkdir subdir &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			 line6 line7 line8 line9 line10 >subdir/file.txt &&
+	git add subdir/file.txt &&
+	git commit -m "base txt" &&
+
+	git rm subdir/file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 >file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 line9 >file.md &&
+	git add file.txt file.md &&
+	git commit -a -m "rename" &&
+	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
+	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
+	# but since same basenames are checked first...
+	cat >expected <<-\EOF &&
+	A	file.md
+	R078	subdir/file.txt	file.txt
+	EOF
+	test_cmp expected actual
+'
+
 test_done
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection
  2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
@ 2021-02-09 12:59     ` Derrick Stolee
  2021-02-09 17:03       ` Junio C Hamano
  0 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee @ 2021-02-09 12:59 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren

On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> The last few patches have introduced a new preliminary step when rename
> detection is on but both break detection and copy detection are off.
> Document this new step.  While we're at it, add a testcase that checks
> the new behavior as well.

Thanks for adding this documentation and test.

> +Note that when rename detection is on but both copy and break
> +detection are off, rename detection adds a preliminary step that first
> +checks files with the same basename.  If files with the same basename

I find myself wanting a definition of 'basename' here, but perhaps I'm
just being pedantic. A quick search clarifies this as a standard term [1]
of which I was just ignorant.

[1] https://man7.org/linux/man-pages/man3/basename.3.html

> +are sufficiently similar, it will mark them as renames and exclude
> +them from the later quadratic step (the one that pairwise compares all
> +unmatched files to find the "best" matches, determined by the highest
> +content similarity).  So, for example, if docs/extensions.txt and
> +docs/config/extensions.txt have similar content, then they will be
> +marked as a rename even if it turns out that docs/extensions.txt was
> +more similar to src/extension-checks.c.  At most, one comparison is
> +done per file in this preliminary pass; so if there are several
> +extensions.txt files throughout the directory hierarchy that were
> +added and deleted, this preliminary step will be skipped for those
> +files.

> +test_expect_success 'basename similarity vs best similarity' '
> +	mkdir subdir &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			 line6 line7 line8 line9 line10 >subdir/file.txt &&
> +	git add subdir/file.txt &&
> +	git commit -m "base txt" &&
> +
> +	git rm subdir/file.txt &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			  line6 line7 line8 >file.txt &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			  line6 line7 line8 line9 >file.md &&
> +	git add file.txt file.md &&
> +	git commit -a -m "rename" &&
> +	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
> +	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
> +	# but since same basenames are checked first...
> +	cat >expected <<-\EOF &&
> +	A	file.md
> +	R078	subdir/file.txt	file.txt
> +	EOF
> +	test_cmp expected actual
> +'
> +

I appreciate the additional comments in this test to make it clear
what you are testing. A minor nit is that the test could have been
added at the start of the series to document the _old_ behavior.
The 'expected' file would have this content:

+	cat >expected <<-\EOF &&
+	A	file.txt
+	R078	subdir/file.txt	file.md
+	EOF

Then, this test case would change the expected output in the same
patch that introduces the behavior change.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates
  2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
@ 2021-02-09 13:17     ` Derrick Stolee
  2021-02-09 16:56       ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee @ 2021-02-09 13:17 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren

On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> We want to make use of unique basenames to help inform rename detection,
> so that more likely pairings can be checked first.  (src/moduleA/foo.txt
> and source/module/A/foo.txt are likely related if there are no other
> 'foo.txt' files among the deleted and added files.)  Add a new function,
> not yet used, which creates a map of the unique basenames within
> rename_src and another within rename_dst, together with the indices
> within rename_src/rename_dst where those basenames show up.  Non-unique
> basenames still show up in the map, but have an invalid index (-1).
> 
> This function was inspired by the fact that in real world repositories,
> most renames often do not involve a basename change.  Here are some
> sample repositories and the percentage of their historical renames (as of
> early 2020) that did not involve a basename change:

I found this difficult to parse. Perhaps instead

  "the percentage of their renames that preserved basenames".

We might also need something stronger, though: which percentage of renames
preserved the basename but also had no other copy of that basename in the
scope of all add/deletes?

Is this reproducible from a shell command that could be documented here?

> +MAYBE_UNUSED
> +static int find_basename_matches(struct diff_options *options,
> +				 int minimum_score,
> +				 int num_src)
> +{
> +	int i;
> +	struct strintmap sources;
> +	struct strintmap dests;
> +
> +	/* Create maps of basename -> fullname(s) for sources and dests */
> +	strintmap_init_with_options(&sources, -1, NULL, 0);
> +	strintmap_init_with_options(&dests, -1, NULL, 0);

Initially, I was wondering why we need the map for each side, but we will need
to enforce uniqueness in each set, so OK.

> +	for (i = 0; i < num_src; ++i) {
> +		char *filename = rename_src[i].p->one->path;
> +		char *base;
> +
> +		/* exact renames removed in remove_unneeded_paths_from_src() */
> +		assert(!rename_src[i].p->one->rename_used);
> +
> +		base = strrchr(filename, '/');
> +		base = (base ? base+1 : filename);

nit: "base + 1"

Also, this is used here and below. Perhaps it's worth pulling out as a
helper? I see similar code being duplicated in these existing spots:

* diff-no-index.c:append_basename()
* help.c:append_similar_ref()
* packfile.c:pack_basename()
* replace-object.c:register_replace_ref()
* setup.c:read_gitfile_gently()
* builtin/rebase.c:cmd_rebase()
* builtin/stash.c:do_create_stash()
* builtin/worktree.c:add_worktree()
* contrib/credential/gnome-keyring/git-credential-gnome-keyring.c:usage()
* contrib/credential/libsecret/git-credential-libsecret.c:usage()
* trace2/tr2_dst.c:tr2_dst_try_auto_path()

There are other places that use strchr(_, '/') but they seem to be related
to peeling basenames off of paths and using the leading portion of the path.

> +		/* Record index within rename_src (i) if basename is unique */
> +		if (strintmap_contains(&sources, base))
> +			strintmap_set(&sources, base, -1);
> +		else
> +			strintmap_set(&sources, base, i);
> +	}
> +	for (i = 0; i < rename_dst_nr; ++i) {
> +		char *filename = rename_dst[i].p->two->path;
> +		char *base;
> +
> +		if (rename_dst[i].is_rename)
> +			continue; /* involved in exact match already. */
> +
> +		base = strrchr(filename, '/');
> +		base = (base ? base+1 : filename);
> +
> +		/* Record index within rename_dst (i) if basename is unique */
> +		if (strintmap_contains(&dests, base))
> +			strintmap_set(&dests, base, -1);
> +		else
> +			strintmap_set(&dests, base, i);
> +	}
> +
> +	/* TODO: Make use of basenames source and destination basenames */
> +
> +	strintmap_clear(&sources);
> +	strintmap_clear(&dests);
> +
> +	return 0;
> +}

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 2/4] diffcore-rename: complete find_basename_matches()
  2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
@ 2021-02-09 13:25     ` Derrick Stolee
  2021-02-09 17:17       ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee @ 2021-02-09 13:25 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren

On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
> +	/*
> +	 * When I checked, over 76% of file renames in linux just moved

Perhaps "In late 2020," instead of "When I checked".

> +	 * files to a different directory but kept the same basename.  gcc
> +	 * did that with over 64% of renames, gecko did it with over 79%,
> +	 * and WebKit did it with over 89%.
> +	 *
> +	 * Therefore we can bypass the normal exhaustive NxM matrix
> +	 * comparison of similarities between all potential rename sources
> +	 * and destinations by instead using file basename as a hint, checking
> +	 * for similarity between files with the same basename, and if we
> +	 * find a pair that are sufficiently similar, record the rename
> +	 * pair and exclude those two from the NxM matrix.
> +	 *
> +	 * This *might* cause us to find a less than optimal pairing (if
> +	 * there is another file that we are even more similar to but has a
> +	 * different basename).  Given the huge performance advantage
> +	 * basename matching provides, and given the frequency with which
> +	 * people use the same basename in real world projects, that's a
> +	 * trade-off we are willing to accept when doing just rename
> +	 * detection.  However, if someone wants copy detection that
> +	 * implies they are willing to spend more cycles to find
> +	 * similarities between files, so it may be less likely that this
> +	 * heuristic is wanted.
> +	 */
> +
> +	int i, renames = 0;
>  	struct strintmap sources;
>  	struct strintmap dests; 

...

> +	 * copy detection.  find_basename_matches() is only used when detecting
> +	 * renames, not when detecting copies, so it'll only be used when a file
> +	 * only existed in the source.  Since we already know that the file

There are two "only"s in this sentence. Just awkward, not wrong.

> +	 * won't be unmodified, there's no point checking for it; that's just a
> +	 * waste of resources.  So set skip_unmodified to 0 so that
> +	 * estimate_similarity() and prefetch() won't waste resources checking
> +	 * for something we already know is false.
> +	 */
> +	int skip_unmodified = 0;
> +



> -	/* TODO: Make use of basenames source and destination basenames */
> +	/* Now look for basename matchups and do similarity estimation */
> +	for (i = 0; i < num_src; ++i) {
> +		char *filename = rename_src[i].p->one->path;
> +		char *base = NULL;
> +		intptr_t src_index;
> +		intptr_t dst_index;
> +
> +		/* Get the basename */
> +		base = strrchr(filename, '/');
> +		base = (base ? base+1 : filename);

Here is the third instance of this in the same function. At minimum we should
extract a helper for you to consume.

> +		/* Find out if this basename is unique among sources */
> +		src_index = strintmap_get(&sources, base);
> +		if (src_index == -1)
> +			continue; /* not a unique basename; skip it */
> +		assert(src_index == i);
> +
> +		if (strintmap_contains(&dests, base)) {
> +			struct diff_filespec *one, *two;
> +			int score;
> +
> +			/* Find out if this basename is unique among dests */
> +			dst_index = strintmap_get(&dests, base);
> +			if (dst_index == -1)
> +				continue; /* not a unique basename; skip it */
> +
> +			/* Ignore this dest if already used in a rename */
> +			if (rename_dst[dst_index].is_rename)
> +				continue; /* already used previously */
> +
> +			/* Estimate the similarity */
> +			one = rename_src[src_index].p->one;
> +			two = rename_dst[dst_index].p->two;
> +			score = estimate_similarity(options->repo, one, two,
> +						    minimum_score, skip_unmodified);
> +
> +			/* If sufficiently similar, record as rename pair */
> +			if (score < minimum_score)
> +				continue;
> +			record_rename_pair(dst_index, src_index, score);
> +			renames++;
> +
> +			/*
> +			 * Found a rename so don't need text anymore; if we
> +			 * didn't find a rename, the filespec_blob would get
> +			 * re-used when doing the matrix of comparisons.
> +			 */
> +			diff_free_filespec_blob(one);
> +			diff_free_filespec_blob(two);
> +		}
> +	}

Makes sense to me.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
@ 2021-02-09 13:33     ` Derrick Stolee
  2021-02-09 17:41       ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee @ 2021-02-09 13:33 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren

On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
> +	num_sources = rename_src_nr;
> +
> +	if (want_copies || break_idx) {
> +		/*
> +		 * Cull sources:
> +		 *   - remove ones corresponding to exact renames
> +		 */
> +		trace2_region_enter("diff", "cull after exact", options->repo);
> +		remove_unneeded_paths_from_src(want_copies);
> +		trace2_region_leave("diff", "cull after exact", options->repo);

Isn't this the same as

> +	} else {
> +		/* Determine minimum score to match basenames */
> +		int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5;
> +
> +		/*
> +		 * Cull sources:
> +		 *   - remove ones involved in renames (found via exact match)
> +		 */
> +		trace2_region_enter("diff", "cull exact", options->repo);
> +		remove_unneeded_paths_from_src(want_copies);
> +		trace2_region_leave("diff", "cull exact", options->repo);

...this? (except the regions are renamed)

Could this be simplified as:

+	num_sources = rename_src_nr;
+
+	trace2_region_enter("diff", "cull after exact", options->repo);
+	remove_unneeded_paths_from_src(want_copies);
+	trace2_region_leave("diff", "cull after exact", options->repo);
+
+	if (!want_copies && !break_idx) {
+		/* Determine minimum score to match basenames */

I realize you probably changed the region names on purpose to distinguish
that there are two "cull" regions in the case of no copies, but I think
that isn't really worth different names. Better to have a consistent region
name around the same activity in both cases.

> +		int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5;

Did you intend for this to be 5*min + 0*MAX? This seems wrong if you want
this value to be different from minimum_score.

> +
> +		/* Utilize file basenames to quickly find renames. */
> +		trace2_region_enter("diff", "basename matches", options->repo);
> +		rename_count += find_basename_matches(options,
> +						      min_basename_score,
> +						      rename_src_nr);
> +		trace2_region_leave("diff", "basename matches", options->repo);
> +
> +		/*
> +		 * Cull sources, again:
> +		 *   - remove ones involved in renames (found via basenames)
> +		 */
> +		trace2_region_enter("diff", "cull basename", options->repo);
> +		remove_unneeded_paths_from_src(want_copies);
> +		trace2_region_leave("diff", "cull basename", options->repo);
> +	}
> +

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates
  2021-02-09 13:17     ` Derrick Stolee
@ 2021-02-09 16:56       ` Elijah Newren
  2021-02-09 17:02         ` Derrick Stolee
  0 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren @ 2021-02-09 16:56 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King

Hi,

On Tue, Feb 9, 2021 at 5:17 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > We want to make use of unique basenames to help inform rename detection,
> > so that more likely pairings can be checked first.  (src/moduleA/foo.txt
> > and source/module/A/foo.txt are likely related if there are no other
> > 'foo.txt' files among the deleted and added files.)  Add a new function,
> > not yet used, which creates a map of the unique basenames within
> > rename_src and another within rename_dst, together with the indices
> > within rename_src/rename_dst where those basenames show up.  Non-unique
> > basenames still show up in the map, but have an invalid index (-1).
> >
> > This function was inspired by the fact that in real world repositories,
> > most renames often do not involve a basename change.  Here are some
> > sample repositories and the percentage of their historical renames (as of
> > early 2020) that did not involve a basename change:
>
> I found this difficult to parse. Perhaps instead
>
>   "the percentage of their renames that preserved basenames".

Ooh, I like it; happy to make that change.

> We might also need something stronger, though: which percentage of renames
> preserved the basename but also had no other copy of that basename in the
> scope of all add/deletes?

I don't think it's useful to try to prove that this idea can save time
or how much time we can save before we try it; I think the only
purpose of these numbers should be to motivate the idea behind why it
was worth trying.  If we attempt to prove how much we'll save apriori,
then what you have is also too weak.  We would need "percentage of
total adds/deletes that are renames that preserved the basename but
also had no other copy of that basename in the scope of all
add/deletes".  But that is also wrong, actually; we need "for any
given two commits that we are likely to diff, what is the average
percentage of total adds/deletes between them that are renames that
preserved the basename but also had no other copy of that basename in
the scope of all add/deletes".  In particular, my script did not look
at the "any two given likely-to-be-diffed commits" viewpoint, I simply
added the number of renames within individual commits that preserved
renames, and divided by the total number of renames in individual
commits.  But even if we could calculate the "any two given
likely-to-be-diffed commits" viewpoint in some sane manner, it'd still
be misleading.  The next series is going to change the "no other copy
of that basename in the scope of all adds/deletes" caveat, by adding a
way to match up _some_ of those files (when it can find a way to
compare any given file to exactly one of the other files with the same
basename).  And even if you consider all the above and calculated it
in order to try to show how much could be saved, you might need to
start worrying about details like the fact that the first comparison
between files in diffcore-rename.c is _much_ more expensive than
subsequent comparisons (due to the fact that the spanhash is cached).

Trying to account for all these details and describe them fully is
completely beside the point, though; I didn't bother to check any of
this before implementing the algorithm -- I just looked up these very
rough numbers and felt they provided sufficient motivation that there
was an optimization worth trying.

> Is this reproducible from a shell command that could be documented here?

No, trying to parse log output with full handling of proper quoting in
the case of filenames with funny characters is too complex to attempt
in shell.  I was surprised by how long it turned out to be in python.
(And I dread attempting to calculate "something stronger" in any
accurate way given how involved just this rough calculation was.  That
idea seems harder to me than actually implementing this series.)

If you're curious, though, and don't care about
quickly-hacked-together-script-not-designed-for-reuse:
https://github.com/newren/git/blob/ort/rebase-testcase/count-renames.py

> > +MAYBE_UNUSED
> > +static int find_basename_matches(struct diff_options *options,
> > +                              int minimum_score,
> > +                              int num_src)
> > +{
> > +     int i;
> > +     struct strintmap sources;
> > +     struct strintmap dests;
> > +
> > +     /* Create maps of basename -> fullname(s) for sources and dests */
> > +     strintmap_init_with_options(&sources, -1, NULL, 0);
> > +     strintmap_init_with_options(&dests, -1, NULL, 0);
>
> Initially, I was wondering why we need the map for each side, but we will need
> to enforce uniqueness in each set, so OK.
>
>> > +     for (i = 0; i < num_src; ++i) {
> > +             char *filename = rename_src[i].p->one->path;
> > +             char *base;
> > +
> > +             /* exact renames removed in remove_unneeded_paths_from_src() */
> > +             assert(!rename_src[i].p->one->rename_used);
> > +
> > +             base = strrchr(filename, '/');
> > +             base = (base ? base+1 : filename);
>
> nit: "base + 1"

Will fix.

> Also, this is used here and below. Perhaps it's worth pulling out as a
> helper? I see similar code being duplicated in these existing spots:
>
> * diff-no-index.c:append_basename()
> * help.c:append_similar_ref()
> * packfile.c:pack_basename()
> * replace-object.c:register_replace_ref()
> * setup.c:read_gitfile_gently()
> * builtin/rebase.c:cmd_rebase()
> * builtin/stash.c:do_create_stash()
> * builtin/worktree.c:add_worktree()
> * contrib/credential/gnome-keyring/git-credential-gnome-keyring.c:usage()
> * contrib/credential/libsecret/git-credential-libsecret.c:usage()
> * trace2/tr2_dst.c:tr2_dst_try_auto_path()

Honestly asking: would anyone ever search for such a two-line helper
function?  I wouldn't have even thought to look, since it seems so
simple.

However, my real concern here is that this type of change would risk
introducing conflicts with unrelated series.  This series is the
second in what will be a 9-series deep dependency chain of
optimizations[1], and the later series are going to be longer than
these first two were (the latter ones are 6-11 patches each).  We've
already discussed previously whether we possibly want to hold the
first couple optimization series out of the upcoming git-2.31 release
in order to keep the optimizations all together, but that might
increase the risk of conflicts with unrelated patches if we try a
bigger tree refactor like this.  (Junio never commented on that,
though.)  It might be better to keep the series touching only
merge-ort.c & diffcore-rename.c, and then do cleanups like the one you
suggest here after the whole series.

That said, it's not a difficult initial change, so I'm mostly
expressing this concern out of making things harder for Junio.  It'd
be best to get his opinion -- Junio, your thoughts?

[1] https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Anewren+Optimization+batch

> There are other places that use strchr(_, '/') but they seem to be related
> to peeling basenames off of paths and using the leading portion of the path.
>
> > +             /* Record index within rename_src (i) if basename is unique */
> > +             if (strintmap_contains(&sources, base))
> > +                     strintmap_set(&sources, base, -1);
> > +             else
> > +                     strintmap_set(&sources, base, i);
> > +     }
> > +     for (i = 0; i < rename_dst_nr; ++i) {
> > +             char *filename = rename_dst[i].p->two->path;
> > +             char *base;
> > +
> > +             if (rename_dst[i].is_rename)
> > +                     continue; /* involved in exact match already. */
> > +
> > +             base = strrchr(filename, '/');
> > +             base = (base ? base+1 : filename);
> > +
> > +             /* Record index within rename_dst (i) if basename is unique */
> > +             if (strintmap_contains(&dests, base))
> > +                     strintmap_set(&dests, base, -1);
> > +             else
> > +                     strintmap_set(&dests, base, i);
> > +     }
> > +
> > +     /* TODO: Make use of basenames source and destination basenames */
> > +
> > +     strintmap_clear(&sources);
> > +     strintmap_clear(&dests);
> > +
> > +     return 0;
> > +}
>
> Thanks,
> -Stolee

Thanks for the review!

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates
  2021-02-09 16:56       ` Elijah Newren
@ 2021-02-09 17:02         ` Derrick Stolee
  2021-02-09 17:42           ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Derrick Stolee @ 2021-02-09 17:02 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King

On 2/9/2021 11:56 AM, Elijah Newren wrote:
>> Also, this is used here and below. Perhaps it's worth pulling out as a
>> helper? I see similar code being duplicated in these existing spots:
>>
>> * diff-no-index.c:append_basename()
>> * help.c:append_similar_ref()
>> * packfile.c:pack_basename()
>> * replace-object.c:register_replace_ref()
>> * setup.c:read_gitfile_gently()
>> * builtin/rebase.c:cmd_rebase()
>> * builtin/stash.c:do_create_stash()
>> * builtin/worktree.c:add_worktree()
>> * contrib/credential/gnome-keyring/git-credential-gnome-keyring.c:usage()
>> * contrib/credential/libsecret/git-credential-libsecret.c:usage()
>> * trace2/tr2_dst.c:tr2_dst_try_auto_path()
> Honestly asking: would anyone ever search for such a two-line helper
> function?  I wouldn't have even thought to look, since it seems so
> simple.
> 
> However, my real concern here is that this type of change would risk
> introducing conflicts with unrelated series.  This series is the
> second in what will be a 9-series deep dependency chain of
> optimizations[1], and the later series are going to be longer than
> these first two were (the latter ones are 6-11 patches each).  We've
> already discussed previously whether we possibly want to hold the
> first couple optimization series out of the upcoming git-2.31 release
> in order to keep the optimizations all together, but that might
> increase the risk of conflicts with unrelated patches if we try a
> bigger tree refactor like this.  (Junio never commented on that,
> though.)  It might be better to keep the series touching only
> merge-ort.c & diffcore-rename.c, and then do cleanups like the one you
> suggest here after the whole series.
> 
> That said, it's not a difficult initial change, so I'm mostly
> expressing this concern out of making things harder for Junio.  It'd
> be best to get his opinion -- Junio, your thoughts?
> 
> [1] https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Anewren+Optimization+batch
 
I don't consider the step of "go put the helper in all these other
places" necessary for the current series. However, the "get basename"
code appears a total of three times in this series, so it would be
good to at least extract it to a static inline method to reduce
the duplication isolated to this change.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection
  2021-02-09 12:59     ` Derrick Stolee
@ 2021-02-09 17:03       ` Junio C Hamano
  2021-02-09 17:44         ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2021-02-09 17:03 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, git, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Elijah Newren

Derrick Stolee <stolee@gmail.com> writes:

>> +Note that when rename detection is on but both copy and break
>> +detection are off, rename detection adds a preliminary step that first
>> +checks files with the same basename.  If files with the same basename
>
> I find myself wanting a definition of 'basename' here, but perhaps I'm
> just being pedantic. A quick search clarifies this as a standard term [1]
> of which I was just ignorant.
>
> [1] https://man7.org/linux/man-pages/man3/basename.3.html
>
>> +are sufficiently similar, it will mark them as renames and exclude
>> +them from the later quadratic step (the one that pairwise compares all
>> +unmatched files to find the "best" matches, determined by the highest
>> +content similarity).

While I do not think `basename` is unacceptably bad, we should aim
to do better.  For "direc/tory/hello.txt", both "hello.txt" or
"hello" are what would come up to people's mind with the technical
term "basename" (i.e. basename as opposed to dirname, vs basename as
opposed to filename with .extension).

Avoiding this ambiguity and using a word understandable by those not
versed well with UNIX/POSIX lingo may be done at the same time,
hopefully.

For example, can we frame the description around this key sentence:

    The heuristics is based on an observation that a file is often
    moved across directories while keeping its filename the same.

The term "filename" alone can be ambiguous (i.e. both "hello.txt"
and "direc/tory/hello.txt" are valid interpretations in the earlier
example), but in the context of a sentence that talks about "moved
across directories", the former would become the only valid one.  We
can even say just "name" and there is no ambiguity in the above "key
sentence".

Then keeping that in mind, we can rewrite the above you quoted like
so without going technical and without risking ambiguity, like this:

    ... a preliminary step that checks if files are moved across
    directories while keeping their filenames the same.  If there is
    a file added to a directory whose contents is sufficiently
    similar to a file with the same name that got deleted from a
    different directory, ...


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 2/4] diffcore-rename: complete find_basename_matches()
  2021-02-09 13:25     ` Derrick Stolee
@ 2021-02-09 17:17       ` Elijah Newren
  2021-02-09 17:34         ` Derrick Stolee
  0 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren @ 2021-02-09 17:17 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King

On Tue, Feb 9, 2021 at 5:25 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
> > +     /*
> > +      * When I checked, over 76% of file renames in linux just moved
>
> Perhaps "In late 2020," instead of "When I checked".

In early 2020 (in fact, it might have been 2019, but I have no records
to verify the actual year), but sure I can change that.

> > +      * files to a different directory but kept the same basename.  gcc
> > +      * did that with over 64% of renames, gecko did it with over 79%,
> > +      * and WebKit did it with over 89%.
> > +      *
> > +      * Therefore we can bypass the normal exhaustive NxM matrix
> > +      * comparison of similarities between all potential rename sources
> > +      * and destinations by instead using file basename as a hint, checking
> > +      * for similarity between files with the same basename, and if we
> > +      * find a pair that are sufficiently similar, record the rename
> > +      * pair and exclude those two from the NxM matrix.
> > +      *
> > +      * This *might* cause us to find a less than optimal pairing (if
> > +      * there is another file that we are even more similar to but has a
> > +      * different basename).  Given the huge performance advantage
> > +      * basename matching provides, and given the frequency with which
> > +      * people use the same basename in real world projects, that's a
> > +      * trade-off we are willing to accept when doing just rename
> > +      * detection.  However, if someone wants copy detection that
> > +      * implies they are willing to spend more cycles to find
> > +      * similarities between files, so it may be less likely that this
> > +      * heuristic is wanted.
> > +      */
> > +
> > +     int i, renames = 0;
> >       struct strintmap sources;
> >       struct strintmap dests;
>
> ...
>
> > +      * copy detection.  find_basename_matches() is only used when detecting
> > +      * renames, not when detecting copies, so it'll only be used when a file
> > +      * only existed in the source.  Since we already know that the file
>
> There are two "only"s in this sentence. Just awkward, not wrong.
>
> > +      * won't be unmodified, there's no point checking for it; that's just a
> > +      * waste of resources.  So set skip_unmodified to 0 so that
> > +      * estimate_similarity() and prefetch() won't waste resources checking
> > +      * for something we already know is false.
> > +      */
> > +     int skip_unmodified = 0;
> > +
>
>
>
> > -     /* TODO: Make use of basenames source and destination basenames */
> > +     /* Now look for basename matchups and do similarity estimation */
> > +     for (i = 0; i < num_src; ++i) {
> > +             char *filename = rename_src[i].p->one->path;
> > +             char *base = NULL;
> > +             intptr_t src_index;
> > +             intptr_t dst_index;
> > +
> > +             /* Get the basename */
> > +             base = strrchr(filename, '/');
> > +             base = (base ? base+1 : filename);
>
> Here is the third instance of this in the same function. At minimum we should
> extract a helper for you to consume.

Where by "this" you mean these last two lines, right?

And perhaps explain why I'm not using either basename(3) or
gitbasename() from git-compat-util.h?  (The latter of which I just
learned about while responding to the review of this patch.)

or maybe gitbasename can do the job, but the skip_dos_drive_prefix()
and the munging of the string passed in both worry me.  And the
is_dir_sep() looks inefficient since I know I'm dealing with filenames
as stored in git internally, and thus can only use '/' characters.
Hmm...

Yeah, I think I'll add my own helper in this file, since you want one,
and just use it.

> > +             /* Find out if this basename is unique among sources */
> > +             src_index = strintmap_get(&sources, base);
> > +             if (src_index == -1)
> > +                     continue; /* not a unique basename; skip it */
> > +             assert(src_index == i);
> > +
> > +             if (strintmap_contains(&dests, base)) {
> > +                     struct diff_filespec *one, *two;
> > +                     int score;
> > +
> > +                     /* Find out if this basename is unique among dests */
> > +                     dst_index = strintmap_get(&dests, base);
> > +                     if (dst_index == -1)
> > +                             continue; /* not a unique basename; skip it */
> > +
> > +                     /* Ignore this dest if already used in a rename */
> > +                     if (rename_dst[dst_index].is_rename)
> > +                             continue; /* already used previously */
> > +
> > +                     /* Estimate the similarity */
> > +                     one = rename_src[src_index].p->one;
> > +                     two = rename_dst[dst_index].p->two;
> > +                     score = estimate_similarity(options->repo, one, two,
> > +                                                 minimum_score, skip_unmodified);
> > +
> > +                     /* If sufficiently similar, record as rename pair */
> > +                     if (score < minimum_score)
> > +                             continue;
> > +                     record_rename_pair(dst_index, src_index, score);
> > +                     renames++;
> > +
> > +                     /*
> > +                      * Found a rename so don't need text anymore; if we
> > +                      * didn't find a rename, the filespec_blob would get
> > +                      * re-used when doing the matrix of comparisons.
> > +                      */
> > +                     diff_free_filespec_blob(one);
> > +                     diff_free_filespec_blob(two);
> > +             }
> > +     }
>
> Makes sense to me.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 2/4] diffcore-rename: complete find_basename_matches()
  2021-02-09 17:17       ` Elijah Newren
@ 2021-02-09 17:34         ` Derrick Stolee
  0 siblings, 0 replies; 71+ messages in thread
From: Derrick Stolee @ 2021-02-09 17:34 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King

On 2/9/2021 12:17 PM, Elijah Newren wrote:
> On Tue, Feb 9, 2021 at 5:25 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
>>> +             /* Get the basename */
>>> +             base = strrchr(filename, '/');
>>> +             base = (base ? base+1 : filename);
>>
>> Here is the third instance of this in the same function. At minimum we should
>> extract a helper for you to consume.
> 
> Where by "this" you mean these last two lines, right?

Correct. The reason to use a helper is to ease cognitive load when
reading the code. These lines are identical and serve the same
purpose. By making a "get_basename()" helper and using it as

	base = get_basename(filename);

makes it easy to understand what is happening without needing
to think carefully about it. For example, I had to remember
that strrchr() returns NULL when '/' is not found, not the first
character of the string.

> And perhaps explain why I'm not using either basename(3) or
> gitbasename() from git-compat-util.h?  (The latter of which I just
> learned about while responding to the review of this patch.)
> 
> or maybe gitbasename can do the job, but the skip_dos_drive_prefix()
> and the munging of the string passed in both worry me.  And the
> is_dir_sep() looks inefficient since I know I'm dealing with filenames
> as stored in git internally, and thus can only use '/' characters.
> Hmm...
> 
> Yeah, I think I'll add my own helper in this file, since you want one,
> and just use it.

Right. I almost made a point to say "Don't use find_last_dir_sep()"
because it uses platform-specific directory separators. Your helper
is based on in-memory representation that always uses Unix-style paths.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-09 13:33     ` Derrick Stolee
@ 2021-02-09 17:41       ` Elijah Newren
  2021-02-09 18:59         ` Junio C Hamano
  0 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren @ 2021-02-09 17:41 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King

On Tue, Feb 9, 2021 at 5:33 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
> > +     num_sources = rename_src_nr;
> > +
> > +     if (want_copies || break_idx) {
> > +             /*
> > +              * Cull sources:
> > +              *   - remove ones corresponding to exact renames
> > +              */
> > +             trace2_region_enter("diff", "cull after exact", options->repo);
> > +             remove_unneeded_paths_from_src(want_copies);
> > +             trace2_region_leave("diff", "cull after exact", options->repo);
>
> Isn't this the same as
>
> > +     } else {
> > +             /* Determine minimum score to match basenames */
> > +             int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5;
> > +
> > +             /*
> > +              * Cull sources:
> > +              *   - remove ones involved in renames (found via exact match)
> > +              */
> > +             trace2_region_enter("diff", "cull exact", options->repo);
> > +             remove_unneeded_paths_from_src(want_copies);
> > +             trace2_region_leave("diff", "cull exact", options->repo);
>
> ...this? (except the regions are renamed)
>
> Could this be simplified as:
>
> +       num_sources = rename_src_nr;
> +
> +       trace2_region_enter("diff", "cull after exact", options->repo);
> +       remove_unneeded_paths_from_src(want_copies);
> +       trace2_region_leave("diff", "cull after exact", options->repo);
> +
> +       if (!want_copies && !break_idx) {
> +               /* Determine minimum score to match basenames */
>
> I realize you probably changed the region names on purpose to distinguish
> that there are two "cull" regions in the case of no copies, but I think
> that isn't really worth different names. Better to have a consistent region
> name around the same activity in both cases.

Actually, the reason they were split is because a later series has to
call remove_unneeded_paths_from_src() differently for the two
branches.  The patch history was so dirty that the easiest way to
clean things up was just to create completely new patches pulling off
relevant chunks of code and touching them up; while doing that, I
didn't notice that the changes I made to split out this early series
resulted in this near-duplication.

So, I can join them...but they would just need to be split back out in
my "Optimization batch 9" series.

I'm happy to fix the region name to make them the same.  Is that good
enough, or would you rather these common code regions combined for
this patch and then split out later?

>
> > +             int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5;
>
> Did you intend for this to be 5*min + 0*MAX? This seems wrong if you want
> this value to be different from minimum_score.

In my cover letter I noted that I didn't know what to set this to and
wanted input; yesterday you said it wasn't worth worrying about using
a different value, but Junio suggested we should use one (but didn't
state how much higher it should be or whether it should be user input
driven).  This weird construct was here just to show that it is easy
to feed a different score into the basename comparison than what is
used elsewhere; I can fix it up once I get word on what Junio wants to
see.

Since I didn't know what to use, though, and I didn't want to get a
different set of numbers for the final commit message on the speedup
achieved if I'm just going to throw them away and recompute once I
find out what Junio wants here, I did intentionally set the
computation to just give us minimum_score, for now.

> > +
> > +             /* Utilize file basenames to quickly find renames. */
> > +             trace2_region_enter("diff", "basename matches", options->repo);
> > +             rename_count += find_basename_matches(options,
> > +                                                   min_basename_score,
> > +                                                   rename_src_nr);
> > +             trace2_region_leave("diff", "basename matches", options->repo);
> > +
> > +             /*
> > +              * Cull sources, again:
> > +              *   - remove ones involved in renames (found via basenames)
> > +              */
> > +             trace2_region_enter("diff", "cull basename", options->repo);
> > +             remove_unneeded_paths_from_src(want_copies);
> > +             trace2_region_leave("diff", "cull basename", options->repo);
> > +     }
> > +
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates
  2021-02-09 17:02         ` Derrick Stolee
@ 2021-02-09 17:42           ` Elijah Newren
  0 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren @ 2021-02-09 17:42 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King

On Tue, Feb 9, 2021 at 9:02 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/9/2021 11:56 AM, Elijah Newren wrote:
> >> Also, this is used here and below. Perhaps it's worth pulling out as a
> >> helper? I see similar code being duplicated in these existing spots:
> >>
> >> * diff-no-index.c:append_basename()
> >> * help.c:append_similar_ref()
> >> * packfile.c:pack_basename()
> >> * replace-object.c:register_replace_ref()
> >> * setup.c:read_gitfile_gently()
> >> * builtin/rebase.c:cmd_rebase()
> >> * builtin/stash.c:do_create_stash()
> >> * builtin/worktree.c:add_worktree()
> >> * contrib/credential/gnome-keyring/git-credential-gnome-keyring.c:usage()
> >> * contrib/credential/libsecret/git-credential-libsecret.c:usage()
> >> * trace2/tr2_dst.c:tr2_dst_try_auto_path()
> > Honestly asking: would anyone ever search for such a two-line helper
> > function?  I wouldn't have even thought to look, since it seems so
> > simple.
> >
> > However, my real concern here is that this type of change would risk
> > introducing conflicts with unrelated series.  This series is the
> > second in what will be a 9-series deep dependency chain of
> > optimizations[1], and the later series are going to be longer than
> > these first two were (the latter ones are 6-11 patches each).  We've
> > already discussed previously whether we possibly want to hold the
> > first couple optimization series out of the upcoming git-2.31 release
> > in order to keep the optimizations all together, but that might
> > increase the risk of conflicts with unrelated patches if we try a
> > bigger tree refactor like this.  (Junio never commented on that,
> > though.)  It might be better to keep the series touching only
> > merge-ort.c & diffcore-rename.c, and then do cleanups like the one you
> > suggest here after the whole series.
> >
> > That said, it's not a difficult initial change, so I'm mostly
> > expressing this concern out of making things harder for Junio.  It'd
> > be best to get his opinion -- Junio, your thoughts?
> >
> > [1] https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Anewren+Optimization+batch
>
> I don't consider the step of "go put the helper in all these other
> places" necessary for the current series. However, the "get basename"
> code appears a total of three times in this series, so it would be
> good to at least extract it to a static inline method to reduce
> the duplication isolated to this change.

Sounds good; will do.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection
  2021-02-09 17:03       ` Junio C Hamano
@ 2021-02-09 17:44         ` Elijah Newren
  0 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren @ 2021-02-09 17:44 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King

On Tue, Feb 9, 2021 at 9:03 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Derrick Stolee <stolee@gmail.com> writes:
>
> >> +Note that when rename detection is on but both copy and break
> >> +detection are off, rename detection adds a preliminary step that first
> >> +checks files with the same basename.  If files with the same basename
> >
> > I find myself wanting a definition of 'basename' here, but perhaps I'm
> > just being pedantic. A quick search clarifies this as a standard term [1]
> > of which I was just ignorant.
> >
> > [1] https://man7.org/linux/man-pages/man3/basename.3.html
> >
> >> +are sufficiently similar, it will mark them as renames and exclude
> >> +them from the later quadratic step (the one that pairwise compares all
> >> +unmatched files to find the "best" matches, determined by the highest
> >> +content similarity).
>
> While I do not think `basename` is unacceptably bad, we should aim
> to do better.  For "direc/tory/hello.txt", both "hello.txt" or
> "hello" are what would come up to people's mind with the technical
> term "basename" (i.e. basename as opposed to dirname, vs basename as
> opposed to filename with .extension).
>
> Avoiding this ambiguity and using a word understandable by those not
> versed well with UNIX/POSIX lingo may be done at the same time,
> hopefully.
>
> For example, can we frame the description around this key sentence:
>
>     The heuristics is based on an observation that a file is often
>     moved across directories while keeping its filename the same.
>
> The term "filename" alone can be ambiguous (i.e. both "hello.txt"
> and "direc/tory/hello.txt" are valid interpretations in the earlier
> example), but in the context of a sentence that talks about "moved
> across directories", the former would become the only valid one.  We
> can even say just "name" and there is no ambiguity in the above "key
> sentence".
>
> Then keeping that in mind, we can rewrite the above you quoted like
> so without going technical and without risking ambiguity, like this:
>
>     ... a preliminary step that checks if files are moved across
>     directories while keeping their filenames the same.  If there is
>     a file added to a directory whose contents is sufficiently
>     similar to a file with the same name that got deleted from a
>     different directory, ...

Nice, I like it!

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-09 17:41       ` Elijah Newren
@ 2021-02-09 18:59         ` Junio C Hamano
  0 siblings, 0 replies; 71+ messages in thread
From: Junio C Hamano @ 2021-02-09 18:59 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King

Elijah Newren <newren@gmail.com> writes:

> Since I didn't know what to use, though, and I didn't want to get a
> different set of numbers for the final commit message on the speedup
> achieved if I'm just going to throw them away and recompute once I
> find out what Junio wants here, I did intentionally set the
> computation to just give us minimum_score, for now.

I thought Derrick earlier suggested "half-way", which I found was
probably a reasonable starting point.  So instead of 5, divide by 8
and multiply both by 4 or something and perhaps allow a debugging
knob to tweak to see what works the best in the real histories
during the refinement phase of the feature?




^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 0/5] Optimization batch 7: use file basenames to guide rename detection
  2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
@ 2021-02-10 15:15   ` Elijah Newren via GitGitGadget
  2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
                       ` (5 more replies)
  4 siblings, 6 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-10 15:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren

This series depends on ort-perf-batch-6[1].

This series uses file basenames (portion of the path after final '/',
including extension) in a basic fashion to guide rename detection.

Changes since v2:

 * insert a new patch at the beginning to test expected old behavior, then
   have later patches change the test expectation
 * tweak commit message wording to clarify that rename stats merely
   motivated the optimization
 * factor out a simple get_basename() helper; have it document why we don't
   use gitbasename()
 * use a higher required threshold to mark same-basename files as a rename,
   defaulting to average of minimum_score and MAX_SCORE (since default
   rename threshold is 50%, this implies default basename threshold is 75%)
 * provide an environment variable (undocumented) that can be used to test
   appropriate threshold for basename-sameness
 * updated the timings based on the new threshold
 * modify the documentation with Junio's suggested simpler and clearer
   wording
 * clean up the wording on a few comments

[1] https://lore.kernel.org/git/xmqqlfc4byt6.fsf@gitster.c.googlers.com/ [2]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

Elijah Newren (5):
  t4001: add a test comparing basename similarity and content similarity
  diffcore-rename: compute basenames of all source and dest candidates
  diffcore-rename: complete find_basename_matches()
  diffcore-rename: guide inexact rename detection based on basenames
  gitdiffcore doc: mention new preliminary step for rename detection

 Documentation/gitdiffcore.txt |  17 +++
 diffcore-rename.c             | 202 +++++++++++++++++++++++++++++++++-
 t/t4001-diff-rename.sh        |  24 ++++
 3 files changed, 239 insertions(+), 4 deletions(-)


base-commit: 7ae9460d3dba84122c2674b46e4339b9d42bdedd
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/843

Range-diff vs v2:

 4:  a0e75d8cd6bd ! 1:  3e6af929d135 gitdiffcore doc: mention new preliminary step for rename detection
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    gitdiffcore doc: mention new preliminary step for rename detection
     +    t4001: add a test comparing basename similarity and content similarity
      
     -    The last few patches have introduced a new preliminary step when rename
     -    detection is on but both break detection and copy detection are off.
     -    Document this new step.  While we're at it, add a testcase that checks
     -    the new behavior as well.
     +    Add a simple test where a removed file is similar to two different added
     +    files; one of them has the same basename, and the other has a slightly
     +    higher content similarity.  Without break detection, filename similarity
     +    of 100% trumps content similarity for pairing up related files.  For
     +    any filename similarity less than 100%, the opposite is true -- content
     +    similarity is all that matters.  Add a testcase that documents this.
      
     -    Signed-off-by: Elijah Newren <newren@gmail.com>
     +    Subsequent commits will add a new rule that includes an inbetween state,
     +    where a mixture of filename similarity and content similarity are
     +    weighed, and which will change the outcome of this testcase.
      
     - ## Documentation/gitdiffcore.txt ##
     -@@ Documentation/gitdiffcore.txt: a similarity score different from the default of 50% by giving a
     - number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
     - 8/10 = 80%).
     - 
     -+Note that when rename detection is on but both copy and break
     -+detection are off, rename detection adds a preliminary step that first
     -+checks files with the same basename.  If files with the same basename
     -+are sufficiently similar, it will mark them as renames and exclude
     -+them from the later quadratic step (the one that pairwise compares all
     -+unmatched files to find the "best" matches, determined by the highest
     -+content similarity).  So, for example, if docs/extensions.txt and
     -+docs/config/extensions.txt have similar content, then they will be
     -+marked as a rename even if it turns out that docs/extensions.txt was
     -+more similar to src/extension-checks.c.  At most, one comparison is
     -+done per file in this preliminary pass; so if there are several
     -+extensions.txt files throughout the directory hierarchy that were
     -+added and deleted, this preliminary step will be skipped for those
     -+files.
     -+
     - Note.  When the "-C" option is used with `--find-copies-harder`
     - option, 'git diff-{asterisk}' commands feed unmodified filepairs to
     - diffcore mechanism as well as modified ones.  This lets the copy
     +    Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## t/t4001-diff-rename.sh ##
      @@ t/t4001-diff-rename.sh: test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
     @@ t/t4001-diff-rename.sh: test_expect_success 'diff-tree -l0 defaults to a big ren
      +	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
      +	# but since same basenames are checked first...
      +	cat >expected <<-\EOF &&
     -+	A	file.md
     -+	R078	subdir/file.txt	file.txt
     ++	R088	subdir/file.txt	file.md
     ++	A	file.txt
      +	EOF
      +	test_cmp expected actual
      +'
 1:  381a45d239bb ! 2:  4fff9b1ff57b diffcore-rename: compute basenames of all source and dest candidates
     @@ Commit message
          basenames still show up in the map, but have an invalid index (-1).
      
          This function was inspired by the fact that in real world repositories,
     -    most renames often do not involve a basename change.  Here are some
     -    sample repositories and the percentage of their historical renames (as of
     -    early 2020) that did not involve a basename change:
     +    files are often moved across directories without changing names.  Here
     +    are some sample repositories and the percentage of their historical
     +    renames (as of early 2020) that preserved basenames:
            * linux: 76%
            * gcc: 64%
            * gecko: 79%
            * webkit: 89%
     +    These statistics alone don't prove that an optimization in this area
     +    will help or how much it will help, since there are also unpaired adds
     +    and deletes, restrictions on which basenames we consider, etc., but it
     +    certainly motivated the idea to try something in this area.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
     @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
       	return renames;
       }
       
     ++static const char *get_basename(const char *filename)
     ++{
     ++	/*
     ++	 * gitbasename() has to worry about special drivers, multiple
     ++	 * directory separator characters, trailing slashes, NULL or
     ++	 * empty strings, etc.  We only work on filenames as stored in
     ++	 * git, and thus get to ignore all those complications.
     ++	 */
     ++	const char *base = strrchr(filename, '/');
     ++	return base ? base + 1 : filename;
     ++}
     ++
      +MAYBE_UNUSED
      +static int find_basename_matches(struct diff_options *options,
      +				 int minimum_score,
     @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
      +	strintmap_init_with_options(&dests, -1, NULL, 0);
      +	for (i = 0; i < num_src; ++i) {
      +		char *filename = rename_src[i].p->one->path;
     -+		char *base;
     ++		const char *base;
      +
      +		/* exact renames removed in remove_unneeded_paths_from_src() */
      +		assert(!rename_src[i].p->one->rename_used);
      +
     -+		base = strrchr(filename, '/');
     -+		base = (base ? base+1 : filename);
     -+
      +		/* Record index within rename_src (i) if basename is unique */
     ++		base = get_basename(filename);
      +		if (strintmap_contains(&sources, base))
      +			strintmap_set(&sources, base, -1);
      +		else
     @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
      +	}
      +	for (i = 0; i < rename_dst_nr; ++i) {
      +		char *filename = rename_dst[i].p->two->path;
     -+		char *base;
     ++		const char *base;
      +
      +		if (rename_dst[i].is_rename)
      +			continue; /* involved in exact match already. */
      +
     -+		base = strrchr(filename, '/');
     -+		base = (base ? base+1 : filename);
     -+
      +		/* Record index within rename_dst (i) if basename is unique */
     ++		base = get_basename(filename);
      +		if (strintmap_contains(&dests, base))
      +			strintmap_set(&dests, base, -1);
      +		else
 2:  dcd0175229aa ! 3:  dc26881e4ed3 diffcore-rename: complete find_basename_matches()
     @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options
       {
      -	int i;
      +	/*
     -+	 * When I checked, over 76% of file renames in linux just moved
     -+	 * files to a different directory but kept the same basename.  gcc
     -+	 * did that with over 64% of renames, gecko did it with over 79%,
     -+	 * and WebKit did it with over 89%.
     ++	 * When I checked in early 2020, over 76% of file renames in linux
     ++	 * just moved files to a different directory but kept the same
     ++	 * basename.  gcc did that with over 64% of renames, gecko did it
     ++	 * with over 79%, and WebKit did it with over 89%.
      +	 *
      +	 * Therefore we can bypass the normal exhaustive NxM matrix
      +	 * comparison of similarities between all potential rename sources
     -+	 * and destinations by instead using file basename as a hint, checking
     -+	 * for similarity between files with the same basename, and if we
     -+	 * find a pair that are sufficiently similar, record the rename
     -+	 * pair and exclude those two from the NxM matrix.
     ++	 * and destinations by instead using file basename as a hint (i.e.
     ++	 * the portion of the filename after the last '/'), checking for
     ++	 * similarity between files with the same basename, and if we find
     ++	 * a pair that are sufficiently similar, record the rename pair and
     ++	 * exclude those two from the NxM matrix.
      +	 *
      +	 * This *might* cause us to find a less than optimal pairing (if
      +	 * there is another file that we are even more similar to but has a
     @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options
      +	 * basename matching provides, and given the frequency with which
      +	 * people use the same basename in real world projects, that's a
      +	 * trade-off we are willing to accept when doing just rename
     -+	 * detection.  However, if someone wants copy detection that
     -+	 * implies they are willing to spend more cycles to find
     -+	 * similarities between files, so it may be less likely that this
     -+	 * heuristic is wanted.
     ++	 * detection.
     ++	 *
     ++	 * If someone wants copy detection that implies they are willing to
     ++	 * spend more cycles to find similarities between files, so it may
     ++	 * be less likely that this heuristic is wanted.  If someone is
     ++	 * doing break detection, that means they do not want filename
     ++	 * similarity to imply any form of content similiarity, and thus
     ++	 * this heuristic would definitely be incompatible.
      +	 */
      +
      +	int i, renames = 0;
     @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options
       	struct strintmap dests;
       
      +	/*
     -+	 * The prefeteching stuff wants to know if it can skip prefetching blobs
     -+	 * that are unmodified.  unmodified blobs are only relevant when doing
     -+	 * copy detection.  find_basename_matches() is only used when detecting
     -+	 * renames, not when detecting copies, so it'll only be used when a file
     -+	 * only existed in the source.  Since we already know that the file
     -+	 * won't be unmodified, there's no point checking for it; that's just a
     -+	 * waste of resources.  So set skip_unmodified to 0 so that
     -+	 * estimate_similarity() and prefetch() won't waste resources checking
     -+	 * for something we already know is false.
     ++	 * The prefeteching stuff wants to know if it can skip prefetching
     ++	 * blobs that are unmodified...and will then do a little extra work
     ++	 * to verify that the oids are indeed different before prefetching.
     ++	 * Unmodified blobs are only relevant when doing copy detection;
     ++	 * when limiting to rename detection, diffcore_rename[_extended]()
     ++	 * will never be called with unmodified source paths fed to us, so
     ++	 * the extra work necessary to check if rename_src entries are
     ++	 * unmodified would be a small waste.
      +	 */
      +	int skip_unmodified = 0;
      +
     @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options
      +	/* Now look for basename matchups and do similarity estimation */
      +	for (i = 0; i < num_src; ++i) {
      +		char *filename = rename_src[i].p->one->path;
     -+		char *base = NULL;
     ++		const char *base = NULL;
      +		intptr_t src_index;
      +		intptr_t dst_index;
      +
     -+		/* Get the basename */
     -+		base = strrchr(filename, '/');
     -+		base = (base ? base+1 : filename);
     -+
      +		/* Find out if this basename is unique among sources */
     ++		base = get_basename(filename);
      +		src_index = strintmap_get(&sources, base);
      +		if (src_index == -1)
      +			continue; /* not a unique basename; skip it */
 3:  ce2173aa1fb7 ! 4:  2493f4b2f55d diffcore-rename: guide inexact rename detection based on basenames
     @@ Commit message
          this change improves the performance as follows:
      
                                      Before                  After
     -        no-renames:       13.815 s ±  0.062 s    13.138 s ±  0.086 s
     -        mega-renames:   1799.937 s ±  0.493 s   169.488 s ±  0.494 s
     -        just-one-mega:    51.289 s ±  0.019 s     5.061 s ±  0.017 s
     +        no-renames:       13.815 s ±  0.062 s    13.294 s ±  0.103 s
     +        mega-renames:   1799.937 s ±  0.493 s   187.248 s ±  0.882 s
     +        just-one-mega:    51.289 s ±  0.019 s     5.557 s ±  0.017 s
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     -@@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
     - 	return renames;
     +@@ diffcore-rename.c: static const char *get_basename(const char *filename)
     + 	return base ? base + 1 : filename;
       }
       
      -MAYBE_UNUSED
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +		trace2_region_leave("diff", "cull after exact", options->repo);
      +	} else {
      +		/* Determine minimum score to match basenames */
     -+		int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5;
     ++		double factor = 0.5;
     ++		char *basename_factor = getenv("GIT_BASENAME_FACTOR");
     ++		int min_basename_score;
     ++
     ++		if (basename_factor)
     ++			factor = strtol(basename_factor, NULL, 10)/100.0;
     ++		assert(factor >= 0.0 && factor <= 1.0);
     ++		min_basename_score = minimum_score +
     ++			(int)(factor * (MAX_SCORE - minimum_score));
      +
      +		/*
      +		 * Cull sources:
      +		 *   - remove ones involved in renames (found via exact match)
      +		 */
     -+		trace2_region_enter("diff", "cull exact", options->repo);
     ++		trace2_region_enter("diff", "cull after exact", options->repo);
      +		remove_unneeded_paths_from_src(want_copies);
     -+		trace2_region_leave("diff", "cull exact", options->repo);
     ++		trace2_region_leave("diff", "cull after exact", options->repo);
      +
      +		/* Utilize file basenames to quickly find renames. */
      +		trace2_region_enter("diff", "basename matches", options->repo);
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       
       		m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
       		for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
     +
     + ## t/t4001-diff-rename.sh ##
     +@@ t/t4001-diff-rename.sh: test_expect_success 'basename similarity vs best similarity' '
     + 	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
     + 	# but since same basenames are checked first...
     + 	cat >expected <<-\EOF &&
     +-	R088	subdir/file.txt	file.md
     +-	A	file.txt
     ++	A	file.md
     ++	R078	subdir/file.txt	file.txt
     + 	EOF
     + 	test_cmp expected actual
     + '
 -:  ------------ > 5:  fc72d24a3358 gitdiffcore doc: mention new preliminary step for rename detection

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity
  2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
@ 2021-02-10 15:15     ` Elijah Newren via GitGitGadget
  2021-02-13  1:15       ` Junio C Hamano
  2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
                       ` (4 subsequent siblings)
  5 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-10 15:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add a simple test where a removed file is similar to two different added
files; one of them has the same basename, and the other has a slightly
higher content similarity.  Without break detection, filename similarity
of 100% trumps content similarity for pairing up related files.  For
any filename similarity less than 100%, the opposite is true -- content
similarity is all that matters.  Add a testcase that documents this.

Subsequent commits will add a new rule that includes an inbetween state,
where a mixture of filename similarity and content similarity are
weighed, and which will change the outcome of this testcase.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t4001-diff-rename.sh | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index c16486a9d41a..797343b38106 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -262,4 +262,28 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
 	grep "myotherfile.*myfile" actual
 '
 
+test_expect_success 'basename similarity vs best similarity' '
+	mkdir subdir &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			 line6 line7 line8 line9 line10 >subdir/file.txt &&
+	git add subdir/file.txt &&
+	git commit -m "base txt" &&
+
+	git rm subdir/file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 >file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 line9 >file.md &&
+	git add file.txt file.md &&
+	git commit -a -m "rename" &&
+	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
+	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
+	# but since same basenames are checked first...
+	cat >expected <<-\EOF &&
+	R088	subdir/file.txt	file.md
+	A	file.txt
+	EOF
+	test_cmp expected actual
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates
  2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
  2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
@ 2021-02-10 15:15     ` Elijah Newren via GitGitGadget
  2021-02-13  1:32       ` Junio C Hamano
  2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
                       ` (3 subsequent siblings)
  5 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-10 15:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

We want to make use of unique basenames to help inform rename detection,
so that more likely pairings can be checked first.  (src/moduleA/foo.txt
and source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the deleted and added files.)  Add a new function,
not yet used, which creates a map of the unique basenames within
rename_src and another within rename_dst, together with the indices
within rename_src/rename_dst where those basenames show up.  Non-unique
basenames still show up in the map, but have an invalid index (-1).

This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names.  Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
  * linux: 76%
  * gcc: 64%
  * gecko: 79%
  * webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 74930716e70d..3eb49a098adf 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,67 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+static const char *get_basename(const char *filename)
+{
+	/*
+	 * gitbasename() has to worry about special drivers, multiple
+	 * directory separator characters, trailing slashes, NULL or
+	 * empty strings, etc.  We only work on filenames as stored in
+	 * git, and thus get to ignore all those complications.
+	 */
+	const char *base = strrchr(filename, '/');
+	return base ? base + 1 : filename;
+}
+
+MAYBE_UNUSED
+static int find_basename_matches(struct diff_options *options,
+				 int minimum_score,
+				 int num_src)
+{
+	int i;
+	struct strintmap sources;
+	struct strintmap dests;
+
+	/* Create maps of basename -> fullname(s) for sources and dests */
+	strintmap_init_with_options(&sources, -1, NULL, 0);
+	strintmap_init_with_options(&dests, -1, NULL, 0);
+	for (i = 0; i < num_src; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base;
+
+		/* exact renames removed in remove_unneeded_paths_from_src() */
+		assert(!rename_src[i].p->one->rename_used);
+
+		/* Record index within rename_src (i) if basename is unique */
+		base = get_basename(filename);
+		if (strintmap_contains(&sources, base))
+			strintmap_set(&sources, base, -1);
+		else
+			strintmap_set(&sources, base, i);
+	}
+	for (i = 0; i < rename_dst_nr; ++i) {
+		char *filename = rename_dst[i].p->two->path;
+		const char *base;
+
+		if (rename_dst[i].is_rename)
+			continue; /* involved in exact match already. */
+
+		/* Record index within rename_dst (i) if basename is unique */
+		base = get_basename(filename);
+		if (strintmap_contains(&dests, base))
+			strintmap_set(&dests, base, -1);
+		else
+			strintmap_set(&dests, base, i);
+	}
+
+	/* TODO: Make use of basenames source and destination basenames */
+
+	strintmap_clear(&sources);
+	strintmap_clear(&dests);
+
+	return 0;
+}
+
 #define NUM_CANDIDATE_PER_DST 4
 static void record_if_better(struct diff_score m[], struct diff_score *o)
 {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 3/5] diffcore-rename: complete find_basename_matches()
  2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
  2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
  2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
@ 2021-02-10 15:15     ` Elijah Newren via GitGitGadget
  2021-02-13  1:48       ` Junio C Hamano
  2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-10 15:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories.  We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames.  If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.

This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file.  For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there were no other
device.h files added or deleted between the commits being compared,
then these two files would be compared in the preliminary step.

This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose.  The next
commit will do the necessary plumbing to make use of it.

Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename.  I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches.  Reason (4) takes a full paragraph to
explain...

If the previous three reasons aren't enough, consider what rename
detection already does.  Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar.  In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity.  Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents.  This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well.  Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 95 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 92 insertions(+), 3 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 3eb49a098adf..001645624e71 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -384,10 +384,52 @@ static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 int num_src)
 {
-	int i;
+	/*
+	 * When I checked in early 2020, over 76% of file renames in linux
+	 * just moved files to a different directory but kept the same
+	 * basename.  gcc did that with over 64% of renames, gecko did it
+	 * with over 79%, and WebKit did it with over 89%.
+	 *
+	 * Therefore we can bypass the normal exhaustive NxM matrix
+	 * comparison of similarities between all potential rename sources
+	 * and destinations by instead using file basename as a hint (i.e.
+	 * the portion of the filename after the last '/'), checking for
+	 * similarity between files with the same basename, and if we find
+	 * a pair that are sufficiently similar, record the rename pair and
+	 * exclude those two from the NxM matrix.
+	 *
+	 * This *might* cause us to find a less than optimal pairing (if
+	 * there is another file that we are even more similar to but has a
+	 * different basename).  Given the huge performance advantage
+	 * basename matching provides, and given the frequency with which
+	 * people use the same basename in real world projects, that's a
+	 * trade-off we are willing to accept when doing just rename
+	 * detection.
+	 *
+	 * If someone wants copy detection that implies they are willing to
+	 * spend more cycles to find similarities between files, so it may
+	 * be less likely that this heuristic is wanted.  If someone is
+	 * doing break detection, that means they do not want filename
+	 * similarity to imply any form of content similiarity, and thus
+	 * this heuristic would definitely be incompatible.
+	 */
+
+	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
 
+	/*
+	 * The prefeteching stuff wants to know if it can skip prefetching
+	 * blobs that are unmodified...and will then do a little extra work
+	 * to verify that the oids are indeed different before prefetching.
+	 * Unmodified blobs are only relevant when doing copy detection;
+	 * when limiting to rename detection, diffcore_rename[_extended]()
+	 * will never be called with unmodified source paths fed to us, so
+	 * the extra work necessary to check if rename_src entries are
+	 * unmodified would be a small waste.
+	 */
+	int skip_unmodified = 0;
+
 	/* Create maps of basename -> fullname(s) for sources and dests */
 	strintmap_init_with_options(&sources, -1, NULL, 0);
 	strintmap_init_with_options(&dests, -1, NULL, 0);
@@ -420,12 +462,59 @@ static int find_basename_matches(struct diff_options *options,
 			strintmap_set(&dests, base, i);
 	}
 
-	/* TODO: Make use of basenames source and destination basenames */
+	/* Now look for basename matchups and do similarity estimation */
+	for (i = 0; i < num_src; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base = NULL;
+		intptr_t src_index;
+		intptr_t dst_index;
+
+		/* Find out if this basename is unique among sources */
+		base = get_basename(filename);
+		src_index = strintmap_get(&sources, base);
+		if (src_index == -1)
+			continue; /* not a unique basename; skip it */
+		assert(src_index == i);
+
+		if (strintmap_contains(&dests, base)) {
+			struct diff_filespec *one, *two;
+			int score;
+
+			/* Find out if this basename is unique among dests */
+			dst_index = strintmap_get(&dests, base);
+			if (dst_index == -1)
+				continue; /* not a unique basename; skip it */
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
+			/* Estimate the similarity */
+			one = rename_src[src_index].p->one;
+			two = rename_dst[dst_index].p->two;
+			score = estimate_similarity(options->repo, one, two,
+						    minimum_score, skip_unmodified);
+
+			/* If sufficiently similar, record as rename pair */
+			if (score < minimum_score)
+				continue;
+			record_rename_pair(dst_index, src_index, score);
+			renames++;
+
+			/*
+			 * Found a rename so don't need text anymore; if we
+			 * didn't find a rename, the filespec_blob would get
+			 * re-used when doing the matrix of comparisons.
+			 */
+			diff_free_filespec_blob(one);
+			diff_free_filespec_blob(two);
+		}
+	}
 
 	strintmap_clear(&sources);
 	strintmap_clear(&dests);
 
-	return 0;
+	return renames;
 }
 
 #define NUM_CANDIDATE_PER_DST 4
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
@ 2021-02-10 15:15     ` Elijah Newren via GitGitGadget
  2021-02-13  1:49       ` Junio C Hamano
  2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
  2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
  5 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-10 15:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames.  As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no 'extensions.txt' files elsewhere among the added
and deleted files, and if a similarity check confirms they are similar,
then they are marked as a rename without looking for a better similarity
match among other files.  This is a behavioral change, as covered in
more detail in the previous commit message.

We do not use this heuristic together with either break or copy
detection.  The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity.  The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities.  So the idea behind this
optimization goes against both of those features, and will be turned off
for both.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       13.815 s ±  0.062 s    13.294 s ±  0.103 s
    mega-renames:   1799.937 s ±  0.493 s   187.248 s ±  0.882 s
    just-one-mega:    51.289 s ±  0.019 s     5.557 s ±  0.017 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c      | 54 ++++++++++++++++++++++++++++++++++++++----
 t/t4001-diff-rename.sh |  4 ++--
 2 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 001645624e71..df76e475c710 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -379,7 +379,6 @@ static const char *get_basename(const char *filename)
 	return base ? base + 1 : filename;
 }
 
-MAYBE_UNUSED
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 int num_src)
@@ -727,12 +726,57 @@ void diffcore_rename(struct diff_options *options)
 	if (minimum_score == MAX_SCORE)
 		goto cleanup;
 
+	num_sources = rename_src_nr;
+
+	if (want_copies || break_idx) {
+		/*
+		 * Cull sources:
+		 *   - remove ones corresponding to exact renames
+		 */
+		trace2_region_enter("diff", "cull after exact", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull after exact", options->repo);
+	} else {
+		/* Determine minimum score to match basenames */
+		double factor = 0.5;
+		char *basename_factor = getenv("GIT_BASENAME_FACTOR");
+		int min_basename_score;
+
+		if (basename_factor)
+			factor = strtol(basename_factor, NULL, 10)/100.0;
+		assert(factor >= 0.0 && factor <= 1.0);
+		min_basename_score = minimum_score +
+			(int)(factor * (MAX_SCORE - minimum_score));
+
+		/*
+		 * Cull sources:
+		 *   - remove ones involved in renames (found via exact match)
+		 */
+		trace2_region_enter("diff", "cull after exact", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull after exact", options->repo);
+
+		/* Utilize file basenames to quickly find renames. */
+		trace2_region_enter("diff", "basename matches", options->repo);
+		rename_count += find_basename_matches(options,
+						      min_basename_score,
+						      rename_src_nr);
+		trace2_region_leave("diff", "basename matches", options->repo);
+
+		/*
+		 * Cull sources, again:
+		 *   - remove ones involved in renames (found via basenames)
+		 */
+		trace2_region_enter("diff", "cull basename", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull basename", options->repo);
+	}
+
 	/*
-	 * Calculate how many renames are left
+	 * Calculate how many rename destinations are left
 	 */
 	num_destinations = (rename_dst_nr - rename_count);
-	remove_unneeded_paths_from_src(want_copies);
-	num_sources = rename_src_nr;
+	num_sources = rename_src_nr; /* rename_src_nr reflects lower number */
 
 	/* All done? */
 	if (!num_destinations || !num_sources)
@@ -764,7 +808,7 @@ void diffcore_rename(struct diff_options *options)
 		struct diff_score *m;
 
 		if (rename_dst[i].is_rename)
-			continue; /* dealt with exact match already. */
+			continue; /* exact or basename match already handled */
 
 		m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
 		for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index 797343b38106..bf62537c29a0 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -280,8 +280,8 @@ test_expect_success 'basename similarity vs best similarity' '
 	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
 	# but since same basenames are checked first...
 	cat >expected <<-\EOF &&
-	R088	subdir/file.txt	file.md
-	A	file.txt
+	A	file.md
+	R078	subdir/file.txt	file.txt
 	EOF
 	test_cmp expected actual
 '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection
  2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
                       ` (3 preceding siblings ...)
  2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
@ 2021-02-10 15:15     ` Elijah Newren via GitGitGadget
  2021-02-10 16:41       ` Junio C Hamano
  2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
  5 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-10 15:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

The last few patches have introduced a new preliminary step when rename
detection is on but both break detection and copy detection are off.
Document this new step.  While we're at it, add a testcase that checks
the new behavior as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/gitdiffcore.txt | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c970d9fe438a..36ebe364d874 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -168,6 +168,23 @@ a similarity score different from the default of 50% by giving a
 number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
 8/10 = 80%).
 
+Note that when rename detection is on but both copy and break
+detection are off, rename detection adds a preliminary step that first
+checks if files are moved across directories while keeping their
+filename the same.  If there is a file added to a directory whose
+contents is sufficiently similar to a file with the same name that got
+deleted from a different directory, it will mark them as renames and
+exclude them from the later quadratic step (the one that pairwise
+compares all unmatched files to find the "best" matches, determined by
+the highest content similarity).  So, for example, if
+docs/extensions.txt and docs/config/extensions.txt have similar
+content, then they will be marked as a rename even if it turns out
+that docs/extensions.txt was more similar to src/extension-checks.c.
+At most, one comparison is done per file in this preliminary pass; so
+if there are several extensions.txt files throughout the directory
+hierarchy that were added and deleted, this preliminary step will be
+skipped for those files.
+
 Note.  When the "-C" option is used with `--find-copies-harder`
 option, 'git diff-{asterisk}' commands feed unmodified filepairs to
 diffcore mechanism as well as modified ones.  This lets the copy
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection
  2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
@ 2021-02-10 16:41       ` Junio C Hamano
  2021-02-10 17:20         ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2021-02-10 16:41 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King,
	Elijah Newren, Derrick Stolee

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> The last few patches have introduced a new preliminary step when rename
> detection is on but both break detection and copy detection are off.
> Document this new step.  While we're at it, add a testcase that checks
> the new behavior as well.
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  Documentation/gitdiffcore.txt | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c970d9fe438a..36ebe364d874 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -168,6 +168,23 @@ a similarity score different from the default of 50% by giving a
>  number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
>  8/10 = 80%).
>  
> +Note that when rename detection is on but both copy and break
> +detection are off, rename detection adds a preliminary step that first
> +checks if files are moved across directories while keeping their
> +filename the same.  If there is a file added to a directory whose
> +contents is sufficiently similar to a file with the same name that got
> +deleted from a different directory, it will mark them as renames and
> +exclude them from the later quadratic step (the one that pairwise
> +compares all unmatched files to find the "best" matches, determined by
> +the highest content similarity).  So, for example, if
> +docs/extensions.txt and docs/config/extensions.txt have similar
> +content, then they will be marked as a rename even if it turns out
> +that docs/extensions.txt was more similar to src/extension-checks.c.

I'd rather use docs/extensions.md instead of src/extension-checks.c;
it would be more realistic for .md to be similar to .txt than .c.

With a raised bar for this step, the equation changes a bit, no?  

    So, for example, if a deleted docs/ext.txt and an added
    docs/config/ext.txt are similar enough, they will be marked as a
    rename and prevent an added docs/ext.md that may be even similar
    to the deleted docs/ext.txt from being considered as the rename
    destination in the later step.  For this reason, the preliminary
    "match same filename" step uses a bit higher threshold to mark a
    file pair as a rename and stop considering other candidates for
    better matches.

or something?

> +At most, one comparison is done per file in this preliminary pass; so
> +if there are several extensions.txt files throughout the directory
> +hierarchy that were added and deleted, this preliminary step will be
> +skipped for those files.

Other than that, the whole series looked sensible to my cursory
read.

Thanks.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection
  2021-02-10 16:41       ` Junio C Hamano
@ 2021-02-10 17:20         ` Elijah Newren
  0 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren @ 2021-02-10 17:20 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Derrick Stolee

On Wed, Feb 10, 2021 at 8:41 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > The last few patches have introduced a new preliminary step when rename
> > detection is on but both break detection and copy detection are off.
> > Document this new step.  While we're at it, add a testcase that checks
> > the new behavior as well.
> >
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> >  Documentation/gitdiffcore.txt | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> >
> > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> > index c970d9fe438a..36ebe364d874 100644
> > --- a/Documentation/gitdiffcore.txt
> > +++ b/Documentation/gitdiffcore.txt
> > @@ -168,6 +168,23 @@ a similarity score different from the default of 50% by giving a
> >  number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
> >  8/10 = 80%).
> >
> > +Note that when rename detection is on but both copy and break
> > +detection are off, rename detection adds a preliminary step that first
> > +checks if files are moved across directories while keeping their
> > +filename the same.  If there is a file added to a directory whose
> > +contents is sufficiently similar to a file with the same name that got
> > +deleted from a different directory, it will mark them as renames and
> > +exclude them from the later quadratic step (the one that pairwise
> > +compares all unmatched files to find the "best" matches, determined by
> > +the highest content similarity).  So, for example, if
> > +docs/extensions.txt and docs/config/extensions.txt have similar
> > +content, then they will be marked as a rename even if it turns out
> > +that docs/extensions.txt was more similar to src/extension-checks.c.
>
> I'd rather use docs/extensions.md instead of src/extension-checks.c;
> it would be more realistic for .md to be similar to .txt than .c.
>
> With a raised bar for this step, the equation changes a bit, no?
>
>     So, for example, if a deleted docs/ext.txt and an added
>     docs/config/ext.txt are similar enough, they will be marked as a
>     rename and prevent an added docs/ext.md that may be even similar
>     to the deleted docs/ext.txt from being considered as the rename
>     destination in the later step.  For this reason, the preliminary
>     "match same filename" step uses a bit higher threshold to mark a
>     file pair as a rename and stop considering other candidates for
>     better matches.
>
> or something?

Good points; I've updated the docs locally to reflect your
suggestions, I'll wait a bit for any other feedback and then send out
a new round with this update.

> > +At most, one comparison is done per file in this preliminary pass; so
> > +if there are several extensions.txt files throughout the directory
> > +hierarchy that were added and deleted, this preliminary step will be
> > +skipped for those files.
>
> Other than that, the whole series looked sensible to my cursory
> read.

Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection
  2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
                       ` (4 preceding siblings ...)
  2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
@ 2021-02-11  8:15     ` Elijah Newren via GitGitGadget
  2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
                         ` (7 more replies)
  5 siblings, 8 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-11  8:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren

This series depends on ort-perf-batch-6[1].

This series uses file basenames (portion of the path after final '/',
including extension) in a basic fashion to guide rename detection.

Changes since v3:

 * update documentation as suggested by Junio
 * NEW: add another patch at the end, to simplify patch series that will be
   submitted later (please review!)

[1] https://lore.kernel.org/git/xmqqlfc4byt6.fsf@gitster.c.googlers.com/

Elijah Newren (6):
  t4001: add a test comparing basename similarity and content similarity
  diffcore-rename: compute basenames of all source and dest candidates
  diffcore-rename: complete find_basename_matches()
  diffcore-rename: guide inexact rename detection based on basenames
  gitdiffcore doc: mention new preliminary step for rename detection
  merge-ort: call diffcore_rename() directly

 Documentation/gitdiffcore.txt |  20 ++++
 diffcore-rename.c             | 202 +++++++++++++++++++++++++++++++++-
 merge-ort.c                   |  66 +++++++++--
 t/t4001-diff-rename.sh        |  24 ++++
 4 files changed, 301 insertions(+), 11 deletions(-)


base-commit: 7ae9460d3dba84122c2674b46e4339b9d42bdedd
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/843

Range-diff vs v3:

 1:  3e6af929d135 = 1:  3e6af929d135 t4001: add a test comparing basename similarity and content similarity
 2:  4fff9b1ff57b = 2:  4fff9b1ff57b diffcore-rename: compute basenames of all source and dest candidates
 3:  dc26881e4ed3 = 3:  dc26881e4ed3 diffcore-rename: complete find_basename_matches()
 4:  2493f4b2f55d = 4:  2493f4b2f55d diffcore-rename: guide inexact rename detection based on basenames
 5:  fc72d24a3358 ! 5:  4e86ed3f29d4 gitdiffcore doc: mention new preliminary step for rename detection
     @@ Documentation/gitdiffcore.txt: a similarity score different from the default of
      +deleted from a different directory, it will mark them as renames and
      +exclude them from the later quadratic step (the one that pairwise
      +compares all unmatched files to find the "best" matches, determined by
     -+the highest content similarity).  So, for example, if
     -+docs/extensions.txt and docs/config/extensions.txt have similar
     -+content, then they will be marked as a rename even if it turns out
     -+that docs/extensions.txt was more similar to src/extension-checks.c.
     -+At most, one comparison is done per file in this preliminary pass; so
     -+if there are several extensions.txt files throughout the directory
     -+hierarchy that were added and deleted, this preliminary step will be
     -+skipped for those files.
     ++the highest content similarity).  So, for example, if a deleted
     ++docs/ext.txt and an added docs/config/ext.txt are similar enough, they
     ++will be marked as a rename and prevent an added docs/ext.md that may
     ++be even more similar to the deleted docs/ext.txt from being considered
     ++as the rename destination in the later step.  For this reason, the
     ++preliminary "match same filename" step uses a bit higher threshold to
     ++mark a file pair as a rename and stop considering other candidates for
     ++better matches.  At most, one comparison is done per file in this
     ++preliminary pass; so if there are several ext.txt files throughout the
     ++directory hierarchy that were added and deleted, this preliminary step
     ++will be skipped for those files.
      +
       Note.  When the "-C" option is used with `--find-copies-harder`
       option, 'git diff-{asterisk}' commands feed unmodified filepairs to
 -:  ------------ > 6:  fedb3d323d94 merge-ort: call diffcore_rename() directly

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity
  2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
@ 2021-02-11  8:15       ` Elijah Newren via GitGitGadget
  2021-02-11  8:15       ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
                         ` (6 subsequent siblings)
  7 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-11  8:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add a simple test where a removed file is similar to two different added
files; one of them has the same basename, and the other has a slightly
higher content similarity.  Without break detection, filename similarity
of 100% trumps content similarity for pairing up related files.  For
any filename similarity less than 100%, the opposite is true -- content
similarity is all that matters.  Add a testcase that documents this.

Subsequent commits will add a new rule that includes an inbetween state,
where a mixture of filename similarity and content similarity are
weighed, and which will change the outcome of this testcase.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t4001-diff-rename.sh | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index c16486a9d41a..797343b38106 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -262,4 +262,28 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
 	grep "myotherfile.*myfile" actual
 '
 
+test_expect_success 'basename similarity vs best similarity' '
+	mkdir subdir &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			 line6 line7 line8 line9 line10 >subdir/file.txt &&
+	git add subdir/file.txt &&
+	git commit -m "base txt" &&
+
+	git rm subdir/file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 >file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 line9 >file.md &&
+	git add file.txt file.md &&
+	git commit -a -m "rename" &&
+	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
+	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
+	# but since same basenames are checked first...
+	cat >expected <<-\EOF &&
+	R088	subdir/file.txt	file.md
+	A	file.txt
+	EOF
+	test_cmp expected actual
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates
  2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
  2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
@ 2021-02-11  8:15       ` Elijah Newren via GitGitGadget
  2021-02-11  8:15       ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
                         ` (5 subsequent siblings)
  7 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-11  8:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

We want to make use of unique basenames to help inform rename detection,
so that more likely pairings can be checked first.  (src/moduleA/foo.txt
and source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the deleted and added files.)  Add a new function,
not yet used, which creates a map of the unique basenames within
rename_src and another within rename_dst, together with the indices
within rename_src/rename_dst where those basenames show up.  Non-unique
basenames still show up in the map, but have an invalid index (-1).

This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names.  Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
  * linux: 76%
  * gcc: 64%
  * gecko: 79%
  * webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 74930716e70d..3eb49a098adf 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,67 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+static const char *get_basename(const char *filename)
+{
+	/*
+	 * gitbasename() has to worry about special drivers, multiple
+	 * directory separator characters, trailing slashes, NULL or
+	 * empty strings, etc.  We only work on filenames as stored in
+	 * git, and thus get to ignore all those complications.
+	 */
+	const char *base = strrchr(filename, '/');
+	return base ? base + 1 : filename;
+}
+
+MAYBE_UNUSED
+static int find_basename_matches(struct diff_options *options,
+				 int minimum_score,
+				 int num_src)
+{
+	int i;
+	struct strintmap sources;
+	struct strintmap dests;
+
+	/* Create maps of basename -> fullname(s) for sources and dests */
+	strintmap_init_with_options(&sources, -1, NULL, 0);
+	strintmap_init_with_options(&dests, -1, NULL, 0);
+	for (i = 0; i < num_src; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base;
+
+		/* exact renames removed in remove_unneeded_paths_from_src() */
+		assert(!rename_src[i].p->one->rename_used);
+
+		/* Record index within rename_src (i) if basename is unique */
+		base = get_basename(filename);
+		if (strintmap_contains(&sources, base))
+			strintmap_set(&sources, base, -1);
+		else
+			strintmap_set(&sources, base, i);
+	}
+	for (i = 0; i < rename_dst_nr; ++i) {
+		char *filename = rename_dst[i].p->two->path;
+		const char *base;
+
+		if (rename_dst[i].is_rename)
+			continue; /* involved in exact match already. */
+
+		/* Record index within rename_dst (i) if basename is unique */
+		base = get_basename(filename);
+		if (strintmap_contains(&dests, base))
+			strintmap_set(&dests, base, -1);
+		else
+			strintmap_set(&dests, base, i);
+	}
+
+	/* TODO: Make use of basenames source and destination basenames */
+
+	strintmap_clear(&sources);
+	strintmap_clear(&dests);
+
+	return 0;
+}
+
 #define NUM_CANDIDATE_PER_DST 4
 static void record_if_better(struct diff_score m[], struct diff_score *o)
 {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v4 3/6] diffcore-rename: complete find_basename_matches()
  2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
  2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
  2021-02-11  8:15       ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
@ 2021-02-11  8:15       ` Elijah Newren via GitGitGadget
  2021-02-11  8:15       ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
                         ` (4 subsequent siblings)
  7 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-11  8:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories.  We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames.  If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.

This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file.  For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there were no other
device.h files added or deleted between the commits being compared,
then these two files would be compared in the preliminary step.

This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose.  The next
commit will do the necessary plumbing to make use of it.

Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename.  I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches.  Reason (4) takes a full paragraph to
explain...

If the previous three reasons aren't enough, consider what rename
detection already does.  Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar.  In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity.  Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents.  This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well.  Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 95 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 92 insertions(+), 3 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 3eb49a098adf..001645624e71 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -384,10 +384,52 @@ static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 int num_src)
 {
-	int i;
+	/*
+	 * When I checked in early 2020, over 76% of file renames in linux
+	 * just moved files to a different directory but kept the same
+	 * basename.  gcc did that with over 64% of renames, gecko did it
+	 * with over 79%, and WebKit did it with over 89%.
+	 *
+	 * Therefore we can bypass the normal exhaustive NxM matrix
+	 * comparison of similarities between all potential rename sources
+	 * and destinations by instead using file basename as a hint (i.e.
+	 * the portion of the filename after the last '/'), checking for
+	 * similarity between files with the same basename, and if we find
+	 * a pair that are sufficiently similar, record the rename pair and
+	 * exclude those two from the NxM matrix.
+	 *
+	 * This *might* cause us to find a less than optimal pairing (if
+	 * there is another file that we are even more similar to but has a
+	 * different basename).  Given the huge performance advantage
+	 * basename matching provides, and given the frequency with which
+	 * people use the same basename in real world projects, that's a
+	 * trade-off we are willing to accept when doing just rename
+	 * detection.
+	 *
+	 * If someone wants copy detection that implies they are willing to
+	 * spend more cycles to find similarities between files, so it may
+	 * be less likely that this heuristic is wanted.  If someone is
+	 * doing break detection, that means they do not want filename
+	 * similarity to imply any form of content similiarity, and thus
+	 * this heuristic would definitely be incompatible.
+	 */
+
+	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
 
+	/*
+	 * The prefeteching stuff wants to know if it can skip prefetching
+	 * blobs that are unmodified...and will then do a little extra work
+	 * to verify that the oids are indeed different before prefetching.
+	 * Unmodified blobs are only relevant when doing copy detection;
+	 * when limiting to rename detection, diffcore_rename[_extended]()
+	 * will never be called with unmodified source paths fed to us, so
+	 * the extra work necessary to check if rename_src entries are
+	 * unmodified would be a small waste.
+	 */
+	int skip_unmodified = 0;
+
 	/* Create maps of basename -> fullname(s) for sources and dests */
 	strintmap_init_with_options(&sources, -1, NULL, 0);
 	strintmap_init_with_options(&dests, -1, NULL, 0);
@@ -420,12 +462,59 @@ static int find_basename_matches(struct diff_options *options,
 			strintmap_set(&dests, base, i);
 	}
 
-	/* TODO: Make use of basenames source and destination basenames */
+	/* Now look for basename matchups and do similarity estimation */
+	for (i = 0; i < num_src; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base = NULL;
+		intptr_t src_index;
+		intptr_t dst_index;
+
+		/* Find out if this basename is unique among sources */
+		base = get_basename(filename);
+		src_index = strintmap_get(&sources, base);
+		if (src_index == -1)
+			continue; /* not a unique basename; skip it */
+		assert(src_index == i);
+
+		if (strintmap_contains(&dests, base)) {
+			struct diff_filespec *one, *two;
+			int score;
+
+			/* Find out if this basename is unique among dests */
+			dst_index = strintmap_get(&dests, base);
+			if (dst_index == -1)
+				continue; /* not a unique basename; skip it */
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
+			/* Estimate the similarity */
+			one = rename_src[src_index].p->one;
+			two = rename_dst[dst_index].p->two;
+			score = estimate_similarity(options->repo, one, two,
+						    minimum_score, skip_unmodified);
+
+			/* If sufficiently similar, record as rename pair */
+			if (score < minimum_score)
+				continue;
+			record_rename_pair(dst_index, src_index, score);
+			renames++;
+
+			/*
+			 * Found a rename so don't need text anymore; if we
+			 * didn't find a rename, the filespec_blob would get
+			 * re-used when doing the matrix of comparisons.
+			 */
+			diff_free_filespec_blob(one);
+			diff_free_filespec_blob(two);
+		}
+	}
 
 	strintmap_clear(&sources);
 	strintmap_clear(&dests);
 
-	return 0;
+	return renames;
 }
 
 #define NUM_CANDIDATE_PER_DST 4
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
                         ` (2 preceding siblings ...)
  2021-02-11  8:15       ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
@ 2021-02-11  8:15       ` Elijah Newren via GitGitGadget
  2021-02-11  8:15       ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
                         ` (3 subsequent siblings)
  7 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-11  8:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames.  As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no 'extensions.txt' files elsewhere among the added
and deleted files, and if a similarity check confirms they are similar,
then they are marked as a rename without looking for a better similarity
match among other files.  This is a behavioral change, as covered in
more detail in the previous commit message.

We do not use this heuristic together with either break or copy
detection.  The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity.  The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities.  So the idea behind this
optimization goes against both of those features, and will be turned off
for both.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       13.815 s ±  0.062 s    13.294 s ±  0.103 s
    mega-renames:   1799.937 s ±  0.493 s   187.248 s ±  0.882 s
    just-one-mega:    51.289 s ±  0.019 s     5.557 s ±  0.017 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c      | 54 ++++++++++++++++++++++++++++++++++++++----
 t/t4001-diff-rename.sh |  4 ++--
 2 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 001645624e71..df76e475c710 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -379,7 +379,6 @@ static const char *get_basename(const char *filename)
 	return base ? base + 1 : filename;
 }
 
-MAYBE_UNUSED
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 int num_src)
@@ -727,12 +726,57 @@ void diffcore_rename(struct diff_options *options)
 	if (minimum_score == MAX_SCORE)
 		goto cleanup;
 
+	num_sources = rename_src_nr;
+
+	if (want_copies || break_idx) {
+		/*
+		 * Cull sources:
+		 *   - remove ones corresponding to exact renames
+		 */
+		trace2_region_enter("diff", "cull after exact", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull after exact", options->repo);
+	} else {
+		/* Determine minimum score to match basenames */
+		double factor = 0.5;
+		char *basename_factor = getenv("GIT_BASENAME_FACTOR");
+		int min_basename_score;
+
+		if (basename_factor)
+			factor = strtol(basename_factor, NULL, 10)/100.0;
+		assert(factor >= 0.0 && factor <= 1.0);
+		min_basename_score = minimum_score +
+			(int)(factor * (MAX_SCORE - minimum_score));
+
+		/*
+		 * Cull sources:
+		 *   - remove ones involved in renames (found via exact match)
+		 */
+		trace2_region_enter("diff", "cull after exact", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull after exact", options->repo);
+
+		/* Utilize file basenames to quickly find renames. */
+		trace2_region_enter("diff", "basename matches", options->repo);
+		rename_count += find_basename_matches(options,
+						      min_basename_score,
+						      rename_src_nr);
+		trace2_region_leave("diff", "basename matches", options->repo);
+
+		/*
+		 * Cull sources, again:
+		 *   - remove ones involved in renames (found via basenames)
+		 */
+		trace2_region_enter("diff", "cull basename", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull basename", options->repo);
+	}
+
 	/*
-	 * Calculate how many renames are left
+	 * Calculate how many rename destinations are left
 	 */
 	num_destinations = (rename_dst_nr - rename_count);
-	remove_unneeded_paths_from_src(want_copies);
-	num_sources = rename_src_nr;
+	num_sources = rename_src_nr; /* rename_src_nr reflects lower number */
 
 	/* All done? */
 	if (!num_destinations || !num_sources)
@@ -764,7 +808,7 @@ void diffcore_rename(struct diff_options *options)
 		struct diff_score *m;
 
 		if (rename_dst[i].is_rename)
-			continue; /* dealt with exact match already. */
+			continue; /* exact or basename match already handled */
 
 		m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
 		for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index 797343b38106..bf62537c29a0 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -280,8 +280,8 @@ test_expect_success 'basename similarity vs best similarity' '
 	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
 	# but since same basenames are checked first...
 	cat >expected <<-\EOF &&
-	R088	subdir/file.txt	file.md
-	A	file.txt
+	A	file.md
+	R078	subdir/file.txt	file.txt
 	EOF
 	test_cmp expected actual
 '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection
  2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
                         ` (3 preceding siblings ...)
  2021-02-11  8:15       ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
@ 2021-02-11  8:15       ` Elijah Newren via GitGitGadget
  2021-02-11  8:15       ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
                         ` (2 subsequent siblings)
  7 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-11  8:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

The last few patches have introduced a new preliminary step when rename
detection is on but both break detection and copy detection are off.
Document this new step.  While we're at it, add a testcase that checks
the new behavior as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/gitdiffcore.txt | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c970d9fe438a..edf92d988c8f 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -168,6 +168,26 @@ a similarity score different from the default of 50% by giving a
 number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
 8/10 = 80%).
 
+Note that when rename detection is on but both copy and break
+detection are off, rename detection adds a preliminary step that first
+checks if files are moved across directories while keeping their
+filename the same.  If there is a file added to a directory whose
+contents is sufficiently similar to a file with the same name that got
+deleted from a different directory, it will mark them as renames and
+exclude them from the later quadratic step (the one that pairwise
+compares all unmatched files to find the "best" matches, determined by
+the highest content similarity).  So, for example, if a deleted
+docs/ext.txt and an added docs/config/ext.txt are similar enough, they
+will be marked as a rename and prevent an added docs/ext.md that may
+be even more similar to the deleted docs/ext.txt from being considered
+as the rename destination in the later step.  For this reason, the
+preliminary "match same filename" step uses a bit higher threshold to
+mark a file pair as a rename and stop considering other candidates for
+better matches.  At most, one comparison is done per file in this
+preliminary pass; so if there are several ext.txt files throughout the
+directory hierarchy that were added and deleted, this preliminary step
+will be skipped for those files.
+
 Note.  When the "-C" option is used with `--find-copies-harder`
 option, 'git diff-{asterisk}' commands feed unmodified filepairs to
 diffcore mechanism as well as modified ones.  This lets the copy
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v4 6/6] merge-ort: call diffcore_rename() directly
  2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
                         ` (4 preceding siblings ...)
  2021-02-11  8:15       ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
@ 2021-02-11  8:15       ` Elijah Newren via GitGitGadget
  2021-02-13  1:53       ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
  2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
  7 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-11  8:15 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

We want to pass additional information to diffcore_rename() (or some
variant thereof) without plumbing that extra information through
diff_tree_oid() and diffcore_std().  Further, since we will need to
gather additional special information related to diffs and are walking
the trees anyway in collect_merge_info(), it seems odd to have
diff_tree_oid()/diffcore_std() repeat those tree walks.  And there may
be times where we can avoid traversing into a subtree in
collect_merge_info() (based on additional information at our disposal),
that the basic diff logic would be unable to take advantage of.  For all
these reasons, just create the add and delete pairs ourself and then
call diffcore_rename() directly.

This change is primarily about enabling future optimizations; the
advantage of avoiding extra tree traversals is small compared to the
cost of rename detection, and the advantage of avoiding the extra tree
traversals is somewhat offset by the extra time spent in
collect_merge_info() collecting the additional data anyway.  However...

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       13.294 s ±  0.103 s    12.775 s ±  0.062 s
    mega-renames:    187.248 s ±  0.882 s   188.754 s ±  0.284 s
    just-one-mega:     5.557 s ±  0.017 s     5.599 s ±  0.019 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 59 insertions(+), 7 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 931b91438cf1..603d30c52170 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -535,6 +535,23 @@ static void setup_path_info(struct merge_options *opt,
 	result->util = mi;
 }
 
+static void add_pair(struct merge_options *opt,
+		     struct name_entry *names,
+		     const char *pathname,
+		     unsigned side,
+		     unsigned is_add /* if false, is_delete */)
+{
+	struct diff_filespec *one, *two;
+	struct rename_info *renames = &opt->priv->renames;
+	int names_idx = is_add ? side : 0;
+
+	one = alloc_filespec(pathname);
+	two = alloc_filespec(pathname);
+	fill_filespec(is_add ? two : one,
+		      &names[names_idx].oid, 1, names[names_idx].mode);
+	diff_queue(&renames->pairs[side], one, two);
+}
+
 static void collect_rename_info(struct merge_options *opt,
 				struct name_entry *names,
 				const char *dirname,
@@ -544,6 +561,7 @@ static void collect_rename_info(struct merge_options *opt,
 				unsigned match_mask)
 {
 	struct rename_info *renames = &opt->priv->renames;
+	unsigned side;
 
 	/* Update dirs_removed, as needed */
 	if (dirmask == 1 || dirmask == 3 || dirmask == 5) {
@@ -554,6 +572,21 @@ static void collect_rename_info(struct merge_options *opt,
 		if (sides & 2)
 			strset_add(&renames->dirs_removed[2], fullname);
 	}
+
+	if (filemask == 0 || filemask == 7)
+		return;
+
+	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; ++side) {
+		unsigned side_mask = (1 << side);
+
+		/* Check for deletion on side */
+		if ((filemask & 1) && !(filemask & side_mask))
+			add_pair(opt, names, fullname, side, 0 /* delete */);
+
+		/* Check for addition on side */
+		if (!(filemask & 1) && (filemask & side_mask))
+			add_pair(opt, names, fullname, side, 1 /* add */);
+	}
 }
 
 static int collect_merge_info_callback(int n,
@@ -2079,6 +2112,27 @@ static int process_renames(struct merge_options *opt,
 	return clean_merge;
 }
 
+static void resolve_diffpair_statuses(struct diff_queue_struct *q)
+{
+	/*
+	 * A simplified version of diff_resolve_rename_copy(); would probably
+	 * just use that function but it's static...
+	 */
+	int i;
+	struct diff_filepair *p;
+
+	for (i = 0; i < q->nr; ++i) {
+		p = q->queue[i];
+		p->status = 0; /* undecided */
+		if (!DIFF_FILE_VALID(p->one))
+			p->status = DIFF_STATUS_ADDED;
+		else if (!DIFF_FILE_VALID(p->two))
+			p->status = DIFF_STATUS_DELETED;
+		else if (DIFF_PAIR_RENAME(p))
+			p->status = DIFF_STATUS_RENAMED;
+	}
+}
+
 static int compare_pairs(const void *a_, const void *b_)
 {
 	const struct diff_filepair *a = *((const struct diff_filepair **)a_);
@@ -2089,8 +2143,6 @@ static int compare_pairs(const void *a_, const void *b_)
 
 /* Call diffcore_rename() to compute which files have changed on given side */
 static void detect_regular_renames(struct merge_options *opt,
-				   struct tree *merge_base,
-				   struct tree *side,
 				   unsigned side_index)
 {
 	struct diff_options diff_opts;
@@ -2108,11 +2160,11 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
 	diff_setup_done(&diff_opts);
 
+	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
-	diff_tree_oid(&merge_base->object.oid, &side->object.oid, "",
-		      &diff_opts);
-	diffcore_std(&diff_opts);
+	diffcore_rename(&diff_opts);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
+	resolve_diffpair_statuses(&diff_queued_diff);
 
 	if (diff_opts.needed_rename_limit > renames->needed_limit)
 		renames->needed_limit = diff_opts.needed_rename_limit;
@@ -2212,8 +2264,8 @@ static int detect_and_process_renames(struct merge_options *opt,
 	memset(&combined, 0, sizeof(combined));
 
 	trace2_region_enter("merge", "regular renames", opt->repo);
-	detect_regular_renames(opt, merge_base, side1, MERGE_SIDE1);
-	detect_regular_renames(opt, merge_base, side2, MERGE_SIDE2);
+	detect_regular_renames(opt, MERGE_SIDE1);
+	detect_regular_renames(opt, MERGE_SIDE2);
 	trace2_region_leave("merge", "regular renames", opt->repo);
 
 	trace2_region_enter("merge", "directory renames", opt->repo);
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity
  2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
@ 2021-02-13  1:15       ` Junio C Hamano
  2021-02-13  4:50         ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2021-02-13  1:15 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King,
	Elijah Newren, Derrick Stolee

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> Add a simple test where a removed file is similar to two different added
> files; one of them has the same basename, and the other has a slightly
> higher content similarity.  Without break detection, filename similarity
> of 100% trumps content similarity for pairing up related files.  For
> any filename similarity less than 100%, the opposite is true -- content
> similarity is all that matters.  Add a testcase that documents this.

I am not sure why it is the "opposite".  When contents are similar
to the same degree of 100%, we tiebreak with the filename.  We never
favor a pair between the same filename over a pair between different
filenames with better content similarity.

And when contents are similar to the same degree of less than 100%,
we do not favor a pair between the same filename over a pair between
different filenames, as long as they are similar to the same degree.

So, I do not think "opposite" is helping readers to understand what
is going on.

> +test_expect_success 'basename similarity vs best similarity' '
> +	mkdir subdir &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			 line6 line7 line8 line9 line10 >subdir/file.txt &&
> +	git add subdir/file.txt &&
> +	git commit -m "base txt" &&
> +
> +	git rm subdir/file.txt &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			  line6 line7 line8 >file.txt &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			  line6 line7 line8 line9 >file.md &&
> +	git add file.txt file.md &&
> +	git commit -a -m "rename" &&
> +	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
> +	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
> +	# but since same basenames are checked first...

I am not sure what the second line of this comment wants to imply
with the ellipses here.  Care to finish the sentence?

Or was the second line planned to be added when we start applying
the "check only the same filename first and see if we find a
better-than-reasonable match" heuristics but somehow survived
"rebase -i" and ended up here?

> +	cat >expected <<-\EOF &&
> +	R088	subdir/file.txt	file.md
> +	A	file.txt
> +	EOF
> +	test_cmp expected actual

Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates
  2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
@ 2021-02-13  1:32       ` Junio C Hamano
  0 siblings, 0 replies; 71+ messages in thread
From: Junio C Hamano @ 2021-02-13  1:32 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King,
	Elijah Newren, Derrick Stolee

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +MAYBE_UNUSED
> +static int find_basename_matches(struct diff_options *options,
> +				 int minimum_score,
> +				 int num_src)
> +{
> +	int i;
> +	struct strintmap sources;
> +	struct strintmap dests;
> +
> +	/* Create maps of basename -> fullname(s) for sources and dests */
> +	strintmap_init_with_options(&sources, -1, NULL, 0);
> +	strintmap_init_with_options(&dests, -1, NULL, 0);
> +	for (i = 0; i < num_src; ++i) {
> +		char *filename = rename_src[i].p->one->path;
> +		const char *base;
> +
> +		/* exact renames removed in remove_unneeded_paths_from_src() */
> +		assert(!rename_src[i].p->one->rename_used);
> +
> +		/* Record index within rename_src (i) if basename is unique */
> +		base = get_basename(filename);
> +		if (strintmap_contains(&sources, base))
> +			strintmap_set(&sources, base, -1);
> +		else
> +			strintmap_set(&sources, base, i);
> +	}
> +	for (i = 0; i < rename_dst_nr; ++i) {
> +		char *filename = rename_dst[i].p->two->path;
> +		const char *base;
> +
> +		if (rename_dst[i].is_rename)
> +			continue; /* involved in exact match already. */
> +
> +		/* Record index within rename_dst (i) if basename is unique */
> +		base = get_basename(filename);
> +		if (strintmap_contains(&dests, base))
> +			strintmap_set(&dests, base, -1);
> +		else
> +			strintmap_set(&dests, base, i);
> +	}
> +
> +	/* TODO: Make use of basenames source and destination basenames */

;-)

So at this point sources and dests can be used to quickly look up,
given a filename, if there is a single src among all sources, and a
single dst among all dests, that have the filename.

I wonder if the second loop over destinations can be "optimized"
further by using the sources map, though.  The reason you quash
entries with -1 when you see second instance of the same name is
because you intend to limit the heuristics only to a uniquely named
file among the removed files going to a uniquely named file among
the added files, right?  So even if a name is unique among dests,
if that name has duplicates on the source side, there is no point
recording its location.  i.e.

	/* record index within dst if it is unique in both dst and src */
	base = get_basename(filename);
	if (strintmap_contains(&sources, base) ||
	    strintmap_contains(&dests, base))
		strintmap_set(&dests, base, -1);
	else
		strintmap_set(&dests, base, i);

perhaps?

I guess it depends on what actually will be written in this "TODO"
space how effective such a change would be.  Presumably, you'd
iterate over &sources while skipping entries that record -1, to
learn (basename, i), and use the basename found there to consult
&dests to see if it yields a non-negative integer j, to notice that
rename_src[i] is a good candidate to match rename_dst[j].  If that
is the case, then such a change won't help as an optimization at
all, as we'd need to consult &dests map with the basename anyway,
so let's scratch the above idea.

In any case, after we walk over rename_src[] and rename_dst[] once,
the number of entries in &sources would be smaller than rename_src[]
so iterating over &sources, hunting for entries that record
non-negative index into rename_src[] would hopefully be cheaper than
the naive loop we've been working with.  I like the idea of using the
strintmap for this part of the code.

Thanks.


> +	strintmap_clear(&sources);
> +	strintmap_clear(&dests);
> +
> +	return 0;
> +}
> +
>  #define NUM_CANDIDATE_PER_DST 4
>  static void record_if_better(struct diff_score m[], struct diff_score *o)
>  {

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 3/5] diffcore-rename: complete find_basename_matches()
  2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
@ 2021-02-13  1:48       ` Junio C Hamano
  2021-02-13 18:34         ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2021-02-13  1:48 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King,
	Elijah Newren, Derrick Stolee

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +	/* Now look for basename matchups and do similarity estimation */
> +	for (i = 0; i < num_src; ++i) {
> +		char *filename = rename_src[i].p->one->path;
> +		const char *base = NULL;
> +		intptr_t src_index;
> +		intptr_t dst_index;
> +
> +		/* Find out if this basename is unique among sources */
> +		base = get_basename(filename);
> +		src_index = strintmap_get(&sources, base);
> +		if (src_index == -1)
> +			continue; /* not a unique basename; skip it */
> +		assert(src_index == i);
> +
> +		if (strintmap_contains(&dests, base)) {
> +			struct diff_filespec *one, *two;
> +			int score;
> +
> +			/* Find out if this basename is unique among dests */
> +			dst_index = strintmap_get(&dests, base);
> +			if (dst_index == -1)
> +				continue; /* not a unique basename; skip it */

It would be a lot easier to read if "we must have the same singleton
in dests" in a single if condition, I suspect.  I.e.

		if (strintmap_contains(&dests, base) &&
		    0 <= (dst_index = (strintmap_get(&dests, base)))) {

It is a bit sad that we iterate over rename_src[] array, even though
we now have a map that presumably have fewer number of entries than
the original array, though.

> +			/* Ignore this dest if already used in a rename */
> +			if (rename_dst[dst_index].is_rename)
> +				continue; /* already used previously */

Since we will only be matching between unique entries in src and
dst, this "this has been used, so we cannot use it" will not change
during this loop.  I wonder if the preparation done in the previous
step, i.e. [PATCH v3 2/5], can take advantage of this fact, i.e.  a
dst that has already been used (in the previous "exact" step) would
not even have to be in &dests map, so that the strintmap_contains()
check can reject it much earlier.

Stepping back a bit, it appears to me that [2/5] and [3/5] considers
a source file having unique basename among the sources even if there
are many such files with the same basename, as long as all the other
files with the same basename have been matched in the previous
"exact" phase.  It probably does the same thing for destination
side.

Intended?

It feels incompatible with the spirit of these two steps aim for
(i.e. only use this optimization on a pair of src/dst with UNIQUE
basenames).  For the purpose of "we only handle unique ones", the
paths that already have matched should participate in deciding if
the files that survived "exact" phase have unique basename among
the original inpu?

Thanks.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
@ 2021-02-13  1:49       ` Junio C Hamano
  0 siblings, 0 replies; 71+ messages in thread
From: Junio C Hamano @ 2021-02-13  1:49 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King,
	Elijah Newren, Derrick Stolee

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
> index 797343b38106..bf62537c29a0 100755
> --- a/t/t4001-diff-rename.sh
> +++ b/t/t4001-diff-rename.sh
> @@ -280,8 +280,8 @@ test_expect_success 'basename similarity vs best similarity' '
>  	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
>  	# but since same basenames are checked first...

Here lies the answer to my earlier question ;-)

>  	cat >expected <<-\EOF &&
> -	R088	subdir/file.txt	file.md
> -	A	file.txt
> +	A	file.md
> +	R078	subdir/file.txt	file.txt
>  	EOF
>  	test_cmp expected actual
>  '

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection
  2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
                         ` (5 preceding siblings ...)
  2021-02-11  8:15       ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
@ 2021-02-13  1:53       ` Junio C Hamano
  2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
  7 siblings, 0 replies; 71+ messages in thread
From: Junio C Hamano @ 2021-02-13  1:53 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King,
	Elijah Newren, Derrick Stolee

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series depends on ort-perf-batch-6[1].
>
> This series uses file basenames (portion of the path after final '/',
> including extension) in a basic fashion to guide rename detection.
>
> Changes since v3:
>
>  * update documentation as suggested by Junio
>  * NEW: add another patch at the end, to simplify patch series that will be
>    submitted later (please review!)

Sorry, by mistake I somehow read v4 and sent some comments on v3,
but as the above says, they are on the part that hadn't changed at
all, and should still be relevant.

Thanks.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity
  2021-02-13  1:15       ` Junio C Hamano
@ 2021-02-13  4:50         ` Elijah Newren
  2021-02-13 23:56           ` Junio C Hamano
  0 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren @ 2021-02-13  4:50 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Derrick Stolee

On Fri, Feb 12, 2021 at 5:15 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > Add a simple test where a removed file is similar to two different added
> > files; one of them has the same basename, and the other has a slightly
> > higher content similarity.  Without break detection, filename similarity
> > of 100% trumps content similarity for pairing up related files.  For
> > any filename similarity less than 100%, the opposite is true -- content
> > similarity is all that matters.  Add a testcase that documents this.
>
> I am not sure why it is the "opposite".  When contents are similar
> to the same degree of 100%, we tiebreak with the filename.  We never
> favor a pair between the same filename over a pair between different
> filenames with better content similarity.

This is not true.  If src/main.c is 99% similar to src/foo.c, and is
0% similar to the src/main.c in the new commit, we match the old
src/main.c to the new src/main.c despite being far more similar
src/foo.c.  Unless break detection is turned on, we do not allow
content similarity to trump (full) filename equality.

> And when contents are similar to the same degree of less than 100%,
> we do not favor a pair between the same filename over a pair between
> different filenames, as long as they are similar to the same degree.

This is also not true; we tiebreak with filenames for inexact renames
just like we do for exact renames (note that basename_same() is called
both from find_identical_files() and from the nested loop where
inexact rename detection is done).

> So, I do not think "opposite" is helping readers to understand what
> is going on.
>
> > +test_expect_success 'basename similarity vs best similarity' '
> > +     mkdir subdir &&
> > +     test_write_lines line1 line2 line3 line4 line5 \
> > +                      line6 line7 line8 line9 line10 >subdir/file.txt &&
> > +     git add subdir/file.txt &&
> > +     git commit -m "base txt" &&
> > +
> > +     git rm subdir/file.txt &&
> > +     test_write_lines line1 line2 line3 line4 line5 \
> > +                       line6 line7 line8 >file.txt &&
> > +     test_write_lines line1 line2 line3 line4 line5 \
> > +                       line6 line7 line8 line9 >file.md &&
> > +     git add file.txt file.md &&
> > +     git commit -a -m "rename" &&
> > +     git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
> > +     # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
> > +     # but since same basenames are checked first...
>
> I am not sure what the second line of this comment wants to imply
> with the ellipses here.  Care to finish the sentence?
>
> Or was the second line planned to be added when we start applying
> the "check only the same filename first and see if we find a
> better-than-reasonable match" heuristics but somehow survived
> "rebase -i" and ended up here?

Oops, indeed; that is precisely what happened.  Will fix.

> > +     cat >expected <<-\EOF &&
> > +     R088    subdir/file.txt file.md
> > +     A       file.txt
> > +     EOF
> > +     test_cmp expected actual
>
> Thanks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 3/5] diffcore-rename: complete find_basename_matches()
  2021-02-13  1:48       ` Junio C Hamano
@ 2021-02-13 18:34         ` Elijah Newren
  2021-02-13 23:55           ` Junio C Hamano
  0 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren @ 2021-02-13 18:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Derrick Stolee

On Fri, Feb 12, 2021 at 5:48 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +     /* Now look for basename matchups and do similarity estimation */
> > +     for (i = 0; i < num_src; ++i) {
> > +             char *filename = rename_src[i].p->one->path;
> > +             const char *base = NULL;
> > +             intptr_t src_index;
> > +             intptr_t dst_index;
> > +
> > +             /* Find out if this basename is unique among sources */
> > +             base = get_basename(filename);
> > +             src_index = strintmap_get(&sources, base);
> > +             if (src_index == -1)
> > +                     continue; /* not a unique basename; skip it */
> > +             assert(src_index == i);
> > +
> > +             if (strintmap_contains(&dests, base)) {
> > +                     struct diff_filespec *one, *two;
> > +                     int score;
> > +
> > +                     /* Find out if this basename is unique among dests */
> > +                     dst_index = strintmap_get(&dests, base);
> > +                     if (dst_index == -1)
> > +                             continue; /* not a unique basename; skip it */
>
> It would be a lot easier to read if "we must have the same singleton
> in dests" in a single if condition, I suspect.  I.e.
>
>                 if (strintmap_contains(&dests, base) &&
>                     0 <= (dst_index = (strintmap_get(&dests, base)))) {

I can change that.  I can also simplify it further to

        if (0 <= (dst_index = (strintmap_get(&dests, base)))) {

since dests uses a default value of -1.  That will decrease the number
of strmap lookups here from 2 to 1.

> It is a bit sad that we iterate over rename_src[] array, even though
> we now have a map that presumably have fewer number of entries than
> the original array, though.

Oh, interesting; I forgot all about that.  I just looked up my
original implementation from February of last year and indeed I had
done exactly that
(https://github.com/newren/git/commit/43eaec6007c92b6af05e0ef0fcc047c1d1ba1de8).
However, when I added a later optimization that pairs up non-unique
basenames, I had to switch to looping over rename_src.

For various reasons (mostly starting with the fact that I had lots of
experimental ideas that were tried and thrown out but with pieces kept
around for ideas), I wasn't even close to having a clean history in my
original implementation of merge-ort and the diffcore-rename
optimizations.  And it was far, far easier to achieve the goal of a
clean history by picking out chunks of code from the end-state and
creating entirely new commits than attempting to use my existing
history.  But, of course, that method made me lose this intermediate
state.

>
> > +                     /* Ignore this dest if already used in a rename */
> > +                     if (rename_dst[dst_index].is_rename)
> > +                             continue; /* already used previously */
>
> Since we will only be matching between unique entries in src and
> dst, this "this has been used, so we cannot use it" will not change
> during this loop.  I wonder if the preparation done in the previous
> step, i.e. [PATCH v3 2/5], can take advantage of this fact, i.e.  a
> dst that has already been used (in the previous "exact" step) would
> not even have to be in &dests map, so that the strintmap_contains()
> check can reject it much earlier.

Good, catch again.  The previous step (v4 2/5) actually did already
check this, so this if-condition will always be false at this point.
Looking at the link above, this if-condition check wasn't there in the
original, but again was added due to altered state introduced by a
later optimization.  So, I should pull this check out of this patch
and add it back in to the later patch.

> Stepping back a bit, it appears to me that [2/5] and [3/5] considers
> a source file having unique basename among the sources even if there
> are many such files with the same basename, as long as all the other
> files with the same basename have been matched in the previous
> "exact" phase.  It probably does the same thing for destination
> side.
>
> Intended?
>
> It feels incompatible with the spirit of these two steps aim for
> (i.e. only use this optimization on a pair of src/dst with UNIQUE
> basenames).  For the purpose of "we only handle unique ones", the
> paths that already have matched should participate in deciding if
> the files that survived "exact" phase have unique basename among
> the original inpu?

Yeah, I should have been more careful with my wording.  Stated a
different way, what confidence can we associate with an exact rename?
Obviously, the confidence is high as we mark them as renames.  But if
the confidence is less than 100%, and enough less than 100% that it
casts a doubt on "related" inexact renames, then yes the basenames of
the exact renames should also be computed so that we can determine
what basenames are truly unique.  By the exact same argument, you
could take this a step further and say that we should calculate the
basenames of *all* files in the tree, not just add/delete pairs, and
only match up the ones via basename that are *truly* unique.  After
all, break detection exists, so perhaps we don't have full confidence
that files with an unchanged fullname are actually related.

From my view, though, both are too cautious and throwing out valuable
heuristics for common cases.  Starting with break detection, it is off
for a reason: we think unchanged filename is a strong enough heuristic
to just match up those files and consider the confidence of the match
in effect 100%.  Similarly, we put a lot of confidence in exact rename
detection.  If there are multiple adds/deletes with the same basename,
and all but one on each side are paired up by exact rename detection,
aren't the remaining two files a (very) likely rename pair?  I think
so, and believe they're worth including in the basename-based rename
detection step.  We do require basename-based matches to meet a much
higher similarity scoring threshold now, which I feel already
adequately adjusts for not doing full content similarity against all
other files.

Also, in the next series, I find an additional way to match up files
by basename when basenames are not unique, and which doesn't involve
pairwise comparing all the files with the same basename.  I only pick
at most one other file to compare to (and the selection is not
random).  So, my overall strategy for these two series is "find which
basenames are likely matches" even if I didn't word it very well.

I do agree, though, that I should add some more careful wording about
this in the series.  I'll include it in a re-roll.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 3/5] diffcore-rename: complete find_basename_matches()
  2021-02-13 18:34         ` Elijah Newren
@ 2021-02-13 23:55           ` Junio C Hamano
  2021-02-14  3:08             ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2021-02-13 23:55 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Derrick Stolee

Elijah Newren <newren@gmail.com> writes:

> I can change that.  I can also simplify it further to
>
>         if (0 <= (dst_index = (strintmap_get(&dests, base)))) {
>
> since dests uses a default value of -1.  That will decrease the number
> of strmap lookups here from 2 to 1.

Which would be a real win, unlike what I said in the message you are
responding to.

>> It feels incompatible with the spirit of these two steps aim for
>> (i.e. only use this optimization on a pair of src/dst with UNIQUE
>> basenames).  For the purpose of "we only handle unique ones", the
>> paths that already have matched should participate in deciding if
>> the files that survived "exact" phase have unique basename among
>> the original inpu?
>
> Yeah, I should have been more careful with my wording.  Stated a
> different way, what confidence can we associate with an exact rename?

Suppose you start with a/Makefile, b/Makefile and c/Makefile and
then they all disappeared while a1/Makefile, b1/Makefile and
c1/Makefile now are in the tree.  The contents a/Makefile used to
have appears without any difference in a1/Makefile, the same for b
and b1, but c/Makefile and c1/Makefile are different.  The c vs c1
pair may worth investigating, so it goes through the "same basename"
phase.

Now, in a slightly different situation, a vs a1 are still identical,
but b vs b1 have only one blank line removal but without any other
change.  It looks odd that such a change has to pessimize c vs c1
optimization opportunity, but an interesting part of the story is
that we can only say "such a change", not "such a miniscule change",
because we have just finished the "exact" phase, and we do not know
how big a difference b vs b1 pair actually had.

That makes me feel that this whole "we must treat unique one that
remains specially" is being incoherent.  If "because we have only
small number of removed and added Makefiles spread across the trees,
first full-matrix matching among them without anything else with
higher bar may be worth an optimization" were the optimization, then
I would understand and support the design to omit those that have
already been matched in the "exact" phase.

But IIRC, limiting this "same basename" phase to unique add/del pair
was sold as a way to make it less likely for the heuristics to make
mistakes, yet the definition of "unique", as shown above, is not all
that solid.  That I find it rather unsatisfactory.

In other words, it is not "what confidence do we have in exact
phase?"  "exact" matching may have found perfect matching pair.  But
the found pair should be happy just between themselves, and should
not have undue effect on how _other_ pairs are compared.  Stopping
the "exact" pair from participating in the "uniqueness" definition
is placing "exact" phase too much weight to affect how other filepairs
are found.

> By the exact same argument, you
> could take this a step further and say that we should calculate the
> basenames of *all* files in the tree, not just add/delete pairs, and
> only match up the ones via basename that are *truly* unique.  After
> all, break detection exists, so perhaps we don't have full confidence
> that files with an unchanged fullname are actually related.

Sorry, but you are not making sense.  These optimizations are done
only when we are not using copies and breaks, no?  What _other_
changes that kept the paths the same, or modified in place, have any
effect on matching added and deleted pairs?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity
  2021-02-13  4:50         ` Elijah Newren
@ 2021-02-13 23:56           ` Junio C Hamano
  2021-02-14  1:24             ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2021-02-13 23:56 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Derrick Stolee

Elijah Newren <newren@gmail.com> writes:

> This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> 0% similar to the src/main.c in the new commit, we match the old
> src/main.c to the new src/main.c despite being far more similar
> src/foo.c.  Unless break detection is turned on, we do not allow
> content similarity to trump (full) filename equality.

Absolutely.  And we are talking about a new optimization that kicks
in only when there is no break or no copy detection going on, no?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity
  2021-02-13 23:56           ` Junio C Hamano
@ 2021-02-14  1:24             ` Elijah Newren
  2021-02-14  1:32               ` Junio C Hamano
  0 siblings, 1 reply; 71+ messages in thread
From: Elijah Newren @ 2021-02-14  1:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Derrick Stolee

On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > 0% similar to the src/main.c in the new commit, we match the old
> > src/main.c to the new src/main.c despite being far more similar
> > src/foo.c.  Unless break detection is turned on, we do not allow
> > content similarity to trump (full) filename equality.
>
> Absolutely.  And we are talking about a new optimization that kicks
> in only when there is no break or no copy detection going on, no?

Yes, precisely, we are only considering cases without break
detection...and thus we are considering cases where for the last 15
years or more, sufficiently large filename similarity (an exact
fullname match) trumps any level of content similarity.  I think it is
useful to note that while my optimization is adding more
considerations that can overrule maximal content similarity, it is not
the first such code choice to do that.

But let me back up a bit...

When I submitted the series, you and Stolee went into a long
discussion about an optimization that I didn't submit, one that feels
looser on "matching" than anything I submitted, and which I think
might counter-intuitively reduce performance rather than aid it.  (The
performance side only comes into view in combination with later
series, but it was why I harped so much since then on only comparing
against at most one other file in the steps before full inexact rename
detection.)  I was quite surprised by the diversion, but it made it
clear to me that my descriptions and commit messages were far too
vague and could be read to imply a completely different algorithm than
I intended.  So, I tried to be far more careful in subsequent
iterations by adding wider context and contrasts.

Further, after I wrote various things to try to clarify the
misunderstandings, I noticed that Stolee picked out one thing and
stated that "This idea of optimizing first for 100% filename
similarity is a good perspective on Git's rename detection algorithm."
(see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/)
 So, that particular point seemed to help him understand more, and
thus might be useful extra context for others reading along now or in
the future.

Given all the above, I was trying to address earlier misunderstandings
and provide more context.  Perhaps I swung the pendulum too far and
talked too much about other cases, or perhaps I just worded things
poorly again.  All I was attempting to do in the commit message was
point out the multiple basic rules with filename and content
similarity, to lay the groundwork for new rules that do alternative
weightings.

Anyway, I've added a few more tweaks to try to improve the wording for
the next round I'll submit today.  Given my track record so far, it
would not be surprising if it still needed more tweaks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity
  2021-02-14  1:24             ` Elijah Newren
@ 2021-02-14  1:32               ` Junio C Hamano
  2021-02-14  3:14                 ` Elijah Newren
  0 siblings, 1 reply; 71+ messages in thread
From: Junio C Hamano @ 2021-02-14  1:32 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Derrick Stolee

I do not consider "the same file changed in place" the same as "we
seem to have lost a file in the old tree, ah, we found one that has
the same basename in a different directory" at all, so your argument
still does not make any sense to me, sorry.

2021年2月13日(土) 17:25 Elijah Newren <newren@gmail.com>:
>
> On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Elijah Newren <newren@gmail.com> writes:
> >
> > > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > > 0% similar to the src/main.c in the new commit, we match the old
> > > src/main.c to the new src/main.c despite being far more similar
> > > src/foo.c.  Unless break detection is turned on, we do not allow
> > > content similarity to trump (full) filename equality.
> >
> > Absolutely.  And we are talking about a new optimization that kicks
> > in only when there is no break or no copy detection going on, no?
>
> Yes, precisely, we are only considering cases without break
> detection...and thus we are considering cases where for the last 15
> years or more, sufficiently large filename similarity (an exact
> fullname match) trumps any level of content similarity.  I think it is
> useful to note that while my optimization is adding more
> considerations that can overrule maximal content similarity, it is not
> the first such code choice to do that.
>
> But let me back up a bit...
>
> When I submitted the series, you and Stolee went into a long
> discussion about an optimization that I didn't submit, one that feels
> looser on "matching" than anything I submitted, and which I think
> might counter-intuitively reduce performance rather than aid it.  (The
> performance side only comes into view in combination with later
> series, but it was why I harped so much since then on only comparing
> against at most one other file in the steps before full inexact rename
> detection.)  I was quite surprised by the diversion, but it made it
> clear to me that my descriptions and commit messages were far too
> vague and could be read to imply a completely different algorithm than
> I intended.  So, I tried to be far more careful in subsequent
> iterations by adding wider context and contrasts.
>
> Further, after I wrote various things to try to clarify the
> misunderstandings, I noticed that Stolee picked out one thing and
> stated that "This idea of optimizing first for 100% filename
> similarity is a good perspective on Git's rename detection algorithm."
> (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/)
>  So, that particular point seemed to help him understand more, and
> thus might be useful extra context for others reading along now or in
> the future.
>
> Given all the above, I was trying to address earlier misunderstandings
> and provide more context.  Perhaps I swung the pendulum too far and
> talked too much about other cases, or perhaps I just worded things
> poorly again.  All I was attempting to do in the commit message was
> point out the multiple basic rules with filename and content
> similarity, to lay the groundwork for new rules that do alternative
> weightings.
>
> Anyway, I've added a few more tweaks to try to improve the wording for
> the next round I'll submit today.  Given my track record so far, it
> would not be surprising if it still needed more tweaks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 3/5] diffcore-rename: complete find_basename_matches()
  2021-02-13 23:55           ` Junio C Hamano
@ 2021-02-14  3:08             ` Elijah Newren
  0 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren @ 2021-02-14  3:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Derrick Stolee

On Sat, Feb 13, 2021 at 3:55 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > I can change that.  I can also simplify it further to
> >
> >         if (0 <= (dst_index = (strintmap_get(&dests, base)))) {
> >
> > since dests uses a default value of -1.  That will decrease the number
> > of strmap lookups here from 2 to 1.
>
> Which would be a real win, unlike what I said in the message you are
> responding to.

Sadly, while it's a real win, it's very temporary.  The next series
I'll submit needs to separate the two checks back out for other
reasons.

> >> It feels incompatible with the spirit of these two steps aim for
> >> (i.e. only use this optimization on a pair of src/dst with UNIQUE
> >> basenames).  For the purpose of "we only handle unique ones", the
> >> paths that already have matched should participate in deciding if
> >> the files that survived "exact" phase have unique basename among
> >> the original inpu?
> >
> > Yeah, I should have been more careful with my wording.  Stated a
> > different way, what confidence can we associate with an exact rename?
>
> Suppose you start with a/Makefile, b/Makefile and c/Makefile and
> then they all disappeared while a1/Makefile, b1/Makefile and
> c1/Makefile now are in the tree.  The contents a/Makefile used to
> have appears without any difference in a1/Makefile, the same for b
> and b1, but c/Makefile and c1/Makefile are different.  The c vs c1
> pair may worth investigating, so it goes through the "same basename"
> phase.
>
> Now, in a slightly different situation, a vs a1 are still identical,
> but b vs b1 have only one blank line removal but without any other
> change.  It looks odd that such a change has to pessimize c vs c1
> optimization opportunity, but an interesting part of the story is
> that we can only say "such a change", not "such a miniscule change",
> because we have just finished the "exact" phase, and we do not know
> how big a difference b vs b1 pair actually had.
>
> That makes me feel that this whole "we must treat unique one that
> remains specially" is being incoherent.

It's really not that special; the pessimization is not in my mind due
to correctness reasons, but performance reasons.

I need to only compare any given file to at most one other file in the
preliminary steps.  When there are multiple remaining possibilities to
compare, I need a method for selecting which ones to compare.  I have
such a method, but it's a lot more code.  It was easier to submit a
series that was only 3 patches long and only considered the pairs that
just happened to uniquely match up so we could talk about the general
idea of basename matching.  The next series finds ways to match up
more files with similar basenames.

>  If "because we have only
> small number of removed and added Makefiles spread across the trees,
> first full-matrix matching among them without anything else with
> higher bar may be worth an optimization" were the optimization, then

This optimization was indeed considered...and fully implemented.
Let's give it a name, so I can refer to it more below.  How about the
"preliminary-matrix-of-basenames" optimization?

> I would understand and support the design to omit those that have
> already been matched in the "exact" phase.
>
> But IIRC, limiting this "same basename" phase to unique add/del pair
> was sold as a way to make it less likely for the heuristics to make
> mistakes, yet the definition of "unique", as shown above, is not all
> that solid.  That I find it rather unsatisfactory.

No, I never sold it as a way to make it less likely for the heuristics
to make mistakes.  If I implied that anywhere, it was on accident.

I certainly emphasized only doing one comparison per file, but not for
that reason.  I had three reasons for mentioning
one-comparison-per-file: (1) I was trying to contrast with Stolee's
original assumption about what this series was doing, to try to avoid
a repeat of the misunderstandings about the current optimization being
suggested.  (2) The preliminary-matrix-of-basenames optimization has
worst-case performance nearly twice as bad as without such an
optimization.  (For example, with preliminary-matrix-of-basenames, if
nearly all unmatched files have the same basename, we end up basically
doing inexact rename detection on all files twice).  I believe
Stolee's original assumption of what was being proposed also has such
twice-as-slow-as-normal worst-case performance behavior.  Even though
the worst case performance would be fairly rare, making an algorithm
twice as slow by introducing an optimization felt like something I
should avoid.  (3) Despite the theoretical problems with worst-case
performance, I implemented the preliminary-matrix-of-basenames
optimization anyway.  I threw the code away, because even in cases
with a wide variety of basenames, it slowed things down when other
optimizations were also involved.  The one clear way to work well with
other optimizations I was working with was to only allow the
preliminary step to compare any given file to at most one other file.

> In other words, it is not "what confidence do we have in exact
> phase?"  "exact" matching may have found perfect matching pair.  But
> the found pair should be happy just between themselves, and should
> not have undue effect on how _other_ pairs are compared.  Stopping
> the "exact" pair from participating in the "uniqueness" definition
> is placing "exact" phase too much weight to affect how other filepairs
> are found.

I guess I look at this quite a bit differently.  Here's my view:

  * If we have a reasonable and cheap way to determine that two
particular files are likely potential rename pairs,
  * AND checking their similarity confirms they are sufficiently
similar (perhaps with a higher bar)
  * then we've found a way to avoid quadratic comparisons.

We will give up "optimal" matches, but as long as what we provide are
"reasonable" matches I think that should suffice.  I personally
believe "reasonable" at O(N) cost trumps "optimal" at O(N^2).

There are several different ways to find "likely potential rename pairs":
  * The preliminary-matrix-of-basenames is one that I tried (but
interacts badly performance-wise with other optimizations).
  * https://github.com/gitgitgadget/git/issues/519 has multiple ideas.
  * Stolee's misunderstanding of my series is another
  * unique basenames among remaining pairs after exact renames is a
really simple one that lets me introduce "reasonable" matches so we
can discuss
  * my next series adds another

That leaves us with a big question.  Are we happy with higher
sufficient similarity bar being enough of a constraint for
"reasonable" matches?  If so, each of the above ideas might be able to
help us.  If not, we may be able to rule some of them out apriori and
avoid working on them (well, working on them any more; I've already
implemented three, and we have an intern who picked a project to look
at one)

> > By the exact same argument, you
> > could take this a step further and say that we should calculate the
> > basenames of *all* files in the tree, not just add/delete pairs, and
> > only match up the ones via basename that are *truly* unique.  After
> > all, break detection exists, so perhaps we don't have full confidence
> > that files with an unchanged fullname are actually related.
>
> Sorry, but you are not making sense.  These optimizations are done
> only when we are not using copies and breaks, no?  What _other_
> changes that kept the paths the same, or modified in place, have any
> effect on matching added and deleted pairs?

If the optimization is presented to users as "only compare basenames
in a preliminary step when they are unique", which is what I was
understanding you to say, and if the user has a/Makefile and
d/Makefile in the source tree, and a1/Makefile and d/Makefile in the
destination tree, then a/Makefile is not the unique "Makefile" in the
source tree.

I think you're trying to make an argument about uniqueness and why it
matters for correctness, but I'm not following it.

The only reason uniqueness is important to me is because I was using
it with future optimizations in mind, and knew it to be related to an
important performance criteria.  I tried to avoid mentioning
uniqueness at all in the user-facing documentation, though I did try
to explain why some files with the same basename might not be matched
up by that step (and my next series modifies those docs a bit.)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity
  2021-02-14  1:32               ` Junio C Hamano
@ 2021-02-14  3:14                 ` Elijah Newren
  0 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren @ 2021-02-14  3:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Derrick Stolee

On Sat, Feb 13, 2021 at 5:32 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> I do not consider "the same file changed in place" the same as "we
> seem to have lost a file in the old tree, ah, we found one that has
> the same basename in a different directory" at all, so your argument
> still does not make any sense to me, sorry.

I'm not set on the commit message wording, you asked why I had used
the terms I did and I tried to explain.  I also explained how the
wording seemed to have helped Stolee understand.

If you'd like to suggest an alternative commit message, I'm happy to take it.

> 2021年2月13日(土) 17:25 Elijah Newren <newren@gmail.com>:
> >
> > On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote:
> > >
> > > Elijah Newren <newren@gmail.com> writes:
> > >
> > > > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > > > 0% similar to the src/main.c in the new commit, we match the old
> > > > src/main.c to the new src/main.c despite being far more similar
> > > > src/foo.c.  Unless break detection is turned on, we do not allow
> > > > content similarity to trump (full) filename equality.
> > >
> > > Absolutely.  And we are talking about a new optimization that kicks
> > > in only when there is no break or no copy detection going on, no?
> >
> > Yes, precisely, we are only considering cases without break
> > detection...and thus we are considering cases where for the last 15
> > years or more, sufficiently large filename similarity (an exact
> > fullname match) trumps any level of content similarity.  I think it is
> > useful to note that while my optimization is adding more
> > considerations that can overrule maximal content similarity, it is not
> > the first such code choice to do that.
> >
> > But let me back up a bit...
> >
> > When I submitted the series, you and Stolee went into a long
> > discussion about an optimization that I didn't submit, one that feels
> > looser on "matching" than anything I submitted, and which I think
> > might counter-intuitively reduce performance rather than aid it.  (The
> > performance side only comes into view in combination with later
> > series, but it was why I harped so much since then on only comparing
> > against at most one other file in the steps before full inexact rename
> > detection.)  I was quite surprised by the diversion, but it made it
> > clear to me that my descriptions and commit messages were far too
> > vague and could be read to imply a completely different algorithm than
> > I intended.  So, I tried to be far more careful in subsequent
> > iterations by adding wider context and contrasts.
> >
> > Further, after I wrote various things to try to clarify the
> > misunderstandings, I noticed that Stolee picked out one thing and
> > stated that "This idea of optimizing first for 100% filename
> > similarity is a good perspective on Git's rename detection algorithm."
> > (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/)
> >  So, that particular point seemed to help him understand more, and
> > thus might be useful extra context for others reading along now or in
> > the future.
> >
> > Given all the above, I was trying to address earlier misunderstandings
> > and provide more context.  Perhaps I swung the pendulum too far and
> > talked too much about other cases, or perhaps I just worded things
> > poorly again.  All I was attempting to do in the commit message was
> > point out the multiple basic rules with filename and content
> > similarity, to lay the groundwork for new rules that do alternative
> > weightings.
> >
> > Anyway, I've added a few more tweaks to try to improve the wording for
> > the next round I'll submit today.  Given my track record so far, it
> > would not be surprising if it still needed more tweaks.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v5 0/6] Optimization batch 7: use file basenames to guide rename detection
  2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
                         ` (6 preceding siblings ...)
  2021-02-13  1:53       ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
@ 2021-02-14  7:51       ` Elijah Newren via GitGitGadget
  2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
                           ` (5 more replies)
  7 siblings, 6 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:51 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren

This series depends on ort-perf-batch-6. It appears Junio has appended an
earlier round of this series on that one and called the combined series
en/diffcore-rename. I'm still resubmitting the two separately to preserve
the threaded discussion in the archives and because gitgitgadget can provide
proper range-diffs that way.

This series uses file basenames (portion of the path after final '/',
including extension) in a basic fashion to guide rename detection.

Changes since v4:

 * add wording to make it clearer that we are considering remaining
   basenames after exact rename detection
 * add three minor optimizations to patch 3. (All three will have to be
   undone by the next series, but this series is probably clearer with
   them.)
 * a typo fix or two
 * v2 of ort-perf-batch-6 added some changes around consistency of
   rename_src_nr; make similar changes in using this variable in
   find_basename_changes() for consistency
 * fix the testcase so the expected comments about the change in behavior
   only show up after we change the behavior
 * attempt a rewrite of the commit message for the new testcase, who knows
   if I'll get it right this time.

Elijah Newren (6):
  t4001: add a test comparing basename similarity and content similarity
  diffcore-rename: compute basenames of source and dest candidates
  diffcore-rename: complete find_basename_matches()
  diffcore-rename: guide inexact rename detection based on basenames
  gitdiffcore doc: mention new preliminary step for rename detection
  merge-ort: call diffcore_rename() directly

 Documentation/gitdiffcore.txt |  20 ++++
 diffcore-rename.c             | 190 +++++++++++++++++++++++++++++++++-
 merge-ort.c                   |  66 ++++++++++--
 t/t4001-diff-rename.sh        |  24 +++++
 4 files changed, 289 insertions(+), 11 deletions(-)


base-commit: dd6595b45640ee9894293e8b729ef9a254564a49
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/843

Range-diff vs v4:

 1:  3e6af929d135 ! 1:  6848422e47e8 t4001: add a test comparing basename similarity and content similarity
     @@ Commit message
      
          Add a simple test where a removed file is similar to two different added
          files; one of them has the same basename, and the other has a slightly
     -    higher content similarity.  Without break detection, filename similarity
     -    of 100% trumps content similarity for pairing up related files.  For
     -    any filename similarity less than 100%, the opposite is true -- content
     -    similarity is all that matters.  Add a testcase that documents this.
     +    higher content similarity.  In the current test, content similarity is
     +    weighted higher than filename similarity.
      
     -    Subsequent commits will add a new rule that includes an inbetween state,
     -    where a mixture of filename similarity and content similarity are
     -    weighed, and which will change the outcome of this testcase.
     +    Subsequent commits will add a new rule that weighs a mixture of filename
     +    similarity and content similarity in a manner that will change the
     +    outcome of this testcase.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
     @@ t/t4001-diff-rename.sh: test_expect_success 'diff-tree -l0 defaults to a big ren
      +	git add file.txt file.md &&
      +	git commit -a -m "rename" &&
      +	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
     -+	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
     -+	# but since same basenames are checked first...
     ++	# subdir/file.txt is 88% similar to file.md and 78% similar to file.txt
      +	cat >expected <<-\EOF &&
      +	R088	subdir/file.txt	file.md
      +	A	file.txt
 2:  4fff9b1ff57b ! 2:  73baae2535d0 diffcore-rename: compute basenames of all source and dest candidates
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    diffcore-rename: compute basenames of all source and dest candidates
     +    diffcore-rename: compute basenames of source and dest candidates
      
     -    We want to make use of unique basenames to help inform rename detection,
     -    so that more likely pairings can be checked first.  (src/moduleA/foo.txt
     -    and source/module/A/foo.txt are likely related if there are no other
     -    'foo.txt' files among the deleted and added files.)  Add a new function,
     -    not yet used, which creates a map of the unique basenames within
     -    rename_src and another within rename_dst, together with the indices
     -    within rename_src/rename_dst where those basenames show up.  Non-unique
     -    basenames still show up in the map, but have an invalid index (-1).
     +    We want to make use of unique basenames among remaining source and
     +    destination files to help inform rename detection, so that more likely
     +    pairings can be checked first.  (src/moduleA/foo.txt and
     +    source/module/A/foo.txt are likely related if there are no other
     +    'foo.txt' files among the remaining deleted and added files.)  Add a new
     +    function, not yet used, which creates a map of the unique basenames
     +    within rename_src and another within rename_dst, together with the
     +    indices within rename_src/rename_dst where those basenames show up.
     +    Non-unique basenames still show up in the map, but have an invalid index
     +    (-1).
      
          This function was inspired by the fact that in real world repositories,
          files are often moved across directories without changing names.  Here
     @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
      +static const char *get_basename(const char *filename)
      +{
      +	/*
     -+	 * gitbasename() has to worry about special drivers, multiple
     ++	 * gitbasename() has to worry about special drives, multiple
      +	 * directory separator characters, trailing slashes, NULL or
      +	 * empty strings, etc.  We only work on filenames as stored in
      +	 * git, and thus get to ignore all those complications.
     @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
      +
      +MAYBE_UNUSED
      +static int find_basename_matches(struct diff_options *options,
     -+				 int minimum_score,
     -+				 int num_src)
     ++				 int minimum_score)
      +{
      +	int i;
      +	struct strintmap sources;
      +	struct strintmap dests;
      +
     -+	/* Create maps of basename -> fullname(s) for sources and dests */
     ++	/*
     ++	 * Create maps of basename -> fullname(s) for remaining sources and
     ++	 * dests.
     ++	 */
      +	strintmap_init_with_options(&sources, -1, NULL, 0);
      +	strintmap_init_with_options(&dests, -1, NULL, 0);
     -+	for (i = 0; i < num_src; ++i) {
     ++	for (i = 0; i < rename_src_nr; ++i) {
      +		char *filename = rename_src[i].p->one->path;
      +		const char *base;
      +
 3:  dc26881e4ed3 ! 3:  ece76429dc35 diffcore-rename: complete find_basename_matches()
     @@ Commit message
          This means we are adding a set of preliminary additional comparisons,
          but for each file we only compare it with at most one other file.  For
          example, if there was a include/media/device.h that was deleted and a
     -    src/module/media/device.h that was added, and there were no other
     -    device.h files added or deleted between the commits being compared,
     -    then these two files would be compared in the preliminary step.
     +    src/module/media/device.h that was added, and there are no other
     +    device.h files in the remaining sets of added and deleted files after
     +    exact rename detection, then these two files would be compared in the
     +    preliminary step.
      
          This commit does not yet actually employ this new optimization, it
          merely adds a function which can be used for this purpose.  The next
     @@ Commit message
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     -@@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options,
     - 				 int minimum_score,
     - 				 int num_src)
     +@@ diffcore-rename.c: MAYBE_UNUSED
     + static int find_basename_matches(struct diff_options *options,
     + 				 int minimum_score)
       {
      -	int i;
      +	/*
     @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options
      +	int i, renames = 0;
       	struct strintmap sources;
       	struct strintmap dests;
     - 
     ++	struct hashmap_iter iter;
     ++	struct strmap_entry *entry;
     ++
      +	/*
      +	 * The prefeteching stuff wants to know if it can skip prefetching
      +	 * blobs that are unmodified...and will then do a little extra work
     @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options
      +	 * unmodified would be a small waste.
      +	 */
      +	int skip_unmodified = 0;
     -+
     - 	/* Create maps of basename -> fullname(s) for sources and dests */
     - 	strintmap_init_with_options(&sources, -1, NULL, 0);
     - 	strintmap_init_with_options(&dests, -1, NULL, 0);
     + 
     + 	/*
     + 	 * Create maps of basename -> fullname(s) for remaining sources and
      @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options,
       			strintmap_set(&dests, base, i);
       	}
       
      -	/* TODO: Make use of basenames source and destination basenames */
      +	/* Now look for basename matchups and do similarity estimation */
     -+	for (i = 0; i < num_src; ++i) {
     -+		char *filename = rename_src[i].p->one->path;
     -+		const char *base = NULL;
     -+		intptr_t src_index;
     ++	strintmap_for_each_entry(&sources, &iter, entry) {
     ++		const char *base = entry->key;
     ++		intptr_t src_index = (intptr_t)entry->value;
      +		intptr_t dst_index;
     -+
     -+		/* Find out if this basename is unique among sources */
     -+		base = get_basename(filename);
     -+		src_index = strintmap_get(&sources, base);
      +		if (src_index == -1)
     -+			continue; /* not a unique basename; skip it */
     -+		assert(src_index == i);
     ++			continue;
      +
     -+		if (strintmap_contains(&dests, base)) {
     ++		if (0 <= (dst_index = strintmap_get(&dests, base))) {
      +			struct diff_filespec *one, *two;
      +			int score;
      +
     -+			/* Find out if this basename is unique among dests */
     -+			dst_index = strintmap_get(&dests, base);
     -+			if (dst_index == -1)
     -+				continue; /* not a unique basename; skip it */
     -+
     -+			/* Ignore this dest if already used in a rename */
     -+			if (rename_dst[dst_index].is_rename)
     -+				continue; /* already used previously */
     -+
      +			/* Estimate the similarity */
      +			one = rename_src[src_index].p->one;
      +			two = rename_dst[dst_index].p->two;
 4:  2493f4b2f55d ! 4:  122902e2706f diffcore-rename: guide inexact rename detection based on basenames
     @@ Commit message
          files based on basenames.  As a quick reminder (see the last two commit
          messages for more details), this means for example that
          docs/extensions.txt and docs/config/extensions.txt are considered likely
     -    renames if there are no 'extensions.txt' files elsewhere among the added
     -    and deleted files, and if a similarity check confirms they are similar,
     -    then they are marked as a rename without looking for a better similarity
     -    match among other files.  This is a behavioral change, as covered in
     -    more detail in the previous commit message.
     +    renames if there are no remaining 'extensions.txt' files elsewhere among
     +    the added and deleted files, and if a similarity check confirms they are
     +    similar, then they are marked as a rename without looking for a better
     +    similarity match among other files.  This is a behavioral change, as
     +    covered in more detail in the previous commit message.
      
          We do not use this heuristic together with either break or copy
          detection.  The point of break detection is to say that filename
     @@ diffcore-rename.c: static const char *get_basename(const char *filename)
       
      -MAYBE_UNUSED
       static int find_basename_matches(struct diff_options *options,
     - 				 int minimum_score,
     - 				 int num_src)
     + 				 int minimum_score)
     + {
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       	if (minimum_score == MAX_SCORE)
       		goto cleanup;
       
     -+	num_sources = rename_src_nr;
     -+
     +-	/* Calculate how many renames are left */
     +-	num_destinations = (rename_dst_nr - rename_count);
     +-	remove_unneeded_paths_from_src(want_copies);
     + 	num_sources = rename_src_nr;
     + 
      +	if (want_copies || break_idx) {
      +		/*
      +		 * Cull sources:
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +		/* Utilize file basenames to quickly find renames. */
      +		trace2_region_enter("diff", "basename matches", options->repo);
      +		rename_count += find_basename_matches(options,
     -+						      min_basename_score,
     -+						      rename_src_nr);
     ++						      min_basename_score);
      +		trace2_region_leave("diff", "basename matches", options->repo);
      +
      +		/*
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +		trace2_region_leave("diff", "cull basename", options->repo);
      +	}
      +
     - 	/*
     --	 * Calculate how many renames are left
     -+	 * Calculate how many rename destinations are left
     - 	 */
     - 	num_destinations = (rename_dst_nr - rename_count);
     --	remove_unneeded_paths_from_src(want_copies);
     --	num_sources = rename_src_nr;
     ++	/* Calculate how many rename destinations are left */
     ++	num_destinations = (rename_dst_nr - rename_count);
      +	num_sources = rename_src_nr; /* rename_src_nr reflects lower number */
     - 
     ++
       	/* All done? */
       	if (!num_destinations || !num_sources)
     + 		goto cleanup;
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       		struct diff_score *m;
       
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      
       ## t/t4001-diff-rename.sh ##
      @@ t/t4001-diff-rename.sh: test_expect_success 'basename similarity vs best similarity' '
     - 	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
     - 	# but since same basenames are checked first...
     + 	git add file.txt file.md &&
     + 	git commit -a -m "rename" &&
     + 	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
     +-	# subdir/file.txt is 88% similar to file.md and 78% similar to file.txt
     ++	# subdir/file.txt is 88% similar to file.md, 78% similar to file.txt,
     ++	# but since same basenames are checked first...
       	cat >expected <<-\EOF &&
      -	R088	subdir/file.txt	file.md
      -	A	file.txt
 5:  4e86ed3f29d4 ! 5:  6f5584f61350 gitdiffcore doc: mention new preliminary step for rename detection
     @@ Documentation/gitdiffcore.txt: a similarity score different from the default of
      +preliminary "match same filename" step uses a bit higher threshold to
      +mark a file pair as a rename and stop considering other candidates for
      +better matches.  At most, one comparison is done per file in this
     -+preliminary pass; so if there are several ext.txt files throughout the
     -+directory hierarchy that were added and deleted, this preliminary step
     -+will be skipped for those files.
     ++preliminary pass; so if there are several remaining ext.txt files
     ++throughout the directory hierarchy after exact rename detection, this
     ++preliminary step will be skipped for those files.
      +
       Note.  When the "-C" option is used with `--find-copies-harder`
       option, 'git diff-{asterisk}' commands feed unmodified filepairs to
 6:  fedb3d323d94 = 6:  aeca14f748af merge-ort: call diffcore_rename() directly

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity
  2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
@ 2021-02-14  7:51         ` Elijah Newren via GitGitGadget
  2021-02-14  7:51         ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:51 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add a simple test where a removed file is similar to two different added
files; one of them has the same basename, and the other has a slightly
higher content similarity.  In the current test, content similarity is
weighted higher than filename similarity.

Subsequent commits will add a new rule that weighs a mixture of filename
similarity and content similarity in a manner that will change the
outcome of this testcase.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t4001-diff-rename.sh | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index c16486a9d41a..0f97858197e1 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -262,4 +262,27 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
 	grep "myotherfile.*myfile" actual
 '
 
+test_expect_success 'basename similarity vs best similarity' '
+	mkdir subdir &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			 line6 line7 line8 line9 line10 >subdir/file.txt &&
+	git add subdir/file.txt &&
+	git commit -m "base txt" &&
+
+	git rm subdir/file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 >file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 line9 >file.md &&
+	git add file.txt file.md &&
+	git commit -a -m "rename" &&
+	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
+	# subdir/file.txt is 88% similar to file.md and 78% similar to file.txt
+	cat >expected <<-\EOF &&
+	R088	subdir/file.txt	file.md
+	A	file.txt
+	EOF
+	test_cmp expected actual
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates
  2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
  2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
@ 2021-02-14  7:51         ` Elijah Newren via GitGitGadget
  2021-02-14  7:51         ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:51 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first.  (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.)  Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).

This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names.  Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
  * linux: 76%
  * gcc: 64%
  * gecko: 79%
  * webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6fd0c4a2f485..e51f33a2184a 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,69 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+static const char *get_basename(const char *filename)
+{
+	/*
+	 * gitbasename() has to worry about special drives, multiple
+	 * directory separator characters, trailing slashes, NULL or
+	 * empty strings, etc.  We only work on filenames as stored in
+	 * git, and thus get to ignore all those complications.
+	 */
+	const char *base = strrchr(filename, '/');
+	return base ? base + 1 : filename;
+}
+
+MAYBE_UNUSED
+static int find_basename_matches(struct diff_options *options,
+				 int minimum_score)
+{
+	int i;
+	struct strintmap sources;
+	struct strintmap dests;
+
+	/*
+	 * Create maps of basename -> fullname(s) for remaining sources and
+	 * dests.
+	 */
+	strintmap_init_with_options(&sources, -1, NULL, 0);
+	strintmap_init_with_options(&dests, -1, NULL, 0);
+	for (i = 0; i < rename_src_nr; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base;
+
+		/* exact renames removed in remove_unneeded_paths_from_src() */
+		assert(!rename_src[i].p->one->rename_used);
+
+		/* Record index within rename_src (i) if basename is unique */
+		base = get_basename(filename);
+		if (strintmap_contains(&sources, base))
+			strintmap_set(&sources, base, -1);
+		else
+			strintmap_set(&sources, base, i);
+	}
+	for (i = 0; i < rename_dst_nr; ++i) {
+		char *filename = rename_dst[i].p->two->path;
+		const char *base;
+
+		if (rename_dst[i].is_rename)
+			continue; /* involved in exact match already. */
+
+		/* Record index within rename_dst (i) if basename is unique */
+		base = get_basename(filename);
+		if (strintmap_contains(&dests, base))
+			strintmap_set(&dests, base, -1);
+		else
+			strintmap_set(&dests, base, i);
+	}
+
+	/* TODO: Make use of basenames source and destination basenames */
+
+	strintmap_clear(&sources);
+	strintmap_clear(&dests);
+
+	return 0;
+}
+
 #define NUM_CANDIDATE_PER_DST 4
 static void record_if_better(struct diff_score m[], struct diff_score *o)
 {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v5 3/6] diffcore-rename: complete find_basename_matches()
  2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
  2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
  2021-02-14  7:51         ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
@ 2021-02-14  7:51         ` Elijah Newren via GitGitGadget
  2021-02-14  7:51         ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:51 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories.  We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames.  If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.

This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file.  For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.

This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose.  The next
commit will do the necessary plumbing to make use of it.

Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename.  I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches.  Reason (4) takes a full paragraph to
explain...

If the previous three reasons aren't enough, consider what rename
detection already does.  Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar.  In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity.  Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents.  This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well.  Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 82 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 79 insertions(+), 3 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index e51f33a2184a..266d4fae48c7 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -383,9 +383,53 @@ MAYBE_UNUSED
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score)
 {
-	int i;
+	/*
+	 * When I checked in early 2020, over 76% of file renames in linux
+	 * just moved files to a different directory but kept the same
+	 * basename.  gcc did that with over 64% of renames, gecko did it
+	 * with over 79%, and WebKit did it with over 89%.
+	 *
+	 * Therefore we can bypass the normal exhaustive NxM matrix
+	 * comparison of similarities between all potential rename sources
+	 * and destinations by instead using file basename as a hint (i.e.
+	 * the portion of the filename after the last '/'), checking for
+	 * similarity between files with the same basename, and if we find
+	 * a pair that are sufficiently similar, record the rename pair and
+	 * exclude those two from the NxM matrix.
+	 *
+	 * This *might* cause us to find a less than optimal pairing (if
+	 * there is another file that we are even more similar to but has a
+	 * different basename).  Given the huge performance advantage
+	 * basename matching provides, and given the frequency with which
+	 * people use the same basename in real world projects, that's a
+	 * trade-off we are willing to accept when doing just rename
+	 * detection.
+	 *
+	 * If someone wants copy detection that implies they are willing to
+	 * spend more cycles to find similarities between files, so it may
+	 * be less likely that this heuristic is wanted.  If someone is
+	 * doing break detection, that means they do not want filename
+	 * similarity to imply any form of content similiarity, and thus
+	 * this heuristic would definitely be incompatible.
+	 */
+
+	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	/*
+	 * The prefeteching stuff wants to know if it can skip prefetching
+	 * blobs that are unmodified...and will then do a little extra work
+	 * to verify that the oids are indeed different before prefetching.
+	 * Unmodified blobs are only relevant when doing copy detection;
+	 * when limiting to rename detection, diffcore_rename[_extended]()
+	 * will never be called with unmodified source paths fed to us, so
+	 * the extra work necessary to check if rename_src entries are
+	 * unmodified would be a small waste.
+	 */
+	int skip_unmodified = 0;
 
 	/*
 	 * Create maps of basename -> fullname(s) for remaining sources and
@@ -422,12 +466,44 @@ static int find_basename_matches(struct diff_options *options,
 			strintmap_set(&dests, base, i);
 	}
 
-	/* TODO: Make use of basenames source and destination basenames */
+	/* Now look for basename matchups and do similarity estimation */
+	strintmap_for_each_entry(&sources, &iter, entry) {
+		const char *base = entry->key;
+		intptr_t src_index = (intptr_t)entry->value;
+		intptr_t dst_index;
+		if (src_index == -1)
+			continue;
+
+		if (0 <= (dst_index = strintmap_get(&dests, base))) {
+			struct diff_filespec *one, *two;
+			int score;
+
+			/* Estimate the similarity */
+			one = rename_src[src_index].p->one;
+			two = rename_dst[dst_index].p->two;
+			score = estimate_similarity(options->repo, one, two,
+						    minimum_score, skip_unmodified);
+
+			/* If sufficiently similar, record as rename pair */
+			if (score < minimum_score)
+				continue;
+			record_rename_pair(dst_index, src_index, score);
+			renames++;
+
+			/*
+			 * Found a rename so don't need text anymore; if we
+			 * didn't find a rename, the filespec_blob would get
+			 * re-used when doing the matrix of comparisons.
+			 */
+			diff_free_filespec_blob(one);
+			diff_free_filespec_blob(two);
+		}
+	}
 
 	strintmap_clear(&sources);
 	strintmap_clear(&dests);
 
-	return 0;
+	return renames;
 }
 
 #define NUM_CANDIDATE_PER_DST 4
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames
  2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
                           ` (2 preceding siblings ...)
  2021-02-14  7:51         ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
@ 2021-02-14  7:51         ` Elijah Newren via GitGitGadget
  2021-02-14  7:51         ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
  2021-02-14  7:51         ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
  5 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:51 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames.  As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files.  This is a behavioral change, as
covered in more detail in the previous commit message.

We do not use this heuristic together with either break or copy
detection.  The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity.  The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities.  So the idea behind this
optimization goes against both of those features, and will be turned off
for both.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       13.815 s ±  0.062 s    13.294 s ±  0.103 s
    mega-renames:   1799.937 s ±  0.493 s   187.248 s ±  0.882 s
    just-one-mega:    51.289 s ±  0.019 s     5.557 s ±  0.017 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c      | 53 ++++++++++++++++++++++++++++++++++++++----
 t/t4001-diff-rename.sh |  7 +++---
 2 files changed, 52 insertions(+), 8 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 266d4fae48c7..41558185ae1d 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -379,7 +379,6 @@ static const char *get_basename(const char *filename)
 	return base ? base + 1 : filename;
 }
 
-MAYBE_UNUSED
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score)
 {
@@ -716,11 +715,55 @@ void diffcore_rename(struct diff_options *options)
 	if (minimum_score == MAX_SCORE)
 		goto cleanup;
 
-	/* Calculate how many renames are left */
-	num_destinations = (rename_dst_nr - rename_count);
-	remove_unneeded_paths_from_src(want_copies);
 	num_sources = rename_src_nr;
 
+	if (want_copies || break_idx) {
+		/*
+		 * Cull sources:
+		 *   - remove ones corresponding to exact renames
+		 */
+		trace2_region_enter("diff", "cull after exact", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull after exact", options->repo);
+	} else {
+		/* Determine minimum score to match basenames */
+		double factor = 0.5;
+		char *basename_factor = getenv("GIT_BASENAME_FACTOR");
+		int min_basename_score;
+
+		if (basename_factor)
+			factor = strtol(basename_factor, NULL, 10)/100.0;
+		assert(factor >= 0.0 && factor <= 1.0);
+		min_basename_score = minimum_score +
+			(int)(factor * (MAX_SCORE - minimum_score));
+
+		/*
+		 * Cull sources:
+		 *   - remove ones involved in renames (found via exact match)
+		 */
+		trace2_region_enter("diff", "cull after exact", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull after exact", options->repo);
+
+		/* Utilize file basenames to quickly find renames. */
+		trace2_region_enter("diff", "basename matches", options->repo);
+		rename_count += find_basename_matches(options,
+						      min_basename_score);
+		trace2_region_leave("diff", "basename matches", options->repo);
+
+		/*
+		 * Cull sources, again:
+		 *   - remove ones involved in renames (found via basenames)
+		 */
+		trace2_region_enter("diff", "cull basename", options->repo);
+		remove_unneeded_paths_from_src(want_copies);
+		trace2_region_leave("diff", "cull basename", options->repo);
+	}
+
+	/* Calculate how many rename destinations are left */
+	num_destinations = (rename_dst_nr - rename_count);
+	num_sources = rename_src_nr; /* rename_src_nr reflects lower number */
+
 	/* All done? */
 	if (!num_destinations || !num_sources)
 		goto cleanup;
@@ -751,7 +794,7 @@ void diffcore_rename(struct diff_options *options)
 		struct diff_score *m;
 
 		if (rename_dst[i].is_rename)
-			continue; /* dealt with exact match already. */
+			continue; /* exact or basename match already handled */
 
 		m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
 		for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index 0f97858197e1..99a5d1bd1c3a 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -277,10 +277,11 @@ test_expect_success 'basename similarity vs best similarity' '
 	git add file.txt file.md &&
 	git commit -a -m "rename" &&
 	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
-	# subdir/file.txt is 88% similar to file.md and 78% similar to file.txt
+	# subdir/file.txt is 88% similar to file.md, 78% similar to file.txt,
+	# but since same basenames are checked first...
 	cat >expected <<-\EOF &&
-	R088	subdir/file.txt	file.md
-	A	file.txt
+	A	file.md
+	R078	subdir/file.txt	file.txt
 	EOF
 	test_cmp expected actual
 '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection
  2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
                           ` (3 preceding siblings ...)
  2021-02-14  7:51         ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
@ 2021-02-14  7:51         ` Elijah Newren via GitGitGadget
  2021-02-14  7:51         ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
  5 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:51 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

The last few patches have introduced a new preliminary step when rename
detection is on but both break detection and copy detection are off.
Document this new step.  While we're at it, add a testcase that checks
the new behavior as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/gitdiffcore.txt | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c970d9fe438a..80fcf9542441 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -168,6 +168,26 @@ a similarity score different from the default of 50% by giving a
 number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
 8/10 = 80%).
 
+Note that when rename detection is on but both copy and break
+detection are off, rename detection adds a preliminary step that first
+checks if files are moved across directories while keeping their
+filename the same.  If there is a file added to a directory whose
+contents is sufficiently similar to a file with the same name that got
+deleted from a different directory, it will mark them as renames and
+exclude them from the later quadratic step (the one that pairwise
+compares all unmatched files to find the "best" matches, determined by
+the highest content similarity).  So, for example, if a deleted
+docs/ext.txt and an added docs/config/ext.txt are similar enough, they
+will be marked as a rename and prevent an added docs/ext.md that may
+be even more similar to the deleted docs/ext.txt from being considered
+as the rename destination in the later step.  For this reason, the
+preliminary "match same filename" step uses a bit higher threshold to
+mark a file pair as a rename and stop considering other candidates for
+better matches.  At most, one comparison is done per file in this
+preliminary pass; so if there are several remaining ext.txt files
+throughout the directory hierarchy after exact rename detection, this
+preliminary step will be skipped for those files.
+
 Note.  When the "-C" option is used with `--find-copies-harder`
 option, 'git diff-{asterisk}' commands feed unmodified filepairs to
 diffcore mechanism as well as modified ones.  This lets the copy
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH v5 6/6] merge-ort: call diffcore_rename() directly
  2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
                           ` (4 preceding siblings ...)
  2021-02-14  7:51         ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
@ 2021-02-14  7:51         ` Elijah Newren via GitGitGadget
  5 siblings, 0 replies; 71+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:51 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Elijah Newren, Derrick Stolee, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

We want to pass additional information to diffcore_rename() (or some
variant thereof) without plumbing that extra information through
diff_tree_oid() and diffcore_std().  Further, since we will need to
gather additional special information related to diffs and are walking
the trees anyway in collect_merge_info(), it seems odd to have
diff_tree_oid()/diffcore_std() repeat those tree walks.  And there may
be times where we can avoid traversing into a subtree in
collect_merge_info() (based on additional information at our disposal),
that the basic diff logic would be unable to take advantage of.  For all
these reasons, just create the add and delete pairs ourself and then
call diffcore_rename() directly.

This change is primarily about enabling future optimizations; the
advantage of avoiding extra tree traversals is small compared to the
cost of rename detection, and the advantage of avoiding the extra tree
traversals is somewhat offset by the extra time spent in
collect_merge_info() collecting the additional data anyway.  However...

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       13.294 s ±  0.103 s    12.775 s ±  0.062 s
    mega-renames:    187.248 s ±  0.882 s   188.754 s ±  0.284 s
    just-one-mega:     5.557 s ±  0.017 s     5.599 s ±  0.019 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 59 insertions(+), 7 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 931b91438cf1..603d30c52170 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -535,6 +535,23 @@ static void setup_path_info(struct merge_options *opt,
 	result->util = mi;
 }
 
+static void add_pair(struct merge_options *opt,
+		     struct name_entry *names,
+		     const char *pathname,
+		     unsigned side,
+		     unsigned is_add /* if false, is_delete */)
+{
+	struct diff_filespec *one, *two;
+	struct rename_info *renames = &opt->priv->renames;
+	int names_idx = is_add ? side : 0;
+
+	one = alloc_filespec(pathname);
+	two = alloc_filespec(pathname);
+	fill_filespec(is_add ? two : one,
+		      &names[names_idx].oid, 1, names[names_idx].mode);
+	diff_queue(&renames->pairs[side], one, two);
+}
+
 static void collect_rename_info(struct merge_options *opt,
 				struct name_entry *names,
 				const char *dirname,
@@ -544,6 +561,7 @@ static void collect_rename_info(struct merge_options *opt,
 				unsigned match_mask)
 {
 	struct rename_info *renames = &opt->priv->renames;
+	unsigned side;
 
 	/* Update dirs_removed, as needed */
 	if (dirmask == 1 || dirmask == 3 || dirmask == 5) {
@@ -554,6 +572,21 @@ static void collect_rename_info(struct merge_options *opt,
 		if (sides & 2)
 			strset_add(&renames->dirs_removed[2], fullname);
 	}
+
+	if (filemask == 0 || filemask == 7)
+		return;
+
+	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; ++side) {
+		unsigned side_mask = (1 << side);
+
+		/* Check for deletion on side */
+		if ((filemask & 1) && !(filemask & side_mask))
+			add_pair(opt, names, fullname, side, 0 /* delete */);
+
+		/* Check for addition on side */
+		if (!(filemask & 1) && (filemask & side_mask))
+			add_pair(opt, names, fullname, side, 1 /* add */);
+	}
 }
 
 static int collect_merge_info_callback(int n,
@@ -2079,6 +2112,27 @@ static int process_renames(struct merge_options *opt,
 	return clean_merge;
 }
 
+static void resolve_diffpair_statuses(struct diff_queue_struct *q)
+{
+	/*
+	 * A simplified version of diff_resolve_rename_copy(); would probably
+	 * just use that function but it's static...
+	 */
+	int i;
+	struct diff_filepair *p;
+
+	for (i = 0; i < q->nr; ++i) {
+		p = q->queue[i];
+		p->status = 0; /* undecided */
+		if (!DIFF_FILE_VALID(p->one))
+			p->status = DIFF_STATUS_ADDED;
+		else if (!DIFF_FILE_VALID(p->two))
+			p->status = DIFF_STATUS_DELETED;
+		else if (DIFF_PAIR_RENAME(p))
+			p->status = DIFF_STATUS_RENAMED;
+	}
+}
+
 static int compare_pairs(const void *a_, const void *b_)
 {
 	const struct diff_filepair *a = *((const struct diff_filepair **)a_);
@@ -2089,8 +2143,6 @@ static int compare_pairs(const void *a_, const void *b_)
 
 /* Call diffcore_rename() to compute which files have changed on given side */
 static void detect_regular_renames(struct merge_options *opt,
-				   struct tree *merge_base,
-				   struct tree *side,
 				   unsigned side_index)
 {
 	struct diff_options diff_opts;
@@ -2108,11 +2160,11 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
 	diff_setup_done(&diff_opts);
 
+	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
-	diff_tree_oid(&merge_base->object.oid, &side->object.oid, "",
-		      &diff_opts);
-	diffcore_std(&diff_opts);
+	diffcore_rename(&diff_opts);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
+	resolve_diffpair_statuses(&diff_queued_diff);
 
 	if (diff_opts.needed_rename_limit > renames->needed_limit)
 		renames->needed_limit = diff_opts.needed_rename_limit;
@@ -2212,8 +2264,8 @@ static int detect_and_process_renames(struct merge_options *opt,
 	memset(&combined, 0, sizeof(combined));
 
 	trace2_region_enter("merge", "regular renames", opt->repo);
-	detect_regular_renames(opt, merge_base, side1, MERGE_SIDE1);
-	detect_regular_renames(opt, merge_base, side2, MERGE_SIDE2);
+	detect_regular_renames(opt, MERGE_SIDE1);
+	detect_regular_renames(opt, MERGE_SIDE2);
 	trace2_region_leave("merge", "regular renames", opt->repo);
 
 	trace2_region_enter("merge", "directory renames", opt->repo);
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2021-02-14  7:55 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-07 14:38   ` Derrick Stolee
2021-02-07 19:51     ` Junio C Hamano
2021-02-08  8:38       ` Elijah Newren
2021-02-08 11:43         ` Derrick Stolee
2021-02-08 16:25           ` Elijah Newren
2021-02-08 17:37         ` Junio C Hamano
2021-02-08 22:00           ` Elijah Newren
2021-02-08 23:43             ` Junio C Hamano
2021-02-08 23:52               ` Elijah Newren
2021-02-08  8:27     ` Elijah Newren
2021-02-08 11:31       ` Derrick Stolee
2021-02-08 16:09         ` Elijah Newren
2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-07  6:05   ` Elijah Newren
2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-09 13:17     ` Derrick Stolee
2021-02-09 16:56       ` Elijah Newren
2021-02-09 17:02         ` Derrick Stolee
2021-02-09 17:42           ` Elijah Newren
2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-09 13:25     ` Derrick Stolee
2021-02-09 17:17       ` Elijah Newren
2021-02-09 17:34         ` Derrick Stolee
2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-09 13:33     ` Derrick Stolee
2021-02-09 17:41       ` Elijah Newren
2021-02-09 18:59         ` Junio C Hamano
2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-09 12:59     ` Derrick Stolee
2021-02-09 17:03       ` Junio C Hamano
2021-02-09 17:44         ` Elijah Newren
2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-13  1:15       ` Junio C Hamano
2021-02-13  4:50         ` Elijah Newren
2021-02-13 23:56           ` Junio C Hamano
2021-02-14  1:24             ` Elijah Newren
2021-02-14  1:32               ` Junio C Hamano
2021-02-14  3:14                 ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-13  1:32       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-13  1:48       ` Junio C Hamano
2021-02-13 18:34         ` Elijah Newren
2021-02-13 23:55           ` Junio C Hamano
2021-02-14  3:08             ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-13  1:49       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-10 16:41       ` Junio C Hamano
2021-02-10 17:20         ` Elijah Newren
2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
2021-02-13  1:53       ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.