All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] diffcore-rename improvements
@ 2020-12-06  2:54 Elijah Newren via GitGitGadget
  2020-12-06  2:54 ` [PATCH 1/7] diffcore-rename: avoid usage of global in too_many_rename_candidates() Elijah Newren via GitGitGadget
                   ` (8 more replies)
  0 siblings, 9 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-06  2:54 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

This patch series include 4 small code cleanups, 1 small bug fix (or perhaps
just a UI improvement -- it's a very minor issue either way), and 2
performance improvements. The first 6 patches are relatively easy to review,
while the last one...is a bit more involved (but provided the biggest
performance boost).

These patches came out of the merge-ort work, and are related, but there is
no dependency between any of the series.

Elijah Newren (7):
  diffcore-rename: avoid usage of global in too_many_rename_candidates()
  diffcore-rename: remove unnecessary if-clause
  diffcore-rename: rename num_create to num_targets
  diffcore-rename: change a few comments to use 'add' instead of
    'create'
  diffcore-rename: reduce jumpiness in progress counters
  diffcore-rename: simplify and accelerate register_rename_src()
  Accelerate rename_dst setup

 diffcore-rename.c | 225 ++++++++++++++++++++--------------------------
 1 file changed, 96 insertions(+), 129 deletions(-)


base-commit: 3a0b884caba2752da0af626fb2de7d597c844e8b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-929%2Fnewren%2Fdiffcore-rename-improvements-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-929/newren/diffcore-rename-improvements-v1
Pull-Request: https://github.com/git/git/pull/929
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/7] diffcore-rename: avoid usage of global in too_many_rename_candidates()
  2020-12-06  2:54 [PATCH 0/7] diffcore-rename improvements Elijah Newren via GitGitGadget
@ 2020-12-06  2:54 ` Elijah Newren via GitGitGadget
  2020-12-09 22:06   ` Taylor Blau
  2020-12-06  2:54 ` [PATCH 2/7] diffcore-rename: remove unnecessary if-clause Elijah Newren via GitGitGadget
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-06  2:54 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

too_many_rename_candidates() got the number of rename targets via an
argument to the function, but the number of rename sources via a global
variable.  That felt rather inconsistent.  Pass in the number of rename
sources as an argument as well.

While we are at it... We had a local variable, num_src, that served two
purposes.  Initially it was set to this global value, but later was used
for counting a subset of the number of sources.  Since we now have a
function argument for the former usage, introduce a clearer variable
name for the latter usage.

This patch has no behavioral changes; it's just renaming and passing an
argument instead of grabbing it from the global namespace.  (You may
find it easier to view the patch using git diff's --color-words option.)

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index d367a6d2443..68ddf51a2a1 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -434,12 +434,11 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
  * 1 if we need to disable inexact rename detection;
  * 2 if we would be under the limit if we were given -C instead of -C -C.
  */
-static int too_many_rename_candidates(int num_create,
+static int too_many_rename_candidates(int num_targets, int num_sources,
 				      struct diff_options *options)
 {
 	int rename_limit = options->rename_limit;
-	int num_src = rename_src_nr;
-	int i;
+	int i, limited_sources;
 
 	options->needed_rename_limit = 0;
 
@@ -447,30 +446,30 @@ static int too_many_rename_candidates(int num_create,
 	 * This basically does a test for the rename matrix not
 	 * growing larger than a "rename_limit" square matrix, ie:
 	 *
-	 *    num_create * num_src > rename_limit * rename_limit
+	 *    num_targets * num_sources > rename_limit * rename_limit
 	 */
 	if (rename_limit <= 0)
 		rename_limit = 32767;
-	if ((num_create <= rename_limit || num_src <= rename_limit) &&
-	    ((uint64_t)num_create * (uint64_t)num_src
+	if ((num_targets <= rename_limit || num_sources <= rename_limit) &&
+	    ((uint64_t)num_targets * (uint64_t)num_sources
 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
 		return 0;
 
 	options->needed_rename_limit =
-		num_src > num_create ? num_src : num_create;
+		num_sources > num_targets ? num_sources : num_targets;
 
 	/* Are we running under -C -C? */
 	if (!options->flags.find_copies_harder)
 		return 1;
 
 	/* Would we bust the limit if we were running under -C? */
-	for (num_src = i = 0; i < rename_src_nr; i++) {
+	for (limited_sources = i = 0; i < num_sources; i++) {
 		if (diff_unmodified_pair(rename_src[i].p))
 			continue;
-		num_src++;
+		limited_sources++;
 	}
-	if ((num_create <= rename_limit || num_src <= rename_limit) &&
-	    ((uint64_t)num_create * (uint64_t)num_src
+	if ((num_targets <= rename_limit || limited_sources <= rename_limit) &&
+	    ((uint64_t)num_targets * (uint64_t)limited_sources
 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
 		return 2;
 	return 1;
@@ -576,7 +575,7 @@ void diffcore_rename(struct diff_options *options)
 	if (!num_create)
 		goto cleanup;
 
-	switch (too_many_rename_candidates(num_create, options)) {
+	switch (too_many_rename_candidates(num_create, rename_src_nr, options)) {
 	case 1:
 		goto cleanup;
 	case 2:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/7] diffcore-rename: remove unnecessary if-clause
  2020-12-06  2:54 [PATCH 0/7] diffcore-rename improvements Elijah Newren via GitGitGadget
  2020-12-06  2:54 ` [PATCH 1/7] diffcore-rename: avoid usage of global in too_many_rename_candidates() Elijah Newren via GitGitGadget
@ 2020-12-06  2:54 ` Elijah Newren via GitGitGadget
  2020-12-09 22:10   ` Taylor Blau
  2020-12-06  2:54 ` [PATCH 3/7] diffcore-rename: rename num_create to num_targets Elijah Newren via GitGitGadget
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-06  2:54 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

diffcore-rename had two different checks of the form

    if ((a < limit || b < limit) &&
        a * b <= limit * limit)

Since these are all non-negative integers, this can be simplified to

    if (a * b <= limit * limit)

The only advantage of the former would be in avoiding a couple
multiplications in the rare case that both a and b are BOTH very large.
I see no reason for such an optimization given that this code is not in
any kind of loop.  Prefer code simplicity here and change to the latter
form.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 68ddf51a2a1..0f8fce9293e 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -450,9 +450,8 @@ static int too_many_rename_candidates(int num_targets, int num_sources,
 	 */
 	if (rename_limit <= 0)
 		rename_limit = 32767;
-	if ((num_targets <= rename_limit || num_sources <= rename_limit) &&
-	    ((uint64_t)num_targets * (uint64_t)num_sources
-	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
+	if ((uint64_t)num_targets * (uint64_t)num_sources
+	    <= (uint64_t)rename_limit * (uint64_t)rename_limit)
 		return 0;
 
 	options->needed_rename_limit =
@@ -468,9 +467,8 @@ static int too_many_rename_candidates(int num_targets, int num_sources,
 			continue;
 		limited_sources++;
 	}
-	if ((num_targets <= rename_limit || limited_sources <= rename_limit) &&
-	    ((uint64_t)num_targets * (uint64_t)limited_sources
-	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
+	if ((uint64_t)num_targets * (uint64_t)limited_sources
+	    <= (uint64_t)rename_limit * (uint64_t)rename_limit)
 		return 2;
 	return 1;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/7] diffcore-rename: rename num_create to num_targets
  2020-12-06  2:54 [PATCH 0/7] diffcore-rename improvements Elijah Newren via GitGitGadget
  2020-12-06  2:54 ` [PATCH 1/7] diffcore-rename: avoid usage of global in too_many_rename_candidates() Elijah Newren via GitGitGadget
  2020-12-06  2:54 ` [PATCH 2/7] diffcore-rename: remove unnecessary if-clause Elijah Newren via GitGitGadget
@ 2020-12-06  2:54 ` Elijah Newren via GitGitGadget
  2020-12-10  2:20   ` Junio C Hamano
  2020-12-06  2:54 ` [PATCH 4/7] diffcore-rename: change a few comments to use 'add' instead of 'create' Elijah Newren via GitGitGadget
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-06  2:54 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Files added since the base commit serve as targets for rename detection.
While it is true that added files can be thought of as being "created"
when they are added IF they have no pairing file that they were renamed
from, and it is true we start out not knowing what the pairings are, it
seems a little odd to think in terms of "file creation" when we are
looking for "file renames".  Rename the variable to avoid this minor
point of confusion.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 0f8fce9293e..655a67759c8 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -502,7 +502,7 @@ void diffcore_rename(struct diff_options *options)
 	struct diff_queue_struct outq;
 	struct diff_score *mx;
 	int i, j, rename_count, skip_unmodified = 0;
-	int num_create, dst_cnt;
+	int num_targets, dst_cnt;
 	struct progress *progress = NULL;
 
 	if (!minimum_score)
@@ -567,13 +567,13 @@ void diffcore_rename(struct diff_options *options)
 	 * Calculate how many renames are left (but all the source
 	 * files still remain as options for rename/copies!)
 	 */
-	num_create = (rename_dst_nr - rename_count);
+	num_targets = (rename_dst_nr - rename_count);
 
 	/* All done? */
-	if (!num_create)
+	if (!num_targets)
 		goto cleanup;
 
-	switch (too_many_rename_candidates(num_create, rename_src_nr, options)) {
+	switch (too_many_rename_candidates(num_targets, rename_src_nr, options)) {
 	case 1:
 		goto cleanup;
 	case 2:
@@ -590,7 +590,7 @@ void diffcore_rename(struct diff_options *options)
 				(uint64_t)rename_dst_nr * (uint64_t)rename_src_nr);
 	}
 
-	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_create), sizeof(*mx));
+	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_targets), sizeof(*mx));
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
 		struct diff_filespec *two = rename_dst[i].two;
 		struct diff_score *m;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 4/7] diffcore-rename: change a few comments to use 'add' instead of 'create'
  2020-12-06  2:54 [PATCH 0/7] diffcore-rename improvements Elijah Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2020-12-06  2:54 ` [PATCH 3/7] diffcore-rename: rename num_create to num_targets Elijah Newren via GitGitGadget
@ 2020-12-06  2:54 ` Elijah Newren via GitGitGadget
  2020-12-10  2:29   ` Junio C Hamano
  2020-12-06  2:54 ` [PATCH 5/7] diffcore-rename: reduce jumpiness in progress counters Elijah Newren via GitGitGadget
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-06  2:54 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

As with the last commit, it feels odd to talk about "creation" (which to
me implies a brand new file, i.e. one not associated with an existing
file, and thus one that could not have been a rename) when we are using
the files to look for renames.  Use the term "addition" instead.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 655a67759c8..7270eb6af48 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -173,7 +173,7 @@ static int estimate_similarity(struct repository *r,
 {
 	/* src points at a file that existed in the original tree (or
 	 * optionally a file in the destination tree) and dst points
-	 * at a newly created file.  They may be quite similar, in which
+	 * at a newly added file.  They may be quite similar, in which
 	 * case we want to say src is renamed to dst or src is copied into
 	 * dst, and then some edit has been applied to dst.
 	 *
@@ -652,9 +652,9 @@ void diffcore_rename(struct diff_options *options)
 		}
 		else if (!DIFF_FILE_VALID(p->one) && DIFF_FILE_VALID(p->two)) {
 			/*
-			 * Creation
+			 * Addition
 			 *
-			 * We would output this create record if it has
+			 * We would output this add record if it has
 			 * not been turned into a rename/copy already.
 			 */
 			struct diff_rename_dst *dst = locate_rename_dst(p->two);
@@ -664,7 +664,7 @@ void diffcore_rename(struct diff_options *options)
 			}
 			else
 				/* no matching rename/copy source, so
-				 * record this as a creation.
+				 * record this as an addition.
 				 */
 				diff_q(&outq, p);
 		}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 5/7] diffcore-rename: reduce jumpiness in progress counters
  2020-12-06  2:54 [PATCH 0/7] diffcore-rename improvements Elijah Newren via GitGitGadget
                   ` (3 preceding siblings ...)
  2020-12-06  2:54 ` [PATCH 4/7] diffcore-rename: change a few comments to use 'add' instead of 'create' Elijah Newren via GitGitGadget
@ 2020-12-06  2:54 ` Elijah Newren via GitGitGadget
  2020-12-09 22:24   ` Taylor Blau
  2020-12-10  2:36   ` Junio C Hamano
  2020-12-06  2:54 ` [PATCH 6/7] diffcore-rename: simplify and accelerate register_rename_src() Elijah Newren via GitGitGadget
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-06  2:54 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Inexact rename detection works by comparing all sources to all
destinations, computing similarities, and then finding the best matches
among those that are sufficiently similar.

However, it is preceded by exact rename detection that works by
checking if there are files with identical hashes.  If exact renames are
found, we can exclude some files from inexact rename detection.

The inexact rename detection loops over the full set of files, but
immediately skips those for which rename_dst[i].is_rename is true and
thus doesn't compare any sources to that destination.  As such, these
paths shouldn't be included in the progress counter.

For the eagle eyed, this change hints at an actual optimization -- the
first one I presented at Git Merge 2020.  I'll be submitting that
optimization later, once the basic merge-ort algorithm has merged.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 7270eb6af48..3d637ba4645 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -587,7 +587,7 @@ void diffcore_rename(struct diff_options *options)
 	if (options->show_rename_progress) {
 		progress = start_delayed_progress(
 				_("Performing inexact rename detection"),
-				(uint64_t)rename_dst_nr * (uint64_t)rename_src_nr);
+				(uint64_t)num_targets * (uint64_t)rename_src_nr);
 	}
 
 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_targets), sizeof(*mx));
@@ -626,7 +626,8 @@ void diffcore_rename(struct diff_options *options)
 			diff_free_filespec_blob(two);
 		}
 		dst_cnt++;
-		display_progress(progress, (uint64_t)(i+1)*(uint64_t)rename_src_nr);
+		display_progress(progress,
+				 (uint64_t)dst_cnt * (uint64_t)rename_src_nr);
 	}
 	stop_progress(&progress);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 6/7] diffcore-rename: simplify and accelerate register_rename_src()
  2020-12-06  2:54 [PATCH 0/7] diffcore-rename improvements Elijah Newren via GitGitGadget
                   ` (4 preceding siblings ...)
  2020-12-06  2:54 ` [PATCH 5/7] diffcore-rename: reduce jumpiness in progress counters Elijah Newren via GitGitGadget
@ 2020-12-06  2:54 ` Elijah Newren via GitGitGadget
  2020-12-09 22:40   ` Taylor Blau
  2020-12-10  2:51   ` Junio C Hamano
  2020-12-06  2:54 ` [PATCH 7/7] Accelerate rename_dst setup Elijah Newren via GitGitGadget
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-06  2:54 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

reigster_rename_src() took pains to create an array in rename_src which
was sorted by pathname of the contained diff_filepair.  However, the
fact that this array was sorted was not needed anywhere, and thus
represented wasted time.  Simply append to the end of the array, which
in a usecase of note saved 45% of diffcore_rename() setup time for me.

Technically, there is one difference in the end result for the code.  IF
the caller of diffcore-rename provides a set of pairs with a duplicate
filename in the sources (an erroneous input condition), the old code
would simply ignore the duplicate (without warning or error).  The new
code will simply swallow it and thus allow multiple pairings for the
"same" source file.  Checking for this error condition is expensive
(basically undoing the optimization) and the only existing callers
already avoid passing such an invalid input.  If someone really wants to
add some error checking here on this input condition, I believe they
should add a separate function which can be called before
diffcore_rename(), and then other callers that don't need additional
checks because of their design (such as merge-ort) can avoid the extra
overhead.

Also, note that I dropped the return type on the function since it was
unconditionally discarded anyway.

This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection,
diffcore_rename() setup time was a sizeable chunk of overall runtime.
This patch dropped execution time of rebasing 35 commits with lots of
renames by 2% overall.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 30 +++---------------------------
 1 file changed, 3 insertions(+), 27 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 3d637ba4645..816d2fbac44 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -76,36 +76,12 @@ static struct diff_rename_src {
 } *rename_src;
 static int rename_src_nr, rename_src_alloc;
 
-static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
+static void register_rename_src(struct diff_filepair *p)
 {
-	int first, last;
-	struct diff_filespec *one = p->one;
-	unsigned short score = p->score;
-
-	first = 0;
-	last = rename_src_nr;
-	while (last > first) {
-		int next = first + ((last - first) >> 1);
-		struct diff_rename_src *src = &(rename_src[next]);
-		int cmp = strcmp(one->path, src->p->one->path);
-		if (!cmp)
-			return src;
-		if (cmp < 0) {
-			last = next;
-			continue;
-		}
-		first = next+1;
-	}
-
-	/* insert to make it at "first" */
 	ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
+	rename_src[rename_src_nr].p = p;
+	rename_src[rename_src_nr].score = p->score;
 	rename_src_nr++;
-	if (first < rename_src_nr)
-		MOVE_ARRAY(rename_src + first + 1, rename_src + first,
-			   rename_src_nr - first - 1);
-	rename_src[first].p = p;
-	rename_src[first].score = score;
-	return &(rename_src[first]);
 }
 
 static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 7/7] Accelerate rename_dst setup
  2020-12-06  2:54 [PATCH 0/7] diffcore-rename improvements Elijah Newren via GitGitGadget
                   ` (5 preceding siblings ...)
  2020-12-06  2:54 ` [PATCH 6/7] diffcore-rename: simplify and accelerate register_rename_src() Elijah Newren via GitGitGadget
@ 2020-12-06  2:54 ` Elijah Newren via GitGitGadget
  2020-12-09 23:01   ` Taylor Blau
  2020-12-06  3:01 ` [PATCH 0/7] diffcore-rename improvements Elijah Newren
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
  8 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-06  2:54 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

register_rename_src() simply references the passed pair inside
rename_src.  In contrast, add_rename_dst() did something entirely
different for rename_dst.  Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL.  Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs.  This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary.  Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src.  Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.

Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go.  However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on).  Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index.  Doing that allows us to still have a
fast lookup without sorting the rename_dst array.  Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.

There is still a strcmp() in add_rename_dst() that I feel is
unnecessary, because I disagree with a regression test in t4058.
However, I'm trying to increase performance without changing behavior,
so I'll leave that fight for a different day (a TODO has been left in
the code about it).

This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime.  This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 161 ++++++++++++++++++++++------------------------
 1 file changed, 77 insertions(+), 84 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 816d2fbac44..fb3c2817c6b 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -5,67 +5,64 @@
 #include "cache.h"
 #include "diff.h"
 #include "diffcore.h"
-#include "object-store.h"
 #include "hashmap.h"
+#include "object-store.h"
 #include "progress.h"
 #include "promisor-remote.h"
+#include "strmap.h"
 
 /* Table of rename/copy destinations */
 
 static struct diff_rename_dst {
-	struct diff_filespec *two;
-	struct diff_filepair *pair;
+	struct diff_filepair *p;
+	struct diff_filespec *filespec_to_free;
+	int is_rename; /* false -> just a create; true -> rename or copy */
 } *rename_dst;
 static int rename_dst_nr, rename_dst_alloc;
+/* Mapping from break source pathname to break destination index */
+static struct strintmap *break_idx = NULL;
 
-static int find_rename_dst(struct diff_filespec *two)
-{
-	int first, last;
-
-	first = 0;
-	last = rename_dst_nr;
-	while (last > first) {
-		int next = first + ((last - first) >> 1);
-		struct diff_rename_dst *dst = &(rename_dst[next]);
-		int cmp = strcmp(two->path, dst->two->path);
-		if (!cmp)
-			return next;
-		if (cmp < 0) {
-			last = next;
-			continue;
-		}
-		first = next+1;
-	}
-	return -first - 1;
-}
-
-static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two)
+static struct diff_rename_dst *locate_rename_dst(struct diff_filepair *p)
 {
-	int ofs = find_rename_dst(two);
-	return ofs < 0 ? NULL : &rename_dst[ofs];
+	/* Lookup by p->ONE->path */
+	int idx = break_idx ? strintmap_get(break_idx, p->one->path) : -1;
+	return (idx == -1) ? NULL : &rename_dst[idx];
 }
 
 /*
  * Returns 0 on success, -1 if we found a duplicate.
  */
-static int add_rename_dst(struct diff_filespec *two)
+static int add_rename_dst(struct diff_filepair *p)
 {
-	int first = find_rename_dst(two);
-
-	if (first >= 0)
+	/*
+	 * See t4058; trees might have duplicate entries.  I think
+	 * trees with duplicate entries should be ignored and we
+	 * should just leave rename detection on; while the results
+	 * may be slightly harder to understand, that's merely a
+	 * result of the underlying tree making no sense.  But I
+	 * believe everything still works fine, the results do still
+	 * make sense, and the extra overhead of doing this checking
+	 * for a few historical weird trees from long ago seems like
+	 * the dog wagging the tail to me.
+	 *
+	 * However: I don't feel like fighting that battle right now.
+	 * For now, to keep the regression test passing, we have to
+	 * detect it.  Since the diff machinery passes these to us in
+	 * adjacent pairs, we just need to check to see if our name
+	 * matches the previous one.
+	 *
+	 * TODO: Dispense with this test, rip out the test in t4058, and
+	 * lobby folks for the change.
+	 */
+	if (rename_dst_nr > 0 &&
+	    !strcmp(rename_dst[rename_dst_nr-1].p->two->path, p->two->path))
 		return -1;
-	first = -first - 1;
 
-	/* insert to make it at "first" */
 	ALLOC_GROW(rename_dst, rename_dst_nr + 1, rename_dst_alloc);
+	rename_dst[rename_dst_nr].p = p;
+	rename_dst[rename_dst_nr].filespec_to_free = NULL;
+	rename_dst[rename_dst_nr].is_rename = 0;
 	rename_dst_nr++;
-	if (first < rename_dst_nr)
-		MOVE_ARRAY(rename_dst + first + 1, rename_dst + first,
-			   rename_dst_nr - first - 1);
-	rename_dst[first].two = alloc_filespec(two->path);
-	fill_filespec(rename_dst[first].two, &two->oid, two->oid_valid,
-		      two->mode);
-	rename_dst[first].pair = NULL;
 	return 0;
 }
 
@@ -78,6 +75,14 @@ static int rename_src_nr, rename_src_alloc;
 
 static void register_rename_src(struct diff_filepair *p)
 {
+	if (p->broken_pair) {
+		if (!break_idx) {
+			break_idx = xmalloc(sizeof(*break_idx));
+			strintmap_init(break_idx, -1);
+		}
+		strintmap_set(break_idx, p->one->path, rename_dst_nr);
+	}
+
 	ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
 	rename_src[rename_src_nr].p = p;
 	rename_src[rename_src_nr].score = p->score;
@@ -117,14 +122,14 @@ static void prefetch(void *prefetch_options)
 	struct oid_array to_fetch = OID_ARRAY_INIT;
 
 	for (i = 0; i < rename_dst_nr; i++) {
-		if (rename_dst[i].pair)
+		if (rename_dst[i].p->renamed_pair)
 			/*
 			 * The loop in diffcore_rename() will not need these
 			 * blobs, so skip prefetching.
 			 */
 			continue; /* already found exact match */
 		diff_add_if_missing(options->repo, &to_fetch,
-				    rename_dst[i].two);
+				    rename_dst[i].p->two);
 	}
 	for (i = 0; i < rename_src_nr; i++) {
 		if (options->skip_unmodified &&
@@ -234,26 +239,24 @@ static int estimate_similarity(struct repository *r,
 
 static void record_rename_pair(int dst_index, int src_index, int score)
 {
-	struct diff_filespec *src, *dst;
-	struct diff_filepair *dp;
+	struct diff_filepair *src = rename_src[src_index].p;
+	struct diff_filepair *dst = rename_dst[dst_index].p;
 
-	if (rename_dst[dst_index].pair)
+	if (dst->renamed_pair)
 		die("internal error: dst already matched.");
 
-	src = rename_src[src_index].p->one;
-	src->rename_used++;
-	src->count++;
+	src->one->rename_used++;
+	src->one->count++;
 
-	dst = rename_dst[dst_index].two;
-	dst->count++;
+	rename_dst[dst_index].filespec_to_free = dst->one;
+	rename_dst[dst_index].is_rename = 1;
 
-	dp = diff_queue(NULL, src, dst);
-	dp->renamed_pair = 1;
-	if (!strcmp(src->path, dst->path))
-		dp->score = rename_src[src_index].score;
+	dst->one = src->one;
+	dst->renamed_pair = 1;
+	if (!strcmp(dst->one->path, dst->two->path))
+		dst->score = rename_src[src_index].score;
 	else
-		dp->score = score;
-	rename_dst[dst_index].pair = dp;
+		dst->score = score;
 }
 
 /*
@@ -299,7 +302,7 @@ static int find_identical_files(struct hashmap *srcs,
 				struct diff_options *options)
 {
 	int renames = 0;
-	struct diff_filespec *target = rename_dst[dst_index].two;
+	struct diff_filespec *target = rename_dst[dst_index].p->two;
 	struct file_similarity *p, *best = NULL;
 	int i = 100, best_score = -1;
 	unsigned int hash = hash_filespec(options->repo, target);
@@ -460,7 +463,7 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
 		    (mx[i].score < minimum_score))
 			break; /* there is no more usable pair. */
 		dst = &rename_dst[mx[i].dst];
-		if (dst->pair)
+		if (dst->is_rename)
 			continue; /* already done, either exact or fuzzy. */
 		if (!copies && rename_src[mx[i].src].p->one->rename_used)
 			continue;
@@ -495,7 +498,7 @@ void diffcore_rename(struct diff_options *options)
 			else if (!options->flags.rename_empty &&
 				 is_empty_blob_oid(&p->two->oid))
 				continue;
-			else if (add_rename_dst(p->two) < 0) {
+			else if (add_rename_dst(p) < 0) {
 				warning("skipping rename detection, detected"
 					" duplicate destination '%s'",
 					p->two->path);
@@ -568,10 +571,10 @@ void diffcore_rename(struct diff_options *options)
 
 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_targets), sizeof(*mx));
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
-		struct diff_filespec *two = rename_dst[i].two;
+		struct diff_filespec *two = rename_dst[i].p->two;
 		struct diff_score *m;
 
-		if (rename_dst[i].pair)
+		if (rename_dst[i].is_rename)
 			continue; /* dealt with exact match already. */
 
 		m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
@@ -628,22 +631,8 @@ void diffcore_rename(struct diff_options *options)
 			diff_q(&outq, p);
 		}
 		else if (!DIFF_FILE_VALID(p->one) && DIFF_FILE_VALID(p->two)) {
-			/*
-			 * Addition
-			 *
-			 * We would output this add record if it has
-			 * not been turned into a rename/copy already.
-			 */
-			struct diff_rename_dst *dst = locate_rename_dst(p->two);
-			if (dst && dst->pair) {
-				diff_q(&outq, dst->pair);
-				pair_to_free = p;
-			}
-			else
-				/* no matching rename/copy source, so
-				 * record this as an addition.
-				 */
-				diff_q(&outq, p);
+			/* Addition */
+			diff_q(&outq, p);
 		}
 		else if (DIFF_FILE_VALID(p->one) && !DIFF_FILE_VALID(p->two)) {
 			/*
@@ -664,8 +653,9 @@ void diffcore_rename(struct diff_options *options)
 			 */
 			if (DIFF_PAIR_BROKEN(p)) {
 				/* broken delete */
-				struct diff_rename_dst *dst = locate_rename_dst(p->one);
-				if (dst && dst->pair)
+				struct diff_rename_dst *dst = locate_rename_dst(p);
+				assert(dst);
+				if (dst->is_rename)
 					/* counterpart is now rename/copy */
 					pair_to_free = p;
 			}
@@ -675,16 +665,14 @@ void diffcore_rename(struct diff_options *options)
 					pair_to_free = p;
 			}
 
-			if (pair_to_free)
-				;
-			else
+			if (!pair_to_free)
 				diff_q(&outq, p);
 		}
 		else if (!diff_unmodified_pair(p))
 			/* all the usual ones need to be kept */
 			diff_q(&outq, p);
 		else
-			/* no need to keep unmodified pairs */
+			/* no need to keep unmodified pairs; FIXME: remove earlier? */
 			pair_to_free = p;
 
 		if (pair_to_free)
@@ -697,11 +685,16 @@ void diffcore_rename(struct diff_options *options)
 	diff_debug_queue("done collapsing", q);
 
 	for (i = 0; i < rename_dst_nr; i++)
-		free_filespec(rename_dst[i].two);
+		if (rename_dst[i].filespec_to_free)
+			free_filespec(rename_dst[i].filespec_to_free);
 
 	FREE_AND_NULL(rename_dst);
 	rename_dst_nr = rename_dst_alloc = 0;
 	FREE_AND_NULL(rename_src);
 	rename_src_nr = rename_src_alloc = 0;
+	if (break_idx) {
+		strintmap_clear(break_idx);
+		FREE_AND_NULL(break_idx);
+	}
 	return;
 }
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/7] diffcore-rename improvements
  2020-12-06  2:54 [PATCH 0/7] diffcore-rename improvements Elijah Newren via GitGitGadget
                   ` (6 preceding siblings ...)
  2020-12-06  2:54 ` [PATCH 7/7] Accelerate rename_dst setup Elijah Newren via GitGitGadget
@ 2020-12-06  3:01 ` Elijah Newren
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
  8 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren @ 2020-12-06  3:01 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: Git Mailing List, Christian Couder, Kaartic Sivaraam, Sangeeta NB

Hi Christian, Kaartic, and Sangeeta,

On Sat, Dec 5, 2020 at 6:54 PM Elijah Newren via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This patch series include 4 small code cleanups, 1 small bug fix (or perhaps
> just a UI improvement -- it's a very minor issue either way), and 2
> performance improvements. The first 6 patches are relatively easy to review,
> while the last one...is a bit more involved (but provided the biggest
> performance boost).

If I've heard correctly, you three are now interested in rename
detection and thus diffcore-rename.c.  You might find this patch
series interesting; even if you don't want to review the details, just
looking over some of the commit messages might help orient you to the
codebase that you'll be working with in the upcoming months.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/7] diffcore-rename: avoid usage of global in too_many_rename_candidates()
  2020-12-06  2:54 ` [PATCH 1/7] diffcore-rename: avoid usage of global in too_many_rename_candidates() Elijah Newren via GitGitGadget
@ 2020-12-09 22:06   ` Taylor Blau
  0 siblings, 0 replies; 55+ messages in thread
From: Taylor Blau @ 2020-12-09 22:06 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren

On Sun, Dec 06, 2020 at 02:54:30AM +0000, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
>
> too_many_rename_candidates() got the number of rename targets via an
> argument to the function, but the number of rename sources via a global
> variable.  That felt rather inconsistent.  Pass in the number of rename
> sources as an argument as well.
>
> While we are at it... We had a local variable, num_src, that served two
> purposes.  Initially it was set to this global value, but later was used
> for counting a subset of the number of sources.  Since we now have a
> function argument for the former usage, introduce a clearer variable
> name for the latter usage.
>
> This patch has no behavioral changes; it's just renaming and passing an
> argument instead of grabbing it from the global namespace.  (You may
> find it easier to view the patch using git diff's --color-words option.)
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  diffcore-rename.c | 23 +++++++++++------------
>  1 file changed, 11 insertions(+), 12 deletions(-)
>
> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index d367a6d2443..68ddf51a2a1 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -434,12 +434,11 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
>   * 1 if we need to disable inexact rename detection;
>   * 2 if we would be under the limit if we were given -C instead of -C -C.
>   */
> -static int too_many_rename_candidates(int num_create,
> +static int too_many_rename_candidates(int num_targets, int num_sources,

OK, so num_create becomes num_targets, and the global num_src becomes a
new parameter named num_sources.

Makes sense, but it took me a second to figure all of that out. I don't
think that you need to split this across two patches (e.g., one to
rename num_targets, and another to introduce the num_sources parameter),
but including the new names for each of these variables in the patch
message might make this clearer to follow.

>  				      struct diff_options *options)
>  {
>  	int rename_limit = options->rename_limit;
> -	int num_src = rename_src_nr;
> -	int i;
> +	int i, limited_sources;
> [...]

Everything else from here looks fine, and indeed viewing this with
--color-words made it much easier to verify.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/7] diffcore-rename: remove unnecessary if-clause
  2020-12-06  2:54 ` [PATCH 2/7] diffcore-rename: remove unnecessary if-clause Elijah Newren via GitGitGadget
@ 2020-12-09 22:10   ` Taylor Blau
  2020-12-10  0:32     ` Elijah Newren
  2020-12-10  2:03     ` Junio C Hamano
  0 siblings, 2 replies; 55+ messages in thread
From: Taylor Blau @ 2020-12-09 22:10 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren

On Sun, Dec 06, 2020 at 02:54:31AM +0000, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
>
> diffcore-rename had two different checks of the form
>
>     if ((a < limit || b < limit) &&
>         a * b <= limit * limit)
>
> Since these are all non-negative integers, this can be simplified to
>
>     if (a * b <= limit * limit)

Makes sense.

> The only advantage of the former would be in avoiding a couple
> multiplications in the rare case that both a and b are BOTH very large.
> I see no reason for such an optimization given that this code is not in
> any kind of loop.  Prefer code simplicity here and change to the latter
> form.

If you were really paranoid, you could perform these checks with
unsigned_mult_overflows(), but I don't think that it's worth doing so
here.

> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  diffcore-rename.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index 68ddf51a2a1..0f8fce9293e 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -450,9 +450,8 @@ static int too_many_rename_candidates(int num_targets, int num_sources,
>  	 */
>  	if (rename_limit <= 0)
>  		rename_limit = 32767;
> -	if ((num_targets <= rename_limit || num_sources <= rename_limit) &&
> -	    ((uint64_t)num_targets * (uint64_t)num_sources
> -	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
> +	if ((uint64_t)num_targets * (uint64_t)num_sources
> +	    <= (uint64_t)rename_limit * (uint64_t)rename_limit)

One small nit here (and below) is that not all of these need casting.
Only one operand of each multiplication needs to be widened for the
compiler to widen the other, too. So, I'd write this instead as:

> +	if ((num_targets * (uint64_t)num_sources) <=
> +     (rename_limit * (uint64_t)rename_limit))

or something. (I tend to prefer the cast on the right-most operand,
since it makes clear that there's no need to cast the "first" operand,
and that casting either will do the trick).

>  		return 0;
>
>  	options->needed_rename_limit =
> @@ -468,9 +467,8 @@ static int too_many_rename_candidates(int num_targets, int num_sources,
>  			continue;
>  		limited_sources++;
>  	}
> -	if ((num_targets <= rename_limit || limited_sources <= rename_limit) &&
> -	    ((uint64_t)num_targets * (uint64_t)limited_sources
> -	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
> +	if ((uint64_t)num_targets * (uint64_t)limited_sources
> +	    <= (uint64_t)rename_limit * (uint64_t)rename_limit)

Same notes here, of course. I was hoping that we could only do this
multiplication once, but it looks like limited_sources grows between the
two checks, so we have to repeat it here, I suppose.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/7] diffcore-rename: reduce jumpiness in progress counters
  2020-12-06  2:54 ` [PATCH 5/7] diffcore-rename: reduce jumpiness in progress counters Elijah Newren via GitGitGadget
@ 2020-12-09 22:24   ` Taylor Blau
  2020-12-10  2:36   ` Junio C Hamano
  1 sibling, 0 replies; 55+ messages in thread
From: Taylor Blau @ 2020-12-09 22:24 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren

On Sun, Dec 06, 2020 at 02:54:34AM +0000, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
>
> Inexact rename detection works by comparing all sources to all
> destinations, computing similarities, and then finding the best matches
> among those that are sufficiently similar.
>
> However, it is preceded by exact rename detection that works by
> checking if there are files with identical hashes.  If exact renames are
> found, we can exclude some files from inexact rename detection.
>
> The inexact rename detection loops over the full set of files, but
> immediately skips those for which rename_dst[i].is_rename is true and
> thus doesn't compare any sources to that destination.  As such, these
> paths shouldn't be included in the progress counter.
>
> For the eagle eyed, this change hints at an actual optimization -- the
> first one I presented at Git Merge 2020.  I'll be submitting that
> optimization later, once the basic merge-ort algorithm has merged.

;-) I was hoping that that was the case.

> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  diffcore-rename.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index 7270eb6af48..3d637ba4645 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -587,7 +587,7 @@ void diffcore_rename(struct diff_options *options)
>  	if (options->show_rename_progress) {
>  		progress = start_delayed_progress(
>  				_("Performing inexact rename detection"),
> -				(uint64_t)rename_dst_nr * (uint64_t)rename_src_nr);
> +				(uint64_t)num_targets * (uint64_t)rename_src_nr);
>  	}
>
>  	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_targets), sizeof(*mx));
> @@ -626,7 +626,8 @@ void diffcore_rename(struct diff_options *options)
>  			diff_free_filespec_blob(two);
>  		}
>  		dst_cnt++;
> -		display_progress(progress, (uint64_t)(i+1)*(uint64_t)rename_src_nr);
> +		display_progress(progress,
> +				 (uint64_t)dst_cnt * (uint64_t)rename_src_nr);

Both of these spots need only one cast, but the patch itself looks
correct (with or without dropping a cast on one of the operands in each
expression).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/7] diffcore-rename: simplify and accelerate register_rename_src()
  2020-12-06  2:54 ` [PATCH 6/7] diffcore-rename: simplify and accelerate register_rename_src() Elijah Newren via GitGitGadget
@ 2020-12-09 22:40   ` Taylor Blau
  2020-12-10  0:25     ` Elijah Newren
  2020-12-10  2:51   ` Junio C Hamano
  1 sibling, 1 reply; 55+ messages in thread
From: Taylor Blau @ 2020-12-09 22:40 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren

On Sun, Dec 06, 2020 at 02:54:35AM +0000, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
>
> reigster_rename_src() took pains to create an array in rename_src which
> was sorted by pathname of the contained diff_filepair.  However, the
> fact that this array was sorted was not needed anywhere, and thus
> represented wasted time.  Simply append to the end of the array, which
> in a usecase of note saved 45% of diffcore_rename() setup time for me.
>
> Technically, there is one difference in the end result for the code.  IF

s/IF/If ?

> the caller of diffcore-rename provides a set of pairs with a duplicate
> filename in the sources (an erroneous input condition), the old code
> would simply ignore the duplicate (without warning or error).  The new
> code will simply swallow it and thus allow multiple pairings for the
> "same" source file.  Checking for this error condition is expensive
> (basically undoing the optimization) and the only existing callers
> already avoid passing such an invalid input.  If someone really wants to
> add some error checking here on this input condition, I believe they
> should add a separate function which can be called before
> diffcore_rename(), and then other callers that don't need additional
> checks because of their design (such as merge-ort) can avoid the extra
> overhead.

It seems like this is currently impossible to trigger, making any extra
(expensive) checking of it worthless, no?

> Also, note that I dropped the return type on the function since it was
> unconditionally discarded anyway.
>
> This patch is being submitted in a different order than its original
> development, but in a large rebase of many commits with lots of renames
> and with several optimizations to inexact rename detection,
> diffcore_rename() setup time was a sizeable chunk of overall runtime.
> This patch dropped execution time of rebasing 35 commits with lots of
> renames by 2% overall.

Neat!

> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  diffcore-rename.c | 30 +++---------------------------
>  1 file changed, 3 insertions(+), 27 deletions(-)
>
> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index 3d637ba4645..816d2fbac44 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -76,36 +76,12 @@ static struct diff_rename_src {
> [...]

This looks obviously correct.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/7] Accelerate rename_dst setup
  2020-12-06  2:54 ` [PATCH 7/7] Accelerate rename_dst setup Elijah Newren via GitGitGadget
@ 2020-12-09 23:01   ` Taylor Blau
  2020-12-10  0:57     ` Elijah Newren
  0 siblings, 1 reply; 55+ messages in thread
From: Taylor Blau @ 2020-12-09 23:01 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren

On Sun, Dec 06, 2020 at 02:54:36AM +0000, Elijah Newren via GitGitGadget wrote:
> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index 816d2fbac44..fb3c2817c6b 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -5,67 +5,64 @@
>  #include "cache.h"
>  #include "diff.h"
>  #include "diffcore.h"
> -#include "object-store.h"
>  #include "hashmap.h"
> +#include "object-store.h"

Shuffling around this object-store.h include looks like noise to me, but
maybe there's a good reason for doing it.

>  #include "progress.h"
>  #include "promisor-remote.h"
> +#include "strmap.h"
>
>  /* Table of rename/copy destinations */
>
>  static struct diff_rename_dst {
> -	struct diff_filespec *two;
> -	struct diff_filepair *pair;
> +	struct diff_filepair *p;
> +	struct diff_filespec *filespec_to_free;
> +	int is_rename; /* false -> just a create; true -> rename or copy */
>  } *rename_dst;
>  static int rename_dst_nr, rename_dst_alloc;
> +/* Mapping from break source pathname to break destination index */
> +static struct strintmap *break_idx = NULL;
>
> -static int find_rename_dst(struct diff_filespec *two)
> -{
> -	int first, last;
> -
> -	first = 0;
> -	last = rename_dst_nr;
> -	while (last > first) {
> -		int next = first + ((last - first) >> 1);
> -		struct diff_rename_dst *dst = &(rename_dst[next]);
> -		int cmp = strcmp(two->path, dst->two->path);
> -		if (!cmp)
> -			return next;
> -		if (cmp < 0) {
> -			last = next;
> -			continue;
> -		}
> -		first = next+1;
> -	}
> -	return -first - 1;
> -}
> -
> -static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two)
> +static struct diff_rename_dst *locate_rename_dst(struct diff_filepair *p)
>  {
> -	int ofs = find_rename_dst(two);
> -	return ofs < 0 ? NULL : &rename_dst[ofs];
> +	/* Lookup by p->ONE->path */
> +	int idx = break_idx ? strintmap_get(break_idx, p->one->path) : -1;

I had to lookup the behavior of strintmap_get() to realize that it
returns the map's "default value" to figure out why this double
ternary was necessary, but I think that it is.

Ideally, if break_idx is non-NULL, then we ought to be able to immediately
return NULL, but it's possible that break_idx is non-NULL and simply
doesn't contain p->one->path, in which case we also want to return NULL.

So, I think this is as clear as it can be.

> +	return (idx == -1) ? NULL : &rename_dst[idx];
>  }
>
>  /*
>   * Returns 0 on success, -1 if we found a duplicate.
>   */
> -static int add_rename_dst(struct diff_filespec *two)
> +static int add_rename_dst(struct diff_filepair *p)
>  {
> -	int first = find_rename_dst(two);
> -
> -	if (first >= 0)
> +	/*
> +	 * See t4058; trees might have duplicate entries.  I think
> +	 * trees with duplicate entries should be ignored and we
> +	 * should just leave rename detection on; while the results
> +	 * may be slightly harder to understand, that's merely a
> +	 * result of the underlying tree making no sense.  But I
> +	 * believe everything still works fine, the results do still
> +	 * make sense, and the extra overhead of doing this checking
> +	 * for a few historical weird trees from long ago seems like
> +	 * the dog wagging the tail to me.
> +	 *
> +	 * However: I don't feel like fighting that battle right now.
> +	 * For now, to keep the regression test passing, we have to
> +	 * detect it.  Since the diff machinery passes these to us in
> +	 * adjacent pairs, we just need to check to see if our name
> +	 * matches the previous one.
> +	 *
> +	 * TODO: Dispense with this test, rip out the test in t4058, and
> +	 * lobby folks for the change.
> +	 */
> +	if (rename_dst_nr > 0 &&
> +	    !strcmp(rename_dst[rename_dst_nr-1].p->two->path, p->two->path))
>  		return -1;
> -	first = -first - 1;
>
> -	/* insert to make it at "first" */
>  	ALLOC_GROW(rename_dst, rename_dst_nr + 1, rename_dst_alloc);
> +	rename_dst[rename_dst_nr].p = p;
> +	rename_dst[rename_dst_nr].filespec_to_free = NULL;
> +	rename_dst[rename_dst_nr].is_rename = 0;
>  	rename_dst_nr++;
> -	if (first < rename_dst_nr)
> -		MOVE_ARRAY(rename_dst + first + 1, rename_dst + first,
> -			   rename_dst_nr - first - 1);
> -	rename_dst[first].two = alloc_filespec(two->path);
> -	fill_filespec(rename_dst[first].two, &two->oid, two->oid_valid,
> -		      two->mode);
> -	rename_dst[first].pair = NULL;
>  	return 0;

Very nice, this is much simpler than what was here before.

> @@ -78,6 +75,14 @@ static int rename_src_nr, rename_src_alloc;
>
>  static void register_rename_src(struct diff_filepair *p)
>  {
> +	if (p->broken_pair) {
> +		if (!break_idx) {
> +			break_idx = xmalloc(sizeof(*break_idx));
> +			strintmap_init(break_idx, -1);
> +		}
> +		strintmap_set(break_idx, p->one->path, rename_dst_nr);
> +	}
> +

Makes sense.

> @@ -664,8 +653,9 @@ void diffcore_rename(struct diff_options *options)
>  			 */
>  			if (DIFF_PAIR_BROKEN(p)) {
>  				/* broken delete */
> -				struct diff_rename_dst *dst = locate_rename_dst(p->one);
> -				if (dst && dst->pair)
> +				struct diff_rename_dst *dst = locate_rename_dst(p);
> +				assert(dst);

You're not hurting anything, but I'm not convinced that this assert() is
buying you anything that the line immediately below it isn't already
doing.

The rest of the patch looks good to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/7] diffcore-rename: simplify and accelerate register_rename_src()
  2020-12-09 22:40   ` Taylor Blau
@ 2020-12-10  0:25     ` Elijah Newren
  2020-12-10  0:41       ` Taylor Blau
  0 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren @ 2020-12-10  0:25 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Elijah Newren via GitGitGadget, Git Mailing List

On Wed, Dec 9, 2020 at 2:40 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Sun, Dec 06, 2020 at 02:54:35AM +0000, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > reigster_rename_src() took pains to create an array in rename_src which
> > was sorted by pathname of the contained diff_filepair.  However, the
> > fact that this array was sorted was not needed anywhere, and thus
> > represented wasted time.  Simply append to the end of the array, which
> > in a usecase of note saved 45% of diffcore_rename() setup time for me.
> >
> > Technically, there is one difference in the end result for the code.  IF
>
> s/IF/If ?

Indeed; will fix.

> > the caller of diffcore-rename provides a set of pairs with a duplicate
> > filename in the sources (an erroneous input condition), the old code
> > would simply ignore the duplicate (without warning or error).  The new
> > code will simply swallow it and thus allow multiple pairings for the
> > "same" source file.  Checking for this error condition is expensive
> > (basically undoing the optimization) and the only existing callers
> > already avoid passing such an invalid input.  If someone really wants to
> > add some error checking here on this input condition, I believe they
> > should add a separate function which can be called before
> > diffcore_rename(), and then other callers that don't need additional
> > checks because of their design (such as merge-ort) can avoid the extra
> > overhead.
>
> It seems like this is currently impossible to trigger, making any extra
> (expensive) checking of it worthless, no?

I believe that's what it currently amounts to, and I debated just
ripping the paragraph out.

However, a natural question that can easily arise for current
reviewers or future readers of the patch is why was there ever sorting
in the first place if the sorting isn't used?  That question came up
for me, and I dug into it.  Sorting is also used with rename_dst in
nearby code in the file, and there are a few reasons for it there.
The quick indexing that rename_dst needs doesn't apply to rename_src,
but it's not as obvious whether the
broken-trees-with-duplicate-entries rationale applies or not.  I spent
a while digging into it.  (And it's possible that I didn't correctly
check the callers or that in the seven months since I wrote this
message someone added another caller of this code that does pass
multiple diff_filepair-s for the "same" source file.)  Anyway, this
paragraph exists because I had to go down a goose chase to answer this
natural question, and I wanted to provide an answer to anyone else
asking the same question.  Also, in the off chance that anyone did
want to add callers that passed multiple copies of any source file, I
wanted to point out that this modified algorithm would result in a
slight behavioral difference, but that otherwise the modified
algorithm gives identical results.  (And if some future reader
stumbled on the paragraph because they had made such a change, I
wanted to provide a quick suggestion of how to get what they wanted
without adversely affecting performance.)

I hope that helps.  I'm sorry if my worrying about these cases and
discussing them made the patch harder to read or review, but I feel
like diffcore-rename is one of those low-level components you want to
be careful with.  diffcore-rename is one of those parts of the code
where a bug in it might not result in a directly observable breakage
to users but in some secondary or tertiary side-effect showing weird
results (e.g. in a merge you won't necessarily see that A was renamed
to B, instead you get a three-way content merge of original A, other
A, and new B -- or don't get a three-way content merge you might
expect).

> > Also, note that I dropped the return type on the function since it was
> > unconditionally discarded anyway.
> >
> > This patch is being submitted in a different order than its original
> > development, but in a large rebase of many commits with lots of renames
> > and with several optimizations to inexact rename detection,
> > diffcore_rename() setup time was a sizeable chunk of overall runtime.
> > This patch dropped execution time of rebasing 35 commits with lots of
> > renames by 2% overall.
>
> Neat!
>
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> >  diffcore-rename.c | 30 +++---------------------------
> >  1 file changed, 3 insertions(+), 27 deletions(-)
> >
> > diff --git a/diffcore-rename.c b/diffcore-rename.c
> > index 3d637ba4645..816d2fbac44 100644
> > --- a/diffcore-rename.c
> > +++ b/diffcore-rename.c
> > @@ -76,36 +76,12 @@ static struct diff_rename_src {
> > [...]
>
> This looks obviously correct.

Thanks for taking a look!

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/7] diffcore-rename: remove unnecessary if-clause
  2020-12-09 22:10   ` Taylor Blau
@ 2020-12-10  0:32     ` Elijah Newren
  2020-12-10  2:03     ` Junio C Hamano
  1 sibling, 0 replies; 55+ messages in thread
From: Elijah Newren @ 2020-12-10  0:32 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Elijah Newren via GitGitGadget, Git Mailing List

On Wed, Dec 9, 2020 at 2:10 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Sun, Dec 06, 2020 at 02:54:31AM +0000, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > diffcore-rename had two different checks of the form
> >
> >     if ((a < limit || b < limit) &&
> >         a * b <= limit * limit)
> >
> > Since these are all non-negative integers, this can be simplified to
> >
> >     if (a * b <= limit * limit)
>
> Makes sense.
>
> > The only advantage of the former would be in avoiding a couple
> > multiplications in the rare case that both a and b are BOTH very large.
> > I see no reason for such an optimization given that this code is not in
> > any kind of loop.  Prefer code simplicity here and change to the latter
> > form.
>
> If you were really paranoid, you could perform these checks with
> unsigned_mult_overflows(), but I don't think that it's worth doing so
> here.
>
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> >  diffcore-rename.c | 10 ++++------
> >  1 file changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/diffcore-rename.c b/diffcore-rename.c
> > index 68ddf51a2a1..0f8fce9293e 100644
> > --- a/diffcore-rename.c
> > +++ b/diffcore-rename.c
> > @@ -450,9 +450,8 @@ static int too_many_rename_candidates(int num_targets, int num_sources,
> >        */
> >       if (rename_limit <= 0)
> >               rename_limit = 32767;
> > -     if ((num_targets <= rename_limit || num_sources <= rename_limit) &&
> > -         ((uint64_t)num_targets * (uint64_t)num_sources
> > -          <= (uint64_t)rename_limit * (uint64_t)rename_limit))
> > +     if ((uint64_t)num_targets * (uint64_t)num_sources
> > +         <= (uint64_t)rename_limit * (uint64_t)rename_limit)
>
> One small nit here (and below) is that not all of these need casting.
> Only one operand of each multiplication needs to be widened for the
> compiler to widen the other, too. So, I'd write this instead as:
>
> > +     if ((num_targets * (uint64_t)num_sources) <=
> > +     (rename_limit * (uint64_t)rename_limit))
>
> or something. (I tend to prefer the cast on the right-most operand,
> since it makes clear that there's no need to cast the "first" operand,
> and that casting either will do the trick).

Yeah, I think that reads slightly better too, but on the other hand it
does make the diff slightly harder to read.  *shrug*.

> >               return 0;
> >
> >       options->needed_rename_limit =
> > @@ -468,9 +467,8 @@ static int too_many_rename_candidates(int num_targets, int num_sources,
> >                       continue;
> >               limited_sources++;
> >       }
> > -     if ((num_targets <= rename_limit || limited_sources <= rename_limit) &&
> > -         ((uint64_t)num_targets * (uint64_t)limited_sources
> > -          <= (uint64_t)rename_limit * (uint64_t)rename_limit))
> > +     if ((uint64_t)num_targets * (uint64_t)limited_sources
> > +         <= (uint64_t)rename_limit * (uint64_t)rename_limit)
>
> Same notes here, of course. I was hoping that we could only do this
> multiplication once, but it looks like limited_sources grows between the
> two checks, so we have to repeat it here, I suppose.

The right hand side of the expression is the same -- rename_limit *
rename_limit -- so it could be stored off (though I don't think
there's much point in doing so unless it made the code clearer; this
is not remotely close to a hot path).  However, in the left hand side,
it's not so much that one of the variables has changed since the last
check, it's that it uses a different set of variables in the check.
In particular, it replaces 'num_sources' with 'limited_sources'.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/7] diffcore-rename: simplify and accelerate register_rename_src()
  2020-12-10  0:25     ` Elijah Newren
@ 2020-12-10  0:41       ` Taylor Blau
  0 siblings, 0 replies; 55+ messages in thread
From: Taylor Blau @ 2020-12-10  0:41 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Taylor Blau, Elijah Newren via GitGitGadget, Git Mailing List

On Wed, Dec 09, 2020 at 04:25:18PM -0800, Elijah Newren wrote:
> On Wed, Dec 9, 2020 at 2:40 PM Taylor Blau <me@ttaylorr.com> wrote:
> > > Technically, there is one difference in the end result for the code.  IF
> >
> > s/IF/If ?
>
> Indeed; will fix.

:-). In fairness, I did read it with _emphasis_, so it at least gave me
a laugh!

> > > the caller of diffcore-rename provides a set of pairs with a duplicate
> > > filename in the sources (an erroneous input condition), the old code
> > > would simply ignore the duplicate (without warning or error).  The new
> > > code will simply swallow it and thus allow multiple pairings for the
> > > "same" source file.  Checking for this error condition is expensive
> > > (basically undoing the optimization) and the only existing callers
> > > already avoid passing such an invalid input.  If someone really wants to
> > > add some error checking here on this input condition, I believe they
> > > should add a separate function which can be called before
> > > diffcore_rename(), and then other callers that don't need additional
> > > checks because of their design (such as merge-ort) can avoid the extra
> > > overhead.
> >
> > It seems like this is currently impossible to trigger, making any extra
> > (expensive) checking of it worthless, no?
>
> I believe that's what it currently amounts to, and I debated just
> ripping the paragraph out.
>
> [ ... ]

No, no need to be sorry: this is exactly the sort of discussion I had
hoped to provoke by sanity checking my understanding of what you had
written. I think that the paragraph does belong in the patch message,
and interested readers can find our discussion here.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/7] Accelerate rename_dst setup
  2020-12-09 23:01   ` Taylor Blau
@ 2020-12-10  0:57     ` Elijah Newren
  2020-12-10  1:43       ` Junio C Hamano
  0 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren @ 2020-12-10  0:57 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Elijah Newren via GitGitGadget, Git Mailing List

On Wed, Dec 9, 2020 at 3:01 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Sun, Dec 06, 2020 at 02:54:36AM +0000, Elijah Newren via GitGitGadget wrote:
> > diff --git a/diffcore-rename.c b/diffcore-rename.c
> > index 816d2fbac44..fb3c2817c6b 100644
> > --- a/diffcore-rename.c
> > +++ b/diffcore-rename.c
> > @@ -5,67 +5,64 @@
> >  #include "cache.h"
> >  #include "diff.h"
> >  #include "diffcore.h"
> > -#include "object-store.h"
> >  #include "hashmap.h"
> > +#include "object-store.h"
>
> Shuffling around this object-store.h include looks like noise to me, but
> maybe there's a good reason for doing it.

Um, well...I guess I couldn't help myself from alphabetizing the
#include list (at least the portion after the initial "cache.h" or
"git-compat-util.h" must come first).  ;-)

I probably should have done it in a separate patch.

> >  #include "progress.h"
> >  #include "promisor-remote.h"
> > +#include "strmap.h"
> >
> >  /* Table of rename/copy destinations */
> >
> >  static struct diff_rename_dst {
> > -     struct diff_filespec *two;
> > -     struct diff_filepair *pair;
> > +     struct diff_filepair *p;
> > +     struct diff_filespec *filespec_to_free;
> > +     int is_rename; /* false -> just a create; true -> rename or copy */
> >  } *rename_dst;
> >  static int rename_dst_nr, rename_dst_alloc;
> > +/* Mapping from break source pathname to break destination index */
> > +static struct strintmap *break_idx = NULL;
> >
> > -static int find_rename_dst(struct diff_filespec *two)
> > -{
> > -     int first, last;
> > -
> > -     first = 0;
> > -     last = rename_dst_nr;
> > -     while (last > first) {
> > -             int next = first + ((last - first) >> 1);
> > -             struct diff_rename_dst *dst = &(rename_dst[next]);
> > -             int cmp = strcmp(two->path, dst->two->path);
> > -             if (!cmp)
> > -                     return next;
> > -             if (cmp < 0) {
> > -                     last = next;
> > -                     continue;
> > -             }
> > -             first = next+1;
> > -     }
> > -     return -first - 1;
> > -}
> > -
> > -static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two)
> > +static struct diff_rename_dst *locate_rename_dst(struct diff_filepair *p)
> >  {
> > -     int ofs = find_rename_dst(two);
> > -     return ofs < 0 ? NULL : &rename_dst[ofs];
> > +     /* Lookup by p->ONE->path */
> > +     int idx = break_idx ? strintmap_get(break_idx, p->one->path) : -1;
>
> I had to lookup the behavior of strintmap_get() to realize that it
> returns the map's "default value" to figure out why this double
> ternary was necessary, but I think that it is.
>
> Ideally, if break_idx is non-NULL, then we ought to be able to immediately
> return NULL, but it's possible that break_idx is non-NULL and simply
> doesn't contain p->one->path, in which case we also want to return NULL.
>
> So, I think this is as clear as it can be.
>
> > +     return (idx == -1) ? NULL : &rename_dst[idx];
> >  }
> >
> >  /*
> >   * Returns 0 on success, -1 if we found a duplicate.
> >   */
> > -static int add_rename_dst(struct diff_filespec *two)
> > +static int add_rename_dst(struct diff_filepair *p)
> >  {
> > -     int first = find_rename_dst(two);
> > -
> > -     if (first >= 0)
> > +     /*
> > +      * See t4058; trees might have duplicate entries.  I think
> > +      * trees with duplicate entries should be ignored and we
> > +      * should just leave rename detection on; while the results
> > +      * may be slightly harder to understand, that's merely a
> > +      * result of the underlying tree making no sense.  But I
> > +      * believe everything still works fine, the results do still
> > +      * make sense, and the extra overhead of doing this checking
> > +      * for a few historical weird trees from long ago seems like
> > +      * the dog wagging the tail to me.
> > +      *
> > +      * However: I don't feel like fighting that battle right now.
> > +      * For now, to keep the regression test passing, we have to
> > +      * detect it.  Since the diff machinery passes these to us in
> > +      * adjacent pairs, we just need to check to see if our name
> > +      * matches the previous one.
> > +      *
> > +      * TODO: Dispense with this test, rip out the test in t4058, and
> > +      * lobby folks for the change.
> > +      */
> > +     if (rename_dst_nr > 0 &&
> > +         !strcmp(rename_dst[rename_dst_nr-1].p->two->path, p->two->path))
> >               return -1;
> > -     first = -first - 1;
> >
> > -     /* insert to make it at "first" */
> >       ALLOC_GROW(rename_dst, rename_dst_nr + 1, rename_dst_alloc);
> > +     rename_dst[rename_dst_nr].p = p;
> > +     rename_dst[rename_dst_nr].filespec_to_free = NULL;
> > +     rename_dst[rename_dst_nr].is_rename = 0;
> >       rename_dst_nr++;
> > -     if (first < rename_dst_nr)
> > -             MOVE_ARRAY(rename_dst + first + 1, rename_dst + first,
> > -                        rename_dst_nr - first - 1);
> > -     rename_dst[first].two = alloc_filespec(two->path);
> > -     fill_filespec(rename_dst[first].two, &two->oid, two->oid_valid,
> > -                   two->mode);
> > -     rename_dst[first].pair = NULL;
> >       return 0;
>
> Very nice, this is much simpler than what was here before.
>
> > @@ -78,6 +75,14 @@ static int rename_src_nr, rename_src_alloc;
> >
> >  static void register_rename_src(struct diff_filepair *p)
> >  {
> > +     if (p->broken_pair) {
> > +             if (!break_idx) {
> > +                     break_idx = xmalloc(sizeof(*break_idx));
> > +                     strintmap_init(break_idx, -1);
> > +             }
> > +             strintmap_set(break_idx, p->one->path, rename_dst_nr);
> > +     }
> > +
>
> Makes sense.
>
> > @@ -664,8 +653,9 @@ void diffcore_rename(struct diff_options *options)
> >                        */
> >                       if (DIFF_PAIR_BROKEN(p)) {
> >                               /* broken delete */
> > -                             struct diff_rename_dst *dst = locate_rename_dst(p->one);
> > -                             if (dst && dst->pair)
> > +                             struct diff_rename_dst *dst = locate_rename_dst(p);
> > +                             assert(dst);
>
> You're not hurting anything, but I'm not convinced that this assert() is
> buying you anything that the line immediately below it isn't already
> doing.

It's identical for runtime correctness, yes.  But the primary point
isn't what happens at runtime, but what happens when future code
readers come along.  If I only keep the "if (dst->is_rename)" line
that comes after without the assert, then someone in the future will
come along (maybe even myself) and think "the original author wasn't
being careful here; they should change this to a check on (dst &&
dst->is_rename)" (because honestly, it is the kind of mistake I'm
prone to make).  However, if they were to do so, and some bug gets
introduced so that locate_rename_dst() returns a NULL for a pair of
interest, then they've masked a bug in the algorithm and made it fail
in harder-to-detect-and-track-down ways.  I wanted it to be clear that
dst == NULL is unacceptable.  I guess I could have marked it with
BUG(), but between an assertion and a NULL-pointer indirection, I
figured that code aborting under that condition was pretty well
covered.  :-)

> The rest of the patch looks good to me.
>
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/7] Accelerate rename_dst setup
  2020-12-10  0:57     ` Elijah Newren
@ 2020-12-10  1:43       ` Junio C Hamano
  0 siblings, 0 replies; 55+ messages in thread
From: Junio C Hamano @ 2020-12-10  1:43 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Taylor Blau, Elijah Newren via GitGitGadget, Git Mailing List

Elijah Newren <newren@gmail.com> writes:

> It's identical for runtime correctness, yes.  But the primary point
> isn't what happens at runtime, but what happens when future code
> readers come along.  If I only keep the "if (dst->is_rename)" line
> that comes after without the assert, then someone in the future will
> come along (maybe even myself) and think "the original author wasn't
> being careful here; they should change this to a check on (dst &&
> dst->is_rename)" (because honestly, it is the kind of mistake I'm
> prone to make).  However, if they were to do so, and some bug gets
> introduced so that locate_rename_dst() returns a NULL for a pair of
> interest, then they've masked a bug in the algorithm and made it fail
> in harder-to-detect-and-track-down ways.  I wanted it to be clear that
> dst == NULL is unacceptable.  I guess I could have marked it with
> BUG(), but between an assertion and a NULL-pointer indirection, I
> figured that code aborting under that condition was pretty well
> covered.  :-)

assert() often gets turned into no-op, so it is more like leaving a
comment in the code.  A comment or BUG("text") can describe not just
the fact that reaching this point with dst==NULL is a bug in the
caller but also why it is a bug (e.g. how NULL-dst would screw up
the computation downstream from this point), but an assert() cannot.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/7] diffcore-rename: remove unnecessary if-clause
  2020-12-09 22:10   ` Taylor Blau
  2020-12-10  0:32     ` Elijah Newren
@ 2020-12-10  2:03     ` Junio C Hamano
  2020-12-10  2:17       ` Elijah Newren
  1 sibling, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2020-12-10  2:03 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Elijah Newren via GitGitGadget, git, Elijah Newren

Taylor Blau <me@ttaylorr.com> writes:

> On Sun, Dec 06, 2020 at 02:54:31AM +0000, Elijah Newren via GitGitGadget wrote:
>> From: Elijah Newren <newren@gmail.com>
>>
>> diffcore-rename had two different checks of the form
>>
>>     if ((a < limit || b < limit) &&
>>         a * b <= limit * limit)
>>
>> Since these are all non-negative integers, this can be simplified to
>>
>>     if (a * b <= limit * limit)
>
> Makes sense.

I've always assumed that the original was for correctness (if a and
b are both larger than limit, a*b could end up being smaller than
limit*limit when the result of multiplication of the former wraps
around but not the latter) ...

>> The only advantage of the former would be in avoiding a couple
>> multiplications in the rare case that both a and b are BOTH very large.
>> I see no reason for such an optimization given that this code is not in
>> any kind of loop.  Prefer code simplicity here and change to the latter
>> form.
>
> If you were really paranoid, you could perform these checks with
> unsigned_mult_overflows(), but I don't think that it's worth doing so
> here.

... and in no way as an optimization.

So, I dunno.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/7] diffcore-rename: remove unnecessary if-clause
  2020-12-10  2:03     ` Junio C Hamano
@ 2020-12-10  2:17       ` Elijah Newren
  2020-12-10  6:56         ` Junio C Hamano
  0 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren @ 2020-12-10  2:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Elijah Newren via GitGitGadget, Git Mailing List

On Wed, Dec 9, 2020 at 6:03 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Taylor Blau <me@ttaylorr.com> writes:
>
> > On Sun, Dec 06, 2020 at 02:54:31AM +0000, Elijah Newren via GitGitGadget wrote:
> >> From: Elijah Newren <newren@gmail.com>
> >>
> >> diffcore-rename had two different checks of the form
> >>
> >>     if ((a < limit || b < limit) &&
> >>         a * b <= limit * limit)
> >>
> >> Since these are all non-negative integers, this can be simplified to
> >>
> >>     if (a * b <= limit * limit)
> >
> > Makes sense.
>
> I've always assumed that the original was for correctness (if a and
> b are both larger than limit, a*b could end up being smaller than
> limit*limit when the result of multiplication of the former wraps
> around but not the latter) ...
>
> >> The only advantage of the former would be in avoiding a couple
> >> multiplications in the rare case that both a and b are BOTH very large.
> >> I see no reason for such an optimization given that this code is not in
> >> any kind of loop.  Prefer code simplicity here and change to the latter
> >> form.
> >
> > If you were really paranoid, you could perform these checks with
> > unsigned_mult_overflows(), but I don't think that it's worth doing so
> > here.
>
> ... and in no way as an optimization.
>
> So, I dunno.

Ah, so would you be okay replacing these with

   if (st_mult(num_targets, limited_sources) <= st_mult(rename_limit,
rename_limit))

?

That'd make the correctness intent far clearer, and allow us to drop
the casting as well since st_mult converts to size_t.  (If, on the off
chance you're on a platform where size_t is 32-bits and the
multiplications don't fit in that size, then the requested matrices
for computing rename detection won't fit in memory for you.)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/7] diffcore-rename: rename num_create to num_targets
  2020-12-06  2:54 ` [PATCH 3/7] diffcore-rename: rename num_create to num_targets Elijah Newren via GitGitGadget
@ 2020-12-10  2:20   ` Junio C Hamano
  2020-12-10  2:25     ` Elijah Newren
  0 siblings, 1 reply; 55+ messages in thread
From: Junio C Hamano @ 2020-12-10  2:20 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> Files added since the base commit serve as targets for rename detection.
> While it is true that added files can be thought of as being "created"
> when they are added IF they have no pairing file that they were renamed
> from, and it is true we start out not knowing what the pairings are, it
> seems a little odd to think in terms of "file creation" when we are
> looking for "file renames".  Rename the variable to avoid this minor
> point of confusion.

This is probably subjective.  

I've always viewed the rename detection as first collecting a set of
deleted paths and a set of created paths, and then trying to find a
good mapping from the former into the latter, so I find num_create a
lot more intuitive than num_targets.

But the remaining elements in the latter set are counted in the
counter "rename_dst_nr", so we clearly are OK to call the elements
of the latter set "the destination" (of a rename), which contrasts
very well with "the source" (of a rename), which is how the deleted
paths are counted with rename_src_nr.

When doing -C and -C -C, the "source" set has not just deleted but
also the preimage of the modified paths, so "source" is a more
appropriate name than "delete".  From that point of view,
"destination" is a more appropriate name for "create" because it
contrasts well with "source".

You silently renamed num_create to num_targets in 1/7 without
justification while adding num_sources.  Perhaps we should go back
to that step and use num_destinations to match?  The result would be
using words <dst, src> that pair with each other much better than
introducing "target" to an existing mix of <create==dst, src> to
make it <target==dst, src>.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/7] diffcore-rename: rename num_create to num_targets
  2020-12-10  2:20   ` Junio C Hamano
@ 2020-12-10  2:25     ` Elijah Newren
  0 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren @ 2020-12-10  2:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Elijah Newren via GitGitGadget, Git Mailing List

On Wed, Dec 9, 2020 at 6:20 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > Files added since the base commit serve as targets for rename detection.
> > While it is true that added files can be thought of as being "created"
> > when they are added IF they have no pairing file that they were renamed
> > from, and it is true we start out not knowing what the pairings are, it
> > seems a little odd to think in terms of "file creation" when we are
> > looking for "file renames".  Rename the variable to avoid this minor
> > point of confusion.
>
> This is probably subjective.
>
> I've always viewed the rename detection as first collecting a set of
> deleted paths and a set of created paths, and then trying to find a
> good mapping from the former into the latter, so I find num_create a
> lot more intuitive than num_targets.
>
> But the remaining elements in the latter set are counted in the
> counter "rename_dst_nr", so we clearly are OK to call the elements
> of the latter set "the destination" (of a rename), which contrasts
> very well with "the source" (of a rename), which is how the deleted
> paths are counted with rename_src_nr.
>
> When doing -C and -C -C, the "source" set has not just deleted but
> also the preimage of the modified paths, so "source" is a more
> appropriate name than "delete".  From that point of view,
> "destination" is a more appropriate name for "create" because it
> contrasts well with "source".
>
> You silently renamed num_create to num_targets in 1/7 without
> justification while adding num_sources.  Perhaps we should go back
> to that step and use num_destinations to match?  The result would be
> using words <dst, src> that pair with each other much better than
> introducing "target" to an existing mix of <create==dst, src> to
> make it <target==dst, src>.

Ooh, I like it.  num_destinations it is.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/7] diffcore-rename: change a few comments to use 'add' instead of 'create'
  2020-12-06  2:54 ` [PATCH 4/7] diffcore-rename: change a few comments to use 'add' instead of 'create' Elijah Newren via GitGitGadget
@ 2020-12-10  2:29   ` Junio C Hamano
  0 siblings, 0 replies; 55+ messages in thread
From: Junio C Hamano @ 2020-12-10  2:29 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> As with the last commit, it feels odd to talk about "creation" (which to
> me implies a brand new file, i.e. one not associated with an existing
> file, and thus one that could not have been a rename) when we are using
> the files to look for renames.  Use the term "addition" instead.

Hmph, to me "create" and "add" mean the same thing in all the
sentences this patch updates.

We first notice the differences in two trees, some paths have
contents in both trees so they cannot be creation or deletion.  Some
paths only exist in one of the trees and they are initially labelled
as "created" or "deleted".  If there were no request by the caller
to detect renames, then that's the end of the story.

Only after we match up a path in the new tree that did not exist in
the old tree with another path in the old tree that no longer exists
in the new tree, what initially got labelled as creation of one and
deletion of another may become a rename.

And in the above two paragraphs, if I replace "to create" with "to
add" and "to delete" with "to remove", I do not see how the story
becomes more natural or logical or intuitive.  Perhaps I am not
getting the subtle difference you find in these two words?  This
looks like a totally meaningless churn to me.

> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  diffcore-rename.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index 655a67759c8..7270eb6af48 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -173,7 +173,7 @@ static int estimate_similarity(struct repository *r,
>  {
>  	/* src points at a file that existed in the original tree (or
>  	 * optionally a file in the destination tree) and dst points
> -	 * at a newly created file.  They may be quite similar, in which
> +	 * at a newly added file.  They may be quite similar, in which
>  	 * case we want to say src is renamed to dst or src is copied into
>  	 * dst, and then some edit has been applied to dst.
>  	 *
> @@ -652,9 +652,9 @@ void diffcore_rename(struct diff_options *options)
>  		}
>  		else if (!DIFF_FILE_VALID(p->one) && DIFF_FILE_VALID(p->two)) {
>  			/*
> -			 * Creation
> +			 * Addition
>  			 *
> -			 * We would output this create record if it has
> +			 * We would output this add record if it has
>  			 * not been turned into a rename/copy already.
>  			 */
>  			struct diff_rename_dst *dst = locate_rename_dst(p->two);
> @@ -664,7 +664,7 @@ void diffcore_rename(struct diff_options *options)
>  			}
>  			else
>  				/* no matching rename/copy source, so
> -				 * record this as a creation.
> +				 * record this as an addition.
>  				 */
>  				diff_q(&outq, p);
>  		}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/7] diffcore-rename: reduce jumpiness in progress counters
  2020-12-06  2:54 ` [PATCH 5/7] diffcore-rename: reduce jumpiness in progress counters Elijah Newren via GitGitGadget
  2020-12-09 22:24   ` Taylor Blau
@ 2020-12-10  2:36   ` Junio C Hamano
  1 sibling, 0 replies; 55+ messages in thread
From: Junio C Hamano @ 2020-12-10  2:36 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> Inexact rename detection works by comparing all sources to all
> destinations, computing similarities, and then finding the best matches
> among those that are sufficiently similar.

Here, you are using <sources, destinations> as contrasting pair of
words, which supports my comment on 1/7 and 3/7 ;-)

> However, it is preceded by exact rename detection that works by
> checking if there are files with identical hashes.  If exact renames are
> found, we can exclude some files from inexact rename detection.
>
> The inexact rename detection loops over the full set of files, but
> immediately skips those for which rename_dst[i].is_rename is true and
> thus doesn't compare any sources to that destination.  As such, these
> paths shouldn't be included in the progress counter.
> ...
> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index 7270eb6af48..3d637ba4645 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -587,7 +587,7 @@ void diffcore_rename(struct diff_options *options)
>  	if (options->show_rename_progress) {
>  		progress = start_delayed_progress(
>  				_("Performing inexact rename detection"),
> -				(uint64_t)rename_dst_nr * (uint64_t)rename_src_nr);
> +				(uint64_t)num_targets * (uint64_t)rename_src_nr);
>  	}

The num_targets (number of destinations) holds the "remaining"
candidates after exact renames are taken care of, so this reduces
the size of the matrix used to count the progress meter.  OK.

>  	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_targets), sizeof(*mx));
> @@ -626,7 +626,8 @@ void diffcore_rename(struct diff_options *options)
>  			diff_free_filespec_blob(two);
>  		}
>  		dst_cnt++;
> -		display_progress(progress, (uint64_t)(i+1)*(uint64_t)rename_src_nr);
> +		display_progress(progress,
> +				 (uint64_t)dst_cnt * (uint64_t)rename_src_nr);

And this fills the progress meter by using the number of
destinations that are actually considered.  Between the two hunks,
there is a "if the source of this destination is already known, move
to next 'i'" continue, that does not update the progress meter.
Changing (i+1) to dst_cnt here compensates for the reduction of the
matrix size we see above.

Makes sense.  This looks good.

>  	}
>  	stop_progress(&progress);

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/7] diffcore-rename: simplify and accelerate register_rename_src()
  2020-12-06  2:54 ` [PATCH 6/7] diffcore-rename: simplify and accelerate register_rename_src() Elijah Newren via GitGitGadget
  2020-12-09 22:40   ` Taylor Blau
@ 2020-12-10  2:51   ` Junio C Hamano
  1 sibling, 0 replies; 55+ messages in thread
From: Junio C Hamano @ 2020-12-10  2:51 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> reigster_rename_src() took pains to create an array in rename_src which

register?

> was sorted by pathname of the contained diff_filepair.  However, the
> fact that this array was sorted was not needed anywhere, and thus
> represented wasted time.  Simply append to the end of the array, which
> in a usecase of note saved 45% of diffcore_rename() setup time for me.

I originally started writing "I do not recall when the sortedness
stopped mattering", until I realized you wrote "anywhere" not
"anymore".

I do not think of any other reason than we wanted to notice and deal
with the duplicated input.  We do not look up the list of rename
sources by pathname.  So if we were sorting it, it is to prevent
such a bug from breaking the rename machinery.  What you call
"technically the behaviour is different" is "removing the safety".

I do not offhand know which caller might give us such an input in
the current code, so it may be entirely a safe thing to do.
Besides, we use a hashmap of rename sources when computing exact
renames, so even if we need to notice and/or avoid duplicates, we do
not have to have register_rename_src() build the table with an
insertion sort as a way to do so.

> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index 3d637ba4645..816d2fbac44 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -76,36 +76,12 @@ static struct diff_rename_src {
>  } *rename_src;
>  static int rename_src_nr, rename_src_alloc;
>  
> -static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
> +static void register_rename_src(struct diff_filepair *p)
>  {
> -	int first, last;
> -	struct diff_filespec *one = p->one;
> -	unsigned short score = p->score;
> -
> -	first = 0;
> -	last = rename_src_nr;
> -	while (last > first) {
> -		int next = first + ((last - first) >> 1);
> -		struct diff_rename_src *src = &(rename_src[next]);
> -		int cmp = strcmp(one->path, src->p->one->path);
> -		if (!cmp)
> -			return src;
> -		if (cmp < 0) {
> -			last = next;
> -			continue;
> -		}
> -		first = next+1;
> -	}
> -
> -	/* insert to make it at "first" */
>  	ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
> +	rename_src[rename_src_nr].p = p;
> +	rename_src[rename_src_nr].score = p->score;
>  	rename_src_nr++;
> -	if (first < rename_src_nr)
> -		MOVE_ARRAY(rename_src + first + 1, rename_src + first,
> -			   rename_src_nr - first - 1);
> -	rename_src[first].p = p;
> -	rename_src[first].score = score;
> -	return &(rename_src[first]);
>  }
>  
>  static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/7] diffcore-rename: remove unnecessary if-clause
  2020-12-10  2:17       ` Elijah Newren
@ 2020-12-10  6:56         ` Junio C Hamano
  0 siblings, 0 replies; 55+ messages in thread
From: Junio C Hamano @ 2020-12-10  6:56 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Taylor Blau, Elijah Newren via GitGitGadget, Git Mailing List

Elijah Newren <newren@gmail.com> writes:

> Ah, so would you be okay replacing these with
>
>    if (st_mult(num_targets, limited_sources) <= st_mult(rename_limit,
> rename_limit))

It's subtle that the use of size_t is a sensible thing to do, and
definitely deserves an in-line comment, but I think that is a good
way to go.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v2 0/9] diffcore-rename improvements
  2020-12-06  2:54 [PATCH 0/7] diffcore-rename improvements Elijah Newren via GitGitGadget
                   ` (7 preceding siblings ...)
  2020-12-06  3:01 ` [PATCH 0/7] diffcore-rename improvements Elijah Newren
@ 2020-12-11  9:08 ` Elijah Newren via GitGitGadget
  2020-12-11  9:08   ` [PATCH v2 1/9] diffcore-rename: rename num_create to num_destinations Elijah Newren via GitGitGadget
                     ` (9 more replies)
  8 siblings, 10 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Taylor Blau, Elijah Newren

This patch series includes 3 small code cleanups, 1 small bug fix (or
perhaps just a UI improvement -- it's a very minor issue either way), more
testcases, and 3 performance improvements. The first 7 patches are
relatively easy to review, while the second to last one...is a bit more
involved (but provided the biggest performance boost).

Changes since v1:

 * Inserted a patch at the beginning to rename num_create to
   num_destinations, as suggested by Junio. Also simplifies patch 2.
 * Used st_mult() to simplify the limit check
 * Dropped a patch Junio didn't care for
 * Inserted two new patches, 5 & 6, to add new testcases motivated by
   Junio's comments about register_rename_src() and correctness in the face
   of duplicate tree entries.
 * Modified patch 7 so that register_rename_src() behaves identically to
   current code, even when trees have duplicate entries
 * Added a new patch 9 which rips out the unnecessary duplicate entry checks
   for BOTH rename_src and rename_dst, and documents why they are safe to
   remove. This is also supported by the new testcases in patches 5 and 6. I
   also ran t4058 (which patches 5 and 6 add to) under valgrind afterwards
   and verified that no errors from diffcore-rename were reported (valgrind
   did report problems from cache-tree and unpack-trees, as documented in
   patch 6, but that was true before this patch series and probably has been
   that way for years or decades).
 * Miscellaneous small cleanups highlighted by Taylor and Junio

Elijah Newren (9):
  diffcore-rename: rename num_create to num_destinations
  diffcore-rename: avoid usage of global in too_many_rename_candidates()
  diffcore-rename: simplify limit check
  diffcore-rename: reduce jumpiness in progress counters
  t4058: add more tests and documentation for duplicate tree entry
    handling
  t4058: explore duplicate tree entry handling in a bit more detail
  diffcore-rename: simplify and accelerate register_rename_src()
  diffcore-rename: accelerate rename_dst setup
  diffcore-rename: remove unneccessary duplicate entry checks

 diffcore-rename.c          | 209 ++++++++++++++-----------------------
 t/t4058-diff-duplicates.sh | 114 +++++++++++++++++++-
 2 files changed, 192 insertions(+), 131 deletions(-)


base-commit: 3a0b884caba2752da0af626fb2de7d597c844e8b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-929%2Fnewren%2Fdiffcore-rename-improvements-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-929/newren/diffcore-rename-improvements-v2
Pull-Request: https://github.com/git/git/pull/929

Range-diff vs v1:

  3:  30381addc5c !  1:  428d8204894 diffcore-rename: rename num_create to num_targets
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    diffcore-rename: rename num_create to num_targets
     +    diffcore-rename: rename num_create to num_destinations
      
     -    Files added since the base commit serve as targets for rename detection.
     -    While it is true that added files can be thought of as being "created"
     -    when they are added IF they have no pairing file that they were renamed
     -    from, and it is true we start out not knowing what the pairings are, it
     -    seems a little odd to think in terms of "file creation" when we are
     -    looking for "file renames".  Rename the variable to avoid this minor
     -    point of confusion.
     +    Our main data structures are rename_src and rename_dst.  For counters of
     +    these data structures, num_sources and num_destinations seem natural;
     +    definitely more so than using num_create for the latter.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     +@@ diffcore-rename.c: static void record_if_better(struct diff_score m[], struct diff_score *o)
     +  * 1 if we need to disable inexact rename detection;
     +  * 2 if we would be under the limit if we were given -C instead of -C -C.
     +  */
     +-static int too_many_rename_candidates(int num_create,
     ++static int too_many_rename_candidates(int num_destinations,
     + 				      struct diff_options *options)
     + {
     + 	int rename_limit = options->rename_limit;
     +@@ diffcore-rename.c: static int too_many_rename_candidates(int num_create,
     + 	 * This basically does a test for the rename matrix not
     + 	 * growing larger than a "rename_limit" square matrix, ie:
     + 	 *
     +-	 *    num_create * num_src > rename_limit * rename_limit
     ++	 *    num_destinations * num_src > rename_limit * rename_limit
     + 	 */
     + 	if (rename_limit <= 0)
     + 		rename_limit = 32767;
     +-	if ((num_create <= rename_limit || num_src <= rename_limit) &&
     +-	    ((uint64_t)num_create * (uint64_t)num_src
     ++	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
     ++	    ((uint64_t)num_destinations * (uint64_t)num_src
     + 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
     + 		return 0;
     + 
     + 	options->needed_rename_limit =
     +-		num_src > num_create ? num_src : num_create;
     ++		num_src > num_destinations ? num_src : num_destinations;
     + 
     + 	/* Are we running under -C -C? */
     + 	if (!options->flags.find_copies_harder)
     +@@ diffcore-rename.c: static int too_many_rename_candidates(int num_create,
     + 			continue;
     + 		num_src++;
     + 	}
     +-	if ((num_create <= rename_limit || num_src <= rename_limit) &&
     +-	    ((uint64_t)num_create * (uint64_t)num_src
     ++	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
     ++	    ((uint64_t)num_destinations * (uint64_t)num_src
     + 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
     + 		return 2;
     + 	return 1;
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       	struct diff_queue_struct outq;
       	struct diff_score *mx;
       	int i, j, rename_count, skip_unmodified = 0;
      -	int num_create, dst_cnt;
     -+	int num_targets, dst_cnt;
     ++	int num_destinations, dst_cnt;
       	struct progress *progress = NULL;
       
       	if (!minimum_score)
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       	 * files still remain as options for rename/copies!)
       	 */
      -	num_create = (rename_dst_nr - rename_count);
     -+	num_targets = (rename_dst_nr - rename_count);
     ++	num_destinations = (rename_dst_nr - rename_count);
       
       	/* All done? */
      -	if (!num_create)
     -+	if (!num_targets)
     ++	if (!num_destinations)
       		goto cleanup;
       
     --	switch (too_many_rename_candidates(num_create, rename_src_nr, options)) {
     -+	switch (too_many_rename_candidates(num_targets, rename_src_nr, options)) {
     +-	switch (too_many_rename_candidates(num_create, options)) {
     ++	switch (too_many_rename_candidates(num_destinations, options)) {
       	case 1:
       		goto cleanup;
       	case 2:
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       	}
       
      -	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_create), sizeof(*mx));
     -+	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_targets), sizeof(*mx));
     ++	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
     ++		     sizeof(*mx));
       	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
       		struct diff_filespec *two = rename_dst[i].two;
       		struct diff_score *m;
  1:  95a3fcb8be0 !  2:  fc62f4c4f89 diffcore-rename: avoid usage of global in too_many_rename_candidates()
     @@ Metadata
       ## Commit message ##
          diffcore-rename: avoid usage of global in too_many_rename_candidates()
      
     -    too_many_rename_candidates() got the number of rename targets via an
     -    argument to the function, but the number of rename sources via a global
     -    variable.  That felt rather inconsistent.  Pass in the number of rename
     -    sources as an argument as well.
     +    too_many_rename_candidates() got the number of rename destinations via
     +    an argument to the function, but the number of rename sources via a
     +    global variable.  That felt rather inconsistent.  Pass in the number of
     +    rename sources as an argument as well.
      
          While we are at it... We had a local variable, num_src, that served two
     -    purposes.  Initially it was set to this global value, but later was used
     +    purposes.  Initially it was set to the global value, but later was used
          for counting a subset of the number of sources.  Since we now have a
          function argument for the former usage, introduce a clearer variable
          name for the latter usage.
     @@ diffcore-rename.c: static void record_if_better(struct diff_score m[], struct di
        * 1 if we need to disable inexact rename detection;
        * 2 if we would be under the limit if we were given -C instead of -C -C.
        */
     --static int too_many_rename_candidates(int num_create,
     -+static int too_many_rename_candidates(int num_targets, int num_sources,
     +-static int too_many_rename_candidates(int num_destinations,
     ++static int too_many_rename_candidates(int num_destinations, int num_sources,
       				      struct diff_options *options)
       {
       	int rename_limit = options->rename_limit;
     @@ diffcore-rename.c: static void record_if_better(struct diff_score m[], struct di
       
       	options->needed_rename_limit = 0;
       
     -@@ diffcore-rename.c: static int too_many_rename_candidates(int num_create,
     +@@ diffcore-rename.c: static int too_many_rename_candidates(int num_destinations,
       	 * This basically does a test for the rename matrix not
       	 * growing larger than a "rename_limit" square matrix, ie:
       	 *
     --	 *    num_create * num_src > rename_limit * rename_limit
     -+	 *    num_targets * num_sources > rename_limit * rename_limit
     +-	 *    num_destinations * num_src > rename_limit * rename_limit
     ++	 *    num_destinations * num_sources > rename_limit * rename_limit
       	 */
       	if (rename_limit <= 0)
       		rename_limit = 32767;
     --	if ((num_create <= rename_limit || num_src <= rename_limit) &&
     --	    ((uint64_t)num_create * (uint64_t)num_src
     -+	if ((num_targets <= rename_limit || num_sources <= rename_limit) &&
     -+	    ((uint64_t)num_targets * (uint64_t)num_sources
     +-	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
     +-	    ((uint64_t)num_destinations * (uint64_t)num_src
     ++	if ((num_destinations <= rename_limit || num_sources <= rename_limit) &&
     ++	    ((uint64_t)num_destinations * (uint64_t)num_sources
       	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
       		return 0;
       
       	options->needed_rename_limit =
     --		num_src > num_create ? num_src : num_create;
     -+		num_sources > num_targets ? num_sources : num_targets;
     +-		num_src > num_destinations ? num_src : num_destinations;
     ++		num_sources > num_destinations ? num_sources : num_destinations;
       
       	/* Are we running under -C -C? */
       	if (!options->flags.find_copies_harder)
     @@ diffcore-rename.c: static int too_many_rename_candidates(int num_create,
      -		num_src++;
      +		limited_sources++;
       	}
     --	if ((num_create <= rename_limit || num_src <= rename_limit) &&
     --	    ((uint64_t)num_create * (uint64_t)num_src
     -+	if ((num_targets <= rename_limit || limited_sources <= rename_limit) &&
     -+	    ((uint64_t)num_targets * (uint64_t)limited_sources
     +-	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
     +-	    ((uint64_t)num_destinations * (uint64_t)num_src
     ++	if ((num_destinations <= rename_limit || limited_sources <= rename_limit) &&
     ++	    ((uint64_t)num_destinations * (uint64_t)limited_sources
       	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
       		return 2;
       	return 1;
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     - 	if (!num_create)
     + 	if (!num_destinations)
       		goto cleanup;
       
     --	switch (too_many_rename_candidates(num_create, options)) {
     -+	switch (too_many_rename_candidates(num_create, rename_src_nr, options)) {
     +-	switch (too_many_rename_candidates(num_destinations, options)) {
     ++	switch (too_many_rename_candidates(num_destinations, rename_src_nr,
     ++					   options)) {
       	case 1:
       		goto cleanup;
       	case 2:
  2:  f96bb36930a !  3:  7214fa73fdd diffcore-rename: remove unnecessary if-clause
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    diffcore-rename: remove unnecessary if-clause
     +    diffcore-rename: simplify limit check
      
          diffcore-rename had two different checks of the form
      
              if ((a < limit || b < limit) &&
                  a * b <= limit * limit)
      
     -    Since these are all non-negative integers, this can be simplified to
     +    This can be simplified to
      
     -        if (a * b <= limit * limit)
     +        if (st_mult(a, b) <= st_mult(limit, limit))
      
     -    The only advantage of the former would be in avoiding a couple
     -    multiplications in the rare case that both a and b are BOTH very large.
     -    I see no reason for such an optimization given that this code is not in
     -    any kind of loop.  Prefer code simplicity here and change to the latter
     -    form.
     +    which makes it clearer how we are checking for overflow, and makes it
     +    much easier to parse given the drop from 8 to 4 variable appearances.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     -@@ diffcore-rename.c: static int too_many_rename_candidates(int num_targets, int num_sources,
     +@@ diffcore-rename.c: static int too_many_rename_candidates(int num_destinations, int num_sources,
     + 	 * growing larger than a "rename_limit" square matrix, ie:
     + 	 *
     + 	 *    num_destinations * num_sources > rename_limit * rename_limit
     ++	 *
     ++	 * We use st_mult() to check overflow conditions; in the
     ++	 * exceptional circumstance that size_t isn't large enough to hold
     ++	 * the multiplication, the system won't be able to allocate enough
     ++	 * memory for the matrix anyway.
       	 */
       	if (rename_limit <= 0)
       		rename_limit = 32767;
     --	if ((num_targets <= rename_limit || num_sources <= rename_limit) &&
     --	    ((uint64_t)num_targets * (uint64_t)num_sources
     +-	if ((num_destinations <= rename_limit || num_sources <= rename_limit) &&
     +-	    ((uint64_t)num_destinations * (uint64_t)num_sources
      -	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
     -+	if ((uint64_t)num_targets * (uint64_t)num_sources
     -+	    <= (uint64_t)rename_limit * (uint64_t)rename_limit)
     ++	if (st_mult(num_destinations, num_sources)
     ++	    <= st_mult(rename_limit, rename_limit))
       		return 0;
       
       	options->needed_rename_limit =
     -@@ diffcore-rename.c: static int too_many_rename_candidates(int num_targets, int num_sources,
     +@@ diffcore-rename.c: static int too_many_rename_candidates(int num_destinations, int num_sources,
       			continue;
       		limited_sources++;
       	}
     --	if ((num_targets <= rename_limit || limited_sources <= rename_limit) &&
     --	    ((uint64_t)num_targets * (uint64_t)limited_sources
     +-	if ((num_destinations <= rename_limit || limited_sources <= rename_limit) &&
     +-	    ((uint64_t)num_destinations * (uint64_t)limited_sources
      -	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
     -+	if ((uint64_t)num_targets * (uint64_t)limited_sources
     -+	    <= (uint64_t)rename_limit * (uint64_t)rename_limit)
     ++	if (st_mult(num_destinations, limited_sources)
     ++	    <= st_mult(rename_limit, rename_limit))
       		return 2;
       	return 1;
       }
  4:  77ca4656ed4 <  -:  ----------- diffcore-rename: change a few comments to use 'add' instead of 'create'
  5:  e8257516c1f !  4:  bf3581b1ac1 diffcore-rename: reduce jumpiness in progress counters
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       		progress = start_delayed_progress(
       				_("Performing inexact rename detection"),
      -				(uint64_t)rename_dst_nr * (uint64_t)rename_src_nr);
     -+				(uint64_t)num_targets * (uint64_t)rename_src_nr);
     ++				(uint64_t)num_destinations * (uint64_t)rename_src_nr);
       	}
       
     - 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_targets), sizeof(*mx));
     + 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       			diff_free_filespec_blob(two);
       		}
  -:  ----------- >  5:  9a4a3159acf t4058: add more tests and documentation for duplicate tree entry handling
  -:  ----------- >  6:  8db27892c59 t4058: explore duplicate tree entry handling in a bit more detail
  6:  306a48820dd !  7:  a58639b2927 diffcore-rename: simplify and accelerate register_rename_src()
     @@ Metadata
       ## Commit message ##
          diffcore-rename: simplify and accelerate register_rename_src()
      
     -    reigster_rename_src() took pains to create an array in rename_src which
     -    was sorted by pathname of the contained diff_filepair.  However, the
     -    fact that this array was sorted was not needed anywhere, and thus
     -    represented wasted time.  Simply append to the end of the array, which
     -    in a usecase of note saved 45% of diffcore_rename() setup time for me.
     -
     -    Technically, there is one difference in the end result for the code.  IF
     -    the caller of diffcore-rename provides a set of pairs with a duplicate
     -    filename in the sources (an erroneous input condition), the old code
     -    would simply ignore the duplicate (without warning or error).  The new
     -    code will simply swallow it and thus allow multiple pairings for the
     -    "same" source file.  Checking for this error condition is expensive
     -    (basically undoing the optimization) and the only existing callers
     -    already avoid passing such an invalid input.  If someone really wants to
     -    add some error checking here on this input condition, I believe they
     -    should add a separate function which can be called before
     -    diffcore_rename(), and then other callers that don't need additional
     -    checks because of their design (such as merge-ort) can avoid the extra
     -    overhead.
     +    register_rename_src() took pains to create an array in rename_src which
     +    was sorted by pathname of the contained diff_filepair.  The sorting was
     +    entirely unnecessary since callers pass filepairs to us in sorted
     +    order.  We can simply append to the end of the rename_src array,
     +    speeding up diffcore_rename() setup time.
      
          Also, note that I dropped the return type on the function since it was
          unconditionally discarded anyway.
     @@ diffcore-rename.c: static struct diff_rename_src {
      -		}
      -		first = next+1;
      -	}
     --
     ++	/*
     ++	 * If we have multiple entries at the same path in the source tree
     ++	 * (an invalid tree, to be sure), avoid using more more than one
     ++	 * such entry in rename detection.  Once upon a time, doing so
     ++	 * caused segfaults; see commit 25d5ea410f ("[PATCH] Redo
     ++	 * rename/copy detection logic.", 2005-05-24).
     ++	 */
     ++	if (rename_src_nr > 0 &&
     ++	    !strcmp(rename_src[rename_src_nr-1].p->one->path, p->one->path))
     ++		return;
     + 
      -	/* insert to make it at "first" */
       	ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
      +	rename_src[rename_src_nr].p = p;
  7:  3e83b51628b !  8:  85714e1583d Accelerate rename_dst setup
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    Accelerate rename_dst setup
     +    diffcore-rename: accelerate rename_dst setup
      
          register_rename_src() simply references the passed pair inside
          rename_src.  In contrast, add_rename_dst() did something entirely
     @@ Commit message
          been done in a weakly quadratic manner, when many renames are involved
          this time could add up.
      
     -    There is still a strcmp() in add_rename_dst() that I feel is
     -    unnecessary, because I disagree with a regression test in t4058.
     -    However, I'm trying to increase performance without changing behavior,
     -    so I'll leave that fight for a different day (a TODO has been left in
     -    the code about it).
     +    There is still a strcmp() in add_rename_dst() that I have left in place
     +    to make it easier to verify that the algorithm has the same results.
     +    This strcmp() is there to check for duplicate destination entries (which
     +    was the easiest way at the time to avoid segfaults in the
     +    diffcore-rename code when trees had multiple entries at a given path).
     +    The underlying double free()s are no longer an issue with the new
     +    algorithm, but that can be addressed in a subsequent commit.
      
          This patch is being submitted in a different order than its original
          development, but in a large rebase of many commits with lots of renames
     @@ Commit message
      
       ## diffcore-rename.c ##
      @@
     - #include "cache.h"
     - #include "diff.h"
     - #include "diffcore.h"
     --#include "object-store.h"
       #include "hashmap.h"
     -+#include "object-store.h"
       #include "progress.h"
       #include "promisor-remote.h"
      +#include "strmap.h"
     @@ diffcore-rename.c
      -
      -	if (first >= 0)
      +	/*
     -+	 * See t4058; trees might have duplicate entries.  I think
     -+	 * trees with duplicate entries should be ignored and we
     -+	 * should just leave rename detection on; while the results
     -+	 * may be slightly harder to understand, that's merely a
     -+	 * result of the underlying tree making no sense.  But I
     -+	 * believe everything still works fine, the results do still
     -+	 * make sense, and the extra overhead of doing this checking
     -+	 * for a few historical weird trees from long ago seems like
     -+	 * the dog wagging the tail to me.
     -+	 *
     -+	 * However: I don't feel like fighting that battle right now.
     -+	 * For now, to keep the regression test passing, we have to
     -+	 * detect it.  Since the diff machinery passes these to us in
     -+	 * adjacent pairs, we just need to check to see if our name
     -+	 * matches the previous one.
     -+	 *
     -+	 * TODO: Dispense with this test, rip out the test in t4058, and
     -+	 * lobby folks for the change.
     ++	 * If we have multiple entries at the same path in the destination
     ++	 * tree (an invalid tree, to be sure), turn off rename detection
     ++	 * entirely.  Once upon a time, diffcore-rename had double free()
     ++	 * issues due to such duplicate paths, resulting in segfaults.  See
     ++	 * commit 4d6be03b95 ("diffcore-rename: avoid processing duplicate
     ++	 * destinations", 2015-02-26).
      +	 */
      +	if (rename_dst_nr > 0 &&
      +	    !strcmp(rename_dst[rename_dst_nr-1].p->two->path, p->two->path))
     @@ diffcore-rename.c
       	return 0;
       }
       
     -@@ diffcore-rename.c: static int rename_src_nr, rename_src_alloc;
     +@@ diffcore-rename.c: static void register_rename_src(struct diff_filepair *p)
     + 	    !strcmp(rename_src[rename_src_nr-1].p->one->path, p->one->path))
     + 		return;
       
     - static void register_rename_src(struct diff_filepair *p)
     - {
      +	if (p->broken_pair) {
      +		if (!break_idx) {
      +			break_idx = xmalloc(sizeof(*break_idx));
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       					" duplicate destination '%s'",
       					p->two->path);
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     - 
     - 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_targets), sizeof(*mx));
     + 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
     + 		     sizeof(*mx));
       	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
      -		struct diff_filespec *two = rename_dst[i].two;
      +		struct diff_filespec *two = rename_dst[i].p->two;
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       		}
       		else if (!DIFF_FILE_VALID(p->one) && DIFF_FILE_VALID(p->two)) {
      -			/*
     --			 * Addition
     +-			 * Creation
      -			 *
     --			 * We would output this add record if it has
     +-			 * We would output this create record if it has
      -			 * not been turned into a rename/copy already.
      -			 */
      -			struct diff_rename_dst *dst = locate_rename_dst(p->two);
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      -			}
      -			else
      -				/* no matching rename/copy source, so
     --				 * record this as an addition.
     +-				 * record this as a creation.
      -				 */
      -				diff_q(&outq, p);
     -+			/* Addition */
     ++			/* Creation */
      +			diff_q(&outq, p);
       		}
       		else if (DIFF_FILE_VALID(p->one) && !DIFF_FILE_VALID(p->two)) {
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      -				struct diff_rename_dst *dst = locate_rename_dst(p->one);
      -				if (dst && dst->pair)
      +				struct diff_rename_dst *dst = locate_rename_dst(p);
     -+				assert(dst);
     ++				if (!dst)
     ++					BUG("tracking failed somehow; failed to find associated dst for broken pair");
      +				if (dst->is_rename)
       					/* counterpart is now rename/copy */
       					pair_to_free = p;
  -:  ----------- >  9:  af256b1c534 diffcore-rename: remove unneccessary duplicate entry checks

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v2 1/9] diffcore-rename: rename num_create to num_destinations
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
@ 2020-12-11  9:08   ` Elijah Newren via GitGitGadget
  2020-12-11  9:08   ` [PATCH v2 2/9] diffcore-rename: avoid usage of global in too_many_rename_candidates() Elijah Newren via GitGitGadget
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Our main data structures are rename_src and rename_dst.  For counters of
these data structures, num_sources and num_destinations seem natural;
definitely more so than using num_create for the latter.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index d367a6d2443..15a98f566e4 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -434,7 +434,7 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
  * 1 if we need to disable inexact rename detection;
  * 2 if we would be under the limit if we were given -C instead of -C -C.
  */
-static int too_many_rename_candidates(int num_create,
+static int too_many_rename_candidates(int num_destinations,
 				      struct diff_options *options)
 {
 	int rename_limit = options->rename_limit;
@@ -447,17 +447,17 @@ static int too_many_rename_candidates(int num_create,
 	 * This basically does a test for the rename matrix not
 	 * growing larger than a "rename_limit" square matrix, ie:
 	 *
-	 *    num_create * num_src > rename_limit * rename_limit
+	 *    num_destinations * num_src > rename_limit * rename_limit
 	 */
 	if (rename_limit <= 0)
 		rename_limit = 32767;
-	if ((num_create <= rename_limit || num_src <= rename_limit) &&
-	    ((uint64_t)num_create * (uint64_t)num_src
+	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
+	    ((uint64_t)num_destinations * (uint64_t)num_src
 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
 		return 0;
 
 	options->needed_rename_limit =
-		num_src > num_create ? num_src : num_create;
+		num_src > num_destinations ? num_src : num_destinations;
 
 	/* Are we running under -C -C? */
 	if (!options->flags.find_copies_harder)
@@ -469,8 +469,8 @@ static int too_many_rename_candidates(int num_create,
 			continue;
 		num_src++;
 	}
-	if ((num_create <= rename_limit || num_src <= rename_limit) &&
-	    ((uint64_t)num_create * (uint64_t)num_src
+	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
+	    ((uint64_t)num_destinations * (uint64_t)num_src
 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
 		return 2;
 	return 1;
@@ -505,7 +505,7 @@ void diffcore_rename(struct diff_options *options)
 	struct diff_queue_struct outq;
 	struct diff_score *mx;
 	int i, j, rename_count, skip_unmodified = 0;
-	int num_create, dst_cnt;
+	int num_destinations, dst_cnt;
 	struct progress *progress = NULL;
 
 	if (!minimum_score)
@@ -570,13 +570,13 @@ void diffcore_rename(struct diff_options *options)
 	 * Calculate how many renames are left (but all the source
 	 * files still remain as options for rename/copies!)
 	 */
-	num_create = (rename_dst_nr - rename_count);
+	num_destinations = (rename_dst_nr - rename_count);
 
 	/* All done? */
-	if (!num_create)
+	if (!num_destinations)
 		goto cleanup;
 
-	switch (too_many_rename_candidates(num_create, options)) {
+	switch (too_many_rename_candidates(num_destinations, options)) {
 	case 1:
 		goto cleanup;
 	case 2:
@@ -593,7 +593,8 @@ void diffcore_rename(struct diff_options *options)
 				(uint64_t)rename_dst_nr * (uint64_t)rename_src_nr);
 	}
 
-	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_create), sizeof(*mx));
+	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
+		     sizeof(*mx));
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
 		struct diff_filespec *two = rename_dst[i].two;
 		struct diff_score *m;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 2/9] diffcore-rename: avoid usage of global in too_many_rename_candidates()
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
  2020-12-11  9:08   ` [PATCH v2 1/9] diffcore-rename: rename num_create to num_destinations Elijah Newren via GitGitGadget
@ 2020-12-11  9:08   ` Elijah Newren via GitGitGadget
  2020-12-11  9:08   ` [PATCH v2 3/9] diffcore-rename: simplify limit check Elijah Newren via GitGitGadget
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

too_many_rename_candidates() got the number of rename destinations via
an argument to the function, but the number of rename sources via a
global variable.  That felt rather inconsistent.  Pass in the number of
rename sources as an argument as well.

While we are at it... We had a local variable, num_src, that served two
purposes.  Initially it was set to the global value, but later was used
for counting a subset of the number of sources.  Since we now have a
function argument for the former usage, introduce a clearer variable
name for the latter usage.

This patch has no behavioral changes; it's just renaming and passing an
argument instead of grabbing it from the global namespace.  (You may
find it easier to view the patch using git diff's --color-words option.)

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 15a98f566e4..1d6675c040d 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -434,12 +434,11 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
  * 1 if we need to disable inexact rename detection;
  * 2 if we would be under the limit if we were given -C instead of -C -C.
  */
-static int too_many_rename_candidates(int num_destinations,
+static int too_many_rename_candidates(int num_destinations, int num_sources,
 				      struct diff_options *options)
 {
 	int rename_limit = options->rename_limit;
-	int num_src = rename_src_nr;
-	int i;
+	int i, limited_sources;
 
 	options->needed_rename_limit = 0;
 
@@ -447,30 +446,30 @@ static int too_many_rename_candidates(int num_destinations,
 	 * This basically does a test for the rename matrix not
 	 * growing larger than a "rename_limit" square matrix, ie:
 	 *
-	 *    num_destinations * num_src > rename_limit * rename_limit
+	 *    num_destinations * num_sources > rename_limit * rename_limit
 	 */
 	if (rename_limit <= 0)
 		rename_limit = 32767;
-	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
-	    ((uint64_t)num_destinations * (uint64_t)num_src
+	if ((num_destinations <= rename_limit || num_sources <= rename_limit) &&
+	    ((uint64_t)num_destinations * (uint64_t)num_sources
 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
 		return 0;
 
 	options->needed_rename_limit =
-		num_src > num_destinations ? num_src : num_destinations;
+		num_sources > num_destinations ? num_sources : num_destinations;
 
 	/* Are we running under -C -C? */
 	if (!options->flags.find_copies_harder)
 		return 1;
 
 	/* Would we bust the limit if we were running under -C? */
-	for (num_src = i = 0; i < rename_src_nr; i++) {
+	for (limited_sources = i = 0; i < num_sources; i++) {
 		if (diff_unmodified_pair(rename_src[i].p))
 			continue;
-		num_src++;
+		limited_sources++;
 	}
-	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
-	    ((uint64_t)num_destinations * (uint64_t)num_src
+	if ((num_destinations <= rename_limit || limited_sources <= rename_limit) &&
+	    ((uint64_t)num_destinations * (uint64_t)limited_sources
 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
 		return 2;
 	return 1;
@@ -576,7 +575,8 @@ void diffcore_rename(struct diff_options *options)
 	if (!num_destinations)
 		goto cleanup;
 
-	switch (too_many_rename_candidates(num_destinations, options)) {
+	switch (too_many_rename_candidates(num_destinations, rename_src_nr,
+					   options)) {
 	case 1:
 		goto cleanup;
 	case 2:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 3/9] diffcore-rename: simplify limit check
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
  2020-12-11  9:08   ` [PATCH v2 1/9] diffcore-rename: rename num_create to num_destinations Elijah Newren via GitGitGadget
  2020-12-11  9:08   ` [PATCH v2 2/9] diffcore-rename: avoid usage of global in too_many_rename_candidates() Elijah Newren via GitGitGadget
@ 2020-12-11  9:08   ` Elijah Newren via GitGitGadget
  2020-12-11  9:08   ` [PATCH v2 4/9] diffcore-rename: reduce jumpiness in progress counters Elijah Newren via GitGitGadget
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

diffcore-rename had two different checks of the form

    if ((a < limit || b < limit) &&
        a * b <= limit * limit)

This can be simplified to

    if (st_mult(a, b) <= st_mult(limit, limit))

which makes it clearer how we are checking for overflow, and makes it
much easier to parse given the drop from 8 to 4 variable appearances.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 1d6675c040d..16553ab259f 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -447,12 +447,16 @@ static int too_many_rename_candidates(int num_destinations, int num_sources,
 	 * growing larger than a "rename_limit" square matrix, ie:
 	 *
 	 *    num_destinations * num_sources > rename_limit * rename_limit
+	 *
+	 * We use st_mult() to check overflow conditions; in the
+	 * exceptional circumstance that size_t isn't large enough to hold
+	 * the multiplication, the system won't be able to allocate enough
+	 * memory for the matrix anyway.
 	 */
 	if (rename_limit <= 0)
 		rename_limit = 32767;
-	if ((num_destinations <= rename_limit || num_sources <= rename_limit) &&
-	    ((uint64_t)num_destinations * (uint64_t)num_sources
-	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
+	if (st_mult(num_destinations, num_sources)
+	    <= st_mult(rename_limit, rename_limit))
 		return 0;
 
 	options->needed_rename_limit =
@@ -468,9 +472,8 @@ static int too_many_rename_candidates(int num_destinations, int num_sources,
 			continue;
 		limited_sources++;
 	}
-	if ((num_destinations <= rename_limit || limited_sources <= rename_limit) &&
-	    ((uint64_t)num_destinations * (uint64_t)limited_sources
-	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
+	if (st_mult(num_destinations, limited_sources)
+	    <= st_mult(rename_limit, rename_limit))
 		return 2;
 	return 1;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 4/9] diffcore-rename: reduce jumpiness in progress counters
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2020-12-11  9:08   ` [PATCH v2 3/9] diffcore-rename: simplify limit check Elijah Newren via GitGitGadget
@ 2020-12-11  9:08   ` Elijah Newren via GitGitGadget
  2020-12-11  9:08   ` [PATCH v2 5/9] t4058: add more tests and documentation for duplicate tree entry handling Elijah Newren via GitGitGadget
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Inexact rename detection works by comparing all sources to all
destinations, computing similarities, and then finding the best matches
among those that are sufficiently similar.

However, it is preceded by exact rename detection that works by
checking if there are files with identical hashes.  If exact renames are
found, we can exclude some files from inexact rename detection.

The inexact rename detection loops over the full set of files, but
immediately skips those for which rename_dst[i].is_rename is true and
thus doesn't compare any sources to that destination.  As such, these
paths shouldn't be included in the progress counter.

For the eagle eyed, this change hints at an actual optimization -- the
first one I presented at Git Merge 2020.  I'll be submitting that
optimization later, once the basic merge-ort algorithm has merged.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 16553ab259f..55a188abcc3 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -593,7 +593,7 @@ void diffcore_rename(struct diff_options *options)
 	if (options->show_rename_progress) {
 		progress = start_delayed_progress(
 				_("Performing inexact rename detection"),
-				(uint64_t)rename_dst_nr * (uint64_t)rename_src_nr);
+				(uint64_t)num_destinations * (uint64_t)rename_src_nr);
 	}
 
 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
@@ -633,7 +633,8 @@ void diffcore_rename(struct diff_options *options)
 			diff_free_filespec_blob(two);
 		}
 		dst_cnt++;
-		display_progress(progress, (uint64_t)(i+1)*(uint64_t)rename_src_nr);
+		display_progress(progress,
+				 (uint64_t)dst_cnt * (uint64_t)rename_src_nr);
 	}
 	stop_progress(&progress);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 5/9] t4058: add more tests and documentation for duplicate tree entry handling
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
                     ` (3 preceding siblings ...)
  2020-12-11  9:08   ` [PATCH v2 4/9] diffcore-rename: reduce jumpiness in progress counters Elijah Newren via GitGitGadget
@ 2020-12-11  9:08   ` Elijah Newren via GitGitGadget
  2020-12-11  9:08   ` [PATCH v2 6/9] t4058: explore duplicate tree entry handling in a bit more detail Elijah Newren via GitGitGadget
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Commit 4d6be03b95 ("diffcore-rename: avoid processing duplicate
destinations", 2015-02-26) added t4058 to demonstrate that a workaround
it added to avoid double frees (namely to just turn off rename detection
when trees had duplicate entries) would indeed avoid segfaults.  The
tests, though, give the impression that the expected diffs are "correct"
when in reality they are just "don't segfault, and do something
semi-reasonable under the circumstances".  Add some notes to make this
clearer.

Also, commit 25d5ea410f ("[PATCH] Redo rename/copy detection logic.",
2005-05-24) added a similar workaround to avoid segfaults, but for
rename_src rather than rename_dst.  I do not see any tests in the
testsuite to cover the collision detection of entries limited to the
source side, so add a couple.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t4058-diff-duplicates.sh | 47 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/t/t4058-diff-duplicates.sh b/t/t4058-diff-duplicates.sh
index c24ee175ef0..bd685089561 100755
--- a/t/t4058-diff-duplicates.sh
+++ b/t/t4058-diff-duplicates.sh
@@ -1,5 +1,14 @@
 #!/bin/sh
 
+# NOTICE:
+#   This testsuite does a number of diffs and checks that the output match.
+#   However, it is a "garbage in, garbage out" situation; the trees have
+#   duplicate entries for individual paths, and it results in diffs that do
+#   not make much sense.  As such, it is not clear that the diffs are
+#   "correct".  The primary purpose of these tests was to verify that
+#   diff-tree does not segfault, but there is perhaps some value in ensuring
+#   that the diff output isn't wildly unreasonable.
+
 test_description='test tree diff when trees have duplicate entries'
 . ./test-lib.sh
 
@@ -57,7 +66,16 @@ test_expect_success 'create trees with duplicate entries' '
 	git tag two $outer_two
 '
 
-test_expect_success 'diff-tree between trees' '
+test_expect_success 'create tree without duplicate entries' '
+	blob_one=$(echo one | git hash-object -w --stdin) &&
+	outer_three=$(make_tree \
+		100644 renamed $blob_one
+	) &&
+	git tag three $outer_three
+'
+
+test_expect_success 'diff-tree between duplicate trees' '
+	# See NOTICE at top of file
 	{
 		printf ":000000 100644 $ZERO_OID $blob_two A\touter/inner\n" &&
 		printf ":000000 100644 $ZERO_OID $blob_two A\touter/inner\n" &&
@@ -71,9 +89,34 @@ test_expect_success 'diff-tree between trees' '
 '
 
 test_expect_success 'diff-tree with renames' '
-	# same expectation as above, since we disable rename detection
+	# See NOTICE at top of file.
 	git diff-tree -M -r --no-abbrev one two >actual &&
 	test_cmp expect actual
 '
 
+test_expect_success 'diff-tree FROM duplicate tree' '
+	# See NOTICE at top of file.
+	{
+		printf ":100644 000000 $blob_one $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":000000 100644 $ZERO_OID $blob_one A\trenamed\n"
+	} >expect &&
+	git diff-tree -r --no-abbrev one three >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'diff-tree FROM duplicate tree, with renames' '
+	# See NOTICE at top of file.
+	{
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 100644 $blob_one $blob_one R100\touter/inner\trenamed\n"
+	} >expect &&
+	git diff-tree -M -r --no-abbrev one three >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 6/9] t4058: explore duplicate tree entry handling in a bit more detail
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
                     ` (4 preceding siblings ...)
  2020-12-11  9:08   ` [PATCH v2 5/9] t4058: add more tests and documentation for duplicate tree entry handling Elijah Newren via GitGitGadget
@ 2020-12-11  9:08   ` Elijah Newren via GitGitGadget
  2021-04-21 12:29     ` Ævar Arnfjörð Bjarmason
  2020-12-11  9:08   ` [PATCH v2 7/9] diffcore-rename: simplify and accelerate register_rename_src() Elijah Newren via GitGitGadget
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

While creating the last commit, I found a number of other cases where
git would segfault when faced with trees that have duplicate entries.
None of these segfaults are in the diffcore-rename code (they all occur
in cache-tree and unpack-trees).  Further, to my knowledge, no one has
ever been adversely affected by these bugs, and given that it has been
15 years and folks have fixed a few other issues with historical
duplicate entries (as noted in the last commit), I am not sure we will
ever run into anyone having problems with these.  So I am not sure these
are worth fixing, but it doesn't hurt to at least document these
failures in the same test file that is concerned with duplicate tree
entries.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t4058-diff-duplicates.sh | 67 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/t/t4058-diff-duplicates.sh b/t/t4058-diff-duplicates.sh
index bd685089561..ad3f37d388d 100755
--- a/t/t4058-diff-duplicates.sh
+++ b/t/t4058-diff-duplicates.sh
@@ -119,4 +119,71 @@ test_expect_success 'diff-tree FROM duplicate tree, with renames' '
 	test_cmp expect actual
 '
 
+test_expect_success 'create a few commits' '
+	git commit-tree -m "Duplicate Entries" two^{tree} >commit_id &&
+	git branch base $(cat commit_id) &&
+
+	git commit-tree -p $(cat commit_id) -m "Just one" three^{tree} >up &&
+	git branch update $(cat up) &&
+
+	git commit-tree -p $(cat up) -m "Back to weird" two^{tree} >final &&
+	git branch final $(cat final) &&
+
+	rm commit_id up final
+'
+
+test_expect_failure 'git read-tree does not segfault' '
+	test_when_finished rm .git/index.lock &&
+	test_might_fail git read-tree --reset base
+'
+
+test_expect_failure 'reset --hard does not segfault' '
+	test_when_finished rm .git/index.lock &&
+	git checkout base &&
+	test_might_fail git reset --hard
+'
+
+test_expect_failure 'git diff HEAD does not segfault' '
+	git checkout base &&
+	GIT_TEST_CHECK_CACHE_TREE=false &&
+	git reset --hard &&
+	test_might_fail git diff HEAD
+'
+
+test_expect_failure 'can switch to another branch when status is empty' '
+	git clean -ffdqx &&
+	git status --porcelain -uno >actual &&
+	test_must_be_empty actual &&
+	git checkout update
+'
+
+test_expect_success 'forcibly switch to another branch, verify status empty' '
+	git checkout -f update &&
+	git status --porcelain -uno >actual &&
+	test_must_be_empty actual
+'
+
+test_expect_success 'fast-forward from non-duplicate entries to duplicate' '
+	git merge final
+'
+
+test_expect_failure 'clean status, switch branches, status still clean' '
+	git status --porcelain -uno >actual &&
+	test_must_be_empty actual &&
+	git checkout base &&
+	git status --porcelain -uno >actual &&
+	test_must_be_empty actual
+'
+
+test_expect_success 'switch to base branch and force status to be clean' '
+	git checkout base &&
+	GIT_TEST_CHECK_CACHE_TREE=false git reset --hard &&
+	git status --porcelain -uno >actual &&
+	test_must_be_empty actual
+'
+
+test_expect_failure 'fast-forward from duplicate entries to non-duplicate' '
+	git merge update
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 7/9] diffcore-rename: simplify and accelerate register_rename_src()
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
                     ` (5 preceding siblings ...)
  2020-12-11  9:08   ` [PATCH v2 6/9] t4058: explore duplicate tree entry handling in a bit more detail Elijah Newren via GitGitGadget
@ 2020-12-11  9:08   ` Elijah Newren via GitGitGadget
  2020-12-11  9:08   ` [PATCH v2 8/9] diffcore-rename: accelerate rename_dst setup Elijah Newren via GitGitGadget
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

register_rename_src() took pains to create an array in rename_src which
was sorted by pathname of the contained diff_filepair.  The sorting was
entirely unnecessary since callers pass filepairs to us in sorted
order.  We can simply append to the end of the rename_src array,
speeding up diffcore_rename() setup time.

Also, note that I dropped the return type on the function since it was
unconditionally discarded anyway.

This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection,
diffcore_rename() setup time was a sizeable chunk of overall runtime.
This patch dropped execution time of rebasing 35 commits with lots of
renames by 2% overall.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 39 +++++++++++++--------------------------
 1 file changed, 13 insertions(+), 26 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 55a188abcc3..a215421a9cb 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -76,36 +76,23 @@ static struct diff_rename_src {
 } *rename_src;
 static int rename_src_nr, rename_src_alloc;
 
-static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
+static void register_rename_src(struct diff_filepair *p)
 {
-	int first, last;
-	struct diff_filespec *one = p->one;
-	unsigned short score = p->score;
-
-	first = 0;
-	last = rename_src_nr;
-	while (last > first) {
-		int next = first + ((last - first) >> 1);
-		struct diff_rename_src *src = &(rename_src[next]);
-		int cmp = strcmp(one->path, src->p->one->path);
-		if (!cmp)
-			return src;
-		if (cmp < 0) {
-			last = next;
-			continue;
-		}
-		first = next+1;
-	}
+	/*
+	 * If we have multiple entries at the same path in the source tree
+	 * (an invalid tree, to be sure), avoid using more more than one
+	 * such entry in rename detection.  Once upon a time, doing so
+	 * caused segfaults; see commit 25d5ea410f ("[PATCH] Redo
+	 * rename/copy detection logic.", 2005-05-24).
+	 */
+	if (rename_src_nr > 0 &&
+	    !strcmp(rename_src[rename_src_nr-1].p->one->path, p->one->path))
+		return;
 
-	/* insert to make it at "first" */
 	ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
+	rename_src[rename_src_nr].p = p;
+	rename_src[rename_src_nr].score = p->score;
 	rename_src_nr++;
-	if (first < rename_src_nr)
-		MOVE_ARRAY(rename_src + first + 1, rename_src + first,
-			   rename_src_nr - first - 1);
-	rename_src[first].p = p;
-	rename_src[first].score = score;
-	return &(rename_src[first]);
 }
 
 static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 8/9] diffcore-rename: accelerate rename_dst setup
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
                     ` (6 preceding siblings ...)
  2020-12-11  9:08   ` [PATCH v2 7/9] diffcore-rename: simplify and accelerate register_rename_src() Elijah Newren via GitGitGadget
@ 2020-12-11  9:08   ` Elijah Newren via GitGitGadget
  2020-12-11  9:08   ` [PATCH v2 9/9] diffcore-rename: remove unneccessary duplicate entry checks Elijah Newren via GitGitGadget
  2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
  9 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

register_rename_src() simply references the passed pair inside
rename_src.  In contrast, add_rename_dst() did something entirely
different for rename_dst.  Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL.  Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs.  This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary.  Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src.  Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.

Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go.  However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on).  Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index.  Doing that allows us to still have a
fast lookup without sorting the rename_dst array.  Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.

There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.

This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime.  This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 148 ++++++++++++++++++++--------------------------
 1 file changed, 65 insertions(+), 83 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index a215421a9cb..2a8fcd52928 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -9,63 +9,48 @@
 #include "hashmap.h"
 #include "progress.h"
 #include "promisor-remote.h"
+#include "strmap.h"
 
 /* Table of rename/copy destinations */
 
 static struct diff_rename_dst {
-	struct diff_filespec *two;
-	struct diff_filepair *pair;
+	struct diff_filepair *p;
+	struct diff_filespec *filespec_to_free;
+	int is_rename; /* false -> just a create; true -> rename or copy */
 } *rename_dst;
 static int rename_dst_nr, rename_dst_alloc;
+/* Mapping from break source pathname to break destination index */
+static struct strintmap *break_idx = NULL;
 
-static int find_rename_dst(struct diff_filespec *two)
-{
-	int first, last;
-
-	first = 0;
-	last = rename_dst_nr;
-	while (last > first) {
-		int next = first + ((last - first) >> 1);
-		struct diff_rename_dst *dst = &(rename_dst[next]);
-		int cmp = strcmp(two->path, dst->two->path);
-		if (!cmp)
-			return next;
-		if (cmp < 0) {
-			last = next;
-			continue;
-		}
-		first = next+1;
-	}
-	return -first - 1;
-}
-
-static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two)
+static struct diff_rename_dst *locate_rename_dst(struct diff_filepair *p)
 {
-	int ofs = find_rename_dst(two);
-	return ofs < 0 ? NULL : &rename_dst[ofs];
+	/* Lookup by p->ONE->path */
+	int idx = break_idx ? strintmap_get(break_idx, p->one->path) : -1;
+	return (idx == -1) ? NULL : &rename_dst[idx];
 }
 
 /*
  * Returns 0 on success, -1 if we found a duplicate.
  */
-static int add_rename_dst(struct diff_filespec *two)
+static int add_rename_dst(struct diff_filepair *p)
 {
-	int first = find_rename_dst(two);
-
-	if (first >= 0)
+	/*
+	 * If we have multiple entries at the same path in the destination
+	 * tree (an invalid tree, to be sure), turn off rename detection
+	 * entirely.  Once upon a time, diffcore-rename had double free()
+	 * issues due to such duplicate paths, resulting in segfaults.  See
+	 * commit 4d6be03b95 ("diffcore-rename: avoid processing duplicate
+	 * destinations", 2015-02-26).
+	 */
+	if (rename_dst_nr > 0 &&
+	    !strcmp(rename_dst[rename_dst_nr-1].p->two->path, p->two->path))
 		return -1;
-	first = -first - 1;
 
-	/* insert to make it at "first" */
 	ALLOC_GROW(rename_dst, rename_dst_nr + 1, rename_dst_alloc);
+	rename_dst[rename_dst_nr].p = p;
+	rename_dst[rename_dst_nr].filespec_to_free = NULL;
+	rename_dst[rename_dst_nr].is_rename = 0;
 	rename_dst_nr++;
-	if (first < rename_dst_nr)
-		MOVE_ARRAY(rename_dst + first + 1, rename_dst + first,
-			   rename_dst_nr - first - 1);
-	rename_dst[first].two = alloc_filespec(two->path);
-	fill_filespec(rename_dst[first].two, &two->oid, two->oid_valid,
-		      two->mode);
-	rename_dst[first].pair = NULL;
 	return 0;
 }
 
@@ -89,6 +74,14 @@ static void register_rename_src(struct diff_filepair *p)
 	    !strcmp(rename_src[rename_src_nr-1].p->one->path, p->one->path))
 		return;
 
+	if (p->broken_pair) {
+		if (!break_idx) {
+			break_idx = xmalloc(sizeof(*break_idx));
+			strintmap_init(break_idx, -1);
+		}
+		strintmap_set(break_idx, p->one->path, rename_dst_nr);
+	}
+
 	ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
 	rename_src[rename_src_nr].p = p;
 	rename_src[rename_src_nr].score = p->score;
@@ -128,14 +121,14 @@ static void prefetch(void *prefetch_options)
 	struct oid_array to_fetch = OID_ARRAY_INIT;
 
 	for (i = 0; i < rename_dst_nr; i++) {
-		if (rename_dst[i].pair)
+		if (rename_dst[i].p->renamed_pair)
 			/*
 			 * The loop in diffcore_rename() will not need these
 			 * blobs, so skip prefetching.
 			 */
 			continue; /* already found exact match */
 		diff_add_if_missing(options->repo, &to_fetch,
-				    rename_dst[i].two);
+				    rename_dst[i].p->two);
 	}
 	for (i = 0; i < rename_src_nr; i++) {
 		if (options->skip_unmodified &&
@@ -245,26 +238,24 @@ static int estimate_similarity(struct repository *r,
 
 static void record_rename_pair(int dst_index, int src_index, int score)
 {
-	struct diff_filespec *src, *dst;
-	struct diff_filepair *dp;
+	struct diff_filepair *src = rename_src[src_index].p;
+	struct diff_filepair *dst = rename_dst[dst_index].p;
 
-	if (rename_dst[dst_index].pair)
+	if (dst->renamed_pair)
 		die("internal error: dst already matched.");
 
-	src = rename_src[src_index].p->one;
-	src->rename_used++;
-	src->count++;
+	src->one->rename_used++;
+	src->one->count++;
 
-	dst = rename_dst[dst_index].two;
-	dst->count++;
+	rename_dst[dst_index].filespec_to_free = dst->one;
+	rename_dst[dst_index].is_rename = 1;
 
-	dp = diff_queue(NULL, src, dst);
-	dp->renamed_pair = 1;
-	if (!strcmp(src->path, dst->path))
-		dp->score = rename_src[src_index].score;
+	dst->one = src->one;
+	dst->renamed_pair = 1;
+	if (!strcmp(dst->one->path, dst->two->path))
+		dst->score = rename_src[src_index].score;
 	else
-		dp->score = score;
-	rename_dst[dst_index].pair = dp;
+		dst->score = score;
 }
 
 /*
@@ -310,7 +301,7 @@ static int find_identical_files(struct hashmap *srcs,
 				struct diff_options *options)
 {
 	int renames = 0;
-	struct diff_filespec *target = rename_dst[dst_index].two;
+	struct diff_filespec *target = rename_dst[dst_index].p->two;
 	struct file_similarity *p, *best = NULL;
 	int i = 100, best_score = -1;
 	unsigned int hash = hash_filespec(options->repo, target);
@@ -476,7 +467,7 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
 		    (mx[i].score < minimum_score))
 			break; /* there is no more usable pair. */
 		dst = &rename_dst[mx[i].dst];
-		if (dst->pair)
+		if (dst->is_rename)
 			continue; /* already done, either exact or fuzzy. */
 		if (!copies && rename_src[mx[i].src].p->one->rename_used)
 			continue;
@@ -511,7 +502,7 @@ void diffcore_rename(struct diff_options *options)
 			else if (!options->flags.rename_empty &&
 				 is_empty_blob_oid(&p->two->oid))
 				continue;
-			else if (add_rename_dst(p->two) < 0) {
+			else if (add_rename_dst(p) < 0) {
 				warning("skipping rename detection, detected"
 					" duplicate destination '%s'",
 					p->two->path);
@@ -586,10 +577,10 @@ void diffcore_rename(struct diff_options *options)
 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
 		     sizeof(*mx));
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
-		struct diff_filespec *two = rename_dst[i].two;
+		struct diff_filespec *two = rename_dst[i].p->two;
 		struct diff_score *m;
 
-		if (rename_dst[i].pair)
+		if (rename_dst[i].is_rename)
 			continue; /* dealt with exact match already. */
 
 		m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
@@ -646,22 +637,8 @@ void diffcore_rename(struct diff_options *options)
 			diff_q(&outq, p);
 		}
 		else if (!DIFF_FILE_VALID(p->one) && DIFF_FILE_VALID(p->two)) {
-			/*
-			 * Creation
-			 *
-			 * We would output this create record if it has
-			 * not been turned into a rename/copy already.
-			 */
-			struct diff_rename_dst *dst = locate_rename_dst(p->two);
-			if (dst && dst->pair) {
-				diff_q(&outq, dst->pair);
-				pair_to_free = p;
-			}
-			else
-				/* no matching rename/copy source, so
-				 * record this as a creation.
-				 */
-				diff_q(&outq, p);
+			/* Creation */
+			diff_q(&outq, p);
 		}
 		else if (DIFF_FILE_VALID(p->one) && !DIFF_FILE_VALID(p->two)) {
 			/*
@@ -682,8 +659,10 @@ void diffcore_rename(struct diff_options *options)
 			 */
 			if (DIFF_PAIR_BROKEN(p)) {
 				/* broken delete */
-				struct diff_rename_dst *dst = locate_rename_dst(p->one);
-				if (dst && dst->pair)
+				struct diff_rename_dst *dst = locate_rename_dst(p);
+				if (!dst)
+					BUG("tracking failed somehow; failed to find associated dst for broken pair");
+				if (dst->is_rename)
 					/* counterpart is now rename/copy */
 					pair_to_free = p;
 			}
@@ -693,16 +672,14 @@ void diffcore_rename(struct diff_options *options)
 					pair_to_free = p;
 			}
 
-			if (pair_to_free)
-				;
-			else
+			if (!pair_to_free)
 				diff_q(&outq, p);
 		}
 		else if (!diff_unmodified_pair(p))
 			/* all the usual ones need to be kept */
 			diff_q(&outq, p);
 		else
-			/* no need to keep unmodified pairs */
+			/* no need to keep unmodified pairs; FIXME: remove earlier? */
 			pair_to_free = p;
 
 		if (pair_to_free)
@@ -715,11 +692,16 @@ void diffcore_rename(struct diff_options *options)
 	diff_debug_queue("done collapsing", q);
 
 	for (i = 0; i < rename_dst_nr; i++)
-		free_filespec(rename_dst[i].two);
+		if (rename_dst[i].filespec_to_free)
+			free_filespec(rename_dst[i].filespec_to_free);
 
 	FREE_AND_NULL(rename_dst);
 	rename_dst_nr = rename_dst_alloc = 0;
 	FREE_AND_NULL(rename_src);
 	rename_src_nr = rename_src_alloc = 0;
+	if (break_idx) {
+		strintmap_clear(break_idx);
+		FREE_AND_NULL(break_idx);
+	}
 	return;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 9/9] diffcore-rename: remove unneccessary duplicate entry checks
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
                     ` (7 preceding siblings ...)
  2020-12-11  9:08   ` [PATCH v2 8/9] diffcore-rename: accelerate rename_dst setup Elijah Newren via GitGitGadget
@ 2020-12-11  9:08   ` Elijah Newren via GitGitGadget
  2020-12-29  8:31     ` Christian Couder
  2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
  9 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-11  9:08 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Commit 25d5ea410f ("[PATCH] Redo rename/copy detection logic.",
2005-05-24) added a duplicate entry check on rename_src in order to
avoid segfaults; the code at the time was prone to double free()s and an
easy way to avoid it was just to turn off rename detection for any
duplicate entries.  Note that the form of the check was modified two
commits ago in this series.

Similarly, commit 4d6be03b95 ("diffcore-rename: avoid processing
duplicate destinations", 2015-02-26) added a duplicate entry check
on rename_dst for the exact same reason -- the code was prone to double
free()s, and an easy way to avoid it was just to turn off rename
detection entirely.  Note that the form of the check was modified in the
commit just before this one.

In the original code in both places, the code was dealing with
individual diff_filespecs and trying to match things up, instead of just
keeping the original diff_filepairs around as we do now.  The
intervening change in structure has fixed the accounting problems and
the associated double free()s that used to occur, and thus we already
have a better fix.  As such, we can remove the band-aid checks for
duplicate entries.

Due to the last two patches, the diffcore_rename() setup is no longer a
sizeable chunk of overall runtime.  Thus, in a large rebase of many
commits with lots of renames and several optimizations to inexact rename
detection, this patch only speeds up the overall code by about half a
percent or so and is pretty close to the run-to-run variability making
it hard to get an exact measurement.  However, with some trace2 regions
around the setup code in diffcore_rename() so that I can focus on just
it, I measure that this patch consistently saves almost a third of the
remaining time spent in diffcore_rename() setup.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c          | 23 -----------------------
 t/t4058-diff-duplicates.sh |  2 +-
 2 files changed, 1 insertion(+), 24 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 2a8fcd52928..90db9ebd6d0 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -34,18 +34,6 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filepair *p)
  */
 static int add_rename_dst(struct diff_filepair *p)
 {
-	/*
-	 * If we have multiple entries at the same path in the destination
-	 * tree (an invalid tree, to be sure), turn off rename detection
-	 * entirely.  Once upon a time, diffcore-rename had double free()
-	 * issues due to such duplicate paths, resulting in segfaults.  See
-	 * commit 4d6be03b95 ("diffcore-rename: avoid processing duplicate
-	 * destinations", 2015-02-26).
-	 */
-	if (rename_dst_nr > 0 &&
-	    !strcmp(rename_dst[rename_dst_nr-1].p->two->path, p->two->path))
-		return -1;
-
 	ALLOC_GROW(rename_dst, rename_dst_nr + 1, rename_dst_alloc);
 	rename_dst[rename_dst_nr].p = p;
 	rename_dst[rename_dst_nr].filespec_to_free = NULL;
@@ -63,17 +51,6 @@ static int rename_src_nr, rename_src_alloc;
 
 static void register_rename_src(struct diff_filepair *p)
 {
-	/*
-	 * If we have multiple entries at the same path in the source tree
-	 * (an invalid tree, to be sure), avoid using more more than one
-	 * such entry in rename detection.  Once upon a time, doing so
-	 * caused segfaults; see commit 25d5ea410f ("[PATCH] Redo
-	 * rename/copy detection logic.", 2005-05-24).
-	 */
-	if (rename_src_nr > 0 &&
-	    !strcmp(rename_src[rename_src_nr-1].p->one->path, p->one->path))
-		return;
-
 	if (p->broken_pair) {
 		if (!break_idx) {
 			break_idx = xmalloc(sizeof(*break_idx));
diff --git a/t/t4058-diff-duplicates.sh b/t/t4058-diff-duplicates.sh
index ad3f37d388d..54614b814db 100755
--- a/t/t4058-diff-duplicates.sh
+++ b/t/t4058-diff-duplicates.sh
@@ -91,7 +91,7 @@ test_expect_success 'diff-tree between duplicate trees' '
 test_expect_success 'diff-tree with renames' '
 	# See NOTICE at top of file.
 	git diff-tree -M -r --no-abbrev one two >actual &&
-	test_cmp expect actual
+	test_must_be_empty actual
 '
 
 test_expect_success 'diff-tree FROM duplicate tree' '
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 9/9] diffcore-rename: remove unneccessary duplicate entry checks
  2020-12-11  9:08   ` [PATCH v2 9/9] diffcore-rename: remove unneccessary duplicate entry checks Elijah Newren via GitGitGadget
@ 2020-12-29  8:31     ` Christian Couder
  2020-12-29 18:09       ` Elijah Newren
  0 siblings, 1 reply; 55+ messages in thread
From: Christian Couder @ 2020-12-29  8:31 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren, Taylor Blau

About the subject:

s/unneccessary/unnecessary/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 9/9] diffcore-rename: remove unneccessary duplicate entry checks
  2020-12-29  8:31     ` Christian Couder
@ 2020-12-29 18:09       ` Elijah Newren
  0 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren @ 2020-12-29 18:09 UTC (permalink / raw)
  To: Christian Couder; +Cc: Elijah Newren via GitGitGadget, git, Taylor Blau

On Tue, Dec 29, 2020 at 12:31 AM Christian Couder
<christian.couder@gmail.com> wrote:
>
> About the subject:
>
> s/unneccessary/unnecessary/

Thanks, I'll remove the unnecessary double c.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 0/9] diffcore-rename improvements
  2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
                     ` (8 preceding siblings ...)
  2020-12-11  9:08   ` [PATCH v2 9/9] diffcore-rename: remove unneccessary duplicate entry checks Elijah Newren via GitGitGadget
@ 2020-12-29 20:05   ` Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 1/9] diffcore-rename: rename num_create to num_destinations Elijah Newren via GitGitGadget
                       ` (8 more replies)
  9 siblings, 9 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-29 20:05 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Taylor Blau, Christian Couder, Elijah Newren

This patch series includes 3 small code cleanups, 1 small bug fix (or
perhaps just a UI improvement -- it's a very minor issue either way), more
testcases, and 3 performance improvements. The first 7 patches are
relatively easy to review, while the second to last one...is a bit more
involved (but provided the biggest performance boost).

Changes since v2:

 * Fixed spelling error: remove unnecessary double c in 'unneccessary'
   (thanks to Christian for spotting)

Elijah Newren (9):
  diffcore-rename: rename num_create to num_destinations
  diffcore-rename: avoid usage of global in too_many_rename_candidates()
  diffcore-rename: simplify limit check
  diffcore-rename: reduce jumpiness in progress counters
  t4058: add more tests and documentation for duplicate tree entry
    handling
  t4058: explore duplicate tree entry handling in a bit more detail
  diffcore-rename: simplify and accelerate register_rename_src()
  diffcore-rename: accelerate rename_dst setup
  diffcore-rename: remove unnecessary duplicate entry checks

 diffcore-rename.c          | 209 ++++++++++++++-----------------------
 t/t4058-diff-duplicates.sh | 114 +++++++++++++++++++-
 2 files changed, 192 insertions(+), 131 deletions(-)


base-commit: 3a0b884caba2752da0af626fb2de7d597c844e8b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-929%2Fnewren%2Fdiffcore-rename-improvements-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-929/newren/diffcore-rename-improvements-v3
Pull-Request: https://github.com/git/git/pull/929

Range-diff vs v2:

  1:  428d8204894 =  1:  428d8204894 diffcore-rename: rename num_create to num_destinations
  2:  fc62f4c4f89 =  2:  fc62f4c4f89 diffcore-rename: avoid usage of global in too_many_rename_candidates()
  3:  7214fa73fdd =  3:  7214fa73fdd diffcore-rename: simplify limit check
  4:  bf3581b1ac1 =  4:  bf3581b1ac1 diffcore-rename: reduce jumpiness in progress counters
  5:  9a4a3159acf =  5:  9a4a3159acf t4058: add more tests and documentation for duplicate tree entry handling
  6:  8db27892c59 =  6:  8db27892c59 t4058: explore duplicate tree entry handling in a bit more detail
  7:  a58639b2927 =  7:  a58639b2927 diffcore-rename: simplify and accelerate register_rename_src()
  8:  85714e1583d =  8:  85714e1583d diffcore-rename: accelerate rename_dst setup
  9:  af256b1c534 !  9:  723b0e55e75 diffcore-rename: remove unneccessary duplicate entry checks
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    diffcore-rename: remove unneccessary duplicate entry checks
     +    diffcore-rename: remove unnecessary duplicate entry checks
      
          Commit 25d5ea410f ("[PATCH] Redo rename/copy detection logic.",
          2005-05-24) added a duplicate entry check on rename_src in order to

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 1/9] diffcore-rename: rename num_create to num_destinations
  2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
@ 2020-12-29 20:05     ` Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 2/9] diffcore-rename: avoid usage of global in too_many_rename_candidates() Elijah Newren via GitGitGadget
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-29 20:05 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Taylor Blau, Christian Couder, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Our main data structures are rename_src and rename_dst.  For counters of
these data structures, num_sources and num_destinations seem natural;
definitely more so than using num_create for the latter.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index d367a6d2443..15a98f566e4 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -434,7 +434,7 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
  * 1 if we need to disable inexact rename detection;
  * 2 if we would be under the limit if we were given -C instead of -C -C.
  */
-static int too_many_rename_candidates(int num_create,
+static int too_many_rename_candidates(int num_destinations,
 				      struct diff_options *options)
 {
 	int rename_limit = options->rename_limit;
@@ -447,17 +447,17 @@ static int too_many_rename_candidates(int num_create,
 	 * This basically does a test for the rename matrix not
 	 * growing larger than a "rename_limit" square matrix, ie:
 	 *
-	 *    num_create * num_src > rename_limit * rename_limit
+	 *    num_destinations * num_src > rename_limit * rename_limit
 	 */
 	if (rename_limit <= 0)
 		rename_limit = 32767;
-	if ((num_create <= rename_limit || num_src <= rename_limit) &&
-	    ((uint64_t)num_create * (uint64_t)num_src
+	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
+	    ((uint64_t)num_destinations * (uint64_t)num_src
 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
 		return 0;
 
 	options->needed_rename_limit =
-		num_src > num_create ? num_src : num_create;
+		num_src > num_destinations ? num_src : num_destinations;
 
 	/* Are we running under -C -C? */
 	if (!options->flags.find_copies_harder)
@@ -469,8 +469,8 @@ static int too_many_rename_candidates(int num_create,
 			continue;
 		num_src++;
 	}
-	if ((num_create <= rename_limit || num_src <= rename_limit) &&
-	    ((uint64_t)num_create * (uint64_t)num_src
+	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
+	    ((uint64_t)num_destinations * (uint64_t)num_src
 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
 		return 2;
 	return 1;
@@ -505,7 +505,7 @@ void diffcore_rename(struct diff_options *options)
 	struct diff_queue_struct outq;
 	struct diff_score *mx;
 	int i, j, rename_count, skip_unmodified = 0;
-	int num_create, dst_cnt;
+	int num_destinations, dst_cnt;
 	struct progress *progress = NULL;
 
 	if (!minimum_score)
@@ -570,13 +570,13 @@ void diffcore_rename(struct diff_options *options)
 	 * Calculate how many renames are left (but all the source
 	 * files still remain as options for rename/copies!)
 	 */
-	num_create = (rename_dst_nr - rename_count);
+	num_destinations = (rename_dst_nr - rename_count);
 
 	/* All done? */
-	if (!num_create)
+	if (!num_destinations)
 		goto cleanup;
 
-	switch (too_many_rename_candidates(num_create, options)) {
+	switch (too_many_rename_candidates(num_destinations, options)) {
 	case 1:
 		goto cleanup;
 	case 2:
@@ -593,7 +593,8 @@ void diffcore_rename(struct diff_options *options)
 				(uint64_t)rename_dst_nr * (uint64_t)rename_src_nr);
 	}
 
-	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_create), sizeof(*mx));
+	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
+		     sizeof(*mx));
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
 		struct diff_filespec *two = rename_dst[i].two;
 		struct diff_score *m;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 2/9] diffcore-rename: avoid usage of global in too_many_rename_candidates()
  2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 1/9] diffcore-rename: rename num_create to num_destinations Elijah Newren via GitGitGadget
@ 2020-12-29 20:05     ` Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 3/9] diffcore-rename: simplify limit check Elijah Newren via GitGitGadget
                       ` (6 subsequent siblings)
  8 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-29 20:05 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Taylor Blau, Christian Couder, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

too_many_rename_candidates() got the number of rename destinations via
an argument to the function, but the number of rename sources via a
global variable.  That felt rather inconsistent.  Pass in the number of
rename sources as an argument as well.

While we are at it... We had a local variable, num_src, that served two
purposes.  Initially it was set to the global value, but later was used
for counting a subset of the number of sources.  Since we now have a
function argument for the former usage, introduce a clearer variable
name for the latter usage.

This patch has no behavioral changes; it's just renaming and passing an
argument instead of grabbing it from the global namespace.  (You may
find it easier to view the patch using git diff's --color-words option.)

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 15a98f566e4..1d6675c040d 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -434,12 +434,11 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
  * 1 if we need to disable inexact rename detection;
  * 2 if we would be under the limit if we were given -C instead of -C -C.
  */
-static int too_many_rename_candidates(int num_destinations,
+static int too_many_rename_candidates(int num_destinations, int num_sources,
 				      struct diff_options *options)
 {
 	int rename_limit = options->rename_limit;
-	int num_src = rename_src_nr;
-	int i;
+	int i, limited_sources;
 
 	options->needed_rename_limit = 0;
 
@@ -447,30 +446,30 @@ static int too_many_rename_candidates(int num_destinations,
 	 * This basically does a test for the rename matrix not
 	 * growing larger than a "rename_limit" square matrix, ie:
 	 *
-	 *    num_destinations * num_src > rename_limit * rename_limit
+	 *    num_destinations * num_sources > rename_limit * rename_limit
 	 */
 	if (rename_limit <= 0)
 		rename_limit = 32767;
-	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
-	    ((uint64_t)num_destinations * (uint64_t)num_src
+	if ((num_destinations <= rename_limit || num_sources <= rename_limit) &&
+	    ((uint64_t)num_destinations * (uint64_t)num_sources
 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
 		return 0;
 
 	options->needed_rename_limit =
-		num_src > num_destinations ? num_src : num_destinations;
+		num_sources > num_destinations ? num_sources : num_destinations;
 
 	/* Are we running under -C -C? */
 	if (!options->flags.find_copies_harder)
 		return 1;
 
 	/* Would we bust the limit if we were running under -C? */
-	for (num_src = i = 0; i < rename_src_nr; i++) {
+	for (limited_sources = i = 0; i < num_sources; i++) {
 		if (diff_unmodified_pair(rename_src[i].p))
 			continue;
-		num_src++;
+		limited_sources++;
 	}
-	if ((num_destinations <= rename_limit || num_src <= rename_limit) &&
-	    ((uint64_t)num_destinations * (uint64_t)num_src
+	if ((num_destinations <= rename_limit || limited_sources <= rename_limit) &&
+	    ((uint64_t)num_destinations * (uint64_t)limited_sources
 	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
 		return 2;
 	return 1;
@@ -576,7 +575,8 @@ void diffcore_rename(struct diff_options *options)
 	if (!num_destinations)
 		goto cleanup;
 
-	switch (too_many_rename_candidates(num_destinations, options)) {
+	switch (too_many_rename_candidates(num_destinations, rename_src_nr,
+					   options)) {
 	case 1:
 		goto cleanup;
 	case 2:
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 3/9] diffcore-rename: simplify limit check
  2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 1/9] diffcore-rename: rename num_create to num_destinations Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 2/9] diffcore-rename: avoid usage of global in too_many_rename_candidates() Elijah Newren via GitGitGadget
@ 2020-12-29 20:05     ` Elijah Newren via GitGitGadget
  2021-11-09 21:14       ` Başar Uğur
  2020-12-29 20:05     ` [PATCH v3 4/9] diffcore-rename: reduce jumpiness in progress counters Elijah Newren via GitGitGadget
                       ` (5 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-29 20:05 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Taylor Blau, Christian Couder, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

diffcore-rename had two different checks of the form

    if ((a < limit || b < limit) &&
        a * b <= limit * limit)

This can be simplified to

    if (st_mult(a, b) <= st_mult(limit, limit))

which makes it clearer how we are checking for overflow, and makes it
much easier to parse given the drop from 8 to 4 variable appearances.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 1d6675c040d..16553ab259f 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -447,12 +447,16 @@ static int too_many_rename_candidates(int num_destinations, int num_sources,
 	 * growing larger than a "rename_limit" square matrix, ie:
 	 *
 	 *    num_destinations * num_sources > rename_limit * rename_limit
+	 *
+	 * We use st_mult() to check overflow conditions; in the
+	 * exceptional circumstance that size_t isn't large enough to hold
+	 * the multiplication, the system won't be able to allocate enough
+	 * memory for the matrix anyway.
 	 */
 	if (rename_limit <= 0)
 		rename_limit = 32767;
-	if ((num_destinations <= rename_limit || num_sources <= rename_limit) &&
-	    ((uint64_t)num_destinations * (uint64_t)num_sources
-	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
+	if (st_mult(num_destinations, num_sources)
+	    <= st_mult(rename_limit, rename_limit))
 		return 0;
 
 	options->needed_rename_limit =
@@ -468,9 +472,8 @@ static int too_many_rename_candidates(int num_destinations, int num_sources,
 			continue;
 		limited_sources++;
 	}
-	if ((num_destinations <= rename_limit || limited_sources <= rename_limit) &&
-	    ((uint64_t)num_destinations * (uint64_t)limited_sources
-	     <= (uint64_t)rename_limit * (uint64_t)rename_limit))
+	if (st_mult(num_destinations, limited_sources)
+	    <= st_mult(rename_limit, rename_limit))
 		return 2;
 	return 1;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 4/9] diffcore-rename: reduce jumpiness in progress counters
  2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
                       ` (2 preceding siblings ...)
  2020-12-29 20:05     ` [PATCH v3 3/9] diffcore-rename: simplify limit check Elijah Newren via GitGitGadget
@ 2020-12-29 20:05     ` Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 5/9] t4058: add more tests and documentation for duplicate tree entry handling Elijah Newren via GitGitGadget
                       ` (4 subsequent siblings)
  8 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-29 20:05 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Taylor Blau, Christian Couder, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Inexact rename detection works by comparing all sources to all
destinations, computing similarities, and then finding the best matches
among those that are sufficiently similar.

However, it is preceded by exact rename detection that works by
checking if there are files with identical hashes.  If exact renames are
found, we can exclude some files from inexact rename detection.

The inexact rename detection loops over the full set of files, but
immediately skips those for which rename_dst[i].is_rename is true and
thus doesn't compare any sources to that destination.  As such, these
paths shouldn't be included in the progress counter.

For the eagle eyed, this change hints at an actual optimization -- the
first one I presented at Git Merge 2020.  I'll be submitting that
optimization later, once the basic merge-ort algorithm has merged.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 16553ab259f..55a188abcc3 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -593,7 +593,7 @@ void diffcore_rename(struct diff_options *options)
 	if (options->show_rename_progress) {
 		progress = start_delayed_progress(
 				_("Performing inexact rename detection"),
-				(uint64_t)rename_dst_nr * (uint64_t)rename_src_nr);
+				(uint64_t)num_destinations * (uint64_t)rename_src_nr);
 	}
 
 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
@@ -633,7 +633,8 @@ void diffcore_rename(struct diff_options *options)
 			diff_free_filespec_blob(two);
 		}
 		dst_cnt++;
-		display_progress(progress, (uint64_t)(i+1)*(uint64_t)rename_src_nr);
+		display_progress(progress,
+				 (uint64_t)dst_cnt * (uint64_t)rename_src_nr);
 	}
 	stop_progress(&progress);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 5/9] t4058: add more tests and documentation for duplicate tree entry handling
  2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
                       ` (3 preceding siblings ...)
  2020-12-29 20:05     ` [PATCH v3 4/9] diffcore-rename: reduce jumpiness in progress counters Elijah Newren via GitGitGadget
@ 2020-12-29 20:05     ` Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 6/9] t4058: explore duplicate tree entry handling in a bit more detail Elijah Newren via GitGitGadget
                       ` (3 subsequent siblings)
  8 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-29 20:05 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Taylor Blau, Christian Couder, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Commit 4d6be03b95 ("diffcore-rename: avoid processing duplicate
destinations", 2015-02-26) added t4058 to demonstrate that a workaround
it added to avoid double frees (namely to just turn off rename detection
when trees had duplicate entries) would indeed avoid segfaults.  The
tests, though, give the impression that the expected diffs are "correct"
when in reality they are just "don't segfault, and do something
semi-reasonable under the circumstances".  Add some notes to make this
clearer.

Also, commit 25d5ea410f ("[PATCH] Redo rename/copy detection logic.",
2005-05-24) added a similar workaround to avoid segfaults, but for
rename_src rather than rename_dst.  I do not see any tests in the
testsuite to cover the collision detection of entries limited to the
source side, so add a couple.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t4058-diff-duplicates.sh | 47 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/t/t4058-diff-duplicates.sh b/t/t4058-diff-duplicates.sh
index c24ee175ef0..bd685089561 100755
--- a/t/t4058-diff-duplicates.sh
+++ b/t/t4058-diff-duplicates.sh
@@ -1,5 +1,14 @@
 #!/bin/sh
 
+# NOTICE:
+#   This testsuite does a number of diffs and checks that the output match.
+#   However, it is a "garbage in, garbage out" situation; the trees have
+#   duplicate entries for individual paths, and it results in diffs that do
+#   not make much sense.  As such, it is not clear that the diffs are
+#   "correct".  The primary purpose of these tests was to verify that
+#   diff-tree does not segfault, but there is perhaps some value in ensuring
+#   that the diff output isn't wildly unreasonable.
+
 test_description='test tree diff when trees have duplicate entries'
 . ./test-lib.sh
 
@@ -57,7 +66,16 @@ test_expect_success 'create trees with duplicate entries' '
 	git tag two $outer_two
 '
 
-test_expect_success 'diff-tree between trees' '
+test_expect_success 'create tree without duplicate entries' '
+	blob_one=$(echo one | git hash-object -w --stdin) &&
+	outer_three=$(make_tree \
+		100644 renamed $blob_one
+	) &&
+	git tag three $outer_three
+'
+
+test_expect_success 'diff-tree between duplicate trees' '
+	# See NOTICE at top of file
 	{
 		printf ":000000 100644 $ZERO_OID $blob_two A\touter/inner\n" &&
 		printf ":000000 100644 $ZERO_OID $blob_two A\touter/inner\n" &&
@@ -71,9 +89,34 @@ test_expect_success 'diff-tree between trees' '
 '
 
 test_expect_success 'diff-tree with renames' '
-	# same expectation as above, since we disable rename detection
+	# See NOTICE at top of file.
 	git diff-tree -M -r --no-abbrev one two >actual &&
 	test_cmp expect actual
 '
 
+test_expect_success 'diff-tree FROM duplicate tree' '
+	# See NOTICE at top of file.
+	{
+		printf ":100644 000000 $blob_one $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":000000 100644 $ZERO_OID $blob_one A\trenamed\n"
+	} >expect &&
+	git diff-tree -r --no-abbrev one three >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'diff-tree FROM duplicate tree, with renames' '
+	# See NOTICE at top of file.
+	{
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 000000 $blob_two $ZERO_OID D\touter/inner\n" &&
+		printf ":100644 100644 $blob_one $blob_one R100\touter/inner\trenamed\n"
+	} >expect &&
+	git diff-tree -M -r --no-abbrev one three >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 6/9] t4058: explore duplicate tree entry handling in a bit more detail
  2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
                       ` (4 preceding siblings ...)
  2020-12-29 20:05     ` [PATCH v3 5/9] t4058: add more tests and documentation for duplicate tree entry handling Elijah Newren via GitGitGadget
@ 2020-12-29 20:05     ` Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 7/9] diffcore-rename: simplify and accelerate register_rename_src() Elijah Newren via GitGitGadget
                       ` (2 subsequent siblings)
  8 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-29 20:05 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Taylor Blau, Christian Couder, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

While creating the last commit, I found a number of other cases where
git would segfault when faced with trees that have duplicate entries.
None of these segfaults are in the diffcore-rename code (they all occur
in cache-tree and unpack-trees).  Further, to my knowledge, no one has
ever been adversely affected by these bugs, and given that it has been
15 years and folks have fixed a few other issues with historical
duplicate entries (as noted in the last commit), I am not sure we will
ever run into anyone having problems with these.  So I am not sure these
are worth fixing, but it doesn't hurt to at least document these
failures in the same test file that is concerned with duplicate tree
entries.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t4058-diff-duplicates.sh | 67 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/t/t4058-diff-duplicates.sh b/t/t4058-diff-duplicates.sh
index bd685089561..ad3f37d388d 100755
--- a/t/t4058-diff-duplicates.sh
+++ b/t/t4058-diff-duplicates.sh
@@ -119,4 +119,71 @@ test_expect_success 'diff-tree FROM duplicate tree, with renames' '
 	test_cmp expect actual
 '
 
+test_expect_success 'create a few commits' '
+	git commit-tree -m "Duplicate Entries" two^{tree} >commit_id &&
+	git branch base $(cat commit_id) &&
+
+	git commit-tree -p $(cat commit_id) -m "Just one" three^{tree} >up &&
+	git branch update $(cat up) &&
+
+	git commit-tree -p $(cat up) -m "Back to weird" two^{tree} >final &&
+	git branch final $(cat final) &&
+
+	rm commit_id up final
+'
+
+test_expect_failure 'git read-tree does not segfault' '
+	test_when_finished rm .git/index.lock &&
+	test_might_fail git read-tree --reset base
+'
+
+test_expect_failure 'reset --hard does not segfault' '
+	test_when_finished rm .git/index.lock &&
+	git checkout base &&
+	test_might_fail git reset --hard
+'
+
+test_expect_failure 'git diff HEAD does not segfault' '
+	git checkout base &&
+	GIT_TEST_CHECK_CACHE_TREE=false &&
+	git reset --hard &&
+	test_might_fail git diff HEAD
+'
+
+test_expect_failure 'can switch to another branch when status is empty' '
+	git clean -ffdqx &&
+	git status --porcelain -uno >actual &&
+	test_must_be_empty actual &&
+	git checkout update
+'
+
+test_expect_success 'forcibly switch to another branch, verify status empty' '
+	git checkout -f update &&
+	git status --porcelain -uno >actual &&
+	test_must_be_empty actual
+'
+
+test_expect_success 'fast-forward from non-duplicate entries to duplicate' '
+	git merge final
+'
+
+test_expect_failure 'clean status, switch branches, status still clean' '
+	git status --porcelain -uno >actual &&
+	test_must_be_empty actual &&
+	git checkout base &&
+	git status --porcelain -uno >actual &&
+	test_must_be_empty actual
+'
+
+test_expect_success 'switch to base branch and force status to be clean' '
+	git checkout base &&
+	GIT_TEST_CHECK_CACHE_TREE=false git reset --hard &&
+	git status --porcelain -uno >actual &&
+	test_must_be_empty actual
+'
+
+test_expect_failure 'fast-forward from duplicate entries to non-duplicate' '
+	git merge update
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 7/9] diffcore-rename: simplify and accelerate register_rename_src()
  2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
                       ` (5 preceding siblings ...)
  2020-12-29 20:05     ` [PATCH v3 6/9] t4058: explore duplicate tree entry handling in a bit more detail Elijah Newren via GitGitGadget
@ 2020-12-29 20:05     ` Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 8/9] diffcore-rename: accelerate rename_dst setup Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 9/9] diffcore-rename: remove unnecessary duplicate entry checks Elijah Newren via GitGitGadget
  8 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-29 20:05 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Taylor Blau, Christian Couder, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

register_rename_src() took pains to create an array in rename_src which
was sorted by pathname of the contained diff_filepair.  The sorting was
entirely unnecessary since callers pass filepairs to us in sorted
order.  We can simply append to the end of the rename_src array,
speeding up diffcore_rename() setup time.

Also, note that I dropped the return type on the function since it was
unconditionally discarded anyway.

This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection,
diffcore_rename() setup time was a sizeable chunk of overall runtime.
This patch dropped execution time of rebasing 35 commits with lots of
renames by 2% overall.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 39 +++++++++++++--------------------------
 1 file changed, 13 insertions(+), 26 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 55a188abcc3..a215421a9cb 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -76,36 +76,23 @@ static struct diff_rename_src {
 } *rename_src;
 static int rename_src_nr, rename_src_alloc;
 
-static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
+static void register_rename_src(struct diff_filepair *p)
 {
-	int first, last;
-	struct diff_filespec *one = p->one;
-	unsigned short score = p->score;
-
-	first = 0;
-	last = rename_src_nr;
-	while (last > first) {
-		int next = first + ((last - first) >> 1);
-		struct diff_rename_src *src = &(rename_src[next]);
-		int cmp = strcmp(one->path, src->p->one->path);
-		if (!cmp)
-			return src;
-		if (cmp < 0) {
-			last = next;
-			continue;
-		}
-		first = next+1;
-	}
+	/*
+	 * If we have multiple entries at the same path in the source tree
+	 * (an invalid tree, to be sure), avoid using more more than one
+	 * such entry in rename detection.  Once upon a time, doing so
+	 * caused segfaults; see commit 25d5ea410f ("[PATCH] Redo
+	 * rename/copy detection logic.", 2005-05-24).
+	 */
+	if (rename_src_nr > 0 &&
+	    !strcmp(rename_src[rename_src_nr-1].p->one->path, p->one->path))
+		return;
 
-	/* insert to make it at "first" */
 	ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
+	rename_src[rename_src_nr].p = p;
+	rename_src[rename_src_nr].score = p->score;
 	rename_src_nr++;
-	if (first < rename_src_nr)
-		MOVE_ARRAY(rename_src + first + 1, rename_src + first,
-			   rename_src_nr - first - 1);
-	rename_src[first].p = p;
-	rename_src[first].score = score;
-	return &(rename_src[first]);
 }
 
 static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 8/9] diffcore-rename: accelerate rename_dst setup
  2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
                       ` (6 preceding siblings ...)
  2020-12-29 20:05     ` [PATCH v3 7/9] diffcore-rename: simplify and accelerate register_rename_src() Elijah Newren via GitGitGadget
@ 2020-12-29 20:05     ` Elijah Newren via GitGitGadget
  2020-12-29 20:05     ` [PATCH v3 9/9] diffcore-rename: remove unnecessary duplicate entry checks Elijah Newren via GitGitGadget
  8 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-29 20:05 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Taylor Blau, Christian Couder, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

register_rename_src() simply references the passed pair inside
rename_src.  In contrast, add_rename_dst() did something entirely
different for rename_dst.  Instead of copying the passed pair, it made a
copy of the second diff_filespec from the passed pair, referenced it,
and then set the diff_rename_dst.pair field to NULL.  Later, when a
pairing is found, record_rename_pair() allocated a full diff_filepair
via diff_queue() and pointed its src and dst fields at the appropriate
diff_filespecs.  This contrast between register_rename_src() for the
rename_src data structure and add_rename_dst() for the rename_dst data
structure is oddly inconsistent and requires more memory and work than
necessary.  Let's just reference the original diff_filepair in
rename_dst as-is, just as we do with rename_src.  Add a new
rename_dst.is_rename field, since the rename_dst.p field is never NULL
unlike the old rename_dst.pair field.

Taking advantage of this change and the fact that same-named paths will
be adjacent, we can get rid of the sorting of the array and most of the
lookups on it, allowing us to instead just append as we go.  However,
there is one remaining reason to still keep locate_rename_dst():
handling broken pairs (i.e. when break detection is on).  Those are
somewhat rare, but we can set up a simple strintmap to get the map
between the source and the index.  Doing that allows us to still have a
fast lookup without sorting the rename_dst array.  Since the sorting had
been done in a weakly quadratic manner, when many renames are involved
this time could add up.

There is still a strcmp() in add_rename_dst() that I have left in place
to make it easier to verify that the algorithm has the same results.
This strcmp() is there to check for duplicate destination entries (which
was the easiest way at the time to avoid segfaults in the
diffcore-rename code when trees had multiple entries at a given path).
The underlying double free()s are no longer an issue with the new
algorithm, but that can be addressed in a subsequent commit.

This patch is being submitted in a different order than its original
development, but in a large rebase of many commits with lots of renames
and with several optimizations to inexact rename detection, both setup
time and write back to output queue time from diffcore_rename() were
sizeable chunks of overall runtime.  This patch accelerated the setup
time by about 65%, and final write back to the output queue time by
about 50%, resulting in an overall drop of 3.5% on the execution time of
rebasing a few dozen patches.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 148 ++++++++++++++++++++--------------------------
 1 file changed, 65 insertions(+), 83 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index a215421a9cb..2a8fcd52928 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -9,63 +9,48 @@
 #include "hashmap.h"
 #include "progress.h"
 #include "promisor-remote.h"
+#include "strmap.h"
 
 /* Table of rename/copy destinations */
 
 static struct diff_rename_dst {
-	struct diff_filespec *two;
-	struct diff_filepair *pair;
+	struct diff_filepair *p;
+	struct diff_filespec *filespec_to_free;
+	int is_rename; /* false -> just a create; true -> rename or copy */
 } *rename_dst;
 static int rename_dst_nr, rename_dst_alloc;
+/* Mapping from break source pathname to break destination index */
+static struct strintmap *break_idx = NULL;
 
-static int find_rename_dst(struct diff_filespec *two)
-{
-	int first, last;
-
-	first = 0;
-	last = rename_dst_nr;
-	while (last > first) {
-		int next = first + ((last - first) >> 1);
-		struct diff_rename_dst *dst = &(rename_dst[next]);
-		int cmp = strcmp(two->path, dst->two->path);
-		if (!cmp)
-			return next;
-		if (cmp < 0) {
-			last = next;
-			continue;
-		}
-		first = next+1;
-	}
-	return -first - 1;
-}
-
-static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two)
+static struct diff_rename_dst *locate_rename_dst(struct diff_filepair *p)
 {
-	int ofs = find_rename_dst(two);
-	return ofs < 0 ? NULL : &rename_dst[ofs];
+	/* Lookup by p->ONE->path */
+	int idx = break_idx ? strintmap_get(break_idx, p->one->path) : -1;
+	return (idx == -1) ? NULL : &rename_dst[idx];
 }
 
 /*
  * Returns 0 on success, -1 if we found a duplicate.
  */
-static int add_rename_dst(struct diff_filespec *two)
+static int add_rename_dst(struct diff_filepair *p)
 {
-	int first = find_rename_dst(two);
-
-	if (first >= 0)
+	/*
+	 * If we have multiple entries at the same path in the destination
+	 * tree (an invalid tree, to be sure), turn off rename detection
+	 * entirely.  Once upon a time, diffcore-rename had double free()
+	 * issues due to such duplicate paths, resulting in segfaults.  See
+	 * commit 4d6be03b95 ("diffcore-rename: avoid processing duplicate
+	 * destinations", 2015-02-26).
+	 */
+	if (rename_dst_nr > 0 &&
+	    !strcmp(rename_dst[rename_dst_nr-1].p->two->path, p->two->path))
 		return -1;
-	first = -first - 1;
 
-	/* insert to make it at "first" */
 	ALLOC_GROW(rename_dst, rename_dst_nr + 1, rename_dst_alloc);
+	rename_dst[rename_dst_nr].p = p;
+	rename_dst[rename_dst_nr].filespec_to_free = NULL;
+	rename_dst[rename_dst_nr].is_rename = 0;
 	rename_dst_nr++;
-	if (first < rename_dst_nr)
-		MOVE_ARRAY(rename_dst + first + 1, rename_dst + first,
-			   rename_dst_nr - first - 1);
-	rename_dst[first].two = alloc_filespec(two->path);
-	fill_filespec(rename_dst[first].two, &two->oid, two->oid_valid,
-		      two->mode);
-	rename_dst[first].pair = NULL;
 	return 0;
 }
 
@@ -89,6 +74,14 @@ static void register_rename_src(struct diff_filepair *p)
 	    !strcmp(rename_src[rename_src_nr-1].p->one->path, p->one->path))
 		return;
 
+	if (p->broken_pair) {
+		if (!break_idx) {
+			break_idx = xmalloc(sizeof(*break_idx));
+			strintmap_init(break_idx, -1);
+		}
+		strintmap_set(break_idx, p->one->path, rename_dst_nr);
+	}
+
 	ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
 	rename_src[rename_src_nr].p = p;
 	rename_src[rename_src_nr].score = p->score;
@@ -128,14 +121,14 @@ static void prefetch(void *prefetch_options)
 	struct oid_array to_fetch = OID_ARRAY_INIT;
 
 	for (i = 0; i < rename_dst_nr; i++) {
-		if (rename_dst[i].pair)
+		if (rename_dst[i].p->renamed_pair)
 			/*
 			 * The loop in diffcore_rename() will not need these
 			 * blobs, so skip prefetching.
 			 */
 			continue; /* already found exact match */
 		diff_add_if_missing(options->repo, &to_fetch,
-				    rename_dst[i].two);
+				    rename_dst[i].p->two);
 	}
 	for (i = 0; i < rename_src_nr; i++) {
 		if (options->skip_unmodified &&
@@ -245,26 +238,24 @@ static int estimate_similarity(struct repository *r,
 
 static void record_rename_pair(int dst_index, int src_index, int score)
 {
-	struct diff_filespec *src, *dst;
-	struct diff_filepair *dp;
+	struct diff_filepair *src = rename_src[src_index].p;
+	struct diff_filepair *dst = rename_dst[dst_index].p;
 
-	if (rename_dst[dst_index].pair)
+	if (dst->renamed_pair)
 		die("internal error: dst already matched.");
 
-	src = rename_src[src_index].p->one;
-	src->rename_used++;
-	src->count++;
+	src->one->rename_used++;
+	src->one->count++;
 
-	dst = rename_dst[dst_index].two;
-	dst->count++;
+	rename_dst[dst_index].filespec_to_free = dst->one;
+	rename_dst[dst_index].is_rename = 1;
 
-	dp = diff_queue(NULL, src, dst);
-	dp->renamed_pair = 1;
-	if (!strcmp(src->path, dst->path))
-		dp->score = rename_src[src_index].score;
+	dst->one = src->one;
+	dst->renamed_pair = 1;
+	if (!strcmp(dst->one->path, dst->two->path))
+		dst->score = rename_src[src_index].score;
 	else
-		dp->score = score;
-	rename_dst[dst_index].pair = dp;
+		dst->score = score;
 }
 
 /*
@@ -310,7 +301,7 @@ static int find_identical_files(struct hashmap *srcs,
 				struct diff_options *options)
 {
 	int renames = 0;
-	struct diff_filespec *target = rename_dst[dst_index].two;
+	struct diff_filespec *target = rename_dst[dst_index].p->two;
 	struct file_similarity *p, *best = NULL;
 	int i = 100, best_score = -1;
 	unsigned int hash = hash_filespec(options->repo, target);
@@ -476,7 +467,7 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
 		    (mx[i].score < minimum_score))
 			break; /* there is no more usable pair. */
 		dst = &rename_dst[mx[i].dst];
-		if (dst->pair)
+		if (dst->is_rename)
 			continue; /* already done, either exact or fuzzy. */
 		if (!copies && rename_src[mx[i].src].p->one->rename_used)
 			continue;
@@ -511,7 +502,7 @@ void diffcore_rename(struct diff_options *options)
 			else if (!options->flags.rename_empty &&
 				 is_empty_blob_oid(&p->two->oid))
 				continue;
-			else if (add_rename_dst(p->two) < 0) {
+			else if (add_rename_dst(p) < 0) {
 				warning("skipping rename detection, detected"
 					" duplicate destination '%s'",
 					p->two->path);
@@ -586,10 +577,10 @@ void diffcore_rename(struct diff_options *options)
 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
 		     sizeof(*mx));
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
-		struct diff_filespec *two = rename_dst[i].two;
+		struct diff_filespec *two = rename_dst[i].p->two;
 		struct diff_score *m;
 
-		if (rename_dst[i].pair)
+		if (rename_dst[i].is_rename)
 			continue; /* dealt with exact match already. */
 
 		m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
@@ -646,22 +637,8 @@ void diffcore_rename(struct diff_options *options)
 			diff_q(&outq, p);
 		}
 		else if (!DIFF_FILE_VALID(p->one) && DIFF_FILE_VALID(p->two)) {
-			/*
-			 * Creation
-			 *
-			 * We would output this create record if it has
-			 * not been turned into a rename/copy already.
-			 */
-			struct diff_rename_dst *dst = locate_rename_dst(p->two);
-			if (dst && dst->pair) {
-				diff_q(&outq, dst->pair);
-				pair_to_free = p;
-			}
-			else
-				/* no matching rename/copy source, so
-				 * record this as a creation.
-				 */
-				diff_q(&outq, p);
+			/* Creation */
+			diff_q(&outq, p);
 		}
 		else if (DIFF_FILE_VALID(p->one) && !DIFF_FILE_VALID(p->two)) {
 			/*
@@ -682,8 +659,10 @@ void diffcore_rename(struct diff_options *options)
 			 */
 			if (DIFF_PAIR_BROKEN(p)) {
 				/* broken delete */
-				struct diff_rename_dst *dst = locate_rename_dst(p->one);
-				if (dst && dst->pair)
+				struct diff_rename_dst *dst = locate_rename_dst(p);
+				if (!dst)
+					BUG("tracking failed somehow; failed to find associated dst for broken pair");
+				if (dst->is_rename)
 					/* counterpart is now rename/copy */
 					pair_to_free = p;
 			}
@@ -693,16 +672,14 @@ void diffcore_rename(struct diff_options *options)
 					pair_to_free = p;
 			}
 
-			if (pair_to_free)
-				;
-			else
+			if (!pair_to_free)
 				diff_q(&outq, p);
 		}
 		else if (!diff_unmodified_pair(p))
 			/* all the usual ones need to be kept */
 			diff_q(&outq, p);
 		else
-			/* no need to keep unmodified pairs */
+			/* no need to keep unmodified pairs; FIXME: remove earlier? */
 			pair_to_free = p;
 
 		if (pair_to_free)
@@ -715,11 +692,16 @@ void diffcore_rename(struct diff_options *options)
 	diff_debug_queue("done collapsing", q);
 
 	for (i = 0; i < rename_dst_nr; i++)
-		free_filespec(rename_dst[i].two);
+		if (rename_dst[i].filespec_to_free)
+			free_filespec(rename_dst[i].filespec_to_free);
 
 	FREE_AND_NULL(rename_dst);
 	rename_dst_nr = rename_dst_alloc = 0;
 	FREE_AND_NULL(rename_src);
 	rename_src_nr = rename_src_alloc = 0;
+	if (break_idx) {
+		strintmap_clear(break_idx);
+		FREE_AND_NULL(break_idx);
+	}
 	return;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 9/9] diffcore-rename: remove unnecessary duplicate entry checks
  2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
                       ` (7 preceding siblings ...)
  2020-12-29 20:05     ` [PATCH v3 8/9] diffcore-rename: accelerate rename_dst setup Elijah Newren via GitGitGadget
@ 2020-12-29 20:05     ` Elijah Newren via GitGitGadget
  8 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren via GitGitGadget @ 2020-12-29 20:05 UTC (permalink / raw)
  To: git
  Cc: Elijah Newren, Taylor Blau, Christian Couder, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Commit 25d5ea410f ("[PATCH] Redo rename/copy detection logic.",
2005-05-24) added a duplicate entry check on rename_src in order to
avoid segfaults; the code at the time was prone to double free()s and an
easy way to avoid it was just to turn off rename detection for any
duplicate entries.  Note that the form of the check was modified two
commits ago in this series.

Similarly, commit 4d6be03b95 ("diffcore-rename: avoid processing
duplicate destinations", 2015-02-26) added a duplicate entry check
on rename_dst for the exact same reason -- the code was prone to double
free()s, and an easy way to avoid it was just to turn off rename
detection entirely.  Note that the form of the check was modified in the
commit just before this one.

In the original code in both places, the code was dealing with
individual diff_filespecs and trying to match things up, instead of just
keeping the original diff_filepairs around as we do now.  The
intervening change in structure has fixed the accounting problems and
the associated double free()s that used to occur, and thus we already
have a better fix.  As such, we can remove the band-aid checks for
duplicate entries.

Due to the last two patches, the diffcore_rename() setup is no longer a
sizeable chunk of overall runtime.  Thus, in a large rebase of many
commits with lots of renames and several optimizations to inexact rename
detection, this patch only speeds up the overall code by about half a
percent or so and is pretty close to the run-to-run variability making
it hard to get an exact measurement.  However, with some trace2 regions
around the setup code in diffcore_rename() so that I can focus on just
it, I measure that this patch consistently saves almost a third of the
remaining time spent in diffcore_rename() setup.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c          | 23 -----------------------
 t/t4058-diff-duplicates.sh |  2 +-
 2 files changed, 1 insertion(+), 24 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 2a8fcd52928..90db9ebd6d0 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -34,18 +34,6 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filepair *p)
  */
 static int add_rename_dst(struct diff_filepair *p)
 {
-	/*
-	 * If we have multiple entries at the same path in the destination
-	 * tree (an invalid tree, to be sure), turn off rename detection
-	 * entirely.  Once upon a time, diffcore-rename had double free()
-	 * issues due to such duplicate paths, resulting in segfaults.  See
-	 * commit 4d6be03b95 ("diffcore-rename: avoid processing duplicate
-	 * destinations", 2015-02-26).
-	 */
-	if (rename_dst_nr > 0 &&
-	    !strcmp(rename_dst[rename_dst_nr-1].p->two->path, p->two->path))
-		return -1;
-
 	ALLOC_GROW(rename_dst, rename_dst_nr + 1, rename_dst_alloc);
 	rename_dst[rename_dst_nr].p = p;
 	rename_dst[rename_dst_nr].filespec_to_free = NULL;
@@ -63,17 +51,6 @@ static int rename_src_nr, rename_src_alloc;
 
 static void register_rename_src(struct diff_filepair *p)
 {
-	/*
-	 * If we have multiple entries at the same path in the source tree
-	 * (an invalid tree, to be sure), avoid using more more than one
-	 * such entry in rename detection.  Once upon a time, doing so
-	 * caused segfaults; see commit 25d5ea410f ("[PATCH] Redo
-	 * rename/copy detection logic.", 2005-05-24).
-	 */
-	if (rename_src_nr > 0 &&
-	    !strcmp(rename_src[rename_src_nr-1].p->one->path, p->one->path))
-		return;
-
 	if (p->broken_pair) {
 		if (!break_idx) {
 			break_idx = xmalloc(sizeof(*break_idx));
diff --git a/t/t4058-diff-duplicates.sh b/t/t4058-diff-duplicates.sh
index ad3f37d388d..54614b814db 100755
--- a/t/t4058-diff-duplicates.sh
+++ b/t/t4058-diff-duplicates.sh
@@ -91,7 +91,7 @@ test_expect_success 'diff-tree between duplicate trees' '
 test_expect_success 'diff-tree with renames' '
 	# See NOTICE at top of file.
 	git diff-tree -M -r --no-abbrev one two >actual &&
-	test_cmp expect actual
+	test_must_be_empty actual
 '
 
 test_expect_success 'diff-tree FROM duplicate tree' '
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 6/9] t4058: explore duplicate tree entry handling in a bit more detail
  2020-12-11  9:08   ` [PATCH v2 6/9] t4058: explore duplicate tree entry handling in a bit more detail Elijah Newren via GitGitGadget
@ 2021-04-21 12:29     ` Ævar Arnfjörð Bjarmason
  2021-04-21 17:38       ` Elijah Newren
  0 siblings, 1 reply; 55+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-04-21 12:29 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Taylor Blau, Elijah Newren


On Fri, Dec 11 2020, Elijah Newren via GitGitGadget wrote:

> While creating the last commit, I found a number of other cases where
> git would segfault when faced with trees that have duplicate entries.
> None of these segfaults are in the diffcore-rename code (they all occur
> in cache-tree and unpack-trees).  Further, to my knowledge, no one has
> ever been adversely affected by these bugs, and given that it has been
> 15 years and folks have fixed a few other issues with historical
> duplicate entries (as noted in the last commit), I am not sure we will
> ever run into anyone having problems with these.  So I am not sure these
> are worth fixing, but it doesn't hurt to at least document these
> failures in the same test file that is concerned with duplicate tree
> entries.
> [...]
> +test_expect_failure 'fast-forward from duplicate entries to non-duplicate' '
> +	git merge update
> +'
> +
>  test_done

Per https://lore.kernel.org/git/87lf9b3mth.fsf@evledraar.gmail.com/
isn't the noise of having a segfault from "git" worth fixing in itself
though? I.e. something like this, so we at least se why it fails:

diff --git a/t/t4058-diff-duplicates.sh b/t/t4058-diff-duplicates.sh
index 54614b814db..ed91d9f7fe9 100755
--- a/t/t4058-diff-duplicates.sh
+++ b/t/t4058-diff-duplicates.sh
@@ -182,8 +182,10 @@ test_expect_success 'switch to base branch and force status to be clean' '
 	test_must_be_empty actual
 '
 
-test_expect_failure 'fast-forward from duplicate entries to non-duplicate' '
-	git merge update
+test_expect_success 'fast-forward from duplicate entries to non-duplicate' '
+	! git merge update 2>err &&
+	grep "^BUG: " err &&
+	grep -F "should have entry at o->src_index->cache[1]" err
 '
 
 test_done
diff --git a/unpack-trees.c b/unpack-trees.c
index 8a1afbc1e49..230cb073fe1 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -789,8 +789,11 @@ static int traverse_by_cache_tree(int pos, int nr_entries, int nr_names,
 	 */
 	for (i = 0; i < nr_entries; i++) {
 		int new_ce_len, len, rc;
+		int j = pos + i;
 
-		src[0] = o->src_index->cache[pos + i];
+		src[0] = o->src_index->cache[j];
+		if (!src[0])
+			BUG("should have entry at o->src_index->cache[%d]", j);
 
 		len = ce_namelen(src[0]);
 		new_ce_len = cache_entry_size(len);


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 6/9] t4058: explore duplicate tree entry handling in a bit more detail
  2021-04-21 12:29     ` Ævar Arnfjörð Bjarmason
@ 2021-04-21 17:38       ` Elijah Newren
  0 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren @ 2021-04-21 17:38 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Taylor Blau

On Wed, Apr 21, 2021 at 5:29 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Fri, Dec 11 2020, Elijah Newren via GitGitGadget wrote:
>
> > While creating the last commit, I found a number of other cases where
> > git would segfault when faced with trees that have duplicate entries.
> > None of these segfaults are in the diffcore-rename code (they all occur
> > in cache-tree and unpack-trees).  Further, to my knowledge, no one has
> > ever been adversely affected by these bugs, and given that it has been
> > 15 years and folks have fixed a few other issues with historical
> > duplicate entries (as noted in the last commit), I am not sure we will
> > ever run into anyone having problems with these.  So I am not sure these
> > are worth fixing, but it doesn't hurt to at least document these
> > failures in the same test file that is concerned with duplicate tree
> > entries.
> > [...]
> > +test_expect_failure 'fast-forward from duplicate entries to non-duplicate' '
> > +     git merge update
> > +'
> > +
> >  test_done
>
> Per https://lore.kernel.org/git/87lf9b3mth.fsf@evledraar.gmail.com/
> isn't the noise of having a segfault from "git" worth fixing in itself
> though? I.e. something like this, so we at least se why it fails:
>
> diff --git a/t/t4058-diff-duplicates.sh b/t/t4058-diff-duplicates.sh
> index 54614b814db..ed91d9f7fe9 100755
> --- a/t/t4058-diff-duplicates.sh
> +++ b/t/t4058-diff-duplicates.sh
> @@ -182,8 +182,10 @@ test_expect_success 'switch to base branch and force status to be clean' '
>         test_must_be_empty actual
>  '
>
> -test_expect_failure 'fast-forward from duplicate entries to non-duplicate' '
> -       git merge update
> +test_expect_success 'fast-forward from duplicate entries to non-duplicate' '
> +       ! git merge update 2>err &&
> +       grep "^BUG: " err &&
> +       grep -F "should have entry at o->src_index->cache[1]" err
>  '
>
>  test_done
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 8a1afbc1e49..230cb073fe1 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -789,8 +789,11 @@ static int traverse_by_cache_tree(int pos, int nr_entries, int nr_names,
>          */
>         for (i = 0; i < nr_entries; i++) {
>                 int new_ce_len, len, rc;
> +               int j = pos + i;
>
> -               src[0] = o->src_index->cache[pos + i];
> +               src[0] = o->src_index->cache[j];
> +               if (!src[0])
> +                       BUG("should have entry at o->src_index->cache[%d]", j);
>
>                 len = ce_namelen(src[0]);
>                 new_ce_len = cache_entry_size(len);
>

Seems reasonable to me.  Are you planning to add a commit message and
turn it into a proper patch?  If so, I'll give my Thumbs-up-by or
whatever we need.  :-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/9] diffcore-rename: simplify limit check
  2020-12-29 20:05     ` [PATCH v3 3/9] diffcore-rename: simplify limit check Elijah Newren via GitGitGadget
@ 2021-11-09 21:14       ` Başar Uğur
  2021-11-10 20:06         ` Elijah Newren
  0 siblings, 1 reply; 55+ messages in thread
From: Başar Uğur @ 2021-11-09 21:14 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Elijah Newren, Taylor Blau, Christian Couder

Hi all,

First post on Git mailing list, to provide a comment on a patch. Hope
this works.

In cases where the `rename_limit` is already greater than 65535, the
`st_mult(rename_limit, rename_limit)` multiplication overflows and
process halts. But I don't think an 'overflow error' is very helpful
for the users in understanding what is wrong with their configuration;
i.e. `diff.renameLimit` documentation says nothing about a 'maximum
allowed value'. I would either clamp it to a reasonable range, or
inform the users about the limits, or maybe both.

Cheers,

-- 
Basar

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/9] diffcore-rename: simplify limit check
  2021-11-09 21:14       ` Başar Uğur
@ 2021-11-10 20:06         ` Elijah Newren
  2021-11-11  9:02           ` Başar Uğur
  0 siblings, 1 reply; 55+ messages in thread
From: Elijah Newren @ 2021-11-10 20:06 UTC (permalink / raw)
  To: Başar Uğur
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Taylor Blau,
	Christian Couder

On Tue, Nov 9, 2021 at 1:15 PM Başar Uğur <basarugur@gmail.com> wrote:
>
> Hi all,
>
> First post on Git mailing list, to provide a comment on a patch. Hope
> this works.
>
> In cases where the `rename_limit` is already greater than 65535, the
> `st_mult(rename_limit, rename_limit)` multiplication overflows and
> process halts.

Out of curiosity, what system are you on?  And how many renames do you
actually have?

We used to clamp to 32767, but one specific repository needed values
larger than that, in the range of ~50000.  However, due to a number of
rename detection related optimizations added since then, the git of
today can handle that same particular repository and support full
rename detection with a rename limit under 1000 for merge/cherry-pick
(sorry, forgot the exact numbers), and probably less than 10000 for
diff (just guestimating; I don't want to go and check).

Anyway, all that aside, I don't see any such overflow for rename_limit
being 2**16; we can push it much farther:

    size_t a = 4294967295;   /*  2**32 -1  */
    size_t b = a;
    size_t c = st_mult(a, b);
    printf("%"PRIuMAX" = %"PRIuMAX" * %"PRIuMAX"\n", c, a, b);

Output:

    18446744065119617025 = 4294967295 * 4294967295

Adding one to the value of a results in:

    fatal: size_t overflow: 4294967296 * 4294967296

> But I don't think an 'overflow error' is very helpful
> for the users in understanding what is wrong with their configuration;
> i.e. `diff.renameLimit` documentation says nothing about a 'maximum
> allowed value'. I would either clamp it to a reasonable range, or
> inform the users about the limits, or maybe both.

That sounds reasonable, but I'm not sure it's worth the effort in
practice.  2**32 - 1 is so impractically large for a rename_limit that
I don't see why anyone would need a value even remotely close to that
level (and I wouldn't at all be surprised if other things in git
scaling broke before we even got to that point).

Perhaps you're dealing with a 32-bit system?  Even then, the
repository that hit this was about 6.5GB packed .git history; and you
might need to be a lot larger than that given the optimization since
then in order to hit this.  Can 32-bit systems even handle that size
of repository without dying in several other ways?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/9] diffcore-rename: simplify limit check
  2021-11-10 20:06         ` Elijah Newren
@ 2021-11-11  9:02           ` Başar Uğur
  2021-11-11 16:19             ` Elijah Newren
  0 siblings, 1 reply; 55+ messages in thread
From: Başar Uğur @ 2021-11-11  9:02 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Taylor Blau,
	Christian Couder

On Wed, Nov 10, 2021 at 9:06 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, Nov 9, 2021 at 1:15 PM Başar Uğur <basarugur@gmail.com> wrote:
> >
> > Hi all,
> >
> > First post on Git mailing list, to provide a comment on a patch. Hope
> > this works.
> >
> > In cases where the `rename_limit` is already greater than 65535, the
> > `st_mult(rename_limit, rename_limit)` multiplication overflows and
> > process halts.
>
> Out of curiosity, what system are you on?  And how many renames do you
> actually have?

I am on a 64-bit Windows 10; but following up on your question, it
became obvious that these limits have something to do with 'which git
executable' I was dealing with. This problem surfaced when I was
working on Visual Studio 2019, and was trying to rename not more than
10 files. My config had 999999 as the renameLimit, and VS2019 showed
the 'fatal error' in its git output. However, git bash was all fine
with listing the renamed files. And the difference between these
scenarios turned out to be, yes, different git executables. VS2019 has
its own copy of git which is 32-bit, and it hits this 999999 * 999999
overflow; whereas *my* copy of git used in bash is 64-bit which does
not have that overflow problem.

>
> We used to clamp to 32767, but one specific repository needed values
> larger than that, in the range of ~50000.  However, due to a number of
> rename detection related optimizations added since then, the git of
> today can handle that same particular repository and support full
> rename detection with a rename limit under 1000 for merge/cherry-pick
> (sorry, forgot the exact numbers), and probably less than 10000 for
> diff (just guestimating; I don't want to go and check).
>
> Anyway, all that aside, I don't see any such overflow for rename_limit
> being 2**16; we can push it much farther:
>
>     size_t a = 4294967295;   /*  2**32 -1  */
>     size_t b = a;
>     size_t c = st_mult(a, b);
>     printf("%"PRIuMAX" = %"PRIuMAX" * %"PRIuMAX"\n", c, a, b);
>
> Output:
>
>     18446744065119617025 = 4294967295 * 4294967295
>
> Adding one to the value of a results in:
>
>     fatal: size_t overflow: 4294967296 * 4294967296
>
> > But I don't think an 'overflow error' is very helpful
> > for the users in understanding what is wrong with their configuration;
> > i.e. `diff.renameLimit` documentation says nothing about a 'maximum
> > allowed value'. I would either clamp it to a reasonable range, or
> > inform the users about the limits, or maybe both.
>
> That sounds reasonable, but I'm not sure it's worth the effort in
> practice.  2**32 - 1 is so impractically large for a rename_limit that
> I don't see why anyone would need a value even remotely close to that
> level (and I wouldn't at all be surprised if other things in git
> scaling broke before we even got to that point).
>
> Perhaps you're dealing with a 32-bit system?  Even then, the
> repository that hit this was about 6.5GB packed .git history; and you
> might need to be a lot larger than that given the optimization since
> then in order to hit this.  Can 32-bit systems even handle that size
> of repository without dying in several other ways?

Good point, but the system aside, 2**16 - 1 = 65535 would remain to be
the limit for the 32-bit git executables, wherever they are used.
Therefore, maybe there is a point to curb it, or mention this
somewhere, as I have said before.

-- 
Basar

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/9] diffcore-rename: simplify limit check
  2021-11-11  9:02           ` Başar Uğur
@ 2021-11-11 16:19             ` Elijah Newren
  0 siblings, 0 replies; 55+ messages in thread
From: Elijah Newren @ 2021-11-11 16:19 UTC (permalink / raw)
  To: Başar Uğur
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Taylor Blau,
	Christian Couder

On Thu, Nov 11, 2021 at 1:03 AM Başar Uğur <basarugur@gmail.com> wrote:
>
> On Wed, Nov 10, 2021 at 9:06 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Tue, Nov 9, 2021 at 1:15 PM Başar Uğur <basarugur@gmail.com> wrote:
> > >
> > > Hi all,
> > >
> > > First post on Git mailing list, to provide a comment on a patch. Hope
> > > this works.
> > >
> > > In cases where the `rename_limit` is already greater than 65535, the
> > > `st_mult(rename_limit, rename_limit)` multiplication overflows and
> > > process halts.
> >
> > Out of curiosity, what system are you on?  And how many renames do you
> > actually have?
>
> I am on a 64-bit Windows 10; but following up on your question, it
> became obvious that these limits have something to do with 'which git
> executable' I was dealing with. This problem surfaced when I was
> working on Visual Studio 2019, and was trying to rename not more than
> 10 files. My config had 999999 as the renameLimit, and VS2019 showed
> the 'fatal error' in its git output. However, git bash was all fine
> with listing the renamed files. And the difference between these
> scenarios turned out to be, yes, different git executables. VS2019 has
> its own copy of git which is 32-bit, and it hits this 999999 * 999999
> overflow; whereas *my* copy of git used in bash is 64-bit which does
> not have that overflow problem.

Ah, thanks for the extra info.

> >
> > We used to clamp to 32767, but one specific repository needed values
> > larger than that, in the range of ~50000.  However, due to a number of
> > rename detection related optimizations added since then, the git of
> > today can handle that same particular repository and support full
> > rename detection with a rename limit under 1000 for merge/cherry-pick
> > (sorry, forgot the exact numbers), and probably less than 10000 for
> > diff (just guestimating; I don't want to go and check).
> >
> > Anyway, all that aside, I don't see any such overflow for rename_limit
> > being 2**16; we can push it much farther:
> >
> >     size_t a = 4294967295;   /*  2**32 -1  */
> >     size_t b = a;
> >     size_t c = st_mult(a, b);
> >     printf("%"PRIuMAX" = %"PRIuMAX" * %"PRIuMAX"\n", c, a, b);
> >
> > Output:
> >
> >     18446744065119617025 = 4294967295 * 4294967295
> >
> > Adding one to the value of a results in:
> >
> >     fatal: size_t overflow: 4294967296 * 4294967296
> >
> > > But I don't think an 'overflow error' is very helpful
> > > for the users in understanding what is wrong with their configuration;
> > > i.e. `diff.renameLimit` documentation says nothing about a 'maximum
> > > allowed value'. I would either clamp it to a reasonable range, or
> > > inform the users about the limits, or maybe both.
> >
> > That sounds reasonable, but I'm not sure it's worth the effort in
> > practice.  2**32 - 1 is so impractically large for a rename_limit that
> > I don't see why anyone would need a value even remotely close to that
> > level (and I wouldn't at all be surprised if other things in git
> > scaling broke before we even got to that point).
> >
> > Perhaps you're dealing with a 32-bit system?  Even then, the
> > repository that hit this was about 6.5GB packed .git history; and you
> > might need to be a lot larger than that given the optimization since
> > then in order to hit this.  Can 32-bit systems even handle that size
> > of repository without dying in several other ways?
>
> Good point, but the system aside, 2**16 - 1 = 65535 would remain to be
> the limit for the 32-bit git executables, wherever they are used.
> Therefore, maybe there is a point to curb it, or mention this
> somewhere, as I have said before.

Fair enough.  If someone wants to contribute a patch to either provide
a nicer error message if the value is set too high, or to clamp it
with a warning, and in either case to make sure the too-large check is
specific to 32-bit systems (or uses appropriately different limits on
32-bit and 64-bit systems), then that would be a welcome contribution.

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2021-11-11 16:19 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-06  2:54 [PATCH 0/7] diffcore-rename improvements Elijah Newren via GitGitGadget
2020-12-06  2:54 ` [PATCH 1/7] diffcore-rename: avoid usage of global in too_many_rename_candidates() Elijah Newren via GitGitGadget
2020-12-09 22:06   ` Taylor Blau
2020-12-06  2:54 ` [PATCH 2/7] diffcore-rename: remove unnecessary if-clause Elijah Newren via GitGitGadget
2020-12-09 22:10   ` Taylor Blau
2020-12-10  0:32     ` Elijah Newren
2020-12-10  2:03     ` Junio C Hamano
2020-12-10  2:17       ` Elijah Newren
2020-12-10  6:56         ` Junio C Hamano
2020-12-06  2:54 ` [PATCH 3/7] diffcore-rename: rename num_create to num_targets Elijah Newren via GitGitGadget
2020-12-10  2:20   ` Junio C Hamano
2020-12-10  2:25     ` Elijah Newren
2020-12-06  2:54 ` [PATCH 4/7] diffcore-rename: change a few comments to use 'add' instead of 'create' Elijah Newren via GitGitGadget
2020-12-10  2:29   ` Junio C Hamano
2020-12-06  2:54 ` [PATCH 5/7] diffcore-rename: reduce jumpiness in progress counters Elijah Newren via GitGitGadget
2020-12-09 22:24   ` Taylor Blau
2020-12-10  2:36   ` Junio C Hamano
2020-12-06  2:54 ` [PATCH 6/7] diffcore-rename: simplify and accelerate register_rename_src() Elijah Newren via GitGitGadget
2020-12-09 22:40   ` Taylor Blau
2020-12-10  0:25     ` Elijah Newren
2020-12-10  0:41       ` Taylor Blau
2020-12-10  2:51   ` Junio C Hamano
2020-12-06  2:54 ` [PATCH 7/7] Accelerate rename_dst setup Elijah Newren via GitGitGadget
2020-12-09 23:01   ` Taylor Blau
2020-12-10  0:57     ` Elijah Newren
2020-12-10  1:43       ` Junio C Hamano
2020-12-06  3:01 ` [PATCH 0/7] diffcore-rename improvements Elijah Newren
2020-12-11  9:08 ` [PATCH v2 0/9] " Elijah Newren via GitGitGadget
2020-12-11  9:08   ` [PATCH v2 1/9] diffcore-rename: rename num_create to num_destinations Elijah Newren via GitGitGadget
2020-12-11  9:08   ` [PATCH v2 2/9] diffcore-rename: avoid usage of global in too_many_rename_candidates() Elijah Newren via GitGitGadget
2020-12-11  9:08   ` [PATCH v2 3/9] diffcore-rename: simplify limit check Elijah Newren via GitGitGadget
2020-12-11  9:08   ` [PATCH v2 4/9] diffcore-rename: reduce jumpiness in progress counters Elijah Newren via GitGitGadget
2020-12-11  9:08   ` [PATCH v2 5/9] t4058: add more tests and documentation for duplicate tree entry handling Elijah Newren via GitGitGadget
2020-12-11  9:08   ` [PATCH v2 6/9] t4058: explore duplicate tree entry handling in a bit more detail Elijah Newren via GitGitGadget
2021-04-21 12:29     ` Ævar Arnfjörð Bjarmason
2021-04-21 17:38       ` Elijah Newren
2020-12-11  9:08   ` [PATCH v2 7/9] diffcore-rename: simplify and accelerate register_rename_src() Elijah Newren via GitGitGadget
2020-12-11  9:08   ` [PATCH v2 8/9] diffcore-rename: accelerate rename_dst setup Elijah Newren via GitGitGadget
2020-12-11  9:08   ` [PATCH v2 9/9] diffcore-rename: remove unneccessary duplicate entry checks Elijah Newren via GitGitGadget
2020-12-29  8:31     ` Christian Couder
2020-12-29 18:09       ` Elijah Newren
2020-12-29 20:05   ` [PATCH v3 0/9] diffcore-rename improvements Elijah Newren via GitGitGadget
2020-12-29 20:05     ` [PATCH v3 1/9] diffcore-rename: rename num_create to num_destinations Elijah Newren via GitGitGadget
2020-12-29 20:05     ` [PATCH v3 2/9] diffcore-rename: avoid usage of global in too_many_rename_candidates() Elijah Newren via GitGitGadget
2020-12-29 20:05     ` [PATCH v3 3/9] diffcore-rename: simplify limit check Elijah Newren via GitGitGadget
2021-11-09 21:14       ` Başar Uğur
2021-11-10 20:06         ` Elijah Newren
2021-11-11  9:02           ` Başar Uğur
2021-11-11 16:19             ` Elijah Newren
2020-12-29 20:05     ` [PATCH v3 4/9] diffcore-rename: reduce jumpiness in progress counters Elijah Newren via GitGitGadget
2020-12-29 20:05     ` [PATCH v3 5/9] t4058: add more tests and documentation for duplicate tree entry handling Elijah Newren via GitGitGadget
2020-12-29 20:05     ` [PATCH v3 6/9] t4058: explore duplicate tree entry handling in a bit more detail Elijah Newren via GitGitGadget
2020-12-29 20:05     ` [PATCH v3 7/9] diffcore-rename: simplify and accelerate register_rename_src() Elijah Newren via GitGitGadget
2020-12-29 20:05     ` [PATCH v3 8/9] diffcore-rename: accelerate rename_dst setup Elijah Newren via GitGitGadget
2020-12-29 20:05     ` [PATCH v3 9/9] diffcore-rename: remove unnecessary duplicate entry checks Elijah Newren via GitGitGadget

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.