All of lore.kernel.org
 help / color / mirror / Atom feed
* --find-copies-harder finds fewer copies/renames than -C does
@ 2011-01-05 17:46 Stefan Haller
  2011-01-05 18:59 ` Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Haller @ 2011-01-05 17:46 UTC (permalink / raw)
  To: git

I was surprised to find out that --find-copies-harder can, under certain
circumstances, detect fewer renames than -C does for a given commit.
After some digging I think I understand now why this is so, but I find
it extremely unintuitive, and would like to know what other people think
about it.  I had expected --find-copies-harder to always detect at least
as many copies/renames as -C, and possibly more, but never less.

The case I had is this: I have a repo with about 10.000 files, and a
commit with 14 files being moved to a different folder; half of them
where unmodified moves, the other half had one-line modifications
(similarity index 97% or so).

"git show -C --name-status <commit>" detects all 14 files as renames;
"git show --find-copies-harder --name-status <commit>" only shows the
unmodified moves as renames, the ones with modifications are shown as
delete plus add.

The reason for this seems to be the condition
"num_create * num_src > rename_limit * rename_limit" in diffcore_rename;
--find-copies-harder exeeds the limit, so it turns off inexact rename
dectection *completely*, while -C stays well below the limit.

I had expected --find-copies-harder to still do inexact rename detection
among the changed files in the commit in this case, and turn it off only
for the unmodified files; I'm not familiar enough with the code to tell
whether that would be easy to implement though.

Any thoughts?


-- 
Stefan Haller
Berlin, Germany
http://www.haller-berlin.de/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: --find-copies-harder finds fewer copies/renames than -C does
  2011-01-05 17:46 --find-copies-harder finds fewer copies/renames than -C does Stefan Haller
@ 2011-01-05 18:59 ` Junio C Hamano
  2011-01-06 16:54   ` Stefan Haller
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2011-01-05 18:59 UTC (permalink / raw)
  To: Stefan Haller; +Cc: git

lists@haller-berlin.de (Stefan Haller) writes:

> The reason for this seems to be the condition
> "num_create * num_src > rename_limit * rename_limit" in diffcore_rename;
> --find-copies-harder exeeds the limit, so it turns off inexact rename
> dectection *completely*, while -C stays well below the limit.
>
> I had expected --find-copies-harder to still do inexact rename detection
> among the changed files in the commit in this case, and turn it off only
> for the unmodified files; I'm not familiar enough with the code to tell
> whether that would be easy to implement though.
>
> Any thoughts?

Two.  When you can spend unlimited amount of resources, it would feel more
intuitive if -C -C lifted rename-limit altogether.  On the other hand, in
a project where the difference does matter (i.e. you have far too many
candidate sources), it is likely that -C -C without rename limit would run
out of memory and not produce any result, so automatic lifting of rename
limit is unacceptable as the default.

Setting rename-limit to match the size of your environment (perhaps do
this once in the config) would make this a non-issue, so coming up with an
automated way to do so might be an interesting exercise.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: --find-copies-harder finds fewer copies/renames than -C does
  2011-01-05 18:59 ` Junio C Hamano
@ 2011-01-06 16:54   ` Stefan Haller
  2011-01-06 20:42     ` Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Haller @ 2011-01-06 16:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> wrote:

> lists@haller-berlin.de (Stefan Haller) writes:
> 
> > I had expected --find-copies-harder to still do inexact rename detection
> > among the changed files in the commit in this case, and turn it off only
> > for the unmodified files; I'm not familiar enough with the code to tell
> > whether that would be easy to implement though.
> >
> > Any thoughts?
> 
> Two.  When you can spend unlimited amount of resources, it would feel more
> intuitive if -C -C lifted rename-limit altogether.  On the other hand, in
> a project where the difference does matter (i.e. you have far too many
> candidate sources), it is likely that -C -C without rename limit would run
> out of memory and not produce any result, so automatic lifting of rename
> limit is unacceptable as the default.

But what about the suggestion of falling back to -C if
--find-copies-harder exceeds the rename limit, but -C does not?
Wouldn't that be the desired behaviour?


-- 
Stefan Haller
Berlin, Germany
http://www.haller-berlin.de/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: --find-copies-harder finds fewer copies/renames than -C does
  2011-01-06 16:54   ` Stefan Haller
@ 2011-01-06 20:42     ` Junio C Hamano
  2011-01-06 21:50       ` [PATCH 0/3] Falling back "diff -C -C" to "diff -C" more gracefully Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2011-01-06 20:42 UTC (permalink / raw)
  To: Stefan Haller; +Cc: git

lists@haller-berlin.de (Stefan Haller) writes:

> But what about the suggestion of falling back to -C if
> --find-copies-harder exceeds the rename limit, but -C does not?
> Wouldn't that be the desired behaviour?

Maybe.  Patches welcome ;-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 0/3] Falling back "diff -C -C" to "diff -C" more gracefully
  2011-01-06 20:42     ` Junio C Hamano
@ 2011-01-06 21:50       ` Junio C Hamano
  2011-01-06 21:50         ` [PATCH 1/3] diffcore-rename: refactor "too many candidates" logic Junio C Hamano
                           ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Junio C Hamano @ 2011-01-06 21:50 UTC (permalink / raw)
  To: git; +Cc: Stefan Haller

In a project with many paths, "diff -C -C" may have too many rename source
candidates (as it tries to use all existing paths) but the rename source
candidates "diff -C" would use may still fit under the rename detection
limit.  Currently, we punt and disable the inexact rename detection
altogether even in such a case.

This weatherballoon series illustrates how diffcore-rename can be tweaked
to allow "-C -C" to fall back to "-C".  Somebody should write a test, but
not today ;-).

Junio C Hamano (3):
  diffcore-rename: refactor "too many candidates" logic
  diffcore-rename: record filepair for rename src
  diffcore-rename: fall back to -C when -C -C busts the rename limit

 diffcore-rename.c |   97 ++++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 71 insertions(+), 26 deletions(-)

-- 
1.7.4.rc1.214.g2a4f9

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] diffcore-rename: refactor "too many candidates" logic
  2011-01-06 21:50       ` [PATCH 0/3] Falling back "diff -C -C" to "diff -C" more gracefully Junio C Hamano
@ 2011-01-06 21:50         ` Junio C Hamano
  2011-01-06 21:50         ` [PATCH 2/3] diffcore-rename: record filepair for rename src Junio C Hamano
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2011-01-06 21:50 UTC (permalink / raw)
  To: git

Move the logic to a separate function, to be enhanced by later patches in
the series.

While at it, swap the condition used in the if statement from "if it is
too big then do this" to "if it would fit then do this".

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 diffcore-rename.c |   39 +++++++++++++++++++++++++--------------
 1 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index df41be5..6ab050d 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -414,11 +414,34 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
 		m[worst] = *o;
 }
 
+static int too_many_rename_candidates(int num_create,
+				      struct diff_options *options)
+{
+	int rename_limit = options->rename_limit;
+	int num_src = rename_src_nr;
+
+	/*
+	 * This basically does a test for the rename matrix not
+	 * growing larger than a "rename_limit" square matrix, ie:
+	 *
+	 *    num_create * num_src > rename_limit * rename_limit
+	 *
+	 * but handles the potential overflow case specially (and we
+	 * assume at least 32-bit integers)
+	 */
+	if (rename_limit <= 0 || rename_limit > 32767)
+		rename_limit = 32767;
+	if ((num_create <= rename_limit || num_src <= rename_limit) &&
+	    (num_create * num_src <= rename_limit * rename_limit))
+		return 0;
+
+	return 1;
+}
+
 void diffcore_rename(struct diff_options *options)
 {
 	int detect_rename = options->detect_rename;
 	int minimum_score = options->rename_score;
-	int rename_limit = options->rename_limit;
 	struct diff_queue_struct *q = &diff_queued_diff;
 	struct diff_queue_struct outq;
 	struct diff_score *mx;
@@ -484,19 +507,7 @@ void diffcore_rename(struct diff_options *options)
 	if (!num_create)
 		goto cleanup;
 
-	/*
-	 * This basically does a test for the rename matrix not
-	 * growing larger than a "rename_limit" square matrix, ie:
-	 *
-	 *    num_create * num_src > rename_limit * rename_limit
-	 *
-	 * but handles the potential overflow case specially (and we
-	 * assume at least 32-bit integers)
-	 */
-	if (rename_limit <= 0 || rename_limit > 32767)
-		rename_limit = 32767;
-	if ((num_create > rename_limit && num_src > rename_limit) ||
-	    (num_create * num_src > rename_limit * rename_limit)) {
+	if (too_many_rename_candidates(num_create, options)) {
 		if (options->warn_on_too_large_rename)
 			warning("too many files (created: %d deleted: %d), skipping inexact rename detection", num_create, num_src);
 		goto cleanup;
-- 
1.7.4.rc1.214.g2a4f9

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] diffcore-rename: record filepair for rename src
  2011-01-06 21:50       ` [PATCH 0/3] Falling back "diff -C -C" to "diff -C" more gracefully Junio C Hamano
  2011-01-06 21:50         ` [PATCH 1/3] diffcore-rename: refactor "too many candidates" logic Junio C Hamano
@ 2011-01-06 21:50         ` Junio C Hamano
  2011-01-06 21:50         ` [PATCH 3/3] diffcore-rename: fall back to -C when -C -C busts the rename limit Junio C Hamano
  2011-01-06 22:28         ` [PATCH 0/3] Falling back "diff -C -C" to "diff -C" more gracefully Jeff King
  3 siblings, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2011-01-06 21:50 UTC (permalink / raw)
  To: git

This will allow us to later skip unmodified entries added due to "-C -C".
We might also want to do something similar to rename_dst side, but that
would only be for the sake of symmetry and not necessary for this series.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 diffcore-rename.c |   23 ++++++++++++-----------
 1 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6ab050d..9ce81b6 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -54,22 +54,23 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
 
 /* Table of rename/copy src files */
 static struct diff_rename_src {
-	struct diff_filespec *one;
+	struct diff_filepair *p;
 	unsigned short score; /* to remember the break score */
 } *rename_src;
 static int rename_src_nr, rename_src_alloc;
 
-static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
-						   unsigned short score)
+static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
 {
 	int first, last;
+	struct diff_filespec *one = p->one;
+	unsigned short score = p->score;
 
 	first = 0;
 	last = rename_src_nr;
 	while (last > first) {
 		int next = (last + first) >> 1;
 		struct diff_rename_src *src = &(rename_src[next]);
-		int cmp = strcmp(one->path, src->one->path);
+		int cmp = strcmp(one->path, src->p->one->path);
 		if (!cmp)
 			return src;
 		if (cmp < 0) {
@@ -89,7 +90,7 @@ static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
 	if (first < rename_src_nr)
 		memmove(rename_src + first + 1, rename_src + first,
 			(rename_src_nr - first - 1) * sizeof(*rename_src));
-	rename_src[first].one = one;
+	rename_src[first].p = p;
 	rename_src[first].score = score;
 	return &(rename_src[first]);
 }
@@ -204,7 +205,7 @@ static void record_rename_pair(int dst_index, int src_index, int score)
 	if (rename_dst[dst_index].pair)
 		die("internal error: dst already matched.");
 
-	src = rename_src[src_index].one;
+	src = rename_src[src_index].p->one;
 	src->rename_used++;
 	src->count++;
 
@@ -384,7 +385,7 @@ static int find_exact_renames(void)
 
 	init_hash(&file_table);
 	for (i = 0; i < rename_src_nr; i++)
-		insert_file_table(&file_table, -1, i, rename_src[i].one);
+		insert_file_table(&file_table, -1, i, rename_src[i].p->one);
 
 	for (i = 0; i < rename_dst_nr; i++)
 		insert_file_table(&file_table, 1, i, rename_dst[i].two);
@@ -472,7 +473,7 @@ void diffcore_rename(struct diff_options *options)
 			 */
 			if (p->broken_pair && !p->score)
 				p->one->rename_used++;
-			register_rename_src(p->one, p->score);
+			register_rename_src(p);
 		}
 		else if (detect_rename == DIFF_DETECT_COPY) {
 			/*
@@ -480,7 +481,7 @@ void diffcore_rename(struct diff_options *options)
 			 * one, to indicate ourselves as a user.
 			 */
 			p->one->rename_used++;
-			register_rename_src(p->one, p->score);
+			register_rename_src(p);
 		}
 	}
 	if (rename_dst_nr == 0 || rename_src_nr == 0)
@@ -526,7 +527,7 @@ void diffcore_rename(struct diff_options *options)
 			m[j].dst = -1;
 
 		for (j = 0; j < rename_src_nr; j++) {
-			struct diff_filespec *one = rename_src[j].one;
+			struct diff_filespec *one = rename_src[j].p->one;
 			struct diff_score this_src;
 			this_src.score = estimate_similarity(one, two,
 							     minimum_score);
@@ -556,7 +557,7 @@ void diffcore_rename(struct diff_options *options)
 		dst = &rename_dst[mx[i].dst];
 		if (dst->pair)
 			continue; /* already done, either exact or fuzzy. */
-		if (rename_src[mx[i].src].one->rename_used)
+		if (rename_src[mx[i].src].p->one->rename_used)
 			continue;
 		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
 		rename_count++;
-- 
1.7.4.rc1.214.g2a4f9

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] diffcore-rename: fall back to -C when -C -C busts the rename limit
  2011-01-06 21:50       ` [PATCH 0/3] Falling back "diff -C -C" to "diff -C" more gracefully Junio C Hamano
  2011-01-06 21:50         ` [PATCH 1/3] diffcore-rename: refactor "too many candidates" logic Junio C Hamano
  2011-01-06 21:50         ` [PATCH 2/3] diffcore-rename: record filepair for rename src Junio C Hamano
@ 2011-01-06 21:50         ` Junio C Hamano
  2011-01-06 22:28         ` [PATCH 0/3] Falling back "diff -C -C" to "diff -C" more gracefully Jeff King
  3 siblings, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2011-01-06 21:50 UTC (permalink / raw)
  To: git

When there are too many paths in the project, the number of rename source
candidates "git diff -C -C" finds will exceed the rename detection limit,
and no inexact rename detection is performed.  We however could fall back
to "git diff -C" if the number of paths modified are sufficiently small.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 diffcore-rename.c |   37 +++++++++++++++++++++++++++++++++++--
 1 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 9ce81b6..4851af3 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -415,11 +415,18 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
 		m[worst] = *o;
 }
 
+/*
+ * Returns:
+ * 0 if we are under the limit;
+ * 1 if we need to disable inexact rename detection;
+ * 2 if we would be under the limit if we were given -C instead of -C -C.
+ */
 static int too_many_rename_candidates(int num_create,
 				      struct diff_options *options)
 {
 	int rename_limit = options->rename_limit;
 	int num_src = rename_src_nr;
+	int i;
 
 	/*
 	 * This basically does a test for the rename matrix not
@@ -436,6 +443,19 @@ static int too_many_rename_candidates(int num_create,
 	    (num_create * num_src <= rename_limit * rename_limit))
 		return 0;
 
+	/* Are we running under -C -C? */
+	if (!DIFF_OPT_TST(options, FIND_COPIES_HARDER))
+		return 1;
+
+	/* Would we bust the limit if we were running under -C? */
+	for (num_src = i = 0; i < rename_src_nr; i++) {
+		if (diff_unmodified_pair(rename_src[i].p))
+			continue;
+		num_src++;
+	}
+	if ((num_create <= rename_limit || num_src <= rename_limit) &&
+	    (num_create * num_src <= rename_limit * rename_limit))
+		return 2;
 	return 1;
 }
 
@@ -446,7 +466,7 @@ void diffcore_rename(struct diff_options *options)
 	struct diff_queue_struct *q = &diff_queued_diff;
 	struct diff_queue_struct outq;
 	struct diff_score *mx;
-	int i, j, rename_count;
+	int i, j, rename_count, skip_unmodified = 0;
 	int num_create, num_src, dst_cnt;
 
 	if (!minimum_score)
@@ -508,10 +528,18 @@ void diffcore_rename(struct diff_options *options)
 	if (!num_create)
 		goto cleanup;
 
-	if (too_many_rename_candidates(num_create, options)) {
+	switch (too_many_rename_candidates(num_create, options)) {
+	case 1:
 		if (options->warn_on_too_large_rename)
 			warning("too many files (created: %d deleted: %d), skipping inexact rename detection", num_create, num_src);
 		goto cleanup;
+	case 2:
+		if (options->warn_on_too_large_rename)
+			warning("too many files, falling back to -C");
+		skip_unmodified = 1;
+		break;
+	default:
+		break;
 	}
 
 	mx = xcalloc(num_create * NUM_CANDIDATE_PER_DST, sizeof(*mx));
@@ -529,6 +557,11 @@ void diffcore_rename(struct diff_options *options)
 		for (j = 0; j < rename_src_nr; j++) {
 			struct diff_filespec *one = rename_src[j].p->one;
 			struct diff_score this_src;
+
+			if (skip_unmodified &&
+			    diff_unmodified_pair(rename_src[j].p))
+				continue;
+
 			this_src.score = estimate_similarity(one, two,
 							     minimum_score);
 			this_src.name_score = basename_same(one, two);
-- 
1.7.4.rc1.214.g2a4f9

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Falling back "diff -C -C" to "diff -C" more gracefully
  2011-01-06 21:50       ` [PATCH 0/3] Falling back "diff -C -C" to "diff -C" more gracefully Junio C Hamano
                           ` (2 preceding siblings ...)
  2011-01-06 21:50         ` [PATCH 3/3] diffcore-rename: fall back to -C when -C -C busts the rename limit Junio C Hamano
@ 2011-01-06 22:28         ` Jeff King
  3 siblings, 0 replies; 14+ messages in thread
From: Jeff King @ 2011-01-06 22:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Stefan Haller

On Thu, Jan 06, 2011 at 01:50:03PM -0800, Junio C Hamano wrote:

> Junio C Hamano (3):
>   diffcore-rename: refactor "too many candidates" logic
>   diffcore-rename: record filepair for rename src
>   diffcore-rename: fall back to -C when -C -C busts the rename limit

I read through the series, and it looks good to me. I didn't do any
testing, though. :)

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: fall back to -C when -C -C busts the rename limit
  2011-03-23 15:58     ` Jeff King
  2011-03-23 16:41       ` Junio C Hamano
@ 2011-03-23 18:17       ` Jeff King
  1 sibling, 0 replies; 14+ messages in thread
From: Jeff King @ 2011-03-23 18:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Mar 23, 2011 at 11:58:54AM -0400, Jeff King wrote:

> Also, on a somewhat related note: which commands should have rename
> progress reporting turned on? It makes sense for "git diff" to do it to
> me. And probably even "git show". Probably not for "git log", though.
> I think we'd also have to look at how that interacts with the
> stderr-to-stdout pager thing. We obviously don't want progress going to
> the pager.

This quick-and-dirty series makes rename progress work properly for "git
show". Patch 1/3 would also be the building block for having "git log"
show per-commit warnings when stderr and stdout are mixed, but a single
warning otherwise. I built it on top of your patches, but I think there
is no reason it could not be done in parallel.

  [1/3]: pager: save the original stderr when redirecting to pager
  [2/3]: progress: use pager's original_stderr if available
  [3/3]: show: turn on rename progress

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: fall back to -C when -C -C busts the rename limit
  2011-03-23 16:41       ` Junio C Hamano
@ 2011-03-23 16:50         ` Jeff King
  0 siblings, 0 replies; 14+ messages in thread
From: Jeff King @ 2011-03-23 16:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Mar 23, 2011 at 09:41:19AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > "find the highest limit needed and report once" strategy you used above?
> 
> Wasn't "find the highest limit" your invention in merge-recursive?  The

Hmm, you're right. I was thinking the code you were modifying was called
per-diff, but it's not. Though looking at it again, I think we could
technically still warn several times, one for each recursive call to
merge_recursive. Your call to show() does fix that (because it checks
o->call_depth).

So I think the code after your patch does the right thing.

> As to the warnings in "log" output, I actually prefer leaving saved_* out
> and showing the warning per internal diff invocation. My initial iteration
> was indeed coded that way, and I did the "find the highest" only to mimic
> what was already in merge-recursive.

I think they are two different cases, because the user never gets to see
the intermediate results of merge-recursive. That is, at the end of the
merge we tell them "here's the result, and by the way, we limited
renames." But for something like "log" or "diff-tree --stdin", it is
about doing N independent diffs, and the user gets to see the result of
each. So if we can match the warnings to the actual diffs in the output,
that is much more useful.

But in the case of something like:

  git rev-list HEAD | git diff-tree --stdin >foo.out

I don't think there is a way to match the warnings to their respective
commits. They are on two different streams, and putting the warning to
stdout would pollute the output.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: fall back to -C when -C -C busts the rename limit
  2011-03-23 15:58     ` Jeff King
@ 2011-03-23 16:41       ` Junio C Hamano
  2011-03-23 16:50         ` Jeff King
  2011-03-23 18:17       ` Jeff King
  1 sibling, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2011-03-23 16:41 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> "find the highest limit needed and report once" strategy you used above?

Wasn't "find the highest limit" your invention in merge-recursive?  The
only deliberate fix in merge-recursive codepath on top of what you did is
to allow squelching the warning when verbosity is set to more quiet than
the default, and removal of the "set var to this value" advice when the
value we are going to suggest is too high already and will not help.

As to the warnings in "log" output, I actually prefer leaving saved_* out
and showing the warning per internal diff invocation. My initial iteration
was indeed coded that way, and I did the "find the highest" only to mimic
what was already in merge-recursive.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] diffcore-rename: fall back to -C when -C -C busts the rename limit
  2011-03-22 21:50   ` [PATCH 3/3] diffcore-rename: fall back to -C when -C -C busts the rename limit Junio C Hamano
@ 2011-03-23 15:58     ` Jeff King
  2011-03-23 16:41       ` Junio C Hamano
  2011-03-23 18:17       ` Jeff King
  0 siblings, 2 replies; 14+ messages in thread
From: Jeff King @ 2011-03-23 15:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Mar 22, 2011 at 02:50:49PM -0700, Junio C Hamano wrote:

> When there are too many paths in the project, the number of rename source
> candidates "git diff -C -C" finds will exceed the rename detection limit,
> and no inexact rename detection is performed.  We however could fall back
> to "git diff -C" if the number of modified paths is sufficiently small.

Hmm. Compared to the previous iteration, this also seems to turn on
warnings for the diff and log families. It would have been a little
easier to review as a separate patch, I think. In particular, I'm not
sure I like the "just warn once at the end" strategy when we show
multiple diffs.

For example, in the trash repo created by your new t4001, I did:

  git commit -a -m foo &&
  git config diff.renamelimit 4 &&
  git log --raw -C -C

And like most users, I read through the commits in the order they're
presented. So I get to the second one and see that it has a bunch of
additions, but no copies. Then I keep reading and finally, after the
third commit, I see a warning that we didn't do copy detection.

Here's what I see wrong with that from the user's perspective:

  1. You give me the warning at some unspecified time after I've already
     read and digested the results of the commit it affects. So at best
     I've wasted my time looking at results that you later tell me might
     be bogus.

  2. The warning is at the very end of the potentially long output. With
     three commits, I'm likely to actually scroll all the way to the end.
     But how often do you run "git log" in git.git and then kill the
     pager after finding the answer you want, never seeing the bottom of
     git's output (and potentially before git even generates it!). So I
     may miss the warning entirely.

  3. Even if I do see the warning, I have no idea which commits it
     applies to. I have to bump the rename limit and re-run just to find
     out whether the commits I cared about were affected.

So I think it would be much more sensible to show something like:

  commit f3ba51f2e2025d15c84873cb24dcf51b77f6b8e1
  Author: A U Thor <author@example.com>
  Date:   Thu Apr 7 15:13:13 2005 -0700

      hundred

  warning: only found copies from modified paths due to too many files.
  warning: you may want to set your diff.renamelimit variable to at
           least 102 and retry the command.
  :000000 100644 0000000... 4daddb7... A  path00
  :000000 100644 0000000... 8a0f05e... A  path01
  [etc]

One argument against that is that we may end up printing many such
warnings. For the pager case, where stderr and stdout are combined, that
is probably fine. Any commit which blows the rename limit is going to
have a ton of diff output anyway, so a few extra lines doesn't hurt
(with some careful flushing, those lines are placed in the right spot).

When stderr is separated, though, it's going to be annoyingly verbose,
and multiple warnings wouldn't match up with their respective commits,
anyway.

So maybe it is worth detecting the pager case and behaving differently
based on whether stderr and stdout are combined.

Another alternative would be to embed the warning in the stdout stream,
but I don't see a clean way of doing that. The best I could come up with
is something like:

    commit f3ba51f2e2025d15c84873cb24dcf51b77f6b8e1
    Author: A U Thor <author@example.com>
    Date:   Thu Apr 7 15:13:13 2005 -0700
    X-Diff-Warning: only found copies...

which just seems a little too hack-ish.

Also, on a somewhat related note: which commands should have rename
progress reporting turned on? It makes sense for "git diff" to do it to
me. And probably even "git show". Probably not for "git log", though.
I think we'd also have to look at how that interacts with the
stderr-to-stdout pager thing. We obviously don't want progress going to
the pager.

> +static const char rename_limit_warning[] =
> +"inexact rename detection was skipped due to too many files.";
> +
> +static const char degrade_cc_to_c_warning[] =
> +"only found copies from modified paths due to too many files.";

Somehow the phrase "due to too many files" seems awkward. The
merge-recursive warning this replaces used "because there were too many
files". Which is longer, of course, but seems more natural.

> +static const char rename_limit_advice[] =
> +"you may want to set your %s variable to at least "
> +"%d and retry the command.";

And this one ends up overflowing reasonably-sized terminals, because it
gets prefixed with "warning: ", and because the variable name is
something like "diff.renamelimit".

> @@ -445,6 +452,20 @@ static int too_many_rename_candidates(int num_create,
>  
>  	options->needed_rename_limit =
>  		num_src > num_create ? num_src : num_create;

This is obviously not introduced by your patch, but I noticed once again
during testing that this number is conservatively high, isn't it? I
think the number we want is actually:

  ceil(sqrt(num_create * num_src))

right? I don't know if it is worth being more accurate.

> @@ -1656,8 +1651,9 @@ int merge_recursive(struct merge_options *o,
>  		commit_list_insert(h2, &(*result)->parents->next);
>  	}
>  	flush_output(o);
> -	if (o->needed_rename_limit)
> -		warning(rename_limit_advice, o->needed_rename_limit);
> +	if (show(o, 2))
> +		diff_warn_rename_limit("merge.renamelimit",
> +				       o->needed_rename_limit, 0);

With the call to show(), we are showing the warning only for the
outermost diff. Which is probably way better than generating a bunch of
warnings, one for each rename-detect we do. But aren't we now failing to
mention limits we hit in recursive invocations, even though they can
affect the results? In other words, isn't this a candidate for the
"find the highest limit needed and report once" strategy you used above?

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/3] diffcore-rename: fall back to -C when -C -C busts the rename limit
  2011-03-22 21:50 ` [PATCH 1/3] diffcore-rename: refactor "too many candidates" logic Junio C Hamano
@ 2011-03-22 21:50   ` Junio C Hamano
  2011-03-23 15:58     ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2011-03-22 21:50 UTC (permalink / raw)
  To: git; +Cc: Jeff King

When there are too many paths in the project, the number of rename source
candidates "git diff -C -C" finds will exceed the rename detection limit,
and no inexact rename detection is performed.  We however could fall back
to "git diff -C" if the number of modified paths is sufficiently small.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/diff-tree.c    |   12 +++++++++++-
 builtin/log.c          |    9 +++++++++
 diff.c                 |   26 ++++++++++++++++++++++++++
 diff.h                 |    2 ++
 diffcore-rename.c      |   38 ++++++++++++++++++++++++++++++++++++--
 merge-recursive.c      |   10 +++-------
 t/t4001-diff-rename.sh |   25 +++++++++++++++++++++++++
 7 files changed, 112 insertions(+), 10 deletions(-)

diff --git a/builtin/diff-tree.c b/builtin/diff-tree.c
index 0d2a3e9..be6417d 100644
--- a/builtin/diff-tree.c
+++ b/builtin/diff-tree.c
@@ -163,6 +163,9 @@ int cmd_diff_tree(int argc, const char **argv, const char *prefix)
 	}
 
 	if (read_stdin) {
+		int saved_nrl = 0;
+		int saved_dcctc = 0;
+
 		if (opt->diffopt.detect_rename)
 			opt->diffopt.setup |= (DIFF_SETUP_USE_SIZE_CACHE |
 					       DIFF_SETUP_USE_CACHE);
@@ -173,9 +176,16 @@ int cmd_diff_tree(int argc, const char **argv, const char *prefix)
 				fputs(line, stdout);
 				fflush(stdout);
 			}
-			else
+			else {
 				diff_tree_stdin(line);
+				if (saved_nrl < opt->diffopt.needed_rename_limit)
+					saved_nrl = opt->diffopt.needed_rename_limit;
+				if (opt->diffopt.degraded_cc_to_c)
+					saved_dcctc = 1;
+			}
 		}
+		opt->diffopt.degraded_cc_to_c = saved_dcctc;
+		opt->diffopt.needed_rename_limit = saved_nrl;
 	}
 
 	return diff_result_code(&opt->diffopt, 0);
diff --git a/builtin/log.c b/builtin/log.c
index 796e9e5..707fd57 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -247,6 +247,8 @@ static void finish_early_output(struct rev_info *rev)
 static int cmd_log_walk(struct rev_info *rev)
 {
 	struct commit *commit;
+	int saved_nrl = 0;
+	int saved_dcctc = 0;
 
 	if (rev->early_output)
 		setup_early_output(rev);
@@ -277,7 +279,14 @@ static int cmd_log_walk(struct rev_info *rev)
 		}
 		free_commit_list(commit->parents);
 		commit->parents = NULL;
+		if (saved_nrl < rev->diffopt.needed_rename_limit)
+			saved_nrl = rev->diffopt.needed_rename_limit;
+		if (rev->diffopt.degraded_cc_to_c)
+			saved_dcctc = 1;
 	}
+	rev->diffopt.degraded_cc_to_c = saved_dcctc;
+	rev->diffopt.needed_rename_limit = saved_nrl;
+
 	if (rev->diffopt.output_format & DIFF_FORMAT_CHECKDIFF &&
 	    DIFF_OPT_TST(&rev->diffopt, CHECK_FAILED)) {
 		return 02;
diff --git a/diff.c b/diff.c
index 9b3eb99..13f322b 100644
--- a/diff.c
+++ b/diff.c
@@ -3956,6 +3956,28 @@ static int is_summary_empty(const struct diff_queue_struct *q)
 	return 1;
 }
 
+static const char rename_limit_warning[] =
+"inexact rename detection was skipped due to too many files.";
+
+static const char degrade_cc_to_c_warning[] =
+"only found copies from modified paths due to too many files.";
+
+static const char rename_limit_advice[] =
+"you may want to set your %s variable to at least "
+"%d and retry the command.";
+
+void diff_warn_rename_limit(const char *varname, int needed, int degraded_cc)
+{
+	if (degraded_cc)
+		warning(degrade_cc_to_c_warning);
+	else if (needed)
+		warning(rename_limit_warning);
+	else
+		return;
+	if (0 < needed && needed < 32767)
+		warning(rename_limit_advice, varname, needed);
+}
+
 void diff_flush(struct diff_options *options)
 {
 	struct diff_queue_struct *q = &diff_queued_diff;
@@ -4237,6 +4259,10 @@ void diffcore_std(struct diff_options *options)
 int diff_result_code(struct diff_options *opt, int status)
 {
 	int result = 0;
+
+	diff_warn_rename_limit("diff.renamelimit",
+			       opt->needed_rename_limit,
+			       opt->degraded_cc_to_c);
 	if (!DIFF_OPT_TST(opt, EXIT_WITH_STATUS) &&
 	    !(opt->output_format & DIFF_FORMAT_CHECKDIFF))
 		return status;
diff --git a/diff.h b/diff.h
index 007a055..b8dde8a 100644
--- a/diff.h
+++ b/diff.h
@@ -111,6 +111,7 @@ struct diff_options {
 	int rename_score;
 	int rename_limit;
 	int needed_rename_limit;
+	int degraded_cc_to_c;
 	int show_rename_progress;
 	int dirstat_percent;
 	int setup;
@@ -273,6 +274,7 @@ extern void diffcore_fix_diff_index(struct diff_options *);
 
 extern int diff_queue_is_empty(void);
 extern void diff_flush(struct diff_options*);
+extern void diff_warn_rename_limit(const char *varname, int needed, int degraded_cc);
 
 /* diff-raw status letters */
 #define DIFF_STATUS_ADDED		'A'
diff --git a/diffcore-rename.c b/diffcore-rename.c
index a932f76..f62587e 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -420,11 +420,18 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
 		m[worst] = *o;
 }
 
+/*
+ * Returns:
+ * 0 if we are under the limit;
+ * 1 if we need to disable inexact rename detection;
+ * 2 if we would be under the limit if we were given -C instead of -C -C.
+ */
 static int too_many_rename_candidates(int num_create,
 				      struct diff_options *options)
 {
 	int rename_limit = options->rename_limit;
 	int num_src = rename_src_nr;
+	int i;
 
 	options->needed_rename_limit = 0;
 
@@ -445,6 +452,20 @@ static int too_many_rename_candidates(int num_create,
 
 	options->needed_rename_limit =
 		num_src > num_create ? num_src : num_create;
+
+	/* Are we running under -C -C? */
+	if (!DIFF_OPT_TST(options, FIND_COPIES_HARDER))
+		return 1;
+
+	/* Would we bust the limit if we were running under -C? */
+	for (num_src = i = 0; i < rename_src_nr; i++) {
+		if (diff_unmodified_pair(rename_src[i].p))
+			continue;
+		num_src++;
+	}
+	if ((num_create <= rename_limit || num_src <= rename_limit) &&
+	    (num_create * num_src <= rename_limit * rename_limit))
+		return 2;
 	return 1;
 }
 
@@ -476,7 +497,7 @@ void diffcore_rename(struct diff_options *options)
 	struct diff_queue_struct *q = &diff_queued_diff;
 	struct diff_queue_struct outq;
 	struct diff_score *mx;
-	int i, j, rename_count;
+	int i, j, rename_count, skip_unmodified = 0;
 	int num_create, num_src, dst_cnt;
 	struct progress *progress = NULL;
 
@@ -539,8 +560,16 @@ void diffcore_rename(struct diff_options *options)
 	if (!num_create)
 		goto cleanup;
 
-	if (too_many_rename_candidates(num_create, options))
+	switch (too_many_rename_candidates(num_create, options)) {
+	case 1:
 		goto cleanup;
+	case 2:
+		options->degraded_cc_to_c = 1;
+		skip_unmodified = 1;
+		break;
+	default:
+		break;
+	}
 
 	if (options->show_rename_progress) {
 		progress = start_progress_delay(
@@ -563,6 +592,11 @@ void diffcore_rename(struct diff_options *options)
 		for (j = 0; j < rename_src_nr; j++) {
 			struct diff_filespec *one = rename_src[j].p->one;
 			struct diff_score this_src;
+
+			if (skip_unmodified &&
+			    diff_unmodified_pair(rename_src[j].p))
+				continue;
+
 			this_src.score = estimate_similarity(one, two,
 							     minimum_score);
 			this_src.name_score = basename_same(one, two);
diff --git a/merge-recursive.c b/merge-recursive.c
index 8e82a8b..3ae4d53 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -22,11 +22,6 @@
 #include "dir.h"
 #include "submodule.h"
 
-static const char rename_limit_advice[] =
-"inexact rename detection was skipped because there were too many\n"
-"  files. You may want to set your merge.renamelimit variable to at least\n"
-"  %d and retry this merge.";
-
 static struct tree *shift_tree_object(struct tree *one, struct tree *two,
 				      const char *subtree_shift)
 {
@@ -1656,8 +1651,9 @@ int merge_recursive(struct merge_options *o,
 		commit_list_insert(h2, &(*result)->parents->next);
 	}
 	flush_output(o);
-	if (o->needed_rename_limit)
-		warning(rename_limit_advice, o->needed_rename_limit);
+	if (show(o, 2))
+		diff_warn_rename_limit("merge.renamelimit",
+				       o->needed_rename_limit, 0);
 	return clean;
 }
 
diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index 71bac83..301f3a0 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -77,4 +77,29 @@ test_expect_success  'favour same basenames even with minor differences' '
 	git show HEAD:path1 | sed "s/15/16/" > subdir/path1 &&
 	git status | grep "renamed: .*path1 -> subdir/path1"'
 
+test_expect_success 'setup for many rename source candidates' '
+	git reset --hard &&
+	for i in 0 1 2 3 4 5 6 7 8 9;
+	do
+		for j in 0 1 2 3 4 5 6 7 8 9;
+		do
+			echo "$i$j" >"path$i$j"
+		done
+	done &&
+	git add "path??" &&
+	test_tick &&
+	git commit -m "hundred" &&
+	(cat path1; echo new) >new-path &&
+	echo old >>path1 &&
+	git add new-path path1 &&
+	git diff -l 4 -C -C --cached --name-status >actual 2>actual.err &&
+	sed -e "s/^\([CM]\)[0-9]*	/\1	/" actual >actual.munged &&
+	cat >expect <<-EOF &&
+	C	path1	new-path
+	M	path1
+	EOF
+	test_cmp expect actual.munged &&
+	grep warning actual.err
+'
+
 test_done
-- 
1.7.4.1.554.gfdad8

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-03-23 18:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-05 17:46 --find-copies-harder finds fewer copies/renames than -C does Stefan Haller
2011-01-05 18:59 ` Junio C Hamano
2011-01-06 16:54   ` Stefan Haller
2011-01-06 20:42     ` Junio C Hamano
2011-01-06 21:50       ` [PATCH 0/3] Falling back "diff -C -C" to "diff -C" more gracefully Junio C Hamano
2011-01-06 21:50         ` [PATCH 1/3] diffcore-rename: refactor "too many candidates" logic Junio C Hamano
2011-01-06 21:50         ` [PATCH 2/3] diffcore-rename: record filepair for rename src Junio C Hamano
2011-01-06 21:50         ` [PATCH 3/3] diffcore-rename: fall back to -C when -C -C busts the rename limit Junio C Hamano
2011-01-06 22:28         ` [PATCH 0/3] Falling back "diff -C -C" to "diff -C" more gracefully Jeff King
2011-03-22 21:45 [PATCH] builtin/diff.c: remove duplicated call to diff_result_code() Junio C Hamano
2011-03-22 21:50 ` [PATCH 1/3] diffcore-rename: refactor "too many candidates" logic Junio C Hamano
2011-03-22 21:50   ` [PATCH 3/3] diffcore-rename: fall back to -C when -C -C busts the rename limit Junio C Hamano
2011-03-23 15:58     ` Jeff King
2011-03-23 16:41       ` Junio C Hamano
2011-03-23 16:50         ` Jeff King
2011-03-23 18:17       ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.