All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Rename detection: Avoid repeated filespec population
@ 2009-01-20 15:59 Björn Steinbrink
  2009-01-20 21:27 ` Jeff King
  2009-01-21 10:27 ` Junio C Hamano
  0 siblings, 2 replies; 6+ messages in thread
From: Björn Steinbrink @ 2009-01-20 15:59 UTC (permalink / raw)
  To: Junio C Hamano, Linus Torvalds; +Cc: git

In diffcore_rename, we assume that the blob contents in the filespec
aren't required anymore after estimate_similarity has been called and thus
we free it. But estimate_similarity might return early when the file sizes
differ too much. In that case, cnt_data is never set and the next call to
estimate_similarity will populate the filespec again, eventually rereading
the same blob over and over again.

To fix that, we first get the blob sizes and only when the blob contents
are actually required, and when cnt_data will be set, the full filespec is
populated, once.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
---
This actually affects the copy detection way more than the pure
rename detection, due to the larger number of candidates, but it's the
same code, and I only realized that when I reran the stuff to get some
numbers to show off. ;-)

Test with linux-2.6.git and hot caches:
git diff-tree -p db30c705758^2..db30c705758

	| unpatched | patched
------------------------------
plain	|        3s |     3s
-M	|        4s |     3s
-C	|    2m 25s |     9s
-C -C	|    4m 30s |    24s


 diffcore-rename.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 168a95b..0b0d6b8 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -153,9 +153,9 @@ static int estimate_similarity(struct diff_filespec *src,
 	 * is a possible size - we really should have a flag to
 	 * say whether the size is valid or not!)
 	 */
-	if (!src->cnt_data && diff_populate_filespec(src, 0))
+	if (!src->cnt_data && diff_populate_filespec(src, 1))
 		return 0;
-	if (!dst->cnt_data && diff_populate_filespec(dst, 0))
+	if (!dst->cnt_data && diff_populate_filespec(dst, 1))
 		return 0;
 
 	max_size = ((src->size > dst->size) ? src->size : dst->size);
@@ -173,6 +173,11 @@ static int estimate_similarity(struct diff_filespec *src,
 	if (base_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
 		return 0;
 
+	if (!src->cnt_data && diff_populate_filespec(src, 0))
+		return 0;
+	if (!dst->cnt_data && diff_populate_filespec(dst, 0))
+		return 0;
+
 	delta_limit = (unsigned long)
 		(base_size * (MAX_SCORE-minimum_score) / MAX_SCORE);
 	if (diffcore_count_changes(src, dst,
-- 
1.6.1.135.g3cf3b

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] Rename detection: Avoid repeated filespec population
  2009-01-20 15:59 [PATCH] Rename detection: Avoid repeated filespec population Björn Steinbrink
@ 2009-01-20 21:27 ` Jeff King
  2009-01-20 22:12   ` Jeff King
  2009-01-21 12:56   ` Björn Steinbrink
  2009-01-21 10:27 ` Junio C Hamano
  1 sibling, 2 replies; 6+ messages in thread
From: Jeff King @ 2009-01-20 21:27 UTC (permalink / raw)
  To: Björn Steinbrink; +Cc: Junio C Hamano, Linus Torvalds, git

On Tue, Jan 20, 2009 at 04:59:57PM +0100, Björn Steinbrink wrote:

> In diffcore_rename, we assume that the blob contents in the filespec
> aren't required anymore after estimate_similarity has been called and thus
> we free it. But estimate_similarity might return early when the file sizes
> differ too much. In that case, cnt_data is never set and the next call to
> estimate_similarity will populate the filespec again, eventually rereading
> the same blob over and over again.
> 
> To fix that, we first get the blob sizes and only when the blob contents
> are actually required, and when cnt_data will be set, the full filespec is
> populated, once.

I think this is a sane thing to do, and obviously your numbers show an
impressive improvement.

However, I found your explanation a little confusing. Yes, repeatedly
loading those blobs is a problem. But what this is really about is that
estimate_similarity has two levels of checks: cheap checks involving the
size, and expensive checks involving the content. What is happening now
is that we are doing extra work for the expensive checks, even if the
cheap checks are going to let us fail early.

And that's just stupid, and a waste of processing time even if we
_don't_ ever look at the same filespec again.

But what makes it even worse is that we have a system in place to cache
the expensive work, but because we only do part of it, we don't bother
to cache the result. And that's what your patch description is about.

So I think your patch is absolutely the right thing to do. But I think
from the commit message it isn't clear that it would not be equally
correct to follow through on generating cnt_data instead of an early
return (which _isn't_ right, because you might not need to generate
cnt_data at all).

> This actually affects the copy detection way more than the pure
> rename detection, due to the larger number of candidates, but it's the
> same code, and I only realized that when I reran the stuff to get some
> numbers to show off. ;-)

I have a pathological real-world rename test case from a long time ago.
It's so awful on rename because almost the whole repo was reorganized,
but almost every path got its content adjusted, too. I was hoping it
would be better with your patch, but it isn't: as it turns out, it is
_too_ uniform. The cheap size checks don't work because all of the files
are almost the same size (they're all jpgs from a camera).

But I think for non-pathological cases, your improvement makes sense.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Rename detection: Avoid repeated filespec population
  2009-01-20 21:27 ` Jeff King
@ 2009-01-20 22:12   ` Jeff King
  2009-01-21 12:56   ` Björn Steinbrink
  1 sibling, 0 replies; 6+ messages in thread
From: Jeff King @ 2009-01-20 22:12 UTC (permalink / raw)
  To: Björn Steinbrink; +Cc: Junio C Hamano, Linus Torvalds, git

On Tue, Jan 20, 2009 at 04:27:23PM -0500, Jeff King wrote:

> I have a pathological real-world rename test case from a long time ago.
> It's so awful on rename because almost the whole repo was reorganized,
> but almost every path got its content adjusted, too. I was hoping it
> would be better with your patch, but it isn't: as it turns out, it is
> _too_ uniform. The cheap size checks don't work because all of the files
> are almost the same size (they're all jpgs from a camera).

Nevermind, I'm stupid. Your patch _does_ yield a significant improvement
for my test case. I was accidentally comparing your patch to some of my
previous optimization attempts, not to vanilla git.

Without your patch, I get:

  $ time git show -l0 06d28867 >/dev/null
  841.88user 146.95system 17:18.99elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
  2476880inputs+0outputs (19372major+27092797minor)pagefaults 0swaps

With it, I get:

  $ time git show -l0 06d28867 >/dev/null
  100.51user 5.76system 2:31.07elapsed 70%CPU (0avgtext+0avgdata 0maxresident)k
  2020176inputs+0outputs (15771major+836127minor)pagefaults 0swaps

So much much better.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Rename detection: Avoid repeated filespec population
  2009-01-20 15:59 [PATCH] Rename detection: Avoid repeated filespec population Björn Steinbrink
  2009-01-20 21:27 ` Jeff King
@ 2009-01-21 10:27 ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2009-01-21 10:27 UTC (permalink / raw)
  To: Björn Steinbrink; +Cc: Linus Torvalds, git

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Rename detection: Avoid repeated filespec population
  2009-01-20 21:27 ` Jeff King
  2009-01-20 22:12   ` Jeff King
@ 2009-01-21 12:56   ` Björn Steinbrink
  2009-01-21 13:32     ` Jeff King
  1 sibling, 1 reply; 6+ messages in thread
From: Björn Steinbrink @ 2009-01-21 12:56 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Linus Torvalds, git

On 2009.01.20 16:27:23 -0500, Jeff King wrote:
> So I think your patch is absolutely the right thing to do. But I think
> from the commit message it isn't clear that it would not be equally
> correct to follow through on generating cnt_data instead of an early
> return (which _isn't_ right, because you might not need to generate
> cnt_data at all).

Yeah, the commit message wasn't exactly great, but after the fifth
attempt I decided to just sent the damn thing, to see whether at least
the patch itself makes sense.

Another possible solution would be to free the blob data only after the
loop in diffcore_rename has finished, but that's obviously quite bad WRT
memory consumption.  :-)

Anyway, too late, yesterday's attempts 6 to 10 at writing a better
commit message didn't work out either, and Junio has applied the patch
by now.

Björn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Rename detection: Avoid repeated filespec population
  2009-01-21 12:56   ` Björn Steinbrink
@ 2009-01-21 13:32     ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2009-01-21 13:32 UTC (permalink / raw)
  To: Björn Steinbrink; +Cc: Junio C Hamano, Linus Torvalds, git

On Wed, Jan 21, 2009 at 01:56:19PM +0100, Björn Steinbrink wrote:

> Another possible solution would be to free the blob data only after the
> loop in diffcore_rename has finished, but that's obviously quite bad WRT
> memory consumption.  :-)

Yeah, and it doesn't have the (admittedly smaller, but still there)
optimization of not loading the blob data at all if we will never need
to.

> Anyway, too late, yesterday's attempts 6 to 10 at writing a better
> commit message didn't work out either, and Junio has applied the patch
> by now.

I think that's fine. My message was more about convincing myself that
your change was the right thing (and hopefully helped convince others,
too).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-01-21 13:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-01-20 15:59 [PATCH] Rename detection: Avoid repeated filespec population Björn Steinbrink
2009-01-20 21:27 ` Jeff King
2009-01-20 22:12   ` Jeff King
2009-01-21 12:56   ` Björn Steinbrink
2009-01-21 13:32     ` Jeff King
2009-01-21 10:27 ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.