Git has bad performance when traversing change-sets with large PPM files

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Git has bad performance when traversing change-sets with large PPM  files
@ 2010-02-19 21:24 Laine Walker-Avina
  2010-02-19 23:30 ` [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor Thomas Rast
  0 siblings, 1 reply; 8+ messages in thread
From: Laine Walker-Avina @ 2010-02-19 21:24 UTC (permalink / raw)
  To: git

Steps to reproduce:

1. Create a new git repo and add a 640x480 PPM image in ASCII format
2. Add a .gitattributes file with the line "*.ppm binary"
3. Commit the PPM file
4. Change the PPM image's contents
5. Add and commit
6. Try to do a git rebase -i to the initial commit
7. Watch the CPU peg at 100%

I'm using git version 1.6.3.3

Please CC me, I'm not on the list.

Thanks,
-- 
Laine Walker-Avina
Firmware Engineer
PASCO scientific

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor
  2010-02-19 21:24 Git has bad performance when traversing change-sets with large PPM files Laine Walker-Avina
@ 2010-02-19 23:30 ` Thomas Rast
  2010-02-20  0:02   ` Thomas Rast
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Rast @ 2010-02-19 23:30 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Laine Walker-Avina

Ordinary 'rebase -i' reads the commits to rebase with (roughly)

  git rev-list --left-right --cherry-pick $upstream...

which gives it the feature of skipping commits that are already
present in $upstream.  However, in the common use-case of rewriting a
few commits up to an ancestor, as in 'git rebase -i HEAD~3', the
--cherry-pick is useless since there are no commits to compare to.

Add a check if $upstream is a direct ancestor of HEAD, and leave away
the --cherry-pick if so.  Since the --cherry-pick is already in
$MERGES_OPTION, we need to decide this before setting the latter.

For simplicity we skip --root mode.  In theory we could do the same
optimization, but using --root --onto <ancestor> is probably even more
rare than having performance issues with --cherry-pick.

Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---

The --cherry-pick mechanism itself could get a similar optimization,
but I don't know that code.


 git-rebase--interactive.sh |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
index 1fda620..10b0ed8 100755
--- a/git-rebase--interactive.sh
+++ b/git-rebase--interactive.sh
@@ -870,7 +870,12 @@ first and then run 'git rebase --continue' again."
 			MERGES_OPTION=
 			first_after_upstream="$(git rev-list --reverse --first-parent $UPSTREAM..$HEAD | head -n 1)"
 		else
-			MERGES_OPTION="--no-merges --cherry-pick"
+			if test -z "$REBASE_ROOT" &&
+				test $(git merge-base $UPSTREAM $HEAD) = $UPSTREAM; then
+				MERGES_OPTION="--no-merges"
+			else
+				MERGES_OPTION="--no-merges --cherry-pick"
+			fi
 		fi
 
 		SHORTHEAD=$(git rev-parse --short $HEAD)
-- 
1.7.0.139.gd1a75

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor
  2010-02-19 23:30 ` [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor Thomas Rast
@ 2010-02-20  0:02   ` Thomas Rast
  2010-02-20  7:27     ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Rast @ 2010-02-20  0:02 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Laine Walker-Avina

On Saturday 20 February 2010 00:30:52 Thomas Rast wrote:
> Ordinary 'rebase -i' reads the commits to rebase with (roughly)
> 
>   git rev-list --left-right --cherry-pick $upstream...
> 
> which gives it the feature of skipping commits that are already
> present in $upstream.  However, in the common use-case of rewriting a
> few commits up to an ancestor, as in 'git rebase -i HEAD~3', the
> --cherry-pick is useless since there are no commits to compare to.
[...]
> The --cherry-pick mechanism itself could get a similar optimization,
> but I don't know that code.

Or maybe it's as simple as this?

diff --git i/revision.c w/revision.c
index 438cc87..29721ec 100644
--- i/revision.c
+++ w/revision.c
@@ -547,6 +547,9 @@ static void cherry_pick_list(struct commit_list *list, struct rev_info *revs)
 			right_count++;
 	}
 
+	if (!left_count || !right_count)
+		return;
+
 	left_first = left_count < right_count;
 	init_patch_ids(&ids);
 	if (revs->diffopt.nr_paths) {

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor
  2010-02-20  0:02   ` Thomas Rast
@ 2010-02-20  7:27     ` Jeff King
  2010-02-20 11:42       ` [PATCH] cherry_pick_list: quit early if one side is empty Thomas Rast
  2010-02-20 13:38       ` [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor Thomas Rast
  0 siblings, 2 replies; 8+ messages in thread
From: Jeff King @ 2010-02-20  7:27 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, Junio C Hamano, Laine Walker-Avina

On Sat, Feb 20, 2010 at 01:02:15AM +0100, Thomas Rast wrote:

> > the --cherry-pick is useless since there are no commits to compare to.
> [...]
> Or maybe it's as simple as this?
> 
> diff --git i/revision.c w/revision.c
> index 438cc87..29721ec 100644
> --- i/revision.c
> +++ w/revision.c
> @@ -547,6 +547,9 @@ static void cherry_pick_list(struct commit_list *list, struct rev_info *revs)
>  			right_count++;
>  	}
>  
> +	if (!left_count || !right_count)
> +		return;
> +
>  	left_first = left_count < right_count;
>  	init_patch_ids(&ids);
>  	if (revs->diffopt.nr_paths) {

>From my reading of the code, that is right, but this is the first time I
ever looked at it, so who knows? :)

Also, I was curious whether we correctly respected gitattributes when
calculating patch ids (specifically, do we actually treat this thing as
binary). But it looks like patch-id's don't make any exceptions for
binary files, and they get fed to xdiff. I think this is not buggy (the
xdiff consumer seems to get the diff content with a length, so we are
not treating the potentially binary data as a C-string anywhere that I
see). But it is probably the source of the slowness to xdiff that
gigantic files.

Does it really make sense to treat binary files as anything other than a
blob for generating patch id? That is, should we simply turn it into:

  binary diff
  $from_sha1
  $to_sha1

and hash that for the patch id?

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] cherry_pick_list: quit early if one side is empty
  2010-02-20  7:27     ` Jeff King
@ 2010-02-20 11:42       ` Thomas Rast
  2010-02-21  6:50         ` Jeff King
  2010-02-20 13:38       ` [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor Thomas Rast
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Rast @ 2010-02-20 11:42 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Junio C Hamano, Laine Walker-Avina

The --cherry-pick logic starts by counting the commits on each side,
so that it can filter away commits on the bigger one.  However, so
far it missed an opportunity for optimization: it doesn't need to do
any work if either side is empty.

This in particular helps the common use-case 'git rebase -i HEAD~$n':
it internally uses --cherry-pick, but since HEAD~$n is a direct
ancestor the left side is always empty.

Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---

On Saturday 20 February 2010 08:27:28 Jeff King wrote:
> 
> From my reading of the code, that is right, but this is the first time I
> ever looked at it, so who knows? :)

Junio agreed too, so here's a real patch.  To keep the complexity of
git-rebase--interactive from exploding even further, you can drop the
earlier one.

> Does it really make sense to treat binary files as anything other than a
> blob for generating patch id? That is, should we simply turn it into:
> 
>   binary diff
>   $from_sha1
>   $to_sha1
> 
> and hash that for the patch id?

I tend to agree, but I can't seem to find out what flags to flip :-(


 revision.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/revision.c b/revision.c
index 438cc87..29721ec 100644
--- a/revision.c
+++ b/revision.c
@@ -547,6 +547,9 @@ static void cherry_pick_list(struct commit_list *list, struct rev_info *revs)
 			right_count++;
 	}
 
+	if (!left_count || !right_count)
+		return;
+
 	left_first = left_count < right_count;
 	init_patch_ids(&ids);
 	if (revs->diffopt.nr_paths) {
-- 
1.7.0.139.gd1a75.dirty

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor
  2010-02-20  7:27     ` Jeff King
  2010-02-20 11:42       ` [PATCH] cherry_pick_list: quit early if one side is empty Thomas Rast
@ 2010-02-20 13:38       ` Thomas Rast
  2010-02-21  7:46         ` Jeff King
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Rast @ 2010-02-20 13:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano, Laine Walker-Avina

On Saturday 20 February 2010 08:27:28 Jeff King wrote:
> But it is probably the source of the slowness to xdiff that
> gigantic files.

BTW, here's a weird data point:

$ ls -l a b
-rw-r--r-- 1 thomas users 3300765 2010-02-20 12:48 a
-rw-r--r-- 1 thomas users 3253762 2010-02-20 12:48 b
$ time diff -u a b | wc -l
54530

real    0m0.644s
user    0m0.562s
sys     0m0.044s
$ time git diff --no-index a b >/dev/null

real    0m22.848s
user    0m21.956s
sys     0m0.137s
$ time git diff --no-index --patience a b >/dev/null

real    0m19.508s
user    0m18.673s
sys     0m0.273s

'a' and 'b' are two pnm's as per the OPs specification, I made 'a' a
gradient and 'b' the same with two crosses drawn over it.  You can
find them at

  http://thomasrast.ch/download/slow-diff-pnms.zip

if you want to reproduce.

So what on earth does 'diff' do that makes it 35 times as fast?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] cherry_pick_list: quit early if one side is empty
  2010-02-20 11:42       ` [PATCH] cherry_pick_list: quit early if one side is empty Thomas Rast
@ 2010-02-21  6:50         ` Jeff King
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2010-02-21  6:50 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, Junio C Hamano, Laine Walker-Avina

On Sat, Feb 20, 2010 at 12:42:04PM +0100, Thomas Rast wrote:

> > Does it really make sense to treat binary files as anything other than a
> > blob for generating patch id? That is, should we simply turn it into:
> > 
> >   binary diff
> >   $from_sha1
> >   $to_sha1
> > 
> > and hash that for the patch id?
> 
> I tend to agree, but I can't seem to find out what flags to flip :-(

You would need to call diff_filespec_is_binary when flushing the diff
queue, which handles both attributes and autodetection. Something like:

diff --git a/diff.c b/diff.c
index 989dbc5..97ce40a 100644
--- a/diff.c
+++ b/diff.c
@@ -3381,6 +3381,14 @@ static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1)

 		diff_fill_sha1_info(p->one);
 		diff_fill_sha1_info(p->two);
+
+		if (diff_filespec_is_binary(p->one) ||
+		    diff_filespec_is_binary(p->two)) {
+			/* TODO: write binary patch into buffer */
+			git_SHA1_Update(&ctx, buffer, len);
+			continue;
+		}
+
 		if (fill_mmfile(&mf1, p->one) < 0 ||
 				fill_mmfile(&mf2, p->two) < 0)
 			return error("unable to read files to diff");

However, thinking on it more, it is a bit more complicated than that.
The patch-id we generate for --cherry-pick is not purely internal. You
can also generate patch-id's by handing the patch text to git-patch-id,
which strips out whitespace and line numbers.  Which means that whatever
we generate should probably match the binary patch format output by the
regular diff.

On the other hand, what we do now for cherry-pick totally does not match
what "git log" would output, and nobody has complained, so perhaps it is
not a big deal.

And I guess they don't technically need to match. The --cherry-pick
thing is internal (i.e., as long as you use the same format for both
sides, you are fine). But I suspect there are other code paths that use
that patch-id code.

-Peff

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor
  2010-02-20 13:38       ` [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor Thomas Rast
@ 2010-02-21  7:46         ` Jeff King
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2010-02-21  7:46 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, Junio C Hamano, Laine Walker-Avina

On Sat, Feb 20, 2010 at 02:38:29PM +0100, Thomas Rast wrote:

> BTW, here's a weird data point:
> 
> $ ls -l a b
> -rw-r--r-- 1 thomas users 3300765 2010-02-20 12:48 a
> -rw-r--r-- 1 thomas users 3253762 2010-02-20 12:48 b
> $ time diff -u a b | wc -l
> 54530
> 
> real    0m0.644s
> user    0m0.562s
> sys     0m0.044s
> $ time git diff --no-index a b >/dev/null
> 
> real    0m22.848s
> user    0m21.956s
> sys     0m0.137s
> $ time git diff --no-index --patience a b >/dev/null
> 
> real    0m19.508s
> user    0m18.673s
> sys     0m0.273s

I was able to reproduce here. According to 'perf', git spends 70% of its
time in xdl_split, which says:

/*
 * See "An O(ND) Difference Algorithm and its Variations", by Eugene Myers.
 * Basically considers a "box" (off1, off2, lim1, lim2) and scan from both
 * the forward diagonal starting from (off1, off2) and the backward diagonal
 * starting from (lim1, lim2). If the K values on the same diagonal crosses
 * returns the furthest point of reach. We might end up having to expensive
 * cases using this algorithm is full, so a little bit of heuristic is needed
 * to cut the search and to return a suboptimal point.
 */

GNU diff also seems to use the same basic Myers algorithm, but says:

   The basic algorithm is described in:
   "An O(ND) Difference Algorithm and its Variations", Eugene Myers,
   Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;
   see especially section 4.2, which describes the variation used below.

   The basic algorithm was independently discovered as described in:
   "Algorithms for Approximate String Matching", E. Ukkonen,
   Information and Control Vol. 64, 1985, pp. 100-118.

   Unless the 'find_minimal' flag is set, this code uses the TOO_EXPENSIVE
   heuristic, by Paul Eggert, to limit the cost to O(N**1.5 log N)
   at the price of producing suboptimal output for large inputs with
   many differences.

So it may be that TOO_EXPENSIVE heuristic. I don't know enough about the
diff code (git's or otherwise) to say how difficult it would be to adapt
GNU diff's tricks.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-02-21  7:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-19 21:24 Git has bad performance when traversing change-sets with large PPM files Laine Walker-Avina
2010-02-19 23:30 ` [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor Thomas Rast
2010-02-20  0:02   ` Thomas Rast
2010-02-20  7:27     ` Jeff King
2010-02-20 11:42       ` [PATCH] cherry_pick_list: quit early if one side is empty Thomas Rast
2010-02-21  6:50         ` Jeff King
2010-02-20 13:38       ` [PATCH] rebase -i: avoid --cherry-pick when rebasing to a direct ancestor Thomas Rast
2010-02-21  7:46         ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).