All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Speeding up a null fetch
@ 2007-02-11 23:32 Julian Phillips
  2007-02-11 23:49 ` Johannes Schindelin
  2007-02-11 23:52 ` Shawn O. Pearce
  0 siblings, 2 replies; 5+ messages in thread
From: Julian Phillips @ 2007-02-11 23:32 UTC (permalink / raw)
  To: git

I was investigating replacing an existing subversion setup with git, and
was mostly pleased with the results - until it came to trying to update a
clone ... which took very much longer than the original clone.

An artifical test repository that has similar features (~25000 commits,
~8000 tags, ~900 branches and a 2.5Gb packfile) when running locally
takes ~20m to clone and ~48m to fetch (with no new commits in the
original repository - i.e. the fetch does not update anything) with a
current code base (i.e. newer than 1.5.0-rc4).  As a side note,
performance was actually better with an older version - packed refs
makes things quite a bit worse (clone was only ~30m with 1.4 IIRC).

Investigation showed that the main culprit seemed to be show-ref
having to build a sorted list of all refs for every ref that was being
checked.  So I used the patch below to reduce this to a single call to
show-ref (unless the ref had been updated).  With this patch the fetch
timed dropped to just under 1m - obviously quite a lot faster (better
than I expected in fact).

However, this seems more band-aid than fix, and I wondered if someone
more familiar with the git internals could point me in the right
direction for a better fix, e.g. should I look at rewriting fetch in C?

diff --git a/Makefile b/Makefile
index 5d31e6d..6baf043 100644
--- a/Makefile
+++ b/Makefile
@@ -120,7 +120,7 @@ ALL_CFLAGS = $(CFLAGS)
 ALL_LDFLAGS = $(LDFLAGS)
 STRIP ?= strip
 
-prefix = $(HOME)
+prefix = $(HOME)/git
 bindir = $(prefix)/bin
 gitexecdir = $(bindir)
 template_dir = $(prefix)/share/git-core/templates/
@@ -188,7 +188,7 @@ SCRIPT_PERL = \
 
 SCRIPTS = $(patsubst %.sh,%,$(SCRIPT_SH)) \
 	  $(patsubst %.perl,%,$(SCRIPT_PERL)) \
-	  git-cherry-pick git-status git-instaweb
+	  git-cherry-pick git-status git-instaweb git-ref-diff.py
 
 # ... and all the rest that could be moved out of bindir to gitexecdir
 PROGRAMS = \
diff --git a/git-fetch.sh b/git-fetch.sh
index 357cac2..ce135a5 100755
--- a/git-fetch.sh
+++ b/git-fetch.sh
@@ -108,11 +108,12 @@ ls_remote_result=$(git ls-remote $exec "$remote") ||
 
 append_fetch_head () {
     head_="$1"
-    remote_="$2"
-    remote_name_="$3"
-    remote_nick_="$4"
-    local_name_="$5"
-    case "$6" in
+    local_head_="$2"
+    remote_="$3"
+    remote_name_="$4"
+    remote_nick_="$5"
+    local_name_="$6"
+    case "$7" in
     t) not_for_merge_='not-for-merge' ;;
     '') not_for_merge_= ;;
     esac
@@ -151,10 +152,15 @@ append_fetch_head () {
 	echo "$head_	not-for-merge	$note_" >>"$GIT_DIR/FETCH_HEAD"
     fi
 
-    update_local_ref "$local_name_" "$head_" "$note_"
+    update_local_ref "$local_name_" "$head_" "$note_" "$local_head_"
 }
 
 update_local_ref () {
+    if [ "$2" == "$4" ]; then
+	[ "$verbose" ] && echo >&2 "* $1: same as $3"
+	return 0
+    fi
+
     # If we are storing the head locally make sure that it is
     # a fast forward (aka "reverse push").
 
@@ -392,7 +398,7 @@ fetch_main () {
       (
 	  git-fetch-pack --thin $exec $keep $shallow_depth "$remote" $rref ||
 	  echo failed "$remote"
-      ) |
+      ) | git-ref-diff.py "$reflist" |
       (
 	trap '
 		if test -n "$keepfile" && test -f "$keepfile"
@@ -402,7 +408,7 @@ fetch_main () {
 	' 0
 
         keepfile=
-	while read sha1 remote_name
+	while read sha1 remote_name local_sha1
 	do
 	  case "$sha1" in
 	  failed)
@@ -441,7 +447,7 @@ fetch_main () {
 	      esac
 	  done
 	  local_name=$(expr "z$found" : 'z[^:]*:\(.*\)')
-	  append_fetch_head "$sha1" "$remote" \
+	  append_fetch_head "$sha1" "$local_sha1" "$remote" \
 		  "$remote_name" "$remote_nick" "$local_name" \
 		  "$not_for_merge" || exit
         done
diff --git a/git-ref-diff.py b/git-ref-diff.py
new file mode 100755
index 0000000..2b30e4c
--- /dev/null
+++ b/git-ref-diff.py
@@ -0,0 +1,33 @@
+#!/usr/bin/python
+
+import os
+import re
+import sys
+
+ref_map_re = re.compile("^\.?\+?(?P<remote>.*?):(?P<local>.*)$")
+
+refs = {}
+refsp = os.popen("git-show-ref")
+for ref in refsp.readlines():
+    (sha, ref) = ref.strip().split(' ')
+    refs[ref] = sha
+refsp.close()
+
+ref_map = {}
+for line in sys.argv[1].split('\n'):
+    ref_map_m = ref_map_re.search(line)
+    if ref_map_m:
+        remote = ref_map_m.group('remote')
+        local = ref_map_m.group('local')
+        ref_map[remote] = local
+
+while True:
+    try:
+        (sha, ref) = raw_input().split(' ')
+    except EOFError:
+        sys.exit(0)
+    lref = ref_map.get(ref, None)
+    if refs.has_key(lref):
+        print "%s %s %s" % (sha, ref, refs[lref])
+    else:
+        print "%s %s -" % (sha, ref)

-- 
Julian

 --- 
Why bother building any more nuclear warheads until we use the ones we have?

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC] Speeding up a null fetch
  2007-02-11 23:32 [RFC] Speeding up a null fetch Julian Phillips
@ 2007-02-11 23:49 ` Johannes Schindelin
  2007-02-12  0:14   ` Julian Phillips
  2007-02-11 23:52 ` Shawn O. Pearce
  1 sibling, 1 reply; 5+ messages in thread
From: Johannes Schindelin @ 2007-02-11 23:49 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git

Hi,

On Sun, 11 Feb 2007, Julian Phillips wrote:

> An artifical test repository that has similar features (~25000 commits,
> ~8000 tags, ~900 branches and a 2.5Gb packfile) when running locally
> takes ~20m to clone and ~48m to fetch (with no new commits in the
> original repository - i.e. the fetch does not update anything) with a
> current code base (i.e. newer than 1.5.0-rc4).

Ouch.

I hope you packed the refs?

BTW your patch
- was not minimal (and therefore it takes longer than necessary to find 
  what you actually fixed),
- it does not show where and how the call to show-ref is avoided (I 
  eventually understand that you avoid calling update_local_ref early, but 
  you sure could have made that easier), and
- it uses Pythong.

Also, it touches a quite core part of git, which will hopefully be 
replaced by a builtin _after_ 1.5.0.

> However, this seems more band-aid than fix, and I wondered if someone 
> more familiar with the git internals could point me in the right 
> direction for a better fix, e.g. should I look at rewriting fetch in C?

Look into the "pu" branch of git. There are the beginnings of a builtin 
(written in C) fetch.

But this _will_ have to wait until after 1.5.0.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Speeding up a null fetch
  2007-02-11 23:32 [RFC] Speeding up a null fetch Julian Phillips
  2007-02-11 23:49 ` Johannes Schindelin
@ 2007-02-11 23:52 ` Shawn O. Pearce
  2007-02-12  0:18   ` Julian Phillips
  1 sibling, 1 reply; 5+ messages in thread
From: Shawn O. Pearce @ 2007-02-11 23:52 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git

Julian Phillips <julian@quantumfyre.co.uk> wrote:
> Investigation showed that the main culprit seemed to be show-ref
> having to build a sorted list of all refs for every ref that was being
> checked.  So I used the patch below to reduce this to a single call to
> show-ref (unless the ref had been updated).  With this patch the fetch
> timed dropped to just under 1m - obviously quite a lot faster (better
> than I expected in fact).

Have a look at the `pu` branch in git.git.  Junio has done some
work in this area to handle 1000 refs better:

  ...
  commit 58fef67cb067b6dee8f94b7b0e0c1a2d324e3505
  Author: Junio C Hamano <junkio@cox.net>
  Date:   Tue Jan 16 02:31:36 2007 -0800

    git-fetch: rewrite another shell loop in C
    
    Move another shell loop that canonicalizes the list of refs for
    underlying git-fetch-pack and fetch-native-store into C.
    
    This seems to shave the runtime for the same 1000 branch
    repository from 30 seconds down to 15 seconds (it used to be 2
    and half minutes with the original version).
    
    Signed-off-by: Junio C Hamano <junkio@cox.net>

  commit 3fc3729cd08e9d40dad54ccdd4db53900eca197b
  Author: Junio C Hamano <junkio@cox.net>
  Date:   Tue Jan 16 01:53:29 2007 -0800

    git-fetch: move more code into C.
    
    This adds "native-store" subcommand to git-fetch--tool to
    move a huge loop implemented in shell into C.  This shaves about
    70% of the runtime to fetch and update 1000 tracking branches
    with a single fetch.
    
    Signed-off-by: Junio C Hamano <junkio@cox.net>
  ...

> However, this seems more band-aid than fix, and I wondered if someone
> more familiar with the git internals could point me in the right
> direction for a better fix, e.g. should I look at rewriting fetch in C?

Rewriting fetch in C is a lot of work, not just in developing it,
but in testing that all existing functionality is preserved and no
new bugs are introduced.  Rewriting some of the performance critical
parts perhaps makes sense.  Rewriting them in Python doesn't, as
we no longer have any Python dependency, and would like to keep it
that way (actuallly, some folks are also trying to remove the Perl
dependency from some of our critical tools).

-- 
Shawn.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Speeding up a null fetch
  2007-02-11 23:49 ` Johannes Schindelin
@ 2007-02-12  0:14   ` Julian Phillips
  0 siblings, 0 replies; 5+ messages in thread
From: Julian Phillips @ 2007-02-12  0:14 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

On Mon, 12 Feb 2007, Johannes Schindelin wrote:

> Hi,
>
> On Sun, 11 Feb 2007, Julian Phillips wrote:
>
>> An artifical test repository that has similar features (~25000 commits,
>> ~8000 tags, ~900 branches and a 2.5Gb packfile) when running locally
>> takes ~20m to clone and ~48m to fetch (with no new commits in the
>> original repository - i.e. the fetch does not update anything) with a
>> current code base (i.e. newer than 1.5.0-rc4).
>
> Ouch.
>
> I hope you packed the refs?

Unfortunately packing only makes things slower ... as it then becomes 
impossible to directly access a particular ref directly, which some of the 
calls to show-ref do.

>
> BTW your patch
> - was not minimal (and therefore it takes longer than necessary to find
>  what you actually fixed),
> - it does not show where and how the call to show-ref is avoided (I
>  eventually understand that you avoid calling update_local_ref early, but
>  you sure could have made that easier), and

Ah yes, sorry.  I seem to have managed to forget to include the paragraph 
explaining what I had done ... :$

(That'll teach me to trying doing too many things at once.)

> - it uses Pythong.
>
> Also, it touches a quite core part of git, which will hopefully be
> replaced by a builtin _after_ 1.5.0.

Indeed, I would never propose what I have done so far as a fix.  I am 
definitely still in the investigation phase.

>
>> However, this seems more band-aid than fix, and I wondered if someone
>> more familiar with the git internals could point me in the right
>> direction for a better fix, e.g. should I look at rewriting fetch in C?
>
> Look into the "pu" branch of git. There are the beginnings of a builtin
> (written in C) fetch.

Ah - this I didn't know.  I shall have to have a play with that, I did 
notice that there is internal caching of the ref list that might magically 
solve the problem if fetch was a builtin (but I have a feeling that it 
won't be that simple).

>
> But this _will_ have to wait until after 1.5.0.

I hope so.  1.5 is looking very nice, and I really don't think that many 
people have such a stuipdly large repository ...

>
> Ciao,
> Dscho
>

-- 
Julian

  ---
You are in a maze of little twisting passages, all alike.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] Speeding up a null fetch
  2007-02-11 23:52 ` Shawn O. Pearce
@ 2007-02-12  0:18   ` Julian Phillips
  0 siblings, 0 replies; 5+ messages in thread
From: Julian Phillips @ 2007-02-12  0:18 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

On Sun, 11 Feb 2007, Shawn O. Pearce wrote:

> Julian Phillips <julian@quantumfyre.co.uk> wrote:
>> Investigation showed that the main culprit seemed to be show-ref
>> having to build a sorted list of all refs for every ref that was being
>> checked.  So I used the patch below to reduce this to a single call to
>> show-ref (unless the ref had been updated).  With this patch the fetch
>> timed dropped to just under 1m - obviously quite a lot faster (better
>> than I expected in fact).
>
> Have a look at the `pu` branch in git.git.  Junio has done some
> work in this area to handle 1000 refs better:
>
>  ...
>  commit 58fef67cb067b6dee8f94b7b0e0c1a2d324e3505
>  Author: Junio C Hamano <junkio@cox.net>
>  Date:   Tue Jan 16 02:31:36 2007 -0800
>
>    git-fetch: rewrite another shell loop in C
>
>    Move another shell loop that canonicalizes the list of refs for
>    underlying git-fetch-pack and fetch-native-store into C.
>
>    This seems to shave the runtime for the same 1000 branch
>    repository from 30 seconds down to 15 seconds (it used to be 2
>    and half minutes with the original version).
>
>    Signed-off-by: Junio C Hamano <junkio@cox.net>
>
>  commit 3fc3729cd08e9d40dad54ccdd4db53900eca197b
>  Author: Junio C Hamano <junkio@cox.net>
>  Date:   Tue Jan 16 01:53:29 2007 -0800
>
>    git-fetch: move more code into C.
>
>    This adds "native-store" subcommand to git-fetch--tool to
>    move a huge loop implemented in shell into C.  This shaves about
>    70% of the runtime to fetch and update 1000 tracking branches
>    with a single fetch.
>
>    Signed-off-by: Junio C Hamano <junkio@cox.net>
>  ...
>

I shall have to see how this work fares with ~9000 refs ... but it 
certainly sounds good.

>> However, this seems more band-aid than fix, and I wondered if someone
>> more familiar with the git internals could point me in the right
>> direction for a better fix, e.g. should I look at rewriting fetch in C?
>
> Rewriting fetch in C is a lot of work, not just in developing it,
> but in testing that all existing functionality is preserved and no
> new bugs are introduced.  Rewriting some of the performance critical
> parts perhaps makes sense.

Indeed - this is why I asked rather than just diving in.

> Rewriting them in Python doesn't, as
> we no longer have any Python dependency, and would like to keep it
> that way (actuallly, some folks are also trying to remove the Perl
> dependency from some of our critical tools).

I only used python for speed of development, I was simply trying to verify 
my suspicions.  I certainly wouldn't expect a python script to get added 
(having seen all the python scripts get replaced).

-- 
Julian

  ---
There are no accidents whatsoever in the universe.
 		-- Baba Ram Dass

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-02-12  0:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-11 23:32 [RFC] Speeding up a null fetch Julian Phillips
2007-02-11 23:49 ` Johannes Schindelin
2007-02-12  0:14   ` Julian Phillips
2007-02-11 23:52 ` Shawn O. Pearce
2007-02-12  0:18   ` Julian Phillips

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.