All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Add --create-cache to repack
@ 2011-01-28  8:06 Shawn O. Pearce
  2011-01-28  9:08 ` Johannes Sixt
  0 siblings, 1 reply; 26+ messages in thread
From: Shawn O. Pearce @ 2011-01-28  8:06 UTC (permalink / raw)
  To: git, Junio C Hamano, Nicolas Pitre; +Cc: John Hawley

A cache pack is all objects reachable from a single commit that is
part of the project's stable history and won't disappear, and is
accessible to all readers of the repository.  By containing only that
commit and its contents, if the commit is reached from a reference we
know immediately that the entire pack is also reachable.  To help
ensure this is true, the --create-cache flag looks for a commit along
refs/heads and refs/tags that is at least 1 month old, working under
the assumption that a commit this old won't be rebased or pruned.

During a clone request if a commit is discovered that matches the
cache pack, all newer objects can be enumerated using normal rules and
sent to the client, and then the cache pack can be simply appended
onto the end of the stream.  There is no need to enumerate the objects
as the object count is in the header of the cache pack.  There is no
need to allocate all of the objects in the pack-objects process, which
reduces its working set size, and its impact on busy servers.

By keeping the pack with a standard .keep file, later repacks of the
repository won't include these objects, which permits disk usage to
stay within a reasonable factor of the repository size.

Because newer packed objects are not delta compressed against the
older cached pack, clients may receive a larger data transfer when the
cached pack is simply appended onto the stream.  pack-objects could
work around this by constructing a thin pack, and adding the cache
pack's tip commit as the uninteresting/common base for the thin pack.
The references for the newer objects will point to older data behind
them so they will automatically use the larger REF_DELTA format.

This commit only adds the logic to git-repack to construct the cached
pack.  For example on a Linux kernel repository:

  # Construct the initial cache pack
  $ git repack --create-cache --cache-include=v2.6.11-tree

  # Remove duplicated objects
  $ git repack -a -d

If this is actually a good idea, pack-objects can later learn how to
use $GIT_DIR/objects/info/cached during revision traversal to know
when a cached pack is found, and switch to the thin pack + cached pack
transfer method described above.

The cached pack is only useful for initial clones of a repository, and
only if object enumeration takes more than a few seconds.  However
initial clones of big projects like linux-2.6.git are killing some
common mirror sites, so this could be one way to help them out.

Later fetch-pack/upload-pack protocol could learn how to more
intelligently use the cached pack in the data stream, allowing a
client whose connection has been broken to resume with a byte range
request within the cached pack, assuming the pack is still present on
the server.  This can be validated by giving the client both the SHA-1
pack name, and the SHA-1 trailer of the pack content, and requiring
these to match on a byte range request.

Repository owners may also enjoy having the cached pack as frequent
`git gc` invocations will now have lower IO and CPU requirements due
to the large pack having a .keep file.  In the future `git gc --auto`
could learn to suggest removing the .keep file and regenerating the
cached pack once there is sufficient new content to make creating a
new pack worthwhile.
---
 git-repack.sh |   57 +++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/git-repack.sh b/git-repack.sh
index 624feec..7a7984c 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -15,6 +15,9 @@ F               pass --no-reuse-object to git-pack-objects
 n               do not run git-update-server-info
 q,quiet         be quiet
 l               pass --local to git-pack-objects
+create-cache    create a cached pack for older history
+cache-include=  other objects to include in the cache
+cache-age=      how old to start caching from
  Packing constraints
 window=         size of the window used for delta compression
 window-memory=  same as the above, but limit memory size instead of entries count
@@ -26,6 +29,7 @@ SUBDIRECTORY_OK='Yes'
 
 no_update_info= all_into_one= remove_redundant= unpack_unreachable=
 local= no_reuse= extra=
+create_cache= cache_include= cache_age=1.month.ago
 while test $# != 0
 do
 	case "$1" in
@@ -38,6 +42,11 @@ do
 	-f)	no_reuse=--no-reuse-delta ;;
 	-F)	no_reuse=--no-reuse-object ;;
 	-l)	local=--local ;;
+	--create-cache) create_cache=t ;;
+	--cache-age) cache_age=$2; shift ;;
+	--cache-include)
+		name=$(git rev-parse --verify $2)
+		cache_include="$cache_include $name"; shift ;;
 	--max-pack-size|--window|--window-memory|--depth)
 		extra="$extra $1=$2"; shift ;;
 	--) shift; break;;
@@ -52,16 +61,19 @@ true)
 esac
 
 PACKDIR="$GIT_OBJECT_DIRECTORY/pack"
+INFODIR="$GIT_OBJECT_DIRECTORY/info"
 PACKTMP="$PACKDIR/.tmp-$$-pack"
 rm -f "$PACKTMP"-*
 trap 'rm -f "$PACKTMP"-*' 0 1 2 3 15
 
 # There will be more repacking strategies to come...
-case ",$all_into_one," in
-,,)
+case ",$create_cache,$all_into_one," in
+,t,,)
+	;;
+,,,)
 	args='--unpacked --incremental'
 	;;
-,t,)
+,,t,)
 	args= existing=
 	if [ -d "$PACKDIR" ]; then
 		for e in `cd "$PACKDIR" && find . -type f -name '*.pack' \
@@ -84,9 +96,22 @@ esac
 
 mkdir -p "$PACKDIR" || exit
 
-args="$args $local ${GIT_QUIET:+-q} $no_reuse$extra"
-names=$(git pack-objects --keep-true-parents --honor-pack-keep --non-empty --all --reflog $args </dev/null "$PACKTMP") ||
-	exit 1
+if [ -n "$create_cache" ]; then
+	root=$(git rev-list -n 1 --until=$cache_age --branches --tags --)
+	args="$args ${GIT_QUIET:+-q} $no_reuse$extra"
+	names=$( ( echo "$root";
+		       for name in $cache_include
+		       do
+		         echo "$name"
+		       done ) |
+		git pack-objects --keep-true-parents --non-empty $args --revs \
+		"$PACKTMP") ||
+		exit 1
+else
+	args="$args $local ${GIT_QUIET:+-q} $no_reuse$extra"
+	names=$(git pack-objects --keep-true-parents --honor-pack-keep --non-empty --all --reflog $args </dev/null "$PACKTMP") ||
+		exit 1
+fi
 if [ -z "$names" ]; then
 	say Nothing new to pack.
 fi
@@ -151,6 +176,10 @@ do
 	mv -f "$PACKTMP-$name.pack" "$PACKDIR/pack-$name.pack" &&
 	mv -f "$PACKTMP-$name.idx"  "$PACKDIR/pack-$name.idx" ||
 	exit
+
+	if [ -n "$create_cache" ]; then
+		echo "cache $root$cache_include" >"$PACKDIR/pack-$name.keep"
+	fi
 done
 
 # Remove the "old-" files
@@ -162,6 +191,22 @@ done
 
 # End of pack replacement.
 
+# Update the cache list
+if [ -n "$create_cache" ]; then
+	mkdir -p "$INFODIR" || exit
+	( echo "+ $root" &&
+	  for name in $cache_include
+	  do
+	    echo "+ $name"
+	  done
+	  for name in $names
+	  do
+	    echo "P $name"
+	  done ) >"$INFODIR/cached"
+	echo "Cached from:"
+	git log --pretty=format:'  [%h] %cd%n  %s' -1 "$root" --
+fi
+
 if test "$remove_redundant" = t
 then
 	# We know $existing are all redundant.
-- 
1.7.4.rc1.253.gb7420

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-28  8:06 [RFC] Add --create-cache to repack Shawn O. Pearce
@ 2011-01-28  9:08 ` Johannes Sixt
  2011-01-28 14:37   ` Shawn Pearce
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Sixt @ 2011-01-28  9:08 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git, Junio C Hamano, Nicolas Pitre, John Hawley

Am 1/28/2011 9:06, schrieb Shawn O. Pearce:
> A cache pack is all objects reachable from a single commit that is
> part of the project's stable history and won't disappear, and is
> accessible to all readers of the repository.  By containing only that
> commit and its contents, if the commit is reached from a reference we
> know immediately that the entire pack is also reachable.  To help
> ensure this is true, the --create-cache flag looks for a commit along
> refs/heads and refs/tags that is at least 1 month old, working under
> the assumption that a commit this old won't be rebased or pruned.

In one of my repositories, I have two stable branches and a good score of
topic branches of various ages (a few hours up to two years 8). The topic
branches will either be dropped eventually, or rebased.

What are the odds that this choice of a tip commit picks one that is in a
topic branch? Or is there no point in using --create-cache in a repository
like this?

-- Hannes

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-28  9:08 ` Johannes Sixt
@ 2011-01-28 14:37   ` Shawn Pearce
  2011-01-28 15:33     ` Johannes Sixt
  2011-01-28 18:46     ` Nicolas Pitre
  0 siblings, 2 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-28 14:37 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git, Junio C Hamano, Nicolas Pitre, John Hawley

On Fri, Jan 28, 2011 at 01:08, Johannes Sixt <j.sixt@viscovery.net> wrote:
> Am 1/28/2011 9:06, schrieb Shawn O. Pearce:
>> A cache pack is all objects reachable from a single commit that is
>> part of the project's stable history and won't disappear, and is
>> accessible to all readers of the repository.  By containing only that
>> commit and its contents, if the commit is reached from a reference we
>> know immediately that the entire pack is also reachable.  To help
>> ensure this is true, the --create-cache flag looks for a commit along
>> refs/heads and refs/tags that is at least 1 month old, working under
>> the assumption that a commit this old won't be rebased or pruned.
>
> In one of my repositories, I have two stable branches and a good score of
> topic branches of various ages (a few hours up to two years 8). The topic
> branches will either be dropped eventually, or rebased.
>
> What are the odds that this choice of a tip commit picks one that is in a
> topic branch? Or is there no point in using --create-cache in a repository
> like this?

Argh, you are right.  Its quite likely this would pick a topic
branch... and that isn't really what is desired.

My original concept here was for distribution point repositories,
which are less likely to have these topic branches that will rebase
and disappear.  Though git.git has one called "pu".  *sigh*

A simple fix is to use --heads --tags by default like I do here, but
make the actual parameters we feed to rev-list configurable.  A
repository owner could select only the master branch as input to
rev-list, making it less likely the topic branches would be
considered.  Unfortunately that requires direct access to the
repository.  It fails for a site like GitHub, where you don't manage
the repository at all.

git.git also is problematic because of the man, html and todo
branches.  Branches that are disconnected from the main history but
are very small (e.g. todo) might be selected instead and create a
nearly useless cache file.  Fortunately disconnected branches could
each have their own cache file (with only the inode overhead of having
an additional 3 files per disconnected branch), and pack-objects could
concat all of those packs together when sending.  Its just a challenge
to identify these branches and keep them from being used for that main
project pack.


This started because I was looking for a way to speed up clones coming
from a JGit server.  Cloning the linux-2.6 repository is painful, it
takes a long time to enumerate the 1.8 million objects.  So I tried
adding a cached list of objects reachable from a given commit, which
speeds up the enumeration phase, but JGit still needs to allocate all
of the working set to track those objects, then go find them in packs
and slice out each compressed form and reformat the headers on the
wire.  Its a lot of redundant work when your kernel repository has
360MB of data that you know a client needs if they have asked for your
master branch with no "have" set.

Later I realized, we can get rid of that cached list of objects and
just use the pack itself.  Its far cleaner, as there is no redundant
cache.  But either way (object list or pack) its a bit of a challenge
to automatically identify the right starting points to use.  Linus
Torvalds' linux-2.6 repository is the perfect case for the RFC I
posted, its one branch with all of the history, and it never rewinds.
But maybe Linus is just very unique in this world.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-28 14:37   ` Shawn Pearce
@ 2011-01-28 15:33     ` Johannes Sixt
  2011-01-28 18:22       ` Shawn Pearce
  2011-01-28 19:15       ` Jay Soffian
  2011-01-28 18:46     ` Nicolas Pitre
  1 sibling, 2 replies; 26+ messages in thread
From: Johannes Sixt @ 2011-01-28 15:33 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Junio C Hamano, Nicolas Pitre, John Hawley

Am 1/28/2011 15:37, schrieb Shawn Pearce:
> A simple fix is to use --heads --tags by default like I do here, but
> make the actual parameters we feed to rev-list configurable.  A
> repository owner could select only the master branch as input to
> rev-list, making it less likely the topic branches would be
> considered.  Unfortunately that requires direct access to the
> repository.  It fails for a site like GitHub, where you don't manage
> the repository at all.

Let's define a ref hierarchy, refs/cache-pack, that names the cache pack
tips. A cache pack would be generated for each ref found in that
hierarchy. Then these commits are under user control even on github,
because you can just push the refs. Junio would perhaps choose a release
tag, and corresponding commits in the man and html histories. The choice
would not be completely automatic, though.

-- Hannes

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-28 15:33     ` Johannes Sixt
@ 2011-01-28 18:22       ` Shawn Pearce
  2011-01-28 19:15       ` Jay Soffian
  1 sibling, 0 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-28 18:22 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git, Junio C Hamano, Nicolas Pitre, John Hawley

On Fri, Jan 28, 2011 at 07:33, Johannes Sixt <j.sixt@viscovery.net> wrote:
> Am 1/28/2011 15:37, schrieb Shawn Pearce:
>> A simple fix is to use --heads --tags by default like I do here, but
>> make the actual parameters we feed to rev-list configurable.  A
>> repository owner could select only the master branch as input to
>> rev-list, making it less likely the topic branches would be
>> considered.  Unfortunately that requires direct access to the
>> repository.  It fails for a site like GitHub, where you don't manage
>> the repository at all.
>
> Let's define a ref hierarchy, refs/cache-pack, that names the cache pack
> tips. A cache pack would be generated for each ref found in that
> hierarchy. Then these commits are under user control even on github,
> because you can just push the refs. Junio would perhaps choose a release
> tag, and corresponding commits in the man and html histories. The choice
> would not be completely automatic, though.

This is a good idea.  Perhaps we go slightly further and say:

  refs/cache-pack/name-without-slash

    This packs into its own pack file, as a single tip.

  refs/cache-pack/group/a
  refs/cache-pack/group/b

   These pack into a pack file together.

If you have direct repository access, you can also just make one of
these a symbolic reference to a branch, e.g. refs/heads/master, and
then periodic `git repack --create-cache` invocations would pick up
the latest point.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-28 14:37   ` Shawn Pearce
  2011-01-28 15:33     ` Johannes Sixt
@ 2011-01-28 18:46     ` Nicolas Pitre
  2011-01-28 19:15       ` Shawn Pearce
  1 sibling, 1 reply; 26+ messages in thread
From: Nicolas Pitre @ 2011-01-28 18:46 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley

On Fri, 28 Jan 2011, Shawn Pearce wrote:

> This started because I was looking for a way to speed up clones coming
> from a JGit server.  Cloning the linux-2.6 repository is painful, it
> takes a long time to enumerate the 1.8 million objects.  So I tried
> adding a cached list of objects reachable from a given commit, which
> speeds up the enumeration phase, but JGit still needs to allocate all
> of the working set to track those objects, then go find them in packs
> and slice out each compressed form and reformat the headers on the
> wire.  Its a lot of redundant work when your kernel repository has
> 360MB of data that you know a client needs if they have asked for your
> master branch with no "have" set.
> 
> Later I realized, we can get rid of that cached list of objects and
> just use the pack itself.  Its far cleaner, as there is no redundant
> cache.  But either way (object list or pack) its a bit of a challenge
> to automatically identify the right starting points to use.  Linus
> Torvalds' linux-2.6 repository is the perfect case for the RFC I
> posted, its one branch with all of the history, and it never rewinds.
> But maybe Linus is just very unique in this world.  :-)

Playing my old record again... I know.  But pack v4 should solve a big 
part of this enumeration cost.

I've changed the format slightly again in my WIP branch.  The idea is to:

1) Have a non compressed yet still really dense representation for tree 
   objects;

2) do the same thing for the first part of commit objects, and only 
   deflate the free form text part.

There is nothing new here.  However, it should be possible to:

3) replace all SHA1 references by an offset into the pack file directly, 
   just like we do for OFS_DELTA objects.  If the SHA1 is actually 
   needed then we can obtain it with a reverse lookup with given object offset 
   in the pack index file, but in practice that is not actually required that 
   often.

So walking the history graph and enumerating objects would require 
nothing more than simply following straight pointers in the pack data in 
99% of the cases.  No object decompression, no memory buffer 
allocation/deallocation to perform that decompression, no string parsing 
in the tree object case, etc. Only cross pack references would require a 
full SHA1 based lookup like we do now.

I still have to sit down and figure out the implications of this, 
especially with forward references, meaning that the offset might have 
to be an object index so to allow for variable length encoding, and also 
to make sure index-pack can reconstruct the pack index.  But that would 
only be an indirect lookup which shouldn't be significantly costly.

So that's the idea.  Keep the exact same functionality as we have now, 
without any need for cache management, but making the data structure in 
a form that should improve object enumeration by some magnitude.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-28 15:33     ` Johannes Sixt
  2011-01-28 18:22       ` Shawn Pearce
@ 2011-01-28 19:15       ` Jay Soffian
  2011-01-28 19:19         ` Shawn Pearce
  1 sibling, 1 reply; 26+ messages in thread
From: Jay Soffian @ 2011-01-28 19:15 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Shawn Pearce, git, Junio C Hamano, Nicolas Pitre, John Hawley

On Fri, Jan 28, 2011 at 10:33 AM, Johannes Sixt <j.sixt@viscovery.net> wrote:
> Let's define a ref hierarchy, refs/cache-pack, that names the cache pack
> tips. A cache pack would be generated for each ref found in that
> hierarchy. Then these commits are under user control even on github,
> because you can just push the refs. Junio would perhaps choose a release
> tag, and corresponding commits in the man and html histories. The choice
> would not be completely automatic, though.

This is just for bare repos, right? Why not just use HEAD?

j.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-28 18:46     ` Nicolas Pitre
@ 2011-01-28 19:15       ` Shawn Pearce
  2011-01-28 21:09         ` Nicolas Pitre
  0 siblings, 1 reply; 26+ messages in thread
From: Shawn Pearce @ 2011-01-28 19:15 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley

On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Fri, 28 Jan 2011, Shawn Pearce wrote:
>
>> This started because I was looking for a way to speed up clones coming
>> from a JGit server.  Cloning the linux-2.6 repository is painful,
...
>> Later I realized, we can get rid of that cached list of objects and
>> just use the pack itself.
...
> Playing my old record again... I know.  But pack v4 should solve a big
> part of this enumeration cost.

I've said the same thing for years myself.  As much as it would be
nice to fix some of the decompression costs with pack v2/v3, v2/v3 is
very common in the wild, and a new pack encoding is going to be a
fairly complex thing to get added to C Git.  And pack v4 doesn't
eliminate the enumeration, it just makes it faster.

> So that's the idea.  Keep the exact same functionality as we have now,
> without any need for cache management, but making the data structure in
> a form that should improve object enumeration by some magnitude.

That's what I also liked about my --create-cache flag.  Its keeping
the same data we already have, in the same format we already have it
in.  We're just making a more explicit statement that everything in
some pack is about as tightly compressed as it ever would be for a
client, and it isn't going to change anytime soon.  Thus we might as
well tag it with .keep to prevent repack of mucking with it, and we
can take advantage of this to serve the pack to clients very fast.

Over breakfast this morning I made the point to Junio that with the
cached pack and a slight network protocol change (enabled by a
capability of course) we could stop using pkt-line framing when
sending the cached pack part of the stream, and just send the pack
directly down the socket.  That changes the clone of a 400 MB project
like linux-2.6 from being a lot of user space stuff, to just being a
sendfile() call for the bulk of the content.  I think we can just hand
off the major streaming to the kernel.  (Part of the protocol change
is we would need to use multiple SHA-1 checksums in the stream, so we
don't have to re-checksum the existing cached pack.)


I love the idea of some of the concepts in pack v4.  I really do.  But
this sounds a lot simpler to implement, and it lets us completely
eliminate a massive amount of server processing (even under pack v4
you still have object enumeration), in exchange for what might be a
few extra MBs on the wire to the client due to slightly less good
deltas and the use of REF_DELTA in the thin pack used for the most
recent objects.  I don't envision this being used on projects smaller
than git.git itself, if you can gc --aggressive the whole thing in a
minute the cached pack is probably pointless.  But if you have 400+
MB, you want that to be network bound, and have almost no CPU impact
on the server.

Plus we can safely do byte range requests for resumable clone within
the cached pack part of the stream.  And when pack v4 comes along, we
can use this same strategy for an equally large pack v4 pack.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-28 19:15       ` Jay Soffian
@ 2011-01-28 19:19         ` Shawn Pearce
  0 siblings, 0 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-28 19:19 UTC (permalink / raw)
  To: Jay Soffian
  Cc: Johannes Sixt, git, Junio C Hamano, Nicolas Pitre, John Hawley

On Fri, Jan 28, 2011 at 11:15, Jay Soffian <jaysoffian@gmail.com> wrote:
> On Fri, Jan 28, 2011 at 10:33 AM, Johannes Sixt <j.sixt@viscovery.net> wrote:
>> Let's define a ref hierarchy, refs/cache-pack, that names the cache pack
>> tips. A cache pack would be generated for each ref found in that
>> hierarchy. Then these commits are under user control even on github,
>> because you can just push the refs. Junio would perhaps choose a release
>> tag, and corresponding commits in the man and html histories. The choice
>> would not be completely automatic, though.
>
> This is just for bare repos, right? Why not just use HEAD?

Even on a bare repository a user might rewind his/her HEAD frequently.
 Caching from today's HEAD might not be ideal if you are about to
rewrite the last 10 commits and push those again to the repository.
That's actually where the "1.month.ago" guess came from in the patch.
If we go back a little in history, the odds of a rewrite are reduced,
and we're more likely to be able to reuse this pack.

HEAD - X commits/X days might be a good approximation if there are no
refs/cache-pack *and* gc --auto notices there is "enough" content to
suggest creating a cached pack.  But I do like Johannes Sixt's
refs/cache-pack ref hierarchy as a way to configure this explicitly.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-28 19:15       ` Shawn Pearce
@ 2011-01-28 21:09         ` Nicolas Pitre
  2011-01-29  1:32           ` Shawn Pearce
  0 siblings, 1 reply; 26+ messages in thread
From: Nicolas Pitre @ 2011-01-28 21:09 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5392 bytes --]

On Fri, 28 Jan 2011, Shawn Pearce wrote:

> On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@fluxnic.net> wrote:
> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
> >
> >> This started because I was looking for a way to speed up clones coming
> >> from a JGit server.  Cloning the linux-2.6 repository is painful,
> ...
> >> Later I realized, we can get rid of that cached list of objects and
> >> just use the pack itself.
> ...
> > Playing my old record again... I know.  But pack v4 should solve a big
> > part of this enumeration cost.
> 
> I've said the same thing for years myself.  As much as it would be
> nice to fix some of the decompression costs with pack v2/v3, v2/v3 is
> very common in the wild, and a new pack encoding is going to be a
> fairly complex thing to get added to C Git.  And pack v4 doesn't
> eliminate the enumeration, it just makes it faster.

Well, you don't necessarily need pack v4 to be widely deployed for 
people to benefit from it.  If it is available on servers such as 
git.kernel.org then everybody will see their clone requests go faster.  
Same principle as for the cache packs.

And yes it doesn't eliminate the enumeration, but you can't eliminate it 
entirely either as many other operations do require object enumeration 
too, and those would be sped up as well.

But this is in fact orthogonal to the cache pack concept indeed.

> That's what I also liked about my --create-cache flag.  Its keeping
> the same data we already have, in the same format we already have it
> in.  We're just making a more explicit statement that everything in
> some pack is about as tightly compressed as it ever would be for a
> client, and it isn't going to change anytime soon.  Thus we might as
> well tag it with .keep to prevent repack of mucking with it, and we
> can take advantage of this to serve the pack to clients very fast.

I do agree on that point.   And I like it too.  However I'd prefer if 
the whole thing wasn't created "automatically".  It's probably best if 
the repository administrator decides explicitly what should go in such 
cached packs according to actual purpose and usage for good commit 
thresholds and branches.  Only a human can make that decision.

I'd also recommend _not_ using the ref namespace for that.  Let's not 
mix up branching/tagging with what is effectively a storage 
implementation issue. Linking the ref namespace with the actual packs 
they refer to would be highly inelegant if the SHA1 of the pack has to 
be part of the ref name.  Instead, I'd suggest simply listing all the 
commit tips a cache pack contains in the .keep file directly instead.  
That would make it much easier to use with the object alternates too as 
the alternate mechanism points to the object store of a foreign repo and 
not to its refs.

> Over breakfast this morning I made the point to Junio that with the
> cached pack and a slight network protocol change (enabled by a
> capability of course) we could stop using pkt-line framing when
> sending the cached pack part of the stream, and just send the pack
> directly down the socket.  That changes the clone of a 400 MB project
> like linux-2.6 from being a lot of user space stuff, to just being a
> sendfile() call for the bulk of the content.  I think we can just hand
> off the major streaming to the kernel. 

While this might look like a good idea in theory, did you actually 
profile it to see if that would make a noticeable difference?  The 
pkt-line framing allows for asynchronous messages to be sent over a 
sideband, which you wouldn't be able to do anymore until the full 400 MB 
is received by the remote side.  Without concrete performance numbers 
I'm not convinced it is worth the maintenance cost for creating a 
deviation in the protocol like this.

> (Part of the protocol change
> is we would need to use multiple SHA-1 checksums in the stream, so we
> don't have to re-checksum the existing cached pack.)

?? I don't follow you here.

> I love the idea of some of the concepts in pack v4.  I really do.  But
> this sounds a lot simpler to implement, and it lets us completely
> eliminate a massive amount of server processing (even under pack v4
> you still have object enumeration), in exchange for what might be a
> few extra MBs on the wire to the client due to slightly less good
> deltas and the use of REF_DELTA in the thin pack used for the most
> recent objects.

I agree.  And what I personally like the most is the fact that this can 
be made transparent to clients using the existing network protocol 
unchanged.

> Plus we can safely do byte range requests for resumable clone within
> the cached pack part of the stream.

That part I'm not sure of.  We are still facing the same old issues 
here, as some mirrors might have the same commit edges for a cache pack 
but not necessarily the same packing result, etc.  So I'd keep that out 
of the picture for now.  The idea of being able to resume the transfer 
of a cache pack is good, however I'd make it into a totally separate 
service outside git-upload-pack where the issue of validating and 
updating content on both sides can be done efficiently without impacting 
the upload-pack protocol.  There would be more than just the cache pack 
in play during a typical clone.

> And when pack v4 comes along, we
> can use this same strategy for an equally large pack v4 pack.

Absolutely.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-28 21:09         ` Nicolas Pitre
@ 2011-01-29  1:32           ` Shawn Pearce
  2011-01-29  2:34             ` Shawn Pearce
                               ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-29  1:32 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley

On Fri, Jan 28, 2011 at 13:09, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Fri, 28 Jan 2011, Shawn Pearce wrote:
>
>> On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@fluxnic.net> wrote:
>> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
>> >
>> >> This started because I was looking for a way to speed up clones coming
>> >> from a JGit server.  Cloning the linux-2.6 repository is painful,

Well, scratch the idea in this thread.  I think.

I retested JGit vs. CGit on an identical linux-2.6 repository.  The
repository was fully packed, but had two pack files.  362M and 57M,
and was created by packing a 1 month old master, marking it .keep, and
then repacking -a -d to get most recent last month into another pack.
This results in some files that should be delta compressed together
being stored whole in the two packs (obviously).

The two implementations take the same amount of time to generate the
clone.  3m28s / 3m22s for JGit, 3m23s for C Git.  The JGit created
pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB.  I point
out this data because improvements made to JGit may show similar
improvements to CGit given how close they are in running time.

I fully implemented the reuse of a cached pack behind a thin pack idea
I was trying to describe in this thread.  It saved 1m7s off the JGit
running time, but increased the data transfer by 25 MiB.  I didn't
expect this much of an increase, I honestly expected the thin pack
portion to be well, thinner.  The issue is the thin pack cannot delta
against all of the history, its only delta compressing against the tip
of the cached pack.  So long-lived side branches that forked off an
older part of the history aren't delta compressing well, or at all,
and that is significantly bloating the thin pack.  (Its also why that
"newer" pack is 57M, but should be 14M if correctly combined with the
cached pack.)  If I were to consider all of the objects in the cached
pack as potential delta base candidates for the thin pack, the entire
benefit of the cached pack disappears.


Which leaves me with dropping this idea.  I started it because I was
actually looking for a way to speed up JGit.  But we're already
roughly on-par with CGit performance.  Dropping 1m7s on a clone is
great, but not at the expense of 6.5% larger network transfer.  For
most clients, 25 MiB of additional data transfer may be much more
significant time than 1m7s saved doing server-side computation.

>> That's what I also liked about my --create-cache flag.
>
> I do agree on that point.   And I like it too.

I'm not sure I like it so much anymore.  :-)

The idea was half-baked, and came at the end of a long day, and after
putting my cranky infant son down to sleep way past his normal bed
time.  I claim I was a sleep deprived new parent who wasn't thinking
things through enough before writing an email to git@vger.

>> sendfile() call for the bulk of the content.  I think we can just hand
>> off the major streaming to the kernel.
>
> While this might look like a good idea in theory, did you actually
> profile it to see if that would make a noticeable difference?  The
> pkt-line framing allows for asynchronous messages to be sent over a
> sideband,

No, of course not.  The pkt-line framing is pretty low overhead, but
copying kernel buffer to userspace back to kernel buffer sort of sucks
for 400 MiB of data.  sendfile() on 400 MiB to a network socket is
much easier when its all kernel space.  I figured, if it all worked
out already to just dump the pack to the wire as-is, then we probably
should also try to go for broke and reduce the userspace copying.  It
might not matter to your desktop, but ask John Hawley (CC'd) about
kernel.org and the git traffic volume he is serving.  They are doing
more than 1 million git:// requests per day now.

>> Plus we can safely do byte range requests for resumable clone within
>> the cached pack part of the stream.
>
> That part I'm not sure of.  We are still facing the same old issues
> here, as some mirrors might have the same commit edges for a cache pack
> but not necessarily the same packing result, etc.  So I'd keep that out
> of the picture for now.

I don't think its that hard.  If we modify the transfer protocol to
allow the server to denote boundaries between packs, the server can
send the pack name (as in pack-$name.pack) and the pack SHA-1 trailer
to the client.  A client asking for resume of a cached pack presents
its original want list, these two SHA-1s, and the byte offset he wants
to restart from.  The server validates the want set is still
reachable, that the cached pack exists, and that the cached pack tips
are reachable from current refs.  If all of that is true, it validates
the trailing SHA-1 in the pack matches what the client gave it.  If
that matches, it should be OK to resume transfer from where the client
asked for.

Then its up to the server administrators of a round-robin serving
cluster to ensure that the same cached pack is available on all nodes,
so that a resuming client is likely to have his request succeed.  This
isn't impossible.  If the server operator cares they can keep the
prior cached pack for several weeks after creating a newer cached
pack, giving clients plenty of time to resume a broken clone.  Disk is
fairly inexpensive these days.

But its perhaps pointless, see above.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-29  1:32           ` Shawn Pearce
@ 2011-01-29  2:34             ` Shawn Pearce
  2011-01-30  8:05               ` Junio C Hamano
  2011-01-29  4:08             ` Nicolas Pitre
                               ` (3 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Shawn Pearce @ 2011-01-29  2:34 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley

On Fri, Jan 28, 2011 at 17:32, Shawn Pearce <spearce@spearce.org> wrote:
>
> Well, scratch the idea in this thread.  I think.
>
> I retested JGit vs. CGit on an identical linux-2.6 repository.  The
> repository was fully packed, but had two pack files.  362M and 57M,
> and was created by packing a 1 month old master, marking it .keep, and
> then repacking -a -d to get most recent last month into another pack.
> This results in some files that should be delta compressed together
> being stored whole in the two packs (obviously).
>
> The two implementations take the same amount of time to generate the
> clone.  3m28s / 3m22s for JGit, 3m23s for C Git.  The JGit created
> pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB.

I just tried caching only the object list of what is reachable from a
particular commit.  The file is a small 20 byte header:

  4 byte magic
  4 byte version
  4 byte number of commits (C)
  4 byte number of trees (T)
  4 byte number of blobs (B)

Then C commit SHA-1s, followed by T tree SHA-1 + 4 byte path_hash,
followed by B blob SHA-1 + 4 byte path_hash.  For any project the size
is basically on par with the .idx file for the pack v1 format, so ~41
MB for linux-2.6.  The file is stored as
$GIT_OBJECT_DIRECTORY/cache/$COMMIT_SHA1.list, and is completely
pack-independent.

Using this for object enumeration shaves almost 1 minute off server
packing time; the clone dropped from 3m28s to 2m29s.  That is close to
what I was getting with the cached pack idea, but the network transfer
stayed the small 376 MiB.  I think this supports your pack v4 work...
if we can speed up object enumeration to be this simple (scan down a
list of objects with their types declared inline, or implied by
location), we can cut a full minute of CPU time off the server side.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-29  1:32           ` Shawn Pearce
  2011-01-29  2:34             ` Shawn Pearce
@ 2011-01-29  4:08             ` Nicolas Pitre
  2011-01-29  4:35               ` Shawn Pearce
  2011-01-30  6:51             ` Junio C Hamano
                               ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Nicolas Pitre @ 2011-01-29  4:08 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8240 bytes --]

On Fri, 28 Jan 2011, Shawn Pearce wrote:

> On Fri, Jan 28, 2011 at 13:09, Nicolas Pitre <nico@fluxnic.net> wrote:
> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
> >
> >> On Fri, Jan 28, 2011 at 10:46, Nicolas Pitre <nico@fluxnic.net> wrote:
> >> > On Fri, 28 Jan 2011, Shawn Pearce wrote:
> >> >
> >> >> This started because I was looking for a way to speed up clones coming
> >> >> from a JGit server.  Cloning the linux-2.6 repository is painful,
> 
> Well, scratch the idea in this thread.  I think.
> 
> I retested JGit vs. CGit on an identical linux-2.6 repository.  The
> repository was fully packed, but had two pack files.  362M and 57M,
> and was created by packing a 1 month old master, marking it .keep, and
> then repacking -a -d to get most recent last month into another pack.
> This results in some files that should be delta compressed together
> being stored whole in the two packs (obviously).
> 
> The two implementations take the same amount of time to generate the
> clone.  3m28s / 3m22s for JGit, 3m23s for C Git.  The JGit created
> pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB.  I point
> out this data because improvements made to JGit may show similar
> improvements to CGit given how close they are in running time.

What are those improvements?

Now, the fact that JGit is so close to CGit must be because the actual 
cost is outside of them such as within zlib, otherwise the C code should 
normally always be faster, right?

Looking at the profile for "git rev-list --objects --all > /dev/null" 
for the object enumeration phase, we have:

# Samples: 1814637
#
# Overhead          Command  Shared Object  Symbol
# ........  ...............  .............  ......
#
    28.81%              git  /home/nico/bin/git  [.] lookup_object
    12.21%              git  /lib64/libz.so.1.2.3  [.] inflate
    10.49%              git  /lib64/libz.so.1.2.3  [.] inflate_fast
     7.47%              git  /lib64/libz.so.1.2.3  [.] inflate_table
     6.66%              git  /lib64/libc-2.11.2.so  [.] __GI_memcpy
     5.66%              git  /home/nico/bin/git  [.] find_pack_entry_one
     2.98%              git  /home/nico/bin/git  [.] decode_tree_entry
     2.73%              git  /lib64/libc-2.11.2.so  [.] _int_malloc
     2.71%              git  /lib64/libz.so.1.2.3  [.] adler32
     2.63%              git  /home/nico/bin/git  [.] process_tree
     1.58%              git  [kernel]       [k] 0xffffffff8112fc0c
     1.44%              git  /lib64/libc-2.11.2.so  [.] __strlen_sse2
     1.31%              git  /home/nico/bin/git  [.] tree_entry
     1.10%              git  /lib64/libc-2.11.2.so  [.] _int_free
     0.96%              git  /home/nico/bin/git  [.] patch_delta
     0.92%              git  /lib64/libc-2.11.2.so  [.] malloc_consolidate
     0.86%              git  /lib64/libc-2.11.2.so  [.] __GI_vfprintf
     0.80%              git  /home/nico/bin/git  [.] create_object
     0.80%              git  /home/nico/bin/git  [.] lookup_blob
     0.63%              git  /home/nico/bin/git  [.] update_tree_entry
[...]

So we've got lookup_object() clearly at the top.  I suspect the 
hashcmp() in there, which probably gets inlined, is responsible for most 
cycles.  There is certainly a better way here, and probably in JGit you 
rely on some optimized facility provided by the language/library to 
perform that lookup.  So there is probably some easy improvements that 
can be made here.

Otherwise it is at least 12.21 + 10.49 + 7.47 + 2.71 = 32.88% spent 
directly in the zlib code, making it the biggest cost.  This is rather 
unavoidable unless the data structure is changed.  And pack v4 would 
probably move things such as find_pack_entry_one, decode_tree_entry, 
process_tree and tree_entry off the radar as well.

The object writeout phase should pretty much be network bound.

> I fully implemented the reuse of a cached pack behind a thin pack idea
> I was trying to describe in this thread.  It saved 1m7s off the JGit
> running time, but increased the data transfer by 25 MiB.  I didn't
> expect this much of an increase, I honestly expected the thin pack
> portion to be well, thinner.  The issue is the thin pack cannot delta
> against all of the history, its only delta compressing against the tip
> of the cached pack.  So long-lived side branches that forked off an
> older part of the history aren't delta compressing well, or at all,
> and that is significantly bloating the thin pack.  (Its also why that
> "newer" pack is 57M, but should be 14M if correctly combined with the
> cached pack.)  If I were to consider all of the objects in the cached
> pack as potential delta base candidates for the thin pack, the entire
> benefit of the cached pack disappears.

Yeah... this sucks.

> I'm not sure I like it so much anymore.  :-)
> 
> The idea was half-baked, and came at the end of a long day, and after
> putting my cranky infant son down to sleep way past his normal bed
> time.  I claim I was a sleep deprived new parent who wasn't thinking
> things through enough before writing an email to git@vger.

Well, this is still valuable information to archive.

And I wish I had been able to still write such quality emails when I was 
a new parent.  ;-)

> >> sendfile() call for the bulk of the content.  I think we can just hand
> >> off the major streaming to the kernel.
> >
> > While this might look like a good idea in theory, did you actually
> > profile it to see if that would make a noticeable difference?  The
> > pkt-line framing allows for asynchronous messages to be sent over a
> > sideband,
> 
> No, of course not.  The pkt-line framing is pretty low overhead, but
> copying kernel buffer to userspace back to kernel buffer sort of sucks
> for 400 MiB of data.  sendfile() on 400 MiB to a network socket is
> much easier when its all kernel space.

Of course.  But still... If you save 0.5 second by avoiding the copy to 
and from user space of that 400 MiB (based on my machine which can do 
1670MB/s) that's pretty much insignificant compared to the total time 
for the clone, and therefore the wrong thing to optimize given the 
required protocol changes.

> I figured, if it all worked
> out already to just dump the pack to the wire as-is, then we probably
> should also try to go for broke and reduce the userspace copying.  It
> might not matter to your desktop, but ask John Hawley (CC'd) about
> kernel.org and the git traffic volume he is serving.  They are doing
> more than 1 million git:// requests per day now.

Impressive.  However I suspect that the vast majority of those requests 
are from clients making a connection just to realize they're up to date 
already.  I don't think the user space copying is really a problem.

Of course, if we could have used sendfile() freely in, say, 
copy_pack_data() then we would have done so long ago.  But we are 
checksuming the data we create on the fly with the data we reuse from 
disk so this is not necessarily a gain.

> >> Plus we can safely do byte range requests for resumable clone within
> >> the cached pack part of the stream.
> >
> > That part I'm not sure of.  We are still facing the same old issues
> > here, as some mirrors might have the same commit edges for a cache pack
> > but not necessarily the same packing result, etc.  So I'd keep that out
> > of the picture for now.
> 
> I don't think its that hard.  If we modify the transfer protocol to
> allow the server to denote boundaries between packs, the server can
> send the pack name (as in pack-$name.pack) and the pack SHA-1 trailer
> to the client.  A client asking for resume of a cached pack presents
> its original want list, these two SHA-1s, and the byte offset he wants
> to restart from.  The server validates the want set is still
> reachable, that the cached pack exists, and that the cached pack tips
> are reachable from current refs.  If all of that is true, it validates
> the trailing SHA-1 in the pack matches what the client gave it.  If
> that matches, it should be OK to resume transfer from where the client
> asked for.

This is still an half solution.  If your network connection drops after 
the first 52 MiB of transfer given the scenario you provided then you're 
still screwed.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-29  4:08             ` Nicolas Pitre
@ 2011-01-29  4:35               ` Shawn Pearce
  0 siblings, 0 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-29  4:35 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley

On Fri, Jan 28, 2011 at 20:08, Nicolas Pitre <nico@fluxnic.net> wrote:
>> pack is actually smaller 376.30 MiB vs. C Git's 380.59 MiB.  I point
>> out this data because improvements made to JGit may show similar
>> improvements to CGit given how close they are in running time.
>
> What are those improvements?

None right now.  JGit is similar to CGit algorithm-wise.  (Actually it
looks like JGit has a faster diff implementation, but that's a
different email.)

If you are asking about why JGit created a slightly smaller pack
file... it splits the delta window during threaded delta search
differently than CGit does, and we align our blocks slightly
differently when comparing two objects to generate a delta sequence
for them.  These two variations mean JGit produces different deltas
than CGit does.  Sometimes we are smaller, sometimes we are larger.
But its a small difference, on the order of 1-4 MiB for something like
linux-2.6.  I don't think its worthwhile trying to analyze the
specific differences in implementations and retrofit those differences
into the other one.

What I was trying to say was, _if_ we made a change to JGit and it
dropped the running time, that same change in CGit should have _at
least_ the same running time improvement, if not better.  I was
pointing out that this cached-pack change dropped the running time by
1 minute, so CGit should also see a similar improvement (if not
better).  I would prefer to test against CGit for this sort of thing,
but its been too long since I last poked pack-objects.c and the
revision code in CGit, while the JGit equivalents are really fresh in
my head.

> Now, the fact that JGit is so close to CGit must be because the actual
> cost is outside of them such as within zlib, otherwise the C code should
> normally always be faster, right?

Yup, I mostly agree with this statement.  CGit does a lot of
malloc/free activity when reading objects in.  JGit does too, but we
often fit into the young generation for the GC, which sometimes can be
faster to clean and recycle memory in.  We're not too far off from C
code.

But yes... our profile looks like this too:

> Looking at the profile for "git rev-list --objects --all > /dev/null"
> for the object enumeration phase, we have:
>
> # Samples: 1814637
> #
> # Overhead          Command  Shared Object  Symbol
> # ........  ...............  .............  ......
> #
>    28.81%              git  /home/nico/bin/git  [.] lookup_object
>    12.21%              git  /lib64/libz.so.1.2.3  [.] inflate
>    10.49%              git  /lib64/libz.so.1.2.3  [.] inflate_fast
>     7.47%              git  /lib64/libz.so.1.2.3  [.] inflate_table
>     6.66%              git  /lib64/libc-2.11.2.so  [.] __GI_memcpy
>     5.66%              git  /home/nico/bin/git  [.] find_pack_entry_one
>     2.98%              git  /home/nico/bin/git  [.] decode_tree_entry
> [...]
>
> So we've got lookup_object() clearly at the top.

Isn't this the hash table lookup inside the revision pool, to see if
the object has already been visited?  That seems horrible, 28% of the
CPU is going to probing that table.

>  I suspect the
> hashcmp() in there, which probably gets inlined, is responsible for most
> cycles.

Probably true.  I know our hashcmp() is inlined, its actually written
by hand as 5 word compares, and is marked final, so the JIT is rather
likely to inline it.

>  There is certainly a better way here, and probably in JGit you
> rely on some optimized facility provided by the language/library to
> perform that lookup.  So there is probably some easy improvements that
> can be made here.

Nope.  Actually we have to bend over backwards and work against the
language to get anything even reasonably sane for performance.  Our
"solution" in JGit has actually been used by Rob Pike to promote his
Go programming language and why Java sucks as a language.  Its a great
quote of mine that someone dragged up off the git@vger mailing list
and started using to promote Go.

At least once I week I envy how easy it is to use hashcmp() and
hashcpy() inside of CGit.  JGit's management of hashes is sh*t because
we have to bend so hard around the language.

> Otherwise it is at least 12.21 + 10.49 + 7.47 + 2.71 = 32.88% spent
> directly in the zlib code, making it the biggest cost.

Yea, that's what we have too, about 33% inside of zlib code... which
is the same implementation that CGit uses.

>  This is rather
> unavoidable unless the data structure is changed.

We already knew this from our pack v4 experiments years ago.

>  And pack v4 would
> probably move things such as find_pack_entry_one, decode_tree_entry,
> process_tree and tree_entry off the radar as well.

This is hard to do inside of CGit if I recall... but yes, changing the
way trees are handled would really improve things.

> The object writeout phase should pretty much be network bound.

Yes.

>> I fully implemented the reuse of a cached pack behind a thin pack idea
>> I was trying to describe in this thread.  It saved 1m7s off the JGit
>> running time, but increased the data transfer by 25 MiB.
>
> Yeah... this sucks.

Very much.  :-(

But this is a fundamental issue with our incremental fetch support
anyway.  In this exact case if the client was at that 1 month old
commit, and fetched current master, he would pull 25 MiB of data.. but
only needed about 4-6 MiB worth of deltas if it was properly delta
compressed against the content we know he already has.  Our server
side optimization of only pushing the immediate "have" list of the
client into the delta search window limits how much we can compress
the data we are sending.  If we were willing to push more in on the
server side, we could shrink the incremental fetch more.  But that's a
CPU problem on the server.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-29  1:32           ` Shawn Pearce
  2011-01-29  2:34             ` Shawn Pearce
  2011-01-29  4:08             ` Nicolas Pitre
@ 2011-01-30  6:51             ` Junio C Hamano
  2011-01-30 17:14               ` Nicolas Pitre
  2011-01-30 19:29               ` Shawn Pearce
  2011-01-30 22:13             ` Shawn Pearce
  2011-01-31 18:47             ` Shawn Pearce
  4 siblings, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2011-01-30  6:51 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley

Shawn Pearce <spearce@spearce.org> writes:

> I fully implemented the reuse of a cached pack behind a thin pack idea
> I was trying to describe in this thread.  It saved 1m7s off the JGit
> running time, but increased the data transfer by 25 MiB.  I didn't
> expect this much of an increase, I honestly expected the thin pack
> portion to be well, thinner.  The issue is the thin pack cannot delta
> against all of the history, its only delta compressing against the tip
> of the cached pack.  So long-lived side branches that forked off an
> older part of the history aren't delta compressing well, or at all,
> and that is significantly bloating the thin pack.  (Its also why that
> "newer" pack is 57M, but should be 14M if correctly combined with the
> cached pack.)  If I were to consider all of the objects in the cached
> pack as potential delta base candidates for the thin pack, the entire
> benefit of the cached pack disappears.

What if you instead use the cached pack this way?

 0. You perform the proposed pre-traversal until you hit the tip of cached
    pack(s), and realize that you will end up sending everything.

 1. Instead of sending the new part of the history first and then sending
    the cached pack(s), you send the contents of cached pack(s), but also
    note what objects you sent;

 2. Then you send the new part of the history, taking full advantage of
    what you have already sent, perhaps doing only half of the reuse-delta
    logic (i.e. you reuse what you can reuse, but you do _not_ punt on an
    object that is not a delta in an existing pack).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-29  2:34             ` Shawn Pearce
@ 2011-01-30  8:05               ` Junio C Hamano
  2011-01-30 19:43                 ` Shawn Pearce
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2011-01-30  8:05 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley

Shawn Pearce <spearce@spearce.org> writes:

> Using this for object enumeration shaves almost 1 minute off server
> packing time; the clone dropped from 3m28s to 2m29s.  That is close to
> what I was getting with the cached pack idea, but the network transfer
> stayed the small 376 MiB.

I like this result.

The amount of transfer being that small was something I didn't quite
expect, though.  Doesn't it indicate that our pathname based object
clustering heuristics is not as effective as we hoped?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-30  6:51             ` Junio C Hamano
@ 2011-01-30 17:14               ` Nicolas Pitre
  2011-01-30 17:41                 ` A Large Angry SCM
  2011-01-30 19:29               ` Shawn Pearce
  1 sibling, 1 reply; 26+ messages in thread
From: Nicolas Pitre @ 2011-01-30 17:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn Pearce, Johannes Sixt, git, John Hawley

On Sat, 29 Jan 2011, Junio C Hamano wrote:

> Shawn Pearce <spearce@spearce.org> writes:
> 
> > I fully implemented the reuse of a cached pack behind a thin pack idea
> > I was trying to describe in this thread.  It saved 1m7s off the JGit
> > running time, but increased the data transfer by 25 MiB.  I didn't
> > expect this much of an increase, I honestly expected the thin pack
> > portion to be well, thinner.  The issue is the thin pack cannot delta
> > against all of the history, its only delta compressing against the tip
> > of the cached pack.  So long-lived side branches that forked off an
> > older part of the history aren't delta compressing well, or at all,
> > and that is significantly bloating the thin pack.  (Its also why that
> > "newer" pack is 57M, but should be 14M if correctly combined with the
> > cached pack.)  If I were to consider all of the objects in the cached
> > pack as potential delta base candidates for the thin pack, the entire
> > benefit of the cached pack disappears.
> 
> What if you instead use the cached pack this way?
> 
>  0. You perform the proposed pre-traversal until you hit the tip of cached
>     pack(s), and realize that you will end up sending everything.
> 
>  1. Instead of sending the new part of the history first and then sending
>     the cached pack(s), you send the contents of cached pack(s), but also
>     note what objects you sent;
> 
>  2. Then you send the new part of the history, taking full advantage of
>     what you have already sent, perhaps doing only half of the reuse-delta
>     logic (i.e. you reuse what you can reuse, but you do _not_ punt on an
>     object that is not a delta in an existing pack).

The problem is to determine the best base object to delta against.  If 
you end up listing all the already sent objects and perform delta 
attempts against them for the remaining non delta objects to find the 
best match then you might end up taking more CPU time than the current 
enumeration phase.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-30 17:14               ` Nicolas Pitre
@ 2011-01-30 17:41                 ` A Large Angry SCM
  0 siblings, 0 replies; 26+ messages in thread
From: A Large Angry SCM @ 2011-01-30 17:41 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Junio C Hamano, Shawn Pearce, Johannes Sixt, git, John Hawley

On 01/30/2011 12:14 PM, Nicolas Pitre wrote:
> On Sat, 29 Jan 2011, Junio C Hamano wrote:
>
>> Shawn Pearce<spearce@spearce.org>  writes:
>>
>>> I fully implemented the reuse of a cached pack behind a thin pack idea
>>> I was trying to describe in this thread.  It saved 1m7s off the JGit
>>> running time, but increased the data transfer by 25 MiB.  I didn't
>>> expect this much of an increase, I honestly expected the thin pack
>>> portion to be well, thinner.  The issue is the thin pack cannot delta
>>> against all of the history, its only delta compressing against the tip
>>> of the cached pack.  So long-lived side branches that forked off an
>>> older part of the history aren't delta compressing well, or at all,
>>> and that is significantly bloating the thin pack.  (Its also why that
>>> "newer" pack is 57M, but should be 14M if correctly combined with the
>>> cached pack.)  If I were to consider all of the objects in the cached
>>> pack as potential delta base candidates for the thin pack, the entire
>>> benefit of the cached pack disappears.
>>
>> What if you instead use the cached pack this way?
>>
>>   0. You perform the proposed pre-traversal until you hit the tip of cached
>>      pack(s), and realize that you will end up sending everything.
>>
>>   1. Instead of sending the new part of the history first and then sending
>>      the cached pack(s), you send the contents of cached pack(s), but also
>>      note what objects you sent;
>>
>>   2. Then you send the new part of the history, taking full advantage of
>>      what you have already sent, perhaps doing only half of the reuse-delta
>>      logic (i.e. you reuse what you can reuse, but you do _not_ punt on an
>>      object that is not a delta in an existing pack).
>
> The problem is to determine the best base object to delta against.  If
> you end up listing all the already sent objects and perform delta
> attempts against them for the remaining non delta objects to find the
> best match then you might end up taking more CPU time than the current
> enumeration phase.

Why worry about best here? Just add the object (or one of the objects) 
with the same path from the commit you found in step 0, above, to the 
delta base search for each object to pack.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-30  6:51             ` Junio C Hamano
  2011-01-30 17:14               ` Nicolas Pitre
@ 2011-01-30 19:29               ` Shawn Pearce
  1 sibling, 0 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-30 19:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley

On Sat, Jan 29, 2011 at 22:51, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>> I fully implemented the reuse of a cached pack behind a thin pack idea
>> I was trying to describe in this thread.  It saved 1m7s off the JGit
>> running time, but increased the data transfer by 25 MiB.  I didn't
>> expect this much of an increase, I honestly expected the thin pack
>> portion to be well, thinner.  The issue is the thin pack cannot delta
>> against all of the history, its only delta compressing against the tip
>> of the cached pack.  So long-lived side branches that forked off an
>> older part of the history aren't delta compressing well, or at all,
>> and that is significantly bloating the thin pack.  (Its also why that
>> "newer" pack is 57M, but should be 14M if correctly combined with the
>> cached pack.)  If I were to consider all of the objects in the cached
>> pack as potential delta base candidates for the thin pack, the entire
>> benefit of the cached pack disappears.
>
> What if you instead use the cached pack this way?
>
>  0. You perform the proposed pre-traversal until you hit the tip of cached
>    pack(s), and realize that you will end up sending everything.
>
>  1. Instead of sending the new part of the history first and then sending
>    the cached pack(s), you send the contents of cached pack(s), but also
>    note what objects you sent;

This is the part I was trying to avoid.  Making this list of objects
from the cached pack(s) costs working set inside of the pack-objects
process.  I had hoped that the cached packs would let me skip this
step.

But lets say that's acceptable cost.  We cannot efficiently make a
useful list of objects from the pack.  Scanning the .idx file only
tells us the SHA-1.  It does not tell us the type, nor does it tell us
what the path hash code would be for the object if it were a tree or
blob.  So we cannot efficiently use this pack listing to construct the
delta window.

>  2. Then you send the new part of the history, taking full advantage of
>    what you have already sent, perhaps doing only half of the reuse-delta
>    logic (i.e. you reuse what you can reuse, but you do _not_ punt on an
>    object that is not a delta in an existing pack).

Well, I guess we could go half-way.  We could try to use only
non-delta objects from the cached pack as potential delta bases for
this delta search.

To do that we would build the reverse index for the cached pack, then
check each object's type code just before we send that part of the
cached pack.  If its non-delta, we can get its SHA-1 from the reverse
index, toss the object into the delta search list, and copy out the
length of the object until the next object starts.

However... I suspect our delta results would be the same as the thin
pack before cached pack test I did earlier.  The objects that are
non-delta in the cached pack are (in theory) approximately the objects
immediately reachable from the cached pack's tip.  That was already
put into the delta window as the base candidates for the thin pack.
This may be a faster way to find that thin pack edge, but the data
transfer will still be sub-optimal because we cannot consider deltas
as bases.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-30  8:05               ` Junio C Hamano
@ 2011-01-30 19:43                 ` Shawn Pearce
  2011-01-30 20:02                   ` Junio C Hamano
  2011-01-30 22:26                   ` Nicolas Pitre
  0 siblings, 2 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-30 19:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley

On Sun, Jan 30, 2011 at 00:05, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>> Using this for object enumeration shaves almost 1 minute off server
>> packing time; the clone dropped from 3m28s to 2m29s.  That is close to
>> what I was getting with the cached pack idea, but the network transfer
>> stayed the small 376 MiB.
>
> I like this result.

I'm really leaning towards putting this cached object list into JGit.

I need to shave that 1 minute off server CPU time. I can afford the 41
MiB disk (and kernel buffer cache), but I cannot really continue to
pay the 1 minute of CPU on each clone request for large repositories.
The object list of what is reachable from commit X isn't ever going to
change, and the path hash function is reasonably stable.  With a
version code in the file we can desupport old files if the path hash
function changes.  10% more disk/kernel memory is cheap for some of my
servers compared to 1 minute of CPU, and some explicit cache
management by the server administrator to construct the file.

> The amount of transfer being that small was something I didn't quite
> expect, though.  Doesn't it indicate that our pathname based object
> clustering heuristics is not as effective as we hoped?

I'm not sure I follow your question.

I think the problem here is old side branches that got recently
merged.  Their _best_ delta base was some old revision, possibly close
to where they branched off from.  Using a newer version of the file
for the delta base created a much larger delta.  E.g. consider a file
where in more recent revisions a function was completely rewritten.
If you have to delta compress against that new version, but you use
the older definition of the function, you need to use insert
instructions
for the entire content of that old function.  But if you can delta
compress against the version you branched from (or one much closer to
it in time), your delta would be very small as that function is
handled by the smaller copy instruction.

Our clustering heuristics work fine.

Our thin-pack selection of potential delta base candidates is not.  We
are not very aggressive in loading the delta base window with
potential candidates, which means we miss some really good compression
opportunities.


Ooooh.

I think my test was flawed.  I injected the cached pack's tip as the
edge for the new stuff to delta compress against.  I should have
injected all of the merge bases between the cached pack's tip and the
new stuff.  Although the cached pack tip is one of the merge bases,
its not all of them.  If we inject all of the merge bases, we can find
the revision that this old side branch is based on, and possibly get a
better delta candidate for it.

IIRC, upload-pack would have walked backwards further and found the
merge base for that side branch, and it would have been part of the
delta base candidates.  I think I need to re-do my cached pack test.
Good thing I have history of my source code saved in this fancy
revision control thingy called "git".  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-30 19:43                 ` Shawn Pearce
@ 2011-01-30 20:02                   ` Junio C Hamano
  2011-01-30 20:20                     ` Shawn Pearce
  2011-01-30 22:26                   ` Nicolas Pitre
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2011-01-30 20:02 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley

Shawn Pearce <spearce@spearce.org> writes:

>> The amount of transfer being that small was something I didn't quite
>> expect, though.  Doesn't it indicate that our pathname based object
>> clustering heuristics is not as effective as we hoped?
>
> I'm not sure I follow your question.

I didn't see path information in your cachefile that contains C commits, T
trees, etc. that sped up the object enumeration, but you didn't observe
much transfer inflation over the stock git.

> Ooooh.
>
> I think my test was flawed.  I injected the cached pack's tip as the
> edge for the new stuff to delta compress against.

That is one of the things I was wondering.  I manually created a thin pack
with only the 1-month-old tip as boundary, and another with all the
boundaries that can be found by rev-list.  I didn't find much difference
in the result, though, as "rev-list --boundary --all --not $onemontholdtip"
had only a few boundary entries in my test.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-30 20:02                   ` Junio C Hamano
@ 2011-01-30 20:20                     ` Shawn Pearce
  0 siblings, 0 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-30 20:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Johannes Sixt, git, John Hawley

On Sun, Jan 30, 2011 at 12:02, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>>> The amount of transfer being that small was something I didn't quite
>>> expect, though.  Doesn't it indicate that our pathname based object
>>> clustering heuristics is not as effective as we hoped?
>>
>> I'm not sure I follow your question.
>
> I didn't see path information in your cachefile that contains C commits, T
> trees, etc. that sped up the object enumeration, but you didn't observe
> much transfer inflation over the stock git.

I didn't store the path itself, I stored the path hash as a 4 byte
int.  Its smaller, but still helps to schedule the object into the
right position in the delta search.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-29  1:32           ` Shawn Pearce
                               ` (2 preceding siblings ...)
  2011-01-30  6:51             ` Junio C Hamano
@ 2011-01-30 22:13             ` Shawn Pearce
  2011-01-31 18:47             ` Shawn Pearce
  4 siblings, 0 replies; 26+ messages in thread
From: Shawn Pearce @ 2011-01-30 22:13 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley

On Fri, Jan 28, 2011 at 17:32, Shawn Pearce <spearce@spearce.org> wrote:
>
> I fully implemented the reuse of a cached pack behind a thin pack idea
> I was trying to describe in this thread.  It saved 1m7s off the JGit
> running time, but increased the data transfer by 25 MiB.  I didn't
> expect this much of an increase, I honestly expected the thin pack
> portion to be well, thinner.

JGit's thin pack creation is crap.  For example, this is the same fetch:

$ git fetch ../tmp_linux26
remote: Counting objects: 61521, done.
remote: Compressing objects: 100% (12096/12096), done.
remote: Total 50275 (delta 42578), reused 45220 (delta 37524)
Receiving objects: 100% (50275/50275), 11.13 MiB | 7.29 MiB/s, done.
Resolving deltas: 100% (42578/42578), completed with 4968 local objects.

$ git fetch git://localhost/tmp_linux26
remote: Counting objects: 144190, done
remote: Finding sources: 100% (50275/50275)
remote: Compressing objects: 100% (106568/106568)
remote: Compressing objects: 100% (12750/12750)
Receiving objects: 100% (50275/50275), 24.66 MiB | 10.93 MiB/s, done.
Resolving deltas: 100% (40345/40345), completed with 2218 local objects.


JGit produced an extra 13.53 MiB for this pack, because it missed
about 2,233 delta opportunities.  It turns out we are too aggressive
at pushing objects from the edges into the delta windows.  JGit pushes
*everything* in the edge commits, rather than only the paths that are
actually used by the objects we need to send.  This floods the delta
search window with garbage, and makes it less likely that an object to
be sent will find a relevant delta base in the search window.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-30 19:43                 ` Shawn Pearce
  2011-01-30 20:02                   ` Junio C Hamano
@ 2011-01-30 22:26                   ` Nicolas Pitre
  1 sibling, 0 replies; 26+ messages in thread
From: Nicolas Pitre @ 2011-01-30 22:26 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Junio C Hamano, Johannes Sixt, git, John Hawley

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1485 bytes --]

On Sun, 30 Jan 2011, Shawn Pearce wrote:

> On Sun, Jan 30, 2011 at 00:05, Junio C Hamano <gitster@pobox.com> wrote:
> > Shawn Pearce <spearce@spearce.org> writes:
> >
> >> Using this for object enumeration shaves almost 1 minute off server
> >> packing time; the clone dropped from 3m28s to 2m29s.  That is close to
> >> what I was getting with the cached pack idea, but the network transfer
> >> stayed the small 376 MiB.
> >
> > I like this result.
> 
> I'm really leaning towards putting this cached object list into JGit.
> 
> I need to shave that 1 minute off server CPU time. I can afford the 41
> MiB disk (and kernel buffer cache), but I cannot really continue to
> pay the 1 minute of CPU on each clone request for large repositories.
> The object list of what is reachable from commit X isn't ever going to
> change, and the path hash function is reasonably stable.  With a
> version code in the file we can desupport old files if the path hash
> function changes.  10% more disk/kernel memory is cheap for some of my
> servers compared to 1 minute of CPU, and some explicit cache
> management by the server administrator to construct the file.

Yep, I think this is probably the best short term solution.  Just walk 
the commit graph as usual, and whenever the commit tip from the cache is 
matched then just shove the entire cache content in the object list.

And let's hope that eventually some future developments will make this 
cache redundant and obsolete.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-29  1:32           ` Shawn Pearce
                               ` (3 preceding siblings ...)
  2011-01-30 22:13             ` Shawn Pearce
@ 2011-01-31 18:47             ` Shawn Pearce
  2011-01-31 21:48               ` Nicolas Pitre
  4 siblings, 1 reply; 26+ messages in thread
From: Shawn Pearce @ 2011-01-31 18:47 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley

On Fri, Jan 28, 2011 at 17:32, Shawn Pearce <spearce@spearce.org> wrote:
>>> >
>>> >> This started because I was looking for a way to speed up clones coming
>>> >> from a JGit server.  Cloning the linux-2.6 repository is painful,
>
> Well, scratch the idea in this thread.  I think.

Nope, I'm back in favor with this after fixing JGit's thin pack
generation.  Here's why.

Take linux-2.6.git as of Jan 12th, with the cache root as of Dec 28th:

  $ git update-ref HEAD f878133bf022717b880d0e0995b8f91436fd605c
  $ git-repack.sh --create-cache \
      --cache-root=b52e2a6d6d05421dea6b6a94582126af8cd5cca2 \
      --cache-include=v2.6.11-tree
  $ git repack -a -d

  $ ls -lh objects/pack/
  total 456M
  1.4M pack-74af5edca80797736fe4de7279b2a81af98470a5.idx
  38M pack-74af5edca80797736fe4de7279b2a81af98470a5.pack

  49M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.idx
  89 pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.keep
  368M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.pack

Our "recent history" is 38M, and our "cached pack" is 368M.  Its a bit
more disk than is strictly necessary, this should be ~380M.  Call it
~26M of wasted disk.  The "cached object list" I proposed elsewhere in
this thread would cost about 41M of disk and is utterly useless except
for initial clones.  Here we are wasting about 26M of disk to have
slightly shorter delta chains in the cached pack (otherwise known as
our ancient history).  So its a slightly smaller waste, and we get
some (minor) benefit.


Clone without pack caching:

  $ time git clone --bare git://localhost/tmp_linux26_withTag tmp_in.git
  Cloning into bare repository tmp_in.git...
  remote: Counting objects: 1861830, done
  remote: Finding sources: 100% (1861830/1861830)
  remote: Getting sizes: 100% (88243/88243)
  remote: Compressing objects: 100% (88184/88184)
  Receiving objects: 100% (1861830/1861830), 376.01 MiB | 19.01 MiB/s, done.
  remote: Total 1861830 (delta 4706), reused 1851053 (delta 1553844)
  Resolving deltas: 100% (1564621/1564621), done.

  real	3m19.005s
  user	1m36.250s
  sys	0m10.290s


Clone with pack caching:

  $ time git clone --bare git://localhost/tmp_linux26_withTag tmp_in.git
  Cloning into bare repository tmp_in.git...
  remote: Counting objects: 1601, done
  remote: Counting objects: 1828460, done
  remote: Finding sources: 100% (50475/50475)
  remote: Getting sizes: 100% (18843/18843)
  remote: Compressing objects: 100% (7585/7585)
  remote: Total 1861830 (delta 2407), reused 1856197 (delta 37510)
  Receiving objects: 100% (1861830/1861830), 378.40 MiB | 31.31 MiB/s, done.
  Resolving deltas: 100% (1559477/1559477), done.

  real	2m2.938s
  user	1m35.890s
  sys	0m9.830s


Using the cached pack increased our total data transfer by 2.39 MiB,
but saved 1m17s on server computation time.  If we go back and look at
our cached pack size (368M), the leading thin-pack should be about
10.4 MiB (378.40M - 368M = 10.4M).  If I modify the tmp_in.git client
to have only the cached pack's tip and fetch using CGit, we see the
thin pack to bring ourselves current is 11.07 MiB (JGit does this in
10.96 MiB):

  $ cd tmp_in.git
  $ git update-ref HEAD b52e2a6d6d05421dea6b6a94582126af8cd5cca2
  $ git repack -a -d  ; # yay we are at ~1 month ago

  $ time git fetch ../tmp_linux26_withTag
  remote: Counting objects: 60570, done.
  remote: Compressing objects: 100% (11924/11924), done.
  remote: Total 49804 (delta 42196), reused 44837 (delta 37231)
  Receiving objects: 100% (49804/49804), 11.07 MiB | 7.37 MiB/s, done.
  Resolving deltas: 100% (42196/42196), completed with 4956 local objects.
  From ../tmp_linux26_withTag
   * branch            HEAD       -> FETCH_HEAD

  real	0m35.083s
  user	0m25.710s
  sys	0m1.190s


The pack caching feature is *no worse* in transfer size than if the
client copied the pack from 1 month ago, and then did an incremental
fetch to bring themselves current.  Compared to the naive clone, it
saves an incredible amount of working set space and CPU time.  The
server only needs to keep track of the incremental thin pack, and can
completely ignore the ancient history objects.  Its a great
alternative for projects that want users to rsync/http dumb transport
down a large stable repository, then incremental fetch themselves
current.  Or busy mirror sites that are willing to trade some small
bandwidth for server CPU and memory.

In this particular example, there is ~11 MiB of data that cannot be
safely resumed, or the first 2.9%.  At 56 KiB/s, a client needs to get
through the first 3 minutes of transfer before they can reach the
resumable checkpoint (where the thin pack ends, and the cached pack
starts).  It would be better if we could resume anywhere in the
stream, but being able to resume the last 97% is infinitely better
than being able to resume nothing.  If someone wants to really go
crazy, this is where a "gittorrent" client could start up and handle
the remaining 97% of the transfer.  :-)


I think this is worthwhile.  If we are afraid of the extra 2.39 MiB
data transfer this forces on the client when the repository owner
enables the feature, we should go back and improve our thin-pack code.
 Transferring 11 MiB to catch up a kernel from Dec 28th to Jan 12th
sounds like a lot of data, and any improvements in the general
thin-pack code would shrink the leading thin-pack, possibly getting us
that 2.39 MiB back.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] Add --create-cache to repack
  2011-01-31 18:47             ` Shawn Pearce
@ 2011-01-31 21:48               ` Nicolas Pitre
  0 siblings, 0 replies; 26+ messages in thread
From: Nicolas Pitre @ 2011-01-31 21:48 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Johannes Sixt, git, Junio C Hamano, John Hawley

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3348 bytes --]

On Mon, 31 Jan 2011, Shawn Pearce wrote:

> On Fri, Jan 28, 2011 at 17:32, Shawn Pearce <spearce@spearce.org> wrote:
> >>> >
> >>> >> This started because I was looking for a way to speed up clones coming
> >>> >> from a JGit server.  Cloning the linux-2.6 repository is painful,
> >
> > Well, scratch the idea in this thread.  I think.
> 
> Nope, I'm back in favor with this after fixing JGit's thin pack
> generation.  Here's why.
> 
> Take linux-2.6.git as of Jan 12th, with the cache root as of Dec 28th:
> 
>   $ git update-ref HEAD f878133bf022717b880d0e0995b8f91436fd605c
>   $ git-repack.sh --create-cache \
>       --cache-root=b52e2a6d6d05421dea6b6a94582126af8cd5cca2 \
>       --cache-include=v2.6.11-tree
>   $ git repack -a -d
> 
>   $ ls -lh objects/pack/
>   total 456M
>   1.4M pack-74af5edca80797736fe4de7279b2a81af98470a5.idx
>   38M pack-74af5edca80797736fe4de7279b2a81af98470a5.pack
> 
>   49M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.idx
>   89 pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.keep
>   368M pack-d3e77c8b3045c7f54fa2fb6bbfd4dceca1e2b9fa.pack
> 
> Our "recent history" is 38M, and our "cached pack" is 368M.  Its a bit
> more disk than is strictly necessary, this should be ~380M.  Call it
> ~26M of wasted disk. 

This is fine.  When doing an incremental fetch, the thin pack does 
minimize the transfer size, but it does increase the stored pack size by 
appending a bunch of non delta objects to make the pack complete.

What happens though, is that when gc kicks in, the wasted space is 
collected back.  Here with a single pack we wouldn't claim that space 
back as our current euristics is to reuse delta (non) pairing by 
default.  Maybe in that case we could simply not reuse deltas if they're 
of the REF_DELTA type.

> The "cached object list" I proposed elsewhere in
> this thread would cost about 41M of disk and is utterly useless except
> for initial clones.  Here we are wasting about 26M of disk to have
> slightly shorter delta chains in the cached pack (otherwise known as
> our ancient history).  So its a slightly smaller waste, and we get
> some (minor) benefit.

Well, of course the ancient history you're willing to keep stable for a 
while could be repacked even more aggressively than usual.

> Using the cached pack increased our total data transfer by 2.39 MiB,

That's more than acceptable IMHO. That's less than 1% of the total 
transfer.

> I think this is worthwhile.  If we are afraid of the extra 2.39 MiB
> data transfer this forces on the client when the repository owner
> enables the feature, we should go back and improve our thin-pack code.
>  Transferring 11 MiB to catch up a kernel from Dec 28th to Jan 12th
> sounds like a lot of data, 

Well, your timing for this test corresponds with the 2.6.38 merge window 
which is a high activity peak for this repository.  Still, that would 
probably fit the usage scenario in practice pretty well where the cache 
pack would be produced on a tagged release which happens right before 
the merge window.


> and any improvements in the general
> thin-pack code would shrink the leading thin-pack, possibly getting us
> that 2.39 MiB back.

Any improvement to the thin pack would require more CPU cycles, possibly 
lot more.  So given this transfer overhead is less than 1% already I 
don't think we need to bother.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2011-01-31 21:48 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-28  8:06 [RFC] Add --create-cache to repack Shawn O. Pearce
2011-01-28  9:08 ` Johannes Sixt
2011-01-28 14:37   ` Shawn Pearce
2011-01-28 15:33     ` Johannes Sixt
2011-01-28 18:22       ` Shawn Pearce
2011-01-28 19:15       ` Jay Soffian
2011-01-28 19:19         ` Shawn Pearce
2011-01-28 18:46     ` Nicolas Pitre
2011-01-28 19:15       ` Shawn Pearce
2011-01-28 21:09         ` Nicolas Pitre
2011-01-29  1:32           ` Shawn Pearce
2011-01-29  2:34             ` Shawn Pearce
2011-01-30  8:05               ` Junio C Hamano
2011-01-30 19:43                 ` Shawn Pearce
2011-01-30 20:02                   ` Junio C Hamano
2011-01-30 20:20                     ` Shawn Pearce
2011-01-30 22:26                   ` Nicolas Pitre
2011-01-29  4:08             ` Nicolas Pitre
2011-01-29  4:35               ` Shawn Pearce
2011-01-30  6:51             ` Junio C Hamano
2011-01-30 17:14               ` Nicolas Pitre
2011-01-30 17:41                 ` A Large Angry SCM
2011-01-30 19:29               ` Shawn Pearce
2011-01-30 22:13             ` Shawn Pearce
2011-01-31 18:47             ` Shawn Pearce
2011-01-31 21:48               ` Nicolas Pitre

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.