[RFC] Add --create-cache to repack

* [RFC] Add --create-cache to repack
@ 2011-01-28  8:06 Shawn O. Pearce
  2011-01-28  9:08 ` Johannes Sixt
  0 siblings, 1 reply; 26+ messages in thread
From: Shawn O. Pearce @ 2011-01-28  8:06 UTC (permalink / raw)
  To: git, Junio C Hamano, Nicolas Pitre; +Cc: John Hawley

A cache pack is all objects reachable from a single commit that is
part of the project's stable history and won't disappear, and is
accessible to all readers of the repository.  By containing only that
commit and its contents, if the commit is reached from a reference we
know immediately that the entire pack is also reachable.  To help
ensure this is true, the --create-cache flag looks for a commit along
refs/heads and refs/tags that is at least 1 month old, working under
the assumption that a commit this old won't be rebased or pruned.

During a clone request if a commit is discovered that matches the
cache pack, all newer objects can be enumerated using normal rules and
sent to the client, and then the cache pack can be simply appended
onto the end of the stream.  There is no need to enumerate the objects
as the object count is in the header of the cache pack.  There is no
need to allocate all of the objects in the pack-objects process, which
reduces its working set size, and its impact on busy servers.

By keeping the pack with a standard .keep file, later repacks of the
repository won't include these objects, which permits disk usage to
stay within a reasonable factor of the repository size.

Because newer packed objects are not delta compressed against the
older cached pack, clients may receive a larger data transfer when the
cached pack is simply appended onto the stream.  pack-objects could
work around this by constructing a thin pack, and adding the cache
pack's tip commit as the uninteresting/common base for the thin pack.
The references for the newer objects will point to older data behind
them so they will automatically use the larger REF_DELTA format.

This commit only adds the logic to git-repack to construct the cached
pack.  For example on a Linux kernel repository:

  # Construct the initial cache pack
  $ git repack --create-cache --cache-include=v2.6.11-tree

  # Remove duplicated objects
  $ git repack -a -d

If this is actually a good idea, pack-objects can later learn how to
use $GIT_DIR/objects/info/cached during revision traversal to know
when a cached pack is found, and switch to the thin pack + cached pack
transfer method described above.

The cached pack is only useful for initial clones of a repository, and
only if object enumeration takes more than a few seconds.  However
initial clones of big projects like linux-2.6.git are killing some
common mirror sites, so this could be one way to help them out.

Later fetch-pack/upload-pack protocol could learn how to more
intelligently use the cached pack in the data stream, allowing a
client whose connection has been broken to resume with a byte range
request within the cached pack, assuming the pack is still present on
the server.  This can be validated by giving the client both the SHA-1
pack name, and the SHA-1 trailer of the pack content, and requiring
these to match on a byte range request.

Repository owners may also enjoy having the cached pack as frequent
`git gc` invocations will now have lower IO and CPU requirements due
to the large pack having a .keep file.  In the future `git gc --auto`
could learn to suggest removing the .keep file and regenerating the
cached pack once there is sufficient new content to make creating a
new pack worthwhile.
---
 git-repack.sh |   57 +++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/git-repack.sh b/git-repack.sh
index 624feec..7a7984c 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -15,6 +15,9 @@ F               pass --no-reuse-object to git-pack-objects
 n               do not run git-update-server-info
 q,quiet         be quiet
 l               pass --local to git-pack-objects
+create-cache    create a cached pack for older history
+cache-include=  other objects to include in the cache
+cache-age=      how old to start caching from
  Packing constraints
 window=         size of the window used for delta compression
 window-memory=  same as the above, but limit memory size instead of entries count
@@ -26,6 +29,7 @@ SUBDIRECTORY_OK='Yes'
 
 no_update_info= all_into_one= remove_redundant= unpack_unreachable=
 local= no_reuse= extra=
+create_cache= cache_include= cache_age=1.month.ago
 while test $# != 0
 do
 	case "$1" in
@@ -38,6 +42,11 @@ do
 	-f)	no_reuse=--no-reuse-delta ;;
 	-F)	no_reuse=--no-reuse-object ;;
 	-l)	local=--local ;;
+	--create-cache) create_cache=t ;;
+	--cache-age) cache_age=$2; shift ;;
+	--cache-include)
+		name=$(git rev-parse --verify $2)
+		cache_include="$cache_include $name"; shift ;;
 	--max-pack-size|--window|--window-memory|--depth)
 		extra="$extra $1=$2"; shift ;;
 	--) shift; break;;
@@ -52,16 +61,19 @@ true)
 esac
 
 PACKDIR="$GIT_OBJECT_DIRECTORY/pack"
+INFODIR="$GIT_OBJECT_DIRECTORY/info"
 PACKTMP="$PACKDIR/.tmp-$$-pack"
 rm -f "$PACKTMP"-*
 trap 'rm -f "$PACKTMP"-*' 0 1 2 3 15
 
 # There will be more repacking strategies to come...
-case ",$all_into_one," in
-,,)
+case ",$create_cache,$all_into_one," in
+,t,,)
+	;;
+,,,)
 	args='--unpacked --incremental'
 	;;
-,t,)
+,,t,)
 	args= existing=
 	if [ -d "$PACKDIR" ]; then
 		for e in `cd "$PACKDIR" && find . -type f -name '*.pack' \
@@ -84,9 +96,22 @@ esac
 
 mkdir -p "$PACKDIR" || exit
 
-args="$args $local ${GIT_QUIET:+-q} $no_reuse$extra"
-names=$(git pack-objects --keep-true-parents --honor-pack-keep --non-empty --all --reflog $args </dev/null "$PACKTMP") ||
-	exit 1
+if [ -n "$create_cache" ]; then
+	root=$(git rev-list -n 1 --until=$cache_age --branches --tags --)
+	args="$args ${GIT_QUIET:+-q} $no_reuse$extra"
+	names=$( ( echo "$root";
+		       for name in $cache_include
+		       do
+		         echo "$name"
+		       done ) |
+		git pack-objects --keep-true-parents --non-empty $args --revs \
+		"$PACKTMP") ||
+		exit 1
+else
+	args="$args $local ${GIT_QUIET:+-q} $no_reuse$extra"
+	names=$(git pack-objects --keep-true-parents --honor-pack-keep --non-empty --all --reflog $args </dev/null "$PACKTMP") ||
+		exit 1
+fi
 if [ -z "$names" ]; then
 	say Nothing new to pack.
 fi
@@ -151,6 +176,10 @@ do
 	mv -f "$PACKTMP-$name.pack" "$PACKDIR/pack-$name.pack" &&
 	mv -f "$PACKTMP-$name.idx"  "$PACKDIR/pack-$name.idx" ||
 	exit
+
+	if [ -n "$create_cache" ]; then
+		echo "cache $root$cache_include" >"$PACKDIR/pack-$name.keep"
+	fi
 done
 
 # Remove the "old-" files
@@ -162,6 +191,22 @@ done
 
 # End of pack replacement.
 
+# Update the cache list
+if [ -n "$create_cache" ]; then
+	mkdir -p "$INFODIR" || exit
+	( echo "+ $root" &&
+	  for name in $cache_include
+	  do
+	    echo "+ $name"
+	  done
+	  for name in $names
+	  do
+	    echo "P $name"
+	  done ) >"$INFODIR/cached"
+	echo "Cached from:"
+	git log --pretty=format:'  [%h] %cd%n  %s' -1 "$root" --
+fi
+
 if test "$remove_redundant" = t
 then
 	# We know $existing are all redundant.
-- 
1.7.4.rc1.253.gb7420

^ permalink raw reply related	[flat|nested] 26+ messages in thread