All of lore.kernel.org
 help / color / mirror / Atom feed
* thin packs ending up fat
@ 2012-01-12 22:15 Jeff King
  2012-01-12 22:32 ` Jeff King
  2012-01-13  8:28 ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 12+ messages in thread
From: Jeff King @ 2012-01-12 22:15 UTC (permalink / raw)
  To: git; +Cc: Nicolas Pitre

[This ended up long; the gist of it is that we often fail to find thin
 deltas, making small fetches much bigger than they could be. Read on
 for details].

Recently I was doing some work related to bundles containing thin packs,
and I wanted to generate some thin packs. Much to my surprise, all of my
packs ended up having no thin objects!

It turns out that when packing a subset of a fully packed repo (as we do
for a bundle or for a fetch), we tend not to make thin packs at all.
The culprit is this logic in try_delta:

        /*
         * We do not bother to try a delta that we discarded
         * on an earlier try, but only when reusing delta data.
         */
        if (reuse_delta && trg_entry->in_pack &&
            trg_entry->in_pack == src_entry->in_pack &&
            trg_entry->in_pack_type != OBJ_REF_DELTA &&
            trg_entry->in_pack_type != OBJ_OFS_DELTA)
                return 0;

This does not try to delta objects which are found in the same pack,
under the assumption that we would have tried them last time and found
that they didn't work. For a complete repack, it works well. We don't
miss useful deltas, and it saves a lot of CPU time.

For a thin pack, though, it means we'll typically fail to actually use
the boundary objects. I think what happens is this:

  1. You start at commit C which contains a file with content A. You
     build commit C', which contains content A'.

  2. You repack the repository, putting A and A' in the pack. Our delta
     rules tend to create deltas in reverse-recency order, so A
     typically is stored as a delta based on A', and A' is stored
     whole.

  3. You try to create a thin pack (e.g., with "git bundle create
     foo.bundle C..C'"). We know that we want to send A', and that we
     can assume the other side has A to use as a base in our thin pack.

     But the logic above says "A and A' are both in the same pack, and
     we didn't find a delta for A'. They must not be a good match". But
     in this case it is because our deltas go in the reverse direction,
     and we never tried them.

So I did a little experimenting (scripts and inputs are at the end of
the email). I generated a thin pack for a subset of objects from my
fully packed repo, measuring the size of the resulting pack and the CPU
time used. I did this both with stock git, and a version of git with the
above optimization removed.

I used two datasets to generate the input to pack-objects. In the first,
I used the reflog entries from my refs/remotes/origin/master ref. That
simulates the workload of upload-packs performed for me by kernel.org
over the last month or so. In the second dataset, I took packed the
objects between git v1.0.0 and v1.1.0, between v1.1.0 and v1.2.0, and so
on. This was an attempt to simulate somebody who fetches much less
frequently, or perhaps somebody who generates bundles to sneaker-net
individual versions.

For each dataset, I calculated the average size (in bytes) over all of
the generated packs in the dataset and the average CPU time used per
pack (in seconds).

                  dataset
            | fetches | tags
---------------------------------
     before | 53358   | 2750977
size  after | 32315   | 2652939
     change |   -39%  |      -4%
---------------------------------
     before |  0.18   | 1.12
CPU   after |  0.19   | 1.50
     change |    +5%  |     +33%

For the case of small, frequent packs, we saved a lot of space for a
little bit of extra CPU. But for larger packs, our savings were much
more limited, and we spent a lot more CPU time.  That makes sense; this
is really beneficial for finding extra matches at the boundaries, and
for cases where you can avoid sending an entire undeltified object at
all for a given path.

So I think this is worth pursuing as an optimization to save bandwidth
on transfers, but it would require figuring out the right times to turn
it on. My initial idea was that we would want it off whenever we were
doing a thin pack. But the "tags" dataset shows that it's probably not
worth the trouble for large packs due to the extra CPU time.

I have a feeling that the case it is generally catching is this "reverse
delta" case. I.e., A is based off of B, we are including B, but A is
available as a thin base. But that can probably be extended into a delta
chain (A is based from B which is based on C; we want to include B and
C, but can build off of A as a base. We'll keep the delta from B to C,
but we would want to delta C off of A). So detecting that they are part
of the same chain might require walking the chains.

Maybe it is enough to simply turn off this optimization if the potential
delta source is not being included in the pack (i.e., we are using
--thin and it is a boundary object). Because if both objects are being
sent, we will just end up reusing the delta that goes in the reverse
direction anyway.

Or maybe there is some other clever way of detecting this situation. I'm
open to suggestions.

-Peff

---
This is the test script I used (feed the dataset files on stdin, get
lines of sizes and CPU times on stdout):

-- >8 --
#!/bin/sh

while read revs; do
	for i in $revs; do
		echo $i
	done >input
	echo >&2 "==> $revs"
	time -o time.out -f "%e" \
		git pack-objects --thin --revs --stdout <input >pack.out
	echo "`stat --format=%s pack.out` `cat time.out`"
done
-- 8< --

and the "fetch" dataset:

-- >8 --
8b0e15fa95e11965f18c8d2585dc8ffd9bfc9356 ^7f41b6bbe3181dc4d1687db036bf22316997a1bf
34c4461ae3b353e8c931565d5527b98ed12e3735 ^8b0e15fa95e11965f18c8d2585dc8ffd9bfc9356
463b0ea22b5b9a882e8140d0308433d8cbd0d1fe ^34c4461ae3b353e8c931565d5527b98ed12e3735
288396994f077eec7e7db0017838a5afbfbf81e3 ^463b0ea22b5b9a882e8140d0308433d8cbd0d1fe
05f6edcd2a418a88eeb953d51408a6aeef312f5c ^288396994f077eec7e7db0017838a5afbfbf81e3
08cfdbb88cd6225b4fc4b8a3cecd0e01758c835d ^05f6edcd2a418a88eeb953d51408a6aeef312f5c
87009edcbd0b4987ccb7ba050a1efe368a315753 ^08cfdbb88cd6225b4fc4b8a3cecd0e01758c835d
8963314c77af9a4eda5dcbdbab3d4001af83ad81 ^87009edcbd0b4987ccb7ba050a1efe368a315753
10b2a48113b8ab6b8f48229eb40fc3637ce025ae ^8963314c77af9a4eda5dcbdbab3d4001af83ad81
997a1946a55cafb992c4ba8e5e0795aa73f5a4a9 ^10b2a48113b8ab6b8f48229eb40fc3637ce025ae
e8e1c29021da446d0c50573ef9619bf74f515c20 ^997a1946a55cafb992c4ba8e5e0795aa73f5a4a9
87bf9a7048c623b3567f612ca3e67a0d412fc83d ^e8e1c29021da446d0c50573ef9619bf74f515c20
ee6dfb2d83ba1b057943e705f707fa27e34e47f9 ^87bf9a7048c623b3567f612ca3e67a0d412fc83d
4cb6764227173a6483edbdad09121651bc0b01c3 ^ee6dfb2d83ba1b057943e705f707fa27e34e47f9
8a042478967679b0dab3137f5f18367a0ffdc48a ^4cb6764227173a6483edbdad09121651bc0b01c3
248dbbe83256202f0edd6e1468d01cfbe27fd733 ^8a042478967679b0dab3137f5f18367a0ffdc48a
bc1bbe0c19a6ff39522b4fa3259f34150e308e1f ^248dbbe83256202f0edd6e1468d01cfbe27fd733
4d2440fe0daa9ad1556dfd220af8b3a883cf849d ^bc1bbe0c19a6ff39522b4fa3259f34150e308e1f
f56ef114eeefee755f422e6cbef2d83974cb90f1 ^4d2440fe0daa9ad1556dfd220af8b3a883cf849d
e14d63198867c545d0662afc00bf7be048bf2231 ^f56ef114eeefee755f422e6cbef2d83974cb90f1
017d1e134545db0d162908f3538077eaa1f34fb6 ^e14d63198867c545d0662afc00bf7be048bf2231
fc14b89a7e6899b8ac3b5751ec2d8c98203ea4c2 ^017d1e134545db0d162908f3538077eaa1f34fb6
406da7803217998ff6bf5dc69c55b1613556c2f4 ^fc14b89a7e6899b8ac3b5751ec2d8c98203ea4c2
7e02a6c63a183270b726bb21640059ae16fa48ae ^406da7803217998ff6bf5dc69c55b1613556c2f4
4cb5d10b14dcbe0155bed9c45ccb94e83bd4c599 ^7e02a6c63a183270b726bb21640059ae16fa48ae
9859a023fef30ffebdd22ad9639c587ac720b8b6 ^4cb5d10b14dcbe0155bed9c45ccb94e83bd4c599
57526fde5df201a99afa6d122c3266b3a1c5673a ^9859a023fef30ffebdd22ad9639c587ac720b8b6
73c6b3575bc638b7096ec913bd91193707e2265d ^57526fde5df201a99afa6d122c3266b3a1c5673a
10f4eb652ee4e592f91f638e579d1afcb96c0408 ^73c6b3575bc638b7096ec913bd91193707e2265d
ee228024933069b93ce23a1bd5eeb7ae12c792f2 ^10f4eb652ee4e592f91f638e579d1afcb96c0408
d16520499d2652b5b59dfb25f9cf2d56a4c6913a ^ee228024933069b93ce23a1bd5eeb7ae12c792f2
876a6f4991abdd72ea707b193b4f2b831096ad3c ^d16520499d2652b5b59dfb25f9cf2d56a4c6913a
8d68493f20db71abeb77adc251b2a116fe45cdaa ^876a6f4991abdd72ea707b193b4f2b831096ad3c
3daff7c31998faedbe0dd7e2b8651e351be40d64 ^8d68493f20db71abeb77adc251b2a116fe45cdaa
e443bdfe1e8e1ef8b3665cfd1c1295bd73e13773 ^3daff7c31998faedbe0dd7e2b8651e351be40d64
5d6dfc7cb140a6eb90138334fab2245b69bc8bc4 ^e443bdfe1e8e1ef8b3665cfd1c1295bd73e13773
ec330158ec04849fe5ff2cb8749797cd63ae592b ^5d6dfc7cb140a6eb90138334fab2245b69bc8bc4
17b4e93d5b849293e6a3659bbc4075ed8a6e97e2 ^ec330158ec04849fe5ff2cb8749797cd63ae592b
4570aeb0d85f3b5ff274b6d5a651c2ee06d25d76 ^17b4e93d5b849293e6a3659bbc4075ed8a6e97e2
247f9d23da8cfd255533433ad2aa07d172afac0b ^4570aeb0d85f3b5ff274b6d5a651c2ee06d25d76
eac2d83247ea0a265d923518c26873bb12c33778 ^247f9d23da8cfd255533433ad2aa07d172afac0b
beecc7ab65b31c5471331e64acaa3f722125ea67 ^eac2d83247ea0a265d923518c26873bb12c33778
7e521640c80b4bb871bca7a9259621a7abb303e7 ^beecc7ab65b31c5471331e64acaa3f722125ea67
0e1cfc52de002e2d9b0e6562e8672fee3bf45a67 ^7e521640c80b4bb871bca7a9259621a7abb303e7
-- 8< --

and the "tags" dataset:

-- >8 --
v1.1.0 ^v1.0.0
v1.2.0 ^v1.1.0
v1.3.0 ^v1.2.0
v1.4.0 ^v1.3.0
v1.5.0 ^v1.4.0
v1.6.0 ^v1.5.0
v1.7.0 ^v1.6.0
-- 8< --

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thin packs ending up fat
  2012-01-12 22:15 thin packs ending up fat Jeff King
@ 2012-01-12 22:32 ` Jeff King
  2012-01-12 23:54   ` Nicolas Pitre
                     ` (2 more replies)
  2012-01-13  8:28 ` Ævar Arnfjörð Bjarmason
  1 sibling, 3 replies; 12+ messages in thread
From: Jeff King @ 2012-01-12 22:32 UTC (permalink / raw)
  To: git; +Cc: Nicolas Pitre

On Thu, Jan 12, 2012 at 05:15:23PM -0500, Jeff King wrote:

> It turns out that when packing a subset of a fully packed repo (as we do
> for a bundle or for a fetch), we tend not to make thin packs at all.
> The culprit is this logic in try_delta:
> 
>         /*
>          * We do not bother to try a delta that we discarded
>          * on an earlier try, but only when reusing delta data.
>          */
>         if (reuse_delta && trg_entry->in_pack &&
>             trg_entry->in_pack == src_entry->in_pack &&
>             trg_entry->in_pack_type != OBJ_REF_DELTA &&
>             trg_entry->in_pack_type != OBJ_OFS_DELTA)
>                 return 0;
> [...]
> Maybe it is enough to simply turn off this optimization if the potential
> delta source is not being included in the pack (i.e., we are using
> --thin and it is a boundary object). Because if both objects are being
> sent, we will just end up reusing the delta that goes in the reverse
> direction anyway.

Hmm. It turns out this is really easy, because we have already marked
such objects as preferred bases.

So with this patch:

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 96c1680..d05e228 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1439,6 +1439,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 	 */
 	if (reuse_delta && trg_entry->in_pack &&
 	    trg_entry->in_pack == src_entry->in_pack &&
+	    !src_entry->preferred_base &&
 	    trg_entry->in_pack_type != OBJ_REF_DELTA &&
 	    trg_entry->in_pack_type != OBJ_OFS_DELTA)
 		return 0;

here are the numbers I get:

                  dataset
            | fetches | tags
---------------------------------
     before | 53358   | 2750977
size  after | 32398   | 2668479
     change |   -39%  |      -3%
---------------------------------
     before |  0.18   | 1.12
CPU   after |  0.18   | 1.15
     change |    +0%  |      +3%

So nearly all of the size benefit, but very little CPU change (even the
3% on the larger-pack case is close to the levels of run-to-run noise).
Obviously the size benefit in the larger-pack case isn't impressive, but
I think the "fetches" case is much more indicative of a real server
load.

-Peff

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: thin packs ending up fat
  2012-01-12 22:32 ` Jeff King
@ 2012-01-12 23:54   ` Nicolas Pitre
  2012-01-13  0:14   ` Junio C Hamano
  2012-01-13  1:31   ` Junio C Hamano
  2 siblings, 0 replies; 12+ messages in thread
From: Nicolas Pitre @ 2012-01-12 23:54 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, 12 Jan 2012, Jeff King wrote:

> On Thu, Jan 12, 2012 at 05:15:23PM -0500, Jeff King wrote:
> 
> > It turns out that when packing a subset of a fully packed repo (as we do
> > for a bundle or for a fetch), we tend not to make thin packs at all.
> > The culprit is this logic in try_delta:
> > 
> >         /*
> >          * We do not bother to try a delta that we discarded
> >          * on an earlier try, but only when reusing delta data.
> >          */
> >         if (reuse_delta && trg_entry->in_pack &&
> >             trg_entry->in_pack == src_entry->in_pack &&
> >             trg_entry->in_pack_type != OBJ_REF_DELTA &&
> >             trg_entry->in_pack_type != OBJ_OFS_DELTA)
> >                 return 0;
> > [...]
> > Maybe it is enough to simply turn off this optimization if the potential
> > delta source is not being included in the pack (i.e., we are using
> > --thin and it is a boundary object). Because if both objects are being
> > sent, we will just end up reusing the delta that goes in the reverse
> > direction anyway.
> 
> Hmm. It turns out this is really easy, because we have already marked
> such objects as preferred bases.

That's exactly what I was about to suggest after reading your first 
email.

> So with this patch:
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 96c1680..d05e228 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1439,6 +1439,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>  	 */
>  	if (reuse_delta && trg_entry->in_pack &&
>  	    trg_entry->in_pack == src_entry->in_pack &&
> +	    !src_entry->preferred_base &&
>  	    trg_entry->in_pack_type != OBJ_REF_DELTA &&
>  	    trg_entry->in_pack_type != OBJ_OFS_DELTA)
>  		return 0;

Acked-by: Nicolas Pitre <nico@fluxnic.net>

> here are the numbers I get:
> 
>                   dataset
>             | fetches | tags
> ---------------------------------
>      before | 53358   | 2750977
> size  after | 32398   | 2668479
>      change |   -39%  |      -3%
> ---------------------------------
>      before |  0.18   | 1.12
> CPU   after |  0.18   | 1.15
>      change |    +0%  |      +3%
> 
> So nearly all of the size benefit, but very little CPU change (even the
> 3% on the larger-pack case is close to the levels of run-to-run noise).
> Obviously the size benefit in the larger-pack case isn't impressive, but
> I think the "fetches" case is much more indicative of a real server
> load.

Indeed.  Please make sure to capture those numbers in the commit log.


Nicolas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thin packs ending up fat
  2012-01-12 22:32 ` Jeff King
  2012-01-12 23:54   ` Nicolas Pitre
@ 2012-01-13  0:14   ` Junio C Hamano
  2012-01-13  1:31   ` Junio C Hamano
  2 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2012-01-13  0:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Nicolas Pitre

Jeff King <peff@peff.net> writes:

> Hmm. It turns out this is really easy, because we have already marked
> such objects as preferred bases.
>
> So with this patch:
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 96c1680..d05e228 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1439,6 +1439,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>  	 */
>  	if (reuse_delta && trg_entry->in_pack &&
>  	    trg_entry->in_pack == src_entry->in_pack &&
> +	    !src_entry->preferred_base &&
>  	    trg_entry->in_pack_type != OBJ_REF_DELTA &&
>  	    trg_entry->in_pack_type != OBJ_OFS_DELTA)
>  		return 0;

Yeah, I was wondering why this one-liner was not in the original message
;-)

> here are the numbers I get:
>
>                   dataset
>             | fetches | tags
> ---------------------------------
>      before | 53358   | 2750977
> size  after | 32398   | 2668479
>      change |   -39%  |      -3%
> ---------------------------------
>      before |  0.18   | 1.12
> CPU   after |  0.18   | 1.15
>      change |    +0%  |      +3%

Looks good.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thin packs ending up fat
  2012-01-12 22:32 ` Jeff King
  2012-01-12 23:54   ` Nicolas Pitre
  2012-01-13  0:14   ` Junio C Hamano
@ 2012-01-13  1:31   ` Junio C Hamano
  2012-01-13  1:51     ` Jeff King
  2012-01-13  2:19     ` Nicolas Pitre
  2 siblings, 2 replies; 12+ messages in thread
From: Junio C Hamano @ 2012-01-13  1:31 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Nicolas Pitre

From: Jeff King <peff@peff.net>
Subject: [PATCH] thin-pack: try harder to create delta against preferred base

When creating a pack using objects that reside in existing packs, we try
to avoid recomputing futile delta between an object (trg) and a candidate
for its base object (src) if they are stored in the same packfile, and trg
is not recorded as a delta already. This heuristics makes sense because it
is likely that we tried to express trg as a delta based on src but it did
not produce a good delta when we created the existing pack.

As the pack heuristics prefer producing delta to remove data, and Linus's
law dictates that the size of a file grows over time, we tend to record
the newest version of the file as inflated, and older ones as delta
against it.

When creating a thin-pack to transfer recent history, it is likely that we
will try to send an object that is recorded in full, as it is newer.  But
the heuristics to avoid recomputing futile delta effectively forbids us
from attempting to express such an object as a delta based on another
object. Sending an object in full is often more expensive than sending a
suboptimal delta based on other objects, and it is even more so if we
could use an object we know the receiving end already has (i.e. referred
base object) as the delta base.

Tweak the recomputation avoidance logic, so that we do not punt on
computing delta against a preferred base object.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 builtin/pack-objects.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c6e2d87..8bfe3a6 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1248,11 +1248,16 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		return -1;
 
 	/*
-	 * We do not bother to try a delta that we discarded
-	 * on an earlier try, but only when reusing delta data.
+	 * We do not bother to try a delta that we discarded on an
+	 * earlier try, but only when reusing delta data.  Note that
+	 * src_entry that is marked as the preferred_base should always
+	 * be considered, as even if we produce a suboptimal delta against
+	 * it, we will still save the transfer cost, as we already know
+	 * the other side has it and we won't send src_entry at all.
 	 */
 	if (reuse_delta && trg_entry->in_pack &&
 	    trg_entry->in_pack == src_entry->in_pack &&
+	    !src_entry->preferred_base &&
 	    trg_entry->in_pack_type != OBJ_REF_DELTA &&
 	    trg_entry->in_pack_type != OBJ_OFS_DELTA)
 		return 0;
-- 
1.7.9.rc0.53.gc8bc2

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: thin packs ending up fat
  2012-01-13  1:31   ` Junio C Hamano
@ 2012-01-13  1:51     ` Jeff King
  2012-01-13  1:59       ` Jeff King
  2012-01-13  7:19       ` Junio C Hamano
  2012-01-13  2:19     ` Nicolas Pitre
  1 sibling, 2 replies; 12+ messages in thread
From: Jeff King @ 2012-01-13  1:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Nicolas Pitre

On Thu, Jan 12, 2012 at 05:31:42PM -0800, Junio C Hamano wrote:

> From: Jeff King <peff@peff.net>
> Subject: [PATCH] thin-pack: try harder to create delta against preferred base

I just sat down to write a nicer commit message, and behold, it was done
for me. Thanks.

> When creating a thin-pack to transfer recent history, it is likely that we
> will try to send an object that is recorded in full, as it is newer.  But
> the heuristics to avoid recomputing futile delta effectively forbids us
> from attempting to express such an object as a delta based on another
> object. Sending an object in full is often more expensive than sending a
> suboptimal delta based on other objects, and it is even more so if we
> could use an object we know the receiving end already has (i.e. referred
> base object) as the delta base.

s/referred/preferred/

> Tweak the recomputation avoidance logic, so that we do not punt on
> computing delta against a preferred base object.
> 
> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Other than that, it looks good to me.

Signed-off-by: Jeff King <peff@peff.net>

I'll try to deploy this to GitHub in the near future. I doubt we'll see
much of a dent in our bandwidth, though, as small fetches that this
helps are probably lost in the noise of clones.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thin packs ending up fat
  2012-01-13  1:51     ` Jeff King
@ 2012-01-13  1:59       ` Jeff King
  2012-01-13  7:19       ` Junio C Hamano
  1 sibling, 0 replies; 12+ messages in thread
From: Jeff King @ 2012-01-13  1:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Nicolas Pitre

On Thu, Jan 12, 2012 at 08:51:17PM -0500, Jeff King wrote:

> > Tweak the recomputation avoidance logic, so that we do not punt on
> > computing delta against a preferred base object.
> > 
> > Signed-off-by: Junio C Hamano <gitster@pobox.com>
> 
> Other than that, it looks good to me.
> 
> Signed-off-by: Jeff King <peff@peff.net>

Actually, Nicolas asked for the numbers to go into the
commit message (and gave his acked-by).  So maybe append:

The effect of this patch can be seen on two simulated
upload-pack workloads. The first is based on 44 reflog
entries from my git.git origin/master reflog, and represents
the packs that kernel.org produced to send me git updates
for the past month or two. The second workload represents
much larger fetches, going from git's v1.0.0 tag to v1.1.0,
then v1.1.0 to v1.2.0, and so on.

The table below shows the average generated pack size and
the average CPU time consumed for each dataset, both before
and after the patch:

                  dataset
            | reflog | tags
---------------------------------
     before | 53358  | 2750977
size  after | 32398  | 2668479
     change |   -39% |      -3%
---------------------------------
     before |  0.18  | 1.12
CPU   after |  0.18  | 1.15
     change |    +0% |      +3%

This patch makes a much bigger difference for packs with a
shorter slice of history (since its effect is seen at the
boundaries of the pack) though it has some benefit even for
larger packs.

Acked-by: Nicolas Pitre <nico@fluxnic.net>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thin packs ending up fat
  2012-01-13  1:31   ` Junio C Hamano
  2012-01-13  1:51     ` Jeff King
@ 2012-01-13  2:19     ` Nicolas Pitre
  1 sibling, 0 replies; 12+ messages in thread
From: Nicolas Pitre @ 2012-01-13  2:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git

On Thu, 12 Jan 2012, Junio C Hamano wrote:

> From: Jeff King <peff@peff.net>
> Subject: [PATCH] thin-pack: try harder to create delta against preferred base
> 
> When creating a pack using objects that reside in existing packs, we try
> to avoid recomputing futile delta between an object (trg) and a candidate
> for its base object (src) if they are stored in the same packfile, and trg
> is not recorded as a delta already. This heuristics makes sense because it
> is likely that we tried to express trg as a delta based on src but it did
> not produce a good delta when we created the existing pack.
> 
> As the pack heuristics prefer producing delta to remove data, and Linus's
> law dictates that the size of a file grows over time, we tend to record
> the newest version of the file as inflated, and older ones as delta
> against it.
> 
> When creating a thin-pack to transfer recent history, it is likely that we
> will try to send an object that is recorded in full, as it is newer.  But
> the heuristics to avoid recomputing futile delta effectively forbids us
> from attempting to express such an object as a delta based on another
> object. Sending an object in full is often more expensive than sending a
> suboptimal delta based on other objects, and it is even more so if we
> could use an object we know the receiving end already has (i.e. referred
> base object) as the delta base.
> 
> Tweak the recomputation avoidance logic, so that we do not punt on
> computing delta against a preferred base object.
> 
> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Acked-by: Nicolas Pitre <nico@fluxnic.net>

> ---
> 
>  builtin/pack-objects.c |    9 +++++++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index c6e2d87..8bfe3a6 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1248,11 +1248,16 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
>  		return -1;
>  
>  	/*
> -	 * We do not bother to try a delta that we discarded
> -	 * on an earlier try, but only when reusing delta data.
> +	 * We do not bother to try a delta that we discarded on an
> +	 * earlier try, but only when reusing delta data.  Note that
> +	 * src_entry that is marked as the preferred_base should always
> +	 * be considered, as even if we produce a suboptimal delta against
> +	 * it, we will still save the transfer cost, as we already know
> +	 * the other side has it and we won't send src_entry at all.
>  	 */
>  	if (reuse_delta && trg_entry->in_pack &&
>  	    trg_entry->in_pack == src_entry->in_pack &&
> +	    !src_entry->preferred_base &&
>  	    trg_entry->in_pack_type != OBJ_REF_DELTA &&
>  	    trg_entry->in_pack_type != OBJ_OFS_DELTA)
>  		return 0;
> -- 
> 1.7.9.rc0.53.gc8bc2
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thin packs ending up fat
  2012-01-13  1:51     ` Jeff King
  2012-01-13  1:59       ` Jeff King
@ 2012-01-13  7:19       ` Junio C Hamano
  2012-01-13 15:15         ` Jeff King
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2012-01-13  7:19 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Nicolas Pitre

Jeff King <peff@peff.net> writes:

> On Thu, Jan 12, 2012 at 05:31:42PM -0800, Junio C Hamano wrote:
>
>> From: Jeff King <peff@peff.net>
>> Subject: [PATCH] thin-pack: try harder to create delta against preferred base
>
> I just sat down to write a nicer commit message, and behold, it was done
> for me. Thanks.

Thank _you_ for a quick response and correction.

I wanted to make sure I understood the root cause of the issue and the
approach the patch takes to address it, instead of committing something
that smelled correct. And the only way I know to do so is to write it
down.

Especiallly, before coming up with the description, I was wondering if
this kind of symptom appears in non-thin cases, but after writing down the
justification for this patch, it became clear that we wouldn't have to
worry too much about that case. In a non-thin pack, we need to record one
object at least in a delta family in inflated base form, so we may as well
send that one near the tip that is already in that form for that, letting
the existing "avoid futile delta" heuristics to kick in. Other objects in
the same delta family will delta against it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thin packs ending up fat
  2012-01-12 22:15 thin packs ending up fat Jeff King
  2012-01-12 22:32 ` Jeff King
@ 2012-01-13  8:28 ` Ævar Arnfjörð Bjarmason
  2012-01-13 15:55   ` Jeff King
  1 sibling, 1 reply; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2012-01-13  8:28 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Nicolas Pitre

On Thu, Jan 12, 2012 at 23:15, Jeff King <peff@peff.net> wrote:

I'd be interested in testing this on some other repos I have. How
exactly did you get this:

> -- >8 --
> 8b0e15fa95e11965f18c8d2585dc8ffd9bfc9356 ^7f41b6bbe3181dc4d1687db036bf22316997a1bf
> 34c4461ae3b353e8c931565d5527b98ed12e3735 ^8b0e15fa95e11965f18c8d2585dc8ffd9bfc9356
> 463b0ea22b5b9a882e8140d0308433d8cbd0d1fe ^34c4461ae3b353e8c931565d5527b98ed12e3735
> 288396994f077eec7e7db0017838a5afbfbf81e3 ^463b0ea22b5b9a882e8140d0308433d8cbd0d1fe
> 05f6edcd2a418a88eeb953d51408a6aeef312f5c ^288396994f077eec7e7db0017838a5afbfbf81e3
> 08cfdbb88cd6225b4fc4b8a3cecd0e01758c835d ^05f6edcd2a418a88eeb953d51408a6aeef312f5c

>From this:

> In the first, I used the reflog entries from my
> refs/remotes/origin/master ref.

I can't make "git reflog refs/remotes/..." show me anything similar.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thin packs ending up fat
  2012-01-13  7:19       ` Junio C Hamano
@ 2012-01-13 15:15         ` Jeff King
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2012-01-13 15:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Nicolas Pitre

On Thu, Jan 12, 2012 at 11:19:37PM -0800, Junio C Hamano wrote:

> I wanted to make sure I understood the root cause of the issue and the
> approach the patch takes to address it, instead of committing something
> that smelled correct. And the only way I know to do so is to write it
> down.

I use something of the same technique myself...

> Especiallly, before coming up with the description, I was wondering if
> this kind of symptom appears in non-thin cases, but after writing down the
> justification for this patch, it became clear that we wouldn't have to
> worry too much about that case. In a non-thin pack, we need to record one
> object at least in a delta family in inflated base form, so we may as well
> send that one near the tip that is already in that form for that, letting
> the existing "avoid futile delta" heuristics to kick in. Other objects in
> the same delta family will delta against it.

Exactly. You asked earlier why my one-liner patch was not in the first
message. It's because it was only through writing the first message that
I came to the same realization. :)

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thin packs ending up fat
  2012-01-13  8:28 ` Ævar Arnfjörð Bjarmason
@ 2012-01-13 15:55   ` Jeff King
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2012-01-13 15:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Nicolas Pitre

On Fri, Jan 13, 2012 at 09:28:43AM +0100, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Jan 12, 2012 at 23:15, Jeff King <peff@peff.net> wrote:
> 
> I'd be interested in testing this on some other repos I have. How
> exactly did you get this:
> 
> > -- >8 --
> > 8b0e15fa95e11965f18c8d2585dc8ffd9bfc9356 ^7f41b6bbe3181dc4d1687db036bf22316997a1bf
> > 34c4461ae3b353e8c931565d5527b98ed12e3735 ^8b0e15fa95e11965f18c8d2585dc8ffd9bfc9356
> > 463b0ea22b5b9a882e8140d0308433d8cbd0d1fe ^34c4461ae3b353e8c931565d5527b98ed12e3735
> > 288396994f077eec7e7db0017838a5afbfbf81e3 ^463b0ea22b5b9a882e8140d0308433d8cbd0d1fe
> > 05f6edcd2a418a88eeb953d51408a6aeef312f5c ^288396994f077eec7e7db0017838a5afbfbf81e3
> > 08cfdbb88cd6225b4fc4b8a3cecd0e01758c835d ^05f6edcd2a418a88eeb953d51408a6aeef312f5c
> 
> From this:
> 
> > In the first, I used the reflog entries from my
> > refs/remotes/origin/master ref.
> 
> I can't make "git reflog refs/remotes/..." show me anything similar.

I just looked in the reflog file directly:

  perl -alne 'print "$F[1] ^$F[0]"' \
    .git/logs/refs/remotes/origin/master

Note that these are just an approximation of what was fetched each time.
The real fetches packed objects for other refs, too. But with respect to
this patch, the result should be the same (since the interesting thing
is the graph boundaries that are packed, so the size of the slice of
history is what's important).

The other slight inaccuracy is that my tests repack the whole repo, and
then simulate old fetches from that repo. When I actually fetched
7f415b6..8b0e15fa from upstream, all of those later commits didn't
actually exist yet. And therefore the blobs in those tip commits were
actually more likely to be non-delta, and therefore this optimization is
more likely to help (because we already re-tried deltas if the target
object was delta'd against something else).

At least in theory. In practice, the variance of the improvement it
provides to each pack is so quite high. Some packs don't benefit at all,
because the preferred bases simply aren't that useful, or are perhaps
already being used. Some packs can have improvements in the 70-80%
range.

The numbers I reported were averaged over all of the packs in the
dataset. You can see individual packs with the script I provided
earlier:

  # before patch
  ./test-pack <dataset-fetch >output-fetch-before
  # after patch
  ./test-pack <dataset-fetch >output-fetch-after

  # collate results
  paste output-fetch-{before,after} |
  perl -alne 'print "$F[0] $F[2] " .
                int(0.5 + (($F[0] - $F[2]) / $F[0] * 100))'

The output is below. You can also feed the percentages through this poor
man's graph generator to see just how wildly it varies:

  perl -alne 'print "*" x $F[2]'

-Peff

-- >8 --
154471 92495 40
41970 24681 41
102816 71062 31
21089 21089 0
57900 13345 77
75666 61518 19
163516 107785 34
44913 15795 65
85452 77832 9
102034 93336 9
10095 8283 18
86078 65495 24
1854 1854 0
16079 8125 49
13830 7119 49
1283 1283 0
4697 2633 44
4584 3818 17
21535 18953 12
3800 3800 0
13963 10215 27
4453 4073 9
1029 598 42
111233 68267 39
879 879 0
96628 86206 11
3287 3287 0
71212 20832 71
132916 100268 25
170716 40959 76
1830 1262 31
243950 133932 45
64588 31546 51
1108 1108 0
6488 6488 0
101179 64014 37
114760 18969 83
23443 19445 17
28207 28080 0
2317 2317 0
110890 53555 52
5660 5543 2
18683 18683 0
4709 4709 0

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-01-13 15:56 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-12 22:15 thin packs ending up fat Jeff King
2012-01-12 22:32 ` Jeff King
2012-01-12 23:54   ` Nicolas Pitre
2012-01-13  0:14   ` Junio C Hamano
2012-01-13  1:31   ` Junio C Hamano
2012-01-13  1:51     ` Jeff King
2012-01-13  1:59       ` Jeff King
2012-01-13  7:19       ` Junio C Hamano
2012-01-13 15:15         ` Jeff King
2012-01-13  2:19     ` Nicolas Pitre
2012-01-13  8:28 ` Ævar Arnfjörð Bjarmason
2012-01-13 15:55   ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.