All of lore.kernel.org
 help / color / mirror / Atom feed
* gc --aggressive
@ 2012-04-17 16:16 Jay Soffian
  2012-04-17 17:53 ` Jay Soffian
  2012-04-17 22:08 ` Jeff King
  0 siblings, 2 replies; 26+ messages in thread
From: Jay Soffian @ 2012-04-17 16:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Shawn Pearce

For a couple years now I've had a maintenance script which repacks all
the repos at @dayjob thusly:

  git config repack.usedeltabaseoffset true
  git config pack.compression 9
  git config pack.indexversion 2
  git config gc.autopacklimit 4
  git config gc.packrefs true
  git config gc.reflogexpire never
  git config gc.reflogexpireunreachable never
  git gc --auto --aggressive --prune

This has worked fine on repos large and small. However, starting a
couple days ago git started running out of memory on a relatively
modest repo[*] while repacking on a Linux box with 12GB memory (+ 12GB
swap). I am able to gc the repo by either removing --aggressive or
.keep'ing the oldest pack.

[*] Stats:

  du -hs objects
  141M	objects

  git count-objects -v
  count: 0
  size: 0
  in-pack: 57656
  packs: 37
  size-pack: 143811
  prune-packable: 0
  garbage: 0

  git version 1.7.10

I've since found a message from Shawn recommending against using --aggressive:

  http://groups.google.com/group/repo-discuss/msg/d2462eed67813571

> Junio Hamano and I looked at things a few weeks ago; it turns out the
> --aggressive flag doesn't generally provide a benefit like we thought
> it would. It would be safe to remove from your GC script, and will
> speed things up considerably.

A couple questions:

1) If --aggressive does not generally provide a benefit, should it be
made a no-op?

2) Is it expected that gc --aggressive would run out of memory on this repo?

I've posted the repo in case anyone wants to take a look:

  http://dl.dropbox.com/u/2138120/WebKit-trimmed.git.zip

j.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-17 16:16 gc --aggressive Jay Soffian
@ 2012-04-17 17:53 ` Jay Soffian
  2012-04-17 20:52   ` Matthieu Moy
  2012-04-28 16:56   ` Nicolas Pitre
  2012-04-17 22:08 ` Jeff King
  1 sibling, 2 replies; 26+ messages in thread
From: Jay Soffian @ 2012-04-17 17:53 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Shawn Pearce, Jeff King

On Tue, Apr 17, 2012 at 12:16 PM, Jay Soffian <jaysoffian@gmail.com> wrote:
> This has worked fine on repos large and small. However, starting a
> couple days ago git started running out of memory on a relatively
> modest repo[*] while repacking on a Linux box with 12GB memory (+ 12GB
> swap). I am able to gc the repo by either removing --aggressive or
> .keep'ing the oldest pack.

Experimentally, setting pack.windowMemory = 256m keeps git memory
usage < 4.5 GB during an aggressive repack.

Ironically I end up with a slightly worse pack (63115590 bytes vs
61518628 bytes) than not using --aggressive. I assume this is because
pack-objects found a better delta chain during the previous aggressive
repack when windowMemory was not set.

> 1) If --aggressive does not generally provide a benefit, should it be
> made a no-op?

I guess I'll revise this question: perhaps --aggressive should be
better explained/discouraged. I found a message from Jeff last month
and stole his words for this patch:

<snip>
diff --git i/Documentation/git-gc.txt w/Documentation/git-gc.txt
index 815afcb922..ca5bf8b51e 100644
--- i/Documentation/git-gc.txt
+++ w/Documentation/git-gc.txt
@@ -37,9 +37,8 @@ OPTIONS
 	Usually 'git gc' runs very quickly while providing good disk
 	space utilization and performance.  This option will cause
 	'git gc' to more aggressively optimize the repository at the expense
-	of taking much more time.  The effects of this optimization are
-	persistent, so this option only needs to be used occasionally; every
-	few hundred changesets or so.
+	of taking much more time and potentially using greater memory. This
+	option is rarely needed. See Repacking below.

 --auto::
 	With this option, 'git gc' checks whether any housekeeping is
@@ -138,6 +137,39 @@ If you are expecting some objects to be collected
and they aren't, check
 all of those locations and decide whether it makes sense in your case to
 remove those references.

+Repacking
+---------
+
+Under the covers 'git gc' calls several commands to optimize the repository.
+The most significant of these with respect to repository size and general
+performance is linkgit:git-repack[1]. There are basically three levels of
+'gc' with respect to repacking:
+
+ 1. `git gc --auto`; if there are too many loose objects (`gc.auto`), they
+    all go into a new incremental pack. If there are already too many
+    packs (`gc.autopacklimit`), all of the existing packs are re-packed
+    together.
+
+    Making an incremental pack is by far the fastest because the speed is
+    independent of the existing repository history. If git packs
+    everything together, it should be more or less the same as (2).
+
+ 2. `git gc`; this packs everything into a single pack. It uses default
+    window and depth parameters, but importantly, it reuses existing
+    deltas. Doing so makes the delta compression phase much faster, and it
+    often makes the writing phase faster (because for older objects, git
+    is primarily streaming them right out of the existing pack). On a big
+    repository though, this does do a lot of I/O, because git has to
+    rewrite the whole pack.
+
+ 3. `git gc --aggressive`; this is often much slower than (2) because git
+    throws out all of the existing deltas and recomputes them from
+    scratch. It uses a higher window parameter meaning it will spend
+    more time computing, and it may end up with a smaller pack. However,
+    unless the repository is known to have initially been poorly packed,
+    this option is not needed and will just cause git to perform
+    extra work.
+
 HOOKS
 -----

@@ -147,6 +179,7 @@ linkgit:githooks[5] for more information.

 SEE ALSO
 --------
+linkgit:git-pack-refs[1]
 linkgit:git-prune[1]
 linkgit:git-reflog[1]
 linkgit:git-repack[1]
</snip>

Thoughts?

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-17 17:53 ` Jay Soffian
@ 2012-04-17 20:52   ` Matthieu Moy
  2012-04-17 21:58     ` Jeff King
  2012-04-28 12:25     ` Jeff King
  2012-04-28 16:56   ` Nicolas Pitre
  1 sibling, 2 replies; 26+ messages in thread
From: Matthieu Moy @ 2012-04-17 20:52 UTC (permalink / raw)
  To: Jay Soffian; +Cc: git, Junio C Hamano, Shawn Pearce, Jeff King

Jay Soffian <jaysoffian@gmail.com> writes:

> + 3. `git gc --aggressive`; this is often much slower than (2) because git
> +    throws out all of the existing deltas and recomputes them from
> +    scratch. It uses a higher window parameter meaning it will spend
> +    more time computing, and it may end up with a smaller pack. However,
> +    unless the repository is known to have initially been poorly packed,
> +    this option is not needed and will just cause git to perform
> +    extra work.

I like your patch.

Maybe you should elaborate on "unless the repository is known to have
initially been poorly packed". My understanding is that --aggressive was
implemented to be called after an import from another VCS that would
have computed very poor deltas, but I'm not sure about the details.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-17 20:52   ` Matthieu Moy
@ 2012-04-17 21:58     ` Jeff King
  2012-04-28 12:25     ` Jeff King
  1 sibling, 0 replies; 26+ messages in thread
From: Jeff King @ 2012-04-17 21:58 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Jay Soffian, git, Junio C Hamano, Shawn Pearce

On Tue, Apr 17, 2012 at 10:52:03PM +0200, Matthieu Moy wrote:

> Jay Soffian <jaysoffian@gmail.com> writes:
> 
> > + 3. `git gc --aggressive`; this is often much slower than (2) because git
> > +    throws out all of the existing deltas and recomputes them from
> > +    scratch. It uses a higher window parameter meaning it will spend
> > +    more time computing, and it may end up with a smaller pack. However,
> > +    unless the repository is known to have initially been poorly packed,
> > +    this option is not needed and will just cause git to perform
> > +    extra work.
> 
> I like your patch.

Me too. I guess it is not surprising since I wrote the initial draft. ;)

> Maybe you should elaborate on "unless the repository is known to have
> initially been poorly packed". My understanding is that --aggressive was
> implemented to be called after an import from another VCS that would
> have computed very poor deltas, but I'm not sure about the details.

Yes, that's exactly it. fast-import will generate packs, but they are
often not optimal. So if you have done a big import, you should
definitely "git gc --aggressive" as the final step. I don't know how
something like a remote-helper would work, where it is fast-importing
little bits at a time. Probably a regular repack would be fine, since it
will be considering deltas between objects in different packs anyway.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-17 16:16 gc --aggressive Jay Soffian
  2012-04-17 17:53 ` Jay Soffian
@ 2012-04-17 22:08 ` Jeff King
  2012-04-17 22:17   ` Junio C Hamano
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2012-04-17 22:08 UTC (permalink / raw)
  To: Jay Soffian; +Cc: git, Junio C Hamano, Shawn Pearce

On Tue, Apr 17, 2012 at 12:16:15PM -0400, Jay Soffian wrote:

> For a couple years now I've had a maintenance script which repacks all
> the repos at @dayjob thusly:
> 
>   git config repack.usedeltabaseoffset true
>   git config pack.compression 9
>   git config pack.indexversion 2
>   git config gc.autopacklimit 4
>   git config gc.packrefs true
>   git config gc.reflogexpire never
>   git config gc.reflogexpireunreachable never
>   git gc --auto --aggressive --prune
> 
> This has worked fine on repos large and small. However, starting a
> couple days ago git started running out of memory on a relatively
> modest repo[*] while repacking on a Linux box with 12GB memory (+ 12GB
> swap). I am able to gc the repo by either removing --aggressive or
> .keep'ing the oldest pack.

I wonder where the memory is going. In theory, the memory consumption
for packing comes from keeping all of the objects for a given window in
memory (so we are looking for a delta for object X, and we have a window
of Y[0]..Y[$window] objects that we will consider). And for a
multi-threaded pack, that's per-thread.

How many cores are there on this box? Have you tried setting
pack.windowMemory to (12 / # of cores) or thereabouts?

> 1) If --aggressive does not generally provide a benefit, should it be
> made a no-op?

In your case, I think it is overkill. But it seems lame that git _can't_
do a full repack on such a beefy machine. You don't want to do it all
the time, but you might want to do it at least once.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-17 22:08 ` Jeff King
@ 2012-04-17 22:17   ` Junio C Hamano
  2012-04-17 22:18     ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2012-04-17 22:17 UTC (permalink / raw)
  To: Jeff King; +Cc: Jay Soffian, git, Shawn Pearce

Jeff King <peff@peff.net> writes:

> ...
> I wonder where the memory is going. In theory, the memory consumption
> for packing comes from keeping all of the objects for a given window in
> memory (so we are looking for a delta for object X, and we have a window
> of Y[0]..Y[$window] objects that we will consider). And for a
> multi-threaded pack, that's per-thread.
>
> How many cores are there on this box? Have you tried setting
> pack.windowMemory to (12 / # of cores) or thereabouts?

Hrm, from the end-user's point of view, it appears that pack.windowMemory
ought to mean the total without having to worry about the division of it
across threads (which the implementation should be responsible for).

> ... But it seems lame that git _can't_
> do a full repack on such a beefy machine. You don't want to do it all
> the time, but you might want to do it at least once.

True.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-17 22:17   ` Junio C Hamano
@ 2012-04-17 22:18     ` Jeff King
  2012-04-17 22:34       ` Junio C Hamano
  2012-04-18  8:49       ` Andreas Ericsson
  0 siblings, 2 replies; 26+ messages in thread
From: Jeff King @ 2012-04-17 22:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jay Soffian, git, Shawn Pearce

On Tue, Apr 17, 2012 at 03:17:28PM -0700, Junio C Hamano wrote:

> > How many cores are there on this box? Have you tried setting
> > pack.windowMemory to (12 / # of cores) or thereabouts?
> 
> Hrm, from the end-user's point of view, it appears that pack.windowMemory
> ought to mean the total without having to worry about the division of it
> across threads (which the implementation should be responsible for).

Agreed. I had to look in the code to check which it meant. I'm not sure
we can change it without regressing existing users, though.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-17 22:18     ` Jeff King
@ 2012-04-17 22:34       ` Junio C Hamano
  2012-04-28 16:42         ` Nicolas Pitre
  2012-04-18  8:49       ` Andreas Ericsson
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2012-04-17 22:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Jay Soffian, git, Shawn Pearce

Jeff King <peff@peff.net> writes:

> On Tue, Apr 17, 2012 at 03:17:28PM -0700, Junio C Hamano wrote:
>
>> > How many cores are there on this box? Have you tried setting
>> > pack.windowMemory to (12 / # of cores) or thereabouts?
>> 
>> Hrm, from the end-user's point of view, it appears that pack.windowMemory
>> ought to mean the total without having to worry about the division of it
>> across threads (which the implementation should be responsible for).
>
> Agreed. I had to look in the code to check which it meant. I'm not sure
> we can change it without regressing existing users, though.

This is a tangent, but I noticed that the canned settings for "aggressive"
use an arbitrarily hardcoded value of depth=250 and window=250 (tweakable
with gc.aggressiveWindow).

Even though a shallower depth does cause base candidates with too long a
chain hanging to be evicted prematurely while it is still in window and
will lead to smaller memory consumption, I do not think the value of
"depth" affects the pack-time memory consumption too much.  But the
runtime performance of the resulting pack may not be great (in the worst
case you would have to undelta 249 times to get to the object data).  We
may want to loosen it a bit.

Also it might make sense to make the window size a bit more flexible
depending on the nature of your history (you would get bigger benefit with
larger window when your history has fine grained commits; if there are not
many few-liner commits, larger window may not help you that much).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-17 22:18     ` Jeff King
  2012-04-17 22:34       ` Junio C Hamano
@ 2012-04-18  8:49       ` Andreas Ericsson
  1 sibling, 0 replies; 26+ messages in thread
From: Andreas Ericsson @ 2012-04-18  8:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Jay Soffian, git, Shawn Pearce

On 04/18/2012 12:18 AM, Jeff King wrote:
> On Tue, Apr 17, 2012 at 03:17:28PM -0700, Junio C Hamano wrote:
> 
>>> How many cores are there on this box? Have you tried setting
>>> pack.windowMemory to (12 / # of cores) or thereabouts?
>>
>> Hrm, from the end-user's point of view, it appears that pack.windowMemory
>> ought to mean the total without having to worry about the division of it
>> across threads (which the implementation should be responsible for).
> 
> Agreed. I had to look in the code to check which it meant. I'm not sure
> we can change it without regressing existing users, though.
> 

Introduce a new one.

core.maxmemoryusage = <something>, which acts as an upper bound in all
codepaths where we actually keep track.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-17 20:52   ` Matthieu Moy
  2012-04-17 21:58     ` Jeff King
@ 2012-04-28 12:25     ` Jeff King
  2012-04-28 17:11       ` Nicolas Pitre
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2012-04-28 12:25 UTC (permalink / raw)
  To: git
  Cc: Matthieu Moy, Nicolas Pitre, Jay Soffian, Junio C Hamano, Shawn Pearce

On Tue, Apr 17, 2012 at 10:52:03PM +0200, Matthieu Moy wrote:

> Jay Soffian <jaysoffian@gmail.com> writes:
> 
> > + 3. `git gc --aggressive`; this is often much slower than (2) because git
> > +    throws out all of the existing deltas and recomputes them from
> > +    scratch. It uses a higher window parameter meaning it will spend
> > +    more time computing, and it may end up with a smaller pack. However,
> > +    unless the repository is known to have initially been poorly packed,
> > +    this option is not needed and will just cause git to perform
> > +    extra work.
> 
> I like your patch.
> 
> Maybe you should elaborate on "unless the repository is known to have
> initially been poorly packed". My understanding is that --aggressive was
> implemented to be called after an import from another VCS that would
> have computed very poor deltas, but I'm not sure about the details.

Coincidentally, I came across a case last week that shows --aggressive
providing a large improvement. And it's a public repo, so I was able to
grab a snapshot of the pre-packed state to experiment on and share.

The current packfile is ~246M. It was produced over time by pushes into
the repository, which were then eventually grouped into a single pack by
"git gc" (I'm not sure of the exact history, but this may even have been
a set of "gc --auto" calls over time).

Here's a list of commands and the pack sizes they yield on the repo:

  1. `git repack -ad`: 246M
  2. `git repack -ad -f`: 376M
  3. `git repack -ad --window=250`: 246M
  4. `git repack -ad -f --window=250`: 145M

The most interesting thing is (4): repacking with a larger window size
yields a 100M (40%) space improvement. The other commands show that it
is not that the current pack is simply bad; command (2) repacks from
scratch and actually ends up with a worse pack. So the increased window
size really is important.

I haven't been able to figure out what it is about this dataset that
makes the bigger window so much better. Certainly doing the same
commands on git.git does not yield as impressive a speedup.

If anybody is interested in the repository as a packing test case, you
can download it from:

  https://gist.github.com/raw/2518074/d9c0244bf0ced690fee1edb2c88c522ecce102e4/phpmyadmin-network.tar

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-17 22:34       ` Junio C Hamano
@ 2012-04-28 16:42         ` Nicolas Pitre
  0 siblings, 0 replies; 26+ messages in thread
From: Nicolas Pitre @ 2012-04-28 16:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Jay Soffian, git, Shawn Pearce

[ coming late to this thread -- thanks to peff who pulled my attention ]

On Tue, 17 Apr 2012, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > On Tue, Apr 17, 2012 at 03:17:28PM -0700, Junio C Hamano wrote:
> >
> >> > How many cores are there on this box? Have you tried setting
> >> > pack.windowMemory to (12 / # of cores) or thereabouts?
> >> 
> >> Hrm, from the end-user's point of view, it appears that pack.windowMemory
> >> ought to mean the total without having to worry about the division of it
> >> across threads (which the implementation should be responsible for).
> >
> > Agreed. I had to look in the code to check which it meant. I'm not sure
> > we can change it without regressing existing users, though.
> 
> This is a tangent, but I noticed that the canned settings for "aggressive"
> use an arbitrarily hardcoded value of depth=250 and window=250 (tweakable
> with gc.aggressiveWindow).
> 
> Even though a shallower depth does cause base candidates with too long a
> chain hanging to be evicted prematurely while it is still in window and
> will lead to smaller memory consumption, I do not think the value of
> "depth" affects the pack-time memory consumption too much.  But the
> runtime performance of the resulting pack may not be great (in the worst
> case you would have to undelta 249 times to get to the object data).  We
> may want to loosen it a bit.

I think people are having misconceptions about the definition of the 
word "aggressive".

This option is, well, aggressive.  By definition this is not meant to be 
"nice".  This is not meant to be fast, or light on memory usage, etc.  
This means "achieve as much damage you can" to reduce the pack size.

If people are using it every night then they must be masochists, or 
attracted by violence, or getting a bit too casual with word 
definitions.

So if being --aggressive hurts, then don't do it.

If people want a loosened version, it would be more appropriate to 
introduce a --mild, or --bold, or --disruptive option.  In the same 
vain, an --insane option could even be introduced to go even further 
than --aggressive.

This being said, this is no excuse for regressions though.  If git is 
eating up much more memory than it used to, provided with the same 
repository and repacking parameters than before, then this certainly 
needs fixing.  But making --aggressive less so is not a fix.

> Also it might make sense to make the window size a bit more flexible
> depending on the nature of your history (you would get bigger benefit with
> larger window when your history has fine grained commits; if there are not
> many few-liner commits, larger window may not help you that much).

How do you detect the history nature of a repository?  That's the hard 
part.  Because it should be auto detected as most users won't make a good 
guess for the best parameter value to use.

Anyway, I think that the window size in terms of objects is a bad 
parameter.  Historically that is the first thing we implemented. But the 
window _memory_ usage is probably a better setting to use.  The delta 
search cost is directly proportional to the amount of data to process 
and that can be controlled with --window-memory, with the ability to 
scale up and down the number of objects in the window.  Keeping the 
number of objects constant makes memory usage totally random since this 
depends on the repository content, and the computing cost to process it 
is highly unpredictable. This is very counter-intuitive for users.

Right now the window is limited by default to 10 objects, and window 
memory usage is unlimited.  This could be reworked so object number, 
while still being limited to avoid pathological cases, could be much 
higher, and the window memory usage always limited by default.  That 
default memory usage could be scaled according to the available 
resources on the system.  But if the user wants to play with this, then 
using a memory usage parameter is much easier to understand with more 
directly observable system load influence.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-17 17:53 ` Jay Soffian
  2012-04-17 20:52   ` Matthieu Moy
@ 2012-04-28 16:56   ` Nicolas Pitre
  1 sibling, 0 replies; 26+ messages in thread
From: Nicolas Pitre @ 2012-04-28 16:56 UTC (permalink / raw)
  To: Jay Soffian; +Cc: git, Junio C Hamano, Shawn Pearce, Jeff King

On Tue, 17 Apr 2012, Jay Soffian wrote:

> On Tue, Apr 17, 2012 at 12:16 PM, Jay Soffian <jaysoffian@gmail.com> wrote:
> > This has worked fine on repos large and small. However, starting a
> > couple days ago git started running out of memory on a relatively
> > modest repo[*] while repacking on a Linux box with 12GB memory (+ 12GB
> > swap). I am able to gc the repo by either removing --aggressive or
> > .keep'ing the oldest pack.
> 
> Experimentally, setting pack.windowMemory = 256m keeps git memory
> usage < 4.5 GB during an aggressive repack.

How many threads are used?  As mentioned elsewhere, the memory usage 
parameter should probably be made global rather than per thread, 
especially with the ever growing number of CPU cores in a system.  But 
this also pauses a balancing problem for optimally distributing memory 
between threads.

> Ironically I end up with a slightly worse pack (63115590 bytes vs
> 61518628 bytes) than not using --aggressive. I assume this is because
> pack-objects found a better delta chain during the previous aggressive
> repack when windowMemory was not set.

Exact.  When reusing delta data, you inherit the quality of the repack 
run that created them in the first place.

> > 1) If --aggressive does not generally provide a benefit, should it be
> > made a no-op?

Absolutely not.  It does provide benefits, but it comes with a cost in 
resources.  If you don't pay that cost then results won't be there.

> I guess I'll revise this question: perhaps --aggressive should be
> better explained/discouraged. I found a message from Jeff last month
> and stole his words for this patch:
> 
> <snip>
> diff --git i/Documentation/git-gc.txt w/Documentation/git-gc.txt
> index 815afcb922..ca5bf8b51e 100644
> --- i/Documentation/git-gc.txt
> +++ w/Documentation/git-gc.txt
> @@ -37,9 +37,8 @@ OPTIONS
>  	Usually 'git gc' runs very quickly while providing good disk
>  	space utilization and performance.  This option will cause
>  	'git gc' to more aggressively optimize the repository at the expense
> -	of taking much more time.  The effects of this optimization are
> -	persistent, so this option only needs to be used occasionally; every
> -	few hundred changesets or so.
> +	of taking much more time and potentially using greater memory. This

Scratch "potentially" here. It definitely uses more memory.

> +	option is rarely needed. See Repacking below.
> 
>  --auto::
>  	With this option, 'git gc' checks whether any housekeeping is
> @@ -138,6 +137,39 @@ If you are expecting some objects to be collected
> and they aren't, check
>  all of those locations and decide whether it makes sense in your case to
>  remove those references.
> 
> +Repacking
> +---------
> +
> +Under the covers 'git gc' calls several commands to optimize the repository.
> +The most significant of these with respect to repository size and general
> +performance is linkgit:git-repack[1]. There are basically three levels of
> +'gc' with respect to repacking:
> +
> + 1. `git gc --auto`; if there are too many loose objects (`gc.auto`), they
> +    all go into a new incremental pack. If there are already too many
> +    packs (`gc.autopacklimit`), all of the existing packs are re-packed
> +    together.
> +
> +    Making an incremental pack is by far the fastest because the speed is
> +    independent of the existing repository history. If git packs
> +    everything together, it should be more or less the same as (2).
> +
> + 2. `git gc`; this packs everything into a single pack. It uses default
> +    window and depth parameters, but importantly, it reuses existing
> +    deltas. Doing so makes the delta compression phase much faster, and it
> +    often makes the writing phase faster (because for older objects, git
> +    is primarily streaming them right out of the existing pack). On a big
> +    repository though, this does do a lot of I/O, because git has to
> +    rewrite the whole pack.
> +
> + 3. `git gc --aggressive`; this is often much slower than (2) because git
> +    throws out all of the existing deltas and recomputes them from
> +    scratch. It uses a higher window parameter meaning it will spend
> +    more time computing, and it may end up with a smaller pack. However,
> +    unless the repository is known to have initially been poorly packed,
> +    this option is not needed and will just cause git to perform
> +    extra work.
> +
>  HOOKS
>  -----
> 
> @@ -147,6 +179,7 @@ linkgit:githooks[5] for more information.
> 
>  SEE ALSO
>  --------
> +linkgit:git-pack-refs[1]
>  linkgit:git-prune[1]
>  linkgit:git-reflog[1]
>  linkgit:git-repack[1]
> </snip>
> 
> Thoughts?

FWIW, Acked-by: Nicolas Pitre <nico@fluxnic.net>


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-28 12:25     ` Jeff King
@ 2012-04-28 17:11       ` Nicolas Pitre
  2012-04-29 11:34         ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Nicolas Pitre @ 2012-04-28 17:11 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce

On Sat, 28 Apr 2012, Jeff King wrote:

> On Tue, Apr 17, 2012 at 10:52:03PM +0200, Matthieu Moy wrote:
> 
> > Jay Soffian <jaysoffian@gmail.com> writes:
> > 
> > > + 3. `git gc --aggressive`; this is often much slower than (2) because git
> > > +    throws out all of the existing deltas and recomputes them from
> > > +    scratch. It uses a higher window parameter meaning it will spend
> > > +    more time computing, and it may end up with a smaller pack. However,
> > > +    unless the repository is known to have initially been poorly packed,
> > > +    this option is not needed and will just cause git to perform
> > > +    extra work.
> > 
> > I like your patch.
> > 
> > Maybe you should elaborate on "unless the repository is known to have
> > initially been poorly packed". My understanding is that --aggressive was
> > implemented to be called after an import from another VCS that would
> > have computed very poor deltas, but I'm not sure about the details.

This is somewhat subjective of course.  But to be effective, you need 
sufficient resources to repack with --aggressive, otherwise you may 
potentially end up with a worse pack.

> Coincidentally, I came across a case last week that shows --aggressive
> providing a large improvement. And it's a public repo, so I was able to
> grab a snapshot of the pre-packed state to experiment on and share.
> 
> The current packfile is ~246M. It was produced over time by pushes into
> the repository, which were then eventually grouped into a single pack by
> "git gc" (I'm not sure of the exact history, but this may even have been
> a set of "gc --auto" calls over time).
> 
> Here's a list of commands and the pack sizes they yield on the repo:
> 
>   1. `git repack -ad`: 246M
>   2. `git repack -ad -f`: 376M
>   3. `git repack -ad --window=250`: 246M
>   4. `git repack -ad -f --window=250`: 145M
> 
> The most interesting thing is (4): repacking with a larger window size
> yields a 100M (40%) space improvement. The other commands show that it
> is not that the current pack is simply bad; command (2) repacks from
> scratch and actually ends up with a worse pack. So the increased window
> size really is important.

Absolutely.  This doesn't surprises me.

> I haven't been able to figure out what it is about this dataset that
> makes the bigger window so much better. Certainly doing the same
> commands on git.git does not yield as impressive a speedup.

The default window size of 10 objects is really really small (yet if 
your objects are 150MB in size then it is probably too big, but I 
digress).  When doing an incremental repack, the window is also limited 
by the fact that we don't redelta those already packed objects.

Many things could explain the improvements with a larger window.  If a 
lot of files were renamed for example, the larger window would allow 
similar objects to still delta against each other, despite the fact that 
we pull them in the search window according to their corresponding file 
names.  With a smaller window this opportunity would be missed.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-28 17:11       ` Nicolas Pitre
@ 2012-04-29 11:34         ` Jeff King
  2012-04-29 13:53           ` Nicolas Pitre
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2012-04-29 11:34 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce

On Sat, Apr 28, 2012 at 01:11:48PM -0400, Nicolas Pitre wrote:

> > Here's a list of commands and the pack sizes they yield on the repo:
> > 
> >   1. `git repack -ad`: 246M
> >   2. `git repack -ad -f`: 376M
> >   3. `git repack -ad --window=250`: 246M
> >   4. `git repack -ad -f --window=250`: 145M
> > 
> > The most interesting thing is (4): repacking with a larger window size
> > yields a 100M (40%) space improvement. The other commands show that it
> > is not that the current pack is simply bad; command (2) repacks from
> > scratch and actually ends up with a worse pack. So the increased window
> > size really is important.
> 
> Absolutely.  This doesn't surprises me.

I was somewhat surprised, because this repo behaves very differently
from other ones as the window size increases. Our default window of 10
is somewhat arbitrary, but I think there was a sense from early tests
that you got diminishing returns from increasing it (this is my vague
recollection; I didn't actually search for old discussions). But here
are some charts showing "repack -adf" with various window sizes on a few
repositories. The first column is the window size; the second is the
resulting pack size (and its percentage of the window=10 case); the
third is the number of seconds of CPU time (and again, the percentage of
the window=10 case).

Here's git.git:

  10 | 31.3M (100%) |   54s (100%)
  20 | 28.8M ( 92%) |   72s (133%)
  40 | 27.4M ( 87%) |  101s (187%)
  80 | 26.3M ( 84%) |  153s (282%)
 160 | 25.7M ( 82%) |  247s (455%)
 320 | 25.4M ( 81%) |  415s (763%)

You can see we get some benefit from increasing window size to 20 or
even 40, but we hit an asymptote around 80%. Meanwhile, CPU time keeps
jumping. Something like 20 or 40 seems like it might be a nice
compromise.

Here's linux-2.6:

  10 | 564M (100%) |  990s (100%)
  20 | 521M ( 92%) | 1323s (134%)
  40 | 495M ( 88%) | 1855s (187%)
  80 | 479M ( 85%) | 2743s (277%)
 160 | 470M ( 83%) | 4284s (432%)
 320 | 463M ( 82%) | 7064s (713%)

It's quite similar, asymptotically heading towards ~80%. And the CPU
numbers look quite similar, too.

And here's the phpmyadmin repository (the one I linked to earlier):

  10 | 386M (100%) | 1592s (100%)
  20 | 280M ( 72%) | 1947s (122%)
  40 | 209M ( 54%) | 2514s (158%)
  80 | 169M ( 44%) | 3386s (213%)
 160 | 151M ( 39%) | 4822s (303%)
 320 | 142M ( 37%) | 6948s (436%)

The packfile size improvements go on for much longer as we increase the
window size. For this repo, a window size of 80-100 is probably a good
spot.

That leads me to a few questions:

  1. Should we bump our default window size? The numbers above show that
     typical repos would benefit from jumping to 20 or even 40.

  2. Is there a heuristic or other metric we can figure out to
     differentiate the first two repositories from the third, and use a
     larger window size on the latter?

  3. Does the phpmyadmin case give us any insight into whether we can
     improve our window sorting algorithm? Looking at the repo, ~55K of
     the ~75K commits are small changes in the po/ directory (it looks
     like they were using a web-based tool to let non-committers tweak
     the translation files). In particular, I see a lot of commits in
     which most of the changes are simply line number changes as the po
     files are refreshed from the source. I wonder if that is making the
     size-sorting heuristics perform poorly, as we end up with many
     files of the same size, and the good deltas get pushed further
     along the window.

  4. What is typical? I suspect that git.git and linux-2.6 are typical,
     and the weird po-files in the phpmyadmin repository are not. But
     I'd be happy to test more repos if people have suggestions. And the
     scripts that generated the charts are included below if anybody
     wants to try it themselves.

-Peff

-- >8 --
cat >collect <<\EOF
#!/bin/sh
# usage: collect /path/to/repo >foo.out

windows='10 20 40 80 160 320'

for i in $windows; do
  echo >&2 "Repacking with window $i..."
  rm -rf tmp && cp -a "$1" tmp && (
    cd tmp &&
    time=`time -f %U -o /dev/stdout git repack -adf --window=$i`
    size=`du -bc objects/pack/pack-*.pack | tail -1 | awk '{print $1}'`
    echo "$i $size $time"
  )
done
EOF

cat >chart <<\EOF
#!/usr/bin/perl
# usage: chart <foo.out

use strict;

my @base;
while (<>) {
  chomp;
  my ($window, $size, $time) = split;

  @base = ($size, $time) unless @base;

  printf '%4s', $window;
  print ' | ', humanize($size);
  printf ' (%3d%%)', int($size / $base[0] * 100 + 0.5);
  printf ' | %4ds', $time;
  printf ' (%d%%)', int($time / $base[1] * 100 + 0.5);
  print "\n";
}

sub human_digits {
  my $n = shift;
  my $digits = $n >= 100 ? 0 :
               $n >=  10 ? 1 :
               2;
  return sprintf '%.*f', $digits, $n;
}

sub humanize {
  my $n = shift;
  my $u;
  foreach $u ('', qw(K M G)) {
    return human_digits($n) . $u if $n < 900;
    $n /= 1024;
  }
  return human_digits($n) . $u;
}
EOF

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-29 11:34         ` Jeff King
@ 2012-04-29 13:53           ` Nicolas Pitre
  2012-05-01 16:28             ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Nicolas Pitre @ 2012-04-29 13:53 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce

On Sun, 29 Apr 2012, Jeff King wrote:

> On Sat, Apr 28, 2012 at 01:11:48PM -0400, Nicolas Pitre wrote:
> 
> > > Here's a list of commands and the pack sizes they yield on the repo:
> > > 
> > >   1. `git repack -ad`: 246M
> > >   2. `git repack -ad -f`: 376M
> > >   3. `git repack -ad --window=250`: 246M
> > >   4. `git repack -ad -f --window=250`: 145M
> > > 
> > > The most interesting thing is (4): repacking with a larger window size
> > > yields a 100M (40%) space improvement. The other commands show that it
> > > is not that the current pack is simply bad; command (2) repacks from
> > > scratch and actually ends up with a worse pack. So the increased window
> > > size really is important.
> > 
> > Absolutely.  This doesn't surprises me.
> 
> I was somewhat surprised, because this repo behaves very differently
> from other ones as the window size increases. Our default window of 10
> is somewhat arbitrary, but I think there was a sense from early tests
> that you got diminishing returns from increasing it (this is my vague
> recollection; I didn't actually search for old discussions). 

Yes, your numbers are very interesting.

But my remark was related to the fact that you need to double the 
affected resources to gain marginal improvements at some point.  This is 
true about computing hardware too: eventually you need way more gates 
and spend much more $$$ to gain some performance, and the added 
performance is never linear with the spending.

> But here are some charts showing "repack -adf" with various window 
> sizes on a few repositories. The first column is the window size; the 
> second is the resulting pack size (and its percentage of the window=10 
> case); the third is the number of seconds of CPU time (and again, the 
> percentage of the window=10 case).
> 
> Here's git.git:
> 
>   10 | 31.3M (100%) |   54s (100%)
>   20 | 28.8M ( 92%) |   72s (133%)
>   40 | 27.4M ( 87%) |  101s (187%)
>   80 | 26.3M ( 84%) |  153s (282%)
>  160 | 25.7M ( 82%) |  247s (455%)
>  320 | 25.4M ( 81%) |  415s (763%)
> 
> You can see we get some benefit from increasing window size to 20 or
> even 40, but we hit an asymptote around 80%. Meanwhile, CPU time keeps
> jumping. Something like 20 or 40 seems like it might be a nice
> compromise.
> 
> Here's linux-2.6:
> 
>   10 | 564M (100%) |  990s (100%)
>   20 | 521M ( 92%) | 1323s (134%)
>   40 | 495M ( 88%) | 1855s (187%)
>   80 | 479M ( 85%) | 2743s (277%)
>  160 | 470M ( 83%) | 4284s (432%)
>  320 | 463M ( 82%) | 7064s (713%)
> 
> It's quite similar, asymptotically heading towards ~80%. And the CPU
> numbers look quite similar, too.
> 
> And here's the phpmyadmin repository (the one I linked to earlier):
> 
>   10 | 386M (100%) | 1592s (100%)
>   20 | 280M ( 72%) | 1947s (122%)
>   40 | 209M ( 54%) | 2514s (158%)
>   80 | 169M ( 44%) | 3386s (213%)
>  160 | 151M ( 39%) | 4822s (303%)
>  320 | 142M ( 37%) | 6948s (436%)
> 
> The packfile size improvements go on for much longer as we increase the
> window size. For this repo, a window size of 80-100 is probably a good
> spot.
> 
> That leads me to a few questions:
> 
>   1. Should we bump our default window size? The numbers above show that
>      typical repos would benefit from jumping to 20 or even 40.

I think this might be a good indication that the number of objects is a 
bad metric to size the window, as I mentioned previously.

Given that you have the test repos already, could you re-run it with 
--window=1000 and play with --window-memory instead?  I would be curious 
to see if this provides more predictable results.

>   2. Is there a heuristic or other metric we can figure out to
>      differentiate the first two repositories from the third, and use a
>      larger window size on the latter?

Maybe we could look at the size reduction within the delta search loop.  
If the reduction quickly diminishes as tested objects are further away 
from the target one then the window doesn't have to be very large, 
whereas if the reduction remains more or less constant then it might be 
worth searching further.  That could be used to dynamically size the 
window at run time.

>   3. Does the phpmyadmin case give us any insight into whether we can
>      improve our window sorting algorithm? Looking at the repo, ~55K of
>      the ~75K commits are small changes in the po/ directory (it looks
>      like they were using a web-based tool to let non-committers tweak
>      the translation files). In particular, I see a lot of commits in
>      which most of the changes are simply line number changes as the po
>      files are refreshed from the source. I wonder if that is making the
>      size-sorting heuristics perform poorly, as we end up with many
>      files of the same size, and the good deltas get pushed further
>      along the window.

You could test this theory by commenting out the size comparisons in 
type_size_sort() and re-run the test.  Linus initially introduced that 
criterion thinking that newer files tend to grow and it is cheaper to 
create a delta that removes data than one that adds data.  And given 
that we wanted to prefer delta chains to start from newer objects this 
all made sense.  However the last comparison in that function is meant 
to handling the recency ordering, and therefore the size comparison 
might be skewing things here.

>   4. What is typical? I suspect that git.git and linux-2.6 are typical,
>      and the weird po-files in the phpmyadmin repository are not. But
>      I'd be happy to test more repos if people have suggestions. And the
>      scripts that generated the charts are included below if anybody
>      wants to try it themselves.

I wouldn't give up on "non typical" data sets just yet though.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-04-29 13:53           ` Nicolas Pitre
@ 2012-05-01 16:28             ` Jeff King
  2012-05-01 17:16               ` Jeff King
  2012-05-01 17:17               ` Nicolas Pitre
  0 siblings, 2 replies; 26+ messages in thread
From: Jeff King @ 2012-05-01 16:28 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce

On Sun, Apr 29, 2012 at 09:53:31AM -0400, Nicolas Pitre wrote:

> But my remark was related to the fact that you need to double the 
> affected resources to gain marginal improvements at some point.  This is 
> true about computing hardware too: eventually you need way more gates 
> and spend much more $$$ to gain some performance, and the added 
> performance is never linear with the spending.

Right, I agree with that. The trick is just finding the right spot on
that curve for each repo to maximize the reward/effort ratio.

> >   1. Should we bump our default window size? The numbers above show that
> >      typical repos would benefit from jumping to 20 or even 40.
> 
> I think this might be a good indication that the number of objects is a 
> bad metric to size the window, as I mentioned previously.
> 
> Given that you have the test repos already, could you re-run it with 
> --window=1000 and play with --window-memory instead?  I would be curious 
> to see if this provides more predictable results.

It doesn't help. The git.git repo does well with about a 1m window
limit. linux-2.6 is somewhere between 1m and 2m. But the phpmyadmin repo
wants more like 16m. So it runs into the same issue as using object
counts.

But it's much, much worse than that. Here are the actual numbers (same
format as before; left-hand column is either window size (if no unit) or
window-memory limit (if k/m unit), followed by resulting pack size, its
percentage of baseline --window=10 pack, the user CPU time and finally
its percentage of the baseline):

  git:
    10 | 31.4M (100%) |   54s (100%)
    20 | 28.8M ( 92%) |   72s (133%)
  128k | 81.4M (260%) |   77s (142%)
  256k | 59.1M (188%) |  106s (195%)
  512k | 44.5M (142%) |  166s (306%)
    1m | 28.7M ( 91%) |  267s (491%)
    2m | 27.0M ( 86%) |  347s (637%)
    4m | 26.0M ( 83%) |  417s (767%)

  linux-2.6:
    10 |  564M (100%) |  990s (100%)
    20 |  521M ( 92%) | 1323s (134%)
  128k | 1.41G (256%) | 1322s (133%)
  256k | 1.08G (196%) | 1810s (183%)
  512k |  783M (139%) | 2775s (280%)
    1m |  579M (103%) | 4620s (466%)
    2m |  504M ( 89%) | 6786s (685%)
    4m |  479M ( 85%) | 8119s (819%)

  phpmyadmin:
    10 |  380M (100%) | 1617s (100%)
    80 |  163M ( 43%) | 3410s (211%)
  128k | 3.42G (921%) | 2367s (146%)
  256k | 3.36G (904%) | 2437s (151%)
  512k | 3.22G (865%) | 2589s (160%)
    1m | 3.10G (833%) | 2746s (170%)
    2m |  436M (115%) | 1674s (104%)
    4m |  299M ( 78%) | 2140s (132%)
    8m |  222M ( 58%) | 2751s (170%)
   16m |  178M ( 47%) | 3334s (206%)

I intentionally started with a too-small memory limit so we could see
the effect as the window size approached something reasonable. You can
see the pack sizes getting comparable for --window=20 around
--window-memory=1m for the git and linux-2.6 cases. But look at the CPU
usage. For a comparable resulting pack size, limiting the window memory
uses 4-5x as much CPU.

I'm not sure what is causing that behavior. I guess maybe for small
objects we end up with a really huge window (in terms of number of
objects), but it doesn't end up actually saving us a lot of space
because there is not as much space to be saved with small objects. So we
spend a lot of extra time looking at objects that don't yield big space
savings.

For some of the really tiny limits, the "writing" phase ended up
dominating. For example, linux-2.6 at 128k ends up with a horribly large
pack that takes even longer to run than --window=10. These numbers don't
reflect the split between the compression and writing phases, but I
noticed while watching the progress meter that the writing phase was
quite slow in such cases. Mostly because we end up having to zlib
deflate a lot more data (which I confirmed via perf).

Interestingly, the phpmyadmin script does not have the same issue. The
CPU usage for the object and memory limits are about the same (probably
this is due to the fact that the history is dominated by similar-sized
.po files, so the two limits end up equating to each other).

> >   2. Is there a heuristic or other metric we can figure out to
> >      differentiate the first two repositories from the third, and use a
> >      larger window size on the latter?
> 
> Maybe we could look at the size reduction within the delta search loop.  
> If the reduction quickly diminishes as tested objects are further away 
> from the target one then the window doesn't have to be very large, 
> whereas if the reduction remains more or less constant then it might be 
> worth searching further.  That could be used to dynamically size the 
> window at run time.

I really like the idea of dynamically sizing the window based on what we
find. If it works. I don't think there's any reason you couldn't have 50
absolutely terrible delta candidates followed by one really amazing
delta candidate. But maybe in practice the window tends to get
progressively worse due to the heuristics, and outliers are unlikely. I
guess we'd have to experiment.

> >   3. Does the phpmyadmin case give us any insight into whether we can
> >      improve our window sorting algorithm?
> [...]
> 
>  You could test this theory by commenting out the size comparisons in 
> type_size_sort() and re-run the test.

I'll try this next.

-Peff

PS Here's my updated collection script, just for reference.

-- >8 --
#!/bin/sh

windows='10 20 128k 256k 512k 1m 2m 4m'
repo=$1; shift
test $# -gt 0 && windows="$*"

cd "$repo" || exit
for i in $windows; do
  case "$i" in
  *[kmg])
    opts="--window=1000 --window-memory=$i" ;;
  *)
    opts="--window=$i" ;;
  esac

  echo >&2 "Repacking $repo with $opts..."

  time -f %U -o time.out \
    git pack-objects --stdout --all-progress-implied --all \
      --no-reuse-delta $opts </dev/null |
    wc -c >size.out
  echo "$i `cat size.out` `cat time.out`"
  rm -f size.out time.out
done

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-05-01 16:28             ` Jeff King
@ 2012-05-01 17:16               ` Jeff King
  2012-05-01 17:59                 ` Nicolas Pitre
  2012-05-01 17:17               ` Nicolas Pitre
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2012-05-01 17:16 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce

On Tue, May 01, 2012 at 12:28:06PM -0400, Jeff King wrote:

> >  You could test this theory by commenting out the size comparisons in 
> > type_size_sort() and re-run the test.
> 
> I'll try this next.

Wow, it behaves horribly. I didn't even let the bigger tests run to
completion. Here is the output for git.git (the first line is from the
original, unmodified version of git with --window=10):

  orig | 31.4M (100%) |   54s (100%)
    10 | 44.0M (140%) |  169s (310%)
    20 | 37.7M (120%) |  232s (428%)
    40 | 33.6M (107%) |  331s (608%)
    80 | 30.9M ( 99%) |  473s (868%)
   160 | 29.4M ( 94%) |  696s (1279%)

Unless the window is increased a lot, the packs end up quite a bit
larger (and even still we spend a lot more CPU time).

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-05-01 16:28             ` Jeff King
  2012-05-01 17:16               ` Jeff King
@ 2012-05-01 17:17               ` Nicolas Pitre
  2012-05-01 17:22                 ` Jeff King
  1 sibling, 1 reply; 26+ messages in thread
From: Nicolas Pitre @ 2012-05-01 17:17 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce

On Tue, 1 May 2012, Jeff King wrote:

> On Sun, Apr 29, 2012 at 09:53:31AM -0400, Nicolas Pitre wrote:
> 
> > But my remark was related to the fact that you need to double the 
> > affected resources to gain marginal improvements at some point.  This is 
> > true about computing hardware too: eventually you need way more gates 
> > and spend much more $$$ to gain some performance, and the added 
> > performance is never linear with the spending.
> 
> Right, I agree with that. The trick is just finding the right spot on
> that curve for each repo to maximize the reward/effort ratio.

Absolutely, at least for the default settings.  However this is not what 
--aggressive is meant to be.

> > >   1. Should we bump our default window size? The numbers above show that
> > >      typical repos would benefit from jumping to 20 or even 40.
> > 
> > I think this might be a good indication that the number of objects is a 
> > bad metric to size the window, as I mentioned previously.
> > 
> > Given that you have the test repos already, could you re-run it with 
> > --window=1000 and play with --window-memory instead?  I would be curious 
> > to see if this provides more predictable results.
> 
> It doesn't help. The git.git repo does well with about a 1m window
> limit. linux-2.6 is somewhere between 1m and 2m. But the phpmyadmin repo
> wants more like 16m. So it runs into the same issue as using object
> counts.
> 
> But it's much, much worse than that. Here are the actual numbers (same
> format as before; left-hand column is either window size (if no unit) or
> window-memory limit (if k/m unit), followed by resulting pack size, its
> percentage of baseline --window=10 pack, the user CPU time and finally
> its percentage of the baseline):
> [...]

Ouch!  Well... so much for good theory.  I'm still really surprised and 
disappointed as I didn't expect such damage at all.

However, this is possibly a good baseline to determine a default value 
for window-memory though.  Given your number, we clearly see that good 
packing can be achieved with relatively little memory and therefore it 
might be a good idea not to leave this parameter unbounded by default in 
order to catch potential pathological cases.  Maybe 64M would be a good 
default value?  Having a repack process eating up more than 16GB of RAM 
because its RAM usage is unbounded is certainly not nice.

> > Maybe we could look at the size reduction within the delta search loop.  
> > If the reduction quickly diminishes as tested objects are further away 
> > from the target one then the window doesn't have to be very large, 
> > whereas if the reduction remains more or less constant then it might be 
> > worth searching further.  That could be used to dynamically size the 
> > window at run time.
> 
> I really like the idea of dynamically sizing the window based on what we
> find. If it works. I don't think there's any reason you couldn't have 50
> absolutely terrible delta candidates followed by one really amazing
> delta candidate. But maybe in practice the window tends to get
> progressively worse due to the heuristics, and outliers are unlikely. I
> guess we'd have to experiment.

Yes.  The idea is to continue searching if results are not progressively 
becoming worse fast enough.  Coming up with a good way to infer that is 
far from obvious though.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-05-01 17:17               ` Nicolas Pitre
@ 2012-05-01 17:22                 ` Jeff King
  2012-05-01 17:47                   ` Nicolas Pitre
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2012-05-01 17:22 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce

On Tue, May 01, 2012 at 01:17:03PM -0400, Nicolas Pitre wrote:

> However, this is possibly a good baseline to determine a default value 
> for window-memory though.  Given your number, we clearly see that good 
> packing can be achieved with relatively little memory and therefore it 
> might be a good idea not to leave this parameter unbounded by default in 
> order to catch potential pathological cases.  Maybe 64M would be a good 
> default value?  Having a repack process eating up more than 16GB of RAM 
> because its RAM usage is unbounded is certainly not nice.

Would that preclude a 65M object from being delta'd at all? If you are
putting a limit of 64M and it means we look at 50 delta candidates
instead of 60, then that is probably not going to hurt our size too
much. But if you have large objects that do happen to delta, the
difference between looking at 0 and 1 objects could have a big impact.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-05-01 17:22                 ` Jeff King
@ 2012-05-01 17:47                   ` Nicolas Pitre
  0 siblings, 0 replies; 26+ messages in thread
From: Nicolas Pitre @ 2012-05-01 17:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce

On Tue, 1 May 2012, Jeff King wrote:

> On Tue, May 01, 2012 at 01:17:03PM -0400, Nicolas Pitre wrote:
> 
> > However, this is possibly a good baseline to determine a default value 
> > for window-memory though.  Given your number, we clearly see that good 
> > packing can be achieved with relatively little memory and therefore it 
> > might be a good idea not to leave this parameter unbounded by default in 
> > order to catch potential pathological cases.  Maybe 64M would be a good 
> > default value?  Having a repack process eating up more than 16GB of RAM 
> > because its RAM usage is unbounded is certainly not nice.
> 
> Would that preclude a 65M object from being delta'd at all? If you are
> putting a limit of 64M and it means we look at 50 delta candidates
> instead of 60, then that is probably not going to hurt our size too
> much. But if you have large objects that do happen to delta, the
> difference between looking at 0 and 1 objects could have a big impact.

If I remember correctly (long time ago since I wrote that code) there is 
always a minimum of one object in the search window.  So with 32M 
objects you'd have only 2 candidates, with 33M objects and bigger there 
would be only one candidate.  Obviously the window could still be filled 
with smaller objects if the total object size is less than 64M.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-05-01 17:16               ` Jeff King
@ 2012-05-01 17:59                 ` Nicolas Pitre
  2012-05-01 18:47                   ` Junio C Hamano
  2012-05-01 19:35                   ` Jeff King
  0 siblings, 2 replies; 26+ messages in thread
From: Nicolas Pitre @ 2012-05-01 17:59 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce

On Tue, 1 May 2012, Jeff King wrote:

> On Tue, May 01, 2012 at 12:28:06PM -0400, Jeff King wrote:
> 
> > >  You could test this theory by commenting out the size comparisons in 
> > > type_size_sort() and re-run the test.
> > 
> > I'll try this next.
> 
> Wow, it behaves horribly. I didn't even let the bigger tests run to
> completion. Here is the output for git.git (the first line is from the
> original, unmodified version of git with --window=10):
> 
>   orig | 31.4M (100%) |   54s (100%)
>     10 | 44.0M (140%) |  169s (310%)
>     20 | 37.7M (120%) |  232s (428%)
>     40 | 33.6M (107%) |  331s (608%)
>     80 | 30.9M ( 99%) |  473s (868%)
>    160 | 29.4M ( 94%) |  696s (1279%)
> 
> Unless the window is increased a lot, the packs end up quite a bit
> larger (and even still we spend a lot more CPU time).

Bleh.  Allright.

One final quick test if you feel like it: I've never been sure that 
the last comparison in type_size_sort() is correct.  Maybe it should be 
the other way around.  Currently it reads:

	return a < b ? -1 : (a > b);

While keeping the size comparison commented out, you could try to 
replace this line with:

	return b < a ? -1 : (b > a);

If this doesn't improve things then it would be clear that this avenue 
should be abandoned.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-05-01 17:59                 ` Nicolas Pitre
@ 2012-05-01 18:47                   ` Junio C Hamano
  2012-05-01 19:22                     ` Nicolas Pitre
  2012-05-01 20:01                     ` Jeff King
  2012-05-01 19:35                   ` Jeff King
  1 sibling, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2012-05-01 18:47 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jeff King, git, Matthieu Moy, Jay Soffian, Shawn Pearce

Nicolas Pitre <nico@fluxnic.net> writes:

> One final quick test if you feel like it: I've never been sure that 
> the last comparison in type_size_sort() is correct.  Maybe it should be 
> the other way around.  Currently it reads:
>
> 	return a < b ? -1 : (a > b);
>
> While keeping the size comparison commented out, you could try to 
> replace this line with:
>
> 	return b < a ? -1 : (b > a);
>
> If this doesn't improve things then it would be clear that this avenue 
> should be abandoned.

Very interesting.  The difference between the two should only matter if
there are many blobs with exactly the same size, and most of them delta
horribly with each other.  Does the problematic repository exhibit such
a characteristic?

The original tie-breaks based on the address (the earlier object we read
in the original input comes earlier in the output) and yours make the
objects later we read (which in turn are from older parts of the history)
come early, but adjacency between two objects of the same type and the
same size would not change (if A and B were next to each other in this
order, your updated sorter will give B and then A still next to each
other), so I suspect not much would change in the candidate selection.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-05-01 18:47                   ` Junio C Hamano
@ 2012-05-01 19:22                     ` Nicolas Pitre
  2012-05-01 20:01                     ` Jeff King
  1 sibling, 0 replies; 26+ messages in thread
From: Nicolas Pitre @ 2012-05-01 19:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Matthieu Moy, Jay Soffian, Shawn Pearce

On Tue, 1 May 2012, Junio C Hamano wrote:

> Nicolas Pitre <nico@fluxnic.net> writes:
> 
> > One final quick test if you feel like it: I've never been sure that 
> > the last comparison in type_size_sort() is correct.  Maybe it should be 
> > the other way around.  Currently it reads:
> >
> > 	return a < b ? -1 : (a > b);
> >
> > While keeping the size comparison commented out, you could try to 
> > replace this line with:
> >
> > 	return b < a ? -1 : (b > a);
> >
> > If this doesn't improve things then it would be clear that this avenue 
> > should be abandoned.
> 
> Very interesting.  The difference between the two should only matter if
> there are many blobs with exactly the same size, and most of them delta
> horribly with each other.  Does the problematic repository exhibit such
> a characteristic?

Not precisely.  This is just to verify some hypothesis that could 
explain the difference in behavior with the phpmyadmin repo.

My hypothesis was that recency order could be skewed by the object size 
when many small changes are made to the same files without varying their 
size much.  So I suggested that a repack run be performed with the 
object size removed from the sort criteria.  However it is important 
that the last comparison be done in the right direction.  Hence my 
suggestion above.

> The original tie-breaks based on the address (the earlier object we read
> in the original input comes earlier in the output) and yours make the
> objects later we read (which in turn are from older parts of the history)
> come early, but adjacency between two objects of the same type and the
> same size would not change (if A and B were next to each other in this
> order, your updated sorter will give B and then A still next to each
> other), so I suspect not much would change in the candidate selection.

Note that the size comparison is commented out in those tests.  The idea 
was to get pure recency order.

Even for objects of the same size, the delta orientation would change 
which might or might not provide a clue.

But this is really just a wild guess without much thinking at this 
point, before giving up on this approach.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-05-01 17:59                 ` Nicolas Pitre
  2012-05-01 18:47                   ` Junio C Hamano
@ 2012-05-01 19:35                   ` Jeff King
  2012-05-01 20:02                     ` Nicolas Pitre
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2012-05-01 19:35 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce

On Tue, May 01, 2012 at 01:59:08PM -0400, Nicolas Pitre wrote:

> One final quick test if you feel like it: I've never been sure that 
> the last comparison in type_size_sort() is correct.  Maybe it should be 
> the other way around.  Currently it reads:
> 
> 	return a < b ? -1 : (a > b);

I think it is right. At least it should put recent things near the
front of the array, just as we are putting bigger things there.

> >   orig | 31.4M (100%) |   54s (100%)
> >     10 | 44.0M (140%) |  169s (310%)
> >     20 | 37.7M (120%) |  232s (428%)
> >     40 | 33.6M (107%) |  331s (608%)
> >     80 | 30.9M ( 99%) |  473s (868%)
> >    160 | 29.4M ( 94%) |  696s (1279%)
> [...]
> While keeping the size comparison commented out, you could try to 
> replace this line with:
> 
> 	return b < a ? -1 : (b > a);

No, it's not better. A few of the pack sizes are better, but some of
them are worse. And the CPU times are still quite bad. Here are the
numbers:

  orig | 31.4M (100%) |   54s (100%)
    10 | 45.6M (145%) |  158s (292%)
    20 | 39.2M (125%) |  205s (377%)
    40 | 35.1M (112%) |  275s (505%)
    80 | 32.4M (103%) |  388s (713%)
   160 | 30.6M ( 98%) |  581s (1067%)

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-05-01 18:47                   ` Junio C Hamano
  2012-05-01 19:22                     ` Nicolas Pitre
@ 2012-05-01 20:01                     ` Jeff King
  1 sibling, 0 replies; 26+ messages in thread
From: Jeff King @ 2012-05-01 20:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, git, Matthieu Moy, Jay Soffian, Shawn Pearce

On Tue, May 01, 2012 at 11:47:26AM -0700, Junio C Hamano wrote:

> > While keeping the size comparison commented out, you could try to 
> > replace this line with:
> >
> > 	return b < a ? -1 : (b > a);
> >
> > If this doesn't improve things then it would be clear that this avenue 
> > should be abandoned.
> 
> Very interesting.  The difference between the two should only matter if
> there are many blobs with exactly the same size, and most of them delta
> horribly with each other.  Does the problematic repository exhibit such
> a characteristic?

No. Here are the objects with the same sizes:

  $ git rev-list --objects --all |
    cut -d' ' -f1 |
    git cat-file --batch-check |
    cut -d' ' -f2,3 |
    sort | uniq -c | sort -rn | head

  19722 tree 2222
  14068 tree 4393
  11418 tree 2156
   9994 tree 4676
   9479 tree 2189
   7944 tree 2255
   6454 commit 251
   6437 tree 4611
   5328 tree 4439
   4586 commit 254

So it's mostly trees and commits (the first repeated blob size is on
line 332 of the output). The commits aren't all that big even without
deltafication, but the trees are. They should be sorted by name_hash,
but within a single name, there are going to be a lot of repetitions (I
think each of those size clusters is just a repetition of the same "po"
directory getting lots of tiny modifications).

So we are triggering that part of the sort quite a bit. But by your
reasoning here:

> The original tie-breaks based on the address (the earlier object we read
> in the original input comes earlier in the output) and yours make the
> objects later we read (which in turn are from older parts of the history)
> come early, but adjacency between two objects of the same type and the
> same size would not change (if A and B were next to each other in this
> order, your updated sorter will give B and then A still next to each
> other), so I suspect not much would change in the candidate selection.

I don't think it makes a big difference (and indeed, switching it and
repacking the phpmyadmin repository yields the same-size pack, although
a lot more CPU time is spent).

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: gc --aggressive
  2012-05-01 19:35                   ` Jeff King
@ 2012-05-01 20:02                     ` Nicolas Pitre
  0 siblings, 0 replies; 26+ messages in thread
From: Nicolas Pitre @ 2012-05-01 20:02 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce

On Tue, 1 May 2012, Jeff King wrote:

> On Tue, May 01, 2012 at 01:59:08PM -0400, Nicolas Pitre wrote:
> 
> > One final quick test if you feel like it: I've never been sure that 
> > the last comparison in type_size_sort() is correct.  Maybe it should be 
> > the other way around.  Currently it reads:
> > 
> > 	return a < b ? -1 : (a > b);
> 
> I think it is right. At least it should put recent things near the
> front of the array, just as we are putting bigger things there.

Right.  In fact, it seems that _I_ did think about it *five* years ago 
(man... time flies by) given commit adcc70950e, and then I reversed the 
whole order in commit b904166ccb to get what we have today.

> > replace this line with:
> > 
> > 	return b < a ? -1 : (b > a);
> 
> No, it's not better. A few of the pack sizes are better, but some of
> them are worse. And the CPU times are still quite bad.

OK.  We can scratch that.


Nicolas

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-05-01 20:03 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-17 16:16 gc --aggressive Jay Soffian
2012-04-17 17:53 ` Jay Soffian
2012-04-17 20:52   ` Matthieu Moy
2012-04-17 21:58     ` Jeff King
2012-04-28 12:25     ` Jeff King
2012-04-28 17:11       ` Nicolas Pitre
2012-04-29 11:34         ` Jeff King
2012-04-29 13:53           ` Nicolas Pitre
2012-05-01 16:28             ` Jeff King
2012-05-01 17:16               ` Jeff King
2012-05-01 17:59                 ` Nicolas Pitre
2012-05-01 18:47                   ` Junio C Hamano
2012-05-01 19:22                     ` Nicolas Pitre
2012-05-01 20:01                     ` Jeff King
2012-05-01 19:35                   ` Jeff King
2012-05-01 20:02                     ` Nicolas Pitre
2012-05-01 17:17               ` Nicolas Pitre
2012-05-01 17:22                 ` Jeff King
2012-05-01 17:47                   ` Nicolas Pitre
2012-04-28 16:56   ` Nicolas Pitre
2012-04-17 22:08 ` Jeff King
2012-04-17 22:17   ` Junio C Hamano
2012-04-17 22:18     ` Jeff King
2012-04-17 22:34       ` Junio C Hamano
2012-04-28 16:42         ` Nicolas Pitre
2012-04-18  8:49       ` Andreas Ericsson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.