All of lore.kernel.org
 help / color / mirror / Atom feed
* git gc --aggressive led to about 40 times slower "git log --raw"
@ 2014-02-18  7:25 Christian Jaeger
  2014-02-18  8:55 ` David Kastrup
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Jaeger @ 2014-02-18  7:25 UTC (permalink / raw)
  To: git

Hi

I've got a repository where "git log --raw > _somefile" took a few
seconds in the past, but after an attempt at merging some commits that
were collected in a clone of the same repo that was created about a
year ago, I noticed that this command was now taking 3 minutes 7
seconds. "git gc", "git fsck", "git clone file:///the/repo/.git" also
now each took between ~4-10 minutes, also "git log --raw somefile" got
equally unusably slow. With the help of the people on the IRC, I
tracked it down to my recent use of "git gc --aggressive" in this
repo. Running "git repack -a -d -f" solved it, now it's again taking
4-5 seconds. After running "git gc --aggressive" again for
confirmation, "git log --raw > _somefile" was again slowed down,
although now 'only' to 1 minute 34 seconds; did perhaps my "git remote
add -f other-repo", which I remember was also running rather slowly,
exacerbate the problem (to the > 3 minutes I was seeing)?

The repo has about 6000 commits, about 12'000 files in the current
HEAD, and about 43 MB packed .git contents. The files are (almost) all
plain text, about half of them are about 42 bytes long, the rest up to
about 2 MB although most of them are just around 5-50 KB. Most files
mostly grow at the end. The biggest files (500KB-2MB) are quite
long-lived and don't stop growing, again mostly at the end. Also,
about 2*5K files are each in the same directory, meaning that the tree
objects representing those 2 directories are big but changing only in
a few places.

I've now learned to avoid "git gc --aggressive". Perhaps there are
some other conclusions to be drawn, I don't know.

Christian.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-18  7:25 git gc --aggressive led to about 40 times slower "git log --raw" Christian Jaeger
@ 2014-02-18  8:55 ` David Kastrup
  2014-02-18  9:45   ` Duy Nguyen
  0 siblings, 1 reply; 31+ messages in thread
From: David Kastrup @ 2014-02-18  8:55 UTC (permalink / raw)
  To: Christian Jaeger; +Cc: git

Christian Jaeger <chrjae@gmail.com> writes:

> I've got a repository where "git log --raw > _somefile" took a few
> seconds in the past, but after an attempt at merging some commits that
> were collected in a clone of the same repo that was created about a
> year ago, I noticed that this command was now taking 3 minutes 7
> seconds. "git gc", "git fsck", "git clone file:///the/repo/.git" also
> now each took between ~4-10 minutes, also "git log --raw somefile" got
> equally unusably slow. With the help of the people on the IRC, I
> tracked it down to my recent use of "git gc --aggressive" in this
> repo. Running "git repack -a -d -f" solved it, now it's again taking
> 4-5 seconds. After running "git gc --aggressive" again for
> confirmation, "git log --raw > _somefile" was again slowed down,
> although now 'only' to 1 minute 34 seconds;

[...]

> I've now learned to avoid "git gc --aggressive". Perhaps there are
> some other conclusions to be drawn, I don't know.

I've seen the same with my ongoing work on git-blame with the current
Emacs Git mirror.  Aggressive packing reduces the repository size to
about a quarter, but it blows up the system time (mainly I/O)
significantly, quite reducing the total benefits of my algorithmic
improvements there.

There is also some quite visible additional time spent in zlib, so a
wild guess would be that zlib is not really suited to the massive amount
of directory entries of a Git object store.  Since the system time still
dominates, this guess would only make sense if Git over zlib kept
rereading the directory section of whatever compressed file we are
talking about.  But that's really a rather handwavy wild guess without
anything better than a hunch to back it up.  I don't even know what kind
of compression and/or packs are used: I've only ever messed myself with
the delta coding of the normal "unpacked" operation (there are a few
older commits from me on that).

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-18  8:55 ` David Kastrup
@ 2014-02-18  9:45   ` Duy Nguyen
  2014-02-18 10:25     ` David Kastrup
  2014-02-18 16:43     ` Christian Jaeger
  0 siblings, 2 replies; 31+ messages in thread
From: Duy Nguyen @ 2014-02-18  9:45 UTC (permalink / raw)
  To: David Kastrup; +Cc: Christian Jaeger, Git Mailing List

On Tue, Feb 18, 2014 at 3:55 PM, David Kastrup <dak@gnu.org> wrote:
> Christian Jaeger <chrjae@gmail.com> writes:
>
>> I've got a repository where "git log --raw > _somefile" took a few
>> seconds in the past, but after an attempt at merging some commits that
>> were collected in a clone of the same repo that was created about a
>> year ago, I noticed that this command was now taking 3 minutes 7
>> seconds. "git gc", "git fsck", "git clone file:///the/repo/.git" also
>> now each took between ~4-10 minutes, also "git log --raw somefile" got
>> equally unusably slow. With the help of the people on the IRC, I
>> tracked it down to my recent use of "git gc --aggressive" in this
>> repo. Running "git repack -a -d -f" solved it, now it's again taking
>> 4-5 seconds. After running "git gc --aggressive" again for
>> confirmation, "git log --raw > _somefile" was again slowed down,
>> although now 'only' to 1 minute 34 seconds;
>
> [...]
>
>> I've now learned to avoid "git gc --aggressive". Perhaps there are
>> some other conclusions to be drawn, I don't know.
>
> I've seen the same with my ongoing work on git-blame with the current
> Emacs Git mirror.  Aggressive packing reduces the repository size to
> about a quarter, but it blows up the system time (mainly I/O)
> significantly, quite reducing the total benefits of my algorithmic
> improvements there.

Likely because --aggressive passes --depth=250 to pack-objects. Long
delta chains could reduce pack size and increase I/O as well as zlib
processing signficantly. Christian can try "git repack -adf" which is
really close to --aggressive (except it uses default --depth=50) and
see if it makes any difference.

> There is also some quite visible additional time spent in zlib, so a
> wild guess would be that zlib is not really suited to the massive amount
> of directory entries of a Git object store.  Since the system time still
> dominates, this guess would only make sense if Git over zlib kept
> rereading the directory section of whatever compressed file we are
> talking about.  But that's really a rather handwavy wild guess without
> anything better than a hunch to back it up.  I don't even know what kind
> of compression and/or packs are used: I've only ever messed myself with
> the delta coding of the normal "unpacked" operation (there are a few
> older commits from me on that).
>
> --
> David Kastrup
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Duy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-18  9:45   ` Duy Nguyen
@ 2014-02-18 10:25     ` David Kastrup
  2014-02-18 15:59       ` Jonathan Nieder
  2014-02-18 16:43     ` Christian Jaeger
  1 sibling, 1 reply; 31+ messages in thread
From: David Kastrup @ 2014-02-18 10:25 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Christian Jaeger, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

> On Tue, Feb 18, 2014 at 3:55 PM, David Kastrup <dak@gnu.org> wrote:
>
>> I've seen the same with my ongoing work on git-blame with the current
>> Emacs Git mirror.  Aggressive packing reduces the repository size to
>> about a quarter, but it blows up the system time (mainly I/O)
>> significantly, quite reducing the total benefits of my algorithmic
>> improvements there.
>
> Likely because --aggressive passes --depth=250 to pack-objects. Long
> delta chains could reduce pack size and increase I/O as well as zlib
> processing signficantly.

Increased zlib processing time is one thing, but if it _increases_ I/O,
then it would seem there is a serious impedance mismatch between the
compression scheme and the code relying on it, leading to repeated reads
of blocks only needed for reconstructing dynamic compression
dictionaries.

Compression should reduce rather than increase the total amount of
reads.  So it would seem that either better caching and/or smaller
independent block sizes and/or strategies for sorting the delta chain to
make its resolution require mostly linear reads, and then make sure to
do this in a manner that does not reinitialize the decompression for
accessing each delta that happens to be more or less "in sequence".

Of course, this is assuming that the additional time is spent
uncompressing data rather than navigating directories.

It's actually conceivable that there is quite a bit of potential to get
better performance from unchanged readers by packing stuff in a
different order while still using the same delta chain depth.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-18 10:25     ` David Kastrup
@ 2014-02-18 15:59       ` Jonathan Nieder
  2014-02-18 20:59         ` Junio C Hamano
  0 siblings, 1 reply; 31+ messages in thread
From: Jonathan Nieder @ 2014-02-18 15:59 UTC (permalink / raw)
  To: David Kastrup; +Cc: Duy Nguyen, Christian Jaeger, Git Mailing List

David Kastrup wrote:
> Duy Nguyen <pclouds@gmail.com> writes:

>> Likely because --aggressive passes --depth=250 to pack-objects. Long
>> delta chains could reduce pack size and increase I/O as well as zlib
>> processing signficantly.
[...]
> Compression should reduce rather than increase the total amount of
> reads.

--depth=250 means to allow chains of "To get this object, first
inflate this object, then apply this delta" of length 250.

That's absurdly long, and doesn't even help compression much in
practice (many short chains referring to the same objects tends to
work fine).  We probably shouldn't make --aggressive do that.
Something like --depth=10 would make more sense.

Hoping that clarifies,
Jonathan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-18  9:45   ` Duy Nguyen
  2014-02-18 10:25     ` David Kastrup
@ 2014-02-18 16:43     ` Christian Jaeger
  1 sibling, 0 replies; 31+ messages in thread
From: Christian Jaeger @ 2014-02-18 16:43 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: David Kastrup, Git Mailing List

2014-02-18 9:45 GMT+00:00 Duy Nguyen <pclouds@gmail.com>:
> Christian can try "git repack -adf"

That's what I already mentioned in my first mail is what I used to fix
the problem.

Here are some 'hard' numbers, FWIW:

- both ~/scr and swap are on the same SSD;

$ free
             total       used       free     shared    buffers     cached
Mem:       3996748    3800828     195920          0     377176    1078848
-/+ buffers/cache:    2344804    1651944
Swap:      2097148     169760    1927388

git only used up to about 100 MB of VIRT or RSS when I checked, there
was an ulimit of "-S -v 1200000".

- this is git version 1.7.10.4 (1:1.7.10.4-1+wheezy1 i386 Debian)

- after my attempted merge (which had conflicts and I had then
cancelled by way of git reset --hard), and then a "git gc", the times
were:

~/scr$ time git log --raw > _THELOG

real 3m7.002s
user 2m0.252s
sys 1m6.008s

- on a copy:

/dev/shm/scr$ time git repack -a -d -f
Counting objects: 34917, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (27038/27038), done.
Writing objects: 100% (34917/34917), done.
Total 34917 (delta 13928), reused 0 (delta 0)

real 4m33.193s
user 3m42.950s
sys 1m13.821s

/dev/shm/scr$ time git log --raw > _THELOG2

real 0m8.276s
user 0m7.192s
sys 0m1.052s

(not sure why it took 8s here, perhaps I had another process running
at the same time? Compare with the "0m4.913s" below.)

/dev/shm/scr$ time g-gc --aggressive
Counting objects: 36066, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (27812/27812), done.
Writing objects: 100% (36066/36066), done.
Total 36066 (delta 14367), reused 21699 (delta 0)
Checking connectivity: 36066, done.

real 5m52.013s
user 8m28.652s
sys 1m4.308s

/dev/shm/scr$ time git log --raw > _THELOG2

real 1m34.430s
user 0m47.291s
sys 0m46.615s

/dev/shm/scr$ time git repack -adf
Counting objects: 36066, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (27812/27812), done.
Writing objects: 100% (36066/36066), done.
Total 36066 (delta 14256), reused 21699 (delta 0)

real 2m32.083s
user 1m51.295s
sys 1m4.940s

/dev/shm/scr$ time git log --raw > _THELOG3

real 0m4.913s
user 0m3.944s
sys 0m0.944s

/dev/shm/scr$ du -s .git
43728 .git

- back in the original place:

~/scr$ time git repack -a -d -f
Counting objects: 36066, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (27812/27812), done.
Writing objects: 100% (36066/36066), done.
Total 36066 (delta 14257), reused 21700 (delta 0)

real 4m6.503s
user 3m16.568s
sys 1m11.640s

~/scr$ time git log --raw > _THELOG2

real 0m5.002s
user 0m4.032s
sys 0m0.952s

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-18 15:59       ` Jonathan Nieder
@ 2014-02-18 20:59         ` Junio C Hamano
  2014-02-18 22:46           ` Duy Nguyen
  2014-02-22  0:36           ` Duy Nguyen
  0 siblings, 2 replies; 31+ messages in thread
From: Junio C Hamano @ 2014-02-18 20:59 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: David Kastrup, Duy Nguyen, Christian Jaeger, Git Mailing List

Jonathan Nieder <jrnieder@gmail.com> writes:

> David Kastrup wrote:
>> Duy Nguyen <pclouds@gmail.com> writes:
>
>>> Likely because --aggressive passes --depth=250 to pack-objects. Long
>>> delta chains could reduce pack size and increase I/O as well as zlib
>>> processing signficantly.
> [...]
>> Compression should reduce rather than increase the total amount of
>> reads.
>
> --depth=250 means to allow chains of "To get this object, first
> inflate this object, then apply this delta" of length 250.
>
> That's absurdly long, and doesn't even help compression much in
> practice (many short chains referring to the same objects tends to
> work fine).  We probably shouldn't make --aggressive do that.
> Something like --depth=10 would make more sense.

Yes, my thinking indeed.

I didn't know --agressive was so aggressive myself, as I personally
never use it. "git repack -a -d -f --depth=32 window=4000" is what I
often use, but I suspect most people would not be patient enough for
that 4k window.

Let's do something like this first and then later make --depth
configurable just like --width, perhaps?  For "aggressive", I think
the default width (hardcoded to 250 but configurable) is a bit too
narrow.

 builtin/gc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 6be6c8d..0d010f0 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -204,7 +204,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 
 	if (aggressive) {
 		argv_array_push(&repack, "-f");
-		argv_array_push(&repack, "--depth=250");
+		argv_array_push(&repack, "--depth=20");
 		if (aggressive_window > 0)
 			argv_array_pushf(&repack, "--window=%d", aggressive_window);
 	}

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-18 20:59         ` Junio C Hamano
@ 2014-02-18 22:46           ` Duy Nguyen
  2014-02-19  0:10             ` Junio C Hamano
  2014-02-22  0:36           ` Duy Nguyen
  1 sibling, 1 reply; 31+ messages in thread
From: Duy Nguyen @ 2014-02-18 22:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, David Kastrup, Christian Jaeger, Git Mailing List

On Wed, Feb 19, 2014 at 3:59 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Let's do something like this first and then later make --depth
> configurable just like --width, perhaps?  For "aggressive", I think
> the default width (hardcoded to 250 but configurable) is a bit too
> narrow.
>
>  builtin/gc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 6be6c8d..0d010f0 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -204,7 +204,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>
>         if (aggressive) {
>                 argv_array_push(&repack, "-f");
> -               argv_array_push(&repack, "--depth=250");
> +               argv_array_push(&repack, "--depth=20");
>                 if (aggressive_window > 0)
>                         argv_array_pushf(&repack, "--window=%d", aggressive_window);
>         }

Lower depth than default (50) does not sound "aggressive" to me, at
least from disk space utilization. I agree it should be configurable
though.
-- 
Duy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-18 22:46           ` Duy Nguyen
@ 2014-02-19  0:10             ` Junio C Hamano
  2014-02-19  0:33               ` Duy Nguyen
  0 siblings, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2014-02-19  0:10 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Jonathan Nieder, David Kastrup, Christian Jaeger, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

> Lower depth than default (50) does not sound "aggressive" to me, at
> least from disk space utilization. I agree it should be configurable
> though.

Do you mean you want to keep "--aggressive" to mean "too aggressive
in resulting size, to the point that it is not useful to anybody"?

Shallow and wide will give us, with a large window, the most
aggressively efficient packfiles that are useful, and we would
rather want to fix it to be usable, I would think.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-19  0:10             ` Junio C Hamano
@ 2014-02-19  0:33               ` Duy Nguyen
  2014-02-19  8:38                 ` Philippe Vaucher
  0 siblings, 1 reply; 31+ messages in thread
From: Duy Nguyen @ 2014-02-19  0:33 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, David Kastrup, Christian Jaeger, Git Mailing List

On Wed, Feb 19, 2014 at 7:10 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Duy Nguyen <pclouds@gmail.com> writes:
>
>> Lower depth than default (50) does not sound "aggressive" to me, at
>> least from disk space utilization. I agree it should be configurable
>> though.
>
> Do you mean you want to keep "--aggressive" to mean "too aggressive
> in resulting size, to the point that it is not useful to anybody"?

git-gc.txt is pretty vague about this --aggressive. I assume we would
want both, better disk utilization and performance. But if it produces
a tiny pack that takes forever to access, then it's definitely bad
aggression.

> Shallow and wide will give us, with a large window, the most
> aggressively efficient packfiles that are useful, and we would
> rather want to fix it to be usable, I would think.

fwiw this is the thread that added --depth=250

http://thread.gmane.org/gmane.comp.gcc.devel/94565/focus=94626

yes, if reducing depth leads to better performance and does not use
much disk in general case, then of course we should do it. "General
case" may be hard to define though. It'd be best if we have some sort
of heuristics to try out different combinations on a specific repo and
return the "best" combination of parameters. It could even take longer
time, but once we have good parameters, they should remain good for a
long time, I think.
-- 
Duy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-19  0:33               ` Duy Nguyen
@ 2014-02-19  8:38                 ` Philippe Vaucher
  2014-02-19  9:01                   ` David Kastrup
                                     ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Philippe Vaucher @ 2014-02-19  8:38 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Junio C Hamano, Jonathan Nieder, David Kastrup, Christian Jaeger,
	Git Mailing List

> fwiw this is the thread that added --depth=250
>
> http://thread.gmane.org/gmane.comp.gcc.devel/94565/focus=94626

This post is quite interesting:
http://article.gmane.org/gmane.comp.gcc.devel/94637

Philippe

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-19  8:38                 ` Philippe Vaucher
@ 2014-02-19  9:01                   ` David Kastrup
  2014-02-19 10:24                     ` Duy Nguyen
  2014-02-19 10:14                   ` Duy Nguyen
  2014-02-19 18:59                   ` Junio C Hamano
  2 siblings, 1 reply; 31+ messages in thread
From: David Kastrup @ 2014-02-19  9:01 UTC (permalink / raw)
  To: Philippe Vaucher
  Cc: Duy Nguyen, Junio C Hamano, Jonathan Nieder, Christian Jaeger,
	Git Mailing List

Philippe Vaucher <philippe.vaucher@gmail.com> writes:

>> fwiw this is the thread that added --depth=250
>>
>> http://thread.gmane.org/gmane.comp.gcc.devel/94565/focus=94626
>
> This post is quite interesting:
> http://article.gmane.org/gmane.comp.gcc.devel/94637

Yes.  Of course I am prejudiced because I volunteered fixing git-blame
on the Emacs developer list in order to make it more feasible to
transfer the Emacs repository to Git.

Calling git blame via C-x v g is a rather important part of the
workflow, and it's currently intolerable to work with on a number of
files.

While I'm fixing the basic shortcomings in builtin/blame.c itself, the
operation "fetch the objects" is necessary for all objects at least
once.  It's conceivable that some nice caching strategy would help with
avoiding the repeated traversal of long delta chain tails.  That could
also help defusing the operation of basic stuff like git-log.

But the short and long end of it is that there are valid operations
accessing a large amount of past history, and one point of having a
distributed version control system with non-shallow repository by
default is to have history and ways of working with it at one's hand.

And git's default modus of operation is _not_ to store things like
copies and moves and renames in commits, but deduce them from looking at
the stored data.  So making looking at stored data including old data
expensive means that Git does not work well in the way it is designed to
operate.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-19  8:38                 ` Philippe Vaucher
  2014-02-19  9:01                   ` David Kastrup
@ 2014-02-19 10:14                   ` Duy Nguyen
  2014-02-20  4:09                     ` Christian Jaeger
  2014-02-20 16:48                     ` David Kastrup
  2014-02-19 18:59                   ` Junio C Hamano
  2 siblings, 2 replies; 31+ messages in thread
From: Duy Nguyen @ 2014-02-19 10:14 UTC (permalink / raw)
  To: Philippe Vaucher
  Cc: Junio C Hamano, Jonathan Nieder, David Kastrup, Christian Jaeger,
	Git Mailing List

On Wed, Feb 19, 2014 at 3:38 PM, Philippe Vaucher
<philippe.vaucher@gmail.com> wrote:
>> fwiw this is the thread that added --depth=250
>>
>> http://thread.gmane.org/gmane.comp.gcc.devel/94565/focus=94626
>
> This post is quite interesting:
> http://article.gmane.org/gmane.comp.gcc.devel/94637

Especially this part

-- 8< --
And quite frankly, a delta depth
of 250 is likely going to cause overflows in the delta cache (which is
only 256 entries in size *and* it's a hash, so it's going to start having
hash conflicts long before hitting the 250 depth limit).
-- 8< --

So in order to get file A's content, we go through its 250 level chain
(and fill the cache), then we get to file B and do the same, which
evicts nearly everything from A. By the time we go to the next commit,
we have to go through 250 levels for A again because the cache is
pretty much useless.

I can think of two improvements we could make, either increase cache
size dynamically (within limits) or make it configurable. If we have N
entries in worktree (both trees and blobs) and depth M, then we might
need to cache N*M objects for it to be effective. Christian, if you
want to experiment this, update MAX_DELTA_CACHE in sha1_file.c and
rebuild.

The other is smarter eviction, instead of throwing all A's cached
items out (based on recent order), keep the last few items of A and
evict B's oldest cached items. Hopefully by the next comit, we can
still reuse some cache for A and other files/trees. Delta cache needs
to learn about grouping to achieve this.
-- 
Duy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-19  9:01                   ` David Kastrup
@ 2014-02-19 10:24                     ` Duy Nguyen
  0 siblings, 0 replies; 31+ messages in thread
From: Duy Nguyen @ 2014-02-19 10:24 UTC (permalink / raw)
  To: David Kastrup
  Cc: Philippe Vaucher, Junio C Hamano, Jonathan Nieder,
	Christian Jaeger, Git Mailing List

On Wed, Feb 19, 2014 at 4:01 PM, David Kastrup <dak@gnu.org> wrote:
> Calling git blame via C-x v g is a rather important part of the
> workflow, and it's currently intolerable to work with on a number of
> files.
>
> While I'm fixing the basic shortcomings in builtin/blame.c itself, the
> operation "fetch the objects" is necessary for all objects at least
> once.  It's conceivable that some nice caching strategy would help with
> avoiding the repeated traversal of long delta chain tails.  That could
> also help defusing the operation of basic stuff like git-log.

Pack v4 is supposed to tackle this delta chain thing, but its future
is a bit uncertain (you can give a hand btw). If you often do "git
blame", you might consider unpack most accessed objects (make it part
of "blame" process), which would function exactly like a cache with no
extra code. The downside is git-gc --auto is more likely to kick in
because of too many loose objects and pack everything up again.
-- 
Duy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-19  8:38                 ` Philippe Vaucher
  2014-02-19  9:01                   ` David Kastrup
  2014-02-19 10:14                   ` Duy Nguyen
@ 2014-02-19 18:59                   ` Junio C Hamano
  2014-02-20 23:35                     ` Duy Nguyen
  2 siblings, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2014-02-19 18:59 UTC (permalink / raw)
  To: Philippe Vaucher
  Cc: Duy Nguyen, Jonathan Nieder, David Kastrup, Christian Jaeger,
	Git Mailing List

Philippe Vaucher <philippe.vaucher@gmail.com> writes:

>> fwiw this is the thread that added --depth=250
>>
>> http://thread.gmane.org/gmane.comp.gcc.devel/94565/focus=94626
>
> This post is quite interesting:
> http://article.gmane.org/gmane.comp.gcc.devel/94637

Yes, it most clearly says that --depth=250 was *not* a
recommendation, with technical background to explain why such a long
delta chain is a bad idea.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-19 10:14                   ` Duy Nguyen
@ 2014-02-20  4:09                     ` Christian Jaeger
  2014-02-20 16:48                     ` David Kastrup
  1 sibling, 0 replies; 31+ messages in thread
From: Christian Jaeger @ 2014-02-20  4:09 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Philippe Vaucher, Junio C Hamano, Jonathan Nieder, David Kastrup,
	Git Mailing List

2014-02-19 10:14 GMT+00:00 Duy Nguyen <pclouds@gmail.com>:
> Christian, if you
> want to experiment this, update MAX_DELTA_CACHE in sha1_file.c and
> rebuild.

I don't have the time right now. (Perhaps next week?)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-19 10:14                   ` Duy Nguyen
  2014-02-20  4:09                     ` Christian Jaeger
@ 2014-02-20 16:48                     ` David Kastrup
  2014-02-20 17:06                       ` David Kastrup
  1 sibling, 1 reply; 31+ messages in thread
From: David Kastrup @ 2014-02-20 16:48 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Philippe Vaucher, Junio C Hamano, Jonathan Nieder,
	Christian Jaeger, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

> I can think of two improvements we could make, either increase cache
> size dynamically (within limits) or make it configurable. If we have N
> entries in worktree (both trees and blobs) and depth M, then we might
> need to cache N*M objects for it to be effective. Christian, if you
> want to experiment this, update MAX_DELTA_CACHE in sha1_file.c and
> rebuild.

Well, my optimized "git-blame" code is considerably hit by an
aggressively packed Emacs repository so I took a look at it with the
MAX_DELTA_CACHE value set to the default 256, and then 512, 1024, 2048.

Here are the results:

dak@lola:/usr/local/tmp/emacs$ time ../git/git blame src/xdisp.c >/dev/null

real	1m17.496s
user	0m30.552s
sys	0m46.496s
dak@lola:/usr/local/tmp/emacs$ time ../git/git blame src/xdisp.c >/dev/null

real	1m13.888s
user	0m30.060s
sys	0m43.420s
dak@lola:/usr/local/tmp/emacs$ time ../git/git blame src/xdisp.c >/dev/null

real	1m16.415s
user	0m31.436s
sys	0m44.564s
dak@lola:/usr/local/tmp/emacs$ time ../git/git blame src/xdisp.c >/dev/null

real	1m24.732s
user	0m34.416s
sys	0m49.808s

So using a value of 512 helps a bit (7% or so), but further increases
already cause a hit.  My machine has 4G of memory (32bit x86), so it is
unlikely that memory is running out.  I have no idea why this would be
so: either memory locality plays a role here, or the cache for some
reason gets reinitialized or scanned/copied/accessed as a whole
repeatedly, defeating the idea of a cache.  Or the access pattern are
such that it's entirely useless as a cache even at this size.

Trying with 16384:
dak@lola:/usr/local/tmp/emacs$ time ../git/git blame src/xdisp.c >/dev/null

real	2m8.000s
user	0m54.968s
sys	1m12.624s

And memory consumption did not exceed about 200m all the while, so is
far lower than what would have been available.

Something's _really_ fishy about that cache behavior.  Note that the
_system_ time goes up considerably, not just user time.  Since the packs
are zlib-packed, it's reasonable that more I/O time is also associated
with more user time and it is well possible that the user time increase
is entirely explainable by the larger amount of compressed data to
access.

But this stinks.  I doubt that the additional time is spent in memory
allocation: most of that would register only as user time.  And the
total allocated memory is not large enough that one can explain this
away with fewer available disk buffers for the kernel: the aggressively
packed repo takes about 300m so it would fine into memory together with
the git process.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-20 16:48                     ` David Kastrup
@ 2014-02-20 17:06                       ` David Kastrup
  2014-02-20 18:07                         ` David Kastrup
  0 siblings, 1 reply; 31+ messages in thread
From: David Kastrup @ 2014-02-20 17:06 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Philippe Vaucher, Junio C Hamano, Jonathan Nieder,
	Christian Jaeger, Git Mailing List

David Kastrup <dak@gnu.org> writes:

> Duy Nguyen <pclouds@gmail.com> writes:
>
>> I can think of two improvements we could make, either increase cache
>> size dynamically (within limits) or make it configurable. If we have N
>> entries in worktree (both trees and blobs) and depth M, then we might
>> need to cache N*M objects for it to be effective. Christian, if you
>> want to experiment this, update MAX_DELTA_CACHE in sha1_file.c and
>> rebuild.
>
> Well, my optimized "git-blame" code is considerably hit by an
> aggressively packed Emacs repository so I took a look at it with the
> MAX_DELTA_CACHE value set to the default 256, and then 512, 1024, 2048.

[...]

> Trying with 16384:
> dak@lola:/usr/local/tmp/emacs$ time ../git/git blame src/xdisp.c >/dev/null
>
> real	2m8.000s
> user	0m54.968s
> sys	1m12.624s
>
> And memory consumption did not exceed about 200m all the while, so is
> far lower than what would have been available.

Of course, this has to do with delta_base_cache_limit defaulting to 16m.

> Something's _really_ fishy about that cache behavior.  Note that the
> _system_ time goes up considerably, not just user time.  Since the
> packs are zlib-packed, it's reasonable that more I/O time is also
> associated with more user time and it is well possible that the user
> time increase is entirely explainable by the larger amount of
> compressed data to access.
>
> But this stinks.

And an obvious contender for the stinking is that the "LRU" scheme used
here is _strictly_ freeing memory based on which cache entry has been
_created_ the longest time ago, not which cache entry has been
_accessed_ the longest time ago.  Which means a pure round-robin
strategy for freeing memory rather than LRU.

Let's see what happens when changing this.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-20 17:06                       ` David Kastrup
@ 2014-02-20 18:07                         ` David Kastrup
  0 siblings, 0 replies; 31+ messages in thread
From: David Kastrup @ 2014-02-20 18:07 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Philippe Vaucher, Junio C Hamano, Jonathan Nieder,
	Christian Jaeger, Git Mailing List

David Kastrup <dak@gnu.org> writes:

> David Kastrup <dak@gnu.org> writes:
>
>> Duy Nguyen <pclouds@gmail.com> writes:
>>
>> Something's _really_ fishy about that cache behavior.  Note that the
>> _system_ time goes up considerably, not just user time.  Since the
>> packs are zlib-packed, it's reasonable that more I/O time is also
>> associated with more user time and it is well possible that the user
>> time increase is entirely explainable by the larger amount of
>> compressed data to access.
>>
>> But this stinks.
>
> And an obvious contender for the stinking is that the "LRU" scheme used
> here is _strictly_ freeing memory based on which cache entry has been
> _created_ the longest time ago, not which cache entry has been
> _accessed_ the longest time ago.  Which means a pure round-robin
> strategy for freeing memory rather than LRU.
>
> Let's see what happens when changing this.

Not much.  With any cache size, using a "true" LRU scheme does not buy
more than 2%.  On the other hand, increasing core.deltaBaseCacheLimit
from its default of 16m to 128m in the config file results in the
following difference (with default #define MAX_DELTA_CACHE (256)):

dak@lola:/usr/local/tmp/emacs$ time ../git/git blame src/xdisp.c >/dev/null

real	1m17.446s
user	0m30.696s
sys	0m46.332s
dak@lola:/usr/local/tmp/emacs$ time ../git/git blame src/xdisp.c >/dev/null

real	0m27.519s
user	0m20.248s
sys	0m7.156s

So it would seem that the default available cache slots are not utilized
anyway when operating on this file (about 1MB in size) with the default
of core.deltaBaseCacheLimit.

It is still irritating that the performance drops quite a bit with a
considerably larger number of cache slots.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-19 18:59                   ` Junio C Hamano
@ 2014-02-20 23:35                     ` Duy Nguyen
  2014-02-21  0:32                       ` Christian Jaeger
                                         ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Duy Nguyen @ 2014-02-20 23:35 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Philippe Vaucher, Jonathan Nieder, David Kastrup,
	Christian Jaeger, Git Mailing List

On Thu, Feb 20, 2014 at 1:59 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Philippe Vaucher <philippe.vaucher@gmail.com> writes:
>
>>> fwiw this is the thread that added --depth=250
>>>
>>> http://thread.gmane.org/gmane.comp.gcc.devel/94565/focus=94626
>>
>> This post is quite interesting:
>> http://article.gmane.org/gmane.comp.gcc.devel/94637
>
> Yes, it most clearly says that --depth=250 was *not* a
> recommendation, with technical background to explain why such a long
> delta chain is a bad idea.

On the other hand, the size reduction is really nice (320MB vs 500MB).
I don't know if we can do this, but does it make sense to apply
--depth=250 for old commits only and shallow depth for recent commits?

For old projects, commits older than 1-2 years is probably less often
accessed and could use some aggressive packing. This still hits
git-blame badly. We could even make sure all objects "on the blame
surface" have short delta chain. But that may be pushing pack-objects
too much.
-- 
Duy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-20 23:35                     ` Duy Nguyen
@ 2014-02-21  0:32                       ` Christian Jaeger
  2014-02-21 17:36                         ` Junio C Hamano
  2014-02-21  5:09                       ` Duy Nguyen
  2014-02-21 17:47                       ` Junio C Hamano
  2 siblings, 1 reply; 31+ messages in thread
From: Christian Jaeger @ 2014-02-21  0:32 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Junio C Hamano, Philippe Vaucher, Jonathan Nieder, David Kastrup,
	Git Mailing List

2014-02-20 23:35 GMT+00:00 Duy Nguyen <pclouds@gmail.com>:
> does it make sense to apply
> --depth=250 for old commits only

Just wondering: would it be difficult to fix the problems that lead to
worse than linear slowdown with the --depth? (I.e. adaptive cache/hash
table size.) If the performance difference between say --depth=25 and
--depth=250 could be reduced from a factor 40 to 10 (or better if
things are back to other things taking more time than the object
access), that would seem like a nice gain in any case.

Also, in "man git-gc" document --aggressive that it leads to slower
*read* performance after the gc, I remember having red that option's
docs when I ran it, and since it didn't mention that it makes reads
slower, I didn't expect it to, and thus didn't remember this as the
source of the problem when I noticed that things were slow.

(But, I took from the discussion that increasing the gzip window size
(?) would make things smaller anyway, so perhaps all that isn't even
necessary?)

I can test next week if you have particular suggestions to test.

Christian.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-20 23:35                     ` Duy Nguyen
  2014-02-21  0:32                       ` Christian Jaeger
@ 2014-02-21  5:09                       ` Duy Nguyen
  2014-02-21 17:47                       ` Junio C Hamano
  2 siblings, 0 replies; 31+ messages in thread
From: Duy Nguyen @ 2014-02-21  5:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Philippe Vaucher, Jonathan Nieder, David Kastrup,
	Christian Jaeger, Git Mailing List

On Fri, Feb 21, 2014 at 06:35:06AM +0700, Duy Nguyen wrote:
> On the other hand, the size reduction is really nice (320MB vs 500MB).
> I don't know if we can do this, but does it make sense to apply
> --depth=250 for old commits only and shallow depth for recent commits?
> 
> For old projects, commits older than 1-2 years is probably less often
> accessed and could use some aggressive packing. This still hits
> git-blame badly. We could even make sure all objects "on the blame
> surface" have short delta chain. But that may be pushing pack-objects
> too much.

We can have a "moderately aggressive" mode like this. With the patch
below, first you repack all and remove all loose objects. Then replay
your favourite use cases with GIT_LOOSE_THEM=1. For example, if I'm
most interested in commits from a yearq ago

$ GIT_LOOSE_THEM=1 ../git log --raw --since=1.year.ago >/dev/null

all relevant trees will be unpacked. Put --stat there too if you want
to unpack blobs. blame-heavy users may want to blame a few (or all)
files here too to unpack more. Now we can repack aggressively all
non-loose objects:

$ git repack -adf --exclude-loose --depth=250

and repack again, this time with normal depth, which would only affect
loose objects

$ git repack -ad

The end result is a pack with ancient history with potentially long
delta chains, tightly packed, and nearer history with shorter
chains. You will not notice any performance degradation (unless I run
past 1 year history in my case). And the result pack of git.git is 39M
rather than 64M with standard depth.

The use of loose objects to mark recent objects is not efficient (but
fast for this prototype). We could store an SHA-1 map instead.

-- 8< --
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 541667f..0e9dc8c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -82,6 +82,7 @@ static int num_preferred_base;
 static struct progress *progress_state;
 static int pack_compression_level = Z_DEFAULT_COMPRESSION;
 static int pack_compression_seen;
+static int no_loose;
 
 static unsigned long delta_cache_size = 0;
 static unsigned long max_delta_cache_size = 256 * 1024 * 1024;
@@ -2204,7 +2205,12 @@ static void show_object(struct object *obj,
 			const struct name_path *path, const char *last,
 			void *data)
 {
-	char *name = path_name(path, last);
+	char *name;
+
+	if (no_loose && has_loose_object(obj->sha1))
+		return;
+
+	name = path_name(path, last);
 
 	add_preferred_base_object(name);
 	add_object_entry(obj->sha1, obj->type, name, 0);
@@ -2487,6 +2493,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		{ OPTION_SET_INT, 0, "reflog", &rev_list_reflog, NULL,
 		  N_("include objects referred by reflog entries"),
 		  PARSE_OPT_NOARG | PARSE_OPT_NONEG, NULL, 1 },
+		OPT_BOOL(0, "exclude-loose", &no_loose, ""),
 		OPT_BOOL(0, "stdout", &pack_to_stdout,
 			 N_("output pack to stdout")),
 		OPT_BOOL(0, "include-tag", &include_tag,
diff --git a/builtin/repack.c b/builtin/repack.c
index bb2314c..9b8bb35 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -137,6 +137,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int no_update_server_info = 0;
 	int quiet = 0;
 	int local = 0;
+	int no_loose = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -152,6 +153,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("pass --no-reuse-object to git-pack-objects")),
 		OPT_BOOL('n', NULL, &no_update_server_info,
 				N_("do not run git-update-server-info")),
+		OPT_BOOL(0, "exclude-loose", &no_loose, ""),
 		OPT__QUIET(&quiet, N_("be quiet")),
 		OPT_BOOL('l', "local", &local,
 				N_("pass --local to git-pack-objects")),
@@ -184,6 +186,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	argv_array_push(&cmd_args, "--non-empty");
 	argv_array_push(&cmd_args, "--all");
 	argv_array_push(&cmd_args, "--reflog");
+	if (no_loose)
+		argv_array_push(&cmd_args, "--exclude-loose");
 	if (window)
 		argv_array_pushf(&cmd_args, "--window=%s", window);
 	if (window_memory)
diff --git a/sha1_file.c b/sha1_file.c
index 6e8c05d..d0988f2 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -454,7 +454,7 @@ int has_loose_object_nonlocal(const unsigned char *sha1)
 	return 0;
 }
 
-static int has_loose_object(const unsigned char *sha1)
+int has_loose_object(const unsigned char *sha1)
 {
 	return has_loose_object_local(sha1) ||
 	       has_loose_object_nonlocal(sha1);
@@ -2114,6 +2114,11 @@ struct unpack_entry_stack_ent {
 	unsigned long size;
 };
 
+static void write_sha1_file_prepare(const void *buf, unsigned long len,
+				    const char *type, unsigned char *sha1,
+				    char *hdr, int *hdrlen);
+static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
+			      const void *buf, unsigned long len, time_t mtime);
 void *unpack_entry(struct packed_git *p, off_t obj_offset,
 		   enum object_type *final_type, unsigned long *final_size)
 {
@@ -2126,6 +2131,7 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset,
 	struct unpack_entry_stack_ent *delta_stack = small_delta_stack;
 	int delta_stack_nr = 0, delta_stack_alloc = UNPACK_ENTRY_STACK_PREALLOC;
 	int base_from_cache = 0;
+	static int let_them_loose = -1;
 
 	if (log_pack_access != no_log_pack_access)
 		write_pack_access_log(p, obj_offset);
@@ -2288,6 +2294,17 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset,
 	*final_type = type;
 	*final_size = size;
 
+	if (let_them_loose == -1)
+		let_them_loose = getenv("GIT_LOOSE_THEM") != NULL;
+	if (let_them_loose && (type == OBJ_TREE || type == OBJ_BLOB)) {
+		unsigned char sha1[20];
+		char hdr[32];
+		int hdrlen;
+		write_sha1_file_prepare(data, size, typename(type), sha1, hdr, &hdrlen);
+		if (!has_loose_object(sha1))
+			write_loose_object(sha1, hdr, hdrlen, data, size, 0);
+	}
+
 	unuse_pack(&w_curs);
 	return data;
 }
-- 8< --

--
Duy

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-21  0:32                       ` Christian Jaeger
@ 2014-02-21 17:36                         ` Junio C Hamano
  0 siblings, 0 replies; 31+ messages in thread
From: Junio C Hamano @ 2014-02-21 17:36 UTC (permalink / raw)
  To: Christian Jaeger
  Cc: Duy Nguyen, Philippe Vaucher, Jonathan Nieder, David Kastrup,
	Git Mailing List

Christian Jaeger <chrjae@gmail.com> writes:

> Also, in "man git-gc" document --aggressive that it leads to slower
> *read* performance after the gc, I remember having red that option's
> docs when I ran it, and since it didn't mention that it makes reads
> slower, I didn't expect it to, and thus didn't remember this as the
> source of the problem when I noticed that things were slow.

Good point. We would at least need such a documentation update to
warn users.

> (But, I took from the discussion that increasing the gzip window size
> (?) would make things smaller anyway, so perhaps all that isn't even
> necessary?)

If you are talking about "--window" in "git repack --window=xxxx",
that is not related to gzip.  It is how many other "similar" objects
an object will be tried to delta against to find a smallest delta
that can represent it in the pack.  Such a better delta, if found,
can give you a packfile with a smaller depth that is as small as
another packfile created with a larger depth, which is an overall
win, and using a wider window is a way to achieve such a result.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-20 23:35                     ` Duy Nguyen
  2014-02-21  0:32                       ` Christian Jaeger
  2014-02-21  5:09                       ` Duy Nguyen
@ 2014-02-21 17:47                       ` Junio C Hamano
  2014-02-24  9:27                         ` Philippe Vaucher
  2 siblings, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2014-02-21 17:47 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Philippe Vaucher, Jonathan Nieder, David Kastrup,
	Christian Jaeger, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

> For old projects, commits older than 1-2 years is probably less often
> accessed and could use some aggressive packing.

I used to repack older part of history manually with a deeper depth,
mark the result with the .keep bit, and then repack the whole thing
again to have the remainder in a shallower depth.  Something like:

	git rev-list --objects v1.5.3 |
        git pack-objects --depth=128 --delta-base-offset pack

would give me the first pack (in real life, I would use a larger
window size like 4096), and then after placing the resulting .pack
and .idx files along with a .keep file in .git/objects/pack/,
running "git repack -a -d" to pack the rest.

> This still hits git-blame badly. We could even make sure all
> objects "on the blame surface" have short delta chain. But that
> may be pushing pack-objects too much.

Yes, you can do a similar trick by blaming all the paths that ever
existed in the project, parse its --porcelain output to learn all
the commits and paths involved, to find the objects that need
quicker access.  Pack such objects in a pack with a shallow depth,
tentatively mark that pack with .keep, repack the remainder with a
deep depth, remove .keep from the first pack and mark the new pack
with .keep to prevent it from getting repacked, or something like
that.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-18 20:59         ` Junio C Hamano
  2014-02-18 22:46           ` Duy Nguyen
@ 2014-02-22  0:36           ` Duy Nguyen
  2014-02-22  6:20             ` David Kastrup
  1 sibling, 1 reply; 31+ messages in thread
From: Duy Nguyen @ 2014-02-22  0:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, David Kastrup, Christian Jaeger, Git Mailing List

On Wed, Feb 19, 2014 at 3:59 AM, Junio C Hamano <gitster@pobox.com> wrote:
> I didn't know --agressive was so aggressive myself, as I personally
> never use it. "git repack -a -d -f --depth=32 window=4000" is what I
> often use, but I suspect most people would not be patient enough for
> that 4k window.
>
> Let's do something like this first and then later make --depth
> configurable just like --width, perhaps?  For "aggressive", I think
> the default width (hardcoded to 250 but configurable) is a bit too
> narrow.

OK with git://git.savannah.gnu.org/emacs.git we have

 - a 209MB pack with --aggressive
 - 1.3GB with --depth=50
 - 1.3GB with --window=4000 --depth=32
 - 1.3GB with --depth=20
 - 821MB with --depth=250 for commits --before=2.years.ago, --depth=50
for the rest

So I don't think we should go with your following patch because the
size explosion is just too much no matter how faster it could be. An
immediate action could be just make --depth=250 configurable and let
people deal with it. A better option is something like "3 repack
steps" you described where we pack deep depth first, mark .keep, pack
shallower depth and combine them all into one.

I'm not really happy with --depth=250 producing 209MB while
--depth=250 --before=2.year.ago a 800MB pack. It looks wrong (or maybe
I did something wrong)

>
>  builtin/gc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 6be6c8d..0d010f0 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -204,7 +204,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>
>         if (aggressive) {
>                 argv_array_push(&repack, "-f");
> -               argv_array_push(&repack, "--depth=250");
> +               argv_array_push(&repack, "--depth=20");
>                 if (aggressive_window > 0)
>                         argv_array_pushf(&repack, "--window=%d", aggressive_window);
>         }



-- 
Duy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-22  0:36           ` Duy Nguyen
@ 2014-02-22  6:20             ` David Kastrup
  2014-02-22  8:53               ` David Kastrup
  2014-02-22  9:57               ` Andreas Schwab
  0 siblings, 2 replies; 31+ messages in thread
From: David Kastrup @ 2014-02-22  6:20 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Junio C Hamano, Jonathan Nieder, Christian Jaeger, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

> OK with git://git.savannah.gnu.org/emacs.git we have
>
>  - a 209MB pack with --aggressive
>  - 1.3GB with --depth=50
>  - 1.3GB with --window=4000 --depth=32
>  - 1.3GB with --depth=20
>  - 821MB with --depth=250 for commits --before=2.years.ago, --depth=50
> for the rest
>
> So I don't think we should go with your following patch because the
> size explosion is just too much no matter how faster it could be. An
> immediate action could be just make --depth=250 configurable and let
> people deal with it. A better option is something like "3 repack
> steps" you described where we pack deep depth first, mark .keep, pack
> shallower depth and combine them all into one.
>
> I'm not really happy with --depth=250 producing 209MB while
> --depth=250 --before=2.year.ago a 800MB pack. It looks wrong (or maybe
> I did something wrong)

That does look strange: Emacs has a history of more than 30 years.  But
the Git mirror is quite younger.  Maybe one needs to make sure to use
the author date rather than the commit date here?

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-22  6:20             ` David Kastrup
@ 2014-02-22  8:53               ` David Kastrup
  2014-02-22  9:14                 ` Duy Nguyen
  2014-02-22  9:57               ` Andreas Schwab
  1 sibling, 1 reply; 31+ messages in thread
From: David Kastrup @ 2014-02-22  8:53 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Junio C Hamano, Jonathan Nieder, Christian Jaeger, Git Mailing List

David Kastrup <dak@gnu.org> writes:

> Duy Nguyen <pclouds@gmail.com> writes:
>
>> OK with git://git.savannah.gnu.org/emacs.git we have
>>
>>  - a 209MB pack with --aggressive
>>  - 1.3GB with --depth=50
>>  - 1.3GB with --window=4000 --depth=32
>>  - 1.3GB with --depth=20
>>  - 821MB with --depth=250 for commits --before=2.years.ago, --depth=50
>> for the rest
>>
>> So I don't think we should go with your following patch because the
>> size explosion is just too much no matter how faster it could be. An
>> immediate action could be just make --depth=250 configurable and let
>> people deal with it. A better option is something like "3 repack
>> steps" you described where we pack deep depth first, mark .keep, pack
>> shallower depth and combine them all into one.
>>
>> I'm not really happy with --depth=250 producing 209MB while
>> --depth=250 --before=2.year.ago a 800MB pack. It looks wrong (or maybe
>> I did something wrong)
>
> That does look strange: Emacs has a history of more than 30 years.  But
> the Git mirror is quite younger.  Maybe one needs to make sure to use
> the author date rather than the commit date here?

Another thing: did you really use --depth=250 here or did you use
--aggressive?  It may be that the latter also sets other options?

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-22  8:53               ` David Kastrup
@ 2014-02-22  9:14                 ` Duy Nguyen
  2014-02-22 13:00                   ` Duy Nguyen
  0 siblings, 1 reply; 31+ messages in thread
From: Duy Nguyen @ 2014-02-22  9:14 UTC (permalink / raw)
  To: David Kastrup
  Cc: Junio C Hamano, Jonathan Nieder, Christian Jaeger, Git Mailing List

On Sat, Feb 22, 2014 at 3:53 PM, David Kastrup <dak@gnu.org> wrote:
> David Kastrup <dak@gnu.org> writes:
>
>> Duy Nguyen <pclouds@gmail.com> writes:
>>
>>> OK with git://git.savannah.gnu.org/emacs.git we have
>>>
>>>  - a 209MB pack with --aggressive
>>>  - 1.3GB with --depth=50
>>>  - 1.3GB with --window=4000 --depth=32
>>>  - 1.3GB with --depth=20
>>>  - 821MB with --depth=250 for commits --before=2.years.ago, --depth=50
>>> for the rest
>>>
>>> So I don't think we should go with your following patch because the
>>> size explosion is just too much no matter how faster it could be. An
>>> immediate action could be just make --depth=250 configurable and let
>>> people deal with it. A better option is something like "3 repack
>>> steps" you described where we pack deep depth first, mark .keep, pack
>>> shallower depth and combine them all into one.
>>>
>>> I'm not really happy with --depth=250 producing 209MB while
>>> --depth=250 --before=2.year.ago a 800MB pack. It looks wrong (or maybe
>>> I did something wrong)
>>
>> That does look strange: Emacs has a history of more than 30 years.  But
>> the Git mirror is quite younger.  Maybe one needs to make sure to use
>> the author date rather than the commit date here?

I think commit date is fine because it covers a large portion of
objects (649946 per total 739990) and it does not (or should not)
affect object ordering in pack-objects/rev-list.

> Another thing: did you really use --depth=250 here or did you use
> --aggressive?  It may be that the latter also sets other options?

I can't use --aggressive because I need to feed revisions directly to
pack-objects. --aggressive also sets --window=250. Thanks for
checking. My machine will have another workout session.
-- 
Duy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-22  6:20             ` David Kastrup
  2014-02-22  8:53               ` David Kastrup
@ 2014-02-22  9:57               ` Andreas Schwab
  1 sibling, 0 replies; 31+ messages in thread
From: Andreas Schwab @ 2014-02-22  9:57 UTC (permalink / raw)
  To: David Kastrup
  Cc: Duy Nguyen, Junio C Hamano, Jonathan Nieder, Christian Jaeger,
	Git Mailing List

David Kastrup <dak@gnu.org> writes:

> That does look strange: Emacs has a history of more than 30 years.  But
> the Git mirror is quite younger.  Maybe one needs to make sure to use
> the author date rather than the commit date here?

There is no difference between commit and author date in the Emacs git
mirror since bzr doesn't keep that distinction (and cvs didn't either).

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-22  9:14                 ` Duy Nguyen
@ 2014-02-22 13:00                   ` Duy Nguyen
  0 siblings, 0 replies; 31+ messages in thread
From: Duy Nguyen @ 2014-02-22 13:00 UTC (permalink / raw)
  To: David Kastrup
  Cc: Junio C Hamano, Jonathan Nieder, Christian Jaeger, Git Mailing List

On Sat, Feb 22, 2014 at 4:14 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Sat, Feb 22, 2014 at 3:53 PM, David Kastrup <dak@gnu.org> wrote:
>> David Kastrup <dak@gnu.org> writes:
>>
>>> Duy Nguyen <pclouds@gmail.com> writes:
>>>
>>>> OK with git://git.savannah.gnu.org/emacs.git we have
>>>>
>>>>  - a 209MB pack with --aggressive
>>>>  - 1.3GB with --depth=50
>>>>  - 1.3GB with --window=4000 --depth=32
>>>>  - 1.3GB with --depth=20
>>>>  - 821MB with --depth=250 for commits --before=2.years.ago, --depth=50
>>>> for the rest
...
>>>>
>>>> I'm not really happy with --depth=250 producing 209MB while
>>>> --depth=250 --before=2.year.ago a 800MB pack. It looks wrong (or maybe
>>>> I did something wrong)
....
>> Another thing: did you really use --depth=250 here or did you use
>> --aggressive?  It may be that the latter also sets other options?
>
> I can't use --aggressive because I need to feed revisions directly to
> pack-objects. --aggressive also sets --window=250. Thanks for
> checking. My machine will have another workout session.

And 800MB is reduced to 177MB, containing history older than 2 years.
The final pack is 199MB, within the size range of current --aggressive
and should be reasonably fast on most operations. Again blame could
still hit long delta chains but I think we should just unpack some
trees/blobs when we hit long delta chains.

I think we should update --aggressive to do it this way. So

 - gc.aggressiveDepth defaults to 50 (or 20?), this is used for recent history
 - gc.aggressiveDeepDepth defaults to 250 (or smaller??), used for
ancient history
 - gc.aggressiveDeepOption is rev-list a rev-list option to define
"ancient history", default to --before=2.years.ago. This option could
be specified multiple times.

Both packing phases use the same gc.aggressiveWindow. We could add
gc.aggressiveDeepWindow too.

GSoC project?
-- 
Duy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: git gc --aggressive led to about 40 times slower "git log --raw"
  2014-02-21 17:47                       ` Junio C Hamano
@ 2014-02-24  9:27                         ` Philippe Vaucher
  0 siblings, 0 replies; 31+ messages in thread
From: Philippe Vaucher @ 2014-02-24  9:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Duy Nguyen, Jonathan Nieder, David Kastrup, Christian Jaeger,
	Git Mailing List

> I used to repack older part of history manually with a deeper depth,
> mark the result with the .keep bit, and then repack the whole thing
> again to have the remainder in a shallower depth.  Something like:
>
>         git rev-list --objects v1.5.3 |
>         git pack-objects --depth=128 --delta-base-offset pack
>
> would give me the first pack (in real life, I would use a larger
> window size like 4096), and then after placing the resulting .pack
> and .idx files along with a .keep file in .git/objects/pack/,
> running "git repack -a -d" to pack the rest.

I'm curious, after these repacking, how do you guys publish these
packs? git push? if yes, on what criteria does the remote repo know
which pack it should fetch?

Or maybe it's only a local operation and thus you cannot do it on the
remote without ssh access?

Philippe

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-02-24  9:28 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-18  7:25 git gc --aggressive led to about 40 times slower "git log --raw" Christian Jaeger
2014-02-18  8:55 ` David Kastrup
2014-02-18  9:45   ` Duy Nguyen
2014-02-18 10:25     ` David Kastrup
2014-02-18 15:59       ` Jonathan Nieder
2014-02-18 20:59         ` Junio C Hamano
2014-02-18 22:46           ` Duy Nguyen
2014-02-19  0:10             ` Junio C Hamano
2014-02-19  0:33               ` Duy Nguyen
2014-02-19  8:38                 ` Philippe Vaucher
2014-02-19  9:01                   ` David Kastrup
2014-02-19 10:24                     ` Duy Nguyen
2014-02-19 10:14                   ` Duy Nguyen
2014-02-20  4:09                     ` Christian Jaeger
2014-02-20 16:48                     ` David Kastrup
2014-02-20 17:06                       ` David Kastrup
2014-02-20 18:07                         ` David Kastrup
2014-02-19 18:59                   ` Junio C Hamano
2014-02-20 23:35                     ` Duy Nguyen
2014-02-21  0:32                       ` Christian Jaeger
2014-02-21 17:36                         ` Junio C Hamano
2014-02-21  5:09                       ` Duy Nguyen
2014-02-21 17:47                       ` Junio C Hamano
2014-02-24  9:27                         ` Philippe Vaucher
2014-02-22  0:36           ` Duy Nguyen
2014-02-22  6:20             ` David Kastrup
2014-02-22  8:53               ` David Kastrup
2014-02-22  9:14                 ` Duy Nguyen
2014-02-22 13:00                   ` Duy Nguyen
2014-02-22  9:57               ` Andreas Schwab
2014-02-18 16:43     ` Christian Jaeger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.