All of lore.kernel.org
 help / color / mirror / Atom feed
* git repack command on larger pack file
@ 2015-10-26  5:57 Sivakumar Selvam
  2015-10-26  6:41 ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Sivakumar Selvam @ 2015-10-26  5:57 UTC (permalink / raw)
  To: git

Hi,
   I ran git repack on a single larger repository abc.git where the pack
file size 34 GB. Generally it used to take 20-25 minutes in my server to
complete the repacking. During repacking I noticed, disk usage was more, So
I thought of splitting the pack file into 4 GB chunks. I used the following
command to do repacking.
   git repack -A -b -d -q --depth=50 --window=10 abc.git

   After adding --max-pack-size=4g to the above command again I ran to split
pack files..
   git repack -A -b -d -q --depth=50 --window=10 --max-pack-size=4g abc.git
 
   When I finished running, I found 12 pack files with each 4 GB and the
size is 48 GB. Now my disk usage has increased by 14 GB. Again, I ran to
check the performance, but the size (48 GB) and time to repacking takes
another 35 minutes more. Why this issue? If we split a larger pack file,
repacking takes more time with more disk usage for storing pack files. Any
thoughts on this why this happens?

Thanks,
Sivakumar Selvam.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git repack command on larger pack file
  2015-10-26  5:57 git repack command on larger pack file Sivakumar Selvam
@ 2015-10-26  6:41 ` Junio C Hamano
  2015-10-26  7:11   ` Junio C Hamano
  2015-10-27 23:47   ` Jeff King
  0 siblings, 2 replies; 9+ messages in thread
From: Junio C Hamano @ 2015-10-26  6:41 UTC (permalink / raw)
  To: Sivakumar Selvam; +Cc: git

Sivakumar Selvam <gerritcode@gmail.com> writes:

>    I ran git repack on a single larger repository abc.git where the pack
> file size 34 GB. Generally it used to take 20-25 minutes in my server to
> complete the repacking. During repacking I noticed, disk usage was more, So
> I thought of splitting the pack file into 4 GB chunks. I used the following
> command to do repacking.
>    git repack -A -b -d -q --depth=50 --window=10 abc.git
>
>    After adding --max-pack-size=4g to the above command again I ran to split
> pack files..
>    git repack -A -b -d -q --depth=50 --window=10 --max-pack-size=4g abc.git
>  
>    When I finished running, I found 12 pack files with each 4 GB and the
> size is 48 GB. Now my disk usage has increased by 14 GB. Again, I ran to
> check the performance, but the size (48 GB) and time to repacking takes
> another 35 minutes more. Why this issue?

Hmmm, what is "this issue"?  I do not see anything surprising.

If you have N objects and run repack with window=10, you would
(roughly speaking, without taking various optimization we have and
bootstrap conditions into account) check each of these N objects
against 10 other objects to find good delta base, no matter how big
your max pack-size is set.  And that takes the bulk of time in the
repack process.  Also it has to write more data to disk (see below),
it has to find a good place to split, it has to adjust bookkeeping
data at the pack boundary, in general it has to do more, not less,
to produce split packs.  It would be surprising if it took less
time.

Each pack by definition has to be self-sufficient; all delta in the
pack must have its base object in the same pack.  Now, imagine that
an object (call it X) would have been expressed as a delta derived
from another object (call it Y) if you were producing a single pack,
and imagine that the pack has grown to be 4 GB big just before you
write object X out.  The current pack (which contains the base
object Y already) needs to be closed and then a new pack is opened.
Imagine how you would write X now into that new pack.  You have to
discard the deltified representation of X (which by definition is
much smaller, because it is an instruction to reconstitute X given
an object Y whose contents is very similar to X) and write the base
representation of X to the pack, because X can no longer be
expressed as a delta derived from Y.  That is why you would need to
write more.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git repack command on larger pack file
  2015-10-26  6:41 ` Junio C Hamano
@ 2015-10-26  7:11   ` Junio C Hamano
  2015-10-27  2:04     ` Sivakumar Selvam
  2015-10-27  8:52     ` Philip Oakley
  2015-10-27 23:47   ` Jeff King
  1 sibling, 2 replies; 9+ messages in thread
From: Junio C Hamano @ 2015-10-26  7:11 UTC (permalink / raw)
  To: Sivakumar Selvam; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Sivakumar Selvam <gerritcode@gmail.com> writes:
>
>> ... So
>> I thought of splitting the pack file into 4 GB chunks.
> ...
> Hmmm, what is "this issue"?  I do not see anything surprising.

While the explanation might have been enlightening, the knowledge
conveyed by the explanation by itself would not be of much practical
use, and enlightment without practical use is never fun.

So let's do another tangent that may be more useful.

In many repositories, older parts of the history often hold the bulk
of objects that do not change, and it is wasteful to repack them
over and over.  If your project is at around v40.0 today, and it was
at around v36.0 6 months ago, for example, you may want to pack
everything that happened before v36.0 into a single pack just once,
pack them really well, and have your "repack" not touch that old
part of the history.

  $ git rev-list --objects v36.0 |
    git pack-objects --window=200 --depth=128 pack

would produce such a pack [*1*]

The standard output from the above pipeline will give you a 40-hex
string (e.g. 51c472761b4690a331c02c90ec364e47cca1b3ac, call it
$HEX), and in the current directory you will find two files,
pack-$HEX.pack and pack-$HEX.idx.

You can then do this:

  $ echo "v36.0 with W/D 200/128" >pack-$HEX.keep
  $ mv pack-$HEX.* .git/objects/pack/.
  $ git repack -a -d

A pack that has an accompanying .keep file is excempt from
repacking, so once you do this, your future "git repack" will only
repack objects that are not in the kept packs.



[Footnote]

*1* I won't say 200/128 gives you a good pack; you would need to
experiment.  In general, larger depth will result in smaller pack
but it will result in bigger overhead while you use the repository
every day.  Larger window will spend a lot of cycles while packing,
but will result in a smaller pack.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git repack command on larger pack file
  2015-10-26  7:11   ` Junio C Hamano
@ 2015-10-27  2:04     ` Sivakumar Selvam
  2015-10-27 23:44       ` Jeff King
  2015-10-27  8:52     ` Philip Oakley
  1 sibling, 1 reply; 9+ messages in thread
From: Sivakumar Selvam @ 2015-10-27  2:04 UTC (permalink / raw)
  To: git

Junio C Hamano <gitster <at> pobox.com> writes:

> 
> Junio C Hamano <gitster <at> pobox.com> writes:
> 
> > Sivakumar Selvam <gerritcode <at> gmail.com> writes:
> >
> >> ... So
> >> I thought of splitting the pack file into 4 GB chunks.
> > ...
> > Hmmm, what is "this issue"?  I do not see anything surprising.
> 
> While the explanation might have been enlightening, the knowledge
> conveyed by the explanation by itself would not be of much practical
> use, and enlightment without practical use is never fun.
> 
> So let's do another tangent that may be more useful.
> 
> In many repositories, older parts of the history often hold the bulk
> of objects that do not change, and it is wasteful to repack them
> over and over.  If your project is at around v40.0 today, and it was
> at around v36.0 6 months ago, for example, you may want to pack
> everything that happened before v36.0 into a single pack just once,
> pack them really well, and have your "repack" not touch that old
> part of the history.
> 
>   $ git rev-list --objects v36.0 |
>     git pack-objects --window=200 --depth=128 pack
> 
> would produce such a pack [*1*]
> 
> The standard output from the above pipeline will give you a 40-hex
> string (e.g. 51c472761b4690a331c02c90ec364e47cca1b3ac, call it
> $HEX), and in the current directory you will find two files,
> pack-$HEX.pack and pack-$HEX.idx.
> 
> You can then do this:
> 
>   $ echo "v36.0 with W/D 200/128" >pack-$HEX.keep
>   $ mv pack-$HEX.* .git/objects/pack/.
>   $ git repack -a -d
> 
> A pack that has an accompanying .keep file is excempt from
> repacking, so once you do this, your future "git repack" will only
> repack objects that are not in the kept packs.
> 
> [Footnote]
> 
> *1* I won't say 200/128 gives you a good pack; you would need to
> experiment.  In general, larger depth will result in smaller pack
> but it will result in bigger overhead while you use the repository
> every day.  Larger window will spend a lot of cycles while packing,
> but will result in a smaller pack.
> 


Hi Junio,

   When I finished git repacking, I found 12 pack files with each 4 GB and
the total size is 48 GB. Again I ran the same git repack command by just
removing only --max-pack-size= parameter, the size of the single pack file
is 66 GB.

git repack -A -b -d -q --depth=50 --window=10 abc.git

Now, I see the total size of the single abc.git has become 66 GB. Initially
it was 34 GB, After using  --max-pack-size=4g it become 48 GB. When we
remove the --max-pack-size=4g parameter and tried to create a single pack
file now it become 66 GB.
   
Looks like once we do git repack with multiple pack files, we can't revert
back to the original size.  

Thanks,
Sivakumar Selvam.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git repack command on larger pack file
  2015-10-26  7:11   ` Junio C Hamano
  2015-10-27  2:04     ` Sivakumar Selvam
@ 2015-10-27  8:52     ` Philip Oakley
  1 sibling, 0 replies; 9+ messages in thread
From: Philip Oakley @ 2015-10-27  8:52 UTC (permalink / raw)
  To: Junio C Hamano, Sivakumar Selvam; +Cc: git

From: "Junio C Hamano" <gitster@pobox.com>
> Junio C Hamano <gitster@pobox.com> writes:
>
>> Sivakumar Selvam <gerritcode@gmail.com> writes:
>>
>>> ... So
>>> I thought of splitting the pack file into 4 GB chunks.
>> ...
>> Hmmm, what is "this issue"?  I do not see anything surprising.
>
> While the explanation might have been enlightening, the knowledge
> conveyed by the explanation by itself would not be of much practical
> use, and enlightment without practical use is never fun.
>
> So let's do another tangent that may be more useful.
>
> In many repositories, older parts of the history often hold the bulk
> of objects that do not change, and it is wasteful to repack them
> over and over.  If your project is at around v40.0 today, and it was
> at around v36.0 6 months ago, for example, you may want to pack
> everything that happened before v36.0 into a single pack just once,
> pack them really well, and have your "repack" not touch that old
> part of the history.
>
>  $ git rev-list --objects v36.0 |
>    git pack-objects --window=200 --depth=128 pack
>
> would produce such a pack [*1*]
>
> The standard output from the above pipeline will give you a 40-hex
> string (e.g. 51c472761b4690a331c02c90ec364e47cca1b3ac, call it
> $HEX), and in the current directory you will find two files,
> pack-$HEX.pack and pack-$HEX.idx.
>
> You can then do this:
>
>  $ echo "v36.0 with W/D 200/128" >pack-$HEX.keep
>  $ mv pack-$HEX.* .git/objects/pack/.
>  $ git repack -a -d
>
> A pack that has an accompanying .keep file is excempt from
> repacking, so once you do this, your future "git repack" will only
> repack objects that are not in the kept packs.
>

I had a quick look at the man pages and couln't find an explanation (such as 
this one) to explain the purpose,  highlight the use of and how to create 
such .keep packs.

Could this form the basis of a short section on .keep packs?(or did I miss 
something)

>
>
> [Footnote]
>
> *1* I won't say 200/128 gives you a good pack; you would need to
> experiment.  In general, larger depth will result in smaller pack
> but it will result in bigger overhead while you use the repository
> every day.  Larger window will spend a lot of cycles while packing,
> but will result in a smaller pack.
> --

Philip 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git repack command on larger pack file
  2015-10-27  2:04     ` Sivakumar Selvam
@ 2015-10-27 23:44       ` Jeff King
  2015-10-28  6:23         ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff King @ 2015-10-27 23:44 UTC (permalink / raw)
  To: Sivakumar Selvam; +Cc: git

On Tue, Oct 27, 2015 at 02:04:23AM +0000, Sivakumar Selvam wrote:

>    When I finished git repacking, I found 12 pack files with each 4 GB and
> the total size is 48 GB. Again I ran the same git repack command by just
> removing only --max-pack-size= parameter, the size of the single pack file
> is 66 GB.
> 
> git repack -A -b -d -q --depth=50 --window=10 abc.git
> 
> Now, I see the total size of the single abc.git has become 66 GB. Initially
> it was 34 GB, After using  --max-pack-size=4g it become 48 GB. When we
> remove the --max-pack-size=4g parameter and tried to create a single pack
> file now it become 66 GB.
>    
> Looks like once we do git repack with multiple pack files, we can't revert
> back to the original size.

Git tries to take some shortcuts when repacking: if two objects are in
the same pack but not deltas, it will not consider making deltas out of
them. The logic is we would already have tried that while making the
original pack. But of course when you are doing weird things with the
packing parameters, that is not always a good assumption.

When doing experiments like this, add "-f" to your repack command-line
to avoid reusing deltas. The result should be much smaller (at the
expense of more CPU time to do the repack).

I'd also recommend increasing "--window" if you can afford the extra CPU
during the repack. It can often produce smaller packs. And it has less
cost than you might think (e.g., window=20 is not twice as expensive as
window=10, because the work to access the objects is cached).  You can
also increase --depth, but I have never found it to be particularly
helpful for decreasing size[1].

-Peff

[1] This is all theory, and I don't know how well git actually finds
    such deltas, but it is probably better to have a dense tree of
    deltas rather than long chains.  If you have a chain of N objects
    and would to add object N+1 to it, you are probably not much worse
    off to base it on object N-1, creating a "fork" at N. The resulting
    objects should be less expensive to access for subsequent operations
    (as any time you want the Nth object, you have to resolve all parts
    of the chain, so shorter chains are better, and you the delta cache
    is more likely to get a hit on that N-1 object).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git repack command on larger pack file
  2015-10-26  6:41 ` Junio C Hamano
  2015-10-26  7:11   ` Junio C Hamano
@ 2015-10-27 23:47   ` Jeff King
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff King @ 2015-10-27 23:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Sivakumar Selvam, git

On Sun, Oct 25, 2015 at 11:41:23PM -0700, Junio C Hamano wrote:

> Also it has to write more data to disk (see below), it has to find a
> good place to split, it has to adjust bookkeeping data at the pack
> boundary, in general it has to do more, not less, to produce split
> packs.  It would be surprising if it took less time.

This may go without saying, but the main cost in the write is that we
have to zlib deflate the output. I don't have any numbers at hand, but
when I've benchmarked serving fetches, it is often a balance game
between CPU time spent on a more aggressive delta search and CPU time
that goes into deflating the results of the search. Spending more CPU on
the former may yield more and smaller deltas which pay for themselves in
time spent on the latter.

There's definitely a balance point, and it varies from repo to repo, and
even within repos from fetch to fetch. I wish I had better heuristics to
report, but it's an ongoing thing I'm exploring. :)

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git repack command on larger pack file
  2015-10-27 23:44       ` Jeff King
@ 2015-10-28  6:23         ` Junio C Hamano
  2015-10-28  6:47           ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2015-10-28  6:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Sivakumar Selvam, git

Jeff King <peff@peff.net> writes:

> Git tries to take some shortcuts when repacking: if two objects are in
> the same pack but not deltas, it will not consider making deltas out of
> them. The logic is we would already have tried that while making the
> original pack. But of course when you are doing weird things with the
> packing parameters, that is not always a good assumption.

Yup, that is http://thread.gmane.org/gmane.comp.version-control.git/16223/focus=16267

> [1] This is all theory, and I don't know how well git actually finds
>     such deltas, but it is probably better to have a dense tree of
>     deltas rather than long chains.  If you have a chain of N objects
>     and would to add object N+1 to it, you are probably not much worse
>     off to base it on object N-1, creating a "fork" at N.

Yes, your guess is perfectly correct here, and indeed we did an
extensive work along that line in 2006/2007.  For an example, see
http://thread.gmane.org/gmane.comp.version-control.git/51949/focus=52003

The histogram "verify-pack -v" produces was in fact done primarily
in order to make it easy to check the distribution of delta depth.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git repack command on larger pack file
  2015-10-28  6:23         ` Junio C Hamano
@ 2015-10-28  6:47           ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2015-10-28  6:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Sivakumar Selvam, git

Junio C Hamano <gitster@pobox.com> writes:

>> [1] This is all theory, and I don't know how well git actually finds
>>     such deltas, but it is probably better to have a dense tree of
>>     deltas rather than long chains.  If you have a chain of N objects
>>     and would to add object N+1 to it, you are probably not much worse
>>     off to base it on object N-1, creating a "fork" at N.
>
> Yes, your guess is perfectly correct here, and indeed we did an
> extensive work along that line in 2006/2007.  For an example, see
> http://thread.gmane.org/gmane.comp.version-control.git/51949/focus=52003

And here is another, which is probably one of the most important
thread on pack-objects, before the bitmap was introduced:

http://thread.gmane.org/gmane.comp.version-control.git/20056/focus=20134

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-10-28  6:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-26  5:57 git repack command on larger pack file Sivakumar Selvam
2015-10-26  6:41 ` Junio C Hamano
2015-10-26  7:11   ` Junio C Hamano
2015-10-27  2:04     ` Sivakumar Selvam
2015-10-27 23:44       ` Jeff King
2015-10-28  6:23         ` Junio C Hamano
2015-10-28  6:47           ` Junio C Hamano
2015-10-27  8:52     ` Philip Oakley
2015-10-27 23:47   ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.