All of lore.kernel.org
 help / color / mirror / Atom feed
* There should have be git gc --repack-arguments
@ 2021-04-07 12:10 Bagas Sanjaya
  2021-04-07 19:37 ` Jeff King
  2021-04-07 19:38 ` Bryan Turner
  0 siblings, 2 replies; 10+ messages in thread
From: Bagas Sanjaya @ 2021-04-07 12:10 UTC (permalink / raw)
  To: git

Hi,

I request that git gc should have --repack-arguments option. The value
of this option should be passed to git repack.

The use case is when I have very large repos (such as GCC and Linux kernel)
on a server with small RAM (1-2 GB). When doing gc on such repo, the repack
step may hang because git-repack have to create single large packfile which
can be larger than available memory (RAM+swap), so it must be necessary to
do git repack --window-memory=<desired memory usage> --max-pack-size=<desired
pack size> to create split and smaller packs instead.

There should also git config item gc.repackArguments, which have the same
effect as git gc --repack-arguments, with the option takes precedence over
the config.

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: There should have be git gc --repack-arguments
  2021-04-07 12:10 There should have be git gc --repack-arguments Bagas Sanjaya
@ 2021-04-07 19:37 ` Jeff King
  2021-04-07 20:40   ` Junio C Hamano
  2021-04-07 19:38 ` Bryan Turner
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff King @ 2021-04-07 19:37 UTC (permalink / raw)
  To: Bagas Sanjaya; +Cc: git

On Wed, Apr 07, 2021 at 07:10:43PM +0700, Bagas Sanjaya wrote:

> I request that git gc should have --repack-arguments option. The value
> of this option should be passed to git repack.

I think in general we prefer to make individual options configurable,
rather than having a blanket "pass along these options" argument, for
two reasons:

  - some options may cause the sub-program to behave unexpectedly. E.g.,
    if you put "-a" in the repack-arguments, that may be subverting
    git-gc's assumptions about how repack will behave

  - arguments are a list, not a string. So you have to provide some
    mechanism for splitting them (presumably on whitespace, but what if
    we need quoting)?

> The use case is when I have very large repos (such as GCC and Linux kernel)
> on a server with small RAM (1-2 GB). When doing gc on such repo, the repack
> step may hang because git-repack have to create single large packfile which
> can be larger than available memory (RAM+swap), so it must be necessary to
> do git repack --window-memory=<desired memory usage> --max-pack-size=<desired
> pack size> to create split and smaller packs instead.
> 
> There should also git config item gc.repackArguments, which have the same
> effect as git gc --repack-arguments, with the option takes precedence over
> the config.

You can set pack.windowMemory in your config already, to solve the first
part.

You can also set pack.packSizeLimit for the latter, though I do not
recommend it. It will not help with memory usage (neither while
repacking nor for later commands). We do mmap() the resulting packfiles,
but we rely on the operating system to manage the actual in-RAM working
set (but that is also true with multiple packfiles; we are happy to map
several of them at once). And it may make your on-disk size much larger.
We don't allow deltas between on-disk packs, which means some objects
which could be stored as deltas won't be. That in turn hurts on a
memory-starved system because we'll need more block cache to perform the
same task. It also results in extra CPU when serving fetches or pushing,
since we'll try to find new deltas between the packs on the fly.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: There should have be git gc --repack-arguments
  2021-04-07 12:10 There should have be git gc --repack-arguments Bagas Sanjaya
  2021-04-07 19:37 ` Jeff King
@ 2021-04-07 19:38 ` Bryan Turner
  2021-04-08 13:31   ` Bagas Sanjaya
  1 sibling, 1 reply; 10+ messages in thread
From: Bryan Turner @ 2021-04-07 19:38 UTC (permalink / raw)
  To: Bagas Sanjaya; +Cc: Git Users

On Wed, Apr 7, 2021 at 5:10 AM Bagas Sanjaya <bagasdotme@gmail.com> wrote:
>
> Hi,
>
> I request that git gc should have --repack-arguments option. The value
> of this option should be passed to git repack.
>
> The use case is when I have very large repos (such as GCC and Linux kernel)
> on a server with small RAM (1-2 GB). When doing gc on such repo, the repack
> step may hang because git-repack have to create single large packfile which
> can be larger than available memory (RAM+swap), so it must be necessary to
> do git repack --window-memory=<desired memory usage> --max-pack-size=<desired
> pack size> to create split and smaller packs instead.

I can't speak to the feature request, but since there are
configuration knobs already for both of those, that implies you can
use git -c pack.windowMemory=... -c pack.packSizeLimit=... gc and
those configuration settings will be propagated to the git repack
process that git gc runs.

>
> There should also git config item gc.repackArguments, which have the same
> effect as git gc --repack-arguments, with the option takes precedence over
> the config.

Passing configuration settings as I show above would already take
precedence over any config file, since config from the command line is
higher priority.

Hope this helps!
Bryan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: There should have be git gc --repack-arguments
  2021-04-07 19:37 ` Jeff King
@ 2021-04-07 20:40   ` Junio C Hamano
  2021-04-07 21:37     ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2021-04-07 20:40 UTC (permalink / raw)
  To: Jeff King; +Cc: Bagas Sanjaya, git

Jeff King <peff@peff.net> writes:

>> ... git repack ...  --max-pack-size=<desired pack size> to create split and
>> smaller packs instead.
> ...
> You can also set pack.packSizeLimit for the latter, though I do not
> recommend it. It will not help with memory usage (neither while
> repacking nor for later commands).

In other words, passing --max-pack-size, whether it is done with a
new --repack-arguments option or it is done with the existing
pack.packSizeLimit configuration, would make things worse.

So in conclusion:

 - attempting to repack everything into one pack on a memory starved
   box would be helped with reduced window memory size.

 - on a small box, it may make sense to avoid repacking everything
   into one in the first place, but we do not want the number of
   packs to grow unbounded.

Would the new geometric repack feature help here, especially for the
latter?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: There should have be git gc --repack-arguments
  2021-04-07 20:40   ` Junio C Hamano
@ 2021-04-07 21:37     ` Jeff King
  2021-04-07 22:13       ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2021-04-07 21:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, Bagas Sanjaya, git

On Wed, Apr 07, 2021 at 01:40:16PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >> ... git repack ...  --max-pack-size=<desired pack size> to create split and
> >> smaller packs instead.
> > ...
> > You can also set pack.packSizeLimit for the latter, though I do not
> > recommend it. It will not help with memory usage (neither while
> > repacking nor for later commands).
> 
> In other words, passing --max-pack-size, whether it is done with a
> new --repack-arguments option or it is done with the existing
> pack.packSizeLimit configuration, would make things worse.

Right. I wish we didn't have --max-pack-size at all. I do not think it
is ever a good idea, and it complicates the packing code quite a bit.

These days we have index v2 to let us address more than 4GB in a
packfile. I suppose it's possible you could have a filesystem whose max
file size is smaller than your total packfile, but that seems pretty
unlikely these days (even 32-bit systems tend to have large file
support).

But that's all a tangent. :)

> So in conclusion:
> 
>  - attempting to repack everything into one pack on a memory starved
>    box would be helped with reduced window memory size.

Yes, though less than you might think. It is only trying to keep the
memory used by delta compression at bay. The per-object book-keeping
tends to be quite high by itself. If you are under memory pressure
during delta compression, you may also be better off reducing the number
of threads (since each thread is simultaneously using windowMemory
bytes).

>  - on a small box, it may make sense to avoid repacking everything
>    into one in the first place, but we do not want the number of
>    packs to grow unbounded.
> 
> Would the new geometric repack feature help here, especially for the
> latter?

Yes, I think it would. You'd perhaps want to generate a multi-pack-index
file, too, to avoid having to look for objects in multiple packs
sequentially (we have a "git repack --write-midx" option on the way, as
well).

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: There should have be git gc --repack-arguments
  2021-04-07 21:37     ` Jeff King
@ 2021-04-07 22:13       ` Junio C Hamano
  2021-04-07 22:22         ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2021-04-07 22:13 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Bagas Sanjaya, git

Jeff King <peff@peff.net> writes:

> On Wed, Apr 07, 2021 at 01:40:16PM -0700, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>> 
>> >> ... git repack ...  --max-pack-size=<desired pack size> to create split and
>> >> smaller packs instead.
>> > ...
>> > You can also set pack.packSizeLimit for the latter, though I do not
>> > recommend it. It will not help with memory usage (neither while
>> > repacking nor for later commands).
>> 
>> In other words, passing --max-pack-size, whether it is done with a
>> new --repack-arguments option or it is done with the existing
>> pack.packSizeLimit configuration, would make things worse.
>
> Right. I wish we didn't have --max-pack-size at all. I do not think it
> is ever a good idea, and it complicates the packing code quite a bit.

I suspect that the original motivation was sneaker-netting on
multiple floppy disks ;-)

>>  - on a small box, it may make sense to avoid repacking everything
>>    into one in the first place, but we do not want the number of
>>    packs to grow unbounded.
>> 
>> Would the new geometric repack feature help here, especially for the
>> latter?
>
> Yes, I think it would. You'd perhaps want to generate a multi-pack-index
> file, too, to avoid having to look for objects in multiple packs
> sequentially (we have a "git repack --write-midx" option on the way, as
> well).

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: There should have be git gc --repack-arguments
  2021-04-07 22:13       ` Junio C Hamano
@ 2021-04-07 22:22         ` Jeff King
  2021-04-09  9:58           ` Bagas Sanjaya
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2021-04-07 22:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, Bagas Sanjaya, git

On Wed, Apr 07, 2021 at 03:13:39PM -0700, Junio C Hamano wrote:

> >> > You can also set pack.packSizeLimit for the latter, though I do not
> >> > recommend it. It will not help with memory usage (neither while
> >> > repacking nor for later commands).
> >> 
> >> In other words, passing --max-pack-size, whether it is done with a
> >> new --repack-arguments option or it is done with the existing
> >> pack.packSizeLimit configuration, would make things worse.
> >
> > Right. I wish we didn't have --max-pack-size at all. I do not think it
> > is ever a good idea, and it complicates the packing code quite a bit.
> 
> I suspect that the original motivation was sneaker-netting on
> multiple floppy disks ;-)

That had always been my impression, too. But when I looked in the
archive while writing my earlier reply, most of the discussion near
--max-pack-size had to do with the early index limitations.

If you are sneaker-netting, you are probably better off to just split
the pack at byte boundaries with an external tool anyway, for two
reasons:

  - our max-pack-size is just a guideline. It only splits at object
    boundaries so if you have an object bigger than the max, we'll
    exceed it.

  - dedicated splitting tools often have useful extra features, like
    k-of-n error correction.

Besides, if you are sneaker netting you'd want to use a bundle, and I
don't think bundles support max-pack-size. :)

Anyway, all off-topic but an interesting diversion.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: There should have be git gc --repack-arguments
  2021-04-07 19:38 ` Bryan Turner
@ 2021-04-08 13:31   ` Bagas Sanjaya
  0 siblings, 0 replies; 10+ messages in thread
From: Bagas Sanjaya @ 2021-04-08 13:31 UTC (permalink / raw)
  To: Bryan Turner; +Cc: Git Users

On 08/04/21 02.38, Bryan Turner wrote:
> I can't speak to the feature request, but since there are
> configuration knobs already for both of those, that implies you can
> use git -c pack.windowMemory=... -c pack.packSizeLimit=... gc and
> those configuration settings will be propagated to the git repack
> process that git gc runs.

Oops, I overlooked that. Thanks for reminding me!

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: There should have be git gc --repack-arguments
  2021-04-07 22:22         ` Jeff King
@ 2021-04-09  9:58           ` Bagas Sanjaya
  2021-04-09 15:49             ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Bagas Sanjaya @ 2021-04-09  9:58 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: Taylor Blau, git

On 08/04/21 05.22, Jeff King wrote:
> If you are sneaker-netting, you are probably better off to just split
> the pack at byte boundaries with an external tool anyway, for two
> reasons:
> 
>    - our max-pack-size is just a guideline. It only splits at object
>      boundaries so if you have an object bigger than the max, we'll
>      exceed it.
> 
>    - dedicated splitting tools often have useful extra features, like
>      k-of-n error correction.
> 
What external tools are for splitting packs? Can splitted packs
by such tools still be usable by Git?

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: There should have be git gc --repack-arguments
  2021-04-09  9:58           ` Bagas Sanjaya
@ 2021-04-09 15:49             ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2021-04-09 15:49 UTC (permalink / raw)
  To: Bagas Sanjaya; +Cc: Junio C Hamano, Taylor Blau, git

On Fri, Apr 09, 2021 at 04:58:32PM +0700, Bagas Sanjaya wrote:

> On 08/04/21 05.22, Jeff King wrote:
> > If you are sneaker-netting, you are probably better off to just split
> > the pack at byte boundaries with an external tool anyway, for two
> > reasons:
> > 
> >    - our max-pack-size is just a guideline. It only splits at object
> >      boundaries so if you have an object bigger than the max, we'll
> >      exceed it.
> > 
> >    - dedicated splitting tools often have useful extra features, like
> >      k-of-n error correction.
> > 
> What external tools are for splitting packs? Can splitted packs
> by such tools still be usable by Git?

No, but you can reassemble the parts at the destination before feeding
them to Git. On a system with normal posix tools, you can split like:

  git pack-objects --stdout --all </dev/null |
  split -b 1m - split-pack-

and then after transferring split-pack-* (which are individual 1
megabyte files) to the destination, you can do:

  cat split-pack-* |
  git index-pack -v --stdin

(There's no error correction in split; tools like rar will do that, and
probably others, but it has been ages since I've had to split a file to
meet transfer requirements).

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-04-09 15:49 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-07 12:10 There should have be git gc --repack-arguments Bagas Sanjaya
2021-04-07 19:37 ` Jeff King
2021-04-07 20:40   ` Junio C Hamano
2021-04-07 21:37     ` Jeff King
2021-04-07 22:13       ` Junio C Hamano
2021-04-07 22:22         ` Jeff King
2021-04-09  9:58           ` Bagas Sanjaya
2021-04-09 15:49             ` Jeff King
2021-04-07 19:38 ` Bryan Turner
2021-04-08 13:31   ` Bagas Sanjaya

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.