All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] repack vs re-clone
@ 2008-02-10  8:25 Marco Costalba
  2008-02-10 20:50 ` Nicolas Pitre
  2008-02-11 18:45 ` Jakub Narebski
  0 siblings, 2 replies; 9+ messages in thread
From: Marco Costalba @ 2008-02-10  8:25 UTC (permalink / raw)
  To: gi mailing list

Sometime I found myself re-cloning entirely a repository, as example
the Linux tree, instead of repackaging my local copy.

The reason is that the published Linux repository is super compressed
and to reach the same level of compression on my local copy I would
need to give my laptop a long night running.

So it happens to be just faster to re-clone the whole thing by upstream.

Also repackaging a big repo in the optimal way is not so trivial, you
need to understand quite advanced stuff like window depth and so on
and probably the pack parameters used upstream are easily better then
what you could 'guess' trying yourself. Or simply you don't have
enough RAM as would be needed.

On the other end it would be interesting to know, before to start the
new clone, what is the real advantage of this, i.e. what is the
repository size upstream.

So I would like to ask if anyone would consider useful:

- A command like 'git info' or something like that that prints size of
local and upstream repository (among possibly other things)

- An option like 'git repack --clone' to instruct git to download and
use current upstream packs instead of trying to recreate new ones.


Marco

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] repack vs re-clone
  2008-02-10  8:25 [RFC] repack vs re-clone Marco Costalba
@ 2008-02-10 20:50 ` Nicolas Pitre
  2008-02-11 18:45 ` Jakub Narebski
  1 sibling, 0 replies; 9+ messages in thread
From: Nicolas Pitre @ 2008-02-10 20:50 UTC (permalink / raw)
  To: Marco Costalba; +Cc: gi mailing list

On Sun, 10 Feb 2008, Marco Costalba wrote:

> Sometime I found myself re-cloning entirely a repository, as example
> the Linux tree, instead of repackaging my local copy.
> 
> The reason is that the published Linux repository is super compressed
> and to reach the same level of compression on my local copy I would
> need to give my laptop a long night running.

No.  I really doubt the public Linux repository is compressed with 
anything but the default repack settings.

And on my average PC by today's standards (P4 @ 3GHz with 1GB memory), 
repacking the Linux repo takes less than 6.5 minutes, and peak RSS is 
around 450MB.

> So it happens to be just faster to re-clone the whole thing by upstream.

Only if you're lucky to have a fast connection to the net with a high 
transfer quota.

> Also repackaging a big repo in the optimal way is not so trivial, you
> need to understand quite advanced stuff like window depth and so on
> and probably the pack parameters used upstream are easily better then
> what you could 'guess' trying yourself. Or simply you don't have
> enough RAM as would be needed.

If such is your case, why would you fully repack your repo in the first 
place?  Simply running 'git gc' should be quite good enough for people 
uninterested in the "advanced" stuff.  The repack that 'git gc' uses 
will happily reuse existing packed data from upstream.



> On the other end it would be interesting to know, before to start the
> new clone, what is the real advantage of this, i.e. what is the
> repository size upstream.

You can already query the remote repository directory listing and figure 
it out.  For example:

lftp -c 'open http://kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git/objects/pack && ls'

And you'll note that even upstream isn't always fully packed in advance.

> So I would like to ask if anyone would consider useful:
> 
> - A command like 'git info' or something like that that prints size of
> local and upstream repository (among possibly other things)
> 
> - An option like 'git repack --clone' to instruct git to download and
> use current upstream packs instead of trying to recreate new ones.

I think that would be a very bad idea.  Not only this is rather 
unnecessary (either you can aford to repack locally, or you live with 
the upstream provided packing and repack incrementally which is pretty 
good enough), but it is also really bad resource wise (that'll end up 
only wasting net bandwitdh and CPU cycles on the server).


Nicolas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] repack vs re-clone
  2008-02-10  8:25 [RFC] repack vs re-clone Marco Costalba
  2008-02-10 20:50 ` Nicolas Pitre
@ 2008-02-11 18:45 ` Jakub Narebski
  2008-02-11 19:20   ` Marco Costalba
  2008-02-11 19:40   ` Nicolas Pitre
  1 sibling, 2 replies; 9+ messages in thread
From: Jakub Narebski @ 2008-02-11 18:45 UTC (permalink / raw)
  To: Marco Costalba; +Cc: git mailing list

"Marco Costalba" <mcostalba@gmail.com> writes:

> Sometime I found myself re-cloning entirely a repository, as example
> the Linux tree, instead of repackaging my local copy.
> 
> The reason is that the published Linux repository is super compressed
> and to reach the same level of compression on my local copy I would
> need to give my laptop a long night running.

Repacking without '--force' (for gc) or '--no-reuse-delta' (low level)
would always reuse this tight packing. Only with '--force' you would
waste CPU trying to find better deltaification[*1*].
 
> So it happens to be just faster to re-clone the whole thing by upstream.

So what you are doing is passing the work, unnecessary work I'd say,
to some poor server. Not nice.


[*1*] I hope that '--no-reuse-delta' means _try_ to find better delta,
but use current one as possible delta, not stupid forget about current
deltaification at all...
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] repack vs re-clone
  2008-02-11 18:45 ` Jakub Narebski
@ 2008-02-11 19:20   ` Marco Costalba
  2008-02-11 19:50     ` Johannes Schindelin
  2008-02-11 19:51     ` Jakub Narebski
  2008-02-11 19:40   ` Nicolas Pitre
  1 sibling, 2 replies; 9+ messages in thread
From: Marco Costalba @ 2008-02-11 19:20 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git mailing list

On Feb 11, 2008 7:45 PM, Jakub Narebski <jnareb@gmail.com> wrote:
> "Marco Costalba" <mcostalba@gmail.com> writes:
>
> > So it happens to be just faster to re-clone the whole thing by upstream.
>
> So what you are doing is passing the work, unnecessary work I'd say,
> to some poor server. Not nice.
>

To a poor net bandwidth I would say because cloning from zero just
downloads the packages.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] repack vs re-clone
  2008-02-11 18:45 ` Jakub Narebski
  2008-02-11 19:20   ` Marco Costalba
@ 2008-02-11 19:40   ` Nicolas Pitre
  2008-02-11 19:53     ` Jakub Narebski
  1 sibling, 1 reply; 9+ messages in thread
From: Nicolas Pitre @ 2008-02-11 19:40 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Marco Costalba, git mailing list

On Mon, 11 Feb 2008, Jakub Narebski wrote:

> [*1*] I hope that '--no-reuse-delta' means _try_ to find better delta,
> but use current one as possible delta, not stupid forget about current
> deltaification at all...

It is really "forget about everything".  And by the time you look for 
the best delta from scratch, remembering what was the best delta before 
won't give you much performance gain, plus it has nasty issues like 
making sure no delta cycles are created if you reuse an old delta, etc.


Nicolas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] repack vs re-clone
  2008-02-11 19:20   ` Marco Costalba
@ 2008-02-11 19:50     ` Johannes Schindelin
  2008-02-11 19:51     ` Jakub Narebski
  1 sibling, 0 replies; 9+ messages in thread
From: Johannes Schindelin @ 2008-02-11 19:50 UTC (permalink / raw)
  To: Marco Costalba; +Cc: Jakub Narebski, git mailing list

Hi,

On Mon, 11 Feb 2008, Marco Costalba wrote:

> On Feb 11, 2008 7:45 PM, Jakub Narebski <jnareb@gmail.com> wrote:
> > "Marco Costalba" <mcostalba@gmail.com> writes:
> >
> > > So it happens to be just faster to re-clone the whole thing by 
> > > upstream.
> >
> > So what you are doing is passing the work, unnecessary work I'd say, 
> > to some poor server. Not nice.
> 
> To a poor net bandwidth I would say because cloning from zero just 
> downloads the packages.

Not exactly.  Remember, if you did not "repack" via clone, the server 
would spend less cycles _and_ bandwidth on you.  Namely zero.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] repack vs re-clone
  2008-02-11 19:20   ` Marco Costalba
  2008-02-11 19:50     ` Johannes Schindelin
@ 2008-02-11 19:51     ` Jakub Narebski
  2008-02-11 20:44       ` Nicolas Pitre
  1 sibling, 1 reply; 9+ messages in thread
From: Jakub Narebski @ 2008-02-11 19:51 UTC (permalink / raw)
  To: Marco Costalba; +Cc: git mailing list

Marco Costalba wrote:
> On Feb 11, 2008 7:45 PM, Jakub Narebski <jnareb@gmail.com> wrote:
> > "Marco Costalba" <mcostalba@gmail.com> writes:
> >
> > > So it happens to be just faster to re-clone the whole thing by upstream.
> >
> > So what you are doing is passing the work, unnecessary work I'd say,
> > to some poor server. Not nice.
> 
> To a poor net bandwidth I would say because cloning from zero just
> downloads the packages.

Cloning from zero over http, https and rsync (and ftp) just downloads
the packfiles. Cloning over git or ssh if I understand correctly[*1*]
generates single pack for transfer. And that generates load for server.

[*1*] If I undersnad correctly from discussions here on git mailing
      list, the pack transfer protocol currently can transfer only
      _single_ pack; proposed multi-pack extension didn't get
      implemented.
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] repack vs re-clone
  2008-02-11 19:40   ` Nicolas Pitre
@ 2008-02-11 19:53     ` Jakub Narebski
  0 siblings, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2008-02-11 19:53 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Marco Costalba, git mailing list

Nicolas Pitre wrote:
> On Mon, 11 Feb 2008, Jakub Narebski wrote:
> 
> > [*1*] I hope that '--no-reuse-delta' means _try_ to find better delta,
> > but use current one as possible delta, not stupid forget about current
> > deltaification at all...
> 
> It is really "forget about everything".  And by the time you look for 
> the best delta from scratch, remembering what was the best delta before 
> won't give you much performance gain, plus it has nasty issues like 
> making sure no delta cycles are created if you reuse an old delta, etc.

So we have either: don't try to find better delta (default), or forget
totally about old delta (--no-reuse-delta), no middle ground? That's
bad...

Or the default is try to find better delta, but reuse if better?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] repack vs re-clone
  2008-02-11 19:51     ` Jakub Narebski
@ 2008-02-11 20:44       ` Nicolas Pitre
  0 siblings, 0 replies; 9+ messages in thread
From: Nicolas Pitre @ 2008-02-11 20:44 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Marco Costalba, git mailing list

On Mon, 11 Feb 2008, Jakub Narebski wrote:

> Marco Costalba wrote:
> > On Feb 11, 2008 7:45 PM, Jakub Narebski <jnareb@gmail.com> wrote:
> > > "Marco Costalba" <mcostalba@gmail.com> writes:
> > >
> > > > So it happens to be just faster to re-clone the whole thing by upstream.
> > >
> > > So what you are doing is passing the work, unnecessary work I'd say,
> > > to some poor server. Not nice.
> > 
> > To a poor net bandwidth I would say because cloning from zero just
> > downloads the packages.
> 
> Cloning from zero over http, https and rsync (and ftp) just downloads
> the packfiles. Cloning over git or ssh if I understand correctly[*1*]
> generates single pack for transfer. And that generates load for server.

The created pack will always reuse existing deltas, so the load is more 
about making sure the sent pack contains only needed objects for the 
required branch -- something that dumb protocols cannot do.


Nicolas

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-02-11 20:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-10  8:25 [RFC] repack vs re-clone Marco Costalba
2008-02-10 20:50 ` Nicolas Pitre
2008-02-11 18:45 ` Jakub Narebski
2008-02-11 19:20   ` Marco Costalba
2008-02-11 19:50     ` Johannes Schindelin
2008-02-11 19:51     ` Jakub Narebski
2008-02-11 20:44       ` Nicolas Pitre
2008-02-11 19:40   ` Nicolas Pitre
2008-02-11 19:53     ` Jakub Narebski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.