On Mon, 6 Apr 2009, Jon Smirl wrote:

> On Mon, Apr 6, 2009 at 10:19 AM, Nicolas Pitre <nico@cam.org> wrote:
> > On Mon, 6 Apr 2009, Jon Smirl wrote:
> >
> >> First thing an initial clone does is copy all of the pack files from
> >> the server to the client without even looking at them.
> >
> > This is a no go for reasons already stated many times.  There are
> > security implications (those packs might contain stuff that you didn't
> > intend to be publically accessible) and there might be efficiency
> > reasons as well (you might have a shared object store with lots of stuff
> > unrelated to the particular clone).
> 
> How do you deal with dense history packs? These packs take many hours
> to make (on a server class machine) and can be half the size of a
> regular pack. Shouldn't there be a way to copy these packs intact on
> an initial clone? It's ok if these packs are specially marked as being
> ok to copy.

[sigh]

Let me explain it all again.

There is basically two ways to create a new pack: the intelligent way, 
and the bruteforce way.

When creating a new pack the intelligent way, what we do is to enumerate 
all the needed object and look them up in the object store.  When a 
particular object is found, we create a record for that object and note 
in which pack it is located, at what offset in that pack, how much space 
it occupies in its compressed form within that pack, , and if whether it 
is a delta or not.  When that object is indeed a delta (the majority of 
objects usually are) then we also keep a pointer on the record for the 
base object for that delta.

Next, for all objects in delta form which base object is also part of 
the object enumeration and obviously part of the same pack, we simply 
flag those objects as directly reusable without any further processing.  
This means that, when those objects are about to be stored in the new 
pack, their raw data is simply copied straight from the original pack 
using the offset and size noted above.  In other words, those objects 
are simply never redeltified nor redeflated at all, and all the work 
that was previously done to find the best delta match is preserved with 
no extra cost.

Of course, when your repository is tightly packed into a single pack, 
then all enumerated objects fall into the reusable category and 
therefore a copy of the original pack is indeed sent over the wire.  
One exception is with older git clients which don't support the delta 
base offset encoding, in which case the delta reference encoding is 
substituted on the fly with almost no cost (this is btw another reason 
why a dumb copy of existing pack may not work universally either).  But 
in the common case, you might see the above as just the same as if git 
did copy the pack file because it really only reads some data from a 
pack and immediately writes that data out.

The bruteforce repacking is different because it simply doesn't concern 
itself with existing deltas at all.  It instead start everything from 
scratch and perform the whole delta search all over for all objects.  
This is what takes lots of resources and CPU cycles, and as you may 
guess, is never used for fetch/clone requests.


Nicolas