Re: [kernel.org users] Re: auto-packing on kernel.org? please?

From: Junio C Hamano <junkio@cox.net>
To: Nick Hengeveld <nickh@reactrix.com>
Cc: git@vger.kernel.org
Subject: Re: [kernel.org users] Re: auto-packing on kernel.org? please?
Date: Sun, 16 Oct 2005 15:12:03 -0700	[thread overview]
Message-ID: <7vwtkd6rik.fsf@assigned-by-dhcp.cox.net> (raw)
In-Reply-To: <20051016213341.GF5509@reactrix.com> (Nick Hengeveld's message of "Sun, 16 Oct 2005 14:33:41 -0700")

Nick Hengeveld <nickh@reactrix.com> writes:

> On Sun, Oct 16, 2005 at 09:56:49AM -0700, Junio C Hamano wrote:
>
>> That's what the .idx file is for, except that after you fetch
>> the range, you may find you would need something else that the
>> object is delta against.
>
> Would it make sense to load the pack indexes for each base up front,
> and then fetch individual objects from a pack if they exist in one of
> a base's pack indexes?  In such a case, it may not even make sense to
> try fetching the object directly first.
>
> What are the circumstances under which it makes more sense to fetch the
> whole pack rather than fetching individual objects from it?

It would make sense if we end up needing most them anyway, I
think.

We are probably far from this, but ideally, we should be able to
set up something like this.

We encourage the server side to prepare packs this way [*1*].

 -- development --> time --> flows --> this --> way -->

 (optional)
 full ------------------------------------------------

 base ---------------
 6mo                 ---------------------------------
 3mo                               -------------------
 1mo                                    --------------
 2wk                                        ----------
 1wk                                             -----
                                                      ^
                                 last pack optimization

That is, a big base pack (say v2.6.12), and multiple packs to
bring people that were in-sync at various time up-to-date to the
time when the set of packs were last optimized.  Any objects
created after the last pack optimization time are left unpacked
until the next pack optimization time.  It might not be a bad
idea to also have a "full" pack.

For example, if you were in-sync 5-months ago, fetching 3mo pack
would not be enough and you would need to get 6mo pack to become
up-to-date wrt the last pack optimization (say 3 days ago).  You
would have obtained the objects not in pack, created within the
last 3 days, already as individual objects before realizing that
you would need to fetch some pack.

Then, we can teach git-http-fetch to do:

 - If an object is unavailable unpacked, get all the indices
   from that repository (and probably its alternates while we
   are at it).

 - Among the set of packs that contain the object we are
   currently interested in, try to find the "best" pack.  The
   definition of "best" would be a balancing act of finding the
   one that contains the least number of objects we already
   have, and the one that contains the most number of objects we
   do not have yet.

The commit walker always goes from present to past, so you would
start from fetching the latest, presumably unpacked objects, and
as soon as you hit the last pack optimization boundary, you have
choices of multiple packs.  If you are relatively up-to-date,
you would find that 1mo pack has more things you already have
than 1wk pack, although both of them would fit the bill -- at
that point you choose to download 1wk pack.  On the other hand,
if you are behind, you may find that 3mo pack has more things
you do not have than 1wk or 2wk or 1mo pack, and using 3mo pack
would become the right choice for you.

I think most repositories have a few related heads and their
heads almost never rewind, so favoring the pack that contains
the most number of objects we do not have would be the right
strategy in practice for the downloader.

[Footnote]

*1* This is different from a proposal posted on the list earlier
by somebody (I think it was Pasky but I may be mistaken) which
looked like this:

 -- development --> time --> flows --> this --> way -->

 base ---------------
 6mo                 --------------
 3mo                               -----
 1mo                                    ----
 2wk                                        -----
 1wk                                             -----

The thing is, sum of 3mo+1mo+2wk+1wk packs in the latter scheme
tends to be a lot bigger than the size of 3mo pack in the former
scheme.