From mboxrd@z Thu Jan 1 00:00:00 1970 From: Junio C Hamano Subject: Re: [kernel.org users] Re: auto-packing on kernel.org? please? Date: Sun, 16 Oct 2005 15:12:03 -0700 Message-ID: <7vwtkd6rik.fsf@assigned-by-dhcp.cox.net> References: <434EABFD.5070604@zytor.com> <434EC07C.30505@pobox.com> <435264B1.2010204@de.bosch.com> <20051016161244.GE5509@reactrix.com> <43527E86.8000907@didntduck.org> <7vzmp9xuwe.fsf@assigned-by-dhcp.cox.net> <20051016213341.GF5509@reactrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: git@vger.kernel.org X-From: git-owner@vger.kernel.org Mon Oct 17 00:13:02 2005 Return-path: Received: from vger.kernel.org ([209.132.176.167]) by ciao.gmane.org with esmtp (Exim 4.43) id 1ERGk4-0005iy-NA for gcvg-git@gmane.org; Mon, 17 Oct 2005 00:12:33 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932065AbVJPWMH (ORCPT ); Sun, 16 Oct 2005 18:12:07 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932066AbVJPWMG (ORCPT ); Sun, 16 Oct 2005 18:12:06 -0400 Received: from fed1rmmtao05.cox.net ([68.230.241.34]:38561 "EHLO fed1rmmtao05.cox.net") by vger.kernel.org with ESMTP id S932065AbVJPWMF (ORCPT ); Sun, 16 Oct 2005 18:12:05 -0400 Received: from assigned-by-dhcp.cox.net ([68.4.9.127]) by fed1rmmtao05.cox.net (InterMail vM.6.01.05.02 201-2131-123-102-20050715) with ESMTP id <20051016221142.FLHZ29333.fed1rmmtao05.cox.net@assigned-by-dhcp.cox.net>; Sun, 16 Oct 2005 18:11:42 -0400 To: Nick Hengeveld In-Reply-To: <20051016213341.GF5509@reactrix.com> (Nick Hengeveld's message of "Sun, 16 Oct 2005 14:33:41 -0700") User-Agent: Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux) Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: Nick Hengeveld writes: > On Sun, Oct 16, 2005 at 09:56:49AM -0700, Junio C Hamano wrote: > >> That's what the .idx file is for, except that after you fetch >> the range, you may find you would need something else that the >> object is delta against. > > Would it make sense to load the pack indexes for each base up front, > and then fetch individual objects from a pack if they exist in one of > a base's pack indexes? In such a case, it may not even make sense to > try fetching the object directly first. > > What are the circumstances under which it makes more sense to fetch the > whole pack rather than fetching individual objects from it? It would make sense if we end up needing most them anyway, I think. We are probably far from this, but ideally, we should be able to set up something like this. We encourage the server side to prepare packs this way [*1*]. -- development --> time --> flows --> this --> way --> (optional) full ------------------------------------------------ base --------------- 6mo --------------------------------- 3mo ------------------- 1mo -------------- 2wk ---------- 1wk ----- ^ last pack optimization That is, a big base pack (say v2.6.12), and multiple packs to bring people that were in-sync at various time up-to-date to the time when the set of packs were last optimized. Any objects created after the last pack optimization time are left unpacked until the next pack optimization time. It might not be a bad idea to also have a "full" pack. For example, if you were in-sync 5-months ago, fetching 3mo pack would not be enough and you would need to get 6mo pack to become up-to-date wrt the last pack optimization (say 3 days ago). You would have obtained the objects not in pack, created within the last 3 days, already as individual objects before realizing that you would need to fetch some pack. Then, we can teach git-http-fetch to do: - If an object is unavailable unpacked, get all the indices from that repository (and probably its alternates while we are at it). - Among the set of packs that contain the object we are currently interested in, try to find the "best" pack. The definition of "best" would be a balancing act of finding the one that contains the least number of objects we already have, and the one that contains the most number of objects we do not have yet. The commit walker always goes from present to past, so you would start from fetching the latest, presumably unpacked objects, and as soon as you hit the last pack optimization boundary, you have choices of multiple packs. If you are relatively up-to-date, you would find that 1mo pack has more things you already have than 1wk pack, although both of them would fit the bill -- at that point you choose to download 1wk pack. On the other hand, if you are behind, you may find that 3mo pack has more things you do not have than 1wk or 2wk or 1mo pack, and using 3mo pack would become the right choice for you. I think most repositories have a few related heads and their heads almost never rewind, so favoring the pack that contains the most number of objects we do not have would be the right strategy in practice for the downloader. [Footnote] *1* This is different from a proposal posted on the list earlier by somebody (I think it was Pasky but I may be mistaken) which looked like this: -- development --> time --> flows --> this --> way --> base --------------- 6mo -------------- 3mo ----- 1mo ---- 2wk ----- 1wk ----- The thing is, sum of 3mo+1mo+2wk+1wk packs in the latter scheme tends to be a lot bigger than the size of 3mo pack in the former scheme.