All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jon Smirl <jonsmirl@gmail.com>
To: Nicolas Pitre <nico@cam.org>
Cc: Junio C Hamano <gitster@pobox.com>,
	david@lang.hm, Nicolas Sebrecht <nicolas.s-dev@laposte.net>,
	"Robin H. Johnson" <robbat2@gentoo.org>,
	Git Mailing List <git@vger.kernel.org>,
	"Shawn O. Pearce" <spearce@spearce.org>
Subject: Re: Performance issue: initial git clone causes massive repack
Date: Mon, 6 Apr 2009 11:28:39 -0400	[thread overview]
Message-ID: <9e4733910904060828m414dfe7v66b19f7b4c5b670e@mail.gmail.com> (raw)
In-Reply-To: <alpine.LFD.2.00.0904061042300.6741@xanadu.home>

On Mon, Apr 6, 2009 at 11:14 AM, Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 6 Apr 2009, Jon Smirl wrote:
>
>> On Mon, Apr 6, 2009 at 10:19 AM, Nicolas Pitre <nico@cam.org> wrote:
>> > On Mon, 6 Apr 2009, Jon Smirl wrote:
>> >
>> >> First thing an initial clone does is copy all of the pack files from
>> >> the server to the client without even looking at them.
>> >
>> > This is a no go for reasons already stated many times.  There are
>> > security implications (those packs might contain stuff that you didn't
>> > intend to be publically accessible) and there might be efficiency
>> > reasons as well (you might have a shared object store with lots of stuff
>> > unrelated to the particular clone).
>>
>> How do you deal with dense history packs? These packs take many hours
>> to make (on a server class machine) and can be half the size of a
>> regular pack. Shouldn't there be a way to copy these packs intact on
>> an initial clone? It's ok if these packs are specially marked as being
>> ok to copy.
>
> [sigh]
>
> Let me explain it all again.
>
> There is basically two ways to create a new pack: the intelligent way,
> and the bruteforce way.
>
> When creating a new pack the intelligent way, what we do is to enumerate
> all the needed object and look them up in the object store.  When a
> particular object is found, we create a record for that object and note
> in which pack it is located, at what offset in that pack, how much space
> it occupies in its compressed form within that pack, , and if whether it
> is a delta or not.  When that object is indeed a delta (the majority of
> objects usually are) then we also keep a pointer on the record for the
> base object for that delta.
>
> Next, for all objects in delta form which base object is also part of
> the object enumeration and obviously part of the same pack, we simply
> flag those objects as directly reusable without any further processing.
> This means that, when those objects are about to be stored in the new
> pack, their raw data is simply copied straight from the original pack
> using the offset and size noted above.  In other words, those objects
> are simply never redeltified nor redeflated at all, and all the work
> that was previously done to find the best delta match is preserved with
> no extra cost.

Does this process cause random reads all over a 2GB pack file? Busy
servers can't keep a 2GB pack in memory.
sendfile() the 2GB pack to client is way more efficient. (assuming the
pack is marked as being ok to send).

>
> Of course, when your repository is tightly packed into a single pack,
> then all enumerated objects fall into the reusable category and
> therefore a copy of the original pack is indeed sent over the wire.
> One exception is with older git clients which don't support the delta
> base offset encoding, in which case the delta reference encoding is
> substituted on the fly with almost no cost (this is btw another reason
> why a dumb copy of existing pack may not work universally either).  But
> in the common case, you might see the above as just the same as if git
> did copy the pack file because it really only reads some data from a
> pack and immediately writes that data out.
>
> The bruteforce repacking is different because it simply doesn't concern
> itself with existing deltas at all.  It instead start everything from
> scratch and perform the whole delta search all over for all objects.
> This is what takes lots of resources and CPU cycles, and as you may
> guess, is never used for fetch/clone requests.
>
>
> Nicolas
>



-- 
Jon Smirl
jonsmirl@gmail.com

  reply	other threads:[~2009-04-06 15:30 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-04 22:07 Performance issue: initial git clone causes massive repack Robin H. Johnson
2009-04-05  0:05 ` Nicolas Sebrecht
2009-04-05  0:37   ` Robin H. Johnson
2009-04-05  3:54     ` Nicolas Sebrecht
2009-04-05  4:08       ` Nicolas Sebrecht
2009-04-05  7:04       ` Robin H. Johnson
2009-04-05 19:02         ` Nicolas Sebrecht
2009-04-05 19:17           ` Shawn O. Pearce
2009-04-05 23:02             ` Robin H. Johnson
2009-04-05 20:43           ` Robin H. Johnson
2009-04-05 21:08             ` Shawn O. Pearce
2009-04-05 21:28           ` david
2009-04-05 21:36             ` Sverre Rabbelier
2009-04-06  3:24               ` Nicolas Pitre
2009-04-07  8:10                 ` Björn Steinbrink
2009-04-07  9:45                   ` Jakub Narebski
2009-04-07 13:13                     ` Nicolas Pitre
2009-04-07 13:37                       ` Jakub Narebski
2009-04-07 14:03                         ` Jon Smirl
2009-04-07 17:59                         ` Nicolas Pitre
2009-04-07 14:21                       ` Björn Steinbrink
2009-04-07 17:48                         ` Nicolas Pitre
2009-04-07 18:12                           ` Björn Steinbrink
2009-04-07 18:56                             ` Nicolas Pitre
2009-04-07 20:27                               ` Björn Steinbrink
2009-04-08  4:52                                 ` Nicolas Pitre
2009-04-10 20:38                                   ` Robin H. Johnson
2009-04-11  1:58                                     ` Nicolas Pitre
2009-04-11  7:06                                       ` Mike Hommey
2009-04-14 15:52                                     ` Johannes Schindelin
2009-04-14 20:17                                       ` Nicolas Pitre
2009-04-14 20:27                                         ` Robin H. Johnson
2009-04-14 21:02                                           ` Nicolas Pitre
2009-04-15  3:09                                           ` Nguyen Thai Ngoc Duy
2009-04-15  5:53                                             ` Robin H. Johnson
2009-04-15  5:54                                             ` Junio C Hamano
2009-04-15 11:51                                               ` Nicolas Pitre
2009-04-22  1:15                                           ` Sam Vilain
2009-04-22  9:55                                             ` Mike Ralphson
2009-04-22 11:24                                               ` Pieter de Bie
2009-04-22 13:19                                               ` Johannes Schindelin
2009-04-22 14:35                                                 ` Shawn O. Pearce
2009-04-22 16:40                                                   ` Andreas Ericsson
2009-04-22 17:06                                                     ` Johannes Schindelin
2009-04-23 19:30                                               ` Christian Couder
2009-04-22 14:14                                             ` Nicolas Pitre
2009-04-22 22:01                                               ` Sam Vilain
2009-04-22 22:50                                                 ` Björn Steinbrink
2009-04-22 23:07                                                 ` Nicolas Pitre
2009-04-22 23:30                                                   ` Johannes Schindelin
2009-04-23  3:16                                                     ` Nicolas Pitre
2009-04-14 20:30                                         ` Johannes Schindelin
2009-04-07 20:29                             ` Jeff King
2009-04-07 20:35                               ` Björn Steinbrink
2009-04-08 11:28                       ` [PATCH] process_{tree,blob}: Remove useless xstrdup calls Björn Steinbrink
2009-04-10 22:20                         ` Linus Torvalds
2009-04-11  0:27                           ` Linus Torvalds
2009-04-11  1:15                             ` Linus Torvalds
2009-04-11  1:34                               ` Nicolas Pitre
2009-04-11 13:41                               ` Björn Steinbrink
2009-04-11 14:07                                 ` Björn Steinbrink
2009-04-11 18:06                                   ` Linus Torvalds
2009-04-11 18:22                                     ` Linus Torvalds
2009-04-11 19:22                                       ` Björn Steinbrink
2009-04-11 20:50                                     ` Björn Steinbrink
2009-04-11 21:43                                       ` Linus Torvalds
2009-04-11 23:24                                         ` Björn Steinbrink
2009-04-11 18:19                                   ` Linus Torvalds
2009-04-11 19:40                                     ` Björn Steinbrink
2009-04-11 19:58                                       ` Linus Torvalds
2009-04-05 22:59             ` Performance issue: initial git clone causes massive repack Nicolas Sebrecht
2009-04-05 23:20               ` david
2009-04-05 23:28                 ` Robin Rosenberg
2009-04-06  3:34                 ` Nicolas Pitre
2009-04-06  5:15                   ` Junio C Hamano
2009-04-06 13:12                     ` Nicolas Pitre
2009-04-06 13:52                     ` Jon Smirl
2009-04-06 14:19                       ` Nicolas Pitre
2009-04-06 14:37                         ` Jon Smirl
2009-04-06 14:48                           ` Shawn O. Pearce
2009-04-06 15:14                           ` Nicolas Pitre
2009-04-06 15:28                             ` Jon Smirl [this message]
2009-04-06 16:14                               ` Nicolas Pitre
2009-04-06 11:22                   ` Matthieu Moy
2009-04-06 13:29                     ` Nicolas Pitre
2009-04-06 14:03                       ` Robin H. Johnson
2009-04-06 14:14                         ` Nicolas Pitre
2009-04-07 10:11               ` Martin Langhoff
2009-04-05 19:57 ` Jeff King
2009-04-05 23:38   ` Robin H. Johnson
2009-04-05 23:42     ` Robin H. Johnson
     [not found]     ` <0015174c150e49b5740466d7d2c2@google.com>
2009-04-06  0:29       ` Robin H. Johnson
2009-04-06  3:10     ` Nguyen Thai Ngoc Duy
2009-04-06  4:09       ` Nicolas Pitre
2009-04-06  4:06     ` Nicolas Pitre
2009-04-06 14:20       ` Robin H. Johnson
2009-04-11 17:24 ` Mark Levedahl

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e4733910904060828m414dfe7v66b19f7b4c5b670e@mail.gmail.com \
    --to=jonsmirl@gmail.com \
    --cc=david@lang.hm \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=nico@cam.org \
    --cc=nicolas.s-dev@laposte.net \
    --cc=robbat2@gentoo.org \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.