All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bo Chen <chen@chenirvine.org>
To: Jeff King <peff@peff.net>
Cc: Sergio <sergio.callegari@gmail.com>, git@vger.kernel.org
Subject: Re: GSoC - Some questions on the idea of
Date: Fri, 30 Mar 2012 19:08:42 -0400	[thread overview]
Message-ID: <CA+M5ThR6jtxqs0-Kz-8fcRuOFRbLr-GvsJcTmrOQ7_geNspDLg@mail.gmail.com> (raw)
In-Reply-To: <20120330203430.GB20376@sigill.intra.peff.net>

I appreciate for the instant reply.

My comments are inline below.

On Fri, Mar 30, 2012 at 4:34 PM, Jeff King <peff@peff.net> wrote:
> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:
>
>> The sub-problems of "delta for large file" problem.
>>
>> 1 large file
>>
>> 1.1 text file (always delta well? need to be confirmed)
>
> They often do, but text files don't tend to be large. There are some
> exceptions (e.g., genetic data is often kept in line-oriented text
> files, but is very large).
>
> But let's take a step back for a moment. Forget about whether a file is
> binary or not. Imagine you want to store a very large file in git.
>
> What are the operations that will perform badly? How can we make them
> perform acceptably, and what tradeoffs must we make? E.g., the way the
> diff code is written, it would be very difficult to run "git diff" on a
> 2 gigabyte file. But is that actually a problem? Answering that means
> talking about the characteristics of 2 gigabyte files, and what we
> expect to see, and to what degree our tradeoffs will impact them.
>
> Here's a more concrete example. At first, even storing a 2 gigabyte file
> with "git add" was painful, because we would load the whole thing in
> memory. Repacking the repository was painful, because we had to rewrite
> the whole 2G file into a packfile. Nowadays, we stream large files
> directly into their own packfiles, and we have to pay the I/O only once
> (and the memory cost never). As a tradeoff, we no longer get delta
> compression of large objects. That's OK for some large objects, like
> movie files (which don't tend to delta well, anyway). But it's not for
> other objects, like virtual machine images, which do tend to delta well.

It seems that we should first provide some kind of mechanism which can
distinguish the delta-friendly objects and non delta-friendly objects.
I am wondering whether this algorithm is available now or will be
devised.



>
> So can we devise a solution which efficiently stores these
> delta-friendly objects, without losing the performance improvements we
> got with the stream-directly-to-packfile approach?

Ah, I see. Design efficient solution for storing the delta-friendly
objects is the main concern. Thank you for helping me clarify this
point.

>
> One possible solution is breaking large files into smaller chunks using
> something like the bupsplit algorithm (and I won't go into the details
> here, as links to bup have already been mentioned elsewhere, and Junio's
> patches make a start at this sort of splitting).
>
> Note that there are other problem areas with big files that can be
> worked on, too. For example, some people want to store 100 gigabytes in
> a repository. Because git is distributed, that means 100G in the repo
> database, and 100G in the working directory, for a total of 200G. People
> in this situation may want to be able to store part of the repository
> database in a network-accessible location, trading some of the
> convenience of being fully distributed for the space savings. So another
> project could be designing a network-based alternate object storage
> system.

>From the architecture point of view, CVS is fully centralized, and Git
is fully distributed. It seems that for big repo, the architecture
described above is in the middle now ^-^.

>
> -Peff

Bo

  reply	other threads:[~2012-03-30 23:08 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-28  4:38 GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-28  6:19 ` Nguyen Thai Ngoc Duy
2012-03-28 11:33   ` GSoC - Some questions on the idea of Sergio
2012-03-30 19:44     ` Bo Chen
2012-03-30 19:51     ` Bo Chen
2012-03-30 20:34       ` Jeff King
2012-03-30 23:08         ` Bo Chen [this message]
2012-03-31 11:02           ` Sergio Callegari
2012-03-31 16:18             ` Neal Kreitzinger
2012-04-02 21:07               ` Jeff King
2012-04-03  9:58                 ` Sergio Callegari
2012-04-11  1:24                 ` Neal Kreitzinger
2012-04-11  6:04                   ` Jonathan Nieder
2012-04-11 16:29                     ` Neal Kreitzinger
2012-04-11 22:09                       ` Jeff King
2012-04-11 16:35                     ` Neal Kreitzinger
2012-04-11 16:44                     ` Neal Kreitzinger
2012-04-11 17:20                       ` Jonathan Nieder
2012-04-11 18:51                         ` Junio C Hamano
2012-04-11 19:03                           ` Jonathan Nieder
2012-04-11 18:23                     ` Neal Kreitzinger
2012-04-11 21:35                   ` Jeff King
2012-04-12 19:29                     ` Neal Kreitzinger
2012-04-12 21:03                       ` Jeff King
     [not found]                         ` <4F8A2EBD.1070407@gmail.com>
2012-04-15  2:15                           ` Jeff King
2012-04-15  2:33                             ` Neal Kreitzinger
2012-04-16 14:54                               ` Jeff King
2012-05-10 21:43                             ` Neal Kreitzinger
2012-05-10 22:39                               ` Jeff King
2012-04-12 21:08                       ` Neal Kreitzinger
2012-04-13 21:36                       ` Bo Chen
2012-03-31 15:19         ` Neal Kreitzinger
2012-04-02 21:40           ` Jeff King
2012-04-02 22:19             ` Junio C Hamano
2012-04-03 10:07               ` Jeff King
2012-03-31 16:49         ` Neal Kreitzinger
2012-03-31 20:28         ` Neal Kreitzinger
2012-03-31 21:27           ` Bo Chen
2012-04-01  4:22             ` Nguyen Thai Ngoc Duy
2012-04-01 23:30               ` Bo Chen
2012-04-02  1:00                 ` Nguyen Thai Ngoc Duy
2012-03-30 19:11   ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-30 19:54     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+M5ThR6jtxqs0-Kz-8fcRuOFRbLr-GvsJcTmrOQ7_geNspDLg@mail.gmail.com \
    --to=chen@chenirvine.org \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=sergio.callegari@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.