All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bo Chen <chen@chenirvine.org>
To: Neal Kreitzinger <nkreitzinger@gmail.com>
Cc: Jeff King <peff@peff.net>, Sergio <sergio.callegari@gmail.com>,
	git@vger.kernel.org
Subject: Re: GSoC - Some questions on the idea of
Date: Sat, 31 Mar 2012 17:27:49 -0400	[thread overview]
Message-ID: <CA+M5ThTKtSFPq8A3oc1wvc9i0vG1NMyHCRE+poYaq+65FQWOxw@mail.gmail.com> (raw)
In-Reply-To: <4F7768D6.3010400@gmail.com>

On Sat, Mar 31, 2012 at 4:28 PM, Neal Kreitzinger
<nkreitzinger@gmail.com> wrote:
> On 3/30/2012 3:34 PM, Jeff King wrote:
>>
>> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:
>>
>>> The sub-problems of "delta for large file" problem.
>>>
>>> 1 large file
>>>
>> But let's take a step back for a moment. Forget about whether a file is
>> binary or not. Imagine you want to store a very large file in git.
>>
>> What are the operations that will perform badly? How can we make them
>> perform acceptably, and what tradeoffs must we make? E.g., the way the
>> diff code is written, it would be very difficult to run "git diff" on a
>> 2 gigabyte file. But is that actually a problem? Answering that means
>> talking about the characteristics of 2 gigabyte files, and what we
>> expect to see, and to what degree our tradeoffs will impact them.
>>
>> Here's a more concrete example. At first, even storing a 2 gigabyte file
>> with "git add" was painful, because we would load the whole thing in
>> memory. Repacking the repository was painful, because we had to rewrite
>> the whole 2G file into a packfile. Nowadays, we stream large files
>> directly into their own packfiles, and we have to pay the I/O only once
>> (and the memory cost never). As a tradeoff, we no longer get delta
>> compression of large objects. That's OK for some large objects, like
>> movie files (which don't tend to delta well, anyway). But it's not for
>> other objects, like virtual machine images, which do tend to delta well.
>>
>> So can we devise a solution which efficiently stores these
>> delta-friendly objects, without losing the performance improvements we
>> got with the stream-directly-to-packfile approach?
>>
>> One possible solution is breaking large files into smaller chunks using
>> something like the bupsplit algorithm (and I won't go into the details
>> here, as links to bup have already been mentioned elsewhere, and Junio's
>> patches make a start at this sort of splitting).
>>
> (I'm no expert on "big-files" in git or elsewhere, but this thread is
> immensely interesting to me as a git user who wants to track all sorts of
> binary files and possibly large text files in the very near future, ie. all
> components tied to a server build and upgrades beyond the linux-distro/rpms
> and perhaps including them also.)
>
> Let's take an even bigger step back for a moment.  Who determines if a file
> shall be a big-file or not?  Git or the user?  How is it determined if a
> file shall be a "big-file" or not?
>
> Who decides bigness:
> Bigness seems to be relative to system resources.  Does the user crunch the
> numbers to determine if a file is big-file, or does git?  If the numbers are
> relative then should git query the system and make the determination?
>  Either way, once the system-resources are upgraded and formerly "big-files"
> are no longer considered "big" how is the previous history refactored tot
> behave "non-big-file-like"?  Conversely, if the system-resources are
> re-distributed so that formerly non-big files are now relatively big (ie,
> moved from powerful central server login to laptops), how is the history
> refactored to accommodate the newly-relative-bigness?
>

In common sense, a file of tens of MBs should not be considered as a
big file, but a file of tens of GBs should definitely be considered as
a big file. I think one simple workable solution is to let the user
set the threshold of the big file. One complicate but intelligent
solution is to let git auto-config the threshold by evaluating current
computing resources in the running platform (a physical machine or
just a VM). As to the problem of migrating git in different platforms
which equip with different computing power, the git repo should also
keep tract of under what big file threshold a specific file is
handled.


> How bigness is decided:
> There seems to be two basic types of big-files:  big-worktree-files, and
> big-history-files.  A big-worktree-file that is delta-friendly is not a
> big-history-file.  A non-big-worktree-file that is delta-unfriendly is a
> big-file-history problem.  If you are working alone on an old computer you
> are probably more concerned about big-worktree-files (memory).  If you are
> working in a large group making lots of changes to the same files on a
> powerful server then you are probably more concerned about
> big-history-file-size (diskspace).  Of course, all are concerned about
> big-worktree-files that are delta-unfriendly.
>
> At what point is a delta-friendly file considered a "big-file"?  I assume
> that may depend on the degree delta-friendliness.  I imagine that a text
> file and vm-image differ in delta-friendliness by several degrees.
>
> At what point(s) is a delta-unfriendly file considered a "big-file"?  I
> assume that may depend on the degree(s) of delta-unfriendliness.  I imagine
> a compiled program and compressed-container differ in delta-unfriendliness
> by several degrees.
>
> My understanding is that git does not ever delta-compress binary files.
>  That would mean even a small-worktree-binary-file becomes a
> big-history-file over time.
>
> v/r,
> neal

  reply	other threads:[~2012-03-31 21:27 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-28  4:38 GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-28  6:19 ` Nguyen Thai Ngoc Duy
2012-03-28 11:33   ` GSoC - Some questions on the idea of Sergio
2012-03-30 19:44     ` Bo Chen
2012-03-30 19:51     ` Bo Chen
2012-03-30 20:34       ` Jeff King
2012-03-30 23:08         ` Bo Chen
2012-03-31 11:02           ` Sergio Callegari
2012-03-31 16:18             ` Neal Kreitzinger
2012-04-02 21:07               ` Jeff King
2012-04-03  9:58                 ` Sergio Callegari
2012-04-11  1:24                 ` Neal Kreitzinger
2012-04-11  6:04                   ` Jonathan Nieder
2012-04-11 16:29                     ` Neal Kreitzinger
2012-04-11 22:09                       ` Jeff King
2012-04-11 16:35                     ` Neal Kreitzinger
2012-04-11 16:44                     ` Neal Kreitzinger
2012-04-11 17:20                       ` Jonathan Nieder
2012-04-11 18:51                         ` Junio C Hamano
2012-04-11 19:03                           ` Jonathan Nieder
2012-04-11 18:23                     ` Neal Kreitzinger
2012-04-11 21:35                   ` Jeff King
2012-04-12 19:29                     ` Neal Kreitzinger
2012-04-12 21:03                       ` Jeff King
     [not found]                         ` <4F8A2EBD.1070407@gmail.com>
2012-04-15  2:15                           ` Jeff King
2012-04-15  2:33                             ` Neal Kreitzinger
2012-04-16 14:54                               ` Jeff King
2012-05-10 21:43                             ` Neal Kreitzinger
2012-05-10 22:39                               ` Jeff King
2012-04-12 21:08                       ` Neal Kreitzinger
2012-04-13 21:36                       ` Bo Chen
2012-03-31 15:19         ` Neal Kreitzinger
2012-04-02 21:40           ` Jeff King
2012-04-02 22:19             ` Junio C Hamano
2012-04-03 10:07               ` Jeff King
2012-03-31 16:49         ` Neal Kreitzinger
2012-03-31 20:28         ` Neal Kreitzinger
2012-03-31 21:27           ` Bo Chen [this message]
2012-04-01  4:22             ` Nguyen Thai Ngoc Duy
2012-04-01 23:30               ` Bo Chen
2012-04-02  1:00                 ` Nguyen Thai Ngoc Duy
2012-03-30 19:11   ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-30 19:54     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+M5ThTKtSFPq8A3oc1wvc9i0vG1NMyHCRE+poYaq+65FQWOxw@mail.gmail.com \
    --to=chen@chenirvine.org \
    --cc=git@vger.kernel.org \
    --cc=nkreitzinger@gmail.com \
    --cc=peff@peff.net \
    --cc=sergio.callegari@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.