Re: GSoC - Some questions on the idea of

From: Bo Chen <chen@chenirvine.org>
To: Neal Kreitzinger <nkreitzinger@gmail.com>
Cc: Jeff King <peff@peff.net>, Sergio <sergio.callegari@gmail.com>,
	git@vger.kernel.org
Subject: Re: GSoC - Some questions on the idea of
Date: Sat, 31 Mar 2012 17:27:49 -0400	[thread overview]
Message-ID: <CA+M5ThTKtSFPq8A3oc1wvc9i0vG1NMyHCRE+poYaq+65FQWOxw@mail.gmail.com> (raw)
In-Reply-To: <4F7768D6.3010400@gmail.com>

On Sat, Mar 31, 2012 at 4:28 PM, Neal Kreitzinger
<nkreitzinger@gmail.com> wrote:
> On 3/30/2012 3:34 PM, Jeff King wrote:
>>
>> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:
>>
>>> The sub-problems of "delta for large file" problem.
>>>
>>> 1 large file
>>>
>> But let's take a step back for a moment. Forget about whether a file is
>> binary or not. Imagine you want to store a very large file in git.
>>
>> What are the operations that will perform badly? How can we make them
>> perform acceptably, and what tradeoffs must we make? E.g., the way the
>> diff code is written, it would be very difficult to run "git diff" on a
>> 2 gigabyte file. But is that actually a problem? Answering that means
>> talking about the characteristics of 2 gigabyte files, and what we
>> expect to see, and to what degree our tradeoffs will impact them.
>>
>> Here's a more concrete example. At first, even storing a 2 gigabyte file
>> with "git add" was painful, because we would load the whole thing in
>> memory. Repacking the repository was painful, because we had to rewrite
>> the whole 2G file into a packfile. Nowadays, we stream large files
>> directly into their own packfiles, and we have to pay the I/O only once
>> (and the memory cost never). As a tradeoff, we no longer get delta
>> compression of large objects. That's OK for some large objects, like
>> movie files (which don't tend to delta well, anyway). But it's not for
>> other objects, like virtual machine images, which do tend to delta well.
>>
>> So can we devise a solution which efficiently stores these
>> delta-friendly objects, without losing the performance improvements we
>> got with the stream-directly-to-packfile approach?
>>
>> One possible solution is breaking large files into smaller chunks using
>> something like the bupsplit algorithm (and I won't go into the details
>> here, as links to bup have already been mentioned elsewhere, and Junio's
>> patches make a start at this sort of splitting).
>>
> (I'm no expert on "big-files" in git or elsewhere, but this thread is
> immensely interesting to me as a git user who wants to track all sorts of
> binary files and possibly large text files in the very near future, ie. all
> components tied to a server build and upgrades beyond the linux-distro/rpms
> and perhaps including them also.)
>
> Let's take an even bigger step back for a moment.  Who determines if a file
> shall be a big-file or not?  Git or the user?  How is it determined if a
> file shall be a "big-file" or not?
>
> Who decides bigness:
> Bigness seems to be relative to system resources.  Does the user crunch the
> numbers to determine if a file is big-file, or does git?  If the numbers are
> relative then should git query the system and make the determination?
>  Either way, once the system-resources are upgraded and formerly "big-files"
> are no longer considered "big" how is the previous history refactored tot
> behave "non-big-file-like"?  Conversely, if the system-resources are
> re-distributed so that formerly non-big files are now relatively big (ie,
> moved from powerful central server login to laptops), how is the history
> refactored to accommodate the newly-relative-bigness?
>

In common sense, a file of tens of MBs should not be considered as a
big file, but a file of tens of GBs should definitely be considered as
a big file. I think one simple workable solution is to let the user
set the threshold of the big file. One complicate but intelligent
solution is to let git auto-config the threshold by evaluating current
computing resources in the running platform (a physical machine or
just a VM). As to the problem of migrating git in different platforms
which equip with different computing power, the git repo should also
keep tract of under what big file threshold a specific file is
handled.

> How bigness is decided:
> There seems to be two basic types of big-files:  big-worktree-files, and
> big-history-files.  A big-worktree-file that is delta-friendly is not a
> big-history-file.  A non-big-worktree-file that is delta-unfriendly is a
> big-file-history problem.  If you are working alone on an old computer you
> are probably more concerned about big-worktree-files (memory).  If you are
> working in a large group making lots of changes to the same files on a
> powerful server then you are probably more concerned about
> big-history-file-size (diskspace).  Of course, all are concerned about
> big-worktree-files that are delta-unfriendly.
>
> At what point is a delta-friendly file considered a "big-file"?  I assume
> that may depend on the degree delta-friendliness.  I imagine that a text
> file and vm-image differ in delta-friendliness by several degrees.
>
> At what point(s) is a delta-unfriendly file considered a "big-file"?  I
> assume that may depend on the degree(s) of delta-unfriendliness.  I imagine
> a compiled program and compressed-container differ in delta-unfriendliness
> by several degrees.
>
> My understanding is that git does not ever delta-compress binary files.
>  That would mean even a small-worktree-binary-file becomes a
> big-history-file over time.
>
> v/r,
> neal