Re: GSoC - Some questions on the idea of

From: Neal Kreitzinger <nkreitzinger@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Bo Chen <chen@chenirvine.org>,
	Sergio <sergio.callegari@gmail.com>,
	git@vger.kernel.org
Subject: Re: GSoC - Some questions on the idea of
Date: Sat, 31 Mar 2012 15:28:06 -0500	[thread overview]
Message-ID: <4F7768D6.3010400@gmail.com> (raw)
In-Reply-To: <20120330203430.GB20376@sigill.intra.peff.net>

On 3/30/2012 3:34 PM, Jeff King wrote:
> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:
>
>> The sub-problems of "delta for large file" problem.
>>
>> 1 large file
>>
> But let's take a step back for a moment. Forget about whether a file is
> binary or not. Imagine you want to store a very large file in git.
>
> What are the operations that will perform badly? How can we make them
> perform acceptably, and what tradeoffs must we make? E.g., the way the
> diff code is written, it would be very difficult to run "git diff" on a
> 2 gigabyte file. But is that actually a problem? Answering that means
> talking about the characteristics of 2 gigabyte files, and what we
> expect to see, and to what degree our tradeoffs will impact them.
>
> Here's a more concrete example. At first, even storing a 2 gigabyte file
> with "git add" was painful, because we would load the whole thing in
> memory. Repacking the repository was painful, because we had to rewrite
> the whole 2G file into a packfile. Nowadays, we stream large files
> directly into their own packfiles, and we have to pay the I/O only once
> (and the memory cost never). As a tradeoff, we no longer get delta
> compression of large objects. That's OK for some large objects, like
> movie files (which don't tend to delta well, anyway). But it's not for
> other objects, like virtual machine images, which do tend to delta well.
>
> So can we devise a solution which efficiently stores these
> delta-friendly objects, without losing the performance improvements we
> got with the stream-directly-to-packfile approach?
>
> One possible solution is breaking large files into smaller chunks using
> something like the bupsplit algorithm (and I won't go into the details
> here, as links to bup have already been mentioned elsewhere, and Junio's
> patches make a start at this sort of splitting).
>
(I'm no expert on "big-files" in git or elsewhere, but this thread is 
immensely interesting to me as a git user who wants to track all sorts 
of binary files and possibly large text files in the very near future, 
ie. all components tied to a server build and upgrades beyond the 
linux-distro/rpms and perhaps including them also.)

Let's take an even bigger step back for a moment.  Who determines if a 
file shall be a big-file or not?  Git or the user?  How is it determined 
if a file shall be a "big-file" or not?

Who decides bigness:
Bigness seems to be relative to system resources.  Does the user crunch 
the numbers to determine if a file is big-file, or does git?  If the 
numbers are relative then should git query the system and make the 
determination?  Either way, once the system-resources are upgraded and 
formerly "big-files" are no longer considered "big" how is the previous 
history refactored to behave "non-big-file-like"?  Conversely, if the 
system-resources are re-distributed so that formerly non-big files are 
now relatively big (ie, moved from powerful central server login to 
laptops), how is the history refactored to accommodate the 
newly-relative-bigness?

How bigness is decided:
There seems to be two basic types of big-files:  big-worktree-files, and 
big-history-files.  A big-worktree-file that is delta-friendly is not a 
big-history-file.  A non-big-worktree-file that is delta-unfriendly is a 
big-file-history problem.  If you are working alone on an old computer 
you are probably more concerned about big-worktree-files (memory).  If 
you are working in a large group making lots of changes to the same 
files on a powerful server then you are probably more concerned about 
big-history-file-size (diskspace).  Of course, all are concerned about 
big-worktree-files that are delta-unfriendly.

At what point is a delta-friendly file considered a "big-file"?  I 
assume that may depend on the degree delta-friendliness.  I imagine that 
a text file and vm-image differ in delta-friendliness by several degrees.

At what point(s) is a delta-unfriendly file considered a "big-file"?  I 
assume that may depend on the degree(s) of delta-unfriendliness.  I 
imagine a compiled program and compressed-container differ in 
delta-unfriendliness by several degrees.

My understanding is that git does not ever delta-compress binary files. 
  That would mean even a small-worktree-binary-file becomes a 
big-history-file over time.

v/r,
neal