Re: GSoC - Some questions on the idea of

From: Jeff King <peff@peff.net>
To: Neal Kreitzinger <nkreitzinger@gmail.com>
Cc: Bo Chen <chen@chenirvine.org>,
	Sergio <sergio.callegari@gmail.com>,
	git@vger.kernel.org
Subject: Re: GSoC - Some questions on the idea of
Date: Mon, 2 Apr 2012 17:40:49 -0400	[thread overview]
Message-ID: <20120402214049.GB28926@sigill.intra.peff.net> (raw)
In-Reply-To: <4F77209A.8050607@gmail.com>

On Sat, Mar 31, 2012 at 10:19:54AM -0500, Neal Kreitzinger wrote:

> >Note that there are other problem areas with big files that can be
> >worked on, too. For example, some people want to store 100 gigabytes
> >in a repository.
> 
> I take it that you have in mind a 100G set of files comprised entirely
> of big-files that cannot be logically separated into smaller submodules?

Not exactly. Two scenarios I'm thinking of are:

  1. You really have 100G of data in the current version that doesn't
     compress well (e.g., you are storing your music collection). You
     can't afford to store two copies on your laptop (because you have a
     fancy SSD, and 100G is expensive again).  You need the working tree
     version, but it's OK to stream the repo version of a blob from the
     network when you actually need it (mostly "checkout", assuming you
     have marked the file as "-diff").

  2. You have a 100G repository, but only 10G in the most recent
     version (e.g., because you are doing game development and storing
     the media assets). You want your clones to be faster and take less
     space. You can do a shallow clone, but then you're never allowed to
     look at old history. Instead, it would be nice to clone all of the
     commits, trees, and small blobs, and then stream large blobs from
     the network as-needed (again, mostly "checkout").

> My understanding is that a main strategy for "big files" is to separate
> your big-files logically into their own submodule(s) to keep them from
> bogging down the not-big-file repo(s).

That helps people who want to work on the not-big parts by not forcing
them into the big parts (another solution would be partial clone, but
more on that in a minute). But it doesn't help people who actually want
to work on the big parts; they would still have to fetch the whole
big-parts repository.

For splitting the big-parts people from the non-big-parts people, there
have been two suggestions: partial checkout (you have all the objects in
the repo, but only checkout some of them) and partial clone (you don't
have some of the objects in the repo). Partial checkout is a much easier
problem, as it is mostly about marking index entries as "do not bother
to check this out, and pretend that it is simply unmodified". Partial
clone is much harder, because it violates git's usual reachability
rules. During a fetch, a client will say "I have commit X", which the
server can then assume means they have all of the ancestors of X, and
all of the tree and blobs referenced by X and its ancestors.

But if a client can say "yes, I have these objects, but I just don't
want to get them because it's expensive", then partial checkout is
sufficient. The non-big-parts people will clone, omitting the big
objects, and then do a partial checkout (to avoid fetching the objects
even once).

Note that some protocol extension is still needed for the client to tell
the server "don't bother including objects X, Y, and Z in the packfile;
I'll get them from my alternate big-object repo". That can either be a
list of objects, or it can simply be "don't bother with objects bigger
than N".

> >Because git is distributed, that means 100G in the repo database,
> >and 100G in the working directory, for a total of 200G.
> 
> I take it that you are implying that the 100G object-store size is due
> to the notion that binary files cannot-be/are-not compressed well?

In this case, yes. But you could easily tweak the numbers to be 100G and
150G. The point is that the data is stored twice, and even the
compressed version may be big.

> >People in this situation may want to be able to store part of the
> >repository database in a network-accessible location, trading some
> >of the convenience of being fully distributed for the space savings.
> >So another project could be designing a network-based alternate
> >object storage system.
> >
> I take it you are implying a local area network with users git repos
> on workstations?

Not necessarily. Obviously if you are doing a lot of active work on the
big files, the faster your network, the better. But it could work at the
internet scale, too, if you don't actually fetch the big files
frequently (so part of a scheme like this would be making sure we avoid
accessing big objects whenever we can; in practice, this is pretty easy,
as git already tries to avoid accessing objects unnecessarily, because
it's expensive even on the local end).

You can also cache a certain number of fetched objects locally. Assuming
there is some locality of the objects you ask about (e.g., because you
are doing "git checkout" back and forth between two branches), this can
help.

> Some setups login to a linux server and have all their repos there.
> The "alternate objects" does not need to network-based in that case.
> It is "local", but local does not mean 20 people cloning the
> alternate objects to their workstations.  It means one copy of
> alternate objects, and twenty repos referencing that one copy.

Right. This is the same concept, except over the network. So people's
working repositories are on their own workstations instead of a central
server. You could even do it today by network-mounting a filesystem and
pointing your alternates file at it. However, I think it's worth making
git aware that the objects are on the network for a few reasons:

  1. Git can be more careful about how it handles the objects, including
     when to fetch, when to stream, and when to cache. For example,
     you'd want to fetch the manifest of objects and cache it in your
     local repository, because you want fast lookups of "do I have this
     object".

  2. Providing remote filesystems on an Internet scale is a management
     pain (and it's a pain for the user, too). My thought was that this
     would be implemented on top of http (the connection setup cost is
     negligible, since these objects would generally be large).

  3. Usually alternate repositories are full repositories that meet the
     connectivity requirements (so you could run "git fsck" in them).
     But this is explicitly about taking just a few disconnected large
     blobs out of the repository and putting them elsewhere. So it needs
     a new set of tools for managing the upstream repository.

-Peff