Re: Avery Pennarun's git-subtree?

From: Avery Pennarun <apenwarr@gmail.com>
To: Jakub Narebski <jnareb@gmail.com>
Cc: "Marc Branchaud" <marcnarc@xiplink.com>,
	"Jens Lehmann" <Jens.Lehmann@web.de>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Bryan Larsen" <bryan.larsen@gmail.com>,
	git <git@vger.kernel.org>, "Junio C Hamano" <gitster@pobox.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>
Subject: Re: Avery Pennarun's git-subtree?
Date: Tue, 27 Jul 2010 15:15:08 -0400	[thread overview]
Message-ID: <AANLkTi=6SDQ2A0Zxf8DiSSNzSfUS43M7wmCkKKraOd8w@mail.gmail.com> (raw)
In-Reply-To: <201007261051.41663.jnareb@gmail.com>

On Mon, Jul 26, 2010 at 4:51 AM, Jakub Narebski <jnareb@gmail.com> wrote:
> On Sat, 24 Jul 2010 00:50, Avery Pennarun wrote:
>> My bup project (http://github.com/apenwarr/bup) is all about huge
>> repositories.  It handles repositories with hundreds of gigabytes, and
>> trees containing millions of files (entire filesystems), quite nicely.
>>  Of course, it's not a version control system, so it won't solve your
>> problems.  It's just evidence that large repositories are actually
>> quite manageable without changing the fundamentals of git.
>
> There is also git-bigfiles project, although it is more about large
> [binary] files than large repositories per se (many files, long history).

Right.  git-bigfiles is valuable, but it's valuable with or without
submodules.  (If you have large blobs, submodules won't save you.)

bup happens to have its own way of dealing with large files too, but
it may not be applicable to git.  It does result in lots and lots of
smaller objects, though, which is why I know git repositories are
fundamentally capable of handling lots and lots of smaller objects :)

> Note that with 'bup' you might not see problems with large repositories
> because it does not examine code paths that are slow in large repositories
> (gc, log, path-delimited log).

gc is a huge problem.  bup avoids it entirely (it foregoes delta
compression); git gc fails completely on such large repositories (100+
GB).  There's no reason this has to be true forever, but yes, to
support really big repos, git gc would need to be improved somewhat.
For most reasonably sane repos (a few GB) you can get reasonable
performance by just making your biggest packfiles .keep so they don't
keep getting repacked all the time.

Compared to that, log feels like not a problem at all :)  At least
performance-wise.  The thing that sucks about log using git-subtree,
of course, is that you get all these log messages from multiple
projects jammed together into a single repo, which is rarely what you
want, even if it's fast.  I think the "best" solution is a single repo
with all your objects, but still keeping the histories of each
submodule separate.

>> IMHO, the correct answer here is to have an inotify-based daemon prod
>> at the .git/index automatically when files get updated, so that git
>> itself doesn't have to stat/readdir through the entire tree in order
>> to do any of its operations.  (Windows also has something like inotify
>> that would work.)  If you had this, then git
>> status/diff/checkout/commit would be just as fast with zillions of
>> files as with 10 files.  Sooner or later, if nobody implements this, I
>> promise I'll get around to it since inotify is actually easy to code
>> for :)
>
> IIUC the problem is that inotify is not automatically recursive, so
> daemon would have to take care of adding inotify trigger to each newly
> created subdirectory.

Yeah, the inotify API is kind of gross that way.  But it can be done,
and people do.  (eg. the beagle project)

>> Also note that the only reason submodules are faster here is that
>> they're ignoring possibly important changes.  Notably, when you do
>> 'git status' from the top level, it won't warn you if you have any
>> not-yet-committed files in any of your submodules.  Personally, I
>> consider that to be really important information, but to obtain it
>> would make 'git status' take just as long as without submodules, so
>> you wouldn't get any benefit.  (I think nowadays there's a way to get
>> this recursive status information if you want it, but it'll be slow of
>> course.)
>
> Errr... didn't it got improved in recent git?  I think git-status now
> includes information about submodules if configured so / unless configured
> otherwise.  Isn't it?

Yes, but you're still left with the choice between slow (checks all
files in all submodules) and not slow (might miss stuff).  This isn't
a submodule question, really, it's an overall performance question
with huge checkouts with or without submodules.

>>> We chose git-submodule over git-subtree mainly because git-submodule lets us
>>> selectively checkout different parts of our code.  (AFAIK sparse checkouts
>>> aren't yet an option.)
>
> Sparse checkouts are here, IIRC, but they do not solve problem of disk
> space (they are still in repository, even if not checked out), and speed
> (they still need to be fetched, even if not checked out).

Hmm, don't mix bandwidth usage (and thus the slowness of fetch) with
slowness during everyday usage.  I don't mind a slow fetch now and
then, but 'git status' should be fast. AFAIK, sparse checkouts
*should* make git status faster.  If they don't, it's probably just a
bug.

Have fun,

Avery