large(25G) repository in git

* large(25G) repository in git
@ 2009-03-23 21:10 Adam Heath
  2009-03-24  1:19 ` Nicolas Pitre
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Adam Heath @ 2009-03-23 21:10 UTC (permalink / raw)
  To: git

We maintain a website in git.  This website has a bunch of backend
server code, and a bunch of data files.  Alot of these files are full
videos.

We use git, so that the distributed nature of website development can
be supported.  Quite often, you'll have a production server, with
online changes occurring(we support in-browser editting of content), a
preview server, where large-scale code changes can be previewed, then
a development server, one per programmer(or more).

Last friday, I was doing a checkin on the production server, and found
1.6G of new files.  git was quite able at committing that.  However,
pushing was problematic.  I was pushing over ssh; so, a new ssh
connection was open to the preview server.  After doing so, git tried
to create a new pack file.  This took *ages*, and the ssh connection
died.  So did git, when it finally got done with the new pack, and
discovered the ssh connection was gone.

So, to work around that, I ran git gc.  When done, I discovered that
git repacked the *entire* repository.  While not something I care for,
I can understand that, and live with it.  It just took *hours* to do so.

Then, what really annoys me, is that when I finally did the push, it
tried sending the single 27G pack file, when the remote already had
25G of the repository in several different packs(the site was an
hg->git conversion).  This part is just unacceptable.

So, here are my questions/observations:

1: Handle the case of the ssh connection dying during git push(seems
simple).

2: Is there an option to tell git to *not* be so thorough when trying
to find similiar files.  videos/doc/pdf/etc aren't always very
deltafiable, so I'd be happy to just do full content compares.

3: delta packs seem to be poorly done.  it seems that if one repo gets
repacked completely, that the entire new pack gets sent, when the
target has most of the objects already.

4: Are there any config options I can set to help in this?  There are
tons of options, and some documentation as to what each one does, but
no recommended practices type doc, that describes what should be done
for different kinds of workflows.

ps: Thank you for your time.  I hope that someone has answers for me.

pps: I'm not subscribed, please cc me.  If I need to be subscribed,
I'll do so, if told.

^ permalink raw reply	[flat|nested] 16+ messages in thread