All of lore.kernel.org
 help / color / mirror / Atom feed
* large(25G) repository in git
@ 2009-03-23 21:10 Adam Heath
  2009-03-24  1:19 ` Nicolas Pitre
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Adam Heath @ 2009-03-23 21:10 UTC (permalink / raw)
  To: git

We maintain a website in git.  This website has a bunch of backend
server code, and a bunch of data files.  Alot of these files are full
videos.

We use git, so that the distributed nature of website development can
be supported.  Quite often, you'll have a production server, with
online changes occurring(we support in-browser editting of content), a
preview server, where large-scale code changes can be previewed, then
a development server, one per programmer(or more).

Last friday, I was doing a checkin on the production server, and found
1.6G of new files.  git was quite able at committing that.  However,
pushing was problematic.  I was pushing over ssh; so, a new ssh
connection was open to the preview server.  After doing so, git tried
to create a new pack file.  This took *ages*, and the ssh connection
died.  So did git, when it finally got done with the new pack, and
discovered the ssh connection was gone.

So, to work around that, I ran git gc.  When done, I discovered that
git repacked the *entire* repository.  While not something I care for,
I can understand that, and live with it.  It just took *hours* to do so.

Then, what really annoys me, is that when I finally did the push, it
tried sending the single 27G pack file, when the remote already had
25G of the repository in several different packs(the site was an
hg->git conversion).  This part is just unacceptable.

So, here are my questions/observations:

1: Handle the case of the ssh connection dying during git push(seems
simple).

2: Is there an option to tell git to *not* be so thorough when trying
to find similiar files.  videos/doc/pdf/etc aren't always very
deltafiable, so I'd be happy to just do full content compares.

3: delta packs seem to be poorly done.  it seems that if one repo gets
repacked completely, that the entire new pack gets sent, when the
target has most of the objects already.

4: Are there any config options I can set to help in this?  There are
tons of options, and some documentation as to what each one does, but
no recommended practices type doc, that describes what should be done
for different kinds of workflows.

ps: Thank you for your time.  I hope that someone has answers for me.

pps: I'm not subscribed, please cc me.  If I need to be subscribed,
I'll do so, if told.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-23 21:10 large(25G) repository in git Adam Heath
@ 2009-03-24  1:19 ` Nicolas Pitre
  2009-03-24 17:59   ` Adam Heath
  2009-03-24  8:59 ` Andreas Ericsson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-03-24  1:19 UTC (permalink / raw)
  To: Adam Heath; +Cc: git

On Mon, 23 Mar 2009, Adam Heath wrote:

> Last friday, I was doing a checkin on the production server, and found
> 1.6G of new files.  git was quite able at committing that.  However,
> pushing was problematic.  I was pushing over ssh; so, a new ssh
> connection was open to the preview server.  After doing so, git tried
> to create a new pack file.  This took *ages*, and the ssh connection
> died.  So did git, when it finally got done with the new pack, and
> discovered the ssh connection was gone.

Strange.  You could instruct ssh to keep the connection up with the 
ServerAliveInterval option (see the ssh_config man page).

> So, to work around that, I ran git gc.  When done, I discovered that
> git repacked the *entire* repository.  While not something I care for,
> I can understand that, and live with it.  It just took *hours* to do so.
> 
> Then, what really annoys me, is that when I finally did the push, it
> tried sending the single 27G pack file, when the remote already had
> 25G of the repository in several different packs(the site was an
> hg->git conversion).  This part is just unacceptable.

This shouldn't happen either.  When pushing, git reconstruct a pack with 
only the necessary objects to transmit.  Are you sure it was really 
trying to send a 27G pack?

> So, here are my questions/observations:
> 
> 1: Handle the case of the ssh connection dying during git push(seems
> simple).

See above.

> 2: Is there an option to tell git to *not* be so thorough when trying
> to find similiar files.  videos/doc/pdf/etc aren't always very
> deltafiable, so I'd be happy to just do full content compares.

Look at the gitattribute documentation.  One thing that the doc appears 
to be missing is information about the "delta" attribute.  You can 
disable delta compression on a file pattern that way.

> 3: delta packs seem to be poorly done.  it seems that if one repo gets
> repacked completely, that the entire new pack gets sent, when the
> target has most of the objects already.

This is not supposed to happen.  Please provide more details if you can.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-23 21:10 large(25G) repository in git Adam Heath
  2009-03-24  1:19 ` Nicolas Pitre
@ 2009-03-24  8:59 ` Andreas Ericsson
  2009-03-24 22:35   ` Adam Heath
  2009-03-24 21:04 ` Sam Hocevar
  2009-03-26 15:43 ` Marcel M. Cary
  3 siblings, 1 reply; 16+ messages in thread
From: Andreas Ericsson @ 2009-03-24  8:59 UTC (permalink / raw)
  To: Adam Heath; +Cc: git

Adam Heath wrote:
> We maintain a website in git.  This website has a bunch of backend
> server code, and a bunch of data files.  Alot of these files are full
> videos.
> 

First of all, I'm going to hint that you would be far better off
keeping the media files in a separate repository, linked in as a
submodule in git and with tweaked configuration settings with the
specific aim of handling huge files.

The basis of such a repository is probably the following config
settings, since media files very rarely compress enough to be
worth the effort, and their own compressed formats make them
very unsuitable delta candidates:
[pack]
   # disable delta-based packing
   depth = 1
   # disable compression
   compression = 0

[gc]
   # don't auto-pack, ever
   auto = 0
   # never automatically consolidate un-.keep'd packs
   autopacklimit = 0

You will have to manually repack this repository from time to
time, and it's almost certainly a good idea to mark the
resulting packs with .keep to avoid copying tons of data.
When packs are being created, objects can be copied from
existing packs, and send-pack will make use of that so that what
goes over the wire will simply be copied from the existing packs.

YMMV. If you do come up with settings that work fine for huge
repos made up of mostly media files, please share your findings.

> We use git, so that the distributed nature of website development can
> be supported.  Quite often, you'll have a production server, with
> online changes occurring(we support in-browser editting of content), a
> preview server, where large-scale code changes can be previewed, then
> a development server, one per programmer(or more).
> 
> Last friday, I was doing a checkin on the production server, and found
> 1.6G of new files.  git was quite able at committing that.  However,
> pushing was problematic.  I was pushing over ssh; so, a new ssh
> connection was open to the preview server.  After doing so, git tried
> to create a new pack file.  This took *ages*, and the ssh connection
> died.  So did git, when it finally got done with the new pack, and
> discovered the ssh connection was gone.
> 
> So, to work around that, I ran git gc.  When done, I discovered that
> git repacked the *entire* repository.  While not something I care for,
> I can understand that, and live with it.  It just took *hours* to do so.
> 

I'm not sure what, if any, magic "git gc" applies before spawning
"git repack", but running "git repack" directly would almost certainly
have produced an incremental pack. Perhaps we need to make gc less
magic.

> Then, what really annoys me, is that when I finally did the push, it
> tried sending the single 27G pack file, when the remote already had
> 25G of the repository in several different packs(the site was an
> hg->git conversion).  This part is just unacceptable.
> 

Agreed. I've never run across that problem, so I can only assume it
has something to do with many huge files being in the pack.

> So, here are my questions/observations:
> 
> 1: Handle the case of the ssh connection dying during git push(seems
> simple).
> 

Not necessarily all that simple (we do not want to touch the ssh
password if we can possibly avoid it, but the user shouldn't have
to type it more than once), but certainly doable. Easier would
probably be to recommend adding the proper SSH config variables,
as has been stated elsewhere.

> 2: Is there an option to tell git to *not* be so thorough when trying
> to find similiar files.  videos/doc/pdf/etc aren't always very
> deltafiable, so I'd be happy to just do full content compares.
> 

See above. I *think* you can also do this with git-attributes, but
I'm not sure. However, keeping the large media files in a sub-module
would nicely solve that problem anyway, and is probably a good idea
even with git-attributes support for pack delta- and compression
settings.

> 3: delta packs seem to be poorly done.  it seems that if one repo gets
> repacked completely, that the entire new pack gets sent, when the
> target has most of the objects already.
> 

This is certainly not the case for most repositories. I believe there's
something being triggered from repositories with many huge files though.

> 4: Are there any config options I can set to help in this?  There are
> tons of options, and some documentation as to what each one does, but
> no recommended practices type doc, that describes what should be done
> for different kinds of workflows.
> 

http://www.thousandparsec.net/~tim/media+git.pdf probably holds all the
relevant information when it comes to storing large media files with
git. I have not checked and have no inclination to do so.

> ps: Thank you for your time.  I hope that someone has answers for me.
> 

Answers aplenty, I hope. I have neither time nor interest in developing
this though, so the task of creating patches and/or documentation will
have to fall to someone else.

> pps: I'm not subscribed, please cc me.  If I need to be subscribed,
> I'll do so, if told.

Subscribing won't be necessary. The custom on git@vger is to always Cc
all who participate in the discussion, and only cull those who state
they're no longer interested in the topic.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-24  1:19 ` Nicolas Pitre
@ 2009-03-24 17:59   ` Adam Heath
  2009-03-24 18:31     ` Nicolas Pitre
  2009-03-24 18:33     ` david
  0 siblings, 2 replies; 16+ messages in thread
From: Adam Heath @ 2009-03-24 17:59 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre wrote:

> Strange.  You could instruct ssh to keep the connection up with the 
> ServerAliveInterval option (see the ssh_config man page).

Sure, could do that.  Already have a separate ssh config entry for
this host.  But why should a connection be kept open for that long?
Why not close and re-open?

Consider the case of other protocol access.  http/git/ssh.  Should
they *all* be changed to allow for this?  Wouldn't it be simpler to
just make git smarter?

>> So, to work around that, I ran git gc.  When done, I discovered that
>> git repacked the *entire* repository.  While not something I care for,
>> I can understand that, and live with it.  It just took *hours* to do so.
>>
>> Then, what really annoys me, is that when I finally did the push, it
>> tried sending the single 27G pack file, when the remote already had
>> 25G of the repository in several different packs(the site was an
>> hg->git conversion).  This part is just unacceptable.
> 
> This shouldn't happen either.  When pushing, git reconstruct a pack with 
> only the necessary objects to transmit.  Are you sure it was really 
> trying to send a 27G pack?

Of course I'm sure.  I wouldn't have sent the email if it didn't
happen.  And, I have the bandwidthd graph and lost time to prove it.

After I ran git push, ssh timed out, the temp pack that was created
was then removed, as git complained about the connection being gone.

I then decided to do a 'git gc', which collapsed all the separate
packs into one.  This allowed git push to proceed quickly, but at that
point, it started sending the entire pack.

It's entirely possible that the temp pack created by git push was
incremental; it just took too long to create it, so it got aborted.

But, doing git gc shouldn't cause things to be resent.

The machines in question have done push before.  Even small amounts;
just the set of objects that are newer.  It's just this time, when the
1.6G of new data was added, git ended up creating a new pack file,
that contained the entire repo, and then tried sending that.

I forgot to mention previously, that the source machine was running
git 1.5.6.5, and was pushing to 1.5.6.3.

I've tried duplicating this problem on a machine with 1.6.1.3, but
either I don't fully understand the issue enough to replicate it, or
the newer git doesn't have the problem.

>> 2: Is there an option to tell git to *not* be so thorough when trying
>> to find similiar files.  videos/doc/pdf/etc aren't always very
>> deltafiable, so I'd be happy to just do full content compares.
> 
> Look at the gitattribute documentation.  One thing that the doc appears 
> to be missing is information about the "delta" attribute.  You can 
> disable delta compression on a file pattern that way.

Um, if it's missing documentation, then how am I supposed to know
about it?  google does give me info, tho.  Thanks for the pointer.

> 
>> 3: delta packs seem to be poorly done.  it seems that if one repo gets
>> repacked completely, that the entire new pack gets sent, when the
>> target has most of the objects already.
> 
> This is not supposed to happen.  Please provide more details if you can.

Well, I haven't been able to replicate it with a script.  I might have
to actually clone this huge repo, do history removal, and reapply the
changes, just to see if I can get it to fail.  But that will take time.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-24 17:59   ` Adam Heath
@ 2009-03-24 18:31     ` Nicolas Pitre
  2009-03-24 20:55       ` Adam Heath
  2009-03-24 18:33     ` david
  1 sibling, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-03-24 18:31 UTC (permalink / raw)
  To: Adam Heath; +Cc: git

On Tue, 24 Mar 2009, Adam Heath wrote:

> Nicolas Pitre wrote:
> 
> > Strange.  You could instruct ssh to keep the connection up with the 
> > ServerAliveInterval option (see the ssh_config man page).
> 
> Sure, could do that.  Already have a separate ssh config entry for
> this host.  But why should a connection be kept open for that long?
> Why not close and re-open?

Because it is way more complex for git to do that than for ssh to keep 
the connection alive.  And normally there is no need as git is supposed 
to be faster than that.

> Consider the case of other protocol access.  http/git/ssh.  Should
> they *all* be changed to allow for this?  Wouldn't it be simpler to
> just make git smarter?

Making git faster is the solution, not working around the issue.

> >> So, to work around that, I ran git gc.  When done, I discovered that
> >> git repacked the *entire* repository.  While not something I care for,
> >> I can understand that, and live with it.  It just took *hours* to do so.
> >>
> >> Then, what really annoys me, is that when I finally did the push, it
> >> tried sending the single 27G pack file, when the remote already had
> >> 25G of the repository in several different packs(the site was an
> >> hg->git conversion).  This part is just unacceptable.
> > 
> > This shouldn't happen either.  When pushing, git reconstruct a pack with 
> > only the necessary objects to transmit.  Are you sure it was really 
> > trying to send a 27G pack?
> 
> Of course I'm sure.  I wouldn't have sent the email if it didn't
> happen.  And, I have the bandwidthd graph and lost time to prove it.

As much as I would like to believe you, this doesn't help fixing the 
problem if you don't provide more information about this.  For example, 
the output from git during the whole operation might give us the 
beginning of a clue.  Otherwise, all I can tell you is that such thing 
is not supposed to happen.

> After I ran git push, ssh timed out, the temp pack that was created
> was then removed, as git complained about the connection being gone.

On a push, there is no creation of a temp pack.  It is always produced 
on the fly and pushed straight via the ssh connection.

> I then decided to do a 'git gc', which collapsed all the separate
> packs into one.  This allowed git push to proceed quickly, but at that
> point, it started sending the entire pack.

If this was really the case, then this is definitely a bug.  Please take 
a snapshot of your screen with git messages if this ever happens again.

> It's entirely possible that the temp pack created by git push was
> incremental; it just took too long to create it, so it got aborted.

The push operation has multiple phases.  You should see "counting 
objects", "compressing objects" and "writing objects".  Could you give 
us an approximation of how long each of those phases took?

> But, doing git gc shouldn't cause things to be resent.

Indeed.

> The machines in question have done push before.  Even small amounts;
> just the set of objects that are newer.  It's just this time, when the
> 1.6G of new data was added, git ended up creating a new pack file,
> that contained the entire repo, and then tried sending that.

And this is wrong.

> I forgot to mention previously, that the source machine was running
> git 1.5.6.5, and was pushing to 1.5.6.3.
> 
> I've tried duplicating this problem on a machine with 1.6.1.3, but
> either I don't fully understand the issue enough to replicate it, or
> the newer git doesn't have the problem.

That's possible.  Maybe others on the list might recall possible issues 
related to this that might have been fixed during that time.

> >> 2: Is there an option to tell git to *not* be so thorough when trying
> >> to find similiar files.  videos/doc/pdf/etc aren't always very
> >> deltafiable, so I'd be happy to just do full content compares.
> > 
> > Look at the gitattribute documentation.  One thing that the doc appears 
> > to be missing is information about the "delta" attribute.  You can 
> > disable delta compression on a file pattern that way.
> 
> Um, if it's missing documentation, then how am I supposed to know
> about it?

Asking on the list, like you did.  However this attribute should be 
documented as well of course.  I even think that someone posted a patch 
for it a while ago which might have been dropped.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-24 17:59   ` Adam Heath
  2009-03-24 18:31     ` Nicolas Pitre
@ 2009-03-24 18:33     ` david
  1 sibling, 0 replies; 16+ messages in thread
From: david @ 2009-03-24 18:33 UTC (permalink / raw)
  To: Adam Heath; +Cc: Nicolas Pitre, git

On Tue, 24 Mar 2009, Adam Heath wrote:

> Nicolas Pitre wrote:
>
>> Strange.  You could instruct ssh to keep the connection up with the
>> ServerAliveInterval option (see the ssh_config man page).
>
> Sure, could do that.  Already have a separate ssh config entry for
> this host.  But why should a connection be kept open for that long?
> Why not close and re-open?

what if the server you are connecting to is behind a load balancer? how do 
you know that your new connection will go to the same server? if the 
client never reconnects, how long should the server keep it's resources 
tied up 'just in case'. if something connects to the server, how does it 
know if it's something reconnecting or connecting for the first time? (or 
someone connecting with the intent of messing up someone else's fetch)

having the client reconnect to finish a single transaction starts getting 
_really_ ugly.

David Lang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-24 18:31     ` Nicolas Pitre
@ 2009-03-24 20:55       ` Adam Heath
  2009-03-25  1:21         ` Nicolas Pitre
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Heath @ 2009-03-24 20:55 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre wrote:
> Because it is way more complex for git to do that than for ssh to keep 
> the connection alive.  And normally there is no need as git is supposed 
> to be faster than that.

Sure, I'll buy that.

>>>> So, to work around that, I ran git gc.  When done, I discovered that
>>>> git repacked the *entire* repository.  While not something I care for,
>>>> I can understand that, and live with it.  It just took *hours* to do so.
>>>>
>>>> Then, what really annoys me, is that when I finally did the push, it
>>>> tried sending the single 27G pack file, when the remote already had
>>>> 25G of the repository in several different packs(the site was an
>>>> hg->git conversion).  This part is just unacceptable.
>>> This shouldn't happen either.  When pushing, git reconstruct a pack with 
>>> only the necessary objects to transmit.  Are you sure it was really 
>>> trying to send a 27G pack?
>> Of course I'm sure.  I wouldn't have sent the email if it didn't
>> happen.  And, I have the bandwidthd graph and lost time to prove it.
> 
> As much as I would like to believe you, this doesn't help fixing the 
> problem if you don't provide more information about this.  For example, 
> the output from git during the whole operation might give us the 
> beginning of a clue.  Otherwise, all I can tell you is that such thing 
> is not supposed to happen.

First off, you've put a bad tone on this.  It appears that you are
saying I'm mistaken, and it didn't send all that data.  "It can't
happen, so it didn't happen."  Believe me, if it hadn't resent all
this data, I wouldn't have even sent the email.

In any event, we got lucky.  I *do* have a log of the push side of
this problem.  I doubt it's enough to figure out the actual cause tho.

==
ofbiz@lnxwww10:/job/@anon-site@> git push bf-yum
Counting objects: 96637, done.

Compressing objects:   6% (2413/34478)   478)
Read from remote host @anon-site-dev@.brainfood.com: Connection reset
by peer
Compressing objects:  27% (9458/34478)

Compressing objects: 100% (34478/34478), done.
error: pack-objects died with strange error
error: failed to push some refs to 'ssh://bf-yum/@anon-site@'
ofbiz@lnxwww10:/job/@anon-site@>
ofbiz@lnxwww10:/job/@anon-site@>
ofbiz@lnxwww10:/job/@anon-site@>
ofbiz@lnxwww10:/job/@anon-site@> git push bf-yum
Counting objects: 96637, done.
Killed by signal 2.:   5% (1866/34478)

ofbiz@lnxwww10:/job/@anon-site@> git gc
Counting objects: 96637, done.
Compressing objects:  27% (9453/34478)

Compressing objects: 100% (34478/34478), done.
Writing objects: 100% (96637/96637), done.
Total 96637 (delta 48713), reused 88929 (delta 43905)
Removing duplicate objects: 100% (256/256), done.
ofbiz@lnxwww10:/job/@anon-site@>
ofbiz@lnxwww10:/job/@anon-site@>
ofbiz@lnxwww10:/job/@anon-site@> du .git -sc
26797788        .git
26797788        total
ofbiz@lnxwww10:/job/@anon-site@> git push bf-yum
Counting objects: 96637, done.
Compressing objects: 100% (29670/29670), done.
Writing objects: 100% (96637/96637), 25.49 GiB | 226 KiB/s, done.
Total 96637 (delta 48713), reused 96637 (delta 48713)
To ssh://bf-yum/@anon-site@
 * [new branch]      master -> lnxwww10
==
ofbiz@lnxwww10:/job/@anon-site@> ls .git/objects/pack/ -l
total 26762436
-r--r--r-- 1 ofbiz users     3452052 2009-03-21 23:11
pack-0d7b399006ae0a57ff3df07fdcaedbaeb7e63d0a.idx
-r--r--r-- 1 ofbiz users 27374508409 2009-03-21 23:11
pack-0d7b399006ae0a57ff3df07fdcaedbaeb7e63d0a.pack
==

I have a bf-yum remote defined, that pushes to the remote branch; once
it gets there, I then do a merge on the target machine.

The 'killed by signal 2' is when I ctrl-c.

The second group was done from another window.  There's only a single
pack file now.

The @anon-site@ stuff is me removing client identifiers.  It's the
only editting I did to the screen log.

> 
>> After I ran git push, ssh timed out, the temp pack that was created
>> was then removed, as git complained about the connection being gone.
> 
> On a push, there is no creation of a temp pack.  It is always produced 
> on the fly and pushed straight via the ssh connection.

No.  I saw a temp file in strace.  It *was* created on the local disk,
and *not* sent on the fly.

>> I then decided to do a 'git gc', which collapsed all the separate
>> packs into one.  This allowed git push to proceed quickly, but at that
>> point, it started sending the entire pack.
> 
> If this was really the case, then this is definitely a bug.  Please take 
> a snapshot of your screen with git messages if this ever happens again.

See above.

> 
>> It's entirely possible that the temp pack created by git push was
>> incremental; it just took too long to create it, so it got aborted.
> 
> The push operation has multiple phases.  You should see "counting 
> objects", "compressing objects" and "writing objects".  Could you give 
> us an approximation of how long each of those phases took?

Well, counting was quick enough.  compression took at *least* 2 hours,
might have been 4 or more.  This all started friday evening.  I was
watching it a bit at the beginning, but then went out, and it died
after I got back to it.

>> I forgot to mention previously, that the source machine was running
>> git 1.5.6.5, and was pushing to 1.5.6.3.
>>
>> I've tried duplicating this problem on a machine with 1.6.1.3, but
>> either I don't fully understand the issue enough to replicate it, or
>> the newer git doesn't have the problem.
> 
> That's possible.  Maybe others on the list might recall possible issues 
> related to this that might have been fixed during that time.

Well, I looked at the release notes between all these versions.
Nothing stands out, but I'm aware that the changelog/release note
entry for some change doesn't always describe the actual bug that
caused the change to occur.

>> Um, if it's missing documentation, then how am I supposed to know
>> about it?
> 
> Asking on the list, like you did.  However this attribute should be 
> documented as well of course.  I even think that someone posted a patch 
> for it a while ago which might have been dropped.

What I'd like, is a way to say a certain pattern of files should only
be deduped, and not deltafied.  This would handle the case of exact
copies, or renames, which would still be a win for us, but generally
when a new video(or doc or pdf) is uploaded, it's alot of work to try
and deltafy, for very little benefit.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-23 21:10 large(25G) repository in git Adam Heath
  2009-03-24  1:19 ` Nicolas Pitre
  2009-03-24  8:59 ` Andreas Ericsson
@ 2009-03-24 21:04 ` Sam Hocevar
  2009-03-24 21:44   ` Adam Heath
  2009-03-26 15:43 ` Marcel M. Cary
  3 siblings, 1 reply; 16+ messages in thread
From: Sam Hocevar @ 2009-03-24 21:04 UTC (permalink / raw)
  To: Adam Heath; +Cc: git

On Mon, Mar 23, 2009, Adam Heath wrote:
> We maintain a website in git.  This website has a bunch of backend
> server code, and a bunch of data files.  Alot of these files are full
> videos.
> 
> [...]
> 
> Last friday, I was doing a checkin on the production server, and found
> 1.6G of new files.  git was quite able at committing that.  However,
> pushing was problematic.  I was pushing over ssh; so, a new ssh
> connection was open to the preview server.  After doing so, git tried
> to create a new pack file.  This took *ages*, and the ssh connection
> died.  So did git, when it finally got done with the new pack, and
> discovered the ssh connection was gone.

   As stated several times by Linus and others, Git was not designed
to handle large files. My stance on the issue is that before trying
to optimise operations so that they perform well on large files, too,
Git should usually avoid such operations, especially deltification.
One notable exception would be someone storing their mailbox in Git,
where deltification is a major space saver. But usually, these large
files are binary blobs that do not benefit from delta search (or even
compression).

   Since I also need to handle large files (80 GiB repository), I am
cleaning up some fixes I did, which can be seen in the git-bigfiles
project (http://caca.zoy.org/wiki/git-bigfiles). I have not yet tried
to change git-push (because I submit through git-p4), but I hope to
address it, too. As time goes I believe some of them could make it into
mainstream Git.

   In your particular case, I would suggest setting pack.packSizeLimit
to something lower. This would reduce the time spent generating a new
pack file if the problem were to happen again.

Regards,
-- 
Sam.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-24 21:04 ` Sam Hocevar
@ 2009-03-24 21:44   ` Adam Heath
  2009-03-25  0:28     ` Nicolas Pitre
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Heath @ 2009-03-24 21:44 UTC (permalink / raw)
  To: Sam Hocevar; +Cc: git

Sam Hocevar wrote:
>    As stated several times by Linus and others, Git was not designed
> to handle large files. My stance on the issue is that before trying
> to optimise operations so that they perform well on large files, too,
> Git should usually avoid such operations, especially deltification.
> One notable exception would be someone storing their mailbox in Git,
> where deltification is a major space saver. But usually, these large
> files are binary blobs that do not benefit from delta search (or even
> compression).

Yeah, in this case, I *know* that my binary blobs are completely
different, and it's just a waste of time for git to come to the same
conclusion.  I'd be perfectly willing to have some knob I could turn
that would tell git this.

>    Since I also need to handle large files (80 GiB repository), I am
> cleaning up some fixes I did, which can be seen in the git-bigfiles
> project (http://caca.zoy.org/wiki/git-bigfiles). I have not yet tried
> to change git-push (because I submit through git-p4), but I hope to
> address it, too. As time goes I believe some of them could make it into
> mainstream Git.

I'd almost be willing to help.  I know the basic premise to how git
works, but the devil is in the details, and I don't have time right
now to learn the internals.

Yet another thing to add to my todo list.

>    In your particular case, I would suggest setting pack.packSizeLimit
> to something lower. This would reduce the time spent generating a new
> pack file if the problem were to happen again.

Yeah, saw that one, but *after* I had this problem.  The default, if
not set, is unlimited, which in this case, is definately *not* what we
want.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-24  8:59 ` Andreas Ericsson
@ 2009-03-24 22:35   ` Adam Heath
  0 siblings, 0 replies; 16+ messages in thread
From: Adam Heath @ 2009-03-24 22:35 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: git

Andreas Ericsson wrote:
> First of all, I'm going to hint that you would be far better off
> keeping the media files in a separate repository, linked in as a
> submodule in git and with tweaked configuration settings with the
> specific aim of handling huge files.

Already do that.  We have a custom overlay/union-type filesystem, that
makes use of a small base directory, where code resides, then each
sub-website is where the content is.

It's just finding documentation thru google that describes the
workflow we are doing is difficult.

> The basis of such a repository is probably the following config
> settings, since media files very rarely compress enough to be
> worth the effort, and their own compressed formats make them
> very unsuitable delta candidates:
> [pack]
>   # disable delta-based packing
>   depth = 1
>   # disable compression
>   compression = 0
> 
> [gc]
>   # don't auto-pack, ever
>   auto = 0
>   # never automatically consolidate un-.keep'd packs
>   autopacklimit = 0

Thanks for the pointers!

> You will have to manually repack this repository from time to
> time, and it's almost certainly a good idea to mark the
> resulting packs with .keep to avoid copying tons of data.
> When packs are being created, objects can be copied from
> existing packs, and send-pack will make use of that so that what
> goes over the wire will simply be copied from the existing packs.
> 
> YMMV. If you do come up with settings that work fine for huge
> repos made up of mostly media files, please share your findings.

I'll use these as a basis.

>> So, to work around that, I ran git gc.  When done, I discovered that
>> git repacked the *entire* repository.  While not something I care for,
>> I can understand that, and live with it.  It just took *hours* to do so.
>>
> 
> I'm not sure what, if any, magic "git gc" applies before spawning
> "git repack", but running "git repack" directly would almost certainly
> have produced an incremental pack. Perhaps we need to make gc less
> magic.

The repo should only be converted into a single .pack, if the user
explicitily wants it.  Any automatic gc call, or called without args,
should just take any loose objects and pack them up.  But that's my
opinion.

> Not necessarily all that simple (we do not want to touch the ssh
> password if we can possibly avoid it, but the user shouldn't have
> to type it more than once), but certainly doable. Easier would
> probably be to recommend adding the proper SSH config variables,
> as has been stated elsewhere.

ssh-agent, or password-less anonymous ssh(I've got a custom login
script inside authorized_keys on the remote).

> See above. I *think* you can also do this with git-attributes, but
> I'm not sure. However, keeping the large media files in a sub-module
> would nicely solve that problem anyway, and is probably a good idea
> even with git-attributes support for pack delta- and compression
> settings.

The site would *still* be > 25G in size, at the least, and constantly
getting bigger.  This site contains copies of ad videos from their
competitors, plus their own, and is used to market their international
company.

> http://www.thousandparsec.net/~tim/media+git.pdf probably holds all the
> relevant information when it comes to storing large media files with
> git. I have not checked and have no inclination to do so.

http://caca.zoy.org/wiki/git-bigfiles is another one.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-24 21:44   ` Adam Heath
@ 2009-03-25  0:28     ` Nicolas Pitre
  2009-03-25  0:57       ` Adam Heath
  0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2009-03-25  0:28 UTC (permalink / raw)
  To: Adam Heath; +Cc: Sam Hocevar, git

On Tue, 24 Mar 2009, Adam Heath wrote:

> Sam Hocevar wrote:
> >    In your particular case, I would suggest setting pack.packSizeLimit
> > to something lower. This would reduce the time spent generating a new
> > pack file if the problem were to happen again.
> 
> Yeah, saw that one, but *after* I had this problem.  The default, if
> not set, is unlimited, which in this case, is definately *not* what we
> want.

In your particular case, if the problem is actually what I think it is, 
the pack.packSizeLimit wouldn't have made any difference.  This setting 
affects local repacking only and has no effect what so ever on the push 
operation.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-25  0:28     ` Nicolas Pitre
@ 2009-03-25  0:57       ` Adam Heath
  2009-03-25  1:47         ` Nicolas Pitre
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Heath @ 2009-03-25  0:57 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Sam Hocevar, git

Nicolas Pitre wrote:
> On Tue, 24 Mar 2009, Adam Heath wrote:
> 
>> Sam Hocevar wrote:
>>>    In your particular case, I would suggest setting pack.packSizeLimit
>>> to something lower. This would reduce the time spent generating a new
>>> pack file if the problem were to happen again.
>> Yeah, saw that one, but *after* I had this problem.  The default, if
>> not set, is unlimited, which in this case, is definately *not* what we
>> want.
> 
> In your particular case, if the problem is actually what I think it is, 
> the pack.packSizeLimit wouldn't have made any difference.  This setting 
> affects local repacking only and has no effect what so ever on the push 
> operation.

Ooh.  Care to enlighten those of us not blessed with git internal
knowledge?

On another note, anyone have a goat I can buy, for the sacrifice?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-24 20:55       ` Adam Heath
@ 2009-03-25  1:21         ` Nicolas Pitre
  0 siblings, 0 replies; 16+ messages in thread
From: Nicolas Pitre @ 2009-03-25  1:21 UTC (permalink / raw)
  To: Adam Heath; +Cc: git

On Tue, 24 Mar 2009, Adam Heath wrote:

> Nicolas Pitre wrote:
> > As much as I would like to believe you, this doesn't help fixing the 
> > problem if you don't provide more information about this.  For example, 
> > the output from git during the whole operation might give us the 
> > beginning of a clue.  Otherwise, all I can tell you is that such thing 
> > is not supposed to happen.
> 
> First off, you've put a bad tone on this.  It appears that you are
> saying I'm mistaken, and it didn't send all that data.  "It can't
> happen, so it didn't happen."  Believe me, if it hadn't resent all
> this data, I wouldn't have even sent the email.

I don't know you.  All I had is the information you provided which was 
rather incomplete.  So don't be offended if I ask for more.  I'm trying 
to help you after all.

And especially in this case, the problem seems not to be about 
packing...

> In any event, we got lucky.  I *do* have a log of the push side of
> this problem.  I doubt it's enough to figure out the actual cause tho.

Well, I think it might.

> ==
> Counting objects: 96637, done.
> Compressing objects: 100% (29670/29670), done.
> Writing objects: 100% (96637/96637), 25.49 GiB | 226 KiB/s, done.
> Total 96637 (delta 48713), reused 96637 (delta 48713)
> To ssh://bf-yum/@anon-site@
>  * [new branch]      master -> lnxwww10

Was that branch really new on the remote side?  If no, then this is 
highly suspicious.  If somehow the previously aborted push attempt 
screwed the remote refs, then the local client would think that the 
remote is empty and conclude that all commits have to be pushed.

> >> After I ran git push, ssh timed out, the temp pack that was created
> >> was then removed, as git complained about the connection being gone.
> > 
> > On a push, there is no creation of a temp pack.  It is always produced 
> > on the fly and pushed straight via the ssh connection.
> 
> No.  I saw a temp file in strace.  It *was* created on the local disk,
> and *not* sent on the fly.

A temp pack is created on the receiving side, not the sending side 
though.  The sending side is piping the pack data on its standard output 
which is connected to ssh's standard input.

> >> Um, if it's missing documentation, then how am I supposed to know
> >> about it?
> > 
> > Asking on the list, like you did.  However this attribute should be 
> > documented as well of course.  I even think that someone posted a patch 
> > for it a while ago which might have been dropped.
> 
> What I'd like, is a way to say a certain pattern of files should only
> be deduped, and not deltafied.  This would handle the case of exact
> copies, or renames, which would still be a win for us, but generally
> when a new video(or doc or pdf) is uploaded, it's alot of work to try
> and deltafy, for very little benefit.

Renamed/duplicated files are always stored uniquely by design.  Git 
store file data into objects which are named after the SHA1 of their 
content.

In order to not attempt any delta on PDF files for example, you need to 
add a negative delta attribute line such as:

*.pdf	-delta

either in a file called .gitattributes which gets versionned 
and distributed, or in .git/info/attributes in which case it'll remain 
local.  Any file matching *.pdf won't be delta compressed.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-25  0:57       ` Adam Heath
@ 2009-03-25  1:47         ` Nicolas Pitre
  0 siblings, 0 replies; 16+ messages in thread
From: Nicolas Pitre @ 2009-03-25  1:47 UTC (permalink / raw)
  To: Adam Heath; +Cc: Sam Hocevar, git

On Tue, 24 Mar 2009, Adam Heath wrote:

> Nicolas Pitre wrote:
> > On Tue, 24 Mar 2009, Adam Heath wrote:
> > 
> >> Sam Hocevar wrote:
> >>>    In your particular case, I would suggest setting pack.packSizeLimit
> >>> to something lower. This would reduce the time spent generating a new
> >>> pack file if the problem were to happen again.
> >> Yeah, saw that one, but *after* I had this problem.  The default, if
> >> not set, is unlimited, which in this case, is definately *not* what we
> >> want.
> > 
> > In your particular case, if the problem is actually what I think it is, 
> > the pack.packSizeLimit wouldn't have made any difference.  This setting 
> > affects local repacking only and has no effect what so ever on the push 
> > operation.
> 
> Ooh.  Care to enlighten those of us not blessed with git internal
> knowledge?

See my previous email for a likely explanation about your issue.

As to the pack.packSizeLimit setting: it is used when repacking only in 
order to avoid big packs on systems that might have issues dealing with 
large files.  During a repack, if the currently produced pack is about 
to get over that limit, then the pack is closed and a new one is 
started.  You therefore end up with many packs.

The transfer protocol used during a fetch or a push uses the pack format 
streamed over the network, but only one pack can be transferred that 
way.  Maybe the reception of a pack during a network transfer should be 
split according to pack.packSizeLimit as well, but this is currently not 
implemented at all. No one complained about that either, so I'm guessing 
that 
splitting a large pack, if needed, by using 'git repack' after a 
clone/fetch is good enough.

Personally, I don't think actively splitting packs into smaller ones is 
that useful, unless you wish to archive them on a file system which 
cannot handle files larger than 2GB or the like.

> On another note, anyone have a goat I can buy, for the sacrifice?

Beware the wrath of Git...


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-23 21:10 large(25G) repository in git Adam Heath
                   ` (2 preceding siblings ...)
  2009-03-24 21:04 ` Sam Hocevar
@ 2009-03-26 15:43 ` Marcel M. Cary
  2009-03-26 16:35   ` Adam Heath
  3 siblings, 1 reply; 16+ messages in thread
From: Marcel M. Cary @ 2009-03-26 15:43 UTC (permalink / raw)
  To: Adam Heath; +Cc: git

Adam Heath wrote:
> We maintain a website in git.  This website has a bunch of backend
> server code, and a bunch of data files.  Alot of these files are full
> videos.
>
> We use git, so that the distributed nature of website development can
> be supported.  Quite often, you'll have a production server, with
> online changes occurring(we support in-browser editting of content), a
> preview server, where large-scale code changes can be previewed, then
> a development server, one per programmer(or more).

My company manages code in a similar way, except we avoid this kind of
issue (with 100 gigabytes of user-uploaded images and other data) by not
checking in the data.  We even went so far is as to halve the size of
our repository by removing 2GB of non-user-supplied images -- rounded
corners, background gradients, logos, etc, etc.  This made Git
noticeably faster.

While I'd love to be able to handle your kind of use case and data size
with Git in that way, it's a little beyond the intended usage to handle
hundreds of gigabytes of binary data, I think.

I imagine as your web site grows, which I'm assuming is your goal, your
problems with scaling Git will continue to be a challenge.

Maybe you can find a way to:

* Get along with less data in your non-production environments; we're
hoping to be able to do this eventually

* Find other ways to copy it; we use rsync even though it does take
forever to crawl over the file system

* Put your data files in a separate Git repository, at least, assuming
your checkin, update, and release code more often than your video files.
 That way you'll experience pain less often, and maybe even be able to
tune your repository differently.

Marcel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: large(25G) repository in git
  2009-03-26 15:43 ` Marcel M. Cary
@ 2009-03-26 16:35   ` Adam Heath
  0 siblings, 0 replies; 16+ messages in thread
From: Adam Heath @ 2009-03-26 16:35 UTC (permalink / raw)
  To: Marcel M. Cary; +Cc: git

Marcel M. Cary wrote:
> My company manages code in a similar way, except we avoid this kind of
> issue (with 100 gigabytes of user-uploaded images and other data) by not
> checking in the data.  We even went so far is as to halve the size of
> our repository by removing 2GB of non-user-supplied images -- rounded
> corners, background gradients, logos, etc, etc.  This made Git
> noticeably faster.

Disk space is cheap.

> While I'd love to be able to handle your kind of use case and data size
> with Git in that way, it's a little beyond the intended usage to handle
> hundreds of gigabytes of binary data, I think.
> 
> I imagine as your web site grows, which I'm assuming is your goal, your
> problems with scaling Git will continue to be a challenge.
> 
> Maybe you can find a way to:
> 
> * Get along with less data in your non-production environments; we're
> hoping to be able to do this eventually

We do that by only cloning/checking out certain modules.

However, as is always the case, sometimes a bug occurs with production
data, and you need to use the real data to track it down.

> * Find other ways to copy it; we use rsync even though it does take
> forever to crawl over the file system
> 
> * Put your data files in a separate Git repository, at least, assuming
> your checkin, update, and release code more often than your video files.
>  That way you'll experience pain less often, and maybe even be able to
> tune your repository differently.

As already mentioned, our sub-sites *are* in separate repos.  There's
a base repository, that has just the event/backend code.  Then 32
*other* repositories, where the actual websites are.

We want to use *some* kind of versioning system.  Being able to have
history of *all* changes is extremely useful.  Not to mention being
able to track what each separate user does as they modify their files
thru their browser.

subversion is just right out.  It's centralized.  It leaves poop all
over the place.

mercurial is just right out.  If you do several *separate* commits of
*separate* files, but don't push for some time period, then eventually
do a push/pull, where the sum total of the changes is larger than some
value, mercurial will fail when it tries to then update the local
directory.  This limit is based on 2G, a hard-coded python limit(even
on a 64-bit host), because mercurial reads the entire set of changes
into a python string.

git mmaps files, does window scanning of the pack files.  It *might*
read a single file all into memory, for compression purposes; I'm not
certain on this.  We certainly haven't hit any limits that cause it to
fail outright.

I haven't tried any others.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-03-26 16:37 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-23 21:10 large(25G) repository in git Adam Heath
2009-03-24  1:19 ` Nicolas Pitre
2009-03-24 17:59   ` Adam Heath
2009-03-24 18:31     ` Nicolas Pitre
2009-03-24 20:55       ` Adam Heath
2009-03-25  1:21         ` Nicolas Pitre
2009-03-24 18:33     ` david
2009-03-24  8:59 ` Andreas Ericsson
2009-03-24 22:35   ` Adam Heath
2009-03-24 21:04 ` Sam Hocevar
2009-03-24 21:44   ` Adam Heath
2009-03-25  0:28     ` Nicolas Pitre
2009-03-25  0:57       ` Adam Heath
2009-03-25  1:47         ` Nicolas Pitre
2009-03-26 15:43 ` Marcel M. Cary
2009-03-26 16:35   ` Adam Heath

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.