All of lore.kernel.org
 help / color / mirror / Atom feed
* Git Large Object Support Proposal
@ 2009-03-19 22:14 Scott Chacon
  2009-03-19 22:31 ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Scott Chacon @ 2009-03-19 22:14 UTC (permalink / raw)
  To: git list

I have been thinking about this for a while, so I wanted to get some
feedback. I've been seeing a number of people interested in using Git
for game development and whatnot, or otherwise committing huge files.
This will occasionally wreak some havoc on our servers (GitHub)
because of the memory mapping involved.  Thus, we would really like to
see a nicer way for Git to handle big files.

There are two proposals on the GSoC page to deal with this - the
'remote alternates/lazy clone' idea and the 'sparse/narrow clone'
idea.  I'm wondering if instead it might be an interesting idea to
concentrate on the 'stub objects' for large blobs that Jakub was
talking about a few months ago:

http://markmail.org/message/my4kvrhsza2yjmlt

But where Git instead stores a stub object and the large binary object
is pulled in via a separate mechanism. I was thinking that the client
could set a max file size and when a binary object larger than that is
staged, Git instead writes a stub blob like:

==
blob [size]\0
[sha of large blob]
==

Then in the tree, we give the stubbed large file a special mode or type:

==
100644 blob 3bb0e8592a41ae3185ee32266c860714980dbed7 README
040000 tree 557b70d2374ae77869711cb583e6d59b8aad5e8b lib
150000 blob 502feb557e2097d38a643e336f722525bc7ea077 big-ass-file.mpeg
==

Sort of like a symlink, but instead of the blob it points to
containing the link path, it just contains the SHA of the real blob.
Then we can have a command like 'git media' or something that helps
manage those, pull them down from a specified server (specified in a
.gitmedia file) and transfer new ones up before a push is allowed,
etc.  This makes it sort of a cross between a symlink and a submodule.

== .git/config
[media]
    push-url = [aws/scp/sftp/etc server]
    password = [write password]
    token = [write token]

== .gitmedia
[server]
    pull-url = [aws/scp/sftp/etc read only url]

This might be nice because all the objects would be local, so most of
the changes to tools should be rather small - we can't
merge/diff/blame large binary stuff really anyhow, right?  Also, the
really large files could be written and served over protocols that are
better for large file transfer (scp, sftp, etc) - the media server
could be different than the git server.  Then our servers can stop
choking when someone tries to add and push a 2 gig file.

If two users have different settings, one would simply have the stub
and the other not, the 'git media update' could check the local db
first before fetching.  If you change the max-file-size at some point,
the trees would just either stop using the stubs (if you lowered it)
for anything that now fits under the size limit, or start using stubs
for files that are now over it.

The workflow may go something like this:

$ cd git-repo
$ cp ~/huge-file.mpg .
$ git media add s3://chacon-media
# wrote new media server url to .gitmedia
$ git add .
# huge-file.mpg is larger than max-file-size (10M) and will be added
as media (see 'git media')
$ git status
# On branch master
#
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#	new file:   .gitmedia
#	new media:   huge-file.mpg
#
$ git push
Uploading new media to s3://chacon-media
Uploading media files 100% (5/5), done.
New media uploaded, pushing to Git server
Counting objects: 14, done.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (10/10), 1.04 KiB, done.
Total 10 (delta 4), reused 0 (delta 0)
To git@github.com:schacon/mediaproject.git
 + dbb5d00...9647674 master -> master


On the client side we would have something like this:

$ git clone git://github.com/schacon/mediaproject.git
Initialized empty Git repository in /private/tmp/simplegit/.git/
remote: Counting objects: 270, done.
remote: Compressing objects: 100% (148/148), done.
remote: Total 270 (delta 103), reused 198 (delta 77)
Receiving objects: 100% (270/270), 24.31 KiB, done.
Resolving deltas: 100% (103/103), done.
# You have unfetched media, run 'git media update' to get large media files
$ git status
# On branch master
#
# Media files to be fetched:
#   (use "git media update <file>..." to fetch)
#
#	unfetched:   huge-file.mpg
#
$ git media update
Fetching media from s3://chacon-media
Fetching media files 100% (1/1), done.


Anyhow, you get the picture.  I would be happy to try to get a proof
of concept of this done, but I wanted to know if there are any serious
objections to this approach to large media.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Git Large Object Support Proposal
  2009-03-19 22:14 Git Large Object Support Proposal Scott Chacon
@ 2009-03-19 22:31 ` Junio C Hamano
  2009-03-19 23:18   ` Scott Chacon
  2009-03-19 23:42   ` david
  0 siblings, 2 replies; 11+ messages in thread
From: Junio C Hamano @ 2009-03-19 22:31 UTC (permalink / raw)
  To: Scott Chacon; +Cc: git list

Scott Chacon <schacon@gmail.com> writes:

> But where Git instead stores a stub object and the large binary object
> is pulled in via a separate mechanism. I was thinking that the client
> could set a max file size and when a binary object larger than that is
> staged, Git instead writes a stub blob like:
>
> ==
> blob [size]\0
> [sha of large blob]
> ==

An immediate pair of questions are, if you can solve the issue by
delegating large media to somebody else (i.e. "media server"), and that
somebody else can solve the issues you are having, (1) what happens if you
lower that "large" threashold to "0 byte"?  Does that somebody else still
work fine, and does the git that uses indirection also still work fine?
If so why are you using git instead of that somebody else altogether?  and
(2) what prevents us from stealing the trick that somebody else uses so
that git itself can natively handle large blobs without indirection?

Without thinking the ramifications through myself, this sounds pretty much
like a band-aid and will nend up hitting the same "blob is larger than we
can handle" issue when you follow the indirection eventually, but that is
just my gut feeling.

This is an off-topic "By the way", but has another topic addressed to you
on git-scm.com/about resolved in any way yet?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Git Large Object Support Proposal
  2009-03-19 22:31 ` Junio C Hamano
@ 2009-03-19 23:18   ` Scott Chacon
  2009-03-19 23:44     ` Junio C Hamano
  2009-03-19 23:42   ` david
  1 sibling, 1 reply; 11+ messages in thread
From: Scott Chacon @ 2009-03-19 23:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git list

Hey,

On Thu, Mar 19, 2009 at 3:31 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Scott Chacon <schacon@gmail.com> writes:
>
>> But where Git instead stores a stub object and the large binary object
>> is pulled in via a separate mechanism. I was thinking that the client
>> could set a max file size and when a binary object larger than that is
>> staged, Git instead writes a stub blob like:
>>
>> ==
>> blob [size]\0
>> [sha of large blob]
>> ==
>
> An immediate pair of questions are, if you can solve the issue by
> delegating large media to somebody else (i.e. "media server"), and that
> somebody else can solve the issues you are having, (1) what happens if you
> lower that "large" threashold to "0 byte"?  Does that somebody else still
> work fine, and does the git that uses indirection also still work fine?
> If so why are you using git instead of that somebody else altogether?  and

In theory it would work fine, where all the commits/trees are
transferred over git and all the blobs are basically stored elsewhere,
but I would assume it would be much slower for the end user and so
nobody would do that.  I would imagine users would only use/enable
this at all if they have large media files that they don't want to
have every version of which cloned every time.  I can't imagine that
this would be used at all by more than a small percentage of users,
but when large media does need to be in source code, they will not use
Git (they will use Perforce or SVN), or they will put it in there and
then kill their (or our) servers when upload-pack tries to mmap it
(twice, yes?).  I thought it would be much more efficient for Git to
have the ability to simply mark files that don't make sense to pack up
and be able to keep track of and transfer them via a more appropriate
protocol.

> (2) what prevents us from stealing the trick that somebody else uses so
> that git itself can natively handle large blobs without indirection?
>

Actually, I'm fine with that - phase two of this project, if it made
sense at all, would be to have another set of git transfer commands
that allowed large blobs to be uploaded/downloaded separately,
importantly not passing them in the packfile and keeping them loose,
uncompressed and headerless on disk so they can simply be streamed
when requested.  I am thinking entirely about movies and images that
are already compressed and there is simply no need to load them
entirely into memory.  I simply thought that taking advantage of
services that already did this (scp, sftp, s3) would be quicker than
building another set of transfer protocols into Git.

> Without thinking the ramifications through myself, this sounds pretty much
> like a band-aid and will nend up hitting the same "blob is larger than we
> can handle" issue when you follow the indirection eventually, but that is
> just my gut feeling.

The point is that we don't keep this data as 'blob's - we don't try to
compress them or add the header to them, they're too big and already
compressed, it's a waste of time and often outside the memory
tolerance of many systems. We keep only the stub in our db and stream
the large media content directly to and from disk.  If we do a
'checkout' or something that would switch it out, we could store the
data in '.git/media' or the equivalent until it's uploaded elsewhere.

>
> This is an off-topic "By the way", but has another topic addressed to you
> on git-scm.com/about resolved in any way yet?
>

Thanks for pointing that out, I missed that thread.  I actually just
pushed out some changes over the last few days - I added the Gnome
project since they just announced they're moving to Git, added a link
to the new OReilly book that just was released and I pulled in some
validation and other misc changes that had been contributed.

Currently I have to re-gen the Authors data manually, so I do it every
once in a while - I just pushed up new data.  Doing it per release is
a good idea, I'll try to get that in the release script.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Git Large Object Support Proposal
  2009-03-19 22:31 ` Junio C Hamano
  2009-03-19 23:18   ` Scott Chacon
@ 2009-03-19 23:42   ` david
  1 sibling, 0 replies; 11+ messages in thread
From: david @ 2009-03-19 23:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Scott Chacon, git list

On Thu, 19 Mar 2009, Junio C Hamano wrote:

> Scott Chacon <schacon@gmail.com> writes:
>
>> But where Git instead stores a stub object and the large binary object
>> is pulled in via a separate mechanism. I was thinking that the client
>> could set a max file size and when a binary object larger than that is
>> staged, Git instead writes a stub blob like:
>>
>> ==
>> blob [size]\0
>> [sha of large blob]
>> ==
>
> An immediate pair of questions are, if you can solve the issue by
> delegating large media to somebody else (i.e. "media server"), and that
> somebody else can solve the issues you are having, (1) what happens if you
> lower that "large" threashold to "0 byte"?  Does that somebody else still
> work fine, and does the git that uses indirection also still work fine?
> If so why are you using git instead of that somebody else altogether?

ideally the difference between useing git with 'large' set to 0 and git 
with no pack file should be an extra lookup for the indirection.

it may be that some other file manipulation may not be possible for 
'large' files, resulting in some reduced functionality.

in any case, the added efficiancy of using pack files (both for local 
storage and for network transport) will make handling the 'large' files 
worse than the same size files through git (assuming that they can benifit 
from delta compression)

> and
> (2) what prevents us from stealing the trick that somebody else uses so
> that git itself can natively handle large blobs without indirection?

the key thing is that large files do not get mmaped or considered for 
inclusion in pack files (including cloning and pulling pack files)

to make them full first-class citizens you would need to make alternate 
code paths for everything that currently does mmap, making those paths 
either process the file a different way. in the long run that may be the 
best thing to do, but that's a lot of change compared to the proposed 
change.

> Without thinking the ramifications through myself, this sounds pretty much
> like a band-aid and will nend up hitting the same "blob is larger than we
> can handle" issue when you follow the indirection eventually, but that is
> just my gut feeling.

it depends on what you are doing with that file when you get to it. if you 
have to mmap it you may run into the same problem. but if the file is a 
streaming video, you can transport it around (with rsync, http, etc) 
without a problem, and using the file (playing the video) never keeps much 
of the file in memory, so it will be very useful on systems that would 
never have a chance of accessing the entire file through mmap.

David Lang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Git Large Object Support Proposal
  2009-03-19 23:18   ` Scott Chacon
@ 2009-03-19 23:44     ` Junio C Hamano
  2009-03-19 23:52       ` david
                         ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Junio C Hamano @ 2009-03-19 23:44 UTC (permalink / raw)
  To: Scott Chacon; +Cc: git list

Scott Chacon <schacon@gmail.com> writes:

> The point is that we don't keep this data as 'blob's - we don't try to
> compress them or add the header to them, they're too big and already
> compressed, it's a waste of time and often outside the memory
> tolerance of many systems. We keep only the stub in our db and stream
> the large media content directly to and from disk.  If we do a
> 'checkout' or something that would switch it out, we could store the
> data in '.git/media' or the equivalent until it's uploaded elsewhere.

Aha, that sounds like you can just maintain a set of out-of-tree symbolic
links that you keep track of, and let other people (e.g. rsync) deal with
the complexity of managing that side of the world.

And I think you can start experimenting it without any change to the core
datastructures.  In your single-page web site in which its sole html file
embeds an mpeg movie, you keep track of these two things in git:

	porn-of-the-day.html
        porn-of-the-day.mpg -> ../media/6066f5ae75ec.mpg

and any time you want to feed a new movie, you update the symlink to a
different one that lives outside the source-controlled tree, while
arranging the link target to be updated out-of-band.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Git Large Object Support Proposal
  2009-03-19 23:44     ` Junio C Hamano
@ 2009-03-19 23:52       ` david
  2009-03-20  0:11         ` Junio C Hamano
  2009-03-20  0:41       ` Junio C Hamano
  2009-03-20  4:46       ` Jeff King
  2 siblings, 1 reply; 11+ messages in thread
From: david @ 2009-03-19 23:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Scott Chacon, git list

On Thu, 19 Mar 2009, Junio C Hamano wrote:

> Scott Chacon <schacon@gmail.com> writes:
>
>> The point is that we don't keep this data as 'blob's - we don't try to
>> compress them or add the header to them, they're too big and already
>> compressed, it's a waste of time and often outside the memory
>> tolerance of many systems. We keep only the stub in our db and stream
>> the large media content directly to and from disk.  If we do a
>> 'checkout' or something that would switch it out, we could store the
>> data in '.git/media' or the equivalent until it's uploaded elsewhere.
>
> Aha, that sounds like you can just maintain a set of out-of-tree symbolic
> links that you keep track of, and let other people (e.g. rsync) deal with
> the complexity of managing that side of the world.
>
> And I think you can start experimenting it without any change to the core
> datastructures.  In your single-page web site in which its sole html file
> embeds an mpeg movie, you keep track of these two things in git:
>
> 	porn-of-the-day.html
>        porn-of-the-day.mpg -> ../media/6066f5ae75ec.mpg
>
> and any time you want to feed a new movie, you update the symlink to a
> different one that lives outside the source-controlled tree, while
> arranging the link target to be updated out-of-band.

that would work, but the proposed change has some advantages

1. you store the sha1 of the real mpg in the 'large file' blob so you can 
detect problems

2. since it knows the sha1 of the real file, it can auto-create the real 
file as needed, without wasting space on too many copies of it.

David Lang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Git Large Object Support Proposal
  2009-03-19 23:52       ` david
@ 2009-03-20  0:11         ` Junio C Hamano
  2009-03-20  0:19           ` Scott Chacon
  2009-03-20  0:23           ` david
  0 siblings, 2 replies; 11+ messages in thread
From: Junio C Hamano @ 2009-03-20  0:11 UTC (permalink / raw)
  To: david; +Cc: Scott Chacon, git list

david@lang.hm writes:

> On Thu, 19 Mar 2009, Junio C Hamano wrote:
>
>> Scott Chacon <schacon@gmail.com> writes:
>>
>>> The point is that we don't keep this data as 'blob's - we don't try to
>>> compress them or add the header to them, they're too big and already
>>> compressed, it's a waste of time and often outside the memory
>>> tolerance of many systems. We keep only the stub in our db and stream
>>> the large media content directly to and from disk.  If we do a
>>> 'checkout' or something that would switch it out, we could store the
>>> data in '.git/media' or the equivalent until it's uploaded elsewhere.
>>
>> Aha, that sounds like you can just maintain a set of out-of-tree symbolic
>> links that you keep track of, and let other people (e.g. rsync) deal with
>> the complexity of managing that side of the world.
>>
>> And I think you can start experimenting it without any change to the core
>> datastructures.  In your single-page web site in which its sole html file
>> embeds an mpeg movie, you keep track of these two things in git:
>>
>> 	porn-of-the-day.html
>>        porn-of-the-day.mpg -> ../media/6066f5ae75ec.mpg
>>
>> and any time you want to feed a new movie, you update the symlink to a
>> different one that lives outside the source-controlled tree, while
>> arranging the link target to be updated out-of-band.
>
> that would work, but the proposed change has some advantages
>
> 1. you store the sha1 of the real mpg in the 'large file' blob so you
> can detect problems

You store the unique identifier of the real mpg in the symbolic link
target which is a blob payload, so you can detect problems already.  I
deliberately said "unique identifier"; you seem to think saying SHA-1
brings something magical but I do not think it needs to be even blob's
SHA-1.  Hashing that much data costs.

In any case, you can have a script (or client-side hook) that does:

    (1) find the out-of-tree symlinks in the index (or in the work tree);

    (2) if it is dangling, and if you have definition of where to get that
        hierarchy from (e.g ../media), run rsync or wget or whatever
        external means to grab it.

and call it after "git pull" updates from some other place.  The "git
media" of Scott's message could be an alias to such a command.

Adding a new type "external-blob" would be an unwelcome pain.  Reusing
"blob" so that existing "blob" codepath now needs to notice special "0"
that is not length "0" is even bigger pain than that.

And that is a pain for unknown benefit, especially when you can start
experimenting without any changes to the existing data structure.  In the
worst case, the experiment may not pan out as well as you hoped and if
that is the end of the story, so be it.  It is not a great loss.  If it
works well enough and we can have the external large media support without
any changes to the data structure, that would be really great.  If it
sort-of works but hits limitation, we can analyze how best to overcome
that limitation, and at that time it _might_ turn out to be the best
approach to introduce a new blob type.

But I do not think we know that yet.

In the longer run, as you speculated in your message, I think the native
blob codepaths need to be updated to tolerate a large, unmappable objects
better.  With that goal in mind, I think it is a huge mistake to
prematurely introduce an arbitrary distinct "blob" and "large blob" types,
if in the end they need to be merged back again; it would force the future
code indefinitely to care about the historical "large blob" types that was
once supported.

> 2. since it knows the sha1 of the real file, it can auto-create the
> real file as needed, without wasting space on too many copies of it.

Hmm, since when SHA-1 is reversible?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Git Large Object Support Proposal
  2009-03-20  0:11         ` Junio C Hamano
@ 2009-03-20  0:19           ` Scott Chacon
  2009-03-20  0:23           ` david
  1 sibling, 0 replies; 11+ messages in thread
From: Scott Chacon @ 2009-03-20  0:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: david, git list

Hey,

On Thu, Mar 19, 2009 at 5:11 PM, Junio C Hamano <gitster@pobox.com> wrote:
> david@lang.hm writes:
>
>> On Thu, 19 Mar 2009, Junio C Hamano wrote:
>>
>>> Scott Chacon <schacon@gmail.com> writes:
>>>
>>>> The point is that we don't keep this data as 'blob's - we don't try to
>>>> compress them or add the header to them, they're too big and already
>>>> compressed, it's a waste of time and often outside the memory
>>>> tolerance of many systems. We keep only the stub in our db and stream
>>>> the large media content directly to and from disk.  If we do a
>>>> 'checkout' or something that would switch it out, we could store the
>>>> data in '.git/media' or the equivalent until it's uploaded elsewhere.
>>>
>>> Aha, that sounds like you can just maintain a set of out-of-tree symbolic
>>> links that you keep track of, and let other people (e.g. rsync) deal with
>>> the complexity of managing that side of the world.
>>>
>>> And I think you can start experimenting it without any change to the core
>>> datastructures.  In your single-page web site in which its sole html file
>>> embeds an mpeg movie, you keep track of these two things in git:
>>>
>>>      porn-of-the-day.html
>>>        porn-of-the-day.mpg -> ../media/6066f5ae75ec.mpg
>>>
>>> and any time you want to feed a new movie, you update the symlink to a
>>> different one that lives outside the source-controlled tree, while
>>> arranging the link target to be updated out-of-band.

It seems like the main problem here would be that most operations in
the working directory would be overwriting not the symlink but the
file it points to.  If you do a simple 'cp ~/generated_file.mpg
porn-of-the-day.mpg' (to upload your newest and bestest porn), it will
overwrite the '../media/6066f5ae75ec.mpg' file, not the symlink so
that we can generate a new symlink.  Then if we haven't uploaded the
'../media/6066f5ae75ec.mpg' file anywhere yet, it's a goner.  Right?
What you are proposing is almost exactly what I want to do, but I'm
concerned with this issue of the symlink reference not working right
for normal working directory operations.  If a file is never
overwritten, however, this is basically identical to what I wanted to
do.

Scott


>>
>> that would work, but the proposed change has some advantages
>>
>> 1. you store the sha1 of the real mpg in the 'large file' blob so you
>> can detect problems
>
> You store the unique identifier of the real mpg in the symbolic link
> target which is a blob payload, so you can detect problems already.  I
> deliberately said "unique identifier"; you seem to think saying SHA-1
> brings something magical but I do not think it needs to be even blob's
> SHA-1.  Hashing that much data costs.
>
> In any case, you can have a script (or client-side hook) that does:
>
>    (1) find the out-of-tree symlinks in the index (or in the work tree);
>
>    (2) if it is dangling, and if you have definition of where to get that
>        hierarchy from (e.g ../media), run rsync or wget or whatever
>        external means to grab it.
>
> and call it after "git pull" updates from some other place.  The "git
> media" of Scott's message could be an alias to such a command.
>
> Adding a new type "external-blob" would be an unwelcome pain.  Reusing
> "blob" so that existing "blob" codepath now needs to notice special "0"
> that is not length "0" is even bigger pain than that.
>
> And that is a pain for unknown benefit, especially when you can start
> experimenting without any changes to the existing data structure.  In the
> worst case, the experiment may not pan out as well as you hoped and if
> that is the end of the story, so be it.  It is not a great loss.  If it
> works well enough and we can have the external large media support without
> any changes to the data structure, that would be really great.  If it
> sort-of works but hits limitation, we can analyze how best to overcome
> that limitation, and at that time it _might_ turn out to be the best
> approach to introduce a new blob type.
>
> But I do not think we know that yet.
>
> In the longer run, as you speculated in your message, I think the native
> blob codepaths need to be updated to tolerate a large, unmappable objects
> better.  With that goal in mind, I think it is a huge mistake to
> prematurely introduce an arbitrary distinct "blob" and "large blob" types,
> if in the end they need to be merged back again; it would force the future
> code indefinitely to care about the historical "large blob" types that was
> once supported.
>
>> 2. since it knows the sha1 of the real file, it can auto-create the
>> real file as needed, without wasting space on too many copies of it.
>
> Hmm, since when SHA-1 is reversible?
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Git Large Object Support Proposal
  2009-03-20  0:11         ` Junio C Hamano
  2009-03-20  0:19           ` Scott Chacon
@ 2009-03-20  0:23           ` david
  1 sibling, 0 replies; 11+ messages in thread
From: david @ 2009-03-20  0:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Scott Chacon, git list

On Thu, 19 Mar 2009, Junio C Hamano wrote:

> david@lang.hm writes:
>
>> On Thu, 19 Mar 2009, Junio C Hamano wrote:
>>
>>> Scott Chacon <schacon@gmail.com> writes:
>>>
>>>> The point is that we don't keep this data as 'blob's - we don't try to
>>>> compress them or add the header to them, they're too big and already
>>>> compressed, it's a waste of time and often outside the memory
>>>> tolerance of many systems. We keep only the stub in our db and stream
>>>> the large media content directly to and from disk.  If we do a
>>>> 'checkout' or something that would switch it out, we could store the
>>>> data in '.git/media' or the equivalent until it's uploaded elsewhere.
>>>
>>> Aha, that sounds like you can just maintain a set of out-of-tree symbolic
>>> links that you keep track of, and let other people (e.g. rsync) deal with
>>> the complexity of managing that side of the world.
>>>
>>> And I think you can start experimenting it without any change to the core
>>> datastructures.  In your single-page web site in which its sole html file
>>> embeds an mpeg movie, you keep track of these two things in git:
>>>
>>> 	porn-of-the-day.html
>>>        porn-of-the-day.mpg -> ../media/6066f5ae75ec.mpg
>>>
>>> and any time you want to feed a new movie, you update the symlink to a
>>> different one that lives outside the source-controlled tree, while
>>> arranging the link target to be updated out-of-band.
>>
>> that would work, but the proposed change has some advantages
>>
>> 1. you store the sha1 of the real mpg in the 'large file' blob so you
>> can detect problems
>
> You store the unique identifier of the real mpg in the symbolic link
> target which is a blob payload, so you can detect problems already.  I
> deliberately said "unique identifier"; you seem to think saying SHA-1
> brings something magical but I do not think it needs to be even blob's
> SHA-1.  Hashing that much data costs.

but hashing the data and using that as the unique identifier gives you 
some advantages.

1. you can detect file corruption

2. you can trivially detect duplicates (even if the duplicates come from 
different sources)

3. it's repeatable (you will always get the same hash from the same input)

> In any case, you can have a script (or client-side hook) that does:
>
>    (1) find the out-of-tree symlinks in the index (or in the work tree);
>
>    (2) if it is dangling, and if you have definition of where to get that
>        hierarchy from (e.g ../media), run rsync or wget or whatever
>        external means to grab it.
>
> and call it after "git pull" updates from some other place.  The "git
> media" of Scott's message could be an alias to such a command.
>
> Adding a new type "external-blob" would be an unwelcome pain.  Reusing
> "blob" so that existing "blob" codepath now needs to notice special "0"
> that is not length "0" is even bigger pain than that.
>
> And that is a pain for unknown benefit, especially when you can start
> experimenting without any changes to the existing data structure.  In the
> worst case, the experiment may not pan out as well as you hoped and if
> that is the end of the story, so be it.  It is not a great loss.  If it
> works well enough and we can have the external large media support without
> any changes to the data structure, that would be really great.  If it
> sort-of works but hits limitation, we can analyze how best to overcome
> that limitation, and at that time it _might_ turn out to be the best
> approach to introduce a new blob type.
>
> But I do not think we know that yet.
>
> In the longer run, as you speculated in your message, I think the native
> blob codepaths need to be updated to tolerate a large, unmappable objects
> better.  With that goal in mind, I think it is a huge mistake to
> prematurely introduce an arbitrary distinct "blob" and "large blob" types,
> if in the end they need to be merged back again; it would force the future
> code indefinitely to care about the historical "large blob" types that was
> once supported.

valid point.

keep in mind that what's a "large, unmappable object" on one system may be 
no problem on another.

>> 2. since it knows the sha1 of the real file, it can auto-create the
>> real file as needed, without wasting space on too many copies of it.
>
> Hmm, since when SHA-1 is reversible?

when it is processing a new, unknown file it can hash it, and look to see 
if a file with that hash exists. if so the work is done, if not it can 
create a file with that hash.

by far the best long-term option would be to make all the codepaths handle 
unmappable files, the question is how large a task that would be.

David Lang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Git Large Object Support Proposal
  2009-03-19 23:44     ` Junio C Hamano
  2009-03-19 23:52       ` david
@ 2009-03-20  0:41       ` Junio C Hamano
  2009-03-20  4:46       ` Jeff King
  2 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2009-03-20  0:41 UTC (permalink / raw)
  To: Scott Chacon; +Cc: git list

Junio C Hamano <gitster@pobox.com> writes:

> Scott Chacon <schacon@gmail.com> writes:
>
>> The point is that we don't keep this data as 'blob's - we don't try to
>> compress them or add the header to them, they're too big and already
>> compressed, it's a waste of time and often outside the memory
>> tolerance of many systems. We keep only the stub in our db and stream
>> the large media content directly to and from disk.  If we do a
>> 'checkout' or something that would switch it out, we could store the
>> data in '.git/media' or the equivalent until it's uploaded elsewhere.
>
> Aha, that sounds like you can just maintain a set of out-of-tree symbolic
> links that you keep track of, and let other people (e.g. rsync) deal with
> the complexity of managing that side of the world.
>
> And I think you can start experimenting it without any change to the core
> datastructures.  In your single-page web site in which its sole html file
> embeds an mpeg movie, you keep track of these two things in git:
>
> 	porn-of-the-day.html
>       porn-of-the-day.mpg -> ../media/6066f5ae75ec.mpg
>
> and any time you want to feed a new movie, you update the symlink to a
> different one that lives outside the source-controlled tree, while
> arranging the link target to be updated out-of-band.

I wasn't thinking clearly.

This is not really a new "huge blob" type but is just a slightly different
flavor of symbolic link.  Its link target name may resemble SHA-1 object
name, but it does not participate in the reachability computation.  it
won't be fetched nor pushed, and if you ever get one via the usual git
codepath into your object store, it will be subject to "git gc", but you
are unlikely to place it inside your object store to begin with.  You have
something like:

    100644 2222222222222222222222222222222222222222 porn-of-the-day.html
    120001 5ed22400803161de2f49331d005be424b7f6d036 porn-of-the-day.mpg

where 5ed22400803161de2f49331d005be424b7f6d036 is a blob that stores the
name of a regular blob object, 6ff87c4664981e4397625791c8ea3bbb5f2279a3,
in your tree object (and in the index), and:

 * When running "git media", you have a configuration to tell it where the
   external media files are kept (e.g. ../media in the previous example),
   and it rsyncs to ../media/6ff87c4664981e4397625791c8ea3bbb5f2279a3 in
   some unspecified way from some unspecified place;

 * When checking out porn-of-the-day.mpg, it becomes a symbolic link that
   points at ../media/6ff87c4664981e4397625791c8ea3bbb5f2279a3 (because it
   follows the same site-specific configuration);

 * When comparing the index (that records the 120001 "slightly different
   symbolic link" entry with the shell blob object) and the work tree
   (that has a symbolic link that points at ../media/6ff87c46649...), you
   do not look at the contents of the ../media/6ff87c46649... file, but
   you do look at its name, apply a reverse of the mapping "checkout"
   codepath did to arrive at 6ff87c4664981e4397625791c8ea3bbb5f2279a3
   SHA-1, compare that with what the shell blob object records.  If you
   updated the symbolic link in the work tree, "git add" would result in
   creating a new shell object (just like when you change the link target
   for a normal symbolic link) that records the external blob.

It still is bothersome that we need to introduce a new tree nodetype
(rather, a new blob subtype similar to "regular file blob", "symlink
blob"), but it is of much less impact than what I originally
misunderstood.

Having said that, if that is what is happening, I do not see the need for
the payload to be even a blob SHA-1 name.  Any identifier that is
convenient to generate in the application domain could do.

But that is a minor detail that immediately popped at me; there may be
other minor details I may find objectionable later.  But overall, I think
your proposal makes sense.

I still think a large part of preliminary experiments to see the benefit
of this approach can and should be done without and before touching the
core part (like introduction of the slightly different symlink 1200001
mode), though.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Git Large Object Support Proposal
  2009-03-19 23:44     ` Junio C Hamano
  2009-03-19 23:52       ` david
  2009-03-20  0:41       ` Junio C Hamano
@ 2009-03-20  4:46       ` Jeff King
  2 siblings, 0 replies; 11+ messages in thread
From: Jeff King @ 2009-03-20  4:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Scott Chacon, git list

On Thu, Mar 19, 2009 at 04:44:49PM -0700, Junio C Hamano wrote:

> Aha, that sounds like you can just maintain a set of out-of-tree symbolic
> links that you keep track of, and let other people (e.g. rsync) deal with
> the complexity of managing that side of the world.
> 
> And I think you can start experimenting it without any change to the core
> datastructures.  In your single-page web site in which its sole html file
> embeds an mpeg movie, you keep track of these two things in git:
> 
> 	porn-of-the-day.html
>         porn-of-the-day.mpg -> ../media/6066f5ae75ec.mpg
> 
> and any time you want to feed a new movie, you update the symlink to a
> different one that lives outside the source-controlled tree, while
> arranging the link target to be updated out-of-band.

I have a repo like this (not porn, but large files :) ) and I use a
similar solution. Instead of large blobs, I have stub files containing a
URL, and the make process pulls them as necessary. It works pretty well
in practice. I don't bother with naming the files by sha-1 but instead
give them human-readable names, since in my case they are generally
immutable (i.e., one a name is assigned, the content doesn't change).

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-03-20  4:47 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-19 22:14 Git Large Object Support Proposal Scott Chacon
2009-03-19 22:31 ` Junio C Hamano
2009-03-19 23:18   ` Scott Chacon
2009-03-19 23:44     ` Junio C Hamano
2009-03-19 23:52       ` david
2009-03-20  0:11         ` Junio C Hamano
2009-03-20  0:19           ` Scott Chacon
2009-03-20  0:23           ` david
2009-03-20  0:41       ` Junio C Hamano
2009-03-20  4:46       ` Jeff King
2009-03-19 23:42   ` david

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.