GSoC - Some questions on the idea of "Better big-file support".

All of lore.kernel.org
 help / color / mirror / Atom feed

* GSoC - Some questions on the idea of "Better big-file support".
@ 2012-03-28  4:38 Bo Chen
  2012-03-28  6:19 ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 43+ messages in thread
From: Bo Chen @ 2012-03-28  4:38 UTC (permalink / raw)
  To: git; +Cc: peff

Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better
big-file support".

As it is described in the idea page,
"Many large files (like media) do not delta very well. However, some
do (like VM disk images). Git could split large objects into smaller
chunks, similar to bup, and find deltas between these much more
manageable chunks. There are some preliminary patches in this
direction, but they are in need of review and expansion."

Can anyone elaborate a little bit why many large files do not delta
very well? Is it a general problem or a specific problem just for Git?
I am really new to Git, can anyone give me some hints on which source
codes I should read to learn more about the current code on delta
operation? It is said that "there are some preliminary patches in this
direction", where can I find these patches?

I will appreciate it if anyone can offer some help.

Thanks.

Bo Chen

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of "Better big-file support".
  2012-03-28  4:38 GSoC - Some questions on the idea of "Better big-file support" Bo Chen
@ 2012-03-28  6:19 ` Nguyen Thai Ngoc Duy
  2012-03-28 11:33   ` GSoC - Some questions on the idea of Sergio
  2012-03-30 19:11   ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen
  0 siblings, 2 replies; 43+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-28  6:19 UTC (permalink / raw)
  To: Bo Chen; +Cc: git, peff

On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen@chenirvine.org> wrote:
> Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better
> big-file support".
>
> As it is described in the idea page,
> "Many large files (like media) do not delta very well. However, some
> do (like VM disk images). Git could split large objects into smaller
> chunks, similar to bup, and find deltas between these much more
> manageable chunks. There are some preliminary patches in this
> direction, but they are in need of review and expansion."
>
> Can anyone elaborate a little bit why many large files do not delta
> very well?

Large files are usually binary. Depends on the type of binary, they
may or may not delta well. Those that are compressed/encrypted
obviously don't delta well because one change can make the final
result completely different.

Another problem with delta-ing large files with git is, current code
needs to load two files in memory for delta. Consuming 4G for delta 2
2GB files does not sound good.

> Is it a general problem or a specific problem just for Git?
> I am really new to Git, can anyone give me some hints on which source
> codes I should read to learn more about the current code on delta
> operation? It is said that "there are some preliminary patches in this
> direction", where can I find these patches?

Read about rsync algorithm [2]. Bup [1] implements the same (I think)
algorithm, but on top of git. For preliminary patches, have a look at
jc/split-blob series at commit 4a1242d in git.git.

[1] https://github.com/apenwarr/bup
[2] http://en.wikipedia.org/wiki/Rsync#Algorithm
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-28  6:19 ` Nguyen Thai Ngoc Duy
@ 2012-03-28 11:33   ` Sergio
  2012-03-30 19:44     ` Bo Chen
  2012-03-30 19:51     ` Bo Chen
  2012-03-30 19:11   ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen
  1 sibling, 2 replies; 43+ messages in thread
From: Sergio @ 2012-03-28 11:33 UTC (permalink / raw)
  To: git

Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:

> 
> On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen <at> chenirvine.org> wrote:
> > Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better
> > big-file support".
> >
> > As it is described in the idea page,
> > "Many large files (like media) do not delta very well. However, some
> > do (like VM disk images). Git could split large objects into smaller
> > chunks, similar to bup, and find deltas between these much more
> > manageable chunks. There are some preliminary patches in this
> > direction, but they are in need of review and expansion."
> >
> > Can anyone elaborate a little bit why many large files do not delta
> > very well?
> 
> Large files are usually binary. Depends on the type of binary, they
> may or may not delta well. Those that are compressed/encrypted
> obviously don't delta well because one change can make the final
> result completely different.

I would add that the larger a file, the larger the temptation to use a
compressed format for it, so that large files are often compressed binaries.

For these, a trick to obtain good deltas can be to decompress before splitting
in chunks with the rsync algorithm. Git filters can already be used for this,
but it can be tricky to assure that the decompress - recompress roundtrip
re-creates the original compressed file.

Furhermore, some compressed binaries are internally composed by multiple streams
(think of a zip archive containing multiple files, but this is by no means
limited to zip). In this case, it is frequent to have many possible orderings of
the streams. If so, the best deltas can be obtained by sorting the streams in
some 'canonical' order and decompressing. Even without decompressing, sorting
alone can obtain good results as long as changes are only due to changes in a
single stream of the container. Personally, I know no example of git filters
used to perform this sorting which can be extremely tricky in assuring the
possibility of recovering the file in the original stream order.

Maybe (but this is just speculation), once the bup-inspired file chunking
support is in place, people will start contributing filters to improve the
management of many types of standard files (obviously 'improve' in terms of
space efficiency as filters can be quite slow).

Sergio

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-28 11:33   ` GSoC - Some questions on the idea of Sergio
@ 2012-03-30 19:44     ` Bo Chen
  2012-03-30 19:51     ` Bo Chen
  1 sibling, 0 replies; 43+ messages in thread
From: Bo Chen @ 2012-03-30 19:44 UTC (permalink / raw)
  To: Sergio; +Cc: git

The following is the list of sub-problems according to my
understanding of the "big file support" problem. Can anyone give some
feed back and help refine it. Thanks.

            ---- text file (always delta well? need to be confirmed)
             |

                                               --- delta well (ok)
large file-|                    ----    general binary file (without
encryption, compression. Other cases which definitely can not delta
well)  -|
             |                     |

                                               --- does not delta well
(improvement?)
            ---- binary file   -|---   encrypted file (improvement?
one straightforward method is to decrypt the file before delta-ing it,
however, we don't always have the key for decryption. Other?)
                                   |
                                  ---    compressed file (improvement?
Decompress before delta-ing it? Other?)



Bo

On Wed, Mar 28, 2012 at 7:33 AM, Sergio <sergio.callegari@gmail.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:
>
>>
>> On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen <at> chenirvine.org> wrote:
>> > Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better
>> > big-file support".
>> >
>> > As it is described in the idea page,
>> > "Many large files (like media) do not delta very well. However, some
>> > do (like VM disk images). Git could split large objects into smaller
>> > chunks, similar to bup, and find deltas between these much more
>> > manageable chunks. There are some preliminary patches in this
>> > direction, but they are in need of review and expansion."
>> >
>> > Can anyone elaborate a little bit why many large files do not delta
>> > very well?
>>
>> Large files are usually binary. Depends on the type of binary, they
>> may or may not delta well. Those that are compressed/encrypted
>> obviously don't delta well because one change can make the final
>> result completely different.
>
> I would add that the larger a file, the larger the temptation to use a
> compressed format for it, so that large files are often compressed binaries.
>
> For these, a trick to obtain good deltas can be to decompress before splitting
> in chunks with the rsync algorithm. Git filters can already be used for this,
> but it can be tricky to assure that the decompress - recompress roundtrip
> re-creates the original compressed file.
>
> Furhermore, some compressed binaries are internally composed by multiple streams
> (think of a zip archive containing multiple files, but this is by no means
> limited to zip). In this case, it is frequent to have many possible orderings of
> the streams. If so, the best deltas can be obtained by sorting the streams in
> some 'canonical' order and decompressing. Even without decompressing, sorting
> alone can obtain good results as long as changes are only due to changes in a
> single stream of the container. Personally, I know no example of git filters
> used to perform this sorting which can be extremely tricky in assuring the
> possibility of recovering the file in the original stream order.
>
> Maybe (but this is just speculation), once the bup-inspired file chunking
> support is in place, people will start contributing filters to improve the
> management of many types of standard files (obviously 'improve' in terms of
> space efficiency as filters can be quite slow).
>
> Sergio
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-28 11:33   ` GSoC - Some questions on the idea of Sergio
  2012-03-30 19:44     ` Bo Chen
@ 2012-03-30 19:51     ` Bo Chen
  2012-03-30 20:34       ` Jeff King
  1 sibling, 1 reply; 43+ messages in thread
From: Bo Chen @ 2012-03-30 19:51 UTC (permalink / raw)
  To: Sergio; +Cc: git

Please neglect my last email.
Following is the version more readable.
The sub-problems of "delta for large file" problem.

1 large file

1.1 text file (always delta well? need to be confirmed)

1.2 binary file

1.2.1  general binary file (without encryption, compression. Other
cases which definitely can not delta well)

1.2.1.1 delta well (ok)
1.2.1.2 does not delta well (improvement?)

1.2.2  encrypted file (improvement? one straightforward method is to
decrypt the file before delta-ing it, however, we don't always have
the key for decryption. Other?)

1.2.3 compressed file (improvement? Decompress before delta-ing it? Other?)

Can anyone give me any feed back for further refining the problem. Thanks.

Bo

On Wed, Mar 28, 2012 at 7:33 AM, Sergio <sergio.callegari@gmail.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes:
>
>>
>> On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen <at> chenirvine.org> wrote:
>> > Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better
>> > big-file support".
>> >
>> > As it is described in the idea page,
>> > "Many large files (like media) do not delta very well. However, some
>> > do (like VM disk images). Git could split large objects into smaller
>> > chunks, similar to bup, and find deltas between these much more
>> > manageable chunks. There are some preliminary patches in this
>> > direction, but they are in need of review and expansion."
>> >
>> > Can anyone elaborate a little bit why many large files do not delta
>> > very well?
>>
>> Large files are usually binary. Depends on the type of binary, they
>> may or may not delta well. Those that are compressed/encrypted
>> obviously don't delta well because one change can make the final
>> result completely different.
>
> I would add that the larger a file, the larger the temptation to use a
> compressed format for it, so that large files are often compressed binaries.
>
> For these, a trick to obtain good deltas can be to decompress before splitting
> in chunks with the rsync algorithm. Git filters can already be used for this,
> but it can be tricky to assure that the decompress - recompress roundtrip
> re-creates the original compressed file.
>
> Furhermore, some compressed binaries are internally composed by multiple streams
> (think of a zip archive containing multiple files, but this is by no means
> limited to zip). In this case, it is frequent to have many possible orderings of
> the streams. If so, the best deltas can be obtained by sorting the streams in
> some 'canonical' order and decompressing. Even without decompressing, sorting
> alone can obtain good results as long as changes are only due to changes in a
> single stream of the container. Personally, I know no example of git filters
> used to perform this sorting which can be extremely tricky in assuring the
> possibility of recovering the file in the original stream order.
>
> Maybe (but this is just speculation), once the bup-inspired file chunking
> support is in place, people will start contributing filters to improve the
> management of many types of standard files (obviously 'improve' in terms of
> space efficiency as filters can be quite slow).
>
> Sergio
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-30 19:51     ` Bo Chen
@ 2012-03-30 20:34       ` Jeff King
  2012-03-30 23:08         ` Bo Chen
                           ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Jeff King @ 2012-03-30 20:34 UTC (permalink / raw)
  To: Bo Chen; +Cc: Sergio, git

On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:

> The sub-problems of "delta for large file" problem.
> 
> 1 large file
> 
> 1.1 text file (always delta well? need to be confirmed)

They often do, but text files don't tend to be large. There are some
exceptions (e.g., genetic data is often kept in line-oriented text
files, but is very large).

But let's take a step back for a moment. Forget about whether a file is
binary or not. Imagine you want to store a very large file in git.

What are the operations that will perform badly? How can we make them
perform acceptably, and what tradeoffs must we make? E.g., the way the
diff code is written, it would be very difficult to run "git diff" on a
2 gigabyte file. But is that actually a problem? Answering that means
talking about the characteristics of 2 gigabyte files, and what we
expect to see, and to what degree our tradeoffs will impact them.

Here's a more concrete example. At first, even storing a 2 gigabyte file
with "git add" was painful, because we would load the whole thing in
memory. Repacking the repository was painful, because we had to rewrite
the whole 2G file into a packfile. Nowadays, we stream large files
directly into their own packfiles, and we have to pay the I/O only once
(and the memory cost never). As a tradeoff, we no longer get delta
compression of large objects. That's OK for some large objects, like
movie files (which don't tend to delta well, anyway). But it's not for
other objects, like virtual machine images, which do tend to delta well.

So can we devise a solution which efficiently stores these
delta-friendly objects, without losing the performance improvements we
got with the stream-directly-to-packfile approach?

One possible solution is breaking large files into smaller chunks using
something like the bupsplit algorithm (and I won't go into the details
here, as links to bup have already been mentioned elsewhere, and Junio's
patches make a start at this sort of splitting).

Note that there are other problem areas with big files that can be
worked on, too. For example, some people want to store 100 gigabytes in
a repository. Because git is distributed, that means 100G in the repo
database, and 100G in the working directory, for a total of 200G. People
in this situation may want to be able to store part of the repository
database in a network-accessible location, trading some of the
convenience of being fully distributed for the space savings. So another
project could be designing a network-based alternate object storage
system.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-30 20:34       ` Jeff King
@ 2012-03-30 23:08         ` Bo Chen
  2012-03-31 11:02           ` Sergio Callegari
  2012-03-31 15:19         ` Neal Kreitzinger
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Bo Chen @ 2012-03-30 23:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Sergio, git

I appreciate for the instant reply.

My comments are inline below.

On Fri, Mar 30, 2012 at 4:34 PM, Jeff King <peff@peff.net> wrote:
> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:
>
>> The sub-problems of "delta for large file" problem.
>>
>> 1 large file
>>
>> 1.1 text file (always delta well? need to be confirmed)
>
> They often do, but text files don't tend to be large. There are some
> exceptions (e.g., genetic data is often kept in line-oriented text
> files, but is very large).
>
> But let's take a step back for a moment. Forget about whether a file is
> binary or not. Imagine you want to store a very large file in git.
>
> What are the operations that will perform badly? How can we make them
> perform acceptably, and what tradeoffs must we make? E.g., the way the
> diff code is written, it would be very difficult to run "git diff" on a
> 2 gigabyte file. But is that actually a problem? Answering that means
> talking about the characteristics of 2 gigabyte files, and what we
> expect to see, and to what degree our tradeoffs will impact them.
>
> Here's a more concrete example. At first, even storing a 2 gigabyte file
> with "git add" was painful, because we would load the whole thing in
> memory. Repacking the repository was painful, because we had to rewrite
> the whole 2G file into a packfile. Nowadays, we stream large files
> directly into their own packfiles, and we have to pay the I/O only once
> (and the memory cost never). As a tradeoff, we no longer get delta
> compression of large objects. That's OK for some large objects, like
> movie files (which don't tend to delta well, anyway). But it's not for
> other objects, like virtual machine images, which do tend to delta well.

It seems that we should first provide some kind of mechanism which can
distinguish the delta-friendly objects and non delta-friendly objects.
I am wondering whether this algorithm is available now or will be
devised.



>
> So can we devise a solution which efficiently stores these
> delta-friendly objects, without losing the performance improvements we
> got with the stream-directly-to-packfile approach?

Ah, I see. Design efficient solution for storing the delta-friendly
objects is the main concern. Thank you for helping me clarify this
point.

>
> One possible solution is breaking large files into smaller chunks using
> something like the bupsplit algorithm (and I won't go into the details
> here, as links to bup have already been mentioned elsewhere, and Junio's
> patches make a start at this sort of splitting).
>
> Note that there are other problem areas with big files that can be
> worked on, too. For example, some people want to store 100 gigabytes in
> a repository. Because git is distributed, that means 100G in the repo
> database, and 100G in the working directory, for a total of 200G. People
> in this situation may want to be able to store part of the repository
> database in a network-accessible location, trading some of the
> convenience of being fully distributed for the space savings. So another
> project could be designing a network-based alternate object storage
> system.

>From the architecture point of view, CVS is fully centralized, and Git
is fully distributed. It seems that for big repo, the architecture
described above is in the middle now ^-^.

>
> -Peff

Bo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-30 23:08         ` Bo Chen
@ 2012-03-31 11:02           ` Sergio Callegari
  2012-03-31 16:18             ` Neal Kreitzinger
  0 siblings, 1 reply; 43+ messages in thread
From: Sergio Callegari @ 2012-03-31 11:02 UTC (permalink / raw)
  To: Bo Chen; +Cc: Jeff King, git

I wonder if it could make sense to have some pluggable mechanism for file 
splitting. Something under the lines of filters, so to say.
Bupsplit can be a rather general mechanism, but large binaries that are 
containers (zip, jar, docx, tgz, pdf - seen as a collection of streams) may 
possibly be
more conveniently split by their inherent components.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-31 11:02           ` Sergio Callegari
@ 2012-03-31 16:18             ` Neal Kreitzinger
  2012-04-02 21:07               ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Neal Kreitzinger @ 2012-03-31 16:18 UTC (permalink / raw)
  To: Sergio Callegari; +Cc: Bo Chen, Jeff King, git

On 3/31/2012 6:02 AM, Sergio Callegari wrote:
> I wonder if it could make sense to have some pluggable mechanism for
>  file splitting. Something under the lines of filters, so to say.
> Bupsplit can be a rather general mechanism, but large binaries that
> are containers (zip, jar, docx, tgz, pdf - seen as a collection of
> streams) may possibly be more conveniently split by their inherent
> components.
>

gitattributes or gitconfig could configure the big-file handler for 
specified files.  Known/supported filetypes like gif, png, zip, pdf, 
etc., could be auto-configured by git.  Any yet-unknown/yet-unsupported 
filetypes could be configured manually by the user, e.g.
*.zgp=bigcontainer

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-31 16:18             ` Neal Kreitzinger
@ 2012-04-02 21:07               ` Jeff King
  2012-04-03  9:58                 ` Sergio Callegari
  2012-04-11  1:24                 ` Neal Kreitzinger
  0 siblings, 2 replies; 43+ messages in thread
From: Jeff King @ 2012-04-02 21:07 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Sergio Callegari, Bo Chen, git

On Sat, Mar 31, 2012 at 11:18:16AM -0500, Neal Kreitzinger wrote:

> On 3/31/2012 6:02 AM, Sergio Callegari wrote:
> >I wonder if it could make sense to have some pluggable mechanism for
> > file splitting. Something under the lines of filters, so to say.
> >Bupsplit can be a rather general mechanism, but large binaries that
> >are containers (zip, jar, docx, tgz, pdf - seen as a collection of
> >streams) may possibly be more conveniently split by their inherent
> >components.
> >
> 
> gitattributes or gitconfig could configure the big-file handler for
> specified files.  Known/supported filetypes like gif, png, zip, pdf,
> etc., could be auto-configured by git.  Any
> yet-unknown/yet-unsupported filetypes could be configured manually by
> the user, e.g.
> *.zgp=bigcontainer

This is a tempting route (and one I've even suggested myself before),
but I think ultimately it is a bad way to go. The problem is that
splitting is only half of the equation. Once you have split contents,
you have to use them intelligently, which means looking at the sha1s of
each split chunk and discarding whole chunks as "the same" without even
looking at the contents.

Which means that it is very important that your chunking algorithm
remain stable from version to version. A change in the algorithm is
going to completely negate the benefits of chunking in the first place.
So something configurable, or something that is not applied consistently
(because it depends on each user's git config, or even on the specific
version of a tool used) can end up being no help at all.

Properly applied, I think a content-aware chunking algorithm could
out-perform a generic one. But I think we need to first find out exactly
how well the generic algorithm can perform. It may be "good enough"
compared to the hassle that inconsistent application of a content-aware
algorithm will cause.  So I wouldn't rule it out, but I'd rather try the
bup-style splitting first, and see how good (or bad) it is.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-02 21:07               ` Jeff King
@ 2012-04-03  9:58                 ` Sergio Callegari
  2012-04-11  1:24                 ` Neal Kreitzinger
  1 sibling, 0 replies; 43+ messages in thread
From: Sergio Callegari @ 2012-04-03  9:58 UTC (permalink / raw)
  To: Jeff King; +Cc: Neal Kreitzinger, Bo Chen, git

On 02/04/2012 23:07, Jeff King wrote:
>> gitattributes or gitconfig could configure the big-file handler for
>> specified files.  Known/supported filetypes like gif, png, zip, pdf,
>> etc., could be auto-configured by git.  Any
>> yet-unknown/yet-unsupported filetypes could be configured manually by
>> the user, e.g.
>> *.zgp=bigcontainer
> This is a tempting route (and one I've even suggested myself before),
> but I think ultimately it is a bad way to go. The problem is that
> splitting is only half of the equation. Once you have split contents,
> you have to use them intelligently, which means looking at the sha1s of
> each split chunk and discarding whole chunks as "the same" without even
> looking at the contents.
>
> Which means that it is very important that your chunking algorithm
> remain stable from version to version. A change in the algorithm is
> going to completely negate the benefits of chunking in the first place.
> So something configurable, or something that is not applied consistently
> (because it depends on each user's git config, or even on the specific
> version of a tool used) can end up being no help at all.
Isn't this the same with filters? The clean algorithms should remain 
stable from
version to version. Filters are often perceived as simpler, so that this 
stability seems easier to achieve, but it is not necessarily the case.
> Properly applied, I think a content-aware chunking algorithm could
> out-perform a generic one. But I think we need to first find out exactly
> how well the generic algorithm can perform. It may be "good enough"
> compared to the hassle that inconsistent application of a content-aware
> algorithm will cause.
Absolutely true, but why not giving freedom to the user to chose? Git 
could provide the bupsplit mechanism and at the same time have a means 
so that the user can plug in a different machinery for specific file 
types.  In this case, it is the user responsibility to do it right.

One could have a special 'filter' for splitting/unsplitting. Say

[splitfilter "XXX"]
     split = xxx
     unsplit = uxxx

xxx is given the file to split on stdin and returns on stdout a stream 
made of an index header and the concatenation of the parts in which the 
file should be split. For unsplitting uxxx is given on stdin the index 
and the concatenation of parts and returns on stdout the binary file.

bupsplit and bupunsplit could be built in, with other tools being user 
provided.  If the users gets them wrong it is ultimately his/her 
responsibility. In the end, the user is given even 'rm' isn't he/she? 
Git could provide a header file defining the index header format to help 
the coding of the alternative, more specific splitters. If people devise 
some of them that look promising, they can probably be collected in contrib.

Possibly, the index header could comprise starting positions for the 
various parts in the stream, but also 'names' for them. This would let 
reusing blob and tree objects to physically store the various parts. For 
bupsplit, names could be flat (e.g. sequence numbers like 0000, 0001). 
For files that are container, they could reflect the inner names. 
Perspectively, one could even devise specific diff tools for these 
'special' trees of split-object components. With this, when storing say 
a very large zip file in git, these tools could help saying things like 
'from version x to version y, only that specific part in the zip file 
has changed'.

Sergio

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-02 21:07               ` Jeff King
  2012-04-03  9:58                 ` Sergio Callegari
@ 2012-04-11  1:24                 ` Neal Kreitzinger
  2012-04-11  6:04                   ` Jonathan Nieder
  2012-04-11 21:35                   ` Jeff King
  1 sibling, 2 replies; 43+ messages in thread
From: Neal Kreitzinger @ 2012-04-11  1:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Sergio Callegari, Bo Chen, git

On 4/2/2012 4:07 PM, Jeff King wrote:

> ...I think we need to first find out exactly
> how well the generic algorithm can perform. It may be "good enough"
> compared to the hassle that inconsistent application of a content-aware
> algorithm will cause.  So I wouldn't rule it out, but I'd rather try the
> bup-style splitting first, and see how good (or bad) it is.
>
(I read bup DESIGN doc to see what bup-style splitting is.) When you use 
bup delta technology in git.git I take it that you will use it for 
big-worktree-files *and* big-history-files (not-big-worktree-files that 
are not xdelta delta-friendly)?  IOW, all binaries plus 
big-text-worktree-files.  Otherwise, small binaries will become large 
histories.

If small binaries are not going to be bup-delta-compressed, then what 
about using xxd to convert the binary to text and then xdelta 
compressing the hex dump to achieve efficient delta compression in the 
pack file?  You could convert the hexdump back to binary with xxd for 
checkout and such.

Maybe small binaries do xdelta well and the above is a moot point.  This 
is all theory to me, but the reality is looming over my head since most 
of the components I should be tracking are binaries small (large 
history?) and big (but am not yet because of "big-file" concerns -- I 
don't want to have to refactor my vast git ecosystem with filter branch 
later because I slammed binaries into the main project or superproject 
without proper systems programming (I'm not sure what the c/linux term 
is for 'systems programming', but in the mainframe world it meant making 
sure everything was configured for efficient performance)).

Now that I say that out loud I guess a superproject with binaries in 
separate repos could be easily refactored by creating new efficient 
repos and making a new commit that points to them instead of the old 
inefficient repos.  That way, when someone checks out the binary repo 
(submodule) into their worktree they get the new efficiency instead of 
the old inefficiency.  Over time, as folks are less likely to check out 
old stuff the old inefficiency goes away on its own.  I think. 
(Submodules are mostly theory to me at this point also.)

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-11  1:24                 ` Neal Kreitzinger
@ 2012-04-11  6:04                   ` Jonathan Nieder
  2012-04-11 16:29                     ` Neal Kreitzinger
                                       ` (3 more replies)
  2012-04-11 21:35                   ` Jeff King
  1 sibling, 4 replies; 43+ messages in thread
From: Jonathan Nieder @ 2012-04-11  6:04 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Jeff King, Sergio Callegari, Bo Chen, git

Neal Kreitzinger wrote:

> Maybe small binaries do xdelta well and the above is a moot point.

If I am reading it correctly, diff-delta copes fine with smallish
binary files that have not changed much.  Converting to hex would only
hurt.

I would suggest tracking source code instead of binaries if possible,
though.

Jonathan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-11  6:04                   ` Jonathan Nieder
@ 2012-04-11 16:29                     ` Neal Kreitzinger
  2012-04-11 22:09                       ` Jeff King
  2012-04-11 16:35                     ` Neal Kreitzinger
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Neal Kreitzinger @ 2012-04-11 16:29 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jeff King, Sergio Callegari, Bo Chen, git

On 4/11/2012 1:04 AM, Jonathan Nieder wrote:
> Neal Kreitzinger wrote:
>
>> Maybe small binaries do xdelta well and the above is a moot point.
>
> If I am reading it correctly, diff-delta copes fine with smallish
> binary files that have not changed much.  Converting to hex would
> only hurt.
>
How do I check the history size of a binary?  IOW, how to I check the
size of the sum of all the delta-compressions and root blob of a binary?
  That way I can sample different binary types to get a symptomatic idea
of how well they are delta compressing.  I suspect that compiled
binaries will compress well (efficient history) and graphics files may
not compress well (large history).

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-11 16:29                     ` Neal Kreitzinger
@ 2012-04-11 22:09                       ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-04-11 22:09 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Jonathan Nieder, Sergio Callegari, Bo Chen, git

On Wed, Apr 11, 2012 at 11:29:50AM -0500, Neal Kreitzinger wrote:

> How do I check the history size of a binary?  IOW, how to I check the
> size of the sum of all the delta-compressions and root blob of a binary?
>  That way I can sample different binary types to get a symptomatic idea
> of how well they are delta compressing.  I suspect that compiled
> binaries will compress well (efficient history) and graphics files may
> not compress well (large history).

I don't think there is a simple command to do it. You have to correlate
blobs at a given path with objects in the packs yourself. You can script
it like:

  # get the delta stats from every pack; you only need to do this part
  # once for a given history state. And obviously you would want to
  # repack before doing it.
  for i in .git/objects/pack/*.pack; do
    git verify-pack -v $i;
  done |
  perl -lne '
    # format is: sha1 type size size-in-pack offset; pick out only the
    # thing we care about: size in pack
    /^([0-9a-f]{40}) \S+\s+\d+ (\d+)/ and print "$1 $2";
  ' |
  sort >delta-stats


  # then you can do this for every path you are interested in.

  # First, get the list of blobs at that path (and follow renames, too).
  # The second line is picking the "after" sha1 from the --raw output.
  git log --follow --raw --no-abbrev $path |
  perl -lne '/:\S+ \S+ \S{40} (\S{40})/ and print $1' |
  sort -u >blobs

  # Then find the delta stats for those blobs
  join blobs delta-stats

which should give you the stored size of each version of a file.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-11  6:04                   ` Jonathan Nieder
  2012-04-11 16:29                     ` Neal Kreitzinger
@ 2012-04-11 16:35                     ` Neal Kreitzinger
  2012-04-11 16:44                     ` Neal Kreitzinger
  2012-04-11 18:23                     ` Neal Kreitzinger
  3 siblings, 0 replies; 43+ messages in thread
From: Neal Kreitzinger @ 2012-04-11 16:35 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jeff King, Sergio Callegari, Bo Chen, git

On 4/11/2012 1:04 AM, Jonathan Nieder wrote:
> Neal Kreitzinger wrote:
>
>> Maybe small binaries do xdelta well and the above is a moot point.
>
> If I am reading it correctly, diff-delta copes fine with smallish
> binary files that have not changed much.  Converting to hex would
> only hurt.
>
> I would suggest tracking source code instead of binaries if
> possible, though.
>
Is there some documentation out there that lists the common binary
formats (e.g., pdf, docx, gif, jpg, png, bmp, mpeg, mp3, zip, 
c-binaries, java stuff, website stuff, etc.) and explains their nature 
(container, compressed, encrypted, etc.), how well they currently delta 
in git.git within specified size boundaries and use cases (pdfs with 
only plain text vs. pdfs with graphics, tables, etc.) so git users can 
reference that to make their git repo/superproject design decisions?

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-11  6:04                   ` Jonathan Nieder
  2012-04-11 16:29                     ` Neal Kreitzinger
  2012-04-11 16:35                     ` Neal Kreitzinger
@ 2012-04-11 16:44                     ` Neal Kreitzinger
  2012-04-11 17:20                       ` Jonathan Nieder
  2012-04-11 18:23                     ` Neal Kreitzinger
  3 siblings, 1 reply; 43+ messages in thread
From: Neal Kreitzinger @ 2012-04-11 16:44 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jeff King, Sergio Callegari, Bo Chen, git

On 4/11/2012 1:04 AM, Jonathan Nieder wrote:
> Neal Kreitzinger wrote:
>
>> Maybe small binaries do xdelta well and the above is a moot point.
>
> If I am reading it correctly, diff-delta copes fine with smallish
> binary files that have not changed much.
>
> I would suggest tracking source code instead of binaries if
> possible, though.
>
I suppose the original "source" in git (linux kernel) was so low level
that it had no graphics files.  However, most projects are end-user
projects and have graphics so I would think that tracking them is a
normal expected use of git to version your software.  If you're going to 
do that then there shouldn't be a problem tracking other binaries that 
are static constants across all servers (as opposed to user edited 
content like databases).  I would consider this subset of "binaries" to 
be the expected domain of git revision control for software, ie, gui 
software.  Graphics files for your app are "source".  The binary is all 
you have.  It's the "source" that you edit to make changes.

Maybe I'm missing something here.  Maybe graphics files are "container" 
files and that makes them a problem.

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-11 16:44                     ` Neal Kreitzinger
@ 2012-04-11 17:20                       ` Jonathan Nieder
  2012-04-11 18:51                         ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Jonathan Nieder @ 2012-04-11 17:20 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Jeff King, Sergio Callegari, Bo Chen, git

Neal Kreitzinger wrote:

>                              Graphics files for your app are
> "source".  The binary is all you have.

Often there is source in SVG or some other simple editable format that
gets lossily compiled to PNG or JPEG compressed raster graphics.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-11 17:20                       ` Jonathan Nieder
@ 2012-04-11 18:51                         ` Junio C Hamano
  2012-04-11 19:03                           ` Jonathan Nieder
  0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2012-04-11 18:51 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Neal Kreitzinger, Jeff King, Sergio Callegari, Bo Chen, git

Jonathan Nieder <jrnieder@gmail.com> writes:

> Neal Kreitzinger wrote:
>
>>                              Graphics files for your app are
>> "source".  The binary is all you have.
>
> Often there is source in SVG or some other simple editable format that
> gets lossily compiled to PNG or JPEG compressed raster graphics.

You could have just underlined "if possible" part in your message and
ended this thread, that seems to be needlessly continuing.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-11 18:51                         ` Junio C Hamano
@ 2012-04-11 19:03                           ` Jonathan Nieder
  0 siblings, 0 replies; 43+ messages in thread
From: Jonathan Nieder @ 2012-04-11 19:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Neal Kreitzinger, Jeff King, Sergio Callegari, Bo Chen, git

Junio C Hamano wrote:

> You could have just underlined "if possible" part in your message

Yes, that's a more important point than the one I responded to. :)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-11  6:04                   ` Jonathan Nieder
                                       ` (2 preceding siblings ...)
  2012-04-11 16:44                     ` Neal Kreitzinger
@ 2012-04-11 18:23                     ` Neal Kreitzinger
  3 siblings, 0 replies; 43+ messages in thread
From: Neal Kreitzinger @ 2012-04-11 18:23 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jeff King, Sergio Callegari, Bo Chen, git

On 4/11/2012 1:04 AM, Jonathan Nieder wrote:
>
> I would suggest tracking source code instead of binaries if
> possible, though.
>
Reasons why we want to track binaries:
(1) Standard Targets: Our deployment is assembly line style because our
target servers are under our control.
(2) Copy vs. Recompile:  We run certain "supported" linux distro
versions on our target servers so we can just put our binaries on them
instead of recompiling.
(3) In-house-Source Compiled Binaries:  For our particular proprietary
(third-party) source language the binaries run on top of a runtime that
runs on top of the O/S so that makes the need to recompile on a server a
non-issue.  We use xxd and compile listings to "diff" our compiled
binaries to detect missing copybook and data dictionary dependencies 
(missed recompiles), unnecessary recompiles (you didn't really change 
what you thought you changed), and miscompiles.  We do this compiled 
binary validation in git branches and then diff the branches to detect 
the discrepancies.
(4) Proprietary-Third-Party Binaries (no source) Versioning:  For our
third party binaries we don't have the source.  The are distributed as
self-extracting-executables.  Changes to third party binaries are
relatively infrequent but frequent enough to cause confusion and
therefore need to be tracked.
(5) Graphics "Source" Versioning:  Our graphics files are part of our
software and changes need to be tracked.
(6) O/S Versioning:  Our linux distro is tracked in a bazaar repo so
I'm thinking we should be able to track it in a git repo instead.  The
assembly line just deploys the payload to a new server instead of doing
manual install.
(7) Superproject tracking of "Super-release":  The above subsystems are
related in varying degrees (dependent).  A superproject can associate
all the versions that comprise a "super" release of the various
subsystem version dependencies.

While some of the reasons above may be non-normative for some git-users,
I think that a large portion (if not the majority) of git-users will
find some subset of the above reasons normative for their use-cases
(namely reasons 5 and 4) therefore making the necessity for binary
tracking normative for git-users in general.

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-11  1:24                 ` Neal Kreitzinger
  2012-04-11  6:04                   ` Jonathan Nieder
@ 2012-04-11 21:35                   ` Jeff King
  2012-04-12 19:29                     ` Neal Kreitzinger
  1 sibling, 1 reply; 43+ messages in thread
From: Jeff King @ 2012-04-11 21:35 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Sergio Callegari, Bo Chen, git

On Tue, Apr 10, 2012 at 08:24:48PM -0500, Neal Kreitzinger wrote:

> (I read bup DESIGN doc to see what bup-style splitting is.) When you
> use bup delta technology in git.git I take it that you will use it
> for big-worktree-files *and* big-history-files

I'm not sure what those terms mean. We are talking about files at the
blob level. So they are either big or not big. We don't know how they
will delta, or what their histories will be like.

> (not-big-worktree-files that are not xdelta delta-friendly)?
> IOW, all binaries plus big-text-worktree-files.  Otherwise, small
> binaries will become large histories.

Files that don't delta won't be helped by splitting, as it is just
another form of finding deltas (in fact, it should produce worse results
than xdelta, because it works with larger granularity; its advantage is
that it is not as memory or CPU-hungry as something like xdelta).

So you really only want to use this for files that are too big to
practically run through the regular delta algorithm. And if you can
avoid it on files that will never delta well, you are better off
(because it adds storage overhead over a straight blob).

The first part is easy: only do it for files that are so big that you
can't run the regular delta algorithm. So since your only alternative is
doing nothing, you only have to perform better than nothing. :)

The second part is harder. We generally don't know that a file doesn't
delta well until we have two versions of it to try[1]. And that's where
some domain-specific knowledge can come in (e.g., knowing that a file is
compressed video, and that future versions are likely to differ in the
video content). But sometimes the results can be surprising. I keep a
repository of photos and videos, carefully annotated via exif tags. If
the media content changes, the results won't delta well. But if I change
the exif tags, they _do_ delta very well. So whether something like
bupsplit is a win depends on the exact update patterns.

[1] I wonder if you could do some statistical analysis on the randomness
    of the file content to determine this. That is, things which look
    very random are probably already heavily compressed, and are not
    going to compress further. You might guess that to mean that they
    will not delta well, either. And sometimes that is true. But the
    example I gave above violates it (most of the file is random, but
    the _changes_ from version to version will not be random, and that
    is what the delta is compressing).

> If small binaries are not going to be bup-delta-compressed, then what
> about using xxd to convert the binary to text and then xdelta
> compressing the hex dump to achieve efficient delta compression in
> the pack file?  You could convert the hexdump back to binary with xxd
> for checkout and such.

That wouldn't help. You are only trading the binary representation for a
less efficient one. But the data patterns will not change. The
redundancy you introduced in the first step may mostly come out via
compression, but it will never be a net win. I'm sure if I were a better
computer scientist I could write you some proof involving Shannon
entropy. But here's a fun experiment:

  # create two files, one very compressible and one not very
  # compressible
  dd if=/dev/zero of=boring.bin bs=1M count=1
  dd if=/dev/urandom of=rand.bin bs=1M count=1

  # now make hex dumps of each, and compress the original and the hex
  # dump
  for i in boring rand; do
    xxd <$i.bin >$i.hex
    for j in bin hex; do
      gzip -c <$i.$j >$i.$j.gz
    done
  done

  # and look at the results
  du {boring,rand}.*

I get:

  1024    boring.bin
  4       boring.bin.gz
  4288    boring.hex
  188     boring.hex.gz
  1024    rand.bin
  1028    rand.bin.gz
  4288    rand.hex
  2324    rand.hex.gz

So you can see that the thing that compresses well will do so in
either representation, but the end result is a net loss with the less
efficient representation. Whereas the thing that does not compress well
will achieve a better compression ratio in its text form, but will still
be a net loss. The reason is that you are just compressing out all of
the redundant bits.

You might observe that this is using gzip, not xdelta. But I think from
an information theory standpoint, they are two sides of the same coin
(e.g., you could consider a delta between two things to be equivalent to
concatenating them and compressing the result). You should be able to
design a similar experiment with xdelta.

> Maybe small binaries do xdelta well and the above is a moot point.

Some will and some will not. But it has nothing to do with whether they
are binary, and everything to do with the type of content they store (or
if binariness does matter, then our delta algorithms should be
improved).

> This is all theory to me, but the reality is looming over my head
> since most of the components I should be tracking are binaries small
> (large history?) and big (but am not yet because of "big-file"
> concerns -- I don't want to have to refactor my vast git ecosystem
> with filter branch later because I slammed binaries into the main
> project or superproject without proper systems programming (I'm not
> sure what the c/linux term is for 'systems programming', but in the
> mainframe world it meant making sure everything was configured for
> efficient performance)).

One of the things that makes bup not usable as-is for git is that it
fundamentally changes the object identities. It would be very easy for
"git add" to bupsplit a file into a tree, and store that tree using git
(in fact, that is more or less how bup works).  But that means that the
resulting object sha1 is going to depend on the splitting choices made.
Instead, we want to consider the split version of an object to be simply
an on-disk representation detail. Just as it is a representation detail
that some objects are stored in delta-encoding inside packs, versus as
loose objects; the sha1 of the object is the same, and we can
reconstruct it byte-for-byte when we want to.

So properly implemented, no, you would not have to ever filter-branch to
tweak these settings. You might have to do a repack to see the gains
(because you want to delete the old non-split representation you have in
your pack and replace it with a split representation), but that is
transparent to git's abstract data model.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-11 21:35                   ` Jeff King
@ 2012-04-12 19:29                     ` Neal Kreitzinger
  2012-04-12 21:03                       ` Jeff King
                                         ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Neal Kreitzinger @ 2012-04-12 19:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Sergio Callegari, Bo Chen, git

On 4/11/2012 4:35 PM, Jeff King wrote:
> On Tue, Apr 10, 2012 at 08:24:48PM -0500, Neal Kreitzinger wrote:
>
>> This is all theory to me, but the reality is looming over my head
>> since most of the components I should be tracking are binaries small
>> (large history?) and big (but am not yet because of "big-file"
>> concerns -- I don't want to have to refactor my vast git ecosystem
>> with filter branch later because I slammed binaries into the main
>> project or superproject without proper systems programming (I'm not
>> sure what the c/linux term is for 'systems programming', but in the
>> mainframe world it meant making sure everything was configured for
>> efficient performance)).
> So properly implemented, no, you would not have to ever filter-branch to
> tweak these settings. You might have to do a repack to see the gains
> (because you want to delete the old non-split representation you have in
> your pack and replace it with a split representation), but that is
> transparent to git's abstract data model.
>
I'm likely going to have to slam graphics files into the main repo in 
the very near future.  It sounds like once git.git is updated for 
big-file optimization I can just upgrade to that git version and repack 
to get the benefits.  Any idea when that version of git will come out 
release number wise and calendar wise?

(Don't read this next part if you just ate or are eating or drinking.  
You may throw-up from nausea or choke from laughing.)
(I am forced to deal with a mandated/micromanaged change control menu 
design from the powers-that-be that is based on cvs workflow and to 
wipe-your-nose-for-you.  It can't even cope with branches much less 
submodules so in that context there isn't time to implement the graphics 
tracking as a submodule.  This change control menu is designed to 
replace cvs commands with equivalent-results git-command sequences.  
While there are many git users who import from svn into git, do their 
work in git, and then export back into svn to get work done, ironically 
I am probably the only git user who has to import from git 
(powers-that-be mandated cvs-style menu controlled git-repo) into git 
(separate normal git repo and commandline), do the work in normal git, 
and then export it back into git (cvs-style menu controlled git-repo).)

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-12 19:29                     ` Neal Kreitzinger
@ 2012-04-12 21:03                       ` Jeff King
       [not found]                         ` <4F8A2EBD.1070407@gmail.com>
  2012-04-12 21:08                       ` Neal Kreitzinger
  2012-04-13 21:36                       ` Bo Chen
  2 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2012-04-12 21:03 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Sergio Callegari, Bo Chen, git

On Thu, Apr 12, 2012 at 02:29:40PM -0500, Neal Kreitzinger wrote:

> I'm likely going to have to slam graphics files into the main repo in
> the very near future.  It sounds like once git.git is updated for
> big-file optimization I can just upgrade to that git version and
> repack to get the benefits.

Depending on the size and number of the files, git may handle them just
fine. They don't delta well, which means they will bloat your object db
a bit, but if you are talking about a hundreds of megabytes total, it is
probably not that big a deal.

> Any idea when that version of git will come out release number wise
> and calendar wise?

No idea. This is still in the discussion and experimenting stage. It may
not even happen.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

[parent not found: <4F8A2EBD.1070407@gmail.com>]

* Re: GSoC - Some questions on the idea of
       [not found]                         ` <4F8A2EBD.1070407@gmail.com>
@ 2012-04-15  2:15                           ` Jeff King
  2012-04-15  2:33                             ` Neal Kreitzinger
  2012-05-10 21:43                             ` Neal Kreitzinger
  0 siblings, 2 replies; 43+ messages in thread
From: Jeff King @ 2012-04-15  2:15 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Sergio Callegari, Bo Chen, git

On Sat, Apr 14, 2012 at 09:13:17PM -0500, Neal Kreitzinger wrote:

> Does a file's delta-compression efficiency in the pack-file directly
> correlate to its efficiency of transmission size/bandwidth in a
> git-fetch and git-push?  IOW, are big-files also a problem for
> git-fetch and git-push by taking too long in a remote transfer?

Yes. The on-the-wire format is a packfile. We create a new packfile on
the fly, so we may find new deltas (e.g., between objects that were
stored on disk in two different packs), but we will mostly be reusing
deltas from the existing packs.

So any time you improve the on-disk representation, you are also
improving the network bandwidth utilization.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-15  2:15                           ` Jeff King
@ 2012-04-15  2:33                             ` Neal Kreitzinger
  2012-04-16 14:54                               ` Jeff King
  2012-05-10 21:43                             ` Neal Kreitzinger
  1 sibling, 1 reply; 43+ messages in thread
From: Neal Kreitzinger @ 2012-04-15  2:33 UTC (permalink / raw)
  To: Jeff King; +Cc: Sergio Callegari, Bo Chen, git

On 4/14/2012 9:15 PM, Jeff King wrote:
> On Sat, Apr 14, 2012 at 09:13:17PM -0500, Neal Kreitzinger wrote:
>
>> Does a file's delta-compression efficiency in the pack-file directly
>> correlate to its efficiency of transmission size/bandwidth in a
>> git-fetch and git-push?  IOW, are big-files also a problem for
>> git-fetch and git-push by taking too long in a remote transfer?
> Yes. The on-the-wire format is a packfile. We create a new packfile on
> the fly, so we may find new deltas (e.g., between objects that were
> stored on disk in two different packs), but we will mostly be reusing
> deltas from the existing packs.
>
> So any time you improve the on-disk representation, you are also
> improving the network bandwidth utilization.
>
We use git to transfer database files from the dev server to 
qa-servers.  Sometimes these barf for some reason and I get called to 
remediate.  I assumed the user closed their session prematurely because 
it was "taking too long".  However, now I'm wondering if the git-pull 
--ff-only is dying on its own due to the big-files.  It could be that on 
a qa-server that hasn't updated database files in awhile they are 
pulling way more than another qa-server that does their git-pull more 
requently.  How would I go about troubleshooting this?  Is there some 
log files I would look at?  (I'm using git 1.7.1 compiled with git 
makefile on rhel6.)  When I go to remediate do git-reset --hard to clear 
out the barfed worktree/index and then run git-pull --ff-only manually 
and it always works.  I'm not sure if that proves it wasn't git that 
barfed the first time.  Maybe the first time git brought some stuff over 
and barfed because it bit off more than it could chew, but the second 
time its really having to chew less food because it already chewed some 
of it the first time and therefore works the second time.

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-15  2:33                             ` Neal Kreitzinger
@ 2012-04-16 14:54                               ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-04-16 14:54 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: git

On Sat, Apr 14, 2012 at 09:33:37PM -0500, Neal Kreitzinger wrote:

> We use git to transfer database files from the dev server to
> qa-servers.  Sometimes these barf for some reason and I get called to
> remediate.  I assumed the user closed their session prematurely
> because it was "taking too long".  However, now I'm wondering if the
> git-pull --ff-only is dying on its own due to the big-files.  It
> could be that on a qa-server that hasn't updated database files in
> awhile they are pulling way more than another qa-server that does
> their git-pull more requently.  How would I go about troubleshooting
> this?  Is there some log files I would look at?  (I'm using git 1.7.1
> compiled with git makefile on rhel6.)

No, git doesn't keep logfiles. Errors go to stderr. So look wherever the
stderr for your git sessions is going (if you are doing this via cron
job or something, then that is outside the scope of git).

> When I go to remediate do git-reset --hard to clear out the barfed
> worktree/index and then run git-pull --ff-only manually and it always
> works.  I'm not sure if that proves it wasn't git that barfed the
> first time.  Maybe the first time git brought some stuff over and
> barfed because it bit off more than it could chew, but the second time
> its really having to chew less food because it already chewed some of
> it the first time and therefore works the second time.

Try "git pull --no-progress" and see if it still works. If the server
has a very long delta-compression phase, there will be no output
generated for a while, which could cause intermediate servers to hang up
(git won't do this, but if, for example, you are pulling over
git-over-http and there is a reverse proxy in the middle, it may hit a
timeout). If the automated pulls are happening from a cron job, then
they won't have a terminal and progress-reporting will be off by
default.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-15  2:15                           ` Jeff King
  2012-04-15  2:33                             ` Neal Kreitzinger
@ 2012-05-10 21:43                             ` Neal Kreitzinger
  2012-05-10 22:39                               ` Jeff King
  1 sibling, 1 reply; 43+ messages in thread
From: Neal Kreitzinger @ 2012-05-10 21:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Sergio Callegari, Bo Chen, git

On 4/14/2012 9:15 PM, Jeff King wrote:
> On Sat, Apr 14, 2012 at 09:13:17PM -0500, Neal Kreitzinger wrote:
>
>> Does a file's delta-compression efficiency in the pack-file directly
>> correlate to its efficiency of transmission size/bandwidth in a
>> git-fetch and git-push?  IOW, are big-files also a problem for
>> git-fetch and git-push by taking too long in a remote transfer?
> Yes. The on-the-wire format is a packfile. We create a new packfile on
> the fly, so we may find new deltas (e.g., between objects that were
> stored on disk in two different packs), but we will mostly be reusing
> deltas from the existing packs.
>
> So any time you improve the on-disk representation, you are also
> improving the network bandwidth utilization.
>
The git-clone manpage says you can use the rsync protocol for the url.  
If you use rsync:// as your url for your remote does that get you the 
rsync delta-transfer algorithm efficiency for the network bandwidth 
utilization part (as opposed to the on-disk representation part)?  (I'm 
new to rsync.)

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-05-10 21:43                             ` Neal Kreitzinger
@ 2012-05-10 22:39                               ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-05-10 22:39 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Sergio Callegari, Bo Chen, git

On Thu, May 10, 2012 at 04:43:26PM -0500, Neal Kreitzinger wrote:

> >Yes. The on-the-wire format is a packfile. We create a new packfile on
> >the fly, so we may find new deltas (e.g., between objects that were
> >stored on disk in two different packs), but we will mostly be reusing
> >deltas from the existing packs.
> >
> >So any time you improve the on-disk representation, you are also
> >improving the network bandwidth utilization.
> >
> The git-clone manpage says you can use the rsync protocol for the
> url.  If you use rsync:// as your url for your remote does that get
> you the rsync delta-transfer algorithm efficiency for the network
> bandwidth utilization part (as opposed to the on-disk representation
> part)?  (I'm new to rsync.)

Well, yes. If you use the rsync transport, it literally runs rsync,
which will use the regular rsync algorithm. But it won't be better than
the git protocol (and in fact will be much worse) for a few reasons:

  1. The object db files are all named after the sha1 of their content
     (the object sha1 for loose objects, and the sha1 of the whole pack
     for packfiles). Rsync will not run its comparison algorithm between
     files with different names. It will not re-transfer existing loose
     objects, but it will delete obsolete packfiles and retransfer new
     ones in their entirety. So it's like re-cloning over again for any
     fetch after an upstream repack.

  2. Even if you could use the rsync delta algorithm, it will never be
     as efficient as git. Git understands the structure of the packfile
     and can tell the other side "Hey, I have these objects". Whereas
     rsync must guess from the bytes in the packfiles. Which is much
     less efficient to compute, and can be wrong if the representation
     has changed (e.g., something used to be a whole object, but is now
     stored as a delta).

  3. Even if you could get the exact right set of objects to transfer,
     and then use the rsync delta algorithm on them, git would still do
     better. Git's job is much easier: one side has both sets of
     objects (those to be sent and those not), and is generating and
     sending efficient deltas for the other side to apply to their
     objects. Rsync assumes a harder job: you have one set, and
     the remote side has the other set, and you must agree on a delta by
     comparing checksums. So it will fundamentally never do as well.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-12 19:29                     ` Neal Kreitzinger
  2012-04-12 21:03                       ` Jeff King
@ 2012-04-12 21:08                       ` Neal Kreitzinger
  2012-04-13 21:36                       ` Bo Chen
  2 siblings, 0 replies; 43+ messages in thread
From: Neal Kreitzinger @ 2012-04-12 21:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Sergio Callegari, Bo Chen, git

On 4/12/2012 2:29 PM, Neal Kreitzinger wrote:
>
> ...ironically I am probably the only git user who has to import from 
> git (powers-that-be mandated cvs-style menu controlled git-repo) into 
> git (separate normal git repo and commandline), do the work in normal 
> git, and then export it back into git (cvs-style menu controlled 
> git-repo).)
>
>
aka, git-cotton-picking  ;-)

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-12 19:29                     ` Neal Kreitzinger
  2012-04-12 21:03                       ` Jeff King
  2012-04-12 21:08                       ` Neal Kreitzinger
@ 2012-04-13 21:36                       ` Bo Chen
  2 siblings, 0 replies; 43+ messages in thread
From: Bo Chen @ 2012-04-13 21:36 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Jeff King, Sergio Callegari, git

On Thu, Apr 12, 2012 at 3:29 PM, Neal Kreitzinger
<nkreitzinger@gmail.com> wrote:
> On 4/11/2012 4:35 PM, Jeff King wrote:
>>
>> On Tue, Apr 10, 2012 at 08:24:48PM -0500, Neal Kreitzinger wrote:
>>
>>> This is all theory to me, but the reality is looming over my head
>>> since most of the components I should be tracking are binaries small
>>> (large history?) and big (but am not yet because of "big-file"
>>> concerns -- I don't want to have to refactor my vast git ecosystem
>>> with filter branch later because I slammed binaries into the main
>>> project or superproject without proper systems programming (I'm not
>>> sure what the c/linux term is for 'systems programming', but in the
>>> mainframe world it meant making sure everything was configured for
>>> efficient performance)).
>>
>> So properly implemented, no, you would not have to ever filter-branch to
>>
>> tweak these settings. You might have to do a repack to see the gains
>> (because you want to delete the old non-split representation you have in
>> your pack and replace it with a split representation), but that is
>> transparent to git's abstract data model.
>>
> I'm likely going to have to slam graphics files into the main repo in the
> very near future.  It sounds like once git.git is updated for big-file
> optimization I can just upgrade to that git version and repack to get the
> benefits.  Any idea when that version of git will come out release number
> wise and calendar wise?
>
> (Don't read this next part if you just ate or are eating or drinking.  You
> may throw-up from nausea or choke from laughing.)
> (I am forced to deal with a mandated/micromanaged change control menu design
> from the powers-that-be that is based on cvs workflow and to
> wipe-your-nose-for-you.  It can't even cope with branches much less
> submodules so in that context there isn't time to implement the graphics
> tracking as a submodule.  This change control menu is designed to replace
> cvs commands with equivalent-results git-command sequences.  While there are
> many git users who import from svn into git, do their work in git, and then
> export back into svn to get work done, ironically I am probably the only git
> user who has to import from git (powers-that-be mandated cvs-style menu
> controlled git-repo) into git (separate normal git repo and commandline), do
> the work in normal git, and then export it back into git (cvs-style menu
> controlled git-repo).)

It seems that this is not directly related to the big-file support
issue. Maybe it is better to discuss it in a new thread ^-^
>
> v/r,
> neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-30 20:34       ` Jeff King
  2012-03-30 23:08         ` Bo Chen
@ 2012-03-31 15:19         ` Neal Kreitzinger
  2012-04-02 21:40           ` Jeff King
  2012-03-31 16:49         ` Neal Kreitzinger
  2012-03-31 20:28         ` Neal Kreitzinger
  3 siblings, 1 reply; 43+ messages in thread
From: Neal Kreitzinger @ 2012-03-31 15:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Bo Chen, Sergio, git

On 3/30/2012 3:34 PM, Jeff King wrote:
> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:
>
>> The sub-problems of "delta for large file" problem.
>>
>> 1 large file
>>
> Note that there are other problem areas with big files that can be
> worked on, too. For example, some people want to store 100 gigabytes
> in a repository.

I take it that you have in mind a 100G set of files comprised entirely
of big-files that cannot be logically separated into smaller submodules?

My understanding is that a main strategy for "big files" is to separate
your big-files logically into their own submodule(s) to keep them from
bogging down the not-big-file repo(s).

Is one of the goals of big-file-support to make submodule strategizing 
unconcerned about big-file groupings and only concerned about 
logical-file groupings?  Big-file groupings are not necessarily logical 
file groupings, but perhaps a technical file grouping subset of a 
logical file grouping that is necessitated by big-file performance 
considerations.  IOW, is the goal of big-file-support to make big-files 
"just work" so that users don't have to think about graphics files, 
binaries, etc, and just treat them like everything else?  Obviously, a 
100G database file will always be a 'big-file' for the foreseeable 
future, but a 0.5G graphics file is not a "big file" generally speaking 
(as opposed to git-speaking).

> Because git is distributed, that means 100G in the repo database,
> and 100G in the working directory, for a total of 200G.

I take it that you are implying that the 100G object-store size is due
to the notion that binary files cannot-be/are-not compressed well?

> People in this situation may want to be able to store part of the
> repository database in a network-accessible location, trading some
> of the convenience of being fully distributed for the space savings.
> So another project could be designing a network-based alternate
> object storage system.
>
I take it you are implying a local area network with users git repos on 
workstations?

In regards to "network-based alternate objects" that are in fact on the 
internet they would need to first be cloned onto the local area network. 
  Or are you imagining this would work for internet "network-based 
alternate objects"?

Some setups login to a linux server and have all their repos there.  The 
"alternate objects" does not need to network-based in that case.  It is 
"local", but local does not mean 20 people cloning the alternate objects 
to their workstations.  It means one copy of alternate objects, and 
twenty repos referencing that one copy.

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-31 15:19         ` Neal Kreitzinger
@ 2012-04-02 21:40           ` Jeff King
  2012-04-02 22:19             ` Junio C Hamano
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2012-04-02 21:40 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Bo Chen, Sergio, git

On Sat, Mar 31, 2012 at 10:19:54AM -0500, Neal Kreitzinger wrote:

> >Note that there are other problem areas with big files that can be
> >worked on, too. For example, some people want to store 100 gigabytes
> >in a repository.
> 
> I take it that you have in mind a 100G set of files comprised entirely
> of big-files that cannot be logically separated into smaller submodules?

Not exactly. Two scenarios I'm thinking of are:

  1. You really have 100G of data in the current version that doesn't
     compress well (e.g., you are storing your music collection). You
     can't afford to store two copies on your laptop (because you have a
     fancy SSD, and 100G is expensive again).  You need the working tree
     version, but it's OK to stream the repo version of a blob from the
     network when you actually need it (mostly "checkout", assuming you
     have marked the file as "-diff").

  2. You have a 100G repository, but only 10G in the most recent
     version (e.g., because you are doing game development and storing
     the media assets). You want your clones to be faster and take less
     space. You can do a shallow clone, but then you're never allowed to
     look at old history. Instead, it would be nice to clone all of the
     commits, trees, and small blobs, and then stream large blobs from
     the network as-needed (again, mostly "checkout").

> My understanding is that a main strategy for "big files" is to separate
> your big-files logically into their own submodule(s) to keep them from
> bogging down the not-big-file repo(s).

That helps people who want to work on the not-big parts by not forcing
them into the big parts (another solution would be partial clone, but
more on that in a minute). But it doesn't help people who actually want
to work on the big parts; they would still have to fetch the whole
big-parts repository.

For splitting the big-parts people from the non-big-parts people, there
have been two suggestions: partial checkout (you have all the objects in
the repo, but only checkout some of them) and partial clone (you don't
have some of the objects in the repo). Partial checkout is a much easier
problem, as it is mostly about marking index entries as "do not bother
to check this out, and pretend that it is simply unmodified". Partial
clone is much harder, because it violates git's usual reachability
rules. During a fetch, a client will say "I have commit X", which the
server can then assume means they have all of the ancestors of X, and
all of the tree and blobs referenced by X and its ancestors.

But if a client can say "yes, I have these objects, but I just don't
want to get them because it's expensive", then partial checkout is
sufficient. The non-big-parts people will clone, omitting the big
objects, and then do a partial checkout (to avoid fetching the objects
even once).

Note that some protocol extension is still needed for the client to tell
the server "don't bother including objects X, Y, and Z in the packfile;
I'll get them from my alternate big-object repo". That can either be a
list of objects, or it can simply be "don't bother with objects bigger
than N".

> >Because git is distributed, that means 100G in the repo database,
> >and 100G in the working directory, for a total of 200G.
> 
> I take it that you are implying that the 100G object-store size is due
> to the notion that binary files cannot-be/are-not compressed well?

In this case, yes. But you could easily tweak the numbers to be 100G and
150G. The point is that the data is stored twice, and even the
compressed version may be big.

> >People in this situation may want to be able to store part of the
> >repository database in a network-accessible location, trading some
> >of the convenience of being fully distributed for the space savings.
> >So another project could be designing a network-based alternate
> >object storage system.
> >
> I take it you are implying a local area network with users git repos
> on workstations?

Not necessarily. Obviously if you are doing a lot of active work on the
big files, the faster your network, the better. But it could work at the
internet scale, too, if you don't actually fetch the big files
frequently (so part of a scheme like this would be making sure we avoid
accessing big objects whenever we can; in practice, this is pretty easy,
as git already tries to avoid accessing objects unnecessarily, because
it's expensive even on the local end).

You can also cache a certain number of fetched objects locally. Assuming
there is some locality of the objects you ask about (e.g., because you
are doing "git checkout" back and forth between two branches), this can
help.

> Some setups login to a linux server and have all their repos there.
> The "alternate objects" does not need to network-based in that case.
> It is "local", but local does not mean 20 people cloning the
> alternate objects to their workstations.  It means one copy of
> alternate objects, and twenty repos referencing that one copy.

Right. This is the same concept, except over the network. So people's
working repositories are on their own workstations instead of a central
server. You could even do it today by network-mounting a filesystem and
pointing your alternates file at it. However, I think it's worth making
git aware that the objects are on the network for a few reasons:

  1. Git can be more careful about how it handles the objects, including
     when to fetch, when to stream, and when to cache. For example,
     you'd want to fetch the manifest of objects and cache it in your
     local repository, because you want fast lookups of "do I have this
     object".

  2. Providing remote filesystems on an Internet scale is a management
     pain (and it's a pain for the user, too). My thought was that this
     would be implemented on top of http (the connection setup cost is
     negligible, since these objects would generally be large).

  3. Usually alternate repositories are full repositories that meet the
     connectivity requirements (so you could run "git fsck" in them).
     But this is explicitly about taking just a few disconnected large
     blobs out of the repository and putting them elsewhere. So it needs
     a new set of tools for managing the upstream repository.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-02 21:40           ` Jeff King
@ 2012-04-02 22:19             ` Junio C Hamano
  2012-04-03 10:07               ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2012-04-02 22:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Neal Kreitzinger, Bo Chen, Sergio, git

Jeff King <peff@peff.net> writes:

>   1. You really have 100G of data in the current version that doesn't
>      compress well (e.g., you are storing your music collection). You
>      can't afford to store two copies on your laptop (because you have a
>      fancy SSD, and 100G is expensive again).  You need the working tree
>      version, but it's OK to stream the repo version of a blob from the
>      network when you actually need it (mostly "checkout", assuming you
>      have marked the file as "-diff").

This feels like a good candidate for an independent project that allows
you fuse-mount from a remote repository to give you an illusion that you
have a checkout of a specific version.  Such a remote fuse-server would be
an application that is built using Git, but I do not think we are in any
business on the client end in such a setup.

So I'll write it off as a "non-Git" issue for now.

The other parts of your message is much more interesting.

> Right. This is the same concept, except over the network. So people's
> working repositories are on their own workstations instead of a central
> server. You could even do it today by network-mounting a filesystem and
> pointing your alternates file at it. However, I think it's worth making
> git aware that the objects are on the network for a few reasons:
>
>   1. Git can be more careful about how it handles the objects, including
>      when to fetch, when to stream, and when to cache. For example,
>      you'd want to fetch the manifest of objects and cache it in your
>      local repository, because you want fast lookups of "do I have this
>      object".
>
>   2. Providing remote filesystems on an Internet scale is a management
>      pain (and it's a pain for the user, too). My thought was that this
>      would be implemented on top of http (the connection setup cost is
>      negligible, since these objects would generally be large).
>
>   3. Usually alternate repositories are full repositories that meet the
>      connectivity requirements (so you could run "git fsck" in them).
>      But this is explicitly about taking just a few disconnected large
>      blobs out of the repository and putting them elsewhere. So it needs
>      a new set of tools for managing the upstream repository.

Or you can split out the really large write-only blobs out of SCM control.
Every time you introduce a new blob, throw it verbatim in an append-only
directory on a networked filesystem under some unique ID as its filename,
and maintain a symlink into that networked filesystem under SCM control.

I think git-annex already does something like that...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-02 22:19             ` Junio C Hamano
@ 2012-04-03 10:07               ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-04-03 10:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Neal Kreitzinger, Bo Chen, Sergio, git

On Mon, Apr 02, 2012 at 03:19:35PM -0700, Junio C Hamano wrote:

> >   1. You really have 100G of data in the current version that doesn't
> >      compress well (e.g., you are storing your music collection). You
> >      can't afford to store two copies on your laptop (because you have a
> >      fancy SSD, and 100G is expensive again).  You need the working tree
> >      version, but it's OK to stream the repo version of a blob from the
> >      network when you actually need it (mostly "checkout", assuming you
> >      have marked the file as "-diff").
> 
> This feels like a good candidate for an independent project that allows
> you fuse-mount from a remote repository to give you an illusion that you
> have a checkout of a specific version.  Such a remote fuse-server would be
> an application that is built using Git, but I do not think we are in any
> business on the client end in such a setup.

I think this is backwards. The primary item you want on the laptop is
the working directory, because you will be accessing and manipulating
the files. That must always work, whether the network is connected or
not. You occasionally will want to perform git operations. Most of these
should succeed when disconnected, but it's OK for some operations (like
checking out an older version of a large blob) to fail.

But if you are mounting a remote repository and pretending that you have
a local checkout, then just accessing the files either requires a
network, or you end up caching most of the remote repository.

It would make more sense to me to clone a bare repository of what's
upstream, and then fuse-mount the local bare repository to provide a
fake working directory. And I believe somebody made such a fuse
filesystem in the early days of git. However, I recall that it was
read-only. I'm not sure how you would handle writing to the git-mounted
directory.

> Or you can split out the really large write-only blobs out of SCM control.
> Every time you introduce a new blob, throw it verbatim in an append-only
> directory on a networked filesystem under some unique ID as its filename,
> and maintain a symlink into that networked filesystem under SCM control.
> 
> I think git-annex already does something like that...

Yes, and git-media basically does this, too. But it's awful to use,
because the user has to be constantly aware of these special links and
managing them. You can't just store a symlink into the networked
filesystem. For one thing, the path may be different on each client
machine, so a simple symlink doesn't work.  For another, symlinks into a
blob repository mean that the files must be read-only (since they are
basically blob-equivalents). So you don't really get your own copy of
the file; you can _replace_ it and update the symlink, but you can't
actually modify it.

So what things like git-media end up doing is to try to insert
themselves between git and the user, and transparently convert the file
into its unique ID on "git add" and tweak the working directory to
contain the actual file on checkout. And it kind of works, but there are
a lot of rough edges (I don't recall the details, but they came up in
past discussions; clean and smudge filters almost get you there, but not
quite).

Basically what I'm proposing to do is to just move that logic into git
itself, so it can just happen at the blob storage level. I don't think
it would even be that much code inside git; you'd want the interface to
be pluggable, so all of the heavy lifting would happen inside of a
helper (so really, this isn't necessarily even "network alternates" as
much as "pluggable alternates").

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-30 20:34       ` Jeff King
  2012-03-30 23:08         ` Bo Chen
  2012-03-31 15:19         ` Neal Kreitzinger
@ 2012-03-31 16:49         ` Neal Kreitzinger
  2012-03-31 20:28         ` Neal Kreitzinger
  3 siblings, 0 replies; 43+ messages in thread
From: Neal Kreitzinger @ 2012-03-31 16:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Bo Chen, Sergio, git

On 3/30/2012 3:34 PM, Jeff King wrote:
> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:
>
>> The sub-problems of "delta for large file" problem.
>>
>> 1 large file
>>
>> 1.1 text file (always delta well? need to be confirmed)
>
> ...But let's take a step back for a moment. Forget about whether a file
> is binary or not. Imagine you want to store a very large file in
> git.
>
> ...Nowadays, we stream large files directly into their own packfiles,
> and we have to pay the I/O only once (and the memory cost never). As
> a tradeoff, we no longer get delta compression of large objects.
> That's OK for some large objects, like movie files (which don't tend
> to delta well, anyway). But it's not for other objects, like virtual
> machine images, which do tend to delta well.
>
> So can we devise a solution which efficiently stores these
> delta-friendly objects, without losing the performance improvements
> we got with the stream-directly-to-packfile approach?
>

gitconfig or gitattributes could specify big-file handlers for 
filetypes.  It seems a bit ridiculous to expect git to autoconfigure 
big-file handlers for everything from gif's to vm-images.  In the case 
of vm-images you would need to read the "big-files" man-page and then 
configure your git for the "vm image handler" for whatever your vm-image 
wildcards are for those files.  For movie files you would also read the 
big-file man-page and configure "movie file 'x' big file handler' for 
whatever your movie file wildcards are.  Movie files and vm-images are 
very expectable (version control) but not very normative (source code 
management) so you need to configure those as needed.  More 
widely-tracked-by-the-public-at-large files like gif, png, etc, could be 
autoconfigured by git to used the correct big-file handler.

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-30 20:34       ` Jeff King
                           ` (2 preceding siblings ...)
  2012-03-31 16:49         ` Neal Kreitzinger
@ 2012-03-31 20:28         ` Neal Kreitzinger
  2012-03-31 21:27           ` Bo Chen
  3 siblings, 1 reply; 43+ messages in thread
From: Neal Kreitzinger @ 2012-03-31 20:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Bo Chen, Sergio, git

On 3/30/2012 3:34 PM, Jeff King wrote:
> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:
>
>> The sub-problems of "delta for large file" problem.
>>
>> 1 large file
>>
> But let's take a step back for a moment. Forget about whether a file is
> binary or not. Imagine you want to store a very large file in git.
>
> What are the operations that will perform badly? How can we make them
> perform acceptably, and what tradeoffs must we make? E.g., the way the
> diff code is written, it would be very difficult to run "git diff" on a
> 2 gigabyte file. But is that actually a problem? Answering that means
> talking about the characteristics of 2 gigabyte files, and what we
> expect to see, and to what degree our tradeoffs will impact them.
>
> Here's a more concrete example. At first, even storing a 2 gigabyte file
> with "git add" was painful, because we would load the whole thing in
> memory. Repacking the repository was painful, because we had to rewrite
> the whole 2G file into a packfile. Nowadays, we stream large files
> directly into their own packfiles, and we have to pay the I/O only once
> (and the memory cost never). As a tradeoff, we no longer get delta
> compression of large objects. That's OK for some large objects, like
> movie files (which don't tend to delta well, anyway). But it's not for
> other objects, like virtual machine images, which do tend to delta well.
>
> So can we devise a solution which efficiently stores these
> delta-friendly objects, without losing the performance improvements we
> got with the stream-directly-to-packfile approach?
>
> One possible solution is breaking large files into smaller chunks using
> something like the bupsplit algorithm (and I won't go into the details
> here, as links to bup have already been mentioned elsewhere, and Junio's
> patches make a start at this sort of splitting).
>
(I'm no expert on "big-files" in git or elsewhere, but this thread is 
immensely interesting to me as a git user who wants to track all sorts 
of binary files and possibly large text files in the very near future, 
ie. all components tied to a server build and upgrades beyond the 
linux-distro/rpms and perhaps including them also.)

Let's take an even bigger step back for a moment.  Who determines if a 
file shall be a big-file or not?  Git or the user?  How is it determined 
if a file shall be a "big-file" or not?

Who decides bigness:
Bigness seems to be relative to system resources.  Does the user crunch 
the numbers to determine if a file is big-file, or does git?  If the 
numbers are relative then should git query the system and make the 
determination?  Either way, once the system-resources are upgraded and 
formerly "big-files" are no longer considered "big" how is the previous 
history refactored to behave "non-big-file-like"?  Conversely, if the 
system-resources are re-distributed so that formerly non-big files are 
now relatively big (ie, moved from powerful central server login to 
laptops), how is the history refactored to accommodate the 
newly-relative-bigness?

How bigness is decided:
There seems to be two basic types of big-files:  big-worktree-files, and 
big-history-files.  A big-worktree-file that is delta-friendly is not a 
big-history-file.  A non-big-worktree-file that is delta-unfriendly is a 
big-file-history problem.  If you are working alone on an old computer 
you are probably more concerned about big-worktree-files (memory).  If 
you are working in a large group making lots of changes to the same 
files on a powerful server then you are probably more concerned about 
big-history-file-size (diskspace).  Of course, all are concerned about 
big-worktree-files that are delta-unfriendly.

At what point is a delta-friendly file considered a "big-file"?  I 
assume that may depend on the degree delta-friendliness.  I imagine that 
a text file and vm-image differ in delta-friendliness by several degrees.

At what point(s) is a delta-unfriendly file considered a "big-file"?  I 
assume that may depend on the degree(s) of delta-unfriendliness.  I 
imagine a compiled program and compressed-container differ in 
delta-unfriendliness by several degrees.

My understanding is that git does not ever delta-compress binary files. 
  That would mean even a small-worktree-binary-file becomes a 
big-history-file over time.

v/r,
neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-31 20:28         ` Neal Kreitzinger
@ 2012-03-31 21:27           ` Bo Chen
  2012-04-01  4:22             ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 43+ messages in thread
From: Bo Chen @ 2012-03-31 21:27 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Jeff King, Sergio, git

On Sat, Mar 31, 2012 at 4:28 PM, Neal Kreitzinger
<nkreitzinger@gmail.com> wrote:
> On 3/30/2012 3:34 PM, Jeff King wrote:
>>
>> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote:
>>
>>> The sub-problems of "delta for large file" problem.
>>>
>>> 1 large file
>>>
>> But let's take a step back for a moment. Forget about whether a file is
>> binary or not. Imagine you want to store a very large file in git.
>>
>> What are the operations that will perform badly? How can we make them
>> perform acceptably, and what tradeoffs must we make? E.g., the way the
>> diff code is written, it would be very difficult to run "git diff" on a
>> 2 gigabyte file. But is that actually a problem? Answering that means
>> talking about the characteristics of 2 gigabyte files, and what we
>> expect to see, and to what degree our tradeoffs will impact them.
>>
>> Here's a more concrete example. At first, even storing a 2 gigabyte file
>> with "git add" was painful, because we would load the whole thing in
>> memory. Repacking the repository was painful, because we had to rewrite
>> the whole 2G file into a packfile. Nowadays, we stream large files
>> directly into their own packfiles, and we have to pay the I/O only once
>> (and the memory cost never). As a tradeoff, we no longer get delta
>> compression of large objects. That's OK for some large objects, like
>> movie files (which don't tend to delta well, anyway). But it's not for
>> other objects, like virtual machine images, which do tend to delta well.
>>
>> So can we devise a solution which efficiently stores these
>> delta-friendly objects, without losing the performance improvements we
>> got with the stream-directly-to-packfile approach?
>>
>> One possible solution is breaking large files into smaller chunks using
>> something like the bupsplit algorithm (and I won't go into the details
>> here, as links to bup have already been mentioned elsewhere, and Junio's
>> patches make a start at this sort of splitting).
>>
> (I'm no expert on "big-files" in git or elsewhere, but this thread is
> immensely interesting to me as a git user who wants to track all sorts of
> binary files and possibly large text files in the very near future, ie. all
> components tied to a server build and upgrades beyond the linux-distro/rpms
> and perhaps including them also.)
>
> Let's take an even bigger step back for a moment.  Who determines if a file
> shall be a big-file or not?  Git or the user?  How is it determined if a
> file shall be a "big-file" or not?
>
> Who decides bigness:
> Bigness seems to be relative to system resources.  Does the user crunch the
> numbers to determine if a file is big-file, or does git?  If the numbers are
> relative then should git query the system and make the determination?
>  Either way, once the system-resources are upgraded and formerly "big-files"
> are no longer considered "big" how is the previous history refactored tot
> behave "non-big-file-like"?  Conversely, if the system-resources are
> re-distributed so that formerly non-big files are now relatively big (ie,
> moved from powerful central server login to laptops), how is the history
> refactored to accommodate the newly-relative-bigness?
>

In common sense, a file of tens of MBs should not be considered as a
big file, but a file of tens of GBs should definitely be considered as
a big file. I think one simple workable solution is to let the user
set the threshold of the big file. One complicate but intelligent
solution is to let git auto-config the threshold by evaluating current
computing resources in the running platform (a physical machine or
just a VM). As to the problem of migrating git in different platforms
which equip with different computing power, the git repo should also
keep tract of under what big file threshold a specific file is
handled.


> How bigness is decided:
> There seems to be two basic types of big-files:  big-worktree-files, and
> big-history-files.  A big-worktree-file that is delta-friendly is not a
> big-history-file.  A non-big-worktree-file that is delta-unfriendly is a
> big-file-history problem.  If you are working alone on an old computer you
> are probably more concerned about big-worktree-files (memory).  If you are
> working in a large group making lots of changes to the same files on a
> powerful server then you are probably more concerned about
> big-history-file-size (diskspace).  Of course, all are concerned about
> big-worktree-files that are delta-unfriendly.
>
> At what point is a delta-friendly file considered a "big-file"?  I assume
> that may depend on the degree delta-friendliness.  I imagine that a text
> file and vm-image differ in delta-friendliness by several degrees.
>
> At what point(s) is a delta-unfriendly file considered a "big-file"?  I
> assume that may depend on the degree(s) of delta-unfriendliness.  I imagine
> a compiled program and compressed-container differ in delta-unfriendliness
> by several degrees.
>
> My understanding is that git does not ever delta-compress binary files.
>  That would mean even a small-worktree-binary-file becomes a
> big-history-file over time.
>
> v/r,
> neal

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-03-31 21:27           ` Bo Chen
@ 2012-04-01  4:22             ` Nguyen Thai Ngoc Duy
  2012-04-01 23:30               ` Bo Chen
  0 siblings, 1 reply; 43+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-04-01  4:22 UTC (permalink / raw)
  To: Bo Chen; +Cc: Neal Kreitzinger, Jeff King, Sergio, git

On Sun, Apr 1, 2012 at 4:27 AM, Bo Chen <chen@chenirvine.org> wrote:
>> Who decides bigness:
>> Bigness seems to be relative to system resources.  Does the user crunch the
>> numbers to determine if a file is big-file, or does git?  If the numbers are
>> relative then should git query the system and make the determination?
>>  Either way, once the system-resources are upgraded and formerly "big-files"
>> are no longer considered "big" how is the previous history refactored tot
>> behave "non-big-file-like"?  Conversely, if the system-resources are
>> re-distributed so that formerly non-big files are now relatively big (ie,
>> moved from powerful central server login to laptops), how is the history
>> refactored to accommodate the newly-relative-bigness?
>>
>
> In common sense, a file of tens of MBs should not be considered as a
> big file, but a file of tens of GBs should definitely be considered as
> a big file. I think one simple workable solution is to let the user
> set the threshold of the big file.

We currently have core.bigFileThreshold = 512MB.

> One complicate but intelligent
> solution is to let git auto-config the threshold by evaluating current
> computing resources in the running platform (a physical machine or
> just a VM). As to the problem of migrating git in different platforms
> which equip with different computing power, the git repo should also
> keep tract of under what big file threshold a specific file is
> handled.
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-01  4:22             ` Nguyen Thai Ngoc Duy
@ 2012-04-01 23:30               ` Bo Chen
  2012-04-02  1:00                 ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 43+ messages in thread
From: Bo Chen @ 2012-04-01 23:30 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Neal Kreitzinger, Jeff King, Sergio, git

One question,  can anyone help me clear?

My .git/objects has 3 blobs, a, b, and c. a is a unique file, b and c
two sequential versions of the same file. When I run "git gc", what
exactly happens here, e.g., how exactly git (in the latest version)
delta compresses-the blobs here?

Any help will be appreciated.

Bo

On Sun, Apr 1, 2012 at 12:22 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Sun, Apr 1, 2012 at 4:27 AM, Bo Chen <chen@chenirvine.org> wrote:
>>> Who decides bigness:
>>> Bigness seems to be relative to system resources.  Does the user crunch the
>>> numbers to determine if a file is big-file, or does git?  If the numbers are
>>> relative then should git query the system and make the determination?
>>>  Either way, once the system-resources are upgraded and formerly "big-files"
>>> are no longer considered "big" how is the previous history refactored tot
>>> behave "non-big-file-like"?  Conversely, if the system-resources are
>>> re-distributed so that formerly non-big files are now relatively big (ie,
>>> moved from powerful central server login to laptops), how is the history
>>> refactored to accommodate the newly-relative-bigness?
>>>
>>
>> In common sense, a file of tens of MBs should not be considered as a
>> big file, but a file of tens of GBs should definitely be considered as
>> a big file. I think one simple workable solution is to let the user
>> set the threshold of the big file.
>
> We currently have core.bigFileThreshold = 512MB.
>
>> One complicate but intelligent
>> solution is to let git auto-config the threshold by evaluating current
>> computing resources in the running platform (a physical machine or
>> just a VM). As to the problem of migrating git in different platforms
>> which equip with different computing power, the git repo should also
>> keep tract of under what big file threshold a specific file is
>> handled.
> --
> Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of
  2012-04-01 23:30               ` Bo Chen
@ 2012-04-02  1:00                 ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 43+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-04-02  1:00 UTC (permalink / raw)
  To: Bo Chen; +Cc: Neal Kreitzinger, Jeff King, Sergio, git

On Mon, Apr 2, 2012 at 6:30 AM, Bo Chen <chen@chenirvine.org> wrote:
> One question,  can anyone help me clear?
>
> My .git/objects has 3 blobs, a, b, and c. a is a unique file, b and c
> two sequential versions of the same file. When I run "git gc", what
> exactly happens here, e.g., how exactly git (in the latest version)
> delta compresses-the blobs here?

See Documentation/technical/pack-heuristics.txt for how pack-objects
(called by"git gc") decides to delta either b or c based on the other
one. Once it chooses, say, b to be delta against c, it generates delta
using diff-delta.c, then store the delta in either ref-delta or
ofs-delta format. The former stores sha-1 of c, the latter the offset
of c in the pack.
-- 
Duy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of "Better big-file support".
  2012-03-28  6:19 ` Nguyen Thai Ngoc Duy
  2012-03-28 11:33   ` GSoC - Some questions on the idea of Sergio
@ 2012-03-30 19:11   ` Bo Chen
  2012-03-30 19:54     ` Jeff King
  1 sibling, 1 reply; 43+ messages in thread
From: Bo Chen @ 2012-03-30 19:11 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, peff

Sorry for replying late.

My questions are inline in the following.


On Wed, Mar 28, 2012 at 2:19 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen@chenirvine.org> wrote:
>> Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better
>> big-file support".
>>
>> As it is described in the idea page,
>> "Many large files (like media) do not delta very well. However, some
>> do (like VM disk images). Git could split large objects into smaller
>> chunks, similar to bup, and find deltas between these much more
>> manageable chunks. There are some preliminary patches in this
>> direction, but they are in need of review and expansion."
>>
>> Can anyone elaborate a little bit why many large files do not delta
>> very well?
>
> Large files are usually binary. Depends on the type of binary, they
> may or may not delta well. Those that are compressed/encrypted
> obviously don't delta well because one change can make the final
> result completely different.

Just make clear one of my confusions. Delta operation is to find out
the differences between different versions of the same file, right?
As I know, delta encoding is to re-encode a file based on the
differences between neighboring blocks, thus can help compress a file
since after delta encoding, we will have more similar data within the
file. Can anyone elaborate a little bit what is the relation between
delta operation in git and delta encoding listed above? Thanks.

>
> Another problem with delta-ing large files with git is, current code
> needs to load two files in memory for delta. Consuming 4G for delta 2
> 2GB files does not sound good.


I am wondering why we cannot divide the 2  2GB files into chunks and
delta chunks by chunks. Is that any difference, except a little more
IOs?

>
>> Is it a general problem or a specific problem just for Git?
>> I am really new to Git, can anyone give me some hints on which source
>> codes I should read to learn more about the current code on delta
>> operation? It is said that "there are some preliminary patches in this
>> direction", where can I find these patches?
>
> Read about rsync algorithm [2]. Bup [1] implements the same (I think)
> algorithm, but on top of git. For preliminary patches, have a look at
> jc/split-blob series at commit 4a1242d in git.git.

Make clear my another confusion. The file which has been updated
(added, deleted, and modified) is first delta-compressed, and then
synchronize to the remote repo by some mechanism (rsync?). I am
wondering what is the the relationship between delta operation and
rsync.

>
> [1] https://github.com/apenwarr/bup
> [2] http://en.wikipedia.org/wiki/Rsync#Algorithm
> --
> Duy

Bo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: GSoC - Some questions on the idea of "Better big-file support".
  2012-03-30 19:11   ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen
@ 2012-03-30 19:54     ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2012-03-30 19:54 UTC (permalink / raw)
  To: Bo Chen; +Cc: Nguyen Thai Ngoc Duy, git

On Fri, Mar 30, 2012 at 03:11:40PM -0400, Bo Chen wrote:

> Just make clear one of my confusions. Delta operation is to find out
> the differences between different versions of the same file, right?
> As I know, delta encoding is to re-encode a file based on the
> differences between neighboring blocks, thus can help compress a file
> since after delta encoding, we will have more similar data within the
> file. Can anyone elaborate a little bit what is the relation between
> delta operation in git and delta encoding listed above? Thanks.

Sort of. Git is snapshot based. So each version of a file is its own
"object", and from a high-level view, we store all objects. But we store
the logical objects themselves in packfiles, in which the actual
representation of the object may be stored as a difference to another
object (which is likely to be a different version of the same file, but
does not have to be).

Here's some background reading:

  http://progit.org/book/ch1-3.html

  http://progit.org/book/ch9-4.html

> I am wondering why we cannot divide the 2  2GB files into chunks and
> delta chunks by chunks. Is that any difference, except a little more
> IOs?

It's more complicated than that. What if the file is re-ordered? You
would want to compare early chunks in one version against later chunks
in the other. So yes, you can reduce memory pressure by doing more I/O,
but doing too much I/O will be very slow. Coming up with a solution is
part of what this project is about. And chunking is part of that
solution.

> > Read about rsync algorithm [2]. Bup [1] implements the same (I think)
> > algorithm, but on top of git. For preliminary patches, have a look at
> > jc/split-blob series at commit 4a1242d in git.git.
> 
> Make clear my another confusion. The file which has been updated
> (added, deleted, and modified) is first delta-compressed, and then
> synchronize to the remote repo by some mechanism (rsync?). I am
> wondering what is the the relationship between delta operation and
> rsync.

No, the updated file is delta compressed into a packfile, and the
packfile is transmitted. Rsync comes into play because it uses a novel
chunking algorithm, which was copied by bup (and is referred to as the
"bupsplit" algorithm). Read up on how bup works and why it was invented.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2012-05-10 22:39 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-28  4:38 GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-28  6:19 ` Nguyen Thai Ngoc Duy
2012-03-28 11:33   ` GSoC - Some questions on the idea of Sergio
2012-03-30 19:44     ` Bo Chen
2012-03-30 19:51     ` Bo Chen
2012-03-30 20:34       ` Jeff King
2012-03-30 23:08         ` Bo Chen
2012-03-31 11:02           ` Sergio Callegari
2012-03-31 16:18             ` Neal Kreitzinger
2012-04-02 21:07               ` Jeff King
2012-04-03  9:58                 ` Sergio Callegari
2012-04-11  1:24                 ` Neal Kreitzinger
2012-04-11  6:04                   ` Jonathan Nieder
2012-04-11 16:29                     ` Neal Kreitzinger
2012-04-11 22:09                       ` Jeff King
2012-04-11 16:35                     ` Neal Kreitzinger
2012-04-11 16:44                     ` Neal Kreitzinger
2012-04-11 17:20                       ` Jonathan Nieder
2012-04-11 18:51                         ` Junio C Hamano
2012-04-11 19:03                           ` Jonathan Nieder
2012-04-11 18:23                     ` Neal Kreitzinger
2012-04-11 21:35                   ` Jeff King
2012-04-12 19:29                     ` Neal Kreitzinger
2012-04-12 21:03                       ` Jeff King
     [not found]                         ` <4F8A2EBD.1070407@gmail.com>
2012-04-15  2:15                           ` Jeff King
2012-04-15  2:33                             ` Neal Kreitzinger
2012-04-16 14:54                               ` Jeff King
2012-05-10 21:43                             ` Neal Kreitzinger
2012-05-10 22:39                               ` Jeff King
2012-04-12 21:08                       ` Neal Kreitzinger
2012-04-13 21:36                       ` Bo Chen
2012-03-31 15:19         ` Neal Kreitzinger
2012-04-02 21:40           ` Jeff King
2012-04-02 22:19             ` Junio C Hamano
2012-04-03 10:07               ` Jeff King
2012-03-31 16:49         ` Neal Kreitzinger
2012-03-31 20:28         ` Neal Kreitzinger
2012-03-31 21:27           ` Bo Chen
2012-04-01  4:22             ` Nguyen Thai Ngoc Duy
2012-04-01 23:30               ` Bo Chen
2012-04-02  1:00                 ` Nguyen Thai Ngoc Duy
2012-03-30 19:11   ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen
2012-03-30 19:54     ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.