* GSoC - Some questions on the idea of "Better big-file support". @ 2012-03-28 4:38 Bo Chen 2012-03-28 6:19 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 43+ messages in thread From: Bo Chen @ 2012-03-28 4:38 UTC (permalink / raw) To: git; +Cc: peff Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better big-file support". As it is described in the idea page, "Many large files (like media) do not delta very well. However, some do (like VM disk images). Git could split large objects into smaller chunks, similar to bup, and find deltas between these much more manageable chunks. There are some preliminary patches in this direction, but they are in need of review and expansion." Can anyone elaborate a little bit why many large files do not delta very well? Is it a general problem or a specific problem just for Git? I am really new to Git, can anyone give me some hints on which source codes I should read to learn more about the current code on delta operation? It is said that "there are some preliminary patches in this direction", where can I find these patches? I will appreciate it if anyone can offer some help. Thanks. Bo Chen ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of "Better big-file support". 2012-03-28 4:38 GSoC - Some questions on the idea of "Better big-file support" Bo Chen @ 2012-03-28 6:19 ` Nguyen Thai Ngoc Duy 2012-03-28 11:33 ` GSoC - Some questions on the idea of Sergio 2012-03-30 19:11 ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen 0 siblings, 2 replies; 43+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-03-28 6:19 UTC (permalink / raw) To: Bo Chen; +Cc: git, peff On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen@chenirvine.org> wrote: > Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better > big-file support". > > As it is described in the idea page, > "Many large files (like media) do not delta very well. However, some > do (like VM disk images). Git could split large objects into smaller > chunks, similar to bup, and find deltas between these much more > manageable chunks. There are some preliminary patches in this > direction, but they are in need of review and expansion." > > Can anyone elaborate a little bit why many large files do not delta > very well? Large files are usually binary. Depends on the type of binary, they may or may not delta well. Those that are compressed/encrypted obviously don't delta well because one change can make the final result completely different. Another problem with delta-ing large files with git is, current code needs to load two files in memory for delta. Consuming 4G for delta 2 2GB files does not sound good. > Is it a general problem or a specific problem just for Git? > I am really new to Git, can anyone give me some hints on which source > codes I should read to learn more about the current code on delta > operation? It is said that "there are some preliminary patches in this > direction", where can I find these patches? Read about rsync algorithm [2]. Bup [1] implements the same (I think) algorithm, but on top of git. For preliminary patches, have a look at jc/split-blob series at commit 4a1242d in git.git. [1] https://github.com/apenwarr/bup [2] http://en.wikipedia.org/wiki/Rsync#Algorithm -- Duy ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-28 6:19 ` Nguyen Thai Ngoc Duy @ 2012-03-28 11:33 ` Sergio 2012-03-30 19:44 ` Bo Chen 2012-03-30 19:51 ` Bo Chen 2012-03-30 19:11 ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen 1 sibling, 2 replies; 43+ messages in thread From: Sergio @ 2012-03-28 11:33 UTC (permalink / raw) To: git Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes: > > On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen <at> chenirvine.org> wrote: > > Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better > > big-file support". > > > > As it is described in the idea page, > > "Many large files (like media) do not delta very well. However, some > > do (like VM disk images). Git could split large objects into smaller > > chunks, similar to bup, and find deltas between these much more > > manageable chunks. There are some preliminary patches in this > > direction, but they are in need of review and expansion." > > > > Can anyone elaborate a little bit why many large files do not delta > > very well? > > Large files are usually binary. Depends on the type of binary, they > may or may not delta well. Those that are compressed/encrypted > obviously don't delta well because one change can make the final > result completely different. I would add that the larger a file, the larger the temptation to use a compressed format for it, so that large files are often compressed binaries. For these, a trick to obtain good deltas can be to decompress before splitting in chunks with the rsync algorithm. Git filters can already be used for this, but it can be tricky to assure that the decompress - recompress roundtrip re-creates the original compressed file. Furhermore, some compressed binaries are internally composed by multiple streams (think of a zip archive containing multiple files, but this is by no means limited to zip). In this case, it is frequent to have many possible orderings of the streams. If so, the best deltas can be obtained by sorting the streams in some 'canonical' order and decompressing. Even without decompressing, sorting alone can obtain good results as long as changes are only due to changes in a single stream of the container. Personally, I know no example of git filters used to perform this sorting which can be extremely tricky in assuring the possibility of recovering the file in the original stream order. Maybe (but this is just speculation), once the bup-inspired file chunking support is in place, people will start contributing filters to improve the management of many types of standard files (obviously 'improve' in terms of space efficiency as filters can be quite slow). Sergio ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-28 11:33 ` GSoC - Some questions on the idea of Sergio @ 2012-03-30 19:44 ` Bo Chen 2012-03-30 19:51 ` Bo Chen 1 sibling, 0 replies; 43+ messages in thread From: Bo Chen @ 2012-03-30 19:44 UTC (permalink / raw) To: Sergio; +Cc: git The following is the list of sub-problems according to my understanding of the "big file support" problem. Can anyone give some feed back and help refine it. Thanks. ---- text file (always delta well? need to be confirmed) | --- delta well (ok) large file-| ---- general binary file (without encryption, compression. Other cases which definitely can not delta well) -| | | --- does not delta well (improvement?) ---- binary file -|--- encrypted file (improvement? one straightforward method is to decrypt the file before delta-ing it, however, we don't always have the key for decryption. Other?) | --- compressed file (improvement? Decompress before delta-ing it? Other?) Bo On Wed, Mar 28, 2012 at 7:33 AM, Sergio <sergio.callegari@gmail.com> wrote: > Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes: > >> >> On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen <at> chenirvine.org> wrote: >> > Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better >> > big-file support". >> > >> > As it is described in the idea page, >> > "Many large files (like media) do not delta very well. However, some >> > do (like VM disk images). Git could split large objects into smaller >> > chunks, similar to bup, and find deltas between these much more >> > manageable chunks. There are some preliminary patches in this >> > direction, but they are in need of review and expansion." >> > >> > Can anyone elaborate a little bit why many large files do not delta >> > very well? >> >> Large files are usually binary. Depends on the type of binary, they >> may or may not delta well. Those that are compressed/encrypted >> obviously don't delta well because one change can make the final >> result completely different. > > I would add that the larger a file, the larger the temptation to use a > compressed format for it, so that large files are often compressed binaries. > > For these, a trick to obtain good deltas can be to decompress before splitting > in chunks with the rsync algorithm. Git filters can already be used for this, > but it can be tricky to assure that the decompress - recompress roundtrip > re-creates the original compressed file. > > Furhermore, some compressed binaries are internally composed by multiple streams > (think of a zip archive containing multiple files, but this is by no means > limited to zip). In this case, it is frequent to have many possible orderings of > the streams. If so, the best deltas can be obtained by sorting the streams in > some 'canonical' order and decompressing. Even without decompressing, sorting > alone can obtain good results as long as changes are only due to changes in a > single stream of the container. Personally, I know no example of git filters > used to perform this sorting which can be extremely tricky in assuring the > possibility of recovering the file in the original stream order. > > Maybe (but this is just speculation), once the bup-inspired file chunking > support is in place, people will start contributing filters to improve the > management of many types of standard files (obviously 'improve' in terms of > space efficiency as filters can be quite slow). > > Sergio > > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-28 11:33 ` GSoC - Some questions on the idea of Sergio 2012-03-30 19:44 ` Bo Chen @ 2012-03-30 19:51 ` Bo Chen 2012-03-30 20:34 ` Jeff King 1 sibling, 1 reply; 43+ messages in thread From: Bo Chen @ 2012-03-30 19:51 UTC (permalink / raw) To: Sergio; +Cc: git Please neglect my last email. Following is the version more readable. The sub-problems of "delta for large file" problem. 1 large file 1.1 text file (always delta well? need to be confirmed) 1.2 binary file 1.2.1 general binary file (without encryption, compression. Other cases which definitely can not delta well) 1.2.1.1 delta well (ok) 1.2.1.2 does not delta well (improvement?) 1.2.2 encrypted file (improvement? one straightforward method is to decrypt the file before delta-ing it, however, we don't always have the key for decryption. Other?) 1.2.3 compressed file (improvement? Decompress before delta-ing it? Other?) Can anyone give me any feed back for further refining the problem. Thanks. Bo On Wed, Mar 28, 2012 at 7:33 AM, Sergio <sergio.callegari@gmail.com> wrote: > Nguyen Thai Ngoc Duy <pclouds <at> gmail.com> writes: > >> >> On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen <at> chenirvine.org> wrote: >> > Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better >> > big-file support". >> > >> > As it is described in the idea page, >> > "Many large files (like media) do not delta very well. However, some >> > do (like VM disk images). Git could split large objects into smaller >> > chunks, similar to bup, and find deltas between these much more >> > manageable chunks. There are some preliminary patches in this >> > direction, but they are in need of review and expansion." >> > >> > Can anyone elaborate a little bit why many large files do not delta >> > very well? >> >> Large files are usually binary. Depends on the type of binary, they >> may or may not delta well. Those that are compressed/encrypted >> obviously don't delta well because one change can make the final >> result completely different. > > I would add that the larger a file, the larger the temptation to use a > compressed format for it, so that large files are often compressed binaries. > > For these, a trick to obtain good deltas can be to decompress before splitting > in chunks with the rsync algorithm. Git filters can already be used for this, > but it can be tricky to assure that the decompress - recompress roundtrip > re-creates the original compressed file. > > Furhermore, some compressed binaries are internally composed by multiple streams > (think of a zip archive containing multiple files, but this is by no means > limited to zip). In this case, it is frequent to have many possible orderings of > the streams. If so, the best deltas can be obtained by sorting the streams in > some 'canonical' order and decompressing. Even without decompressing, sorting > alone can obtain good results as long as changes are only due to changes in a > single stream of the container. Personally, I know no example of git filters > used to perform this sorting which can be extremely tricky in assuring the > possibility of recovering the file in the original stream order. > > Maybe (but this is just speculation), once the bup-inspired file chunking > support is in place, people will start contributing filters to improve the > management of many types of standard files (obviously 'improve' in terms of > space efficiency as filters can be quite slow). > > Sergio > > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-30 19:51 ` Bo Chen @ 2012-03-30 20:34 ` Jeff King 2012-03-30 23:08 ` Bo Chen ` (3 more replies) 0 siblings, 4 replies; 43+ messages in thread From: Jeff King @ 2012-03-30 20:34 UTC (permalink / raw) To: Bo Chen; +Cc: Sergio, git On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote: > The sub-problems of "delta for large file" problem. > > 1 large file > > 1.1 text file (always delta well? need to be confirmed) They often do, but text files don't tend to be large. There are some exceptions (e.g., genetic data is often kept in line-oriented text files, but is very large). But let's take a step back for a moment. Forget about whether a file is binary or not. Imagine you want to store a very large file in git. What are the operations that will perform badly? How can we make them perform acceptably, and what tradeoffs must we make? E.g., the way the diff code is written, it would be very difficult to run "git diff" on a 2 gigabyte file. But is that actually a problem? Answering that means talking about the characteristics of 2 gigabyte files, and what we expect to see, and to what degree our tradeoffs will impact them. Here's a more concrete example. At first, even storing a 2 gigabyte file with "git add" was painful, because we would load the whole thing in memory. Repacking the repository was painful, because we had to rewrite the whole 2G file into a packfile. Nowadays, we stream large files directly into their own packfiles, and we have to pay the I/O only once (and the memory cost never). As a tradeoff, we no longer get delta compression of large objects. That's OK for some large objects, like movie files (which don't tend to delta well, anyway). But it's not for other objects, like virtual machine images, which do tend to delta well. So can we devise a solution which efficiently stores these delta-friendly objects, without losing the performance improvements we got with the stream-directly-to-packfile approach? One possible solution is breaking large files into smaller chunks using something like the bupsplit algorithm (and I won't go into the details here, as links to bup have already been mentioned elsewhere, and Junio's patches make a start at this sort of splitting). Note that there are other problem areas with big files that can be worked on, too. For example, some people want to store 100 gigabytes in a repository. Because git is distributed, that means 100G in the repo database, and 100G in the working directory, for a total of 200G. People in this situation may want to be able to store part of the repository database in a network-accessible location, trading some of the convenience of being fully distributed for the space savings. So another project could be designing a network-based alternate object storage system. -Peff ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-30 20:34 ` Jeff King @ 2012-03-30 23:08 ` Bo Chen 2012-03-31 11:02 ` Sergio Callegari 2012-03-31 15:19 ` Neal Kreitzinger ` (2 subsequent siblings) 3 siblings, 1 reply; 43+ messages in thread From: Bo Chen @ 2012-03-30 23:08 UTC (permalink / raw) To: Jeff King; +Cc: Sergio, git I appreciate for the instant reply. My comments are inline below. On Fri, Mar 30, 2012 at 4:34 PM, Jeff King <peff@peff.net> wrote: > On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote: > >> The sub-problems of "delta for large file" problem. >> >> 1 large file >> >> 1.1 text file (always delta well? need to be confirmed) > > They often do, but text files don't tend to be large. There are some > exceptions (e.g., genetic data is often kept in line-oriented text > files, but is very large). > > But let's take a step back for a moment. Forget about whether a file is > binary or not. Imagine you want to store a very large file in git. > > What are the operations that will perform badly? How can we make them > perform acceptably, and what tradeoffs must we make? E.g., the way the > diff code is written, it would be very difficult to run "git diff" on a > 2 gigabyte file. But is that actually a problem? Answering that means > talking about the characteristics of 2 gigabyte files, and what we > expect to see, and to what degree our tradeoffs will impact them. > > Here's a more concrete example. At first, even storing a 2 gigabyte file > with "git add" was painful, because we would load the whole thing in > memory. Repacking the repository was painful, because we had to rewrite > the whole 2G file into a packfile. Nowadays, we stream large files > directly into their own packfiles, and we have to pay the I/O only once > (and the memory cost never). As a tradeoff, we no longer get delta > compression of large objects. That's OK for some large objects, like > movie files (which don't tend to delta well, anyway). But it's not for > other objects, like virtual machine images, which do tend to delta well. It seems that we should first provide some kind of mechanism which can distinguish the delta-friendly objects and non delta-friendly objects. I am wondering whether this algorithm is available now or will be devised. > > So can we devise a solution which efficiently stores these > delta-friendly objects, without losing the performance improvements we > got with the stream-directly-to-packfile approach? Ah, I see. Design efficient solution for storing the delta-friendly objects is the main concern. Thank you for helping me clarify this point. > > One possible solution is breaking large files into smaller chunks using > something like the bupsplit algorithm (and I won't go into the details > here, as links to bup have already been mentioned elsewhere, and Junio's > patches make a start at this sort of splitting). > > Note that there are other problem areas with big files that can be > worked on, too. For example, some people want to store 100 gigabytes in > a repository. Because git is distributed, that means 100G in the repo > database, and 100G in the working directory, for a total of 200G. People > in this situation may want to be able to store part of the repository > database in a network-accessible location, trading some of the > convenience of being fully distributed for the space savings. So another > project could be designing a network-based alternate object storage > system. >From the architecture point of view, CVS is fully centralized, and Git is fully distributed. It seems that for big repo, the architecture described above is in the middle now ^-^. > > -Peff Bo ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-30 23:08 ` Bo Chen @ 2012-03-31 11:02 ` Sergio Callegari 2012-03-31 16:18 ` Neal Kreitzinger 0 siblings, 1 reply; 43+ messages in thread From: Sergio Callegari @ 2012-03-31 11:02 UTC (permalink / raw) To: Bo Chen; +Cc: Jeff King, git I wonder if it could make sense to have some pluggable mechanism for file splitting. Something under the lines of filters, so to say. Bupsplit can be a rather general mechanism, but large binaries that are containers (zip, jar, docx, tgz, pdf - seen as a collection of streams) may possibly be more conveniently split by their inherent components. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-31 11:02 ` Sergio Callegari @ 2012-03-31 16:18 ` Neal Kreitzinger 2012-04-02 21:07 ` Jeff King 0 siblings, 1 reply; 43+ messages in thread From: Neal Kreitzinger @ 2012-03-31 16:18 UTC (permalink / raw) To: Sergio Callegari; +Cc: Bo Chen, Jeff King, git On 3/31/2012 6:02 AM, Sergio Callegari wrote: > I wonder if it could make sense to have some pluggable mechanism for > file splitting. Something under the lines of filters, so to say. > Bupsplit can be a rather general mechanism, but large binaries that > are containers (zip, jar, docx, tgz, pdf - seen as a collection of > streams) may possibly be more conveniently split by their inherent > components. > gitattributes or gitconfig could configure the big-file handler for specified files. Known/supported filetypes like gif, png, zip, pdf, etc., could be auto-configured by git. Any yet-unknown/yet-unsupported filetypes could be configured manually by the user, e.g. *.zgp=bigcontainer v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-31 16:18 ` Neal Kreitzinger @ 2012-04-02 21:07 ` Jeff King 2012-04-03 9:58 ` Sergio Callegari 2012-04-11 1:24 ` Neal Kreitzinger 0 siblings, 2 replies; 43+ messages in thread From: Jeff King @ 2012-04-02 21:07 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: Sergio Callegari, Bo Chen, git On Sat, Mar 31, 2012 at 11:18:16AM -0500, Neal Kreitzinger wrote: > On 3/31/2012 6:02 AM, Sergio Callegari wrote: > >I wonder if it could make sense to have some pluggable mechanism for > > file splitting. Something under the lines of filters, so to say. > >Bupsplit can be a rather general mechanism, but large binaries that > >are containers (zip, jar, docx, tgz, pdf - seen as a collection of > >streams) may possibly be more conveniently split by their inherent > >components. > > > > gitattributes or gitconfig could configure the big-file handler for > specified files. Known/supported filetypes like gif, png, zip, pdf, > etc., could be auto-configured by git. Any > yet-unknown/yet-unsupported filetypes could be configured manually by > the user, e.g. > *.zgp=bigcontainer This is a tempting route (and one I've even suggested myself before), but I think ultimately it is a bad way to go. The problem is that splitting is only half of the equation. Once you have split contents, you have to use them intelligently, which means looking at the sha1s of each split chunk and discarding whole chunks as "the same" without even looking at the contents. Which means that it is very important that your chunking algorithm remain stable from version to version. A change in the algorithm is going to completely negate the benefits of chunking in the first place. So something configurable, or something that is not applied consistently (because it depends on each user's git config, or even on the specific version of a tool used) can end up being no help at all. Properly applied, I think a content-aware chunking algorithm could out-perform a generic one. But I think we need to first find out exactly how well the generic algorithm can perform. It may be "good enough" compared to the hassle that inconsistent application of a content-aware algorithm will cause. So I wouldn't rule it out, but I'd rather try the bup-style splitting first, and see how good (or bad) it is. -Peff ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-02 21:07 ` Jeff King @ 2012-04-03 9:58 ` Sergio Callegari 2012-04-11 1:24 ` Neal Kreitzinger 1 sibling, 0 replies; 43+ messages in thread From: Sergio Callegari @ 2012-04-03 9:58 UTC (permalink / raw) To: Jeff King; +Cc: Neal Kreitzinger, Bo Chen, git On 02/04/2012 23:07, Jeff King wrote: >> gitattributes or gitconfig could configure the big-file handler for >> specified files. Known/supported filetypes like gif, png, zip, pdf, >> etc., could be auto-configured by git. Any >> yet-unknown/yet-unsupported filetypes could be configured manually by >> the user, e.g. >> *.zgp=bigcontainer > This is a tempting route (and one I've even suggested myself before), > but I think ultimately it is a bad way to go. The problem is that > splitting is only half of the equation. Once you have split contents, > you have to use them intelligently, which means looking at the sha1s of > each split chunk and discarding whole chunks as "the same" without even > looking at the contents. > > Which means that it is very important that your chunking algorithm > remain stable from version to version. A change in the algorithm is > going to completely negate the benefits of chunking in the first place. > So something configurable, or something that is not applied consistently > (because it depends on each user's git config, or even on the specific > version of a tool used) can end up being no help at all. Isn't this the same with filters? The clean algorithms should remain stable from version to version. Filters are often perceived as simpler, so that this stability seems easier to achieve, but it is not necessarily the case. > Properly applied, I think a content-aware chunking algorithm could > out-perform a generic one. But I think we need to first find out exactly > how well the generic algorithm can perform. It may be "good enough" > compared to the hassle that inconsistent application of a content-aware > algorithm will cause. Absolutely true, but why not giving freedom to the user to chose? Git could provide the bupsplit mechanism and at the same time have a means so that the user can plug in a different machinery for specific file types. In this case, it is the user responsibility to do it right. One could have a special 'filter' for splitting/unsplitting. Say [splitfilter "XXX"] split = xxx unsplit = uxxx xxx is given the file to split on stdin and returns on stdout a stream made of an index header and the concatenation of the parts in which the file should be split. For unsplitting uxxx is given on stdin the index and the concatenation of parts and returns on stdout the binary file. bupsplit and bupunsplit could be built in, with other tools being user provided. If the users gets them wrong it is ultimately his/her responsibility. In the end, the user is given even 'rm' isn't he/she? Git could provide a header file defining the index header format to help the coding of the alternative, more specific splitters. If people devise some of them that look promising, they can probably be collected in contrib. Possibly, the index header could comprise starting positions for the various parts in the stream, but also 'names' for them. This would let reusing blob and tree objects to physically store the various parts. For bupsplit, names could be flat (e.g. sequence numbers like 0000, 0001). For files that are container, they could reflect the inner names. Perspectively, one could even devise specific diff tools for these 'special' trees of split-object components. With this, when storing say a very large zip file in git, these tools could help saying things like 'from version x to version y, only that specific part in the zip file has changed'. Sergio ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-02 21:07 ` Jeff King 2012-04-03 9:58 ` Sergio Callegari @ 2012-04-11 1:24 ` Neal Kreitzinger 2012-04-11 6:04 ` Jonathan Nieder 2012-04-11 21:35 ` Jeff King 1 sibling, 2 replies; 43+ messages in thread From: Neal Kreitzinger @ 2012-04-11 1:24 UTC (permalink / raw) To: Jeff King; +Cc: Sergio Callegari, Bo Chen, git On 4/2/2012 4:07 PM, Jeff King wrote: > ...I think we need to first find out exactly > how well the generic algorithm can perform. It may be "good enough" > compared to the hassle that inconsistent application of a content-aware > algorithm will cause. So I wouldn't rule it out, but I'd rather try the > bup-style splitting first, and see how good (or bad) it is. > (I read bup DESIGN doc to see what bup-style splitting is.) When you use bup delta technology in git.git I take it that you will use it for big-worktree-files *and* big-history-files (not-big-worktree-files that are not xdelta delta-friendly)? IOW, all binaries plus big-text-worktree-files. Otherwise, small binaries will become large histories. If small binaries are not going to be bup-delta-compressed, then what about using xxd to convert the binary to text and then xdelta compressing the hex dump to achieve efficient delta compression in the pack file? You could convert the hexdump back to binary with xxd for checkout and such. Maybe small binaries do xdelta well and the above is a moot point. This is all theory to me, but the reality is looming over my head since most of the components I should be tracking are binaries small (large history?) and big (but am not yet because of "big-file" concerns -- I don't want to have to refactor my vast git ecosystem with filter branch later because I slammed binaries into the main project or superproject without proper systems programming (I'm not sure what the c/linux term is for 'systems programming', but in the mainframe world it meant making sure everything was configured for efficient performance)). Now that I say that out loud I guess a superproject with binaries in separate repos could be easily refactored by creating new efficient repos and making a new commit that points to them instead of the old inefficient repos. That way, when someone checks out the binary repo (submodule) into their worktree they get the new efficiency instead of the old inefficiency. Over time, as folks are less likely to check out old stuff the old inefficiency goes away on its own. I think. (Submodules are mostly theory to me at this point also.) v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-11 1:24 ` Neal Kreitzinger @ 2012-04-11 6:04 ` Jonathan Nieder 2012-04-11 16:29 ` Neal Kreitzinger ` (3 more replies) 2012-04-11 21:35 ` Jeff King 1 sibling, 4 replies; 43+ messages in thread From: Jonathan Nieder @ 2012-04-11 6:04 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: Jeff King, Sergio Callegari, Bo Chen, git Neal Kreitzinger wrote: > Maybe small binaries do xdelta well and the above is a moot point. If I am reading it correctly, diff-delta copes fine with smallish binary files that have not changed much. Converting to hex would only hurt. I would suggest tracking source code instead of binaries if possible, though. Jonathan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-11 6:04 ` Jonathan Nieder @ 2012-04-11 16:29 ` Neal Kreitzinger 2012-04-11 22:09 ` Jeff King 2012-04-11 16:35 ` Neal Kreitzinger ` (2 subsequent siblings) 3 siblings, 1 reply; 43+ messages in thread From: Neal Kreitzinger @ 2012-04-11 16:29 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Jeff King, Sergio Callegari, Bo Chen, git On 4/11/2012 1:04 AM, Jonathan Nieder wrote: > Neal Kreitzinger wrote: > >> Maybe small binaries do xdelta well and the above is a moot point. > > If I am reading it correctly, diff-delta copes fine with smallish > binary files that have not changed much. Converting to hex would > only hurt. > How do I check the history size of a binary? IOW, how to I check the size of the sum of all the delta-compressions and root blob of a binary? That way I can sample different binary types to get a symptomatic idea of how well they are delta compressing. I suspect that compiled binaries will compress well (efficient history) and graphics files may not compress well (large history). v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-11 16:29 ` Neal Kreitzinger @ 2012-04-11 22:09 ` Jeff King 0 siblings, 0 replies; 43+ messages in thread From: Jeff King @ 2012-04-11 22:09 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: Jonathan Nieder, Sergio Callegari, Bo Chen, git On Wed, Apr 11, 2012 at 11:29:50AM -0500, Neal Kreitzinger wrote: > How do I check the history size of a binary? IOW, how to I check the > size of the sum of all the delta-compressions and root blob of a binary? > That way I can sample different binary types to get a symptomatic idea > of how well they are delta compressing. I suspect that compiled > binaries will compress well (efficient history) and graphics files may > not compress well (large history). I don't think there is a simple command to do it. You have to correlate blobs at a given path with objects in the packs yourself. You can script it like: # get the delta stats from every pack; you only need to do this part # once for a given history state. And obviously you would want to # repack before doing it. for i in .git/objects/pack/*.pack; do git verify-pack -v $i; done | perl -lne ' # format is: sha1 type size size-in-pack offset; pick out only the # thing we care about: size in pack /^([0-9a-f]{40}) \S+\s+\d+ (\d+)/ and print "$1 $2"; ' | sort >delta-stats # then you can do this for every path you are interested in. # First, get the list of blobs at that path (and follow renames, too). # The second line is picking the "after" sha1 from the --raw output. git log --follow --raw --no-abbrev $path | perl -lne '/:\S+ \S+ \S{40} (\S{40})/ and print $1' | sort -u >blobs # Then find the delta stats for those blobs join blobs delta-stats which should give you the stored size of each version of a file. -Peff ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-11 6:04 ` Jonathan Nieder 2012-04-11 16:29 ` Neal Kreitzinger @ 2012-04-11 16:35 ` Neal Kreitzinger 2012-04-11 16:44 ` Neal Kreitzinger 2012-04-11 18:23 ` Neal Kreitzinger 3 siblings, 0 replies; 43+ messages in thread From: Neal Kreitzinger @ 2012-04-11 16:35 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Jeff King, Sergio Callegari, Bo Chen, git On 4/11/2012 1:04 AM, Jonathan Nieder wrote: > Neal Kreitzinger wrote: > >> Maybe small binaries do xdelta well and the above is a moot point. > > If I am reading it correctly, diff-delta copes fine with smallish > binary files that have not changed much. Converting to hex would > only hurt. > > I would suggest tracking source code instead of binaries if > possible, though. > Is there some documentation out there that lists the common binary formats (e.g., pdf, docx, gif, jpg, png, bmp, mpeg, mp3, zip, c-binaries, java stuff, website stuff, etc.) and explains their nature (container, compressed, encrypted, etc.), how well they currently delta in git.git within specified size boundaries and use cases (pdfs with only plain text vs. pdfs with graphics, tables, etc.) so git users can reference that to make their git repo/superproject design decisions? v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-11 6:04 ` Jonathan Nieder 2012-04-11 16:29 ` Neal Kreitzinger 2012-04-11 16:35 ` Neal Kreitzinger @ 2012-04-11 16:44 ` Neal Kreitzinger 2012-04-11 17:20 ` Jonathan Nieder 2012-04-11 18:23 ` Neal Kreitzinger 3 siblings, 1 reply; 43+ messages in thread From: Neal Kreitzinger @ 2012-04-11 16:44 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Jeff King, Sergio Callegari, Bo Chen, git On 4/11/2012 1:04 AM, Jonathan Nieder wrote: > Neal Kreitzinger wrote: > >> Maybe small binaries do xdelta well and the above is a moot point. > > If I am reading it correctly, diff-delta copes fine with smallish > binary files that have not changed much. > > I would suggest tracking source code instead of binaries if > possible, though. > I suppose the original "source" in git (linux kernel) was so low level that it had no graphics files. However, most projects are end-user projects and have graphics so I would think that tracking them is a normal expected use of git to version your software. If you're going to do that then there shouldn't be a problem tracking other binaries that are static constants across all servers (as opposed to user edited content like databases). I would consider this subset of "binaries" to be the expected domain of git revision control for software, ie, gui software. Graphics files for your app are "source". The binary is all you have. It's the "source" that you edit to make changes. Maybe I'm missing something here. Maybe graphics files are "container" files and that makes them a problem. v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-11 16:44 ` Neal Kreitzinger @ 2012-04-11 17:20 ` Jonathan Nieder 2012-04-11 18:51 ` Junio C Hamano 0 siblings, 1 reply; 43+ messages in thread From: Jonathan Nieder @ 2012-04-11 17:20 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: Jeff King, Sergio Callegari, Bo Chen, git Neal Kreitzinger wrote: > Graphics files for your app are > "source". The binary is all you have. Often there is source in SVG or some other simple editable format that gets lossily compiled to PNG or JPEG compressed raster graphics. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-11 17:20 ` Jonathan Nieder @ 2012-04-11 18:51 ` Junio C Hamano 2012-04-11 19:03 ` Jonathan Nieder 0 siblings, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2012-04-11 18:51 UTC (permalink / raw) To: Jonathan Nieder Cc: Neal Kreitzinger, Jeff King, Sergio Callegari, Bo Chen, git Jonathan Nieder <jrnieder@gmail.com> writes: > Neal Kreitzinger wrote: > >> Graphics files for your app are >> "source". The binary is all you have. > > Often there is source in SVG or some other simple editable format that > gets lossily compiled to PNG or JPEG compressed raster graphics. You could have just underlined "if possible" part in your message and ended this thread, that seems to be needlessly continuing. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-11 18:51 ` Junio C Hamano @ 2012-04-11 19:03 ` Jonathan Nieder 0 siblings, 0 replies; 43+ messages in thread From: Jonathan Nieder @ 2012-04-11 19:03 UTC (permalink / raw) To: Junio C Hamano Cc: Neal Kreitzinger, Jeff King, Sergio Callegari, Bo Chen, git Junio C Hamano wrote: > You could have just underlined "if possible" part in your message Yes, that's a more important point than the one I responded to. :) ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-11 6:04 ` Jonathan Nieder ` (2 preceding siblings ...) 2012-04-11 16:44 ` Neal Kreitzinger @ 2012-04-11 18:23 ` Neal Kreitzinger 3 siblings, 0 replies; 43+ messages in thread From: Neal Kreitzinger @ 2012-04-11 18:23 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Jeff King, Sergio Callegari, Bo Chen, git On 4/11/2012 1:04 AM, Jonathan Nieder wrote: > > I would suggest tracking source code instead of binaries if > possible, though. > Reasons why we want to track binaries: (1) Standard Targets: Our deployment is assembly line style because our target servers are under our control. (2) Copy vs. Recompile: We run certain "supported" linux distro versions on our target servers so we can just put our binaries on them instead of recompiling. (3) In-house-Source Compiled Binaries: For our particular proprietary (third-party) source language the binaries run on top of a runtime that runs on top of the O/S so that makes the need to recompile on a server a non-issue. We use xxd and compile listings to "diff" our compiled binaries to detect missing copybook and data dictionary dependencies (missed recompiles), unnecessary recompiles (you didn't really change what you thought you changed), and miscompiles. We do this compiled binary validation in git branches and then diff the branches to detect the discrepancies. (4) Proprietary-Third-Party Binaries (no source) Versioning: For our third party binaries we don't have the source. The are distributed as self-extracting-executables. Changes to third party binaries are relatively infrequent but frequent enough to cause confusion and therefore need to be tracked. (5) Graphics "Source" Versioning: Our graphics files are part of our software and changes need to be tracked. (6) O/S Versioning: Our linux distro is tracked in a bazaar repo so I'm thinking we should be able to track it in a git repo instead. The assembly line just deploys the payload to a new server instead of doing manual install. (7) Superproject tracking of "Super-release": The above subsystems are related in varying degrees (dependent). A superproject can associate all the versions that comprise a "super" release of the various subsystem version dependencies. While some of the reasons above may be non-normative for some git-users, I think that a large portion (if not the majority) of git-users will find some subset of the above reasons normative for their use-cases (namely reasons 5 and 4) therefore making the necessity for binary tracking normative for git-users in general. v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-11 1:24 ` Neal Kreitzinger 2012-04-11 6:04 ` Jonathan Nieder @ 2012-04-11 21:35 ` Jeff King 2012-04-12 19:29 ` Neal Kreitzinger 1 sibling, 1 reply; 43+ messages in thread From: Jeff King @ 2012-04-11 21:35 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: Sergio Callegari, Bo Chen, git On Tue, Apr 10, 2012 at 08:24:48PM -0500, Neal Kreitzinger wrote: > (I read bup DESIGN doc to see what bup-style splitting is.) When you > use bup delta technology in git.git I take it that you will use it > for big-worktree-files *and* big-history-files I'm not sure what those terms mean. We are talking about files at the blob level. So they are either big or not big. We don't know how they will delta, or what their histories will be like. > (not-big-worktree-files that are not xdelta delta-friendly)? > IOW, all binaries plus big-text-worktree-files. Otherwise, small > binaries will become large histories. Files that don't delta won't be helped by splitting, as it is just another form of finding deltas (in fact, it should produce worse results than xdelta, because it works with larger granularity; its advantage is that it is not as memory or CPU-hungry as something like xdelta). So you really only want to use this for files that are too big to practically run through the regular delta algorithm. And if you can avoid it on files that will never delta well, you are better off (because it adds storage overhead over a straight blob). The first part is easy: only do it for files that are so big that you can't run the regular delta algorithm. So since your only alternative is doing nothing, you only have to perform better than nothing. :) The second part is harder. We generally don't know that a file doesn't delta well until we have two versions of it to try[1]. And that's where some domain-specific knowledge can come in (e.g., knowing that a file is compressed video, and that future versions are likely to differ in the video content). But sometimes the results can be surprising. I keep a repository of photos and videos, carefully annotated via exif tags. If the media content changes, the results won't delta well. But if I change the exif tags, they _do_ delta very well. So whether something like bupsplit is a win depends on the exact update patterns. [1] I wonder if you could do some statistical analysis on the randomness of the file content to determine this. That is, things which look very random are probably already heavily compressed, and are not going to compress further. You might guess that to mean that they will not delta well, either. And sometimes that is true. But the example I gave above violates it (most of the file is random, but the _changes_ from version to version will not be random, and that is what the delta is compressing). > If small binaries are not going to be bup-delta-compressed, then what > about using xxd to convert the binary to text and then xdelta > compressing the hex dump to achieve efficient delta compression in > the pack file? You could convert the hexdump back to binary with xxd > for checkout and such. That wouldn't help. You are only trading the binary representation for a less efficient one. But the data patterns will not change. The redundancy you introduced in the first step may mostly come out via compression, but it will never be a net win. I'm sure if I were a better computer scientist I could write you some proof involving Shannon entropy. But here's a fun experiment: # create two files, one very compressible and one not very # compressible dd if=/dev/zero of=boring.bin bs=1M count=1 dd if=/dev/urandom of=rand.bin bs=1M count=1 # now make hex dumps of each, and compress the original and the hex # dump for i in boring rand; do xxd <$i.bin >$i.hex for j in bin hex; do gzip -c <$i.$j >$i.$j.gz done done # and look at the results du {boring,rand}.* I get: 1024 boring.bin 4 boring.bin.gz 4288 boring.hex 188 boring.hex.gz 1024 rand.bin 1028 rand.bin.gz 4288 rand.hex 2324 rand.hex.gz So you can see that the thing that compresses well will do so in either representation, but the end result is a net loss with the less efficient representation. Whereas the thing that does not compress well will achieve a better compression ratio in its text form, but will still be a net loss. The reason is that you are just compressing out all of the redundant bits. You might observe that this is using gzip, not xdelta. But I think from an information theory standpoint, they are two sides of the same coin (e.g., you could consider a delta between two things to be equivalent to concatenating them and compressing the result). You should be able to design a similar experiment with xdelta. > Maybe small binaries do xdelta well and the above is a moot point. Some will and some will not. But it has nothing to do with whether they are binary, and everything to do with the type of content they store (or if binariness does matter, then our delta algorithms should be improved). > This is all theory to me, but the reality is looming over my head > since most of the components I should be tracking are binaries small > (large history?) and big (but am not yet because of "big-file" > concerns -- I don't want to have to refactor my vast git ecosystem > with filter branch later because I slammed binaries into the main > project or superproject without proper systems programming (I'm not > sure what the c/linux term is for 'systems programming', but in the > mainframe world it meant making sure everything was configured for > efficient performance)). One of the things that makes bup not usable as-is for git is that it fundamentally changes the object identities. It would be very easy for "git add" to bupsplit a file into a tree, and store that tree using git (in fact, that is more or less how bup works). But that means that the resulting object sha1 is going to depend on the splitting choices made. Instead, we want to consider the split version of an object to be simply an on-disk representation detail. Just as it is a representation detail that some objects are stored in delta-encoding inside packs, versus as loose objects; the sha1 of the object is the same, and we can reconstruct it byte-for-byte when we want to. So properly implemented, no, you would not have to ever filter-branch to tweak these settings. You might have to do a repack to see the gains (because you want to delete the old non-split representation you have in your pack and replace it with a split representation), but that is transparent to git's abstract data model. -Peff ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-11 21:35 ` Jeff King @ 2012-04-12 19:29 ` Neal Kreitzinger 2012-04-12 21:03 ` Jeff King ` (2 more replies) 0 siblings, 3 replies; 43+ messages in thread From: Neal Kreitzinger @ 2012-04-12 19:29 UTC (permalink / raw) To: Jeff King; +Cc: Sergio Callegari, Bo Chen, git On 4/11/2012 4:35 PM, Jeff King wrote: > On Tue, Apr 10, 2012 at 08:24:48PM -0500, Neal Kreitzinger wrote: > >> This is all theory to me, but the reality is looming over my head >> since most of the components I should be tracking are binaries small >> (large history?) and big (but am not yet because of "big-file" >> concerns -- I don't want to have to refactor my vast git ecosystem >> with filter branch later because I slammed binaries into the main >> project or superproject without proper systems programming (I'm not >> sure what the c/linux term is for 'systems programming', but in the >> mainframe world it meant making sure everything was configured for >> efficient performance)). > So properly implemented, no, you would not have to ever filter-branch to > tweak these settings. You might have to do a repack to see the gains > (because you want to delete the old non-split representation you have in > your pack and replace it with a split representation), but that is > transparent to git's abstract data model. > I'm likely going to have to slam graphics files into the main repo in the very near future. It sounds like once git.git is updated for big-file optimization I can just upgrade to that git version and repack to get the benefits. Any idea when that version of git will come out release number wise and calendar wise? (Don't read this next part if you just ate or are eating or drinking. You may throw-up from nausea or choke from laughing.) (I am forced to deal with a mandated/micromanaged change control menu design from the powers-that-be that is based on cvs workflow and to wipe-your-nose-for-you. It can't even cope with branches much less submodules so in that context there isn't time to implement the graphics tracking as a submodule. This change control menu is designed to replace cvs commands with equivalent-results git-command sequences. While there are many git users who import from svn into git, do their work in git, and then export back into svn to get work done, ironically I am probably the only git user who has to import from git (powers-that-be mandated cvs-style menu controlled git-repo) into git (separate normal git repo and commandline), do the work in normal git, and then export it back into git (cvs-style menu controlled git-repo).) v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-12 19:29 ` Neal Kreitzinger @ 2012-04-12 21:03 ` Jeff King [not found] ` <4F8A2EBD.1070407@gmail.com> 2012-04-12 21:08 ` Neal Kreitzinger 2012-04-13 21:36 ` Bo Chen 2 siblings, 1 reply; 43+ messages in thread From: Jeff King @ 2012-04-12 21:03 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: Sergio Callegari, Bo Chen, git On Thu, Apr 12, 2012 at 02:29:40PM -0500, Neal Kreitzinger wrote: > I'm likely going to have to slam graphics files into the main repo in > the very near future. It sounds like once git.git is updated for > big-file optimization I can just upgrade to that git version and > repack to get the benefits. Depending on the size and number of the files, git may handle them just fine. They don't delta well, which means they will bloat your object db a bit, but if you are talking about a hundreds of megabytes total, it is probably not that big a deal. > Any idea when that version of git will come out release number wise > and calendar wise? No idea. This is still in the discussion and experimenting stage. It may not even happen. -Peff ^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <4F8A2EBD.1070407@gmail.com>]
* Re: GSoC - Some questions on the idea of [not found] ` <4F8A2EBD.1070407@gmail.com> @ 2012-04-15 2:15 ` Jeff King 2012-04-15 2:33 ` Neal Kreitzinger 2012-05-10 21:43 ` Neal Kreitzinger 0 siblings, 2 replies; 43+ messages in thread From: Jeff King @ 2012-04-15 2:15 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: Sergio Callegari, Bo Chen, git On Sat, Apr 14, 2012 at 09:13:17PM -0500, Neal Kreitzinger wrote: > Does a file's delta-compression efficiency in the pack-file directly > correlate to its efficiency of transmission size/bandwidth in a > git-fetch and git-push? IOW, are big-files also a problem for > git-fetch and git-push by taking too long in a remote transfer? Yes. The on-the-wire format is a packfile. We create a new packfile on the fly, so we may find new deltas (e.g., between objects that were stored on disk in two different packs), but we will mostly be reusing deltas from the existing packs. So any time you improve the on-disk representation, you are also improving the network bandwidth utilization. -Peff ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-15 2:15 ` Jeff King @ 2012-04-15 2:33 ` Neal Kreitzinger 2012-04-16 14:54 ` Jeff King 2012-05-10 21:43 ` Neal Kreitzinger 1 sibling, 1 reply; 43+ messages in thread From: Neal Kreitzinger @ 2012-04-15 2:33 UTC (permalink / raw) To: Jeff King; +Cc: Sergio Callegari, Bo Chen, git On 4/14/2012 9:15 PM, Jeff King wrote: > On Sat, Apr 14, 2012 at 09:13:17PM -0500, Neal Kreitzinger wrote: > >> Does a file's delta-compression efficiency in the pack-file directly >> correlate to its efficiency of transmission size/bandwidth in a >> git-fetch and git-push? IOW, are big-files also a problem for >> git-fetch and git-push by taking too long in a remote transfer? > Yes. The on-the-wire format is a packfile. We create a new packfile on > the fly, so we may find new deltas (e.g., between objects that were > stored on disk in two different packs), but we will mostly be reusing > deltas from the existing packs. > > So any time you improve the on-disk representation, you are also > improving the network bandwidth utilization. > We use git to transfer database files from the dev server to qa-servers. Sometimes these barf for some reason and I get called to remediate. I assumed the user closed their session prematurely because it was "taking too long". However, now I'm wondering if the git-pull --ff-only is dying on its own due to the big-files. It could be that on a qa-server that hasn't updated database files in awhile they are pulling way more than another qa-server that does their git-pull more requently. How would I go about troubleshooting this? Is there some log files I would look at? (I'm using git 1.7.1 compiled with git makefile on rhel6.) When I go to remediate do git-reset --hard to clear out the barfed worktree/index and then run git-pull --ff-only manually and it always works. I'm not sure if that proves it wasn't git that barfed the first time. Maybe the first time git brought some stuff over and barfed because it bit off more than it could chew, but the second time its really having to chew less food because it already chewed some of it the first time and therefore works the second time. v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-15 2:33 ` Neal Kreitzinger @ 2012-04-16 14:54 ` Jeff King 0 siblings, 0 replies; 43+ messages in thread From: Jeff King @ 2012-04-16 14:54 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: git On Sat, Apr 14, 2012 at 09:33:37PM -0500, Neal Kreitzinger wrote: > We use git to transfer database files from the dev server to > qa-servers. Sometimes these barf for some reason and I get called to > remediate. I assumed the user closed their session prematurely > because it was "taking too long". However, now I'm wondering if the > git-pull --ff-only is dying on its own due to the big-files. It > could be that on a qa-server that hasn't updated database files in > awhile they are pulling way more than another qa-server that does > their git-pull more requently. How would I go about troubleshooting > this? Is there some log files I would look at? (I'm using git 1.7.1 > compiled with git makefile on rhel6.) No, git doesn't keep logfiles. Errors go to stderr. So look wherever the stderr for your git sessions is going (if you are doing this via cron job or something, then that is outside the scope of git). > When I go to remediate do git-reset --hard to clear out the barfed > worktree/index and then run git-pull --ff-only manually and it always > works. I'm not sure if that proves it wasn't git that barfed the > first time. Maybe the first time git brought some stuff over and > barfed because it bit off more than it could chew, but the second time > its really having to chew less food because it already chewed some of > it the first time and therefore works the second time. Try "git pull --no-progress" and see if it still works. If the server has a very long delta-compression phase, there will be no output generated for a while, which could cause intermediate servers to hang up (git won't do this, but if, for example, you are pulling over git-over-http and there is a reverse proxy in the middle, it may hit a timeout). If the automated pulls are happening from a cron job, then they won't have a terminal and progress-reporting will be off by default. -Peff ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-15 2:15 ` Jeff King 2012-04-15 2:33 ` Neal Kreitzinger @ 2012-05-10 21:43 ` Neal Kreitzinger 2012-05-10 22:39 ` Jeff King 1 sibling, 1 reply; 43+ messages in thread From: Neal Kreitzinger @ 2012-05-10 21:43 UTC (permalink / raw) To: Jeff King; +Cc: Sergio Callegari, Bo Chen, git On 4/14/2012 9:15 PM, Jeff King wrote: > On Sat, Apr 14, 2012 at 09:13:17PM -0500, Neal Kreitzinger wrote: > >> Does a file's delta-compression efficiency in the pack-file directly >> correlate to its efficiency of transmission size/bandwidth in a >> git-fetch and git-push? IOW, are big-files also a problem for >> git-fetch and git-push by taking too long in a remote transfer? > Yes. The on-the-wire format is a packfile. We create a new packfile on > the fly, so we may find new deltas (e.g., between objects that were > stored on disk in two different packs), but we will mostly be reusing > deltas from the existing packs. > > So any time you improve the on-disk representation, you are also > improving the network bandwidth utilization. > The git-clone manpage says you can use the rsync protocol for the url. If you use rsync:// as your url for your remote does that get you the rsync delta-transfer algorithm efficiency for the network bandwidth utilization part (as opposed to the on-disk representation part)? (I'm new to rsync.) v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-05-10 21:43 ` Neal Kreitzinger @ 2012-05-10 22:39 ` Jeff King 0 siblings, 0 replies; 43+ messages in thread From: Jeff King @ 2012-05-10 22:39 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: Sergio Callegari, Bo Chen, git On Thu, May 10, 2012 at 04:43:26PM -0500, Neal Kreitzinger wrote: > >Yes. The on-the-wire format is a packfile. We create a new packfile on > >the fly, so we may find new deltas (e.g., between objects that were > >stored on disk in two different packs), but we will mostly be reusing > >deltas from the existing packs. > > > >So any time you improve the on-disk representation, you are also > >improving the network bandwidth utilization. > > > The git-clone manpage says you can use the rsync protocol for the > url. If you use rsync:// as your url for your remote does that get > you the rsync delta-transfer algorithm efficiency for the network > bandwidth utilization part (as opposed to the on-disk representation > part)? (I'm new to rsync.) Well, yes. If you use the rsync transport, it literally runs rsync, which will use the regular rsync algorithm. But it won't be better than the git protocol (and in fact will be much worse) for a few reasons: 1. The object db files are all named after the sha1 of their content (the object sha1 for loose objects, and the sha1 of the whole pack for packfiles). Rsync will not run its comparison algorithm between files with different names. It will not re-transfer existing loose objects, but it will delete obsolete packfiles and retransfer new ones in their entirety. So it's like re-cloning over again for any fetch after an upstream repack. 2. Even if you could use the rsync delta algorithm, it will never be as efficient as git. Git understands the structure of the packfile and can tell the other side "Hey, I have these objects". Whereas rsync must guess from the bytes in the packfiles. Which is much less efficient to compute, and can be wrong if the representation has changed (e.g., something used to be a whole object, but is now stored as a delta). 3. Even if you could get the exact right set of objects to transfer, and then use the rsync delta algorithm on them, git would still do better. Git's job is much easier: one side has both sets of objects (those to be sent and those not), and is generating and sending efficient deltas for the other side to apply to their objects. Rsync assumes a harder job: you have one set, and the remote side has the other set, and you must agree on a delta by comparing checksums. So it will fundamentally never do as well. -Peff ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-12 19:29 ` Neal Kreitzinger 2012-04-12 21:03 ` Jeff King @ 2012-04-12 21:08 ` Neal Kreitzinger 2012-04-13 21:36 ` Bo Chen 2 siblings, 0 replies; 43+ messages in thread From: Neal Kreitzinger @ 2012-04-12 21:08 UTC (permalink / raw) To: Jeff King; +Cc: Sergio Callegari, Bo Chen, git On 4/12/2012 2:29 PM, Neal Kreitzinger wrote: > > ...ironically I am probably the only git user who has to import from > git (powers-that-be mandated cvs-style menu controlled git-repo) into > git (separate normal git repo and commandline), do the work in normal > git, and then export it back into git (cvs-style menu controlled > git-repo).) > > aka, git-cotton-picking ;-) v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-12 19:29 ` Neal Kreitzinger 2012-04-12 21:03 ` Jeff King 2012-04-12 21:08 ` Neal Kreitzinger @ 2012-04-13 21:36 ` Bo Chen 2 siblings, 0 replies; 43+ messages in thread From: Bo Chen @ 2012-04-13 21:36 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: Jeff King, Sergio Callegari, git On Thu, Apr 12, 2012 at 3:29 PM, Neal Kreitzinger <nkreitzinger@gmail.com> wrote: > On 4/11/2012 4:35 PM, Jeff King wrote: >> >> On Tue, Apr 10, 2012 at 08:24:48PM -0500, Neal Kreitzinger wrote: >> >>> This is all theory to me, but the reality is looming over my head >>> since most of the components I should be tracking are binaries small >>> (large history?) and big (but am not yet because of "big-file" >>> concerns -- I don't want to have to refactor my vast git ecosystem >>> with filter branch later because I slammed binaries into the main >>> project or superproject without proper systems programming (I'm not >>> sure what the c/linux term is for 'systems programming', but in the >>> mainframe world it meant making sure everything was configured for >>> efficient performance)). >> >> So properly implemented, no, you would not have to ever filter-branch to >> >> tweak these settings. You might have to do a repack to see the gains >> (because you want to delete the old non-split representation you have in >> your pack and replace it with a split representation), but that is >> transparent to git's abstract data model. >> > I'm likely going to have to slam graphics files into the main repo in the > very near future. It sounds like once git.git is updated for big-file > optimization I can just upgrade to that git version and repack to get the > benefits. Any idea when that version of git will come out release number > wise and calendar wise? > > (Don't read this next part if you just ate or are eating or drinking. You > may throw-up from nausea or choke from laughing.) > (I am forced to deal with a mandated/micromanaged change control menu design > from the powers-that-be that is based on cvs workflow and to > wipe-your-nose-for-you. It can't even cope with branches much less > submodules so in that context there isn't time to implement the graphics > tracking as a submodule. This change control menu is designed to replace > cvs commands with equivalent-results git-command sequences. While there are > many git users who import from svn into git, do their work in git, and then > export back into svn to get work done, ironically I am probably the only git > user who has to import from git (powers-that-be mandated cvs-style menu > controlled git-repo) into git (separate normal git repo and commandline), do > the work in normal git, and then export it back into git (cvs-style menu > controlled git-repo).) It seems that this is not directly related to the big-file support issue. Maybe it is better to discuss it in a new thread ^-^ > > v/r, > neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-30 20:34 ` Jeff King 2012-03-30 23:08 ` Bo Chen @ 2012-03-31 15:19 ` Neal Kreitzinger 2012-04-02 21:40 ` Jeff King 2012-03-31 16:49 ` Neal Kreitzinger 2012-03-31 20:28 ` Neal Kreitzinger 3 siblings, 1 reply; 43+ messages in thread From: Neal Kreitzinger @ 2012-03-31 15:19 UTC (permalink / raw) To: Jeff King; +Cc: Bo Chen, Sergio, git On 3/30/2012 3:34 PM, Jeff King wrote: > On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote: > >> The sub-problems of "delta for large file" problem. >> >> 1 large file >> > Note that there are other problem areas with big files that can be > worked on, too. For example, some people want to store 100 gigabytes > in a repository. I take it that you have in mind a 100G set of files comprised entirely of big-files that cannot be logically separated into smaller submodules? My understanding is that a main strategy for "big files" is to separate your big-files logically into their own submodule(s) to keep them from bogging down the not-big-file repo(s). Is one of the goals of big-file-support to make submodule strategizing unconcerned about big-file groupings and only concerned about logical-file groupings? Big-file groupings are not necessarily logical file groupings, but perhaps a technical file grouping subset of a logical file grouping that is necessitated by big-file performance considerations. IOW, is the goal of big-file-support to make big-files "just work" so that users don't have to think about graphics files, binaries, etc, and just treat them like everything else? Obviously, a 100G database file will always be a 'big-file' for the foreseeable future, but a 0.5G graphics file is not a "big file" generally speaking (as opposed to git-speaking). > Because git is distributed, that means 100G in the repo database, > and 100G in the working directory, for a total of 200G. I take it that you are implying that the 100G object-store size is due to the notion that binary files cannot-be/are-not compressed well? > People in this situation may want to be able to store part of the > repository database in a network-accessible location, trading some > of the convenience of being fully distributed for the space savings. > So another project could be designing a network-based alternate > object storage system. > I take it you are implying a local area network with users git repos on workstations? In regards to "network-based alternate objects" that are in fact on the internet they would need to first be cloned onto the local area network. Or are you imagining this would work for internet "network-based alternate objects"? Some setups login to a linux server and have all their repos there. The "alternate objects" does not need to network-based in that case. It is "local", but local does not mean 20 people cloning the alternate objects to their workstations. It means one copy of alternate objects, and twenty repos referencing that one copy. v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-31 15:19 ` Neal Kreitzinger @ 2012-04-02 21:40 ` Jeff King 2012-04-02 22:19 ` Junio C Hamano 0 siblings, 1 reply; 43+ messages in thread From: Jeff King @ 2012-04-02 21:40 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: Bo Chen, Sergio, git On Sat, Mar 31, 2012 at 10:19:54AM -0500, Neal Kreitzinger wrote: > >Note that there are other problem areas with big files that can be > >worked on, too. For example, some people want to store 100 gigabytes > >in a repository. > > I take it that you have in mind a 100G set of files comprised entirely > of big-files that cannot be logically separated into smaller submodules? Not exactly. Two scenarios I'm thinking of are: 1. You really have 100G of data in the current version that doesn't compress well (e.g., you are storing your music collection). You can't afford to store two copies on your laptop (because you have a fancy SSD, and 100G is expensive again). You need the working tree version, but it's OK to stream the repo version of a blob from the network when you actually need it (mostly "checkout", assuming you have marked the file as "-diff"). 2. You have a 100G repository, but only 10G in the most recent version (e.g., because you are doing game development and storing the media assets). You want your clones to be faster and take less space. You can do a shallow clone, but then you're never allowed to look at old history. Instead, it would be nice to clone all of the commits, trees, and small blobs, and then stream large blobs from the network as-needed (again, mostly "checkout"). > My understanding is that a main strategy for "big files" is to separate > your big-files logically into their own submodule(s) to keep them from > bogging down the not-big-file repo(s). That helps people who want to work on the not-big parts by not forcing them into the big parts (another solution would be partial clone, but more on that in a minute). But it doesn't help people who actually want to work on the big parts; they would still have to fetch the whole big-parts repository. For splitting the big-parts people from the non-big-parts people, there have been two suggestions: partial checkout (you have all the objects in the repo, but only checkout some of them) and partial clone (you don't have some of the objects in the repo). Partial checkout is a much easier problem, as it is mostly about marking index entries as "do not bother to check this out, and pretend that it is simply unmodified". Partial clone is much harder, because it violates git's usual reachability rules. During a fetch, a client will say "I have commit X", which the server can then assume means they have all of the ancestors of X, and all of the tree and blobs referenced by X and its ancestors. But if a client can say "yes, I have these objects, but I just don't want to get them because it's expensive", then partial checkout is sufficient. The non-big-parts people will clone, omitting the big objects, and then do a partial checkout (to avoid fetching the objects even once). Note that some protocol extension is still needed for the client to tell the server "don't bother including objects X, Y, and Z in the packfile; I'll get them from my alternate big-object repo". That can either be a list of objects, or it can simply be "don't bother with objects bigger than N". > >Because git is distributed, that means 100G in the repo database, > >and 100G in the working directory, for a total of 200G. > > I take it that you are implying that the 100G object-store size is due > to the notion that binary files cannot-be/are-not compressed well? In this case, yes. But you could easily tweak the numbers to be 100G and 150G. The point is that the data is stored twice, and even the compressed version may be big. > >People in this situation may want to be able to store part of the > >repository database in a network-accessible location, trading some > >of the convenience of being fully distributed for the space savings. > >So another project could be designing a network-based alternate > >object storage system. > > > I take it you are implying a local area network with users git repos > on workstations? Not necessarily. Obviously if you are doing a lot of active work on the big files, the faster your network, the better. But it could work at the internet scale, too, if you don't actually fetch the big files frequently (so part of a scheme like this would be making sure we avoid accessing big objects whenever we can; in practice, this is pretty easy, as git already tries to avoid accessing objects unnecessarily, because it's expensive even on the local end). You can also cache a certain number of fetched objects locally. Assuming there is some locality of the objects you ask about (e.g., because you are doing "git checkout" back and forth between two branches), this can help. > Some setups login to a linux server and have all their repos there. > The "alternate objects" does not need to network-based in that case. > It is "local", but local does not mean 20 people cloning the > alternate objects to their workstations. It means one copy of > alternate objects, and twenty repos referencing that one copy. Right. This is the same concept, except over the network. So people's working repositories are on their own workstations instead of a central server. You could even do it today by network-mounting a filesystem and pointing your alternates file at it. However, I think it's worth making git aware that the objects are on the network for a few reasons: 1. Git can be more careful about how it handles the objects, including when to fetch, when to stream, and when to cache. For example, you'd want to fetch the manifest of objects and cache it in your local repository, because you want fast lookups of "do I have this object". 2. Providing remote filesystems on an Internet scale is a management pain (and it's a pain for the user, too). My thought was that this would be implemented on top of http (the connection setup cost is negligible, since these objects would generally be large). 3. Usually alternate repositories are full repositories that meet the connectivity requirements (so you could run "git fsck" in them). But this is explicitly about taking just a few disconnected large blobs out of the repository and putting them elsewhere. So it needs a new set of tools for managing the upstream repository. -Peff ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-02 21:40 ` Jeff King @ 2012-04-02 22:19 ` Junio C Hamano 2012-04-03 10:07 ` Jeff King 0 siblings, 1 reply; 43+ messages in thread From: Junio C Hamano @ 2012-04-02 22:19 UTC (permalink / raw) To: Jeff King; +Cc: Neal Kreitzinger, Bo Chen, Sergio, git Jeff King <peff@peff.net> writes: > 1. You really have 100G of data in the current version that doesn't > compress well (e.g., you are storing your music collection). You > can't afford to store two copies on your laptop (because you have a > fancy SSD, and 100G is expensive again). You need the working tree > version, but it's OK to stream the repo version of a blob from the > network when you actually need it (mostly "checkout", assuming you > have marked the file as "-diff"). This feels like a good candidate for an independent project that allows you fuse-mount from a remote repository to give you an illusion that you have a checkout of a specific version. Such a remote fuse-server would be an application that is built using Git, but I do not think we are in any business on the client end in such a setup. So I'll write it off as a "non-Git" issue for now. The other parts of your message is much more interesting. > Right. This is the same concept, except over the network. So people's > working repositories are on their own workstations instead of a central > server. You could even do it today by network-mounting a filesystem and > pointing your alternates file at it. However, I think it's worth making > git aware that the objects are on the network for a few reasons: > > 1. Git can be more careful about how it handles the objects, including > when to fetch, when to stream, and when to cache. For example, > you'd want to fetch the manifest of objects and cache it in your > local repository, because you want fast lookups of "do I have this > object". > > 2. Providing remote filesystems on an Internet scale is a management > pain (and it's a pain for the user, too). My thought was that this > would be implemented on top of http (the connection setup cost is > negligible, since these objects would generally be large). > > 3. Usually alternate repositories are full repositories that meet the > connectivity requirements (so you could run "git fsck" in them). > But this is explicitly about taking just a few disconnected large > blobs out of the repository and putting them elsewhere. So it needs > a new set of tools for managing the upstream repository. Or you can split out the really large write-only blobs out of SCM control. Every time you introduce a new blob, throw it verbatim in an append-only directory on a networked filesystem under some unique ID as its filename, and maintain a symlink into that networked filesystem under SCM control. I think git-annex already does something like that... ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-02 22:19 ` Junio C Hamano @ 2012-04-03 10:07 ` Jeff King 0 siblings, 0 replies; 43+ messages in thread From: Jeff King @ 2012-04-03 10:07 UTC (permalink / raw) To: Junio C Hamano; +Cc: Neal Kreitzinger, Bo Chen, Sergio, git On Mon, Apr 02, 2012 at 03:19:35PM -0700, Junio C Hamano wrote: > > 1. You really have 100G of data in the current version that doesn't > > compress well (e.g., you are storing your music collection). You > > can't afford to store two copies on your laptop (because you have a > > fancy SSD, and 100G is expensive again). You need the working tree > > version, but it's OK to stream the repo version of a blob from the > > network when you actually need it (mostly "checkout", assuming you > > have marked the file as "-diff"). > > This feels like a good candidate for an independent project that allows > you fuse-mount from a remote repository to give you an illusion that you > have a checkout of a specific version. Such a remote fuse-server would be > an application that is built using Git, but I do not think we are in any > business on the client end in such a setup. I think this is backwards. The primary item you want on the laptop is the working directory, because you will be accessing and manipulating the files. That must always work, whether the network is connected or not. You occasionally will want to perform git operations. Most of these should succeed when disconnected, but it's OK for some operations (like checking out an older version of a large blob) to fail. But if you are mounting a remote repository and pretending that you have a local checkout, then just accessing the files either requires a network, or you end up caching most of the remote repository. It would make more sense to me to clone a bare repository of what's upstream, and then fuse-mount the local bare repository to provide a fake working directory. And I believe somebody made such a fuse filesystem in the early days of git. However, I recall that it was read-only. I'm not sure how you would handle writing to the git-mounted directory. > Or you can split out the really large write-only blobs out of SCM control. > Every time you introduce a new blob, throw it verbatim in an append-only > directory on a networked filesystem under some unique ID as its filename, > and maintain a symlink into that networked filesystem under SCM control. > > I think git-annex already does something like that... Yes, and git-media basically does this, too. But it's awful to use, because the user has to be constantly aware of these special links and managing them. You can't just store a symlink into the networked filesystem. For one thing, the path may be different on each client machine, so a simple symlink doesn't work. For another, symlinks into a blob repository mean that the files must be read-only (since they are basically blob-equivalents). So you don't really get your own copy of the file; you can _replace_ it and update the symlink, but you can't actually modify it. So what things like git-media end up doing is to try to insert themselves between git and the user, and transparently convert the file into its unique ID on "git add" and tweak the working directory to contain the actual file on checkout. And it kind of works, but there are a lot of rough edges (I don't recall the details, but they came up in past discussions; clean and smudge filters almost get you there, but not quite). Basically what I'm proposing to do is to just move that logic into git itself, so it can just happen at the blob storage level. I don't think it would even be that much code inside git; you'd want the interface to be pluggable, so all of the heavy lifting would happen inside of a helper (so really, this isn't necessarily even "network alternates" as much as "pluggable alternates"). -Peff ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-30 20:34 ` Jeff King 2012-03-30 23:08 ` Bo Chen 2012-03-31 15:19 ` Neal Kreitzinger @ 2012-03-31 16:49 ` Neal Kreitzinger 2012-03-31 20:28 ` Neal Kreitzinger 3 siblings, 0 replies; 43+ messages in thread From: Neal Kreitzinger @ 2012-03-31 16:49 UTC (permalink / raw) To: Jeff King; +Cc: Bo Chen, Sergio, git On 3/30/2012 3:34 PM, Jeff King wrote: > On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote: > >> The sub-problems of "delta for large file" problem. >> >> 1 large file >> >> 1.1 text file (always delta well? need to be confirmed) > > ...But let's take a step back for a moment. Forget about whether a file > is binary or not. Imagine you want to store a very large file in > git. > > ...Nowadays, we stream large files directly into their own packfiles, > and we have to pay the I/O only once (and the memory cost never). As > a tradeoff, we no longer get delta compression of large objects. > That's OK for some large objects, like movie files (which don't tend > to delta well, anyway). But it's not for other objects, like virtual > machine images, which do tend to delta well. > > So can we devise a solution which efficiently stores these > delta-friendly objects, without losing the performance improvements > we got with the stream-directly-to-packfile approach? > gitconfig or gitattributes could specify big-file handlers for filetypes. It seems a bit ridiculous to expect git to autoconfigure big-file handlers for everything from gif's to vm-images. In the case of vm-images you would need to read the "big-files" man-page and then configure your git for the "vm image handler" for whatever your vm-image wildcards are for those files. For movie files you would also read the big-file man-page and configure "movie file 'x' big file handler' for whatever your movie file wildcards are. Movie files and vm-images are very expectable (version control) but not very normative (source code management) so you need to configure those as needed. More widely-tracked-by-the-public-at-large files like gif, png, etc, could be autoconfigured by git to used the correct big-file handler. v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-30 20:34 ` Jeff King ` (2 preceding siblings ...) 2012-03-31 16:49 ` Neal Kreitzinger @ 2012-03-31 20:28 ` Neal Kreitzinger 2012-03-31 21:27 ` Bo Chen 3 siblings, 1 reply; 43+ messages in thread From: Neal Kreitzinger @ 2012-03-31 20:28 UTC (permalink / raw) To: Jeff King; +Cc: Bo Chen, Sergio, git On 3/30/2012 3:34 PM, Jeff King wrote: > On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote: > >> The sub-problems of "delta for large file" problem. >> >> 1 large file >> > But let's take a step back for a moment. Forget about whether a file is > binary or not. Imagine you want to store a very large file in git. > > What are the operations that will perform badly? How can we make them > perform acceptably, and what tradeoffs must we make? E.g., the way the > diff code is written, it would be very difficult to run "git diff" on a > 2 gigabyte file. But is that actually a problem? Answering that means > talking about the characteristics of 2 gigabyte files, and what we > expect to see, and to what degree our tradeoffs will impact them. > > Here's a more concrete example. At first, even storing a 2 gigabyte file > with "git add" was painful, because we would load the whole thing in > memory. Repacking the repository was painful, because we had to rewrite > the whole 2G file into a packfile. Nowadays, we stream large files > directly into their own packfiles, and we have to pay the I/O only once > (and the memory cost never). As a tradeoff, we no longer get delta > compression of large objects. That's OK for some large objects, like > movie files (which don't tend to delta well, anyway). But it's not for > other objects, like virtual machine images, which do tend to delta well. > > So can we devise a solution which efficiently stores these > delta-friendly objects, without losing the performance improvements we > got with the stream-directly-to-packfile approach? > > One possible solution is breaking large files into smaller chunks using > something like the bupsplit algorithm (and I won't go into the details > here, as links to bup have already been mentioned elsewhere, and Junio's > patches make a start at this sort of splitting). > (I'm no expert on "big-files" in git or elsewhere, but this thread is immensely interesting to me as a git user who wants to track all sorts of binary files and possibly large text files in the very near future, ie. all components tied to a server build and upgrades beyond the linux-distro/rpms and perhaps including them also.) Let's take an even bigger step back for a moment. Who determines if a file shall be a big-file or not? Git or the user? How is it determined if a file shall be a "big-file" or not? Who decides bigness: Bigness seems to be relative to system resources. Does the user crunch the numbers to determine if a file is big-file, or does git? If the numbers are relative then should git query the system and make the determination? Either way, once the system-resources are upgraded and formerly "big-files" are no longer considered "big" how is the previous history refactored to behave "non-big-file-like"? Conversely, if the system-resources are re-distributed so that formerly non-big files are now relatively big (ie, moved from powerful central server login to laptops), how is the history refactored to accommodate the newly-relative-bigness? How bigness is decided: There seems to be two basic types of big-files: big-worktree-files, and big-history-files. A big-worktree-file that is delta-friendly is not a big-history-file. A non-big-worktree-file that is delta-unfriendly is a big-file-history problem. If you are working alone on an old computer you are probably more concerned about big-worktree-files (memory). If you are working in a large group making lots of changes to the same files on a powerful server then you are probably more concerned about big-history-file-size (diskspace). Of course, all are concerned about big-worktree-files that are delta-unfriendly. At what point is a delta-friendly file considered a "big-file"? I assume that may depend on the degree delta-friendliness. I imagine that a text file and vm-image differ in delta-friendliness by several degrees. At what point(s) is a delta-unfriendly file considered a "big-file"? I assume that may depend on the degree(s) of delta-unfriendliness. I imagine a compiled program and compressed-container differ in delta-unfriendliness by several degrees. My understanding is that git does not ever delta-compress binary files. That would mean even a small-worktree-binary-file becomes a big-history-file over time. v/r, neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-31 20:28 ` Neal Kreitzinger @ 2012-03-31 21:27 ` Bo Chen 2012-04-01 4:22 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 43+ messages in thread From: Bo Chen @ 2012-03-31 21:27 UTC (permalink / raw) To: Neal Kreitzinger; +Cc: Jeff King, Sergio, git On Sat, Mar 31, 2012 at 4:28 PM, Neal Kreitzinger <nkreitzinger@gmail.com> wrote: > On 3/30/2012 3:34 PM, Jeff King wrote: >> >> On Fri, Mar 30, 2012 at 03:51:20PM -0400, Bo Chen wrote: >> >>> The sub-problems of "delta for large file" problem. >>> >>> 1 large file >>> >> But let's take a step back for a moment. Forget about whether a file is >> binary or not. Imagine you want to store a very large file in git. >> >> What are the operations that will perform badly? How can we make them >> perform acceptably, and what tradeoffs must we make? E.g., the way the >> diff code is written, it would be very difficult to run "git diff" on a >> 2 gigabyte file. But is that actually a problem? Answering that means >> talking about the characteristics of 2 gigabyte files, and what we >> expect to see, and to what degree our tradeoffs will impact them. >> >> Here's a more concrete example. At first, even storing a 2 gigabyte file >> with "git add" was painful, because we would load the whole thing in >> memory. Repacking the repository was painful, because we had to rewrite >> the whole 2G file into a packfile. Nowadays, we stream large files >> directly into their own packfiles, and we have to pay the I/O only once >> (and the memory cost never). As a tradeoff, we no longer get delta >> compression of large objects. That's OK for some large objects, like >> movie files (which don't tend to delta well, anyway). But it's not for >> other objects, like virtual machine images, which do tend to delta well. >> >> So can we devise a solution which efficiently stores these >> delta-friendly objects, without losing the performance improvements we >> got with the stream-directly-to-packfile approach? >> >> One possible solution is breaking large files into smaller chunks using >> something like the bupsplit algorithm (and I won't go into the details >> here, as links to bup have already been mentioned elsewhere, and Junio's >> patches make a start at this sort of splitting). >> > (I'm no expert on "big-files" in git or elsewhere, but this thread is > immensely interesting to me as a git user who wants to track all sorts of > binary files and possibly large text files in the very near future, ie. all > components tied to a server build and upgrades beyond the linux-distro/rpms > and perhaps including them also.) > > Let's take an even bigger step back for a moment. Who determines if a file > shall be a big-file or not? Git or the user? How is it determined if a > file shall be a "big-file" or not? > > Who decides bigness: > Bigness seems to be relative to system resources. Does the user crunch the > numbers to determine if a file is big-file, or does git? If the numbers are > relative then should git query the system and make the determination? > Either way, once the system-resources are upgraded and formerly "big-files" > are no longer considered "big" how is the previous history refactored tot > behave "non-big-file-like"? Conversely, if the system-resources are > re-distributed so that formerly non-big files are now relatively big (ie, > moved from powerful central server login to laptops), how is the history > refactored to accommodate the newly-relative-bigness? > In common sense, a file of tens of MBs should not be considered as a big file, but a file of tens of GBs should definitely be considered as a big file. I think one simple workable solution is to let the user set the threshold of the big file. One complicate but intelligent solution is to let git auto-config the threshold by evaluating current computing resources in the running platform (a physical machine or just a VM). As to the problem of migrating git in different platforms which equip with different computing power, the git repo should also keep tract of under what big file threshold a specific file is handled. > How bigness is decided: > There seems to be two basic types of big-files: big-worktree-files, and > big-history-files. A big-worktree-file that is delta-friendly is not a > big-history-file. A non-big-worktree-file that is delta-unfriendly is a > big-file-history problem. If you are working alone on an old computer you > are probably more concerned about big-worktree-files (memory). If you are > working in a large group making lots of changes to the same files on a > powerful server then you are probably more concerned about > big-history-file-size (diskspace). Of course, all are concerned about > big-worktree-files that are delta-unfriendly. > > At what point is a delta-friendly file considered a "big-file"? I assume > that may depend on the degree delta-friendliness. I imagine that a text > file and vm-image differ in delta-friendliness by several degrees. > > At what point(s) is a delta-unfriendly file considered a "big-file"? I > assume that may depend on the degree(s) of delta-unfriendliness. I imagine > a compiled program and compressed-container differ in delta-unfriendliness > by several degrees. > > My understanding is that git does not ever delta-compress binary files. > That would mean even a small-worktree-binary-file becomes a > big-history-file over time. > > v/r, > neal ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-03-31 21:27 ` Bo Chen @ 2012-04-01 4:22 ` Nguyen Thai Ngoc Duy 2012-04-01 23:30 ` Bo Chen 0 siblings, 1 reply; 43+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-04-01 4:22 UTC (permalink / raw) To: Bo Chen; +Cc: Neal Kreitzinger, Jeff King, Sergio, git On Sun, Apr 1, 2012 at 4:27 AM, Bo Chen <chen@chenirvine.org> wrote: >> Who decides bigness: >> Bigness seems to be relative to system resources. Does the user crunch the >> numbers to determine if a file is big-file, or does git? If the numbers are >> relative then should git query the system and make the determination? >> Either way, once the system-resources are upgraded and formerly "big-files" >> are no longer considered "big" how is the previous history refactored tot >> behave "non-big-file-like"? Conversely, if the system-resources are >> re-distributed so that formerly non-big files are now relatively big (ie, >> moved from powerful central server login to laptops), how is the history >> refactored to accommodate the newly-relative-bigness? >> > > In common sense, a file of tens of MBs should not be considered as a > big file, but a file of tens of GBs should definitely be considered as > a big file. I think one simple workable solution is to let the user > set the threshold of the big file. We currently have core.bigFileThreshold = 512MB. > One complicate but intelligent > solution is to let git auto-config the threshold by evaluating current > computing resources in the running platform (a physical machine or > just a VM). As to the problem of migrating git in different platforms > which equip with different computing power, the git repo should also > keep tract of under what big file threshold a specific file is > handled. -- Duy ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-01 4:22 ` Nguyen Thai Ngoc Duy @ 2012-04-01 23:30 ` Bo Chen 2012-04-02 1:00 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 43+ messages in thread From: Bo Chen @ 2012-04-01 23:30 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Neal Kreitzinger, Jeff King, Sergio, git One question, can anyone help me clear? My .git/objects has 3 blobs, a, b, and c. a is a unique file, b and c two sequential versions of the same file. When I run "git gc", what exactly happens here, e.g., how exactly git (in the latest version) delta compresses-the blobs here? Any help will be appreciated. Bo On Sun, Apr 1, 2012 at 12:22 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > On Sun, Apr 1, 2012 at 4:27 AM, Bo Chen <chen@chenirvine.org> wrote: >>> Who decides bigness: >>> Bigness seems to be relative to system resources. Does the user crunch the >>> numbers to determine if a file is big-file, or does git? If the numbers are >>> relative then should git query the system and make the determination? >>> Either way, once the system-resources are upgraded and formerly "big-files" >>> are no longer considered "big" how is the previous history refactored tot >>> behave "non-big-file-like"? Conversely, if the system-resources are >>> re-distributed so that formerly non-big files are now relatively big (ie, >>> moved from powerful central server login to laptops), how is the history >>> refactored to accommodate the newly-relative-bigness? >>> >> >> In common sense, a file of tens of MBs should not be considered as a >> big file, but a file of tens of GBs should definitely be considered as >> a big file. I think one simple workable solution is to let the user >> set the threshold of the big file. > > We currently have core.bigFileThreshold = 512MB. > >> One complicate but intelligent >> solution is to let git auto-config the threshold by evaluating current >> computing resources in the running platform (a physical machine or >> just a VM). As to the problem of migrating git in different platforms >> which equip with different computing power, the git repo should also >> keep tract of under what big file threshold a specific file is >> handled. > -- > Duy ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of 2012-04-01 23:30 ` Bo Chen @ 2012-04-02 1:00 ` Nguyen Thai Ngoc Duy 0 siblings, 0 replies; 43+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-04-02 1:00 UTC (permalink / raw) To: Bo Chen; +Cc: Neal Kreitzinger, Jeff King, Sergio, git On Mon, Apr 2, 2012 at 6:30 AM, Bo Chen <chen@chenirvine.org> wrote: > One question, can anyone help me clear? > > My .git/objects has 3 blobs, a, b, and c. a is a unique file, b and c > two sequential versions of the same file. When I run "git gc", what > exactly happens here, e.g., how exactly git (in the latest version) > delta compresses-the blobs here? See Documentation/technical/pack-heuristics.txt for how pack-objects (called by"git gc") decides to delta either b or c based on the other one. Once it chooses, say, b to be delta against c, it generates delta using diff-delta.c, then store the delta in either ref-delta or ofs-delta format. The former stores sha-1 of c, the latter the offset of c in the pack. -- Duy ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of "Better big-file support". 2012-03-28 6:19 ` Nguyen Thai Ngoc Duy 2012-03-28 11:33 ` GSoC - Some questions on the idea of Sergio @ 2012-03-30 19:11 ` Bo Chen 2012-03-30 19:54 ` Jeff King 1 sibling, 1 reply; 43+ messages in thread From: Bo Chen @ 2012-03-30 19:11 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: git, peff Sorry for replying late. My questions are inline in the following. On Wed, Mar 28, 2012 at 2:19 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > On Wed, Mar 28, 2012 at 11:38 AM, Bo Chen <chen@chenirvine.org> wrote: >> Hi, Everyone. This is Bo Chen. I am interested in the idea of "Better >> big-file support". >> >> As it is described in the idea page, >> "Many large files (like media) do not delta very well. However, some >> do (like VM disk images). Git could split large objects into smaller >> chunks, similar to bup, and find deltas between these much more >> manageable chunks. There are some preliminary patches in this >> direction, but they are in need of review and expansion." >> >> Can anyone elaborate a little bit why many large files do not delta >> very well? > > Large files are usually binary. Depends on the type of binary, they > may or may not delta well. Those that are compressed/encrypted > obviously don't delta well because one change can make the final > result completely different. Just make clear one of my confusions. Delta operation is to find out the differences between different versions of the same file, right? As I know, delta encoding is to re-encode a file based on the differences between neighboring blocks, thus can help compress a file since after delta encoding, we will have more similar data within the file. Can anyone elaborate a little bit what is the relation between delta operation in git and delta encoding listed above? Thanks. > > Another problem with delta-ing large files with git is, current code > needs to load two files in memory for delta. Consuming 4G for delta 2 > 2GB files does not sound good. I am wondering why we cannot divide the 2 2GB files into chunks and delta chunks by chunks. Is that any difference, except a little more IOs? > >> Is it a general problem or a specific problem just for Git? >> I am really new to Git, can anyone give me some hints on which source >> codes I should read to learn more about the current code on delta >> operation? It is said that "there are some preliminary patches in this >> direction", where can I find these patches? > > Read about rsync algorithm [2]. Bup [1] implements the same (I think) > algorithm, but on top of git. For preliminary patches, have a look at > jc/split-blob series at commit 4a1242d in git.git. Make clear my another confusion. The file which has been updated (added, deleted, and modified) is first delta-compressed, and then synchronize to the remote repo by some mechanism (rsync?). I am wondering what is the the relationship between delta operation and rsync. > > [1] https://github.com/apenwarr/bup > [2] http://en.wikipedia.org/wiki/Rsync#Algorithm > -- > Duy Bo ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: GSoC - Some questions on the idea of "Better big-file support". 2012-03-30 19:11 ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen @ 2012-03-30 19:54 ` Jeff King 0 siblings, 0 replies; 43+ messages in thread From: Jeff King @ 2012-03-30 19:54 UTC (permalink / raw) To: Bo Chen; +Cc: Nguyen Thai Ngoc Duy, git On Fri, Mar 30, 2012 at 03:11:40PM -0400, Bo Chen wrote: > Just make clear one of my confusions. Delta operation is to find out > the differences between different versions of the same file, right? > As I know, delta encoding is to re-encode a file based on the > differences between neighboring blocks, thus can help compress a file > since after delta encoding, we will have more similar data within the > file. Can anyone elaborate a little bit what is the relation between > delta operation in git and delta encoding listed above? Thanks. Sort of. Git is snapshot based. So each version of a file is its own "object", and from a high-level view, we store all objects. But we store the logical objects themselves in packfiles, in which the actual representation of the object may be stored as a difference to another object (which is likely to be a different version of the same file, but does not have to be). Here's some background reading: http://progit.org/book/ch1-3.html http://progit.org/book/ch9-4.html > I am wondering why we cannot divide the 2 2GB files into chunks and > delta chunks by chunks. Is that any difference, except a little more > IOs? It's more complicated than that. What if the file is re-ordered? You would want to compare early chunks in one version against later chunks in the other. So yes, you can reduce memory pressure by doing more I/O, but doing too much I/O will be very slow. Coming up with a solution is part of what this project is about. And chunking is part of that solution. > > Read about rsync algorithm [2]. Bup [1] implements the same (I think) > > algorithm, but on top of git. For preliminary patches, have a look at > > jc/split-blob series at commit 4a1242d in git.git. > > Make clear my another confusion. The file which has been updated > (added, deleted, and modified) is first delta-compressed, and then > synchronize to the remote repo by some mechanism (rsync?). I am > wondering what is the the relationship between delta operation and > rsync. No, the updated file is delta compressed into a packfile, and the packfile is transmitted. Rsync comes into play because it uses a novel chunking algorithm, which was copied by bup (and is referred to as the "bupsplit" algorithm). Read up on how bup works and why it was invented. -Peff ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2012-05-10 22:39 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-03-28 4:38 GSoC - Some questions on the idea of "Better big-file support" Bo Chen 2012-03-28 6:19 ` Nguyen Thai Ngoc Duy 2012-03-28 11:33 ` GSoC - Some questions on the idea of Sergio 2012-03-30 19:44 ` Bo Chen 2012-03-30 19:51 ` Bo Chen 2012-03-30 20:34 ` Jeff King 2012-03-30 23:08 ` Bo Chen 2012-03-31 11:02 ` Sergio Callegari 2012-03-31 16:18 ` Neal Kreitzinger 2012-04-02 21:07 ` Jeff King 2012-04-03 9:58 ` Sergio Callegari 2012-04-11 1:24 ` Neal Kreitzinger 2012-04-11 6:04 ` Jonathan Nieder 2012-04-11 16:29 ` Neal Kreitzinger 2012-04-11 22:09 ` Jeff King 2012-04-11 16:35 ` Neal Kreitzinger 2012-04-11 16:44 ` Neal Kreitzinger 2012-04-11 17:20 ` Jonathan Nieder 2012-04-11 18:51 ` Junio C Hamano 2012-04-11 19:03 ` Jonathan Nieder 2012-04-11 18:23 ` Neal Kreitzinger 2012-04-11 21:35 ` Jeff King 2012-04-12 19:29 ` Neal Kreitzinger 2012-04-12 21:03 ` Jeff King [not found] ` <4F8A2EBD.1070407@gmail.com> 2012-04-15 2:15 ` Jeff King 2012-04-15 2:33 ` Neal Kreitzinger 2012-04-16 14:54 ` Jeff King 2012-05-10 21:43 ` Neal Kreitzinger 2012-05-10 22:39 ` Jeff King 2012-04-12 21:08 ` Neal Kreitzinger 2012-04-13 21:36 ` Bo Chen 2012-03-31 15:19 ` Neal Kreitzinger 2012-04-02 21:40 ` Jeff King 2012-04-02 22:19 ` Junio C Hamano 2012-04-03 10:07 ` Jeff King 2012-03-31 16:49 ` Neal Kreitzinger 2012-03-31 20:28 ` Neal Kreitzinger 2012-03-31 21:27 ` Bo Chen 2012-04-01 4:22 ` Nguyen Thai Ngoc Duy 2012-04-01 23:30 ` Bo Chen 2012-04-02 1:00 ` Nguyen Thai Ngoc Duy 2012-03-30 19:11 ` GSoC - Some questions on the idea of "Better big-file support" Bo Chen 2012-03-30 19:54 ` Jeff King
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.