* Delta compression not so effective @ 2017-03-01 13:51 Marius Storm-Olsen 2017-03-01 16:06 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Marius Storm-Olsen @ 2017-03-01 13:51 UTC (permalink / raw) To: git I have just converted an SVN repo to Git (using SubGit), where I feel delta compression has let me down :) Suffice it to say, this is a "traditional" SVN repo, with an extern/ blown out of proportion with many binary check-ins. BUT, even still, I would expect Git's delta compression to be quite effective, compared to the compression present in SVN. In this case however, the Git repo ends up being 46% larger than the SVN DB. Details - SVN: Commits: 32988 DB (server) size: 139GB Branches: 103 Tags: 1088 Details - Git: $ git count-objects -v count: 0 size: 0 in-pack: 666515 packs: 1 size-pack: 211933109 prune-packable: 0 garbage: 0 size-garbage: 0 $ du -sh . 203G . $ java -jar ~/sources/bfg/bfg.jar --delete-folders extern --no-blob-protection && \ git reflog expire --expire=now --all && \ git gc --prune=now --aggressive $ git count-objects -v count: 0 size: 0 in-pack: 495070 packs: 1 size-pack: 5765365 prune-packable: 0 garbage: 0 size-garbage: 0 $ du -sh . 5.6G . When first importing, I disabled gc to avoid any repacking until completed. When done importing, there was 209GB of all loose objects (~670k files). With the hopes of quick consolidation, I did a git -c gc.autoDetach=0 -c gc.reflogExpire=0 \ -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \ -c gc.rerereunresolved=0 -c gc.pruneExpire=now \ gc --prune which brought it down to 206GB in a single pack. I then ran git repack -a -d -F --window=350 --depth=250 which took it down to 203GB, where I'm at right now. However, this is still miles away from the 139GB in SVN's DB. Any ideas what's going on, and why my results are so terrible, compared to SVN? Thanks! -- .marius ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-01 13:51 Delta compression not so effective Marius Storm-Olsen @ 2017-03-01 16:06 ` Junio C Hamano 2017-03-01 16:17 ` Junio C Hamano 2017-03-01 17:36 ` Linus Torvalds 2017-03-01 20:19 ` Martin Langhoff 2 siblings, 1 reply; 15+ messages in thread From: Junio C Hamano @ 2017-03-01 16:06 UTC (permalink / raw) To: Marius Storm-Olsen; +Cc: Git Mailing List On Wed, Mar 1, 2017 at 5:51 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote: > ... which brought it down to 206GB in a single pack. I then ran > git repack -a -d -F --window=350 --depth=250 > which took it down to 203GB, where I'm at right now. Just a hunch. s/F/f/ perhaps? "-F" does not allow Git to recover from poor delta-base choice the original importer may have made (and if the original importer used fast-import, it is known that its choice of the delta-base is suboptimal). ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-01 16:06 ` Junio C Hamano @ 2017-03-01 16:17 ` Junio C Hamano 0 siblings, 0 replies; 15+ messages in thread From: Junio C Hamano @ 2017-03-01 16:17 UTC (permalink / raw) To: Marius Storm-Olsen; +Cc: Git Mailing List On Wed, Mar 1, 2017 at 8:06 AM, Junio C Hamano <gitster@pobox.com> wrote: > Just a hunch. s/F/f/ perhaps? "-F" does not allow Git to recover from poor Nah, sorry for the noise. Between -F and -f there shouldn't be any difference. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-01 13:51 Delta compression not so effective Marius Storm-Olsen 2017-03-01 16:06 ` Junio C Hamano @ 2017-03-01 17:36 ` Linus Torvalds 2017-03-01 17:57 ` Marius Storm-Olsen 2017-03-01 20:19 ` Martin Langhoff 2 siblings, 1 reply; 15+ messages in thread From: Linus Torvalds @ 2017-03-01 17:36 UTC (permalink / raw) To: Marius Storm-Olsen; +Cc: Git Mailing List On Wed, Mar 1, 2017 at 5:51 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote: > > When first importing, I disabled gc to avoid any repacking until completed. > When done importing, there was 209GB of all loose objects (~670k files). > With the hopes of quick consolidation, I did a > git -c gc.autoDetach=0 -c gc.reflogExpire=0 \ > -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \ > -c gc.rerereunresolved=0 -c gc.pruneExpire=now \ > gc --prune > which brought it down to 206GB in a single pack. I then ran > git repack -a -d -F --window=350 --depth=250 > which took it down to 203GB, where I'm at right now. Considering that it was 209GB in loose objects, I don't think it delta-packed the big objects at all. I wonder if the big objects end up hitting some size limit that causes the delta creation to fail. For example, we have that HASH_LIMIT that limits how many hashes we'll create for the same hash bucket, because there's some quadratic behavior in the delta algorithm. It triggered with things like big files that have lots of repeated content. We also have various memory limits, in particular 'window_memory_limit'. That one should default to 0, but maybe you limited it at some point in a config file and forgot about it? Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-01 17:36 ` Linus Torvalds @ 2017-03-01 17:57 ` Marius Storm-Olsen 2017-03-01 18:30 ` Linus Torvalds 0 siblings, 1 reply; 15+ messages in thread From: Marius Storm-Olsen @ 2017-03-01 17:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List On 3/1/2017 11:36, Linus Torvalds wrote: > On Wed, Mar 1, 2017 at 5:51 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote: >> >> When first importing, I disabled gc to avoid any repacking until completed. >> When done importing, there was 209GB of all loose objects (~670k files). >> With the hopes of quick consolidation, I did a >> git -c gc.autoDetach=0 -c gc.reflogExpire=0 \ >> -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \ >> -c gc.rerereunresolved=0 -c gc.pruneExpire=now \ >> gc --prune >> which brought it down to 206GB in a single pack. I then ran >> git repack -a -d -F --window=350 --depth=250 >> which took it down to 203GB, where I'm at right now. > > Considering that it was 209GB in loose objects, I don't think it > delta-packed the big objects at all. > > I wonder if the big objects end up hitting some size limit that causes > the delta creation to fail. You're likely on to something here. I just ran git verify-pack --verbose objects/pack/pack-9473815bc36d20fbcd38021d7454fbe09f791931.idx | sort -k3n | tail -n15 and got no blobs with deltas in them. feb35d6dc7af8463e038c71cc3893d163d47c31c blob 36841958 36461935 3259424358 007b65e603cdcec6644ddc25c2a729a394534927 blob 36845345 36462120 3341677889 0727a97f68197c99c63fcdf7254e5867f8512f14 blob 37368646 36983862 3677338718 576ce2e0e7045ee36d0370c2365dc730cb435f40 blob 37399203 37014740 3639613780 7f6e8b22eed5d8348467d9b0180fc4ae01129052 blob 125296632 83609223 5045853543 014b9318d2d969c56d46034a70223554589b3dc4 blob 170113524 6124878 1118227958 22d83cb5240872006c01651eb1166c8db62c62d8 blob 170113524 65941491 1257435955 292ac84f48a3d5c4de8d12bfb2905e055f9a33b1 blob 170113524 67770601 1323377446 2b9329277e379dfbdcd0b452b39c6b0bf3549005 blob 170113524 7656690 1110571268 37517efb4818a15ad7bba79b515170b3ee18063b blob 170113524 133083119 1124352836 55a4a70500eb3b99735677d0025f33b1bb78624a blob 170113524 6592386 1398975989 e669421ea5bf2e733d5bf10cf505904d168de749 blob 170113524 7827942 1391148047 e9916da851962265a9d5b099e72f60659a74c144 blob 170113524 73514361 966299538 f7bf1313752deb1bae592cc7fc54289aea87ff19 blob 170113524 70756581 1039814687 8afc6f2a51f0fa1cc4b03b8d10c70599866804ad blob 248959314 237612609 606692699 In fact, I don't see a single "deltified" blob until 6355th last line! > For example, we have that HASH_LIMIT that limits how many hashes > we'll create for the same hash bucket, because there's some quadratic > behavior in the delta algorithm. It triggered with things like big > files that have lots of repeated content. > > We also have various memory limits, in particular > 'window_memory_limit'. That one should default to 0, but maybe you > limited it at some point in a config file and forgot about it? Indeed, I did do a -c pack.threads=20 --window-memory=6g to 'git repack', since the machine is a 20-core (40 threads) machine with 126GB of RAM. So I guess with these sized objects, even at 6GB per thread, it's not enough to get a big enough Window for proper delta-packing? This repo took >14hr to repack on 20 threads though ("compression" step was very fast, but stuck 95% of the time in "writing objects"), so I can only imagine how long a pack.threads=1 will take :) But arent't the blobs sorted by some metric for reasonable delta-pack locality, so even with a 6GB window it should have seen ~25 similar objects to deltify against? -- .marius ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-01 17:57 ` Marius Storm-Olsen @ 2017-03-01 18:30 ` Linus Torvalds 2017-03-01 21:08 ` Martin Langhoff 2017-03-02 0:12 ` Marius Storm-Olsen 0 siblings, 2 replies; 15+ messages in thread From: Linus Torvalds @ 2017-03-01 18:30 UTC (permalink / raw) To: Marius Storm-Olsen; +Cc: Git Mailing List On Wed, Mar 1, 2017 at 9:57 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote: > > Indeed, I did do a > -c pack.threads=20 --window-memory=6g > to 'git repack', since the machine is a 20-core (40 threads) machine with > 126GB of RAM. > > So I guess with these sized objects, even at 6GB per thread, it's not enough > to get a big enough Window for proper delta-packing? Hmm. The 6GB window should be plenty good enough, unless your blobs are in the gigabyte range too. > This repo took >14hr to repack on 20 threads though ("compression" step was > very fast, but stuck 95% of the time in "writing objects"), so I can only > imagine how long a pack.threads=1 will take :) Actually, it's usually the compression phase that should be slow - but if something is limiting finding deltas (so that we abort early), then that would certainly tend to speed up compression. The "writing objects" phase should be mainly about the actual IO. Which should be much faster *if* you actually find deltas. > But arent't the blobs sorted by some metric for reasonable delta-pack > locality, so even with a 6GB window it should have seen ~25 similar objects > to deltify against? Yes they are. The sorting for delta packing tries to make sure that the window is effective. However, the sorting is also just a heuristic, and it may well be that your repository layout ends up screwing up the sorting, so that the windows just work very badly. For example, the sorting code thinks that objects with the same name across the history are good sources of deltas. But it may be that for your case, the binary blobs that you have don't tend to actually change in the history, so that heuristic doesn't end up doing anything. The sorting does use the size and the type too, but the "filename hash" (which isn't really a hash, it's something nasty to give reasonable results for the case where files get renamed) is the main sort key. So you might well want to look at the sorting code too. If filenames (particularly the end of filenames) for the blobs aren't good hints for the sorting code, that sort might end up spreading all the blobs out rather than sort them by size. And again, if that happens, the "can I delta these two objects" code will notice that the size of the objects are wildly different and won't even bother trying. Which speeds up the "compressing" phase, of course, but then because you don't get any good deltas, the "writing out" phase sucks donkey balls because it does zlib compression on big objects and writes them out to disk. So there are certainly multiple possible reasons for the deltification to not work well for you. Hos sensitive is your material? Could you make a smaller repo with some of the blobs that still show the symptoms? I don't think I want to download 206GB of data even if my internet access is good. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-01 18:30 ` Linus Torvalds @ 2017-03-01 21:08 ` Martin Langhoff 2017-03-02 0:12 ` Marius Storm-Olsen 1 sibling, 0 replies; 15+ messages in thread From: Martin Langhoff @ 2017-03-01 21:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Marius Storm-Olsen, Git Mailing List On Wed, Mar 1, 2017 at 1:30 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > For example, the sorting code thinks that objects with the same name > across the history are good sources of deltas. Marius has indicated he is working with jar files. IME jar and war files, which are zipfiles containing Java bytecode, range from not delta-ing in a useful fashion, to pretty good deltas. Depending on the build process (hi Maven!) there can be enough variance in the build metadata to throw all the compression machinery off. On a simple Maven-driven project I have at hand, two .war files compiled from the same codebase compressed really well in git. I've also seen projects where storage space is ~101% of the "uncompressed" size. my 2c, m -- martin.langhoff@gmail.com - ask interesting questions ~ http://linkedin.com/in/martinlanghoff - don't be distracted ~ http://github.com/martin-langhoff by shiny stuff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-01 18:30 ` Linus Torvalds 2017-03-01 21:08 ` Martin Langhoff @ 2017-03-02 0:12 ` Marius Storm-Olsen 2017-03-02 0:43 ` Linus Torvalds 1 sibling, 1 reply; 15+ messages in thread From: Marius Storm-Olsen @ 2017-03-02 0:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List On 3/1/2017 12:30, Linus Torvalds wrote: > On Wed, Mar 1, 2017 at 9:57 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote: >> >> Indeed, I did do a >> -c pack.threads=20 --window-memory=6g >> to 'git repack', since the machine is a 20-core (40 threads) machine with >> 126GB of RAM. >> >> So I guess with these sized objects, even at 6GB per thread, it's not enough >> to get a big enough Window for proper delta-packing? > > Hmm. The 6GB window should be plenty good enough, unless your blobs > are in the gigabyte range too. No, the list of git verify-objects in the previous post was from the bottom of the sorted list, so those are the largest blobs, ~249MB.. >> This repo took >14hr to repack on 20 threads though ("compression" step was >> very fast, but stuck 95% of the time in "writing objects"), so I can only >> imagine how long a pack.threads=1 will take :) > > Actually, it's usually the compression phase that should be slow - but > if something is limiting finding deltas (so that we abort early), then > that would certainly tend to speed up compression. > > The "writing objects" phase should be mainly about the actual IO. > Which should be much faster *if* you actually find deltas. So, this repo must be knocking several parts of Git's insides. I was curious about why it was so slow on the writing objects part, since the whole repo is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing has ~400MB/s continuous throughput available. iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single thread (since the "write objects" stage is single threaded, obviously). The failing delta must be triggering other negative behavior. > For example, the sorting code thinks that objects with the same name > across the history are good sources of deltas. But it may be that for > your case, the binary blobs that you have don't tend to actually > change in the history, so that heuristic doesn't end up doing > anything. These are generally just DLLs (debug & release), which content is updated due to upstream project updates. So, filenames/paths tend to stay identical, while content changes throughout history. > The sorting does use the size and the type too, but the "filename > hash" (which isn't really a hash, it's something nasty to give > reasonable results for the case where files get renamed) is the main > sort key. > > So you might well want to look at the sorting code too. If filenames > (particularly the end of filenames) for the blobs aren't good hints > for the sorting code, that sort might end up spreading all the blobs > out rather than sort them by size. Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed blobs are the same DLLs (multiple of them) > And again, if that happens, the "can I delta these two objects" code > will notice that the size of the objects are wildly different and > won't even bother trying. Which speeds up the "compressing" phase, of > course, but then because you don't get any good deltas, the "writing > out" phase sucks donkey balls because it does zlib compression on big > objects and writes them out to disk. Right, now on this machine, I really didn't notice much difference between standard zlib level and doing -9. The 203GB version was actually with zlib=9. > So there are certainly multiple possible reasons for the deltification > to not work well for you. > > Hos sensitive is your material? Could you make a smaller repo with > some of the blobs that still show the symptoms? I don't think I want > to download 206GB of data even if my internet access is good. Pretty sensitive, and not sure how I can reproduce this reasonable well. However, I can easily recompile git with any recommended instrumentation/printfs, if you have any suggestions of good places to start? If anyone have good file/line numbers, I'll give that a go, and report back? Thanks! -- .marius ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-02 0:12 ` Marius Storm-Olsen @ 2017-03-02 0:43 ` Linus Torvalds 2017-03-04 8:27 ` Marius Storm-Olsen 0 siblings, 1 reply; 15+ messages in thread From: Linus Torvalds @ 2017-03-02 0:43 UTC (permalink / raw) To: Marius Storm-Olsen; +Cc: Git Mailing List On Wed, Mar 1, 2017 at 4:12 PM, Marius Storm-Olsen <mstormo@gmail.com> wrote: > > No, the list of git verify-objects in the previous post was from the bottom > of the sorted list, so those are the largest blobs, ~249MB.. .. so with a 6GB window, you should easily sill have 20+ objects. Not a huge window, but it should find some deltas. But a smaller window - _together_ with a suboptimal sorting choice - could then result in lack of successful delta matches. > So, this repo must be knocking several parts of Git's insides. I was curious > about why it was so slow on the writing objects part, since the whole repo > is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing > has ~400MB/s continuous throughput available. > > iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single > thread (since the "write objects" stage is single threaded, obviously). So the writing phase isn't multi-threaded because it's not expected to matter. But if you can't even generate deltas, you aren't just *writing* much more data, you're compressing all that data with zlib too. So even with a fast disk subsystem, you won't even be able to saturate the disk, simply because the compression will be slower (and single-threaded). > Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed > blobs are the same DLLs (multiple of them) I think the first thing you should test is to repack with fewer threads, and a bigger pack window. Do somethinig like -c pack.threads=4 --window-memory=30g instead. Just to see if that starts finding deltas. > Right, now on this machine, I really didn't notice much difference between > standard zlib level and doing -9. The 203GB version was actually with > zlib=9. Don't. zlib has *horrible* scaling with higher compressions. It doesn't actually improve the end result very much, and it makes things *much* slower. zlib was a reasonable choice when git started - well-known, stable, easy to use. But realistically it's a relatively horrible choice today, just because there are better alternatives now. >> Hos sensitive is your material? Could you make a smaller repo with >> some of the blobs that still show the symptoms? I don't think I want >> to download 206GB of data even if my internet access is good. > > Pretty sensitive, and not sure how I can reproduce this reasonable well. > However, I can easily recompile git with any recommended > instrumentation/printfs, if you have any suggestions of good places to > start? If anyone have good file/line numbers, I'll give that a go, and > report back? So the first thing you might want to do is to just print out the objects after sorting them, and before it starts trying to finsd deltas. See prepare_pack() in builtin/pack-objects.c, where it does something like this: if (nr_deltas && n > 1) { unsigned nr_done = 0; if (progress) progress_state = start_progress(_("Compressing objects"), nr_deltas); QSORT(delta_list, n, type_size_sort); ll_find_deltas(delta_list, n, window+1, depth, &nr_done); stop_progress(&progress_state); and notice that QSORT() line: that's what sorts the objects. You can do something like for (i = 0; i < n; i++) show_object_entry_details(delta_list[i]); right after that QSORT(), and make that print out the object hash, filename hash, and size (we don't have the filename that the object was associated with any more at that stage - they take too much space). Save off that array for off-line processing: when you have the object hash, you can see what the contents are, and match it up wuith the file in the git history using something like git log --oneline --raw -R --abbrev=40 which shows you the log, but also the "diff" in the form of "this filename changed from SHA1 to SHA1", so you can match up the object hashes with where they are in the tree (and where they are in history). So then you could try to figure out if that type_size_sort() heuristic is just particularly horrible for you. In fact, if your data is not *so* sensitive, and you're ok with making the one-line commit logs and the filenames public, you could make just those things available, and maybe I'll have time to look at it. I'm in the middle of the kernel merge window, but I'm in the last stretch, and because of the SHA1 thing I've been looking at git lately. No promises, though. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-02 0:43 ` Linus Torvalds @ 2017-03-04 8:27 ` Marius Storm-Olsen 2017-03-06 1:14 ` Linus Torvalds 2017-03-07 9:07 ` Thomas Braun 0 siblings, 2 replies; 15+ messages in thread From: Marius Storm-Olsen @ 2017-03-04 8:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List On 3/1/2017 18:43, Linus Torvalds wrote: >> So, this repo must be knocking several parts of Git's insides. I was curious >> about why it was so slow on the writing objects part, since the whole repo >> is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing >> has ~400MB/s continuous throughput available. >> >> iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single >> thread (since the "write objects" stage is single threaded, obviously). > > So the writing phase isn't multi-threaded because it's not expected to > matter. But if you can't even generate deltas, you aren't just > *writing* much more data, you're compressing all that data with zlib > too. > > So even with a fast disk subsystem, you won't even be able to saturate > the disk, simply because the compression will be slower (and > single-threaded). I did a simple $ time zip -r repo.zip repo/ ... total bytes=219353596620, compressed=214310715074 -> 2% savings real 154m6.323s user 133m5.209s sys 5m5.338s also using a single thread + same disk, as git repack. But if you compare it to the numbers below, it's 2.6hrs with zip vs 14.2hrs (1:5.5). So it can't just be the overhead of having to compress the full blobs, due to lacking delta.. >> Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed >> blobs are the same DLLs (multiple of them) > > I think the first thing you should test is to repack with fewer > threads, and a bigger pack window. Do somethinig like > > -c pack.threads=4 --window-memory=30g > > instead. Just to see if that starts finding deltas. I reran the repack with the options above (dropping the zlib=9, as you suggested) $ time git -c pack.threads=4 repack -a -d -F \ --window=350 --depth=250 --window-memory=30g Delta compression using up to 4 threads. Compressing objects: 100% (609413/609413) Writing objects: 100% (666515/666515), done. Total 666515 (delta 499585), reused 0 (delta 0) real 850m3.473s user 897m36.280s sys 10m8.824s and ended up with $ du -sh . 205G . In other words, going from 6G to 30G window didn't help a lick on finding deltas for those binaries. (205G was what I had with the non-aggressive 'git gc', before zlib=9 repack.) BUT, oddly enough, even if the new size if almost identical to the previous version without zlib=9, git verify-pack --verbose objects/pack/pack-29b06ae4d458ac03efd98b330702d30e851b2933.idx | sort -k3n | tail -n15 gives me a VERY different list than before 17e5b2146311256dc8317d6e0ed1291363c31a76 blob 673399562 110248747 190398904084 04c881d9069eab3bd0d50dd48a047a60f79cc415 blob 673863358 111710559 188818868865 fdcabd75aeda86ce234d6e43b54d27d993acddcd blob 674523614 111956017 185706433825 d8815033d1b00b151ae762be8a69ffa35f55c4b4 blob 675286758 112099638 185153570292 997e0b9d3bcf440af10c7bbe535a597ca46c492c blob 678274978 112654668 184041692883 dfed141679e5c33caaa921cbe1595a24967a3c2c blob 681692132 113121410 186753502634 76a4000e71cd5b85f2265e02eb876acf1f33cc55 blob 682673430 112743915 184563542298 81e7292c4d2da2d2d236fbfaa572b6c4e8d787f4 blob 684543130 112797325 181805773038 991184c60e1fc6b2721bf40f181012b72b10d02d blob 684543130 112796892 182344388066 0e9269f4abd1440addd05d4f964c96d74d11cd89 blob 684547270 112809074 181070719237 6019b6d09759cf5adeac678c8b56d177803a0486 blob 684547270 112809336 180517242193 70a5f70bd205329472d6f9c660eb3f7d207a596e blob 686852038 112873611 183520467528 e86a0064d9652be9f5e3a877b11a665f64198ecd blob 686852038 112874133 182893219377 bae8de0555be5b1ffa0988cbc6cba698f6745c26 blob 894041802 137223252 2355250324 94dc773600e03ac1e6f3ab077b70b8297325ad77 blob 945197364 145219485 16560137220 compared to the last 3 entries of the previous pack e9916da851962265a9d5b099e72f60659a74c144 blob 170113524 73514361 966299538 f7bf1313752deb1bae592cc7fc54289aea87ff19 blob 170113524 70756581 1039814687 8afc6f2a51f0fa1cc4b03b8d10c70599866804ad blob 248959314 237612609 606692699 > So the first thing you might want to do is to just print out the > objects after sorting them, and before it starts trying to finsd > deltas. ... > and notice that QSORT() line: that's what sorts the objects. You can > do something like > > for (i = 0; i < n; i++) > show_object_entry_details(delta_list[i]); I did fprintf(stderr, "%s %u %lu\n", sha1_to_hex(delta_list[i]->idx.sha1), delta_list[i]->hash, delta_list[i]->size); I assume that's correct? > In fact, if your data is not *so* sensitive, and you're ok with making > the one-line commit logs and the filenames public, you could make just > those things available, and maybe I'll have time to look at it. I've removed all commit messages, and "sanitized" some filepaths etc, so name hashes won't match what's reported, but that should be fine. (the object_entry->hash seems to be just a trivial uint32 hash for sorting anyways) I really don't want the files on the mailinglist, so I'll send you a link directly. However, small snippets for public discussions about potential issues would be fine, obviously. BUT, if I look at the last 3 entries of the sorted git verify-pack output, and look for them in the 'git log --oneline --raw -R --abbrev=40' output, I get: :100644 100644 991184c60e1fc6b2721bf40f181012b72b10d02d e86a0064d9652be9f5e3a877b11a665f64198ecd M extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib :100644 000000 bae8de0555be5b1ffa0988cbc6cba698f6745c26 0000000000000000000000000000000000000000 D extern/win/gdal-2.0.0/lib/x64/Debug/libgdal.lib :000000 100644 0000000000000000000000000000000000000000 94dc773600e03ac1e6f3ab077b70b8297325ad77 A extern/win/gdal-2.0.0/lib/x64/Debug/gdal.lib while I cannot find ANY of them in the delta_list output?? Shouldn't delta_list contain all objects, sorted by some heuristics? Or is the delta_list already here limited by some other metric, before the QSORT? Also note that the 'git log --oneline --raw -R --abbrev=40' only gave me the log for trunk, so for the second last object, must have been added in a branch, and deleted on trunk; so I could only see the deletion of that object in the output. You might get an idea for how to easily create a repo which reproduces the issue, and which would highlight it more easily for the ML. I was thinking of maybe scripting up make install prefix=extern for each Git release, and rewrite trunk history with extern/ binary commits at the time of each tag; maybe that would show the same behavior? But then again, most of the binaries are just copies of each other, and only ~10M, so probably not a big win. Thanks! -- .marius ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-04 8:27 ` Marius Storm-Olsen @ 2017-03-06 1:14 ` Linus Torvalds 2017-03-06 13:36 ` Marius Storm-Olsen 2017-03-07 9:07 ` Thomas Braun 1 sibling, 1 reply; 15+ messages in thread From: Linus Torvalds @ 2017-03-06 1:14 UTC (permalink / raw) To: Marius Storm-Olsen; +Cc: Git Mailing List On Sat, Mar 4, 2017 at 12:27 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote: > > I reran the repack with the options above (dropping the zlib=9, as you > suggested) > > $ time git -c pack.threads=4 repack -a -d -F \ > --window=350 --depth=250 --window-memory=30g > > and ended up with > $ du -sh . > 205G . > > In other words, going from 6G to 30G window didn't help a lick on finding > deltas for those binaries. Ok. > I did > fprintf(stderr, "%s %u %lu\n", > sha1_to_hex(delta_list[i]->idx.sha1), > delta_list[i]->hash, > delta_list[i]->size); > > I assume that's correct? Looks good. > I've removed all commit messages, and "sanitized" some filepaths etc, so > name hashes won't match what's reported, but that should be fine. (the > object_entry->hash seems to be just a trivial uint32 hash for sorting > anyways) Yes. I see your name list and your pack-file index. > BUT, if I look at the last 3 entries of the sorted git verify-pack output, > and look for them in the 'git log --oneline --raw -R --abbrev=40' output, I > get: ... > while I cannot find ANY of them in the delta_list output?? \ Yes. You have a lot of of object names in that log file you sent in private that aren't in the delta list. Now, objects smaller than 50 bytes we don't ever try to even delta. I can't see the object sizes when they don't show up in the delta list, but looking at some of those filenames I'd expect them to not fall in that category. I guess you could do the printout a bit earlier (on the "to_pack.objects[]" array - to_pack.nr_objects is the count there). That should show all of them. But the small objects shouldn't matter. But if you have a file like extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib I would have assumed that it has a size that is > 50. Unless those "extern" things are placeholders? > You might get an idea for how to easily create a repo which reproduces the > issue, and which would highlight it more easily for the ML. Looking at your sorted object list ready for packing, it doesn't look horrible. When sorting for size, it still shows a lot of those large files with the same name hash, so they sorted together in that form too. I do wonder if your dll data just simply is absolutely horrible for xdelta. We've also limited the delta finding a bit, simply because it had some O(m*n) behavior that gets very expensive on some patterns. Maybe your blobs trigger some of those case. The diff-delta work all goes back to 2005 and 2006, so it's a long time ago. What I'd ask you to do is try to find if you could make a reposity of just one of the bigger DLL's with its history, particularly if you can find some that you don't think is _that_ sensitive. Looking at it, for example, I see that you have that file extern/redhat-5/FlammableV3/x64/plugins/libFlameCUDA-3.0.703.so that seems to have changed several times, and is a largish blob. Could you try creating a repository with git fast-import that *only* contains that file (or pick another one), and see if that delta's well? And if you find some case that doesn't xdelta well, and that you feel you could make available outside, we could have a test-case... Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-06 1:14 ` Linus Torvalds @ 2017-03-06 13:36 ` Marius Storm-Olsen 0 siblings, 0 replies; 15+ messages in thread From: Marius Storm-Olsen @ 2017-03-06 13:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List On 3/5/2017 19:14, Linus Torvalds wrote: > On Sat, Mar 4, 2017 at 12:27 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote: > I guess you could do the printout a bit earlier (on the > "to_pack.objects[]" array - to_pack.nr_objects is the count there). > That should show all of them. But the small objects shouldn't matter. > > But if you have a file like > > extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib > > I would have assumed that it has a size that is > 50. Unless those > "extern" things are placeholders? No placeholders, the FlameProxyLibD.lib is a debug lib, and probably the largest in the whole repo (with a replace count > 5). > I do wonder if your dll data just simply is absolutely horrible for > xdelta. We've also limited the delta finding a bit, simply because it > had some O(m*n) behavior that gets very expensive on some patterns. > Maybe your blobs trigger some of those case. Ok, but given that the SVN delta compression, which forward-linear only, is ~45% better, perhaps that particular search could be done fairly cheap? Although, I bet time(stamps) are out of the loop at that point, so it's not a factor anymore. Even if it where, I'm not sure it would solve anything, if there's other factors also limiting deltafication. > The diff-delta work all goes back to 2005 and 2006, so it's a long time ago. > > What I'd ask you to do is try to find if you could make a reposity of > just one of the bigger DLL's with its history, particularly if you can > find some that you don't think is _that_ sensitive. > > Looking at it, for example, I see that you have that file > > extern/redhat-5/FlammableV3/x64/plugins/libFlameCUDA-3.0.703.so > > that seems to have changed several times, and is a largish blob. Could > you try creating a repository with git fast-import that *only* > contains that file (or pick another one), and see if that delta's > well? I'll filter-branch to extern/ only, however the whole FlammableV3 needs to go too, I'm afaid (extern for that project, but internal to $WORK). I'll do some rewrites and see what comes up. > And if you find some case that doesn't xdelta well, and that you feel > you could make available outside, we could have a test-case... I'll try with this repo first, if not, I'll see if I can construct one. Thanks! -- .marius ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-04 8:27 ` Marius Storm-Olsen 2017-03-06 1:14 ` Linus Torvalds @ 2017-03-07 9:07 ` Thomas Braun 1 sibling, 0 replies; 15+ messages in thread From: Thomas Braun @ 2017-03-07 9:07 UTC (permalink / raw) To: Marius Storm-Olsen, Linus Torvalds; +Cc: Git Mailing List > Marius Storm-Olsen <mstormo@gmail.com> hat am 4. März 2017 um 09:27 > geschrieben: [...] > I really don't want the files on the mailinglist, so I'll send you a > link directly. However, small snippets for public discussions about > potential issues would be fine, obviously. git fast-export can anonymize a repository [1]. Maybe an anonymized repository still shows the issue you are seeing. [1]: https://www.git-scm.com/docs/git-fast-export#_anonymizing ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-01 13:51 Delta compression not so effective Marius Storm-Olsen 2017-03-01 16:06 ` Junio C Hamano 2017-03-01 17:36 ` Linus Torvalds @ 2017-03-01 20:19 ` Martin Langhoff 2017-03-01 23:59 ` Marius Storm-Olsen 2 siblings, 1 reply; 15+ messages in thread From: Martin Langhoff @ 2017-03-01 20:19 UTC (permalink / raw) To: Marius Storm-Olsen; +Cc: Git Mailing List On Wed, Mar 1, 2017 at 8:51 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote: > BUT, even still, I would expect Git's delta compression to be quite effective, compared to the compression present in SVN. jar files are zipfiles. They don't delta in any useful form, and in fact they differ even if they contain identical binary files inside. > Commits: 32988 > DB (server) size: 139GB Are you certain of the on-disk storage at the SVN server? Ideally, you've taken the size with a low-level tool like `du -sh /path/to/SVNRoot`. Even with no delta compression (as per Junio and Linus' discussion), based on past experience importing jar/wars/binaries from SVN into git... I'd expect git's worst case to be on-par with SVN, perhaps ~5% larger due to compression headers on uncompressible data. cheers, m -- martin.langhoff@gmail.com - ask interesting questions ~ http://linkedin.com/in/martinlanghoff - don't be distracted ~ http://github.com/martin-langhoff by shiny stuff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Delta compression not so effective 2017-03-01 20:19 ` Martin Langhoff @ 2017-03-01 23:59 ` Marius Storm-Olsen 0 siblings, 0 replies; 15+ messages in thread From: Marius Storm-Olsen @ 2017-03-01 23:59 UTC (permalink / raw) To: Martin Langhoff; +Cc: Git Mailing List On 3/1/2017 14:19, Martin Langhoff wrote: > On Wed, Mar 1, 2017 at 8:51 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote: >> BUT, even still, I would expect Git's delta compression to be quite effective, compared to the compression present in SVN. > > jar files are zipfiles. They don't delta in any useful form, and in > fact they differ even if they contain identical binary files inside. If you look through the initial post, you'll see that the jar in question is in fact a tool (BFG) by Roberto Tyley, which is basically git filter-branch on steroids. I used it to quickly filter out the extern/ folder, just to prove most of the original size stems from that particular folder. That's all. The repo does not contain zip or jar files. A few images and other compressed formats (except a few 100MBs of proprietary files, which never change), but nothing unusual. >> Commits: 32988 >> DB (server) size: 139GB > > Are you certain of the on-disk storage at the SVN server? Ideally, > you've taken the size with a low-level tool like `du -sh > /path/to/SVNRoot`. 139GB is from 'du -sh' on the SVN server. I imported (via SubGit) directly from the (hotcopied) SVN folder on the server. So true SVN size. > Even with no delta compression (as per Junio and Linus' discussion), > based on past experience importing jar/wars/binaries from SVN into > git... I'd expect git's worst case to be on-par with SVN, perhaps ~5% > larger due to compression headers on uncompressible data. Yes, I was expecting a Git repo <139GB, but like Linus mentioned, something must be knocking the delta search off its feet, so it bails out. Loose object -> 'hard' repack didn't show that much difference. Thanks! -- .marius ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2017-03-07 9:07 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-01 13:51 Delta compression not so effective Marius Storm-Olsen 2017-03-01 16:06 ` Junio C Hamano 2017-03-01 16:17 ` Junio C Hamano 2017-03-01 17:36 ` Linus Torvalds 2017-03-01 17:57 ` Marius Storm-Olsen 2017-03-01 18:30 ` Linus Torvalds 2017-03-01 21:08 ` Martin Langhoff 2017-03-02 0:12 ` Marius Storm-Olsen 2017-03-02 0:43 ` Linus Torvalds 2017-03-04 8:27 ` Marius Storm-Olsen 2017-03-06 1:14 ` Linus Torvalds 2017-03-06 13:36 ` Marius Storm-Olsen 2017-03-07 9:07 ` Thomas Braun 2017-03-01 20:19 ` Martin Langhoff 2017-03-01 23:59 ` Marius Storm-Olsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).