git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Marius Storm-Olsen <mstormo@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: Delta compression not so effective
Date: Wed, 1 Mar 2017 18:12:10 -0600	[thread overview]
Message-ID: <603afdf2-159c-6bed-0e85-2824391185d1@gmail.com> (raw)
In-Reply-To: <CA+55aFx7QFqrHw4e72vOdM5z0rw1CCkL2-UX8ej5CLSBWjLNLA@mail.gmail.com>

On 3/1/2017 12:30, Linus Torvalds wrote:
> On Wed, Mar 1, 2017 at 9:57 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
>>
>> Indeed, I did do a
>>     -c pack.threads=20 --window-memory=6g
>> to 'git repack', since the machine is a 20-core (40 threads) machine with
>> 126GB of RAM.
>>
>> So I guess with these sized objects, even at 6GB per thread, it's not enough
>> to get a big enough Window for proper delta-packing?
>
> Hmm. The 6GB window should be plenty good enough, unless your blobs
> are in the gigabyte range too.

No, the list of git verify-objects in the previous post was from the 
bottom of the sorted list, so those are the largest blobs, ~249MB..


>> This repo took >14hr to repack on 20 threads though ("compression" step was
>> very fast, but stuck 95% of the time in "writing objects"), so I can only
>> imagine how long a pack.threads=1 will take :)
>
> Actually, it's usually the compression phase that should be slow - but
> if something is limiting finding deltas (so that we abort early), then
> that would certainly tend to speed up compression.
>
> The "writing objects" phase should be mainly about the actual IO.
> Which should be much faster *if* you actually find deltas.

So, this repo must be knocking several parts of Git's insides. I was 
curious about why it was so slow on the writing objects part, since the 
whole repo is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, 
but the thing has ~400MB/s continuous throughput available.

iostat -m 5 showed trickle read/write to the process, and 80-100% CPU 
single thread (since the "write objects" stage is single threaded, 
obviously).

The failing delta must be triggering other negative behavior.


> For example, the sorting code thinks that objects with the same name
> across the history are good sources of deltas. But it may be that for
> your case, the binary blobs that you have don't tend to actually
> change in the history, so that heuristic doesn't end up doing
> anything.

These are generally just DLLs (debug & release), which content is 
updated due to upstream project updates. So, filenames/paths tend to 
stay identical, while content changes throughout history.


> The sorting does use the size and the type too, but the "filename
> hash" (which isn't really a hash, it's something nasty to give
> reasonable results for the case where files get renamed) is the main
> sort key.
>
> So you might well want to look at the sorting code too. If filenames
> (particularly the end of filenames) for the blobs aren't good hints
> for the sorting code, that sort might end up spreading all the blobs
> out rather than sort them by size.

Filenames are fairly static, and the bulk of the 6000 biggest 
non-delta'ed blobs are the same DLLs (multiple of them)


> And again, if that happens, the "can I delta these two objects" code
> will notice that the size of the objects are wildly different and
> won't even bother trying. Which speeds up the "compressing" phase, of
> course, but then because you don't get any good deltas, the "writing
> out" phase sucks donkey balls because it does zlib compression on big
> objects and writes them out to disk.

Right, now on this machine, I really didn't notice much difference 
between standard zlib level and doing -9. The 203GB version was actually 
with zlib=9.


> So there are certainly multiple possible reasons for the deltification
> to not work well for you.
>
> Hos sensitive is your material? Could you make a smaller repo with
> some of the blobs that still show the symptoms? I don't think I want
> to download 206GB of data even if my internet access is good.

Pretty sensitive, and not sure how I can reproduce this reasonable well. 
However, I can easily recompile git with any recommended 
instrumentation/printfs, if you have any suggestions of good places to 
start? If anyone have good file/line numbers, I'll give that a go, and 
report back?

Thanks!

-- 
.marius

  parent reply	other threads:[~2017-03-02  0:19 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-01 13:51 Delta compression not so effective Marius Storm-Olsen
2017-03-01 16:06 ` Junio C Hamano
2017-03-01 16:17   ` Junio C Hamano
2017-03-01 17:36 ` Linus Torvalds
2017-03-01 17:57   ` Marius Storm-Olsen
2017-03-01 18:30     ` Linus Torvalds
2017-03-01 21:08       ` Martin Langhoff
2017-03-02  0:12       ` Marius Storm-Olsen [this message]
2017-03-02  0:43         ` Linus Torvalds
2017-03-04  8:27           ` Marius Storm-Olsen
2017-03-06  1:14             ` Linus Torvalds
2017-03-06 13:36               ` Marius Storm-Olsen
2017-03-07  9:07             ` Thomas Braun
2017-03-01 20:19 ` Martin Langhoff
2017-03-01 23:59   ` Marius Storm-Olsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=603afdf2-159c-6bed-0e85-2824391185d1@gmail.com \
    --to=mstormo@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).