From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Ulrich Windl <Ulrich.Windl@rz.uni-regensburg.de>, git@vger.kernel.org
Subject: Re: non-smooth progress indication for git fsck and git gc
Date: Sat, 01 Sep 2018 14:53:28 +0200 [thread overview]
Message-ID: <87y3clbcqf.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <87bma2qcba.fsf@evledraar.gmail.com>
On Thu, Aug 16 2018, Ævar Arnfjörð Bjarmason wrote:
> On Thu, Aug 16 2018, Jeff King wrote:
>
>> On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote:
>>
>>> I'd like to point out some minor issue observed while processing some
>>> 50000-object repository with many binary objects, but most are rather
>>> small:
>>>
>>> Between the two phases of "git fsck" (checking directories and
>>> checking objects) there was a break of several seconds where no
>>> progress was indicated.
>>>
>>> During "git gc" the writing objects phase did not update for some
>>> seconds, but then the percentage counter jumped like from 15% to 42%.
>>>
>>> I understand that updating the progress output too often can be a
>>> performance bottleneck, while upating it too rarely might only bore
>>> the user... ;-)
>>
>> We update the counter integer for every object we process, and then
>> actually update the display whenever the percentage increases or a
>> second has elapsed, whichever comes first.
>>
>> What you're seeing is likely an artifact of _what_ we're counting:
>> written objects. Not all objects are the same size, so it's not uncommon
>> to see thousands/sec when dealing with small ones, and then several
>> seconds for one giant blob.
>>
>> The only way to solve that is to count bytes. We don't have a total byte
>> count in most cases, and it wouldn't always make sense (e.g., the
>> "Compressing objects" meter can show the same issue, but it's not really
>> putting through bytes in a linear way). In some cases we do show
>> transmitted size and throughput, but that's just for network operations.
>> We could do the same for "gc" with the patch below. But usually
>> throughput isn't all that interesting for a filesystem write, because
>> bandwidth isn't the bottleneck.
>>
>> Possibly we could have a "half throughput" mode that counts up the total
>> size written, but omits the speed indicator. That's not an unreasonable
>> thing to show for a local pack, since you end up with the final pack
>> size. The object counter would still be jumpy, but you'd at least have
>> one number updated at least once per second as you put through a large
>> blob.
>>
>> If you really want a smooth percentage, then we'd have to start counting
>> bytes instead of objects. Two reasons we don't do that are:
>>
>> - we often don't know the total byte size exactly. E.g., for a
>> packfile write, it depends on the result of deflating each object.
>> You can make an approximation and just pretend at the end that you
>> hit 100%. Or you can count the pre-deflate sizes, but then your
>> meter doesn't match the bytes from the throughput counter.
>>
>> - for something like fsck, we're not actually writing out bytes. So I
>> guess you'd be measuring "here's how many bytes of objects I
>> fsck-ed". But is that on-disk compressed bytes, or decompressed
>> bytes?
>>
>> If the former, that's only marginally better as a measure of effort,
>> since delta compression means that a small number of on-disk bytes
>> may require a big effort (imagine processing a 100 byte blob versus
>> a 100 byte delta off of a 100MB blob).
>>
>> The latter is probably more accurate. But it's also not free to
>> pre-generate the total. We can get the number of objects or the size
>> of the packfile in constant-time, but totaling up the uncompressed
>> size of all objects is O(n). So that's extra computation, but it
>> also means a potential lag before we can start the progress meter.
>>
>> I'm also not sure how meaningful a byte count is for a user there.
>>
>> So there. That's probably more than you wanted to know about Git's
>> progress code. I think it probably _could_ be improved by counting
>> more/different things, but I also think it can be a bit of a rabbit
>> hole. Which is why AFAIK nobody's really looked too seriously into
>> changing it.
>>
>> -Peff
>
> This is all interesting, but I think unrelated to what Ulrich is talking
> about. Quote:
>
> Between the two phases of "git fsck" (checking directories and
> checking objects) there was a break of several seconds where no
> progress was indicated
>
> I.e. it's not about the pause you get with your testcase (which is
> certainly another issue) but the break between the two progress bars.
>
> Here's a test case you can clone:
> https://github.com/avar/2015-04-03-1M-git (or might already have
> "locally" :)
>
> If you fsck this repository it'll take around (on my spinning rust
> server) 30 seconds between 100% of "Checking object directories" before
> you get any output from "Checking objects".
>
> The breakdown of that is (this is from approximate eyeballing):
>
> * We spend 1-3 seconds just on this:
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181
>
> * We spend the majority of the ~30s on this:
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79
>
> * Wes spend another 3-5 seconds on this QSORT:
> https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L105
>
> I.e. it's not about objects v.s. bytes, but the structural problem with
> the code that we pass a progress bar down to verify_pack() which does a
> lot of work in verify_pack_index() and verify_packfile() before we even
> get to iterating over the objects in the file, and only then do we start
> displaying progress.
This patch almost entirely fixes the problem, except for the 1-3 seconds
we take on the qsort:
diff --git a/sha1dc/sha1.c b/sha1dc/sha1.c
index df0630bc6d..6d6b9b54c4 100644
--- a/sha1dc/sha1.c
+++ b/sha1dc/sha1.c
@@ -25,6 +25,7 @@
#include "sha1.h"
#include "ubc_check.h"
+#include "../progress.h"
#if (defined(__amd64__) || defined(__amd64) || defined(__x86_64__) || defined(__x86_64) || \
defined(i386) || defined(__i386) || defined(__i386__) || defined(__i486__) || \
@@ -1820,6 +1821,8 @@ void SHA1DCSetCallback(SHA1_CTX* ctx, collision_block_callback callback)
void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
{
unsigned left, fill;
+ struct progress *progress;
+ size_t len_cp = len, done = 0;
if (len == 0)
return;
@@ -1836,6 +1839,8 @@ void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
len -= fill;
left = 0;
}
+
+ progress = start_delayed_progress(_("Hashing"), len_cp);
while (len >= 64)
{
ctx->total += 64;
@@ -1848,7 +1853,12 @@ void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
#endif /* defined(SHA1DC_ALLOW_UNALIGNED_ACCESS) */
buf += 64;
len -= 64;
+ done += 64;
+ display_progress(progress, done);
}
+ display_progress(progress, len_cp);
+ stop_progress(&progress);
+
if (len > 0)
{
ctx->total += len;
With this we'll get output like:
$ ~/g/git/git -C ~/g/2015-04-03-1M-git/ --exec-path=$PWD fsck
Checking object directories: 100% (256/256), done.
Hashing: 100% (452634108/452634108), done.
Hashing: 100% (1073741824/1073741824), done.
Hashing: 100% (1073741824/1073741824), done.
Hashing: 100% (1008001572/1008001572), done.
Checking objects: 2% (262144/13064614)
^C
All tests pass with this. Isn't it awesome? Except it's of course a
massive hack, we wouldn't want to just hook into SHA1DC like this.
The problem comes down to us needing to call git_hash_sha1_update() with
some really huge input, that function is going to take a *long* time,
and the only way we're getting incremental progress is:
1) If we ourselves split the input into N chunks
2) If we hack into the SHA1 library itself
This patch does #2, but for this to be acceptable we'd need to do
something like #1.
The least hacky way I can think of doing this is something like the
following: We'd define two git_hash_algo entries for SHA-1, one that could give you
progress reports, and that one would only ever hash in chunks of at most
size N, and would wrap git_SHA1_Update() in something similar to the
progress code above.
We'd also need to extend the git_hash_algo struct to e.g. have a char *
we'd use as a description, instead of just "Hashing".
Then the caller here could just switch out the hash algo if
show_progress was true, temporarily some "int do_progress" and "char
*progress_description" entries in the struct, and off we go:
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L69-L80
Still a hack, but I can't see another way of making it work without
compromising all hashing performance with some progress reporting.
next prev parent reply other threads:[~2018-09-01 12:53 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-08-16 6:54 non-smooth progress indication for git fsck and git gc Ulrich Windl
2018-08-16 15:18 ` Duy Nguyen
2018-08-16 16:05 ` Jeff King
2018-08-20 8:27 ` Antw: " Ulrich Windl
2018-08-16 15:57 ` Jeff King
2018-08-16 20:02 ` Jeff King
2018-08-16 22:10 ` Junio C Hamano
2018-08-16 20:35 ` Ævar Arnfjörð Bjarmason
2018-08-16 20:55 ` Jeff King
2018-08-16 21:06 ` Jeff King
2018-08-17 14:39 ` Duy Nguyen
2018-08-20 8:33 ` Antw: " Ulrich Windl
2018-08-20 8:57 ` Ævar Arnfjörð Bjarmason
2018-08-20 9:37 ` Ulrich Windl
2018-08-21 1:07 ` Jeff King
2018-08-21 6:20 ` Ulrich Windl
2018-08-21 15:21 ` Duy Nguyen
2018-09-01 12:53 ` Ævar Arnfjörð Bjarmason [this message]
2018-09-01 13:52 ` Ævar Arnfjörð Bjarmason
2018-09-02 7:46 ` Jeff King
2018-09-02 7:55 ` Jeff King
2018-09-02 8:55 ` Jeff King
2018-09-03 16:48 ` Ævar Arnfjörð Bjarmason
2018-09-07 3:30 ` Jeff King
2018-09-04 15:53 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87y3clbcqf.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=Ulrich.Windl@rz.uni-regensburg.de \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).