Re: [GSOC] How to improve the performance of git cat-file --batch

From: ZheNing Hu <adlternative@gmail.com>
To: Christian Couder <christian.couder@gmail.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Git List" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Hariom verma" <hariom18599@gmail.com>
Subject: Re: [GSOC] How to improve the performance of git cat-file --batch
Date: Wed, 28 Jul 2021 21:38:33 +0800	[thread overview]
Message-ID: <CAOLTT8QvtJ70X8mQx4K4gD0T=7i-ryd0QL81-QeSTqSWyHuWLQ@mail.gmail.com> (raw)
In-Reply-To: <CAP8UFD1WtSX59AqfG=d0Ge2BcK+8LdyZk0mQuftpu=FKX-877Q@mail.gmail.com>

Christian Couder <christian.couder@gmail.com> 于2021年7月28日周三 下午3:34写道：
>
> On Tue, Jul 27, 2021 at 3:37 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > Christian Couder <christian.couder@gmail.com> 于2021年7月26日周一 下午5:38写道：
> > >
> > > On Sun, Jul 25, 2021 at 2:04 PM ZheNing Hu <adlternative@gmail.com> wrote:
> > > > Ævar Arnfjörð Bjarmason <avarab@gmail.com> 于2021年7月25日周日 上午5:23写道：
> > >
> > > > > Having skimmed it I'm a bit confused about this in reference to
> > > > > performance generally. I haven't looked into the case you're discussing,
> > > > > but as I noted in
> > > > > https://lore.kernel.org/git/87im1p6x34.fsf@evledraar.gmail.com/ the
> > > > > profiling clearly shows that the main problem is that you've added
> > > > > object lookups we skipped before.
> > > >
> > > > Yeah, you showed me last time that lookup_object() took up a lot of time.
> > >
> > > Could the document explain with some details why there are more calls
> > > to lookup_object()?
>
> Please note that here we are looking for the number of times the
> lookup_object() function is called. This means that to measure that
> properly, it might actually be better to have some way to count this
> number of times the lookup_object() function is called, rather than
> count the time spent in the function.
>

Ok, therefore we need an accurate number of call times about lookup_object(),
although the conclusion is obvious: 0 (upstream/master) and a big
number (with my patch).

> For example you could add a trace_printf(...) call in the
> lookup_object() function, set GIT_TRACE=/tmp/git_trace.log, and then
> just run `git cat-file --batch ...` and count the number of times the
> new trace from lookup_object() appears in the log file.
>

After adding a trace_printf() in lookup_object(), here is the result:

Checkout to d3b5272a94 [GSOC] cat-file: reuse ref-filter logic

$ GIT_TRACE=/tmp/git_trace.log ggg cat-file --batch --batch-all-objects
$  cat /tmp/git_trace.log | wc -l
522710

Checkout to eb27b338a3e7 (upstream/master)

$ rm /tmp/git_trace.log
$ GIT_TRACE=/tmp/git_trace.log ggg cat-file --batch --batch-all-objects
$  cat /tmp/git_trace.log | wc -l
1

This is the only 1 time left is printed by git.c, which show that after using
my patch, we additionally call  lookup_object() when we use --batch option.
According to the results of the previous gprof test: lookup_object()
occupies 8.72%
of the total time. (Though below you seem to think that the effect of
gprof is not
reliable enough.) This may be a place worthy of optimization.

> > > For example it could take an example `git cat-file
> > > --batch ...` command (if possible a simple one), and say which
> > > functions like lookup_object() it was using (and how many times) to
> > > get the data it needs before using the ref-filter logic, and then the
> > > same information after using the ref-filter logic.
> >
> > Sorry but this time I use gprof but can’t observe the same effect as before.
> > lookup_object() is indeed a part of the time overhead, but its proportion is
> > not very large this time.
>
> I am not sure gprof is a good tool for this. It looks like it tries to
> attribute spent times to functions by splitting time between many low
> level functions, and it doesn't seem like the right approach to me.
> For example if lookup_object() is called 5% more often, it could be
> that the excess time is attributed to some low level functions and not
> to lookup_object() itself.
>

Maybe you are talking about "cumulative seconds" part of gprof?
it's "self seconds" part  is the number of seconds accounted for by
this function alone.

> That's why we might get a more accurate view of what happens by just
> counting the number of time the function is called.
>

According to https://sourceware.org/binutils/docs/gprof/Flat-Profile.html :

calls
This is the total number of times the function was called. If the
function was never called,
or the number of times it was called cannot be determined (probably
because the function
was not compiled with profiling enabled), the calls field is blank.

Therefore, there are accurate numbers in the previous gprof test results.

> > > It could be nice if there were also some data about how much time used
> > > to be spent in lookup_object() and how much time is now spent there,
> > > and how this compares with the whole slowdown we are seeing. If Ævar
> > > already showed that, you can of course reuse what he already did.
>
> Now I regret having wrote the above, sorry, as it might not be the
> best way to look at this.
>

Yes, because his previous results are based on a version of my patch set,
and some changes have taken place with my patch later.

> > This is my test for git cat-file --batch --batch-all-objects >/dev/null:
>
> [...]
>
> > Because we called parse_object_buffer() in get_object(), lookup_object()
> > is called indirectly...
>
> It would be nice if you could add a bit more details about how
> lookup_object() is called (both before and after the changes that
> degrade performance).
>

After we letting git cat-file --batch reuse the logic of ref-filter,
we will use get_object()
to grab the object's data. Since we used atom %(raw), it will require
us to grab the raw data
of the object, oi->info.contentp will be set, parse_object_buffer() in
get_object() will be
called, parse_object_buffer() calls lookup_commit(), lookup_blob(),
lookup_tree(),
and lookup_tag(), they call lookup_object(). As we have seen,
lookup_object() seems to
take a lot of time.

So let us think, can we skip this parse_object_buffer() in some scenarios?
parse_object_buffer() parses the data of the object into a "struct
object *obj", and then we use this
obj feed to grab_values(), and then grab_values() feed obj to
grab_tag_values() or grab_commit_values()
to handle some logic about %(tag), %(type), %(object), %(tree),
%(parent), %(numparent).

But git cat-file --batch can avaid handle there atoms with default format.

Therefore, maybe we can skip parsing object buffer if we really don't
care about these atoms.

> > We can see that some functions are called the same times:
>
> When you say "the same times" I guess you mean that the same amount of
> time is spent in these functions.

No... What I want to express is that the number of calls to these
functions is same,
see gprof "calls part" with  "patch_delta()", both are 1968866.

>
> > patch_delta(),
> > unpack_entry(), hashmap_remove()... But after using my patch,
> > format_ref_array_item(), grab_sub_body_contents(), get_object(), lookup_object()
> > begin to occupy a certain proportion.
>
> Thanks!

Thanks for hint!!!
--
ZheNing Hu