Re: [Question] Can git cat-file have a type filtering option?

From: ZheNing Hu <adlternative@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>,
	Git List <git@vger.kernel.org>,
	johncai86@gmail.com
Subject: Re: [Question] Can git cat-file have a type filtering option?
Date: Tue, 11 Apr 2023 22:09:33 +0800	[thread overview]
Message-ID: <CAOLTT8T9pJFr94acvUo-8EYriST1gOAkXaDZBxHk54o=Zm5=Sg@mail.gmail.com> (raw)
In-Reply-To: <20230410201414.GC104097@coredump.intra.peff.net>

Jeff King <peff@peff.net> 于2023年4月11日周二 04:14写道：
>
> On Sun, Apr 09, 2023 at 02:47:30PM +0800, ZheNing Hu wrote:
>
> > > Perhaps slightly so, since there is naturally going to be some
> > > duplicated effort spawning processes, loading any shared libraries,
> > > initializing the repository and reading its configuration, etc.
> > >
> > > But I'd wager that these are all a negligible cost when compared to the
> > > time we'll have to spend reading, inflating, and printing out all of the
> > > objects in your repository.
> >
> > "What you said makes sense. I implemented the --type-filter option for
> > git cat-file and compared the performance of outputting all blobs in the
> > git repository with and without using the type-filter. I found that the
> > difference was not significant.
> >
> > time git  cat-file --batch-all-objects --batch-check="%(objectname)
> > %(objecttype)" |
> > awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null
> > 17.10s user 0.27s system 102% cpu 16.987 total
> >
> > time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null
> > 16.74s user 0.19s system 95% cpu 17.655 total
> >
> > At first, I thought the processes that provide all blob oids by using
> > git rev-list or git cat-file --batch-all-objects --batch-check might waste
> > cpu, io, memory resources because they need to read a large number
> > of objects, and then they are read again by git cat-file --batch.
> > However, it seems that this is not actually the bottleneck in performance.
>
> Yeah, I think most of your time there is spent on the --batch command
> itself, which is just putting through a lot of bytes. You might also try
> with "--unordered". The default ordering for --batch-all-objects is in
> sha1 order, which has pretty bad locality characteristics for delta
> caching. Using --unordered goes in pack-order, which should be optimal.
>
> E.g., in git.git, running:
>
>   time \
>     git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
>     perl -lne 'print $1 if /^blob (.*)/' |
>     git cat-file --batch >/dev/null
>
> takes:
>
>   real  0m29.961s
>   user  0m29.128s
>   sys   0m1.461s
>
> Adding "--unordered" to the initial cat-file gives:
>
>   real  0m1.970s
>   user  0m2.170s
>   sys   0m0.126s
>
> So reducing the size of the actual --batch printing may make the
> relative cost of using multiple processes much higher (I didn't apply
> your --type-filter patches to test myself).
>

You are right. Adding the --unordered option can avoid the
time-consuming sorting process from affecting the test results.

time git cat-file --unordered --batch-all-objects \
--batch-check="%(objectname) %(objecttype)" | \
awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null

4.17s user 0.23s system 109% cpu 4.025 total

time git cat-file --unordered --batch-all-objects --batch
--type-filter=blob >/dev/null

3.84s user 0.17s system 97% cpu 4.099 total

It looks like the difference is not significant either.

After all, the truly time-consuming process is reading
the entire data of the blob, whereas git cat-file --batch-check
only reads the first few bytes of the object in comparison.

> In general, I do think having a processing pipeline like this is OK, as
> it's pretty flexible. But especially for smaller queries (even ones that
> don't ask for the whole object contents), the per-object lookup costs
> can start to dominate (especially in a repository that hasn't been
> recently packed). Right now, even your "--batch --type-filter" example
> is probably making at least two lookups per object, because we don't
> have a way to open a "handle" to an object to check its type, and then
> extract the contents conditionally. And of course with multiple
> processes, we're naturally doing a separate lookup in each one.
>

Yes, the type of the object is encapsulated in the header of the loose
object file or the object entry header of the pack file. We have to read
it to get the object type. This may be a lingering question I have had:
why does git put the type/size in the file data instead of storing it as some
kind of metadata elsewhere?

> So a nice thing about being able to do the filtering in one process is
> that we could _eventually_ do it all with one object lookup. But I'd
> probably wait on adding something like --type-filter until we have an
> internal single-lookup API, and then we could time it to see how much
> speedup we can get.
>

I am highly skeptical of this "internal single-lookup API". Do we really
need an extra metadata table to record all objects?
Something like: metadata: {oid: type, size}?

> -Peff

ZheNing Hu