Re: GSoC Git Proposal Draft - ZheNing Hu

From: ZheNing Hu <adlternative@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Git List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>,
	Eric Sunshine <sunshine@sunshineco.com>,
	Christian Couder <christian.couder@gmail.com>,
	Hariom verma <hariom18599@gmail.com>,
	Shourya Shukla <periperidip@gmail.com>,
	olyatelezhnaya@gmail.com
Subject: Re: GSoC Git Proposal Draft - ZheNing Hu
Date: Sat, 3 Apr 2021 22:27:39 +0800	[thread overview]
Message-ID: <CAOLTT8R_kmdNhJaMjj60H0SEzs6-KrzTQhBHzShQ82aoDa5vzw@mail.gmail.com> (raw)
In-Reply-To: <YGc6ybE1wD1ck0uB@coredump.intra.peff.net>

Hi, Peff,

Jeff King <peff@peff.net> 于2021年4月2日周五 下午11:39写道：
>
> On Fri, Apr 02, 2021 at 05:03:17PM +0800, ZheNing Hu wrote:
>
> > * Because part of the feature of `git for-each-ref` is very similar to
> > that of `git cat-file`, I think `git cat-file` can learn some feasible
> > solutions from it.
> >
> > #### My possible solutions:
> >
> > 1. Same [solution](https://github.com/git/git/pull/568/commits/cc40c464e813fc7a6bd93a01661646114d694d76)
> > as Olga, add member `struct ref_format format` in `struct
> > batch_options`.
> > 2. Use the function
> > [`verify_ref_format()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L904)
> > to replace the first `expand_format()` for parsing format strings.
> > 3. Write a function like
> > [`format_ref_array_item()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L2392),
> > get information about objects, and use `get_object()` to grub the
> > information which we prefer (or just use `grab_common_value()`).
> > 4. The migration of `%(rest)` may require learning the handling of
> > `%(if)` ,`%(else)`.
>
> I think one thing to keep an eye on here is the performance of cat-file.
> The formatting code used by for-each-ref is rather slow (it may load
> more of the object details than is necessary, it is too eager to
> allocate intermediate strings, and so on). That's usually not _too_ big
> a problem for ref-filter, because the number of refs tends to be much
> smaller than the number of total objects. But I'd expect that moving to
> the ref-filter code would make something like:
>
>  git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype)'
>
> measurably slower.
>

Forgive me for thinking about the whole question too simple. It seems that
there are a lot of points to think about in this project.

> IMHO the right path forward is not to try porting cat-file to use the
> ref-filter code, but to start first with writing a universal formatting
> module that takes the best of both implementations (and the commit
> pretty-printer):
>
>   - separate the format-parsing stage from formatting actual items, as
>     ref-filter does. This lets us have more complex formats without
>     paying a per-item runtime cost while formatting. This should also
>     allow us to handle multiple syntaxes for the same thing (e.g.,
>     ref-filter %(authorname) vs pretty.c %an).
>

This is a good suggestion.
Olga seems to have wanted to remove `mark_query` in `struct expand_data`,
I think she also wanted to separate the two parts.

The ref-filter uses `used_atom` as the result of parsing `%(atom)`, It’s
really worth learning.

>   - figure out which data will be needed for each item based on the
>     parsed format, and then do the minimum amount of work to get that
>     data (using "oid_object_info_extended()" helps here, because it
>     likewise tries to do as little work as possible to satisfy the
>     request, but there are many elements that it doesn't know about)
>

I have indeed noticed that `oid_object_info_extended()`
can get information about the object which we actually want.
In `cat-file.c`, It has been used in `batch_object_write()`, and
`expanding_atom()` specify what data we need.
In `ref-filter.c`, It has been used in `get_object()`.
I am not sure what you mean about "many elements that it
doesn't know about", For the time being, `cat-file` can get 5
kind of objects info it need.

Maybe you think that `cat-file` can learn some features in
`ref-filter` to extend the function of `cat-file --batch`?
E.g. %(objectname:short)? I think I may have a better
understanding of the topic of this mini-project now.
We may not want to port the logic of cat-file,but to learn some
design in `ref-filter`, right?

>   - likewise avoid doing any intermediate work we can; as much as
>     possible, format the result directly into a result strbuf, rather
>     than allocating many sub-strings and assembling them (as cat-file
>     does).
>

I guess you mean `scratch` in `batch_object_write()`
every time new content is added after `strbuf_reset`,
but refilter just append messages to `final_buf`.

>   - handle formats where the necessary item data may or may not be
>     present. E.g., if we're given a refname, then "%(refname)" makes
>     sense. But in cat-file we'd not have a refname, and just an object.
>     We should still be able to use the same formatting code to handle
>     "%(objecttype)", etc. Likewise for formats which require a specific
>     type (say %(authorname) for a commit, but the object is a blob).
>     Ref-filter does this to some degree for things like authorname, but
>     we'd be extending it to the case that we don't even have a refname.
>

I may not have a very deep understanding of some details.
On this issue, I think we can use `info_source` to invalidate interfaces that
are not of the same type(only allow SOUCR_OTHER)

Let me summarize:
First part : Parsing any type of atoms, whether it is %an or %(authorname).
Second part : Find all functions that get objects information (which
`oid_object_info_extended()` can't get)
Third part : Optimize multiple small strings in cat-file into one `finnal_buf`.
Forth part : Error handling for unsuitable atoms.

Que:
These task sound like the logic of `ref-filter`, so if I can
participate in this project
later, should I start with optimizing the logic of `ref-filter`, right?

> -Peff

Thank you so much!

--
ZheNing Hu