git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* GSoC Git Proposal Draft - ZheNing Hu
@ 2021-04-02  9:03 ZheNing Hu
  2021-04-02 14:57 ` Christian Couder
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: ZheNing Hu @ 2021-04-02  9:03 UTC (permalink / raw)
  To: Git List
  Cc: Junio C Hamano, Eric Sunshine, Christian Couder, Hariom verma,
	Jeff King, Shourya Shukla, olyatelezhnaya, ZheNing Hu

Hello, Git,
I'm ZheNing Hu,
Here is my GSoC 2021 Proposal draft.
And website version is there :
https://docs.google.com/document/d/119k-Xa4CKOt5rC1gg1cqPr6H3MvdgTUizndJGAo1Erk/edit

Welcome any Comments and Correct :)

----8<----
## Use ref-filter formats in git cat-file

### About Me
| Name | ZheNing Hu |
| ---------- | ------------------------------------------ |
| Major | Computer Science And Technology |
| Mobile no. | +86 15058356458 |
| Email | adlternative@gmail.com |
| IRC | adlternative (on #git-devel/#git@freenode) |
| Github | https://github.com/adlternative/ |
| Blogs | https://adlternative.github.io/ |
| Time Zone | CST (UTC +08:00) |

### Education & Background
* I am currently a 2nd Year Student majoring in computer science and
technology in Xi'an University of Posts & Telecommunications (China).
* In my freshman year, I joined the XiYou Linux Group of the
university and learned how to use Git to submit my own code to GitHub.
I have learned C, C++, Python and shell in two years, I know how to
use gdb debugging, and I am familiar with relevant knowledge of Linux
System Programming and Linux Network Programming.
* I started learning Git source code and made contributions to Git
from December of 2020.

### Me & Git
Around last November, I found a couple of projects
[build-your-own-git](https://github.com/danistefanovic/build-your-own-x#build-your-own-git)
on GitHub teaching me how to write a simple git, the mechanics of Git
are very interesting:

1. There are four types of objects in Git: BLOB, TREE, COMMIT, TAG
2. The (loose)objects are stored in `.git/object/sha1[0-1]/sha1[2-39]`
with the sha1 value of the data as the storage address.
3. All branches are just references to commits.

Then I read`《Pro Git》`and Jiang Xin's `《Git Authoritative Guide》`,
learned the use of most Git subcommands.

Later, I started learning some of the Git source code, I found Git has
at least 200,000 lines of C code and 200,000 lines of shell script
code, which leaves me a little confused about where to start.

But then, after I submitted my first patch, a lot of people in the Git
community came over and gave me very enthusiastic guidance, which gave
me the courage to learn the Git source code, and then I started making
my own contributions, You can find them here:
[gitgitgadget](https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Aadlternative+)
or
[git.kernel.org](https://git.kernel.org/pub/scm/git/git.git/log/?qt=grep&q=ZheNing+Hu)


These patches have been merged into the "master" branch:

#### [master]
* difftool.c: learn a new way start at specified file [(mail
list)](https://lore.kernel.org/git/pull.870.v6.git.1613739235241.gitgitgadget@gmail.com/)
* ls-files.c: add --deduplicate option
[(mail list)](https://lore.kernel.org/git/384f77a4c188456854bd86335e9bdc8018097a5f.1611485667.git.gitgitgadget@gmail.com/)
* ls_files.c: consolidate two for loops into one
[(mail list)](https://lore.kernel.org/git/f9d5e44d2c08b9e3d05a73b0a6e520ef7bb889c9.1611485667.git.gitgitgadget@gmail.com/)
* ls_files.c: bugfix for --deleted and --modified
[(mail list)](https://lore.kernel.org/git/8b02367a359e62d7721b9078ac8393a467d83724.1611485667.git.gitgitgadget@gmail.com/)
* builtin/*: update usage format
[(mail list)](https://lore.kernel.org/git/d3eb6dcff1468645560c16e1d8753002cbd7f143.1609944243.git.gitgitgadget@gmail.com/)

And These patches are in the queue:

#### [next]

* format-patch: allow a non-integral version numbers
[(mail list)](https://lore.kernel.org/git/pull.885.v10.git.1616497946427.gitgitgadget@gmail.com/)
* [GSOC] commit: add --trailer option
[(mail list)](https://lore.kernel.org/git/pull.901.v14.git.1616507757999.gitgitgadget@gmail.com/)

#### [WIP]

* gitk: add right-click context menu for tags
[(mail list)](https://lore.kernel.org/git/pull.866.v5.git.1614227923637.gitgitgadget@gmail.com/)
* [GSOC] trailer: pass arg as positional parameter
[(mail list)](https://lore.kernel.org/git/5894d8c4b36466326b0427bfda0d6981e52a0907.1617185147.git.gitgitgadget@gmail.com/)

### Proposed Project

* Git used to have an old problem of duplicated implementations of
some logic. For example, Git had at least 4 different implementations
to format command output for different commands.

* `git cat-file` is a git subcommand used to see information about a git object.

* `git cat-file --batch` can print object information and contents on
stdin. The only difference between `--batch-check` and `--batch` is
that `--batch-check` does not print the contents of the object.
* `--batch-all-objects` will show all objects with `--batch` or `--batch-check`.
* `--batch-check` and `--batch` both accept formatted strings:
* `%(objectname)`: 40-bit SHA1 string of Git object
* `%(objecttype)`: Object Type blob,tree,commit,tag
* `%(objectsize)`: Size of the object's content
* `%(objectsize:disk)`: The size of the object itself on disk
* `%(delatbase)`: If the object is stored incrementally in Git,
Returns the SHA1 string for its delabase
* `%(rest)`: Anything before the space and TAB in the input
line is treated as an object, and anything after
that will be printed as usual
* In the original design, the first time use `expand_format()` in
`batch_objects()` is to parsing formatted messages, the second time
use `expand_format()` in `batch_object_write()` is to format the
object information and store it in a string buffer, eventually the
contents of this buffer will be printed to standard output.


* [Olga](olyatelezhnaya@gmail.com) have been involved in integrating
`ref-filter` logic into `cat-file`
[(link)](https://github.com/git/git/pull/568), the problem with her
patches at that time:
1. Too long patch series, difficult to adjust and merge.
2. I don't think it's a good idea for her to use `struct
ref_array_item` instead of `struct expand_data` for `cat-file` to fit
`ref-filter` logic, because `struct ref_array_item` and `struct
expand_data` are not very related.
[(link)](https://github.com/git/git/pull/568/commits/e0aafaa76476ba5528f84b794043531ebd4633c7#diff-d03110606a7ed8cb9832bbcc572f1093435cc6115c4e58d7a7750af3c33319a7R238)

* Because part of the feature of `git for-each-ref` is very similar to
that of `git cat-file`, I think `git cat-file` can learn some feasible
solutions from it.

#### My possible solutions:

1. Same [solution](https://github.com/git/git/pull/568/commits/cc40c464e813fc7a6bd93a01661646114d694d76)
as Olga, add member `struct ref_format format` in `struct
batch_options`.
2. Use the function
[`verify_ref_format()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L904)
to replace the first `expand_format()` for parsing format strings.
3. Write a function like
[`format_ref_array_item()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L2392),
get information about objects, and use `get_object()` to grub the
information which we prefer (or just use `grab_common_value()`).
4. The migration of `%(rest)` may require learning the handling of
`%(if)` ,`%(else)`.

### Are you applying for other Projects?

No, Git is the only one.

### Blogging about Git

In fact, while I am studying Git source code, I often write some
[blogs](https://adlternative.github.io/tags/git/) to record my
learning content, this helps me to recall some content after
forgetting it. Most of the blogs were written in Chinese previously,
but during the GSoC, I promise all my blogs will be written in
English.

### TimeLine
* May 18 ~ June 8
* Look for a scheme to make `git cat-file` and `ref-filter` more
compatible, and start the integration attempt.
* *Stretch Goal*: move `%(objectsize)`,`%(objecttype)`,`%(objectname)` .

* June 8 ~ July 8
* Move the body of the `git cat-file` attempt to the `ref-filter`
logic, complete the basic function realization.
* *Stretch Goal*: move `%(deltabase)`,`%(objectsize:disk)`,`%(rest)` .

* July 8 ~ August 17
* Analyze the performance of ref-filter and try to reduce the
performance cost of a lot of string matching. I thought if I had some
spare time, I could work on some other interesting patches.
* *Stretch Goal*: Optimize ref-filter performance.

### Availability
My exam is expected to end in June, but the time I don't have classes
before the final exam, as well as the summer vacation after that, is
basically my self-learning time. Although I am studying many other
courses, I have enough time and energy to complete daily tasks. I'm
staying active on the Git mailing list, you can find me at any time as
long as I am not sleeping. :)


### Post GSoC
* I love open source philosophy, willing to spread the spirit of
openness, freedom and willing to research technology with like-minded
people.
* In my previous contact with the Git community in the past few
months, many people in the Git community gave me great encouragement.
I hope I can keep my passion for Git alive, contribute my own code,
and pass this cool thing on.
* I am willing to contribute code to the Git community for a long time
after the end of GSoC.
* I hope the Git community can give me a chance to participate in
GSoC. I sincerely thank GSoC and the Git community!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: GSoC Git Proposal Draft - ZheNing Hu
  2021-04-02  9:03 GSoC Git Proposal Draft - ZheNing Hu ZheNing Hu
@ 2021-04-02 14:57 ` Christian Couder
  2021-04-03 13:23   ` ZheNing Hu
  2021-04-02 15:39 ` Jeff King
  2021-04-11  6:11 ` ZheNing Hu
  2 siblings, 1 reply; 11+ messages in thread
From: Christian Couder @ 2021-04-02 14:57 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Git List, Junio C Hamano, Eric Sunshine, Hariom verma, Jeff King,
	Shourya Shukla,
	Оля
	Тележная

Hi,

On Fri, Apr 2, 2021 at 11:03 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
> Hello, Git,
> I'm ZheNing Hu,
> Here is my GSoC 2021 Proposal draft.
> And website version is there :
> https://docs.google.com/document/d/119k-Xa4CKOt5rC1gg1cqPr6H3MvdgTUizndJGAo1Erk/edit
>
> Welcome any Comments and Correct :)

Thanks!

> ----8<----
> ## Use ref-filter formats in git cat-file
>
> ### About Me
> | Name | ZheNing Hu |
> | ---------- | ------------------------------------------ |
> | Major | Computer Science And Technology |
> | Mobile no. | +86 15058356458 |
> | Email | adlternative@gmail.com |
> | IRC | adlternative (on #git-devel/#git@freenode) |
> | Github | https://github.com/adlternative/ |
> | Blogs | https://adlternative.github.io/ |
> | Time Zone | CST (UTC +08:00) |
>
> ### Education & Background
> * I am currently a 2nd Year Student majoring in computer science and
> technology in Xi'an University of Posts & Telecommunications (China).
> * In my freshman year, I joined the XiYou Linux Group of the
> university and learned how to use Git to submit my own code to GitHub.
> I have learned C, C++, Python and shell in two years, I know how to
> use gdb debugging, and I am familiar with relevant knowledge of Linux
> System Programming and Linux Network Programming.
> * I started learning Git source code and made contributions to Git
> from December of 2020.
>
> ### Me & Git
> Around last November, I found a couple of projects
> [build-your-own-git](https://github.com/danistefanovic/build-your-own-x#build-your-own-git)
> on GitHub teaching me how to write a simple git, the mechanics of Git
> are very interesting:
>
> 1. There are four types of objects in Git: BLOB, TREE, COMMIT, TAG
> 2. The (loose)objects are stored in `.git/object/sha1[0-1]/sha1[2-39]`
> with the sha1 value of the data as the storage address.
> 3. All branches are just references to commits.
>
> Then I read`《Pro Git》`and Jiang Xin's `《Git Authoritative Guide》`,
> learned the use of most Git subcommands.
>
> Later, I started learning some of the Git source code, I found Git has
> at least 200,000 lines of C code and 200,000 lines of shell script
> code, which leaves me a little confused about where to start.
>
> But then, after I submitted my first patch, a lot of people in the Git
> community came over and gave me very enthusiastic guidance, which gave
> me the courage to learn the Git source code, and then I started making
> my own contributions, You can find them here:
> [gitgitgadget](https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Aadlternative+)
> or
> [git.kernel.org](https://git.kernel.org/pub/scm/git/git.git/log/?qt=grep&q=ZheNing+Hu)
>
>
> These patches have been merged into the "master" branch:
>
> #### [master]
> * difftool.c: learn a new way start at specified file [(mail
> list)](https://lore.kernel.org/git/pull.870.v6.git.1613739235241.gitgitgadget@gmail.com/)
> * ls-files.c: add --deduplicate option
> [(mail list)](https://lore.kernel.org/git/384f77a4c188456854bd86335e9bdc8018097a5f.1611485667.git.gitgitgadget@gmail.com/)
> * ls_files.c: consolidate two for loops into one
> [(mail list)](https://lore.kernel.org/git/f9d5e44d2c08b9e3d05a73b0a6e520ef7bb889c9.1611485667.git.gitgitgadget@gmail.com/)
> * ls_files.c: bugfix for --deleted and --modified
> [(mail list)](https://lore.kernel.org/git/8b02367a359e62d7721b9078ac8393a467d83724.1611485667.git.gitgitgadget@gmail.com/)
> * builtin/*: update usage format
> [(mail list)](https://lore.kernel.org/git/d3eb6dcff1468645560c16e1d8753002cbd7f143.1609944243.git.gitgitgadget@gmail.com/)
>
> And These patches are in the queue:
>
> #### [next]
>
> * format-patch: allow a non-integral version numbers
> [(mail list)](https://lore.kernel.org/git/pull.885.v10.git.1616497946427.gitgitgadget@gmail.com/)
> * [GSOC] commit: add --trailer option
> [(mail list)](https://lore.kernel.org/git/pull.901.v14.git.1616507757999.gitgitgadget@gmail.com/)
>
> #### [WIP]
>
> * gitk: add right-click context menu for tags
> [(mail list)](https://lore.kernel.org/git/pull.866.v5.git.1614227923637.gitgitgadget@gmail.com/)
> * [GSOC] trailer: pass arg as positional parameter
> [(mail list)](https://lore.kernel.org/git/5894d8c4b36466326b0427bfda0d6981e52a0907.1617185147.git.gitgitgadget@gmail.com/)

Great!

> ### Proposed Project
>
> * Git used to have an old problem of duplicated implementations of
> some logic. For example, Git had at least 4 different implementations
> to format command output for different commands.

What's the current status? Which implementations have been merged
together since that time?

> * `git cat-file` is a git subcommand used to see information about a git object.
>
> * `git cat-file --batch` can print object information and contents on
> stdin.

It reads from stdin and prints on stdout.

> The only difference between `--batch-check` and `--batch` is
> that `--batch-check` does not print the contents of the object.
> * `--batch-all-objects` will show all objects with `--batch` or `--batch-check`.
> * `--batch-check` and `--batch` both accept formatted strings:

It might be better to say that they accept a custom format that can
have placeholders like the following:

> * `%(objectname)`: 40-bit SHA1 string of Git object

Git is being worked on to be able to use SHA-256 as well as SHA1.

> * `%(objecttype)`: Object Type blob,tree,commit,tag
> * `%(objectsize)`: Size of the object's content
> * `%(objectsize:disk)`: The size of the object itself on disk
> * `%(delatbase)`: If the object is stored incrementally in Git,

s/delatbase/deltabase/

> Returns the SHA1 string for its delabase

s/delabase/deltabase/

Also see above about SHA1 and SHA256.

> * `%(rest)`: Anything before the space and TAB in the input
> line is treated as an object, and anything after
> that will be printed as usual

In general it's ok to copy some parts of the doc if they are important
for your proposal as long as you say that it comes from the doc. It's
also ok with rephrasing parts of it, to adapt them or make sure you
understand them though.

> * In the original design, the first time use `expand_format()` in
> `batch_objects()` is to parsing formatted messages, the second time

s/parsing/to parse/

I am not sure what you call "formatted messages".

> use `expand_format()` in `batch_object_write()` is to format the
> object information and store it in a string buffer, eventually the
> contents of this buffer will be printed to standard output.
>
>
> * [Olga](olyatelezhnaya@gmail.com) have been involved in integrating
> `ref-filter` logic into `cat-file`
> [(link)](https://github.com/git/git/pull/568), the problem with her
> patches at that time:
> 1. Too long patch series, difficult to adjust and merge.
> 2. I don't think it's a good idea for her to use `struct
> ref_array_item` instead of `struct expand_data` for `cat-file` to fit
> `ref-filter` logic, because `struct ref_array_item` and `struct
> expand_data` are not very related.
> [(link)](https://github.com/git/git/pull/568/commits/e0aafaa76476ba5528f84b794043531ebd4633c7#diff-d03110606a7ed8cb9832bbcc572f1093435cc6115c4e58d7a7750af3c33319a7R238)

Olga also sent patch series to the mailing list. Could you find them
and tell what happened to them?

Also Hariom Verma worked on a related project recently. Could you talk
a bit about it?

> * Because part of the feature of `git for-each-ref` is very similar to
> that of `git cat-file`, I think `git cat-file` can learn some feasible
> solutions from it.
>
> #### My possible solutions:
>
> 1. Same [solution](https://github.com/git/git/pull/568/commits/cc40c464e813fc7a6bd93a01661646114d694d76)
> as Olga, add member `struct ref_format format` in `struct
> batch_options`.
> 2. Use the function
> [`verify_ref_format()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L904)
> to replace the first `expand_format()` for parsing format strings.
> 3. Write a function like
> [`format_ref_array_item()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L2392),
> get information about objects, and use `get_object()` to grub the
> information which we prefer (or just use `grab_common_value()`).
> 4. The migration of `%(rest)` may require learning the handling of
> `%(if)` ,`%(else)`.

I will look at this later.

> ### Are you applying for other Projects?
>
> No, Git is the only one.
>
> ### Blogging about Git
>
> In fact, while I am studying Git source code, I often write some
> [blogs](https://adlternative.github.io/tags/git/) to record my
> learning content, this helps me to recall some content after
> forgetting it. Most of the blogs were written in Chinese previously,
> but during the GSoC, I promise all my blogs will be written in
> English.
>
> ### TimeLine
> * May 18 ~ June 8
> * Look for a scheme to make `git cat-file` and `ref-filter` more
> compatible, and start the integration attempt.
> * *Stretch Goal*: move `%(objectsize)`,`%(objecttype)`,`%(objectname)` .
>
> * June 8 ~ July 8
> * Move the body of the `git cat-file` attempt to the `ref-filter`
> logic, complete the basic function realization.
> * *Stretch Goal*: move `%(deltabase)`,`%(objectsize:disk)`,`%(rest)` .
>
> * July 8 ~ August 17
> * Analyze the performance of ref-filter and try to reduce the
> performance cost of a lot of string matching. I thought if I had some
> spare time, I could work on some other interesting patches.
> * *Stretch Goal*: Optimize ref-filter performance.

I will also look at the timeline later.

> ### Availability
> My exam is expected to end in June, but the time I don't have classes
> before the final exam, as well as the summer vacation after that, is
> basically my self-learning time. Although I am studying many other
> courses, I have enough time and energy to complete daily tasks. I'm
> staying active on the Git mailing list, you can find me at any time as
> long as I am not sleeping. :)
>
>
> ### Post GSoC
> * I love open source philosophy, willing to spread the spirit of
> openness, freedom and willing to research technology with like-minded
> people.
> * In my previous contact with the Git community in the past few
> months, many people in the Git community gave me great encouragement.
> I hope I can keep my passion for Git alive, contribute my own code,
> and pass this cool thing on.
> * I am willing to contribute code to the Git community for a long time
> after the end of GSoC.
> * I hope the Git community can give me a chance to participate in
> GSoC. I sincerely thank GSoC and the Git community!

Thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: GSoC Git Proposal Draft - ZheNing Hu
  2021-04-02  9:03 GSoC Git Proposal Draft - ZheNing Hu ZheNing Hu
  2021-04-02 14:57 ` Christian Couder
@ 2021-04-02 15:39 ` Jeff King
  2021-04-03 14:27   ` ZheNing Hu
  2021-04-11  6:11 ` ZheNing Hu
  2 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2021-04-02 15:39 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Git List, Junio C Hamano, Eric Sunshine, Christian Couder,
	Hariom verma, Shourya Shukla, olyatelezhnaya

On Fri, Apr 02, 2021 at 05:03:17PM +0800, ZheNing Hu wrote:

> * Because part of the feature of `git for-each-ref` is very similar to
> that of `git cat-file`, I think `git cat-file` can learn some feasible
> solutions from it.
> 
> #### My possible solutions:
> 
> 1. Same [solution](https://github.com/git/git/pull/568/commits/cc40c464e813fc7a6bd93a01661646114d694d76)
> as Olga, add member `struct ref_format format` in `struct
> batch_options`.
> 2. Use the function
> [`verify_ref_format()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L904)
> to replace the first `expand_format()` for parsing format strings.
> 3. Write a function like
> [`format_ref_array_item()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L2392),
> get information about objects, and use `get_object()` to grub the
> information which we prefer (or just use `grab_common_value()`).
> 4. The migration of `%(rest)` may require learning the handling of
> `%(if)` ,`%(else)`.

I think one thing to keep an eye on here is the performance of cat-file.
The formatting code used by for-each-ref is rather slow (it may load
more of the object details than is necessary, it is too eager to
allocate intermediate strings, and so on). That's usually not _too_ big
a problem for ref-filter, because the number of refs tends to be much
smaller than the number of total objects. But I'd expect that moving to
the ref-filter code would make something like:

 git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype)'

measurably slower.

IMHO the right path forward is not to try porting cat-file to use the
ref-filter code, but to start first with writing a universal formatting
module that takes the best of both implementations (and the commit
pretty-printer):

  - separate the format-parsing stage from formatting actual items, as
    ref-filter does. This lets us have more complex formats without
    paying a per-item runtime cost while formatting. This should also
    allow us to handle multiple syntaxes for the same thing (e.g.,
    ref-filter %(authorname) vs pretty.c %an).

  - figure out which data will be needed for each item based on the
    parsed format, and then do the minimum amount of work to get that
    data (using "oid_object_info_extended()" helps here, because it
    likewise tries to do as little work as possible to satisfy the
    request, but there are many elements that it doesn't know about)

  - likewise avoid doing any intermediate work we can; as much as
    possible, format the result directly into a result strbuf, rather
    than allocating many sub-strings and assembling them (as cat-file
    does).

  - handle formats where the necessary item data may or may not be
    present. E.g., if we're given a refname, then "%(refname)" makes
    sense. But in cat-file we'd not have a refname, and just an object.
    We should still be able to use the same formatting code to handle
    "%(objecttype)", etc. Likewise for formats which require a specific
    type (say %(authorname) for a commit, but the object is a blob).
    Ref-filter does this to some degree for things like authorname, but
    we'd be extending it to the case that we don't even have a refname.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: GSoC Git Proposal Draft - ZheNing Hu
  2021-04-02 14:57 ` Christian Couder
@ 2021-04-03 13:23   ` ZheNing Hu
  0 siblings, 0 replies; 11+ messages in thread
From: ZheNing Hu @ 2021-04-03 13:23 UTC (permalink / raw)
  To: Christian Couder
  Cc: Git List, Junio C Hamano, Eric Sunshine, Hariom verma, Jeff King,
	Shourya Shukla,
	Оля
	Тележная

Hi, Christian,

Christian Couder <christian.couder@gmail.com> 于2021年4月2日周五 下午10:57写道:
>
> Hi,
>
> On Fri, Apr 2, 2021 at 11:03 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > Hello, Git,
> > I'm ZheNing Hu,
> > Here is my GSoC 2021 Proposal draft.
> > And website version is there :
> > https://docs.google.com/document/d/119k-Xa4CKOt5rC1gg1cqPr6H3MvdgTUizndJGAo1Erk/edit
> >
> > Welcome any Comments and Correct :)
>
> Thanks!
>
> > ----8<----
> > ## Use ref-filter formats in git cat-file
> >
> > ### About Me
> > | Name | ZheNing Hu |
> > | ---------- | ------------------------------------------ |
> > | Major | Computer Science And Technology |
> > | Mobile no. | +86 15058356458 |
> > | Email | adlternative@gmail.com |
> > | IRC | adlternative (on #git-devel/#git@freenode) |
> > | Github | https://github.com/adlternative/ |
> > | Blogs | https://adlternative.github.io/ |
> > | Time Zone | CST (UTC +08:00) |
> >
> > ### Education & Background
> > * I am currently a 2nd Year Student majoring in computer science and
> > technology in Xi'an University of Posts & Telecommunications (China).
> > * In my freshman year, I joined the XiYou Linux Group of the
> > university and learned how to use Git to submit my own code to GitHub.
> > I have learned C, C++, Python and shell in two years, I know how to
> > use gdb debugging, and I am familiar with relevant knowledge of Linux
> > System Programming and Linux Network Programming.
> > * I started learning Git source code and made contributions to Git
> > from December of 2020.
> >
> > ### Me & Git
> > Around last November, I found a couple of projects
> > [build-your-own-git](https://github.com/danistefanovic/build-your-own-x#build-your-own-git)
> > on GitHub teaching me how to write a simple git, the mechanics of Git
> > are very interesting:
> >
> > 1. There are four types of objects in Git: BLOB, TREE, COMMIT, TAG
> > 2. The (loose)objects are stored in `.git/object/sha1[0-1]/sha1[2-39]`
> > with the sha1 value of the data as the storage address.
> > 3. All branches are just references to commits.
> >
> > Then I read`《Pro Git》`and Jiang Xin's `《Git Authoritative Guide》`,
> > learned the use of most Git subcommands.
> >
> > Later, I started learning some of the Git source code, I found Git has
> > at least 200,000 lines of C code and 200,000 lines of shell script
> > code, which leaves me a little confused about where to start.
> >
> > But then, after I submitted my first patch, a lot of people in the Git
> > community came over and gave me very enthusiastic guidance, which gave
> > me the courage to learn the Git source code, and then I started making
> > my own contributions, You can find them here:
> > [gitgitgadget](https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Aadlternative+)
> > or
> > [git.kernel.org](https://git.kernel.org/pub/scm/git/git.git/log/?qt=grep&q=ZheNing+Hu)
> >
> >
> > These patches have been merged into the "master" branch:
> >
> > #### [master]
> > * difftool.c: learn a new way start at specified file [(mail
> > list)](https://lore.kernel.org/git/pull.870.v6.git.1613739235241.gitgitgadget@gmail.com/)
> > * ls-files.c: add --deduplicate option
> > [(mail list)](https://lore.kernel.org/git/384f77a4c188456854bd86335e9bdc8018097a5f.1611485667.git.gitgitgadget@gmail.com/)
> > * ls_files.c: consolidate two for loops into one
> > [(mail list)](https://lore.kernel.org/git/f9d5e44d2c08b9e3d05a73b0a6e520ef7bb889c9.1611485667.git.gitgitgadget@gmail.com/)
> > * ls_files.c: bugfix for --deleted and --modified
> > [(mail list)](https://lore.kernel.org/git/8b02367a359e62d7721b9078ac8393a467d83724.1611485667.git.gitgitgadget@gmail.com/)
> > * builtin/*: update usage format
> > [(mail list)](https://lore.kernel.org/git/d3eb6dcff1468645560c16e1d8753002cbd7f143.1609944243.git.gitgitgadget@gmail.com/)
> >
> > And These patches are in the queue:
> >
> > #### [next]
> >
> > * format-patch: allow a non-integral version numbers
> > [(mail list)](https://lore.kernel.org/git/pull.885.v10.git.1616497946427.gitgitgadget@gmail.com/)
> > * [GSOC] commit: add --trailer option
> > [(mail list)](https://lore.kernel.org/git/pull.901.v14.git.1616507757999.gitgitgadget@gmail.com/)
> >
> > #### [WIP]
> >
> > * gitk: add right-click context menu for tags
> > [(mail list)](https://lore.kernel.org/git/pull.866.v5.git.1614227923637.gitgitgadget@gmail.com/)
> > * [GSOC] trailer: pass arg as positional parameter
> > [(mail list)](https://lore.kernel.org/git/5894d8c4b36466326b0427bfda0d6981e52a0907.1617185147.git.gitgitgadget@gmail.com/)
>
> Great!
>
> > ### Proposed Project
> >
> > * Git used to have an old problem of duplicated implementations of
> > some logic. For example, Git had at least 4 different implementations
> > to format command output for different commands.
>
> What's the current status? Which implementations have been merged
> together since that time?
>

Under the current situation,
there are `git cat-file` using `expand_format()` for format parsing,
and `git for-each-ref` using `format_ref_array_item()` for format parsing,
and `git log --pretty` using `format_commit_one()` item for format parsing,
maybe have more?

In my general understanding now, these `cat-file` atoms, `ref-filter` have
related implementations.

%(objectsize) %(objecttype) %(objectname) %(deltabase) %(objectsize:disk)

`cat-file --batch` have a implicit %(contents) ,it already implement
in `ref-filter`.

now all them can used by `git for-each-ref`.

At the same time,
Some of the feature in 'pretty.c' can also be found in 'ref-filter.c'.

`--pretty=%s`   to %(subject)
`--pretty=%f `   to %(subject:sanitized)
`--pretty=%aN` to %(authorname)
`--pretty=%b`   to %(body)
...

On the other hand, after Olga's solution was rejected, `git cat-file`
did not directly
use the logic in 'ref-filter'. So now we can see two similar 'struct
expand_data' in
'cat-file.c' and 'ref-filter.c'. But Olga still made many useful
changes in ref-filter:
such as `grab_common_values()` support for a variety of different atoms.

> > * `git cat-file` is a git subcommand used to see information about a git object.
> >
> > * `git cat-file --batch` can print object information and contents on
> > stdin.
>
> It reads from stdin and prints on stdout.
>
> > The only difference between `--batch-check` and `--batch` is
> > that `--batch-check` does not print the contents of the object.
> > * `--batch-all-objects` will show all objects with `--batch` or `--batch-check`.
> > * `--batch-check` and `--batch` both accept formatted strings:
>
> It might be better to say that they accept a custom format that can
> have placeholders like the following:
>
> > * `%(objectname)`: 40-bit SHA1 string of Git object
>
> Git is being worked on to be able to use SHA-256 as well as SHA1.
>

Yes, one of my classmate was worried about the security of Git using
SHA1 and I told him Git is already making changes.

> > * `%(objecttype)`: Object Type blob,tree,commit,tag
> > * `%(objectsize)`: Size of the object's content
> > * `%(objectsize:disk)`: The size of the object itself on disk
> > * `%(delatbase)`: If the object is stored incrementally in Git,
>
> s/delatbase/deltabase/
>
> > Returns the SHA1 string for its delabase
>
> s/delabase/deltabase/
>

Thanks for above correcting.

> Also see above about SHA1 and SHA256.
>
> > * `%(rest)`: Anything before the space and TAB in the input
> > line is treated as an object, and anything after
> > that will be printed as usual
>
> In general it's ok to copy some parts of the doc if they are important
> for your proposal as long as you say that it comes from the doc. It's
> also ok with rephrasing parts of it, to adapt them or make sure you
> understand them though.
>

Maybe I can use the instructions in the documentation will be better.

> > * In the original design, the first time use `expand_format()` in
> > `batch_objects()` is to parsing formatted messages, the second time
>
> s/parsing/to parse/
>
> I am not sure what you call "formatted messages".
>

I'm not good at expression, As you say, it's a custom format that can have
placeholders,'%(atom)'.

> > use `expand_format()` in `batch_object_write()` is to format the
> > object information and store it in a string buffer, eventually the
> > contents of this buffer will be printed to standard output.
> >
> >
> > * [Olga](olyatelezhnaya@gmail.com) have been involved in integrating
> > `ref-filter` logic into `cat-file`
> > [(link)](https://github.com/git/git/pull/568), the problem with her
> > patches at that time:
> > 1. Too long patch series, difficult to adjust and merge.
> > 2. I don't think it's a good idea for her to use `struct
> > ref_array_item` instead of `struct expand_data` for `cat-file` to fit
> > `ref-filter` logic, because `struct ref_array_item` and `struct
> > expand_data` are not very related.
> > [(link)](https://github.com/git/git/pull/568/commits/e0aafaa76476ba5528f84b794043531ebd4633c7#diff-d03110606a7ed8cb9832bbcc572f1093435cc6115c4e58d7a7750af3c33319a7R238)
>
> Olga also sent patch series to the mailing list. Could you find them
> and tell what happened to them?
>

Peff tested the performance of Olga's `cat-file`, the performance of
`cat-file` appears to
be degraded by using the logic of ref-filter due "very eager to
allocate lots of separate strings".
[(link)](https://lore.kernel.org/git/20190228214112.GK12723@sigill.intra.peff.net/)

Olga add %(rest) to `for-each-ref`,
Peff say he is not sure that for-each-ref should be supporting %(rest).
[(link)](https://lore.kernel.org/git/20190228211122.GD12723@sigill.intra.peff.net/)

%(rest) seem not useful for `for-each-ref`,
Peff think we should add some option to ref-filter to enable/disable
placeholder like
"%(rest)" In some places where it is not needed at all.
[(link)](https://lore.kernel.org/git/20190228210753.GC12723@sigill.intra.peff.net/)

Olga make `struct expand_data` global,and put it in ref_filter.h
Peff say `struct expand_data` may need a more desciriptive name in
global namespace.
[(link)](https://lore.kernel.org/git/20190228213015.GI12723@sigill.intra.peff.net/)

Olga make `mark_query` and
Peff think `splict_on_whitespace` or `mark_query` can be deleted in
`struct expand_data` immediatly.
[(link)](https://lore.kernel.org/git/20190228212540.GF12723@sigill.intra.peff.net/)

> Also Hariom Verma worked on a related project recently. Could you talk
> a bit about it?
>

Hariom's work is re-use `ref-filter` logic in pretty.c|h,
I admin I might have neglected to look at his patches, but it seems
that he once also proposed
a "pretty-lib.c|h" for use ref-filter features.
[(link)](https://public-inbox.org/git/a83270485be2bebb1ce77be55ff73d136b735922.1592218662.git.gitgitgadget@gmail.com/)
I may need more time to check what's going on here.

> > * Because part of the feature of `git for-each-ref` is very similar to
> > that of `git cat-file`, I think `git cat-file` can learn some feasible
> > solutions from it.
> >
> > #### My possible solutions:
> >
> > 1. Same [solution](https://github.com/git/git/pull/568/commits/cc40c464e813fc7a6bd93a01661646114d694d76)
> > as Olga, add member `struct ref_format format` in `struct
> > batch_options`.
> > 2. Use the function
> > [`verify_ref_format()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L904)
> > to replace the first `expand_format()` for parsing format strings.
> > 3. Write a function like
> > [`format_ref_array_item()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L2392),
> > get information about objects, and use `get_object()` to grub the
> > information which we prefer (or just use `grab_common_value()`).
> > 4. The migration of `%(rest)` may require learning the handling of
> > `%(if)` ,`%(else)`.
>
> I will look at this later.
>
> > ### Are you applying for other Projects?
> >
> > No, Git is the only one.
> >
> > ### Blogging about Git
> >
> > In fact, while I am studying Git source code, I often write some
> > [blogs](https://adlternative.github.io/tags/git/) to record my
> > learning content, this helps me to recall some content after
> > forgetting it. Most of the blogs were written in Chinese previously,
> > but during the GSoC, I promise all my blogs will be written in
> > English.
> >
> > ### TimeLine
> > * May 18 ~ June 8
> > * Look for a scheme to make `git cat-file` and `ref-filter` more
> > compatible, and start the integration attempt.
> > * *Stretch Goal*: move `%(objectsize)`,`%(objecttype)`,`%(objectname)` .
> >
> > * June 8 ~ July 8
> > * Move the body of the `git cat-file` attempt to the `ref-filter`
> > logic, complete the basic function realization.
> > * *Stretch Goal*: move `%(deltabase)`,`%(objectsize:disk)`,`%(rest)` .
> >
> > * July 8 ~ August 17
> > * Analyze the performance of ref-filter and try to reduce the
> > performance cost of a lot of string matching. I thought if I had some
> > spare time, I could work on some other interesting patches.
> > * *Stretch Goal*: Optimize ref-filter performance.
>
> I will also look at the timeline later.
>
> > ### Availability
> > My exam is expected to end in June, but the time I don't have classes
> > before the final exam, as well as the summer vacation after that, is
> > basically my self-learning time. Although I am studying many other
> > courses, I have enough time and energy to complete daily tasks. I'm
> > staying active on the Git mailing list, you can find me at any time as
> > long as I am not sleeping. :)
> >
> >
> > ### Post GSoC
> > * I love open source philosophy, willing to spread the spirit of
> > openness, freedom and willing to research technology with like-minded
> > people.
> > * In my previous contact with the Git community in the past few
> > months, many people in the Git community gave me great encouragement.
> > I hope I can keep my passion for Git alive, contribute my own code,
> > and pass this cool thing on.
> > * I am willing to contribute code to the Git community for a long time
> > after the end of GSoC.
> > * I hope the Git community can give me a chance to participate in
> > GSoC. I sincerely thank GSoC and the Git community!
>
> Thanks!

Thanks :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: GSoC Git Proposal Draft - ZheNing Hu
  2021-04-02 15:39 ` Jeff King
@ 2021-04-03 14:27   ` ZheNing Hu
  2021-04-07 19:28     ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: ZheNing Hu @ 2021-04-03 14:27 UTC (permalink / raw)
  To: Jeff King
  Cc: Git List, Junio C Hamano, Eric Sunshine, Christian Couder,
	Hariom verma, Shourya Shukla, olyatelezhnaya

Hi, Peff,

Jeff King <peff@peff.net> 于2021年4月2日周五 下午11:39写道:
>
> On Fri, Apr 02, 2021 at 05:03:17PM +0800, ZheNing Hu wrote:
>
> > * Because part of the feature of `git for-each-ref` is very similar to
> > that of `git cat-file`, I think `git cat-file` can learn some feasible
> > solutions from it.
> >
> > #### My possible solutions:
> >
> > 1. Same [solution](https://github.com/git/git/pull/568/commits/cc40c464e813fc7a6bd93a01661646114d694d76)
> > as Olga, add member `struct ref_format format` in `struct
> > batch_options`.
> > 2. Use the function
> > [`verify_ref_format()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L904)
> > to replace the first `expand_format()` for parsing format strings.
> > 3. Write a function like
> > [`format_ref_array_item()`](https://github.com/gitgitgadget/git/blob/84d06cdc06389ae7c462434cb7b1db0980f63860/ref-filter.c#L2392),
> > get information about objects, and use `get_object()` to grub the
> > information which we prefer (or just use `grab_common_value()`).
> > 4. The migration of `%(rest)` may require learning the handling of
> > `%(if)` ,`%(else)`.
>
> I think one thing to keep an eye on here is the performance of cat-file.
> The formatting code used by for-each-ref is rather slow (it may load
> more of the object details than is necessary, it is too eager to
> allocate intermediate strings, and so on). That's usually not _too_ big
> a problem for ref-filter, because the number of refs tends to be much
> smaller than the number of total objects. But I'd expect that moving to
> the ref-filter code would make something like:
>
>  git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype)'
>
> measurably slower.
>

Forgive me for thinking about the whole question too simple. It seems that
there are a lot of points to think about in this project.

> IMHO the right path forward is not to try porting cat-file to use the
> ref-filter code, but to start first with writing a universal formatting
> module that takes the best of both implementations (and the commit
> pretty-printer):
>
>   - separate the format-parsing stage from formatting actual items, as
>     ref-filter does. This lets us have more complex formats without
>     paying a per-item runtime cost while formatting. This should also
>     allow us to handle multiple syntaxes for the same thing (e.g.,
>     ref-filter %(authorname) vs pretty.c %an).
>

This is a good suggestion.
Olga seems to have wanted to remove `mark_query` in `struct expand_data`,
I think she also wanted to separate the two parts.

The ref-filter uses `used_atom` as the result of parsing `%(atom)`, It’s
really worth learning.

>   - figure out which data will be needed for each item based on the
>     parsed format, and then do the minimum amount of work to get that
>     data (using "oid_object_info_extended()" helps here, because it
>     likewise tries to do as little work as possible to satisfy the
>     request, but there are many elements that it doesn't know about)
>

I have indeed noticed that `oid_object_info_extended()`
can get information about the object which we actually want.
In `cat-file.c`, It has been used in `batch_object_write()`, and
`expanding_atom()` specify what data we need.
In `ref-filter.c`, It has been used in `get_object()`.
I am not sure what you mean about "many elements that it
doesn't know about", For the time being, `cat-file` can get 5
kind of objects info it need.

Maybe you think that `cat-file` can learn some features in
`ref-filter` to extend the function of `cat-file --batch`?
E.g. %(objectname:short)? I think I may have a better
understanding of the topic of this mini-project now.
We may not want to port the logic of cat-file,but to learn some
design in `ref-filter`, right?

>   - likewise avoid doing any intermediate work we can; as much as
>     possible, format the result directly into a result strbuf, rather
>     than allocating many sub-strings and assembling them (as cat-file
>     does).
>

I guess you mean `scratch` in `batch_object_write()`
every time new content is added after `strbuf_reset`,
but refilter just append messages to `final_buf`.

>   - handle formats where the necessary item data may or may not be
>     present. E.g., if we're given a refname, then "%(refname)" makes
>     sense. But in cat-file we'd not have a refname, and just an object.
>     We should still be able to use the same formatting code to handle
>     "%(objecttype)", etc. Likewise for formats which require a specific
>     type (say %(authorname) for a commit, but the object is a blob).
>     Ref-filter does this to some degree for things like authorname, but
>     we'd be extending it to the case that we don't even have a refname.
>

I may not have a very deep understanding of some details.
On this issue, I think we can use `info_source` to invalidate interfaces that
are not of the same type(only allow SOUCR_OTHER)

Let me summarize:
First part : Parsing any type of atoms, whether it is %an or %(authorname).
Second part : Find all functions that get objects information (which
`oid_object_info_extended()` can't get)
Third part : Optimize multiple small strings in cat-file into one `finnal_buf`.
Forth part : Error handling for unsuitable atoms.

Que:
These task sound like the logic of `ref-filter`, so if I can
participate in this project
later, should I start with optimizing the logic of `ref-filter`, right?

> -Peff

Thank you so much!

--
ZheNing Hu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: GSoC Git Proposal Draft - ZheNing Hu
  2021-04-03 14:27   ` ZheNing Hu
@ 2021-04-07 19:28     ` Jeff King
  2021-04-08 13:29       ` ZheNing Hu
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2021-04-07 19:28 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Git List, Junio C Hamano, Eric Sunshine, Christian Couder,
	Hariom verma, Shourya Shukla, olyatelezhnaya

On Sat, Apr 03, 2021 at 10:27:39PM +0800, ZheNing Hu wrote:

> >   - figure out which data will be needed for each item based on the
> >     parsed format, and then do the minimum amount of work to get that
> >     data (using "oid_object_info_extended()" helps here, because it
> >     likewise tries to do as little work as possible to satisfy the
> >     request, but there are many elements that it doesn't know about)
> >
> 
> I have indeed noticed that `oid_object_info_extended()`
> can get information about the object which we actually want.
> In `cat-file.c`, It has been used in `batch_object_write()`, and
> `expanding_atom()` specify what data we need.
> In `ref-filter.c`, It has been used in `get_object()`.
> I am not sure what you mean about "many elements that it
> doesn't know about", For the time being, `cat-file` can get 5
> kind of objects info it need.

I think there are things one might want to format that
oid_object_info_extended() does not know about. For example, if you are
asking about %(authorname), it can't provide that. But we want to do as
little work as possible to satisfy the request. So for example, with the
format "%(objectsize)", we'd prefer _not_ to load the contents of each
object, and just ask oid_object_info_extended() for the size. But if we
are asked for "%(authorname)", we know we'll have to read and parse the
object contents.

So this notion of "figure out the least amount of work" will have to be
part of the format code (and ref-filter and the pretty.c formatters do
make an attempt at this; I'm saying that a universal formatter will want
to keep this behavior).

> Maybe you think that `cat-file` can learn some features in
> `ref-filter` to extend the function of `cat-file --batch`?
> E.g. %(objectname:short)? I think I may have a better
> understanding of the topic of this mini-project now.
> We may not want to port the logic of cat-file,but to learn some
> design in `ref-filter`, right?

Yes, I think the goal is for all of the commands that allow format
specifiers to support the same set (at least where it makes sense;
obviously you cannot ask for %(refname) in cat-file).

And IMHO the best way to do that is to write a new universal formatting
API that takes the best parts from all of the existing ones. It _could_
also be done by choosing ref-filter as the best implementation, slowly
teaching it formats the other commands know (which is what Olga had
started with), and then cleaning up any performance deficiencies. But I
think that last part would actually be easier when starting from scratch
(e.g., I think it would help to actually produce an abstract syntax tree
of the parsed format, and then walk that tree to fill in the values).

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: GSoC Git Proposal Draft - ZheNing Hu
  2021-04-07 19:28     ` Jeff King
@ 2021-04-08 13:29       ` ZheNing Hu
  0 siblings, 0 replies; 11+ messages in thread
From: ZheNing Hu @ 2021-04-08 13:29 UTC (permalink / raw)
  To: Jeff King
  Cc: Git List, Junio C Hamano, Eric Sunshine, Christian Couder,
	Hariom verma, Shourya Shukla,
	Оля
	Тележная

Jeff King <peff@peff.net> 于2021年4月8日周四 上午3:28写道:
>
> On Sat, Apr 03, 2021 at 10:27:39PM +0800, ZheNing Hu wrote:
>
> > >   - figure out which data will be needed for each item based on the
> > >     parsed format, and then do the minimum amount of work to get that
> > >     data (using "oid_object_info_extended()" helps here, because it
> > >     likewise tries to do as little work as possible to satisfy the
> > >     request, but there are many elements that it doesn't know about)
> > >
> >
> > I have indeed noticed that `oid_object_info_extended()`
> > can get information about the object which we actually want.
> > In `cat-file.c`, It has been used in `batch_object_write()`, and
> > `expanding_atom()` specify what data we need.
> > In `ref-filter.c`, It has been used in `get_object()`.
> > I am not sure what you mean about "many elements that it
> > doesn't know about", For the time being, `cat-file` can get 5
> > kind of objects info it need.
>
> I think there are things one might want to format that
> oid_object_info_extended() does not know about. For example, if you are
> asking about %(authorname), it can't provide that. But we want to do as
> little work as possible to satisfy the request. So for example, with the
> format "%(objectsize)", we'd prefer _not_ to load the contents of each
> object, and just ask oid_object_info_extended() for the size. But if we
> are asked for "%(authorname)", we know we'll have to read and parse the
> object contents.
>

OK, I understand it now, `%(authorname)` needs to grub info in object content
so that content must be parsed, If we need to let cat-file learn
`%(authorname)`,
It takes extra work to extract from the object.

> So this notion of "figure out the least amount of work" will have to be
> part of the format code (and ref-filter and the pretty.c formatters do
> make an attempt at this; I'm saying that a universal formatter will want
> to keep this behavior).
>

You're right. %(tree) %(parent) ... reliant on commit object info,
%(tagger) %(taggername) ... reliant on tag object info.But If it is
some %(objectsize) or %(objectname) content, we do not need
to parse the content of the objects. Future work we should also
keep avoid parsing of non-dependent info.

> > Maybe you think that `cat-file` can learn some features in
> > `ref-filter` to extend the function of `cat-file --batch`?
> > E.g. %(objectname:short)? I think I may have a better
> > understanding of the topic of this mini-project now.
> > We may not want to port the logic of cat-file,but to learn some
> > design in `ref-filter`, right?
>
> Yes, I think the goal is for all of the commands that allow format
> specifiers to support the same set (at least where it makes sense;
> obviously you cannot ask for %(refname) in cat-file).
>

The future new API may need to deny such access.

> And IMHO the best way to do that is to write a new universal formatting
> API that takes the best parts from all of the existing ones. It _could_
> also be done by choosing ref-filter as the best implementation, slowly
> teaching it formats the other commands know (which is what Olga had
> started with), and then cleaning up any performance deficiencies. But I
> think that last part would actually be easier when starting from scratch
> (e.g., I think it would help to actually produce an abstract syntax tree
> of the parsed format, and then walk that tree to fill in the values).
>
> -Peff

It is the unified "%an" and "%author" you said last time.
I think maybe Olga and Hariom might have done similar things:
Calling `ref-filter` results in slower speed.

And you said we may can refactor to abstract syntax tree, this is
a good idea, and this may be a big project, In particular, pre-knowledge
of compilation principles is required, and we may also need to deal with
each different atom carefully.

Thanks.
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: GSoC Git Proposal Draft - ZheNing Hu
  2021-04-02  9:03 GSoC Git Proposal Draft - ZheNing Hu ZheNing Hu
  2021-04-02 14:57 ` Christian Couder
  2021-04-02 15:39 ` Jeff King
@ 2021-04-11  6:11 ` ZheNing Hu
  2021-04-11 15:34   ` ZheNing Hu
  2 siblings, 1 reply; 11+ messages in thread
From: ZheNing Hu @ 2021-04-11  6:11 UTC (permalink / raw)
  To: Git List
  Cc: Junio C Hamano, Eric Sunshine, Christian Couder, Hariom verma,
	Jeff King, Shourya Shukla,
	Оля
	Тележная,
	Jiang Xin

Here is my GSoC 2021 Proposal draft v2.
And website version is there :
https://docs.google.com/document/d/119k-Xa4CKOt5rC1gg1cqPr6H3MvdgTUizndJGAo1Erk/edit
Welcome any Comments and Correct :)

-------8<---------
Use ref-filter formats in git cat-file

About Me

Name ZheNing Hu
Major Computer Science And Technology
Mobile no. +86 15058356458
Email adlternative@gmail.com
IRC adlternative (on #git-devel/#git@freenode)
Github https://github.com/adlternative/
Blogs https://adlternative.github.io/
Time Zone CST (UTC +08:00)

Education & Background

I am currently a 2nd Year Student majoring in
computer science and technology in Xi'an University
of Posts & Telecommunications (China). In my freshman
year,I joined the XiYou Linux Group of the university
and learned how to use Git to submit my own code to GitHub.
I have learned C, C++, Python and shell in two years,
I know how to use gdb debugging, and I am familiar with
relevant knowledge of Linux System Programming and Linux
Network Programming. I started learning Git source code
and made contributions to Git from December of 2020.


Me & Git

Around last November, I found a couple of projects on
GitHub[1] teaching me how to write a simple Git, the mechanics
of Git are very interesting:
1. There are four types of objects in Git: blob, tree, commit, tag.
2. The (loose)objects(with SHA-1 hash algorithm) are stored in
".git/objects/sha1[0-1]/sha1[2-39]" with the sha1 value of the object
data as the storage address.
3. All branches are just references to commits.
...
Then I read 《Pro Git》 and Jiang Xin's 《Git Authoritative Guide》,
learned the use of most Git subcommands.
Later, I started learning some of the Git source code, I found Git has
at least 200,000 lines of C code and 200,000 lines of shell script code,
which leaves me a little confused about where to start. But then, after I
submitted my first patch, a lot of people in the Git community came over and
gave me very enthusiastic guidance, which gave me the courage to learn the
Git source code, and then I started making my own contributions, You can find
them here:[2][3]
And These patches are in Git master branch:
[master]
difftool.c: learn a new way start at specified file
https://lore.kernel.org/git/pull.870.v6.git.1613739235241.gitgitgadget@gmail.com/

ls-files.c: add --deduplicate option
https://lore.kernel.org/git/384f77a4c188456854bd86335e9bdc8018097a5f.1611485667.git.gitgitgadget@gmail.com/

ls_files.c: consolidate two for loops into one
https://lore.kernel.org/git/f9d5e44d2c08b9e3d05a73b0a6e520ef7bb889c9.1611485667.git.gitgitgadget@gmail.com/

ls_files.c: bugfix for --deleted and --modified
https://lore.kernel.org/git/8b02367a359e62d7721b9078ac8393a467d83724.1611485667.git.gitgitgadget@gmail.com/

builtin/*: update usage format
https://lore.kernel.org/git/d3eb6dcff1468645560c16e1d8753002cbd7f143.1609944243.git.gitgitgadget@gmail.com/

format-patch: allow a non-integral version numbers
https://lore.kernel.org/git/pull.885.v10.git.1616497946427.gitgitgadget@gmail.com/

[GSOC] commit: add --trailer option
https://lore.kernel.org/git/pull.901.v14.git.1616507757999.gitgitgadget@gmail.com/

And These patches are working:
[wip]
gitk: add right-click context menu for tags
https://lore.kernel.org/git/pull.866.v5.git.1614227923637.gitgitgadget@gmail.com/

[GSOC] trailer: add new .cmd config option
https://lore.kernel.org/git/3dc8983a47020fb417bb8c6c3d835e609b13c155.1617975462.git.gitgitgadget@gmail.com/

[GSOC] docs: correct descript of trailer.<token>.command
https://lore.kernel.org/git/505903811df83cf26f4dd70c5b811dde169896a2.1617975462.git.gitgitgadget@gmail.com/

[GSOC] ref-filter: get rid of show_ref_array_item
https://lore.kernel.org/git/pull.927.v2.git.1617809209164.gitgitgadget@gmail.com/

Proposed Project

Current situation
Git used to have an old problem of duplicated implementations of some logic.
For example, Git had at least 4 different implementations to format command
output for different commands. E.g. `git cat-file --batch=“%(objectname)”`,
`git log --pretty=“%aN”`, `git for-each-ref --format=“%(refname)”`.

Which implementations have been merged together?

2018 ~ 2019
Olga Telezhnaia: Reuse ref-filter formatting logic in `git cat-file`
Olga Integrate some `git cat-file` logic into the `ref-filter`, now almost
all format atoms in the `git cat-file` are available in the `git for-each-ref`,
e.g. `git for-each-ref --format=“%(objectsize:disk) %(deltabase) ”`.

2020 ~ 2021
Hariom Verma: Unify ref-filter formats with other --pretty formats
Hariom migrated some of the '--pretty' logic to the 'ref-filter',
e.g. `git for-each-ref --format="%(trailers:key=Signed-off-by)"`
or ` git for-each-ref --format="%(subject:sanitize)"`.

What’s git cat-file?
`git cat-file` is a Git subcommand used to see information about a
Git object.
`git cat-file --batch` can read objects from stdin and print each
object information and contents to stdout.
`git --batch-check` can read objects from stdin and print each
object information to stdout.
`--batch-all-objects`  will show all objects info in the git repo
with `--batch` or `--batch-check`.
`--batch-check` and `--batch` both accept a custom format that can
have placeholders like the following, refer to here[4]:

%(objectname) The full hex representation of the object name.
%(objecttype) The type of the object.
%(objectsize) The size, in bytes, of the object.
%(objectsize:disk) The size, in bytes, that the object takes up on disk.
%(deltatbase) If the object is stored as a delta on-disk, this expands to
the full hex representation of the delta base object name.
Otherwise, expands to the null OID (all zeroes).
%(rest) If this atom is used in the output string, input lines are
  split at the first whitespace boundary. All characters before
that whitespace are considered to be the object name; characters
after that first run of whitespace (i.e., the "rest" of the line)
are output in place of the `%(rest)` atom.

What’s the original design of git cat-file --batch?
1. First time use `expand_format()` in `batch_objects()` is used to
parse format atoms, this will determine what data we need to capture.
2. Read the object name from standard input,and use it to get the
object's oid from `get_oid_with_context()`.
3. In `batch_object_write()`, `oid_object_info_extended()` will obtain
the object information which we need.
4. Second time use `expand_format()` in `batch_object_write()`, will
formatting actual items, and store it in a string buffer, eventually
the contents of this buffer will be printed to standard output.


What are the disadvantages in git cat-file --batch?
atom format-parsing stage and formatting actual items stage are not
separated yet. This limits the ability of `git cat-file --batch` to
support richer formats like `git for-each-ref` or `git log --pretty`.


Why is Olga’s solution rejected?
1. Olga's solution is to let `git cat-file` use the `ref-filter` interface,
the performance of `cat-file` appears to be degraded due "very eager to
allocate lots of separate strings" in `ref-filter` and other reasons.
2. Then Olga adopted the method of optimizing `ref-filter`, but the
performance of `git cat-file` is still not as good as the previous method.
3. Too long patch series, difficult to adjust and merge.
4. Is “%(rest)” worth migrating? “%(rest)” is for `git cat-file --batch`
which will be read from the terminal, anything after the space on each line
will continue to be printed, this option is quite unnecessary for
`git for-each-ref`, which does not require standard input.


My possible solution
1. Analyze how to get data which `oid_object_info_extended()` can't
get directly, analyze the minimum amount of data required for each
step of atom format parsing.
2. Find a uniform way to parse format, like `%an` in `log --pretty`
or `%(authorname)` in `ref-filter`(might it can learn something from
`git config` or can try using abstract syntax trees for format atoms parsing).
3. Apply the new interface to 'git cat-file', and then we could add
richer options for `git cat-file`.
4. (Optional optimization) Change the strbuf allocate strategy of `ref-filter`:
Use a single strbuf for all refs output. Improving its performance, reducing the
overhead of allocating large numbers of small strbuf.
5. (Optional optimization) In addition, if we migrate `cat-file` to `ref-filter`
only with improved performance of `ref-filter`, we need to isolate
some atoms that
are not applicable to `cat-files`. For example, `refname` is not useful for
`git cat-file`, we can exit the program by using `die()` or just print
error messages.

Are you applying for other Projects?

No, Git is the only one.

Blogging about Git

In fact, while I am studying Git source code, I often write some
blogs[5] to record
my learning content, this helps me to recall some content after
forgetting it. Most
of the blogs were written in Chinese previously, but during the GSoC,
I promise all
my blogs will be written in English.


Time Line

May 18 ~ June 18
1. Learn the details of atom format parsing in `ref-filter.c` and `pretty.c`.
Think about how to combine two different atom formats in a parsing way.
(For example, how can we use abstract syntax trees to organize different atoms)
2. Analyze and optimize `ref-filter` performance.
3. Discuss with mentors about a reasonable solution about uniform
formatting parsing,
and then start coding it.
June 18 ~ July 18
1. Continue to integrate the atom format parsing and apply it to `pretty.c` and
`ref-filter.c`.
2. Make sure that the performance of `git for-each-ref` and `git log --pretty`
are better than the previous methods under the new format parsing interface.
July 18 ~ August 17
1. Let `git cat-file` use the new and better formatting parsing interface.
2. Support more options for `cat-file --batch` and ensure isolation from those
different types of atoms.

Availability

I have plenty of time before and after my final exam, I have enough
energy to complete
daily tasks. I'm staying active on the Git mailing list, you can find
me at any time as
long as I am not sleeping. :)

Post GSoC

I love open source philosophy, willing to spread the spirit of
openness, freedom and
willing to research technology with like-minded people.
In my previous contact with the Git community in the past few months,
many people in the
Git community gave me great encouragement. I hope I can keep my
passion for Git alive,
contribute my own code, and pass this cool thing on. I am willing to
contribute code to
the Git community for a long time after the end of GSoC.
I hope the Git community can give me a chance to participate in GSoC.
I sincerely thank
GSoC and the Git community!


________________
[1] https://github.com/danistefanovic/build-your-own-x#build-your-own-git
[2] https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Aadlternative+
[3] https://git.kernel.org/pub/scm/git/git.git/log/?qt=grep&q=ZheNing+Hu
[4] https://github.com/gitgitgadget/git/blob/89b43f80a514aee58b662ad606e6352e03eaeee4/Documentation/git-cat-file.txt#L189
[5] https://adlternative.github.io/tags/git/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: GSoC Git Proposal Draft - ZheNing Hu
  2021-04-11  6:11 ` ZheNing Hu
@ 2021-04-11 15:34   ` ZheNing Hu
  2021-04-13  6:40     ` Jeff King
  0 siblings, 1 reply; 11+ messages in thread
From: ZheNing Hu @ 2021-04-11 15:34 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Eric Sunshine, Christian Couder, Hariom verma,
	Shourya Shukla,
	Оля
	Тележная,
	Jiang Xin, Git List

Hi, Peff,

> Why is Olga’s solution rejected?
> 1. Olga's solution is to let `git cat-file` use the `ref-filter` interface,
> the performance of `cat-file` appears to be degraded due "very eager to
> allocate lots of separate strings" in `ref-filter` and other reasons.

I am thinking today whether we can directly append some object information
directly to `&state->stack->output`, Instead of assigning to `v->s` firstly.

But in `cmp_ref_sorting()` we will use `get_ref_atom_value()`, It is possible
to compare `v->s` of two different refs, I must goto fill object info in `v->s`.

So I think this is one of the reasons why `ref-filter` desires to
allocate a large
number of strings, right?

--
ZheNing Hu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: GSoC Git Proposal Draft - ZheNing Hu
  2021-04-11 15:34   ` ZheNing Hu
@ 2021-04-13  6:40     ` Jeff King
  2021-04-13 14:51       ` ZheNing Hu
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff King @ 2021-04-13  6:40 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Junio C Hamano, Eric Sunshine, Christian Couder, Hariom verma,
	Shourya Shukla,
	Оля
	Тележная,
	Jiang Xin, Git List

On Sun, Apr 11, 2021 at 11:34:35PM +0800, ZheNing Hu wrote:

> > Why is Olga’s solution rejected?
> > 1. Olga's solution is to let `git cat-file` use the `ref-filter` interface,
> > the performance of `cat-file` appears to be degraded due "very eager to
> > allocate lots of separate strings" in `ref-filter` and other reasons.
> 
> I am thinking today whether we can directly append some object information
> directly to `&state->stack->output`, Instead of assigning to `v->s` firstly.

Yes, that's the direction I think we'd want to go.

> But in `cmp_ref_sorting()` we will use `get_ref_atom_value()`, It is possible
> to compare `v->s` of two different refs, I must goto fill object info in `v->s`.
> 
> So I think this is one of the reasons why `ref-filter` desires to
> allocate a large
> number of strings, right?

Yeah, I think sorting in general is a bit tricky, because it inherently
requires collecting the value for each item. Just thinking about what
properties an ideal solution would have (which we might not be able to
get all of):

  - if we're sorting by something numeric (e.g., an committer
    timestamp), we should avoid forming it into a string at all

  - if the sort item requires work to extract that overlaps with the
    output format (e.g., sorting by authordate and showing author name
    in the format, both of which require parsing the author ident line
    of a commit), ideally we'd just do that work once per ref/object.

  - if we are sorting, obviously we have to hold some amount of data for
    each item in memory all at once (since we have to get data on the
    sort properties for each, and then sort the result). So we'd
    probably need at least some allocation per ref anyway, and an extra
    string isn't too bad. But if we're not sorting, then it would be
    nice to consider one ref/object at a time, which lets us keep our
    peak memory usage lower, reuse output buffers, etc.

I think some of those are in competition with each other. Minimizing
work shared between the sorting and format steps means keeping more data
in memory. So it might be sensible to just treat them totally
independently, and not worry about sharing work (I haven't looked at how
ref-filter does this now).  TBH, I care a lot less about making the
"sorting" case fast than I do about making sure that if we _aren't_
sorting, we go as fast as possible.

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: GSoC Git Proposal Draft - ZheNing Hu
  2021-04-13  6:40     ` Jeff King
@ 2021-04-13 14:51       ` ZheNing Hu
  0 siblings, 0 replies; 11+ messages in thread
From: ZheNing Hu @ 2021-04-13 14:51 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Eric Sunshine, Christian Couder, Hariom verma,
	Shourya Shukla,
	Оля
	Тележная,
	Jiang Xin, Git List

Jeff King <peff@peff.net> 于2021年4月13日周二 下午2:40写道:
>
> On Sun, Apr 11, 2021 at 11:34:35PM +0800, ZheNing Hu wrote:
>
> > > Why is Olga’s solution rejected?
> > > 1. Olga's solution is to let `git cat-file` use the `ref-filter` interface,
> > > the performance of `cat-file` appears to be degraded due "very eager to
> > > allocate lots of separate strings" in `ref-filter` and other reasons.
> >
> > I am thinking today whether we can directly append some object information
> > directly to `&state->stack->output`, Instead of assigning to `v->s` firstly.
>
> Yes, that's the direction I think we'd want to go.
>
> > But in `cmp_ref_sorting()` we will use `get_ref_atom_value()`, It is possible
> > to compare `v->s` of two different refs, I must goto fill object info in `v->s`.
> >
> > So I think this is one of the reasons why `ref-filter` desires to
> > allocate a large
> > number of strings, right?
>
> Yeah, I think sorting in general is a bit tricky, because it inherently
> requires collecting the value for each item. Just thinking about what
> properties an ideal solution would have (which we might not be able to
> get all of):
>
>   - if we're sorting by something numeric (e.g., an committer
>     timestamp), we should avoid forming it into a string at all
>
>   - if the sort item requires work to extract that overlaps with the
>     output format (e.g., sorting by authordate and showing author name
>     in the format, both of which require parsing the author ident line
>     of a commit), ideally we'd just do that work once per ref/object.
>

Yes i can understand.

>   - if we are sorting, obviously we have to hold some amount of data for
>     each item in memory all at once (since we have to get data on the
>     sort properties for each, and then sort the result). So we'd
>     probably need at least some allocation per ref anyway, and an extra
>     string isn't too bad. But if we're not sorting, then it would be
>     nice to consider one ref/object at a time, which lets us keep our
>     peak memory usage lower, reuse output buffers, etc.
>

Yes, storing these strings in memory is beneficial for sorting.

> I think some of those are in competition with each other. Minimizing
> work shared between the sorting and format steps means keeping more data
> in memory. So it might be sensible to just treat them totally
> independently, and not worry about sharing work (I haven't looked at how
> ref-filter does this now).  TBH, I care a lot less about making the
> "sorting" case fast than I do about making sure that if we _aren't_
> sorting, we go as fast as possible.
>

Okay, so we just focus on the "nosort" case.
I am thinking about finding those cases that git do not need sort and we can
make a flag like "nosort = 1", and then use this "nosort" flag in
`ref-filter` to do the
string copy optimization what we want.
But the problem now is that `git for-each-ref` itself does not support
`--un-sort`,
and it have a default sort order by `refname`. I suspect that there are no
unsorted situation here for us to improve (Any other command call
`ref_array_sort()`
will also have similar situation, and it seem cause a little memory leak, the
ref_sorting entries in sorting_tail aren't free, right?)

> -Peff

--
ZheNing Hu

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-04-13 14:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-02  9:03 GSoC Git Proposal Draft - ZheNing Hu ZheNing Hu
2021-04-02 14:57 ` Christian Couder
2021-04-03 13:23   ` ZheNing Hu
2021-04-02 15:39 ` Jeff King
2021-04-03 14:27   ` ZheNing Hu
2021-04-07 19:28     ` Jeff King
2021-04-08 13:29       ` ZheNing Hu
2021-04-11  6:11 ` ZheNing Hu
2021-04-11 15:34   ` ZheNing Hu
2021-04-13  6:40     ` Jeff King
2021-04-13 14:51       ` ZheNing Hu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).