All of lore.kernel.org
 help / color / mirror / Atom feed
From: ZheNing Hu <adlternative@gmail.com>
To: Git List <git@vger.kernel.org>
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Eric Sunshine" <sunshine@sunshineco.com>,
	"Christian Couder" <christian.couder@gmail.com>,
	"Hariom verma" <hariom18599@gmail.com>,
	"Jeff King" <peff@peff.net>,
	"Shourya Shukla" <periperidip@gmail.com>,
	"Оля Тележная" <olyatelezhnaya@gmail.com>,
	"Jiang Xin" <worldhello.net@gmail.com>
Subject: Re: GSoC Git Proposal Draft - ZheNing Hu
Date: Sun, 11 Apr 2021 14:11:49 +0800	[thread overview]
Message-ID: <CAOLTT8RTA0inxgxbd3qDToKYxwgXGKvJikXWsXg7oQ4asFj+HQ@mail.gmail.com> (raw)
In-Reply-To: <CAOLTT8RfE4nn5NnjZh7xuF09-5=+K+_j_2kP0327HVdR4x_wAQ@mail.gmail.com>

Here is my GSoC 2021 Proposal draft v2.
And website version is there :
https://docs.google.com/document/d/119k-Xa4CKOt5rC1gg1cqPr6H3MvdgTUizndJGAo1Erk/edit
Welcome any Comments and Correct :)

-------8<---------
Use ref-filter formats in git cat-file

About Me

Name ZheNing Hu
Major Computer Science And Technology
Mobile no. +86 15058356458
Email adlternative@gmail.com
IRC adlternative (on #git-devel/#git@freenode)
Github https://github.com/adlternative/
Blogs https://adlternative.github.io/
Time Zone CST (UTC +08:00)

Education & Background

I am currently a 2nd Year Student majoring in
computer science and technology in Xi'an University
of Posts & Telecommunications (China). In my freshman
year,I joined the XiYou Linux Group of the university
and learned how to use Git to submit my own code to GitHub.
I have learned C, C++, Python and shell in two years,
I know how to use gdb debugging, and I am familiar with
relevant knowledge of Linux System Programming and Linux
Network Programming. I started learning Git source code
and made contributions to Git from December of 2020.


Me & Git

Around last November, I found a couple of projects on
GitHub[1] teaching me how to write a simple Git, the mechanics
of Git are very interesting:
1. There are four types of objects in Git: blob, tree, commit, tag.
2. The (loose)objects(with SHA-1 hash algorithm) are stored in
".git/objects/sha1[0-1]/sha1[2-39]" with the sha1 value of the object
data as the storage address.
3. All branches are just references to commits.
...
Then I read 《Pro Git》 and Jiang Xin's 《Git Authoritative Guide》,
learned the use of most Git subcommands.
Later, I started learning some of the Git source code, I found Git has
at least 200,000 lines of C code and 200,000 lines of shell script code,
which leaves me a little confused about where to start. But then, after I
submitted my first patch, a lot of people in the Git community came over and
gave me very enthusiastic guidance, which gave me the courage to learn the
Git source code, and then I started making my own contributions, You can find
them here:[2][3]
And These patches are in Git master branch:
[master]
difftool.c: learn a new way start at specified file
https://lore.kernel.org/git/pull.870.v6.git.1613739235241.gitgitgadget@gmail.com/

ls-files.c: add --deduplicate option
https://lore.kernel.org/git/384f77a4c188456854bd86335e9bdc8018097a5f.1611485667.git.gitgitgadget@gmail.com/

ls_files.c: consolidate two for loops into one
https://lore.kernel.org/git/f9d5e44d2c08b9e3d05a73b0a6e520ef7bb889c9.1611485667.git.gitgitgadget@gmail.com/

ls_files.c: bugfix for --deleted and --modified
https://lore.kernel.org/git/8b02367a359e62d7721b9078ac8393a467d83724.1611485667.git.gitgitgadget@gmail.com/

builtin/*: update usage format
https://lore.kernel.org/git/d3eb6dcff1468645560c16e1d8753002cbd7f143.1609944243.git.gitgitgadget@gmail.com/

format-patch: allow a non-integral version numbers
https://lore.kernel.org/git/pull.885.v10.git.1616497946427.gitgitgadget@gmail.com/

[GSOC] commit: add --trailer option
https://lore.kernel.org/git/pull.901.v14.git.1616507757999.gitgitgadget@gmail.com/

And These patches are working:
[wip]
gitk: add right-click context menu for tags
https://lore.kernel.org/git/pull.866.v5.git.1614227923637.gitgitgadget@gmail.com/

[GSOC] trailer: add new .cmd config option
https://lore.kernel.org/git/3dc8983a47020fb417bb8c6c3d835e609b13c155.1617975462.git.gitgitgadget@gmail.com/

[GSOC] docs: correct descript of trailer.<token>.command
https://lore.kernel.org/git/505903811df83cf26f4dd70c5b811dde169896a2.1617975462.git.gitgitgadget@gmail.com/

[GSOC] ref-filter: get rid of show_ref_array_item
https://lore.kernel.org/git/pull.927.v2.git.1617809209164.gitgitgadget@gmail.com/

Proposed Project

Current situation
Git used to have an old problem of duplicated implementations of some logic.
For example, Git had at least 4 different implementations to format command
output for different commands. E.g. `git cat-file --batch=“%(objectname)”`,
`git log --pretty=“%aN”`, `git for-each-ref --format=“%(refname)”`.

Which implementations have been merged together?

2018 ~ 2019
Olga Telezhnaia: Reuse ref-filter formatting logic in `git cat-file`
Olga Integrate some `git cat-file` logic into the `ref-filter`, now almost
all format atoms in the `git cat-file` are available in the `git for-each-ref`,
e.g. `git for-each-ref --format=“%(objectsize:disk) %(deltabase) ”`.

2020 ~ 2021
Hariom Verma: Unify ref-filter formats with other --pretty formats
Hariom migrated some of the '--pretty' logic to the 'ref-filter',
e.g. `git for-each-ref --format="%(trailers:key=Signed-off-by)"`
or ` git for-each-ref --format="%(subject:sanitize)"`.

What’s git cat-file?
`git cat-file` is a Git subcommand used to see information about a
Git object.
`git cat-file --batch` can read objects from stdin and print each
object information and contents to stdout.
`git --batch-check` can read objects from stdin and print each
object information to stdout.
`--batch-all-objects`  will show all objects info in the git repo
with `--batch` or `--batch-check`.
`--batch-check` and `--batch` both accept a custom format that can
have placeholders like the following, refer to here[4]:

%(objectname) The full hex representation of the object name.
%(objecttype) The type of the object.
%(objectsize) The size, in bytes, of the object.
%(objectsize:disk) The size, in bytes, that the object takes up on disk.
%(deltatbase) If the object is stored as a delta on-disk, this expands to
the full hex representation of the delta base object name.
Otherwise, expands to the null OID (all zeroes).
%(rest) If this atom is used in the output string, input lines are
  split at the first whitespace boundary. All characters before
that whitespace are considered to be the object name; characters
after that first run of whitespace (i.e., the "rest" of the line)
are output in place of the `%(rest)` atom.

What’s the original design of git cat-file --batch?
1. First time use `expand_format()` in `batch_objects()` is used to
parse format atoms, this will determine what data we need to capture.
2. Read the object name from standard input,and use it to get the
object's oid from `get_oid_with_context()`.
3. In `batch_object_write()`, `oid_object_info_extended()` will obtain
the object information which we need.
4. Second time use `expand_format()` in `batch_object_write()`, will
formatting actual items, and store it in a string buffer, eventually
the contents of this buffer will be printed to standard output.


What are the disadvantages in git cat-file --batch?
atom format-parsing stage and formatting actual items stage are not
separated yet. This limits the ability of `git cat-file --batch` to
support richer formats like `git for-each-ref` or `git log --pretty`.


Why is Olga’s solution rejected?
1. Olga's solution is to let `git cat-file` use the `ref-filter` interface,
the performance of `cat-file` appears to be degraded due "very eager to
allocate lots of separate strings" in `ref-filter` and other reasons.
2. Then Olga adopted the method of optimizing `ref-filter`, but the
performance of `git cat-file` is still not as good as the previous method.
3. Too long patch series, difficult to adjust and merge.
4. Is “%(rest)” worth migrating? “%(rest)” is for `git cat-file --batch`
which will be read from the terminal, anything after the space on each line
will continue to be printed, this option is quite unnecessary for
`git for-each-ref`, which does not require standard input.


My possible solution
1. Analyze how to get data which `oid_object_info_extended()` can't
get directly, analyze the minimum amount of data required for each
step of atom format parsing.
2. Find a uniform way to parse format, like `%an` in `log --pretty`
or `%(authorname)` in `ref-filter`(might it can learn something from
`git config` or can try using abstract syntax trees for format atoms parsing).
3. Apply the new interface to 'git cat-file', and then we could add
richer options for `git cat-file`.
4. (Optional optimization) Change the strbuf allocate strategy of `ref-filter`:
Use a single strbuf for all refs output. Improving its performance, reducing the
overhead of allocating large numbers of small strbuf.
5. (Optional optimization) In addition, if we migrate `cat-file` to `ref-filter`
only with improved performance of `ref-filter`, we need to isolate
some atoms that
are not applicable to `cat-files`. For example, `refname` is not useful for
`git cat-file`, we can exit the program by using `die()` or just print
error messages.

Are you applying for other Projects?

No, Git is the only one.

Blogging about Git

In fact, while I am studying Git source code, I often write some
blogs[5] to record
my learning content, this helps me to recall some content after
forgetting it. Most
of the blogs were written in Chinese previously, but during the GSoC,
I promise all
my blogs will be written in English.


Time Line

May 18 ~ June 18
1. Learn the details of atom format parsing in `ref-filter.c` and `pretty.c`.
Think about how to combine two different atom formats in a parsing way.
(For example, how can we use abstract syntax trees to organize different atoms)
2. Analyze and optimize `ref-filter` performance.
3. Discuss with mentors about a reasonable solution about uniform
formatting parsing,
and then start coding it.
June 18 ~ July 18
1. Continue to integrate the atom format parsing and apply it to `pretty.c` and
`ref-filter.c`.
2. Make sure that the performance of `git for-each-ref` and `git log --pretty`
are better than the previous methods under the new format parsing interface.
July 18 ~ August 17
1. Let `git cat-file` use the new and better formatting parsing interface.
2. Support more options for `cat-file --batch` and ensure isolation from those
different types of atoms.

Availability

I have plenty of time before and after my final exam, I have enough
energy to complete
daily tasks. I'm staying active on the Git mailing list, you can find
me at any time as
long as I am not sleeping. :)

Post GSoC

I love open source philosophy, willing to spread the spirit of
openness, freedom and
willing to research technology with like-minded people.
In my previous contact with the Git community in the past few months,
many people in the
Git community gave me great encouragement. I hope I can keep my
passion for Git alive,
contribute my own code, and pass this cool thing on. I am willing to
contribute code to
the Git community for a long time after the end of GSoC.
I hope the Git community can give me a chance to participate in GSoC.
I sincerely thank
GSoC and the Git community!


________________
[1] https://github.com/danistefanovic/build-your-own-x#build-your-own-git
[2] https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Aadlternative+
[3] https://git.kernel.org/pub/scm/git/git.git/log/?qt=grep&q=ZheNing+Hu
[4] https://github.com/gitgitgadget/git/blob/89b43f80a514aee58b662ad606e6352e03eaeee4/Documentation/git-cat-file.txt#L189
[5] https://adlternative.github.io/tags/git/

  parent reply	other threads:[~2021-04-11  6:14 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-02  9:03 GSoC Git Proposal Draft - ZheNing Hu ZheNing Hu
2021-04-02 14:57 ` Christian Couder
2021-04-03 13:23   ` ZheNing Hu
2021-04-02 15:39 ` Jeff King
2021-04-03 14:27   ` ZheNing Hu
2021-04-07 19:28     ` Jeff King
2021-04-08 13:29       ` ZheNing Hu
2021-04-11  6:11 ` ZheNing Hu [this message]
2021-04-11 15:34   ` ZheNing Hu
2021-04-13  6:40     ` Jeff King
2021-04-13 14:51       ` ZheNing Hu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOLTT8RTA0inxgxbd3qDToKYxwgXGKvJikXWsXg7oQ4asFj+HQ@mail.gmail.com \
    --to=adlternative@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=hariom18599@gmail.com \
    --cc=olyatelezhnaya@gmail.com \
    --cc=peff@peff.net \
    --cc=periperidip@gmail.com \
    --cc=sunshine@sunshineco.com \
    --cc=worldhello.net@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.