All of lore.kernel.org
 help / color / mirror / Atom feed
* [GSoC][PROPOSAL v1] Implement consistency check for refs
@ 2024-03-23 10:03 shejialuo
  2024-03-25  8:06 ` Patrick Steinhardt
  0 siblings, 1 reply; 3+ messages in thread
From: shejialuo @ 2024-03-23 10:03 UTC (permalink / raw)
  To: git; +Cc: Patrick Steinhardt, Karthik Nayak, Junio C Hamano

Hi everyone,

This is my proposal for the project "Implement consistency checks for
refs". The web version can be viewed at
https://docs.google.com/document/d/1pWnnyykGmJIN-wyosZ3PtueFfs_BRdvJq-cwroRorBI/

----------

# Implement consistency check for refs

## Personal Information

+ Name: Jialuo She
+ Email: shejialuo@gmail.com
+ Mobile no.: (+86) 13019582712
+ Education: Xidian University, Shanxi, China
+ Year: Final Year
+ Degree: Master in Software Engineer
+ GitHub: https://github.com/shejialuo
+ Blog: https://luolibrary.com/

## About Me

I am someone who loves coding, and my favorite quote is from "The
Shawshank Redemption": "Get busy living, or get busy dying". In my final
year as a student, I hope to continue doing valuable things.

I have some basic open-source experience, such as improving
documentation for Angular [1]-[2], adding some features to the
community [3]. I also maintain some open-source projects, mainly course
codes, book answers, and some practical in-action tutorials to help
others [4]. Additionaly, I write blogs to document and help others with
some of the public courses I’ve taken [5]. I have written a toy git by
myself due to my interest [6].

I am currently interning at NVIDIA, where I started in the software
department working on CUDA Test Development. My current position is in
the hardware department as a Tegra System Architect, and I will become
a full-time employee this July.

I hope to use this opportunity to continue participating in the git
community and contribute to open source in the future.

## Overview and Background

As [7] shows, when the "packed-refs" file gets corrupted, the
git-fsck(1) command cannot detect this situation where the "packed-refs"
file contains corrupted binary zeros.

The current mechanism for git-fsck(1) command can be classified into
two parts. The first part is to check the object database. For every
loose file which is under `$GIT_DIR/objects`, it will use the internal
method `for_each_loose_file_in_objdir` to check whether the result of the
content-hash is consistent. The second part is to check the "packed-refs"
file, it will open and parse it to iterate the refs which provides an
implicit consistency check. However, as [8] shows, when there are some
null bytes, maybe the read file operation will just simply ignore
those null bytes.

It turns out that the current git-fsck(1) command mainly focuses on
checking the consistency of the object database. It lacks the
functionality to check the consistency ref database explicitly. The
upcoming retable backend will soon be released which is a binary
format which would be hard for users to check the corruption.

And this project aims at implementing consistency check for refs
suggested by [8] and [9]. And there are two backends for git: one
is file and another is retable. Initially these checks may only apply
to the "file" backend and then apply to the "retable" backend.

The expected project difficulty is medium and the expected project
size would be 175 hours or 350 hours depending on whether to implement
consistency check for "retable" backend.


## Pre-GSoC

I first got in touch with Git codebase in February, 2024. I have only
submitted a minor patch as the microproject to get familiar with the
workflow of the Git community.


### MicroProject

t9117: prefer test_path_* helper functions
  Mailing list thread:
  https://lore.kernel.org/git/20240304095436.56399-2-shejialuo@gmail.com/
  Status: merged into master
  Description:
  test -(e|d) does not provide a nice error message when we hit test
  failures, so use test_path_exists, test_path_is_dir instead.

### Discussion

I have discussed with the community for some implementation ideas in
[10] which is the fundamental of the "Proposed Project" part.

### Learning the Source Code

During the time from completing the microproject until now, I have
primarily read the source code related to `fsck.c` and ref internals.
I have summarized the implementation diagram of the ref mechanism as
shown in [11].

### Previous Work

This project might be a new idea here. It is proposed mainly due to
the new-released retable backend [12].

## Proposed Project

As [10] shows, Patrick wrote:

> Some ideas from the top of my head:
>
>  - generic
>    - Ensure that all ref names are conformant.
>    - Ensure that there are no directory/file conflicts for the ref
>      names.
>  - files
>    - Ensure that "packed-refs" is well-formatted.
>    - Ensure that refs in "packed-refs" are ordered lexicographically.
>    - Check for corrupted loose refs in "refs/".
>  - reftable
>    - Ensure that there are no garbage files in "reftable/".
>    - Ensure that "tables.list" is well-formatted.
>    - Ensure that each table is well-formatted.
>    - Ensure that refs in each table are ordered correctly.

### Generic

It's easy to ensure that all ref names are conformant. The
git-check-ref-format(1) provides the functionality to check whether
the name of a ref is ok. It eventually calls the `check_refname_component`
in the `ref.c` file. And I have understood the rules of the ref name
as the following:

- Hierarchial Grouping:
    - Valid: `refs/heads/feature/new-feature`
    - Invalid: `refs/heads/.hidden-branch` (begins with a dot)
    - Invalid: `refs/heads/feature.lock` (ends with `.lock`)
- Contain at Least One Slash
    - Valid: `refs/heads/main`
    - Invalid: `main` (no slash)
- No Consecutive Dots
    - Valid: `refs/heads/feature/new.feature`
    - Invalid: `refs/heads/feature..new-feature` (contains `..`)
- No ASCII Control Characters, Space, Tilde, Caret, or Colon
    - Valid: `refs/heads/feature/new-feature`
    - Invalid: `refs/heads/feature^new` (contains `^`)
    - Invalid: `refs/heads/feature: new` (contains `:` and space)
- No Question-Mark, Asterisk, or Open Bracket
    - Valid: `refs/heads/feature/new-feature`
    - Invalid: `refs/heads/feature/new?feature`
    - Invalid: `refs/heads/feature/*` (contains `*`)
- No Leading or Trailing Slash, No Multiple Consecutive Slashes
    - Valid: `refs/heads/feature/new-feature`
    - Invalid: `/refs/heads/feature/new-feature` (begins with `/`)
    - Invalid: `refs/heads//new-feature` (contains `//`)
- No Ending with a Dot
    - Valid: `refs/heads/feature/new-feature`
    - Invalid: `refs/heads/feature.` (ends with `.`)
- No Sequence `@{`
    - Valid: `refs/heads/feature/new-feature`
    - Invalid: `refs/heads/feature@{time}` (contains `@{`).
- Not the Single Character `@`
    - Valid: `refs/heads/feature/@new-feature`
    - Invalid: `@` (single character `@`)
- No Backslash
    - Valid: `refs/heads/feature/new-feature`
    - Invalid: `refs/heads/feature\new-feature`

So for ref names, we could reuse this function for every backend.

### The Interface For Two Backends

From the discussion in [10], Patrick wrote:

> For what it's worth, not all of the checks need to be implemented as
> part of GSoC. At a minimum, it should result in the infra to allow for
> backend-specific checks and a couple of checks for at least one of the
> backends.

I think the most important thing for this project is to setup the
infrastructure with good flexibility which allows backend-specific check
and programmer can add checks easily. This is the reason why I dive into
ref internal, I want to understand how Git handles these two interfaces.
And I want to reuse this idea to add ref consistency checks.

As picture [11] shows, the `ref_store` is the core abstraction provided
by the `ref_internal.h`, it is the base class which contains the `be`
pointer which provides backend-specific functions whose interfaces are
defined in the `ref_storage_be`.

In my view, I think we could reuse this polymorphism. We could define
some common interfaces or only one interface in the `ref_storage_be`.
For every backend, it needs to provide its own function pointer. And
we could reuse `ref_store` structure to do this.

At last, we can use this polymorphism in `fsck.c` to elegantly setup
the infrastructure. And it has the following advantages:

- Reuse the current polymorphism provided by `ref_store`.
- We can add checks easily for future.

### The Concrete Checks

At current, this project proposal cannot determine the concrete checks.
It should be discussed to make the checks concrete and clear.

### Timeline

Pre-GSoC
(Until May 1)
- Continue to read the source code especially for retable backend.
  Understand the retable format to determine the concrete checks
  with mentors.

Community Bonding
(May 1 - May 26)
- Talk with mentors to determine the detailed infrastructure should be
  implemented to provide flexibility for programmers. Also, determine
  the checks for the generic part and continue understanding the retable
  format and code.

Phase I
(May 27 - July 7)
- Implement the infrastructure to support both file-backend and
  retable-backend. Write tests to make sure the correctness of the
  infrastructure. And then add the some generic ref checks both for
  file-backend and retable-backend (such as use existing
  `check_refname_component`) and write tests.

Phase II
(July 7 - Aug 19)
- Talk with mentors to determine the ref checks for file-backend and
  continue to add ref checks for file-backend. Also add write unit tests.

Final Week
(Aug 19 - Aug 26)
- Fix any bugs.
- Write the final report.

However, in my perspective, I hope this project could last 22 weeks and
I could add ref checks for retable-backend. But due to uncertainly,
I omit the timeline here.

### Availability

Currently, I am engaged in a remote internship, which allows me to
work 2-3 hours each evening from Monday to Friday. On weekends, I
am able to dedicate 8 hours each day to work. As I am in my final year
and do not have exams, I can commit to 20-30 hours of work per week.
Even after becoming a full-time employee in July, my schedule will remain
similar to my internship period. Additionally, I am able to continue
working until November (22 Week Project).

## Post-GSoC

After completing the GSoC, I plan to continue working on the code related
to retable and read other modules in Git. This was my original intention
for participating in GSoC, to integrate into the community through this
opportunity and, if possible, to become a mentor in the future, igniting
the flame in others' hearts. Most importantly, to "Get busy living."

## Closing Remarks

Thanks for every review for reviewing this proposal. It’s my honor to
work with the community.

## References

[1] https://github.com/angular/angular/pull/44674
[2] https://github.com/angular/angular/pull/44628
[3] https://github.com/ehForwarderBot/efb-qq-plugin-go-cqhttp/pull/68
[4] https://github.com/shejialuo/database-systems-complete-book-solutions
[5] https://luolibrary.com/categories/CS144/
[6] https://github.com/shejialuo/ugit-cpp
[7] https://lore.kernel.org/git/6cfee0e4-3285-4f18-91ff-d097da9de737@rd10.de/
[8] https://lore.kernel.org/git/20240120010055.GC117170@coredump.intra.peff.net/
[9] https://lore.kernel.org/git/ZakIPEytlxHGCB9Y@tanuki/
[10] https://lore.kernel.org/git/Ze2E_xgfwTUzsQ92@ArchLinux/
[11] https://s2.loli.net/2024/03/23/3YGoBgZzHNtOfVw.png
[12] https://lore.kernel.org/git/cover.1706601199.git.ps@pks.im/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [GSoC][PROPOSAL v1] Implement consistency check for refs
  2024-03-23 10:03 [GSoC][PROPOSAL v1] Implement consistency check for refs shejialuo
@ 2024-03-25  8:06 ` Patrick Steinhardt
  2024-03-26 11:52   ` shejialuo
  0 siblings, 1 reply; 3+ messages in thread
From: Patrick Steinhardt @ 2024-03-25  8:06 UTC (permalink / raw)
  To: shejialuo; +Cc: git, Karthik Nayak, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 12912 bytes --]

On Sat, Mar 23, 2024 at 06:03:30PM +0800, shejialuo wrote:
> Hi everyone,
> 
> This is my proposal for the project "Implement consistency checks for
> refs". The web version can be viewed at
> https://docs.google.com/document/d/1pWnnyykGmJIN-wyosZ3PtueFfs_BRdvJq-cwroRorBI/
> 
> ----------
> 
> # Implement consistency check for refs
> 
> ## Personal Information
> 
> + Name: Jialuo She
> + Email: shejialuo@gmail.com
> + Mobile no.: (+86) 13019582712
> + Education: Xidian University, Shanxi, China
> + Year: Final Year
> + Degree: Master in Software Engineer
> + GitHub: https://github.com/shejialuo
> + Blog: https://luolibrary.com/
> 
> ## About Me
> 
> I am someone who loves coding, and my favorite quote is from "The
> Shawshank Redemption": "Get busy living, or get busy dying". In my final
> year as a student, I hope to continue doing valuable things.
> 
> I have some basic open-source experience, such as improving
> documentation for Angular [1]-[2], adding some features to the
> community [3]. I also maintain some open-source projects, mainly course
> codes, book answers, and some practical in-action tutorials to help
> others [4]. Additionaly, I write blogs to document and help others with
> some of the public courses I’ve taken [5]. I have written a toy git by
> myself due to my interest [6].
> 
> I am currently interning at NVIDIA, where I started in the software
> department working on CUDA Test Development. My current position is in
> the hardware department as a Tegra System Architect, and I will become
> a full-time employee this July.

Given that this overlaps with GSoC, how do you expect to allocate time
for both a full-time job and participating in the GSoC at the same time?

> I hope to use this opportunity to continue participating in the git
> community and contribute to open source in the future.
> 
> ## Overview and Background
> 
> As [7] shows, when the "packed-refs" file gets corrupted, the
> git-fsck(1) command cannot detect this situation where the "packed-refs"
> file contains corrupted binary zeros.
> 
> The current mechanism for git-fsck(1) command can be classified into
> two parts. The first part is to check the object database. For every
> loose file which is under `$GIT_DIR/objects`, it will use the internal
> method `for_each_loose_file_in_objdir` to check whether the result of the
> content-hash is consistent. The second part is to check the "packed-refs"
> file, it will open and parse it to iterate the refs which provides an
> implicit consistency check. However, as [8] shows, when there are some
> null bytes, maybe the read file operation will just simply ignore
> those null bytes.

git-fsck(1) does a lot more nowadays: it also checks packfiles, commit
graphs, multi-pack indices, reverse indices and bitmaps.

> It turns out that the current git-fsck(1) command mainly focuses on
> checking the consistency of the object database. It lacks the
> functionality to check the consistency ref database explicitly. The
> upcoming retable backend will soon be released which is a binary
> format which would be hard for users to check the corruption.

This part is true though. Most of the above data structures are closely
tied to the objects that exist in the repository, while we have nothing
that explicitly checks the refdb.

> And this project aims at implementing consistency check for refs
> suggested by [8] and [9]. And there are two backends for git: one
> is file and another is retable. Initially these checks may only apply
> to the "file" backend and then apply to the "retable" backend.
> 
> The expected project difficulty is medium and the expected project
> size would be 175 hours or 350 hours depending on whether to implement
> consistency check for "retable" backend.
> 
> 
> ## Pre-GSoC
> 
> I first got in touch with Git codebase in February, 2024. I have only
> submitted a minor patch as the microproject to get familiar with the
> workflow of the Git community.
> 
> 
> ### MicroProject
> 
> t9117: prefer test_path_* helper functions
>   Mailing list thread:
>   https://lore.kernel.org/git/20240304095436.56399-2-shejialuo@gmail.com/
>   Status: merged into master
>   Description:
>   test -(e|d) does not provide a nice error message when we hit test
>   failures, so use test_path_exists, test_path_is_dir instead.
> 
> ### Discussion
> 
> I have discussed with the community for some implementation ideas in
> [10] which is the fundamental of the "Proposed Project" part.
> 
> ### Learning the Source Code
> 
> During the time from completing the microproject until now, I have
> primarily read the source code related to `fsck.c` and ref internals.
> I have summarized the implementation diagram of the ref mechanism as
> shown in [11].
> 
> ### Previous Work
> 
> This project might be a new idea here. It is proposed mainly due to
> the new-released retable backend [12].
> 
> ## Proposed Project
> 
> As [10] shows, Patrick wrote:
> 
> > Some ideas from the top of my head:
> >
> >  - generic
> >    - Ensure that all ref names are conformant.
> >    - Ensure that there are no directory/file conflicts for the ref
> >      names.
> >  - files
> >    - Ensure that "packed-refs" is well-formatted.
> >    - Ensure that refs in "packed-refs" are ordered lexicographically.
> >    - Check for corrupted loose refs in "refs/".
> >  - reftable
> >    - Ensure that there are no garbage files in "reftable/".
> >    - Ensure that "tables.list" is well-formatted.
> >    - Ensure that each table is well-formatted.
> >    - Ensure that refs in each table are ordered correctly.
> 
> ### Generic
> 
> It's easy to ensure that all ref names are conformant. The
> git-check-ref-format(1) provides the functionality to check whether
> the name of a ref is ok. It eventually calls the `check_refname_component`
> in the `ref.c` file. And I have understood the rules of the ref name
> as the following:
> 
> - Hierarchial Grouping:
>     - Valid: `refs/heads/feature/new-feature`
>     - Invalid: `refs/heads/.hidden-branch` (begins with a dot)
>     - Invalid: `refs/heads/feature.lock` (ends with `.lock`)
> - Contain at Least One Slash
>     - Valid: `refs/heads/main`
>     - Invalid: `main` (no slash)
> - No Consecutive Dots
>     - Valid: `refs/heads/feature/new.feature`
>     - Invalid: `refs/heads/feature..new-feature` (contains `..`)
> - No ASCII Control Characters, Space, Tilde, Caret, or Colon
>     - Valid: `refs/heads/feature/new-feature`
>     - Invalid: `refs/heads/feature^new` (contains `^`)
>     - Invalid: `refs/heads/feature: new` (contains `:` and space)
> - No Question-Mark, Asterisk, or Open Bracket
>     - Valid: `refs/heads/feature/new-feature`
>     - Invalid: `refs/heads/feature/new?feature`
>     - Invalid: `refs/heads/feature/*` (contains `*`)
> - No Leading or Trailing Slash, No Multiple Consecutive Slashes
>     - Valid: `refs/heads/feature/new-feature`
>     - Invalid: `/refs/heads/feature/new-feature` (begins with `/`)
>     - Invalid: `refs/heads//new-feature` (contains `//`)
> - No Ending with a Dot
>     - Valid: `refs/heads/feature/new-feature`
>     - Invalid: `refs/heads/feature.` (ends with `.`)
> - No Sequence `@{`
>     - Valid: `refs/heads/feature/new-feature`
>     - Invalid: `refs/heads/feature@{time}` (contains `@{`).
> - Not the Single Character `@`
>     - Valid: `refs/heads/feature/@new-feature`
>     - Invalid: `@` (single character `@`)
> - No Backslash
>     - Valid: `refs/heads/feature/new-feature`
>     - Invalid: `refs/heads/feature\new-feature`
> 
> So for ref names, we could reuse this function for every backend.
> 
> ### The Interface For Two Backends
> 
> From the discussion in [10], Patrick wrote:
> 
> > For what it's worth, not all of the checks need to be implemented as
> > part of GSoC. At a minimum, it should result in the infra to allow for
> > backend-specific checks and a couple of checks for at least one of the
> > backends.
> 
> I think the most important thing for this project is to setup the
> infrastructure with good flexibility which allows backend-specific check
> and programmer can add checks easily. This is the reason why I dive into
> ref internal, I want to understand how Git handles these two interfaces.
> And I want to reuse this idea to add ref consistency checks.
> 
> As picture [11] shows, the `ref_store` is the core abstraction provided
> by the `ref_internal.h`, it is the base class which contains the `be`
> pointer which provides backend-specific functions whose interfaces are
> defined in the `ref_storage_be`.
> 
> In my view, I think we could reuse this polymorphism. We could define
> some common interfaces or only one interface in the `ref_storage_be`.
> For every backend, it needs to provide its own function pointer. And
> we could reuse `ref_store` structure to do this.
> 
> At last, we can use this polymorphism in `fsck.c` to elegantly setup
> the infrastructure. And it has the following advantages:
> 
> - Reuse the current polymorphism provided by `ref_store`.
> - We can add checks easily for future.

Sounds sensible.

> ### The Concrete Checks
> 
> At current, this project proposal cannot determine the concrete checks.
> It should be discussed to make the checks concrete and clear.
> 
> ### Timeline
> 
> Pre-GSoC
> (Until May 1)
> - Continue to read the source code especially for retable backend.
>   Understand the retable format to determine the concrete checks
>   with mentors.
> 
> Community Bonding
> (May 1 - May 26)
> - Talk with mentors to determine the detailed infrastructure should be
>   implemented to provide flexibility for programmers. Also, determine
>   the checks for the generic part and continue understanding the retable
>   format and code.
> 
> Phase I
> (May 27 - July 7)
> - Implement the infrastructure to support both file-backend and
>   retable-backend. Write tests to make sure the correctness of the
>   infrastructure. And then add the some generic ref checks both for
>   file-backend and retable-backend (such as use existing
>   `check_refname_component`) and write tests.
> 
> Phase II
> (July 7 - Aug 19)
> - Talk with mentors to determine the ref checks for file-backend and
>   continue to add ref checks for file-backend. Also add write unit tests.
> 
> Final Week
> (Aug 19 - Aug 26)
> - Fix any bugs.
> - Write the final report.
> 
> However, in my perspective, I hope this project could last 22 weeks and
> I could add ref checks for retable-backend. But due to uncertainly,
> I omit the timeline here.

That's fine. Adding consistency checks to the "reftable" backend can
easily be defined as a stretch goal or even not at all.

> ### Availability
> 
> Currently, I am engaged in a remote internship, which allows me to
> work 2-3 hours each evening from Monday to Friday. On weekends, I
> am able to dedicate 8 hours each day to work. As I am in my final year
> and do not have exams, I can commit to 20-30 hours of work per week.
> Even after becoming a full-time employee in July, my schedule will remain
> similar to my internship period. Additionally, I am able to continue
> working until November (22 Week Project).

Hum, okay. This does sound extremely taxing and I do wonder whether this
is going to be a sustainable mode of working for multiple months in a
row.

Thanks for your application!

Patrick

> ## Post-GSoC
> 
> After completing the GSoC, I plan to continue working on the code related
> to retable and read other modules in Git. This was my original intention
> for participating in GSoC, to integrate into the community through this
> opportunity and, if possible, to become a mentor in the future, igniting
> the flame in others' hearts. Most importantly, to "Get busy living."
> 
> ## Closing Remarks
> 
> Thanks for every review for reviewing this proposal. It’s my honor to
> work with the community.
> 
> ## References
> 
> [1] https://github.com/angular/angular/pull/44674
> [2] https://github.com/angular/angular/pull/44628
> [3] https://github.com/ehForwarderBot/efb-qq-plugin-go-cqhttp/pull/68
> [4] https://github.com/shejialuo/database-systems-complete-book-solutions
> [5] https://luolibrary.com/categories/CS144/
> [6] https://github.com/shejialuo/ugit-cpp
> [7] https://lore.kernel.org/git/6cfee0e4-3285-4f18-91ff-d097da9de737@rd10.de/
> [8] https://lore.kernel.org/git/20240120010055.GC117170@coredump.intra.peff.net/
> [9] https://lore.kernel.org/git/ZakIPEytlxHGCB9Y@tanuki/
> [10] https://lore.kernel.org/git/Ze2E_xgfwTUzsQ92@ArchLinux/
> [11] https://s2.loli.net/2024/03/23/3YGoBgZzHNtOfVw.png
> [12] https://lore.kernel.org/git/cover.1706601199.git.ps@pks.im/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [GSoC][PROPOSAL v1] Implement consistency check for refs
  2024-03-25  8:06 ` Patrick Steinhardt
@ 2024-03-26 11:52   ` shejialuo
  0 siblings, 0 replies; 3+ messages in thread
From: shejialuo @ 2024-03-26 11:52 UTC (permalink / raw)
  To: ps; +Cc: git, gitster, karthik.188

> Hum, okay. This does sound extremely taxing and I do wonder whether this
> is going to be a sustainable mode of working for multiple months in a
> row.

Such a long schedule might indeed be too intense, but I think it's very
flexible, and I can continue to improve this aspect of the work after
GSoC.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-03-26 11:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-23 10:03 [GSoC][PROPOSAL v1] Implement consistency check for refs shejialuo
2024-03-25  8:06 ` Patrick Steinhardt
2024-03-26 11:52   ` shejialuo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.