git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][GSoC] Implement Generation Number v2
@ 2020-03-22  9:35 Abhishek Kumar
  2020-03-22 20:05 ` Jakub Narebski
  0 siblings, 1 reply; 12+ messages in thread
From: Abhishek Kumar @ 2020-03-22  9:35 UTC (permalink / raw)
  To: git; +Cc: jnareb, stolee, christian.couder

Hello everyone,

Following the discussions between Jakub and Stolee [16], it was decided that
implementing "Corrected Commit Date With Strictly Monotonic Offset" [17] was
more appropriate for a GSoC project.

I would like to work on it. The proposal could still use some work but I
thought it would be best to get early feedback.

Having been over most of the related codebase, I think the project can be
reasonably completed in 8 weeks or so.

Transitioning from SHA1 to SHA3-256 (or at least adding support for it) has
been on the waitlist for a while. Is that something we could pursue as a
part of this project?

If not, are there any other related components/ideas that I could work on.

Thank you Jakub for proposing this project and the other contributors for
continued interest and discussions.

Note: You can alternatively read this proposal on Github [18].

[16]: https://lore.kernel.org/git/86eeto5x1j.fsf@gmail.com/
[17]: https://lore.kernel.org/git/86o8ziatb2.fsf_-_@gmail.com/
[18]: https://github.com/abhishekkumar2718/GSoC20/blob/master/generation_number_v2.md

===============================================================================

# Implement Generation Number v2

## Abstract

Git uses various clever methods for making operations on very large repositories
faster, from bitmap indices for git fetch [1], to generation numbers (also known
as topological levels) in the commit-graph file for commit graph traversal
operations like git log --graph [2].

However, generation numbers do not always improve performance. Stolee has
previously explored alternatives for a better generation number [3]. Backward
compatible corrected commit date was chosen because of its performance, local
computability, and backward compatibility.

[1]: https://githubengineering.com/counting-objects/
[2]: https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/
[3]: https://lore.kernel.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com/

## Goals

This project aims to:
- Move `generation`, `graph_pos` members of struct commit into a commit slab.
- Update generation number to v2, modifying all places that reference or
compute generation numbers.
- Use generation numbers to speed up implementation of `--ancestry-path` and
possibly other commands.
- Extend Stolee`s performance test suite.
- Examine performance improvements.

## Overview

Git walks the commit graph for many reasons, including listing and filtering
commit history and computing merge bases.

These operations can become slow as the commit count grows. The two main costs
here are decompressing and parsing commits and walking the entire graphs to
satisfy topological order constraints.

Stolee introduced commit-graph file to accelerate this process. The commit-graph
file stores commit graph structure along extra metadata to speed up graph walks.
A chunk includes commit OID, list of parents, commit date, root tree OID and 
generation number.

Generation number v1 was defined as one more than the length of the longest path
from A to the root commit. Then,

    If A and B are commits with generation numbers N and M, and N <= M, then
    A cannot reach B.

This property reduces the time to walk commits and determine topological
relationships significantly.

### Performance Regression

While testing git v2.19.0-rc1, Stolee came across a performance regression due
to using topological levels [4]. The discussion can be summarized as:

Topological level changes the order in which the commits are inspected and
gives a performance regression if it chooses to walk a longer path than
compared to date-based heuristic.

For example, consider the performance before and after using topological levels
on `git merge-base v4.8 v4.9`:

  before topological levels: 0.122s
   after topological levels: 0.547s


The topology of this can be described as:
```
   v4.9
    |  \
    |   \
   v4.8  \
    | \   \
    |  \   |
   ...  A  B
    |  /  /
    | /  /
    |/__/
    C
```

Here, the "..." means a "very long line of commits". A and B have generation
one more than C.

If we order commits by generation, then commit date, B is removed from the queue
only after many commits reachable from v4.8 have been explored.  However, if
ordering by commit date removes A and B from the queue early and C is inserted.
As C is processed, the rest of queue is marked as STALE and loop terminates.

[4]: https://lore.kernel.org/git/pull.28.git.gitgitgadget@gmail.com/

### Alternatives to Topological Level

To improve generation numbers, Stolee investigated other alternative
definitions for generation numbers. The extended exploration and report are
available here [5].

Corrected commit date was chosen as the next generation number by overall
consensus because of its performance, local computability, and immutability.

However, it was not backward compatible with current implementation i.e., older
clients do not report correct values if they ignore reachability index version.
Existing clients would fail altogether with commit-graph v2 instead of
switching commit-graph feature off, as discovered by Ævar Arnfjörð Bjarmason [6].

[5]: https://lore.kernel.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com/
[6]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

### Corrected Commit Date With Strictly Monotonic Offset

Jakub later proposed a modification to corrected commit date, _Corrected Commit
Date with Strictly Monotonic Offset_ defined as follows [7]:

For a commit C, let its _offset commit date_ (denoted by odate(C))
be a commit date plus some offset, i.e. odate(C) = date(C) + offset(C),
such that:

1. Offset commit date is greater than the maximum of the commit date of
   C and the offset commit dates of its parents.

    If odate(A) < odate(B), then A cannot reach B.

2. The offset of a commit is one more than the maximum offset of a parent, or
   more.

    If offset(A) < offset(B), then A cannot reach B.

Since the offset are strictly greater than the offset of a parent, the old
clients give correct values for the odate as well. `git commit-graph verify`
would fail since the offsets are not generation numbers in all cases.

This is an acceptable trade off because users can re-write the commit graph
after verify fails.
 
A rough implementation by Jakub can be read here [8].

[7]: https://lore.kernel.org/git/86o8ziatb2.fsf_-_@gmail.com/
[8]: https://github.com/derrickstolee/git/pull/10

## Plan

### Community bonding period (April 27 - May 18, 4 weeks)

1. Extend Stolee`s performance test suite.

To investigate replacements for generation numbers, Stolee implemented
prototypes and compared performance on a range of queries.

It currently tests:
- `git log --topo-order -N`: measures commits walked to determine all possible
ways of arriving at given commit.
- `git log --topo-order -10 A..B`: measures commits walked to determine commits
reachable from B but not from A.
- `git merge-base A B`: measures commits walked to find the lowest (closest)
ancestor for both A and B.

We can also add tests for:
- `git branch --[no-]contains A`: measures commits walked to determine if A is
reachable from the tip of a given branch for all branches (many to one query).
- `git branch --[no-]merged A`: measures commits walked to determine if the tip
of a given branch is reachable from A for all branches (one to many query).
- `git push/fetch`: measures commits walked to compute HAVES/WANTS of a client
(many to many query) [9].

2. Find other operations that can be optimized using generation numbers.

As Jakub pointed out in the discussion on generation numbers, --ancestry-path
can be optimized using generation numbers [10]. Discuss and find any other
operation which could use generation numbers.

3. Explore commit, commit-graph, commit-reach, revision

[9]: https://github.blog/2015-09-22-counting-objects/
[10]: https://lore.kernel.org/git/861s82wdbp.fsf@gmail.com/

### Move `generation`, `graph_pos` into a slab (May 18 - June 8, 3 weeks)

struct commit is used in many contexts. However, the struct members
`generation` and `graph_pos` are only used for commit-graph related operations
and otherwise waste memory.

Using a commit slab reduces space of an individual commit object, which becomes
crucial as we load thousands of them. We can dynamically associate commits
with their generation/graph positions when needed.

Commit slab for `generation` is 64 bits wide as it stores odate instead
of offset [11]. Commit slab for `graph_pos` is simpler, with a straightforward
allocation of 32 bits.

Note: Commit-graph chunks would still store offset and will be used to compute
corrected date on initialization.

[11]: https://lore.kernel.org/git/cfa2c367-5cd7-add5-0293-caa75b103f34@gmail.com/

### Implement backward compatible corrected commit date (June 8 - July 20, 6 weeks)

1. Implement backward compatible corrected commit date (June 8 - June 22, 2 weeks)

Topological levels are computed using a DFS in `compute_generation_numbers` of
commit-graph.c. The condition

  generation(C) = max(generation(P)) + 1,
  where C is a commit and P are its parent(s).

needs to be redefined as:

  generation(C) = max(max(generation(P)), correction(C)) + 1,
  where correction(C) is max(cdate(P) - cdate(C), 0). 

Definitions of `GENERATION_NUMBER_{INFINITY, MAX, ZERO}` remain the same.

The functions `compare_commits_by_gen_then_commit_date` and
`compare_commits_by_commit_date` can be replaced by `compare_commits_by_gen` as
all three *should* return the same order.

--> First evaluation (June 15 - June 19)

2. Update existing references (June 22 - July 20, 4 weeks)

The following functions use generation number or committer date for cutoffs:
- `contains_tag_algo`
- `can_all_from_reach_with_flag`
- `paint_down_to_common`
- `remove_redundant`

All of these can be updated with little to no changes.

--> End Semester Exams (June 29 - July 4)

--> Second evaluation (July 13 - July 17)

### Extend generation number to --ancestry-path (July 20 - August 3, 2 weeks)

`--ancestry-path A..B` returns the commits that are both descendants of A and
ancestors of B. The algorithm computes commits that are reachable by B but not
by A, reverse the graph and walk backwards from A [12].

We can use generation number to limit walks to find commits reachable by B.

[12]: https://lore.kernel.org/git/0c6b42e4-e825-ff70-a528-e118abf4c435@gmail.com/

### Examine performance improvements (August 3 - August 10, 1 week)

- Compare the performance of generation number v2 versus topological level using
  performance test suite.
- Wrap up any existing patches.

The performance report would be a nice summary my work over the summer.

## Contributions

[Microproject] Consolidate test_cmp_graph logic
-----
Log graph comparison logic is duplicated many times. This patch consolidates
comparison and sanitation logic in lib-log-graph.

Status: Merged

Patch: https://lore.kernel.org/git/20200216134750.18947-1-abhishekkumar8222@gmail.com/

Commit: https://github.com/git/git/commit/46703057c1a0f85e24c0144b38c226c6a9ccb737

I have also reviewed patches and discussed queries with other contributors:
- https://lore.kernel.org/git/CAHk66fskrfcJ0YFDhfimVBTJZB4um7r=GdQuM8heJdZtF8D7UQ@mail.gmail.com/
- https://lore.kernel.org/git/CAHk66fvt-1RaLK8E7SDpocWM9OMAcA-gP5hjHq6r5N_FbATNgA@mail.gmail.com/
- https://github.com/git/git/pull/647#issuecomment-591978405

and others.

## Post GSoC

I would love to keep contributing to git after the GSoC period ends. There`s so
much to learn from the community.

Hannes`s comment on checks as a penalty that should be paid only by constant
strbufs was a perspective I had not considered [13].

Interacting with Kyagi made me rethink the justifications _emphasizing commit
messages_. I was at my wit`s end, which makes me appreciate my patient mentors
more and want to give back to the community.

[13]: https://lore.kernel.org/git/467c035f-c7cd-01e1-e64c-2c915610de01@kdbg.org/

## About Me

I am Abhishek Kumar, a second-year CSE student at National Institute of
Technology Karnataka, India. I have a blog where I talk about my interests -
programming, fiction, and literature [14].

I primarily work with C/C++ and Ruby on Rails. I am a member of my institute`s
Open Source Club and student-built University Management System, _IRIS_. I have
some experience of mentoring - Creating their code style guide and being an
active reviewer [15].

[14]: https://abhishekkumar2718.github.io/

[15]: https://iris.nitk.ac.in/about_us

## Availablity

The official GSoC coding period runs from April 27 to August 17.

Due to the outbreak of COVID-19 in my country, my college has pre-emptively
announced summer vacations from March 17 to June 1. Unfortunately,
I would have classes for a large part of the coding period. However, I can still
contribute 35-40 hours every week due to a low course load (~20 hours a week).

I would not be able to contribute from June 29 to July 4 due to end semester exams.
It would be easily compensated during the subsequent week "semester break" from
July 5 to July 27.

## Contact Information

```
     Name: Abhishek Kumar
    Email: abhishekkumar8222@gmail.com
    Major: Computer Science And Engineering
   Degree: Bachelor of Technology
Institute: National Institute of Technology Karnataka
   Github: abhishekkumar2718
 Timezone: UTC+5:30 (IST)
```

Thank you for taking the time to review my proposal!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][GSoC] Implement Generation Number v2
  2020-03-22  9:35 [RFC][GSoC] Implement Generation Number v2 Abhishek Kumar
@ 2020-03-22 20:05 ` Jakub Narebski
  2020-03-23  4:25   ` Abhishek Kumar
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Narebski @ 2020-03-22 20:05 UTC (permalink / raw)
  To: Abhishek Kumar; +Cc: git, Derrick Stolee, Christian Couder

A few comments.

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:

> Hello everyone,
>
> Following the discussions between Jakub and Stolee [16], it was decided that
> implementing "Corrected Commit Date With Strictly Monotonic Offset" [17] was
> more appropriate for a GSoC project.
>
> I would like to work on it. The proposal could still use some work but I
> thought it would be best to get early feedback.

Thank you for your interest in this idea.

[...]
> ### Corrected Commit Date With Strictly Monotonic Offset
>
> Jakub later proposed a modification to corrected commit date, _Corrected Commit
> Date with Strictly Monotonic Offset_ defined as follows [7]:

I don't remember who originally proposed this idea, but it was not me.
I have only given it the name you use.

> For a commit C, let its _offset commit date_ (denoted by odate(C))
> be a commit date plus some offset, i.e. odate(C) = date(C) + offset(C),
> such that:
>
> 1. Offset commit date is greater than the maximum of the commit date of
>    C and the offset commit dates of its parents.
>
>     If odate(A) < odate(B), then A cannot reach B.
>
> 2. The offset of a commit is one more than the maximum offset of a parent, or
>    more.
>
>     If offset(A) < offset(B), then A cannot reach B.
>
> Since the offset are strictly greater than the offset of a parent, the old
> clients give correct values for the odate as well. `git commit-graph verify`
> would fail since the offsets are not generation numbers in all cases.
>
> This is an acceptable trade off because users can re-write the commit graph
> after verify fails.

One thing to solve is find and implement a way to distinguish between
commit-graph with generation number v1 (legacy), and commit-graph with
generation number v2.

Unfortunately for the time being we cannot use commit-graph format
version; the idea that was proposed on the mailing list (when we found
about the bug in handling commit-graph versioning, during incremental
commit-graph implementation), was to create and use metadata chunk or
versioning chunk (the final version of incremental format do not use
this mechanism).  This could be used by gen2 compatibile Git to
distinguish between situation where old commit-graph file to be updated
uses generation number v1, and when it uses v2.

If you have a better idea, please say so.

[...]
> ## Contributions
>
> [Microproject] Consolidate test_cmp_graph logic
> -----
> Log graph comparison logic is duplicated many times. This patch consolidates
> comparison and sanitation logic in lib-log-graph.
>
> Status: Merged
>
> Patch: https://lore.kernel.org/git/20200216134750.18947-1-abhishekkumar8222@gmail.com/
>
> Commit: https://github.com/git/git/commit/46703057c1a0f85e24c0144b38c226c6a9ccb737

Nice, this is related work.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][GSoC] Implement Generation Number v2
  2020-03-22 20:05 ` Jakub Narebski
@ 2020-03-23  4:25   ` Abhishek Kumar
  2020-03-23  5:32     ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Abhishek Kumar @ 2020-03-23  4:25 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, christian.couder

On Sun, Mar 22, 2020 at 09:05:26PM +0100, Jakub Narebski wrote:
> A few comments.
> 
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> 
> > Hello everyone,
> >
> > Following the discussions between Jakub and Stolee [16], it was decided that
> > implementing "Corrected Commit Date With Strictly Monotonic Offset" [17] was
> > more appropriate for a GSoC project.
> >
> > I would like to work on it. The proposal could still use some work but I
> > thought it would be best to get early feedback.
> 
> Thank you for your interest in this idea.
> 
> [...]
> > ### Corrected Commit Date With Strictly Monotonic Offset
> >
> > Jakub later proposed a modification to corrected commit date, _Corrected Commit
> > Date with Strictly Monotonic Offset_ defined as follows [7]:
> 
> I don't remember who originally proposed this idea, but it was not me.
> I have only given it the name you use.
> 

Thanks, noted. Will update it.

> > For a commit C, let its _offset commit date_ (denoted by odate(C))
> > be a commit date plus some offset, i.e. odate(C) = date(C) + offset(C),
> > such that:
> >
> > 1. Offset commit date is greater than the maximum of the commit date of
> >    C and the offset commit dates of its parents.
> >
> >     If odate(A) < odate(B), then A cannot reach B.
> >
> > 2. The offset of a commit is one more than the maximum offset of a parent, or
> >    more.
> >
> >     If offset(A) < offset(B), then A cannot reach B.
> >
> > Since the offset are strictly greater than the offset of a parent, the old
> > clients give correct values for the odate as well. `git commit-graph verify`
> > would fail since the offsets are not generation numbers in all cases.
> >
> > This is an acceptable trade off because users can re-write the commit graph
> > after verify fails.
> 
> One thing to solve is find and implement a way to distinguish between
> commit-graph with generation number v1 (legacy), and commit-graph with
> generation number v2.
> 
> Unfortunately for the time being we cannot use commit-graph format
> version; the idea that was proposed on the mailing list (when we found
> about the bug in handling commit-graph versioning, during incremental
> commit-graph implementation), was to create and use metadata chunk or
> versioning chunk (the final version of incremental format do not use
> this mechanism).  This could be used by gen2 compatibile Git to
> distinguish between situation where old commit-graph file to be updated
> uses generation number v1, and when it uses v2.
> 
> If you have a better idea, please say so.
> 

We could also use a flag file. Here's how it works:

If the file `.git/info/generation-number-v2` exists, use gen2.
Otherwise use gen1.

We can, of course generalize this to `<dir>/info/generation-number-v2`
if needed.

1. It is independent of commit-graph format. 

2. Switching between versions requires creating/deleting a file, which
is simpler than reading commit-graph file and modifying (or removing) a
metadata chunk.

Johannes used a similar flag file during the conversion of difftool [1].

[1]: https://lore.kernel.org/git/598dcfdbeef4e15d2d439053a0423589182e5f30.1479834051.git.johannes.schindelin@gmx.de/

> [...]
> > ## Contributions
> >
> > [Microproject] Consolidate test_cmp_graph logic
> > -----
> > Log graph comparison logic is duplicated many times. This patch consolidates
> > comparison and sanitation logic in lib-log-graph.
> >
> > Status: Merged
> >
> > Patch: https://lore.kernel.org/git/20200216134750.18947-1-abhishekkumar8222@gmail.com/
> >
> > Commit: https://github.com/git/git/commit/46703057c1a0f85e24c0144b38c226c6a9ccb737
> 
> Nice, this is related work.
> 
> Best,
> -- 
> Jakub Narębski

Regards
Abhishek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][GSoC] Implement Generation Number v2
  2020-03-23  4:25   ` Abhishek Kumar
@ 2020-03-23  5:32     ` Junio C Hamano
  2020-03-23 11:32       ` Abhishek Kumar
  2020-03-23 13:43       ` Jakub Narebski
  0 siblings, 2 replies; 12+ messages in thread
From: Junio C Hamano @ 2020-03-23  5:32 UTC (permalink / raw)
  To: Abhishek Kumar; +Cc: Jakub Narebski, git, christian.couder

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:

>> Unfortunately for the time being we cannot use commit-graph format
>> version; the idea that was proposed on the mailing list (when we found
>> about the bug in handling commit-graph versioning, during incremental
>> commit-graph implementation), was to create and use metadata chunk or
>> versioning chunk (the final version of incremental format do not use
>> this mechanism).  This could be used by gen2 compatibile Git to
>> distinguish between situation where old commit-graph file to be updated
>> uses generation number v1, and when it uses v2.
>> 
>> If you have a better idea, please say so.
>
> We could also use a flag file. Here's how it works:
>
> If the file `.git/info/generation-number-v2` exists, use gen2.
> Otherwise use gen1.

If the file is lost then we will try to read the other file that has
the commit-graph data as if it were in old format?  And if such a
file was created (say, with "touch .git/info/generation-number-v2"),
a file in the original format will be read as if it is in new
format?  If that is the case, it is likely that we'd see a segfault;
sounds too brittle to me.

It appears that the format of "CDAT", and the fact that generation
is represented as higher 30-bit of a be32 integer, is very much
hardcoded in the design and is hard to change, but your new version
of graph file can be designed not to use "CDAT" chunk at all, and
instead have the commit data with new version of generation numbers
stored in a different chunk (say "CDA2") to force older version of
Git not to use the new graph file---would that work?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][GSoC] Implement Generation Number v2
  2020-03-23  5:32     ` Junio C Hamano
@ 2020-03-23 11:32       ` Abhishek Kumar
  2020-03-23 13:43       ` Jakub Narebski
  1 sibling, 0 replies; 12+ messages in thread
From: Abhishek Kumar @ 2020-03-23 11:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: jnareb, git

On Sun, Mar 22, 2020 at 10:32:32PM -0700, Junio C Hamano wrote:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> 
> >> Unfortunately for the time being we cannot use commit-graph format
> >> version; the idea that was proposed on the mailing list (when we found
> >> about the bug in handling commit-graph versioning, during incremental
> >> commit-graph implementation), was to create and use metadata chunk or
> >> versioning chunk (the final version of incremental format do not use
> >> this mechanism).  This could be used by gen2 compatibile Git to
> >> distinguish between situation where old commit-graph file to be updated
> >> uses generation number v1, and when it uses v2.
> >> 
> >> If you have a better idea, please say so.
> >
> > We could also use a flag file. Here's how it works:
> >
> > If the file `.git/info/generation-number-v2` exists, use gen2.
> > Otherwise use gen1.
> 
> If the file is lost then we will try to read the other file that has
> the commit-graph data as if it were in old format?  And if such a
> file was created (say, with "touch .git/info/generation-number-v2"),
> a file in the original format will be read as if it is in new
> format?  If that is the case, it is likely that we'd see a segfault;
> sounds too brittle to me.

Agreed. Flag file has too many issues to be pursued further.

> 
> It appears that the format of "CDAT", and the fact that generation
> is represented as higher 30-bit of a be32 integer, is very much
> hardcoded in the design and is hard to change, but your new version
> of graph file can be designed not to use "CDAT" chunk at all, and
> instead have the commit data with new version of generation numbers
> stored in a different chunk (say "CDA2") to force older version of
> Git not to use the new graph file---would that work?

A commit-graph without "CDAT" chunk will hard fail on older versions of
Git. verify_commit_graph_lite() errors out if chunk_commit_data is null.

Metadata chunk seems the way to go.

Regards
Abhishek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][GSoC] Implement Generation Number v2
  2020-03-23  5:32     ` Junio C Hamano
  2020-03-23 11:32       ` Abhishek Kumar
@ 2020-03-23 13:43       ` Jakub Narebski
  2020-03-23 15:54         ` Derrick Stolee
                           ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Jakub Narebski @ 2020-03-23 13:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Abhishek Kumar, git, Christian Couder, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>> Jakub Narębski <jnareb@gmail.com> writes:
[...]
>>> Unfortunately for the time being we cannot use commit-graph format
>>> version; the idea that was proposed on the mailing list (when we found
>>> about the bug in handling commit-graph versioning, during incremental
>>> commit-graph implementation), was to create and use metadata chunk or
>>> versioning chunk (the final version of incremental format do not use
>>> this mechanism).  This could be used by gen2 compatibile Git to
>>> distinguish between situation where old commit-graph file to be updated
>>> uses generation number v1, and when it uses v2.
>>> 
>>> If you have a better idea, please say so.
>>
>> We could also use a flag file. Here's how it works:
>>
>> If the file `.git/info/generation-number-v2` exists, use gen2.
>> Otherwise use gen1.
>
> If the file is lost then we will try to read the other file that has
> the commit-graph data as if it were in old format?  And if such a
> file was created (say, with "touch .git/info/generation-number-v2"),
> a file in the original format will be read as if it is in new
> format?  If that is the case, it is likely that we'd see a segfault;
> sounds too brittle to me.
>
> It appears that the format of "CDAT", and the fact that generation
> is represented as higher 30-bit of a be32 integer, is very much
> hardcoded in the design and is hard to change, but your new version
> of graph file can be designed not to use "CDAT" chunk at all, and
> instead have the commit data with new version of generation numbers
> stored in a different chunk (say "CDA2") to force older version of
> Git not to use the new graph file---would that work?

It looks like there are a few possible ways of handling introduction of
generation numbers v2.  Let's consider them one by one.

The problem we need to solve is co-existence of old Git (that does not
understand v2, and that hard fails on commit-graph format version bump),
and new Git (that understands and writes v2, and that I assume soft
fails that is it simply doesn't use commit-graph if it of unknown
version).


If the commit-graph file was written by new Git, and includes generation
numbers v2, we want old Git to at least do not crash, possibly do not
use commit-graph, best if it can use commit-graph in suboptimal way.  We
also need to handle old Git trying to update (in incremental or
non-incremental way) the commit-graph file.

If the commit-graph file was written by old Git, and includes generation
nmbers v1 (topological levels), we want new Git to recognize this and at
best use those old generation numbers in a correct way.  We want new Git
to be able to update commit-graph file (in incremental or
non-incremental way).

Did I miss anything?


Proposed solutions are:
 - metadata / versioning chunk,
 - flag file: `.git/info/generation-number-v2`,
 - new chunk for commit data: "CDA2".

I would like to propose yet another solution: putting generation number
v2 data in a separate chunk (and possibly keeping generation number v1
in CDAT commit data chunk).  In this case we could even use ordinary
corrected commit date as generation number v2 (storing offsets as 32-bit
unsigned values), instead of backward-compatibile corrected commit date
with monotonic offsets.

Each solution has its advantages and disadvantages.


With the flag file, the problem is (as Junio noticed) that if file gets
accidentally deleted, new Git would think incorrectly that commit-graph
uses generation number v1... which while suboptimal should not be bad
thanks to backward compatibility.  But I think the flag file should have
some kind of checksum as its contents (perhaps simply a copy of
commit-graph file checksum, or one checksum per file in chain with
incremental commit-graph), so that it old Git rewrites commit-graph file
leaving flag file present, new Git would notice this.

Metadata or versioning chunk cannot be deleted by mistake; if old Git
copies unknown chunks to new updated commit-graph file instead of
skipping them we would need to add some kind of checksum (similarly to
the case for flag file).  The problem to be solved is what to do if some
files in the chain of commit-graph files have v2 (and this chunk), and
some have v1 generation number (and do not have this chunk).

About moving commit data with generation number v2 to "CDA2" chunk: if
"CDAT" chunk is missing then (I think) old Git would simply not use
commit-graph file at all; it may crash, but I don't think so.  If "CDAT"
chunk has zero length... I don't know what would happen then, possibly
also old Git would simply not use commit-graph data at all.

Putting generation number v2 into separate chunk (which might be called
"GEN2" or "OFFS"/"DOFF") has the disadvantage of increasing the on disk
size of the commit graph, and possibly also increasing memory
consumption (the latter depends on how it would be handled), but has the
advantage of being fullly backward compatibile.  Old Git would simply
use generation numbers v1 in "CDAT", new Git would use generation
numbers v2 in "OFFS" -- combining commit creation date from "CDAT" and
offset from "OFFS"), and there should be no problems with updating
commit-graph file (either rewriting, or adding new commit-graph to the
chain).

I think that's all.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][GSoC] Implement Generation Number v2
  2020-03-23 13:43       ` Jakub Narebski
@ 2020-03-23 15:54         ` Derrick Stolee
  2020-03-24  9:24           ` Jakub Narebski
  2020-03-23 16:04         ` Junio C Hamano
  2020-03-26 10:15         ` [GSoC][Proposal v2] " Abhishek Kumar
  2 siblings, 1 reply; 12+ messages in thread
From: Derrick Stolee @ 2020-03-23 15:54 UTC (permalink / raw)
  To: Jakub Narebski, Junio C Hamano; +Cc: Abhishek Kumar, git, Christian Couder

On 3/23/2020 9:43 AM, Jakub Narebski wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>>> Jakub Narębski <jnareb@gmail.com> writes:
> [...]
>>>> Unfortunately for the time being we cannot use commit-graph format
>>>> version; the idea that was proposed on the mailing list (when we found
>>>> about the bug in handling commit-graph versioning, during incremental
>>>> commit-graph implementation), was to create and use metadata chunk or
>>>> versioning chunk (the final version of incremental format do not use
>>>> this mechanism).  This could be used by gen2 compatibile Git to
>>>> distinguish between situation where old commit-graph file to be updated
>>>> uses generation number v1, and when it uses v2.
>>>>
>>>> If you have a better idea, please say so.
>>>
>>> We could also use a flag file. Here's how it works:
>>>
>>> If the file `.git/info/generation-number-v2` exists, use gen2.
>>> Otherwise use gen1.
>>
>> If the file is lost then we will try to read the other file that has
>> the commit-graph data as if it were in old format?  And if such a
>> file was created (say, with "touch .git/info/generation-number-v2"),
>> a file in the original format will be read as if it is in new
>> format?  If that is the case, it is likely that we'd see a segfault;
>> sounds too brittle to me.
>>
>> It appears that the format of "CDAT", and the fact that generation
>> is represented as higher 30-bit of a be32 integer, is very much
>> hardcoded in the design and is hard to change, but your new version
>> of graph file can be designed not to use "CDAT" chunk at all, and
>> instead have the commit data with new version of generation numbers
>> stored in a different chunk (say "CDA2") to force older version of
>> Git not to use the new graph file---would that work?
> 
> It looks like there are a few possible ways of handling introduction of
> generation numbers v2.  Let's consider them one by one.
> 
> The problem we need to solve is co-existence of old Git (that does not
> understand v2, and that hard fails on commit-graph format version bump),
> and new Git (that understands and writes v2, and that I assume soft
> fails that is it simply doesn't use commit-graph if it of unknown
> version).
> 
> 
> If the commit-graph file was written by new Git, and includes generation
> numbers v2, we want old Git to at least do not crash, possibly do not
> use commit-graph, best if it can use commit-graph in suboptimal way.  We
> also need to handle old Git trying to update (in incremental or
> non-incremental way) the commit-graph file.
> 
> If the commit-graph file was written by old Git, and includes generation
> nmbers v1 (topological levels), we want new Git to recognize this and at
> best use those old generation numbers in a correct way.  We want new Git
> to be able to update commit-graph file (in incremental or
> non-incremental way).
> 
> Did I miss anything?
> 
> 
> Proposed solutions are:
>  - metadata / versioning chunk,
>  - flag file: `.git/info/generation-number-v2`,
>  - new chunk for commit data: "CDA2".
> 
> I would like to propose yet another solution: putting generation number
> v2 data in a separate chunk (and possibly keeping generation number v1
> in CDAT commit data chunk).  In this case we could even use ordinary
> corrected commit date as generation number v2 (storing offsets as 32-bit
> unsigned values), instead of backward-compatibile corrected commit date
> with monotonic offsets.

I agree that if we are creating a new (optional) chunk, then that gets
around our versioning issues and could store just the offsets to get
the "corrected commit date" option instead of the backwards-compatible one.
By including yet another version number at the beginning of that chunk,
we could present a way to update this "second reachability index chunk"
with things like your interval mechanism with very little cost.

> Each solution has its advantages and disadvantages.
> 
> 
> With the flag file, the problem is (as Junio noticed) that if file gets
> accidentally deleted, new Git would think incorrectly that commit-graph
> uses generation number v1... which while suboptimal should not be bad
> thanks to backward compatibility.  But I think the flag file should have
> some kind of checksum as its contents (perhaps simply a copy of
> commit-graph file checksum, or one checksum per file in chain with
> incremental commit-graph), so that it old Git rewrites commit-graph file
> leaving flag file present, new Git would notice this.

I'm not a fan of the flag file idea. Optional chunks are a good way forward.
That _could_ mean the metadata chunk, whose length can grow in the future
if/when we add more fixed-width metadata values.

> Metadata or versioning chunk cannot be deleted by mistake; if old Git
> copies unknown chunks to new updated commit-graph file instead of
> skipping them we would need to add some kind of checksum (similarly to
> the case for flag file).  The problem to be solved is what to do if some
> files in the chain of commit-graph files have v2 (and this chunk), and> some have v1 generation number (and do not have this chunk).

The incremental commit-graph format is newer than our previous tests
for generation number v2, which will be a big reason why that old code
cannot be immediately adapted here.

The simplest thing to do is usually right: if we try to write a
generation number version that doesn't match the current commit-graph,
then we need to flatten the entire chain into one layer and recompute
the values from scratch. While it is _technically_ possible to mix
the backwards-compatible corrected commit date with generation number
v1, it requires taking the "lowest version" when doing comparisons and
that may behave very strangely. Better to avoid that complication.

> About moving commit data with generation number v2 to "CDA2" chunk: if
> "CDAT" chunk is missing then (I think) old Git would simply not use
> commit-graph file at all; it may crash, but I don't think so.  If "CDAT"
> chunk has zero length... I don't know what would happen then, possibly
> also old Git would simply not use commit-graph data at all.

CDAT is required as it contains more than just generation numbers. It
has the commit date, parent int-ids, and root tree oid. The generation
numbers _could_ be left as all zeroes, which is a special case for the
format before generation numbers were introduced, but it would be better
to have values there.

> Putting generation number v2 into separate chunk (which might be called
> "GEN2" or "OFFS"/"DOFF") has the disadvantage of increasing the on disk
> size of the commit graph, and possibly also increasing memory
> consumption (the latter depends on how it would be handled), but has the
> advantage of being fullly backward compatibile.  Old Git would simply
> use generation numbers v1 in "CDAT", new Git would use generation
> numbers v2 in "OFFS" -- combining commit creation date from "CDAT" and
> offset from "OFFS"), and there should be no problems with updating
> commit-graph file (either rewriting, or adding new commit-graph to the
> chain).

I share these concerns but also the locality of the data within the file.
As we parse commits, we need the parent and commit date information out
of the CDAT chunk anyway, so it is not difficult to grab the nearby
generation number. If we put that data further away in a separate chunk,
then it can be more expensive to flip between the CDAT chunk and the
GEN2 chunk.

In terms of your prototyping for performance checks, it may be good to
have a number of "GEN<X>" chunks so you can compute one commit-graph
that stores all of the candidate reachability indexes, then use one
of the chunks based on a config value or environment variable. I think
that would only be appropriate for testing if you are evaluating which
to build, so if you are focusing entirely on backwards-compatible
corrected commit date, this is not worth spending time on.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][GSoC] Implement Generation Number v2
  2020-03-23 13:43       ` Jakub Narebski
  2020-03-23 15:54         ` Derrick Stolee
@ 2020-03-23 16:04         ` Junio C Hamano
  2020-03-24 15:44           ` Jakub Narebski
  2020-03-26 10:15         ` [GSoC][Proposal v2] " Abhishek Kumar
  2 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2020-03-23 16:04 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Abhishek Kumar, git, Christian Couder, Derrick Stolee

Jakub Narebski <jnareb@gmail.com> writes:

> About moving commit data with generation number v2 to "CDA2" chunk: if
> "CDAT" chunk is missing then (I think) old Git would simply not use
> commit-graph file at all; it may crash, but I don't think so.  If "CDAT"
> chunk has zero length... I don't know what would happen then, possibly
> also old Git would simply not use commit-graph data at all.

Yeah, if it makes it crash, then we cannot use that "missing CDAT"
approach.

> Putting generation number v2 into separate chunk (which might be called
> "GEN2" or "OFFS"/"DOFF") has the disadvantage of increasing the on disk
> size of the commit graph, and possibly also increasing memory
> consumption (the latter depends on how it would be handled), but has the
> advantage of being fullly backward compatibile.  Old Git would simply
> use generation numbers v1 in "CDAT", new Git would use generation
> numbers v2 in "OFFS" -- combining commit creation date from "CDAT" and
> offset from "OFFS"),

Do we have an option *not* to record meaningful generation numbers
in CDAT and have the current Git binaries understand and still use
the rest of the graph file, while not using the optimizations that
rely on having generation numbers?  If not, then the new version of
Git that tries to be compatible with old one needs to compute both
generation numbers, and we would need to keep the topological number
for quite some time.

> and there should be no problems with updating
> commit-graph file (either rewriting, or adding new commit-graph to the
> chain).

Would merging by the current Git also work well (meaning, would
"GEN2" or whatever it does not understand be omitted)?

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][GSoC] Implement Generation Number v2
  2020-03-23 15:54         ` Derrick Stolee
@ 2020-03-24  9:24           ` Jakub Narebski
  0 siblings, 0 replies; 12+ messages in thread
From: Jakub Narebski @ 2020-03-24  9:24 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Junio C Hamano, Abhishek Kumar, git, Christian Couder

Derrick Stolee <stolee@gmail.com> writes:
> On 3/23/2020 9:43 AM, Jakub Narebski wrote:
>> Junio C Hamano <gitster@pobox.com> writes:
>>> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>>>> Jakub Narębski <jnareb@gmail.com> writes:
>> [...]
>> Proposed solutions are:
>>  - metadata / versioning chunk,
>>  - flag file: `.git/info/generation-number-v2`,
>>  - new chunk for commit data: "CDA2".
>> 
>> I would like to propose yet another solution: putting generation number
>> v2 data in a separate chunk (and possibly keeping generation number v1
>> in CDAT commit data chunk).  In this case we could even use ordinary
>> corrected commit date as generation number v2 (storing offsets as 32-bit
>> unsigned values), instead of backward-compatibile corrected commit date
>> with monotonic offsets.
>
> I agree that if we are creating a new (optional) chunk, then that gets
> around our versioning issues and could store just the offsets to get
> the "corrected commit date" option instead of the backwards-compatible one.
> By including yet another version number at the beginning of that chunk,
> we could present a way to update this "second reachability index chunk"
> with things like your interval mechanism with very little cost.

All this may be just a transitory phase, waiting until Git versions that
fail hard on commit-graph format version change die out... then we will
be able to use version number as intended (though it has the
disadvantage of turning off commit-graph wholly for older Git versions).

>> Each solution has its advantages and disadvantages.
>> 
>> 
>> With the flag file, the problem is (as Junio noticed) that if file gets
>> accidentally deleted, new Git would think incorrectly that commit-graph
>> uses generation number v1... which while suboptimal should not be bad
>> thanks to backward compatibility.  But I think the flag file should have
>> some kind of checksum as its contents (perhaps simply a copy of
>> commit-graph file checksum, or one checksum per file in chain with
>> incremental commit-graph), so that it old Git rewrites commit-graph file
>> leaving flag file present, new Git would notice this.
>
> I'm not a fan of the flag file idea. Optional chunks are a good way forward.
> That _could_ mean the metadata chunk, whose length can grow in the future
> if/when we add more fixed-width metadata values.

So it looks like metadata / versioning chunk would be the best solution,
isn't it?

>> Metadata or versioning chunk cannot be deleted by mistake; if old Git
>> copies unknown chunks to new updated commit-graph file instead of
>> skipping them we would need to add some kind of checksum (similarly to
>> the case for flag file).  The problem to be solved is what to do if some
>> files in the chain of commit-graph files have v2 (and this chunk),
>> and> some have v1 generation number (and do not have this chunk).
>
> The incremental commit-graph format is newer than our previous tests
> for generation number v2, which will be a big reason why that old code
> cannot be immediately adapted here.
>
> The simplest thing to do is usually right: if we try to write a
> generation number version that doesn't match the current commit-graph,
> then we need to flatten the entire chain into one layer and recompute
> the values from scratch. While it is _technically_ possible to mix
> the backwards-compatible corrected commit date with generation number
> v1, it requires taking the "lowest version" when doing comparisons and
> that may behave very strangely. Better to avoid that complication.

Right.

>> About moving commit data with generation number v2 to "CDA2" chunk: if
>> "CDAT" chunk is missing then (I think) old Git would simply not use
>> commit-graph file at all; it may crash, but I don't think so.  If "CDAT"
>> chunk has zero length... I don't know what would happen then, possibly
>> also old Git would simply not use commit-graph data at all.
>
> CDAT is required as it contains more than just generation numbers. It
> has the commit date, parent int-ids, and root tree oid. The generation
> numbers _could_ be left as all zeroes, which is a special case for the
> format before generation numbers were introduced, but it would be better
> to have values there.

I think (but I might be wrong) that the most expensive part of
calculating generation numbers is actually walking the commits.  Because
both generation number v1 (topological level) and generation number v2
(corrected committerdate, with or without monotonic restriction on
offsets) can be computed at the same time, during the same walk,
possibly with negligible cost compared to computing single geneation
number.

But this should be perhaps benchmarked.

>> Putting generation number v2 into separate chunk (which might be called
>> "GEN2" or "OFFS"/"DOFF") has the disadvantage of increasing the on disk
>> size of the commit graph, and possibly also increasing memory
>> consumption (the latter depends on how it would be handled), but has the
>> advantage of being fullly backward compatibile.  Old Git would simply
>> use generation numbers v1 in "CDAT", new Git would use generation
>> numbers v2 in "OFFS" -- combining commit creation date from "CDAT" and
>> offset from "OFFS"), and there should be no problems with updating
>> commit-graph file (either rewriting, or adding new commit-graph to the
>> chain).
>
> I share these concerns but also the locality of the data within the file.
> As we parse commits, we need the parent and commit date information out
> of the CDAT chunk anyway, so it is not difficult to grab the nearby
> generation number. If we put that data further away in a separate chunk,
> then it can be more expensive to flip between the CDAT chunk and the
> GEN2 chunk.

Right, I forgot about this issue, that Git is lazily parsing
commit-graph data, so keeping all [possible] commit data close is a good
idea from the performance point of view.

So it looks like metadata / versioning chunk would be the best solution
to the backward-compatibility interoperability problem.

> In terms of your prototyping for performance checks, it may be good to
> have a number of "GEN<X>" chunks so you can compute one commit-graph
> that stores all of the candidate reachability indexes, then use one
> of the chunks based on a config value or environment variable. I think
> that would only be appropriate for testing if you are evaluating which
> to build, so if you are focusing entirely on backwards-compatible
> corrected commit date, this is not worth spending time on.

Right, but it looks like there is nobody taking new labelings for GSoC
2020.

Good idea for prototyping, true.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][GSoC] Implement Generation Number v2
  2020-03-23 16:04         ` Junio C Hamano
@ 2020-03-24 15:44           ` Jakub Narebski
  2020-03-24 21:13             ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Narebski @ 2020-03-24 15:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Abhishek Kumar, git, Christian Couder, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:
> Jakub Narebski <jnareb@gmail.com> writes:
>
>> About moving commit data with generation number v2 to "CDA2" chunk: if
>> "CDAT" chunk is missing then (I think) old Git would simply not use
>> commit-graph file at all; it may crash, but I don't think so.  If "CDAT"
>> chunk has zero length... I don't know what would happen then, possibly
>> also old Git would simply not use commit-graph data at all.
>
> Yeah, if it makes it crash, then we cannot use that "missing CDAT"
> approach.

I have not tested this, but from reading the code it looks like "missing
CDAT" makes Git fail softly -- it would return NULL for the
commit-graph, and thus not use commit-graph data at all... which might
be too high a price (too high performance penalty for old Git).

>> Putting generation number v2 into separate chunk (which might be called
>> "GEN2" or "OFFS"/"DOFF") has the disadvantage of increasing the on disk
>> size of the commit graph, and possibly also increasing memory
>> consumption (the latter depends on how it would be handled), but has the
>> advantage of being fullly backward compatibile.  Old Git would simply
>> use generation numbers v1 in "CDAT", new Git would use generation
>> numbers v2 in "OFFS" -- combining commit creation date from "CDAT" and
>> offset from "OFFS"),
>
> Do we have an option *not* to record meaningful generation numbers
> in CDAT and have the current Git binaries understand and still use
> the rest of the graph file, while not using the optimizations that
> rely on having generation numbers?  If not, then the new version of
> Git that tries to be compatible with old one needs to compute both
> generation numbers, and we would need to keep the topological number
> for quite some time.

We can, as Derrick Stolee wrote, put zero (GENERATION_NUMBER_ZERO) for
generation number.  Without generation number data we lose some of
performance improvements, though.

On the other hand computing generation number v1 (topological level) and
generation number v2 ([monotonic] offset for corrected commit date)
should not be much more costly than calculating single generation
number, assuming that most of the cost is walking the commit graph.  But
this would need benchmarking.

Also, as Stolee wrote, with generation number v2 in separate chunk we
have commit data not together, but split into two areas.

>> and there should be no problems with updating
>> commit-graph file (either rewriting, or adding new commit-graph to the
>> chain).
>
> Would merging by the current Git also work well (meaning, would
> "GEN2" or whatever it does not understand be omitted)?

From the analysis of write_commit_graph_file(), it looks like unknown
chunks are simply skipped (ommitted), but I have not checked it in
practice.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][GSoC] Implement Generation Number v2
  2020-03-24 15:44           ` Jakub Narebski
@ 2020-03-24 21:13             ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2020-03-24 21:13 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Abhishek Kumar, git, Christian Couder, Derrick Stolee

Jakub Narebski <jnareb@gmail.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>> Jakub Narebski <jnareb@gmail.com> writes:
>>
>>> About moving commit data with generation number v2 to "CDA2" chunk: if
>>> "CDAT" chunk is missing then (I think) old Git would simply not use
>>> commit-graph file at all; it may crash, but I don't think so.  If "CDAT"
>>> chunk has zero length... I don't know what would happen then, possibly
>>> also old Git would simply not use commit-graph data at all.
>>
>> Yeah, if it makes it crash, then we cannot use that "missing CDAT"
>> approach.
>
> I have not tested this, but from reading the code it looks like "missing
> CDAT" makes Git fail softly -- it would return NULL for the
> commit-graph, and thus not use commit-graph data at all... which might
> be too high a price (too high performance penalty for old Git).

Oh, that would not make me worried an iota.  If anything, it
encourages adoption of the new version ;-) As long as the
correctness is not violated, it is fine as a fallback behaviour.

> On the other hand computing generation number v1 (topological level) and
> generation number v2 ([monotonic] offset for corrected commit date)
> should not be much more costly than calculating single generation
> number, assuming that most of the cost is walking the commit graph.  But
> this would need benchmarking.

Good point.  If we can do so cheaply enough, there is no reason to
omit the v1 data from the file and stuff placeholder 0 in these
fields.

> Also, as Stolee wrote, with generation number v2 in separate chunk we
> have commit data not together, but split into two areas.

Yes, that is a problem.  The "empty CDAT with everything not just
generation numbers in CDA2" approach would help us move forward with
v2, while not harming older version of Git.  Just like pack bitmap,
commit-graph is designed to be a local optimization matter, and
worrying about older Git not being able to use the optimization
hints left by newer version of Git is not useful.  Quite honestly,
the only two things we should make sure are that the repository does
not appear corrupt and unusable by current Git when it has graph
files with new generation number scheme, and that manipulations by
current Git of a repository with newer graph files will not corrupt
the repository for current or newer Git.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [GSoC][Proposal v2] Implement Generation Number v2
  2020-03-23 13:43       ` Jakub Narebski
  2020-03-23 15:54         ` Derrick Stolee
  2020-03-23 16:04         ` Junio C Hamano
@ 2020-03-26 10:15         ` Abhishek Kumar
  2 siblings, 0 replies; 12+ messages in thread
From: Abhishek Kumar @ 2020-03-26 10:15 UTC (permalink / raw)
  To: abhishekkumar8222; +Cc: git, jnareb, stolee

Hello,

Thanks to Jakub for the discussion on differentiating between generation
numbers and the review.

Changes in v2:
- Add section on distinguishing generation numbers v1 and v2.
- Update 'Plan' with new GSoC timeline.
- Fix incorrect attribution of devising backwards compatible corrected
  commit dates to Jakub.

You can alternatively read this with markdown rendered:
https://github.com/abhishekkumar2718/GSoC20/blob/master/generation_number_v2.md

Thanks
Abhishek
---

# Implement Generation Number v2

## Abstract

Git uses various clever methods for making operations on very large repositories
faster, from bitmap indices for git fetch [1], to generation numbers (also known
as topological levels) in the commit-graph file for commit graph traversal
operations like git log --graph [2].

However, generation numbers do not always improve performance. Stolee has
previously explored alternatives for a better generation number [3]. Backward
compatible corrected commit date was chosen because of its performance, local
computability, and backward compatibility.

[1]: https://githubengineering.com/counting-objects/
[2]: https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/
[3]: https://lore.kernel.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com/

## Goals

This project aims to:
- Move `generation`, `graph_pos` members of struct commit into a commit slab.
- Examine methods to distinguish commit-graph with legacy generation number and
generation number v2.
- Update generation number to v2, modifying all places that reference or
compute generation numbers.
- Use generation numbers to speed up implementation of `--ancestry-path` and
possibly other commands.
- Extend Stolee`s performance test suite.
- Examine performance improvements.

## Overview

Git walks the commit graph for many reasons, including listing and filtering
commit history and computing merge bases.

These operations can become slow as the commit count grows. The two main costs
here are decompressing and parsing commits and walking the entire graphs to
satisfy topological order constraints.

Stolee introduced commit-graph file to accelerate this process. The commit-graph
file stores commit graph structure along extra metadata to speed up graph walks.
A chunk includes commit OID, list of parents, commit date, root tree OID and 
generation number.

Generation number v1 was defined as one more than the length of the longest path
from A to the root commit. Then,

    If A and B are commits with generation numbers N and M, and N <= M, then
    A cannot reach B.

This property reduces the time to walk commits and determine topological
relationships significantly.

### Performance Regression

While testing git v2.19.0-rc1, Stolee came across a performance regression due
to using topological levels [4]. The discussion can be summarized as:

Topological level changes the order in which the commits are inspected and
gives a performance regression if it chooses to walk a longer path than
compared to date-based heuristic.

For example, consider the performance before and after using topological levels
on `git merge-base v4.8 v4.9`:

  before topological levels: 0.122s
   after topological levels: 0.547s


The topology of this can be described as:
```
   v4.9
    |  \
    |   \
   v4.8  \
    | \   \
    |  \   |
   ...  A  B
    |  /  /
    | /  /
    |/__/
    C
```

Here, the "..." means a "very long line of commits". A and B have generation
one more than C.

If we order commits by generation, then commit date, B is removed from the queue
only after many commits reachable from v4.8 have been explored.  However, if
ordering by commit date removes A and B from the queue early and C is inserted.
As C is processed, the rest of queue is marked as STALE and loop terminates.

[4]: https://lore.kernel.org/git/pull.28.git.gitgitgadget@gmail.com/

### Alternatives to Topological Level

To improve generation numbers, Stolee investigated other alternative
definitions for generation numbers. The extended exploration and report are
available here [5].

Corrected commit date was chosen as the next generation number by overall
consensus because of its performance, local computability, and immutability.

However, it was not backward compatible with current implementation i.e., older
clients do not report correct values if they ignore reachability index version.
Existing clients would fail altogether with commit-graph v2 instead of
switching commit-graph feature off, as discovered by Ævar Arnfjörð Bjarmason [6].

[5]: https://lore.kernel.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com/
[6]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

### Corrected Commit Date With Strictly Monotonic Offset

Corrected commit date was later modified to _Corrected Commit Date With
Strictly Monotonic Offset_, defined as follows [7]:

For a commit C, let its _offset commit date_ (denoted by odate(C))
be a commit date plus some offset, i.e. odate(C) = date(C) + offset(C),
such that:

1. Offset commit date is greater than the maximum of the commit date of
   C and the offset commit dates of its parents.

    If odate(A) < odate(B), then A cannot reach B.

2. The offset of a commit is one more than the maximum offset of a parent, or
   more.

    If offset(A) < offset(B), then A cannot reach B.

Since the offset are strictly greater than the offset of a parent, the old
clients give correct values for the odate as well. `git commit-graph verify`
would fail since the offsets are not generation numbers in all cases.

This is an acceptable trade off because users can re-write the commit graph
after verify fails.
 
A rough implementation by Jakub can be read here [8].

[7]: https://lore.kernel.org/git/86o8ziatb2.fsf_-_@gmail.com/
[8]: https://github.com/derrickstolee/git/pull/10

## Distinguishing Generation Numbers

As explained above, commit-graph format version cannot be used to distinguish
between generation numbers. We want new Git to recognize if commit-graph uses
generation v1, use those generation numbers correctly and update commit-graph
file. The solutions suggested are:

- metadata / versioning chunk: Introduce an optional chunk to store generation
number as metadata and store offsets in CDAT. For a chain of commit-graph files
with both versions, best choice would be flatten the chain and recompute values
from scratch.

- new chunk for generation number: Introduce an optional chunk to store the
offset. Fill in CDAT with either generation zero or compute topological level.
This removes the restriction of storing generation number in CDAT. However,
since Git lazily parses commit data, seperating generation number from CDAT
would increase time spent in I/O.

Metadata chunk seems like the better solution for corrected commit date.
However, generation number chunk should be revisited if later versions of
generation numbers require more than 30 bits.

## Plan

### Community bonding period (May 4 - June 1)

1. Extend Stolee`s performance test suite.

To investigate replacements for generation numbers, Stolee implemented
prototypes and compared performance on a range of queries.

It currently tests:
- `git log --topo-order -N`: measures commits walked to determine all possible
ways of arriving at given commit.
- `git log --topo-order -10 A..B`: measures commits walked to determine commits
reachable from B but not from A.
- `git merge-base A B`: measures commits walked to find the lowest (closest)
ancestor for both A and B.

We can also add tests for:
- `git branch --[no-]contains A`: measures commits walked to determine if A is
reachable from the tip of a given branch for all branches (many to one query).
- `git branch --[no-]merged A`: measures commits walked to determine if the tip
of a given branch is reachable from A for all branches (one to many query).
- `git push/fetch`: measures commits walked to compute HAVES/WANTS of a client
(many to many query) [9].

2. Find other operations that can be optimized using generation numbers.

As Jakub pointed out in the discussion on generation numbers, --ancestry-path
can be optimized using generation numbers [10]. Discuss and find any other
operation which could use generation numbers.

3. Explore commit, commit-graph, commit-reach, revision

[9]: https://github.blog/2015-09-22-counting-objects/
[10]: https://lore.kernel.org/git/861s82wdbp.fsf@gmail.com/

### Move `generation`, `graph_pos` into a slab (June 1 - June 15, 2 weeks)

struct commit is used in many contexts. However, the struct members
`generation` and `graph_pos` are only used for commit-graph related operations
and otherwise waste memory.

Using a commit slab reduces space of an individual commit object, which becomes
crucial as we load thousands of them. We can dynamically associate commits
with their generation/graph positions when needed.

Commit slab for `generation` is 64 bits wide as it stores odate instead
of offset [11]. Commit slab for `graph_pos` is simpler, with a straightforward
allocation of 32 bits.

Note: Commit-graph chunks would still store offset and will be used to compute
corrected date on initialization.

[11]: https://lore.kernel.org/git/cfa2c367-5cd7-add5-0293-caa75b103f34@gmail.com/

### Implement backward compatible corrected commit date (June 15 - August 2, 7 weeks)

1. Implement metadata chunk (June 15 - June 29, 2 week)

2. Implement backward compatible corrected commit date (June 29 - July 13, 2 weeks)

Topological levels are computed using a DFS in `compute_generation_numbers` of
commit-graph.c. The condition

  generation(C) = max(generation(P)) + 1,
  where C is a commit and P are its parent(s).

needs to be redefined as:

  generation(C) = max(max(generation(P)), correction(C)) + 1,
  where correction(C) is max(cdate(P) - cdate(C), 0). 

Definitions of `GENERATION_NUMBER_{INFINITY, MAX, ZERO}` remain the same.

The functions `compare_commits_by_gen_then_commit_date` and
`compare_commits_by_commit_date` can be replaced by `compare_commits_by_gen` as
all three *should* return the same order.

--> First evaluation (June 29 - July 3)

--> End Semester Exams (June 29 - July 4)

3. Update existing references (July 13 - August 2 , 3 weeks)

The following functions use generation number or committer date for cutoffs:
- `contains_tag_algo`
- `can_all_from_reach_with_flag`
- `paint_down_to_common`
- `remove_redundant`

All of these can be updated with little to no changes.

--> Second evaluation (July 27 - July 31)

### Extend generation number to --ancestry-path (August 3 - August 16, 2 weeks)

`--ancestry-path A..B` returns the commits that are both descendants of A and
ancestors of B. The algorithm computes commits that are reachable by B but not
by A, reverse the graph and walk backwards from A [12].

We can use generation number to limit walks to find commits reachable by B.

[12]: https://lore.kernel.org/git/0c6b42e4-e825-ff70-a528-e118abf4c435@gmail.com/

### Examine performance improvements (August 17 - August 24, 1 week)

- Compare the performance of generation number v2 versus topological level using
  performance test suite.
- Wrap up any existing patches.

The performance report would be a nice summary my work over the summer.

## Contributions

[Microproject] Consolidate test_cmp_graph logic
-----
Log graph comparison logic is duplicated many times. This patch consolidates
comparison and sanitation logic in lib-log-graph.

Status: Merged

Patch: https://lore.kernel.org/git/20200216134750.18947-1-abhishekkumar8222@gmail.com/

Commit: https://github.com/git/git/commit/46703057c1a0f85e24c0144b38c226c6a9ccb737

I have also reviewed patches and discussed queries with other contributors:
- https://lore.kernel.org/git/CAHk66fskrfcJ0YFDhfimVBTJZB4um7r=GdQuM8heJdZtF8D7UQ@mail.gmail.com/
- https://lore.kernel.org/git/CAHk66fvt-1RaLK8E7SDpocWM9OMAcA-gP5hjHq6r5N_FbATNgA@mail.gmail.com/
- https://github.com/git/git/pull/647#issuecomment-591978405

and others.

## Post GSoC

I would love to keep contributing to git after the GSoC period ends. There`s so
much to learn from the community.

Hannes`s comment on checks as a penalty that should be paid only by constant
strbufs was a perspective I had not considered [13].

Interacting with Kyagi made me rethink the justifications _emphasizing commit
messages_. I was at my wit`s end, which makes me appreciate my patient mentors
more and want to give back to the community.

[13]: https://lore.kernel.org/git/467c035f-c7cd-01e1-e64c-2c915610de01@kdbg.org/

## About Me

I am Abhishek Kumar, a second-year CSE student at National Institute of
Technology Karnataka, India. I have a blog where I talk about my interests -
programming, fiction, and literature [14].

I primarily work with C/C++ and Ruby on Rails. I am a member of my institute`s
Open Source Club and student-built University Management System, _IRIS_. I have
some experience of mentoring - Creating their code style guide and being an
active reviewer [15].

[14]: https://abhishekkumar2718.github.io/

[15]: https://iris.nitk.ac.in/about_us

## Availablity

The official GSoC coding period runs from June 1 to August 24.

Due to the outbreak of COVID-19 in my country, my college has pre-emptively
announced summer vacations from March 17 to June 1. Unfortunately,
I would have classes for a large part of the coding period. However, I can still
contribute 35-40 hours every week due to a low course load (~20 hours a week).

I would not be able to contribute from June 29 to July 4 due to end semester
exams.  It would be easily compensated during the subsequent "semester break"
from July 5 to July 27.

## Contact Information

```
     Name: Abhishek Kumar
    Email: abhishekkumar8222@gmail.com
    Major: Computer Science And Engineering
   Degree: Bachelor of Technology
Institute: National Institute of Technology Karnataka
   Github: abhishekkumar2718
 Timezone: UTC+5:30 (IST)
```

Thank you for taking the time to review my proposal!

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-03-26 10:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-22  9:35 [RFC][GSoC] Implement Generation Number v2 Abhishek Kumar
2020-03-22 20:05 ` Jakub Narebski
2020-03-23  4:25   ` Abhishek Kumar
2020-03-23  5:32     ` Junio C Hamano
2020-03-23 11:32       ` Abhishek Kumar
2020-03-23 13:43       ` Jakub Narebski
2020-03-23 15:54         ` Derrick Stolee
2020-03-24  9:24           ` Jakub Narebski
2020-03-23 16:04         ` Junio C Hamano
2020-03-24 15:44           ` Jakub Narebski
2020-03-24 21:13             ` Junio C Hamano
2020-03-26 10:15         ` [GSoC][Proposal v2] " Abhishek Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).