Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
From: jnareb@gmail.com (Jakub Narębski)
To: "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, "Derrick Stolee" <stolee@gmail.com>,
	"Taylor Blau" <me@ttaylorr.com>,
	"Abhishek Kumar" <abhishekkumar8222@gmail.com>,
	"Jakub Narębski" <jnareb@gmail.com>
Subject: Re: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date
Date: Sun, 23 Aug 2020 17:27:46 +0200
Message-ID: <85blj1e619.fsf@gmail.com> (raw)
In-Reply-To: <85zh6uxh7l.fsf@gmail.com> ("Jakub \=\?utf-8\?Q\?Nar\=C4\=99bski\=22\?\= \=\?utf-8\?Q\?'s\?\= message of "Mon, 17 Aug 2020 02:13:18 +0200")

Hello,

Here is a summary of my comments and thoughts after carefully reviewing
all patches in the series.

Jakub Narębski <jnareb@gmail.com> writes:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]
>> The Corrected Commit Date is defined as:
>>
>> For a commit C, let its corrected commit date be the maximum of the commit
>> date of C and the corrected commit dates of its parents.
>
> Actually it needs to be "corrected commit dates of its parents plus 1"
> to fulfill the reachability condition for a generation number for a
> commit:
>
>       A can reach B   =>  gen(A) < gen(B)
>
> Of course it can be computed in simpler way, because
>
>   max_P (gen(P) + 1)  ==  max_P (gen(P)) + 1

I see that it is defined correctly in the documentation, which is more
important than cover letter (which would not be stored in the repository
for memory, even in the commit message for the merge).

[...]
> However I think the cover letter should also describe what should happen
> in a mixed version environment (for example new Git on command line,
> copy of old Git used by GUI client), and in particular what should
> happen in a mixed-chain case - both for reading and for writing the
> commit-graph file.
>
> For *writing*: because old Git would create commit-graph layers without
> the GDAT chunk, to simplify the behavior and make easy to reason about
> commit-graph data (the situation should be not that common, and
> transient -- it should get more rare as the time goes), we want the
> following behavior from new Git:
>
> - If top layer contains the GDAT chunk, or we are rewriting commit-graph
>   file (--split=replace), or we are merging layers and there are no
>   layers without GDAT chunk below set of layers that are merged, then
>
>      write commit-graph file or commit-graph layer with GDAT chunk,

Actually we can simplify handling of merging layers, while still
retaining the property that in mixed-version chain all GDAT-full layers
are before / below GDAT-less layers.

Namely, if merging layers, and at least one layer being merged doesn't
have GDAT chunk, then the result of merge wouldn't have either.

We can always switch to slightly more complicated behavior proposed
above in quoted part later, perhaps as a followup commit.

>
>   otherwise
>
>      write commit-graph layer without GDAT chunk.
>
>   This means that there are commit-graph layers without GDAT chunk if
>   and only if the top layer is also without GDAT chunk.

This might be not necessary if Git would always check the whole chain of
split commit-graph layers for presence of GDAT-less layers.

But I still think it is a good idea to avoid having GDAT-less "holes".

>
> For *reading* we want to use generation number v2 (corrected commit
> date) if possible, and fall back to generation number v1 (topological
> levels).
>
> - If the top layer contains the GDAT chunk (or maybe even if the topmost
>   layer that involves all commits in question, not necessarily the top
>   layer in the full commit-graph chain), then use generation number v2
>
>   - commit_graph_data->generation stores corrected commit date,
>     computed as sum of committer date (from CDAT) and offset (from GDAT)

Or stored directly in GDAT, at the cost of increasing the file size by
at most 7% (if I have done my math correctly).

See also the issue with clamping offsets (GENERATION_NUMBER_V2_OFFSET_MAX).

>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - there is no need for committer date heuristics, and no need for
>     limiting use of generation number to where there is a cutoff (to not
>     hamper performance).
>
> - If there are layers without GDAT chunks, which thanks to the write
>   behavior means simply top layer without GDAT chunk, we need to turn
>   off use of generation numbers or fall back to using topological levels
>
>   - commit_graph_data->generation stores topological levels,
>     taken from CDAT chunk (30-bits)
>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - we probably want to keep tie-breaking of sorting by generation
>     number via committer date, and limit use of generation number as
>     opposed to using committer date heuristics (with slop) to not make
>     performance worse.

And this is being done in this patch series.  Good!

The thing I was worrying about turned out to be non-issue, as the
comparison function in question is used only when writing, and in this
case we have corrected commit date computer - though perhaps not being
written (as far as I understand it, but I might be mistaken).

[...]
>>
>> Abhishek Kumar (11):
>>   commit-graph: fix regression when computing bloom filter

No problems, maybe just expanding a commit message and/or adding a comment.

>>   revision: parse parent in indegree_walk_step()

Looks good to me.

>>   commit-graph: consolidate fill_commit_graph_info

I think this commit could be split into three:
- fix to the 'generate tar with future mtime' test
  as it is a hidden bug (when using commit-graph)
- simplify implementation of fill_commit_in_graph()
  by using fill_commit_graph_info()
- move loading date into fill_commit_graph_info()
  that uncovers the issue with 'generate tar with future mtime'

On the other hand because they are inter-related, those changes might be
kept in a single commit.

In commit message greater care needs to be taken with
fill_commit_graph_info() and fill_commit_in_graph(), when to use one and
when to use the other. For example it is fill_commit_graph_info() that
changes its behavior, and it is fill_commit_in_graph() that is getting
simplified.

>>   commit-graph: consolidate compare_commits_by_gen

Looks good to me, though it might be good idea to add comments about the
sorting order (inside comment) to appropriate header files.

>>   commit-graph: return 64-bit generation number

This conversion misses one place, though it would be changed to use
topological levels slab in next patch.

This patch also unnecessary introduces GENERATION_NUMBER_V1_INFINITY.
There is no need for it: `generation` field can always simply use
GENERATION_NUMBER_INFINITY for commits not in commit-graph.

>>   commit-graph: add a slab to store topological levels

This is the patch that needs GENERATION_NUMBER_V1_INFINITY (or
TOPOLOGICAL_LEVEL_INFINITY), not the previous patch.

Detailed analysis uncovered hidden bug in the code of
compute_generation_numbers() that works only because of historical
reasons (that topological levels in Git start from 1, not from 0). The
problem is that the 'level' / 'generation' variable for commits not in
graph, and therefore ones that needs to have its generation number
computed, is 0 (default value on commit slab) and not
GENERATION_NUMBER*_INFINITY as it should.

This issue is present since moving commit graph info data to
commit-slab.  We can simply document it and ignore (it works, if by
accident), or try to fix it.

We need to handle GENERATION_NUMBER*_MAX clamping carefully
in the future patches.

I think also that this patch needs a bit more detailed commit message.

>>   commit-graph: implement corrected commit date

Looks good, though verify_commit_graph() now verifies *a* generation
number used, be it v1 (topological level) or v2 (corrected commit date),
so the variable rename is unnecessary.  We verify that they fulfill the
reachability condition promise, that is gen(parent) <= gen(commit),
(the '=' is to include GENERATION_NUMBER*_MAX case), as it is what is
used to speed up graph traversal.

We probably want to verify both topological level values in CDAT, and if
they exist also corrected commit date values in GDAT.  But that might be
left for the future commit.

>>   commit-graph: implement generation data chunk

To save up to 6-7% on commit-graph file size we store 32-bits corrected
commit date offsets, instead of storing 64-bits corrected commit date.

However, as far as I understand it, using non-monotonic values for
on-disk storage with limited field size, like 32-bits corrected
commit date offset, leads to problems with GENERATION_NUMBER*_MAX.
Namely, as I have written in detail in my reply to patch 11/11 in this
series, there is no way to fulfill the reachability condition promise if
we have to store offset which true value do not fit in 32-bits reserved
for it.

This is extremly unlikely to happen in practice, but we need to be able
to handle it somehow.  We can store 64-bit corrected commit date, which
has graph-monotonic values, and the problem goes away in theory and in
practice (we would never have values that do not fit).  We can keep
storing 32-bit offsets, and simply do not use GDAT chunk if there is
offset value that do not fit.

All this of course, provided that I am not wrong about this issue...

>>   commit-graph: use generation v2 only if entire chain does

The commit message do not say anything about the *writing* side.

However, if we want to keep the requirement that GDAT-less layers in the
split commit-graph chain must all be above any GDAT-containing layers,
we need to consider how we want layer merging to behave.  We could
either opt for using GDAT whenever possible, or similify code and skip
using GDAT if we are unsure.

The first approach would mean that if the topmost layer below set of
layers being rewritten (in the split commit-graph chain) exists, and it
does not contain GDAT chunk, then and only then the result of rewrite
should not have GDAT chunk either.

The second approach is, I think, simpler: if any of layers that is being
rewritten is GDAT-less (we need to check only the top layer, though),
and we are not running full rewrite ('--split=replace'), then the result
of rewrite should not have GDAT chunk either.  We can switch to the
first algorithm in later commit.

Whether we choose one or the other, we need test that doing layer
merging do not break GDAT-inness requirement stated above.


Also, we can probably test that we are not using v1 and v2 at the same
time with tests involving --is-ancestor, or --contains / --merged.

>>   commit-reach: use corrected commit dates in paint_down_to_common()

I think this commit could be split into two:
- disable commit graph for entire t6024-recursive-merge test
- use corrected commit dates in paint_down_to_common()

On the other hand because they are inter-related, those changes might be
better kept in a single commit.

It would be nice to have some benchmark data, or at least stating that
performance does not change (within the error bounds), using for example
'git merge-base v4.8 v4.9' in Linux kernel repository.

>>   doc: add corrected commit date info

Looks good, there are a few places where 'generation number' (referring
to the v1 version) should have been replaced with 'topological level'.

I am also unsure how the GDAT chunk can be "later modified to store
future generation number related data.".  I'd like to have an example,
or for this statement to be removed (if it turns out to not be true, not
without introducing yet another chunk type).

>>  18 files changed, 396 insertions(+), 185 deletions(-)

Thanks for all your work.

Best regards,
--
Jakub Narębski

  parent reply index

Thread overview: 129+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-28  9:13 [PATCH 0/6] " Abhishek Kumar via GitGitGadget
2020-07-28  9:13 ` [PATCH 1/6] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
2020-07-28 15:28   ` Taylor Blau
2020-07-30  5:24     ` Abhishek Kumar
2020-08-04  0:46   ` Jakub Narębski
2020-08-04  0:56     ` Taylor Blau
2020-08-04 10:10       ` Jakub Narębski
2020-08-04  7:55     ` Jakub Narębski
2020-07-28  9:13 ` [PATCH 2/6] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-07-28 13:00   ` Derrick Stolee
2020-07-28 15:30     ` Taylor Blau
2020-08-05 23:16   ` Jakub Narębski
2020-07-28  9:13 ` [PATCH 3/6] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-07-28 13:14   ` Derrick Stolee
2020-07-28 15:19     ` René Scharfe
2020-07-28 15:58       ` Derrick Stolee
2020-07-28 16:01     ` Taylor Blau
2020-07-30  6:07     ` Abhishek Kumar
2020-07-28  9:13 ` [PATCH 4/6] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
2020-07-28 16:03   ` Taylor Blau
2020-07-28  9:13 ` [PATCH 5/6] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-07-28 16:12   ` Taylor Blau
2020-07-30  6:52     ` Abhishek Kumar
2020-07-28  9:13 ` [PATCH 6/6] commit-graph: implement corrected commit date offset Abhishek Kumar via GitGitGadget
2020-07-28 15:55   ` Derrick Stolee
2020-07-28 16:23     ` Taylor Blau
2020-07-30  7:27     ` Abhishek Kumar
2020-07-28 14:54 ` [PATCH 0/6] [GSoC] Implement Corrected Commit Date Taylor Blau
2020-07-30  7:47   ` Abhishek Kumar
2020-07-28 16:35 ` Derrick Stolee
2020-08-09  2:53 ` [PATCH v2 00/10] " Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 01/10] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 04/10] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 05/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-08-10 16:28     ` Derrick Stolee
2020-08-11 11:03       ` Abhishek Kumar
2020-08-11 12:27         ` Derrick Stolee
2020-08-11 18:58           ` Taylor Blau
2020-08-09  2:53   ` [PATCH v2 06/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 07/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2020-08-10 14:23     ` Derrick Stolee
2020-08-14  4:59       ` Abhishek Kumar
2020-08-14 12:24         ` Derrick Stolee
2020-08-09  2:53   ` [PATCH v2 08/10] commit-graph: handle mixed generation commit chains Abhishek Kumar via GitGitGadget
2020-08-10 16:42     ` Derrick Stolee
2020-08-11 11:36       ` Abhishek Kumar
2020-08-11 12:43         ` Derrick Stolee
2020-08-09  2:53   ` [PATCH v2 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2020-08-09  2:53   ` [PATCH v2 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
2020-08-10 16:47   ` [PATCH v2 00/10] [GSoC] Implement Corrected Commit Date Derrick Stolee
2020-08-15 16:39   ` [PATCH v3 00/11] " Abhishek Kumar via GitGitGadget
2020-08-15 16:39     ` [PATCH v3 01/11] commit-graph: fix regression when computing bloom filter Abhishek Kumar via GitGitGadget
2020-08-17 22:30       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-08-18 14:18       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-08-19 17:54       ` Jakub Narębski
2020-08-21  4:11         ` Abhishek Kumar
2020-08-25 11:11           ` Jakub Narębski
2020-09-01 11:35             ` Abhishek Kumar
2020-08-15 16:39     ` [PATCH v3 04/11] commit-graph: consolidate compare_commits_by_gen Abhishek Kumar via GitGitGadget
2020-08-17 13:22       ` Derrick Stolee
2020-08-21 11:05       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 05/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2020-08-21 13:14       ` Jakub Narębski
2020-08-25  5:04         ` Abhishek Kumar
2020-08-25 12:18           ` Jakub Narębski
2020-09-01 12:06             ` Abhishek Kumar
2020-09-03 13:42               ` Jakub Narębski
2020-09-05 17:21                 ` Abhishek Kumar
2020-09-13 15:39                   ` Jakub Narębski
2020-09-28 21:48                     ` Jakub Narębski
2020-10-05  5:25                       ` Abhishek Kumar
2020-08-15 16:39     ` [PATCH v3 06/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
2020-08-21 18:43       ` Jakub Narębski
2020-08-25  6:14         ` Abhishek Kumar
2020-08-25  7:33           ` Jakub Narębski
2020-08-25  7:56             ` Jakub Narębski
2020-09-01 10:26               ` Abhishek Kumar
2020-09-03  9:25                 ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2020-08-22  0:05       ` Jakub Narębski
2020-08-25  6:49         ` Abhishek Kumar
2020-08-25 10:07           ` Jakub Narębski
2020-09-01 11:01             ` Abhishek Kumar
2020-08-15 16:39     ` [PATCH v3 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-08-22 13:09       ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
2020-08-22 17:14       ` Jakub Narębski
2020-08-26  7:15         ` Abhishek Kumar
2020-08-26 10:38           ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2020-08-22 19:09       ` Jakub Narębski
2020-09-01 10:08         ` Abhishek Kumar
2020-09-03 19:11           ` Jakub Narębski
2020-08-15 16:39     ` [PATCH v3 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
2020-08-22 22:20       ` Jakub Narębski
2020-08-27  6:39         ` Abhishek Kumar
2020-08-27 12:43           ` Jakub Narębski
2020-08-27 13:15           ` Derrick Stolee
2020-09-01 13:01             ` Abhishek Kumar
2020-08-17  0:13     ` [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date Jakub Narębski
     [not found]       ` <CANQwDwdKp7oKy9BeKdvKhwPUiq0R5MS8TCw-eWGCYCoMGv=G-g@mail.gmail.com>
2020-08-17  1:32         ` Fwd: " Taylor Blau
2020-08-17  7:56           ` Jakub Narębski
2020-08-18  6:12       ` Abhishek Kumar
2020-08-23 15:27       ` Jakub Narębski [this message]
2020-08-24  2:49         ` Abhishek Kumar
2020-10-07 14:09     ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
2020-10-24 23:16         ` Jakub Narębski
2020-10-25 20:58           ` Taylor Blau
2020-10-07 14:09       ` [PATCH v4 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
2020-10-24 23:41         ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
2020-10-25 10:52         ` Jakub Narębski
2020-10-27  6:33           ` Abhishek Kumar
2020-10-07 14:09       ` [PATCH v4 04/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
2020-10-25 13:48         ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 05/10] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
2020-10-25 22:17         ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 06/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
2020-10-27 18:53         ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 07/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
2020-10-30 12:45         ` Jakub Narębski
2020-10-07 14:09       ` [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2020-10-07 14:09       ` [PATCH v4 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=85blj1e619.fsf@gmail.com \
    --to=jnareb@gmail.com \
    --cc=abhishekkumar8222@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git