* [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09 ` Abhishek Kumar via GitGitGadget
2020-10-24 23:16 ` Jakub Narębski
2020-10-07 14:09 ` [PATCH v4 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
` (10 subsequent siblings)
11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
commit_gen_cmp is used when writing a commit-graph to sort commits in
generation order before computing Bloom filters. Since c49c82aa (commit:
move members graph_pos, generation to a slab, 2020-06-17) made it so
that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
during writing, we cannot call it within this function. Instead, access
the generation number directly through the slab (i.e., by calling
'commit_graph_data_at(c)->generation') in order to access it while
writing.
While measuring performance with `git commit-graph write --reachable
--changed-paths` on the linux repository led to around 1m40s for both
HEAD and master (and could be due to fault in my measurements), it is
still the "right" thing to do.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..94503e584b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
const struct commit *a = *(const struct commit **)va;
const struct commit *b = *(const struct commit **)vb;
- uint32_t generation_a = commit_graph_generation(a);
- uint32_t generation_b = commit_graph_generation(b);
+ uint32_t generation_a = commit_graph_data_at(a)->generation;
+ uint32_t generation_b = commit_graph_data_at(b)->generation;
/* lower generation commits first */
if (generation_a < generation_b)
return -1;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters
2020-10-07 14:09 ` [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2020-10-24 23:16 ` Jakub Narębski
2020-10-25 20:58 ` Taylor Blau
0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-10-24 23:16 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar, Garima Singh,
Jeff King
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> commit_gen_cmp is used when writing a commit-graph to sort commits in
> generation order before computing Bloom filters. Since c49c82aa (commit:
> move members graph_pos, generation to a slab, 2020-06-17) made it so
> that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
> during writing, we cannot call it within this function. Instead, access
> the generation number directly through the slab (i.e., by calling
> 'commit_graph_data_at(c)->generation') in order to access it while
> writing.
This description is all right, but I think it can be made more clear:
When running `git commit-graph write --reachable --changed-paths` to
compute Bloom filters for changed paths, commits are first sorted by
generation number using 'commit_gen_cmp()'. Commits with similar
generation are more likely to have many trees in common, making the
diff faster, see 3d112755.
However, since c49c82aa (commit: move members graph_pos, generation to
a slab, 2020-06-17) made it so that 'commit_graph_generation()'
returns 'GENERATION_NUMBER_INFINITY' during writing, we cannot call it
within this function. Instead, access the generation number directly
through the slab (i.e., by calling 'commit_graph_data_at(c)->generation')
in order to access it while writing.
Or something like that.
We should also add an explanation why avoiding getter is safe here,
perhaps adding the following line to the second paragraph:
It is safe to do because 'commit_gen_cmp()' from commit-graph.c is
static and used only when writing Bloom filters, and because writing
changed-paths filters is done after computing generation numbers (if
necessary).
Or something like that.
>
> While measuring performance with `git commit-graph write --reachable
> --changed-paths` on the linux repository led to around 1m40s for both
> HEAD and master (and could be due to fault in my measurements), it is
> still the "right" thing to do.
I had to read the above paragraph several times to understand it,
possibly because I have expected here to be a fix for a performance
regression. The commit message for 3d112755 (commit-graph: examine
commits by generation number) describes reduction of computation time
from 3m00s to 1m37s. So I would expect performance with HEAD (i.e.
before those changes) to be around 3m, not the same before and after
changes being around 1m40s.
Can anyone recheck this before-and-after benchmark, please?
Anyway, it might be more clear to write it as the following:
On the Linux kernel repository, this patch didn't reduce the
computation time for 'git commit-graph write --reachable
--changed-paths', which is around 1m40s both before and after this
change. This could be a fault in my measurements; it is still the
"right" thing to do.
Or something like that.
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
Anyway, it is nice and clear change.
> commit-graph.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index cb042bdba8..94503e584b 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
> const struct commit *a = *(const struct commit **)va;
> const struct commit *b = *(const struct commit **)vb;
>
> - uint32_t generation_a = commit_graph_generation(a);
> - uint32_t generation_b = commit_graph_generation(b);
> + uint32_t generation_a = commit_graph_data_at(a)->generation;
> + uint32_t generation_b = commit_graph_data_at(b)->generation;
> /* lower generation commits first */
> if (generation_a < generation_b)
> return -1;
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters
2020-10-24 23:16 ` Jakub Narębski
@ 2020-10-25 20:58 ` Taylor Blau
2020-11-03 5:36 ` Abhishek Kumar
0 siblings, 1 reply; 211+ messages in thread
From: Taylor Blau @ 2020-10-25 20:58 UTC (permalink / raw)
To: Jakub Narębski
Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee,
Taylor Blau, Abhishek Kumar, Garima Singh, Jeff King
On Sun, Oct 25, 2020 at 01:16:48AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > While measuring performance with `git commit-graph write --reachable
> > --changed-paths` on the linux repository led to around 1m40s for both
> > HEAD and master (and could be due to fault in my measurements), it is
> > still the "right" thing to do.
>
> I had to read the above paragraph several times to understand it,
> possibly because I have expected here to be a fix for a performance
> regression. The commit message for 3d112755 (commit-graph: examine
> commits by generation number) describes reduction of computation time
> from 3m00s to 1m37s. So I would expect performance with HEAD (i.e.
> before those changes) to be around 3m, not the same before and after
> changes being around 1m40s.
>
> Can anyone recheck this before-and-after benchmark, please?
My hunch is that our heuristic to fall back to the commits 'date'
value is saving us here. commit_gen_cmp() first compares the generation
numbers, breaking ties by 'date' as a heuristic. But since all
generation number queries return GENERATION_NUMBER_INFINITY during
writing, we're relying on our heuristic entirely.
I haven't looked much further than that, other than to see that I could
get about a ~4sec speed-up with this patch as compared to v2.29.1 in the
computing Bloom filters region on the kernel.
> Anyway, it might be more clear to write it as the following:
>
> On the Linux kernel repository, this patch didn't reduce the
> computation time for 'git commit-graph write --reachable
> --changed-paths', which is around 1m40s both before and after this
> change. This could be a fault in my measurements; it is still the
> "right" thing to do.
>
> Or something like that.
Assuming that we are in fact being saved by the "date" heuristic, I'd
probably write the following commit message instead:
Before computing Bloom filters, the commit-graph machinery uses
commit_gen_cmp to sort commits by generation order for improved diff
performance. 3d11275505 (commit-graph: examine commits by generation
number, 2020-03-30) claims that this sort can reduce the time spent to
compute Bloom filters by nearly half.
But since c49c82aa4c (commit: move members graph_pos, generation to a
slab, 2020-06-17), this optimization is broken, since asking for
'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
while writing.
Not all hope is lost, though: 'commit_graph_generation()' falls
back to comparing commits by their date when they have equal generation
number, and so since c49c82aa4c is purely a date comparison function.
This heuristic is good enough that we don't seem to loose appreciable
performance while computing Bloom filters. [Benchmark that we loose
about ~4sec before/after c49c82aa4c9...]
So, avoid the uesless 'commit_graph_generation()' while writing by
instead accessing the slab directly. This returns the newly-computed
generation numbers, and allows us to avoid the heuristic by directly
comparing generation numbers.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters
2020-10-25 20:58 ` Taylor Blau
@ 2020-11-03 5:36 ` Abhishek Kumar
0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-03 5:36 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, gitgitgadget, jnareb, abhishekkumar8222
On Sun, Oct 25, 2020 at 04:58:14PM -0400, Taylor Blau wrote:
> On Sun, Oct 25, 2020 at 01:16:48AM +0200, Jakub Narębski wrote:
> > "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >
> > > While measuring performance with `git commit-graph write --reachable
> > > --changed-paths` on the linux repository led to around 1m40s for both
> > > HEAD and master (and could be due to fault in my measurements), it is
> > > still the "right" thing to do.
> >
> > I had to read the above paragraph several times to understand it,
> > possibly because I have expected here to be a fix for a performance
> > regression. The commit message for 3d112755 (commit-graph: examine
> > commits by generation number) describes reduction of computation time
> > from 3m00s to 1m37s. So I would expect performance with HEAD (i.e.
> > before those changes) to be around 3m, not the same before and after
> > changes being around 1m40s.
> >
> > Can anyone recheck this before-and-after benchmark, please?
>
> My hunch is that our heuristic to fall back to the commits 'date'
> value is saving us here. commit_gen_cmp() first compares the generation
> numbers, breaking ties by 'date' as a heuristic. But since all
> generation number queries return GENERATION_NUMBER_INFINITY during
> writing, we're relying on our heuristic entirely.
>
> I haven't looked much further than that, other than to see that I could
> get about a ~4sec speed-up with this patch as compared to v2.29.1 in the
> computing Bloom filters region on the kernel.
>
Thanks for benchmarking it. I wasn't sure if I am testing it correctly
or the patch made no difference.
> > Anyway, it might be more clear to write it as the following:
> >
> > On the Linux kernel repository, this patch didn't reduce the
> > computation time for 'git commit-graph write --reachable
> > --changed-paths', which is around 1m40s both before and after this
> > change. This could be a fault in my measurements; it is still the
> > "right" thing to do.
> >
> > Or something like that.
>
> Assuming that we are in fact being saved by the "date" heuristic, I'd
> probably write the following commit message instead:
>
> Before computing Bloom filters, the commit-graph machinery uses
> commit_gen_cmp to sort commits by generation order for improved diff
> performance. 3d11275505 (commit-graph: examine commits by generation
> number, 2020-03-30) claims that this sort can reduce the time spent to
> compute Bloom filters by nearly half.
>
> But since c49c82aa4c (commit: move members graph_pos, generation to a
> slab, 2020-06-17), this optimization is broken, since asking for
> 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
> while writing.
>
> Not all hope is lost, though: 'commit_graph_generation()' falls
> back to comparing commits by their date when they have equal generation
> number, and so since c49c82aa4c is purely a date comparison function.
> This heuristic is good enough that we don't seem to loose appreciable
> performance while computing Bloom filters. [Benchmark that we loose
> about ~4sec before/after c49c82aa4c9...]
>
> So, avoid the uesless 'commit_graph_generation()' while writing by
> instead accessing the slab directly. This returns the newly-computed
> generation numbers, and allows us to avoid the heuristic by directly
> comparing generation numbers.
>
That's a lot better, will change.
> Thanks,
> Taylor
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v4 02/10] revision: parse parent in indegree_walk_step()
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
2020-10-07 14:09 ` [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09 ` Abhishek Kumar via GitGitGadget
2020-10-24 23:41 ` Jakub Narębski
2020-10-07 14:09 ` [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
` (9 subsequent siblings)
11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
revision.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/revision.c b/revision.c
index aa62212040..c97abcdde1 100644
--- a/revision.c
+++ b/revision.c
@@ -3381,6 +3381,9 @@ static void indegree_walk_step(struct rev_info *revs)
struct commit *parent = p->item;
int *pi = indegree_slab_at(&info->indegree, parent);
+ if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
+ return;
+
if (*pi)
(*pi)++;
else
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v4 02/10] revision: parse parent in indegree_walk_step()
2020-10-07 14:09 ` [PATCH v4 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-10-24 23:41 ` Jakub Narębski
0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-10-24 23:41 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In indegree_walk_step(), we add unvisited parents to the indegree queue.
> However, parents are not guaranteed to be parsed. As the indegree queue
> sorts by generation number, let's parse parents before inserting them to
> ensure the correct priority order.
All right, we need to ensure the parent commit is parsed to know its
generation number, to insert in into priority queue in a correct order.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Looks good.
> ---
> revision.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/revision.c b/revision.c
> index aa62212040..c97abcdde1 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3381,6 +3381,9 @@ static void indegree_walk_step(struct rev_info *revs)
> struct commit *parent = p->item;
> int *pi = indegree_slab_at(&info->indegree, parent);
>
> + if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
> + return;
> +
> if (*pi)
> (*pi)++;
> else
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
2020-10-07 14:09 ` [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
2020-10-07 14:09 ` [PATCH v4 02/10] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09 ` Abhishek Kumar via GitGitGadget
2020-10-25 10:52 ` Jakub Narębski
2020-10-07 14:09 ` [PATCH v4 04/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
` (8 subsequent siblings)
11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().
fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the corrected committer date, we have to load
committer date to calculate generation number value.
e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2 ^ 36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.
The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and loads committer date from the object
database, generating a tar file with the expected mtime.
However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.
The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending om some implementations) octal
digits for the size and mtime of each files.
Thus, setting a timestamp of 2 ^ 33 + 1 would overflow the 11-octal
digit implementations while still fitting into commit data chunk.
Since we want to test 12-octal digit implementations of ustar as well,
let's modify the existing test to no longer use commit-graph file.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 27 ++++++++++-----------------
t/t5000-tar-tree.sh | 20 +++++++++++++++++++-
2 files changed, 29 insertions(+), 18 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 94503e584b..e8362e144e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -749,15 +749,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
const unsigned char *commit_data;
struct commit_graph_data *graph_data;
uint32_t lex_index;
+ uint64_t date_high, date_low;
while (pos < g->num_commits_in_base)
g = g->base_graph;
+ if (pos >= g->num_commits + g->num_commits_in_base)
+ die(_("invalid commit position. commit-graph is likely corrupt"));
+
lex_index = pos - g->num_commits_in_base;
commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
graph_data = commit_graph_data_at(item);
graph_data->graph_pos = pos;
+
+ date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+ date_low = get_be32(commit_data + g->hash_len + 12);
+ item->date = (timestamp_t)((date_high << 32) | date_low);
+
graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
}
@@ -772,38 +781,22 @@ static int fill_commit_in_graph(struct repository *r,
{
uint32_t edge_value;
uint32_t *parent_data_ptr;
- uint64_t date_low, date_high;
struct commit_list **pptr;
- struct commit_graph_data *graph_data;
const unsigned char *commit_data;
uint32_t lex_index;
while (pos < g->num_commits_in_base)
g = g->base_graph;
- if (pos >= g->num_commits + g->num_commits_in_base)
- die(_("invalid commit position. commit-graph is likely corrupt"));
+ fill_commit_graph_info(item, g, pos);
- /*
- * Store the "full" position, but then use the
- * "local" position for the rest of the calculation.
- */
- graph_data = commit_graph_data_at(item);
- graph_data->graph_pos = pos;
lex_index = pos - g->num_commits_in_base;
-
commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
item->object.parsed = 1;
set_commit_tree(item, NULL);
- date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
- date_low = get_be32(commit_data + g->hash_len + 12);
- item->date = (timestamp_t)((date_high << 32) | date_low);
-
- graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
pptr = &item->parents;
edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 3ebb0d3b65..8f41cdc509 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -431,11 +431,29 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
test_cmp expect actual
'
+test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
+ rm -f .git/index &&
+ echo foo >file &&
+ git add file &&
+ GIT_COMMITTER_DATE="@17179869183 +0000" \
+ git commit -m "tempori parendum"
+'
+
+test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
+ git archive HEAD >future.tar
+'
+
+test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
+ echo 2514 >expect &&
+ tar_info future.tar | cut -d" " -f2 >actual &&
+ test_cmp expect actual
+'
+
test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
rm -f .git/index &&
echo content >file &&
git add file &&
- GIT_COMMITTER_DATE="@68719476737 +0000" \
+ GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
git commit -m "tempori parendum"
'
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info
2020-10-07 14:09 ` [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-10-25 10:52 ` Jakub Narębski
2020-10-27 6:33 ` Abhishek Kumar
0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-10-25 10:52 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar
Hi Abhishek,
In short: everything is all right, except for the now duplicated test
names in t5000 after this commit.
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Both fill_commit_graph_info() and fill_commit_in_graph() parse
> information present in commit data chunk. Let's simplify the
> implementation by calling fill_commit_graph_info() within
> fill_commit_in_graph().
>
> fill_commit_graph_info() used to not load committer data from commit data
> chunk. However, with the corrected committer date, we have to load
> committer date to calculate generation number value.
Nice writeup, however the last sentence would in my opinion read better
in the future tense: we don't use generation number v2 yet. For
example:
However, with upcoming switch to using corrected committer date as
generation number v2, we will have to load committer date to compute
generation number value anyway.
Or something like that - notice the minor addition and changes.
The following is slightly unrelated change, but we agreed that it would
be better to not separate them; the need for change to the t5000 test is
caused by the change described above.
>
> e51217e15 (t5000: test tar files that overflow ustar headers,
> 30-06-2016) introduced a test 'generate tar with future mtime' that
> creates a commit with committer date of (2 ^ 36 + 1) seconds since
> EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
> committer time overflows into generation number (within CDAT chunk) and
> has undefined behavior.
>
> The test used to pass as fill_commit_graph_info() would not set struct
> member `date` of struct commit and loads committer date from the object
> database, generating a tar file with the expected mtime.
I think it should be s/loads/load/, as in "would load", but I am not a
native English speaker.
>
> However, with corrected commit date, we will load the committer date
> from CDAT chunk (truncated to lower 34-bits to populate the generation
> number. Thus, Git sets date and generates tar file with the truncated
> mtime.
>
> The ustar format (the header format used by most modern tar programs)
> only has room for 11 (or 12, depending om some implementations) octal
> digits for the size and mtime of each files.
>
> Thus, setting a timestamp of 2 ^ 33 + 1 would overflow the 11-octal
> digit implementations while still fitting into commit data chunk.
>
> Since we want to test 12-octal digit implementations of ustar as well,
> let's modify the existing test to no longer use commit-graph file.
The description above is for me does not make it entirely clear that we
add new test for handling possible 11-octal digit overflow nearly
identical to the existing one, and turn off use of commit-graph file for
test that checks handling 12-octal digit overflow.
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
> commit-graph.c | 27 ++++++++++-----------------
> t/t5000-tar-tree.sh | 20 +++++++++++++++++++-
> 2 files changed, 29 insertions(+), 18 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 94503e584b..e8362e144e 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -749,15 +749,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> const unsigned char *commit_data;
> struct commit_graph_data *graph_data;
> uint32_t lex_index;
> + uint64_t date_high, date_low;
>
> while (pos < g->num_commits_in_base)
> g = g->base_graph;
>
> + if (pos >= g->num_commits + g->num_commits_in_base)
> + die(_("invalid commit position. commit-graph is likely corrupt"));
> +
> lex_index = pos - g->num_commits_in_base;
> commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
>
> graph_data = commit_graph_data_at(item);
> graph_data->graph_pos = pos;
> +
> + date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> + date_low = get_be32(commit_data + g->hash_len + 12);
> + item->date = (timestamp_t)((date_high << 32) | date_low);
> +
> graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> }
>
> @@ -772,38 +781,22 @@ static int fill_commit_in_graph(struct repository *r,
> {
> uint32_t edge_value;
> uint32_t *parent_data_ptr;
> - uint64_t date_low, date_high;
> struct commit_list **pptr;
> - struct commit_graph_data *graph_data;
> const unsigned char *commit_data;
> uint32_t lex_index;
>
> while (pos < g->num_commits_in_base)
> g = g->base_graph;
>
> - if (pos >= g->num_commits + g->num_commits_in_base)
> - die(_("invalid commit position. commit-graph is likely corrupt"));
> + fill_commit_graph_info(item, g, pos);
>
> - /*
> - * Store the "full" position, but then use the
> - * "local" position for the rest of the calculation.
> - */
> - graph_data = commit_graph_data_at(item);
> - graph_data->graph_pos = pos;
> lex_index = pos - g->num_commits_in_base;
> -
> commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
>
> item->object.parsed = 1;
>
> set_commit_tree(item, NULL);
>
> - date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> - date_low = get_be32(commit_data + g->hash_len + 12);
> - item->date = (timestamp_t)((date_high << 32) | date_low);
> -
> - graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> -
> pptr = &item->parents;
>
> edge_value = get_be32(commit_data + g->hash_len);
All right, looks good for me.
Here second change begins.
> diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> index 3ebb0d3b65..8f41cdc509 100755
> --- a/t/t5000-tar-tree.sh
> +++ b/t/t5000-tar-tree.sh
> @@ -431,11 +431,29 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
> test_cmp expect actual
> '
>
> +test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> + rm -f .git/index &&
> + echo foo >file &&
> + git add file &&
> + GIT_COMMITTER_DATE="@17179869183 +0000" \
> + git commit -m "tempori parendum"
> +'
> +
> +test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
> + git archive HEAD >future.tar
> +'
> +
> +test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> + echo 2514 >expect &&
> + tar_info future.tar | cut -d" " -f2 >actual &&
> + test_cmp expect actual
> +'
> +
Everything is all right, except we now have duplicated test names.
Perhaps in the three following tests we should use 'far-far-future
commit' and 'far future mtime' in place of current 'far-future commit'
and 'future mtime' for tests checking handling 12-digital ditgits
overflow, or add description how far the future is, for example
'far-future commit (2^11 + 1)', etc.
> test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> rm -f .git/index &&
> echo content >file &&
> git add file &&
> - GIT_COMMITTER_DATE="@68719476737 +0000" \
> + GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
> git commit -m "tempori parendum"
> '
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info
2020-10-25 10:52 ` Jakub Narębski
@ 2020-10-27 6:33 ` Abhishek Kumar
0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-10-27 6:33 UTC (permalink / raw)
To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee, me
Hello Dr. Narębski,
On Sun, Oct 25, 2020 at 11:52:42AM +0100, Jakub Narębski wrote:
> Hi Abhishek,
>
> In short: everything is all right, except for the now duplicated test
> names in t5000 after this commit.
>
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > Both fill_commit_graph_info() and fill_commit_in_graph() parse
> > information present in commit data chunk. Let's simplify the
> > implementation by calling fill_commit_graph_info() within
> > fill_commit_in_graph().
> >
> > fill_commit_graph_info() used to not load committer data from commit data
> > chunk. However, with the corrected committer date, we have to load
> > committer date to calculate generation number value.
>
> Nice writeup, however the last sentence would in my opinion read better
> in the future tense: we don't use generation number v2 yet. For
> example:
>
> However, with upcoming switch to using corrected committer date as
> generation number v2, we will have to load committer date to compute
> generation number value anyway.
>
> Or something like that - notice the minor addition and changes.
>
Thanks for the change, it looks better!
> The following is slightly unrelated change, but we agreed that it would
> be better to not separate them; the need for change to the t5000 test is
> caused by the change described above.
>
> >
> > e51217e15 (t5000: test tar files that overflow ustar headers,
> > 30-06-2016) introduced a test 'generate tar with future mtime' that
> > creates a commit with committer date of (2 ^ 36 + 1) seconds since
> > EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
> > committer time overflows into generation number (within CDAT chunk) and
> > has undefined behavior.
> >
> > The test used to pass as fill_commit_graph_info() would not set struct
> > member `date` of struct commit and loads committer date from the object
> > database, generating a tar file with the expected mtime.
>
> I think it should be s/loads/load/, as in "would load", but I am not a
> native English speaker.
>
That's correct - since I have used "would not set" in the first half of
sentence, the later half should follow suit too.
> >
> > However, with corrected commit date, we will load the committer date
> > from CDAT chunk (truncated to lower 34-bits to populate the generation
> > number. Thus, Git sets date and generates tar file with the truncated
> > mtime.
> >
> > The ustar format (the header format used by most modern tar programs)
> > only has room for 11 (or 12, depending om some implementations) octal
> > digits for the size and mtime of each files.
> >
> > Thus, setting a timestamp of 2 ^ 33 + 1 would overflow the 11-octal
> > digit implementations while still fitting into commit data chunk.
> >
> > Since we want to test 12-octal digit implementations of ustar as well,
> > let's modify the existing test to no longer use commit-graph file.
>
> The description above is for me does not make it entirely clear that we
> add new test for handling possible 11-octal digit overflow nearly
> identical to the existing one, and turn off use of commit-graph file for
> test that checks handling 12-octal digit overflow.
>
Revised the last paragraphs to:
The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.
To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits
without overflowing 34-bits of the Commit Data chunk.
To test the 12-octal digit implementation, the smallest committer date
possible is 2^36, which overflows the Commit Data chunk and thus
commit-graph must be disabled for the test.
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> > commit-graph.c | 27 ++++++++++-----------------
> > t/t5000-tar-tree.sh | 20 +++++++++++++++++++-
> > 2 files changed, 29 insertions(+), 18 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 94503e584b..e8362e144e 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -749,15 +749,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> > const unsigned char *commit_data;
> > struct commit_graph_data *graph_data;
> > uint32_t lex_index;
> > + uint64_t date_high, date_low;
> >
> > while (pos < g->num_commits_in_base)
> > g = g->base_graph;
> >
> > + if (pos >= g->num_commits + g->num_commits_in_base)
> > + die(_("invalid commit position. commit-graph is likely corrupt"));
> > +
> > lex_index = pos - g->num_commits_in_base;
> > commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
> >
> > graph_data = commit_graph_data_at(item);
> > graph_data->graph_pos = pos;
> > +
> > + date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > + date_low = get_be32(commit_data + g->hash_len + 12);
> > + item->date = (timestamp_t)((date_high << 32) | date_low);
> > +
> > graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > }
> >
> > @@ -772,38 +781,22 @@ static int fill_commit_in_graph(struct repository *r,
> > {
> > uint32_t edge_value;
> > uint32_t *parent_data_ptr;
> > - uint64_t date_low, date_high;
> > struct commit_list **pptr;
> > - struct commit_graph_data *graph_data;
> > const unsigned char *commit_data;
> > uint32_t lex_index;
> >
> > while (pos < g->num_commits_in_base)
> > g = g->base_graph;
> >
> > - if (pos >= g->num_commits + g->num_commits_in_base)
> > - die(_("invalid commit position. commit-graph is likely corrupt"));
> > + fill_commit_graph_info(item, g, pos);
> >
> > - /*
> > - * Store the "full" position, but then use the
> > - * "local" position for the rest of the calculation.
> > - */
> > - graph_data = commit_graph_data_at(item);
> > - graph_data->graph_pos = pos;
> > lex_index = pos - g->num_commits_in_base;
> > -
> > commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
> >
> > item->object.parsed = 1;
> >
> > set_commit_tree(item, NULL);
> >
> > - date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
> > - date_low = get_be32(commit_data + g->hash_len + 12);
> > - item->date = (timestamp_t)((date_high << 32) | date_low);
> > -
> > - graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > -
> > pptr = &item->parents;
> >
> > edge_value = get_be32(commit_data + g->hash_len);
>
> All right, looks good for me.
>
> Here second change begins.
>
> > diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
> > index 3ebb0d3b65..8f41cdc509 100755
> > --- a/t/t5000-tar-tree.sh
> > +++ b/t/t5000-tar-tree.sh
> > @@ -431,11 +431,29 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
> > test_cmp expect actual
> > '
> >
> > +test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> > + rm -f .git/index &&
> > + echo foo >file &&
> > + git add file &&
> > + GIT_COMMITTER_DATE="@17179869183 +0000" \
> > + git commit -m "tempori parendum"
> > +'
> > +
> > +test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
> > + git archive HEAD >future.tar
> > +'
> > +
> > +test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
> > + echo 2514 >expect &&
> > + tar_info future.tar | cut -d" " -f2 >actual &&
> > + test_cmp expect actual
> > +'
> > +
>
> Everything is all right, except we now have duplicated test names.
>
> Perhaps in the three following tests we should use 'far-far-future
> commit' and 'far future mtime' in place of current 'far-future commit'
> and 'future mtime' for tests checking handling 12-digital ditgits
> overflow, or add description how far the future is, for example
> 'far-future commit (2^11 + 1)', etc.
>
Changed, thanks for pointing this out.
> > test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
> > rm -f .git/index &&
> > echo content >file &&
> > git add file &&
> > - GIT_COMMITTER_DATE="@68719476737 +0000" \
> > + GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
> > git commit -m "tempori parendum"
> > '
>
> Best,
> --
> Jakub Narębski
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v4 04/10] commit-graph: return 64-bit generation number
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
` (2 preceding siblings ...)
2020-10-07 14:09 ` [PATCH v4 03/10] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09 ` Abhishek Kumar via GitGitGadget
2020-10-25 13:48 ` Jakub Narębski
2020-10-07 14:09 ` [PATCH v4 05/10] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
` (7 subsequent siblings)
11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In a preparatory step, let's return timestamp_t values from
commit_graph_generation(), use timestamp_t for local variables and
define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.
With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 22 +++++++++++-----------
commit-graph.h | 4 ++--
commit-reach.c | 36 ++++++++++++++++++------------------
commit-reach.h | 2 +-
commit.c | 4 ++--
commit.h | 4 ++--
revision.c | 10 +++++-----
upload-pack.c | 2 +-
8 files changed, 42 insertions(+), 42 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index e8362e144e..bfc532de6f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -99,7 +99,7 @@ uint32_t commit_graph_position(const struct commit *c)
return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
}
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
{
struct commit_graph_data *data =
commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
const struct commit *a = *(const struct commit **)va;
const struct commit *b = *(const struct commit **)vb;
- uint32_t generation_a = commit_graph_data_at(a)->generation;
- uint32_t generation_b = commit_graph_data_at(b)->generation;
+ const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+ const timestamp_t generation_b = commit_graph_data_at(b)->generation;
/* lower generation commits first */
if (generation_a < generation_b)
return -1;
@@ -1350,7 +1350,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
_("Computing commit graph generation numbers"),
ctx->commits.nr);
for (i = 0; i < ctx->commits.nr; i++) {
- uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+ timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
display_progress(ctx->progress, i + 1);
if (generation != GENERATION_NUMBER_INFINITY &&
@@ -1383,8 +1383,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
data->generation = max_generation + 1;
pop_commit(&list);
- if (data->generation > GENERATION_NUMBER_MAX)
- data->generation = GENERATION_NUMBER_MAX;
+ if (data->generation > GENERATION_NUMBER_V1_MAX)
+ data->generation = GENERATION_NUMBER_V1_MAX;
}
}
}
@@ -2404,8 +2404,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
for (i = 0; i < g->num_commits; i++) {
struct commit *graph_commit, *odb_commit;
struct commit_list *graph_parents, *odb_parents;
- uint32_t max_generation = 0;
- uint32_t generation;
+ timestamp_t max_generation = 0;
+ timestamp_t generation;
display_progress(progress, i + 1);
hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2469,11 +2469,11 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
continue;
/*
- * If one of our parents has generation GENERATION_NUMBER_MAX, then
- * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
+ * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
+ * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
* extra logic in the following condition.
*/
- if (max_generation == GENERATION_NUMBER_MAX)
+ if (max_generation == GENERATION_NUMBER_V1_MAX)
max_generation--;
generation = commit_graph_generation(graph_commit);
diff --git a/commit-graph.h b/commit-graph.h
index f8e92500c6..8be247fa35 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -144,12 +144,12 @@ void disable_commit_graph(struct repository *r);
struct commit_graph_data {
uint32_t graph_pos;
- uint32_t generation;
+ timestamp_t generation;
};
/*
* Commits should be parsed before accessing generation, graph positions.
*/
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
uint32_t commit_graph_position(const struct commit *);
#endif
diff --git a/commit-reach.c b/commit-reach.c
index 50175b159e..20b48b872b 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
static struct commit_list *paint_down_to_common(struct repository *r,
struct commit *one, int n,
struct commit **twos,
- int min_generation)
+ timestamp_t min_generation)
{
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
- uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+ timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
if (!min_generation)
queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
struct commit *commit = prio_queue_get(&queue);
struct commit_list *parents;
int flags;
- uint32_t generation = commit_graph_generation(commit);
+ timestamp_t generation = commit_graph_generation(commit);
if (min_generation && generation > last_gen)
- BUG("bad generation skip %8x > %8x at %s",
+ BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
generation, last_gen,
oid_to_hex(&commit->object.oid));
last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
repo_parse_commit(r, array[i]);
for (i = 0; i < cnt; i++) {
struct commit_list *common;
- uint32_t min_generation = commit_graph_generation(array[i]);
+ timestamp_t min_generation = commit_graph_generation(array[i]);
if (redundant[i])
continue;
for (j = filled = 0; j < cnt; j++) {
- uint32_t curr_generation;
+ timestamp_t curr_generation;
if (i == j || redundant[j])
continue;
filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
{
struct commit_list *bases;
int ret = 0, i;
- uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
+ timestamp_t generation, max_generation = GENERATION_NUMBER_INFINITY;
if (repo_parse_commit(r, commit))
return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
static enum contains_result contains_test(struct commit *candidate,
const struct commit_list *want,
struct contains_cache *cache,
- uint32_t cutoff)
+ timestamp_t cutoff)
{
enum contains_result *cached = contains_cache_at(cache, candidate);
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
{
struct contains_stack contains_stack = { 0, 0, NULL };
enum contains_result result;
- uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+ timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
const struct commit_list *p;
for (p = want; p; p = p->next) {
- uint32_t generation;
+ timestamp_t generation;
struct commit *c = p->item;
load_commit_graph_info(the_repository, c);
generation = commit_graph_generation(c);
@@ -566,8 +566,8 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
const struct commit *a = *(const struct commit * const *)_a;
const struct commit *b = *(const struct commit * const *)_b;
- uint32_t generation_a = commit_graph_generation(a);
- uint32_t generation_b = commit_graph_generation(b);
+ timestamp_t generation_a = commit_graph_generation(a);
+ timestamp_t generation_b = commit_graph_generation(b);
if (generation_a < generation_b)
return -1;
@@ -580,7 +580,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
unsigned int with_flag,
unsigned int assign_flag,
time_t min_commit_date,
- uint32_t min_generation)
+ timestamp_t min_generation)
{
struct commit **list = NULL;
int i;
@@ -681,13 +681,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
struct commit_list *from_iter = from, *to_iter = to;
int result;
- uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+ timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
while (from_iter) {
add_object_array(&from_iter->item->object, NULL, &from_objs);
if (!parse_commit(from_iter->item)) {
- uint32_t generation;
+ timestamp_t generation;
if (from_iter->item->date < min_commit_date)
min_commit_date = from_iter->item->date;
@@ -701,7 +701,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
while (to_iter) {
if (!parse_commit(to_iter->item)) {
- uint32_t generation;
+ timestamp_t generation;
if (to_iter->item->date < min_commit_date)
min_commit_date = to_iter->item->date;
@@ -741,13 +741,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
struct commit_list *found_commits = NULL;
struct commit **to_last = to + nr_to;
struct commit **from_last = from + nr_from;
- uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+ timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
int num_to_find = 0;
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
for (item = to; item < to_last; item++) {
- uint32_t generation;
+ timestamp_t generation;
struct commit *c = *item;
parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a31..148b56fea5 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
unsigned int with_flag,
unsigned int assign_flag,
time_t min_commit_date,
- uint32_t min_generation);
+ timestamp_t min_generation);
int can_all_from_reach(struct commit_list *from, struct commit_list *to,
int commit_date_cutoff);
diff --git a/commit.c b/commit.c
index f53429c0ac..3b488381d5 100644
--- a/commit.c
+++ b/commit.c
@@ -731,8 +731,8 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
{
const struct commit *a = a_, *b = b_;
- const uint32_t generation_a = commit_graph_generation(a),
- generation_b = commit_graph_generation(b);
+ const timestamp_t generation_a = commit_graph_generation(a),
+ generation_b = commit_graph_generation(b);
/* newer commits first */
if (generation_a < generation_b)
diff --git a/commit.h b/commit.h
index 5467786c7b..33c66b2177 100644
--- a/commit.h
+++ b/commit.h
@@ -11,8 +11,8 @@
#include "commit-slab.h"
#define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
-#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
#define GENERATION_NUMBER_ZERO 0
struct commit_list {
diff --git a/revision.c b/revision.c
index c97abcdde1..2861f1c45c 100644
--- a/revision.c
+++ b/revision.c
@@ -3308,7 +3308,7 @@ define_commit_slab(indegree_slab, int);
define_commit_slab(author_date_slab, timestamp_t);
struct topo_walk_info {
- uint32_t min_generation;
+ timestamp_t min_generation;
struct prio_queue explore_queue;
struct prio_queue indegree_queue;
struct prio_queue topo_queue;
@@ -3354,7 +3354,7 @@ static void explore_walk_step(struct rev_info *revs)
}
static void explore_to_depth(struct rev_info *revs,
- uint32_t gen_cutoff)
+ timestamp_t gen_cutoff)
{
struct topo_walk_info *info = revs->topo_walk_info;
struct commit *c;
@@ -3397,7 +3397,7 @@ static void indegree_walk_step(struct rev_info *revs)
}
static void compute_indegrees_to_depth(struct rev_info *revs,
- uint32_t gen_cutoff)
+ timestamp_t gen_cutoff)
{
struct topo_walk_info *info = revs->topo_walk_info;
struct commit *c;
@@ -3455,7 +3455,7 @@ static void init_topo_walk(struct rev_info *revs)
info->min_generation = GENERATION_NUMBER_INFINITY;
for (list = revs->commits; list; list = list->next) {
struct commit *c = list->item;
- uint32_t generation;
+ timestamp_t generation;
if (repo_parse_commit_gently(revs->repo, c, 1))
continue;
@@ -3516,7 +3516,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
for (p = commit->parents; p; p = p->next) {
struct commit *parent = p->item;
int *pi;
- uint32_t generation;
+ timestamp_t generation;
if (parent->object.flags & UNINTERESTING)
continue;
diff --git a/upload-pack.c b/upload-pack.c
index 3b858eb457..fdb82885b6 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -497,7 +497,7 @@ static int got_oid(struct upload_pack_data *data,
static int ok_to_give_up(struct upload_pack_data *data)
{
- uint32_t min_generation = GENERATION_NUMBER_ZERO;
+ timestamp_t min_generation = GENERATION_NUMBER_ZERO;
if (!data->have_obj.nr)
return 0;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v4 04/10] commit-graph: return 64-bit generation number
2020-10-07 14:09 ` [PATCH v4 04/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-10-25 13:48 ` Jakub Narębski
2020-11-03 6:40 ` Abhishek Kumar
0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-10-25 13:48 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar
Hi Abhishek,
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In a preparatory step, let's return timestamp_t values from
> commit_graph_generation(), use timestamp_t for local variables and
> define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
I think it would be easier to understand if it was explicitely said what
this preparatory step prepares for, e.g.:
In a preparatory step for introducing corrected commit dates as
generation number, let's return timestamp_t values from...
Or even
generation number, let's change the return type of
commit_graph_generation() to timestamp_t, and use ...
Otherwise it looks good.
>
> We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
> represent the largest topological level we can store in the commit data
> chunk.
>
> With corrected commit dates implemented, we will have two such *_MAX
> variables to denote the largest offset and largest topological level
> that can be stored.
All right, nice explanation.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Note that there are two changes that are not mentioned in the commit
message, namely adding 'const'-ness to generation_a/b local variables in
commit_gen_cmp() from commit-graph.c, and switching from
GENERATION_NUMBER_ZERO to GENERATION_NUMBER_INFINITY as the default
(initial) value for 'max_generation' in repo_in_merge_bases_many().
While the former is a simple "while-at-it" change that shouldn't affect
correctness, the latter needs an explanation (or fixing if it is wrong).
> ---
> commit-graph.c | 22 +++++++++++-----------
> commit-graph.h | 4 ++--
> commit-reach.c | 36 ++++++++++++++++++------------------
> commit-reach.h | 2 +-
> commit.c | 4 ++--
> commit.h | 4 ++--
> revision.c | 10 +++++-----
> upload-pack.c | 2 +-
> 8 files changed, 42 insertions(+), 42 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index e8362e144e..bfc532de6f 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -99,7 +99,7 @@ uint32_t commit_graph_position(const struct commit *c)
> return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
> }
>
> -uint32_t commit_graph_generation(const struct commit *c)
> +timestamp_t commit_graph_generation(const struct commit *c)
All right.
> {
> struct commit_graph_data *data =
> commit_graph_data_slab_peek(&commit_graph_data_slab, c);
> @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
> const struct commit *a = *(const struct commit **)va;
> const struct commit *b = *(const struct commit **)vb;
>
> - uint32_t generation_a = commit_graph_data_at(a)->generation;
> - uint32_t generation_b = commit_graph_data_at(b)->generation;
> + const timestamp_t generation_a = commit_graph_data_at(a)->generation;
> + const timestamp_t generation_b = commit_graph_data_at(b)->generation;
All right... but this also adds 'const' qualifier. I understand that
you don't want to create separate commit for this "while at it"
change...
> /* lower generation commits first */
> if (generation_a < generation_b)
> return -1;
> @@ -1350,7 +1350,7 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> _("Computing commit graph generation numbers"),
> ctx->commits.nr);
> for (i = 0; i < ctx->commits.nr; i++) {
> - uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> + timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
All right.
>
> display_progress(ctx->progress, i + 1);
> if (generation != GENERATION_NUMBER_INFINITY &&
> @@ -1383,8 +1383,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> data->generation = max_generation + 1;
> pop_commit(&list);
>
> - if (data->generation > GENERATION_NUMBER_MAX)
> - data->generation = GENERATION_NUMBER_MAX;
> + if (data->generation > GENERATION_NUMBER_V1_MAX)
> + data->generation = GENERATION_NUMBER_V1_MAX;
All right, this is the other mentioned change.
> }
> }
> }
> @@ -2404,8 +2404,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> for (i = 0; i < g->num_commits; i++) {
> struct commit *graph_commit, *odb_commit;
> struct commit_list *graph_parents, *odb_parents;
> - uint32_t max_generation = 0;
> - uint32_t generation;
> + timestamp_t max_generation = 0;
> + timestamp_t generation;
All right.
>
> display_progress(progress, i + 1);
> hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> @@ -2469,11 +2469,11 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> continue;
>
> /*
> - * If one of our parents has generation GENERATION_NUMBER_MAX, then
> - * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
> + * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
> + * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
> * extra logic in the following condition.
> */
> - if (max_generation == GENERATION_NUMBER_MAX)
> + if (max_generation == GENERATION_NUMBER_V1_MAX)
> max_generation--;
All right. Nice fixing a comment too.
>
> generation = commit_graph_generation(graph_commit);
> diff --git a/commit-graph.h b/commit-graph.h
> index f8e92500c6..8be247fa35 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -144,12 +144,12 @@ void disable_commit_graph(struct repository *r);
>
> struct commit_graph_data {
> uint32_t graph_pos;
> - uint32_t generation;
> + timestamp_t generation;
> };
All right.
>
> /*
> * Commits should be parsed before accessing generation, graph positions.
> */
> -uint32_t commit_graph_generation(const struct commit *);
> +timestamp_t commit_graph_generation(const struct commit *);
> uint32_t commit_graph_position(const struct commit *);
> #endif
All right.
> diff --git a/commit-reach.c b/commit-reach.c
> index 50175b159e..20b48b872b 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
> static struct commit_list *paint_down_to_common(struct repository *r,
> struct commit *one, int n,
> struct commit **twos,
> - int min_generation)
> + timestamp_t min_generation)
> {
> struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
> struct commit_list *result = NULL;
> int i;
> - uint32_t last_gen = GENERATION_NUMBER_INFINITY;
> + timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
All right.
>
> if (!min_generation)
> queue.compare = compare_commits_by_commit_date;
> @@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
> struct commit *commit = prio_queue_get(&queue);
> struct commit_list *parents;
> int flags;
> - uint32_t generation = commit_graph_generation(commit);
> + timestamp_t generation = commit_graph_generation(commit);
All right.
>
> if (min_generation && generation > last_gen)
> - BUG("bad generation skip %8x > %8x at %s",
> + BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
All right; nice of you noticing this issue.
> generation, last_gen,
> oid_to_hex(&commit->object.oid));
> last_gen = generation;
> @@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
> repo_parse_commit(r, array[i]);
> for (i = 0; i < cnt; i++) {
> struct commit_list *common;
> - uint32_t min_generation = commit_graph_generation(array[i]);
> + timestamp_t min_generation = commit_graph_generation(array[i]);
>
> if (redundant[i])
> continue;
> for (j = filled = 0; j < cnt; j++) {
> - uint32_t curr_generation;
> + timestamp_t curr_generation;
> if (i == j || redundant[j])
> continue;
> filled_index[filled] = j;
All right.
> @@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
> {
> struct commit_list *bases;
> int ret = 0, i;
> - uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
> + timestamp_t generation, max_generation = GENERATION_NUMBER_INFINITY;
The change of type from uint32_t to timestamp_t is expected, but the
change from GENERATION_NUMBER_ZERO to GENERATION_NUMBER_INFINITY is not.
This might be caused by the fact that repo_in_merge_bases_many()
switched from using min_generation and GENERATION_NUMBER_INFINITY to
using max_generation and GENERATION_NUMBER_ZERO. Or the reverse: I see
one version on https://github.com/git/git, and other version in 'master'
pulled from https://github.com/git-for-windows/git
Certainly max_generation should be paired with GENERATION_NUMBER_ZERO,
and min_generation with GENERATION_NUMBER_INFINITY.
>
> if (repo_parse_commit(r, commit))
> return ret;
> @@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
> static enum contains_result contains_test(struct commit *candidate,
> const struct commit_list *want,
> struct contains_cache *cache,
> - uint32_t cutoff)
> + timestamp_t cutoff)
All right.
Sidenote: this parameter should probably be named gen_cutoff, for
consistency and better readability (but that was the existing state),
but this would also mean more changes.
> {
> enum contains_result *cached = contains_cache_at(cache, candidate);
>
> @@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
> {
> struct contains_stack contains_stack = { 0, 0, NULL };
> enum contains_result result;
> - uint32_t cutoff = GENERATION_NUMBER_INFINITY;
> + timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
Sidenote: this variable should probably be named gen_cutoff, for
consistency and better readability (but that was the existing state).
However changing it would pollute this commit with unrelated changes;
it is not that big of an isseu that it *requires* fixing.
> const struct commit_list *p;
>
> for (p = want; p; p = p->next) {
> - uint32_t generation;
> + timestamp_t generation;
> struct commit *c = p->item;
> load_commit_graph_info(the_repository, c);
> generation = commit_graph_generation(c);
All right.
> @@ -566,8 +566,8 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
> const struct commit *a = *(const struct commit * const *)_a;
> const struct commit *b = *(const struct commit * const *)_b;
>
> - uint32_t generation_a = commit_graph_generation(a);
> - uint32_t generation_b = commit_graph_generation(b);
> + timestamp_t generation_a = commit_graph_generation(a);
> + timestamp_t generation_b = commit_graph_generation(b);
All right.
>
> if (generation_a < generation_b)
> return -1;
> @@ -580,7 +580,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
> unsigned int with_flag,
> unsigned int assign_flag,
> time_t min_commit_date,
> - uint32_t min_generation)
> + timestamp_t min_generation)
> {
> struct commit **list = NULL;
> int i;
All right.
> @@ -681,13 +681,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
> time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
> struct commit_list *from_iter = from, *to_iter = to;
> int result;
> - uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> + timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
>
> while (from_iter) {
> add_object_array(&from_iter->item->object, NULL, &from_objs);
>
> if (!parse_commit(from_iter->item)) {
> - uint32_t generation;
> + timestamp_t generation;
> if (from_iter->item->date < min_commit_date)
> min_commit_date = from_iter->item->date;
>
All right.
> @@ -701,7 +701,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>
> while (to_iter) {
> if (!parse_commit(to_iter->item)) {
> - uint32_t generation;
> + timestamp_t generation;
> if (to_iter->item->date < min_commit_date)
> min_commit_date = to_iter->item->date;
>
All right.
> @@ -741,13 +741,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
> struct commit_list *found_commits = NULL;
> struct commit **to_last = to + nr_to;
> struct commit **from_last = from + nr_from;
> - uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> + timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
> int num_to_find = 0;
>
> struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>
> for (item = to; item < to_last; item++) {
> - uint32_t generation;
> + timestamp_t generation;
> struct commit *c = *item;
>
> parse_commit(c);
All right.
> diff --git a/commit-reach.h b/commit-reach.h
> index b49ad71a31..148b56fea5 100644
> --- a/commit-reach.h
> +++ b/commit-reach.h
> @@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
> unsigned int with_flag,
> unsigned int assign_flag,
> time_t min_commit_date,
> - uint32_t min_generation);
> + timestamp_t min_generation);
> int can_all_from_reach(struct commit_list *from, struct commit_list *to,
> int commit_date_cutoff);
>
All right.
> diff --git a/commit.c b/commit.c
> index f53429c0ac..3b488381d5 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -731,8 +731,8 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
> int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
> {
> const struct commit *a = a_, *b = b_;
> - const uint32_t generation_a = commit_graph_generation(a),
> - generation_b = commit_graph_generation(b);
> + const timestamp_t generation_a = commit_graph_generation(a),
> + generation_b = commit_graph_generation(b);
>
All right (assuming that the indent after change looks all right; but
even if it doesn't t would be a very minor issue).
> /* newer commits first */
> if (generation_a < generation_b)
> diff --git a/commit.h b/commit.h
> index 5467786c7b..33c66b2177 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -11,8 +11,8 @@
> #include "commit-slab.h"
>
> #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
> -#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
> -#define GENERATION_NUMBER_MAX 0x3FFFFFFF
> +#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
> +#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
> #define GENERATION_NUMBER_ZERO 0
>
All right, we redefine GENERATION_NUMBER_INFINITY and rename
GENERATION_NUMBER_MAX.
> struct commit_list {
> diff --git a/revision.c b/revision.c
> index c97abcdde1..2861f1c45c 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3308,7 +3308,7 @@ define_commit_slab(indegree_slab, int);
> define_commit_slab(author_date_slab, timestamp_t);
>
> struct topo_walk_info {
> - uint32_t min_generation;
> + timestamp_t min_generation;
> struct prio_queue explore_queue;
> struct prio_queue indegree_queue;
> struct prio_queue topo_queue;
All right.
> @@ -3354,7 +3354,7 @@ static void explore_walk_step(struct rev_info *revs)
> }
>
> static void explore_to_depth(struct rev_info *revs,
> - uint32_t gen_cutoff)
> + timestamp_t gen_cutoff)
> {
> struct topo_walk_info *info = revs->topo_walk_info;
> struct commit *c;
All right.
> @@ -3397,7 +3397,7 @@ static void indegree_walk_step(struct rev_info *revs)
> }
>
> static void compute_indegrees_to_depth(struct rev_info *revs,
> - uint32_t gen_cutoff)
> + timestamp_t gen_cutoff)
> {
> struct topo_walk_info *info = revs->topo_walk_info;
> struct commit *c;
All right.
> @@ -3455,7 +3455,7 @@ static void init_topo_walk(struct rev_info *revs)
> info->min_generation = GENERATION_NUMBER_INFINITY;
> for (list = revs->commits; list; list = list->next) {
> struct commit *c = list->item;
> - uint32_t generation;
> + timestamp_t generation;
>
> if (repo_parse_commit_gently(revs->repo, c, 1))
> continue;
All right.
> @@ -3516,7 +3516,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
> for (p = commit->parents; p; p = p->next) {
> struct commit *parent = p->item;
> int *pi;
> - uint32_t generation;
> + timestamp_t generation;
>
> if (parent->object.flags & UNINTERESTING)
> continue;
All right.
> diff --git a/upload-pack.c b/upload-pack.c
> index 3b858eb457..fdb82885b6 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -497,7 +497,7 @@ static int got_oid(struct upload_pack_data *data,
>
> static int ok_to_give_up(struct upload_pack_data *data)
> {
> - uint32_t min_generation = GENERATION_NUMBER_ZERO;
> + timestamp_t min_generation = GENERATION_NUMBER_ZERO;
>
> if (!data->have_obj.nr)
> return 0;
All right.
The only thing to check is if you have changed the type in all the
places that need it. My cursory examination shows that those are all
places than need fixing.
Note that the 'generation' variable in git-name-rev, git-fsck and in
git-show-branch (snd sha1-name.c) means something different.
Also, 'first_generation' variable in generation_numbers_enabled() (part
of commit-graph.c) examines and will examine generation number v1 i.e.
topological levels, and do not need type change... though it may require
name change in some time in the future; the generation number
computation path also does not require change type, though variables
would be renamed in the future commit.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 04/10] commit-graph: return 64-bit generation number
2020-10-25 13:48 ` Jakub Narębski
@ 2020-11-03 6:40 ` Abhishek Kumar
0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-03 6:40 UTC (permalink / raw)
To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee
On Sun, Oct 25, 2020 at 02:48:27PM +0100, Jakub Narębski wrote:
> Hi Abhishek,
>
> Note that there are two changes that are not mentioned in the commit
> message, namely adding 'const'-ness to generation_a/b local variables in
> commit_gen_cmp() from commit-graph.c, and switching from
> GENERATION_NUMBER_ZERO to GENERATION_NUMBER_INFINITY as the default
> (initial) value for 'max_generation' in repo_in_merge_bases_many().
>
> While the former is a simple "while-at-it" change that shouldn't affect
> correctness, the latter needs an explanation (or fixing if it is wrong).
>
The change from GENERATION_NUMBER_ZERO to GENERATION_NUMBER_INFINITY was
incorrect. While fixing merge conflicts on rebasing to master again, I
didn't notice that repo_in_merge_bases_many() switched from using
min_generation and GENERATION_NUMBER_ZERO to max_generation and
GENERATION_NUMBER_ZERO.
Thanks for noticing!
> ...
>
> Best,
> --
> Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v4 05/10] commit-graph: add a slab to store topological levels
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
` (3 preceding siblings ...)
2020-10-07 14:09 ` [PATCH v4 04/10] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09 ` Abhishek Kumar via GitGitGadget
2020-10-25 22:17 ` Jakub Narębski
2020-10-07 14:09 ` [PATCH v4 06/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
` (6 subsequent siblings)
11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In a later commit we will introduce corrected commit date as the
generation number v2. This value will be stored in the new seperate
Generation Data chunk. However, to ensure backwards compatibility with
"Old" Git we need to continue to write generation number v1, which is
topological level, to the commit data chunk. This means that we need to
compute both versions of generation numbers when writing the
commit-graph file. Therefore, let's introduce a commit-slab to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.
When Git creates a split commit-graph, it takes advantage of the
generation values that have been computed already and present in
existing commit-graph files.
So, let's add a pointer to struct commit_graph as well as struct
write_commit_graph_context to the topological level commit-slab
and populate it with topological levels while writing a commit-graph
file.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
commit-graph.h | 1 +
2 files changed, 33 insertions(+), 15 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index bfc532de6f..cedd311024 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
+define_commit_slab(topo_level_slab, uint32_t);
+
/* Keep track of the order in which commits are added to our list. */
define_commit_slab(commit_pos, int);
static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -768,6 +770,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
item->date = (timestamp_t)((date_high << 32) | date_low);
graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
+ if (g->topo_levels)
+ *topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
}
static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -962,6 +967,7 @@ struct write_commit_graph_context {
changed_paths:1,
order_by_pack:1;
+ struct topo_level_slab *topo_levels;
const struct commit_graph_opts *opts;
size_t total_bloom_filter_data_size;
const struct bloom_filter_settings *bloom_settings;
@@ -1108,7 +1114,7 @@ static int write_graph_chunk_data(struct hashfile *f,
else
packedDate[0] = 0;
- packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+ packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
@@ -1350,11 +1356,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
_("Computing commit graph generation numbers"),
ctx->commits.nr);
for (i = 0; i < ctx->commits.nr; i++) {
- timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+ timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
display_progress(ctx->progress, i + 1);
- if (generation != GENERATION_NUMBER_INFINITY &&
- generation != GENERATION_NUMBER_ZERO)
+ if (level != GENERATION_NUMBER_INFINITY &&
+ level != GENERATION_NUMBER_ZERO)
continue;
commit_list_insert(ctx->commits.list[i], &list);
@@ -1362,29 +1368,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
struct commit *current = list->item;
struct commit_list *parent;
int all_parents_computed = 1;
- uint32_t max_generation = 0;
+ uint32_t max_level = 0;
for (parent = current->parents; parent; parent = parent->next) {
- generation = commit_graph_data_at(parent->item)->generation;
+ level = *topo_level_slab_at(ctx->topo_levels, parent->item);
- if (generation == GENERATION_NUMBER_INFINITY ||
- generation == GENERATION_NUMBER_ZERO) {
+ if (level == GENERATION_NUMBER_INFINITY ||
+ level == GENERATION_NUMBER_ZERO) {
all_parents_computed = 0;
commit_list_insert(parent->item, &list);
break;
- } else if (generation > max_generation) {
- max_generation = generation;
+ } else if (level > max_level) {
+ max_level = level;
}
}
if (all_parents_computed) {
- struct commit_graph_data *data = commit_graph_data_at(current);
-
- data->generation = max_generation + 1;
pop_commit(&list);
- if (data->generation > GENERATION_NUMBER_V1_MAX)
- data->generation = GENERATION_NUMBER_V1_MAX;
+ if (max_level > GENERATION_NUMBER_V1_MAX - 1)
+ max_level = GENERATION_NUMBER_V1_MAX - 1;
+ *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
}
}
}
@@ -2142,6 +2146,7 @@ int write_commit_graph(struct object_directory *odb,
int res = 0;
int replace = 0;
struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+ struct topo_level_slab topo_levels;
if (!commit_graph_compatible(the_repository))
return 0;
@@ -2163,6 +2168,18 @@ int write_commit_graph(struct object_directory *odb,
bloom_settings.max_changed_paths);
ctx->bloom_settings = &bloom_settings;
+ init_topo_level_slab(&topo_levels);
+ ctx->topo_levels = &topo_levels;
+
+ if (ctx->r->objects->commit_graph) {
+ struct commit_graph *g = ctx->r->objects->commit_graph;
+
+ while (g) {
+ g->topo_levels = &topo_levels;
+ g = g->base_graph;
+ }
+ }
+
if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
ctx->changed_paths = 1;
if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/commit-graph.h b/commit-graph.h
index 8be247fa35..2e9aa7824e 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -73,6 +73,7 @@ struct commit_graph {
const unsigned char *chunk_bloom_indexes;
const unsigned char *chunk_bloom_data;
+ struct topo_level_slab *topo_levels;
struct bloom_filter_settings *bloom_filter_settings;
};
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v4 05/10] commit-graph: add a slab to store topological levels
2020-10-07 14:09 ` [PATCH v4 05/10] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2020-10-25 22:17 ` Jakub Narębski
0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-10-25 22:17 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> In a later commit we will introduce corrected commit date as the
> generation number v2. This value will be stored in the new seperate
> Generation Data chunk. However, to ensure backwards compatibility with
> "Old" Git we need to continue to write generation number v1, which is
> topological level, to the commit data chunk. This means that we need to
> compute both versions of generation numbers when writing the
> commit-graph file. Therefore, let's introduce a commit-slab to store
> topological levels; corrected commit date will be stored in the member
> `generation` of struct commit_graph_data.
>
> When Git creates a split commit-graph, it takes advantage of the
> generation values that have been computed already and present in
> existing commit-graph files.
>
> So, let's add a pointer to struct commit_graph as well as struct
> write_commit_graph_context to the topological level commit-slab
> and populate it with topological levels while writing a commit-graph
> file.
I think you meant here "add a pointer in `struct commit_graph` as well
as in `struct write_commit_graph_context`...".
Perhaps we should add the information that it is done that way to be
able to allocate topo_level_slab only when needed, in the
write_commit_graph(), and adding new member to those struct is required
to pass it through the call chain (modifying `struct commit_graph` is
needed for fill_commit_graph_info()). But that might be too much detail
to put in the commit message.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
> commit-graph.c | 47 ++++++++++++++++++++++++++++++++---------------
> commit-graph.h | 1 +
> 2 files changed, 33 insertions(+), 15 deletions(-)
>
Let me reorder those files for easier review.
> diff --git a/commit-graph.h b/commit-graph.h
> index 8be247fa35..2e9aa7824e 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -73,6 +73,7 @@ struct commit_graph {
> const unsigned char *chunk_bloom_indexes;
> const unsigned char *chunk_bloom_data;
>
> + struct topo_level_slab *topo_levels;
> struct bloom_filter_settings *bloom_filter_settings;
> };
All right, here we add new member to `struct commit_graph` type.
> diff --git a/commit-graph.c b/commit-graph.c
> index bfc532de6f..cedd311024 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -962,6 +967,7 @@ struct write_commit_graph_context {
> changed_paths:1,
> order_by_pack:1;
>
> + struct topo_level_slab *topo_levels;
> const struct commit_graph_opts *opts;
> size_t total_bloom_filter_data_size;
> const struct bloom_filter_settings *bloom_settings;
All right, here we add new member to `struct write_commit_graph_context`
type, which is local to commit-graph.c.
> @@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
> /* Remember to update object flag allocation in object.h */
> #define REACHABLE (1u<<15)
>
> +define_commit_slab(topo_level_slab, uint32_t);
> +
All right, here we define new slab for storing topological levels; this
just defines new type. Note that we do not define any setters and
getters to handle non-zero initialization, like we have for
commit_graph_data_slab.
> /* Keep track of the order in which commits are added to our list. */
> define_commit_slab(commit_pos, int);
> static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> @@ -768,6 +770,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> item->date = (timestamp_t)((date_high << 32) | date_low);
>
> graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> +
> + if (g->topo_levels)
> + *topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
I guess using get_be32() is repeated in this newly added part of code
because previous part would be changed to read in generation number v2,
if available, and we won't be then able to use
*topo_level_slab_at(g->topo_levels, item) = graph_data->generation;
All right, that's smart.
I guess that in fill_commit_graph_info() we don't know if we are reading
commit-graph, when topo levels slab is not present, or whether we are
extending and writing the commit-graph file, when we need to fill it
with current commit-graph data.
The fact that fill_commit_graph_info() takes 'struct commit_graph' also
explains why we need to add pointer to a topo_levels slab to both
structs.
> }
>
> static inline void set_commit_tree(struct commit *c, struct tree *t)
[...]
> @@ -2142,6 +2146,7 @@ int write_commit_graph(struct object_directory *odb,
> int res = 0;
> int replace = 0;
> struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> + struct topo_level_slab topo_levels;
>
> if (!commit_graph_compatible(the_repository))
> return 0;
> @@ -2163,6 +2168,18 @@ int write_commit_graph(struct object_directory *odb,
> bloom_settings.max_changed_paths);
> ctx->bloom_settings = &bloom_settings;
>
> + init_topo_level_slab(&topo_levels);
> + ctx->topo_levels = &topo_levels;
> +
> + if (ctx->r->objects->commit_graph) {
> + struct commit_graph *g = ctx->r->objects->commit_graph;
> +
> + while (g) {
> + g->topo_levels = &topo_levels;
> + g = g->base_graph;
> + }
> + }
> +
> if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
> ctx->changed_paths = 1;
> if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
All right, we need topo_level_slab only for writing the commit-graph, so
we allocate it with init_*_slab() in write_commit_graph(), and set
pointers to it in `struct write_commit_graph_context *ctx` and in
`struct commit_graph` for each layer in the commit graph. This is
needed to pass it down the call-chain.
Looks good to me.
> @@ -1108,7 +1114,7 @@ static int write_graph_chunk_data(struct hashfile *f,
> else
> packedDate[0] = 0;
>
> - packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
> + packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
>
All right, write_graph_chunk_data() is called from write_commit_graph(),
so we know that cxt->topo_levels is not NULL.
> packedDate[1] = htonl((*list)->date);
> hashwrite(f, packedDate, 8);
> @@ -1350,11 +1356,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> _("Computing commit graph generation numbers"),
> ctx->commits.nr);
> for (i = 0; i < ctx->commits.nr; i++) {
> - timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
> + timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>
All right, we know that compute_generation_numbers() is called by the
write_commit_graph(), so we know that cxt->topo_levels is not NULL.
Also, we rename 'generation' to 'level' in preparation for the time when
we would be computing *both* topological level (for backward
compatibility) and corrected committer date (to be used as generation
number v2). All right.
> display_progress(ctx->progress, i + 1);
> - if (generation != GENERATION_NUMBER_INFINITY &&
> - generation != GENERATION_NUMBER_ZERO)
> + if (level != GENERATION_NUMBER_INFINITY &&
> + level != GENERATION_NUMBER_ZERO)
> continue;
Same here, the results of renaming of 'generation' local variable to
'level'.
>
> commit_list_insert(ctx->commits.list[i], &list);
> @@ -1362,29 +1368,27 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> struct commit *current = list->item;
> struct commit_list *parent;
> int all_parents_computed = 1;
> - uint32_t max_generation = 0;
> + uint32_t max_level = 0;
Similarly, we rename 'max_generation' to 'max_level'.
>
> for (parent = current->parents; parent; parent = parent->next) {
> - generation = commit_graph_data_at(parent->item)->generation;
> + level = *topo_level_slab_at(ctx->topo_levels, parent->item);
>
> - if (generation == GENERATION_NUMBER_INFINITY ||
> - generation == GENERATION_NUMBER_ZERO) {
> + if (level == GENERATION_NUMBER_INFINITY ||
> + level == GENERATION_NUMBER_ZERO) {
> all_parents_computed = 0;
> commit_list_insert(parent->item, &list);
> break;
> - } else if (generation > max_generation) {
> - max_generation = generation;
> + } else if (level > max_level) {
> + max_level = level;
> }
> }
Continuation of those renames.
>
> if (all_parents_computed) {
> - struct commit_graph_data *data = commit_graph_data_at(current);
> -
> - data->generation = max_generation + 1;
> pop_commit(&list);
>
> - if (data->generation > GENERATION_NUMBER_V1_MAX)
> - data->generation = GENERATION_NUMBER_V1_MAX;
> + if (max_level > GENERATION_NUMBER_V1_MAX - 1)
> + max_level = GENERATION_NUMBER_V1_MAX - 1;
> + *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
This is a bit safer way to handle possible overflow: instead of
final = max_found + 1; /* set to maximum plus 1 */
if (final > MAX_POSSIBLE_VALUE) /* handle overflow */
final = MAX_POSSIBLE_VALUE;
where we can have problems if MAX_POSSIBLE_VALUE overflows, we use the
following pattern:
if (max_found > MAX_POSSIBLE_VALUE - 1) /* handle overflow */
max_found > MAX_POSSIBLE_VALUE - 1;
final = max_found + 1; /* set to maximum plus 1 */
It is just a bit obscured by renaming variable and switch to using
commit slab.
It is not that important for topological level, where
GENERATION_NUMBER_V1_MAX is smaller than maximum possible value, but it
would be important for generation number v2.
> }
> }
> }
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v4 06/10] commit-graph: implement corrected commit date
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
` (4 preceding siblings ...)
2020-10-07 14:09 ` [PATCH v4 05/10] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09 ` Abhishek Kumar via GitGitGadget
2020-10-27 18:53 ` Jakub Narębski
2020-10-07 14:09 ` [PATCH v4 07/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
` (5 subsequent siblings)
11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
With most of preparations done, let's implement corrected commit date.
The corrected commit date for a commit is defined as:
* A commit with no parents (a root commit) has corrected commit date
equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
the maximum of its commit date and one more than the largest corrected
commit date among its parents.
As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).
To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.
Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.
However, using offsets be problematic if one of commits is malformed but
valid and has committerdate of 0 Unix time, as the offset would be the
same as corrected commit date and thus require 64-bits to be stored
properly.
While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 43 +++++++++++++++++++++++--------------------
1 file changed, 23 insertions(+), 20 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index cedd311024..03948adfce 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
else if (generation_a > generation_b)
return 1;
- /* use date as a heuristic when generations are equal */
- if (a->date < b->date)
- return -1;
- else if (a->date > b->date)
- return 1;
return 0;
}
@@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
ctx->commits.nr);
for (i = 0; i < ctx->commits.nr; i++) {
timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+ timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
display_progress(ctx->progress, i + 1);
if (level != GENERATION_NUMBER_INFINITY &&
- level != GENERATION_NUMBER_ZERO)
+ level != GENERATION_NUMBER_ZERO &&
+ corrected_commit_date != GENERATION_NUMBER_INFINITY &&
+ corrected_commit_date != GENERATION_NUMBER_ZERO
+ )
continue;
commit_list_insert(ctx->commits.list[i], &list);
@@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
struct commit_list *parent;
int all_parents_computed = 1;
uint32_t max_level = 0;
+ timestamp_t max_corrected_commit_date = 0;
for (parent = current->parents; parent; parent = parent->next) {
level = *topo_level_slab_at(ctx->topo_levels, parent->item);
-
+ corrected_commit_date = commit_graph_data_at(parent->item)->generation;
if (level == GENERATION_NUMBER_INFINITY ||
- level == GENERATION_NUMBER_ZERO) {
+ level == GENERATION_NUMBER_ZERO ||
+ corrected_commit_date == GENERATION_NUMBER_INFINITY ||
+ corrected_commit_date == GENERATION_NUMBER_ZERO
+ ) {
all_parents_computed = 0;
commit_list_insert(parent->item, &list);
break;
- } else if (level > max_level) {
- max_level = level;
+ } else {
+ if (level > max_level)
+ max_level = level;
+
+ if (corrected_commit_date > max_corrected_commit_date)
+ max_corrected_commit_date = corrected_commit_date;
}
}
@@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (max_level > GENERATION_NUMBER_V1_MAX - 1)
max_level = GENERATION_NUMBER_V1_MAX - 1;
*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+ if (current->date && current->date > max_corrected_commit_date)
+ max_corrected_commit_date = current->date - 1;
+ commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
}
}
}
@@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
if (generation_zero == GENERATION_ZERO_EXISTS)
continue;
- /*
- * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
- * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
- * extra logic in the following condition.
- */
- if (max_generation == GENERATION_NUMBER_V1_MAX)
- max_generation--;
-
generation = commit_graph_generation(graph_commit);
- if (generation != max_generation + 1)
- graph_report(_("commit-graph generation for commit %s is %u != %u"),
+ if (generation < max_generation + 1)
+ graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
oid_to_hex(&cur_oid),
generation,
max_generation + 1);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v4 06/10] commit-graph: implement corrected commit date
2020-10-07 14:09 ` [PATCH v4 06/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-10-27 18:53 ` Jakub Narębski
2020-11-03 11:44 ` Abhishek Kumar
0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-10-27 18:53 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With most of preparations done, let's implement corrected commit date.
>
> The corrected commit date for a commit is defined as:
>
> * A commit with no parents (a root commit) has corrected commit date
> equal to its committer date.
> * A commit with at least one parent has corrected commit date equal to
> the maximum of its commit date and one more than the largest corrected
> commit date among its parents.
All right. We might want to say that it fulfills the same reachability
criteria as topological level, but perhaps this level of detail is not
necessary here.
> As a special case, a root commit with timestamp of zero (01.01.1970
> 00:00:00Z) has corrected commit date of one, to be able to distinguish
> from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
> date).
I'm not sure if this special case is really necessary, but it makes for
cleaner reasoning.
> To minimize the space required to store corrected commit date, Git
> stores corrected commit date offsets into the commit-graph file. The
> corrected commit date offset for a commit is defined as the difference
> between its corrected commit date and actual commit date.
>
> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> in most cases is 64 bits (uintmax_t). However, corrected commit date
> offsets can be safely stored using only 32-bits. This halves the size
> of GDAT chunk, which is a reduction of around 6% in the size of
> commit-graph file.
>
> However, using offsets be problematic if one of commits is malformed but
> valid and has committerdate of 0 Unix time, as the offset would be the
> same as corrected commit date and thus require 64-bits to be stored
> properly.
>
> While Git does not write out offsets at this stage, Git stores the
> corrected commit dates in member generation of struct commit_graph_data.
> It will begin writing commit date offsets with the introduction of
> generation data chunk.
All right.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Somewhere in the commit message we should also describe that this commit
changes how commit-graph is verified: from checking that the generation
number agrees with _topological level definition_, that is that for a
given commit it is 1 more than maximum of its parents (with the caveat
that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
checking that slightly weaker condition fulfilled by both topological
levels (generation number v1) and by corrected commit date (generation
number v2) that for a given commit its generation number is 1 more than
maximum of its parents or larger.
But, as far as I understand it, current code does not handle correctly
GENERATION_NUMBER_V1_MAX case (if we use generation number v1).
On the other hand we could have simpy use functional check, that
generation number used (which can be v1 or v2, or any similar other)
fulfills the reachability condition for each edge, which can be
simplified to checking that generation(parents) <= generation(commit).
If the reachability condition is true for each edge, then it is true for
each path, and for each commit.
> ---
> commit-graph.c | 43 +++++++++++++++++++++++--------------------
> 1 file changed, 23 insertions(+), 20 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index cedd311024..03948adfce 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
> else if (generation_a > generation_b)
> return 1;
>
> - /* use date as a heuristic when generations are equal */
> - if (a->date < b->date)
> - return -1;
> - else if (a->date > b->date)
> - return 1;
Why this change? It is not described in the commit message.
Note that while this tie-breaking fallback doesn't make much sense for
corrected committer date generation number v2, this tie-breaking helps
if we have to use topological levels (generation number v2).
> return 0;
> }
>
> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> ctx->commits.nr);
> for (i = 0; i < ctx->commits.nr; i++) {
> timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
enough; no need for 'timestamp_t' for 'level' variable.
> + timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>
All right, we compute both generation numbers: topological levels and
corrected commit date.
I guess we use 'corrected_commit_date' instead of simply 'generation' to
make it asier to remember which is which.
> display_progress(ctx->progress, i + 1);
> if (level != GENERATION_NUMBER_INFINITY &&
> - level != GENERATION_NUMBER_ZERO)
> + level != GENERATION_NUMBER_ZERO &&
> + corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> + corrected_commit_date != GENERATION_NUMBER_ZERO
Straightforward addition.
> + )
Why this closing parenthesis is now in separated line?
> continue;
>
> commit_list_insert(ctx->commits.list[i], &list);
> @@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> struct commit_list *parent;
> int all_parents_computed = 1;
> uint32_t max_level = 0;
> + timestamp_t max_corrected_commit_date = 0;
All right, straightforward addition.
>
> for (parent = current->parents; parent; parent = parent->next) {
> level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> -
Why we have removed this empty line?
> + corrected_commit_date = commit_graph_data_at(parent->item)->generation;
All right.
> if (level == GENERATION_NUMBER_INFINITY ||
> - level == GENERATION_NUMBER_ZERO) {
> + level == GENERATION_NUMBER_ZERO ||
> + corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> + corrected_commit_date == GENERATION_NUMBER_ZERO
> + ) {
All right, same as above.
> all_parents_computed = 0;
> commit_list_insert(parent->item, &list);
> break;
> - } else if (level > max_level) {
> - max_level = level;
> + } else {
> + if (level > max_level)
> + max_level = level;
> +
> + if (corrected_commit_date > max_corrected_commit_date)
> + max_corrected_commit_date = corrected_commit_date;
> }
All right, reasonable and straightforward.
> }
>
> @@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> if (max_level > GENERATION_NUMBER_V1_MAX - 1)
> max_level = GENERATION_NUMBER_V1_MAX - 1;
> *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> +
> + if (current->date && current->date > max_corrected_commit_date)
> + max_corrected_commit_date = current->date - 1;
> + commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
All right.
Here we use the same trick as in previous commit (and as above) to avoid
any possible overflow, to minimize number of conditionals. The fact
that max_corrected_commit_date might store incorrect value doesn't
matter, as it is reset at beginning of this loop.
> }
> }
> }
> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> if (generation_zero == GENERATION_ZERO_EXISTS)
> continue;
>
> - /*
> - * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
> - * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
> - * extra logic in the following condition.
> - */
> - if (max_generation == GENERATION_NUMBER_V1_MAX)
> - max_generation--;
> -
Perhaps in the future we should check that both topological levels, and
also corrected committer date (if it exists) for correctness according
to their definition. Then the above removed part would be restored (but
with s/max_generation/max_level/).
> generation = commit_graph_generation(graph_commit);
> - if (generation != max_generation + 1)
> - graph_report(_("commit-graph generation for commit %s is %u != %u"),
> + if (generation < max_generation + 1)
> + graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
All right, so we relaxed the check so that it will be fulfilled by
generation number v2 (and also by generation number v1, as it implies
the more strict check for v1).
What would happen however if generation holds topological levels, and it
is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
GENERATION_NUMBER_V1_MAX for a commit? As you can check, the condition
would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
so the `git commit-graph verify` would incorrectly say that there is
a problem with generation number, while there isn't one (false positive
detection of error).
Sidenote: I think we don't have to worry about having to introduce
GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
disck representation) corrected commiter date is the same as of commiter
date itself, plus some, and I don't see us coming close to 64-bit limit
of timestamp_t for commit dates.
> oid_to_hex(&cur_oid),
> generation,
> max_generation + 1);
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 06/10] commit-graph: implement corrected commit date
2020-10-27 18:53 ` Jakub Narębski
@ 2020-11-03 11:44 ` Abhishek Kumar
2020-11-04 16:45 ` Jakub Narębski
0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-03 11:44 UTC (permalink / raw)
To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee
On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ...
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Somewhere in the commit message we should also describe that this commit
> changes how commit-graph is verified: from checking that the generation
> number agrees with _topological level definition_, that is that for a
> given commit it is 1 more than maximum of its parents (with the caveat
> that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
> checking that slightly weaker condition fulfilled by both topological
> levels (generation number v1) and by corrected commit date (generation
> number v2) that for a given commit its generation number is 1 more than
> maximum of its parents or larger.
Sure, that makes sense. Will add.
>
> But, as far as I understand it, current code does not handle correctly
> GENERATION_NUMBER_V1_MAX case (if we use generation number v1).
>
> On the other hand we could have simpy use functional check, that
> generation number used (which can be v1 or v2, or any similar other)
> fulfills the reachability condition for each edge, which can be
> simplified to checking that generation(parents) <= generation(commit).
> If the reachability condition is true for each edge, then it is true for
> each path, and for each commit.
>
> > ---
> > commit-graph.c | 43 +++++++++++++++++++++++--------------------
> > 1 file changed, 23 insertions(+), 20 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index cedd311024..03948adfce 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
> > else if (generation_a > generation_b)
> > return 1;
> >
> > - /* use date as a heuristic when generations are equal */
> > - if (a->date < b->date)
> > - return -1;
> > - else if (a->date > b->date)
> > - return 1;
>
> Why this change? It is not described in the commit message.
>
> Note that while this tie-breaking fallback doesn't make much sense for
> corrected committer date generation number v2, this tie-breaking helps
> if we have to use topological levels (generation number v2).
>
Right, I should have mentioned this change (and it's not something that
makes a difference either way).
We call commit_gen_cmp() only when we are sorting commits by generation
to speed up computation of Bloom filters i.e. while writing a commit
graph (either split commit-graph or a simple commit-graph).
Since we are always computing and storing corrected commit date when we
are writing (whether we write a GDAT chunk or not), using date as
heuristic is longer required.
> > return 0;
> > }
> >
> > @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> > ctx->commits.nr);
> > for (i = 0; i < ctx->commits.nr; i++) {
> > timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>
> Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
> enough; no need for 'timestamp_t' for 'level' variable.
>
> > + timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
> >
We need the 'timestamp_t' as we are comparing level with the now 64-bits
GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to
timestamp_t. I have a hunch that since we are explicitly using a fixed
width data type, compiler is unwilling to type coerce into broader data
types.
Advice on this appreciated.
>
> All right, we compute both generation numbers: topological levels and
> corrected commit date.
>
> I guess we use 'corrected_commit_date' instead of simply 'generation' to
> make it asier to remember which is which.
>
> > display_progress(ctx->progress, i + 1);
> > if (level != GENERATION_NUMBER_INFINITY &&
> > - level != GENERATION_NUMBER_ZERO)
> > + level != GENERATION_NUMBER_ZERO &&
> > + corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> > + corrected_commit_date != GENERATION_NUMBER_ZERO
>
> Straightforward addition.
>
> > + )
>
> Why this closing parenthesis is now in separated line?
>
> > continue;
> >
> > commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> > struct commit_list *parent;
> > int all_parents_computed = 1;
> > uint32_t max_level = 0;
> > + timestamp_t max_corrected_commit_date = 0;
>
> All right, straightforward addition.
>
> >
> > for (parent = current->parents; parent; parent = parent->next) {
> > level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> > -
>
> Why we have removed this empty line?
>
> > + corrected_commit_date = commit_graph_data_at(parent->item)->generation;
>
> All right.
>
> > if (level == GENERATION_NUMBER_INFINITY ||
> > - level == GENERATION_NUMBER_ZERO) {
> > + level == GENERATION_NUMBER_ZERO ||
> > + corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> > + corrected_commit_date == GENERATION_NUMBER_ZERO
> > + ) {
>
> All right, same as above.
>
> > all_parents_computed = 0;
> > commit_list_insert(parent->item, &list);
> > break;
> > - } else if (level > max_level) {
> > - max_level = level;
> > + } else {
> > + if (level > max_level)
> > + max_level = level;
> > +
> > + if (corrected_commit_date > max_corrected_commit_date)
> > + max_corrected_commit_date = corrected_commit_date;
> > }
>
> All right, reasonable and straightforward.
>
> > }
> >
> > @@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> > if (max_level > GENERATION_NUMBER_V1_MAX - 1)
> > max_level = GENERATION_NUMBER_V1_MAX - 1;
> > *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> > +
> > + if (current->date && current->date > max_corrected_commit_date)
> > + max_corrected_commit_date = current->date - 1;
> > + commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
>
> All right.
>
> Here we use the same trick as in previous commit (and as above) to avoid
> any possible overflow, to minimize number of conditionals. The fact
> that max_corrected_commit_date might store incorrect value doesn't
> matter, as it is reset at beginning of this loop.
>
> > }
> > }
> > }
> > @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> > if (generation_zero == GENERATION_ZERO_EXISTS)
> > continue;
> >
> > - /*
> > - * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
> > - * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
> > - * extra logic in the following condition.
> > - */
> > - if (max_generation == GENERATION_NUMBER_V1_MAX)
> > - max_generation--;
> > -
>
> Perhaps in the future we should check that both topological levels, and
> also corrected committer date (if it exists) for correctness according
> to their definition. Then the above removed part would be restored (but
> with s/max_generation/max_level/).
>
> > generation = commit_graph_generation(graph_commit);
> > - if (generation != max_generation + 1)
> > - graph_report(_("commit-graph generation for commit %s is %u != %u"),
> > + if (generation < max_generation + 1)
> > + graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>
> All right, so we relaxed the check so that it will be fulfilled by
> generation number v2 (and also by generation number v1, as it implies
> the more strict check for v1).
>
> What would happen however if generation holds topological levels, and it
> is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
> GENERATION_NUMBER_V1_MAX for a commit? As you can check, the condition
> would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
> so the `git commit-graph verify` would incorrectly say that there is
> a problem with generation number, while there isn't one (false positive
> detection of error).
Alright, so the above block still makes sense if we are working with
topological levels but not with corrected commit dates. Instead of
removing it, I will modify the condition to check that one of our parents
has GENERATION_NUMBER_V1_MAX and the graph uses topological levels.
Suprised that no test breaks by this change.
I have also moved changes in the verify function to the next patch, as
we cannot write or read corrected commit dates yet - so little sense in
modifying verify.
>
> Sidenote: I think we don't have to worry about having to introduce
> GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
> disck representation) corrected commiter date is the same as of commiter
> date itself, plus some, and I don't see us coming close to 64-bit limit
> of timestamp_t for commit dates.
>
> > oid_to_hex(&cur_oid),
> > generation,
> > max_generation + 1);
>
> Best,
> --
> Jakub Narębski
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 06/10] commit-graph: implement corrected commit date
2020-11-03 11:44 ` Abhishek Kumar
@ 2020-11-04 16:45 ` Jakub Narębski
2020-11-05 14:05 ` Philip Oakley
0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-11-04 16:45 UTC (permalink / raw)
To: Abhishek Kumar
Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau
Hello Abhishek,
Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>> ...
>>> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>
>> Somewhere in the commit message we should also describe that this commit
>> changes how commit-graph is verified: from checking that the generation
>> number agrees with _topological level definition_, that is that for a
>> given commit it is 1 more than maximum of its parents (with the caveat
>> that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
>> checking that slightly weaker condition fulfilled by both topological
>> levels (generation number v1) and by corrected commit date (generation
>> number v2) that for a given commit its generation number is 1 more than
>> maximum of its parents or larger.
>
> Sure, that makes sense. Will add.
Actually this description should match whatever we decide about
mechanism for verifying correctness of generation numbers (see below).
Because we have to choose one.
>>
>> But, as far as I understand it, current code does not handle correctly
>> GENERATION_NUMBER_V1_MAX case (if we use generation number v1).
>>
>> On the other hand we could have simpy use functional check, that
>> generation number used (which can be v1 or v2, or any similar other)
>> fulfills the reachability condition for each edge, which can be
>> simplified to checking that generation(parents) <= generation(commit).
>> If the reachability condition is true for each edge, then it is true for
>> each path, and for each commit.
See below.
>>> ---
>>> commit-graph.c | 43 +++++++++++++++++++++++--------------------
>>> 1 file changed, 23 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/commit-graph.c b/commit-graph.c
>>> index cedd311024..03948adfce 100644
>>> --- a/commit-graph.c
>>> +++ b/commit-graph.c
>>> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>> else if (generation_a > generation_b)
>>> return 1;
>>>
>>> - /* use date as a heuristic when generations are equal */
>>> - if (a->date < b->date)
>>> - return -1;
>>> - else if (a->date > b->date)
>>> - return 1;
>>
>> Why this change? It is not described in the commit message.
>>
>> Note that while this tie-breaking fallback doesn't make much sense for
>> corrected committer date generation number v2, this tie-breaking helps
>> if we have to use topological levels (generation number v2).
>>
>
> Right, I should have mentioned this change (and it's not something that
> makes a difference either way).
>
> We call commit_gen_cmp() only when we are sorting commits by generation
> to speed up computation of Bloom filters i.e. while writing a commit
> graph (either split commit-graph or a simple commit-graph).
>
> Since we are always computing and storing corrected commit date when we
> are writing (whether we write a GDAT chunk or not), using date as
> heuristic is longer required.
Thanks. This description really should be added to the commit message,
because (yet again?) I was confused by this change.
Sidenote: it is not obvious at least to me that this function is used
only for sorting commits to speed up computation of Bloom filters while
writing the commit-graph (`git commit-graph write --changed-paths [other
options]`).
>>> return 0;
>>> }
>>>
>>> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>> ctx->commits.nr);
>>> for (i = 0; i < ctx->commits.nr; i++) {
>>> timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>>
>> Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
>> enough; no need for 'timestamp_t' for 'level' variable.
>>
>>> + timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>>>
>
> We need the 'timestamp_t' as we are comparing level with the now 64-bits
> GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to
> timestamp_t. I have a hunch that since we are explicitly using a fixed
> width data type, compiler is unwilling to type coerce into broader data
> types.
>
> Advice on this appreciated.
All right, so the wider type is used because of comparison with
wide-uint GENERATION_NUMBER_INFINITY. I stand corrected.
[...]
>>> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>>> if (generation_zero == GENERATION_ZERO_EXISTS)
>>> continue;
>>>
>>> - /*
>>> - * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
>>> - * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
>>> - * extra logic in the following condition.
>>> - */
>>> - if (max_generation == GENERATION_NUMBER_V1_MAX)
>>> - max_generation--;
>>> -
>>
>> Perhaps in the future we should check that both topological levels, and
>> also corrected committer date (if it exists) for correctness according
>> to their definition. Then the above removed part would be restored (but
>> with s/max_generation/max_level/).
>>
>>> generation = commit_graph_generation(graph_commit);
>>> - if (generation != max_generation + 1)
>>> - graph_report(_("commit-graph generation for commit %s is %u != %u"),
>>> + if (generation < max_generation + 1)
>>> + graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>>
>> All right, so we relaxed the check so that it will be fulfilled by
>> generation number v2 (and also by generation number v1, as it implies
>> the more strict check for v1).
>>
>> What would happen however if generation holds topological levels, and it
>> is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
>> GENERATION_NUMBER_V1_MAX for a commit? As you can check, the condition
>> would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
>> so the `git commit-graph verify` would incorrectly say that there is
>> a problem with generation number, while there isn't one (false positive
>> detection of error).
>
> Alright, so the above block still makes sense if we are working with
> topological levels but not with corrected commit dates. Instead of
> removing it, I will modify the condition to check that one of our parents
> has GENERATION_NUMBER_V1_MAX and the graph uses topological levels.
That is one of the 3 possible solutions I can think of.
I. First solution is to switch from checking that generation number
matches its definition to checking that the [weaker] reachability
condition for the generation number is true, that is:
if (generation < max_generation)
graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
The [weaker] reachability condition for generation numbers states that
A reachable from B => gen(A) <= gen(B)
This condition is true even if one or more generation numbers is
GENERATION_NUMBER_ZERO (uninitialized or written by old git version),
GENERATION_NUMBER_V1_MAX (we hit storage limitations, can happen only
for generation number v1), or GENERATION_NUMBER_INFINITY (for commits
outside of the serialized commit-graph, doesn't matter and cannot happen
during verification of the commit-graph data by definition).
This means that if P* is the parent of C with the maximal generation
number, and gen(C) < gen(P*) is true (while gen(P*) <= gen(C) should be
true), then there is a problem with generation number.
This is why I thought you were going for, and what I have proposed.
Advantages:
- we are testing what actually matters for speeding up reachability
queries, namely that the reachability property holds true
- the test works for generation number v1, generation number v2,
and any possible future use-compatibile generation number
(not that I think we would need any)
- least complicated solution
Disadvantages:
- weaker test that we have had for generation number v1 (topological
levels), and weaker that possible test for generation number v2
that we could have (see below)
II. Verify corrected committed date (generation number v2) if available,
and verify topological levels (generation number v1) otherwise, checking
that it matches the definition of it -- using version-specific checks.
This would probably mean adding a conditional around the code verifying
that given generation number is correct, possibly:
if (g->read_generation_data) {
/* verify corrected commit date */
} else {
/* current code for verifying topological levels */
}
II.a. For topological levels (generation number v1) we would continue
checking that it matches the definition, that is that the following
condition holds:
gen(C) = max_{P: P ∈ parents(C)} gen(P) + 1
This includes code for handling the case where `max_generation`, holding
max_{P: P ∈ parents(C)} gen(P), is GENERATION_NUMBER_V1_MAX.
II.b. For corrected commiter dates (generation number v2) we can use the
code proposed by this revision of this commit, namely we check if the
following condition holds:
gen(P) + 1 <= gen(C) for each P \in parents(C)
or, in other words:
max_{P: P ∈ parents(C)} { gen(P) } + 1 <= gen(C)
Which could be checked using the following code (i.e. current state
after this revision of this patch):
if (generation < max_generation + 1)
graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
This is what I think you are proposing now.
Additionally, theoretically we could also check that the following
condition holds for corrected commiter date:
committer_date(C) <= gen_v2(C)
but this is automatically fufilled because we use non-negative offsets
to store corrected committed date info.
Alternatively we can check for compliance with the definition of the
corrected committer date:
if (max_generation + 1 <= graph_commit->date) {
/* commit date does not need correction */
if (generation != graph_commit->date)
graph_report(_("commit-graph corrected commit date for commit %s "
"is %"PRItime" != %"PRItime" commit date"),
...);
} else {
if (generation != max_generation + 1)
graph_report(_("commit-graph generation v2 for commit %s is %"PRItime" != %"PRItime),
...);
}
Though I think it might be overkill.
Advantages:
- more strict tests, checking generation numbers (v2 if present, v1
otherwise) against their definition
- if there is no GDAT chunk, verify works just like it did before
Disadvantages:
- more complicated code
- possibly measurable performance degradation due to extra conditional
III. Like II., but if there is generation numbers chunk (GDAT chunk), we
verify *both* topological levels (v1) and corrected commit date (v2)
against their definition. If GDAT chunk is not present, it reduces to
current code (before this patch series).
Advantages:
- if there is no GDAT chunk, verify works just like it did before
- most strict tests, verifying all the data: both generation number v1
and generation number v2 -- if possible
Disadvantages:
- most complex code; we need to somehow extract topological levels
if the GDAT chunk is present (they are not on graph data slab in this
case); I have not even started to think how it could be done
- slower verification
> Suprised that no test breaks by this change.
I don't whink we have any test that created commit graph with
topological levels greater than GENERATION_NUMBER_V1_MAX; this would be
expensive and have to be of course protected by GIT_TEST_LONG aka
EXPENSIVE prerequisite.
# GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 is here to force verification of topological levels
test_expect_success EXPENSIVE 'verify handles topological levels > GENERATION_NUMBER_V1_MAX' '
rm -rf long_chain &&
git init long_chain &&
test_commit_bulk -C long_chain 1073741824 &&
(
cd long_chain &&
GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write &&
git commit-graph verify
)
'
This however lies slightly outside the scope of this patch series,
though if you could add this test (in a separate patch), after testing
it, it would be very nice.
>
> I have also moved changes in the verify function to the next patch, as
> we cannot write or read corrected commit dates yet - so little sense in
> modifying verify.
I think putting changes to the verify function in a separate patch, be
it before or after this one (depending on the choice of the algorithm
for verification, see above) would be a good idea.
>>
>> Sidenote: I think we don't have to worry about having to introduce
>> GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
>> disck representation) corrected commiter date is the same as of commiter
>> date itself, plus some, and I don't see us coming close to 64-bit limit
>> of timestamp_t for commit dates.
>>
>>> oid_to_hex(&cur_oid),
>>> generation,
>>> max_generation + 1);
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 06/10] commit-graph: implement corrected commit date
2020-11-04 16:45 ` Jakub Narębski
@ 2020-11-05 14:05 ` Philip Oakley
2020-11-05 18:22 ` Junio C Hamano
0 siblings, 1 reply; 211+ messages in thread
From: Philip Oakley @ 2020-11-05 14:05 UTC (permalink / raw)
To: Jakub Narębski, Abhishek Kumar
Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau
Hi Abhishek,
On 04/11/2020 16:45, Jakub Narębski wrote:
> Hello Abhishek,
>
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>> On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote:
>>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>
>>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>> ...
>>>> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>> Somewhere in the commit message we should also describe that this commit
>>> changes how commit-graph is verified: from checking that the generation
>>> number agrees with _topological level definition_, that is that for a
>>> given commit it is 1 more than maximum of its parents (with the caveat
>>> that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
>>> checking that slightly weaker condition fulfilled by both topological
>>> levels (generation number v1) and by corrected commit date (generation
>>> number v2) that for a given commit its generation number is 1 more than
>>> maximum of its parents or larger.
>> Sure, that makes sense. Will add.
> Actually this description should match whatever we decide about
> mechanism for verifying correctness of generation numbers (see below).
> Because we have to choose one.
This may be not part of the the main project, but could you consider, if
time permits, also adding some entries into the Git Glossary (`git help
glossary`) for the various terms we are using here and elsewhere, e.g.
'topological levels', 'generation number', 'corrected commit date' (and
its fancy technical name for the use of date heuristics e.g. the
'chronological ordering';).
The glossary can provide a reference, once the issues are resolved. The
History Simplification and Commit Ordering section of git-log maybe a
useful guide to some of the terms that would link to the glossary.
--
Philip
>>> But, as far as I understand it, current code does not handle correctly
>>> GENERATION_NUMBER_V1_MAX case (if we use generation number v1).
>>>
>>> On the other hand we could have simpy use functional check, that
>>> generation number used (which can be v1 or v2, or any similar other)
>>> fulfills the reachability condition for each edge, which can be
>>> simplified to checking that generation(parents) <= generation(commit).
>>> If the reachability condition is true for each edge, then it is true for
>>> each path, and for each commit.
> See below.
>
>>>> ---
>>>> commit-graph.c | 43 +++++++++++++++++++++++--------------------
>>>> 1 file changed, 23 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/commit-graph.c b/commit-graph.c
>>>> index cedd311024..03948adfce 100644
>>>> --- a/commit-graph.c
>>>> +++ b/commit-graph.c
>>>> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>>> else if (generation_a > generation_b)
>>>> return 1;
>>>>
>>>> - /* use date as a heuristic when generations are equal */
>>>> - if (a->date < b->date)
>>>> - return -1;
>>>> - else if (a->date > b->date)
>>>> - return 1;
>>> Why this change? It is not described in the commit message.
>>>
>>> Note that while this tie-breaking fallback doesn't make much sense for
>>> corrected committer date generation number v2, this tie-breaking helps
>>> if we have to use topological levels (generation number v2).
>>>
>> Right, I should have mentioned this change (and it's not something that
>> makes a difference either way).
>>
>> We call commit_gen_cmp() only when we are sorting commits by generation
>> to speed up computation of Bloom filters i.e. while writing a commit
>> graph (either split commit-graph or a simple commit-graph).
>>
>> Since we are always computing and storing corrected commit date when we
>> are writing (whether we write a GDAT chunk or not), using date as
>> heuristic is longer required.
> Thanks. This description really should be added to the commit message,
> because (yet again?) I was confused by this change.
>
> Sidenote: it is not obvious at least to me that this function is used
> only for sorting commits to speed up computation of Bloom filters while
> writing the commit-graph (`git commit-graph write --changed-paths [other
> options]`).
>
>>>> return 0;
>>>> }
>>>>
>>>> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>>> ctx->commits.nr);
>>>> for (i = 0; i < ctx->commits.nr; i++) {
>>>> timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>>> Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
>>> enough; no need for 'timestamp_t' for 'level' variable.
>>>
>>>> + timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>>>>
>> We need the 'timestamp_t' as we are comparing level with the now 64-bits
>> GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to
>> timestamp_t. I have a hunch that since we are explicitly using a fixed
>> width data type, compiler is unwilling to type coerce into broader data
>> types.
>>
>> Advice on this appreciated.
> All right, so the wider type is used because of comparison with
> wide-uint GENERATION_NUMBER_INFINITY. I stand corrected.
>
> [...]
>>>> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>>>> if (generation_zero == GENERATION_ZERO_EXISTS)
>>>> continue;
>>>>
>>>> - /*
>>>> - * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
>>>> - * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
>>>> - * extra logic in the following condition.
>>>> - */
>>>> - if (max_generation == GENERATION_NUMBER_V1_MAX)
>>>> - max_generation--;
>>>> -
>>> Perhaps in the future we should check that both topological levels, and
>>> also corrected committer date (if it exists) for correctness according
>>> to their definition. Then the above removed part would be restored (but
>>> with s/max_generation/max_level/).
>>>
>>>> generation = commit_graph_generation(graph_commit);
>>>> - if (generation != max_generation + 1)
>>>> - graph_report(_("commit-graph generation for commit %s is %u != %u"),
>>>> + if (generation < max_generation + 1)
>>>> + graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>>> All right, so we relaxed the check so that it will be fulfilled by
>>> generation number v2 (and also by generation number v1, as it implies
>>> the more strict check for v1).
>>>
>>> What would happen however if generation holds topological levels, and it
>>> is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
>>> GENERATION_NUMBER_V1_MAX for a commit? As you can check, the condition
>>> would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
>>> so the `git commit-graph verify` would incorrectly say that there is
>>> a problem with generation number, while there isn't one (false positive
>>> detection of error).
>> Alright, so the above block still makes sense if we are working with
>> topological levels but not with corrected commit dates. Instead of
>> removing it, I will modify the condition to check that one of our parents
>> has GENERATION_NUMBER_V1_MAX and the graph uses topological levels.
> That is one of the 3 possible solutions I can think of.
>
>
> I. First solution is to switch from checking that generation number
> matches its definition to checking that the [weaker] reachability
> condition for the generation number is true, that is:
>
> if (generation < max_generation)
> graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>
> The [weaker] reachability condition for generation numbers states that
>
> A reachable from B => gen(A) <= gen(B)
>
> This condition is true even if one or more generation numbers is
> GENERATION_NUMBER_ZERO (uninitialized or written by old git version),
> GENERATION_NUMBER_V1_MAX (we hit storage limitations, can happen only
> for generation number v1), or GENERATION_NUMBER_INFINITY (for commits
> outside of the serialized commit-graph, doesn't matter and cannot happen
> during verification of the commit-graph data by definition).
>
> This means that if P* is the parent of C with the maximal generation
> number, and gen(C) < gen(P*) is true (while gen(P*) <= gen(C) should be
> true), then there is a problem with generation number.
>
> This is why I thought you were going for, and what I have proposed.
>
> Advantages:
> - we are testing what actually matters for speeding up reachability
> queries, namely that the reachability property holds true
> - the test works for generation number v1, generation number v2,
> and any possible future use-compatibile generation number
> (not that I think we would need any)
> - least complicated solution
>
> Disadvantages:
> - weaker test that we have had for generation number v1 (topological
> levels), and weaker that possible test for generation number v2
> that we could have (see below)
>
>
> II. Verify corrected committed date (generation number v2) if available,
> and verify topological levels (generation number v1) otherwise, checking
> that it matches the definition of it -- using version-specific checks.
>
> This would probably mean adding a conditional around the code verifying
> that given generation number is correct, possibly:
>
> if (g->read_generation_data) {
> /* verify corrected commit date */
> } else {
> /* current code for verifying topological levels */
> }
>
> II.a. For topological levels (generation number v1) we would continue
> checking that it matches the definition, that is that the following
> condition holds:
>
> gen(C) = max_{P: P ∈ parents(C)} gen(P) + 1
>
> This includes code for handling the case where `max_generation`, holding
> max_{P: P ∈ parents(C)} gen(P), is GENERATION_NUMBER_V1_MAX.
>
> II.b. For corrected commiter dates (generation number v2) we can use the
> code proposed by this revision of this commit, namely we check if the
> following condition holds:
>
> gen(P) + 1 <= gen(C) for each P \in parents(C)
>
> or, in other words:
>
> max_{P: P ∈ parents(C)} { gen(P) } + 1 <= gen(C)
>
> Which could be checked using the following code (i.e. current state
> after this revision of this patch):
>
> if (generation < max_generation + 1)
> graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>
> This is what I think you are proposing now.
>
> Additionally, theoretically we could also check that the following
> condition holds for corrected commiter date:
>
> committer_date(C) <= gen_v2(C)
>
> but this is automatically fufilled because we use non-negative offsets
> to store corrected committed date info.
>
> Alternatively we can check for compliance with the definition of the
> corrected committer date:
>
> if (max_generation + 1 <= graph_commit->date) {
> /* commit date does not need correction */
> if (generation != graph_commit->date)
> graph_report(_("commit-graph corrected commit date for commit %s "
> "is %"PRItime" != %"PRItime" commit date"),
> ...);
> } else {
> if (generation != max_generation + 1)
> graph_report(_("commit-graph generation v2 for commit %s is %"PRItime" != %"PRItime),
> ...);
> }
>
> Though I think it might be overkill.
>
> Advantages:
> - more strict tests, checking generation numbers (v2 if present, v1
> otherwise) against their definition
> - if there is no GDAT chunk, verify works just like it did before
>
> Disadvantages:
> - more complicated code
> - possibly measurable performance degradation due to extra conditional
>
>
> III. Like II., but if there is generation numbers chunk (GDAT chunk), we
> verify *both* topological levels (v1) and corrected commit date (v2)
> against their definition. If GDAT chunk is not present, it reduces to
> current code (before this patch series).
>
> Advantages:
> - if there is no GDAT chunk, verify works just like it did before
> - most strict tests, verifying all the data: both generation number v1
> and generation number v2 -- if possible
>
> Disadvantages:
> - most complex code; we need to somehow extract topological levels
> if the GDAT chunk is present (they are not on graph data slab in this
> case); I have not even started to think how it could be done
> - slower verification
>
>> Suprised that no test breaks by this change.
> I don't whink we have any test that created commit graph with
> topological levels greater than GENERATION_NUMBER_V1_MAX; this would be
> expensive and have to be of course protected by GIT_TEST_LONG aka
> EXPENSIVE prerequisite.
>
> # GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 is here to force verification of topological levels
> test_expect_success EXPENSIVE 'verify handles topological levels > GENERATION_NUMBER_V1_MAX' '
> rm -rf long_chain &&
> git init long_chain &&
> test_commit_bulk -C long_chain 1073741824 &&
> (
> cd long_chain &&
> GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write &&
> git commit-graph verify
> )
> '
>
> This however lies slightly outside the scope of this patch series,
> though if you could add this test (in a separate patch), after testing
> it, it would be very nice.
>
>> I have also moved changes in the verify function to the next patch, as
>> we cannot write or read corrected commit dates yet - so little sense in
>> modifying verify.
> I think putting changes to the verify function in a separate patch, be
> it before or after this one (depending on the choice of the algorithm
> for verification, see above) would be a good idea.
>
>>> Sidenote: I think we don't have to worry about having to introduce
>>> GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
>>> disck representation) corrected commiter date is the same as of commiter
>>> date itself, plus some, and I don't see us coming close to 64-bit limit
>>> of timestamp_t for commit dates.
>>>
>>>> oid_to_hex(&cur_oid),
>>>> generation,
>>>> max_generation + 1);
> Best,
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 06/10] commit-graph: implement corrected commit date
2020-11-05 14:05 ` Philip Oakley
@ 2020-11-05 18:22 ` Junio C Hamano
2020-11-06 18:26 ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Jakub Narębski
0 siblings, 1 reply; 211+ messages in thread
From: Junio C Hamano @ 2020-11-05 18:22 UTC (permalink / raw)
To: Philip Oakley
Cc: Jakub Narębski, Abhishek Kumar, git,
Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau
Philip Oakley <philipoakley@iee.email> writes:
> This may be not part of the the main project, but could you consider, if
> time permits, also adding some entries into the Git Glossary (`git help
> glossary`) for the various terms we are using here and elsewhere, e.g.
> 'topological levels', 'generation number', 'corrected commit date' (and
> its fancy technical name for the use of date heuristics e.g. the
> 'chronological ordering';).
>
> The glossary can provide a reference, once the issues are resolved. The
> History Simplification and Commit Ordering section of git-log maybe a
> useful guide to some of the terms that would link to the glossary.
Ah, I first thought that Documentation/rev-list-options.txt (which
is the relevant part of "git log" documentation you mention here)
already have references to deep technical terms explained in the
glossary and you are suggesting Abhishek to mimic the arrangement by
adding new and agreed-upon terms to the glossary and referring to
them from the commit-graph documentation updated by this series.
But sadly that is not the case. What you are saying is that you
noticed that rev-list-options.txt needs a similar "the terms we use
to explain these two sections should be defined and explained in the
glossary (if they are not) and new references to glossary should be
added there" update.
In any case, that is a very good suggestion. I agree that updating
"git log" doc may be outside the scope of Abhishek's theme, but it
would be very good to have such an update by anybody ;-)
Thanks
^ permalink raw reply [flat|nested] 211+ messages in thread
* Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date)
2020-11-05 18:22 ` Junio C Hamano
@ 2020-11-06 18:26 ` Jakub Narębski
2020-11-06 19:33 ` Extending and updating gitglossary Junio C Hamano
2020-11-08 17:23 ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Philip Oakley
0 siblings, 2 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-11-06 18:26 UTC (permalink / raw)
To: Junio C Hamano
Cc: Philip Oakley, Abhishek Kumar, git,
Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau
Junio C Hamano <gitster@pobox.com> writes:
> Philip Oakley <philipoakley@iee.email> writes:
>
>> This may be not part of the the main project, but could you consider, if
>> time permits, also adding some entries into the Git Glossary (`git help
>> glossary`) for the various terms we are using here and elsewhere, e.g.
>> 'topological levels', 'generation number', 'corrected commit date' (and
>> its fancy technical name for the use of date heuristics e.g. the
>> 'chronological ordering';).
>>
>> The glossary can provide a reference, once the issues are resolved. The
>> History Simplification and Commit Ordering section of git-log maybe a
>> useful guide to some of the terms that would link to the glossary.
>
> Ah, I first thought that Documentation/rev-list-options.txt (which
> is the relevant part of "git log" documentation you mention here)
> already have references to deep technical terms explained in the
> glossary and you are suggesting Abhishek to mimic the arrangement by
> adding new and agreed-upon terms to the glossary and referring to
> them from the commit-graph documentation updated by this series.
>
> But sadly that is not the case. What you are saying is that you
> noticed that rev-list-options.txt needs a similar "the terms we use
> to explain these two sections should be defined and explained in the
> glossary (if they are not) and new references to glossary should be
> added there" update.
>
> In any case, that is a very good suggestion. I agree that updating
> "git log" doc may be outside the scope of Abhishek's theme, but it
> would be very good to have such an update by anybody ;-)
The only possible problem I see with this suggestion is that some of
those terms (like 'topological levels' and 'corrected commit date') are
technical terms that should be not of concern for Git user, only for
developers working on Git. (However one could encounter the term
"generation number" in `git commit-graph verify` output.)
I don't think adding technical terms that the user won't encounter in
the documentation or among messages that Git outputs would be not a good
idea. It could confuse users, rather than help them.
Conversely, perhaps we should add Documentation/technical/glossary.txt
to help developers.
P.S. By the way, when looking at Documentation/glossary-content.txt, I
have noticed few obsolescent entries, like "Git archive", few that have
description that soon could be or is obsolete and would need updating,
like "master" (when default branch switch to "main"), or "object
identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
function).
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: Extending and updating gitglossary
2020-11-06 18:26 ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Jakub Narębski
@ 2020-11-06 19:33 ` Junio C Hamano
2020-11-08 17:23 ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Philip Oakley
1 sibling, 0 replies; 211+ messages in thread
From: Junio C Hamano @ 2020-11-06 19:33 UTC (permalink / raw)
To: Jakub Narębski
Cc: Philip Oakley, Abhishek Kumar, git,
Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau
Jakub Narębski <jnareb@gmail.com> writes:
> I don't think adding technical terms that the user won't encounter in
> the documentation or among messages that Git outputs would be not a good
> idea. It could confuse users, rather than help them.
>
> Conversely, perhaps we should add Documentation/technical/glossary.txt
> to help developers.
Thanks for a thoughtful suggestion to help the target audience. I
agree 100% with the above two paragraphs.
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date)
2020-11-06 18:26 ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Jakub Narębski
2020-11-06 19:33 ` Extending and updating gitglossary Junio C Hamano
@ 2020-11-08 17:23 ` Philip Oakley
2020-11-10 1:35 ` Extending and updating gitglossary Jakub Narębski
1 sibling, 1 reply; 211+ messages in thread
From: Philip Oakley @ 2020-11-08 17:23 UTC (permalink / raw)
To: Jakub Narębski, Junio C Hamano
Cc: Abhishek Kumar, git, Abhishek Kumar via GitGitGadget,
Derrick Stolee, Taylor Blau
Hi Jakub,
On 06/11/2020 18:26, Jakub Narębski wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>> Philip Oakley <philipoakley@iee.email> writes:
>>
>>> This may be not part of the the main project, but could you consider, if
>>> time permits, also adding some entries into the Git Glossary (`git help
>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>> its fancy technical name for the use of date heuristics e.g. the
>>> 'chronological ordering';).
>>>
>>> The glossary can provide a reference, once the issues are resolved. The
>>> History Simplification and Commit Ordering section of git-log maybe a
>>> useful guide to some of the terms that would link to the glossary.
>> Ah, I first thought that Documentation/rev-list-options.txt (which
>> is the relevant part of "git log" documentation you mention here)
>> already have references to deep technical terms explained in the
>> glossary and you are suggesting Abhishek to mimic the arrangement by
>> adding new and agreed-upon terms to the glossary and referring to
>> them from the commit-graph documentation updated by this series.
>>
>> But sadly that is not the case. What you are saying is that you
>> noticed that rev-list-options.txt needs a similar "the terms we use
>> to explain these two sections should be defined and explained in the
>> glossary (if they are not) and new references to glossary should be
>> added there" update.
>>
>> In any case, that is a very good suggestion. I agree that updating
>> "git log" doc may be outside the scope of Abhishek's theme, but it
>> would be very good to have such an update by anybody ;-)
> The only possible problem I see with this suggestion is that some of
> those terms (like 'topological levels' and 'corrected commit date') are
> technical terms that should be not of concern for Git user, only for
> developers working on Git. (However one could encounter the term
> "generation number" in `git commit-graph verify` output.)
However we do mention "topolog*" in a number of the manual pages, and
rather less, as yet, in the technical pages.
"Lexicographic" and "chronological" are in the same group of fancy
technical words ;-)
>
> I don't think adding technical terms that the user won't encounter in
> the documentation or among messages that Git outputs would be not a good
> idea. It could confuse users, rather than help them.
>
> Conversely, perhaps we should add Documentation/technical/glossary.txt
> to help developers.
I would agree that the Glossary probably ought to be split into the
primary, secondary and background terms so that the core concepts are
separated from the academic/developer style terms.
Git does rip up most of what folks think about version "control",
usually based on the imperfect replication of physical artefacts.
>
> P.S. By the way, when looking at Documentation/glossary-content.txt, I
> have noticed few obsolescent entries, like "Git archive", few that have
> description that soon could be or is obsolete and would need updating,
> like "master" (when default branch switch to "main"), or "object
> identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
> function).
The obsolescent items can be updated. I'm expecting that the 'main' and
'SHA-' changes will eventually be picked up as part of the respective
patch series, hopefully as part of the global replacements.
--
Philip
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: Extending and updating gitglossary
2020-11-08 17:23 ` Extending and updating gitglossary (was: Re: [PATCH v4 06/10] commit-graph: implement corrected commit date) Philip Oakley
@ 2020-11-10 1:35 ` Jakub Narębski
2020-11-10 14:04 ` Philip Oakley
0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-11-10 1:35 UTC (permalink / raw)
To: Philip Oakley
Cc: Junio C Hamano, Abhishek Kumar, git,
Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau
Hello Philip,
Philip Oakley <philipoakley@iee.email> writes:
> On 06/11/2020 18:26, Jakub Narębski wrote:
>> Junio C Hamano <gitster@pobox.com> writes:
>>> Philip Oakley <philipoakley@iee.email> writes:
>>>
>>>> This may be not part of the the main project, but could you consider, if
>>>> time permits, also adding some entries into the Git Glossary (`git help
>>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>>> its fancy technical name for the use of date heuristics e.g. the
>>>> 'chronological ordering';).
>>>>
>>>> The glossary can provide a reference, once the issues are resolved. The
>>>> History Simplification and Commit Ordering section of git-log maybe a
>>>> useful guide to some of the terms that would link to the glossary.
>>>
>>> Ah, I first thought that Documentation/rev-list-options.txt (which
>>> is the relevant part of "git log" documentation you mention here)
>>> already have references to deep technical terms explained in the
>>> glossary and you are suggesting Abhishek to mimic the arrangement by
>>> adding new and agreed-upon terms to the glossary and referring to
>>> them from the commit-graph documentation updated by this series.
>>>
>>> But sadly that is not the case. What you are saying is that you
>>> noticed that rev-list-options.txt needs a similar "the terms we use
>>> to explain these two sections should be defined and explained in the
>>> glossary (if they are not) and new references to glossary should be
>>> added there" update.
What terms you feel need glossary entry?
>>> In any case, that is a very good suggestion. I agree that updating
>>> "git log" doc may be outside the scope of Abhishek's theme, but it
>>> would be very good to have such an update by anybody ;-)
>>
>> The only possible problem I see with this suggestion is that some of
>> those terms (like 'topological levels' and 'corrected commit date') are
>> technical terms that should be not of concern for Git user, only for
>> developers working on Git. (However one could encounter the term
>> "generation number" in `git commit-graph verify` output.)
To be more precise, I think that user-facing glossary should include
only terms that appear in user-facing documentation and in output
messages of Git commands (with the possible exception of maybe output
messages of some low-level plumbing).
I think that the developer-facing glossary should include terms that
appear in technical documentation, and in commit messages in Git
history.
> However we do mention "topolog*" in a number of the manual pages, and
> rather less, as yet, in the technical pages.
>
> "Lexicographic" and "chronological" are in the same group of fancy
> technical words ;-)
I think that 'topological level' would appear only in technical
documentation; if it would be the case then there is no reason to add it
to user-facing glossary (to gitglossary manpage).
'Topological order' or 'topological sort', 'lexicographical order' and
'chronological order' are not Git-specific terms, and there are no
Git-specific ambiguities. I am therefore a bit unsure about adding them
to *Git* glossary.
- In computer science, a _topological sort_ or _topological_ ordering of
a directed graph is a linear ordering of its vertices such that for
every directed edge uv from vertex u to vertex v, u comes before v in
the ordering.
For Git it means that top to bottom, commits always appear before
their parents. With `--graph` or `--topo-order` Git also avoids
showing commits on multiple lines of history intermixed.
- In mathematics, the _lexicographic_ or _lexicographical order_ (also
known as lexical order, dictionary order, etc.) is a generalization of
the alphabetical order.
For Git it is simply alphabetical order.
- _Chronological order_ is the arrangement of things following one after
another in time; or in other words date order.
Note that `git log --date-order` commits also always appear before
their parents, but otherwise commits are shown in the commit timestamp
order (committer date order)
>>
>> I don't think adding technical terms that the user won't encounter in
>> the documentation or among messages that Git outputs would be not a good
>> idea. It could confuse users, rather than help them.
>>
>> Conversely, perhaps we should add Documentation/technical/glossary.txt
>> to help developers.
>
> I would agree that the Glossary probably ought to be split into the
> primary, secondary and background terms so that the core concepts are
> separated from the academic/developer style terms.
I don't thing we need three separate layers; in my opinion separating
terms that user of Git might encounter from terms that somebody working
on developing Git may encounter would be enough.
The technical glossary / dictionary could also help onboarding...
>
> Git does rip up most of what folks think about version "control",
> usually based on the imperfect replication of physical artefacts.
I don't quite understand what you wanted to say there. Could you
explain in more detail, please?
>> P.S. By the way, when looking at Documentation/glossary-content.txt, I
>> have noticed few obsolescent entries, like "Git archive", few that have
>> description that soon could be or is obsolete and would need updating,
>> like "master" (when default branch switch to "main"), or "object
>> identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
>> function).
>
> The obsolescent items can be updated. I'm expecting that the 'main' and
> 'SHA-' changes will eventually be picked up as part of the respective
> patch series, hopefully as part of the global replacements.
Here I meant that "Git archive" entry is not important anymore, as I
think there are no active users of GNU arch version control system (no
"arch people"); arch's last release was in 2006, and its replacement,
Bazaar (or 'bzr') doesn't use this term. So I think it can be safely
removed in 2020, after 14 years after last release of arch.
In most cases "SHA-1" in the descriptions of terms in glossary should be
replaced by "object identifier" (to be more generic). This can be
safely done before switch to NewHash is ready and announced.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: Extending and updating gitglossary
2020-11-10 1:35 ` Extending and updating gitglossary Jakub Narębski
@ 2020-11-10 14:04 ` Philip Oakley
2020-11-10 23:52 ` Jakub Narębski
0 siblings, 1 reply; 211+ messages in thread
From: Philip Oakley @ 2020-11-10 14:04 UTC (permalink / raw)
To: Jakub Narębski
Cc: Junio C Hamano, Abhishek Kumar, git,
Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau
Hi Jakub,
On 10/11/2020 01:35, Jakub Narębski wrote:
> Hello Philip,
>
> Philip Oakley <philipoakley@iee.email> writes:
>> On 06/11/2020 18:26, Jakub Narębski wrote:
>>> Junio C Hamano <gitster@pobox.com> writes:
>>>> Philip Oakley <philipoakley@iee.email> writes:
>>>>
>>>>> This may be not part of the the main project, but could you consider, if
>>>>> time permits, also adding some entries into the Git Glossary (`git help
>>>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>>>> its fancy technical name for the use of date heuristics e.g. the
>>>>> 'chronological ordering';).
>>>>>
>>>>> The glossary can provide a reference, once the issues are resolved. The
>>>>> History Simplification and Commit Ordering section of git-log maybe a
>>>>> useful guide to some of the terms that would link to the glossary.
>>>> Ah, I first thought that Documentation/rev-list-options.txt (which
>>>> is the relevant part of "git log" documentation you mention here)
>>>> already have references to deep technical terms explained in the
>>>> glossary and you are suggesting Abhishek to mimic the arrangement by
>>>> adding new and agreed-upon terms to the glossary and referring to
>>>> them from the commit-graph documentation updated by this series.
>>>>
>>>> But sadly that is not the case. What you are saying is that you
>>>> noticed that rev-list-options.txt needs a similar "the terms we use
>>>> to explain these two sections should be defined and explained in the
>>>> glossary (if they are not) and new references to glossary should be
>>>> added there" update.
> What terms you feel need glossary entry?
While it was Junio that made the comment, I'd agree that we should be
using the glossary to explain, in a general sense, the terms that are
used is a specialist sense. As the user community expands, their natural
understanding of some of the terms diminishes.
>
>>>> In any case, that is a very good suggestion. I agree that updating
>>>> "git log" doc may be outside the scope of Abhishek's theme, but it
>>>> would be very good to have such an update by anybody ;-)
>>> The only possible problem I see with this suggestion is that some of
>>> those terms (like 'topological levels' and 'corrected commit date') are
>>> technical terms that should be not of concern for Git user, only for
>>> developers working on Git. (However one could encounter the term
>>> "generation number" in `git commit-graph verify` output.)
> To be more precise, I think that user-facing glossary should include
> only terms that appear in user-facing documentation and in output
> messages of Git commands (with the possible exception of maybe output
> messages of some low-level plumbing).
And where implied, the underlying concepts when they aren't obvious, or
lack general terms (e.g. the 'staging area' discussions)
>
> I think that the developer-facing glossary should include terms that
> appear in technical documentation, and in commit messages in Git
> history.
>
>> However we do mention "topolog*" in a number of the manual pages, and
>> rather less, as yet, in the technical pages.
>>
>> "Lexicographic" and "chronological" are in the same group of fancy
>> technical words ;-)
> I think that 'topological level' would appear only in technical
> documentation; if it would be the case then there is no reason to add it
> to user-facing glossary (to gitglossary manpage).
>
> 'Topological order' or 'topological sort', 'lexicographical order' and
> 'chronological order' are not Git-specific terms, and there are no
> Git-specific ambiguities. I am therefore a bit unsure about adding them
> to *Git* glossary.
It is that they aren't terms used in normal speech, so many folks do not
comprehend the implied precision that the docs assume, nor the problems
they may hide.
>
> - In computer science, a _topological sort_ or _topological_ ordering of
> a directed graph is a linear ordering of its vertices such that for
> every directed edge uv from vertex u to vertex v, u comes before v in
> the ordering.
Does this imply that those who aren't computer scientists shouldn't be
using Git?
>
> For Git it means that top to bottom, commits always appear before
> their parents. With `--graph` or `--topo-order` Git also avoids
> showing commits on multiple lines of history intermixed.
>
> - In mathematics, the _lexicographic_ or _lexicographical order_ (also
> known as lexical order, dictionary order, etc.) is a generalization of
> the alphabetical order.
>
> For Git it is simply alphabetical order.
ASCII order, Case sensitivity, Special characters, etc.
>
> - _Chronological order_ is the arrangement of things following one after
> another in time; or in other words date order.
Given that most résumés (the thing most folk see that asks for date
order) is latest first, does this clarify which way chronological is? (I
see this regularly in my other volunteer work).
>
> Note that `git log --date-order` commits also always appear before
> their parents, but otherwise commits are shown in the commit timestamp
> order (committer date order)
>
>>> I don't think adding technical terms that the user won't encounter in
>>> the documentation or among messages that Git outputs would be not a good
>>> idea. It could confuse users, rather than help them.
>>>
>>> Conversely, perhaps we should add Documentation/technical/glossary.txt
>>> to help developers.
>> I would agree that the Glossary probably ought to be split into the
>> primary, secondary and background terms so that the core concepts are
>> separated from the academic/developer style terms.
> I don't thing we need three separate layers; in my opinion separating
> terms that user of Git might encounter from terms that somebody working
> on developing Git may encounter would be enough.
>
> The technical glossary / dictionary could also help onboarding...
>
>> Git does rip up most of what folks think about version "control",
>> usually based on the imperfect replication of physical artefacts.
> I don't quite understand what you wanted to say there. Could you
> explain in more detail, please?
Background, I see Git & Version Control from an engineers view point,
rather than developers view.
In the "real" world there are no perfect copies, we serialise key items
so that we can track their degradation, and replace them when required.
We attempt to "Control" what is happening. Our documentation and
monitoring systems have layers of control to ensure only suitably
qualified persons may access and inspect critical items, can record and
access previous status reports, etc. There is only one "Mona Lisa", with
critical access controls, even though there are 'copies'
https://en.wikipedia.org/wiki/Mona_Lisa#Early_versions_and_copies.
Almost all of our terminology for configuration control comes from the
'real' world, i.e. pre-modern computing.
Git turns all that on its head. We can make perfect duplicates (they're
not copies, not replicas..). The Object name is immutable. It's either
right or wrong (exempt the SHAttered sha-1 breakage; were moving to
sha-256). Git does *not* provide any access control. It supports the
'software freedoms' by distributing the control to the user. The
repository is a version storage system, and the OIDs allow easy
authentication between folks that they are looking at the same object,
and all its implied descendants.
Git has ripped up classical 'real' world version control. In many areas
we need new or alternative terms, and documents that explain them to
screen writers(*) and the many other non CS-major users of Git (and some
engineers;-)
(*) there's a diff pattern for them, IIRC, or at least one was proposed.
>
>>> P.S. By the way, when looking at Documentation/glossary-content.txt, I
>>> have noticed few obsolescent entries, like "Git archive", few that have
>>> description that soon could be or is obsolete and would need updating,
>>> like "master" (when default branch switch to "main"), or "object
>>> identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
>>> function).
>> The obsolescent items can be updated. I'm expecting that the 'main' and
>> 'SHA-' changes will eventually be picked up as part of the respective
>> patch series, hopefully as part of the global replacements.
> Here I meant that "Git archive" entry is not important anymore, as I
> think there are no active users of GNU arch version control system (no
> "arch people"); arch's last release was in 2006, and its replacement,
> Bazaar (or 'bzr') doesn't use this term. So I think it can be safely
> removed in 2020, after 14 years after last release of arch.
>
> In most cases "SHA-1" in the descriptions of terms in glossary should be
> replaced by "object identifier" (to be more generic). This can be
> safely done before switch to NewHash is ready and announced.
>
> Best,
--
Philip
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: Extending and updating gitglossary
2020-11-10 14:04 ` Philip Oakley
@ 2020-11-10 23:52 ` Jakub Narębski
0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-11-10 23:52 UTC (permalink / raw)
To: Philip Oakley
Cc: Junio C Hamano, Abhishek Kumar, git,
Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau
Hello Philip,
Philip Oakley <philipoakley@iee.email> writes:
> On 10/11/2020 01:35, Jakub Narębski wrote:
>> Philip Oakley <philipoakley@iee.email> writes:
>>> On 06/11/2020 18:26, Jakub Narębski wrote:
>>>> Junio C Hamano <gitster@pobox.com> writes:
>>>>> Philip Oakley <philipoakley@iee.email> writes:
>>>>>
>>>>>> This may be not part of the the main project, but could you consider, if
>>>>>> time permits, also adding some entries into the Git Glossary (`git help
>>>>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>>>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>>>>> its fancy technical name for the use of date heuristics e.g. the
>>>>>> 'chronological ordering';).
>>>>>>
>>>>>> The glossary can provide a reference, once the issues are resolved. The
>>>>>> History Simplification and Commit Ordering section of git-log maybe a
>>>>>> useful guide to some of the terms that would link to the glossary.
[...]
>> What terms you feel need glossary entry?
>
> While it was Junio that made the comment, I'd agree that we should be
> using the glossary to explain, in a general sense, the terms that are
> used is a specialist sense. As the user community expands, their natural
> understanding of some of the terms diminishes.
I was hoping for a list of terms from the abovementioned sections of
git-log manpage you feel need entry in gitglosary(7).
[...]
>> To be more precise, I think that user-facing glossary should include
>> only terms that appear in user-facing documentation and in output
>> messages of Git commands (with the possible exception of maybe output
>> messages of some low-level plumbing).
>
> And where implied, the underlying concepts when they aren't obvious, or
> lack general terms (e.g. the 'staging area' discussions)
True, 'staging area' should IMVHO be in glossary (replacing or in
addition to older less specific term 'index', previous name for 'staging
area' term).
>> I think that the developer-facing glossary should include terms that
>> appear in technical documentation, and in commit messages in Git
>> history.
Such as 'topological levels', 'commit slab' / 'on the slab', etc.
>>> However we do mention "topolog*" in a number of the manual pages, and
>>> rather less, as yet, in the technical pages.
>>>
>>> "Lexicographic" and "chronological" are in the same group of fancy
>>> technical words ;-)
>>
>> I think that 'topological level' would appear only in technical
>> documentation; if it would be the case then there is no reason to add it
>> to user-facing glossary (to gitglossary manpage).
>>
>> 'Topological order' or 'topological sort', 'lexicographical order' and
>> 'chronological order' are not Git-specific terms, and there are no
>> Git-specific ambiguities. I am therefore a bit unsure about adding them
>> to *Git* glossary.
>
> It is that they aren't terms used in normal speech, so many folks do not
> comprehend the implied precision that the docs assume, nor the problems
> they may hide.
Right.
>> - In computer science, a _topological sort_ or _topological_ ordering of
>> a directed graph is a linear ordering of its vertices such that for
>> every directed edge uv from vertex u to vertex v, u comes before v in
>> the ordering.
>
> Does this imply that those who aren't computer scientists shouldn't be
> using Git?
I think that in most cases where we refer to topological order in the
documentation we describe it there. It might be good idea to add it to
the glossary, especially because Git uses it often in a very specific
sense.
On the other hand, should we define 'topology' or 'graph' as well? Or
'glossary' ;-) ? Those don't have any special meaning in Git, and can be
as well found in the dictionary or Wikipedia.
>> For Git it means that top to bottom, commits always appear before
>> their parents. With `--graph` or `--topo-order` Git also avoids
>> showing commits on multiple lines of history intermixed.
>>
>> - In mathematics, the _lexicographic_ or _lexicographical order_ (also
>> known as lexical order, dictionary order, etc.) is a generalization of
>> the alphabetical order.
>>
>> For Git it is simply alphabetical order.
>
> ASCII order, Case sensitivity, Special characters, etc.
Actually I don't know. Let me check: the only place this term appears in
the documentation is in git-tag(1) manpage and related documentation.
It simplly uses strcmp(), or strcasecmp() when using `--ignore-case`
option; so by default case sensitive.
It looks like it does not take locale-specific rules.
>> - _Chronological order_ is the arrangement of things following one after
>> another in time; or in other words date order.
>
> Given that most résumés (the thing most folk see that asks for date
> order) is latest first, does this clarify which way chronological is? (I
> see this regularly in my other volunteer work).
Right, it might be not obvious at first glance that Git outputs most
recent commits first, that is newest commits are on top. Though if you
think about it in more detail, it is the only ordering that makes sense,
especially for projects with a long history; first, it is newest commits
that are most interesting, and second Git always walks the history from
child to parent.
>> Note that `git log --date-order` commits also always appear before
>> their parents, but otherwise commits are shown in the commit timestamp
>> order (committer date order)
[...]
>>> Git does rip up most of what folks think about version "control",
>>> usually based on the imperfect replication of physical artefacts.
>>
>> I don't quite understand what you wanted to say there. Could you
>> explain in more detail, please?
>
> Background, I see Git & Version Control from an engineers view point,
> rather than developers view.
>
> In the "real" world there are no perfect copies, we serialise key items
> so that we can track their degradation, and replace them when required.
> We attempt to "Control" what is happening. Our documentation and
> monitoring systems have layers of control to ensure only suitably
> qualified persons may access and inspect critical items, can record and
> access previous status reports, etc. There is only one "Mona Lisa", with
> critical access controls, even though there are 'copies'
> https://en.wikipedia.org/wiki/Mona_Lisa#Early_versions_and_copies.
> Almost all of our terminology for configuration control comes from the
> 'real' world, i.e. pre-modern computing.
>
> Git turns all that on its head. We can make perfect duplicates (they're
> not copies, not replicas..). The Object name is immutable. It's either
> right or wrong (exempt the SHAttered sha-1 breakage; were moving to
> sha-256). Git does *not* provide any access control. It supports the
> 'software freedoms' by distributing the control to the user. The
> repository is a version storage system, and the OIDs allow easy
> authentication between folks that they are looking at the same object,
> and all its implied descendants.
>
> Git has ripped up classical 'real' world version control. In many areas
> we need new or alternative terms, and documents that explain them to
> screen writers(*) and the many other non CS-major users of Git (and some
> engineers;-)
>
> (*) there's a diff pattern for them, IIRC, or at least one was proposed.
Right, though for me the concept of 'version control' was by default
always about the digital, usually the source code.
There are different editions of books, changes to non-digital technical
drawings and plans (AFAIK often in the form of physical foil overlays as
subsequent layers, if done well; overdrawing on the same layer if not),
amendment and changes to laws, etc.
Anyway, the question is what level of knowledge can we assume from the
average Git user -- this would affect the spread of terms that should be
considered for the Git glossary.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v4 07/10] commit-graph: implement generation data chunk
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
` (5 preceding siblings ...)
2020-10-07 14:09 ` [PATCH v4 06/10] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09 ` Abhishek Kumar via GitGitGadget
2020-10-30 12:45 ` Jakub Narębski
2020-10-07 14:09 ` [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
` (4 subsequent siblings)
11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number was to distinguish
between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation Data chunk (or
GDAT). GDAT stores corrected committer date offsets whereas CDAT will
still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
While storing corrected commit date offset instead of the corrected
commit date saves us 4 bytes per commit, it's possible for the offsets
to overflow the 4-bytes allocated. As such overflows are exceedingly
rare, we use the following overflow management scheme:
We introduce a new commit-graph chunk, GENERATION_DATA_OVERFLOW ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 98 +++++++++++++++++++++++++++++++++--
commit-graph.h | 3 ++
commit.h | 1 +
t/README | 3 ++
t/helper/test-read-graph.c | 4 ++
t/t4216-log-bloom.sh | 4 +-
t/t5318-commit-graph.sh | 70 ++++++++++++++++++++-----
t/t5324-split-commit-graph.sh | 12 ++---
t/t6600-test-reach.sh | 68 +++++++++++++-----------
9 files changed, 206 insertions(+), 57 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 03948adfce..71d0b243db 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 9
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
+#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
+
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
@@ -385,6 +389,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
graph->chunk_commit_data = data + chunk_offset;
break;
+ case GRAPH_CHUNKID_GENERATION_DATA:
+ if (graph->chunk_generation_data)
+ chunk_repeated = 1;
+ else
+ graph->chunk_generation_data = data + chunk_offset;
+ break;
+
+ case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
+ if (graph->chunk_generation_data_overflow)
+ chunk_repeated = 1;
+ else
+ graph->chunk_generation_data_overflow = data + chunk_offset;
+ break;
+
case GRAPH_CHUNKID_EXTRAEDGES:
if (graph->chunk_extra_edges)
chunk_repeated = 1;
@@ -745,8 +763,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
{
const unsigned char *commit_data;
struct commit_graph_data *graph_data;
- uint32_t lex_index;
- uint64_t date_high, date_low;
+ uint32_t lex_index, offset_pos;
+ uint64_t date_high, date_low, offset;
while (pos < g->num_commits_in_base)
g = g->base_graph;
@@ -764,7 +782,16 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
- graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+ if (g->chunk_generation_data) {
+ offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+
+ if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+ offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
+ graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+ } else
+ graph_data->generation = item->date + offset;
+ } else
+ graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
if (g->topo_levels)
*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
@@ -942,6 +969,7 @@ struct write_commit_graph_context {
struct packed_oid_list oids;
struct packed_commit_list commits;
int num_extra_edges;
+ int num_generation_data_overflows;
unsigned long approx_nr_objects;
struct progress *progress;
int progress_done;
@@ -960,7 +988,8 @@ struct write_commit_graph_context {
report_progress:1,
split:1,
changed_paths:1,
- order_by_pack:1;
+ order_by_pack:1,
+ write_generation_data:1;
struct topo_level_slab *topo_levels;
const struct commit_graph_opts *opts;
@@ -1120,6 +1149,44 @@ static int write_graph_chunk_data(struct hashfile *f,
return 0;
}
+static int write_graph_chunk_generation_data(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ int i, num_generation_data_overflows = 0;
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+ display_progress(ctx->progress, ++ctx->progress_cnt);
+
+ if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+ offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
+ num_generation_data_overflows++;
+ }
+
+ hashwrite_be32(f, offset);
+ }
+
+ return 0;
+}
+
+static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ int i;
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+ display_progress(ctx->progress, ++ctx->progress_cnt);
+
+ if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+ hashwrite_be32(f, offset >> 32);
+ hashwrite_be32(f, (uint32_t) offset);
+ }
+ }
+
+ return 0;
+}
+
static int write_graph_chunk_extra_edges(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
@@ -1399,7 +1466,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (current->date && current->date > max_corrected_commit_date)
max_corrected_commit_date = current->date - 1;
+
commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
+
+ if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
+ ctx->num_generation_data_overflows++;
}
}
}
@@ -1765,6 +1836,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
chunks[2].id = GRAPH_CHUNKID_DATA;
chunks[2].size = (hashsz + 16) * ctx->commits.nr;
chunks[2].write_fn = write_graph_chunk_data;
+
+ if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
+ ctx->write_generation_data = 0;
+ if (ctx->write_generation_data) {
+ chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+ chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+ chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+ num_chunks++;
+ }
+ if (ctx->num_generation_data_overflows) {
+ chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
+ chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
+ chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
+ num_chunks++;
+ }
if (ctx->num_extra_edges) {
chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2170,6 +2256,8 @@ int write_commit_graph(struct object_directory *odb,
ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
ctx->opts = opts;
ctx->total_bloom_filter_data_size = 0;
+ ctx->write_generation_data = 1;
+ ctx->num_generation_data_overflows = 0;
bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
bloom_settings.bits_per_entry);
diff --git a/commit-graph.h b/commit-graph.h
index 2e9aa7824e..19a02001fd 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
#include "oidset.h"
#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
@@ -68,6 +69,8 @@ struct commit_graph {
const uint32_t *chunk_oid_fanout;
const unsigned char *chunk_oid_lookup;
const unsigned char *chunk_commit_data;
+ const unsigned char *chunk_generation_data;
+ const unsigned char *chunk_generation_data_overflow;
const unsigned char *chunk_extra_edges;
const unsigned char *chunk_base_graphs;
const unsigned char *chunk_bloom_indexes;
diff --git a/commit.h b/commit.h
index 33c66b2177..251d877fcf 100644
--- a/commit.h
+++ b/commit.h
@@ -14,6 +14,7 @@
#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
#define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)
struct commit_list {
struct commit *item;
diff --git a/t/README b/t/README
index 2adaf7c2d2..975c054bc9 100644
--- a/t/README
+++ b/t/README
@@ -379,6 +379,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
be written after every 'git commit' command, and overrides the
'core.commitGraph' setting to true.
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
commit-graph write to compute and write changed path Bloom filters for
every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 5f585a1725..75927b2c81 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -33,6 +33,10 @@ int cmd__read_graph(int argc, const char **argv)
printf(" oid_lookup");
if (graph->chunk_commit_data)
printf(" commit_metadata");
+ if (graph->chunk_generation_data)
+ printf(" generation_data");
+ if (graph->chunk_generation_data_overflow)
+ printf(" generation_data_overflow");
if (graph->chunk_extra_edges)
printf(" extra_edges");
if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index d11040ce41..dbde016188 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -40,11 +40,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
'
graph_read_expect () {
- NUM_CHUNKS=5
+ NUM_CHUNKS=6
cat >expect <<- EOF
header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
num_commits: $1
- chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+ chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
EOF
test-tool read-graph >actual &&
test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2ed0c1544d..0328e98564 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -76,7 +76,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
graph_read_expect() {
OPTIONAL=""
NUM_CHUNKS=3
- if test ! -z $2
+ if test ! -z "$2"
then
OPTIONAL=" $2"
NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -103,14 +103,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
# valid commit and tree OID
git rev-parse HEAD HEAD^{tree} >in &&
git commit-graph write --stdin-commits <in &&
- graph_read_expect 3
+ graph_read_expect 3 generation_data
'
test_expect_success 'write graph' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "3"
+ graph_read_expect "3" generation_data
'
test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -219,7 +219,7 @@ test_expect_success 'write graph with merges' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "10" "extra_edges"
+ graph_read_expect "10" "generation_data extra_edges"
'
graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -254,7 +254,7 @@ test_expect_success 'write graph with new commit' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -264,7 +264,7 @@ test_expect_success 'write graph with nothing new' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -274,7 +274,7 @@ test_expect_success 'build graph from latest pack with closure' '
cd "$TRASH_DIRECTORY/full" &&
cat new-idx | git commit-graph write --stdin-packs &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "9" "extra_edges"
+ graph_read_expect "9" "generation_data extra_edges"
'
graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -287,7 +287,7 @@ test_expect_success 'build graph from commits with closure' '
git rev-parse merge/1 >>commits-in &&
cat commits-in | git commit-graph write --stdin-commits &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "6"
+ graph_read_expect "6" "generation_data"
'
graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -297,7 +297,7 @@ test_expect_success 'build graph from commits with append' '
cd "$TRASH_DIRECTORY/full" &&
git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "10" "extra_edges"
+ graph_read_expect "10" "generation_data extra_edges"
'
graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -307,7 +307,7 @@ test_expect_success 'build graph using --reachable' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write --reachable &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -328,7 +328,7 @@ test_expect_success 'write graph in bare repo' '
cd "$TRASH_DIRECTORY/bare" &&
git commit-graph write &&
test_path_is_file $baredir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
test_expect_success 'git commit-graph verify' '
cd "$TRASH_DIRECTORY/full" &&
- git rev-parse commits/8 | git commit-graph write --stdin-commits &&
- git commit-graph verify >output
+ git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+ git commit-graph verify >output &&
+ graph_read_expect 9 extra_edges
'
NUM_COMMITS=9
@@ -741,4 +742,47 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
)
'
+test_commit_with_date() {
+ file="$1.t" &&
+ echo "$1" >"$file" &&
+ git add "$file" &&
+ GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
+ git tag "$1"
+}
+
+test_expect_success 'overflow corrected commit date offset' '
+ objdir=".git/objects" &&
+ UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
+ FUTURE_DATE="@2147483646 +0000" &&
+ test_oid_cache <<-EOF &&
+ oid_version sha1:1
+ oid_version sha256:2
+ EOF
+ cd "$TRASH_DIRECTORY" &&
+ mkdir repo &&
+ cd repo &&
+ git init &&
+ test_commit_with_date 1 "$UNIX_EPOCH_ZERO" &&
+ test_commit 2 &&
+ test_commit_with_date 3 "$UNIX_EPOCH_ZERO" &&
+ git commit-graph write --reachable &&
+ graph_read_expect 3 generation_data &&
+ test_commit_with_date 4 "$FUTURE_DATE" &&
+ test_commit 5 &&
+ test_commit_with_date 6 "$UNIX_EPOCH_ZERO" &&
+ git branch left &&
+ git reset --hard 3 &&
+ test_commit 7 &&
+ test_commit_with_date 8 "$FUTURE_DATE" &&
+ test_commit 9 &&
+ git branch right &&
+ git reset --hard 3 &&
+ git merge left right &&
+ git commit-graph write --reachable &&
+ graph_read_expect 10 "generation_data generation_data_overflow" &&
+ git commit-graph verify
+'
+
+graph_git_behavior 'overflow corrected commit date offset' repo left right
+
test_done
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index c334ee9155..651df89ab2 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
infodir=".git/objects/info" &&
graphdir="$infodir/commit-graphs" &&
test_oid_cache <<-EOM
- shallow sha1:1760
- shallow sha256:2064
+ shallow sha1:2132
+ shallow sha256:2436
- base sha1:1376
- base sha256:1496
+ base sha1:1408
+ base sha256:1528
oid_version sha1:1
oid_version sha256:2
@@ -31,9 +31,9 @@ graph_read_expect() {
NUM_BASE=$2
fi
cat >expect <<- EOF
- header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
+ header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
num_commits: $1
- chunks: oid_fanout oid_lookup commit_metadata
+ chunks: oid_fanout oid_lookup commit_metadata generation_data
EOF
test-tool read-graph >output &&
test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index f807276337..e2d33a8a4c 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,10 +55,13 @@ test_expect_success 'setup' '
git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
mv .git/objects/info/commit-graph commit-graph-half &&
chmod u+w commit-graph-half &&
+ GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+ mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+ chmod u+w commit-graph-no-gdat &&
git config core.commitGraph true
'
-run_three_modes () {
+run_all_modes () {
test_when_finished rm -rf .git/objects/info/commit-graph &&
"$@" <input >actual &&
test_cmp expect actual &&
@@ -67,11 +70,14 @@ run_three_modes () {
test_cmp expect actual &&
cp commit-graph-half .git/objects/info/commit-graph &&
"$@" <input >actual &&
+ test_cmp expect actual &&
+ cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+ "$@" <input >actual &&
test_cmp expect actual
}
-test_three_modes () {
- run_three_modes test-tool reach "$@"
+test_all_modes () {
+ run_all_modes test-tool reach "$@"
}
test_expect_success 'ref_newer:miss' '
@@ -80,7 +86,7 @@ test_expect_success 'ref_newer:miss' '
B:commit-4-9
EOF
echo "ref_newer(A,B):0" >expect &&
- test_three_modes ref_newer
+ test_all_modes ref_newer
'
test_expect_success 'ref_newer:hit' '
@@ -89,7 +95,7 @@ test_expect_success 'ref_newer:hit' '
B:commit-2-3
EOF
echo "ref_newer(A,B):1" >expect &&
- test_three_modes ref_newer
+ test_all_modes ref_newer
'
test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +104,7 @@ test_expect_success 'in_merge_bases:hit' '
B:commit-8-8
EOF
echo "in_merge_bases(A,B):1" >expect &&
- test_three_modes in_merge_bases
+ test_all_modes in_merge_bases
'
test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +113,7 @@ test_expect_success 'in_merge_bases:miss' '
B:commit-5-9
EOF
echo "in_merge_bases(A,B):0" >expect &&
- test_three_modes in_merge_bases
+ test_all_modes in_merge_bases
'
test_expect_success 'in_merge_bases_many:hit' '
@@ -117,7 +123,7 @@ test_expect_success 'in_merge_bases_many:hit' '
X:commit-5-7
EOF
echo "in_merge_bases_many(A,X):1" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'in_merge_bases_many:miss' '
@@ -127,7 +133,7 @@ test_expect_success 'in_merge_bases_many:miss' '
X:commit-8-6
EOF
echo "in_merge_bases_many(A,X):0" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'in_merge_bases_many:miss-heuristic' '
@@ -137,7 +143,7 @@ test_expect_success 'in_merge_bases_many:miss-heuristic' '
X:commit-6-6
EOF
echo "in_merge_bases_many(A,X):0" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'is_descendant_of:hit' '
@@ -148,7 +154,7 @@ test_expect_success 'is_descendant_of:hit' '
X:commit-1-1
EOF
echo "is_descendant_of(A,X):1" >expect &&
- test_three_modes is_descendant_of
+ test_all_modes is_descendant_of
'
test_expect_success 'is_descendant_of:miss' '
@@ -159,7 +165,7 @@ test_expect_success 'is_descendant_of:miss' '
X:commit-7-6
EOF
echo "is_descendant_of(A,X):0" >expect &&
- test_three_modes is_descendant_of
+ test_all_modes is_descendant_of
'
test_expect_success 'get_merge_bases_many' '
@@ -174,7 +180,7 @@ test_expect_success 'get_merge_bases_many' '
git rev-parse commit-5-6 \
commit-4-7 | sort
} >expect &&
- test_three_modes get_merge_bases_many
+ test_all_modes get_merge_bases_many
'
test_expect_success 'reduce_heads' '
@@ -196,7 +202,7 @@ test_expect_success 'reduce_heads' '
commit-2-8 \
commit-1-10 | sort
} >expect &&
- test_three_modes reduce_heads
+ test_all_modes reduce_heads
'
test_expect_success 'can_all_from_reach:hit' '
@@ -219,7 +225,7 @@ test_expect_success 'can_all_from_reach:hit' '
Y:commit-8-1
EOF
echo "can_all_from_reach(X,Y):1" >expect &&
- test_three_modes can_all_from_reach
+ test_all_modes can_all_from_reach
'
test_expect_success 'can_all_from_reach:miss' '
@@ -241,7 +247,7 @@ test_expect_success 'can_all_from_reach:miss' '
Y:commit-8-5
EOF
echo "can_all_from_reach(X,Y):0" >expect &&
- test_three_modes can_all_from_reach
+ test_all_modes can_all_from_reach
'
test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -264,7 +270,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
Y:commit-8-1
EOF
echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
- test_three_modes can_all_from_reach_with_flag
+ test_all_modes can_all_from_reach_with_flag
'
test_expect_success 'commit_contains:hit' '
@@ -280,8 +286,8 @@ test_expect_success 'commit_contains:hit' '
X:commit-9-3
EOF
echo "commit_contains(_,A,X,_):1" >expect &&
- test_three_modes commit_contains &&
- test_three_modes commit_contains --tag
+ test_all_modes commit_contains &&
+ test_all_modes commit_contains --tag
'
test_expect_success 'commit_contains:miss' '
@@ -297,8 +303,8 @@ test_expect_success 'commit_contains:miss' '
X:commit-9-3
EOF
echo "commit_contains(_,A,X,_):0" >expect &&
- test_three_modes commit_contains &&
- test_three_modes commit_contains --tag
+ test_all_modes commit_contains &&
+ test_all_modes commit_contains --tag
'
test_expect_success 'rev-list: basic topo-order' '
@@ -310,7 +316,7 @@ test_expect_success 'rev-list: basic topo-order' '
commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-6-6
+ run_all_modes git rev-list --topo-order commit-6-6
'
test_expect_success 'rev-list: first-parent topo-order' '
@@ -322,7 +328,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
commit-6-2 \
commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>expect &&
- run_three_modes git rev-list --first-parent --topo-order commit-6-6
+ run_all_modes git rev-list --first-parent --topo-order commit-6-6
'
test_expect_success 'rev-list: range topo-order' '
@@ -334,7 +340,7 @@ test_expect_success 'rev-list: range topo-order' '
commit-6-2 commit-5-2 commit-4-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
'
test_expect_success 'rev-list: range topo-order' '
@@ -346,7 +352,7 @@ test_expect_success 'rev-list: range topo-order' '
commit-6-2 commit-5-2 commit-4-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
'
test_expect_success 'rev-list: first-parent range topo-order' '
@@ -358,7 +364,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
commit-6-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+ run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
'
test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -368,7 +374,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
commit-6-3 commit-5-3 commit-4-3 \
>expect &&
- run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+ run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
'
test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -382,7 +388,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
commit-3-8 commit-2-8 commit-1-8 \
commit-3-7 commit-2-7 commit-1-7 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
'
test_expect_success 'get_reachable_subset:all' '
@@ -402,7 +408,7 @@ test_expect_success 'get_reachable_subset:all' '
commit-1-7 \
commit-5-6 | sort
) >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_expect_success 'get_reachable_subset:some' '
@@ -420,7 +426,7 @@ test_expect_success 'get_reachable_subset:some' '
git rev-parse commit-3-3 \
commit-1-7 | sort
) >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_expect_success 'get_reachable_subset:none' '
@@ -434,7 +440,7 @@ test_expect_success 'get_reachable_subset:none' '
Y:commit-2-8
EOF
echo "get_reachable_subset(X,Y)" >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v4 07/10] commit-graph: implement generation data chunk
2020-10-07 14:09 ` [PATCH v4 07/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-10-30 12:45 ` Jakub Narębski
2020-11-06 11:25 ` Abhishek Kumar
0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-10-30 12:45 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar
Tl;dr summary: the code writing GDOV chunk could be made more performant
(I think), but that could be left for the future commit.
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> As discovered by Ævar, we cannot increment graph version to
> distinguish between generation numbers v1 and v2 [1]. Thus, one of
> pre-requistes before implementing generation number was to distinguish
> between graph versions in a backwards compatible manner.
Minor nitpick: I think you meant "implementing generation number v2",
to be more precise.
>
> We are going to introduce a new chunk called Generation Data chunk (or
Very minor nitpick: perhaps s/Generation Data/Generation DATa/, to provide
mnemonics for chunk name.
> GDAT). GDAT stores corrected committer date offsets whereas CDAT will
> still store topological level.
Minor nitpick: I think the second sentence should use consistent
grammatical tense (but I am not a native English speaker); also
s/level/levels/:
GDAT will store corrected committer date offsets, whereas CDAT will
still store topological levels.
But it is perfectly understandable as it is.
>
> Old Git does not understand GDAT chunk and would ignore it, reading
> topological levels from CDAT. New Git can parse GDAT and take advantage
> of newer generation numbers, falling back to topological levels when
> GDAT chunk is missing (as it would happen with a commit graph written
> by old Git).
Minor nitpick: I think we use commit-graph with dash when writing about
the commit-graph file, like below.
>
> We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
> which forces commit-graph file to be written without generation data
> chunk to emulate a commit-graph file written by old Git.
All right.
>
> While storing corrected commit date offset instead of the corrected
> commit date saves us 4 bytes per commit, it's possible for the offsets
> to overflow the 4-bytes allocated. As such overflows are exceedingly
> rare, we use the following overflow management scheme:
Perhaps it would be good idea to write the idea in full from start, as
the commit message is intended to be read stadalone and not in the
context of the patch series. On the other hand it might be too much
detail in already [necessarily] lengthty commit message.
Perhaps something like the following proposal would read better.
To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file,
instead of corrected commit dates themselves. This saves us 4 bytes
per commit, decreasing the GDAT chunk size by half, but it's possible
for the offset to overflow the 4-bytes allocated for storage. As such
overflows are and should be exceedingly rare, we use the following
overflow management scheme:
NOTE: this overflow handling is a *new* code (or new-ish code, as it is
inspired and similar to EDGE chunk data handling), so it needs more
careful review.
>
> We introduce a new commit-graph chunk, GENERATION_DATA_OVERFLOW ('GDOV')
Minor issue: why GENERATION_DATA_OVERFLOW and not Generation Data
OVerflow, like for the GDAT chunk?
> to store corrected commit dates for commits with offsets greater than
> GENERATION_NUMBER_V2_OFFSET_MAX.
>
> If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
> the MSB of the offset and the other bits store the position of corrected
> commit date in GDOV chunk, similar to how Extra Edge List is maintained.
>
> We test the overflow-related code with the following repo history:
>
> F - N - U
> / \
> U - N - U N
> \ /
> N - F - N
Do we need such complex history? I guess we need to test the handling of
merge commits too.
>
> Where the commits denoted by U have committer date of zero seconds
> since Unix epoch, the commits denoted by N have committer date of
> 1112354055 (default committer date for the test suite) seconds since
> Unix epoch and the commits denoted by F have committer date of
> (2 ^ 31 - 2) seconds since Unix epoch.
>
> The largest offset observed is 2 ^ 31, just large enough to overflow.
>
> [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
> commit-graph.c | 98 +++++++++++++++++++++++++++++++++--
> commit-graph.h | 3 ++
> commit.h | 1 +
> t/README | 3 ++
> t/helper/test-read-graph.c | 4 ++
> t/t4216-log-bloom.sh | 4 +-
> t/t5318-commit-graph.sh | 70 ++++++++++++++++++++-----
> t/t5324-split-commit-graph.sh | 12 ++---
> t/t6600-test-reach.sh | 68 +++++++++++++-----------
> 9 files changed, 206 insertions(+), 57 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 03948adfce..71d0b243db 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
> #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
> #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
> +#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
> #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
> #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
> #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
> #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 7
> +#define MAX_NUM_CHUNKS 9
All right.
>
> #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>
> @@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
> #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
> + GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
>
> +#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
> +
All right, though the naming convention is different from the one used
for EDGE chunk: GRAPH_EXTRA_EDGES_NEEDED and GRAPH_EDGE_LAST_MASK.
> /* Remember to update object flag allocation in object.h */
> #define REACHABLE (1u<<15)
>
> @@ -385,6 +389,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> graph->chunk_commit_data = data + chunk_offset;
> break;
>
> + case GRAPH_CHUNKID_GENERATION_DATA:
> + if (graph->chunk_generation_data)
> + chunk_repeated = 1;
> + else
> + graph->chunk_generation_data = data + chunk_offset;
> + break;
> +
> + case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
> + if (graph->chunk_generation_data_overflow)
> + chunk_repeated = 1;
> + else
> + graph->chunk_generation_data_overflow = data + chunk_offset;
> + break;
> +
Necessary but unavoidable boilerplate for adding new chunks to the
commit-graph file format. All right.
> case GRAPH_CHUNKID_EXTRAEDGES:
> if (graph->chunk_extra_edges)
> chunk_repeated = 1;
> @@ -745,8 +763,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> {
> const unsigned char *commit_data;
> struct commit_graph_data *graph_data;
> - uint32_t lex_index;
> - uint64_t date_high, date_low;
> + uint32_t lex_index, offset_pos;
> + uint64_t date_high, date_low, offset;
All right, we are adding two new variables: `offset` to read data stored
in GDAT chunk, and `offset_pos` to help read data from GDOV chunk if
necessary i.e. to handle overflow in corrected commit data offset
storage.
>
> while (pos < g->num_commits_in_base)
> g = g->base_graph;
> @@ -764,7 +782,16 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> date_low = get_be32(commit_data + g->hash_len + 12);
> item->date = (timestamp_t)((date_high << 32) | date_low);
>
> - graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> + if (g->chunk_generation_data) {
> + offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
Style: why space after the `(timestamp_t)` cast operator?
Though CodingGuidelines do not say anything on this topic... perhaps the
space after cast operator makes it more readable?
> +
> + if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
All right, so the CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW is equivalent of
GRAPH_EXTRA_EDGES_NEEDED.
> + offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
Hmmm... instead of using bitwise and on an equivalent to the
GRAPH_EDGE_LAST_MASK, we utilize the fact that we know that the MSB bit
is set, so we can clear it with bitwise xor. Clever trick.
> + graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
> + } else
> + graph_data->generation = item->date + offset;
All right, this handles the case when we have generation number v2, with
or without overflow.
> + } else
> + graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
All right, this handles the case where we have only generation number
v1, like for commit-graph file written by old Git.
>
> if (g->topo_levels)
> *topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
> @@ -942,6 +969,7 @@ struct write_commit_graph_context {
> struct packed_oid_list oids;
> struct packed_commit_list commits;
> int num_extra_edges;
> + int num_generation_data_overflows;
> unsigned long approx_nr_objects;
> struct progress *progress;
> int progress_done;
> @@ -960,7 +988,8 @@ struct write_commit_graph_context {
> report_progress:1,
> split:1,
> changed_paths:1,
> - order_by_pack:1;
> + order_by_pack:1,
> + write_generation_data:1;
>
> struct topo_level_slab *topo_levels;
> const struct commit_graph_opts *opts;
All right, this adds necessary fields to `struct write_commit_graph_context`.
> @@ -1120,6 +1149,44 @@ static int write_graph_chunk_data(struct hashfile *f,
> return 0;
> }
>
> +static int write_graph_chunk_generation_data(struct hashfile *f,
> + struct write_commit_graph_context *ctx)
> +{
> + int i, num_generation_data_overflows = 0;
Minor nitpick: in my opinion there should be empty line here, between
the variables declaration and the code... however not all
write_graph_chunk_*() functions have it.
> + for (i = 0; i < ctx->commits.nr; i++) {
> + struct commit *c = ctx->commits.list[i];
> + timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
> + display_progress(ctx->progress, ++ctx->progress_cnt);
All right.
> +
> + if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
> + offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
> + num_generation_data_overflows++;
> + }
Hmmm... shouldn't we store these commits that need overflow handling
(with corrected commit date offset greater than GENERATION_NUMBER_V2_OFFSET_MAX)
in a list or a queue, to remember them for writing GDOV chunk?
We could store oids, or we could store commits themselves, for example:
if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
num_generation_data_overflows++;
ALLOC_GROW(ctx->gdov_commits.list, ctx->gdov_commits.nr + 1, ctx->gdov_commits.alloc);
ctx->commits.list[ctx->gdov_commits.nr] = c;
ctx->gdov_commits.nr++;
}
Though in the above proposal we could get rid of `num_generation_data_overflows`,
as it should be the same as `ctx->gdov_commits.nr`.
I have called the extra commit list member of write_commit_graph_context
`gdov_commits`, but perhaps a better name would be `commits_gen_v2_overflow`,
or similar more descriptive name.
> +
> + hashwrite_be32(f, offset);
> + }
> +
> + return 0;
> +}
All right.
> +
> +static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
> + struct write_commit_graph_context *ctx)
> +{
> + int i;
> + for (i = 0; i < ctx->commits.nr; i++) {
Here we loop over *all* commits again, instead of looping over those
very rare commits that need overflow handling for their corrected commit
date data.
Though this possible performance issue^* could be fixed in the future commit.
*) It needs to be actually benchmarked which version is faster.
With the change proposed above (and required changes to the `struct
write_commit_graph_context`) it could look like this:
for (i = 0; i < ctx->gcov_commits.nr; i++) {
> + struct commit *c = ctx->commits.list[i];
> + timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
> + display_progress(ctx->progress, ++ctx->progress_cnt);
> +
> + if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
> + hashwrite_be32(f, offset >> 32);
> + hashwrite_be32(f, (uint32_t) offset);
> + }
> + }
The above would be as simple as the following:
struct commit *c = ctx->gcov_commits.list[i];
timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
display_progress(ctx->progress, ++ctx->progress_cnt);
hashwrite_be64(f, offset);
Assumming that there would be hashwrite_be64(), it would be the
following otherwise:
hashwrite_be32(f, offset >> 32);
hashwrite_be32(f, (uint32_t)offset);
> +
> + return 0;
> +}
> +
> static int write_graph_chunk_extra_edges(struct hashfile *f,
> struct write_commit_graph_context *ctx)
> {
> @@ -1399,7 +1466,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>
> if (current->date && current->date > max_corrected_commit_date)
> max_corrected_commit_date = current->date - 1;
> +
This is a bit unrelated change, adding this empty line.
> commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
> +
> + if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
> + ctx->num_generation_data_overflows++;
All right, we need to track number of commits that need overflow
handling for generation number v2 to know what size GDOV chunk would
need to be.
> }
> }
> }
> @@ -1765,6 +1836,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> chunks[2].id = GRAPH_CHUNKID_DATA;
> chunks[2].size = (hashsz + 16) * ctx->commits.nr;
> chunks[2].write_fn = write_graph_chunk_data;
> +
> + if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
> + ctx->write_generation_data = 0;
All right, here we handle GIT_TEST_COMMIT_GRAPH_NO_GDAT.
> + if (ctx->write_generation_data) {
> + chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
> + chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
> + chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
> + num_chunks++;
> + }
All right, the GDAT chunk consist of <number of commits> entries.
> + if (ctx->num_generation_data_overflows) {
> + chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
> + chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
> + chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
> + num_chunks++;
> + }
All right, that's what num_generation_data_overflows was for.
> if (ctx->num_extra_edges) {
> chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
> chunks[num_chunks].size = 4 * ctx->num_extra_edges;
> @@ -2170,6 +2256,8 @@ int write_commit_graph(struct object_directory *odb,
> ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
> ctx->opts = opts;
> ctx->total_bloom_filter_data_size = 0;
> + ctx->write_generation_data = 1;
> + ctx->num_generation_data_overflows = 0;
>
> bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
> bloom_settings.bits_per_entry);
> diff --git a/commit-graph.h b/commit-graph.h
> index 2e9aa7824e..19a02001fd 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -6,6 +6,7 @@
> #include "oidset.h"
>
> #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
> +#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
> #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
> #define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
>
> @@ -68,6 +69,8 @@ struct commit_graph {
> const uint32_t *chunk_oid_fanout;
> const unsigned char *chunk_oid_lookup;
> const unsigned char *chunk_commit_data;
> + const unsigned char *chunk_generation_data;
> + const unsigned char *chunk_generation_data_overflow;
All right, two new chunks: GDAT and GDOV.
> const unsigned char *chunk_extra_edges;
> const unsigned char *chunk_base_graphs;
> const unsigned char *chunk_bloom_indexes;
> diff --git a/commit.h b/commit.h
> index 33c66b2177..251d877fcf 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -14,6 +14,7 @@
> #define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
> #define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
> #define GENERATION_NUMBER_ZERO 0
> +#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)
Should we use this form, or hexadecimal constant?
#define GENERATION_NUMBER_V2_OFFSET_MAX 0x7FFFFFFF
But I think the current definition is more explicit: all bits set to one
except for the most significant digit. All right.
>
> struct commit_list {
> struct commit *item;
> diff --git a/t/README b/t/README
> index 2adaf7c2d2..975c054bc9 100644
> --- a/t/README
> +++ b/t/README
> @@ -379,6 +379,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
> be written after every 'git commit' command, and overrides the
> 'core.commitGraph' setting to true.
>
> +GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
> +commit-graph to be written without generation data chunk.
> +
All right. Nice have it documented.
> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
> commit-graph write to compute and write changed path Bloom filters for
> every 'git commit-graph write', as if the `--changed-paths` option was
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index 5f585a1725..75927b2c81 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -33,6 +33,10 @@ int cmd__read_graph(int argc, const char **argv)
> printf(" oid_lookup");
> if (graph->chunk_commit_data)
> printf(" commit_metadata");
> + if (graph->chunk_generation_data)
> + printf(" generation_data");
> + if (graph->chunk_generation_data_overflow)
> + printf(" generation_data_overflow");
> if (graph->chunk_extra_edges)
> printf(" extra_edges");
> if (graph->chunk_bloom_indexes)
All right, updating `test-tool read-graph` with new chunks.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index d11040ce41..dbde016188 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -40,11 +40,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
> '
>
> graph_read_expect () {
> - NUM_CHUNKS=5
> + NUM_CHUNKS=6
> cat >expect <<- EOF
Sidenote: I have just noticed this, and as I see it is not something you
wrote, but usually we write it with no space after the dash and before
'EOF':
cat >expect <<-EOF
> header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
> num_commits: $1
> - chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
> + chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
> EOF
> test-tool read-graph >actual &&
> test_cmp expect actual
All right, updating expect value for `test-tool read-graph` in the usual
case, with generation number chunk.
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 2ed0c1544d..0328e98564 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -76,7 +76,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
> graph_read_expect() {
> OPTIONAL=""
> NUM_CHUNKS=3
> - if test ! -z $2
> + if test ! -z "$2"
All right, that is straighforward fix, which is now needed.
> then
> OPTIONAL=" $2"
> NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
> @@ -103,14 +103,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
> # valid commit and tree OID
> git rev-parse HEAD HEAD^{tree} >in &&
> git commit-graph write --stdin-commits <in &&
> - graph_read_expect 3
> + graph_read_expect 3 generation_data
> '
>
> test_expect_success 'write graph' '
> cd "$TRASH_DIRECTORY/full" &&
> git commit-graph write &&
> test_path_is_file $objdir/info/commit-graph &&
> - graph_read_expect "3"
> + graph_read_expect "3" generation_data
> '
>
> test_expect_success POSIXPERM 'write graph has correct permissions' '
> @@ -219,7 +219,7 @@ test_expect_success 'write graph with merges' '
> cd "$TRASH_DIRECTORY/full" &&
> git commit-graph write &&
> test_path_is_file $objdir/info/commit-graph &&
> - graph_read_expect "10" "extra_edges"
> + graph_read_expect "10" "generation_data extra_edges"
> '
>
> graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
> @@ -254,7 +254,7 @@ test_expect_success 'write graph with new commit' '
> cd "$TRASH_DIRECTORY/full" &&
> git commit-graph write &&
> test_path_is_file $objdir/info/commit-graph &&
> - graph_read_expect "11" "extra_edges"
> + graph_read_expect "11" "generation_data extra_edges"
> '
>
> graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -264,7 +264,7 @@ test_expect_success 'write graph with nothing new' '
> cd "$TRASH_DIRECTORY/full" &&
> git commit-graph write &&
> test_path_is_file $objdir/info/commit-graph &&
> - graph_read_expect "11" "extra_edges"
> + graph_read_expect "11" "generation_data extra_edges"
> '
>
> graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -274,7 +274,7 @@ test_expect_success 'build graph from latest pack with closure' '
> cd "$TRASH_DIRECTORY/full" &&
> cat new-idx | git commit-graph write --stdin-packs &&
> test_path_is_file $objdir/info/commit-graph &&
> - graph_read_expect "9" "extra_edges"
> + graph_read_expect "9" "generation_data extra_edges"
> '
>
> graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
> @@ -287,7 +287,7 @@ test_expect_success 'build graph from commits with closure' '
> git rev-parse merge/1 >>commits-in &&
> cat commits-in | git commit-graph write --stdin-commits &&
> test_path_is_file $objdir/info/commit-graph &&
> - graph_read_expect "6"
> + graph_read_expect "6" "generation_data"
> '
>
> graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
> @@ -297,7 +297,7 @@ test_expect_success 'build graph from commits with append' '
> cd "$TRASH_DIRECTORY/full" &&
> git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
> test_path_is_file $objdir/info/commit-graph &&
> - graph_read_expect "10" "extra_edges"
> + graph_read_expect "10" "generation_data extra_edges"
> '
>
> graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -307,7 +307,7 @@ test_expect_success 'build graph using --reachable' '
> cd "$TRASH_DIRECTORY/full" &&
> git commit-graph write --reachable &&
> test_path_is_file $objdir/info/commit-graph &&
> - graph_read_expect "11" "extra_edges"
> + graph_read_expect "11" "generation_data extra_edges"
> '
>
> graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
> @@ -328,7 +328,7 @@ test_expect_success 'write graph in bare repo' '
> cd "$TRASH_DIRECTORY/bare" &&
> git commit-graph write &&
> test_path_is_file $baredir/info/commit-graph &&
> - graph_read_expect "11" "extra_edges"
> + graph_read_expect "11" "generation_data extra_edges"
> '
All those just add "generation_data" (aka GDAT) to expected chunks. All
right.
>
> graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
> @@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
>
> test_expect_success 'git commit-graph verify' '
> cd "$TRASH_DIRECTORY/full" &&
> - git rev-parse commits/8 | git commit-graph write --stdin-commits &&
> - git commit-graph verify >output
> + git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
> + git commit-graph verify >output &&
All right, this simply adds GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. I assume
this is needed because this test is also setup for the following commits
_without_ even saying that in the test name (bad practice, in my
opinion), and the comment above this test says the following:
# the verify tests below expect the commit-graph to contain
# exactly the commits reachable from the commits/8 branch.
# If the file changes the set of commits in the list, then the
# offsets into the binary file will result in different edits
# and the tests will likely break.
So the following tests are fragile (though perhaps unavoidably fragile),
and without this change they would not work, I assume.
> + graph_read_expect 9 extra_edges
I guess that this is here to check that GIT_TEST_COMMIT_GRAPH_NO_GDAT=1
work as intended, and that the following "verify" tests wouldn't break.
I understand its necessity, even if I don't quite like having a test
that checks multiple things. This is a minor issue, though.
All right.
We might want to have a separate test that checks that we get
commit-graph with and without GDAT chunk depending on whether we use
GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. On the other hand, this environment
variable is there purely for tests, so the question is should we test
the test infrastructure?
> '
>
> NUM_COMMITS=9
> @@ -741,4 +742,47 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
> )
> '
>
> +test_commit_with_date() {
> + file="$1.t" &&
> + echo "$1" >"$file" &&
> + git add "$file" &&
> + GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
> + git tag "$1"
> +}
Here we add a helper function. All right.
I wonder though if it wouldn't be a better idea to add `--date <date>`
option to the test_commit() function in test-lib-functions.sh (which
option would set GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, and also
set notick=yes).
For example:
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index f1ae935fee..a1f9a2b09b 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -202,6 +202,12 @@ test_commit () {
--signoff)
signoff="$1"
;;
+ --date)
+ notick=yes
+ GIT_COMMITTER_DATE="$2"
+ GIT_AUTHOR_DATE="$2"
+ shift
+ ;;
-C)
indir="$2"
shift
> +
It would be nice to have there comment describing the shape of the
revision history we generate here, that currenly is present only in the
commmit message.
# We test the overflow-related code with the following repo history:
#
# 4:F - 5:N - 6:U
# / \
# 1:U - 2:N - 3:U M:N
# \ /
# 7:N - 8:F - 9:N
#
# Here the commits denoted by U have committer date of zero seconds
# since Unix epoch, the commits denoted by N have committer date
# starting from 1112354055 seconds since Unix epoch (default committer
# date for the test suite), and the commits denoted by F have committer
# date of (2 ^ 31 - 2) seconds since Unix epoch.
#
# The largest offset observed is 2 ^ 31, just large enough to overflow.
#
> +test_expect_success 'overflow corrected commit date offset' '
> + objdir=".git/objects" &&
> + UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
> + FUTURE_DATE="@2147483646 +0000" &&
It is a bit funny to see UNIX_EPOCH_ZERO spelled one way, and
FUTURE_DATE other way.
Wouldn't be more readable to use UNIX_EPOCH_ZERO="@0 +0000"?
> + test_oid_cache <<-EOF &&
> + oid_version sha1:1
> + oid_version sha256:2
> + EOF
> + cd "$TRASH_DIRECTORY" &&
> + mkdir repo &&
> + cd repo &&
> + git init &&
> + test_commit_with_date 1 "$UNIX_EPOCH_ZERO" &&
> + test_commit 2 &&
> + test_commit_with_date 3 "$UNIX_EPOCH_ZERO" &&
> + git commit-graph write --reachable &&
> + graph_read_expect 3 generation_data &&
> + test_commit_with_date 4 "$FUTURE_DATE" &&
> + test_commit 5 &&
> + test_commit_with_date 6 "$UNIX_EPOCH_ZERO" &&
> + git branch left &&
> + git reset --hard 3 &&
> + test_commit 7 &&
> + test_commit_with_date 8 "$FUTURE_DATE" &&
> + test_commit 9 &&
> + git branch right &&
> + git reset --hard 3 &&
> + git merge left right &&
We have test_merge() function in test-lib-functions.sh, perhaps we
should use it here.
> + git commit-graph write --reachable &&
> + graph_read_expect 10 "generation_data generation_data_overflow" &&
All right, we write the commit-graph and check that it has both GDAT and
GDOV chunks present.
> + git commit-graph verify
All right, we checks that created commit graph with GDAT and GDOV passes
'git commit-graph verify` checks.
> +'
> +
> +graph_git_behavior 'overflow corrected commit date offset' repo left right
All right, here we compare the Git behavior with the commit-graph to the
behavior without it... however I think that those two tests really
should have distinct (different) test names. Currently they both use
'overflow corrected commit date offset'.
> +
> test_done
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index c334ee9155..651df89ab2 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
> infodir=".git/objects/info" &&
> graphdir="$infodir/commit-graphs" &&
> test_oid_cache <<-EOM
> - shallow sha1:1760
> - shallow sha256:2064
> + shallow sha1:2132
> + shallow sha256:2436
>
> - base sha1:1376
> - base sha256:1496
> + base sha1:1408
> + base sha256:1528
>
> oid_version sha1:1
> oid_version sha256:2
> @@ -31,9 +31,9 @@ graph_read_expect() {
> NUM_BASE=$2
> fi
> cat >expect <<- EOF
> - header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
> + header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
> num_commits: $1
> - chunks: oid_fanout oid_lookup commit_metadata
> + chunks: oid_fanout oid_lookup commit_metadata generation_data
> EOF
> test-tool read-graph >output &&
> test_cmp expect output
All right, we now expect the commit graph to include the GDAT chunk...
though shouldn't be there old expected value for no GDAT, for future
tests? But perhaps this is not necessary.
Note that I have not checked the details, but it looks OK to me.
> diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
> index f807276337..e2d33a8a4c 100755
> --- a/t/t6600-test-reach.sh
> +++ b/t/t6600-test-reach.sh
> @@ -55,10 +55,13 @@ test_expect_success 'setup' '
> git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
> mv .git/objects/info/commit-graph commit-graph-half &&
> chmod u+w commit-graph-half &&
> + GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
> + mv .git/objects/info/commit-graph commit-graph-no-gdat &&
> + chmod u+w commit-graph-no-gdat &&
All right, this prepares for testing one more mode. The run_all_modes()
function would test the following cases:
- no commit-graph
- commit-graph for all commits, with GDAT
- commit-graph with half of commits, with GDAT
- commit-graph for all commits, without GDAT
> git config core.commitGraph true
> '
>
> -run_three_modes () {
> +run_all_modes () {
> test_when_finished rm -rf .git/objects/info/commit-graph &&
> "$@" <input >actual &&
> test_cmp expect actual &&
> @@ -67,11 +70,14 @@ run_three_modes () {
> test_cmp expect actual &&
> cp commit-graph-half .git/objects/info/commit-graph &&
> "$@" <input >actual &&
> + test_cmp expect actual &&
> + cp commit-graph-no-gdat .git/objects/info/commit-graph &&
> + "$@" <input >actual &&
> test_cmp expect actual
> }
>
> -test_three_modes () {
> - run_three_modes test-tool reach "$@"
> +test_all_modes () {
> + run_all_modes test-tool reach "$@"
> }
All right.
Though to reduce "noise" in this patch, the rename of run_three_modes()
to run_all_modes() and test_three_modes() to test_all_modes() could have
been done in a separate preparatory patch. It would be pure refactoring
patch, without introducing any new functionality.
>
> test_expect_success 'ref_newer:miss' '
> @@ -80,7 +86,7 @@ test_expect_success 'ref_newer:miss' '
> B:commit-4-9
> EOF
> echo "ref_newer(A,B):0" >expect &&
> - test_three_modes ref_newer
> + test_all_modes ref_newer
> '
>
> test_expect_success 'ref_newer:hit' '
> @@ -89,7 +95,7 @@ test_expect_success 'ref_newer:hit' '
> B:commit-2-3
> EOF
> echo "ref_newer(A,B):1" >expect &&
> - test_three_modes ref_newer
> + test_all_modes ref_newer
> '
>
> test_expect_success 'in_merge_bases:hit' '
> @@ -98,7 +104,7 @@ test_expect_success 'in_merge_bases:hit' '
> B:commit-8-8
> EOF
> echo "in_merge_bases(A,B):1" >expect &&
> - test_three_modes in_merge_bases
> + test_all_modes in_merge_bases
> '
>
> test_expect_success 'in_merge_bases:miss' '
> @@ -107,7 +113,7 @@ test_expect_success 'in_merge_bases:miss' '
> B:commit-5-9
> EOF
> echo "in_merge_bases(A,B):0" >expect &&
> - test_three_modes in_merge_bases
> + test_all_modes in_merge_bases
> '
>
> test_expect_success 'in_merge_bases_many:hit' '
> @@ -117,7 +123,7 @@ test_expect_success 'in_merge_bases_many:hit' '
> X:commit-5-7
> EOF
> echo "in_merge_bases_many(A,X):1" >expect &&
> - test_three_modes in_merge_bases_many
> + test_all_modes in_merge_bases_many
> '
>
> test_expect_success 'in_merge_bases_many:miss' '
> @@ -127,7 +133,7 @@ test_expect_success 'in_merge_bases_many:miss' '
> X:commit-8-6
> EOF
> echo "in_merge_bases_many(A,X):0" >expect &&
> - test_three_modes in_merge_bases_many
> + test_all_modes in_merge_bases_many
> '
>
> test_expect_success 'in_merge_bases_many:miss-heuristic' '
> @@ -137,7 +143,7 @@ test_expect_success 'in_merge_bases_many:miss-heuristic' '
> X:commit-6-6
> EOF
> echo "in_merge_bases_many(A,X):0" >expect &&
> - test_three_modes in_merge_bases_many
> + test_all_modes in_merge_bases_many
> '
>
> test_expect_success 'is_descendant_of:hit' '
> @@ -148,7 +154,7 @@ test_expect_success 'is_descendant_of:hit' '
> X:commit-1-1
> EOF
> echo "is_descendant_of(A,X):1" >expect &&
> - test_three_modes is_descendant_of
> + test_all_modes is_descendant_of
> '
>
> test_expect_success 'is_descendant_of:miss' '
> @@ -159,7 +165,7 @@ test_expect_success 'is_descendant_of:miss' '
> X:commit-7-6
> EOF
> echo "is_descendant_of(A,X):0" >expect &&
> - test_three_modes is_descendant_of
> + test_all_modes is_descendant_of
> '
>
> test_expect_success 'get_merge_bases_many' '
> @@ -174,7 +180,7 @@ test_expect_success 'get_merge_bases_many' '
> git rev-parse commit-5-6 \
> commit-4-7 | sort
> } >expect &&
> - test_three_modes get_merge_bases_many
> + test_all_modes get_merge_bases_many
> '
>
> test_expect_success 'reduce_heads' '
> @@ -196,7 +202,7 @@ test_expect_success 'reduce_heads' '
> commit-2-8 \
> commit-1-10 | sort
> } >expect &&
> - test_three_modes reduce_heads
> + test_all_modes reduce_heads
> '
>
> test_expect_success 'can_all_from_reach:hit' '
> @@ -219,7 +225,7 @@ test_expect_success 'can_all_from_reach:hit' '
> Y:commit-8-1
> EOF
> echo "can_all_from_reach(X,Y):1" >expect &&
> - test_three_modes can_all_from_reach
> + test_all_modes can_all_from_reach
> '
>
> test_expect_success 'can_all_from_reach:miss' '
> @@ -241,7 +247,7 @@ test_expect_success 'can_all_from_reach:miss' '
> Y:commit-8-5
> EOF
> echo "can_all_from_reach(X,Y):0" >expect &&
> - test_three_modes can_all_from_reach
> + test_all_modes can_all_from_reach
> '
>
> test_expect_success 'can_all_from_reach_with_flag: tags case' '
> @@ -264,7 +270,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
> Y:commit-8-1
> EOF
> echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
> - test_three_modes can_all_from_reach_with_flag
> + test_all_modes can_all_from_reach_with_flag
> '
>
> test_expect_success 'commit_contains:hit' '
> @@ -280,8 +286,8 @@ test_expect_success 'commit_contains:hit' '
> X:commit-9-3
> EOF
> echo "commit_contains(_,A,X,_):1" >expect &&
> - test_three_modes commit_contains &&
> - test_three_modes commit_contains --tag
> + test_all_modes commit_contains &&
> + test_all_modes commit_contains --tag
> '
>
> test_expect_success 'commit_contains:miss' '
> @@ -297,8 +303,8 @@ test_expect_success 'commit_contains:miss' '
> X:commit-9-3
> EOF
> echo "commit_contains(_,A,X,_):0" >expect &&
> - test_three_modes commit_contains &&
> - test_three_modes commit_contains --tag
> + test_all_modes commit_contains &&
> + test_all_modes commit_contains --tag
> '
>
> test_expect_success 'rev-list: basic topo-order' '
> @@ -310,7 +316,7 @@ test_expect_success 'rev-list: basic topo-order' '
> commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
> commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
> >expect &&
> - run_three_modes git rev-list --topo-order commit-6-6
> + run_all_modes git rev-list --topo-order commit-6-6
> '
>
> test_expect_success 'rev-list: first-parent topo-order' '
> @@ -322,7 +328,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
> commit-6-2 \
> commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
> >expect &&
> - run_three_modes git rev-list --first-parent --topo-order commit-6-6
> + run_all_modes git rev-list --first-parent --topo-order commit-6-6
> '
>
> test_expect_success 'rev-list: range topo-order' '
> @@ -334,7 +340,7 @@ test_expect_success 'rev-list: range topo-order' '
> commit-6-2 commit-5-2 commit-4-2 \
> commit-6-1 commit-5-1 commit-4-1 \
> >expect &&
> - run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
> + run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
> '
>
> test_expect_success 'rev-list: range topo-order' '
> @@ -346,7 +352,7 @@ test_expect_success 'rev-list: range topo-order' '
> commit-6-2 commit-5-2 commit-4-2 \
> commit-6-1 commit-5-1 commit-4-1 \
> >expect &&
> - run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
> + run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
> '
>
> test_expect_success 'rev-list: first-parent range topo-order' '
> @@ -358,7 +364,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
> commit-6-2 \
> commit-6-1 commit-5-1 commit-4-1 \
> >expect &&
> - run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
> + run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
> '
>
> test_expect_success 'rev-list: ancestry-path topo-order' '
> @@ -368,7 +374,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
> commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
> commit-6-3 commit-5-3 commit-4-3 \
> >expect &&
> - run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
> + run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
> '
>
> test_expect_success 'rev-list: symmetric difference topo-order' '
> @@ -382,7 +388,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
> commit-3-8 commit-2-8 commit-1-8 \
> commit-3-7 commit-2-7 commit-1-7 \
> >expect &&
> - run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
> + run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
> '
>
> test_expect_success 'get_reachable_subset:all' '
> @@ -402,7 +408,7 @@ test_expect_success 'get_reachable_subset:all' '
> commit-1-7 \
> commit-5-6 | sort
> ) >expect &&
> - test_three_modes get_reachable_subset
> + test_all_modes get_reachable_subset
> '
>
> test_expect_success 'get_reachable_subset:some' '
> @@ -420,7 +426,7 @@ test_expect_success 'get_reachable_subset:some' '
> git rev-parse commit-3-3 \
> commit-1-7 | sort
> ) >expect &&
> - test_three_modes get_reachable_subset
> + test_all_modes get_reachable_subset
> '
>
> test_expect_success 'get_reachable_subset:none' '
> @@ -434,7 +440,7 @@ test_expect_success 'get_reachable_subset:none' '
> Y:commit-2-8
> EOF
> echo "get_reachable_subset(X,Y)" >expect &&
> - test_three_modes get_reachable_subset
> + test_all_modes get_reachable_subset
All those are pure renames of test_three_modes() to test_all_modes(),
which now does tests for one more mode -- without GDAT.
All right.
> '
>
> test_done
Best,
--
Jakub Narębski
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v4 07/10] commit-graph: implement generation data chunk
2020-10-30 12:45 ` Jakub Narębski
@ 2020-11-06 11:25 ` Abhishek Kumar
2020-11-06 17:56 ` Jakub Narębski
0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-06 11:25 UTC (permalink / raw)
To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee
On Fri, Oct 30, 2020 at 01:45:29PM +0100, Jakub Narębski wrote:
>
> ...
>
> >
> > While storing corrected commit date offset instead of the corrected
> > commit date saves us 4 bytes per commit, it's possible for the offsets
> > to overflow the 4-bytes allocated. As such overflows are exceedingly
> > rare, we use the following overflow management scheme:
>
> Perhaps it would be good idea to write the idea in full from start, as
> the commit message is intended to be read stadalone and not in the
> context of the patch series. On the other hand it might be too much
> detail in already [necessarily] lengthty commit message.
>
> Perhaps something like the following proposal would read better.
>
> To minimize the space required to store corrected commit date, Git
> stores corrected commit date offsets into the commit-graph file,
> instead of corrected commit dates themselves. This saves us 4 bytes
> per commit, decreasing the GDAT chunk size by half, but it's possible
> for the offset to overflow the 4-bytes allocated for storage. As such
> overflows are and should be exceedingly rare, we use the following
> overflow management scheme:
>
Thanks, that's better.
>
> ...
>
> > We test the overflow-related code with the following repo history:
> >
> > F - N - U
> > / \
> > U - N - U N
> > \ /
> > N - F - N
>
> Do we need such complex history? I guess we need to test the handling of
> merge commits too.
>
I wanted to test three cases - a root epoch zero commit, a commit that's
far enough in past to overflow the offset and a commit that's far enough
in the future to overflow the offset.
> >
> > Where the commits denoted by U have committer date of zero seconds
> > since Unix epoch, the commits denoted by N have committer date of
> > 1112354055 (default committer date for the test suite) seconds since
> > Unix epoch and the commits denoted by F have committer date of
> > (2 ^ 31 - 2) seconds since Unix epoch.
> >
> > The largest offset observed is 2 ^ 31, just large enough to overflow.
> >
> > [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> > commit-graph.c | 98 +++++++++++++++++++++++++++++++++--
> > commit-graph.h | 3 ++
> > commit.h | 1 +
> > t/README | 3 ++
> > t/helper/test-read-graph.c | 4 ++
> > t/t4216-log-bloom.sh | 4 +-
> > t/t5318-commit-graph.sh | 70 ++++++++++++++++++++-----
> > t/t5324-split-commit-graph.sh | 12 ++---
> > t/t6600-test-reach.sh | 68 +++++++++++++-----------
> > 9 files changed, 206 insertions(+), 57 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 03948adfce..71d0b243db 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
> > #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> > #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
> > #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
> > +#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
> > +#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
> > #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
> > #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
> > #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
> > #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> > -#define MAX_NUM_CHUNKS 7
> > +#define MAX_NUM_CHUNKS 9
>
> All right.
>
> >
> > #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
> >
> > @@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
> > #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
> > + GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
> >
> > +#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
> > +
>
> All right, though the naming convention is different from the one used
> for EDGE chunk: GRAPH_EXTRA_EDGES_NEEDED and GRAPH_EDGE_LAST_MASK.
>
> > /* Remember to update object flag allocation in object.h */
> > #define REACHABLE (1u<<15)
> >
> > @@ -385,6 +389,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> > graph->chunk_commit_data = data + chunk_offset;
> > break;
> >
> > + case GRAPH_CHUNKID_GENERATION_DATA:
> > + if (graph->chunk_generation_data)
> > + chunk_repeated = 1;
> > + else
> > + graph->chunk_generation_data = data + chunk_offset;
> > + break;
> > +
> > + case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
> > + if (graph->chunk_generation_data_overflow)
> > + chunk_repeated = 1;
> > + else
> > + graph->chunk_generation_data_overflow = data + chunk_offset;
> > + break;
> > +
>
> Necessary but unavoidable boilerplate for adding new chunks to the
> commit-graph file format. All right.
>
> > case GRAPH_CHUNKID_EXTRAEDGES:
> > if (graph->chunk_extra_edges)
> > chunk_repeated = 1;
> > @@ -745,8 +763,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> > {
> > const unsigned char *commit_data;
> > struct commit_graph_data *graph_data;
> > - uint32_t lex_index;
> > - uint64_t date_high, date_low;
> > + uint32_t lex_index, offset_pos;
> > + uint64_t date_high, date_low, offset;
>
> All right, we are adding two new variables: `offset` to read data stored
> in GDAT chunk, and `offset_pos` to help read data from GDOV chunk if
> necessary i.e. to handle overflow in corrected commit data offset
> storage.
>
> >
> > while (pos < g->num_commits_in_base)
> > g = g->base_graph;
> > @@ -764,7 +782,16 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> > date_low = get_be32(commit_data + g->hash_len + 12);
> > item->date = (timestamp_t)((date_high << 32) | date_low);
> >
> > - graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> > + if (g->chunk_generation_data) {
> > + offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
>
> Style: why space after the `(timestamp_t)` cast operator?
>
> Though CodingGuidelines do not say anything on this topic... perhaps the
> space after cast operator makes it more readable?
>
> > +
> > + if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
>
> All right, so the CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW is equivalent of
> GRAPH_EXTRA_EDGES_NEEDED.
>
> > + offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
>
> Hmmm... instead of using bitwise and on an equivalent to the
> GRAPH_EDGE_LAST_MASK, we utilize the fact that we know that the MSB bit
> is set, so we can clear it with bitwise xor. Clever trick.
>
>
> > + graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
> > + } else
> > + graph_data->generation = item->date + offset;
>
> All right, this handles the case when we have generation number v2, with
> or without overflow.
>
> > + } else
> > + graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>
> All right, this handles the case where we have only generation number
> v1, like for commit-graph file written by old Git.
>
> >
> > if (g->topo_levels)
> > *topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
> > @@ -942,6 +969,7 @@ struct write_commit_graph_context {
> > struct packed_oid_list oids;
> > struct packed_commit_list commits;
> > int num_extra_edges;
> > + int num_generation_data_overflows;
> > unsigned long approx_nr_objects;
> > struct progress *progress;
> > int progress_done;
> > @@ -960,7 +988,8 @@ struct write_commit_graph_context {
> > report_progress:1,
> > split:1,
> > changed_paths:1,
> > - order_by_pack:1;
> > + order_by_pack:1,
> > + write_generation_data:1;
> >
> > struct topo_level_slab *topo_levels;
> > const struct commit_graph_opts *opts;
>
> All right, this adds necessary fields to `struct write_commit_graph_context`.
>
> > @@ -1120,6 +1149,44 @@ static int write_graph_chunk_data(struct hashfile *f,
> > return 0;
> > }
> >
> > +static int write_graph_chunk_generation_data(struct hashfile *f,
> > + struct write_commit_graph_context *ctx)
> > +{
> > + int i, num_generation_data_overflows = 0;
>
> Minor nitpick: in my opinion there should be empty line here, between
> the variables declaration and the code... however not all
> write_graph_chunk_*() functions have it.
>
> > + for (i = 0; i < ctx->commits.nr; i++) {
> > + struct commit *c = ctx->commits.list[i];
> > + timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
> > + display_progress(ctx->progress, ++ctx->progress_cnt);
>
> All right.
>
> > +
> > + if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
> > + offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
> > + num_generation_data_overflows++;
> > + }
>
> Hmmm... shouldn't we store these commits that need overflow handling
> (with corrected commit date offset greater than GENERATION_NUMBER_V2_OFFSET_MAX)
> in a list or a queue, to remember them for writing GDOV chunk?
>
We could, although write_graph_chunk_extra_edges() (just like this function)
prefers to iterate over all commits again. Both octopus merges and
overflowing corrected commit dates are exceedingly rare, might be
worthwhile to trade some memory to avoid looping again.
> We could store oids, or we could store commits themselves, for example:
>
> if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
> offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
> num_generation_data_overflows++;
>
> ALLOC_GROW(ctx->gdov_commits.list, ctx->gdov_commits.nr + 1, ctx->gdov_commits.alloc);
> ctx->commits.list[ctx->gdov_commits.nr] = c;
> ctx->gdov_commits.nr++;
> }
>
> Though in the above proposal we could get rid of `num_generation_data_overflows`,
> as it should be the same as `ctx->gdov_commits.nr`.
>
> I have called the extra commit list member of write_commit_graph_context
> `gdov_commits`, but perhaps a better name would be `commits_gen_v2_overflow`,
> or similar more descriptive name.
>
> > +
> > + hashwrite_be32(f, offset);
> > + }
> > +
> > + return 0;
> > +}
>
> All right.
>
> > +
> > +static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
> > + struct write_commit_graph_context *ctx)
> > +{
> > + int i;
> > + for (i = 0; i < ctx->commits.nr; i++) {
>
> Here we loop over *all* commits again, instead of looping over those
> very rare commits that need overflow handling for their corrected commit
> date data.
>
> Though this possible performance issue^* could be fixed in the future commit.
>
> *) It needs to be actually benchmarked which version is faster.
>
> ...
>
> >
> > graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
> > @@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
> >
> > test_expect_success 'git commit-graph verify' '
> > cd "$TRASH_DIRECTORY/full" &&
> > - git rev-parse commits/8 | git commit-graph write --stdin-commits &&
> > - git commit-graph verify >output
> > + git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
> > + git commit-graph verify >output &&
>
> All right, this simply adds GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. I assume
> this is needed because this test is also setup for the following commits
> _without_ even saying that in the test name (bad practice, in my
> opinion), and the comment above this test says the following:
>
> # the verify tests below expect the commit-graph to contain
> # exactly the commits reachable from the commits/8 branch.
> # If the file changes the set of commits in the list, then the
> # offsets into the binary file will result in different edits
> # and the tests will likely break.
>
> So the following tests are fragile (though perhaps unavoidably fragile),
> and without this change they would not work, I assume.
>
> > + graph_read_expect 9 extra_edges
>
> I guess that this is here to check that GIT_TEST_COMMIT_GRAPH_NO_GDAT=1
> work as intended, and that the following "verify" tests wouldn't break.
> I understand its necessity, even if I don't quite like having a test
> that checks multiple things. This is a minor issue, though.
>
> All right.
>
>
> We might want to have a separate test that checks that we get
> commit-graph with and without GDAT chunk depending on whether we use
> GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. On the other hand, this environment
> variable is there purely for tests, so the question is should we test
> the test infrastructure?
>
> > '
> >
> > NUM_COMMITS=9
> > @@ -741,4 +742,47 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
> > )
> > '
> >
> > +test_commit_with_date() {
> > + file="$1.t" &&
> > + echo "$1" >"$file" &&
> > + git add "$file" &&
> > + GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
> > + git tag "$1"
> > +}
>
> Here we add a helper function. All right.
>
> I wonder though if it wouldn't be a better idea to add `--date <date>`
> option to the test_commit() function in test-lib-functions.sh (which
> option would set GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, and also
> set notick=yes).
>
Yes, that's a better idea - I didn't know how to change test_commit()
well enough to tinker with what's working.
> For example:
>
> diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
> index f1ae935fee..a1f9a2b09b 100644
> --- a/t/test-lib-functions.sh
> +++ b/t/test-lib-functions.sh
> @@ -202,6 +202,12 @@ test_commit () {
> --signoff)
> signoff="$1"
> ;;
> + --date)
> + notick=yes
> + GIT_COMMITTER_DATE="$2"
> + GIT_AUTHOR_DATE="$2"
> + shift
> + ;;
> -C)
> indir="$2"
> shift
>
>
> > +
>
> It would be nice to have there comment describing the shape of the
> revision history we generate here, that currenly is present only in the
> commmit message.
>
> # We test the overflow-related code with the following repo history:
> #
> # 4:F - 5:N - 6:U
> # / \
> # 1:U - 2:N - 3:U M:N
> # \ /
> # 7:N - 8:F - 9:N
> #
> # Here the commits denoted by U have committer date of zero seconds
> # since Unix epoch, the commits denoted by N have committer date
> # starting from 1112354055 seconds since Unix epoch (default committer
> # date for the test suite), and the commits denoted by F have committer
> # date of (2 ^ 31 - 2) seconds since Unix epoch.
> #
> # The largest offset observed is 2 ^ 31, just large enough to overflow.
> #
Yes, it would. Added.
>
> > +test_expect_success 'overflow corrected commit date offset' '
> > + objdir=".git/objects" &&
> > + UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
> > + FUTURE_DATE="@2147483646 +0000" &&
>
> It is a bit funny to see UNIX_EPOCH_ZERO spelled one way, and
> FUTURE_DATE other way.
>
> Wouldn't be more readable to use UNIX_EPOCH_ZERO="@0 +0000"?
It would, for some reason - I couldn't figure out the valid format for
this. Changed.
>
> > + test_oid_cache <<-EOF &&
> > + oid_version sha1:1
> > + oid_version sha256:2
> > + EOF
> > + cd "$TRASH_DIRECTORY" &&
> > + mkdir repo &&
> > + cd repo &&
> > + git init &&
> > + test_commit_with_date 1 "$UNIX_EPOCH_ZERO" &&
> > + test_commit 2 &&
> > + test_commit_with_date 3 "$UNIX_EPOCH_ZERO" &&
> > + git commit-graph write --reachable &&
> > + graph_read_expect 3 generation_data &&
> > + test_commit_with_date 4 "$FUTURE_DATE" &&
> > + test_commit 5 &&
> > + test_commit_with_date 6 "$UNIX_EPOCH_ZERO" &&
> > + git branch left &&
> > + git reset --hard 3 &&
> > + test_commit 7 &&
> > + test_commit_with_date 8 "$FUTURE_DATE" &&
> > + test_commit 9 &&
> > + git branch right &&
> > + git reset --hard 3 &&
> > + git merge left right &&
>
> We have test_merge() function in test-lib-functions.sh, perhaps we
> should use it here.
>
> > + git commit-graph write --reachable &&
> > + graph_read_expect 10 "generation_data generation_data_overflow" &&
>
> All right, we write the commit-graph and check that it has both GDAT and
> GDOV chunks present.
>
> > + git commit-graph verify
>
> All right, we checks that created commit graph with GDAT and GDOV passes
> 'git commit-graph verify` checks.
>
> > +'
> > +
> > +graph_git_behavior 'overflow corrected commit date offset' repo left right
>
> All right, here we compare the Git behavior with the commit-graph to the
> behavior without it... however I think that those two tests really
> should have distinct (different) test names. Currently they both use
> 'overflow corrected commit date offset'.
>
Following the earlier tests, the first test could be "set up and verify
repo with generation data overflow chunk" and the git behavior test can
be "generation data overflow chunk repo"
> > +
> > test_done
> > diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> > index c334ee9155..651df89ab2 100755
> > --- a/t/t5324-split-commit-graph.sh
> > +++ b/t/t5324-split-commit-graph.sh
> > @@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
> > infodir=".git/objects/info" &&
> > graphdir="$infodir/commit-graphs" &&
> > test_oid_cache <<-EOM
> > - shallow sha1:1760
> > - shallow sha256:2064
> > + shallow sha1:2132
> > + shallow sha256:2436
> >
> > - base sha1:1376
> > - base sha256:1496
> > + base sha1:1408
> > + base sha256:1528
> >
> > oid_version sha1:1
> > oid_version sha256:2
> > @@ -31,9 +31,9 @@ graph_read_expect() {
> > NUM_BASE=$2
> > fi
> > cat >expect <<- EOF
> > - header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
> > + header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
> > num_commits: $1
> > - chunks: oid_fanout oid_lookup commit_metadata
> > + chunks: oid_fanout oid_lookup commit_metadata generation_data
> > EOF
> > test-tool read-graph >output &&
> > test_cmp expect output
>
> All right, we now expect the commit graph to include the GDAT chunk...
> though shouldn't be there old expected value for no GDAT, for future
> tests? But perhaps this is not necessary.
>
> Note that I have not checked the details, but it looks OK to me.
>
> > diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
> > index f807276337..e2d33a8a4c 100755
> > --- a/t/t6600-test-reach.sh
> > +++ b/t/t6600-test-reach.sh
> > @@ -55,10 +55,13 @@ test_expect_success 'setup' '
> > git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
> > mv .git/objects/info/commit-graph commit-graph-half &&
> > chmod u+w commit-graph-half &&
> > + GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
> > + mv .git/objects/info/commit-graph commit-graph-no-gdat &&
> > + chmod u+w commit-graph-no-gdat &&
>
> All right, this prepares for testing one more mode. The run_all_modes()
> function would test the following cases:
> - no commit-graph
> - commit-graph for all commits, with GDAT
> - commit-graph with half of commits, with GDAT
> - commit-graph for all commits, without GDAT
>
> > git config core.commitGraph true
> > '
> >
> > -run_three_modes () {
> > +run_all_modes () {
> > test_when_finished rm -rf .git/objects/info/commit-graph &&
> > "$@" <input >actual &&
> > test_cmp expect actual &&
> > @@ -67,11 +70,14 @@ run_three_modes () {
> > test_cmp expect actual &&
> > cp commit-graph-half .git/objects/info/commit-graph &&
> > "$@" <input >actual &&
> > + test_cmp expect actual &&
> > + cp commit-graph-no-gdat .git/objects/info/commit-graph &&
> > + "$@" <input >actual &&
> > test_cmp expect actual
> > }
> >
> > -test_three_modes () {
> > - run_three_modes test-tool reach "$@"
> > +test_all_modes () {
> > + run_all_modes test-tool reach "$@"
> > }
>
> All right.
>
> Though to reduce "noise" in this patch, the rename of run_three_modes()
> to run_all_modes() and test_three_modes() to test_all_modes() could have
> been done in a separate preparatory patch. It would be pure refactoring
> patch, without introducing any new functionality.
>
Sure, that makes sense to me - this is patch is over 200 lines long
already.
> ...
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 07/10] commit-graph: implement generation data chunk
2020-11-06 11:25 ` Abhishek Kumar
@ 2020-11-06 17:56 ` Jakub Narębski
0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-11-06 17:56 UTC (permalink / raw)
To: Abhishek Kumar
Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar via GitGitGadget
In short: I think that because current implementation of writing GDOV
chunk follows an example of writing EDGE chunk, it should be left as it
is now (simple), and posible performance improvements be postponed to
some future commit.
Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Fri, Oct 30, 2020 at 01:45:29PM +0100, Jakub Narębski wrote:
[...]
>>> We test the overflow-related code with the following repo history:
>>>
>>> F - N - U
>>> / \
>>> U - N - U N
>>> \ /
>>> N - F - N
>>
>> Do we need such complex history? I guess we need to test the handling of
>> merge commits too.
>>
>
> I wanted to test three cases - a root epoch zero commit, a commit that's
> far enough in past to overflow the offset and a commit that's far enough
> in the future to overflow the offset.
All right, if I understand this correctly this would be U as root, U-F
pair of commits and N-F pair of commits, respectively. Did I get it
right?
Anyway, it might be a good idea to put this explanation in the commit
message.
>>>
>>> Where the commits denoted by U have committer date of zero seconds
>>> since Unix epoch, the commits denoted by N have committer date of
>>> 1112354055 (default committer date for the test suite) seconds since
>>> Unix epoch and the commits denoted by F have committer date of
>>> (2 ^ 31 - 2) seconds since Unix epoch.
>>>
>>> The largest offset observed is 2 ^ 31, just large enough to overflow.
[...]
>>> @@ -1120,6 +1149,44 @@ static int write_graph_chunk_data(struct hashfile *f,
>>> return 0;
>>> }
>>>
>>> +static int write_graph_chunk_generation_data(struct hashfile *f,
>>> + struct write_commit_graph_context *ctx)
>>> +{
>>> + int i, num_generation_data_overflows = 0;
>>
>> Minor nitpick: in my opinion there should be empty line here, between
>> the variables declaration and the code... however not all
>> write_graph_chunk_*() functions have it.
>>
>>> + for (i = 0; i < ctx->commits.nr; i++) {
>>> + struct commit *c = ctx->commits.list[i];
>>> + timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
>>> + display_progress(ctx->progress, ++ctx->progress_cnt);
>>
>> All right.
>>
>>> +
>>> + if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
>>> + offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
>>> + num_generation_data_overflows++;
>>> + }
>>
>> Hmmm... shouldn't we store these commits that need overflow handling
>> (with corrected commit date offset greater than GENERATION_NUMBER_V2_OFFSET_MAX)
>> in a list or a queue, to remember them for writing GDOV chunk?
>>
>
> We could, although write_graph_chunk_extra_edges() (just like this function)
> prefers to iterate over all commits again. Both octopus merges and
> overflowing corrected commit dates are exceedingly rare, might be
> worthwhile to trade some memory to avoid looping again.
I'm sorry, I have not looked what write_graph_chunk_extra_edges() does,
or rather how it does what it does -- it is a good idea to pattern your
solution in similar existing code.
For me this is an even stronger hint that we should strive for
simplicity first, and leave possible performance improvements for the
future commit. Especially that you perform the most significant
optimization for this overflow handling: ensuring that we do not perform
any work if there are no commits with generation data overflow.
Maybe, maybe we should add that information about similarity between
write_graph_chunk_generation_data_overflow() and write_graph_chunk_extra_edges()
in the commit message. I am unsure...
>> We could store oids, or we could store commits themselves, for example:
>>
>> if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
>> offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
>> num_generation_data_overflows++;
>>
>> ALLOC_GROW(ctx->gdov_commits.list, ctx->gdov_commits.nr + 1, ctx->gdov_commits.alloc);
>> ctx->commits.list[ctx->gdov_commits.nr] = c;
>> ctx->gdov_commits.nr++;
>> }
>>
>> Though in the above proposal we could get rid of `num_generation_data_overflows`,
>> as it should be the same as `ctx->gdov_commits.nr`.
>>
>> I have called the extra commit list member of write_commit_graph_context
>> `gdov_commits`, but perhaps a better name would be `commits_gen_v2_overflow`,
>> or similar more descriptive name.
[...]
>>> @@ -741,4 +742,47 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
>>> )
>>> '
>>>
>>> +test_commit_with_date() {
>>> + file="$1.t" &&
>>> + echo "$1" >"$file" &&
>>> + git add "$file" &&
>>> + GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
>>> + git tag "$1"
>>> +}
>>
>> Here we add a helper function. All right.
>>
>> I wonder though if it wouldn't be a better idea to add `--date <date>`
>> option to the test_commit() function in test-lib-functions.sh (which
>> option would set GIT_COMMITTER_DATE and GIT_AUTHOR_DATE, and also
>> set notick=yes).
>>
>
> Yes, that's a better idea - I didn't know how to change test_commit()
> well enough to tinker with what's working.
>
>> For example:
>>
>> diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
>> index f1ae935fee..a1f9a2b09b 100644
>> --- a/t/test-lib-functions.sh
>> +++ b/t/test-lib-functions.sh
>> @@ -202,6 +202,12 @@ test_commit () {
>> --signoff)
>> signoff="$1"
>> ;;
>> + --date)
>> + notick=yes
>> + GIT_COMMITTER_DATE="$2"
>> + GIT_AUTHOR_DATE="$2"
>> + shift
>> + ;;
>> -C)
>> indir="$2"
>> shift
Note however that I have while I have followed example of other options
(namely '-C <directory>'), I have not actually tested this proposed
implementation in tests; I have just tested that it looks like it works
OK.
[...]
>>> +test_expect_success 'overflow corrected commit date offset' '
>>> + objdir=".git/objects" &&
>>> + UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
>>> + FUTURE_DATE="@2147483646 +0000" &&
>>
>> It is a bit funny to see UNIX_EPOCH_ZERO spelled one way, and
>> FUTURE_DATE other way.
>>
>> Wouldn't be more readable to use UNIX_EPOCH_ZERO="@0 +0000"?
>
> It would, for some reason - I couldn't figure out the valid format for
> this. Changed.
Well, if "@2147483646 +0000" works (i.e. "@<Unix epoch/timestamp> <offset>"),
why the same for timestamp 0, i.e. "@0 +0000", wouldn't work?
[...]
>>> +graph_git_behavior 'overflow corrected commit date offset' repo left right
>>
>> All right, here we compare the Git behavior with the commit-graph to the
>> behavior without it... however I think that those two tests really
>> should have distinct (different) test names. Currently they both use
>> 'overflow corrected commit date offset'.
>>
>
> Following the earlier tests, the first test could be "set up and verify
> repo with generation data overflow chunk" and the git behavior test can
> be "generation data overflow chunk repo"
First is OK, the second could possibly be improved but is all right.
[...]
>> Though to reduce "noise" in this patch, the rename of run_three_modes()
>> to run_all_modes() and test_three_modes() to test_all_modes() could have
>> been done in a separate preparatory patch. It would be pure refactoring
>> patch, without introducing any new functionality.
>>
>
> Sure, that makes sense to me - this is patch is over 200 lines long
> already.
Thanks in advance.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
` (6 preceding siblings ...)
2020-10-07 14:09 ` [PATCH v4 07/10] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09 ` Abhishek Kumar via GitGitGadget
2020-11-01 0:55 ` Jakub Narębski
2020-10-07 14:09 ` [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
` (3 subsequent siblings)
11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:
1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
Because of the current use of inspecting the current layer for a
chunk_generation_data pointer, the commits in the lower layer will be
interpreted as having very large generation values (commit date plus
offset) compared to the generation numbers in the top layer (topological
level). This violates the expectation that the generation of a parent is
strictly smaller than the generation of a child.
It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.
This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).
When writing the new layer in split commit-graph, we write a GDAT chunk
only if the topmost layer has a GDAT chunk. This guarantees that if a
layer has GDAT chunk, all lower layers must have a GDAT chunk as well.
Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 29 +++++++++++-
commit-graph.h | 1 +
t/t5324-split-commit-graph.sh | 86 +++++++++++++++++++++++++++++++++++
3 files changed, 115 insertions(+), 1 deletion(-)
diff --git a/commit-graph.c b/commit-graph.c
index 71d0b243db..5d15a1399b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -605,6 +605,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
return graph_chain;
}
+static void validate_mixed_generation_chain(struct commit_graph *g)
+{
+ int read_generation_data;
+
+ if (!g)
+ return;
+
+ read_generation_data = !!g->chunk_generation_data;
+
+ while (g) {
+ g->read_generation_data = read_generation_data;
+ g = g->base_graph;
+ }
+}
+
struct commit_graph *read_commit_graph_one(struct repository *r,
struct object_directory *odb)
{
@@ -613,6 +628,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
if (!g)
g = load_commit_graph_chain(r, odb);
+ validate_mixed_generation_chain(g);
+
return g;
}
@@ -782,7 +799,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
- if (g->chunk_generation_data) {
+ if (g->chunk_generation_data && g->read_generation_data) {
offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
@@ -2030,6 +2047,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
}
}
+ if (!ctx->write_generation_data && g->chunk_generation_data)
+ ctx->write_generation_data = 1;
+
if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
ctx->new_base_graph = g;
else if (ctx->num_commit_graphs_after != 1)
@@ -2274,6 +2294,7 @@ int write_commit_graph(struct object_directory *odb,
struct commit_graph *g = ctx->r->objects->commit_graph;
while (g) {
+ g->read_generation_data = 1;
g->topo_levels = &topo_levels;
g = g->base_graph;
}
@@ -2300,6 +2321,9 @@ int write_commit_graph(struct object_directory *odb,
g = ctx->r->objects->commit_graph;
+ if (g && !g->chunk_generation_data)
+ ctx->write_generation_data = 0;
+
while (g) {
ctx->num_commit_graphs_before++;
g = g->base_graph;
@@ -2318,6 +2342,9 @@ int write_commit_graph(struct object_directory *odb,
if (ctx->opts)
replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
+
+ if (replace)
+ ctx->write_generation_data = 1;
}
ctx->approx_nr_objects = approximate_object_count();
diff --git a/commit-graph.h b/commit-graph.h
index 19a02001fd..ad52130883 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -64,6 +64,7 @@ struct commit_graph {
struct object_directory *odb;
uint32_t num_commits_in_base;
+ unsigned int read_generation_data;
struct commit_graph *base_graph;
const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 651df89ab2..d0949a9eb8 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -440,4 +440,90 @@ test_expect_success '--split=replace with partial Bloom data' '
verify_chain_files_exist $graphdir
'
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+ mkdir mixed &&
+ graphdir=".git/objects/info/commit-graphs" &&
+ test_oid_cache <<-EOM &&
+ oid_version sha1:1
+ oid_version sha256:2
+ EOM
+ cd "$TRASH_DIRECTORY/mixed" &&
+ git init &&
+ git config core.commitGraph true &&
+ git config gc.writeCommitGraph false &&
+ for i in $(test_seq 3)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git reset --hard commits/1 &&
+ for i in $(test_seq 4 5)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git reset --hard commits/2 &&
+ for i in $(test_seq 6 10)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split &&
+ git reset --hard commits/2 &&
+ git merge commits/4 &&
+ git branch merge/1 &&
+ git reset --hard commits/4 &&
+ git merge commits/6 &&
+ git branch merge/2 &&
+ GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 4 1
+ num_commits: 2
+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
+ test_cmp expect output &&
+ git commit-graph verify
+'
+
+test_expect_success 'does not write generation data chunk if not present on existing tip' '
+ cd "$TRASH_DIRECTORY/mixed" &&
+ git reset --hard commits/3 &&
+ git merge merge/1 &&
+ git merge commits/5 &&
+ git merge merge/2 &&
+ git branch merge/3 &&
+ git commit-graph write --reachable --split=no-merge &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 4 2
+ num_commits: 3
+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
+ test_cmp expect output &&
+ git commit-graph verify
+'
+
+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
+ cd "$TRASH_DIRECTORY/mixed" &&
+ git commit-graph write --reachable --split=replace &&
+ test_path_is_file $graphdir/commit-graph-chain &&
+ test_line_count = 1 $graphdir/commit-graph-chain &&
+ verify_chain_files_exist $graphdir &&
+ graph_read_expect 15 &&
+ git commit-graph verify
+'
+
+test_expect_success 'add one commit, write a tip graph' '
+ cd "$TRASH_DIRECTORY/mixed" &&
+ test_commit 11 &&
+ git branch commits/11 &&
+ git commit-graph write --reachable --split &&
+ test_path_is_missing $infodir/commit-graph &&
+ test_path_is_file $graphdir/commit-graph-chain &&
+ ls $graphdir/graph-*.graph >graph-files &&
+ test_line_count = 2 graph-files &&
+ verify_chain_files_exist $graphdir
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does
2020-10-07 14:09 ` [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-11-01 0:55 ` Jakub Narębski
2020-11-12 10:01 ` Abhishek Kumar
0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-11-01 0:55 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar
Hi Abhishek,
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Since there are released versions of Git that understand generation
> numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> chunk, the following scenario is possible:
>
> 1. "New" Git writes a commit-graph with the GDAT chunk.
> 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
All right.
>
> Because of the current use of inspecting the current layer for a
> chunk_generation_data pointer, the commits in the lower layer will be
> interpreted as having very large generation values (commit date plus
> offset) compared to the generation numbers in the top layer (topological
> level). This violates the expectation that the generation of a parent is
> strictly smaller than the generation of a child.
I think this paragraphs tries too much to be concise, with the result it
is less clear than it could be. Perhaps it would be better to separate
"what-if" from the current behavior.
If each layer of split commit-graph is treated independently, as it
were the case before this commit, with Git inspecting only the current
layer for chunk_generation_data pointer, commits in the lower layer
(one with GDAT) would have corrected commit date as their generation
number, while commits in the upper layer would have topological levels
as their generation. Corrected commit dates have usually much larger
values than topological levels. This means that if we take two
commits, one from the upper layer, and one reachable from it in the
lower layer, then the expectation that the generation of a parent is
smaller than the generation of a child would be violated.
>
> It is difficult to expose this issue in a test. Since we _start_ with
> artificially low generation numbers, any commit walk that prioritizes
> generation numbers will walk all of the commits with high generation
> number before walking the commits with low generation number. In all the
> cases I tried, the commit-graph layers themselves "protect" any
> incorrect behavior since none of the commits in the lower layer can
> reach the commits in the upper layer.
I don't quite understand the issue here. Unless none of the following
query commands short-circuit and all walk the commit graph regardless of
what generation numbers tell them, they should give different results
with and without the commit graph, if we take two commits one from lower
layer of split commit graph with GDAT, and one commit from the higher
layer without GDAT, one lower reachable from the other higher.
We have the following query commands that we can check:
$ git merge-base --is-ancestor <lower> <higher>
$ git merge-base --independent <lower> <higher>
$ git tag --contains <tag-to-lower>
$ git tag --merged <tag-to-higher>
$ git branch --contains <branch-to-lower>
$ git branch --merged <branch-to-higher>
The second set of queries require for those commits to be tagged, or
have branch pointing at them, respectively.
Also, shouldn't `git commit-graph verify` fail with split commit graph
where the top layer is created with GIT_TEST_COMMIT_GRAPH_NO_GDAT=1?
Let's assume that we have the following history, with newer commits
shown on top like in `git log --graph --oneline --all`:
topological corrected generation
level commit date number^*
d 3 3
|
c | 3 3
| | without GDAT
..|..|.....[layer.boundary]........................................
| | with GDAT
| b 2 1112912113 1112912113
| |
a | 2 1112912053 1112912053
| /
|/
r 1 1112911993 1112911993
*) each layer inspected individually.
With such history, we can for example reach 'a' from 'c', thus
`git merge-base --is-ancestor a b` should return true value, but
without this commit gen(a) > gen(c), instead of gen(a) <= gen(c);
I use here weaker reachability condition, but the one that works
also for commits outside the commit-graph (and those for which
generation numbers overflows).
>
> This issue would manifest itself as a performance problem in this case,
> especially with something like "git log --graph" since the low
> generation numbers would cause the in-degree queue to walk all of the
> commits in the lower layer before allowing the topo-order queue to write
> anything to output (depending on the size of the upper layer).
All right, that's good explanation.
>
> When writing the new layer in split commit-graph, we write a GDAT chunk
> only if the topmost layer has a GDAT chunk. This guarantees that if a
> layer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>
> Rewriting layers follows similar approach: if the topmost layer below
> the set of layers being rewritten (in the split commit-graph chain)
> exists, and it does not contain GDAT chunk, then the result of rewrite
> does not have GDAT chunks either.
All right, very good explanation; the only minor suggestion would be to
add some 'intro' to the first of those two paragraphs, for example:
Therefore, when writing the new layer in split commit-graph...
Though I am not sure if it is necessary.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
> commit-graph.c | 29 +++++++++++-
> commit-graph.h | 1 +
> t/t5324-split-commit-graph.sh | 86 +++++++++++++++++++++++++++++++++++
> 3 files changed, 115 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 71d0b243db..5d15a1399b 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -605,6 +605,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
> return graph_chain;
> }
>
> +static void validate_mixed_generation_chain(struct commit_graph *g)
> +{
> + int read_generation_data;
> +
> + if (!g)
> + return;
> +
> + read_generation_data = !!g->chunk_generation_data;
> +
> + while (g) {
> + g->read_generation_data = read_generation_data;
> + g = g->base_graph;
> + }
> +}
All right, this function checks assumedly topmost layer if it is
GDAT-less, and if it is propagates this status down the layers of split
commit graph. This is needed because if we have mixed-generation commit
graph, then for each and every layer we need to use topological levels
as generation number.
The only minor issue is the name of this function (it does not hint that
it propagates the GDAT status downwards), but I don't have a better
idea, unfortunately. And it does reflect what this function is used for.
> +
> struct commit_graph *read_commit_graph_one(struct repository *r,
> struct object_directory *odb)
> {
> @@ -613,6 +628,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
> if (!g)
> g = load_commit_graph_chain(r, odb);
>
> + validate_mixed_generation_chain(g);
> +
All right, this looks like a good place to put this new check: just
after reading commit-graph chain.
> return g;
> }
>
> @@ -782,7 +799,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> date_low = get_be32(commit_data + g->hash_len + 12);
> item->date = (timestamp_t)((date_high << 32) | date_low);
>
> - if (g->chunk_generation_data) {
> + if (g->chunk_generation_data && g->read_generation_data) {
> offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
All right, instead of simply checking if the current layer has
generation data chunk, we need to also check if the whole graph allows
for it (if there are no mixed-generation layers).
The g->read_generation_data should be filled correctly, because
fill_commit_graph_info() is always preceded by read_commit_graph(), if I
understand it correctly.
>
> if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
> @@ -2030,6 +2047,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
> }
> }
>
> + if (!ctx->write_generation_data && g->chunk_generation_data)
> + ctx->write_generation_data = 1;
> +
This needs more careful examination, and looking at larger context of
those lines.
At this point, unless `--split=replace` option is used, 'g' points to
the bottom layer out of all topmost layers being merged. We know that if
there are GDAT-less layers then these must be top layers, so this means
that we can write GDAT chunk in the result of the merge -- because we
would be replacing all possible GDAT-less layers (and maybe some with
GDAT) with a single layer with the GDAT chunk.
The ctx->write_generation_data is set to true unless environment
variable GIT_TEST_COMMIT_GRAPH_NO_GDAT is true, and that in
write_commit_graph() it would be set to false if topmost layer doesn't
have GDAT chunk, and to true if `--split=replace` option is used; see
below.
Looks good to me.
NOTE that this means that GIT_TEST_COMMIT_GRAPH_NO_GDAT prevents from
writing GDAT chunk with generation data v2 unless we are merging layers,
or replacing all of them with a single layer: then it is _ignored_.
Should we clarify this fact in the description of GIT_TEST_COMMIT_GRAPH_NO_GDAT
in t/README? Currently it reads:
GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
commit-graph to be written without generation data chunk.
> if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
> ctx->new_base_graph = g;
> else if (ctx->num_commit_graphs_after != 1)
> @@ -2274,6 +2294,7 @@ int write_commit_graph(struct object_directory *odb,
> struct commit_graph *g = ctx->r->objects->commit_graph;
>
> while (g) {
> + g->read_generation_data = 1;
> g->topo_levels = &topo_levels;
> g = g->base_graph;
> }
All right, when writing the commit graph we want to make use of existing
generation data chunks. This is safe, because when computing generation
numbers for writing we have separate place to store topoogical levels
(`topo_levels`) so they would not be mixed with corrected commit dates:
generation number v1 and v2 are kept separate.
> @@ -2300,6 +2321,9 @@ int write_commit_graph(struct object_directory *odb,
>
> g = ctx->r->objects->commit_graph;
>
> + if (g && !g->chunk_generation_data)
> + ctx->write_generation_data = 0;
> +
All right, if current (topmost) layed does not have GDAT, then when
creating a new layer do not create GDAT layer either (merging layers and
rewriting the commit-graph is handled separately).
> while (g) {
> ctx->num_commit_graphs_before++;
> g = g->base_graph;
> @@ -2318,6 +2342,9 @@ int write_commit_graph(struct object_directory *odb,
>
> if (ctx->opts)
> replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
> +
> + if (replace)
> + ctx->write_generation_data = 1;
All right, when replacing all layers (`git commit-graph write --split=replace`),
then we can safely write the GDAT chunk.
Note however that here we don't take into account the value of the
environment variable GIT_TEST_COMMIT_GRAPH_NO_GDAT. Which maybe is what
we want...
> }
>
> ctx->approx_nr_objects = approximate_object_count();
> diff --git a/commit-graph.h b/commit-graph.h
> index 19a02001fd..ad52130883 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -64,6 +64,7 @@ struct commit_graph {
> struct object_directory *odb;
>
> uint32_t num_commits_in_base;
> + unsigned int read_generation_data;
> struct commit_graph *base_graph;
All right, this new field is here to propagate to each layer the
information whether we can read from the generation number v2 data
chunk.
Though I am not sure whether this field should be added here, and
whether it should be `unsigned int` (we don't have to be that careful
about saving space for this type).
>
> const uint32_t *chunk_oid_fanout;
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 651df89ab2..d0949a9eb8 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -440,4 +440,90 @@ test_expect_success '--split=replace with partial Bloom data' '
> verify_chain_files_exist $graphdir
> '
>
> +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> + mkdir mixed &&
This should probably go just before cd-ing into just created
subdirectory.
> + graphdir=".git/objects/info/commit-graphs" &&
> + test_oid_cache <<-EOM &&
> + oid_version sha1:1
> + oid_version sha256:2
> + EOM
Minor nitpick: Why use "EOM", which is used only twice in Git the test
suite, and not the conventional "EOF" (used at least 4000 times)?
> + cd "$TRASH_DIRECTORY/mixed" &&
The t/README says:
- Don't chdir around in tests. It is not sufficient to chdir to
somewhere and then chdir back to the original location later in
the test, as any intermediate step can fail and abort the test,
causing the next test to start in an unexpected directory. Do so
inside a subshell if necessary.
Though I am not sure if it should apply also to this situation.
> + git init &&
> + git config core.commitGraph true &&
> + git config gc.writeCommitGraph false &&
All right.
> + for i in $(test_seq 3)
> + do
> + test_commit $i &&
> + git branch commits/$i || return 1
> + done &&
> + git reset --hard commits/1 &&
> + for i in $(test_seq 4 5)
> + do
> + test_commit $i &&
> + git branch commits/$i || return 1
> + done &&
> + git reset --hard commits/2 &&
> + for i in $(test_seq 6 10)
> + do
> + test_commit $i &&
> + git branch commits/$i || return 1
> + done &&
> + git commit-graph write --reachable --split &&
Is there a reason why we do not check just written commit-graph file
with `test-tool read-graph >output-layer-1`?
> + git reset --hard commits/2 &&
> + git merge commits/4 &&
Shouldn't we use `test_merge` instead of `git merge`; I am not sure when
to use one or the other?
> + git branch merge/1 &&
> + git reset --hard commits/4 &&
> + git merge commits/6 &&
> + git branch merge/2 &&
It would be nice to have ASCII-art of the history (of the graph of
revisions) created here for subsequent tests:
/- 6 <-- 7 <-- 8 <-- 9 <-- 10*
/ \-\
/ \
1 <-- 2 <-- 3* \--\
| \ \
| \-----\ \
\ \ \
\-- 4*<------ M/1 M/2
|\ /
| \-- 5* /
\ /
\------------/
* - 1st layer
Though I am not sure if what I have created is readable; I think a
better way to draw this graph is possible, for example:
/- 3*
/
/
1 <------ 2 <---- 6 <-- 7 <-- 8 <-- 9 <-- 10*
\ \ \
\ \ \
\ \ \
\- 4* <-- M/1 \
|\ \
| \------------- M/2
\
\---- 5*
Edit: as I see the history gets even more complicated, so perhaps
ASCII-art diagram of the history with layers marked would be too
complicated, and wouldn't bring much.
Why do we need such shape of the history in the repository?
> + GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
> + test-tool read-graph >output &&
> + cat >expect <<-EOF &&
> + header: 43475048 1 $(test_oid oid_version) 4 1
> + num_commits: 2
> + chunks: oid_fanout oid_lookup commit_metadata
> + EOF
> + test_cmp expect output &&
All right, we check that we have 2 commits, and that there is no GDAT
chunk.
> + git commit-graph verify
All right, we verify commit-graph as a whole (both layers).
> +'
> +
> +test_expect_success 'does not write generation data chunk if not present on existing tip' '
Hmmm... I wonder if we can come up with a better name for this test;
for example should it be "does not write" or "do not write"?
> + cd "$TRASH_DIRECTORY/mixed" &&
> + git reset --hard commits/3 &&
> + git merge merge/1 &&
> + git merge commits/5 &&
> + git merge merge/2 &&
> + git branch merge/3 &&
The commit graph gets complicated, so it would not be easy to visualize
it with ASCII-art diagram without any crossed lines. Maybe `git log
--graph --oneline --all` would help:
* (merge/3) Merge branch 'merge/2'
|\
| * (merge/2) Merge branch 'commits/6'
| |\
* | \ Merge branch 'commits/5'
|\ \ \
| * | | (commits/5) 5
| |/ /
* | | Merge branch 'merge/1'
|\ \ \
| * | | (merge/1) Merge branch 'commits/4'
| |\| |
| | * | (commits/4) 4
* | | | (commits/3) 3
|/ / /
| | | * (commits/10) 10
| | | * (commits/9) 9
| | | * (commits/8) 8
| | | * (commits/7) 7
| | |/
| | * (commits/6) 6
| |/
|/|
* | (commits/2) 2
|/
* (commits/1) 1
> + git commit-graph write --reachable --split=no-merge &&
> + test-tool read-graph >output &&
> + cat >expect <<-EOF &&
> + header: 43475048 1 $(test_oid oid_version) 4 2
> + num_commits: 3
> + chunks: oid_fanout oid_lookup commit_metadata
> + EOF
> + test_cmp expect output &&
> + git commit-graph verify
All right, so here we check that we have layer without GDAT at the top,
and we request not to merge layers thus new layer will be created, then
the new layer also does not have GDAT chunk (and has 3 commits).
Minor nitpick: shouldn't those test be indented?
> +'
> +
> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> + cd "$TRASH_DIRECTORY/mixed" &&
> + git commit-graph write --reachable --split=replace &&
> + test_path_is_file $graphdir/commit-graph-chain &&
> + test_line_count = 1 $graphdir/commit-graph-chain &&
> + verify_chain_files_exist $graphdir &&
All right, this checks that we have split commit-graph chain that
consist of a single layer, and that the commit-graph file for this
single layer exists.
> + graph_read_expect 15 &&
Shouldn't we use `test-tool read-graph` to check whether generation_data
chunk is present... ah, sorry, I have realized that after previous
patches `graph_read_expect 15` implicitly checks the latter, because in
its' use of `test-tool read-graph` it does expect generation_data chunk.
So we use `test-tool read-graph` manually to check that generation_data
chunk is absent, and we use graph_read_expect to check that it is
present (and in both cases that the number of commits matches). I
wonder if it would be possible to simplify that...
> + git commit-graph verify
All right.
> +'
> +
> +test_expect_success 'add one commit, write a tip graph' '
> + cd "$TRASH_DIRECTORY/mixed" &&
> + test_commit 11 &&
> + git branch commits/11 &&
> + git commit-graph write --reachable --split &&
> + test_path_is_missing $infodir/commit-graph &&
> + test_path_is_file $graphdir/commit-graph-chain &&
> + ls $graphdir/graph-*.graph >graph-files &&
> + test_line_count = 2 graph-files &&
> + verify_chain_files_exist $graphdir
> +'
What it is meant to test? That adding single-commit to a 15 commit
commit-graph file in split mode does not result in layers merging, and
actually adds a new layer: we check that we have exactly two layers and
that they are all OK.
We don't check here that the newly created top layer commit-graph does
have GDAT chunk, as it should be if the top layer (in this case the only
layer) has GDAT chunk.
> +
> test_done
One test we are missing is testing that merging layers is done
correctly, namely that if we are merging layers in split commit-graph
file, and the layer below the ones we are merging lacks GDAT chunk, then
the result of the merge should also be without GDAT chunk. This would
require at least two GDAT-less layers in a setup.
I'm not sure how difficult writing such test should be.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does
2020-11-01 0:55 ` Jakub Narębski
@ 2020-11-12 10:01 ` Abhishek Kumar
2020-11-13 9:59 ` Jakub Narębski
0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-12 10:01 UTC (permalink / raw)
To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee
On Sun, Nov 01, 2020 at 01:55:11AM +0100, Jakub Narębski wrote:
> Hi Abhishek,
>
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > Since there are released versions of Git that understand generation
> > numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> > chunk, the following scenario is possible:
> >
> > 1. "New" Git writes a commit-graph with the GDAT chunk.
> > 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
>
> All right.
>
> >
> > Because of the current use of inspecting the current layer for a
> > chunk_generation_data pointer, the commits in the lower layer will be
> > interpreted as having very large generation values (commit date plus
> > offset) compared to the generation numbers in the top layer (topological
> > level). This violates the expectation that the generation of a parent is
> > strictly smaller than the generation of a child.
>
> I think this paragraphs tries too much to be concise, with the result it
> is less clear than it could be. Perhaps it would be better to separate
> "what-if" from the current behavior.
>
> If each layer of split commit-graph is treated independently, as it
> were the case before this commit, with Git inspecting only the current
> layer for chunk_generation_data pointer, commits in the lower layer
> (one with GDAT) would have corrected commit date as their generation
> number, while commits in the upper layer would have topological levels
> as their generation. Corrected commit dates have usually much larger
> values than topological levels. This means that if we take two
> commits, one from the upper layer, and one reachable from it in the
> lower layer, then the expectation that the generation of a parent is
> smaller than the generation of a child would be violated.
>
Thanks, that's better.
> >
> > It is difficult to expose this issue in a test. Since we _start_ with
> > artificially low generation numbers, any commit walk that prioritizes
> > generation numbers will walk all of the commits with high generation
> > number before walking the commits with low generation number. In all the
> > cases I tried, the commit-graph layers themselves "protect" any
> > incorrect behavior since none of the commits in the lower layer can
> > reach the commits in the upper layer.
>
> I don't quite understand the issue here. Unless none of the following
> query commands short-circuit and all walk the commit graph regardless of
> what generation numbers tell them, they should give different results
> with and without the commit graph, if we take two commits one from lower
> layer of split commit graph with GDAT, and one commit from the higher
> layer without GDAT, one lower reachable from the other higher.
>
> We have the following query commands that we can check:
> $ git merge-base --is-ancestor <lower> <higher>
> $ git merge-base --independent <lower> <higher>
>
> $ git tag --contains <tag-to-lower>
> $ git tag --merged <tag-to-higher>
> $ git branch --contains <branch-to-lower>
> $ git branch --merged <branch-to-higher>
>
> The second set of queries require for those commits to be tagged, or
> have branch pointing at them, respectively.
>
> Also, shouldn't `git commit-graph verify` fail with split commit graph
> where the top layer is created with GIT_TEST_COMMIT_GRAPH_NO_GDAT=1?
>
> Let's assume that we have the following history, with newer commits
> shown on top like in `git log --graph --oneline --all`:
>
> topological corrected generation
> level commit date number^*
>
> d 3 3
> |
> c | 3 3
> | | without GDAT
> ..|..|.....[layer.boundary]........................................
> | | with GDAT
> | b 2 1112912113 1112912113
> | |
> a | 2 1112912053 1112912053
> | /
> |/
> r 1 1112911993 1112911993
>
> *) each layer inspected individually.
>
> With such history, we can for example reach 'a' from 'c', thus
> `git merge-base --is-ancestor a b` should return true value, but
> without this commit gen(a) > gen(c), instead of gen(a) <= gen(c);
> I use here weaker reachability condition, but the one that works
> also for commits outside the commit-graph (and those for which
> generation numbers overflows).
>
The original explanation was given by Dr. Stolee and he might not have
thought exhaustively about the issue.
In any case, your explanation and the history make sense to me. I will
try to add test and report back to the mailing list if something goes
wrong.
Thank you for clarifying in such detail.
> >
> > This issue would manifest itself as a performance problem in this case,
> > especially with something like "git log --graph" since the low
> > generation numbers would cause the in-degree queue to walk all of the
> > commits in the lower layer before allowing the topo-order queue to write
> > anything to output (depending on the size of the upper layer).
>
> All right, that's good explanation.
>
> ...
>
> > @@ -2030,6 +2047,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
> > }
> > }
> >
> > + if (!ctx->write_generation_data && g->chunk_generation_data)
> > + ctx->write_generation_data = 1;
> > +
>
> This needs more careful examination, and looking at larger context of
> those lines.
>
> At this point, unless `--split=replace` option is used, 'g' points to
> the bottom layer out of all topmost layers being merged. We know that if
> there are GDAT-less layers then these must be top layers, so this means
> that we can write GDAT chunk in the result of the merge -- because we
> would be replacing all possible GDAT-less layers (and maybe some with
> GDAT) with a single layer with the GDAT chunk.
>
> The ctx->write_generation_data is set to true unless environment
> variable GIT_TEST_COMMIT_GRAPH_NO_GDAT is true, and that in
> write_commit_graph() it would be set to false if topmost layer doesn't
> have GDAT chunk, and to true if `--split=replace` option is used; see
> below.
>
> Looks good to me.
>
>
> NOTE that this means that GIT_TEST_COMMIT_GRAPH_NO_GDAT prevents from
> writing GDAT chunk with generation data v2 unless we are merging layers,
> or replacing all of them with a single layer: then it is _ignored_.
>
> Should we clarify this fact in the description of GIT_TEST_COMMIT_GRAPH_NO_GDAT
> in t/README? Currently it reads:
>
> GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
> commit-graph to be written without generation data chunk.
I think it's better to *not* write generation data chunk if
GIT_TEST_COMMIT_GRAPH_NO_GDAT is set even though all GDAT-less layers
are merged, that is:
if (!ctx->write_generation_data &&
g->chunk_generation_data &&
!git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
ctx->write_generation_data = 1;
With this change, we would have a method to force-write commit-graph
without generation data chunk regardless of the shape of split
commit-graph files.
>
> ...
>
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 19a02001fd..ad52130883 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -64,6 +64,7 @@ struct commit_graph {
> > struct object_directory *odb;
> >
> > uint32_t num_commits_in_base;
> > + unsigned int read_generation_data;
> > struct commit_graph *base_graph;
>
> All right, this new field is here to propagate to each layer the
> information whether we can read from the generation number v2 data
> chunk.
>
> Though I am not sure whether this field should be added here, and
> whether it should be `unsigned int` (we don't have to be that careful
> about saving space for this type).
>
I cannot think of a more appropriate struct than `struct commit_graph`.
Any particular suggestions?
> >
> > const uint32_t *chunk_oid_fanout;
> > diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> > index 651df89ab2..d0949a9eb8 100755
> > --- a/t/t5324-split-commit-graph.sh
> > +++ b/t/t5324-split-commit-graph.sh
> > @@ -440,4 +440,90 @@ test_expect_success '--split=replace with partial Bloom data' '
> > verify_chain_files_exist $graphdir
> > '
> >
> > +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> > + mkdir mixed &&
>
> This should probably go just before cd-ing into just created
> subdirectory.
>
> > + graphdir=".git/objects/info/commit-graphs" &&
> > + test_oid_cache <<-EOM &&
> > + oid_version sha1:1
> > + oid_version sha256:2
> > + EOM
>
> Minor nitpick: Why use "EOM", which is used only twice in Git the test
> suite, and not the conventional "EOF" (used at least 4000 times)?
Right, both instances of "EOM" are actually my own. I looked up some
test script for oid cache that did use EOM when I first wrote the tests
but it's changed now. Will replace.
>
> > + cd "$TRASH_DIRECTORY/mixed" &&
>
> The t/README says:
>
> - Don't chdir around in tests. It is not sufficient to chdir to
> somewhere and then chdir back to the original location later in
> the test, as any intermediate step can fail and abort the test,
> causing the next test to start in an unexpected directory. Do so
> inside a subshell if necessary.
>
> Though I am not sure if it should apply also to this situation.
While I cannot avoid changing directory, using a subshell would be best
to avoid causing the later tests to start in unexpected directories.
>
> > + git init &&
> > + git config core.commitGraph true &&
> > + git config gc.writeCommitGraph false &&
>
> All right.
>
> > + for i in $(test_seq 3)
> > + do
> > + test_commit $i &&
> > + git branch commits/$i || return 1
> > + done &&
> > + git reset --hard commits/1 &&
> > + for i in $(test_seq 4 5)
> > + do
> > + test_commit $i &&
> > + git branch commits/$i || return 1
> > + done &&
> > + git reset --hard commits/2 &&
> > + for i in $(test_seq 6 10)
> > + do
> > + test_commit $i &&
> > + git branch commits/$i || return 1
> > + done &&
> > + git commit-graph write --reachable --split &&
>
> Is there a reason why we do not check just written commit-graph file
> with `test-tool read-graph >output-layer-1`?
We could check the written commit-graph file at this point but it's same
as existing tests as above.
>
> > + git reset --hard commits/2 &&
> > + git merge commits/4 &&
>
> Shouldn't we use `test_merge` instead of `git merge`; I am not sure when
> to use one or the other?
`test_merge` is used in 26 places whereas `git merge` is used in over a
thousand places. `test_merge` is just not widely adopted and this lack
of adoption prevents further use.
>
> > + git branch merge/1 &&
> > + git reset --hard commits/4 &&
> > + git merge commits/6 &&
> > + git branch merge/2 &&
>
> It would be nice to have ASCII-art of the history (of the graph of
> revisions) created here for subsequent tests:
>
>
> /- 6 <-- 7 <-- 8 <-- 9 <-- 10*
> / \-\
> / \
> 1 <-- 2 <-- 3* \--\
> | \ \
> | \-----\ \
> \ \ \
> \-- 4*<------ M/1 M/2
> |\ /
> | \-- 5* /
> \ /
> \------------/
>
> * - 1st layer
>
> Though I am not sure if what I have created is readable; I think a
> better way to draw this graph is possible, for example:
>
> /- 3*
> /
> /
> 1 <------ 2 <---- 6 <-- 7 <-- 8 <-- 9 <-- 10*
> \ \ \
> \ \ \
> \ \ \
> \- 4* <-- M/1 \
> |\ \
> | \------------- M/2
> \
> \---- 5*
>
> Edit: as I see the history gets even more complicated, so perhaps
> ASCII-art diagram of the history with layers marked would be too
> complicated, and wouldn't bring much.
>
> Why do we need such shape of the history in the repository?
We don't need such a complicated shape. Any commit-graph file with 2-3
layers regardless of how commits are related should suffice. Will
simplify.
>
> > + GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
> > + test-tool read-graph >output &&
> > + cat >expect <<-EOF &&
> > + header: 43475048 1 $(test_oid oid_version) 4 1
> > + num_commits: 2
> > + chunks: oid_fanout oid_lookup commit_metadata
> > + EOF
> > + test_cmp expect output &&
>
> All right, we check that we have 2 commits, and that there is no GDAT
> chunk.
>
> > + git commit-graph verify
>
> All right, we verify commit-graph as a whole (both layers).
>
> > +'
> > +
> > +test_expect_success 'does not write generation data chunk if not present on existing tip' '
>
> Hmmm... I wonder if we can come up with a better name for this test;
> for example should it be "does not write" or "do not write"?
That's better.
>
> > + cd "$TRASH_DIRECTORY/mixed" &&
> > + git reset --hard commits/3 &&
> > + git merge merge/1 &&
> > + git merge commits/5 &&
> > + git merge merge/2 &&
> > + git branch merge/3 &&
>
> The commit graph gets complicated, so it would not be easy to visualize
> it with ASCII-art diagram without any crossed lines. Maybe `git log
> --graph --oneline --all` would help:
>
> * (merge/3) Merge branch 'merge/2'
> |\
> | * (merge/2) Merge branch 'commits/6'
> | |\
> * | \ Merge branch 'commits/5'
> |\ \ \
> | * | | (commits/5) 5
> | |/ /
> * | | Merge branch 'merge/1'
> |\ \ \
> | * | | (merge/1) Merge branch 'commits/4'
> | |\| |
> | | * | (commits/4) 4
> * | | | (commits/3) 3
> |/ / /
> | | | * (commits/10) 10
> | | | * (commits/9) 9
> | | | * (commits/8) 8
> | | | * (commits/7) 7
> | | |/
> | | * (commits/6) 6
> | |/
> |/|
> * | (commits/2) 2
> |/
> * (commits/1) 1
>
>
> > + git commit-graph write --reachable --split=no-merge &&
> > + test-tool read-graph >output &&
> > + cat >expect <<-EOF &&
> > + header: 43475048 1 $(test_oid oid_version) 4 2
> > + num_commits: 3
> > + chunks: oid_fanout oid_lookup commit_metadata
> > + EOF
> > + test_cmp expect output &&
> > + git commit-graph verify
>
> All right, so here we check that we have layer without GDAT at the top,
> and we request not to merge layers thus new layer will be created, then
> the new layer also does not have GDAT chunk (and has 3 commits).
>
> Minor nitpick: shouldn't those test be indented?
>
The tests look indented to me and `git diff HEAD^ --check` gives nothing.
Did you mean the lines enclosed by EOF delimiter?
> > +'
> > +
> > +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> > + cd "$TRASH_DIRECTORY/mixed" &&
> > + git commit-graph write --reachable --split=replace &&
> > + test_path_is_file $graphdir/commit-graph-chain &&
> > + test_line_count = 1 $graphdir/commit-graph-chain &&
> > + verify_chain_files_exist $graphdir &&
>
> All right, this checks that we have split commit-graph chain that
> consist of a single layer, and that the commit-graph file for this
> single layer exists.
>
> > + graph_read_expect 15 &&
>
> Shouldn't we use `test-tool read-graph` to check whether generation_data
> chunk is present... ah, sorry, I have realized that after previous
> patches `graph_read_expect 15` implicitly checks the latter, because in
> its' use of `test-tool read-graph` it does expect generation_data chunk.
>
> So we use `test-tool read-graph` manually to check that generation_data
> chunk is absent, and we use graph_read_expect to check that it is
> present (and in both cases that the number of commits matches). I
> wonder if it would be possible to simplify that...
>
The problem here is graph_read_expect() as defined in
t5324-split-commit-graph takes two parameters - number of commits and
number of base graphs. If the number of base graphs is not passed to
the function call, it's assumed to be zero. Using a default parameter
is tricky - I can fix it by manually adding a zero to each of
graph_read_expect() in an additional preparatory patch.
Any other suggestions are welcome too.
>
> > + git commit-graph verify
>
> All right.
>
> > +'
> > +
> > +test_expect_success 'add one commit, write a tip graph' '
> > + cd "$TRASH_DIRECTORY/mixed" &&
> > + test_commit 11 &&
> > + git branch commits/11 &&
> > + git commit-graph write --reachable --split &&
> > + test_path_is_missing $infodir/commit-graph &&
> > + test_path_is_file $graphdir/commit-graph-chain &&
> > + ls $graphdir/graph-*.graph >graph-files &&
> > + test_line_count = 2 graph-files &&
> > + verify_chain_files_exist $graphdir
> > +'
>
> What it is meant to test? That adding single-commit to a 15 commit
> commit-graph file in split mode does not result in layers merging, and
> actually adds a new layer: we check that we have exactly two layers and
> that they are all OK.
This test is meant to check writing to a split graph in "normal"
conditions (i.e. all existing layers have generation data chunk). The
above tests are special cases as they involve merging layers with mixed
generation number versions.
>
> We don't check here that the newly created top layer commit-graph does
> have GDAT chunk, as it should be if the top layer (in this case the only
> layer) has GDAT chunk.
> > +
> > test_done
>
> One test we are missing is testing that merging layers is done
> correctly, namely that if we are merging layers in split commit-graph
> file, and the layer below the ones we are merging lacks GDAT chunk, then
> the result of the merge should also be without GDAT chunk. This would
> require at least two GDAT-less layers in a setup.
>
> I'm not sure how difficult writing such test should be.
It wouldn't be too hard.
After the last test, I can write some more commits and write split
commit-graph file without GDAT chunk. Then write some more commits
and merge layers using `git commit-graph write --max-commits=<nr>`.
Thanks for pointing this out!
>
> Best,
> --
> Jakub Narębski
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does
2020-11-12 10:01 ` Abhishek Kumar
@ 2020-11-13 9:59 ` Jakub Narębski
0 siblings, 0 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-11-13 9:59 UTC (permalink / raw)
To: Abhishek Kumar
Cc: git, Abhishek Kumar via GitGitGadget, Derrick Stolee, Taylor Blau
Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sun, Nov 01, 2020 at 01:55:11AM +0100, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
[...]
>>> It is difficult to expose this issue in a test. Since we _start_ with
>>> artificially low generation numbers, any commit walk that prioritizes
>>> generation numbers will walk all of the commits with high generation
>>> number before walking the commits with low generation number. In all the
>>> cases I tried, the commit-graph layers themselves "protect" any
>>> incorrect behavior since none of the commits in the lower layer can
>>> reach the commits in the upper layer.
>>
>> I don't quite understand the issue here. Unless none of the following
>> query commands short-circuit and all walk the commit graph regardless of
>> what generation numbers tell them, they should give different results
>> with and without the commit graph, if we take two commits one from lower
>> layer of split commit graph with GDAT, and one commit from the higher
>> layer without GDAT, one lower reachable from the other higher.
>>
>> We have the following query commands that we can check:
>> $ git merge-base --is-ancestor <lower> <higher>
>> $ git merge-base --independent <lower> <higher>
>>
>> $ git tag --contains <tag-to-lower>
>> $ git tag --merged <tag-to-higher>
>> $ git branch --contains <branch-to-lower>
>> $ git branch --merged <branch-to-higher>
>>
>> The second set of queries require for those commits to be tagged, or
>> have branch pointing at them, respectively.
>>
>> Also, shouldn't `git commit-graph verify` fail with split commit graph
>> where the top layer is created with GIT_TEST_COMMIT_GRAPH_NO_GDAT=1?
>>
>> Let's assume that we have the following history, with newer commits
>> shown on top like in `git log --graph --oneline --all`:
>>
>> topological corrected generation
>> level commit date number^*
>>
>> d 3 3
>> |
>> c | 3 3
>> | | without GDAT
>> ..|..|.....[layer.boundary]........................................
>> | | with GDAT
>> | b 2 1112912113 1112912113
>> | |
>> a | 2 1112912053 1112912053
>> | /
>> |/
>> r 1 1112911993 1112911993
>>
>> *) each layer inspected individually.
>>
>> With such history, we can for example reach 'a' from 'c', thus
>> `git merge-base --is-ancestor a b` should return true value, but
>> without this commit gen(a) > gen(c), instead of gen(a) <= gen(c);
>> I use here weaker reachability condition, but the one that works
>> also for commits outside the commit-graph (and those for which
>> generation numbers overflows).
>>
>
> The original explanation was given by Dr. Stolee and he might not have
> thought exhaustively about the issue.
>
> In any case, your explanation and the history make sense to me. I will
> try to add test and report back to the mailing list if something goes
> wrong.
>
> Thank you for clarifying in such detail.
I don't think you need to add any new test. It should be enough to check
that the first test introduced in this patch, namely 'setup repo for
mixed generation commit-graph-chain', fails without the change in this
patch -- as I think it does. This is because `git commit-graph verify`
should fail with mixed-version split commit-graph with GDAT-less layer
on top without this change.
Reporting this (possibly as from one sentence to one paragraph in the
commit message) would be enough, in my opinion.
[...]
>> NOTE that this means that GIT_TEST_COMMIT_GRAPH_NO_GDAT prevents from
>> writing GDAT chunk with generation data v2 unless we are merging layers,
>> or replacing all of them with a single layer: then it is _ignored_.
>>
>> Should we clarify this fact in the description of GIT_TEST_COMMIT_GRAPH_NO_GDAT
>> in t/README? Currently it reads:
>>
>> GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
>> commit-graph to be written without generation data chunk.
>
> I think it's better to *not* write generation data chunk if
> GIT_TEST_COMMIT_GRAPH_NO_GDAT is set even though all GDAT-less layers
> are merged, that is:
>
> if (!ctx->write_generation_data &&
> g->chunk_generation_data &&
> !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
> ctx->write_generation_data = 1;
>
> With this change, we would have a method to force-write commit-graph
> without generation data chunk regardless of the shape of split
> commit-graph files.
While it would be more consistent to always behave like the old Git with
GIT_TEST_COMMIT_GRAPH_NO_GDAT=1, it is in my opinion not necessary.
The only thing we need to test the mixed-version commit-graph chain is
the ability to add new layer on top without GDAT. It does not matter if
this layer is created from new commits or a result of partial or full
merge of layers.
So the alternative to extending what GIT_TEST_COMMIT_GRAPH_NO_GDAT does
that you propose here would be simply improving the description of t in
t/README, e.g.
GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
commit-graph, or new layer in split commit-graph chain, to be written
without generation data chunk. It does not affect merging of layers.
For me either solution is fine.
[...]
>>> diff --git a/commit-graph.h b/commit-graph.h
>>> index 19a02001fd..ad52130883 100644
>>> --- a/commit-graph.h
>>> +++ b/commit-graph.h
>>> @@ -64,6 +64,7 @@ struct commit_graph {
>>> struct object_directory *odb;
>>>
>>> uint32_t num_commits_in_base;
>>> + unsigned int read_generation_data;
>>> struct commit_graph *base_graph;
>>
>> All right, this new field is here to propagate to each layer the
>> information whether we can read from the generation number v2 data
>> chunk.
>>
>> Though I am not sure whether this field should be added here, and
>> whether it should be `unsigned int` (we don't have to be that careful
>> about saving space for this type).
>
> I cannot think of a more appropriate struct than `struct commit_graph`.
> Any particular suggestions?
After thinking about it a bit more, I think it is fine to have it here
in `struct commit_graph`, it is better than using a global variable
(which would make code non-reentrant; not that we use multiple threads
for reading multiple layers of the commit graph, but we might want to in
the future).
[...]
>>> + cd "$TRASH_DIRECTORY/mixed" &&
>>
>> The t/README says:
>>
>> - Don't chdir around in tests. It is not sufficient to chdir to
>> somewhere and then chdir back to the original location later in
>> the test, as any intermediate step can fail and abort the test,
>> causing the next test to start in an unexpected directory. Do so
>> inside a subshell if necessary.
>>
>> Though I am not sure if it should apply also to this situation.
>
> While I cannot avoid changing directory, using a subshell would be best
> to avoid causing the later tests to start in unexpected directories.
This would allow for easier skipping of tests, and failed tests would
not propagate the error (because of subsequent tests after a failed one
starting in unexpected directory).
[...]
>>> + for i in $(test_seq 3)
>>> + do
>>> + test_commit $i &&
>>> + git branch commits/$i || return 1
>>> + done &&
>>> + git reset --hard commits/1 &&
>>> + for i in $(test_seq 4 5)
>>> + do
>>> + test_commit $i &&
>>> + git branch commits/$i || return 1
>>> + done &&
>>> + git reset --hard commits/2 &&
>>> + for i in $(test_seq 6 10)
>>> + do
>>> + test_commit $i &&
>>> + git branch commits/$i || return 1
>>> + done &&
>>> + git commit-graph write --reachable --split &&
>>
>> Is there a reason why we do not check just written commit-graph file
>> with `test-tool read-graph >output-layer-1`?
>
> We could check the written commit-graph file at this point but it's same
> as existing tests as above.
All right, thanks for an explanation.
>>
>>> + git reset --hard commits/2 &&
>>> + git merge commits/4 &&
>>
>> Shouldn't we use `test_merge` instead of `git merge`; I am not sure when
>> to use one or the other?
>
> `test_merge` is used in 26 places whereas `git merge` is used in over a
> thousand places. `test_merge` is just not widely adopted and this lack
> of adoption prevents further use.
All right then.
>>> + git branch merge/1 &&
>>> + git reset --hard commits/4 &&
>>> + git merge commits/6 &&
>>> + git branch merge/2 &&
>>
>> It would be nice to have ASCII-art of the history (of the graph of
>> revisions) created here for subsequent tests:
>>
>>
>> /- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>> / \-\
>> / \
>> 1 <-- 2 <-- 3* \--\
>> | \ \
>> | \-----\ \
>> \ \ \
>> \-- 4*<------ M/1 M/2
>> |\ /
>> | \-- 5* /
>> \ /
>> \------------/
>>
>> * - 1st layer
>>
>> Though I am not sure if what I have created is readable; I think a
>> better way to draw this graph is possible, for example:
>>
>> /- 3*
>> /
>> /
>> 1 <------ 2 <---- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>> \ \ \
>> \ \ \
>> \ \ \
>> \- 4* <-- M/1 \
>> |\ \
>> | \------------- M/2
>> \
>> \---- 5*
>>
>> Edit: as I see the history gets even more complicated, so perhaps
>> ASCII-art diagram of the history with layers marked would be too
>> complicated, and wouldn't bring much.
>>
>> Why do we need such shape of the history in the repository?
>
> We don't need such a complicated shape. Any commit-graph file with 2-3
> layers regardless of how commits are related should suffice. Will
> simplify.
If you are unsire if we need this shape of history to properly test all
corner cases of the algorithm, or whether simple history would be
enough, you can simply compare code coverage. Git Makefile ha the
'coverage' target (which requires 'gcov' tool).
NOTE: if it is possible to run 'make coverage' for you, it can be used
to check if there are any parts of the new code that are not tested.
[...]
>>> + git commit-graph write --reachable --split=no-merge &&
>>> + test-tool read-graph >output &&
>>> + cat >expect <<-EOF &&
>>> + header: 43475048 1 $(test_oid oid_version) 4 2
>>> + num_commits: 3
>>> + chunks: oid_fanout oid_lookup commit_metadata
>>> + EOF
>>> + test_cmp expect output &&
>>> + git commit-graph verify
>>
>> All right, so here we check that we have layer without GDAT at the top,
>> and we request not to merge layers thus new layer will be created, then
>> the new layer also does not have GDAT chunk (and has 3 commits).
>>
>> Minor nitpick: shouldn't those test be indented?
>>
>
> The tests look indented to me and `git diff HEAD^ --check` gives nothing.
>
> Did you mean the lines enclosed by EOF delimiter?
I'm sorry, that was my mistake -- tabs are used for indent, and the
tabstop (in my newsreader) when being quoted made it look like it was
not indented.
[...]
>>> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
>>> + cd "$TRASH_DIRECTORY/mixed" &&
>>> + git commit-graph write --reachable --split=replace &&
>>> + test_path_is_file $graphdir/commit-graph-chain &&
>>> + test_line_count = 1 $graphdir/commit-graph-chain &&
>>> + verify_chain_files_exist $graphdir &&
>>
>> All right, this checks that we have split commit-graph chain that
>> consist of a single layer, and that the commit-graph file for this
>> single layer exists.
>>
>>> + graph_read_expect 15 &&
>>
>> Shouldn't we use `test-tool read-graph` to check whether generation_data
>> chunk is present... ah, sorry, I have realized that after previous
>> patches `graph_read_expect 15` implicitly checks the latter, because in
>> its' use of `test-tool read-graph` it does expect generation_data chunk.
>>
>> So we use `test-tool read-graph` manually to check that generation_data
>> chunk is absent, and we use graph_read_expect to check that it is
>> present (and in both cases that the number of commits matches). I
>> wonder if it would be possible to simplify that...
What I wanted to say that it might be better to have a second variant of
graph_read_expect() for GDAT-less layers -- but this might be
unnecessary complication.
> The problem here is graph_read_expect() as defined in
> t5324-split-commit-graph takes two parameters - number of commits and
> number of base graphs. If the number of base graphs is not passed to
> the function call, it's assumed to be zero. Using a default parameter
> is tricky - I can fix it by manually adding a zero to each of
> graph_read_expect() in an additional preparatory patch.
All right, thanks for an explanation. I should have examined
graph_read_expect() in more detail.
> Any other suggestions are welcome too.
[...]
>>> +test_expect_success 'add one commit, write a tip graph' '
>>> + cd "$TRASH_DIRECTORY/mixed" &&
>>> + test_commit 11 &&
>>> + git branch commits/11 &&
>>> + git commit-graph write --reachable --split &&
>>> + test_path_is_missing $infodir/commit-graph &&
>>> + test_path_is_file $graphdir/commit-graph-chain &&
>>> + ls $graphdir/graph-*.graph >graph-files &&
>>> + test_line_count = 2 graph-files &&
>>> + verify_chain_files_exist $graphdir
>>> +'
>>
>> What it is meant to test? That adding single-commit to a 15 commit
>> commit-graph file in split mode does not result in layers merging, and
>> actually adds a new layer: we check that we have exactly two layers and
>> that they are all OK.
>
> This test is meant to check writing to a split graph in "normal"
> conditions (i.e. all existing layers have generation data chunk). The
> above tests are special cases as they involve merging layers with mixed
> generation number versions.
All right.
>>
>> We don't check here that the newly created top layer commit-graph does
>> have GDAT chunk, as it should be if the top layer (in this case the only
>> layer) has GDAT chunk.
>>> +
>>> test_done
>>
>> One test we are missing is testing that merging layers is done
>> correctly, namely that if we are merging layers in split commit-graph
>> file, and the layer below the ones we are merging lacks GDAT chunk, then
>> the result of the merge should also be without GDAT chunk. This would
>> require at least two GDAT-less layers in a setup.
>>
>> I'm not sure how difficult writing such test should be.
>
> It wouldn't be too hard.
>
> After the last test, I can write some more commits and write split
> commit-graph file without GDAT chunk. Then write some more commits
> and merge layers using `git commit-graph write --max-commits=<nr>`.
>
> Thanks for pointing this out!
Good.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common()
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
` (7 preceding siblings ...)
2020-10-07 14:09 ` [PATCH v4 08/10] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09 ` Abhishek Kumar via GitGitGadget
2020-11-03 17:59 ` Jakub Narębski
2020-10-07 14:09 ` [PATCH v4 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
` (2 subsequent siblings)
11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
With corrected commit dates implemented, we no longer have to rely on
commit date as a heuristic in paint_down_to_common().
While using corrected commit dates Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).
For example, the command `git merge-base v4.8 v4.9` on the linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.
t6404-recursive-merge setups a unique repository where all commits have
the same committer date without well-defined merge-base.
While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form a intermediate commit.
- Merge the intermediate commit with A.
With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form a intermediate commit.
- Merge the intermediate commit with C.
While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.
As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 14 ++++++++++++++
commit-graph.h | 8 +++++++-
commit-reach.c | 2 +-
t/t6404-recursive-merge.sh | 5 ++++-
4 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 5d15a1399b..3de1933ede 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -705,6 +705,20 @@ int generation_numbers_enabled(struct repository *r)
return !!first_generation;
}
+int corrected_commit_dates_enabled(struct repository *r)
+{
+ struct commit_graph *g;
+ if (!prepare_commit_graph(r))
+ return 0;
+
+ g = r->objects->commit_graph;
+
+ if (!g->num_commits)
+ return 0;
+
+ return g->read_generation_data;
+}
+
struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
{
struct commit_graph *g = r->objects->commit_graph;
diff --git a/commit-graph.h b/commit-graph.h
index ad52130883..d2c048dc64 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -89,13 +89,19 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
struct commit_graph *parse_commit_graph(struct repository *r,
void *graph_map, size_t graph_size);
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+
/*
* Return 1 if and only if the repository has a commit-graph
* file and generation numbers are computed in that file.
*/
int generation_numbers_enabled(struct repository *r);
-struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
enum commit_graph_write_flags {
COMMIT_GRAPH_WRITE_APPEND = (1 << 0),
diff --git a/commit-reach.c b/commit-reach.c
index 20b48b872b..46f5a9e638 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
int i;
timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
- if (!min_generation)
+ if (!min_generation && !corrected_commit_dates_enabled(r))
queue.compare = compare_commits_by_commit_date;
one->object.flags |= PARENT1;
diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
index 332cfc53fd..7055771b62 100755
--- a/t/t6404-recursive-merge.sh
+++ b/t/t6404-recursive-merge.sh
@@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
export GIT_COMMITTER_DATE
test_expect_success 'setup tests' '
+ GIT_TEST_COMMIT_GRAPH=0 &&
+ export GIT_TEST_COMMIT_GRAPH &&
echo 1 >a1 &&
git add a1 &&
GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
'
test_expect_success 'combined merge conflicts' '
- test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+ test_must_fail git merge -m final G
'
test_expect_success 'result contains a conflict' '
@@ -82,6 +84,7 @@ test_expect_success 'result contains a conflict' '
'
test_expect_success 'virtual trees were processed' '
+ # TODO: fragile test, relies on ambigious merge-base resolution
git ls-files --stage >out &&
cat >expect <<-EOF &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common()
2020-10-07 14:09 ` [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-11-03 17:59 ` Jakub Narębski
2020-11-03 18:19 ` Junio C Hamano
2020-11-20 10:33 ` Abhishek Kumar
0 siblings, 2 replies; 211+ messages in thread
From: Jakub Narębski @ 2020-11-03 17:59 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Taylor Blau, Eric Sunshine, Abhishek Kumar
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With corrected commit dates implemented, we no longer have to rely on
> commit date as a heuristic in paint_down_to_common().
>
> While using corrected commit dates Git walks nearly the same number of
> commits as commit date, the process is slower as for each comparision we
> have to access a commit-slab (for corrected committer date) instead of
> accessing struct member (for committer date).
Something for the future: I wonder if it would be worth it to bring back
generation number from the commit-slab into `struct commit`.
>
> For example, the command `git merge-base v4.8 v4.9` on the linux
> repository walks 167468 commits, taking 0.135s for committer date and
> 167496 commits, taking 0.157s for corrected committer date respectively.
I think it would be good idea to explicitly refer to the commit that
changed paint_down_to_common() to *not* use generation numbers v1
(topological levels) in the cases such as this, namely 091f4cf3 (commit:
don't use generation numbers if not needed). In this commit we have the
following:
This change makes a concrete difference depending on the topology
of the commit graph. For instance, computing the merge-base between
consecutive versions of the Linux kernel has no effect for versions
after v4.9, but 'git merge-base v4.8 v4.9' presents a performance
regression:
v2.18.0: 0.122s
v2.19.0-rc1: 0.547s
HEAD: 0.127s
To determine that this was simply an ordering issue, I inserted
a counter within the while loop of paint_down_to_common() and
found that the loop runs 167,468 times in v2.18.0 and 635,579
times in v2.19.0-rc1.
The times you report (0.135s and 0.157s) are close to 0.122s / 0.127s
reported in 091f4cf3 - that is most probably because of the differences
in the system performance (hardware, operating system, load, etc.).
Numbers of commits walked for the committed date heuristics, that is
167,468 agrees with your results; 167,496 (+28) for corrected commit
date (generation number v2) is significantly smaller (-468,083) than
635,579 reported for topological levels (generation number v1).
I suspect that there are cases (with date skew) where corrected commit
date gives better performance than committer date heuristics, and I am
quite sure that generation number v2 can give better performance in case
where paint_down_to_common() uses generation numbers.
.................................................................
Here begins separate second change, which is not put into separate
commit because it is fairly tightly connected to the change described
above. It would be good idea, in my opinion, to add a sentence that
explicitely marks this switch, for example:
This change accidentally broke fragile t6404-recursive-merge test.
t6404-recursive-merge setups a unique repository...
Maybe with s/accidentaly/incidentally/.
Or add some other way of connection those two parts of the commit
messages.
> t6404-recursive-merge setups a unique repository where all commits have
> the same committer date without well-defined merge-base.
>
> While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
> date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
> conflicts' merges commits in the order:
> - Merge C with B to form a intermediate commit.
> - Merge the intermediate commit with A.
>
> With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
> use the corrected committer date, which changes the order in which
> commits are merged:
> - Merge A with B to form a intermediate commit.
> - Merge the intermediate commit with C.
>
> While resulting repositories are equivalent, 6404.4 'virtual trees were
> processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
> different merge-bases and thus have different object ids for the
> intermediate commits.
>
> As this has already causes problems (as noted in 859fdc0 (commit-graph:
> define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
> within t6404-recursive-merge.
Very nice explanation.
Perhaps in the future we could make this test less fragile.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
> commit-graph.c | 14 ++++++++++++++
> commit-graph.h | 8 +++++++-
> commit-reach.c | 2 +-
> t/t6404-recursive-merge.sh | 5 ++++-
> 4 files changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 5d15a1399b..3de1933ede 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -705,6 +705,20 @@ int generation_numbers_enabled(struct repository *r)
> return !!first_generation;
> }
>
> +int corrected_commit_dates_enabled(struct repository *r)
> +{
> + struct commit_graph *g;
> + if (!prepare_commit_graph(r))
> + return 0;
> +
> + g = r->objects->commit_graph;
> +
> + if (!g->num_commits)
> + return 0;
> +
> + return g->read_generation_data;
> +}
Very nice abstraction.
Minor issue: I wonder if it would be better to use _available() or
"_present()" rather than _enabled() suffix.
> +
> struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
> {
> struct commit_graph *g = r->objects->commit_graph;
> diff --git a/commit-graph.h b/commit-graph.h
> index ad52130883..d2c048dc64 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -89,13 +89,19 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
> struct commit_graph *parse_commit_graph(struct repository *r,
> void *graph_map, size_t graph_size);
>
> +struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
> +
> /*
> * Return 1 if and only if the repository has a commit-graph
> * file and generation numbers are computed in that file.
> */
> int generation_numbers_enabled(struct repository *r);
>
> -struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
This moving get_bloom_filter_settings() before generation_numbers_enabled()
looks like accidental change. If not, why it is here?
> +/*
> + * Return 1 if and only if the repository has a commit-graph
> + * file and generation data chunk has been written for the file.
> + */
> +int corrected_commit_dates_enabled(struct repository *r);
>
All right, nice to have documentation for the public function.
> enum commit_graph_write_flags {
> COMMIT_GRAPH_WRITE_APPEND = (1 << 0),
> diff --git a/commit-reach.c b/commit-reach.c
> index 20b48b872b..46f5a9e638 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
> int i;
> timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
>
> - if (!min_generation)
> + if (!min_generation && !corrected_commit_dates_enabled(r))
> queue.compare = compare_commits_by_commit_date;
>
> one->object.flags |= PARENT1;
All right, this is the meat of the first change.
> diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
> index 332cfc53fd..7055771b62 100755
> --- a/t/t6404-recursive-merge.sh
> +++ b/t/t6404-recursive-merge.sh
> @@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
> export GIT_COMMITTER_DATE
>
> test_expect_success 'setup tests' '
> + GIT_TEST_COMMIT_GRAPH=0 &&
> + export GIT_TEST_COMMIT_GRAPH &&
> echo 1 >a1 &&
> git add a1 &&
> GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
All right, we turn off running this test with commit-graph for the whole
script, not only for a single test. As this is a setup, it would be run
even if we are skipping some tests.
> @@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
> '
>
> test_expect_success 'combined merge conflicts' '
> - test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
> + test_must_fail git merge -m final G
> '
All right, it is no longer necessary to run this specific test with
GIT_TEST_COMMIT_GRAPH=0 as now the whole script is run with this
setting.
>
> test_expect_success 'result contains a conflict' '
> @@ -82,6 +84,7 @@ test_expect_success 'result contains a conflict' '
> '
>
> test_expect_success 'virtual trees were processed' '
> + # TODO: fragile test, relies on ambigious merge-base resolution
> git ls-files --stage >out &&
>
> cat >expect <<-EOF &&
Good call! Nice adding TODO comment for the future.
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common()
2020-11-03 17:59 ` Jakub Narębski
@ 2020-11-03 18:19 ` Junio C Hamano
2020-11-20 10:33 ` Abhishek Kumar
1 sibling, 0 replies; 211+ messages in thread
From: Junio C Hamano @ 2020-11-03 18:19 UTC (permalink / raw)
To: Jakub Narębski
Cc: Abhishek Kumar via GitGitGadget, git, Derrick Stolee,
Taylor Blau, Eric Sunshine, Abhishek Kumar
jnareb@gmail.com (Jakub Narębski) writes:
> I suspect that there are cases (with date skew) where corrected commit
> date gives better performance than committer date heuristics, and I am
> quite sure that generation number v2 can give better performance in case
> where paint_down_to_common() uses generation numbers.
Thanks for a well reasoned review.
>
> .................................................................
>
> Here begins separate second change, which is not put into separate
> commit because it is fairly tightly connected to the change described
> above. It would be good idea, in my opinion, to add a sentence that
> explicitely marks this switch, for example:
>
> This change accidentally broke fragile t6404-recursive-merge test.
> t6404-recursive-merge setups a unique repository...
>
> Maybe with s/accidentaly/incidentally/.
Also "setup" is not a verb. "... sets up a unique repository".
> Or add some other way of connection those two parts of the commit
> messages.
> ...
>> As this has already causes problems (as noted in 859fdc0 (commit-graph:
>> define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
>> within t6404-recursive-merge.
>
> Very nice explanation.
>
> Perhaps in the future we could make this test less fragile.
If "separate second change" is distracting, would it be an option to
fix the test before this step, perhaps?
Thanks.
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common()
2020-11-03 17:59 ` Jakub Narębski
2020-11-03 18:19 ` Junio C Hamano
@ 2020-11-20 10:33 ` Abhishek Kumar
1 sibling, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-20 10:33 UTC (permalink / raw)
To: Jakub Narębski
Cc: abhishekkumar8222, git, gitgitgadget, stolee, sunshine
On Tue, Nov 03, 2020 at 06:59:03PM +0100, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With corrected commit dates implemented, we no longer have to rely on
> > commit date as a heuristic in paint_down_to_common().
> >
> > While using corrected commit dates Git walks nearly the same number of
> > commits as commit date, the process is slower as for each comparision we
> > have to access a commit-slab (for corrected committer date) instead of
> > accessing struct member (for committer date).
>
> Something for the future: I wonder if it would be worth it to bring back
> generation number from the commit-slab into `struct commit`.
>
> >
> > For example, the command `git merge-base v4.8 v4.9` on the linux
> > repository walks 167468 commits, taking 0.135s for committer date and
> > 167496 commits, taking 0.157s for corrected committer date respectively.
>
> I think it would be good idea to explicitly refer to the commit that
> changed paint_down_to_common() to *not* use generation numbers v1
> (topological levels) in the cases such as this, namely 091f4cf3 (commit:
> don't use generation numbers if not needed). In this commit we have the
> following:
> ...
>
I have re-arranged the first half of commit message:
091f4cf3 (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.
For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.
While using corrected commit dates Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).
>
> The times you report (0.135s and 0.157s) are close to 0.122s / 0.127s
> reported in 091f4cf3 - that is most probably because of the differences
> in the system performance (hardware, operating system, load, etc.).
> Numbers of commits walked for the committed date heuristics, that is
> 167,468 agrees with your results; 167,496 (+28) for corrected commit
> date (generation number v2) is significantly smaller (-468,083) than
> 635,579 reported for topological levels (generation number v1).
>
> I suspect that there are cases (with date skew) where corrected commit
> date gives better performance than committer date heuristics, and I am
> quite sure that generation number v2 can give better performance in case
> where paint_down_to_common() uses generation numbers.
>
> .................................................................
>
> Here begins separate second change, which is not put into separate
> commit because it is fairly tightly connected to the change described
> above. It would be good idea, in my opinion, to add a sentence that
> explicitely marks this switch, for example:
>
> This change accidentally broke fragile t6404-recursive-merge test.
> t6404-recursive-merge setups a unique repository...
>
> Maybe with s/accidentaly/incidentally/.
>
Thanks, will add.
> Or add some other way of connection those two parts of the commit
> messages.
> ...
> >
> > +int corrected_commit_dates_enabled(struct repository *r)
> > +{
> > + struct commit_graph *g;
> > + if (!prepare_commit_graph(r))
> > + return 0;
> > +
> > + g = r->objects->commit_graph;
> > +
> > + if (!g->num_commits)
> > + return 0;
> > +
> > + return g->read_generation_data;
> > +}
>
> Very nice abstraction.
>
> Minor issue: I wonder if it would be better to use _available() or
> "_present()" rather than _enabled() suffix.
>
We could, but that breaks conformity with `generation_numbers_enabled()`.
I see both functions to be similar in nature, to answer whether the
commit-graph has X? X could be topological levels or corrected commit
dates.
> > +
> > struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
> > {
> > struct commit_graph *g = r->objects->commit_graph;
> > diff --git a/commit-graph.h b/commit-graph.h
> > index ad52130883..d2c048dc64 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -89,13 +89,19 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
> > struct commit_graph *parse_commit_graph(struct repository *r,
> > void *graph_map, size_t graph_size);
> >
> > +struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
> > +
> > /*
> > * Return 1 if and only if the repository has a commit-graph
> > * file and generation numbers are computed in that file.
> > */
> > int generation_numbers_enabled(struct repository *r);
> >
> > -struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
>
> This moving get_bloom_filter_settings() before generation_numbers_enabled()
> looks like accidental change. If not, why it is here?
Right, that's an accidental change. I wanted to group
generation_numbers_enabled() and corrected_commit_dates_enabled()
together.
>
> ...
>
> Best,
> --
> Jakub Narębski
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v4 10/10] doc: add corrected commit date info
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
` (8 preceding siblings ...)
2020-10-07 14:09 ` [PATCH v4 09/10] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-10-07 14:09 ` Abhishek Kumar via GitGitGadget
2020-11-04 1:37 ` Jakub Narębski
2020-11-04 23:37 ` [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date Jakub Narębski
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
11 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-10-07 14:09 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
.../technical/commit-graph-format.txt | 21 +++++--
Documentation/technical/commit-graph.txt | 62 ++++++++++++++++---
2 files changed, 69 insertions(+), 14 deletions(-)
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b3b58880b9..08d9026ad4 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
The Git commit graph stores a list of commit OIDs and some associated
metadata, including:
-- The generation number of the commit. Commits with no parents have
- generation number 1; commits with parents have generation number
- one more than the maximum generation number of its parents. We
- reserve zero as special, and can be used to mark a generation
- number invalid or as "not computed".
+- The generation number of the commit.
- The root tree OID.
@@ -86,13 +82,26 @@ CHUNK DATA:
position. If there are more than two parents, the second value
has its most-significant bit on and the other bits store an array
position into the Extra Edge List chunk.
- * The next 8 bytes store the generation number of the commit and
+ * The next 8 bytes store the topological level (generation number v1)
+ of the commit and
the commit time in seconds since EPOCH. The generation number
uses the higher 30 bits of the first 4 bytes, while the commit
time uses the 32 bits of the second 4 bytes, along with the lowest
2 bits of the lowest byte, storing the 33rd and 34th bit of the
commit time.
+ Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes)
+ * This list of 4-byte values store corrected commit date offsets for the
+ commits, arranged in the same order as commit data chunk.
+ * If the corrected commit date offset cannot be stored within 31 bits,
+ the value has its most-significant bit on and the other bits store
+ the position of corrected commit date into the Generation Data Overflow
+ chunk.
+
+ Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
+ * This list of 8-byte values stores the corrected commit dates for commits
+ with corrected commit date offsets that cannot be stored within 31 bits.
+
Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
This list of 4-byte values store the second through nth parents for
all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index f14a7659aa..75f71c4c7b 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
Values 1-4 satisfy the requirements of parse_commit_gently().
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates (generation number v2)
+2. Topological levels (generation nummber v1)
- * A commit with no parents (a root commit) has generation number one.
+Define "corrected committer date" of a commit recursively as follows:
- * A commit with at least one parent has generation number one more than
- the largest generation number among its parents.
+ * A commit with no parents (a root commit) has corrected committer date
+ equal to its committer date.
-Equivalently, the generation number of a commit A is one more than the
+ * A commit with at least one parent has corrected committer date equal to
+ the maximum of its commiter date and one more than the largest corrected
+ committer date among its parents.
+
+ * As a special case, a root commit with timestamp zero has corrected commit
+ date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+ (that is, an uncomputed corrected commit date).
+
+Define the "topological level" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has topological level of one.
+
+ * A commit with at least one parent has topological level one more than
+ the largest topological level among its parents.
+
+Equivalently, the topological level of a commit A is one more than the
length of a longest path from A to a root commit. The recursive definition
is easier to use for computation and observing the following property:
@@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
generation numbers, then we always expand the boundary commit with highest
generation number and can easily detect the stopping condition.
+The properties applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
This property can be used to significantly reduce the time it takes to
walk commits and determine topological relationships. Without generation
numbers, the general heuristic is the following:
@@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
If A and B are commits with commit time X and Y, respectively, and
X < Y, then A _probably_ cannot reach B.
-This heuristic is currently used whenever the computation is allowed to
+In absence of corrected commit dates (for example, old versions of Git or
+mixed generation graph chains),
+this heuristic is currently used whenever the computation is allowed to
violate topological relationships due to clock skew (such as "git log"
with default order), but is not used when the topological order is
required (such as merge base calculations, "git log --graph").
@@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
generation number and walk until reaching commits with known generation
number.
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
in the commit-graph file. If a commit-graph file was written by a version
of Git that did not compute generation numbers, then those commits will
have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,7 +115,7 @@ fully-computed generation numbers. Using strict inequality may result in
walking a few extra commits, but the simplicity in dealing with commits
with generation number *_INFINITY or *_ZERO is valuable.
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
+We use the macro GENERATION_NUMBER_MAX for commits whose
generation numbers are computed to be at least this value. We limit at
this value since it is the largest value that can be stored in the
commit-graph file using the 30 bits available to generation numbers. This
@@ -267,6 +289,30 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
number of commits) could be extracted into config settings for full
flexibility.
+## Handling Mixed Generation Number Chains
+
+With the introduction of generation number v2 and generation data chunk, the
+following scenario is possible:
+
+1. "New" Git writes a commit-graph with the corrected commit dates.
+2. "Old" Git writes a split commit-graph on top without corrected commit dates.
+
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
+levels of the higher layer. For this reason, Git inspects each layer to
+see if any layer is missing corrected commit dates. In such a case, Git
+only uses topological level
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
+guarantees that if a layer has corrected commit dates, all lower layers
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
+commit dates. Instead, the new layer will have corrected commit dates if and
+only if all existing layers below the new layer have corrected commit dates.
+
## Deleting graph-{hash} files
After a new tip file is written, some `graph-{hash}` files may no longer
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v4 10/10] doc: add corrected commit date info
2020-10-07 14:09 ` [PATCH v4 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-11-04 1:37 ` Jakub Narębski
2020-11-21 6:30 ` Abhishek Kumar
0 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-11-04 1:37 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With generation data chunk and corrected commit dates implemented, let's
> update the technical documentation for commit-graph.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Nice.
> ---
> .../technical/commit-graph-format.txt | 21 +++++--
> Documentation/technical/commit-graph.txt | 62 ++++++++++++++++---
> 2 files changed, 69 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index b3b58880b9..08d9026ad4 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -4,11 +4,7 @@ Git commit graph format
> The Git commit graph stores a list of commit OIDs and some associated
> metadata, including:
>
> -- The generation number of the commit. Commits with no parents have
> - generation number 1; commits with parents have generation number
> - one more than the maximum generation number of its parents. We
> - reserve zero as special, and can be used to mark a generation
> - number invalid or as "not computed".
> +- The generation number of the commit.
All right, because we could store both generation number v1 and
generation number v2 in the commit-graph file, and we need to describe
both, the description is now consolidated and in only one place.
>
> - The root tree OID.
>
> @@ -86,13 +82,26 @@ CHUNK DATA:
> position. If there are more than two parents, the second value
> has its most-significant bit on and the other bits store an array
> position into the Extra Edge List chunk.
> - * The next 8 bytes store the generation number of the commit and
> + * The next 8 bytes store the topological level (generation number v1)
> + of the commit and
All right, this is updated information about CDAT chunk.
> the commit time in seconds since EPOCH. The generation number
> uses the higher 30 bits of the first 4 bytes, while the commit
> time uses the 32 bits of the second 4 bytes, along with the lowest
> 2 bits of the lowest byte, storing the 33rd and 34th bit of the
> commit time.
>
> + Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes)
Should we mark this chunk as "[Optional]"? Its absence is not an error.
> + * This list of 4-byte values store corrected commit date offsets for the
> + commits, arranged in the same order as commit data chunk.
> + * If the corrected commit date offset cannot be stored within 31 bits,
> + the value has its most-significant bit on and the other bits store
> + the position of corrected commit date into the Generation Data Overflow
> + chunk.
All right.
> +
> + Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> + * This list of 8-byte values stores the corrected commit dates for commits
> + with corrected commit date offsets that cannot be stored within 31 bits.
A question: do we store 8-byte / 64-bit corrected commit date *directly*,
or do we store corrected commit date *offset* as 8-byte / 64-bit value?
Perhaps we should add the information that [like the EDGE chunk] it is
present only when necessary, and that it is present only when GDAT chunk
is present (it might be obvious, but it could be better to state
this explicitly).
> +
All right, this is the information about two new chunks (with the
mentioned above caveat about the clarity of the description of
overflow-handling chunk).
> Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> This list of 4-byte values store the second through nth parents for
> all octopus merges. The second parent value in the commit data stores
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index f14a7659aa..75f71c4c7b 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
>
> Values 1-4 satisfy the requirements of parse_commit_gently().
>
> -Define the "generation number" of a commit recursively as follows:
> +There are two definitions of generation number:
> +1. Corrected committer dates (generation number v2)
> +2. Topological levels (generation nummber v1)
All right.
>
> - * A commit with no parents (a root commit) has generation number one.
> +Define "corrected committer date" of a commit recursively as follows:
>
> - * A commit with at least one parent has generation number one more than
> - the largest generation number among its parents.
> + * A commit with no parents (a root commit) has corrected committer date
> + equal to its committer date.
Minor nitpick: the above point has been accidentally indented one space
more than necessary, and than is indented in other places. Or maybe
that fixes / unifies the formatting... I am not sure.
>
> -Equivalently, the generation number of a commit A is one more than the
> + * A commit with at least one parent has corrected committer date equal to
> + the maximum of its commiter date and one more than the largest corrected
> + committer date among its parents.
> +
> + * As a special case, a root commit with timestamp zero has corrected commit
> + date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
> + (that is, an uncomputed corrected commit date).
All right. Looks good.
> +
> +Define the "topological level" of a commit recursively as follows:
> +
> + * A commit with no parents (a root commit) has topological level of one.
> +
> + * A commit with at least one parent has topological level one more than
> + the largest topological level among its parents.
> +
All right, this just repeats what was written before, or in other words
move existing contents lower/later, just with 'generation number'
replaced by 'topological level' (though it might be not obvious from the
patch because of the latter change).
> +Equivalently, the topological level of a commit A is one more than the
> length of a longest path from A to a root commit. The recursive definition
> is easier to use for computation and observing the following property:
>
> @@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
> generation numbers, then we always expand the boundary commit with highest
> generation number and can easily detect the stopping condition.
>
> +The properties applies to both versions of generation number, that is both
> +corrected committer dates and topological levels.
> +
I think it should be "This property" or "The property", not "The
properties"; it is a single property, a single condition.
We can alternatively say "This condition is fulfilled by both versions...",
or "This condition is true for both versions...".
> This property can be used to significantly reduce the time it takes to
> walk commits and determine topological relationships. Without generation
> numbers, the general heuristic is the following:
> @@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
> If A and B are commits with commit time X and Y, respectively, and
> X < Y, then A _probably_ cannot reach B.
>
> -This heuristic is currently used whenever the computation is allowed to
> +In absence of corrected commit dates (for example, old versions of Git or
> +mixed generation graph chains),
> +this heuristic is currently used whenever the computation is allowed to
> violate topological relationships due to clock skew (such as "git log"
> with default order), but is not used when the topological order is
> required (such as merge base calculations, "git log --graph").
All right, this explains when commit date heuristics is used (which is
less often than before).
> @@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
> generation number and walk until reaching commits with known generation
> number.
>
> -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
All right, 64-bit GENERATION_NUMBER_INFINITY = 0xFFFFFFFFFFFFFFFF is a
bit unwieldy...
> in the commit-graph file. If a commit-graph file was written by a version
> of Git that did not compute generation numbers, then those commits will
> have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> @@ -93,7 +115,7 @@ fully-computed generation numbers. Using strict inequality may result in
> walking a few extra commits, but the simplicity in dealing with commits
> with generation number *_INFINITY or *_ZERO is valuable.
>
> -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> +We use the macro GENERATION_NUMBER_MAX for commits whose
This should be
+We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF to for commits whose
+topological levels (generation number v1) are computed to be at least this value. We limit at
this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to topological levels. This
We need to use "topological levels" or "generation numbers v1" thorough
the rest of this section.
> generation numbers are computed to be at least this value. We limit at
> this value since it is the largest value that can be stored in the
> commit-graph file using the 30 bits available to generation numbers. This
> @@ -267,6 +289,30 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
> number of commits) could be extracted into config settings for full
> flexibility.
>
All right, I agree that we don't need to write about overflow handling
for storing corrected committer dates (generation number v2) as offsets;
this is something format-specific, and this documentation is more about
using commit-graph data. What is present in commit-graph-format.txt
should be enough information.
Sidenote: I wonder if other Git implementations such as JGit, Dulwich,
Gitoxide (gix), go-git have support for the commit-graph file...
> +## Handling Mixed Generation Number Chains
> +
> +With the introduction of generation number v2 and generation data chunk, the
> +following scenario is possible:
> +
> +1. "New" Git writes a commit-graph with the corrected commit dates.
> +2. "Old" Git writes a split commit-graph on top without corrected commit dates.
> +
> +A naive approach of using the newest available generation number from
> +each layer would lead to violated expectations: the lower layer would
> +use corrected commit dates which are much larger than the topological
> +levels of the higher layer. For this reason, Git inspects each layer to
> +see if any layer is missing corrected commit dates. In such a case, Git
> +only uses topological level
This should end in full stop:
+only uses topological levels.
Or maybe we should expand the last sentence a bit:
+only uses topological levels for generation numbers.
Sidenote: it is a good explanation, even if Git can make use of the
property described below that only topmost layers might be missing
corrected commit graph by the construction (so it needs to check only
the top layer).
> +
> +When writing a new layer in split commit-graph, we write corrected commit
> +dates if the topmost layer has corrected commit dates written. This
> +guarantees that if a layer has corrected commit dates, all lower layers
> +must have corrected commit dates as well.
> +
> +When merging layers, we do not consider whether the merged layers had corrected
> +commit dates. Instead, the new layer will have corrected commit dates if and
> +only if all existing layers below the new layer have corrected commit dates.
> +
Perhaps we should explicitly say that when rewriting split commit-graph
as a single file (`--split=replace`) then the newly created single layer
would store corrected commit dates.
> ## Deleting graph-{hash} files
>
> After a new tip file is written, some `graph-{hash}` files may no longer
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 10/10] doc: add corrected commit date info
2020-11-04 1:37 ` Jakub Narębski
@ 2020-11-21 6:30 ` Abhishek Kumar
0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-21 6:30 UTC (permalink / raw)
To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee
On Wed, Nov 04, 2020 at 02:37:41AM +0100, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With generation data chunk and corrected commit dates implemented, let's
> > update the technical documentation for commit-graph.
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Nice.
>
> > ---
> > .../technical/commit-graph-format.txt | 21 +++++--
> > Documentation/technical/commit-graph.txt | 62 ++++++++++++++++---
> > 2 files changed, 69 insertions(+), 14 deletions(-)
> >
> > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> > index b3b58880b9..08d9026ad4 100644
> > --- a/Documentation/technical/commit-graph-format.txt
> > +++ b/Documentation/technical/commit-graph-format.txt
> > @@ -4,11 +4,7 @@ Git commit graph format
> > The Git commit graph stores a list of commit OIDs and some associated
> > metadata, including:
> >
> > -- The generation number of the commit. Commits with no parents have
> > - generation number 1; commits with parents have generation number
> > - one more than the maximum generation number of its parents. We
> > - reserve zero as special, and can be used to mark a generation
> > - number invalid or as "not computed".
> > +- The generation number of the commit.
>
> All right, because we could store both generation number v1 and
> generation number v2 in the commit-graph file, and we need to describe
> both, the description is now consolidated and in only one place.
>
> >
> > - The root tree OID.
> >
> > @@ -86,13 +82,26 @@ CHUNK DATA:
> > position. If there are more than two parents, the second value
> > has its most-significant bit on and the other bits store an array
> > position into the Extra Edge List chunk.
> > - * The next 8 bytes store the generation number of the commit and
> > + * The next 8 bytes store the topological level (generation number v1)
> > + of the commit and
>
> All right, this is updated information about CDAT chunk.
>
> > the commit time in seconds since EPOCH. The generation number
> > uses the higher 30 bits of the first 4 bytes, while the commit
> > time uses the 32 bits of the second 4 bytes, along with the lowest
> > 2 bits of the lowest byte, storing the 33rd and 34th bit of the
> > commit time.
> >
> > + Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes)
>
> Should we mark this chunk as "[Optional]"? Its absence is not an error.
I think we should mark it as "optional", although optional might not
have been the best choice word.
Optional (for me) implies that it is configurable and decided by the end-user
directly. However, it is *conditional* - on the existing commit graph file(s)
(if any) and the version of Git.
> > + * This list of 4-byte values store corrected commit date offsets for the
> > + commits, arranged in the same order as commit data chunk.
> > + * If the corrected commit date offset cannot be stored within 31 bits,
> > + the value has its most-significant bit on and the other bits store
> > + the position of corrected commit date into the Generation Data Overflow
> > + chunk.
>
> All right.
>
> > +
> > + Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> > + * This list of 8-byte values stores the corrected commit dates for commits
> > + with corrected commit date offsets that cannot be stored within 31 bits.
>
> A question: do we store 8-byte / 64-bit corrected commit date *directly*,
> or do we store corrected commit date *offset* as 8-byte / 64-bit value?
>
We store the dates directly rather 8-byte offsets. Will clarify.
> Perhaps we should add the information that [like the EDGE chunk] it is
> present only when necessary, and that it is present only when GDAT chunk
> is present (it might be obvious, but it could be better to state
> this explicitly).
>
It's always better to be explicit. Thanks for the detailed review.
> > +
>
> All right, this is the information about two new chunks (with the
> mentioned above caveat about the clarity of the description of
> overflow-handling chunk).
>
> > Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> > This list of 4-byte values store the second through nth parents for
> > all octopus merges. The second parent value in the commit data stores
> > diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> > index f14a7659aa..75f71c4c7b 100644
> > --- a/Documentation/technical/commit-graph.txt
> > +++ b/Documentation/technical/commit-graph.txt
> > @@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
> >
> > Values 1-4 satisfy the requirements of parse_commit_gently().
> >
> > -Define the "generation number" of a commit recursively as follows:
> > +There are two definitions of generation number:
> > +1. Corrected committer dates (generation number v2)
> > +2. Topological levels (generation nummber v1)
>
> All right.
>
> >
> > - * A commit with no parents (a root commit) has generation number one.
> > +Define "corrected committer date" of a commit recursively as follows:
> >
> > - * A commit with at least one parent has generation number one more than
> > - the largest generation number among its parents.
> > + * A commit with no parents (a root commit) has corrected committer date
> > + equal to its committer date.
>
> Minor nitpick: the above point has been accidentally indented one space
> more than necessary, and than is indented in other places. Or maybe
> that fixes / unifies the formatting... I am not sure.
>
That's a force of habit - I like to write markdown with greater
indentation. Should have been indented with one space instead of two.
> >
> > -Equivalently, the generation number of a commit A is one more than the
> > + * A commit with at least one parent has corrected committer date equal to
> > + the maximum of its commiter date and one more than the largest corrected
> > + committer date among its parents.
> > +
> > + * As a special case, a root commit with timestamp zero has corrected commit
> > + date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
> > + (that is, an uncomputed corrected commit date).
>
> All right. Looks good.
>
> > +
> > +Define the "topological level" of a commit recursively as follows:
> > +
> > + * A commit with no parents (a root commit) has topological level of one.
> > +
> > + * A commit with at least one parent has topological level one more than
> > + the largest topological level among its parents.
> > +
>
> All right, this just repeats what was written before, or in other words
> move existing contents lower/later, just with 'generation number'
> replaced by 'topological level' (though it might be not obvious from the
> patch because of the latter change).
>
> > +Equivalently, the topological level of a commit A is one more than the
> > length of a longest path from A to a root commit. The recursive definition
> > is easier to use for computation and observing the following property:
> >
> > @@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
> > generation numbers, then we always expand the boundary commit with highest
> > generation number and can easily detect the stopping condition.
> >
> > +The properties applies to both versions of generation number, that is both
> > +corrected committer dates and topological levels.
> > +
>
> I think it should be "This property" or "The property", not "The
> properties"; it is a single property, a single condition.
>
> We can alternatively say "This condition is fulfilled by both versions...",
> or "This condition is true for both versions...".
>
> > This property can be used to significantly reduce the time it takes to
> > walk commits and determine topological relationships. Without generation
> > numbers, the general heuristic is the following:
> > @@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
> > If A and B are commits with commit time X and Y, respectively, and
> > X < Y, then A _probably_ cannot reach B.
> >
> > -This heuristic is currently used whenever the computation is allowed to
> > +In absence of corrected commit dates (for example, old versions of Git or
> > +mixed generation graph chains),
> > +this heuristic is currently used whenever the computation is allowed to
> > violate topological relationships due to clock skew (such as "git log"
> > with default order), but is not used when the topological order is
> > required (such as merge base calculations, "git log --graph").
>
> All right, this explains when commit date heuristics is used (which is
> less often than before).
>
> > @@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
> > generation number and walk until reaching commits with known generation
> > number.
> >
> > -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> > +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
>
> All right, 64-bit GENERATION_NUMBER_INFINITY = 0xFFFFFFFFFFFFFFFF is a
> bit unwieldy...
>
> > in the commit-graph file. If a commit-graph file was written by a version
> > of Git that did not compute generation numbers, then those commits will
> > have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> > @@ -93,7 +115,7 @@ fully-computed generation numbers. Using strict inequality may result in
> > walking a few extra commits, but the simplicity in dealing with commits
> > with generation number *_INFINITY or *_ZERO is valuable.
> >
> > -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> > +We use the macro GENERATION_NUMBER_MAX for commits whose
>
> This should be
>
> +We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF to for commits whose
> +topological levels (generation number v1) are computed to be at least this value. We limit at
> this value since it is the largest value that can be stored in the
> +commit-graph file using the 30 bits available to topological levels. This
>
> We need to use "topological levels" or "generation numbers v1" thorough
> the rest of this section.
>
> > generation numbers are computed to be at least this value. We limit at
> > this value since it is the largest value that can be stored in the
> > commit-graph file using the 30 bits available to generation numbers. This
> > @@ -267,6 +289,30 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
> > number of commits) could be extracted into config settings for full
> > flexibility.
> >
>
> All right, I agree that we don't need to write about overflow handling
> for storing corrected committer dates (generation number v2) as offsets;
> this is something format-specific, and this documentation is more about
> using commit-graph data. What is present in commit-graph-format.txt
> should be enough information.
>
> Sidenote: I wonder if other Git implementations such as JGit, Dulwich,
> Gitoxide (gix), go-git have support for the commit-graph file...
>
> > +## Handling Mixed Generation Number Chains
> > +
> > +With the introduction of generation number v2 and generation data chunk, the
> > +following scenario is possible:
> > +
> > +1. "New" Git writes a commit-graph with the corrected commit dates.
> > +2. "Old" Git writes a split commit-graph on top without corrected commit dates.
> > +
> > +A naive approach of using the newest available generation number from
> > +each layer would lead to violated expectations: the lower layer would
> > +use corrected commit dates which are much larger than the topological
> > +levels of the higher layer. For this reason, Git inspects each layer to
> > +see if any layer is missing corrected commit dates. In such a case, Git
> > +only uses topological level
>
> This should end in full stop:
>
> +only uses topological levels.
>
> Or maybe we should expand the last sentence a bit:
>
> +only uses topological levels for generation numbers.
>
> Sidenote: it is a good explanation, even if Git can make use of the
> property described below that only topmost layers might be missing
> corrected commit graph by the construction (so it needs to check only
> the top layer).
>
> > +
> > +When writing a new layer in split commit-graph, we write corrected commit
> > +dates if the topmost layer has corrected commit dates written. This
> > +guarantees that if a layer has corrected commit dates, all lower layers
> > +must have corrected commit dates as well.
> > +
> > +When merging layers, we do not consider whether the merged layers had corrected
> > +commit dates. Instead, the new layer will have corrected commit dates if and
> > +only if all existing layers below the new layer have corrected commit dates.
> > +
>
> Perhaps we should explicitly say that when rewriting split commit-graph
> as a single file (`--split=replace`) then the newly created single layer
> would store corrected commit dates.
>
Rewriting split commit-graph as a single file is a case where there are
no "existing layers below the new layer". We should clarify that if the
new layer is the only layer, it will always have corrected commit dates
when written by compatible versions of Git.
I have appended a paragraph at the end:
While writing or merging layers, if the new layer is the only layer,
it will have corrected commit dates when written by compatible
versions of Git. Thus, rewriting split commit-graph as a singel file
(`--split=replace`) creates a single layer with corrected commit
dates.
> > ## Deleting graph-{hash} files
> >
> > After a new tip file is written, some `graph-{hash}` files may no longer
>
> Best,
> --
> Jakub Narębski
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
` (9 preceding siblings ...)
2020-10-07 14:09 ` [PATCH v4 10/10] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-11-04 23:37 ` Jakub Narębski
2020-11-22 5:31 ` Abhishek Kumar
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
11 siblings, 1 reply; 211+ messages in thread
From: Jakub Narębski @ 2020-11-04 23:37 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Taylor Blau, Abhishek Kumar
Hi Abhishek,
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.
Thanks a lot for continued working on this patch series.
>
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.
Very minor nitpick: it would make it easier to read if the commands
themself would be put inside single quotes or backticks, e.g. `git log
--graph` and `git merge-base v4.8 v4.9`.
I wonder if it is worth mentioning (probably not) that this performance
hit was the reason why since 091f4cf3 `git merge-base` uses committer
date heuristics unless there is a cutoff and using topological levels
(generation date v1) is expected to give better performance.
>
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem 1
Minor issue: this looks a bit strange; is there an error in formatting
this part?
> [https://public-inbox.org/git/87a7gdspo4.fsf@evledraar.gmail.com/], we also
> needed a way to distinguish between the old and new generation number
> without incrementing graph version.
>
> Various candidates were examined (https://github.com/derrickstolee/gen-test,
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
>
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk.
Nice writeup about the history of generation number v2, much appreciated.
> Thus, Corrected Commit Date was chosen as generation
> number v2. The Corrected Commit Date is defined as:
Minor nitpick: it would be probably better to use "is defined as
follows." instead of "is defined as:".
>
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents plus 1. Then
> corrected commit date offset is the difference between corrected commit date
> of C and commit date of C. As a special case, a root commit with timestamp
> zero has corrected commit date of 1 to be able distinguish it from
> GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date).
Very minor nitpick: s/with timestamp/with *the* timestamp/, and
s/to be able distinguish/to be able *to* distinguish/ (without the '*'
used to mark the additions).
>
> We will introduce an additional commit-graph chunk, Generation Data chunk,
Or "Generation DATa chunk", if we want to emphasize where its name came
from, or even "Generation DATa (GDAT) chunk". But it is fine as it is
now, though it would be good idea to write "Generation Data (GDAT)
chunk" to explicitly state its name / shortcut.
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.
Nice writeup of handling the backward compatibility.
>
> While storing corrected commit date offsets saves us 4 bytes per commit (as
> compared with storing corrected commit dates directly), it's possible for
> the offset to overflow the space allocated. To handle such cases, we
> introduce a new chunk, Generation Data Overflow (GDOV) that stores the
> corrected commit date. For overflowing offsets, we set MSB and store the
> position into the GDOV chunk, in a mechanism similar to the Extra Edges list
> chunk.
Very minor suggestion: perhaps it would be better to use "it's however
possible".
Very minor suggestion: "it's possible for the offset to overflow" could
be simplified to just "the offset can overflow"... though the simplified
version loses a bit of hint that the overflow should be very rare in
real repositories.
But it is just fine as it is now; I am not a native English speaker to
judge which version is better.
>
> For mixed generation number environment (for example new Git on the command
> line, old Git used by GUI client), we can encounter a mixed-chain
> commit-graph (a commit-graph chain where some of split commit-graph files
> have GDAT chunk and others do not). As backward compatibility is one of the
> goals, we can define the following behavior:
>
> While reading a mixed-chain commit-graph version, we fall back on
> topological levels as corrected commit dates and topological levels cannot
> be compared directly.
>
> While writing on top of a split commit-graph, we check if the tip of the
> chain has a GDAT chunk. If it does, we append to the chain, writing GDAT
> chunk. Thus, we guarantee if the topmost split commit-graph file has a GDAT
> chunk, rest of the chain does too.
>
> If the topmost split commit-graph file does not have a GDAT chunk (meaning
> it has been appended by the old Git), we write without GDAT chunk. We do
> write a GDAT chunk when the existing chain does not have GDAT chunk - when
> we are writing to the commit-graph chain with the 'replace' strategy.
I think the last paragraph can be simplified (or added to) by explicitly
stating the goal:
When adding new layer to the split commit-graph file, and when merging
some or all layers (replacing them in the latter case), the new layer
will have GDAT chunk if and only if in the final result there would be
no layer without GDAT chunk just below it.
>
> Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews.
You are welcome.
>
> I look forward to everyone's reviews!
>
> Thanks
>
> * Abhishek
>
>
> ----------------------------------------------------------------------------
>
> Changes in version 4:
>
> * Added GDOV to handle overflows in generation data.
> * Added a test for writing tip graph for a generation number v2 graph chain
> in t5324-split-commit-graph.sh
> * Added a section on how mixed generation number chains are handled in
> Documentation/technical/commit-graph-format.txt
> * Reverted unimportant whitespace style changes in commit-graph.c
> * Added header comments about the order of comparision for
> compare_commits_by_gen_then_commit_date in commit.h,
> compare_commits_by_gen in commit-graph.h
> * Elaborated on why t6404 fails with corrected commit date and must be run
> with GIT_TEST_COMMIT_GRAPH=1 in the commit "commit-reach: use corrected
> commit dates in paint_down_to_common()"
> * Elaborated on write behavior for mixed generation number chains in the
> commit "commit-graph: use generation v2 only if entire chain does"
> * Added notes about adding the topo_level slab to struct
> write_commit_graph_context as well as struct commit_graph.
> * Clarified commit message for "commit-graph: consolidate
> fill_commit_graph_info"
> * Removed the claim "GDAT can store future generation numbers" because it
> hasn't been tested yet.
>
> Changes in version 3:
>
> * Reordered patches as discussed in 2
> [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]
> .
> * Split "implement corrected commit date" into two patches - one
> introducing the topo level slab and other implementing corrected commit
> dates.
> * Extended split-commit-graph tests to verify at the end of test.
> * Use topological levels as generation number if any of split commit-graph
> files do not have generation data chunk.
>
> Changes in version 2:
>
> * Add tests for generation data chunk.
> * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
> generation data chunk.
> * Compare commits with corrected commit dates if present in
> paint_down_to_common().
> * Update technical documentation.
> * Handle mixed generation commit chains.
> * Improve commit messages for "commit-graph: fix regression when computing
> bloom filter", "commit-graph: consolidate fill_commit_graph_info",
> * Revert unnecessary whitespace changes.
> * Split uint_32 -> timestamp_t change into a new commit.
After careful review of those 10 patches it looks like the series is
close to being ready, requiring only small changes to progress.
> Abhishek Kumar (10):
> commit-graph: fix regression when computing Bloom filters
All good, beside possible improvement to the commit message.
Thanks to Taylor Blau for discovering possible reason for strange
no change in performance.
> revision: parse parent in indegree_walk_step()
Looks good.
> commit-graph: consolidate fill_commit_graph_info
Needs to fix now duplicated test names (minor change).
Proposed possible improvement to the commit message.
> commit-graph: return 64-bit generation number
Needs fixing due to mismerge: there should be no switch from
using GENERATION_NUMBER_ZERO to using GENERATION_NUMBER_INFINITY.
Possible minor improvement to the commit message.
> commit-graph: add a slab to store topological levels
Possible minor improvement to the commit message.
There is also not very important issue, but something that would be
nice to explain, namely that checks for GENERATION_NUMBER_INFINITY
can never be true, as topo_level_slab_at() returns 0 for commits
outside the commit-graph, not GENERATION_NUMBER_INFINITY. It works
but it is not obvious why.
> commit-graph: implement corrected commit date
The change to commit-graph verification needs fixing, and we need to
decide how verifying generation numbers should work. Perhaps a test
for handling topological level of GENERATION_NUMBER_V1_MAX could be
added (though this might be left for ater).
The changes to `git commit-graph verify` code could be put into
separate patch, either before or after this one.
> commit-graph: implement generation data chunk
Proposed possible improvement to the commit message.
The commit message does not explain why given shape of history is
needed to test handling corrected commit date offset overflow.
Proposed minor corrections to the coding style.
Instead of looping again through all commits when handling overflow
in corrected commit date offsets, while there should be at most a
few commits needing it, why not save those commits on list and loop
only through those commits? Though this _possible_ performance
improvement could be left to the followup...
test_commit_with_date() could be instead implemented via adding
`--date <date>` option to test_commit() in test-lib-functions.sh.
Also, to reduce "noise" in this patch, the rename of
run_three_modes() to run_all_modes() and test_three_modes() to
test_all_modes() could have been done in a separate preparatory
patch. It would be pure refactoring patch, without introducing any
new functionality. But it is not something that is necessary.
> commit-graph: use generation v2 only if entire chain does
Proposed possible improvement to the commit message.
Proposed minor corrections to the coding style (also in tests).
There is a question whether merging layers or replacing them should
honor GIT_TEST_COMMIT_GRAPH_NO_GDAT.
Tests possibly could be made more strict, and check more things
explicitly. One test we are missing is testing that merging layers
is done correctly, namely that if we are merging layers in split
commit-graph file, and the layer below the ones we are merging lacks
GDAT chunk, then the result of the merge should also be without GDAT
chunk -- but that might be left for later.
> commit-reach: use corrected commit dates in paint_down_to_common()
This patch consist of two slightly interleaved changes, which
possibly could be separated: change to paint_down_to_common() and
change to t6404-recursive-merge test.
In the commit message for the paint_down_to_common() we should
explicitly mention 091f4cf3, which this one partially reverts.
Possible accidental change, question about function naming.
> doc: add corrected commit date info
Needs further improvements to the documentation, like adding
"[Optional]" to chunk description, and leftover switching from
"generation numbers" to "topological levels" in one place.
>
> .../technical/commit-graph-format.txt | 21 +-
> Documentation/technical/commit-graph.txt | 62 ++++-
> commit-graph.c | 256 ++++++++++++++----
> commit-graph.h | 17 +-
> commit-reach.c | 38 +--
> commit-reach.h | 2 +-
> commit.c | 4 +-
> commit.h | 5 +-
> revision.c | 13 +-
> t/README | 3 +
> t/helper/test-read-graph.c | 4 +
> t/t4216-log-bloom.sh | 4 +-
> t/t5000-tar-tree.sh | 20 +-
> t/t5318-commit-graph.sh | 70 ++++-
> t/t5324-split-commit-graph.sh | 98 ++++++-
> t/t6404-recursive-merge.sh | 5 +-
> t/t6600-test-reach.sh | 68 ++---
> upload-pack.c | 2 +-
> 18 files changed, 534 insertions(+), 158 deletions(-)
>
>
> base-commit: d98273ba77e1ab9ec755576bc86c716a97bf59d7
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v4
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v4
> Pull-Request: https://github.com/gitgitgadget/git/pull/676
[...]
Best,
--
Jakub Narębski
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date
2020-11-04 23:37 ` [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date Jakub Narębski
@ 2020-11-22 5:31 ` Abhishek Kumar
0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2020-11-22 5:31 UTC (permalink / raw)
To: Jakub Narębski; +Cc: abhishekkumar8222, git, gitgitgadget, stolee
On Thu, Nov 05, 2020 at 12:37:49AM +0100, Jakub Narębski wrote:
> Hi Abhishek,
>
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
>
> Thanks a lot for continued working on this patch series.
Thank you so much for the careful review of the series.
>
> >
> > Git uses topological levels in the commit-graph file for commit-graph
> > traversal operations like git log --graph. Unfortunately, using topological
> > levels can result in a worse performance than without them when compared
> > with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> > on the Linux repository walks 635,579 commits using topological levels and
> > walks 167,468 using committer date.
>
> Very minor nitpick: it would make it easier to read if the commands
> themself would be put inside single quotes or backticks, e.g. `git log
> --graph` and `git merge-base v4.8 v4.9`.
That's unexpected - I wrote the commands within single quotes in the pull
request. Since backticks are rendered as "code-tags" on Github, let me
try single quotes.
>
> I wonder if it is worth mentioning (probably not) that this performance
> hit was the reason why since 091f4cf3 `git merge-base` uses committer
> date heuristics unless there is a cutoff and using topological levels
> (generation date v1) is expected to give better performance.
>
I think that's useful context for someone wondering whether we continue
to take the performance hit with topological levels or have abandoned
topological levels or chosen some another alternative altogether.
> >
> > Thus, the need for generation number v2 was born. New generation number
> > needed to provide good performance, increment updates, and backward
> > compatibility. Due to an unfortunate problem 1
>
> Minor issue: this looks a bit strange; is there an error in formatting
> this part?
Yes. The plaintext in pull request description reads as follows:
Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem [1], we also needed a way
to distinguish between the old and new generation number without
incrementing graph version.
[1]: https://public-inbox.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
I have been reviewing other pull request descriptions to match their
style (and hope the cover letter renders correctly) and Dr. Stolee in
his patch series to "add --literal value" option to configuration has
written:
As reported [1], 'git maintenance unregister' fails when a repository
is located in a directory with regex glob characters.
[1] https://lore.kernel.org/git/2c2db228-069a-947d-8446-89f4d3f6181a@gmail.com/T/#mb96fa4187a0d6aeda097cd95804a8aafc0273022
(Note the lack of colon after [1])
>
> > [https://public-inbox.org/git/87a7gdspo4.fsf@evledraar.gmail.com/], we also
> > needed a way to distinguish between the old and new generation number
> > without incrementing graph version.
> >
> > Various candidates were examined (https://github.com/derrickstolee/gen-test,
> > https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> > number v2, Corrected Commit Date with Mononotically Increasing Offsets
> > performed much worse than committer date (506,577 vs. 167,468 commits walked
> > for git merge-base v4.8 v4.9) and was dropped.
> >
> > Using Generation Data chunk (GDAT) relieves the requirement of backward
> > compatibility as we would continue to store topological levels in Commit
> > Data (CDAT) chunk.
>
> Nice writeup about the history of generation number v2, much appreciated.
>
> > Thus, Corrected Commit Date was chosen as generation
> > number v2. The Corrected Commit Date is defined as:
>
> Minor nitpick: it would be probably better to use "is defined as
> follows." instead of "is defined as:".
>
> >
> > For a commit C, let its corrected commit date be the maximum of the commit
> > date of C and the corrected commit dates of its parents plus 1. Then
> > corrected commit date offset is the difference between corrected commit date
> > of C and commit date of C. As a special case, a root commit with timestamp
> > zero has corrected commit date of 1 to be able distinguish it from
> > GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date).
>
> Very minor nitpick: s/with timestamp/with *the* timestamp/, and
> s/to be able distinguish/to be able *to* distinguish/ (without the '*'
> used to mark the additions).
>
> >
> > We will introduce an additional commit-graph chunk, Generation Data chunk,
>
> Or "Generation DATa chunk", if we want to emphasize where its name came
> from, or even "Generation DATa (GDAT) chunk". But it is fine as it is
> now, though it would be good idea to write "Generation Data (GDAT)
> chunk" to explicitly state its name / shortcut.
>
> > and store corrected commit date offsets in GDAT chunk while storing
> > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > of Git would use corrected commit dates, falling back to topological level
> > if the generation data chunk is absent in the commit-graph file.
>
> Nice writeup of handling the backward compatibility.
>
> >
> > While storing corrected commit date offsets saves us 4 bytes per commit (as
> > compared with storing corrected commit dates directly), it's possible for
> > the offset to overflow the space allocated. To handle such cases, we
> > introduce a new chunk, Generation Data Overflow (GDOV) that stores the
> > corrected commit date. For overflowing offsets, we set MSB and store the
> > position into the GDOV chunk, in a mechanism similar to the Extra Edges list
> > chunk.
>
> Very minor suggestion: perhaps it would be better to use "it's however
> possible".
>
> Very minor suggestion: "it's possible for the offset to overflow" could
> be simplified to just "the offset can overflow"... though the simplified
> version loses a bit of hint that the overflow should be very rare in
> real repositories.
>
> But it is just fine as it is now; I am not a native English speaker to
> judge which version is better.
>
I think it is better to indicate the rareness of overflows.
> >
> > For mixed generation number environment (for example new Git on the command
> > line, old Git used by GUI client), we can encounter a mixed-chain
> > commit-graph (a commit-graph chain where some of split commit-graph files
> > have GDAT chunk and others do not). As backward compatibility is one of the
> > goals, we can define the following behavior:
> >
> > While reading a mixed-chain commit-graph version, we fall back on
> > topological levels as corrected commit dates and topological levels cannot
> > be compared directly.
> >
> > While writing on top of a split commit-graph, we check if the tip of the
> > chain has a GDAT chunk. If it does, we append to the chain, writing GDAT
> > chunk. Thus, we guarantee if the topmost split commit-graph file has a GDAT
> > chunk, rest of the chain does too.
> >
> > If the topmost split commit-graph file does not have a GDAT chunk (meaning
> > it has been appended by the old Git), we write without GDAT chunk. We do
> > write a GDAT chunk when the existing chain does not have GDAT chunk - when
> > we are writing to the commit-graph chain with the 'replace' strategy.
>
> I think the last paragraph can be simplified (or added to) by explicitly
> stating the goal:
>
> When adding new layer to the split commit-graph file, and when merging
> some or all layers (replacing them in the latter case), the new layer
> will have GDAT chunk if and only if in the final result there would be
> no layer without GDAT chunk just below it.
>
Thanks, that is much clearer to understand.
> ...
>
> After careful review of those 10 patches it looks like the series is
> close to being ready, requiring only small changes to progress.
>
Thank you for writing this handy reference for changes.
> > Abhishek Kumar (10):
> > commit-graph: fix regression when computing Bloom filters
>
> All good, beside possible improvement to the commit message.
> Thanks to Taylor Blau for discovering possible reason for strange
> no change in performance.
>
> > revision: parse parent in indegree_walk_step()
>
> Looks good.
>
> > commit-graph: consolidate fill_commit_graph_info
>
> Needs to fix now duplicated test names (minor change).
> Proposed possible improvement to the commit message.
>
> > commit-graph: return 64-bit generation number
>
> Needs fixing due to mismerge: there should be no switch from
> using GENERATION_NUMBER_ZERO to using GENERATION_NUMBER_INFINITY.
> Possible minor improvement to the commit message.
>
> > commit-graph: add a slab to store topological levels
>
> Possible minor improvement to the commit message.
>
> There is also not very important issue, but something that would be
> nice to explain, namely that checks for GENERATION_NUMBER_INFINITY
> can never be true, as topo_level_slab_at() returns 0 for commits
> outside the commit-graph, not GENERATION_NUMBER_INFINITY. It works
> but it is not obvious why.
>
> > commit-graph: implement corrected commit date
>
> The change to commit-graph verification needs fixing, and we need to
> decide how verifying generation numbers should work. Perhaps a test
> for handling topological level of GENERATION_NUMBER_V1_MAX could be
> added (though this might be left for ater).
>
> The changes to `git commit-graph verify` code could be put into
> separate patch, either before or after this one.
>
> > commit-graph: implement generation data chunk
>
> Proposed possible improvement to the commit message.
> The commit message does not explain why given shape of history is
> needed to test handling corrected commit date offset overflow.
>
> Proposed minor corrections to the coding style.
>
> Instead of looping again through all commits when handling overflow
> in corrected commit date offsets, while there should be at most a
> few commits needing it, why not save those commits on list and loop
> only through those commits? Though this _possible_ performance
> improvement could be left to the followup...
Since the improvement can be applied to both
`write_graph_chunk_generation_data_overflow()` and
`write_graph_chunk_extra_edges()`, I am planning to cover this in a
followup.
>
> test_commit_with_date() could be instead implemented via adding
> `--date <date>` option to test_commit() in test-lib-functions.sh.
>
> Also, to reduce "noise" in this patch, the rename of
> run_three_modes() to run_all_modes() and test_three_modes() to
> test_all_modes() could have been done in a separate preparatory
> patch. It would be pure refactoring patch, without introducing any
> new functionality. But it is not something that is necessary.
>
> > commit-graph: use generation v2 only if entire chain does
>
> Proposed possible improvement to the commit message.
> Proposed minor corrections to the coding style (also in tests).
>
> There is a question whether merging layers or replacing them should
> honor GIT_TEST_COMMIT_GRAPH_NO_GDAT.
>
> Tests possibly could be made more strict, and check more things
> explicitly. One test we are missing is testing that merging layers
> is done correctly, namely that if we are merging layers in split
> commit-graph file, and the layer below the ones we are merging lacks
> GDAT chunk, then the result of the merge should also be without GDAT
> chunk -- but that might be left for later.
>
> > commit-reach: use corrected commit dates in paint_down_to_common()
>
> This patch consist of two slightly interleaved changes, which
> possibly could be separated: change to paint_down_to_common() and
> change to t6404-recursive-merge test.
>
> In the commit message for the paint_down_to_common() we should
> explicitly mention 091f4cf3, which this one partially reverts.
>
> Possible accidental change, question about function naming.
>
> > doc: add corrected commit date info
>
> Needs further improvements to the documentation, like adding
> "[Optional]" to chunk description, and leftover switching from
> "generation numbers" to "topological levels" in one place.
>
> ...
>
> Best,
> --
> Jakub Narębski
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date
2020-10-07 14:09 ` [PATCH v4 00/10] " Abhishek Kumar via GitGitGadget
` (10 preceding siblings ...)
2020-11-04 23:37 ` [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date Jakub Narębski
@ 2020-12-28 11:15 ` Abhishek Kumar via GitGitGadget
2020-12-28 11:15 ` [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
` (12 more replies)
11 siblings, 13 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:15 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar
This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.
Git uses topological levels in the commit-graph file for commit-graph
traversal operations like 'git log --graph'. Unfortunately, using
topological levels can result in a worse performance than without them when
compared with committer date as a heuristics. For example, 'git merge-base
v4.8 v4.9' on the Linux repository walks 635,579 commits using topological
levels and walks 167,468 using committer date. Since 091f4cf3 (commit: don't
use generation numbers if not needed, 2018-08-30), 'git merge-base' uses
committer date heuristic unless there is a cutoff because of the performance
hit.
Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem [1], we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.
[1] https://public-inbox.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Various candidates were examined (https://github.com/derrickstolee/gen-test,
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets
performed much worse than committer date (506,577 vs. 167,468 commits walked
for 'git merge-base v4.8 v4.9') and was dropped.
Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as follows:
For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents plus 1. Then
corrected commit date offset is the difference between corrected commit date
of C and commit date of C. As a special case, a root commit with the
timestamp zero has corrected commit date of 1 to be able to distinguish it
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date).
We will introduce an additional commit-graph chunk, Generation DATa (GDAT)
chunk, and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.
While storing corrected commit date offsets saves us 4 bytes per commit (as
compared with storing corrected commit dates directly), it's however
possible for the offset to overflow the space allocated. To handle such
cases, we introduce a new chunk, Generation Data Overflow (GDOV) that stores
the corrected commit date. For overflowing offsets, we set MSB and store the
position into the GDOV chunk, in a mechanism similar to the Extra Edges list
chunk.
For mixed generation number environment (for example new Git on the command
line, old Git used by GUI client), we can encounter a mixed-chain
commit-graph (a commit-graph chain where some of split commit-graph files
have GDAT chunk and others do not). As backward compatibility is one of the
goals, we can define the following behavior:
While reading a mixed-chain commit-graph version, we fall back on
topological levels as corrected commit dates and topological levels cannot
be compared directly.
When adding new layer to the split commit-graph file, and when merging some
or all layers (replacing them in the latter case), the new layer will have
GDAT chunk if and only if in the final result there would be no layer
without GDAT chunk just below it.
Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews.
I look forward to everyone's reviews!
Thanks
* Abhishek
----------------------------------------------------------------------------
Improvements left for a future series:
* Save commits with generation data overflow and extra edge commits instead
of looping over all commits. cf. 858sbel67n.fsf@gmail.com
* Verify both topological levels and corrected commit dates when present.
cf. 85pn4tnk8u.fsf@gmail.com
Changes in version 5:
* Explained a possible reason for no change in performance for
"commit-graph: fix regression when computing bloom-filters"
* Clarified about the addition of a new test for 11-digit octal
implementations of ustar.
* Fixed duplicate test names in "commit-graph: consolidate
fill_commit_graph_info".
* Swapped the order "commit-graph: return 64-bit generation number",
"commit-graph: add a slab to store topological levels" to minimize lines
changed.
* Fixed the mismerge in "commit-graph: return 64-bit generation number"
* Clarified the preparatory steps are for the larger goal of implementing
generation number v2 in "commit-graph: return 64-bit generation number".
* Moved the rename of "run_three_modes()" to "run_all_modes()" into a new
patch "t6600-test-reach: generalize *_three_modes".
* Explained and removed the checks for GENERATION_NUMBER_INFINITY that can
never be true in "commit-graph: add a slab to store topological levels".
* Fixed incorrect logic for verifying commit-graph in "commit-graph:
implement corrected commit date".
* Added minor improvements to commit message of "commit-graph: implement
generation data chunk".
* Added '--date ' option to test_commit() in 'test-lib-functions.sh' in
"commit-graph: implement generation data chunk".
* Improved coding style (also in tests) for "commit-graph: use generation
v2 only if entire chain does".
* Simplified test repository structure in "commit-graph: use generation v2
only if entire chain does" as only the number of commits in a split
commit-graph layer are relevant.
* Added a new test in "commit-graph: use generation v2 only if entire chain
does" to check if the layers are merged correctly.
* Explicitly mentioned commit "091f4cf3" in the commit-message of
"commit-graph: use corrected commit dates in paint_down_to_common()".
* Minor corrections to documentation in "doc: add corrected commit date
info".
* Minor corrections to coding style.
Changes in version 4:
* Added GDOV to handle overflows in generation data.
* Added a test for writing tip graph for a generation number v2 graph chain
in t5324-split-commit-graph.sh
* Added a section on how mixed generation number chains are handled in
Documentation/technical/commit-graph-format.txt
* Reverted unimportant whitespace, style changes in commit-graph.c
* Added header comments about the order of comparision for
compare_commits_by_gen_then_commit_date in commit.h,
compare_commits_by_gen in commit-graph.h
* Elaborated on why t6404 fails with corrected commit date and must be run
with GIT_TEST_COMMIT_GRAPH=1in the commit "commit-reach: use corrected
commit dates in paint_down_to_common()"
* Elaborated on write behavior for mixed generation number chains in the
commit "commit-graph: use generation v2 only if entire chain does"
* Added notes about adding the topo_level slab to struct
write_commit_graph_context as well as struct commit_graph.
* Clarified commit message for "commit-graph: consolidate
fill_commit_graph_info"
* Removed the claim "GDAT can store future generation numbers" because it
hasn't been tested yet.
Changes in version 3:
* Reordered patches as discussed in 2
[https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/].
* Split "implement corrected commit date" into two patches - one
introducing the topo level slab and other implementing corrected commit
dates.
* Extended split-commit-graph tests to verify at the end of test.
* Use topological levels as generation number if any of split commit-graph
files do not have generation data chunk.
Changes in version 2:
* Add tests for generation data chunk.
* Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
generation data chunk.
* Compare commits with corrected commit dates if present in
paint_down_to_common().
* Update technical documentation.
* Handle mixed generation commit chains.
* Improve commit messages for "commit-graph: fix regression when computing
bloom filter", "commit-graph: consolidate fill_commit_graph_info",
* Revert unnecessary whitespace changes.
* Split uint_32 -> timestamp_t change into a new commit.
Abhishek Kumar (11):
commit-graph: fix regression when computing Bloom filters
revision: parse parent in indegree_walk_step()
commit-graph: consolidate fill_commit_graph_info
t6600-test-reach: generalize *_three_modes
commit-graph: add a slab to store topological levels
commit-graph: return 64-bit generation number
commit-graph: implement corrected commit date
commit-graph: implement generation data chunk
commit-graph: use generation v2 only if entire chain does
commit-reach: use corrected commit dates in paint_down_to_common()
doc: add corrected commit date info
.../technical/commit-graph-format.txt | 28 +-
Documentation/technical/commit-graph.txt | 77 +++++-
commit-graph.c | 243 ++++++++++++++----
commit-graph.h | 15 +-
commit-reach.c | 38 +--
commit-reach.h | 2 +-
commit.c | 4 +-
commit.h | 5 +-
revision.c | 13 +-
t/README | 3 +
t/helper/test-read-graph.c | 4 +
t/t4216-log-bloom.sh | 4 +-
t/t5000-tar-tree.sh | 24 +-
t/t5318-commit-graph.sh | 79 +++++-
t/t5324-split-commit-graph.sh | 193 +++++++++++++-
t/t6404-recursive-merge.sh | 5 +-
t/t6600-test-reach.sh | 68 ++---
t/test-lib-functions.sh | 6 +
upload-pack.c | 2 +-
19 files changed, 659 insertions(+), 154 deletions(-)
base-commit: 4a0de43f4923993377dbbc42cfc0a1054b6c5ccf
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/676
Range-diff vs v4:
1: fae81b534b1 ! 1: c4e817abf7d commit-graph: fix regression when computing Bloom filters
@@ Metadata
## Commit message ##
commit-graph: fix regression when computing Bloom filters
- commit_gen_cmp is used when writing a commit-graph to sort commits in
- generation order before computing Bloom filters. Since c49c82aa (commit:
- move members graph_pos, generation to a slab, 2020-06-17) made it so
- that 'commit_graph_generation()' returns 'GENERATION_NUMBER_INFINITY'
- during writing, we cannot call it within this function. Instead, access
- the generation number directly through the slab (i.e., by calling
- 'commit_graph_data_at(c)->generation') in order to access it while
- writing.
+ Before computing Bloom fitlers, the commit-graph machinery uses
+ commit_gen_cmp to sort commits by generation order for improved diff
+ performance. 3d11275505 (commit-graph: examine commits by generation
+ number, 2020-03-30) claims that this sort can reduce the time spent to
+ compute Bloom filters by nearly half.
- While measuring performance with `git commit-graph write --reachable
- --changed-paths` on the linux repository led to around 1m40s for both
- HEAD and master (and could be due to fault in my measurements), it is
- still the "right" thing to do.
+ But since c49c82aa4c (commit: move members graph_pos, generation to a
+ slab, 2020-06-17), this optimization is broken, since asking for a
+ 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
+ while writing.
+
+ Not all hope is lost, though: 'commit_graph_generation()' falls back to
+ comparing commits by their date when they have equal generation number,
+ and so since c49c82aa4c is purely a date comparision function. This
+ heuristic is good enough that we don't seem to loose appreciable
+ performance while computing Bloom filters. Applying this patch (compared
+ with v2.29.1) speeds up computing Bloom filters by around ~4
+ seconds.
+
+ So, avoid the useless 'commit_graph_generation()' while writing by
+ instead accessing the slab directly. This returns the newly-computed
+ generation numbers, and allows us to avoid the heuristic by directly
+ comparing generation numbers.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
2: 4470d916428 = 2: 7645e0bcef0 revision: parse parent in indegree_walk_step()
3: 18bb3318a12 ! 3: ca646912b2b commit-graph: consolidate fill_commit_graph_info
@@ Commit message
fill_commit_in_graph().
fill_commit_graph_info() used to not load committer data from commit data
- chunk. However, with the corrected committer date, we have to load
- committer date to calculate generation number value.
+ chunk. However, with the upcoming switch to using corrected committer
+ date as generation number v2, we will have to load committer date to
+ compute generation number value anyway.
e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
- creates a commit with committer date of (2 ^ 36 + 1) seconds since
+ creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.
The test used to pass as fill_commit_graph_info() would not set struct
- member `date` of struct commit and loads committer date from the object
+ member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.
However, with corrected commit date, we will load the committer date
@@ Commit message
mtime.
The ustar format (the header format used by most modern tar programs)
- only has room for 11 (or 12, depending om some implementations) octal
- digits for the size and mtime of each files.
+ only has room for 11 (or 12, depending on some implementations) octal
+ digits for the size and mtime of each file.
- Thus, setting a timestamp of 2 ^ 33 + 1 would overflow the 11-octal
- digit implementations while still fitting into commit data chunk.
+ As the CDAT chunk is overflow by 12-octal digits but not 11-octal
+ digits, we split the existing tests to test both implementations
+ separately and add a new explicit test for 11-digit implementation.
- Since we want to test 12-octal digit implementations of ustar as well,
- let's modify the existing test to no longer use commit-graph file.
+ To test the 11-octal digit implementation, we create a future commit
+ with committer date of 2^34 - 1, which overflows 11-octal digits without
+ overflowing 34-bits of the Commit Date chunks.
+
+ To test the 12-octal digit implementation, the smallest committer date
+ possible is 2^36 + 1, which overflows the CDAT chunk and thus
+ commit-graph must be disabled for the test.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
@@ t/t5000-tar-tree.sh: test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can
test_cmp expect actual
'
-+test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
+-test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
++test_expect_success TIME_IS_64BIT 'set up repository with far-future (2^34 - 1) commit' '
+ rm -f .git/index &&
+ echo foo >file &&
+ git add file &&
@@ t/t5000-tar-tree.sh: test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can
+ git commit -m "tempori parendum"
+'
+
-+test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
++test_expect_success TIME_IS_64BIT 'generate tar with far-future mtime' '
+ git archive HEAD >future.tar
+'
+
@@ t/t5000-tar-tree.sh: test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can
+ test_cmp expect actual
+'
+
- test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
++test_expect_success TIME_IS_64BIT 'set up repository with far-far-future (2^36 + 1) commit' '
rm -f .git/index &&
echo content >file &&
git add file &&
@@ t/t5000-tar-tree.sh: test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can
git commit -m "tempori parendum"
'
+-test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
++test_expect_success TIME_IS_64BIT 'generate tar with far-far-future mtime' '
+ git archive HEAD >future.tar
+ '
+
-: ----------- > 4: 591935075f1 t6600-test-reach: generalize *_three_modes
5: e067f653ad5 ! 5: baae7006764 commit-graph: add a slab to store topological levels
@@ Commit message
commit-graph: add a slab to store topological levels
In a later commit we will introduce corrected commit date as the
- generation number v2. This value will be stored in the new seperate
- Generation Data chunk. However, to ensure backwards compatibility with
- "Old" Git we need to continue to write generation number v1, which is
- topological level, to the commit data chunk. This means that we need to
- compute both versions of generation numbers when writing the
- commit-graph file. Therefore, let's introduce a commit-slab to store
+ generation number v2. Corrected commit dates will be stored in the new
+ seperate Generation Data chunk. However, to ensure backwards
+ compatibility with "Old" Git we need to continue to write generation
+ number v1 (topological levels) to the commit data chunk. Thus, we need
+ to compute and store both versions of generation numbers to write the
+ commit-graph file.
+
+ Therefore, let's introduce a commit-slab `topo_level_slab` to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.
- When Git creates a split commit-graph, it takes advantage of the
- generation values that have been computed already and present in
- existing commit-graph files.
+ The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO`
+ mark commits not in the commit-graph file and commits written by a
+ version of Git that did not compute generation numbers respectively.
+ Generation numbers are computed identically for both kinds of commits.
+
+ A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit
+ is not in the commit-graph file. However, since the slab is
+ zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`).
+ Thus, we no longer need to check if the topological level of a commit is
+ `GENERATION_NUMBER_INFINITY`.
- So, let's add a pointer to struct commit_graph as well as struct
- write_commit_graph_context to the topological level commit-slab
- and populate it with topological levels while writing a commit-graph
- file.
+ We will add a pointer to the slab in `struct write_commit_graph_context`
+ and `struct commit_graph` to populate the slab in
+ `fill_commit_graph_info` if the commit has a pre-computed topological
+ level as in case of split commit-graphs.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
_("Computing commit graph generation numbers"),
ctx->commits.nr);
for (i = 0; i < ctx->commits.nr; i++) {
-- timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
-+ timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+- uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
++ uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
display_progress(ctx->progress, i + 1);
- if (generation != GENERATION_NUMBER_INFINITY &&
- generation != GENERATION_NUMBER_ZERO)
-+ if (level != GENERATION_NUMBER_INFINITY &&
-+ level != GENERATION_NUMBER_ZERO)
++ if (level != GENERATION_NUMBER_ZERO)
continue;
commit_list_insert(ctx->commits.list[i], &list);
@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
- if (generation == GENERATION_NUMBER_INFINITY ||
- generation == GENERATION_NUMBER_ZERO) {
-+ if (level == GENERATION_NUMBER_INFINITY ||
-+ level == GENERATION_NUMBER_ZERO) {
++ if (level == GENERATION_NUMBER_ZERO) {
all_parents_computed = 0;
commit_list_insert(parent->item, &list);
break;
@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
- data->generation = max_generation + 1;
pop_commit(&list);
-- if (data->generation > GENERATION_NUMBER_V1_MAX)
-- data->generation = GENERATION_NUMBER_V1_MAX;
-+ if (max_level > GENERATION_NUMBER_V1_MAX - 1)
-+ max_level = GENERATION_NUMBER_V1_MAX - 1;
+- if (data->generation > GENERATION_NUMBER_MAX)
+- data->generation = GENERATION_NUMBER_MAX;
++ if (max_level > GENERATION_NUMBER_MAX - 1)
++ max_level = GENERATION_NUMBER_MAX - 1;
+ *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
}
}
@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+ struct topo_level_slab topo_levels;
- if (!commit_graph_compatible(the_repository))
- return 0;
+ prepare_repo_settings(the_repository);
+ if (!the_repository->settings.core_commit_graph) {
@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
bloom_settings.max_changed_paths);
ctx->bloom_settings = &bloom_settings;
4: 011b0aa497d ! 6: 26bd6f49100 commit-graph: return 64-bit generation number
@@ Metadata
## Commit message ##
commit-graph: return 64-bit generation number
- In a preparatory step, let's return timestamp_t values from
- commit_graph_generation(), use timestamp_t for local variables and
- define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead.
+ In a preparatory step for introducing corrected commit dates, let's
+ return timestamp_t values from commit_graph_generation(), use
+ timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
+ as (2 ^ 63 - 1) instead.
We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
@@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
if (generation_a < generation_b)
return -1;
@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
- _("Computing commit graph generation numbers"),
- ctx->commits.nr);
- for (i = 0; i < ctx->commits.nr; i++) {
-- uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
-+ timestamp_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
-
- display_progress(ctx->progress, i + 1);
- if (generation != GENERATION_NUMBER_INFINITY &&
-@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
- data->generation = max_generation + 1;
+ if (all_parents_computed) {
pop_commit(&list);
-- if (data->generation > GENERATION_NUMBER_MAX)
-- data->generation = GENERATION_NUMBER_MAX;
-+ if (data->generation > GENERATION_NUMBER_V1_MAX)
-+ data->generation = GENERATION_NUMBER_V1_MAX;
+- if (max_level > GENERATION_NUMBER_MAX - 1)
+- max_level = GENERATION_NUMBER_MAX - 1;
++ if (max_level > GENERATION_NUMBER_V1_MAX - 1)
++ max_level = GENERATION_NUMBER_V1_MAX - 1;
+ *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
}
}
- }
@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
for (i = 0; i < g->num_commits; i++) {
struct commit *graph_commit, *odb_commit;
@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_grap
max_generation--;
generation = commit_graph_generation(graph_commit);
+ if (generation != max_generation + 1)
+- graph_report(_("commit-graph generation for commit %s is %u != %u"),
++ graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
+ oid_to_hex(&cur_oid),
+ generation,
+ max_generation + 1);
## commit-graph.h ##
@@ commit-graph.h: void disable_commit_graph(struct repository *r);
@@ commit-reach.c: int repo_in_merge_bases_many(struct repository *r, struct commit
struct commit_list *bases;
int ret = 0, i;
- uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
-+ timestamp_t generation, max_generation = GENERATION_NUMBER_INFINITY;
++ timestamp_t generation, max_generation = GENERATION_NUMBER_ZERO;
if (repo_parse_commit(r, commit))
return ret;
6: 694ef1ec08d ! 7: 859c39eff52 commit-graph: implement corrected commit date
@@ Commit message
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
## commit-graph.c ##
-@@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
- else if (generation_a > generation_b)
- return 1;
-
-- /* use date as a heuristic when generations are equal */
-- if (a->date < b->date)
-- return -1;
-- else if (a->date > b->date)
-- return 1;
- return 0;
- }
-
@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
ctx->commits.nr);
for (i = 0; i < ctx->commits.nr; i++) {
- timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+ uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+ timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
display_progress(ctx->progress, i + 1);
- if (level != GENERATION_NUMBER_INFINITY &&
-- level != GENERATION_NUMBER_ZERO)
-+ level != GENERATION_NUMBER_ZERO &&
-+ corrected_commit_date != GENERATION_NUMBER_INFINITY &&
-+ corrected_commit_date != GENERATION_NUMBER_ZERO
-+ )
+- if (level != GENERATION_NUMBER_ZERO)
++ if (level != GENERATION_NUMBER_ZERO &&
++ corrected_commit_date != GENERATION_NUMBER_ZERO)
continue;
commit_list_insert(ctx->commits.list[i], &list);
@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
for (parent = current->parents; parent; parent = parent->next) {
level = *topo_level_slab_at(ctx->topo_levels, parent->item);
--
+ corrected_commit_date = commit_graph_data_at(parent->item)->generation;
- if (level == GENERATION_NUMBER_INFINITY ||
-- level == GENERATION_NUMBER_ZERO) {
-+ level == GENERATION_NUMBER_ZERO ||
-+ corrected_commit_date == GENERATION_NUMBER_INFINITY ||
-+ corrected_commit_date == GENERATION_NUMBER_ZERO
-+ ) {
+
+- if (level == GENERATION_NUMBER_ZERO) {
++ if (level == GENERATION_NUMBER_ZERO ||
++ corrected_commit_date == GENERATION_NUMBER_ZERO) {
all_parents_computed = 0;
commit_list_insert(parent->item, &list);
break;
@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
}
}
}
-@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
- if (generation_zero == GENERATION_ZERO_EXISTS)
- continue;
-
-- /*
-- * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
-- * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
-- * extra logic in the following condition.
-- */
-- if (max_generation == GENERATION_NUMBER_V1_MAX)
-- max_generation--;
--
- generation = commit_graph_generation(graph_commit);
-- if (generation != max_generation + 1)
-- graph_report(_("commit-graph generation for commit %s is %u != %u"),
-+ if (generation < max_generation + 1)
-+ graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
- oid_to_hex(&cur_oid),
- generation,
- max_generation + 1);
7: b903efe2ea1 ! 8: 8403c4d0257 commit-graph: implement generation data chunk
@@ Commit message
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
- pre-requistes before implementing generation number was to distinguish
- between graph versions in a backwards compatible manner.
+ pre-requistes before implementing generation number v2 was to
+ distinguish between graph versions in a backwards compatible manner.
- We are going to introduce a new chunk called Generation Data chunk (or
- GDAT). GDAT stores corrected committer date offsets whereas CDAT will
- still store topological level.
+ We are going to introduce a new chunk called Generation DATa chunk (or
+ GDAT). GDAT will store corrected committer date offsets whereas CDAT
+ will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
- GDAT chunk is missing (as it would happen with a commit graph written
+ GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
- While storing corrected commit date offset instead of the corrected
- commit date saves us 4 bytes per commit, it's possible for the offsets
- to overflow the 4-bytes allocated. As such overflows are exceedingly
- rare, we use the following overflow management scheme:
+ To minimize the space required to store corrrected commit date, Git
+ stores corrected commit date offsets into the commit-graph file, instea
+ of corrected commit dates. This saves us 4 bytes per commit, decreasing
+ the GDAT chunk size by half, but it's possible for the offset to
+ overflow the 4-bytes allocated for storage. As such overflows are and
+ should be exceedingly rare, we use the following overflow management
+ scheme:
- We introduce a new commit-graph chunk, GENERATION_DATA_OVERFLOW ('GDOV')
+ We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
@@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
- graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+ if (g->chunk_generation_data) {
-+ offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
++ offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+
+ if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+ offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
@@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
if (g->topo_levels)
*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
@@ commit-graph.c: struct write_commit_graph_context {
- struct packed_oid_list oids;
+ struct oid_array oids;
struct packed_commit_list commits;
int num_extra_edges;
+ int num_generation_data_overflows;
@@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ int i, num_generation_data_overflows = 0;
++
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
@@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
-
if (current->date && current->date > max_corrected_commit_date)
max_corrected_commit_date = current->date - 1;
-+
commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
+
+ if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
bloom_settings.bits_per_entry);
+@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
+ continue;
+
+ /*
+- * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
+- * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
+- * extra logic in the following condition.
++ * If we are using topological level and one of our parents has
++ * generation GENERATION_NUMBER_V1_MAX, then our generation is
++ * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
++ * in the following condition.
+ */
+- if (max_generation == GENERATION_NUMBER_V1_MAX)
++ if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
+ max_generation--;
+
+ generation = commit_graph_generation(graph_commit);
+- if (generation != max_generation + 1)
+- graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
++ if (generation < max_generation + 1)
++ graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
+ oid_to_hex(&cur_oid),
+ generation,
+ max_generation + 1);
## commit-graph.h ##
@@
@@ t/t5318-commit-graph.sh: test_expect_success 'corrupt commit-graph write (missin
)
'
-+test_commit_with_date() {
-+ file="$1.t" &&
-+ echo "$1" >"$file" &&
-+ git add "$file" &&
-+ GIT_COMMITTER_DATE="$2" GIT_AUTHOR_DATE="$2" git commit -m "$1"
-+ git tag "$1"
-+}
++# We test the overflow-related code with the following repo history:
++#
++# 4:F - 5:N - 6:U
++# / \
++# 1:U - 2:N - 3:U M:N
++# \ /
++# 7:N - 8:F - 9:N
++#
++# Here the commits denoted by U have committer date of zero seconds
++# since Unix epoch, the commits denoted by N have committer date
++# starting from 1112354055 seconds since Unix epoch (default committer
++# date for the test suite), and the commits denoted by F have committer
++# date of (2 ^ 31 - 2) seconds since Unix epoch.
++#
++# The largest offset observed is 2 ^ 31, just large enough to overflow.
++#
+
-+test_expect_success 'overflow corrected commit date offset' '
++test_expect_success 'set up and verify repo with generation data overflow chunk' '
+ objdir=".git/objects" &&
-+ UNIX_EPOCH_ZERO="1970-01-01 00:00 +0000" &&
++ UNIX_EPOCH_ZERO="@0 +0000" &&
+ FUTURE_DATE="@2147483646 +0000" &&
+ test_oid_cache <<-EOF &&
+ oid_version sha1:1
@@ t/t5318-commit-graph.sh: test_expect_success 'corrupt commit-graph write (missin
+ mkdir repo &&
+ cd repo &&
+ git init &&
-+ test_commit_with_date 1 "$UNIX_EPOCH_ZERO" &&
++ test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
+ test_commit 2 &&
-+ test_commit_with_date 3 "$UNIX_EPOCH_ZERO" &&
++ test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
+ git commit-graph write --reachable &&
+ graph_read_expect 3 generation_data &&
-+ test_commit_with_date 4 "$FUTURE_DATE" &&
++ test_commit --date "$FUTURE_DATE" 4 &&
+ test_commit 5 &&
-+ test_commit_with_date 6 "$UNIX_EPOCH_ZERO" &&
++ test_commit --date "$UNIX_EPOCH_ZERO" 6 &&
+ git branch left &&
+ git reset --hard 3 &&
+ test_commit 7 &&
-+ test_commit_with_date 8 "$FUTURE_DATE" &&
++ test_commit --date "$FUTURE_DATE" 8 &&
+ test_commit 9 &&
+ git branch right &&
+ git reset --hard 3 &&
-+ git merge left right &&
++ test_merge M left right &&
+ git commit-graph write --reachable &&
+ graph_read_expect 10 "generation_data generation_data_overflow" &&
+ git commit-graph verify
+'
+
-+graph_git_behavior 'overflow corrected commit date offset' repo left right
++graph_git_behavior 'generation data overflow chunk repo' repo left right
+
test_done
@@ t/t6600-test-reach.sh: test_expect_success 'setup' '
git config core.commitGraph true
'
--run_three_modes () {
-+run_all_modes () {
- test_when_finished rm -rf .git/objects/info/commit-graph &&
- "$@" <input >actual &&
- test_cmp expect actual &&
-@@ t/t6600-test-reach.sh: run_three_modes () {
+@@ t/t6600-test-reach.sh: run_all_modes () {
test_cmp expect actual &&
cp commit-graph-half .git/objects/info/commit-graph &&
"$@" <input >actual &&
@@ t/t6600-test-reach.sh: run_three_modes () {
test_cmp expect actual
}
--test_three_modes () {
-- run_three_modes test-tool reach "$@"
-+test_all_modes () {
-+ run_all_modes test-tool reach "$@"
- }
-
- test_expect_success 'ref_newer:miss' '
-@@ t/t6600-test-reach.sh: test_expect_success 'ref_newer:miss' '
- B:commit-4-9
- EOF
- echo "ref_newer(A,B):0" >expect &&
-- test_three_modes ref_newer
-+ test_all_modes ref_newer
- '
-
- test_expect_success 'ref_newer:hit' '
-@@ t/t6600-test-reach.sh: test_expect_success 'ref_newer:hit' '
- B:commit-2-3
- EOF
- echo "ref_newer(A,B):1" >expect &&
-- test_three_modes ref_newer
-+ test_all_modes ref_newer
- '
-
- test_expect_success 'in_merge_bases:hit' '
-@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases:hit' '
- B:commit-8-8
- EOF
- echo "in_merge_bases(A,B):1" >expect &&
-- test_three_modes in_merge_bases
-+ test_all_modes in_merge_bases
- '
-
- test_expect_success 'in_merge_bases:miss' '
-@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases:miss' '
- B:commit-5-9
- EOF
- echo "in_merge_bases(A,B):0" >expect &&
-- test_three_modes in_merge_bases
-+ test_all_modes in_merge_bases
- '
-
- test_expect_success 'in_merge_bases_many:hit' '
-@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases_many:hit' '
- X:commit-5-7
- EOF
- echo "in_merge_bases_many(A,X):1" >expect &&
-- test_three_modes in_merge_bases_many
-+ test_all_modes in_merge_bases_many
- '
-
- test_expect_success 'in_merge_bases_many:miss' '
-@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases_many:miss' '
- X:commit-8-6
- EOF
- echo "in_merge_bases_many(A,X):0" >expect &&
-- test_three_modes in_merge_bases_many
-+ test_all_modes in_merge_bases_many
- '
-
- test_expect_success 'in_merge_bases_many:miss-heuristic' '
-@@ t/t6600-test-reach.sh: test_expect_success 'in_merge_bases_many:miss-heuristic' '
- X:commit-6-6
- EOF
- echo "in_merge_bases_many(A,X):0" >expect &&
-- test_three_modes in_merge_bases_many
-+ test_all_modes in_merge_bases_many
- '
-
- test_expect_success 'is_descendant_of:hit' '
-@@ t/t6600-test-reach.sh: test_expect_success 'is_descendant_of:hit' '
- X:commit-1-1
- EOF
- echo "is_descendant_of(A,X):1" >expect &&
-- test_three_modes is_descendant_of
-+ test_all_modes is_descendant_of
- '
-
- test_expect_success 'is_descendant_of:miss' '
-@@ t/t6600-test-reach.sh: test_expect_success 'is_descendant_of:miss' '
- X:commit-7-6
- EOF
- echo "is_descendant_of(A,X):0" >expect &&
-- test_three_modes is_descendant_of
-+ test_all_modes is_descendant_of
- '
-
- test_expect_success 'get_merge_bases_many' '
-@@ t/t6600-test-reach.sh: test_expect_success 'get_merge_bases_many' '
- git rev-parse commit-5-6 \
- commit-4-7 | sort
- } >expect &&
-- test_three_modes get_merge_bases_many
-+ test_all_modes get_merge_bases_many
- '
-
- test_expect_success 'reduce_heads' '
-@@ t/t6600-test-reach.sh: test_expect_success 'reduce_heads' '
- commit-2-8 \
- commit-1-10 | sort
- } >expect &&
-- test_three_modes reduce_heads
-+ test_all_modes reduce_heads
- '
-
- test_expect_success 'can_all_from_reach:hit' '
-@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach:hit' '
- Y:commit-8-1
- EOF
- echo "can_all_from_reach(X,Y):1" >expect &&
-- test_three_modes can_all_from_reach
-+ test_all_modes can_all_from_reach
- '
-
- test_expect_success 'can_all_from_reach:miss' '
-@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach:miss' '
- Y:commit-8-5
- EOF
- echo "can_all_from_reach(X,Y):0" >expect &&
-- test_three_modes can_all_from_reach
-+ test_all_modes can_all_from_reach
- '
-
- test_expect_success 'can_all_from_reach_with_flag: tags case' '
-@@ t/t6600-test-reach.sh: test_expect_success 'can_all_from_reach_with_flag: tags case' '
- Y:commit-8-1
- EOF
- echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
-- test_three_modes can_all_from_reach_with_flag
-+ test_all_modes can_all_from_reach_with_flag
- '
-
- test_expect_success 'commit_contains:hit' '
-@@ t/t6600-test-reach.sh: test_expect_success 'commit_contains:hit' '
- X:commit-9-3
- EOF
- echo "commit_contains(_,A,X,_):1" >expect &&
-- test_three_modes commit_contains &&
-- test_three_modes commit_contains --tag
-+ test_all_modes commit_contains &&
-+ test_all_modes commit_contains --tag
- '
-
- test_expect_success 'commit_contains:miss' '
-@@ t/t6600-test-reach.sh: test_expect_success 'commit_contains:miss' '
- X:commit-9-3
- EOF
- echo "commit_contains(_,A,X,_):0" >expect &&
-- test_three_modes commit_contains &&
-- test_three_modes commit_contains --tag
-+ test_all_modes commit_contains &&
-+ test_all_modes commit_contains --tag
- '
-
- test_expect_success 'rev-list: basic topo-order' '
-@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: basic topo-order' '
- commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
- commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
- >expect &&
-- run_three_modes git rev-list --topo-order commit-6-6
-+ run_all_modes git rev-list --topo-order commit-6-6
- '
-
- test_expect_success 'rev-list: first-parent topo-order' '
-@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: first-parent topo-order' '
- commit-6-2 \
- commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
- >expect &&
-- run_three_modes git rev-list --first-parent --topo-order commit-6-6
-+ run_all_modes git rev-list --first-parent --topo-order commit-6-6
- '
-
- test_expect_success 'rev-list: range topo-order' '
-@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: range topo-order' '
- commit-6-2 commit-5-2 commit-4-2 \
- commit-6-1 commit-5-1 commit-4-1 \
- >expect &&
-- run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
-+ run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
- '
-
- test_expect_success 'rev-list: range topo-order' '
-@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: range topo-order' '
- commit-6-2 commit-5-2 commit-4-2 \
- commit-6-1 commit-5-1 commit-4-1 \
- >expect &&
-- run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
-+ run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
- '
-
- test_expect_success 'rev-list: first-parent range topo-order' '
-@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: first-parent range topo-order' '
- commit-6-2 \
- commit-6-1 commit-5-1 commit-4-1 \
- >expect &&
-- run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
-+ run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
- '
-
- test_expect_success 'rev-list: ancestry-path topo-order' '
-@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: ancestry-path topo-order' '
- commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
- commit-6-3 commit-5-3 commit-4-3 \
- >expect &&
-- run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
-+ run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
- '
-
- test_expect_success 'rev-list: symmetric difference topo-order' '
-@@ t/t6600-test-reach.sh: test_expect_success 'rev-list: symmetric difference topo-order' '
- commit-3-8 commit-2-8 commit-1-8 \
- commit-3-7 commit-2-7 commit-1-7 \
- >expect &&
-- run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
-+ run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
- '
-
- test_expect_success 'get_reachable_subset:all' '
-@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:all' '
- commit-1-7 \
- commit-5-6 | sort
- ) >expect &&
-- test_three_modes get_reachable_subset
-+ test_all_modes get_reachable_subset
- '
-
- test_expect_success 'get_reachable_subset:some' '
-@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:some' '
- git rev-parse commit-3-3 \
- commit-1-7 | sort
- ) >expect &&
-- test_three_modes get_reachable_subset
-+ test_all_modes get_reachable_subset
- '
-
- test_expect_success 'get_reachable_subset:none' '
-@@ t/t6600-test-reach.sh: test_expect_success 'get_reachable_subset:none' '
- Y:commit-2-8
- EOF
- echo "get_reachable_subset(X,Y)" >expect &&
-- test_three_modes get_reachable_subset
-+ test_all_modes get_reachable_subset
- '
-
- test_done
+
+ ## t/test-lib-functions.sh ##
+@@ t/test-lib-functions.sh: test_commit () {
+ --signoff)
+ signoff="$1"
+ ;;
++ --date)
++ notick=yes
++ GIT_COMMITTER_DATE="$2"
++ GIT_AUTHOR_DATE="$2"
++ shift
++ ;;
+ -C)
+ indir="$2"
+ shift
8: 8ec119edc66 ! 9: a3a70a1edd0 commit-graph: use generation v2 only if entire chain does
@@ Commit message
1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
- Because of the current use of inspecting the current layer for a
- chunk_generation_data pointer, the commits in the lower layer will be
- interpreted as having very large generation values (commit date plus
- offset) compared to the generation numbers in the top layer (topological
- level). This violates the expectation that the generation of a parent is
- strictly smaller than the generation of a child.
+ If each layer of split commit-graph is treated independently, as it was
+ the case before this commit, with Git inspecting only the current layer
+ for chunk_generation_data pointer, commits in the lower layer (one with
+ GDAT) whould have corrected commit date as their generation number,
+ while commits in the upper layer would have topological levels as their
+ generation. Corrected commit dates usually have much larger values than
+ topological levels. This means that if we take two commits, one from the
+ upper layer, and one reachable from it in the lower layer, then the
+ expectation that the generation of a parent is smaller than the
+ generation of a child would be violated.
It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
@@ Commit message
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).
- When writing the new layer in split commit-graph, we write a GDAT chunk
- only if the topmost layer has a GDAT chunk. This guarantees that if a
- layer has GDAT chunk, all lower layers must have a GDAT chunk as well.
+ Therefore, When writing the new layer in split commit-graph, we write a
+ GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
+ that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
+ as well.
Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
@@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
item->date = (timestamp_t)((date_high << 32) | date_low);
- if (g->chunk_generation_data) {
-+ if (g->chunk_generation_data && g->read_generation_data) {
- offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
++ if (g->read_generation_data) {
+ offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
@@ commit-graph.c: static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
- }
- }
+ if (i < ctx->num_commit_graphs_after)
+ ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
-+ if (!ctx->write_generation_data && g->chunk_generation_data)
-+ ctx->write_generation_data = 1;
++ /*
++ * If the topmost remaining layer has generation data chunk, the
++ * resultant layer also has generation data chunk.
++ */
++ if (i == ctx->num_commit_graphs_after - 2)
++ ctx->write_generation_data = !!g->chunk_generation_data;
+
- if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
- ctx->new_base_graph = g;
- else if (ctx->num_commit_graphs_after != 1)
+ i--;
+ g = g->base_graph;
+ }
@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
struct commit_graph *g = ctx->r->objects->commit_graph;
@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
g->topo_levels = &topo_levels;
g = g->base_graph;
}
-@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
-
- g = ctx->r->objects->commit_graph;
+@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
+ * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
+ * in the following condition.
+ */
+- if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
++ if (!g->read_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
+ max_generation--;
-+ if (g && !g->chunk_generation_data)
-+ ctx->write_generation_data = 0;
-+
- while (g) {
- ctx->num_commit_graphs_before++;
- g = g->base_graph;
-@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
-
- if (ctx->opts)
- replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
-+
-+ if (replace)
-+ ctx->write_generation_data = 1;
- }
-
- ctx->approx_nr_objects = approximate_object_count();
+ generation = commit_graph_generation(graph_commit);
## commit-graph.h ##
@@ commit-graph.h: struct commit_graph {
@@ commit-graph.h: struct commit_graph {
const uint32_t *chunk_oid_fanout;
## t/t5324-split-commit-graph.sh ##
-@@ t/t5324-split-commit-graph.sh: test_expect_success '--split=replace with partial Bloom data' '
- verify_chain_files_exist $graphdir
+@@ t/t5324-split-commit-graph.sh: test_expect_success 'prevent regression for duplicate commits across layers' '
+ git -C dup commit-graph verify
'
++NUM_FIRST_LAYER_COMMITS=64
++NUM_SECOND_LAYER_COMMITS=16
++NUM_THIRD_LAYER_COMMITS=7
++NUM_FOURTH_LAYER_COMMITS=8
++NUM_FIFTH_LAYER_COMMITS=16
++SECOND_LAYER_SEQUENCE_START=$(($NUM_FIRST_LAYER_COMMITS + 1))
++SECOND_LAYER_SEQUENCE_END=$(($SECOND_LAYER_SEQUENCE_START + $NUM_SECOND_LAYER_COMMITS - 1))
++THIRD_LAYER_SEQUENCE_START=$(($SECOND_LAYER_SEQUENCE_END + 1))
++THIRD_LAYER_SEQUENCE_END=$(($THIRD_LAYER_SEQUENCE_START + $NUM_THIRD_LAYER_COMMITS - 1))
++FOURTH_LAYER_SEQUENCE_START=$(($THIRD_LAYER_SEQUENCE_END + 1))
++FOURTH_LAYER_SEQUENCE_END=$(($FOURTH_LAYER_SEQUENCE_START + $NUM_FOURTH_LAYER_COMMITS - 1))
++FIFTH_LAYER_SEQUENCE_START=$(($FOURTH_LAYER_SEQUENCE_END + 1))
++FIFTH_LAYER_SEQUENCE_END=$(($FIFTH_LAYER_SEQUENCE_START + $NUM_FIFTH_LAYER_COMMITS - 1))
++
++# Current split graph chain:
++#
++# 16 commits (No GDAT)
++# ------------------------
++# 64 commits (GDAT)
++#
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
-+ mkdir mixed &&
+ graphdir=".git/objects/info/commit-graphs" &&
-+ test_oid_cache <<-EOM &&
++ test_oid_cache <<-EOF &&
+ oid_version sha1:1
+ oid_version sha256:2
-+ EOM
-+ cd "$TRASH_DIRECTORY/mixed" &&
-+ git init &&
-+ git config core.commitGraph true &&
-+ git config gc.writeCommitGraph false &&
-+ for i in $(test_seq 3)
-+ do
-+ test_commit $i &&
-+ git branch commits/$i || return 1
-+ done &&
-+ git reset --hard commits/1 &&
-+ for i in $(test_seq 4 5)
-+ do
-+ test_commit $i &&
-+ git branch commits/$i || return 1
-+ done &&
-+ git reset --hard commits/2 &&
-+ for i in $(test_seq 6 10)
-+ do
-+ test_commit $i &&
-+ git branch commits/$i || return 1
-+ done &&
-+ git commit-graph write --reachable --split &&
-+ git reset --hard commits/2 &&
-+ git merge commits/4 &&
-+ git branch merge/1 &&
-+ git reset --hard commits/4 &&
-+ git merge commits/6 &&
-+ git branch merge/2 &&
-+ GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
-+ test-tool read-graph >output &&
-+ cat >expect <<-EOF &&
-+ header: 43475048 1 $(test_oid oid_version) 4 1
-+ num_commits: 2
-+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
-+ test_cmp expect output &&
-+ git commit-graph verify
++ git init mixed &&
++ (
++ cd mixed &&
++ git config core.commitGraph true &&
++ git config gc.writeCommitGraph false &&
++ for i in $(test_seq $NUM_FIRST_LAYER_COMMITS)
++ do
++ test_commit $i &&
++ git branch commits/$i || return 1
++ done &&
++ git commit-graph write --reachable --split &&
++ graph_read_expect $NUM_FIRST_LAYER_COMMITS &&
++ test_line_count = 1 $graphdir/commit-graph-chain &&
++ for i in $(test_seq $SECOND_LAYER_SEQUENCE_START $SECOND_LAYER_SEQUENCE_END)
++ do
++ test_commit $i &&
++ git branch commits/$i || return 1
++ done &&
++ GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
++ test_line_count = 2 $graphdir/commit-graph-chain &&
++ test-tool read-graph >output &&
++ cat >expect <<-EOF &&
++ header: 43475048 1 $(test_oid oid_version) 4 1
++ num_commits: $NUM_SECOND_LAYER_COMMITS
++ chunks: oid_fanout oid_lookup commit_metadata
++ EOF
++ test_cmp expect output &&
++ git commit-graph verify &&
++ cat $graphdir/commit-graph-chain
++ )
+'
+
-+test_expect_success 'does not write generation data chunk if not present on existing tip' '
-+ cd "$TRASH_DIRECTORY/mixed" &&
-+ git reset --hard commits/3 &&
-+ git merge merge/1 &&
-+ git merge commits/5 &&
-+ git merge merge/2 &&
-+ git branch merge/3 &&
-+ git commit-graph write --reachable --split=no-merge &&
-+ test-tool read-graph >output &&
-+ cat >expect <<-EOF &&
-+ header: 43475048 1 $(test_oid oid_version) 4 2
-+ num_commits: 3
-+ chunks: oid_fanout oid_lookup commit_metadata
-+ EOF
-+ test_cmp expect output &&
-+ git commit-graph verify
++# The new layer will be added without generation data chunk as it was not
++# present on the layer underneath it.
++#
++# 7 commits (No GDAT)
++# ------------------------
++# 16 commits (No GDAT)
++# ------------------------
++# 64 commits (GDAT)
++#
++test_expect_success 'do not write generation data chunk if not present on existing tip' '
++ git clone mixed mixed-no-gdat &&
++ (
++ cd mixed-no-gdat &&
++ for i in $(test_seq $THIRD_LAYER_SEQUENCE_START $THIRD_LAYER_SEQUENCE_END)
++ do
++ test_commit $i &&
++ git branch commits/$i || return 1
++ done &&
++ git commit-graph write --reachable --split=no-merge &&
++ test_line_count = 3 $graphdir/commit-graph-chain &&
++ test-tool read-graph >output &&
++ cat >expect <<-EOF &&
++ header: 43475048 1 $(test_oid oid_version) 4 2
++ num_commits: $NUM_THIRD_LAYER_COMMITS
++ chunks: oid_fanout oid_lookup commit_metadata
++ EOF
++ test_cmp expect output &&
++ git commit-graph verify
++ )
++'
++
++# Number of commits in each layer of the split-commit graph before merge:
++#
++# 8 commits (No GDAT)
++# ------------------------
++# 7 commits (No GDAT)
++# ------------------------
++# 16 commits (No GDAT)
++# ------------------------
++# 64 commits (GDAT)
++#
++# The top two layers are merged and do not have generation data chunk as layer below them does
++# not have generation data chunk.
++#
++# 15 commits (No GDAT)
++# ------------------------
++# 16 commits (No GDAT)
++# ------------------------
++# 64 commits (GDAT)
++#
++test_expect_success 'do not write generation data chunk if the topmost remaining layer does not have generation data chunk' '
++ git clone mixed-no-gdat mixed-merge-no-gdat &&
++ (
++ cd mixed-merge-no-gdat &&
++ for i in $(test_seq $FOURTH_LAYER_SEQUENCE_START $FOURTH_LAYER_SEQUENCE_END)
++ do
++ test_commit $i &&
++ git branch commits/$i || return 1
++ done &&
++ git commit-graph write --reachable --split --size-multiple 1 &&
++ test_line_count = 3 $graphdir/commit-graph-chain &&
++ test-tool read-graph >output &&
++ cat >expect <<-EOF &&
++ header: 43475048 1 $(test_oid oid_version) 4 2
++ num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
++ chunks: oid_fanout oid_lookup commit_metadata
++ EOF
++ test_cmp expect output &&
++ git commit-graph verify
++ )
+'
+
-+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
-+ cd "$TRASH_DIRECTORY/mixed" &&
-+ git commit-graph write --reachable --split=replace &&
-+ test_path_is_file $graphdir/commit-graph-chain &&
-+ test_line_count = 1 $graphdir/commit-graph-chain &&
-+ verify_chain_files_exist $graphdir &&
-+ graph_read_expect 15 &&
-+ git commit-graph verify
++# Number of commits in each layer of the split-commit graph before merge:
++#
++# 16 commits (No GDAT)
++# ------------------------
++# 15 commits (No GDAT)
++# ------------------------
++# 16 commits (No GDAT)
++# ------------------------
++# 64 commits (GDAT)
++#
++# The top three layers are merged and has generation data chunk as the topmost remaining layer
++# has generation data chunk.
++#
++# 47 commits (GDAT)
++# ------------------------
++# 64 commits (GDAT)
++#
++test_expect_success 'write generation data chunk if topmost remaining layer has generation data chunk' '
++ git clone mixed-merge-no-gdat mixed-merge-gdat &&
++ (
++ cd mixed-merge-gdat &&
++ for i in $(test_seq $FIFTH_LAYER_SEQUENCE_START $FIFTH_LAYER_SEQUENCE_END)
++ do
++ test_commit $i &&
++ git branch commits/$i || return 1
++ done &&
++ git commit-graph write --reachable --split --size-multiple 1 &&
++ test_line_count = 2 $graphdir/commit-graph-chain &&
++ test-tool read-graph >output &&
++ cat >expect <<-EOF &&
++ header: 43475048 1 $(test_oid oid_version) 5 1
++ num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
++ chunks: oid_fanout oid_lookup commit_metadata generation_data
++ EOF
++ test_cmp expect output
++ )
+'
+
-+test_expect_success 'add one commit, write a tip graph' '
-+ cd "$TRASH_DIRECTORY/mixed" &&
-+ test_commit 11 &&
-+ git branch commits/11 &&
-+ git commit-graph write --reachable --split &&
-+ test_path_is_missing $infodir/commit-graph &&
-+ test_path_is_file $graphdir/commit-graph-chain &&
-+ ls $graphdir/graph-*.graph >graph-files &&
-+ test_line_count = 2 graph-files &&
-+ verify_chain_files_exist $graphdir
++test_expect_success 'write generation data chunk when commit-graph chain is replaced' '
++ git clone mixed mixed-replace &&
++ (
++ cd mixed-replace &&
++ git commit-graph write --reachable --split=replace &&
++ test_path_is_file $graphdir/commit-graph-chain &&
++ test_line_count = 1 $graphdir/commit-graph-chain &&
++ verify_chain_files_exist $graphdir &&
++ graph_read_expect $(($NUM_FIRST_LAYER_COMMITS + $NUM_SECOND_LAYER_COMMITS)) &&
++ git commit-graph verify
++ )
+'
+
test_done
9: bb9b02af32d ! 10: 093101f908b commit-reach: use corrected commit dates in paint_down_to_common()
@@ Metadata
## Commit message ##
commit-reach: use corrected commit dates in paint_down_to_common()
- With corrected commit dates implemented, we no longer have to rely on
- commit date as a heuristic in paint_down_to_common().
+ 091f4cf (commit: don't use generation numbers if not needed,
+ 2018-08-30) changed paint_down_to_common() to use commit dates instead
+ of generation numbers v1 (topological levels) as the performance
+ regressed on certain topologies. With generation number v2 (corrected
+ commit dates) implemented, we no longer have to rely on commit dates and
+ can use generation numbers.
- While using corrected commit dates Git walks nearly the same number of
+ For example, the command `git merge-base v4.8 v4.9` on the Linux
+ repository walks 167468 commits, taking 0.135s for committer date and
+ 167496 commits, taking 0.157s for corrected committer date respectively.
+
+ While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).
- For example, the command `git merge-base v4.8 v4.9` on the linux
- repository walks 167468 commits, taking 0.135s for committer date and
- 167496 commits, taking 0.157s for corrected committer date respectively.
-
- t6404-recursive-merge setups a unique repository where all commits have
- the same committer date without well-defined merge-base.
+ This change incidentally broke the fragile t6404-recursive-merge test.
+ t6404-recursive-merge sets up a unique repository where all commits have
+ the same committer date without a well-defined merge-base.
While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- - Merge C with B to form a intermediate commit.
+ - Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.
With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- - Merge A with B to form a intermediate commit.
+ - Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.
While resulting repositories are equivalent, 6404.4 'virtual trees were
@@ commit-graph.c: int generation_numbers_enabled(struct repository *r)
struct commit_graph *g = r->objects->commit_graph;
## commit-graph.h ##
-@@ commit-graph.h: struct commit_graph *read_commit_graph_one(struct repository *r,
- struct commit_graph *parse_commit_graph(struct repository *r,
- void *graph_map, size_t graph_size);
-
-+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
-+
- /*
- * Return 1 if and only if the repository has a commit-graph
- * file and generation numbers are computed in that file.
+@@ commit-graph.h: struct commit_graph *parse_commit_graph(struct repository *r,
*/
int generation_numbers_enabled(struct repository *r);
--struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
++
+ struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
enum commit_graph_write_flags {
- COMMIT_GRAPH_WRITE_APPEND = (1 << 0),
## commit-reach.c ##
@@ commit-reach.c: static struct commit_list *paint_down_to_common(struct repository *r,
10: 9ada43967d2 ! 11: 20299e57457 doc: add corrected commit date info
@@ Documentation/technical/commit-graph-format.txt: CHUNK DATA:
2 bits of the lowest byte, storing the 33rd and 34th bit of the
commit time.
-+ Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes)
++ Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+ * This list of 4-byte values store corrected commit date offsets for the
+ commits, arranged in the same order as commit data chunk.
+ * If the corrected commit date offset cannot be stored within 31 bits,
+ the value has its most-significant bit on and the other bits store
+ the position of corrected commit date into the Generation Data Overflow
+ chunk.
++ * Generation Data chunk is present only when commit-graph file is written
++ by compatible versions of Git and in case of split commit-graph chains,
++ the topmost layer also has Generation Data chunk.
+
+ Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
-+ * This list of 8-byte values stores the corrected commit dates for commits
-+ with corrected commit date offsets that cannot be stored within 31 bits.
++ * This list of 8-byte values stores the corrected commit date offsets
++ for commits with corrected commit date offsets that cannot be
++ stored within 31 bits.
++ * Generation Data Overflow chunk is present only when Generation Data
++ chunk is present and atleast one corrected commit date offset cannot
++ be stored within 31 bits.
+
Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
This list of 4-byte values store the second through nth parents for
@@ Documentation/technical/commit-graph.txt: A consumer may load the following info
- * A commit with at least one parent has generation number one more than
- the largest generation number among its parents.
-+ * A commit with no parents (a root commit) has corrected committer date
++ * A commit with no parents (a root commit) has corrected committer date
+ equal to its committer date.
-Equivalently, the generation number of a commit A is one more than the
-+ * A commit with at least one parent has corrected committer date equal to
++ * A commit with at least one parent has corrected committer date equal to
+ the maximum of its commiter date and one more than the largest corrected
+ committer date among its parents.
+
-+ * As a special case, a root commit with timestamp zero has corrected commit
++ * As a special case, a root commit with timestamp zero has corrected commit
+ date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+ (that is, an uncomputed corrected commit date).
+
@@ Documentation/technical/commit-graph.txt: is easier to use for computation and o
generation numbers, then we always expand the boundary commit with highest
generation number and can easily detect the stopping condition.
-+The properties applies to both versions of generation number, that is both
++The property applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
This property can be used to significantly reduce the time it takes to
@@ Documentation/technical/commit-graph.txt: fully-computed generation numbers. Usi
with generation number *_INFINITY or *_ZERO is valuable.
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-+We use the macro GENERATION_NUMBER_MAX for commits whose
- generation numbers are computed to be at least this value. We limit at
- this value since it is the largest value that can be stored in the
- commit-graph file using the 30 bits available to generation numbers. This
+-generation numbers are computed to be at least this value. We limit at
+-this value since it is the largest value that can be stored in the
+-commit-graph file using the 30 bits available to generation numbers. This
+-presents another case where a commit can have generation number equal to
+-that of a parent.
++We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
++topological levels (generation number v1) are computed to be at least
++this value. We limit at this value since it is the largest value that
++can be stored in the commit-graph file using the 30 bits available
++to topological levels. This presents another case where a commit can
++have generation number equal to that of a parent.
+
+ Design Details
+ --------------
@@ Documentation/technical/commit-graph.txt: The merge strategy values (2 for the size multiple, 64,000 for the maximum
number of commits) could be extracted into config settings for full
flexibility.
@@ Documentation/technical/commit-graph.txt: The merge strategy values (2 for the s
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
-+levels of the higher layer. For this reason, Git inspects each layer to
-+see if any layer is missing corrected commit dates. In such a case, Git
-+only uses topological level
++levels of the higher layer. For this reason, Git inspects the topmost
++layer to see if the layer is missing corrected commit dates. In such a case
++Git only uses topological level for generation numbers.
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
@@ Documentation/technical/commit-graph.txt: The merge strategy values (2 for the s
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
-+commit dates. Instead, the new layer will have corrected commit dates if and
-+only if all existing layers below the new layer have corrected commit dates.
++commit dates. Instead, the new layer will have corrected commit dates if the
++layer below the new layer has corrected commit dates.
++
++While writing or merging layers, if the new layer is the only layer, it will
++have corrected commit dates when written by compatible versions of Git. Thus,
++rewriting split commit-graph as a single file (`--split=replace`) creates a
++single layer with corrected commit dates.
+
## Deleting graph-{hash} files
--
gitgitgadget
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:15 ` Abhishek Kumar via GitGitGadget
2020-12-30 1:35 ` Derrick Stolee
2021-01-05 9:45 ` SZEDER Gábor
2020-12-28 11:15 ` [PATCH v5 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
` (11 subsequent siblings)
12 siblings, 2 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:15 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Before computing Bloom fitlers, the commit-graph machinery uses
commit_gen_cmp to sort commits by generation order for improved diff
performance. 3d11275505 (commit-graph: examine commits by generation
number, 2020-03-30) claims that this sort can reduce the time spent to
compute Bloom filters by nearly half.
But since c49c82aa4c (commit: move members graph_pos, generation to a
slab, 2020-06-17), this optimization is broken, since asking for a
'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
while writing.
Not all hope is lost, though: 'commit_graph_generation()' falls back to
comparing commits by their date when they have equal generation number,
and so since c49c82aa4c is purely a date comparision function. This
heuristic is good enough that we don't seem to loose appreciable
performance while computing Bloom filters. Applying this patch (compared
with v2.29.1) speeds up computing Bloom filters by around ~4
seconds.
So, avoid the useless 'commit_graph_generation()' while writing by
instead accessing the slab directly. This returns the newly-computed
generation numbers, and allows us to avoid the heuristic by directly
comparing generation numbers.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 06f8dc1d896..caf823295f4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
const struct commit *a = *(const struct commit **)va;
const struct commit *b = *(const struct commit **)vb;
- uint32_t generation_a = commit_graph_generation(a);
- uint32_t generation_b = commit_graph_generation(b);
+ uint32_t generation_a = commit_graph_data_at(a)->generation;
+ uint32_t generation_b = commit_graph_data_at(b)->generation;
/* lower generation commits first */
if (generation_a < generation_b)
return -1;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
2020-12-28 11:15 ` [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2020-12-30 1:35 ` Derrick Stolee
2021-01-08 5:45 ` Abhishek Kumar
2021-01-05 9:45 ` SZEDER Gábor
1 sibling, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-12-30 1:35 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget, git
Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar
On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Before computing Bloom fitlers, the commit-graph machinery uses
s/fitlers/filters/
> commit_gen_cmp to sort commits by generation order for improved diff
> performance. 3d11275505 (commit-graph: examine commits by generation
> number, 2020-03-30) claims that this sort can reduce the time spent to
> compute Bloom filters by nearly half.
>
> But since c49c82aa4c (commit: move members graph_pos, generation to a
> slab, 2020-06-17), this optimization is broken, since asking for a
> 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
> while writing.
>
> Not all hope is lost, though: 'commit_graph_generation()' falls back to
> comparing commits by their date when they have equal generation number,
> and so since c49c82aa4c is purely a date comparision function. This
s/comparision/comparison/
> heuristic is good enough that we don't seem to loose appreciable
> performance while computing Bloom filters. Applying this patch (compared
> with v2.29.1) speeds up computing Bloom filters by around ~4
> seconds.
Using "~4 seconds" here is odd since there is no baseline. Which
repository did you use?
Previous discussion used relative terms. Something like "speeds up by
a factor of 1.25" or something might be interesting.
> So, avoid the useless 'commit_graph_generation()' while writing by
> instead accessing the slab directly. This returns the newly-computed
> generation numbers, and allows us to avoid the heuristic by directly
> comparing generation numbers.
This introduces some timing restrictions to the ability for this
comparison function. It would be dangerous if someone extracted
the method for another purpose. A comment above these lines could
warn future developers from making that mistake, but they would
probably use the comparison functions in commit.c instead.
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
> commit-graph.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 06f8dc1d896..caf823295f4 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
> const struct commit *a = *(const struct commit **)va;
> const struct commit *b = *(const struct commit **)vb;
>
> - uint32_t generation_a = commit_graph_generation(a);
> - uint32_t generation_b = commit_graph_generation(b);
> + uint32_t generation_a = commit_graph_data_at(a)->generation;
> + uint32_t generation_b = commit_graph_data_at(b)->generation;
> /* lower generation commits first */
> if (generation_a < generation_b)
> return -1;
>
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
2020-12-30 1:35 ` Derrick Stolee
@ 2021-01-08 5:45 ` Abhishek Kumar
0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-08 5:45 UTC (permalink / raw)
To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me
On Tue, Dec 29, 2020 at 08:35:56PM -0500, Derrick Stolee wrote:
> On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > Before computing Bloom fitlers, the commit-graph machinery uses
>
> s/fitlers/filters/
>
> > commit_gen_cmp to sort commits by generation order for improved diff
> > performance. 3d11275505 (commit-graph: examine commits by generation
> > number, 2020-03-30) claims that this sort can reduce the time spent to
> > compute Bloom filters by nearly half.
> >
> > But since c49c82aa4c (commit: move members graph_pos, generation to a
> > slab, 2020-06-17), this optimization is broken, since asking for a
> > 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
> > while writing.
> >
> > Not all hope is lost, though: 'commit_graph_generation()' falls back to
> > comparing commits by their date when they have equal generation number,
> > and so since c49c82aa4c is purely a date comparision function. This
>
> s/comparision/comparison/
>
> > heuristic is good enough that we don't seem to loose appreciable
> > performance while computing Bloom filters. Applying this patch (compared
> > with v2.29.1) speeds up computing Bloom filters by around ~4
> > seconds.
>
> Using "~4 seconds" here is odd since there is no baseline. Which
> repository did you use?
>
I used the linux repository, will mention that.
> Previous discussion used relative terms. Something like "speeds up by
> a factor of 1.25" or something might be interesting.
>
As SZEDER Gábor found, the improvements are rather minor - ranging from
0.40% to 5.19% [1]. I want to make sure this is the correct way to word
in the commit message:
Applying this patch (compared with v2.30.0) speeds up computing Bloom
filters by factors ranging from 0.40% to 5.19% on various
repositories.
https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/
> > So, avoid the useless 'commit_graph_generation()' while writing by
> > instead accessing the slab directly. This returns the newly-computed
> > generation numbers, and allows us to avoid the heuristic by directly
> > comparing generation numbers.
>
> This introduces some timing restrictions to the ability for this
> comparison function. It would be dangerous if someone extracted
> the method for another purpose. A comment above these lines could
> warn future developers from making that mistake, but they would
> probably use the comparison functions in commit.c instead.
>
Sure, will add a comment above.
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> > commit-graph.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 06f8dc1d896..caf823295f4 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
> > const struct commit *a = *(const struct commit **)va;
> > const struct commit *b = *(const struct commit **)vb;
> >
> > - uint32_t generation_a = commit_graph_generation(a);
> > - uint32_t generation_b = commit_graph_generation(b);
> > + uint32_t generation_a = commit_graph_data_at(a)->generation;
> > + uint32_t generation_b = commit_graph_data_at(b)->generation;
> > /* lower generation commits first */
> > if (generation_a < generation_b)
> > return -1;
> >
>
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
2020-12-28 11:15 ` [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
2020-12-30 1:35 ` Derrick Stolee
@ 2021-01-05 9:45 ` SZEDER Gábor
2021-01-05 9:47 ` SZEDER Gábor
2021-01-08 5:51 ` Abhishek Kumar
1 sibling, 2 replies; 211+ messages in thread
From: SZEDER Gábor @ 2021-01-05 9:45 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar
On Mon, Dec 28, 2020 at 11:15:58AM +0000, Abhishek Kumar via GitGitGadget wrote:
> Before computing Bloom fitlers, the commit-graph machinery uses
> commit_gen_cmp to sort commits by generation order for improved diff
> performance. 3d11275505 (commit-graph: examine commits by generation
> number, 2020-03-30) claims that this sort can reduce the time spent to
> compute Bloom filters by nearly half.
That's true, though there are repositories where it has basically no
effect. Alas we can't directly test it, because in 3d11275505 there
is no '--changed-paths' option yet... one has to revert 3d11275505 on
top of d38e07b8c4 (commit-graph: add --changed-paths option to write
subcommand, 2020-04-06) to make any runtime comparisons ('git
commit-graph write --reachable --changed-paths', best of five):
Sorting by
pack | generation
position |
-------------------+------------
gcc 114.821s | 38.963s
git 8.896s | 5.620s
linux 209.984s | 104.900s
webkit 35.193s | 35.482s
Note the almost 3x speedup in the gcc repository, and the basically
negligible slowdown in the webkit repo.
> But since c49c82aa4c (commit: move members graph_pos, generation to a
> slab, 2020-06-17), this optimization is broken, since asking for a
> 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
> while writing.
I wouldn't say that c49c82aa4c broke this optimisation, because:
did not break that optimization. Though, sadly, it's not
mentioned in 3d11275505's commit message, when commit_gen_cmp()
compares two commits with identical generation numbers, then it
doesn't leave them unsorted, but falls back to use their committer
date as a tie-braker. This means that after c49c82aa4c the commits
are sorted by committer date, which appears to be so good a heuristic
for Bloom filter computation that there is barely any slowdown
compared to sorting by generation numbers:
> Not all hope is lost, though: 'commit_graph_generation()' falls back to
You mean commit_gen_cmp() here.
> comparing commits by their date when they have equal generation number,
> and so since c49c82aa4c is purely a date comparision function. This
> heuristic is good enough that we don't seem to loose appreciable
> performance while computing Bloom filters.
Indeed, c49c82aa4c barely caused any runtime difference in the
repositories I usually use to test modified path Bloom filter
performance:
c49c82aa4c^ c49c82aa4c
---------------------------------------------
android-base 43.057s 43.091s 0.07%
cmssw 21.781s 21.856s 0.34%
cpython 9.626s 9.724s 1.01%
elasticsearch 18.049s 18.224s 0.96%
gcc 40.312s 40.255s -0.14%
gecko-dev 104.515s 104.740s 0.21%
git 5.559s 5.570s 0.19%
glibc 4.455s 4.468s 0.29%
go 4.009s 4.016s 0.17%
homebrew-cask 30.759s 30.523s -0.76%
homebrew-core 57.122s 56.553s -0.99%
jdk 18.297s 18.364s 0.36%
linux 104.499s 105.302s 0.76%
llvm-project 34.074s 34.446s 1.09%
rails 6.472s 6.486s 0.21%
rust 14.943s 14.947s 0.02%
tensorflow 13.362s 13.477s 0.86%
webkit 34.583s 34.601s 0.05%
> Applying this patch (compared
> with v2.29.1) speeds up computing Bloom filters by around ~4
> seconds.
Without a baseline and knowing which repo, this "~4 seconds" is
meaningless.
Here are my results comparing this fix to v2.30.0, best of five:
v2.30.0 +
v2.30.0 this fix
---------------------------------------------
android-base 42.786s 42.933s 0.34%
cmssw 20.229s 20.160s -0.34%
cpython 9.616s 9.647s 0.32%
elasticsearch 16.859s 16.936s 0.45%
gcc 38.909s 36.889s -5.19%
gecko-dev 99.417s 98.558s -0.86%
git 5.620s 5.509s -1.97%
glibc 4.307s 4.301s -0.13%
go 3.971s 3.938s -0.83%
homebrew-cask 31.262s 30.283s -3.13%
homebrew-core 57.842s 55.663s -3.76%
jdk 12.557s 12.251s -2.43%
linux 94.335s 94.760s 0.45%
llvm-project 34.432s 33.988s -1.28%
rails 6.481s 6.454s -0.41%
rust 14.772s 14.601s -1.15%
tensorflow 11.759s 11.711s -0.40%
webkit 33.917s 33.759s -0.46%
> So, avoid the useless 'commit_graph_generation()' while writing by
> instead accessing the slab directly. This returns the newly-computed
> generation numbers, and allows us to avoid the heuristic by directly
> comparing generation numbers.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
> commit-graph.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 06f8dc1d896..caf823295f4 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
> const struct commit *a = *(const struct commit **)va;
> const struct commit *b = *(const struct commit **)vb;
>
> - uint32_t generation_a = commit_graph_generation(a);
> - uint32_t generation_b = commit_graph_generation(b);
> + uint32_t generation_a = commit_graph_data_at(a)->generation;
> + uint32_t generation_b = commit_graph_data_at(b)->generation;
> /* lower generation commits first */
> if (generation_a < generation_b)
> return -1;
> --
> gitgitgadget
>
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
2021-01-05 9:45 ` SZEDER Gábor
@ 2021-01-05 9:47 ` SZEDER Gábor
2021-01-08 5:51 ` Abhishek Kumar
1 sibling, 0 replies; 211+ messages in thread
From: SZEDER Gábor @ 2021-01-05 9:47 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar
On Tue, Jan 05, 2021 at 10:45:35AM +0100, SZEDER Gábor wrote:
> > But since c49c82aa4c (commit: move members graph_pos, generation to a
> > slab, 2020-06-17), this optimization is broken, since asking for a
> > 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
> > while writing.
>
> I wouldn't say that c49c82aa4c broke this optimisation, because:
>
> did not break that optimization. Though, sadly, it's not
> mentioned in 3d11275505's commit message, when commit_gen_cmp()
> compares two commits with identical generation numbers, then it
> doesn't leave them unsorted, but falls back to use their committer
> date as a tie-braker. This means that after c49c82aa4c the commits
> are sorted by committer date, which appears to be so good a heuristic
> for Bloom filter computation that there is barely any slowdown
> compared to sorting by generation numbers:
Gaah, scratch this paragraph; I first misunderstood what you wrote in
the paragraph below, but then forgot to remove it.
> > Not all hope is lost, though: 'commit_graph_generation()' falls back to
>
> You mean commit_gen_cmp() here.
>
> > comparing commits by their date when they have equal generation number,
> > and so since c49c82aa4c is purely a date comparision function. This
> > heuristic is good enough that we don't seem to loose appreciable
> > performance while computing Bloom filters.
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters
2021-01-05 9:45 ` SZEDER Gábor
2021-01-05 9:47 ` SZEDER Gábor
@ 2021-01-08 5:51 ` Abhishek Kumar
1 sibling, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-08 5:51 UTC (permalink / raw)
To: SZEDER Gábor
Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me, stolee
On Tue, Jan 05, 2021 at 10:45:35AM +0100, SZEDER Gábor wrote:
> On Mon, Dec 28, 2020 at 11:15:58AM +0000, Abhishek Kumar via GitGitGadget wrote:
> > Before computing Bloom fitlers, the commit-graph machinery uses
> > commit_gen_cmp to sort commits by generation order for improved diff
> > performance. 3d11275505 (commit-graph: examine commits by generation
> > number, 2020-03-30) claims that this sort can reduce the time spent to
> > compute Bloom filters by nearly half.
>
> That's true, though there are repositories where it has basically no
> effect. Alas we can't directly test it, because in 3d11275505 there
> is no '--changed-paths' option yet... one has to revert 3d11275505 on
> top of d38e07b8c4 (commit-graph: add --changed-paths option to write
> subcommand, 2020-04-06) to make any runtime comparisons ('git
> commit-graph write --reachable --changed-paths', best of five):
>
> Sorting by
> pack | generation
> position |
> -------------------+------------
> gcc 114.821s | 38.963s
> git 8.896s | 5.620s
> linux 209.984s | 104.900s
> webkit 35.193s | 35.482s
>
> Note the almost 3x speedup in the gcc repository, and the basically
> negligible slowdown in the webkit repo.
>
> > But since c49c82aa4c (commit: move members graph_pos, generation to a
> > slab, 2020-06-17), this optimization is broken, since asking for a
> > 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
> > while writing.
>
> I wouldn't say that c49c82aa4c broke this optimisation, because:
>
> did not break that optimization. Though, sadly, it's not
> mentioned in 3d11275505's commit message, when commit_gen_cmp()
> compares two commits with identical generation numbers, then it
> doesn't leave them unsorted, but falls back to use their committer
> date as a tie-braker. This means that after c49c82aa4c the commits
> are sorted by committer date, which appears to be so good a heuristic
> for Bloom filter computation that there is barely any slowdown
> compared to sorting by generation numbers:
>
> > Not all hope is lost, though: 'commit_graph_generation()' falls back to
>
> You mean commit_gen_cmp() here.
>
Yes, fixed.
> > comparing commits by their date when they have equal generation number,
> > and so since c49c82aa4c is purely a date comparision function. This
> > heuristic is good enough that we don't seem to loose appreciable
> > performance while computing Bloom filters.
>
> Indeed, c49c82aa4c barely caused any runtime difference in the
> repositories I usually use to test modified path Bloom filter
> performance:
>
> c49c82aa4c^ c49c82aa4c
> ---------------------------------------------
> android-base 43.057s 43.091s 0.07%
> cmssw 21.781s 21.856s 0.34%
> cpython 9.626s 9.724s 1.01%
> elasticsearch 18.049s 18.224s 0.96%
> gcc 40.312s 40.255s -0.14%
> gecko-dev 104.515s 104.740s 0.21%
> git 5.559s 5.570s 0.19%
> glibc 4.455s 4.468s 0.29%
> go 4.009s 4.016s 0.17%
> homebrew-cask 30.759s 30.523s -0.76%
> homebrew-core 57.122s 56.553s -0.99%
> jdk 18.297s 18.364s 0.36%
> linux 104.499s 105.302s 0.76%
> llvm-project 34.074s 34.446s 1.09%
> rails 6.472s 6.486s 0.21%
> rust 14.943s 14.947s 0.02%
> tensorflow 13.362s 13.477s 0.86%
> webkit 34.583s 34.601s 0.05%
>
> > Applying this patch (compared
> > with v2.29.1) speeds up computing Bloom filters by around ~4
> > seconds.
>
> Without a baseline and knowing which repo, this "~4 seconds" is
> meaningless.
>
> Here are my results comparing this fix to v2.30.0, best of five:
>
> v2.30.0 +
> v2.30.0 this fix
> ---------------------------------------------
> android-base 42.786s 42.933s 0.34%
> cmssw 20.229s 20.160s -0.34%
> cpython 9.616s 9.647s 0.32%
> elasticsearch 16.859s 16.936s 0.45%
> gcc 38.909s 36.889s -5.19%
> gecko-dev 99.417s 98.558s -0.86%
> git 5.620s 5.509s -1.97%
> glibc 4.307s 4.301s -0.13%
> go 3.971s 3.938s -0.83%
> homebrew-cask 31.262s 30.283s -3.13%
> homebrew-core 57.842s 55.663s -3.76%
> jdk 12.557s 12.251s -2.43%
> linux 94.335s 94.760s 0.45%
> llvm-project 34.432s 33.988s -1.28%
> rails 6.481s 6.454s -0.41%
> rust 14.772s 14.601s -1.15%
> tensorflow 11.759s 11.711s -0.40%
> webkit 33.917s 33.759s -0.46%
>
Thank you for the detailed performance benchmarking.
>
> > So, avoid the useless 'commit_graph_generation()' while writing by
> > instead accessing the slab directly. This returns the newly-computed
> > generation numbers, and allows us to avoid the heuristic by directly
> > comparing generation numbers.
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> > commit-graph.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 06f8dc1d896..caf823295f4 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -144,8 +144,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
> > const struct commit *a = *(const struct commit **)va;
> > const struct commit *b = *(const struct commit **)vb;
> >
> > - uint32_t generation_a = commit_graph_generation(a);
> > - uint32_t generation_b = commit_graph_generation(b);
> > + uint32_t generation_a = commit_graph_data_at(a)->generation;
> > + uint32_t generation_b = commit_graph_data_at(b)->generation;
> > /* lower generation commits first */
> > if (generation_a < generation_b)
> > return -1;
> > --
> > gitgitgadget
> >
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v5 02/11] revision: parse parent in indegree_walk_step()
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
2020-12-28 11:15 ` [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:15 ` Abhishek Kumar via GitGitGadget
2020-12-28 11:16 ` [PATCH v5 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
` (10 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:15 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
revision.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/revision.c b/revision.c
index 9dff845bed6..de8e45f462f 100644
--- a/revision.c
+++ b/revision.c
@@ -3373,6 +3373,9 @@ static void indegree_walk_step(struct rev_info *revs)
struct commit *parent = p->item;
int *pi = indegree_slab_at(&info->indegree, parent);
+ if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
+ return;
+
if (*pi)
(*pi)++;
else
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v5 03/11] commit-graph: consolidate fill_commit_graph_info
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
2020-12-28 11:15 ` [PATCH v5 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
2020-12-28 11:15 ` [PATCH v5 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16 ` Abhishek Kumar via GitGitGadget
2020-12-28 11:16 ` [PATCH v5 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
` (9 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().
fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.
e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.
The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.
However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.
The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.
As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.
To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.
To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 27 ++++++++++-----------------
t/t5000-tar-tree.sh | 24 +++++++++++++++++++++---
2 files changed, 31 insertions(+), 20 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index caf823295f4..d5b33b4f7ac 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -749,15 +749,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
const unsigned char *commit_data;
struct commit_graph_data *graph_data;
uint32_t lex_index;
+ uint64_t date_high, date_low;
while (pos < g->num_commits_in_base)
g = g->base_graph;
+ if (pos >= g->num_commits + g->num_commits_in_base)
+ die(_("invalid commit position. commit-graph is likely corrupt"));
+
lex_index = pos - g->num_commits_in_base;
commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
graph_data = commit_graph_data_at(item);
graph_data->graph_pos = pos;
+
+ date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+ date_low = get_be32(commit_data + g->hash_len + 12);
+ item->date = (timestamp_t)((date_high << 32) | date_low);
+
graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
}
@@ -772,38 +781,22 @@ static int fill_commit_in_graph(struct repository *r,
{
uint32_t edge_value;
uint32_t *parent_data_ptr;
- uint64_t date_low, date_high;
struct commit_list **pptr;
- struct commit_graph_data *graph_data;
const unsigned char *commit_data;
uint32_t lex_index;
while (pos < g->num_commits_in_base)
g = g->base_graph;
- if (pos >= g->num_commits + g->num_commits_in_base)
- die(_("invalid commit position. commit-graph is likely corrupt"));
+ fill_commit_graph_info(item, g, pos);
- /*
- * Store the "full" position, but then use the
- * "local" position for the rest of the calculation.
- */
- graph_data = commit_graph_data_at(item);
- graph_data->graph_pos = pos;
lex_index = pos - g->num_commits_in_base;
-
commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
item->object.parsed = 1;
set_commit_tree(item, NULL);
- date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
- date_low = get_be32(commit_data + g->hash_len + 12);
- item->date = (timestamp_t)((date_high << 32) | date_low);
-
- graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
pptr = &item->parents;
edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 3ebb0d3b652..7204799a0b5 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -431,15 +431,33 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
test_cmp expect actual
'
-test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
+test_expect_success TIME_IS_64BIT 'set up repository with far-future (2^34 - 1) commit' '
+ rm -f .git/index &&
+ echo foo >file &&
+ git add file &&
+ GIT_COMMITTER_DATE="@17179869183 +0000" \
+ git commit -m "tempori parendum"
+'
+
+test_expect_success TIME_IS_64BIT 'generate tar with far-future mtime' '
+ git archive HEAD >future.tar
+'
+
+test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
+ echo 2514 >expect &&
+ tar_info future.tar | cut -d" " -f2 >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success TIME_IS_64BIT 'set up repository with far-far-future (2^36 + 1) commit' '
rm -f .git/index &&
echo content >file &&
git add file &&
- GIT_COMMITTER_DATE="@68719476737 +0000" \
+ GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
git commit -m "tempori parendum"
'
-test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
+test_expect_success TIME_IS_64BIT 'generate tar with far-far-future mtime' '
git archive HEAD >future.tar
'
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v5 04/11] t6600-test-reach: generalize *_three_modes
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
` (2 preceding siblings ...)
2020-12-28 11:16 ` [PATCH v5 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16 ` Abhishek Kumar via GitGitGadget
2020-12-28 11:16 ` [PATCH v5 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
` (8 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In a preparatory step to implement generation number v2, we add tests to
ensure Git can read and parse commit-graph files without Generation Data
chunk. These files represent commit-graph files written by Old Git and
are neccesary for backward compatability.
We extend run_three_modes() and test_three_modes() to *_all_modes() with
the fourth mode being "commit-graph without generation data chunk".
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
t/t6600-test-reach.sh | 62 +++++++++++++++++++++----------------------
1 file changed, 31 insertions(+), 31 deletions(-)
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index f807276337d..af10f0dc090 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -58,7 +58,7 @@ test_expect_success 'setup' '
git config core.commitGraph true
'
-run_three_modes () {
+run_all_modes () {
test_when_finished rm -rf .git/objects/info/commit-graph &&
"$@" <input >actual &&
test_cmp expect actual &&
@@ -70,8 +70,8 @@ run_three_modes () {
test_cmp expect actual
}
-test_three_modes () {
- run_three_modes test-tool reach "$@"
+test_all_modes () {
+ run_all_modes test-tool reach "$@"
}
test_expect_success 'ref_newer:miss' '
@@ -80,7 +80,7 @@ test_expect_success 'ref_newer:miss' '
B:commit-4-9
EOF
echo "ref_newer(A,B):0" >expect &&
- test_three_modes ref_newer
+ test_all_modes ref_newer
'
test_expect_success 'ref_newer:hit' '
@@ -89,7 +89,7 @@ test_expect_success 'ref_newer:hit' '
B:commit-2-3
EOF
echo "ref_newer(A,B):1" >expect &&
- test_three_modes ref_newer
+ test_all_modes ref_newer
'
test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +98,7 @@ test_expect_success 'in_merge_bases:hit' '
B:commit-8-8
EOF
echo "in_merge_bases(A,B):1" >expect &&
- test_three_modes in_merge_bases
+ test_all_modes in_merge_bases
'
test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +107,7 @@ test_expect_success 'in_merge_bases:miss' '
B:commit-5-9
EOF
echo "in_merge_bases(A,B):0" >expect &&
- test_three_modes in_merge_bases
+ test_all_modes in_merge_bases
'
test_expect_success 'in_merge_bases_many:hit' '
@@ -117,7 +117,7 @@ test_expect_success 'in_merge_bases_many:hit' '
X:commit-5-7
EOF
echo "in_merge_bases_many(A,X):1" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'in_merge_bases_many:miss' '
@@ -127,7 +127,7 @@ test_expect_success 'in_merge_bases_many:miss' '
X:commit-8-6
EOF
echo "in_merge_bases_many(A,X):0" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'in_merge_bases_many:miss-heuristic' '
@@ -137,7 +137,7 @@ test_expect_success 'in_merge_bases_many:miss-heuristic' '
X:commit-6-6
EOF
echo "in_merge_bases_many(A,X):0" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'is_descendant_of:hit' '
@@ -148,7 +148,7 @@ test_expect_success 'is_descendant_of:hit' '
X:commit-1-1
EOF
echo "is_descendant_of(A,X):1" >expect &&
- test_three_modes is_descendant_of
+ test_all_modes is_descendant_of
'
test_expect_success 'is_descendant_of:miss' '
@@ -159,7 +159,7 @@ test_expect_success 'is_descendant_of:miss' '
X:commit-7-6
EOF
echo "is_descendant_of(A,X):0" >expect &&
- test_three_modes is_descendant_of
+ test_all_modes is_descendant_of
'
test_expect_success 'get_merge_bases_many' '
@@ -174,7 +174,7 @@ test_expect_success 'get_merge_bases_many' '
git rev-parse commit-5-6 \
commit-4-7 | sort
} >expect &&
- test_three_modes get_merge_bases_many
+ test_all_modes get_merge_bases_many
'
test_expect_success 'reduce_heads' '
@@ -196,7 +196,7 @@ test_expect_success 'reduce_heads' '
commit-2-8 \
commit-1-10 | sort
} >expect &&
- test_three_modes reduce_heads
+ test_all_modes reduce_heads
'
test_expect_success 'can_all_from_reach:hit' '
@@ -219,7 +219,7 @@ test_expect_success 'can_all_from_reach:hit' '
Y:commit-8-1
EOF
echo "can_all_from_reach(X,Y):1" >expect &&
- test_three_modes can_all_from_reach
+ test_all_modes can_all_from_reach
'
test_expect_success 'can_all_from_reach:miss' '
@@ -241,7 +241,7 @@ test_expect_success 'can_all_from_reach:miss' '
Y:commit-8-5
EOF
echo "can_all_from_reach(X,Y):0" >expect &&
- test_three_modes can_all_from_reach
+ test_all_modes can_all_from_reach
'
test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -264,7 +264,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
Y:commit-8-1
EOF
echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
- test_three_modes can_all_from_reach_with_flag
+ test_all_modes can_all_from_reach_with_flag
'
test_expect_success 'commit_contains:hit' '
@@ -280,8 +280,8 @@ test_expect_success 'commit_contains:hit' '
X:commit-9-3
EOF
echo "commit_contains(_,A,X,_):1" >expect &&
- test_three_modes commit_contains &&
- test_three_modes commit_contains --tag
+ test_all_modes commit_contains &&
+ test_all_modes commit_contains --tag
'
test_expect_success 'commit_contains:miss' '
@@ -297,8 +297,8 @@ test_expect_success 'commit_contains:miss' '
X:commit-9-3
EOF
echo "commit_contains(_,A,X,_):0" >expect &&
- test_three_modes commit_contains &&
- test_three_modes commit_contains --tag
+ test_all_modes commit_contains &&
+ test_all_modes commit_contains --tag
'
test_expect_success 'rev-list: basic topo-order' '
@@ -310,7 +310,7 @@ test_expect_success 'rev-list: basic topo-order' '
commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-6-6
+ run_all_modes git rev-list --topo-order commit-6-6
'
test_expect_success 'rev-list: first-parent topo-order' '
@@ -322,7 +322,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
commit-6-2 \
commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>expect &&
- run_three_modes git rev-list --first-parent --topo-order commit-6-6
+ run_all_modes git rev-list --first-parent --topo-order commit-6-6
'
test_expect_success 'rev-list: range topo-order' '
@@ -334,7 +334,7 @@ test_expect_success 'rev-list: range topo-order' '
commit-6-2 commit-5-2 commit-4-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
'
test_expect_success 'rev-list: range topo-order' '
@@ -346,7 +346,7 @@ test_expect_success 'rev-list: range topo-order' '
commit-6-2 commit-5-2 commit-4-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
'
test_expect_success 'rev-list: first-parent range topo-order' '
@@ -358,7 +358,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
commit-6-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+ run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
'
test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -368,7 +368,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
commit-6-3 commit-5-3 commit-4-3 \
>expect &&
- run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+ run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
'
test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -382,7 +382,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
commit-3-8 commit-2-8 commit-1-8 \
commit-3-7 commit-2-7 commit-1-7 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
'
test_expect_success 'get_reachable_subset:all' '
@@ -402,7 +402,7 @@ test_expect_success 'get_reachable_subset:all' '
commit-1-7 \
commit-5-6 | sort
) >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_expect_success 'get_reachable_subset:some' '
@@ -420,7 +420,7 @@ test_expect_success 'get_reachable_subset:some' '
git rev-parse commit-3-3 \
commit-1-7 | sort
) >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_expect_success 'get_reachable_subset:none' '
@@ -434,7 +434,7 @@ test_expect_success 'get_reachable_subset:none' '
Y:commit-2-8
EOF
echo "get_reachable_subset(X,Y)" >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v5 05/11] commit-graph: add a slab to store topological levels
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
` (3 preceding siblings ...)
2020-12-28 11:16 ` [PATCH v5 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16 ` Abhishek Kumar via GitGitGadget
2020-12-28 11:16 ` [PATCH v5 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
` (7 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In a later commit we will introduce corrected commit date as the
generation number v2. Corrected commit dates will be stored in the new
seperate Generation Data chunk. However, to ensure backwards
compatibility with "Old" Git we need to continue to write generation
number v1 (topological levels) to the commit data chunk. Thus, we need
to compute and store both versions of generation numbers to write the
commit-graph file.
Therefore, let's introduce a commit-slab `topo_level_slab` to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.
The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO`
mark commits not in the commit-graph file and commits written by a
version of Git that did not compute generation numbers respectively.
Generation numbers are computed identically for both kinds of commits.
A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit
is not in the commit-graph file. However, since the slab is
zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`).
Thus, we no longer need to check if the topological level of a commit is
`GENERATION_NUMBER_INFINITY`.
We will add a pointer to the slab in `struct write_commit_graph_context`
and `struct commit_graph` to populate the slab in
`fill_commit_graph_info` if the commit has a pre-computed topological
level as in case of split commit-graphs.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 45 ++++++++++++++++++++++++++++++---------------
commit-graph.h | 1 +
2 files changed, 31 insertions(+), 15 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index d5b33b4f7ac..c98e8910fe2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
+define_commit_slab(topo_level_slab, uint32_t);
+
/* Keep track of the order in which commits are added to our list. */
define_commit_slab(commit_pos, int);
static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -768,6 +770,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
item->date = (timestamp_t)((date_high << 32) | date_low);
graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
+ if (g->topo_levels)
+ *topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
}
static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -956,6 +961,7 @@ struct write_commit_graph_context {
changed_paths:1,
order_by_pack:1;
+ struct topo_level_slab *topo_levels;
const struct commit_graph_opts *opts;
size_t total_bloom_filter_data_size;
const struct bloom_filter_settings *bloom_settings;
@@ -1102,7 +1108,7 @@ static int write_graph_chunk_data(struct hashfile *f,
else
packedDate[0] = 0;
- packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+ packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
@@ -1332,11 +1338,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
_("Computing commit graph generation numbers"),
ctx->commits.nr);
for (i = 0; i < ctx->commits.nr; i++) {
- uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+ uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
display_progress(ctx->progress, i + 1);
- if (generation != GENERATION_NUMBER_INFINITY &&
- generation != GENERATION_NUMBER_ZERO)
+ if (level != GENERATION_NUMBER_ZERO)
continue;
commit_list_insert(ctx->commits.list[i], &list);
@@ -1344,29 +1349,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
struct commit *current = list->item;
struct commit_list *parent;
int all_parents_computed = 1;
- uint32_t max_generation = 0;
+ uint32_t max_level = 0;
for (parent = current->parents; parent; parent = parent->next) {
- generation = commit_graph_data_at(parent->item)->generation;
+ level = *topo_level_slab_at(ctx->topo_levels, parent->item);
- if (generation == GENERATION_NUMBER_INFINITY ||
- generation == GENERATION_NUMBER_ZERO) {
+ if (level == GENERATION_NUMBER_ZERO) {
all_parents_computed = 0;
commit_list_insert(parent->item, &list);
break;
- } else if (generation > max_generation) {
- max_generation = generation;
+ } else if (level > max_level) {
+ max_level = level;
}
}
if (all_parents_computed) {
- struct commit_graph_data *data = commit_graph_data_at(current);
-
- data->generation = max_generation + 1;
pop_commit(&list);
- if (data->generation > GENERATION_NUMBER_MAX)
- data->generation = GENERATION_NUMBER_MAX;
+ if (max_level > GENERATION_NUMBER_MAX - 1)
+ max_level = GENERATION_NUMBER_MAX - 1;
+ *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
}
}
}
@@ -2102,6 +2104,7 @@ int write_commit_graph(struct object_directory *odb,
int res = 0;
int replace = 0;
struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+ struct topo_level_slab topo_levels;
prepare_repo_settings(the_repository);
if (!the_repository->settings.core_commit_graph) {
@@ -2128,6 +2131,18 @@ int write_commit_graph(struct object_directory *odb,
bloom_settings.max_changed_paths);
ctx->bloom_settings = &bloom_settings;
+ init_topo_level_slab(&topo_levels);
+ ctx->topo_levels = &topo_levels;
+
+ if (ctx->r->objects->commit_graph) {
+ struct commit_graph *g = ctx->r->objects->commit_graph;
+
+ while (g) {
+ g->topo_levels = &topo_levels;
+ g = g->base_graph;
+ }
+ }
+
if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
ctx->changed_paths = 1;
if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/commit-graph.h b/commit-graph.h
index f8e92500c6e..00f00745b79 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -73,6 +73,7 @@ struct commit_graph {
const unsigned char *chunk_bloom_indexes;
const unsigned char *chunk_bloom_data;
+ struct topo_level_slab *topo_levels;
struct bloom_filter_settings *bloom_filter_settings;
};
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v5 06/11] commit-graph: return 64-bit generation number
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
` (4 preceding siblings ...)
2020-12-28 11:16 ` [PATCH v5 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16 ` Abhishek Kumar via GitGitGadget
2020-12-28 11:16 ` [PATCH v5 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
` (6 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.
We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.
With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 22 +++++++++++-----------
commit-graph.h | 4 ++--
commit-reach.c | 36 ++++++++++++++++++------------------
commit-reach.h | 2 +-
commit.c | 4 ++--
commit.h | 4 ++--
revision.c | 10 +++++-----
upload-pack.c | 2 +-
8 files changed, 42 insertions(+), 42 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index c98e8910fe2..1b2a015f92f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -101,7 +101,7 @@ uint32_t commit_graph_position(const struct commit *c)
return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
}
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
{
struct commit_graph_data *data =
commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -146,8 +146,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
const struct commit *a = *(const struct commit **)va;
const struct commit *b = *(const struct commit **)vb;
- uint32_t generation_a = commit_graph_data_at(a)->generation;
- uint32_t generation_b = commit_graph_data_at(b)->generation;
+ const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+ const timestamp_t generation_b = commit_graph_data_at(b)->generation;
/* lower generation commits first */
if (generation_a < generation_b)
return -1;
@@ -1366,8 +1366,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (all_parents_computed) {
pop_commit(&list);
- if (max_level > GENERATION_NUMBER_MAX - 1)
- max_level = GENERATION_NUMBER_MAX - 1;
+ if (max_level > GENERATION_NUMBER_V1_MAX - 1)
+ max_level = GENERATION_NUMBER_V1_MAX - 1;
*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
}
}
@@ -2363,8 +2363,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
for (i = 0; i < g->num_commits; i++) {
struct commit *graph_commit, *odb_commit;
struct commit_list *graph_parents, *odb_parents;
- uint32_t max_generation = 0;
- uint32_t generation;
+ timestamp_t max_generation = 0;
+ timestamp_t generation;
display_progress(progress, i + 1);
hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2428,16 +2428,16 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
continue;
/*
- * If one of our parents has generation GENERATION_NUMBER_MAX, then
- * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
+ * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
+ * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
* extra logic in the following condition.
*/
- if (max_generation == GENERATION_NUMBER_MAX)
+ if (max_generation == GENERATION_NUMBER_V1_MAX)
max_generation--;
generation = commit_graph_generation(graph_commit);
if (generation != max_generation + 1)
- graph_report(_("commit-graph generation for commit %s is %u != %u"),
+ graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
oid_to_hex(&cur_oid),
generation,
max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 00f00745b79..2e9aa7824ee 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -145,12 +145,12 @@ void disable_commit_graph(struct repository *r);
struct commit_graph_data {
uint32_t graph_pos;
- uint32_t generation;
+ timestamp_t generation;
};
/*
* Commits should be parsed before accessing generation, graph positions.
*/
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
uint32_t commit_graph_position(const struct commit *);
#endif
diff --git a/commit-reach.c b/commit-reach.c
index 50175b159e7..9b24b0378d5 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
static struct commit_list *paint_down_to_common(struct repository *r,
struct commit *one, int n,
struct commit **twos,
- int min_generation)
+ timestamp_t min_generation)
{
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
- uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+ timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
if (!min_generation)
queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
struct commit *commit = prio_queue_get(&queue);
struct commit_list *parents;
int flags;
- uint32_t generation = commit_graph_generation(commit);
+ timestamp_t generation = commit_graph_generation(commit);
if (min_generation && generation > last_gen)
- BUG("bad generation skip %8x > %8x at %s",
+ BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
generation, last_gen,
oid_to_hex(&commit->object.oid));
last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
repo_parse_commit(r, array[i]);
for (i = 0; i < cnt; i++) {
struct commit_list *common;
- uint32_t min_generation = commit_graph_generation(array[i]);
+ timestamp_t min_generation = commit_graph_generation(array[i]);
if (redundant[i])
continue;
for (j = filled = 0; j < cnt; j++) {
- uint32_t curr_generation;
+ timestamp_t curr_generation;
if (i == j || redundant[j])
continue;
filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
{
struct commit_list *bases;
int ret = 0, i;
- uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
+ timestamp_t generation, max_generation = GENERATION_NUMBER_ZERO;
if (repo_parse_commit(r, commit))
return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
static enum contains_result contains_test(struct commit *candidate,
const struct commit_list *want,
struct contains_cache *cache,
- uint32_t cutoff)
+ timestamp_t cutoff)
{
enum contains_result *cached = contains_cache_at(cache, candidate);
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
{
struct contains_stack contains_stack = { 0, 0, NULL };
enum contains_result result;
- uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+ timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
const struct commit_list *p;
for (p = want; p; p = p->next) {
- uint32_t generation;
+ timestamp_t generation;
struct commit *c = p->item;
load_commit_graph_info(the_repository, c);
generation = commit_graph_generation(c);
@@ -566,8 +566,8 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
const struct commit *a = *(const struct commit * const *)_a;
const struct commit *b = *(const struct commit * const *)_b;
- uint32_t generation_a = commit_graph_generation(a);
- uint32_t generation_b = commit_graph_generation(b);
+ timestamp_t generation_a = commit_graph_generation(a);
+ timestamp_t generation_b = commit_graph_generation(b);
if (generation_a < generation_b)
return -1;
@@ -580,7 +580,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
unsigned int with_flag,
unsigned int assign_flag,
time_t min_commit_date,
- uint32_t min_generation)
+ timestamp_t min_generation)
{
struct commit **list = NULL;
int i;
@@ -681,13 +681,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
struct commit_list *from_iter = from, *to_iter = to;
int result;
- uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+ timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
while (from_iter) {
add_object_array(&from_iter->item->object, NULL, &from_objs);
if (!parse_commit(from_iter->item)) {
- uint32_t generation;
+ timestamp_t generation;
if (from_iter->item->date < min_commit_date)
min_commit_date = from_iter->item->date;
@@ -701,7 +701,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
while (to_iter) {
if (!parse_commit(to_iter->item)) {
- uint32_t generation;
+ timestamp_t generation;
if (to_iter->item->date < min_commit_date)
min_commit_date = to_iter->item->date;
@@ -741,13 +741,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
struct commit_list *found_commits = NULL;
struct commit **to_last = to + nr_to;
struct commit **from_last = from + nr_from;
- uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+ timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
int num_to_find = 0;
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
for (item = to; item < to_last; item++) {
- uint32_t generation;
+ timestamp_t generation;
struct commit *c = *item;
parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a317..148b56fea50 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
unsigned int with_flag,
unsigned int assign_flag,
time_t min_commit_date,
- uint32_t min_generation);
+ timestamp_t min_generation);
int can_all_from_reach(struct commit_list *from, struct commit_list *to,
int commit_date_cutoff);
diff --git a/commit.c b/commit.c
index fe1fa3dc41f..17abf92a2d2 100644
--- a/commit.c
+++ b/commit.c
@@ -731,8 +731,8 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
{
const struct commit *a = a_, *b = b_;
- const uint32_t generation_a = commit_graph_generation(a),
- generation_b = commit_graph_generation(b);
+ const timestamp_t generation_a = commit_graph_generation(a),
+ generation_b = commit_graph_generation(b);
/* newer commits first */
if (generation_a < generation_b)
diff --git a/commit.h b/commit.h
index 5467786c7be..33c66b2177c 100644
--- a/commit.h
+++ b/commit.h
@@ -11,8 +11,8 @@
#include "commit-slab.h"
#define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
-#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
#define GENERATION_NUMBER_ZERO 0
struct commit_list {
diff --git a/revision.c b/revision.c
index de8e45f462f..d55c2e4d566 100644
--- a/revision.c
+++ b/revision.c
@@ -3300,7 +3300,7 @@ define_commit_slab(indegree_slab, int);
define_commit_slab(author_date_slab, timestamp_t);
struct topo_walk_info {
- uint32_t min_generation;
+ timestamp_t min_generation;
struct prio_queue explore_queue;
struct prio_queue indegree_queue;
struct prio_queue topo_queue;
@@ -3346,7 +3346,7 @@ static void explore_walk_step(struct rev_info *revs)
}
static void explore_to_depth(struct rev_info *revs,
- uint32_t gen_cutoff)
+ timestamp_t gen_cutoff)
{
struct topo_walk_info *info = revs->topo_walk_info;
struct commit *c;
@@ -3389,7 +3389,7 @@ static void indegree_walk_step(struct rev_info *revs)
}
static void compute_indegrees_to_depth(struct rev_info *revs,
- uint32_t gen_cutoff)
+ timestamp_t gen_cutoff)
{
struct topo_walk_info *info = revs->topo_walk_info;
struct commit *c;
@@ -3447,7 +3447,7 @@ static void init_topo_walk(struct rev_info *revs)
info->min_generation = GENERATION_NUMBER_INFINITY;
for (list = revs->commits; list; list = list->next) {
struct commit *c = list->item;
- uint32_t generation;
+ timestamp_t generation;
if (repo_parse_commit_gently(revs->repo, c, 1))
continue;
@@ -3508,7 +3508,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
for (p = commit->parents; p; p = p->next) {
struct commit *parent = p->item;
int *pi;
- uint32_t generation;
+ timestamp_t generation;
if (parent->object.flags & UNINTERESTING)
continue;
diff --git a/upload-pack.c b/upload-pack.c
index 3b66bf92ba8..b87607e0dd4 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -500,7 +500,7 @@ static int got_oid(struct upload_pack_data *data,
static int ok_to_give_up(struct upload_pack_data *data)
{
- uint32_t min_generation = GENERATION_NUMBER_ZERO;
+ timestamp_t min_generation = GENERATION_NUMBER_ZERO;
if (!data->have_obj.nr)
return 0;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v5 07/11] commit-graph: implement corrected commit date
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
` (5 preceding siblings ...)
2020-12-28 11:16 ` [PATCH v5 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16 ` Abhishek Kumar via GitGitGadget
2020-12-30 1:53 ` Derrick Stolee
2020-12-28 11:16 ` [PATCH v5 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
` (5 subsequent siblings)
12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
With most of preparations done, let's implement corrected commit date.
The corrected commit date for a commit is defined as:
* A commit with no parents (a root commit) has corrected commit date
equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
the maximum of its commit date and one more than the largest corrected
commit date among its parents.
As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).
To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.
Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.
However, using offsets be problematic if one of commits is malformed but
valid and has committerdate of 0 Unix time, as the offset would be the
same as corrected commit date and thus require 64-bits to be stored
properly.
While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 1b2a015f92f..bfc3aae5f93 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1339,9 +1339,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
ctx->commits.nr);
for (i = 0; i < ctx->commits.nr; i++) {
uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+ timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
display_progress(ctx->progress, i + 1);
- if (level != GENERATION_NUMBER_ZERO)
+ if (level != GENERATION_NUMBER_ZERO &&
+ corrected_commit_date != GENERATION_NUMBER_ZERO)
continue;
commit_list_insert(ctx->commits.list[i], &list);
@@ -1350,16 +1352,23 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
struct commit_list *parent;
int all_parents_computed = 1;
uint32_t max_level = 0;
+ timestamp_t max_corrected_commit_date = 0;
for (parent = current->parents; parent; parent = parent->next) {
level = *topo_level_slab_at(ctx->topo_levels, parent->item);
+ corrected_commit_date = commit_graph_data_at(parent->item)->generation;
- if (level == GENERATION_NUMBER_ZERO) {
+ if (level == GENERATION_NUMBER_ZERO ||
+ corrected_commit_date == GENERATION_NUMBER_ZERO) {
all_parents_computed = 0;
commit_list_insert(parent->item, &list);
break;
- } else if (level > max_level) {
- max_level = level;
+ } else {
+ if (level > max_level)
+ max_level = level;
+
+ if (corrected_commit_date > max_corrected_commit_date)
+ max_corrected_commit_date = corrected_commit_date;
}
}
@@ -1369,6 +1378,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (max_level > GENERATION_NUMBER_V1_MAX - 1)
max_level = GENERATION_NUMBER_V1_MAX - 1;
*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+ if (current->date && current->date > max_corrected_commit_date)
+ max_corrected_commit_date = current->date - 1;
+ commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
}
}
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v5 07/11] commit-graph: implement corrected commit date
2020-12-28 11:16 ` [PATCH v5 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-12-30 1:53 ` Derrick Stolee
2021-01-10 12:21 ` Abhishek Kumar
0 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-12-30 1:53 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget, git
Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar
On 12/28/2020 6:16 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With most of preparations done, let's implement corrected commit date.
>
> The corrected commit date for a commit is defined as:
>
> * A commit with no parents (a root commit) has corrected commit date
> equal to its committer date.
> * A commit with at least one parent has corrected commit date equal to
> the maximum of its commit date and one more than the largest corrected
> commit date among its parents.
>
> As a special case, a root commit with timestamp of zero (01.01.1970
> 00:00:00Z) has corrected commit date of one, to be able to distinguish
> from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
> date).
>
> To minimize the space required to store corrected commit date, Git
> stores corrected commit date offsets into the commit-graph file. The
> corrected commit date offset for a commit is defined as the difference
> between its corrected commit date and actual commit date.
>
> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> in most cases is 64 bits (uintmax_t). However, corrected commit date
> offsets can be safely stored using only 32-bits. This halves the size
> of GDAT chunk, which is a reduction of around 6% in the size of
> commit-graph file.
>
> However, using offsets be problematic if one of commits is malformed but
However, using 32-bit offsets is problematic if a commit is malformed...
> valid and has committerdate of 0 Unix time, as the offset would be the
s/committerdate/committer date/
> same as corrected commit date and thus require 64-bits to be stored
> properly.
>
> While Git does not write out offsets at this stage, Git stores the
> corrected commit dates in member generation of struct commit_graph_data.
> It will begin writing commit date offsets with the introduction of
> generation data chunk.
>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
> commit-graph.c | 21 +++++++++++++++++----
> 1 file changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 1b2a015f92f..bfc3aae5f93 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1339,9 +1339,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> ctx->commits.nr);
> for (i = 0; i < ctx->commits.nr; i++) {
> uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> + timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>
> display_progress(ctx->progress, i + 1);
> - if (level != GENERATION_NUMBER_ZERO)
> + if (level != GENERATION_NUMBER_ZERO &&
> + corrected_commit_date != GENERATION_NUMBER_ZERO)
> continue;
>
> commit_list_insert(ctx->commits.list[i], &list);
> @@ -1350,16 +1352,23 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> struct commit_list *parent;
> int all_parents_computed = 1;
> uint32_t max_level = 0;
> + timestamp_t max_corrected_commit_date = 0;
>
> for (parent = current->parents; parent; parent = parent->next) {
> level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> + corrected_commit_date = commit_graph_data_at(parent->item)->generation;
>
> - if (level == GENERATION_NUMBER_ZERO) {
> + if (level == GENERATION_NUMBER_ZERO ||
> + corrected_commit_date == GENERATION_NUMBER_ZERO) {
> all_parents_computed = 0;
> commit_list_insert(parent->item, &list);
> break;
> - } else if (level > max_level) {
> - max_level = level;
> + } else {
> + if (level > max_level)
> + max_level = level;
> +
> + if (corrected_commit_date > max_corrected_commit_date)
> + max_corrected_commit_date = corrected_commit_date;
nit: the "break" in the first case makes it so this large else block
is unnecessary.
- if (level == GENERATION_NUMBER_ZERO) {
+ if (level == GENERATION_NUMBER_ZERO ||
+ corrected_commit_date == GENERATION_NUMBER_ZERO) {
all_parents_computed = 0;
commit_list_insert(parent->item, &list);
break;
- } else if (level > max_level) {
- max_level = level;
+
+ if (level > max_level)
+ max_level = level;
+
+ if (corrected_commit_date > max_corrected_commit_date)
+ max_corrected_commit_date = corrected_commit_date;
- }
}
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v5 07/11] commit-graph: implement corrected commit date
2020-12-30 1:53 ` Derrick Stolee
@ 2021-01-10 12:21 ` Abhishek Kumar
0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-10 12:21 UTC (permalink / raw)
To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me
On Tue, Dec 29, 2020 at 08:53:11PM -0500, Derrick Stolee wrote:
> On 12/28/2020 6:16 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With most of preparations done, let's implement corrected commit date.
> >
> > The corrected commit date for a commit is defined as:
> >
> > * A commit with no parents (a root commit) has corrected commit date
> > equal to its committer date.
> > * A commit with at least one parent has corrected commit date equal to
> > the maximum of its commit date and one more than the largest corrected
> > commit date among its parents.
> >
> > As a special case, a root commit with timestamp of zero (01.01.1970
> > 00:00:00Z) has corrected commit date of one, to be able to distinguish
> > from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
> > date).
> >
> > To minimize the space required to store corrected commit date, Git
> > stores corrected commit date offsets into the commit-graph file. The
> > corrected commit date offset for a commit is defined as the difference
> > between its corrected commit date and actual commit date.
> >
> > Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> > in most cases is 64 bits (uintmax_t). However, corrected commit date
> > offsets can be safely stored using only 32-bits. This halves the size
> > of GDAT chunk, which is a reduction of around 6% in the size of
> > commit-graph file.
> >
> > However, using offsets be problematic if one of commits is malformed but
>
> However, using 32-bit offsets is problematic if a commit is malformed...
>
> > valid and has committerdate of 0 Unix time, as the offset would be the
>
> s/committerdate/committer date/
>
> > same as corrected commit date and thus require 64-bits to be stored
> > properly.
> >
> > While Git does not write out offsets at this stage, Git stores the
> > corrected commit dates in member generation of struct commit_graph_data.
> > It will begin writing commit date offsets with the introduction of
> > generation data chunk.
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> > commit-graph.c | 21 +++++++++++++++++----
> > 1 file changed, 17 insertions(+), 4 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 1b2a015f92f..bfc3aae5f93 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -1339,9 +1339,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> > ctx->commits.nr);
> > for (i = 0; i < ctx->commits.nr; i++) {
> > uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> > + timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
> >
> > display_progress(ctx->progress, i + 1);
> > - if (level != GENERATION_NUMBER_ZERO)
> > + if (level != GENERATION_NUMBER_ZERO &&
> > + corrected_commit_date != GENERATION_NUMBER_ZERO)
> > continue;
> >
> > commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1350,16 +1352,23 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> > struct commit_list *parent;
> > int all_parents_computed = 1;
> > uint32_t max_level = 0;
> > + timestamp_t max_corrected_commit_date = 0;
> >
> > for (parent = current->parents; parent; parent = parent->next) {
> > level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> > + corrected_commit_date = commit_graph_data_at(parent->item)->generation;
> >
> > - if (level == GENERATION_NUMBER_ZERO) {
> > + if (level == GENERATION_NUMBER_ZERO ||
> > + corrected_commit_date == GENERATION_NUMBER_ZERO) {
> > all_parents_computed = 0;
> > commit_list_insert(parent->item, &list);
> > break;
> > - } else if (level > max_level) {
> > - max_level = level;
> > + } else {
> > + if (level > max_level)
> > + max_level = level;
> > +
> > + if (corrected_commit_date > max_corrected_commit_date)
> > + max_corrected_commit_date = corrected_commit_date;
>
> nit: the "break" in the first case makes it so this large else block
> is unnecessary.
Thanks, removed.
>
> - if (level == GENERATION_NUMBER_ZERO) {
> + if (level == GENERATION_NUMBER_ZERO ||
> + corrected_commit_date == GENERATION_NUMBER_ZERO) {
> all_parents_computed = 0;
> commit_list_insert(parent->item, &list);
> break;
> - } else if (level > max_level) {
> - max_level = level;
> +
> + if (level > max_level)
> + max_level = level;
> +
> + if (corrected_commit_date > max_corrected_commit_date)
> + max_corrected_commit_date = corrected_commit_date;
> - }
> }
>
> Thanks,
> -Stolee
>
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v5 08/11] commit-graph: implement generation data chunk
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
` (6 preceding siblings ...)
2020-12-28 11:16 ` [PATCH v5 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16 ` Abhishek Kumar via GitGitGadget
2020-12-28 11:16 ` [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
` (4 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 111 ++++++++++++++++++++++++++++++----
commit-graph.h | 3 +
commit.h | 1 +
t/README | 3 +
t/helper/test-read-graph.c | 4 ++
t/t4216-log-bloom.sh | 4 +-
t/t5318-commit-graph.sh | 79 ++++++++++++++++++++----
t/t5324-split-commit-graph.sh | 12 ++--
t/t6600-test-reach.sh | 6 ++
t/test-lib-functions.sh | 6 ++
10 files changed, 197 insertions(+), 32 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index bfc3aae5f93..629b2f17fbc 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 9
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
+#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
+
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
@@ -390,6 +394,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
graph->chunk_commit_data = data + chunk_offset;
break;
+ case GRAPH_CHUNKID_GENERATION_DATA:
+ if (graph->chunk_generation_data)
+ chunk_repeated = 1;
+ else
+ graph->chunk_generation_data = data + chunk_offset;
+ break;
+
+ case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
+ if (graph->chunk_generation_data_overflow)
+ chunk_repeated = 1;
+ else
+ graph->chunk_generation_data_overflow = data + chunk_offset;
+ break;
+
case GRAPH_CHUNKID_EXTRAEDGES:
if (graph->chunk_extra_edges)
chunk_repeated = 1;
@@ -750,8 +768,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
{
const unsigned char *commit_data;
struct commit_graph_data *graph_data;
- uint32_t lex_index;
- uint64_t date_high, date_low;
+ uint32_t lex_index, offset_pos;
+ uint64_t date_high, date_low, offset;
while (pos < g->num_commits_in_base)
g = g->base_graph;
@@ -769,7 +787,16 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
- graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+ if (g->chunk_generation_data) {
+ offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+
+ if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+ offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
+ graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+ } else
+ graph_data->generation = item->date + offset;
+ } else
+ graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
if (g->topo_levels)
*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
@@ -941,6 +968,7 @@ struct write_commit_graph_context {
struct oid_array oids;
struct packed_commit_list commits;
int num_extra_edges;
+ int num_generation_data_overflows;
unsigned long approx_nr_objects;
struct progress *progress;
int progress_done;
@@ -959,7 +987,8 @@ struct write_commit_graph_context {
report_progress:1,
split:1,
changed_paths:1,
- order_by_pack:1;
+ order_by_pack:1,
+ write_generation_data:1;
struct topo_level_slab *topo_levels;
const struct commit_graph_opts *opts;
@@ -1119,6 +1148,45 @@ static int write_graph_chunk_data(struct hashfile *f,
return 0;
}
+static int write_graph_chunk_generation_data(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ int i, num_generation_data_overflows = 0;
+
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+ display_progress(ctx->progress, ++ctx->progress_cnt);
+
+ if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+ offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
+ num_generation_data_overflows++;
+ }
+
+ hashwrite_be32(f, offset);
+ }
+
+ return 0;
+}
+
+static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ int i;
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+ display_progress(ctx->progress, ++ctx->progress_cnt);
+
+ if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+ hashwrite_be32(f, offset >> 32);
+ hashwrite_be32(f, (uint32_t) offset);
+ }
+ }
+
+ return 0;
+}
+
static int write_graph_chunk_extra_edges(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
@@ -1382,6 +1450,9 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (current->date && current->date > max_corrected_commit_date)
max_corrected_commit_date = current->date - 1;
commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
+
+ if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
+ ctx->num_generation_data_overflows++;
}
}
}
@@ -1715,6 +1786,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
chunks[2].id = GRAPH_CHUNKID_DATA;
chunks[2].size = (hashsz + 16) * ctx->commits.nr;
chunks[2].write_fn = write_graph_chunk_data;
+
+ if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
+ ctx->write_generation_data = 0;
+ if (ctx->write_generation_data) {
+ chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+ chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+ chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+ num_chunks++;
+ }
+ if (ctx->num_generation_data_overflows) {
+ chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
+ chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
+ chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
+ num_chunks++;
+ }
if (ctx->num_extra_edges) {
chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2135,6 +2221,8 @@ int write_commit_graph(struct object_directory *odb,
ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
ctx->opts = opts;
ctx->total_bloom_filter_data_size = 0;
+ ctx->write_generation_data = 1;
+ ctx->num_generation_data_overflows = 0;
bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
bloom_settings.bits_per_entry);
@@ -2441,16 +2529,17 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
continue;
/*
- * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
- * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
- * extra logic in the following condition.
+ * If we are using topological level and one of our parents has
+ * generation GENERATION_NUMBER_V1_MAX, then our generation is
+ * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
+ * in the following condition.
*/
- if (max_generation == GENERATION_NUMBER_V1_MAX)
+ if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
max_generation--;
generation = commit_graph_generation(graph_commit);
- if (generation != max_generation + 1)
- graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
+ if (generation < max_generation + 1)
+ graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
oid_to_hex(&cur_oid),
generation,
max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 2e9aa7824ee..19a02001fde 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
#include "oidset.h"
#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
@@ -68,6 +69,8 @@ struct commit_graph {
const uint32_t *chunk_oid_fanout;
const unsigned char *chunk_oid_lookup;
const unsigned char *chunk_commit_data;
+ const unsigned char *chunk_generation_data;
+ const unsigned char *chunk_generation_data_overflow;
const unsigned char *chunk_extra_edges;
const unsigned char *chunk_base_graphs;
const unsigned char *chunk_bloom_indexes;
diff --git a/commit.h b/commit.h
index 33c66b2177c..251d877fcf6 100644
--- a/commit.h
+++ b/commit.h
@@ -14,6 +14,7 @@
#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
#define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)
struct commit_list {
struct commit *item;
diff --git a/t/README b/t/README
index c730a707705..8a121487279 100644
--- a/t/README
+++ b/t/README
@@ -393,6 +393,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
be written after every 'git commit' command, and overrides the
'core.commitGraph' setting to true.
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
commit-graph write to compute and write changed path Bloom filters for
every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 5f585a17256..75927b2c81d 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -33,6 +33,10 @@ int cmd__read_graph(int argc, const char **argv)
printf(" oid_lookup");
if (graph->chunk_commit_data)
printf(" commit_metadata");
+ if (graph->chunk_generation_data)
+ printf(" generation_data");
+ if (graph->chunk_generation_data_overflow)
+ printf(" generation_data_overflow");
if (graph->chunk_extra_edges)
printf(" extra_edges");
if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index d11040ce41c..dbde0161882 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -40,11 +40,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
'
graph_read_expect () {
- NUM_CHUNKS=5
+ NUM_CHUNKS=6
cat >expect <<- EOF
header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
num_commits: $1
- chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+ chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
EOF
test-tool read-graph >actual &&
test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2ed0c1544da..fa27df579a5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -76,7 +76,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
graph_read_expect() {
OPTIONAL=""
NUM_CHUNKS=3
- if test ! -z $2
+ if test ! -z "$2"
then
OPTIONAL=" $2"
NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -103,14 +103,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
# valid commit and tree OID
git rev-parse HEAD HEAD^{tree} >in &&
git commit-graph write --stdin-commits <in &&
- graph_read_expect 3
+ graph_read_expect 3 generation_data
'
test_expect_success 'write graph' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "3"
+ graph_read_expect "3" generation_data
'
test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -219,7 +219,7 @@ test_expect_success 'write graph with merges' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "10" "extra_edges"
+ graph_read_expect "10" "generation_data extra_edges"
'
graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -254,7 +254,7 @@ test_expect_success 'write graph with new commit' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -264,7 +264,7 @@ test_expect_success 'write graph with nothing new' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -274,7 +274,7 @@ test_expect_success 'build graph from latest pack with closure' '
cd "$TRASH_DIRECTORY/full" &&
cat new-idx | git commit-graph write --stdin-packs &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "9" "extra_edges"
+ graph_read_expect "9" "generation_data extra_edges"
'
graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -287,7 +287,7 @@ test_expect_success 'build graph from commits with closure' '
git rev-parse merge/1 >>commits-in &&
cat commits-in | git commit-graph write --stdin-commits &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "6"
+ graph_read_expect "6" "generation_data"
'
graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -297,7 +297,7 @@ test_expect_success 'build graph from commits with append' '
cd "$TRASH_DIRECTORY/full" &&
git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "10" "extra_edges"
+ graph_read_expect "10" "generation_data extra_edges"
'
graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -307,7 +307,7 @@ test_expect_success 'build graph using --reachable' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write --reachable &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -328,7 +328,7 @@ test_expect_success 'write graph in bare repo' '
cd "$TRASH_DIRECTORY/bare" &&
git commit-graph write &&
test_path_is_file $baredir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
test_expect_success 'git commit-graph verify' '
cd "$TRASH_DIRECTORY/full" &&
- git rev-parse commits/8 | git commit-graph write --stdin-commits &&
- git commit-graph verify >output
+ git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+ git commit-graph verify >output &&
+ graph_read_expect 9 extra_edges
'
NUM_COMMITS=9
@@ -741,4 +742,56 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
)
'
+# We test the overflow-related code with the following repo history:
+#
+# 4:F - 5:N - 6:U
+# / \
+# 1:U - 2:N - 3:U M:N
+# \ /
+# 7:N - 8:F - 9:N
+#
+# Here the commits denoted by U have committer date of zero seconds
+# since Unix epoch, the commits denoted by N have committer date
+# starting from 1112354055 seconds since Unix epoch (default committer
+# date for the test suite), and the commits denoted by F have committer
+# date of (2 ^ 31 - 2) seconds since Unix epoch.
+#
+# The largest offset observed is 2 ^ 31, just large enough to overflow.
+#
+
+test_expect_success 'set up and verify repo with generation data overflow chunk' '
+ objdir=".git/objects" &&
+ UNIX_EPOCH_ZERO="@0 +0000" &&
+ FUTURE_DATE="@2147483646 +0000" &&
+ test_oid_cache <<-EOF &&
+ oid_version sha1:1
+ oid_version sha256:2
+ EOF
+ cd "$TRASH_DIRECTORY" &&
+ mkdir repo &&
+ cd repo &&
+ git init &&
+ test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
+ test_commit 2 &&
+ test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
+ git commit-graph write --reachable &&
+ graph_read_expect 3 generation_data &&
+ test_commit --date "$FUTURE_DATE" 4 &&
+ test_commit 5 &&
+ test_commit --date "$UNIX_EPOCH_ZERO" 6 &&
+ git branch left &&
+ git reset --hard 3 &&
+ test_commit 7 &&
+ test_commit --date "$FUTURE_DATE" 8 &&
+ test_commit 9 &&
+ git branch right &&
+ git reset --hard 3 &&
+ test_merge M left right &&
+ git commit-graph write --reachable &&
+ graph_read_expect 10 "generation_data generation_data_overflow" &&
+ git commit-graph verify
+'
+
+graph_git_behavior 'generation data overflow chunk repo' repo left right
+
test_done
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 4d3842b83b9..587757b62d9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
infodir=".git/objects/info" &&
graphdir="$infodir/commit-graphs" &&
test_oid_cache <<-EOM
- shallow sha1:1760
- shallow sha256:2064
+ shallow sha1:2132
+ shallow sha256:2436
- base sha1:1376
- base sha256:1496
+ base sha1:1408
+ base sha256:1528
oid_version sha1:1
oid_version sha256:2
@@ -31,9 +31,9 @@ graph_read_expect() {
NUM_BASE=$2
fi
cat >expect <<- EOF
- header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
+ header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
num_commits: $1
- chunks: oid_fanout oid_lookup commit_metadata
+ chunks: oid_fanout oid_lookup commit_metadata generation_data
EOF
test-tool read-graph >output &&
test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index af10f0dc090..e2d33a8a4c4 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,6 +55,9 @@ test_expect_success 'setup' '
git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
mv .git/objects/info/commit-graph commit-graph-half &&
chmod u+w commit-graph-half &&
+ GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+ mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+ chmod u+w commit-graph-no-gdat &&
git config core.commitGraph true
'
@@ -67,6 +70,9 @@ run_all_modes () {
test_cmp expect actual &&
cp commit-graph-half .git/objects/info/commit-graph &&
"$@" <input >actual &&
+ test_cmp expect actual &&
+ cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+ "$@" <input >actual &&
test_cmp expect actual
}
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 999982fe4a9..3ad712c3acc 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -202,6 +202,12 @@ test_commit () {
--signoff)
signoff="$1"
;;
+ --date)
+ notick=yes
+ GIT_COMMITTER_DATE="$2"
+ GIT_AUTHOR_DATE="$2"
+ shift
+ ;;
-C)
indir="$2"
shift
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
` (7 preceding siblings ...)
2020-12-28 11:16 ` [PATCH v5 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16 ` Abhishek Kumar via GitGitGadget
2020-12-30 3:23 ` Derrick Stolee
2020-12-28 11:16 ` [PATCH v5 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
` (3 subsequent siblings)
12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:
1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.
It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.
This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).
Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.
Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 29 +++++-
commit-graph.h | 1 +
t/t5324-split-commit-graph.sh | 181 ++++++++++++++++++++++++++++++++++
3 files changed, 209 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 629b2f17fbc..41a65d98738 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -610,6 +610,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
return graph_chain;
}
+static void validate_mixed_generation_chain(struct commit_graph *g)
+{
+ int read_generation_data;
+
+ if (!g)
+ return;
+
+ read_generation_data = !!g->chunk_generation_data;
+
+ while (g) {
+ g->read_generation_data = read_generation_data;
+ g = g->base_graph;
+ }
+}
+
struct commit_graph *read_commit_graph_one(struct repository *r,
struct object_directory *odb)
{
@@ -618,6 +633,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
if (!g)
g = load_commit_graph_chain(r, odb);
+ validate_mixed_generation_chain(g);
+
return g;
}
@@ -787,7 +804,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
- if (g->chunk_generation_data) {
+ if (g->read_generation_data) {
offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
@@ -2012,6 +2029,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
if (i < ctx->num_commit_graphs_after)
ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
+ /*
+ * If the topmost remaining layer has generation data chunk, the
+ * resultant layer also has generation data chunk.
+ */
+ if (i == ctx->num_commit_graphs_after - 2)
+ ctx->write_generation_data = !!g->chunk_generation_data;
+
i--;
g = g->base_graph;
}
@@ -2239,6 +2263,7 @@ int write_commit_graph(struct object_directory *odb,
struct commit_graph *g = ctx->r->objects->commit_graph;
while (g) {
+ g->read_generation_data = 1;
g->topo_levels = &topo_levels;
g = g->base_graph;
}
@@ -2534,7 +2559,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
* also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
* in the following condition.
*/
- if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
+ if (!g->read_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
max_generation--;
generation = commit_graph_generation(graph_commit);
diff --git a/commit-graph.h b/commit-graph.h
index 19a02001fde..ad52130883b 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -64,6 +64,7 @@ struct commit_graph {
struct object_directory *odb;
uint32_t num_commits_in_base;
+ unsigned int read_generation_data;
struct commit_graph *base_graph;
const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 587757b62d9..8e90f3423b8 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -453,4 +453,185 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
git -C dup commit-graph verify
'
+NUM_FIRST_LAYER_COMMITS=64
+NUM_SECOND_LAYER_COMMITS=16
+NUM_THIRD_LAYER_COMMITS=7
+NUM_FOURTH_LAYER_COMMITS=8
+NUM_FIFTH_LAYER_COMMITS=16
+SECOND_LAYER_SEQUENCE_START=$(($NUM_FIRST_LAYER_COMMITS + 1))
+SECOND_LAYER_SEQUENCE_END=$(($SECOND_LAYER_SEQUENCE_START + $NUM_SECOND_LAYER_COMMITS - 1))
+THIRD_LAYER_SEQUENCE_START=$(($SECOND_LAYER_SEQUENCE_END + 1))
+THIRD_LAYER_SEQUENCE_END=$(($THIRD_LAYER_SEQUENCE_START + $NUM_THIRD_LAYER_COMMITS - 1))
+FOURTH_LAYER_SEQUENCE_START=$(($THIRD_LAYER_SEQUENCE_END + 1))
+FOURTH_LAYER_SEQUENCE_END=$(($FOURTH_LAYER_SEQUENCE_START + $NUM_FOURTH_LAYER_COMMITS - 1))
+FIFTH_LAYER_SEQUENCE_START=$(($FOURTH_LAYER_SEQUENCE_END + 1))
+FIFTH_LAYER_SEQUENCE_END=$(($FIFTH_LAYER_SEQUENCE_START + $NUM_FIFTH_LAYER_COMMITS - 1))
+
+# Current split graph chain:
+#
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+ graphdir=".git/objects/info/commit-graphs" &&
+ test_oid_cache <<-EOF &&
+ oid_version sha1:1
+ oid_version sha256:2
+ EOF
+ git init mixed &&
+ (
+ cd mixed &&
+ git config core.commitGraph true &&
+ git config gc.writeCommitGraph false &&
+ for i in $(test_seq $NUM_FIRST_LAYER_COMMITS)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split &&
+ graph_read_expect $NUM_FIRST_LAYER_COMMITS &&
+ test_line_count = 1 $graphdir/commit-graph-chain &&
+ for i in $(test_seq $SECOND_LAYER_SEQUENCE_START $SECOND_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+ test_line_count = 2 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 4 1
+ num_commits: $NUM_SECOND_LAYER_COMMITS
+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
+ test_cmp expect output &&
+ git commit-graph verify &&
+ cat $graphdir/commit-graph-chain
+ )
+'
+
+# The new layer will be added without generation data chunk as it was not
+# present on the layer underneath it.
+#
+# 7 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if not present on existing tip' '
+ git clone mixed mixed-no-gdat &&
+ (
+ cd mixed-no-gdat &&
+ for i in $(test_seq $THIRD_LAYER_SEQUENCE_START $THIRD_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split=no-merge &&
+ test_line_count = 3 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 4 2
+ num_commits: $NUM_THIRD_LAYER_COMMITS
+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
+ test_cmp expect output &&
+ git commit-graph verify
+ )
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+# 8 commits (No GDAT)
+# ------------------------
+# 7 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+# The top two layers are merged and do not have generation data chunk as layer below them does
+# not have generation data chunk.
+#
+# 15 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if the topmost remaining layer does not have generation data chunk' '
+ git clone mixed-no-gdat mixed-merge-no-gdat &&
+ (
+ cd mixed-merge-no-gdat &&
+ for i in $(test_seq $FOURTH_LAYER_SEQUENCE_START $FOURTH_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split --size-multiple 1 &&
+ test_line_count = 3 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 4 2
+ num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
+ test_cmp expect output &&
+ git commit-graph verify
+ )
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+# 16 commits (No GDAT)
+# ------------------------
+# 15 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+# The top three layers are merged and has generation data chunk as the topmost remaining layer
+# has generation data chunk.
+#
+# 47 commits (GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'write generation data chunk if topmost remaining layer has generation data chunk' '
+ git clone mixed-merge-no-gdat mixed-merge-gdat &&
+ (
+ cd mixed-merge-gdat &&
+ for i in $(test_seq $FIFTH_LAYER_SEQUENCE_START $FIFTH_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split --size-multiple 1 &&
+ test_line_count = 2 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 5 1
+ num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
+ chunks: oid_fanout oid_lookup commit_metadata generation_data
+ EOF
+ test_cmp expect output
+ )
+'
+
+test_expect_success 'write generation data chunk when commit-graph chain is replaced' '
+ git clone mixed mixed-replace &&
+ (
+ cd mixed-replace &&
+ git commit-graph write --reachable --split=replace &&
+ test_path_is_file $graphdir/commit-graph-chain &&
+ test_line_count = 1 $graphdir/commit-graph-chain &&
+ verify_chain_files_exist $graphdir &&
+ graph_read_expect $(($NUM_FIRST_LAYER_COMMITS + $NUM_SECOND_LAYER_COMMITS)) &&
+ git commit-graph verify
+ )
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does
2020-12-28 11:16 ` [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-12-30 3:23 ` Derrick Stolee
2021-01-10 13:13 ` Abhishek Kumar
0 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-12-30 3:23 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget, git
Cc: Jakub Narębski, Abhishek Kumar, Taylor Blau
On 12/28/2020 6:16 AM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
...
> +static void validate_mixed_generation_chain(struct commit_graph *g)
> +{
> + int read_generation_data;
> +
> + if (!g)
> + return;
> +
> + read_generation_data = !!g->chunk_generation_data;
> +
> + while (g) {
> + g->read_generation_data = read_generation_data;
> + g = g->base_graph;
> + }
> +}
> +
This method exists to say "use generation v2 if the top layer has it"
and that helps with the future layer checks.
> @@ -2239,6 +2263,7 @@ int write_commit_graph(struct object_directory *odb,
> struct commit_graph *g = ctx->r->objects->commit_graph;
>
> while (g) {
> + g->read_generation_data = 1;
> g->topo_levels = &topo_levels;
> g = g->base_graph;
> }
However, here you just turn them on automatically.
I think the diff you want is here:
struct commit_graph *g = ctx->r->objects->commit_graph;
+ validate_mixed_generation_chain(g);
+
while (g) {
g->topo_levels = &topo_levels;
g = g->base_graph;
}
But maybe you have a good reason for what you already have.
I paid attention to this because I hit a problem in my local testing.
After trying to reproduce it, I think the root cause is that I had a
commit-graph that was written by an older version of your series, so
it caused an unexpected pairing of an "offset required" bit but no
offset chunk.
Perhaps this diff is required in the proper place to avoid the
segfault I hit, in the case of a malformed commit-graph file:
diff --git a/commit-graph.c b/commit-graph.c
index c8d7ed1330..d264c90868 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -822,6 +822,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+ if (!g->chunk_generation_data_overflow)
+ die(_("commit-graph requires overflow generation data but has none"));
+
offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
} else
Your tests in this patch seem very thorough, covering all the cases
I could think to create this strange situation. I even tried creating
cases where the overflow would be necessary. The following test actually
fails on the "graph_read_expect 6" due to the extra chunk, not the 'write'
process I was trying to trick into failure.
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 8e90f3423b..cfef8e52b9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -453,6 +453,20 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
git -C dup commit-graph verify
'
+test_expect_success 'upgrade to generation data succeeds when there was none' '
+ (
+ cd dup &&
+ rm -rf .git/objects/info/commit-graph* &&
+ GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph \
+ write --reachable &&
+ GIT_COMMITTER_DATE="1980-01-01 00:00" git commit --allow-empty -m one &&
+ GIT_COMMITTER_DATE="2090-01-01 00:00" git commit --allow-empty -m two &&
+ GIT_COMMITTER_DATE="2000-01-01 00:00" git commit --allow-empty -m three &&
+ git commit-graph write --reachable &&
+ graph_read_expect 6
+ )
+'
+
NUM_FIRST_LAYER_COMMITS=64
NUM_SECOND_LAYER_COMMITS=16
NUM_THIRD_LAYER_COMMITS=7
Thanks,
-Stolee
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does
2020-12-30 3:23 ` Derrick Stolee
@ 2021-01-10 13:13 ` Abhishek Kumar
2021-01-11 12:43 ` Derrick Stolee
0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-10 13:13 UTC (permalink / raw)
To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me
On Tue, Dec 29, 2020 at 10:23:54PM -0500, Derrick Stolee wrote:
> On 12/28/2020 6:16 AM, Abhishek Kumar via GitGitGadget wrote:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> ...
>
> > +static void validate_mixed_generation_chain(struct commit_graph *g)
> > +{
> > + int read_generation_data;
> > +
> > + if (!g)
> > + return;
> > +
> > + read_generation_data = !!g->chunk_generation_data;
> > +
> > + while (g) {
> > + g->read_generation_data = read_generation_data;
> > + g = g->base_graph;
> > + }
> > +}
> > +
>
> This method exists to say "use generation v2 if the top layer has it"
> and that helps with the future layer checks.
>
> > @@ -2239,6 +2263,7 @@ int write_commit_graph(struct object_directory *odb,
> > struct commit_graph *g = ctx->r->objects->commit_graph;
> >
> > while (g) {
> > + g->read_generation_data = 1;
> > g->topo_levels = &topo_levels;
> > g = g->base_graph;
> > }
>
> However, here you just turn them on automatically.
>
> I think the diff you want is here:
>
> struct commit_graph *g = ctx->r->objects->commit_graph;
>
> + validate_mixed_generation_chain(g);
> +
> while (g) {
> g->topo_levels = &topo_levels;
> g = g->base_graph;
> }
>
> But maybe you have a good reason for what you already have.
>
Thanks, that was an oversight.
My (incorrect) reasoning at the time was:
Since we are computing both topological levels and corrected commit
dates, we can read corrected commit dates from layers with a GDAT chunk
hidden below non-GDAT layer.
But we end up storing both corrected commit date offsets (for a layers with
GDAT chunk) and topological level (for layers without GDAT chunk) in the
same slab with no way to distinguish between the two!
> I paid attention to this because I hit a problem in my local testing.
> After trying to reproduce it, I think the root cause is that I had a
> commit-graph that was written by an older version of your series, so
> it caused an unexpected pairing of an "offset required" bit but no
> offset chunk.
>
> Perhaps this diff is required in the proper place to avoid the
> segfault I hit, in the case of a malformed commit-graph file:
>
> diff --git a/commit-graph.c b/commit-graph.c
> index c8d7ed1330..d264c90868 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -822,6 +822,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
> offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
>
> if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
> + if (!g->chunk_generation_data_overflow)
> + die(_("commit-graph requires overflow generation data but has none"));
> +
> offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
> graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
> } else
>
> Your tests in this patch seem very thorough, covering all the cases
> I could think to create this strange situation. I even tried creating
> cases where the overflow would be necessary. The following test actually
> fails on the "graph_read_expect 6" due to the extra chunk, not the 'write'
> process I was trying to trick into failure.
>
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 8e90f3423b..cfef8e52b9 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -453,6 +453,20 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
> git -C dup commit-graph verify
> '
>
> +test_expect_success 'upgrade to generation data succeeds when there was none' '
> + (
> + cd dup &&
> + rm -rf .git/objects/info/commit-graph* &&
> + GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph \
> + write --reachable &&
> + GIT_COMMITTER_DATE="1980-01-01 00:00" git commit --allow-empty -m one &&
> + GIT_COMMITTER_DATE="2090-01-01 00:00" git commit --allow-empty -m two &&
> + GIT_COMMITTER_DATE="2000-01-01 00:00" git commit --allow-empty -m three &&
> + git commit-graph write --reachable &&
> + graph_read_expect 6
> + )
> +'
I am not sure what this test adds over the existing generation data
overflow related tests added in t5318-commit-graph.sh
> +
> NUM_FIRST_LAYER_COMMITS=64
> NUM_SECOND_LAYER_COMMITS=16
> NUM_THIRD_LAYER_COMMITS=7
>
> Thanks,
> -Stolee
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does
2021-01-10 13:13 ` Abhishek Kumar
@ 2021-01-11 12:43 ` Derrick Stolee
0 siblings, 0 replies; 211+ messages in thread
From: Derrick Stolee @ 2021-01-11 12:43 UTC (permalink / raw)
To: 2e89c6e1-e8e8-0d51-5670-038b4e296d93
Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me
On 1/10/2021 8:13 AM, Abhishek Kumar wrote:
> On Tue, Dec 29, 2020 at 10:23:54PM -0500, Derrick Stolee wrote:
>> Your tests in this patch seem very thorough, covering all the cases
>> I could think to create this strange situation. I even tried creating
>> cases where the overflow would be necessary. The following test actually
>> fails on the "graph_read_expect 6" due to the extra chunk, not the 'write'
>> process I was trying to trick into failure.
>>
>> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
>> index 8e90f3423b..cfef8e52b9 100755
>> --- a/t/t5324-split-commit-graph.sh
>> +++ b/t/t5324-split-commit-graph.sh
>> @@ -453,6 +453,20 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
>> git -C dup commit-graph verify
>> '
>>
>> +test_expect_success 'upgrade to generation data succeeds when there was none' '
>> + (
>> + cd dup &&
>> + rm -rf .git/objects/info/commit-graph* &&
>> + GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph \
>> + write --reachable &&
>> + GIT_COMMITTER_DATE="1980-01-01 00:00" git commit --allow-empty -m one &&
>> + GIT_COMMITTER_DATE="2090-01-01 00:00" git commit --allow-empty -m two &&
>> + GIT_COMMITTER_DATE="2000-01-01 00:00" git commit --allow-empty -m three &&
>> + git commit-graph write --reachable &&
>> + graph_read_expect 6
>> + )
>> +'
>
> I am not sure what this test adds over the existing generation data
> overflow related tests added in t5318-commit-graph.sh
Good point.
-Stolee
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v5 10/11] commit-reach: use corrected commit dates in paint_down_to_common()
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
` (8 preceding siblings ...)
2020-12-28 11:16 ` [PATCH v5 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16 ` Abhishek Kumar via GitGitGadget
2020-12-28 11:16 ` [PATCH v5 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
` (2 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.
For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.
While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).
This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.
While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.
With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.
While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.
As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 14 ++++++++++++++
commit-graph.h | 6 ++++++
commit-reach.c | 2 +-
t/t6404-recursive-merge.sh | 5 ++++-
4 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 41a65d98738..c8d7ed13302 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -710,6 +710,20 @@ int generation_numbers_enabled(struct repository *r)
return !!first_generation;
}
+int corrected_commit_dates_enabled(struct repository *r)
+{
+ struct commit_graph *g;
+ if (!prepare_commit_graph(r))
+ return 0;
+
+ g = r->objects->commit_graph;
+
+ if (!g->num_commits)
+ return 0;
+
+ return g->read_generation_data;
+}
+
struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
{
struct commit_graph *g = r->objects->commit_graph;
diff --git a/commit-graph.h b/commit-graph.h
index ad52130883b..97f3497c279 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -95,6 +95,12 @@ struct commit_graph *parse_commit_graph(struct repository *r,
*/
int generation_numbers_enabled(struct repository *r);
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
+
struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
enum commit_graph_write_flags {
diff --git a/commit-reach.c b/commit-reach.c
index 9b24b0378d5..e38771ca5a1 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
int i;
timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
- if (!min_generation)
+ if (!min_generation && !corrected_commit_dates_enabled(r))
queue.compare = compare_commits_by_commit_date;
one->object.flags |= PARENT1;
diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
index b1c3d4dda49..86f74ae5847 100755
--- a/t/t6404-recursive-merge.sh
+++ b/t/t6404-recursive-merge.sh
@@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
export GIT_COMMITTER_DATE
test_expect_success 'setup tests' '
+ GIT_TEST_COMMIT_GRAPH=0 &&
+ export GIT_TEST_COMMIT_GRAPH &&
echo 1 >a1 &&
git add a1 &&
GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
'
test_expect_success 'combined merge conflicts' '
- test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+ test_must_fail git merge -m final G
'
test_expect_success 'result contains a conflict' '
@@ -82,6 +84,7 @@ test_expect_success 'result contains a conflict' '
'
test_expect_success 'virtual trees were processed' '
+ # TODO: fragile test, relies on ambigious merge-base resolution
git ls-files --stage >out &&
cat >expect <<-EOF &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v5 11/11] doc: add corrected commit date info
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
` (9 preceding siblings ...)
2020-12-28 11:16 ` [PATCH v5 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2020-12-28 11:16 ` Abhishek Kumar via GitGitGadget
2020-12-30 4:35 ` [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2020-12-28 11:16 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
.../technical/commit-graph-format.txt | 28 +++++--
Documentation/technical/commit-graph.txt | 77 +++++++++++++++----
2 files changed, 86 insertions(+), 19 deletions(-)
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b3b58880b92..b6658eff188 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
The Git commit graph stores a list of commit OIDs and some associated
metadata, including:
-- The generation number of the commit. Commits with no parents have
- generation number 1; commits with parents have generation number
- one more than the maximum generation number of its parents. We
- reserve zero as special, and can be used to mark a generation
- number invalid or as "not computed".
+- The generation number of the commit.
- The root tree OID.
@@ -86,13 +82,33 @@ CHUNK DATA:
position. If there are more than two parents, the second value
has its most-significant bit on and the other bits store an array
position into the Extra Edge List chunk.
- * The next 8 bytes store the generation number of the commit and
+ * The next 8 bytes store the topological level (generation number v1)
+ of the commit and
the commit time in seconds since EPOCH. The generation number
uses the higher 30 bits of the first 4 bytes, while the commit
time uses the 32 bits of the second 4 bytes, along with the lowest
2 bits of the lowest byte, storing the 33rd and 34th bit of the
commit time.
+ Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+ * This list of 4-byte values store corrected commit date offsets for the
+ commits, arranged in the same order as commit data chunk.
+ * If the corrected commit date offset cannot be stored within 31 bits,
+ the value has its most-significant bit on and the other bits store
+ the position of corrected commit date into the Generation Data Overflow
+ chunk.
+ * Generation Data chunk is present only when commit-graph file is written
+ by compatible versions of Git and in case of split commit-graph chains,
+ the topmost layer also has Generation Data chunk.
+
+ Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
+ * This list of 8-byte values stores the corrected commit date offsets
+ for commits with corrected commit date offsets that cannot be
+ stored within 31 bits.
+ * Generation Data Overflow chunk is present only when Generation Data
+ chunk is present and atleast one corrected commit date offset cannot
+ be stored within 31 bits.
+
Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
This list of 4-byte values store the second through nth parents for
all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index f14a7659aa8..f05e7bda1a9 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
Values 1-4 satisfy the requirements of parse_commit_gently().
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates (generation number v2)
+2. Topological levels (generation nummber v1)
- * A commit with no parents (a root commit) has generation number one.
+Define "corrected committer date" of a commit recursively as follows:
- * A commit with at least one parent has generation number one more than
- the largest generation number among its parents.
+ * A commit with no parents (a root commit) has corrected committer date
+ equal to its committer date.
-Equivalently, the generation number of a commit A is one more than the
+ * A commit with at least one parent has corrected committer date equal to
+ the maximum of its commiter date and one more than the largest corrected
+ committer date among its parents.
+
+ * As a special case, a root commit with timestamp zero has corrected commit
+ date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+ (that is, an uncomputed corrected commit date).
+
+Define the "topological level" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has topological level of one.
+
+ * A commit with at least one parent has topological level one more than
+ the largest topological level among its parents.
+
+Equivalently, the topological level of a commit A is one more than the
length of a longest path from A to a root commit. The recursive definition
is easier to use for computation and observing the following property:
@@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
generation numbers, then we always expand the boundary commit with highest
generation number and can easily detect the stopping condition.
+The property applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
This property can be used to significantly reduce the time it takes to
walk commits and determine topological relationships. Without generation
numbers, the general heuristic is the following:
@@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
If A and B are commits with commit time X and Y, respectively, and
X < Y, then A _probably_ cannot reach B.
-This heuristic is currently used whenever the computation is allowed to
+In absence of corrected commit dates (for example, old versions of Git or
+mixed generation graph chains),
+this heuristic is currently used whenever the computation is allowed to
violate topological relationships due to clock skew (such as "git log"
with default order), but is not used when the topological order is
required (such as merge base calculations, "git log --graph").
@@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
generation number and walk until reaching commits with known generation
number.
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
in the commit-graph file. If a commit-graph file was written by a version
of Git that did not compute generation numbers, then those commits will
have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
walking a few extra commits, but the simplicity in dealing with commits
with generation number *_INFINITY or *_ZERO is valuable.
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
+topological levels (generation number v1) are computed to be at least
+this value. We limit at this value since it is the largest value that
+can be stored in the commit-graph file using the 30 bits available
+to topological levels. This presents another case where a commit can
+have generation number equal to that of a parent.
Design Details
--------------
@@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
number of commits) could be extracted into config settings for full
flexibility.
+## Handling Mixed Generation Number Chains
+
+With the introduction of generation number v2 and generation data chunk, the
+following scenario is possible:
+
+1. "New" Git writes a commit-graph with the corrected commit dates.
+2. "Old" Git writes a split commit-graph on top without corrected commit dates.
+
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
+levels of the higher layer. For this reason, Git inspects the topmost
+layer to see if the layer is missing corrected commit dates. In such a case
+Git only uses topological level for generation numbers.
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
+guarantees that if a layer has corrected commit dates, all lower layers
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
+commit dates. Instead, the new layer will have corrected commit dates if the
+layer below the new layer has corrected commit dates.
+
+While writing or merging layers, if the new layer is the only layer, it will
+have corrected commit dates when written by compatible versions of Git. Thus,
+rewriting split commit-graph as a single file (`--split=replace`) creates a
+single layer with corrected commit dates.
+
## Deleting graph-{hash} files
After a new tip file is written, some `graph-{hash}` files may no longer
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
` (10 preceding siblings ...)
2020-12-28 11:16 ` [PATCH v5 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2020-12-30 4:35 ` Derrick Stolee
2021-01-10 14:06 ` Abhishek Kumar
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
12 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2020-12-30 4:35 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget, git
Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar
On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.
Abhishek,
Thank you for this version. I appreciate your hard work on this topic,
especially after GSoC ended and you returned to being a full-time student.
My hope was that I could completely approve this series and only provide
forward-fixes from here on out, as necessary. I think there are a few minor
typos that you might want to address, but I was also able to understand your
intention.
I did make a particular case about a SEGFAULT I hit that I have been unable
to replicate. I saw it both in my copy of torvalds/linux and of
chromium/chromium. I have the file for chromium/chromium that is in a bad
state where a GDAT value includes the bit saying it should be in the long
offsets chunk, but that chunk doesn't exist. Further, that chunk doesn't
exist in a from-scratch write.
I'm now taking backups of my existing commit-graph files before any later
test, but it doesn't repro for my Git repository or any other repo I try on
purpose.
However, I did some performance testing to double-check your numbers. I sent
a patch [1] that helps with some of the hard numbers.
[1] https://lore.kernel.org/git/pull.828.git.1609302714183.gitgitgadget@gmail.com/
The big question is whether the overhead from using a slab to store the
generation values is worth it. I still think it is, for these reasons:
1. Generation number v2 is measurably better than v1 in most user cases.
2. Generation number v2 is slower than using committer date due to the
overhead, but _guarantees correctness_.
I like to use "git log --graph -<N>" to compare against topological levels
(v1), for various levels of <N>. When <N> is small, we hope to minimize
the amount we need to walk using the extra commit-date information as an
assistance. Repos like git/git and torvalds/linux use the philosophy of
"base your changes on oldest applicable commit" enough that v1 struggles
sometimes.
git/git: N=1000
Benchmark #1: baseline
Time (mean ± σ): 100.3 ms ± 4.2 ms [User: 89.0 ms, System: 11.3 ms]
Range (min … max): 94.5 ms … 105.1 ms 28 runs
Benchmark #2: test
Time (mean ± σ): 35.8 ms ± 3.1 ms [User: 29.6 ms, System: 6.2 ms]
Range (min … max): 29.8 ms … 40.6 ms 81 runs
Summary
'test' ran
2.80 ± 0.27 times faster than 'baseline'
This is a dramatic improvement! Using my topo-walk stats commit, I see that
v1 walks 58,805 commits as part of the in-degree walk while v2 only walks
4,335 commits!
torvalds/linux: N=1000 (starting at v5.10)
Benchmark #1: baseline
Time (mean ± σ): 90.8 ms ± 3.7 ms [User: 75.2 ms, System: 15.6 ms]
Range (min … max): 85.2 ms … 96.2 ms 31 runs
Benchmark #2: test
Time (mean ± σ): 49.2 ms ± 3.5 ms [User: 36.9 ms, System: 12.3 ms]
Range (min … max): 42.9 ms … 54.0 ms 61 runs
Summary
'test' ran
1.85 ± 0.15 times faster than 'baseline'
Similarly, v1 walked 38,161 commits compared to 4,340 by v2.
If I increase N to something like 10,000, then usually these values get
washed out due to the width of the parallel topics.
The place we were still using commit-date as a heuristic was paint_down_to_common
which caused a regression the first time we used v1, at least for certain cases.
Specifically, computing the merge-base in torvalds/linux between v4.8 and v4.9
hit a strangeness about a pair of recent commits both based on a very old commit,
but the generation numbers forced walking farther than necessary. This doesn't
happen with v2, but we see the overhead cost of the slabs:
Benchmark #1: baseline
Time (mean ± σ): 112.9 ms ± 2.8 ms [User: 96.5 ms, System: 16.3 ms]
Range (min … max): 107.7 ms … 118.0 ms 26 runs
Benchmark #2: test
Time (mean ± σ): 147.1 ms ± 5.2 ms [User: 132.7 ms, System: 14.3 ms]
Range (min … max): 141.4 ms … 162.2 ms 18 runs
Summary
'baseline' ran
1.30 ± 0.06 times faster than 'test'
The overhead still exists for a more recent pair of versions (v5.0 and v5.1):
Benchmark #1: baseline
Time (mean ± σ): 25.1 ms ± 3.2 ms [User: 18.6 ms, System: 6.5 ms]
Range (min … max): 19.0 ms … 32.8 ms 99 runs
Benchmark #2: test
Time (mean ± σ): 33.3 ms ± 3.3 ms [User: 26.5 ms, System: 6.9 ms]
Range (min … max): 27.0 ms … 38.4 ms 105 runs
Summary
'baseline' ran
1.33 ± 0.22 times faster than 'test'
I still think this overhead is worth it. In case not everyone agrees, it _might_
be worth a command-line option to skip the GDAT chunk. That also prevents an
ability to eventually wean entirely of generation number v1 and allow the commit
date to take the full 64-bit column (instead of only 34 bits, saving 30 for
topo-levels).
Again, such a modification should not be considered required for this series.
> ----------------------------------------------------------------------------
>
> Improvements left for a future series:
>
> * Save commits with generation data overflow and extra edge commits instead
> of looping over all commits. cf. 858sbel67n.fsf@gmail.com
> * Verify both topological levels and corrected commit dates when present.
> cf. 85pn4tnk8u.fsf@gmail.com
These seem like reasonable things to delay for a later series
or for #leftoverbits
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date
2020-12-30 4:35 ` [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2021-01-10 14:06 ` Abhishek Kumar
0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-10 14:06 UTC (permalink / raw)
To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me
On Tue, Dec 29, 2020 at 11:35:56PM -0500, Derrick Stolee wrote:
> On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote:
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
>
> Abhishek,
>
> Thank you for this version. I appreciate your hard work on this topic,
> especially after GSoC ended and you returned to being a full-time student.
>
> My hope was that I could completely approve this series and only provide
> forward-fixes from here on out, as necessary. I think there are a few minor
> typos that you might want to address, but I was also able to understand your
> intention.
>
> I did make a particular case about a SEGFAULT I hit that I have been unable
> to replicate. I saw it both in my copy of torvalds/linux and of
> chromium/chromium. I have the file for chromium/chromium that is in a bad
> state where a GDAT value includes the bit saying it should be in the long
> offsets chunk, but that chunk doesn't exist. Further, that chunk doesn't
> exist in a from-scratch write.
I hope validating mixed generation chain while writing as well was
enough to fix the SEGFAULT.
>
> I'm now taking backups of my existing commit-graph files before any later
> test, but it doesn't repro for my Git repository or any other repo I try on
> purpose.
>
> However, I did some performance testing to double-check your numbers. I sent
> a patch [1] that helps with some of the hard numbers.
>
> [1] https://lore.kernel.org/git/pull.828.git.1609302714183.gitgitgadget@gmail.com/
>
> The big question is whether the overhead from using a slab to store the
> generation values is worth it. I still think it is, for these reasons:
>
> 1. Generation number v2 is measurably better than v1 in most user cases.
>
> 2. Generation number v2 is slower than using committer date due to the
> overhead, but _guarantees correctness_.
>
> I like to use "git log --graph -<N>" to compare against topological levels
> (v1), for various levels of <N>. When <N> is small, we hope to minimize
> the amount we need to walk using the extra commit-date information as an
> assistance. Repos like git/git and torvalds/linux use the philosophy of
> "base your changes on oldest applicable commit" enough that v1 struggles
> sometimes.
>
> git/git: N=1000
>
> Benchmark #1: baseline
> Time (mean ± σ): 100.3 ms ± 4.2 ms [User: 89.0 ms, System: 11.3 ms]
> Range (min … max): 94.5 ms … 105.1 ms 28 runs
>
> Benchmark #2: test
> Time (mean ± σ): 35.8 ms ± 3.1 ms [User: 29.6 ms, System: 6.2 ms]
> Range (min … max): 29.8 ms … 40.6 ms 81 runs
>
> Summary
> 'test' ran
> 2.80 ± 0.27 times faster than 'baseline'
>
> This is a dramatic improvement! Using my topo-walk stats commit, I see that
> v1 walks 58,805 commits as part of the in-degree walk while v2 only walks
> 4,335 commits!
>
> torvalds/linux: N=1000 (starting at v5.10)
>
> Benchmark #1: baseline
> Time (mean ± σ): 90.8 ms ± 3.7 ms [User: 75.2 ms, System: 15.6 ms]
> Range (min … max): 85.2 ms … 96.2 ms 31 runs
>
> Benchmark #2: test
> Time (mean ± σ): 49.2 ms ± 3.5 ms [User: 36.9 ms, System: 12.3 ms]
> Range (min … max): 42.9 ms … 54.0 ms 61 runs
>
> Summary
> 'test' ran
> 1.85 ± 0.15 times faster than 'baseline'
>
> Similarly, v1 walked 38,161 commits compared to 4,340 by v2.
>
> If I increase N to something like 10,000, then usually these values get
> washed out due to the width of the parallel topics.
That's not too bad, as large N would be needed rather infrequently.
>
> The place we were still using commit-date as a heuristic was paint_down_to_common
> which caused a regression the first time we used v1, at least for certain cases.
>
> Specifically, computing the merge-base in torvalds/linux between v4.8 and v4.9
> hit a strangeness about a pair of recent commits both based on a very old commit,
> but the generation numbers forced walking farther than necessary. This doesn't
> happen with v2, but we see the overhead cost of the slabs:
>
> Benchmark #1: baseline
> Time (mean ± σ): 112.9 ms ± 2.8 ms [User: 96.5 ms, System: 16.3 ms]
> Range (min … max): 107.7 ms … 118.0 ms 26 runs
>
> Benchmark #2: test
> Time (mean ± σ): 147.1 ms ± 5.2 ms [User: 132.7 ms, System: 14.3 ms]
> Range (min … max): 141.4 ms … 162.2 ms 18 runs
>
> Summary
> 'baseline' ran
> 1.30 ± 0.06 times faster than 'test'
>
> The overhead still exists for a more recent pair of versions (v5.0 and v5.1):
>
> Benchmark #1: baseline
> Time (mean ± σ): 25.1 ms ± 3.2 ms [User: 18.6 ms, System: 6.5 ms]
> Range (min … max): 19.0 ms … 32.8 ms 99 runs
>
> Benchmark #2: test
> Time (mean ± σ): 33.3 ms ± 3.3 ms [User: 26.5 ms, System: 6.9 ms]
> Range (min … max): 27.0 ms … 38.4 ms 105 runs
>
> Summary
> 'baseline' ran
> 1.33 ± 0.22 times faster than 'test'
>
> I still think this overhead is worth it. In case not everyone agrees, it _might_
> be worth a command-line option to skip the GDAT chunk. That also prevents an
> ability to eventually wean entirely of generation number v1 and allow the commit
> date to take the full 64-bit column (instead of only 34 bits, saving 30 for
> topo-levels).
Thank you for the detailed benchmarking and discussion.
I don't think there is any disagreement on utility of corrected commit
dates so far.
We will run out of 34-bits for the commit date by the year 2514, so I
am not exactly worried about weaning of generation number v1 anytime
soon.
>
> Again, such a modification should not be considered required for this series.
>
> > ----------------------------------------------------------------------------
> >
> > Improvements left for a future series:
> >
> > * Save commits with generation data overflow and extra edge commits instead
> > of looping over all commits. cf. 858sbel67n.fsf@gmail.com
> > * Verify both topological levels and corrected commit dates when present.
> > cf. 85pn4tnk8u.fsf@gmail.com
>
> These seem like reasonable things to delay for a later series
> or for #leftoverbits
>
> Thanks,
> -Stolee
>
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
2020-12-28 11:15 ` [PATCH v5 00/11] " Abhishek Kumar via GitGitGadget
` (11 preceding siblings ...)
2020-12-30 4:35 ` [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
` (12 more replies)
12 siblings, 13 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar
This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.
Git uses topological levels in the commit-graph file for commit-graph
traversal operations like 'git log --graph'. Unfortunately, using
topological levels can result in a worse performance than without them when
compared with committer date as a heuristics. For example, 'git merge-base
v4.8 v4.9' on the Linux repository walks 635,579 commits using topological
levels and walks 167,468 using committer date. Since 091f4cf3 (commit: don't
use generation numbers if not needed, 2018-08-30), 'git merge-base' uses
committer date heuristic unless there is a cutoff because of the performance
hit.
Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem [1], we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.
[1] https://public-inbox.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Various candidates were examined (https://github.com/derrickstolee/gen-test,
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets
performed much worse than committer date (506,577 vs. 167,468 commits walked
for 'git merge-base v4.8 v4.9') and was dropped.
Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as follows:
For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents plus 1. Then
corrected commit date offset is the difference between corrected commit date
of C and commit date of C. As a special case, a root commit with the
timestamp zero has corrected commit date of 1 to be able to distinguish it
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date).
We will introduce an additional commit-graph chunk, Generation DATa (GDAT)
chunk, and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.
While storing corrected commit date offsets saves us 4 bytes per commit (as
compared with storing corrected commit dates directly), it's however
possible for the offset to overflow the space allocated. To handle such
cases, we introduce a new chunk, Generation Data Overflow (GDOV) that stores
the corrected commit date. For overflowing offsets, we set MSB and store the
position into the GDOV chunk, in a mechanism similar to the Extra Edges list
chunk.
For mixed generation number environment (for example new Git on the command
line, old Git used by GUI client), we can encounter a mixed-chain
commit-graph (a commit-graph chain where some of split commit-graph files
have GDAT chunk and others do not). As backward compatibility is one of the
goals, we can define the following behavior:
While reading a mixed-chain commit-graph version, we fall back on
topological levels as corrected commit dates and topological levels cannot
be compared directly.
When adding new layer to the split commit-graph file, and when merging some
or all layers (replacing them in the latter case), the new layer will have
GDAT chunk if and only if in the final result there would be no layer
without GDAT chunk just below it.
Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews.
I look forward to everyone's reviews!
Thanks
* Abhishek
----------------------------------------------------------------------------
Improvements left for a future series:
* Save commits with generation data overflow and extra edge commits instead
of looping over all commits. cf. 858sbel67n.fsf@gmail.com
* Verify both topological levels and corrected commit dates when present.
cf. 85pn4tnk8u.fsf@gmail.com
Changes in version 6:
* Fixed typos in commit message for "commit-graph: implement corrected
commit date".
* Removed an unnecessary else-block in "commit-graph: implement corrected
commit date".
* Validate mixed generation chain correctly while writing in "commit-graph:
use generation v2 only if the entire chain does".
* Die if the GDAT chunk indicates data has overflown but there are is no
generation data overflow chunk.
Changes in version 5:
* Explained a possible reason for no change in performance for
"commit-graph: fix regression when computing bloom-filters"
* Clarified about the addition of a new test for 11-digit octal
implementations of ustar.
* Fixed duplicate test names in "commit-graph: consolidate
fill_commit_graph_info".
* Swapped the order "commit-graph: return 64-bit generation number",
"commit-graph: add a slab to store topological levels" to minimize lines
changed.
* Fixed the mismerge in "commit-graph: return 64-bit generation number"
* Clarified the preparatory steps are for the larger goal of implementing
generation number v2 in "commit-graph: return 64-bit generation number".
* Moved the rename of "run_three_modes()" to "run_all_modes()" into a new
patch "t6600-test-reach: generalize *_three_modes".
* Explained and removed the checks for GENERATION_NUMBER_INFINITY that can
never be true in "commit-graph: add a slab to store topological levels".
* Fixed incorrect logic for verifying commit-graph in "commit-graph:
implement corrected commit date".
* Added minor improvements to commit message of "commit-graph: implement
generation data chunk".
* Added '--date ' option to test_commit() in 'test-lib-functions.sh' in
"commit-graph: implement generation data chunk".
* Improved coding style (also in tests) for "commit-graph: use generation
v2 only if entire chain does".
* Simplified test repository structure in "commit-graph: use generation v2
only if entire chain does" as only the number of commits in a split
commit-graph layer are relevant.
* Added a new test in "commit-graph: use generation v2 only if entire chain
does" to check if the layers are merged correctly.
* Explicitly mentioned commit "091f4cf3" in the commit-message of
"commit-graph: use corrected commit dates in paint_down_to_common()".
* Minor corrections to documentation in "doc: add corrected commit date
info".
* Minor corrections to coding style.
Changes in version 4:
* Added GDOV to handle overflows in generation data.
* Added a test for writing tip graph for a generation number v2 graph chain
in t5324-split-commit-graph.sh
* Added a section on how mixed generation number chains are handled in
Documentation/technical/commit-graph-format.txt
* Reverted unimportant whitespace, style changes in commit-graph.c
* Added header comments about the order of comparision for
compare_commits_by_gen_then_commit_date in commit.h,
compare_commits_by_gen in commit-graph.h
* Elaborated on why t6404 fails with corrected commit date and must be run
with GIT_TEST_COMMIT_GRAPH=1in the commit "commit-reach: use corrected
commit dates in paint_down_to_common()"
* Elaborated on write behavior for mixed generation number chains in the
commit "commit-graph: use generation v2 only if entire chain does"
* Added notes about adding the topo_level slab to struct
write_commit_graph_context as well as struct commit_graph.
* Clarified commit message for "commit-graph: consolidate
fill_commit_graph_info"
* Removed the claim "GDAT can store future generation numbers" because it
hasn't been tested yet.
Changes in version 3:
* Reordered patches as discussed in 2
[https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/].
* Split "implement corrected commit date" into two patches - one
introducing the topo level slab and other implementing corrected commit
dates.
* Extended split-commit-graph tests to verify at the end of test.
* Use topological levels as generation number if any of split commit-graph
files do not have generation data chunk.
Changes in version 2:
* Add tests for generation data chunk.
* Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
generation data chunk.
* Compare commits with corrected commit dates if present in
paint_down_to_common().
* Update technical documentation.
* Handle mixed generation commit chains.
* Improve commit messages for "commit-graph: fix regression when computing
bloom filter", "commit-graph: consolidate fill_commit_graph_info",
* Revert unnecessary whitespace changes.
* Split uint_32 -> timestamp_t change into a new commit.
Abhishek Kumar (11):
commit-graph: fix regression when computing Bloom filters
revision: parse parent in indegree_walk_step()
commit-graph: consolidate fill_commit_graph_info
t6600-test-reach: generalize *_three_modes
commit-graph: add a slab to store topological levels
commit-graph: return 64-bit generation number
commit-graph: implement corrected commit date
commit-graph: implement generation data chunk
commit-graph: use generation v2 only if entire chain does
commit-reach: use corrected commit dates in paint_down_to_common()
doc: add corrected commit date info
.../technical/commit-graph-format.txt | 28 +-
Documentation/technical/commit-graph.txt | 77 +++++-
commit-graph.c | 251 ++++++++++++++----
commit-graph.h | 15 +-
commit-reach.c | 38 +--
commit-reach.h | 2 +-
commit.c | 4 +-
commit.h | 5 +-
revision.c | 13 +-
t/README | 3 +
t/helper/test-read-graph.c | 4 +
t/t4216-log-bloom.sh | 4 +-
t/t5000-tar-tree.sh | 24 +-
t/t5318-commit-graph.sh | 79 +++++-
t/t5324-split-commit-graph.sh | 193 +++++++++++++-
t/t6404-recursive-merge.sh | 5 +-
t/t6600-test-reach.sh | 68 ++---
t/test-lib-functions.sh | 6 +
upload-pack.c | 2 +-
19 files changed, 667 insertions(+), 154 deletions(-)
base-commit: 4151fdb1c76c1a190ac9241b67223efd19f3e478
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v6
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v6
Pull-Request: https://github.com/gitgitgadget/git/pull/676
Range-diff vs v5:
1: c4e817abf7d ! 1: 4d8eb415578 commit-graph: fix regression when computing Bloom filters
@@ Metadata
## Commit message ##
commit-graph: fix regression when computing Bloom filters
- Before computing Bloom fitlers, the commit-graph machinery uses
+ Before computing Bloom filters, the commit-graph machinery uses
commit_gen_cmp to sort commits by generation order for improved diff
performance. 3d11275505 (commit-graph: examine commits by generation
number, 2020-03-30) claims that this sort can reduce the time spent to
@@ Commit message
'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
while writing.
- Not all hope is lost, though: 'commit_graph_generation()' falls back to
+ Not all hope is lost, though: 'commit_gen_cmp()' falls back to
comparing commits by their date when they have equal generation number,
- and so since c49c82aa4c is purely a date comparision function. This
+ and so since c49c82aa4c is purely a date comparison function. This
heuristic is good enough that we don't seem to loose appreciable
- performance while computing Bloom filters. Applying this patch (compared
- with v2.29.1) speeds up computing Bloom filters by around ~4
- seconds.
+ performance while computing Bloom filters.
+
+ Applying this patch (compared with v2.30.0) speeds up computing Bloom
+ filters by factors ranging from 0.40% to 5.19% on various repositories [1].
So, avoid the useless 'commit_graph_generation()' while writing by
instead accessing the slab directly. This returns the newly-computed
generation numbers, and allows us to avoid the heuristic by directly
comparing generation numbers.
+ [1]: https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/
+
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
## commit-graph.c ##
-@@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
+@@ commit-graph.c: static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
+ return data;
+ }
+
++/*
++ * Should be used only while writing commit-graph as it compares
++ * generation value of commits by directly accessing commit-slab.
++ */
+ static int commit_gen_cmp(const void *va, const void *vb)
+ {
const struct commit *a = *(const struct commit **)va;
const struct commit *b = *(const struct commit **)vb;
2: 7645e0bcef0 = 2: 05dcb862818 revision: parse parent in indegree_walk_step()
3: ca646912b2b = 3: dcb9891d819 commit-graph: consolidate fill_commit_graph_info
4: 591935075f1 = 4: 4fbdee7ac90 t6600-test-reach: generalize *_three_modes
5: baae7006764 = 5: fbd8feb5d8c commit-graph: add a slab to store topological levels
6: 26bd6f49100 = 6: 855ff662a44 commit-graph: return 64-bit generation number
7: 859c39eff52 ! 7: 8fbe7486405 commit-graph: implement corrected commit date
@@ Commit message
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.
- However, using offsets be problematic if one of commits is malformed but
- valid and has committerdate of 0 Unix time, as the offset would be the
- same as corrected commit date and thus require 64-bits to be stored
- properly.
+ However, using offsets be problematic if a commit is malformed but valid
+ and has committer date of 0 Unix time, as the offset would be the same
+ as corrected commit date and thus require 64-bits to be stored properly.
While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
break;
- } else if (level > max_level) {
- max_level = level;
-+ } else {
-+ if (level > max_level)
-+ max_level = level;
-+
-+ if (corrected_commit_date > max_corrected_commit_date)
-+ max_corrected_commit_date = corrected_commit_date;
}
++
++ if (level > max_level)
++ max_level = level;
++
++ if (corrected_commit_date > max_corrected_commit_date)
++ max_corrected_commit_date = corrected_commit_date;
}
+ if (all_parents_computed) {
@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (max_level > GENERATION_NUMBER_V1_MAX - 1)
max_level = GENERATION_NUMBER_V1_MAX - 1;
8: 8403c4d0257 ! 8: 6d0696ae216 commit-graph: implement generation data chunk
@@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
+ offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+
+ if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
++ if (!g->chunk_generation_data_overflow)
++ die(_("commit-graph requires overflow generation data but has none"));
++
+ offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
+ graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+ } else
9: a3a70a1edd0 ! 9: fba0d7f3dfe commit-graph: use generation v2 only if entire chain does
@@ commit-graph.c: static void split_graph_merge_strategy(struct write_commit_graph
g = g->base_graph;
}
@@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
- struct commit_graph *g = ctx->r->objects->commit_graph;
+ } else
+ ctx->num_commit_graphs_after = 1;
- while (g) {
-+ g->read_generation_data = 1;
- g->topo_levels = &topo_levels;
- g = g->base_graph;
- }
++ validate_mixed_generation_chain(ctx->r->objects->commit_graph);
++
+ compute_generation_numbers(ctx);
+
+ if (ctx->changed_paths)
@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
* also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
* in the following condition.
10: 093101f908b = 10: ba1f2c5555f commit-reach: use corrected commit dates in paint_down_to_common()
11: 20299e57457 = 11: e571f03d8bd doc: add corrected commit date info
--
gitgitgadget
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v6 01/11] commit-graph: fix regression when computing Bloom filters
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
` (11 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Before computing Bloom filters, the commit-graph machinery uses
commit_gen_cmp to sort commits by generation order for improved diff
performance. 3d11275505 (commit-graph: examine commits by generation
number, 2020-03-30) claims that this sort can reduce the time spent to
compute Bloom filters by nearly half.
But since c49c82aa4c (commit: move members graph_pos, generation to a
slab, 2020-06-17), this optimization is broken, since asking for a
'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
while writing.
Not all hope is lost, though: 'commit_gen_cmp()' falls back to
comparing commits by their date when they have equal generation number,
and so since c49c82aa4c is purely a date comparison function. This
heuristic is good enough that we don't seem to loose appreciable
performance while computing Bloom filters.
Applying this patch (compared with v2.30.0) speeds up computing Bloom
filters by factors ranging from 0.40% to 5.19% on various repositories [1].
So, avoid the useless 'commit_graph_generation()' while writing by
instead accessing the slab directly. This returns the newly-computed
generation numbers, and allows us to avoid the heuristic by directly
comparing generation numbers.
[1]: https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index e9124d4a412..0267886e76c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -139,13 +139,17 @@ static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
return data;
}
+/*
+ * Should be used only while writing commit-graph as it compares
+ * generation value of commits by directly accessing commit-slab.
+ */
static int commit_gen_cmp(const void *va, const void *vb)
{
const struct commit *a = *(const struct commit **)va;
const struct commit *b = *(const struct commit **)vb;
- uint32_t generation_a = commit_graph_generation(a);
- uint32_t generation_b = commit_graph_generation(b);
+ uint32_t generation_a = commit_graph_data_at(a)->generation;
+ uint32_t generation_b = commit_graph_data_at(b)->generation;
/* lower generation commits first */
if (generation_a < generation_b)
return -1;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v6 02/11] revision: parse parent in indegree_walk_step()
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
` (10 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
revision.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/revision.c b/revision.c
index 1bb590ece78..be2d828a4cc 100644
--- a/revision.c
+++ b/revision.c
@@ -3397,6 +3397,9 @@ static void indegree_walk_step(struct rev_info *revs)
struct commit *parent = p->item;
int *pi = indegree_slab_at(&info->indegree, parent);
+ if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
+ return;
+
if (*pi)
(*pi)++;
else
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v6 03/11] commit-graph: consolidate fill_commit_graph_info
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
` (9 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().
fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.
e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.
The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.
However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.
The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.
As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.
To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.
To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 27 ++++++++++-----------------
t/t5000-tar-tree.sh | 24 +++++++++++++++++++++---
2 files changed, 31 insertions(+), 20 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 0267886e76c..3d59b8b905d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -753,15 +753,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
const unsigned char *commit_data;
struct commit_graph_data *graph_data;
uint32_t lex_index;
+ uint64_t date_high, date_low;
while (pos < g->num_commits_in_base)
g = g->base_graph;
+ if (pos >= g->num_commits + g->num_commits_in_base)
+ die(_("invalid commit position. commit-graph is likely corrupt"));
+
lex_index = pos - g->num_commits_in_base;
commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
graph_data = commit_graph_data_at(item);
graph_data->graph_pos = pos;
+
+ date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+ date_low = get_be32(commit_data + g->hash_len + 12);
+ item->date = (timestamp_t)((date_high << 32) | date_low);
+
graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
}
@@ -776,38 +785,22 @@ static int fill_commit_in_graph(struct repository *r,
{
uint32_t edge_value;
uint32_t *parent_data_ptr;
- uint64_t date_low, date_high;
struct commit_list **pptr;
- struct commit_graph_data *graph_data;
const unsigned char *commit_data;
uint32_t lex_index;
while (pos < g->num_commits_in_base)
g = g->base_graph;
- if (pos >= g->num_commits + g->num_commits_in_base)
- die(_("invalid commit position. commit-graph is likely corrupt"));
+ fill_commit_graph_info(item, g, pos);
- /*
- * Store the "full" position, but then use the
- * "local" position for the rest of the calculation.
- */
- graph_data = commit_graph_data_at(item);
- graph_data->graph_pos = pos;
lex_index = pos - g->num_commits_in_base;
-
commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
item->object.parsed = 1;
set_commit_tree(item, NULL);
- date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
- date_low = get_be32(commit_data + g->hash_len + 12);
- item->date = (timestamp_t)((date_high << 32) | date_low);
-
- graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
pptr = &item->parents;
edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 3ebb0d3b652..7204799a0b5 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -431,15 +431,33 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
test_cmp expect actual
'
-test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
+test_expect_success TIME_IS_64BIT 'set up repository with far-future (2^34 - 1) commit' '
+ rm -f .git/index &&
+ echo foo >file &&
+ git add file &&
+ GIT_COMMITTER_DATE="@17179869183 +0000" \
+ git commit -m "tempori parendum"
+'
+
+test_expect_success TIME_IS_64BIT 'generate tar with far-future mtime' '
+ git archive HEAD >future.tar
+'
+
+test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
+ echo 2514 >expect &&
+ tar_info future.tar | cut -d" " -f2 >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success TIME_IS_64BIT 'set up repository with far-far-future (2^36 + 1) commit' '
rm -f .git/index &&
echo content >file &&
git add file &&
- GIT_COMMITTER_DATE="@68719476737 +0000" \
+ GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
git commit -m "tempori parendum"
'
-test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
+test_expect_success TIME_IS_64BIT 'generate tar with far-far-future mtime' '
git archive HEAD >future.tar
'
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v6 04/11] t6600-test-reach: generalize *_three_modes
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
` (2 preceding siblings ...)
2021-01-16 18:11 ` [PATCH v6 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
` (8 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In a preparatory step to implement generation number v2, we add tests to
ensure Git can read and parse commit-graph files without Generation Data
chunk. These files represent commit-graph files written by Old Git and
are neccesary for backward compatability.
We extend run_three_modes() and test_three_modes() to *_all_modes() with
the fourth mode being "commit-graph without generation data chunk".
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
t/t6600-test-reach.sh | 62 +++++++++++++++++++++----------------------
1 file changed, 31 insertions(+), 31 deletions(-)
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index f807276337d..af10f0dc090 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -58,7 +58,7 @@ test_expect_success 'setup' '
git config core.commitGraph true
'
-run_three_modes () {
+run_all_modes () {
test_when_finished rm -rf .git/objects/info/commit-graph &&
"$@" <input >actual &&
test_cmp expect actual &&
@@ -70,8 +70,8 @@ run_three_modes () {
test_cmp expect actual
}
-test_three_modes () {
- run_three_modes test-tool reach "$@"
+test_all_modes () {
+ run_all_modes test-tool reach "$@"
}
test_expect_success 'ref_newer:miss' '
@@ -80,7 +80,7 @@ test_expect_success 'ref_newer:miss' '
B:commit-4-9
EOF
echo "ref_newer(A,B):0" >expect &&
- test_three_modes ref_newer
+ test_all_modes ref_newer
'
test_expect_success 'ref_newer:hit' '
@@ -89,7 +89,7 @@ test_expect_success 'ref_newer:hit' '
B:commit-2-3
EOF
echo "ref_newer(A,B):1" >expect &&
- test_three_modes ref_newer
+ test_all_modes ref_newer
'
test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +98,7 @@ test_expect_success 'in_merge_bases:hit' '
B:commit-8-8
EOF
echo "in_merge_bases(A,B):1" >expect &&
- test_three_modes in_merge_bases
+ test_all_modes in_merge_bases
'
test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +107,7 @@ test_expect_success 'in_merge_bases:miss' '
B:commit-5-9
EOF
echo "in_merge_bases(A,B):0" >expect &&
- test_three_modes in_merge_bases
+ test_all_modes in_merge_bases
'
test_expect_success 'in_merge_bases_many:hit' '
@@ -117,7 +117,7 @@ test_expect_success 'in_merge_bases_many:hit' '
X:commit-5-7
EOF
echo "in_merge_bases_many(A,X):1" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'in_merge_bases_many:miss' '
@@ -127,7 +127,7 @@ test_expect_success 'in_merge_bases_many:miss' '
X:commit-8-6
EOF
echo "in_merge_bases_many(A,X):0" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'in_merge_bases_many:miss-heuristic' '
@@ -137,7 +137,7 @@ test_expect_success 'in_merge_bases_many:miss-heuristic' '
X:commit-6-6
EOF
echo "in_merge_bases_many(A,X):0" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'is_descendant_of:hit' '
@@ -148,7 +148,7 @@ test_expect_success 'is_descendant_of:hit' '
X:commit-1-1
EOF
echo "is_descendant_of(A,X):1" >expect &&
- test_three_modes is_descendant_of
+ test_all_modes is_descendant_of
'
test_expect_success 'is_descendant_of:miss' '
@@ -159,7 +159,7 @@ test_expect_success 'is_descendant_of:miss' '
X:commit-7-6
EOF
echo "is_descendant_of(A,X):0" >expect &&
- test_three_modes is_descendant_of
+ test_all_modes is_descendant_of
'
test_expect_success 'get_merge_bases_many' '
@@ -174,7 +174,7 @@ test_expect_success 'get_merge_bases_many' '
git rev-parse commit-5-6 \
commit-4-7 | sort
} >expect &&
- test_three_modes get_merge_bases_many
+ test_all_modes get_merge_bases_many
'
test_expect_success 'reduce_heads' '
@@ -196,7 +196,7 @@ test_expect_success 'reduce_heads' '
commit-2-8 \
commit-1-10 | sort
} >expect &&
- test_three_modes reduce_heads
+ test_all_modes reduce_heads
'
test_expect_success 'can_all_from_reach:hit' '
@@ -219,7 +219,7 @@ test_expect_success 'can_all_from_reach:hit' '
Y:commit-8-1
EOF
echo "can_all_from_reach(X,Y):1" >expect &&
- test_three_modes can_all_from_reach
+ test_all_modes can_all_from_reach
'
test_expect_success 'can_all_from_reach:miss' '
@@ -241,7 +241,7 @@ test_expect_success 'can_all_from_reach:miss' '
Y:commit-8-5
EOF
echo "can_all_from_reach(X,Y):0" >expect &&
- test_three_modes can_all_from_reach
+ test_all_modes can_all_from_reach
'
test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -264,7 +264,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
Y:commit-8-1
EOF
echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
- test_three_modes can_all_from_reach_with_flag
+ test_all_modes can_all_from_reach_with_flag
'
test_expect_success 'commit_contains:hit' '
@@ -280,8 +280,8 @@ test_expect_success 'commit_contains:hit' '
X:commit-9-3
EOF
echo "commit_contains(_,A,X,_):1" >expect &&
- test_three_modes commit_contains &&
- test_three_modes commit_contains --tag
+ test_all_modes commit_contains &&
+ test_all_modes commit_contains --tag
'
test_expect_success 'commit_contains:miss' '
@@ -297,8 +297,8 @@ test_expect_success 'commit_contains:miss' '
X:commit-9-3
EOF
echo "commit_contains(_,A,X,_):0" >expect &&
- test_three_modes commit_contains &&
- test_three_modes commit_contains --tag
+ test_all_modes commit_contains &&
+ test_all_modes commit_contains --tag
'
test_expect_success 'rev-list: basic topo-order' '
@@ -310,7 +310,7 @@ test_expect_success 'rev-list: basic topo-order' '
commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-6-6
+ run_all_modes git rev-list --topo-order commit-6-6
'
test_expect_success 'rev-list: first-parent topo-order' '
@@ -322,7 +322,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
commit-6-2 \
commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>expect &&
- run_three_modes git rev-list --first-parent --topo-order commit-6-6
+ run_all_modes git rev-list --first-parent --topo-order commit-6-6
'
test_expect_success 'rev-list: range topo-order' '
@@ -334,7 +334,7 @@ test_expect_success 'rev-list: range topo-order' '
commit-6-2 commit-5-2 commit-4-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
'
test_expect_success 'rev-list: range topo-order' '
@@ -346,7 +346,7 @@ test_expect_success 'rev-list: range topo-order' '
commit-6-2 commit-5-2 commit-4-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
'
test_expect_success 'rev-list: first-parent range topo-order' '
@@ -358,7 +358,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
commit-6-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+ run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
'
test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -368,7 +368,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
commit-6-3 commit-5-3 commit-4-3 \
>expect &&
- run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+ run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
'
test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -382,7 +382,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
commit-3-8 commit-2-8 commit-1-8 \
commit-3-7 commit-2-7 commit-1-7 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
'
test_expect_success 'get_reachable_subset:all' '
@@ -402,7 +402,7 @@ test_expect_success 'get_reachable_subset:all' '
commit-1-7 \
commit-5-6 | sort
) >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_expect_success 'get_reachable_subset:some' '
@@ -420,7 +420,7 @@ test_expect_success 'get_reachable_subset:some' '
git rev-parse commit-3-3 \
commit-1-7 | sort
) >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_expect_success 'get_reachable_subset:none' '
@@ -434,7 +434,7 @@ test_expect_success 'get_reachable_subset:none' '
Y:commit-2-8
EOF
echo "get_reachable_subset(X,Y)" >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v6 05/11] commit-graph: add a slab to store topological levels
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
` (3 preceding siblings ...)
2021-01-16 18:11 ` [PATCH v6 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
` (7 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In a later commit we will introduce corrected commit date as the
generation number v2. Corrected commit dates will be stored in the new
seperate Generation Data chunk. However, to ensure backwards
compatibility with "Old" Git we need to continue to write generation
number v1 (topological levels) to the commit data chunk. Thus, we need
to compute and store both versions of generation numbers to write the
commit-graph file.
Therefore, let's introduce a commit-slab `topo_level_slab` to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.
The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO`
mark commits not in the commit-graph file and commits written by a
version of Git that did not compute generation numbers respectively.
Generation numbers are computed identically for both kinds of commits.
A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit
is not in the commit-graph file. However, since the slab is
zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`).
Thus, we no longer need to check if the topological level of a commit is
`GENERATION_NUMBER_INFINITY`.
We will add a pointer to the slab in `struct write_commit_graph_context`
and `struct commit_graph` to populate the slab in
`fill_commit_graph_info` if the commit has a pre-computed topological
level as in case of split commit-graphs.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 45 ++++++++++++++++++++++++++++++---------------
commit-graph.h | 1 +
2 files changed, 31 insertions(+), 15 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 3d59b8b905d..3b69c3cc329 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
+define_commit_slab(topo_level_slab, uint32_t);
+
/* Keep track of the order in which commits are added to our list. */
define_commit_slab(commit_pos, int);
static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -772,6 +774,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
item->date = (timestamp_t)((date_high << 32) | date_low);
graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
+ if (g->topo_levels)
+ *topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
}
static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -960,6 +965,7 @@ struct write_commit_graph_context {
changed_paths:1,
order_by_pack:1;
+ struct topo_level_slab *topo_levels;
const struct commit_graph_opts *opts;
size_t total_bloom_filter_data_size;
const struct bloom_filter_settings *bloom_settings;
@@ -1106,7 +1112,7 @@ static int write_graph_chunk_data(struct hashfile *f,
else
packedDate[0] = 0;
- packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+ packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
@@ -1336,11 +1342,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
_("Computing commit graph generation numbers"),
ctx->commits.nr);
for (i = 0; i < ctx->commits.nr; i++) {
- uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+ uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
display_progress(ctx->progress, i + 1);
- if (generation != GENERATION_NUMBER_INFINITY &&
- generation != GENERATION_NUMBER_ZERO)
+ if (level != GENERATION_NUMBER_ZERO)
continue;
commit_list_insert(ctx->commits.list[i], &list);
@@ -1348,29 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
struct commit *current = list->item;
struct commit_list *parent;
int all_parents_computed = 1;
- uint32_t max_generation = 0;
+ uint32_t max_level = 0;
for (parent = current->parents; parent; parent = parent->next) {
- generation = commit_graph_data_at(parent->item)->generation;
+ level = *topo_level_slab_at(ctx->topo_levels, parent->item);
- if (generation == GENERATION_NUMBER_INFINITY ||
- generation == GENERATION_NUMBER_ZERO) {
+ if (level == GENERATION_NUMBER_ZERO) {
all_parents_computed = 0;
commit_list_insert(parent->item, &list);
break;
- } else if (generation > max_generation) {
- max_generation = generation;
+ } else if (level > max_level) {
+ max_level = level;
}
}
if (all_parents_computed) {
- struct commit_graph_data *data = commit_graph_data_at(current);
-
- data->generation = max_generation + 1;
pop_commit(&list);
- if (data->generation > GENERATION_NUMBER_MAX)
- data->generation = GENERATION_NUMBER_MAX;
+ if (max_level > GENERATION_NUMBER_MAX - 1)
+ max_level = GENERATION_NUMBER_MAX - 1;
+ *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
}
}
}
@@ -2106,6 +2108,7 @@ int write_commit_graph(struct object_directory *odb,
int res = 0;
int replace = 0;
struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+ struct topo_level_slab topo_levels;
prepare_repo_settings(the_repository);
if (!the_repository->settings.core_commit_graph) {
@@ -2132,6 +2135,18 @@ int write_commit_graph(struct object_directory *odb,
bloom_settings.max_changed_paths);
ctx->bloom_settings = &bloom_settings;
+ init_topo_level_slab(&topo_levels);
+ ctx->topo_levels = &topo_levels;
+
+ if (ctx->r->objects->commit_graph) {
+ struct commit_graph *g = ctx->r->objects->commit_graph;
+
+ while (g) {
+ g->topo_levels = &topo_levels;
+ g = g->base_graph;
+ }
+ }
+
if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
ctx->changed_paths = 1;
if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/commit-graph.h b/commit-graph.h
index f8e92500c6e..00f00745b79 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -73,6 +73,7 @@ struct commit_graph {
const unsigned char *chunk_bloom_indexes;
const unsigned char *chunk_bloom_data;
+ struct topo_level_slab *topo_levels;
struct bloom_filter_settings *bloom_filter_settings;
};
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v6 06/11] commit-graph: return 64-bit generation number
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
` (4 preceding siblings ...)
2021-01-16 18:11 ` [PATCH v6 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
` (6 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.
We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.
With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 22 +++++++++++-----------
commit-graph.h | 4 ++--
commit-reach.c | 36 ++++++++++++++++++------------------
commit-reach.h | 2 +-
commit.c | 4 ++--
commit.h | 4 ++--
revision.c | 10 +++++-----
upload-pack.c | 2 +-
8 files changed, 42 insertions(+), 42 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 3b69c3cc329..6d42e30cd9a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -101,7 +101,7 @@ uint32_t commit_graph_position(const struct commit *c)
return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
}
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
{
struct commit_graph_data *data =
commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -150,8 +150,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
const struct commit *a = *(const struct commit **)va;
const struct commit *b = *(const struct commit **)vb;
- uint32_t generation_a = commit_graph_data_at(a)->generation;
- uint32_t generation_b = commit_graph_data_at(b)->generation;
+ const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+ const timestamp_t generation_b = commit_graph_data_at(b)->generation;
/* lower generation commits first */
if (generation_a < generation_b)
return -1;
@@ -1370,8 +1370,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (all_parents_computed) {
pop_commit(&list);
- if (max_level > GENERATION_NUMBER_MAX - 1)
- max_level = GENERATION_NUMBER_MAX - 1;
+ if (max_level > GENERATION_NUMBER_V1_MAX - 1)
+ max_level = GENERATION_NUMBER_V1_MAX - 1;
*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
}
}
@@ -2367,8 +2367,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
for (i = 0; i < g->num_commits; i++) {
struct commit *graph_commit, *odb_commit;
struct commit_list *graph_parents, *odb_parents;
- uint32_t max_generation = 0;
- uint32_t generation;
+ timestamp_t max_generation = 0;
+ timestamp_t generation;
display_progress(progress, i + 1);
hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2432,16 +2432,16 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
continue;
/*
- * If one of our parents has generation GENERATION_NUMBER_MAX, then
- * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
+ * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
+ * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
* extra logic in the following condition.
*/
- if (max_generation == GENERATION_NUMBER_MAX)
+ if (max_generation == GENERATION_NUMBER_V1_MAX)
max_generation--;
generation = commit_graph_generation(graph_commit);
if (generation != max_generation + 1)
- graph_report(_("commit-graph generation for commit %s is %u != %u"),
+ graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
oid_to_hex(&cur_oid),
generation,
max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 00f00745b79..2e9aa7824ee 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -145,12 +145,12 @@ void disable_commit_graph(struct repository *r);
struct commit_graph_data {
uint32_t graph_pos;
- uint32_t generation;
+ timestamp_t generation;
};
/*
* Commits should be parsed before accessing generation, graph positions.
*/
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
uint32_t commit_graph_position(const struct commit *);
#endif
diff --git a/commit-reach.c b/commit-reach.c
index 50175b159e7..9b24b0378d5 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
static struct commit_list *paint_down_to_common(struct repository *r,
struct commit *one, int n,
struct commit **twos,
- int min_generation)
+ timestamp_t min_generation)
{
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
- uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+ timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
if (!min_generation)
queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
struct commit *commit = prio_queue_get(&queue);
struct commit_list *parents;
int flags;
- uint32_t generation = commit_graph_generation(commit);
+ timestamp_t generation = commit_graph_generation(commit);
if (min_generation && generation > last_gen)
- BUG("bad generation skip %8x > %8x at %s",
+ BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
generation, last_gen,
oid_to_hex(&commit->object.oid));
last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
repo_parse_commit(r, array[i]);
for (i = 0; i < cnt; i++) {
struct commit_list *common;
- uint32_t min_generation = commit_graph_generation(array[i]);
+ timestamp_t min_generation = commit_graph_generation(array[i]);
if (redundant[i])
continue;
for (j = filled = 0; j < cnt; j++) {
- uint32_t curr_generation;
+ timestamp_t curr_generation;
if (i == j || redundant[j])
continue;
filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
{
struct commit_list *bases;
int ret = 0, i;
- uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
+ timestamp_t generation, max_generation = GENERATION_NUMBER_ZERO;
if (repo_parse_commit(r, commit))
return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
static enum contains_result contains_test(struct commit *candidate,
const struct commit_list *want,
struct contains_cache *cache,
- uint32_t cutoff)
+ timestamp_t cutoff)
{
enum contains_result *cached = contains_cache_at(cache, candidate);
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
{
struct contains_stack contains_stack = { 0, 0, NULL };
enum contains_result result;
- uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+ timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
const struct commit_list *p;
for (p = want; p; p = p->next) {
- uint32_t generation;
+ timestamp_t generation;
struct commit *c = p->item;
load_commit_graph_info(the_repository, c);
generation = commit_graph_generation(c);
@@ -566,8 +566,8 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
const struct commit *a = *(const struct commit * const *)_a;
const struct commit *b = *(const struct commit * const *)_b;
- uint32_t generation_a = commit_graph_generation(a);
- uint32_t generation_b = commit_graph_generation(b);
+ timestamp_t generation_a = commit_graph_generation(a);
+ timestamp_t generation_b = commit_graph_generation(b);
if (generation_a < generation_b)
return -1;
@@ -580,7 +580,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
unsigned int with_flag,
unsigned int assign_flag,
time_t min_commit_date,
- uint32_t min_generation)
+ timestamp_t min_generation)
{
struct commit **list = NULL;
int i;
@@ -681,13 +681,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
struct commit_list *from_iter = from, *to_iter = to;
int result;
- uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+ timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
while (from_iter) {
add_object_array(&from_iter->item->object, NULL, &from_objs);
if (!parse_commit(from_iter->item)) {
- uint32_t generation;
+ timestamp_t generation;
if (from_iter->item->date < min_commit_date)
min_commit_date = from_iter->item->date;
@@ -701,7 +701,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
while (to_iter) {
if (!parse_commit(to_iter->item)) {
- uint32_t generation;
+ timestamp_t generation;
if (to_iter->item->date < min_commit_date)
min_commit_date = to_iter->item->date;
@@ -741,13 +741,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
struct commit_list *found_commits = NULL;
struct commit **to_last = to + nr_to;
struct commit **from_last = from + nr_from;
- uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+ timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
int num_to_find = 0;
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
for (item = to; item < to_last; item++) {
- uint32_t generation;
+ timestamp_t generation;
struct commit *c = *item;
parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a317..148b56fea50 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
unsigned int with_flag,
unsigned int assign_flag,
time_t min_commit_date,
- uint32_t min_generation);
+ timestamp_t min_generation);
int can_all_from_reach(struct commit_list *from, struct commit_list *to,
int commit_date_cutoff);
diff --git a/commit.c b/commit.c
index bab8d5ab07c..4c717329ee0 100644
--- a/commit.c
+++ b/commit.c
@@ -753,8 +753,8 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
{
const struct commit *a = a_, *b = b_;
- const uint32_t generation_a = commit_graph_generation(a),
- generation_b = commit_graph_generation(b);
+ const timestamp_t generation_a = commit_graph_generation(a),
+ generation_b = commit_graph_generation(b);
/* newer commits first */
if (generation_a < generation_b)
diff --git a/commit.h b/commit.h
index f4e7b0158e2..742d96c41e8 100644
--- a/commit.h
+++ b/commit.h
@@ -11,8 +11,8 @@
#include "commit-slab.h"
#define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
-#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
#define GENERATION_NUMBER_ZERO 0
struct commit_list {
diff --git a/revision.c b/revision.c
index be2d828a4cc..31fd3219e65 100644
--- a/revision.c
+++ b/revision.c
@@ -3300,7 +3300,7 @@ define_commit_slab(indegree_slab, int);
define_commit_slab(author_date_slab, timestamp_t);
struct topo_walk_info {
- uint32_t min_generation;
+ timestamp_t min_generation;
struct prio_queue explore_queue;
struct prio_queue indegree_queue;
struct prio_queue topo_queue;
@@ -3368,7 +3368,7 @@ static void explore_walk_step(struct rev_info *revs)
}
static void explore_to_depth(struct rev_info *revs,
- uint32_t gen_cutoff)
+ timestamp_t gen_cutoff)
{
struct topo_walk_info *info = revs->topo_walk_info;
struct commit *c;
@@ -3413,7 +3413,7 @@ static void indegree_walk_step(struct rev_info *revs)
}
static void compute_indegrees_to_depth(struct rev_info *revs,
- uint32_t gen_cutoff)
+ timestamp_t gen_cutoff)
{
struct topo_walk_info *info = revs->topo_walk_info;
struct commit *c;
@@ -3471,7 +3471,7 @@ static void init_topo_walk(struct rev_info *revs)
info->min_generation = GENERATION_NUMBER_INFINITY;
for (list = revs->commits; list; list = list->next) {
struct commit *c = list->item;
- uint32_t generation;
+ timestamp_t generation;
if (repo_parse_commit_gently(revs->repo, c, 1))
continue;
@@ -3539,7 +3539,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
for (p = commit->parents; p; p = p->next) {
struct commit *parent = p->item;
int *pi;
- uint32_t generation;
+ timestamp_t generation;
if (parent->object.flags & UNINTERESTING)
continue;
diff --git a/upload-pack.c b/upload-pack.c
index 3b66bf92ba8..b87607e0dd4 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -500,7 +500,7 @@ static int got_oid(struct upload_pack_data *data,
static int ok_to_give_up(struct upload_pack_data *data)
{
- uint32_t min_generation = GENERATION_NUMBER_ZERO;
+ timestamp_t min_generation = GENERATION_NUMBER_ZERO;
if (!data->have_obj.nr)
return 0;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v6 07/11] commit-graph: implement corrected commit date
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
` (5 preceding siblings ...)
2021-01-16 18:11 ` [PATCH v6 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
` (5 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
With most of preparations done, let's implement corrected commit date.
The corrected commit date for a commit is defined as:
* A commit with no parents (a root commit) has corrected commit date
equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
the maximum of its commit date and one more than the largest corrected
commit date among its parents.
As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).
To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.
Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.
However, using offsets be problematic if a commit is malformed but valid
and has committer date of 0 Unix time, as the offset would be the same
as corrected commit date and thus require 64-bits to be stored properly.
While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 6d42e30cd9a..a899f429093 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1343,9 +1343,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
ctx->commits.nr);
for (i = 0; i < ctx->commits.nr; i++) {
uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+ timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
display_progress(ctx->progress, i + 1);
- if (level != GENERATION_NUMBER_ZERO)
+ if (level != GENERATION_NUMBER_ZERO &&
+ corrected_commit_date != GENERATION_NUMBER_ZERO)
continue;
commit_list_insert(ctx->commits.list[i], &list);
@@ -1354,17 +1356,24 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
struct commit_list *parent;
int all_parents_computed = 1;
uint32_t max_level = 0;
+ timestamp_t max_corrected_commit_date = 0;
for (parent = current->parents; parent; parent = parent->next) {
level = *topo_level_slab_at(ctx->topo_levels, parent->item);
+ corrected_commit_date = commit_graph_data_at(parent->item)->generation;
- if (level == GENERATION_NUMBER_ZERO) {
+ if (level == GENERATION_NUMBER_ZERO ||
+ corrected_commit_date == GENERATION_NUMBER_ZERO) {
all_parents_computed = 0;
commit_list_insert(parent->item, &list);
break;
- } else if (level > max_level) {
- max_level = level;
}
+
+ if (level > max_level)
+ max_level = level;
+
+ if (corrected_commit_date > max_corrected_commit_date)
+ max_corrected_commit_date = corrected_commit_date;
}
if (all_parents_computed) {
@@ -1373,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (max_level > GENERATION_NUMBER_V1_MAX - 1)
max_level = GENERATION_NUMBER_V1_MAX - 1;
*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+ if (current->date && current->date > max_corrected_commit_date)
+ max_corrected_commit_date = current->date - 1;
+ commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
}
}
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v6 08/11] commit-graph: implement generation data chunk
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
` (6 preceding siblings ...)
2021-01-16 18:11 ` [PATCH v6 07/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
` (4 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 114 ++++++++++++++++++++++++++++++----
commit-graph.h | 3 +
commit.h | 1 +
t/README | 3 +
t/helper/test-read-graph.c | 4 ++
t/t4216-log-bloom.sh | 4 +-
t/t5318-commit-graph.sh | 79 +++++++++++++++++++----
t/t5324-split-commit-graph.sh | 12 ++--
t/t6600-test-reach.sh | 6 ++
t/test-lib-functions.sh | 6 ++
10 files changed, 200 insertions(+), 32 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index a899f429093..7365958d9d3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 9
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
+#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
+
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
@@ -394,6 +398,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
graph->chunk_commit_data = data + chunk_offset;
break;
+ case GRAPH_CHUNKID_GENERATION_DATA:
+ if (graph->chunk_generation_data)
+ chunk_repeated = 1;
+ else
+ graph->chunk_generation_data = data + chunk_offset;
+ break;
+
+ case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
+ if (graph->chunk_generation_data_overflow)
+ chunk_repeated = 1;
+ else
+ graph->chunk_generation_data_overflow = data + chunk_offset;
+ break;
+
case GRAPH_CHUNKID_EXTRAEDGES:
if (graph->chunk_extra_edges)
chunk_repeated = 1;
@@ -754,8 +772,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
{
const unsigned char *commit_data;
struct commit_graph_data *graph_data;
- uint32_t lex_index;
- uint64_t date_high, date_low;
+ uint32_t lex_index, offset_pos;
+ uint64_t date_high, date_low, offset;
while (pos < g->num_commits_in_base)
g = g->base_graph;
@@ -773,7 +791,19 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
- graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+ if (g->chunk_generation_data) {
+ offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+
+ if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+ if (!g->chunk_generation_data_overflow)
+ die(_("commit-graph requires overflow generation data but has none"));
+
+ offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
+ graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+ } else
+ graph_data->generation = item->date + offset;
+ } else
+ graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
if (g->topo_levels)
*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
@@ -945,6 +975,7 @@ struct write_commit_graph_context {
struct oid_array oids;
struct packed_commit_list commits;
int num_extra_edges;
+ int num_generation_data_overflows;
unsigned long approx_nr_objects;
struct progress *progress;
int progress_done;
@@ -963,7 +994,8 @@ struct write_commit_graph_context {
report_progress:1,
split:1,
changed_paths:1,
- order_by_pack:1;
+ order_by_pack:1,
+ write_generation_data:1;
struct topo_level_slab *topo_levels;
const struct commit_graph_opts *opts;
@@ -1123,6 +1155,45 @@ static int write_graph_chunk_data(struct hashfile *f,
return 0;
}
+static int write_graph_chunk_generation_data(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ int i, num_generation_data_overflows = 0;
+
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+ display_progress(ctx->progress, ++ctx->progress_cnt);
+
+ if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+ offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
+ num_generation_data_overflows++;
+ }
+
+ hashwrite_be32(f, offset);
+ }
+
+ return 0;
+}
+
+static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ int i;
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+ display_progress(ctx->progress, ++ctx->progress_cnt);
+
+ if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+ hashwrite_be32(f, offset >> 32);
+ hashwrite_be32(f, (uint32_t) offset);
+ }
+ }
+
+ return 0;
+}
+
static int write_graph_chunk_extra_edges(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
@@ -1386,6 +1457,9 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (current->date && current->date > max_corrected_commit_date)
max_corrected_commit_date = current->date - 1;
commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
+
+ if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
+ ctx->num_generation_data_overflows++;
}
}
}
@@ -1719,6 +1793,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
chunks[2].id = GRAPH_CHUNKID_DATA;
chunks[2].size = (hashsz + 16) * ctx->commits.nr;
chunks[2].write_fn = write_graph_chunk_data;
+
+ if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
+ ctx->write_generation_data = 0;
+ if (ctx->write_generation_data) {
+ chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+ chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+ chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+ num_chunks++;
+ }
+ if (ctx->num_generation_data_overflows) {
+ chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
+ chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
+ chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
+ num_chunks++;
+ }
if (ctx->num_extra_edges) {
chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2139,6 +2228,8 @@ int write_commit_graph(struct object_directory *odb,
ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
ctx->opts = opts;
ctx->total_bloom_filter_data_size = 0;
+ ctx->write_generation_data = 1;
+ ctx->num_generation_data_overflows = 0;
bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
bloom_settings.bits_per_entry);
@@ -2445,16 +2536,17 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
continue;
/*
- * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
- * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
- * extra logic in the following condition.
+ * If we are using topological level and one of our parents has
+ * generation GENERATION_NUMBER_V1_MAX, then our generation is
+ * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
+ * in the following condition.
*/
- if (max_generation == GENERATION_NUMBER_V1_MAX)
+ if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
max_generation--;
generation = commit_graph_generation(graph_commit);
- if (generation != max_generation + 1)
- graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
+ if (generation < max_generation + 1)
+ graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
oid_to_hex(&cur_oid),
generation,
max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 2e9aa7824ee..19a02001fde 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
#include "oidset.h"
#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
@@ -68,6 +69,8 @@ struct commit_graph {
const uint32_t *chunk_oid_fanout;
const unsigned char *chunk_oid_lookup;
const unsigned char *chunk_commit_data;
+ const unsigned char *chunk_generation_data;
+ const unsigned char *chunk_generation_data_overflow;
const unsigned char *chunk_extra_edges;
const unsigned char *chunk_base_graphs;
const unsigned char *chunk_bloom_indexes;
diff --git a/commit.h b/commit.h
index 742d96c41e8..eff94f3f7c2 100644
--- a/commit.h
+++ b/commit.h
@@ -14,6 +14,7 @@
#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
#define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)
struct commit_list {
struct commit *item;
diff --git a/t/README b/t/README
index c730a707705..8a121487279 100644
--- a/t/README
+++ b/t/README
@@ -393,6 +393,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
be written after every 'git commit' command, and overrides the
'core.commitGraph' setting to true.
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
commit-graph write to compute and write changed path Bloom filters for
every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 5f585a17256..75927b2c81d 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -33,6 +33,10 @@ int cmd__read_graph(int argc, const char **argv)
printf(" oid_lookup");
if (graph->chunk_commit_data)
printf(" commit_metadata");
+ if (graph->chunk_generation_data)
+ printf(" generation_data");
+ if (graph->chunk_generation_data_overflow)
+ printf(" generation_data_overflow");
if (graph->chunk_extra_edges)
printf(" extra_edges");
if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index d11040ce41c..dbde0161882 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -40,11 +40,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
'
graph_read_expect () {
- NUM_CHUNKS=5
+ NUM_CHUNKS=6
cat >expect <<- EOF
header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
num_commits: $1
- chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+ chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
EOF
test-tool read-graph >actual &&
test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2ed0c1544da..fa27df579a5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -76,7 +76,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
graph_read_expect() {
OPTIONAL=""
NUM_CHUNKS=3
- if test ! -z $2
+ if test ! -z "$2"
then
OPTIONAL=" $2"
NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -103,14 +103,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
# valid commit and tree OID
git rev-parse HEAD HEAD^{tree} >in &&
git commit-graph write --stdin-commits <in &&
- graph_read_expect 3
+ graph_read_expect 3 generation_data
'
test_expect_success 'write graph' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "3"
+ graph_read_expect "3" generation_data
'
test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -219,7 +219,7 @@ test_expect_success 'write graph with merges' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "10" "extra_edges"
+ graph_read_expect "10" "generation_data extra_edges"
'
graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -254,7 +254,7 @@ test_expect_success 'write graph with new commit' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -264,7 +264,7 @@ test_expect_success 'write graph with nothing new' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -274,7 +274,7 @@ test_expect_success 'build graph from latest pack with closure' '
cd "$TRASH_DIRECTORY/full" &&
cat new-idx | git commit-graph write --stdin-packs &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "9" "extra_edges"
+ graph_read_expect "9" "generation_data extra_edges"
'
graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -287,7 +287,7 @@ test_expect_success 'build graph from commits with closure' '
git rev-parse merge/1 >>commits-in &&
cat commits-in | git commit-graph write --stdin-commits &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "6"
+ graph_read_expect "6" "generation_data"
'
graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -297,7 +297,7 @@ test_expect_success 'build graph from commits with append' '
cd "$TRASH_DIRECTORY/full" &&
git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "10" "extra_edges"
+ graph_read_expect "10" "generation_data extra_edges"
'
graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -307,7 +307,7 @@ test_expect_success 'build graph using --reachable' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write --reachable &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -328,7 +328,7 @@ test_expect_success 'write graph in bare repo' '
cd "$TRASH_DIRECTORY/bare" &&
git commit-graph write &&
test_path_is_file $baredir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
test_expect_success 'git commit-graph verify' '
cd "$TRASH_DIRECTORY/full" &&
- git rev-parse commits/8 | git commit-graph write --stdin-commits &&
- git commit-graph verify >output
+ git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+ git commit-graph verify >output &&
+ graph_read_expect 9 extra_edges
'
NUM_COMMITS=9
@@ -741,4 +742,56 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
)
'
+# We test the overflow-related code with the following repo history:
+#
+# 4:F - 5:N - 6:U
+# / \
+# 1:U - 2:N - 3:U M:N
+# \ /
+# 7:N - 8:F - 9:N
+#
+# Here the commits denoted by U have committer date of zero seconds
+# since Unix epoch, the commits denoted by N have committer date
+# starting from 1112354055 seconds since Unix epoch (default committer
+# date for the test suite), and the commits denoted by F have committer
+# date of (2 ^ 31 - 2) seconds since Unix epoch.
+#
+# The largest offset observed is 2 ^ 31, just large enough to overflow.
+#
+
+test_expect_success 'set up and verify repo with generation data overflow chunk' '
+ objdir=".git/objects" &&
+ UNIX_EPOCH_ZERO="@0 +0000" &&
+ FUTURE_DATE="@2147483646 +0000" &&
+ test_oid_cache <<-EOF &&
+ oid_version sha1:1
+ oid_version sha256:2
+ EOF
+ cd "$TRASH_DIRECTORY" &&
+ mkdir repo &&
+ cd repo &&
+ git init &&
+ test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
+ test_commit 2 &&
+ test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
+ git commit-graph write --reachable &&
+ graph_read_expect 3 generation_data &&
+ test_commit --date "$FUTURE_DATE" 4 &&
+ test_commit 5 &&
+ test_commit --date "$UNIX_EPOCH_ZERO" 6 &&
+ git branch left &&
+ git reset --hard 3 &&
+ test_commit 7 &&
+ test_commit --date "$FUTURE_DATE" 8 &&
+ test_commit 9 &&
+ git branch right &&
+ git reset --hard 3 &&
+ test_merge M left right &&
+ git commit-graph write --reachable &&
+ graph_read_expect 10 "generation_data generation_data_overflow" &&
+ git commit-graph verify
+'
+
+graph_git_behavior 'generation data overflow chunk repo' repo left right
+
test_done
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 4d3842b83b9..587757b62d9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
infodir=".git/objects/info" &&
graphdir="$infodir/commit-graphs" &&
test_oid_cache <<-EOM
- shallow sha1:1760
- shallow sha256:2064
+ shallow sha1:2132
+ shallow sha256:2436
- base sha1:1376
- base sha256:1496
+ base sha1:1408
+ base sha256:1528
oid_version sha1:1
oid_version sha256:2
@@ -31,9 +31,9 @@ graph_read_expect() {
NUM_BASE=$2
fi
cat >expect <<- EOF
- header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
+ header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
num_commits: $1
- chunks: oid_fanout oid_lookup commit_metadata
+ chunks: oid_fanout oid_lookup commit_metadata generation_data
EOF
test-tool read-graph >output &&
test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index af10f0dc090..e2d33a8a4c4 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,6 +55,9 @@ test_expect_success 'setup' '
git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
mv .git/objects/info/commit-graph commit-graph-half &&
chmod u+w commit-graph-half &&
+ GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+ mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+ chmod u+w commit-graph-no-gdat &&
git config core.commitGraph true
'
@@ -67,6 +70,9 @@ run_all_modes () {
test_cmp expect actual &&
cp commit-graph-half .git/objects/info/commit-graph &&
"$@" <input >actual &&
+ test_cmp expect actual &&
+ cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+ "$@" <input >actual &&
test_cmp expect actual
}
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 999982fe4a9..3ad712c3acc 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -202,6 +202,12 @@ test_commit () {
--signoff)
signoff="$1"
;;
+ --date)
+ notick=yes
+ GIT_COMMITTER_DATE="$2"
+ GIT_AUTHOR_DATE="$2"
+ shift
+ ;;
-C)
indir="$2"
shift
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v6 09/11] commit-graph: use generation v2 only if entire chain does
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
` (7 preceding siblings ...)
2021-01-16 18:11 ` [PATCH v6 08/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
` (3 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:
1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.
It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.
This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).
Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.
Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 30 +++++-
commit-graph.h | 1 +
t/t5324-split-commit-graph.sh | 181 ++++++++++++++++++++++++++++++++++
3 files changed, 210 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 7365958d9d3..d32492f3724 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -614,6 +614,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
return graph_chain;
}
+static void validate_mixed_generation_chain(struct commit_graph *g)
+{
+ int read_generation_data;
+
+ if (!g)
+ return;
+
+ read_generation_data = !!g->chunk_generation_data;
+
+ while (g) {
+ g->read_generation_data = read_generation_data;
+ g = g->base_graph;
+ }
+}
+
struct commit_graph *read_commit_graph_one(struct repository *r,
struct object_directory *odb)
{
@@ -622,6 +637,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
if (!g)
g = load_commit_graph_chain(r, odb);
+ validate_mixed_generation_chain(g);
+
return g;
}
@@ -791,7 +808,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
- if (g->chunk_generation_data) {
+ if (g->read_generation_data) {
offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
@@ -2019,6 +2036,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
if (i < ctx->num_commit_graphs_after)
ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
+ /*
+ * If the topmost remaining layer has generation data chunk, the
+ * resultant layer also has generation data chunk.
+ */
+ if (i == ctx->num_commit_graphs_after - 2)
+ ctx->write_generation_data = !!g->chunk_generation_data;
+
i--;
g = g->base_graph;
}
@@ -2343,6 +2367,8 @@ int write_commit_graph(struct object_directory *odb,
} else
ctx->num_commit_graphs_after = 1;
+ validate_mixed_generation_chain(ctx->r->objects->commit_graph);
+
compute_generation_numbers(ctx);
if (ctx->changed_paths)
@@ -2541,7 +2567,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
* also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
* in the following condition.
*/
- if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
+ if (!g->read_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
max_generation--;
generation = commit_graph_generation(graph_commit);
diff --git a/commit-graph.h b/commit-graph.h
index 19a02001fde..ad52130883b 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -64,6 +64,7 @@ struct commit_graph {
struct object_directory *odb;
uint32_t num_commits_in_base;
+ unsigned int read_generation_data;
struct commit_graph *base_graph;
const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 587757b62d9..8e90f3423b8 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -453,4 +453,185 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
git -C dup commit-graph verify
'
+NUM_FIRST_LAYER_COMMITS=64
+NUM_SECOND_LAYER_COMMITS=16
+NUM_THIRD_LAYER_COMMITS=7
+NUM_FOURTH_LAYER_COMMITS=8
+NUM_FIFTH_LAYER_COMMITS=16
+SECOND_LAYER_SEQUENCE_START=$(($NUM_FIRST_LAYER_COMMITS + 1))
+SECOND_LAYER_SEQUENCE_END=$(($SECOND_LAYER_SEQUENCE_START + $NUM_SECOND_LAYER_COMMITS - 1))
+THIRD_LAYER_SEQUENCE_START=$(($SECOND_LAYER_SEQUENCE_END + 1))
+THIRD_LAYER_SEQUENCE_END=$(($THIRD_LAYER_SEQUENCE_START + $NUM_THIRD_LAYER_COMMITS - 1))
+FOURTH_LAYER_SEQUENCE_START=$(($THIRD_LAYER_SEQUENCE_END + 1))
+FOURTH_LAYER_SEQUENCE_END=$(($FOURTH_LAYER_SEQUENCE_START + $NUM_FOURTH_LAYER_COMMITS - 1))
+FIFTH_LAYER_SEQUENCE_START=$(($FOURTH_LAYER_SEQUENCE_END + 1))
+FIFTH_LAYER_SEQUENCE_END=$(($FIFTH_LAYER_SEQUENCE_START + $NUM_FIFTH_LAYER_COMMITS - 1))
+
+# Current split graph chain:
+#
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+ graphdir=".git/objects/info/commit-graphs" &&
+ test_oid_cache <<-EOF &&
+ oid_version sha1:1
+ oid_version sha256:2
+ EOF
+ git init mixed &&
+ (
+ cd mixed &&
+ git config core.commitGraph true &&
+ git config gc.writeCommitGraph false &&
+ for i in $(test_seq $NUM_FIRST_LAYER_COMMITS)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split &&
+ graph_read_expect $NUM_FIRST_LAYER_COMMITS &&
+ test_line_count = 1 $graphdir/commit-graph-chain &&
+ for i in $(test_seq $SECOND_LAYER_SEQUENCE_START $SECOND_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+ test_line_count = 2 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 4 1
+ num_commits: $NUM_SECOND_LAYER_COMMITS
+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
+ test_cmp expect output &&
+ git commit-graph verify &&
+ cat $graphdir/commit-graph-chain
+ )
+'
+
+# The new layer will be added without generation data chunk as it was not
+# present on the layer underneath it.
+#
+# 7 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if not present on existing tip' '
+ git clone mixed mixed-no-gdat &&
+ (
+ cd mixed-no-gdat &&
+ for i in $(test_seq $THIRD_LAYER_SEQUENCE_START $THIRD_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split=no-merge &&
+ test_line_count = 3 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 4 2
+ num_commits: $NUM_THIRD_LAYER_COMMITS
+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
+ test_cmp expect output &&
+ git commit-graph verify
+ )
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+# 8 commits (No GDAT)
+# ------------------------
+# 7 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+# The top two layers are merged and do not have generation data chunk as layer below them does
+# not have generation data chunk.
+#
+# 15 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if the topmost remaining layer does not have generation data chunk' '
+ git clone mixed-no-gdat mixed-merge-no-gdat &&
+ (
+ cd mixed-merge-no-gdat &&
+ for i in $(test_seq $FOURTH_LAYER_SEQUENCE_START $FOURTH_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split --size-multiple 1 &&
+ test_line_count = 3 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 4 2
+ num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
+ test_cmp expect output &&
+ git commit-graph verify
+ )
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+# 16 commits (No GDAT)
+# ------------------------
+# 15 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+# The top three layers are merged and has generation data chunk as the topmost remaining layer
+# has generation data chunk.
+#
+# 47 commits (GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'write generation data chunk if topmost remaining layer has generation data chunk' '
+ git clone mixed-merge-no-gdat mixed-merge-gdat &&
+ (
+ cd mixed-merge-gdat &&
+ for i in $(test_seq $FIFTH_LAYER_SEQUENCE_START $FIFTH_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split --size-multiple 1 &&
+ test_line_count = 2 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 5 1
+ num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
+ chunks: oid_fanout oid_lookup commit_metadata generation_data
+ EOF
+ test_cmp expect output
+ )
+'
+
+test_expect_success 'write generation data chunk when commit-graph chain is replaced' '
+ git clone mixed mixed-replace &&
+ (
+ cd mixed-replace &&
+ git commit-graph write --reachable --split=replace &&
+ test_path_is_file $graphdir/commit-graph-chain &&
+ test_line_count = 1 $graphdir/commit-graph-chain &&
+ verify_chain_files_exist $graphdir &&
+ graph_read_expect $(($NUM_FIRST_LAYER_COMMITS + $NUM_SECOND_LAYER_COMMITS)) &&
+ git commit-graph verify
+ )
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v6 10/11] commit-reach: use corrected commit dates in paint_down_to_common()
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
` (8 preceding siblings ...)
2021-01-16 18:11 ` [PATCH v6 09/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-16 18:11 ` [PATCH v6 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
` (2 subsequent siblings)
12 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.
For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.
While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).
This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.
While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.
With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.
While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.
As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 14 ++++++++++++++
commit-graph.h | 6 ++++++
commit-reach.c | 2 +-
t/t6404-recursive-merge.sh | 5 ++++-
4 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index d32492f3724..d3d14601d4d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -714,6 +714,20 @@ int generation_numbers_enabled(struct repository *r)
return !!first_generation;
}
+int corrected_commit_dates_enabled(struct repository *r)
+{
+ struct commit_graph *g;
+ if (!prepare_commit_graph(r))
+ return 0;
+
+ g = r->objects->commit_graph;
+
+ if (!g->num_commits)
+ return 0;
+
+ return g->read_generation_data;
+}
+
struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
{
struct commit_graph *g = r->objects->commit_graph;
diff --git a/commit-graph.h b/commit-graph.h
index ad52130883b..97f3497c279 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -95,6 +95,12 @@ struct commit_graph *parse_commit_graph(struct repository *r,
*/
int generation_numbers_enabled(struct repository *r);
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
+
struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
enum commit_graph_write_flags {
diff --git a/commit-reach.c b/commit-reach.c
index 9b24b0378d5..e38771ca5a1 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
int i;
timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
- if (!min_generation)
+ if (!min_generation && !corrected_commit_dates_enabled(r))
queue.compare = compare_commits_by_commit_date;
one->object.flags |= PARENT1;
diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
index b1c3d4dda49..86f74ae5847 100755
--- a/t/t6404-recursive-merge.sh
+++ b/t/t6404-recursive-merge.sh
@@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
export GIT_COMMITTER_DATE
test_expect_success 'setup tests' '
+ GIT_TEST_COMMIT_GRAPH=0 &&
+ export GIT_TEST_COMMIT_GRAPH &&
echo 1 >a1 &&
git add a1 &&
GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
'
test_expect_success 'combined merge conflicts' '
- test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+ test_must_fail git merge -m final G
'
test_expect_success 'result contains a conflict' '
@@ -82,6 +84,7 @@ test_expect_success 'result contains a conflict' '
'
test_expect_success 'virtual trees were processed' '
+ # TODO: fragile test, relies on ambigious merge-base resolution
git ls-files --stage >out &&
cat >expect <<-EOF &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v6 11/11] doc: add corrected commit date info
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
` (9 preceding siblings ...)
2021-01-16 18:11 ` [PATCH v6 10/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2021-01-16 18:11 ` Abhishek Kumar via GitGitGadget
2021-01-27 0:04 ` SZEDER Gábor
2021-01-18 21:04 ` [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
12 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-01-16 18:11 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar,
SZEDER Gábor, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
.../technical/commit-graph-format.txt | 28 +++++--
Documentation/technical/commit-graph.txt | 77 +++++++++++++++----
2 files changed, 86 insertions(+), 19 deletions(-)
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b3b58880b92..b6658eff188 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
The Git commit graph stores a list of commit OIDs and some associated
metadata, including:
-- The generation number of the commit. Commits with no parents have
- generation number 1; commits with parents have generation number
- one more than the maximum generation number of its parents. We
- reserve zero as special, and can be used to mark a generation
- number invalid or as "not computed".
+- The generation number of the commit.
- The root tree OID.
@@ -86,13 +82,33 @@ CHUNK DATA:
position. If there are more than two parents, the second value
has its most-significant bit on and the other bits store an array
position into the Extra Edge List chunk.
- * The next 8 bytes store the generation number of the commit and
+ * The next 8 bytes store the topological level (generation number v1)
+ of the commit and
the commit time in seconds since EPOCH. The generation number
uses the higher 30 bits of the first 4 bytes, while the commit
time uses the 32 bits of the second 4 bytes, along with the lowest
2 bits of the lowest byte, storing the 33rd and 34th bit of the
commit time.
+ Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+ * This list of 4-byte values store corrected commit date offsets for the
+ commits, arranged in the same order as commit data chunk.
+ * If the corrected commit date offset cannot be stored within 31 bits,
+ the value has its most-significant bit on and the other bits store
+ the position of corrected commit date into the Generation Data Overflow
+ chunk.
+ * Generation Data chunk is present only when commit-graph file is written
+ by compatible versions of Git and in case of split commit-graph chains,
+ the topmost layer also has Generation Data chunk.
+
+ Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
+ * This list of 8-byte values stores the corrected commit date offsets
+ for commits with corrected commit date offsets that cannot be
+ stored within 31 bits.
+ * Generation Data Overflow chunk is present only when Generation Data
+ chunk is present and atleast one corrected commit date offset cannot
+ be stored within 31 bits.
+
Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
This list of 4-byte values store the second through nth parents for
all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index f14a7659aa8..f05e7bda1a9 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
Values 1-4 satisfy the requirements of parse_commit_gently().
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates (generation number v2)
+2. Topological levels (generation nummber v1)
- * A commit with no parents (a root commit) has generation number one.
+Define "corrected committer date" of a commit recursively as follows:
- * A commit with at least one parent has generation number one more than
- the largest generation number among its parents.
+ * A commit with no parents (a root commit) has corrected committer date
+ equal to its committer date.
-Equivalently, the generation number of a commit A is one more than the
+ * A commit with at least one parent has corrected committer date equal to
+ the maximum of its commiter date and one more than the largest corrected
+ committer date among its parents.
+
+ * As a special case, a root commit with timestamp zero has corrected commit
+ date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+ (that is, an uncomputed corrected commit date).
+
+Define the "topological level" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has topological level of one.
+
+ * A commit with at least one parent has topological level one more than
+ the largest topological level among its parents.
+
+Equivalently, the topological level of a commit A is one more than the
length of a longest path from A to a root commit. The recursive definition
is easier to use for computation and observing the following property:
@@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
generation numbers, then we always expand the boundary commit with highest
generation number and can easily detect the stopping condition.
+The property applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
This property can be used to significantly reduce the time it takes to
walk commits and determine topological relationships. Without generation
numbers, the general heuristic is the following:
@@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
If A and B are commits with commit time X and Y, respectively, and
X < Y, then A _probably_ cannot reach B.
-This heuristic is currently used whenever the computation is allowed to
+In absence of corrected commit dates (for example, old versions of Git or
+mixed generation graph chains),
+this heuristic is currently used whenever the computation is allowed to
violate topological relationships due to clock skew (such as "git log"
with default order), but is not used when the topological order is
required (such as merge base calculations, "git log --graph").
@@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
generation number and walk until reaching commits with known generation
number.
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
in the commit-graph file. If a commit-graph file was written by a version
of Git that did not compute generation numbers, then those commits will
have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
walking a few extra commits, but the simplicity in dealing with commits
with generation number *_INFINITY or *_ZERO is valuable.
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
+topological levels (generation number v1) are computed to be at least
+this value. We limit at this value since it is the largest value that
+can be stored in the commit-graph file using the 30 bits available
+to topological levels. This presents another case where a commit can
+have generation number equal to that of a parent.
Design Details
--------------
@@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
number of commits) could be extracted into config settings for full
flexibility.
+## Handling Mixed Generation Number Chains
+
+With the introduction of generation number v2 and generation data chunk, the
+following scenario is possible:
+
+1. "New" Git writes a commit-graph with the corrected commit dates.
+2. "Old" Git writes a split commit-graph on top without corrected commit dates.
+
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
+levels of the higher layer. For this reason, Git inspects the topmost
+layer to see if the layer is missing corrected commit dates. In such a case
+Git only uses topological level for generation numbers.
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
+guarantees that if a layer has corrected commit dates, all lower layers
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
+commit dates. Instead, the new layer will have corrected commit dates if the
+layer below the new layer has corrected commit dates.
+
+While writing or merging layers, if the new layer is the only layer, it will
+have corrected commit dates when written by compatible versions of Git. Thus,
+rewriting split commit-graph as a single file (`--split=replace`) creates a
+single layer with corrected commit dates.
+
## Deleting graph-{hash} files
After a new tip file is written, some `graph-{hash}` files may no longer
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v6 11/11] doc: add corrected commit date info
2021-01-16 18:11 ` [PATCH v6 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2021-01-27 0:04 ` SZEDER Gábor
2021-01-30 5:29 ` Abhishek Kumar
0 siblings, 1 reply; 211+ messages in thread
From: SZEDER Gábor @ 2021-01-27 0:04 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget
Cc: git, Derrick Stolee, Jakub Narębski, Taylor Blau, Abhishek Kumar
On Sat, Jan 16, 2021 at 06:11:18PM +0000, Abhishek Kumar via GitGitGadget wrote:
> With generation data chunk and corrected commit dates implemented, let's
> update the technical documentation for commit-graph.
This patch should come much earlier in this series, before patch 07/11
(commit-graph: implement corrected commit date), or perhaps even
earlier. That way if someone were to investigate an issue in this
series and checks out one of its commits, then the specification and
the will be right there under 'Documentation/technical/'.
Furthermore, a patch introducing a new chunk format is the right place
to justify the introduction of said new chunk. What problems does a
chunk of corrected commit dates solve? Why does it solve them? Why
do we need corrected commit dates instead of simple commit dates?
What alternatives were considered [1]? Any other design considerations
worth mentioning for the benefit of future readers?
None of the patches' log messages properly explain these, and while
much of these is indeed explained in the cover letter, the cover
letter will not be part of the history. Requiring to look up mailing
list archives for the justification puts unnecessary burden on other
developers who might get interested in this feature in the future.
You might want to take
https://public-inbox.org/git/20200529085038.26008-16-szeder.dev@gmail.com/
as an inspiration.
[1] Please remember the following snippet from SubmittingPatches:
"Try to make sure your explanation can be understood without
external resources. Instead of giving a URL to a mailing list
archive, summarize the relevant points of the discussion."
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
> .../technical/commit-graph-format.txt | 28 +++++--
> Documentation/technical/commit-graph.txt | 77 +++++++++++++++----
> 2 files changed, 86 insertions(+), 19 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index b3b58880b92..b6658eff188 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -4,11 +4,7 @@ Git commit graph format
> The Git commit graph stores a list of commit OIDs and some associated
> metadata, including:
>
> -- The generation number of the commit. Commits with no parents have
> - generation number 1; commits with parents have generation number
> - one more than the maximum generation number of its parents. We
> - reserve zero as special, and can be used to mark a generation
> - number invalid or as "not computed".
> +- The generation number of the commit.
>
> - The root tree OID.
>
> @@ -86,13 +82,33 @@ CHUNK DATA:
> position. If there are more than two parents, the second value
> has its most-significant bit on and the other bits store an array
> position into the Extra Edge List chunk.
> - * The next 8 bytes store the generation number of the commit and
> + * The next 8 bytes store the topological level (generation number v1)
> + of the commit and
> the commit time in seconds since EPOCH. The generation number
> uses the higher 30 bits of the first 4 bytes, while the commit
> time uses the 32 bits of the second 4 bytes, along with the lowest
> 2 bits of the lowest byte, storing the 33rd and 34th bit of the
> commit time.
>
> + Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
> + * This list of 4-byte values store corrected commit date offsets for the
> + commits, arranged in the same order as commit data chunk.
> + * If the corrected commit date offset cannot be stored within 31 bits,
> + the value has its most-significant bit on and the other bits store
> + the position of corrected commit date into the Generation Data Overflow
> + chunk.
> + * Generation Data chunk is present only when commit-graph file is written
> + by compatible versions of Git and in case of split commit-graph chains,
> + the topmost layer also has Generation Data chunk.
> +
> + Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> + * This list of 8-byte values stores the corrected commit date offsets
> + for commits with corrected commit date offsets that cannot be
> + stored within 31 bits.
> + * Generation Data Overflow chunk is present only when Generation Data
> + chunk is present and atleast one corrected commit date offset cannot
> + be stored within 31 bits.
> +
> Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> This list of 4-byte values store the second through nth parents for
> all octopus merges. The second parent value in the commit data stores
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index f14a7659aa8..f05e7bda1a9 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
>
> Values 1-4 satisfy the requirements of parse_commit_gently().
>
> -Define the "generation number" of a commit recursively as follows:
> +There are two definitions of generation number:
> +1. Corrected committer dates (generation number v2)
> +2. Topological levels (generation nummber v1)
>
> - * A commit with no parents (a root commit) has generation number one.
> +Define "corrected committer date" of a commit recursively as follows:
>
> - * A commit with at least one parent has generation number one more than
> - the largest generation number among its parents.
> + * A commit with no parents (a root commit) has corrected committer date
> + equal to its committer date.
>
> -Equivalently, the generation number of a commit A is one more than the
> + * A commit with at least one parent has corrected committer date equal to
> + the maximum of its commiter date and one more than the largest corrected
> + committer date among its parents.
> +
> + * As a special case, a root commit with timestamp zero has corrected commit
> + date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
> + (that is, an uncomputed corrected commit date).
> +
> +Define the "topological level" of a commit recursively as follows:
> +
> + * A commit with no parents (a root commit) has topological level of one.
> +
> + * A commit with at least one parent has topological level one more than
> + the largest topological level among its parents.
> +
> +Equivalently, the topological level of a commit A is one more than the
> length of a longest path from A to a root commit. The recursive definition
> is easier to use for computation and observing the following property:
>
> @@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
> generation numbers, then we always expand the boundary commit with highest
> generation number and can easily detect the stopping condition.
>
> +The property applies to both versions of generation number, that is both
> +corrected committer dates and topological levels.
> +
> This property can be used to significantly reduce the time it takes to
> walk commits and determine topological relationships. Without generation
> numbers, the general heuristic is the following:
> @@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
> If A and B are commits with commit time X and Y, respectively, and
> X < Y, then A _probably_ cannot reach B.
>
> -This heuristic is currently used whenever the computation is allowed to
> +In absence of corrected commit dates (for example, old versions of Git or
> +mixed generation graph chains),
> +this heuristic is currently used whenever the computation is allowed to
> violate topological relationships due to clock skew (such as "git log"
> with default order), but is not used when the topological order is
> required (such as merge base calculations, "git log --graph").
> @@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
> generation number and walk until reaching commits with known generation
> number.
>
> -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
> in the commit-graph file. If a commit-graph file was written by a version
> of Git that did not compute generation numbers, then those commits will
> have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> @@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
> walking a few extra commits, but the simplicity in dealing with commits
> with generation number *_INFINITY or *_ZERO is valuable.
>
> -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> -generation numbers are computed to be at least this value. We limit at
> -this value since it is the largest value that can be stored in the
> -commit-graph file using the 30 bits available to generation numbers. This
> -presents another case where a commit can have generation number equal to
> -that of a parent.
> +We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
> +topological levels (generation number v1) are computed to be at least
> +this value. We limit at this value since it is the largest value that
> +can be stored in the commit-graph file using the 30 bits available
> +to topological levels. This presents another case where a commit can
> +have generation number equal to that of a parent.
>
> Design Details
> --------------
> @@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
> number of commits) could be extracted into config settings for full
> flexibility.
>
> +## Handling Mixed Generation Number Chains
> +
> +With the introduction of generation number v2 and generation data chunk, the
> +following scenario is possible:
> +
> +1. "New" Git writes a commit-graph with the corrected commit dates.
> +2. "Old" Git writes a split commit-graph on top without corrected commit dates.
> +
> +A naive approach of using the newest available generation number from
> +each layer would lead to violated expectations: the lower layer would
> +use corrected commit dates which are much larger than the topological
> +levels of the higher layer. For this reason, Git inspects the topmost
> +layer to see if the layer is missing corrected commit dates. In such a case
> +Git only uses topological level for generation numbers.
> +
> +When writing a new layer in split commit-graph, we write corrected commit
> +dates if the topmost layer has corrected commit dates written. This
> +guarantees that if a layer has corrected commit dates, all lower layers
> +must have corrected commit dates as well.
> +
> +When merging layers, we do not consider whether the merged layers had corrected
> +commit dates. Instead, the new layer will have corrected commit dates if the
> +layer below the new layer has corrected commit dates.
> +
> +While writing or merging layers, if the new layer is the only layer, it will
> +have corrected commit dates when written by compatible versions of Git. Thus,
> +rewriting split commit-graph as a single file (`--split=replace`) creates a
> +single layer with corrected commit dates.
> +
> ## Deleting graph-{hash} files
>
> After a new tip file is written, some `graph-{hash}` files may no longer
> --
> gitgitgadget
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v6 11/11] doc: add corrected commit date info
2021-01-27 0:04 ` SZEDER Gábor
@ 2021-01-30 5:29 ` Abhishek Kumar
2021-01-31 1:45 ` Taylor Blau
0 siblings, 1 reply; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-30 5:29 UTC (permalink / raw)
To: SZEDER Gábor
Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me, stolee
On Wed, Jan 27, 2021 at 01:04:54AM +0100, SZEDER Gábor wrote:
> On Sat, Jan 16, 2021 at 06:11:18PM +0000, Abhishek Kumar via GitGitGadget wrote:
> > With generation data chunk and corrected commit dates implemented, let's
> > update the technical documentation for commit-graph.
>
> This patch should come much earlier in this series, before patch 07/11
> (commit-graph: implement corrected commit date), or perhaps even
> earlier. That way if someone were to investigate an issue in this
> series and checks out one of its commits, then the specification and
> the will be right there under 'Documentation/technical/'.
>
> Furthermore, a patch introducing a new chunk format is the right place
> to justify the introduction of said new chunk. What problems does a
> chunk of corrected commit dates solve? Why does it solve them? Why
> do we need corrected commit dates instead of simple commit dates?
> What alternatives were considered [1]? Any other design considerations
> worth mentioning for the benefit of future readers?
>
> None of the patches' log messages properly explain these, and while
> much of these is indeed explained in the cover letter, the cover
> letter will not be part of the history. Requiring to look up mailing
> list archives for the justification puts unnecessary burden on other
> developers who might get interested in this feature in the future.
>
> You might want to take
> https://public-inbox.org/git/20200529085038.26008-16-szeder.dev@gmail.com/
> as an inspiration.
>
Alright, the suggestion makes a lot of sense and the patch introducing
documentation is the perfect place to justify the introduction of new
chunk format.
>
> [1] Please remember the following snippet from SubmittingPatches:
> "Try to make sure your explanation can be understood without
> external resources. Instead of giving a URL to a mailing list
> archive, summarize the relevant points of the discussion."
>
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> > .../technical/commit-graph-format.txt | 28 +++++--
> > Documentation/technical/commit-graph.txt | 77 +++++++++++++++----
> > 2 files changed, 86 insertions(+), 19 deletions(-)
> >
> > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> > index b3b58880b92..b6658eff188 100644
> > --- a/Documentation/technical/commit-graph-format.txt
> > +++ b/Documentation/technical/commit-graph-format.txt
> > @@ -4,11 +4,7 @@ Git commit graph format
> > The Git commit graph stores a list of commit OIDs and some associated
> > metadata, including:
> >
> > -- The generation number of the commit. Commits with no parents have
> > - generation number 1; commits with parents have generation number
> > - one more than the maximum generation number of its parents. We
> > - reserve zero as special, and can be used to mark a generation
> > - number invalid or as "not computed".
> > +- The generation number of the commit.
> >
> > - The root tree OID.
> >
> > @@ -86,13 +82,33 @@ CHUNK DATA:
> > position. If there are more than two parents, the second value
> > has its most-significant bit on and the other bits store an array
> > position into the Extra Edge List chunk.
> > - * The next 8 bytes store the generation number of the commit and
> > + * The next 8 bytes store the topological level (generation number v1)
> > + of the commit and
> > the commit time in seconds since EPOCH. The generation number
> > uses the higher 30 bits of the first 4 bytes, while the commit
> > time uses the 32 bits of the second 4 bytes, along with the lowest
> > 2 bits of the lowest byte, storing the 33rd and 34th bit of the
> > commit time.
> >
> > + Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
> > + * This list of 4-byte values store corrected commit date offsets for the
> > + commits, arranged in the same order as commit data chunk.
> > + * If the corrected commit date offset cannot be stored within 31 bits,
> > + the value has its most-significant bit on and the other bits store
> > + the position of corrected commit date into the Generation Data Overflow
> > + chunk.
> > + * Generation Data chunk is present only when commit-graph file is written
> > + by compatible versions of Git and in case of split commit-graph chains,
> > + the topmost layer also has Generation Data chunk.
> > +
> > + Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> > + * This list of 8-byte values stores the corrected commit date offsets
> > + for commits with corrected commit date offsets that cannot be
> > + stored within 31 bits.
> > + * Generation Data Overflow chunk is present only when Generation Data
> > + chunk is present and atleast one corrected commit date offset cannot
> > + be stored within 31 bits.
> > +
> > Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> > This list of 4-byte values store the second through nth parents for
> > all octopus merges. The second parent value in the commit data stores
> > diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> > index f14a7659aa8..f05e7bda1a9 100644
> > --- a/Documentation/technical/commit-graph.txt
> > +++ b/Documentation/technical/commit-graph.txt
> > @@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
> >
> > Values 1-4 satisfy the requirements of parse_commit_gently().
> >
> > -Define the "generation number" of a commit recursively as follows:
> > +There are two definitions of generation number:
> > +1. Corrected committer dates (generation number v2)
> > +2. Topological levels (generation nummber v1)
> >
> > - * A commit with no parents (a root commit) has generation number one.
> > +Define "corrected committer date" of a commit recursively as follows:
> >
> > - * A commit with at least one parent has generation number one more than
> > - the largest generation number among its parents.
> > + * A commit with no parents (a root commit) has corrected committer date
> > + equal to its committer date.
> >
> > -Equivalently, the generation number of a commit A is one more than the
> > + * A commit with at least one parent has corrected committer date equal to
> > + the maximum of its commiter date and one more than the largest corrected
> > + committer date among its parents.
> > +
> > + * As a special case, a root commit with timestamp zero has corrected commit
> > + date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
> > + (that is, an uncomputed corrected commit date).
> > +
> > +Define the "topological level" of a commit recursively as follows:
> > +
> > + * A commit with no parents (a root commit) has topological level of one.
> > +
> > + * A commit with at least one parent has topological level one more than
> > + the largest topological level among its parents.
> > +
> > +Equivalently, the topological level of a commit A is one more than the
> > length of a longest path from A to a root commit. The recursive definition
> > is easier to use for computation and observing the following property:
> >
> > @@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
> > generation numbers, then we always expand the boundary commit with highest
> > generation number and can easily detect the stopping condition.
> >
> > +The property applies to both versions of generation number, that is both
> > +corrected committer dates and topological levels.
> > +
> > This property can be used to significantly reduce the time it takes to
> > walk commits and determine topological relationships. Without generation
> > numbers, the general heuristic is the following:
> > @@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
> > If A and B are commits with commit time X and Y, respectively, and
> > X < Y, then A _probably_ cannot reach B.
> >
> > -This heuristic is currently used whenever the computation is allowed to
> > +In absence of corrected commit dates (for example, old versions of Git or
> > +mixed generation graph chains),
> > +this heuristic is currently used whenever the computation is allowed to
> > violate topological relationships due to clock skew (such as "git log"
> > with default order), but is not used when the topological order is
> > required (such as merge base calculations, "git log --graph").
> > @@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
> > generation number and walk until reaching commits with known generation
> > number.
> >
> > -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> > +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
> > in the commit-graph file. If a commit-graph file was written by a version
> > of Git that did not compute generation numbers, then those commits will
> > have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> > @@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
> > walking a few extra commits, but the simplicity in dealing with commits
> > with generation number *_INFINITY or *_ZERO is valuable.
> >
> > -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> > -generation numbers are computed to be at least this value. We limit at
> > -this value since it is the largest value that can be stored in the
> > -commit-graph file using the 30 bits available to generation numbers. This
> > -presents another case where a commit can have generation number equal to
> > -that of a parent.
> > +We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
> > +topological levels (generation number v1) are computed to be at least
> > +this value. We limit at this value since it is the largest value that
> > +can be stored in the commit-graph file using the 30 bits available
> > +to topological levels. This presents another case where a commit can
> > +have generation number equal to that of a parent.
> >
> > Design Details
> > --------------
> > @@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
> > number of commits) could be extracted into config settings for full
> > flexibility.
> >
> > +## Handling Mixed Generation Number Chains
> > +
> > +With the introduction of generation number v2 and generation data chunk, the
> > +following scenario is possible:
> > +
> > +1. "New" Git writes a commit-graph with the corrected commit dates.
> > +2. "Old" Git writes a split commit-graph on top without corrected commit dates.
> > +
> > +A naive approach of using the newest available generation number from
> > +each layer would lead to violated expectations: the lower layer would
> > +use corrected commit dates which are much larger than the topological
> > +levels of the higher layer. For this reason, Git inspects the topmost
> > +layer to see if the layer is missing corrected commit dates. In such a case
> > +Git only uses topological level for generation numbers.
> > +
> > +When writing a new layer in split commit-graph, we write corrected commit
> > +dates if the topmost layer has corrected commit dates written. This
> > +guarantees that if a layer has corrected commit dates, all lower layers
> > +must have corrected commit dates as well.
> > +
> > +When merging layers, we do not consider whether the merged layers had corrected
> > +commit dates. Instead, the new layer will have corrected commit dates if the
> > +layer below the new layer has corrected commit dates.
> > +
> > +While writing or merging layers, if the new layer is the only layer, it will
> > +have corrected commit dates when written by compatible versions of Git. Thus,
> > +rewriting split commit-graph as a single file (`--split=replace`) creates a
> > +single layer with corrected commit dates.
> > +
> > ## Deleting graph-{hash} files
> >
> > After a new tip file is written, some `graph-{hash}` files may no longer
> > --
> > gitgitgadget
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v6 11/11] doc: add corrected commit date info
2021-01-30 5:29 ` Abhishek Kumar
@ 2021-01-31 1:45 ` Taylor Blau
0 siblings, 0 replies; 211+ messages in thread
From: Taylor Blau @ 2021-01-31 1:45 UTC (permalink / raw)
To: 20210127000454.GA1440011
Cc: SZEDER Gábor, abhishekkumar8222, git, gitgitgadget, jnareb,
me, stolee
On Sat, Jan 30, 2021 at 10:59:05AM +0530, Abhishek Kumar wrote:
> > You might want to take
> > https://public-inbox.org/git/20200529085038.26008-16-szeder.dev@gmail.com/
> > as an inspiration.
> >
> Alright, the suggestion makes a lot of sense and the patch introducing
> documentation is the perfect place to justify the introduction of new
> chunk format.
I don't have any strong feelings about Gábor's suggestion itself, but
note that there isn't any work for you to do in this series, since the
patches are on track to be merged to master.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
` (10 preceding siblings ...)
2021-01-16 18:11 ` [PATCH v6 11/11] doc: add corrected commit date info Abhishek Kumar via GitGitGadget
@ 2021-01-18 21:04 ` Derrick Stolee
2021-01-18 22:00 ` Taylor Blau
` (2 more replies)
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
12 siblings, 3 replies; 211+ messages in thread
From: Derrick Stolee @ 2021-01-18 21:04 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget, git
Cc: Jakub Narębski, Taylor Blau, Abhishek Kumar, SZEDER Gábor
On 1/16/2021 1:11 PM, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.
...
> Changes in version 6:
>
> * Fixed typos in commit message for "commit-graph: implement corrected
> commit date".
> * Removed an unnecessary else-block in "commit-graph: implement corrected
> commit date".
> * Validate mixed generation chain correctly while writing in "commit-graph:
> use generation v2 only if the entire chain does".
> * Die if the GDAT chunk indicates data has overflown but there are is no
> generation data overflow chunk.
I checked the range-diff and looked once more through the patch
series. This version is good to go by my standards.
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Thanks, Abhishek!
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
2021-01-18 21:04 ` [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2021-01-18 22:00 ` Taylor Blau
2021-01-23 12:11 ` Abhishek Kumar
2021-01-19 0:02 ` Junio C Hamano
2021-01-23 12:07 ` Abhishek Kumar
2 siblings, 1 reply; 211+ messages in thread
From: Taylor Blau @ 2021-01-18 22:00 UTC (permalink / raw)
To: Derrick Stolee
Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
Taylor Blau, Abhishek Kumar, SZEDER Gábor
On Mon, Jan 18, 2021 at 04:04:14PM -0500, Derrick Stolee wrote:
> I checked the range-diff and looked once more through the patch
> series. This version is good to go by my standards.
>
> Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
I re-read this series now that it seems to have stabilized, and I agree
with Stolee that it LGTM.
Reviewed-by: Taylor Blau <me@ttaylorr.com>
> Thanks, Abhishek!
Incredible work!
Thanks,
Taylor
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
2021-01-18 22:00 ` Taylor Blau
@ 2021-01-23 12:11 ` Abhishek Kumar
0 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-23 12:11 UTC (permalink / raw)
To: Taylor Blau; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, stolee
On Mon, Jan 18, 2021 at 05:00:41PM -0500, Taylor Blau wrote:
> On Mon, Jan 18, 2021 at 04:04:14PM -0500, Derrick Stolee wrote:
> > I checked the range-diff and looked once more through the patch
> > series. This version is good to go by my standards.
> >
> > Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
>
> I re-read this series now that it seems to have stabilized, and I agree
> with Stolee that it LGTM.
>
> Reviewed-by: Taylor Blau <me@ttaylorr.com>
>
> > Thanks, Abhishek!
>
> Incredible work!
Thanks a lot for the reviews and help in identifying the reason behind
(relatively) minor performance increase when we switched from useless
'commit_graph_generation()' calls to direct slab calls.
>
> Thanks,
> Taylor
Thanks
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
2021-01-18 21:04 ` [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
2021-01-18 22:00 ` Taylor Blau
@ 2021-01-19 0:02 ` Junio C Hamano
2021-01-23 12:07 ` Abhishek Kumar
2 siblings, 0 replies; 211+ messages in thread
From: Junio C Hamano @ 2021-01-19 0:02 UTC (permalink / raw)
To: Derrick Stolee
Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
Taylor Blau, Abhishek Kumar, SZEDER Gábor
Derrick Stolee <stolee@gmail.com> writes:
> On 1/16/2021 1:11 PM, Abhishek Kumar via GitGitGadget wrote:
>> This patch series implements the corrected commit date offsets as generation
>> number v2, along with other pre-requisites.
> ...
>> Changes in version 6:
>>
>> * Fixed typos in commit message for "commit-graph: implement corrected
>> commit date".
>> * Removed an unnecessary else-block in "commit-graph: implement corrected
>> commit date".
>> * Validate mixed generation chain correctly while writing in "commit-graph:
>> use generation v2 only if the entire chain does".
>> * Die if the GDAT chunk indicates data has overflown but there are is no
>> generation data overflow chunk.
>
> I checked the range-diff and looked once more through the patch
> series. This version is good to go by my standards.
>
> Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Thanks, both. I'll give it a (hopefully) final read-over after
replacing what we have kept in 'seen'.
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date
2021-01-18 21:04 ` [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
2021-01-18 22:00 ` Taylor Blau
2021-01-19 0:02 ` Junio C Hamano
@ 2021-01-23 12:07 ` Abhishek Kumar
2 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar @ 2021-01-23 12:07 UTC (permalink / raw)
To: Derrick Stolee; +Cc: abhishekkumar8222, git, gitgitgadget, jnareb, me
On Mon, Jan 18, 2021 at 04:04:14PM -0500, Derrick Stolee wrote:
> On 1/16/2021 1:11 PM, Abhishek Kumar via GitGitGadget wrote:
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> ...
> > Changes in version 6:
> >
> > * Fixed typos in commit message for "commit-graph: implement corrected
> > commit date".
> > * Removed an unnecessary else-block in "commit-graph: implement corrected
> > commit date".
> > * Validate mixed generation chain correctly while writing in "commit-graph:
> > use generation v2 only if the entire chain does".
> > * Die if the GDAT chunk indicates data has overflown but there are is no
> > generation data overflow chunk.
>
> I checked the range-diff and looked once more through the patch
> series. This version is good to go by my standards.
>
> Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
>
> Thanks, Abhishek!
>
Thanks a lot for the review and continued guidance through out the
patch series!
- Abhishek
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date
2021-01-16 18:11 ` [PATCH v6 " Abhishek Kumar via GitGitGadget
` (11 preceding siblings ...)
2021-01-18 21:04 ` [PATCH v6 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
` (11 more replies)
12 siblings, 12 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar
This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.
Git uses topological levels in the commit-graph file for commit-graph
traversal operations like 'git log --graph'. Unfortunately, topological
levels can perform worse than committer date when parents of a commit differ
greatly in generation numbers [1]. For example, 'git merge-base v4.8 v4.9'
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date. Since 091f4cf3 (commit: don't use
generation numbers if not needed, 2018-08-30), 'git merge-base' uses
committer date heuristic unless there is a cutoff because of the performance
hit.
[1]
https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/
Thus, the need for generation number v2 was born. As Git used to die when
graph version understood by it and in the commit-graph file are different
[2], we needed a way to distinguish between the old and new generation
number without incrementing the graph version.
[2] https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
The following candidates were proposed
(https://github.com/derrickstolee/gen-test,
https://github.com/abhishekkumar2718/git/pull/1):
* (Epoch, Date) Pairs.
* Maximum Generation Numbers.
* Corrected Commit Date.
* FELINE Index.
* Corrected Commit Date with Monotonically Increasing Offsets.
Based on performance, local computability, and immutability (along with the
introduction of an additional commit-graph chunk which relieved the
requirement of backwards-compatibility) Corrected Commit Date was chosen as
generation number v2 and is defined as follows:
For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents plus 1. Then
corrected commit date offset is the difference between corrected commit date
of C and commit date of C. As a special case, a root commit with the
timestamp zero has corrected commit date of 1 to distinguish it from
GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).
While it was proposed initially to store corrected commit date offsets
within Commit Data Chunk, storing the offsets in a new chunk did not affect
the performance measurably. The new chunk is "Generation DATa (GDAT) chunk"
and it stores corrected commit date offsets while CDAT chunk stores
topological level. The old versions of Git would ignore GDAT chunk, using
topological levels from CDAT chunk. In contrast, new versions of Git would
use corrected commit dates, falling back to topological level if the
generation data chunk is absent in the commit-graph file.
While storing corrected commit date offsets saves us 4 bytes per commit (as
compared with storing corrected commit dates directly), it's however
possible for the offset to overflow the space allocated. To handle such
cases, we introduce a new chunk, Generation Data Overflow (GDOV) that stores
the corrected commit date. For overflowing offsets, we set MSB and store the
position into the GDOV chunk, in a mechanism similar to the Extra Edges list
chunk.
For mixed generation number environment (for example new Git on the command
line, old Git used by GUI client), we can encounter a mixed-chain
commit-graph (a commit-graph chain where some of split commit-graph files
have GDAT chunk and others do not). As backward compatibility is one of the
goals, we can define the following behavior:
While reading a mixed-chain commit-graph version, we fall back on
topological levels as corrected commit dates and topological levels cannot
be compared directly.
When adding new layer to the split commit-graph file, and when merging some
or all layers (replacing them in the latter case), the new layer will have
GDAT chunk if and only if in the final result there would be no layer
without GDAT chunk just below it.
Thanks to Dr. Stolee, Dr. Narębski, Taylor Blau and SZEDER Gábor for their
reviews.
I look forward to everyone's reviews!
Thanks
* Abhishek
----------------------------------------------------------------------------
Improvements left for a future series:
* Save commits with generation data overflow and extra edge commits instead
of looping over all commits. cf. 858sbel67n.fsf@gmail.com
* Verify both topological levels and corrected commit dates when present.
cf. 85pn4tnk8u.fsf@gmail.com
Changes in version 7:
* Moved the documentation patch ahead of "commit-graph: implement corrected
commit date" and elaborated on the introduction of generation number v2.
Changes in version 6:
* Fixed typos in commit message for "commit-graph: implement corrected
commit date".
* Removed an unnecessary else-block in "commit-graph: implement corrected
commit date".
* Validate mixed generation chain correctly while writing in "commit-graph:
use generation v2 only if the entire chain does".
* Die if the GDAT chunk indicates data has overflown but there are is no
generation data overflow chunk.
Changes in version 5:
* Explained a possible reason for no change in performance for
"commit-graph: fix regression when computing bloom-filters"
* Clarified about the addition of a new test for 11-digit octal
implementations of ustar.
* Fixed duplicate test names in "commit-graph: consolidate
fill_commit_graph_info".
* Swapped the order "commit-graph: return 64-bit generation number",
"commit-graph: add a slab to store topological levels" to minimize lines
changed.
* Fixed the mismerge in "commit-graph: return 64-bit generation number"
* Clarified the preparatory steps are for the larger goal of implementing
generation number v2 in "commit-graph: return 64-bit generation number".
* Moved the rename of "run_three_modes()" to "run_all_modes()" into a new
patch "t6600-test-reach: generalize *_three_modes".
* Explained and removed the checks for GENERATION_NUMBER_INFINITY that can
never be true in "commit-graph: add a slab to store topological levels".
* Fixed incorrect logic for verifying commit-graph in "commit-graph:
implement corrected commit date".
* Added minor improvements to commit message of "commit-graph: implement
generation data chunk".
* Added '--date ' option to test_commit() in 'test-lib-functions.sh' in
"commit-graph: implement generation data chunk".
* Improved coding style (also in tests) for "commit-graph: use generation
v2 only if entire chain does".
* Simplified test repository structure in "commit-graph: use generation v2
only if entire chain does" as only the number of commits in a split
commit-graph layer are relevant.
* Added a new test in "commit-graph: use generation v2 only if entire chain
does" to check if the layers are merged correctly.
* Explicitly mentioned commit "091f4cf3" in the commit-message of
"commit-graph: use corrected commit dates in paint_down_to_common()".
* Minor corrections to documentation in "doc: add corrected commit date
info".
* Minor corrections to coding style.
Changes in version 4:
* Added GDOV to handle overflows in generation data.
* Added a test for writing tip graph for a generation number v2 graph chain
in t5324-split-commit-graph.sh
* Added a section on how mixed generation number chains are handled in
Documentation/technical/commit-graph-format.txt
* Reverted unimportant whitespace, style changes in commit-graph.c
* Added header comments about the order of comparision for
compare_commits_by_gen_then_commit_date in commit.h,
compare_commits_by_gen in commit-graph.h
* Elaborated on why t6404 fails with corrected commit date and must be run
with GIT_TEST_COMMIT_GRAPH=1in the commit "commit-reach: use corrected
commit dates in paint_down_to_common()"
* Elaborated on write behavior for mixed generation number chains in the
commit "commit-graph: use generation v2 only if entire chain does"
* Added notes about adding the topo_level slab to struct
write_commit_graph_context as well as struct commit_graph.
* Clarified commit message for "commit-graph: consolidate
fill_commit_graph_info"
* Removed the claim "GDAT can store future generation numbers" because it
hasn't been tested yet.
Changes in version 3:
* Reordered patches to implement corrected commit date before generation
data chunk [3].
* Split "implement corrected commit date" into two patches - one
introducing the topo level slab and other implementing corrected commit
dates.
* Extended split-commit-graph tests to verify at the end of test.
* Use topological levels as generation number if any of split commit-graph
files do not have generation data chunk.
[3]
https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/
Changes in version 2:
* Add tests for generation data chunk.
* Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
generation data chunk.
* Compare commits with corrected commit dates if present in
paint_down_to_common().
* Update technical documentation.
* Handle mixed generation commit chains.
* Improve commit messages for "commit-graph: fix regression when computing
bloom filter", "commit-graph: consolidate fill_commit_graph_info",
* Revert unnecessary whitespace changes.
* Split uint_32 -> timestamp_t change into a new commit.
Abhishek Kumar (11):
commit-graph: fix regression when computing Bloom filters
revision: parse parent in indegree_walk_step()
commit-graph: consolidate fill_commit_graph_info
t6600-test-reach: generalize *_three_modes
commit-graph: add a slab to store topological levels
commit-graph: return 64-bit generation number
commit-graph: document generation number v2
commit-graph: implement corrected commit date
commit-graph: implement generation data chunk
commit-graph: use generation v2 only if entire chain does
commit-reach: use corrected commit dates in paint_down_to_common()
.../technical/commit-graph-format.txt | 28 +-
Documentation/technical/commit-graph.txt | 77 +++++-
commit-graph.c | 251 ++++++++++++++----
commit-graph.h | 15 +-
commit-reach.c | 38 +--
commit-reach.h | 2 +-
commit.c | 4 +-
commit.h | 5 +-
revision.c | 13 +-
t/README | 3 +
t/helper/test-read-graph.c | 4 +
t/t4216-log-bloom.sh | 4 +-
t/t5000-tar-tree.sh | 24 +-
t/t5318-commit-graph.sh | 79 +++++-
t/t5324-split-commit-graph.sh | 193 +++++++++++++-
t/t6404-recursive-merge.sh | 5 +-
t/t6600-test-reach.sh | 68 ++---
t/test-lib-functions.sh | 6 +
upload-pack.c | 2 +-
19 files changed, 667 insertions(+), 154 deletions(-)
base-commit: e6362826a0409539642a5738db61827e5978e2e4
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v7
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v7
Pull-Request: https://github.com/gitgitgadget/git/pull/676
Range-diff vs v6:
1: 4d8eb415578 = 1: 9ac331b63ee commit-graph: fix regression when computing Bloom filters
2: 05dcb862818 = 2: 90ca0a1fd69 revision: parse parent in indegree_walk_step()
3: dcb9891d819 = 3: b3040696d43 commit-graph: consolidate fill_commit_graph_info
4: 4fbdee7ac90 = 4: 085085a4330 t6600-test-reach: generalize *_three_modes
5: fbd8feb5d8c = 5: 3b1aae4106a commit-graph: add a slab to store topological levels
6: 855ff662a44 = 6: ea32cba16ef commit-graph: return 64-bit generation number
11: e571f03d8bd ! 7: 8647b5d2e38 doc: add corrected commit date info
@@ Metadata
Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
## Commit message ##
- doc: add corrected commit date info
+ commit-graph: document generation number v2
- With generation data chunk and corrected commit dates implemented, let's
- update the technical documentation for commit-graph.
+ Git uses topological levels in the commit-graph file for commit-graph
+ traversal operations like 'git log --graph'. Unfortunately, topological
+ levels can perform worse than committer date when parents of a commit
+ differ greatly in generation numbers [1]. For example, 'git merge-base
+ v4.8 v4.9' on the Linux repository walks 635,579 commits using
+ topological levels and walks 167,468 using committer date. Since
+ 091f4cf3 (commit: don't use generation numbers if not needed,
+ 2018-08-30), 'git merge-base' uses committer date heuristic unless there
+ is a cutoff because of the performance hit.
+
+ [1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/
+
+ Thus, the need for generation number v2 was born. As Git used to die
+ when graph version understood by it and in the commit-graph file are
+ different [2], we needed a way to distinguish between the old and new
+ generation number without incrementing the graph version.
+
+ [2] https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
+
+ The following candidates were proposed (https://github.com/derrickstolee/gen-test,
+ https://github.com/abhishekkumar2718/git/pull/1):
+ - (Epoch, Date) Pairs.
+ - Maximum Generation Numbers.
+ - Corrected Commit Date.
+ - FELINE Index.
+ - Corrected Commit Date with Monotonically Increasing Offsets.
+
+ Based on performance, local computability, and immutability (along with
+ the introduction of an additional commit-graph chunk which relieved the
+ requirement of backwards-compatibility) Corrected Commit Date was chosen
+ as generation number v2 and is defined as follows:
+
+ For a commit C, let its corrected commit date be the maximum of the
+ commit date of C and the corrected commit dates of its parents plus 1.
+ Then corrected commit date offset is the difference between corrected
+ commit date of C and commit date of C. As a special case, a root commit
+ with the timestamp zero has corrected commit date of 1 to distinguish it
+ from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).
+
+ While it was proposed initially to store corrected commit date offsets
+ within Commit Data Chunk, storing the offsets in a new chunk did not
+ affect the performance measurably. The new chunk is "Generation DATa
+ (GDAT) chunk" and it stores corrected commit date offsets while CDAT
+ chunk stores topological level. The old versions of Git would ignore
+ GDAT chunk, using topological levels from CDAT chunk. In contrast, new
+ versions of Git would use corrected commit dates, falling back to
+ topological level if the generation data chunk is absent in the
+ commit-graph file.
+
+ While storing corrected commit date offsets saves us 4 bytes per commit
+ (as compared with storing corrected commit dates directly), it's however
+ possible for the offset to overflow the space allocated. To handle such
+ cases, we introduce a new chunk, _Generation Data Overflow_ (GDOV) that
+ stores the corrected commit date. For overflowing offsets, we set MSB
+ and store the position into the GDOV chunk, in a mechanism similar to
+ the Extra Edges list chunk.
+
+ For mixed generation number environment (for example new Git on the
+ command line, old Git used by GUI client), we can encounter a
+ mixed-chain commit-graph (a commit-graph chain where some of split
+ commit-graph files have GDAT chunk and others do not). As backward
+ compatibility is one of the goals, we can define the following behavior:
+
+ While reading a mixed-chain commit-graph version, we fall back on
+ topological levels as corrected commit dates and topological levels
+ cannot be compared directly.
+
+ When adding new layer to the split commit-graph file, and when merging
+ some or all layers (replacing them in the latter case), the new layer
+ will have GDAT chunk if and only if in the final result there would be
+ no layer without GDAT chunk just below it.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
7: 8fbe7486405 = 8: ec598f1d500 commit-graph: implement corrected commit date
8: 6d0696ae216 = 9: 71d81518857 commit-graph: implement generation data chunk
9: fba0d7f3dfe = 10: 07a88f1aae6 commit-graph: use generation v2 only if entire chain does
10: ba1f2c5555f = 11: 523e2d4a902 commit-reach: use corrected commit dates in paint_down_to_common()
--
gitgitgadget
^ permalink raw reply [flat|nested] 211+ messages in thread
* [PATCH v7 01/11] commit-graph: fix regression when computing Bloom filters
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
` (10 subsequent siblings)
11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Before computing Bloom filters, the commit-graph machinery uses
commit_gen_cmp to sort commits by generation order for improved diff
performance. 3d11275505 (commit-graph: examine commits by generation
number, 2020-03-30) claims that this sort can reduce the time spent to
compute Bloom filters by nearly half.
But since c49c82aa4c (commit: move members graph_pos, generation to a
slab, 2020-06-17), this optimization is broken, since asking for a
'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
while writing.
Not all hope is lost, though: 'commit_gen_cmp()' falls back to
comparing commits by their date when they have equal generation number,
and so since c49c82aa4c is purely a date comparison function. This
heuristic is good enough that we don't seem to loose appreciable
performance while computing Bloom filters.
Applying this patch (compared with v2.30.0) speeds up computing Bloom
filters by factors ranging from 0.40% to 5.19% on various repositories [1].
So, avoid the useless 'commit_graph_generation()' while writing by
instead accessing the slab directly. This returns the newly-computed
generation numbers, and allows us to avoid the heuristic by directly
comparing generation numbers.
[1]: https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index f3486ec18f1..78de312ccec 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -139,13 +139,17 @@ static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
return data;
}
+/*
+ * Should be used only while writing commit-graph as it compares
+ * generation value of commits by directly accessing commit-slab.
+ */
static int commit_gen_cmp(const void *va, const void *vb)
{
const struct commit *a = *(const struct commit **)va;
const struct commit *b = *(const struct commit **)vb;
- uint32_t generation_a = commit_graph_generation(a);
- uint32_t generation_b = commit_graph_generation(b);
+ uint32_t generation_a = commit_graph_data_at(a)->generation;
+ uint32_t generation_b = commit_graph_data_at(b)->generation;
/* lower generation commits first */
if (generation_a < generation_b)
return -1;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v7 02/11] revision: parse parent in indegree_walk_step()
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
` (9 subsequent siblings)
11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In indegree_walk_step(), we add unvisited parents to the indegree queue.
However, parents are not guaranteed to be parsed. As the indegree queue
sorts by generation number, let's parse parents before inserting them to
ensure the correct priority order.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
revision.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/revision.c b/revision.c
index 0b5c7231401..5474001331a 100644
--- a/revision.c
+++ b/revision.c
@@ -3399,6 +3399,9 @@ static void indegree_walk_step(struct rev_info *revs)
struct commit *parent = p->item;
int *pi = indegree_slab_at(&info->indegree, parent);
+ if (repo_parse_commit_gently(revs->repo, parent, 1) < 0)
+ return;
+
if (*pi)
(*pi)++;
else
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v7 03/11] commit-graph: consolidate fill_commit_graph_info
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 01/11] commit-graph: fix regression when computing Bloom filters Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 02/11] revision: parse parent in indegree_walk_step() Abhishek Kumar via GitGitGadget
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
` (8 subsequent siblings)
11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().
fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.
e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.
The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.
However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.
The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.
As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.
To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.
To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 27 ++++++++++-----------------
t/t5000-tar-tree.sh | 24 +++++++++++++++++++++---
2 files changed, 31 insertions(+), 20 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 78de312ccec..955418bd6e5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -753,15 +753,24 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
const unsigned char *commit_data;
struct commit_graph_data *graph_data;
uint32_t lex_index;
+ uint64_t date_high, date_low;
while (pos < g->num_commits_in_base)
g = g->base_graph;
+ if (pos >= g->num_commits + g->num_commits_in_base)
+ die(_("invalid commit position. commit-graph is likely corrupt"));
+
lex_index = pos - g->num_commits_in_base;
commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * lex_index;
graph_data = commit_graph_data_at(item);
graph_data->graph_pos = pos;
+
+ date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
+ date_low = get_be32(commit_data + g->hash_len + 12);
+ item->date = (timestamp_t)((date_high << 32) | date_low);
+
graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
}
@@ -776,38 +785,22 @@ static int fill_commit_in_graph(struct repository *r,
{
uint32_t edge_value;
uint32_t *parent_data_ptr;
- uint64_t date_low, date_high;
struct commit_list **pptr;
- struct commit_graph_data *graph_data;
const unsigned char *commit_data;
uint32_t lex_index;
while (pos < g->num_commits_in_base)
g = g->base_graph;
- if (pos >= g->num_commits + g->num_commits_in_base)
- die(_("invalid commit position. commit-graph is likely corrupt"));
+ fill_commit_graph_info(item, g, pos);
- /*
- * Store the "full" position, but then use the
- * "local" position for the rest of the calculation.
- */
- graph_data = commit_graph_data_at(item);
- graph_data->graph_pos = pos;
lex_index = pos - g->num_commits_in_base;
-
commit_data = g->chunk_commit_data + (g->hash_len + 16) * lex_index;
item->object.parsed = 1;
set_commit_tree(item, NULL);
- date_high = get_be32(commit_data + g->hash_len + 8) & 0x3;
- date_low = get_be32(commit_data + g->hash_len + 12);
- item->date = (timestamp_t)((date_high << 32) | date_low);
-
- graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
-
pptr = &item->parents;
edge_value = get_be32(commit_data + g->hash_len);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 3ebb0d3b652..7204799a0b5 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -431,15 +431,33 @@ test_expect_success TAR_HUGE,LONG_IS_64BIT 'system tar can read our huge size' '
test_cmp expect actual
'
-test_expect_success TIME_IS_64BIT 'set up repository with far-future commit' '
+test_expect_success TIME_IS_64BIT 'set up repository with far-future (2^34 - 1) commit' '
+ rm -f .git/index &&
+ echo foo >file &&
+ git add file &&
+ GIT_COMMITTER_DATE="@17179869183 +0000" \
+ git commit -m "tempori parendum"
+'
+
+test_expect_success TIME_IS_64BIT 'generate tar with far-future mtime' '
+ git archive HEAD >future.tar
+'
+
+test_expect_success TAR_HUGE,TIME_IS_64BIT,TIME_T_IS_64BIT 'system tar can read our future mtime' '
+ echo 2514 >expect &&
+ tar_info future.tar | cut -d" " -f2 >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success TIME_IS_64BIT 'set up repository with far-far-future (2^36 + 1) commit' '
rm -f .git/index &&
echo content >file &&
git add file &&
- GIT_COMMITTER_DATE="@68719476737 +0000" \
+ GIT_TEST_COMMIT_GRAPH=0 GIT_COMMITTER_DATE="@68719476737 +0000" \
git commit -m "tempori parendum"
'
-test_expect_success TIME_IS_64BIT 'generate tar with future mtime' '
+test_expect_success TIME_IS_64BIT 'generate tar with far-far-future mtime' '
git archive HEAD >future.tar
'
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v7 04/11] t6600-test-reach: generalize *_three_modes
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
` (2 preceding siblings ...)
2021-02-01 6:58 ` [PATCH v7 03/11] commit-graph: consolidate fill_commit_graph_info Abhishek Kumar via GitGitGadget
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
` (7 subsequent siblings)
11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In a preparatory step to implement generation number v2, we add tests to
ensure Git can read and parse commit-graph files without Generation Data
chunk. These files represent commit-graph files written by Old Git and
are neccesary for backward compatability.
We extend run_three_modes() and test_three_modes() to *_all_modes() with
the fourth mode being "commit-graph without generation data chunk".
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
t/t6600-test-reach.sh | 62 +++++++++++++++++++++----------------------
1 file changed, 31 insertions(+), 31 deletions(-)
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index f807276337d..af10f0dc090 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -58,7 +58,7 @@ test_expect_success 'setup' '
git config core.commitGraph true
'
-run_three_modes () {
+run_all_modes () {
test_when_finished rm -rf .git/objects/info/commit-graph &&
"$@" <input >actual &&
test_cmp expect actual &&
@@ -70,8 +70,8 @@ run_three_modes () {
test_cmp expect actual
}
-test_three_modes () {
- run_three_modes test-tool reach "$@"
+test_all_modes () {
+ run_all_modes test-tool reach "$@"
}
test_expect_success 'ref_newer:miss' '
@@ -80,7 +80,7 @@ test_expect_success 'ref_newer:miss' '
B:commit-4-9
EOF
echo "ref_newer(A,B):0" >expect &&
- test_three_modes ref_newer
+ test_all_modes ref_newer
'
test_expect_success 'ref_newer:hit' '
@@ -89,7 +89,7 @@ test_expect_success 'ref_newer:hit' '
B:commit-2-3
EOF
echo "ref_newer(A,B):1" >expect &&
- test_three_modes ref_newer
+ test_all_modes ref_newer
'
test_expect_success 'in_merge_bases:hit' '
@@ -98,7 +98,7 @@ test_expect_success 'in_merge_bases:hit' '
B:commit-8-8
EOF
echo "in_merge_bases(A,B):1" >expect &&
- test_three_modes in_merge_bases
+ test_all_modes in_merge_bases
'
test_expect_success 'in_merge_bases:miss' '
@@ -107,7 +107,7 @@ test_expect_success 'in_merge_bases:miss' '
B:commit-5-9
EOF
echo "in_merge_bases(A,B):0" >expect &&
- test_three_modes in_merge_bases
+ test_all_modes in_merge_bases
'
test_expect_success 'in_merge_bases_many:hit' '
@@ -117,7 +117,7 @@ test_expect_success 'in_merge_bases_many:hit' '
X:commit-5-7
EOF
echo "in_merge_bases_many(A,X):1" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'in_merge_bases_many:miss' '
@@ -127,7 +127,7 @@ test_expect_success 'in_merge_bases_many:miss' '
X:commit-8-6
EOF
echo "in_merge_bases_many(A,X):0" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'in_merge_bases_many:miss-heuristic' '
@@ -137,7 +137,7 @@ test_expect_success 'in_merge_bases_many:miss-heuristic' '
X:commit-6-6
EOF
echo "in_merge_bases_many(A,X):0" >expect &&
- test_three_modes in_merge_bases_many
+ test_all_modes in_merge_bases_many
'
test_expect_success 'is_descendant_of:hit' '
@@ -148,7 +148,7 @@ test_expect_success 'is_descendant_of:hit' '
X:commit-1-1
EOF
echo "is_descendant_of(A,X):1" >expect &&
- test_three_modes is_descendant_of
+ test_all_modes is_descendant_of
'
test_expect_success 'is_descendant_of:miss' '
@@ -159,7 +159,7 @@ test_expect_success 'is_descendant_of:miss' '
X:commit-7-6
EOF
echo "is_descendant_of(A,X):0" >expect &&
- test_three_modes is_descendant_of
+ test_all_modes is_descendant_of
'
test_expect_success 'get_merge_bases_many' '
@@ -174,7 +174,7 @@ test_expect_success 'get_merge_bases_many' '
git rev-parse commit-5-6 \
commit-4-7 | sort
} >expect &&
- test_three_modes get_merge_bases_many
+ test_all_modes get_merge_bases_many
'
test_expect_success 'reduce_heads' '
@@ -196,7 +196,7 @@ test_expect_success 'reduce_heads' '
commit-2-8 \
commit-1-10 | sort
} >expect &&
- test_three_modes reduce_heads
+ test_all_modes reduce_heads
'
test_expect_success 'can_all_from_reach:hit' '
@@ -219,7 +219,7 @@ test_expect_success 'can_all_from_reach:hit' '
Y:commit-8-1
EOF
echo "can_all_from_reach(X,Y):1" >expect &&
- test_three_modes can_all_from_reach
+ test_all_modes can_all_from_reach
'
test_expect_success 'can_all_from_reach:miss' '
@@ -241,7 +241,7 @@ test_expect_success 'can_all_from_reach:miss' '
Y:commit-8-5
EOF
echo "can_all_from_reach(X,Y):0" >expect &&
- test_three_modes can_all_from_reach
+ test_all_modes can_all_from_reach
'
test_expect_success 'can_all_from_reach_with_flag: tags case' '
@@ -264,7 +264,7 @@ test_expect_success 'can_all_from_reach_with_flag: tags case' '
Y:commit-8-1
EOF
echo "can_all_from_reach_with_flag(X,_,_,0,0):1" >expect &&
- test_three_modes can_all_from_reach_with_flag
+ test_all_modes can_all_from_reach_with_flag
'
test_expect_success 'commit_contains:hit' '
@@ -280,8 +280,8 @@ test_expect_success 'commit_contains:hit' '
X:commit-9-3
EOF
echo "commit_contains(_,A,X,_):1" >expect &&
- test_three_modes commit_contains &&
- test_three_modes commit_contains --tag
+ test_all_modes commit_contains &&
+ test_all_modes commit_contains --tag
'
test_expect_success 'commit_contains:miss' '
@@ -297,8 +297,8 @@ test_expect_success 'commit_contains:miss' '
X:commit-9-3
EOF
echo "commit_contains(_,A,X,_):0" >expect &&
- test_three_modes commit_contains &&
- test_three_modes commit_contains --tag
+ test_all_modes commit_contains &&
+ test_all_modes commit_contains --tag
'
test_expect_success 'rev-list: basic topo-order' '
@@ -310,7 +310,7 @@ test_expect_success 'rev-list: basic topo-order' '
commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-6-6
+ run_all_modes git rev-list --topo-order commit-6-6
'
test_expect_success 'rev-list: first-parent topo-order' '
@@ -322,7 +322,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
commit-6-2 \
commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>expect &&
- run_three_modes git rev-list --first-parent --topo-order commit-6-6
+ run_all_modes git rev-list --first-parent --topo-order commit-6-6
'
test_expect_success 'rev-list: range topo-order' '
@@ -334,7 +334,7 @@ test_expect_success 'rev-list: range topo-order' '
commit-6-2 commit-5-2 commit-4-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-3..commit-6-6
'
test_expect_success 'rev-list: range topo-order' '
@@ -346,7 +346,7 @@ test_expect_success 'rev-list: range topo-order' '
commit-6-2 commit-5-2 commit-4-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-8..commit-6-6
'
test_expect_success 'rev-list: first-parent range topo-order' '
@@ -358,7 +358,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
commit-6-2 \
commit-6-1 commit-5-1 commit-4-1 \
>expect &&
- run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+ run_all_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
'
test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -368,7 +368,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
commit-6-3 commit-5-3 commit-4-3 \
>expect &&
- run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+ run_all_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
'
test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -382,7 +382,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
commit-3-8 commit-2-8 commit-1-8 \
commit-3-7 commit-2-7 commit-1-7 \
>expect &&
- run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+ run_all_modes git rev-list --topo-order commit-3-8...commit-6-6
'
test_expect_success 'get_reachable_subset:all' '
@@ -402,7 +402,7 @@ test_expect_success 'get_reachable_subset:all' '
commit-1-7 \
commit-5-6 | sort
) >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_expect_success 'get_reachable_subset:some' '
@@ -420,7 +420,7 @@ test_expect_success 'get_reachable_subset:some' '
git rev-parse commit-3-3 \
commit-1-7 | sort
) >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_expect_success 'get_reachable_subset:none' '
@@ -434,7 +434,7 @@ test_expect_success 'get_reachable_subset:none' '
Y:commit-2-8
EOF
echo "get_reachable_subset(X,Y)" >expect &&
- test_three_modes get_reachable_subset
+ test_all_modes get_reachable_subset
'
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v7 05/11] commit-graph: add a slab to store topological levels
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
` (3 preceding siblings ...)
2021-02-01 6:58 ` [PATCH v7 04/11] t6600-test-reach: generalize *_three_modes Abhishek Kumar via GitGitGadget
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
` (6 subsequent siblings)
11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In a later commit we will introduce corrected commit date as the
generation number v2. Corrected commit dates will be stored in the new
seperate Generation Data chunk. However, to ensure backwards
compatibility with "Old" Git we need to continue to write generation
number v1 (topological levels) to the commit data chunk. Thus, we need
to compute and store both versions of generation numbers to write the
commit-graph file.
Therefore, let's introduce a commit-slab `topo_level_slab` to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.
The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO`
mark commits not in the commit-graph file and commits written by a
version of Git that did not compute generation numbers respectively.
Generation numbers are computed identically for both kinds of commits.
A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit
is not in the commit-graph file. However, since the slab is
zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`).
Thus, we no longer need to check if the topological level of a commit is
`GENERATION_NUMBER_INFINITY`.
We will add a pointer to the slab in `struct write_commit_graph_context`
and `struct commit_graph` to populate the slab in
`fill_commit_graph_info` if the commit has a pre-computed topological
level as in case of split commit-graphs.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 45 ++++++++++++++++++++++++++++++---------------
commit-graph.h | 1 +
2 files changed, 31 insertions(+), 15 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 955418bd6e5..2f344cce151 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -64,6 +64,8 @@ void git_test_write_commit_graph_or_die(void)
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
+define_commit_slab(topo_level_slab, uint32_t);
+
/* Keep track of the order in which commits are added to our list. */
define_commit_slab(commit_pos, int);
static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
@@ -772,6 +774,9 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
item->date = (timestamp_t)((date_high << 32) | date_low);
graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
+ if (g->topo_levels)
+ *topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
}
static inline void set_commit_tree(struct commit *c, struct tree *t)
@@ -960,6 +965,7 @@ struct write_commit_graph_context {
changed_paths:1,
order_by_pack:1;
+ struct topo_level_slab *topo_levels;
const struct commit_graph_opts *opts;
size_t total_bloom_filter_data_size;
const struct bloom_filter_settings *bloom_settings;
@@ -1106,7 +1112,7 @@ static int write_graph_chunk_data(struct hashfile *f,
else
packedDate[0] = 0;
- packedDate[0] |= htonl(commit_graph_data_at(*list)->generation << 2);
+ packedDate[0] |= htonl(*topo_level_slab_at(ctx->topo_levels, *list) << 2);
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
@@ -1336,11 +1342,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
_("Computing commit graph generation numbers"),
ctx->commits.nr);
for (i = 0; i < ctx->commits.nr; i++) {
- uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
+ uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
display_progress(ctx->progress, i + 1);
- if (generation != GENERATION_NUMBER_INFINITY &&
- generation != GENERATION_NUMBER_ZERO)
+ if (level != GENERATION_NUMBER_ZERO)
continue;
commit_list_insert(ctx->commits.list[i], &list);
@@ -1348,29 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
struct commit *current = list->item;
struct commit_list *parent;
int all_parents_computed = 1;
- uint32_t max_generation = 0;
+ uint32_t max_level = 0;
for (parent = current->parents; parent; parent = parent->next) {
- generation = commit_graph_data_at(parent->item)->generation;
+ level = *topo_level_slab_at(ctx->topo_levels, parent->item);
- if (generation == GENERATION_NUMBER_INFINITY ||
- generation == GENERATION_NUMBER_ZERO) {
+ if (level == GENERATION_NUMBER_ZERO) {
all_parents_computed = 0;
commit_list_insert(parent->item, &list);
break;
- } else if (generation > max_generation) {
- max_generation = generation;
+ } else if (level > max_level) {
+ max_level = level;
}
}
if (all_parents_computed) {
- struct commit_graph_data *data = commit_graph_data_at(current);
-
- data->generation = max_generation + 1;
pop_commit(&list);
- if (data->generation > GENERATION_NUMBER_MAX)
- data->generation = GENERATION_NUMBER_MAX;
+ if (max_level > GENERATION_NUMBER_MAX - 1)
+ max_level = GENERATION_NUMBER_MAX - 1;
+ *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
}
}
}
@@ -2106,6 +2108,7 @@ int write_commit_graph(struct object_directory *odb,
int res = 0;
int replace = 0;
struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+ struct topo_level_slab topo_levels;
prepare_repo_settings(the_repository);
if (!the_repository->settings.core_commit_graph) {
@@ -2132,6 +2135,18 @@ int write_commit_graph(struct object_directory *odb,
bloom_settings.max_changed_paths);
ctx->bloom_settings = &bloom_settings;
+ init_topo_level_slab(&topo_levels);
+ ctx->topo_levels = &topo_levels;
+
+ if (ctx->r->objects->commit_graph) {
+ struct commit_graph *g = ctx->r->objects->commit_graph;
+
+ while (g) {
+ g->topo_levels = &topo_levels;
+ g = g->base_graph;
+ }
+ }
+
if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
ctx->changed_paths = 1;
if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/commit-graph.h b/commit-graph.h
index f8e92500c6e..00f00745b79 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -73,6 +73,7 @@ struct commit_graph {
const unsigned char *chunk_bloom_indexes;
const unsigned char *chunk_bloom_data;
+ struct topo_level_slab *topo_levels;
struct bloom_filter_settings *bloom_filter_settings;
};
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v7 06/11] commit-graph: return 64-bit generation number
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
` (4 preceding siblings ...)
2021-02-01 6:58 ` [PATCH v7 05/11] commit-graph: add a slab to store topological levels Abhishek Kumar via GitGitGadget
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 07/11] commit-graph: document generation number v2 Abhishek Kumar via GitGitGadget
` (5 subsequent siblings)
11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.
We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.
With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 22 +++++++++++-----------
commit-graph.h | 4 ++--
commit-reach.c | 36 ++++++++++++++++++------------------
commit-reach.h | 2 +-
commit.c | 4 ++--
commit.h | 4 ++--
revision.c | 10 +++++-----
upload-pack.c | 2 +-
8 files changed, 42 insertions(+), 42 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 2f344cce151..8f17815021d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -101,7 +101,7 @@ uint32_t commit_graph_position(const struct commit *c)
return data ? data->graph_pos : COMMIT_NOT_FROM_GRAPH;
}
-uint32_t commit_graph_generation(const struct commit *c)
+timestamp_t commit_graph_generation(const struct commit *c)
{
struct commit_graph_data *data =
commit_graph_data_slab_peek(&commit_graph_data_slab, c);
@@ -150,8 +150,8 @@ static int commit_gen_cmp(const void *va, const void *vb)
const struct commit *a = *(const struct commit **)va;
const struct commit *b = *(const struct commit **)vb;
- uint32_t generation_a = commit_graph_data_at(a)->generation;
- uint32_t generation_b = commit_graph_data_at(b)->generation;
+ const timestamp_t generation_a = commit_graph_data_at(a)->generation;
+ const timestamp_t generation_b = commit_graph_data_at(b)->generation;
/* lower generation commits first */
if (generation_a < generation_b)
return -1;
@@ -1370,8 +1370,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (all_parents_computed) {
pop_commit(&list);
- if (max_level > GENERATION_NUMBER_MAX - 1)
- max_level = GENERATION_NUMBER_MAX - 1;
+ if (max_level > GENERATION_NUMBER_V1_MAX - 1)
+ max_level = GENERATION_NUMBER_V1_MAX - 1;
*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
}
}
@@ -2367,8 +2367,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
for (i = 0; i < g->num_commits; i++) {
struct commit *graph_commit, *odb_commit;
struct commit_list *graph_parents, *odb_parents;
- uint32_t max_generation = 0;
- uint32_t generation;
+ timestamp_t max_generation = 0;
+ timestamp_t generation;
display_progress(progress, i + 1);
hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2432,16 +2432,16 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
continue;
/*
- * If one of our parents has generation GENERATION_NUMBER_MAX, then
- * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
+ * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
+ * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
* extra logic in the following condition.
*/
- if (max_generation == GENERATION_NUMBER_MAX)
+ if (max_generation == GENERATION_NUMBER_V1_MAX)
max_generation--;
generation = commit_graph_generation(graph_commit);
if (generation != max_generation + 1)
- graph_report(_("commit-graph generation for commit %s is %u != %u"),
+ graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
oid_to_hex(&cur_oid),
generation,
max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 00f00745b79..2e9aa7824ee 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -145,12 +145,12 @@ void disable_commit_graph(struct repository *r);
struct commit_graph_data {
uint32_t graph_pos;
- uint32_t generation;
+ timestamp_t generation;
};
/*
* Commits should be parsed before accessing generation, graph positions.
*/
-uint32_t commit_graph_generation(const struct commit *);
+timestamp_t commit_graph_generation(const struct commit *);
uint32_t commit_graph_position(const struct commit *);
#endif
diff --git a/commit-reach.c b/commit-reach.c
index 50175b159e7..9b24b0378d5 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -32,12 +32,12 @@ static int queue_has_nonstale(struct prio_queue *queue)
static struct commit_list *paint_down_to_common(struct repository *r,
struct commit *one, int n,
struct commit **twos,
- int min_generation)
+ timestamp_t min_generation)
{
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
- uint32_t last_gen = GENERATION_NUMBER_INFINITY;
+ timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
if (!min_generation)
queue.compare = compare_commits_by_commit_date;
@@ -58,10 +58,10 @@ static struct commit_list *paint_down_to_common(struct repository *r,
struct commit *commit = prio_queue_get(&queue);
struct commit_list *parents;
int flags;
- uint32_t generation = commit_graph_generation(commit);
+ timestamp_t generation = commit_graph_generation(commit);
if (min_generation && generation > last_gen)
- BUG("bad generation skip %8x > %8x at %s",
+ BUG("bad generation skip %"PRItime" > %"PRItime" at %s",
generation, last_gen,
oid_to_hex(&commit->object.oid));
last_gen = generation;
@@ -177,12 +177,12 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
repo_parse_commit(r, array[i]);
for (i = 0; i < cnt; i++) {
struct commit_list *common;
- uint32_t min_generation = commit_graph_generation(array[i]);
+ timestamp_t min_generation = commit_graph_generation(array[i]);
if (redundant[i])
continue;
for (j = filled = 0; j < cnt; j++) {
- uint32_t curr_generation;
+ timestamp_t curr_generation;
if (i == j || redundant[j])
continue;
filled_index[filled] = j;
@@ -321,7 +321,7 @@ int repo_in_merge_bases_many(struct repository *r, struct commit *commit,
{
struct commit_list *bases;
int ret = 0, i;
- uint32_t generation, max_generation = GENERATION_NUMBER_ZERO;
+ timestamp_t generation, max_generation = GENERATION_NUMBER_ZERO;
if (repo_parse_commit(r, commit))
return ret;
@@ -470,7 +470,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
static enum contains_result contains_test(struct commit *candidate,
const struct commit_list *want,
struct contains_cache *cache,
- uint32_t cutoff)
+ timestamp_t cutoff)
{
enum contains_result *cached = contains_cache_at(cache, candidate);
@@ -506,11 +506,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
{
struct contains_stack contains_stack = { 0, 0, NULL };
enum contains_result result;
- uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+ timestamp_t cutoff = GENERATION_NUMBER_INFINITY;
const struct commit_list *p;
for (p = want; p; p = p->next) {
- uint32_t generation;
+ timestamp_t generation;
struct commit *c = p->item;
load_commit_graph_info(the_repository, c);
generation = commit_graph_generation(c);
@@ -566,8 +566,8 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
const struct commit *a = *(const struct commit * const *)_a;
const struct commit *b = *(const struct commit * const *)_b;
- uint32_t generation_a = commit_graph_generation(a);
- uint32_t generation_b = commit_graph_generation(b);
+ timestamp_t generation_a = commit_graph_generation(a);
+ timestamp_t generation_b = commit_graph_generation(b);
if (generation_a < generation_b)
return -1;
@@ -580,7 +580,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
unsigned int with_flag,
unsigned int assign_flag,
time_t min_commit_date,
- uint32_t min_generation)
+ timestamp_t min_generation)
{
struct commit **list = NULL;
int i;
@@ -681,13 +681,13 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
time_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
struct commit_list *from_iter = from, *to_iter = to;
int result;
- uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+ timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
while (from_iter) {
add_object_array(&from_iter->item->object, NULL, &from_objs);
if (!parse_commit(from_iter->item)) {
- uint32_t generation;
+ timestamp_t generation;
if (from_iter->item->date < min_commit_date)
min_commit_date = from_iter->item->date;
@@ -701,7 +701,7 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
while (to_iter) {
if (!parse_commit(to_iter->item)) {
- uint32_t generation;
+ timestamp_t generation;
if (to_iter->item->date < min_commit_date)
min_commit_date = to_iter->item->date;
@@ -741,13 +741,13 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
struct commit_list *found_commits = NULL;
struct commit **to_last = to + nr_to;
struct commit **from_last = from + nr_from;
- uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+ timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
int num_to_find = 0;
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
for (item = to; item < to_last; item++) {
- uint32_t generation;
+ timestamp_t generation;
struct commit *c = *item;
parse_commit(c);
diff --git a/commit-reach.h b/commit-reach.h
index b49ad71a317..148b56fea50 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -87,7 +87,7 @@ int can_all_from_reach_with_flag(struct object_array *from,
unsigned int with_flag,
unsigned int assign_flag,
time_t min_commit_date,
- uint32_t min_generation);
+ timestamp_t min_generation);
int can_all_from_reach(struct commit_list *from, struct commit_list *to,
int commit_date_cutoff);
diff --git a/commit.c b/commit.c
index bab8d5ab07c..4c717329ee0 100644
--- a/commit.c
+++ b/commit.c
@@ -753,8 +753,8 @@ int compare_commits_by_author_date(const void *a_, const void *b_,
int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)
{
const struct commit *a = a_, *b = b_;
- const uint32_t generation_a = commit_graph_generation(a),
- generation_b = commit_graph_generation(b);
+ const timestamp_t generation_a = commit_graph_generation(a),
+ generation_b = commit_graph_generation(b);
/* newer commits first */
if (generation_a < generation_b)
diff --git a/commit.h b/commit.h
index f4e7b0158e2..742d96c41e8 100644
--- a/commit.h
+++ b/commit.h
@@ -11,8 +11,8 @@
#include "commit-slab.h"
#define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
-#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
-#define GENERATION_NUMBER_MAX 0x3FFFFFFF
+#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
+#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
#define GENERATION_NUMBER_ZERO 0
struct commit_list {
diff --git a/revision.c b/revision.c
index 5474001331a..a54d2bd28df 100644
--- a/revision.c
+++ b/revision.c
@@ -3302,7 +3302,7 @@ define_commit_slab(indegree_slab, int);
define_commit_slab(author_date_slab, timestamp_t);
struct topo_walk_info {
- uint32_t min_generation;
+ timestamp_t min_generation;
struct prio_queue explore_queue;
struct prio_queue indegree_queue;
struct prio_queue topo_queue;
@@ -3370,7 +3370,7 @@ static void explore_walk_step(struct rev_info *revs)
}
static void explore_to_depth(struct rev_info *revs,
- uint32_t gen_cutoff)
+ timestamp_t gen_cutoff)
{
struct topo_walk_info *info = revs->topo_walk_info;
struct commit *c;
@@ -3415,7 +3415,7 @@ static void indegree_walk_step(struct rev_info *revs)
}
static void compute_indegrees_to_depth(struct rev_info *revs,
- uint32_t gen_cutoff)
+ timestamp_t gen_cutoff)
{
struct topo_walk_info *info = revs->topo_walk_info;
struct commit *c;
@@ -3473,7 +3473,7 @@ static void init_topo_walk(struct rev_info *revs)
info->min_generation = GENERATION_NUMBER_INFINITY;
for (list = revs->commits; list; list = list->next) {
struct commit *c = list->item;
- uint32_t generation;
+ timestamp_t generation;
if (repo_parse_commit_gently(revs->repo, c, 1))
continue;
@@ -3541,7 +3541,7 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
for (p = commit->parents; p; p = p->next) {
struct commit *parent = p->item;
int *pi;
- uint32_t generation;
+ timestamp_t generation;
if (parent->object.flags & UNINTERESTING)
continue;
diff --git a/upload-pack.c b/upload-pack.c
index 3b66bf92ba8..b87607e0dd4 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -500,7 +500,7 @@ static int got_oid(struct upload_pack_data *data,
static int ok_to_give_up(struct upload_pack_data *data)
{
- uint32_t min_generation = GENERATION_NUMBER_ZERO;
+ timestamp_t min_generation = GENERATION_NUMBER_ZERO;
if (!data->have_obj.nr)
return 0;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v7 07/11] commit-graph: document generation number v2
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
` (5 preceding siblings ...)
2021-02-01 6:58 ` [PATCH v7 06/11] commit-graph: return 64-bit generation number Abhishek Kumar via GitGitGadget
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 08/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
` (4 subsequent siblings)
11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Git uses topological levels in the commit-graph file for commit-graph
traversal operations like 'git log --graph'. Unfortunately, topological
levels can perform worse than committer date when parents of a commit
differ greatly in generation numbers [1]. For example, 'git merge-base
v4.8 v4.9' on the Linux repository walks 635,579 commits using
topological levels and walks 167,468 using committer date. Since
091f4cf3 (commit: don't use generation numbers if not needed,
2018-08-30), 'git merge-base' uses committer date heuristic unless there
is a cutoff because of the performance hit.
[1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/
Thus, the need for generation number v2 was born. As Git used to die
when graph version understood by it and in the commit-graph file are
different [2], we needed a way to distinguish between the old and new
generation number without incrementing the graph version.
[2] https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
The following candidates were proposed (https://github.com/derrickstolee/gen-test,
https://github.com/abhishekkumar2718/git/pull/1):
- (Epoch, Date) Pairs.
- Maximum Generation Numbers.
- Corrected Commit Date.
- FELINE Index.
- Corrected Commit Date with Monotonically Increasing Offsets.
Based on performance, local computability, and immutability (along with
the introduction of an additional commit-graph chunk which relieved the
requirement of backwards-compatibility) Corrected Commit Date was chosen
as generation number v2 and is defined as follows:
For a commit C, let its corrected commit date be the maximum of the
commit date of C and the corrected commit dates of its parents plus 1.
Then corrected commit date offset is the difference between corrected
commit date of C and commit date of C. As a special case, a root commit
with the timestamp zero has corrected commit date of 1 to distinguish it
from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).
While it was proposed initially to store corrected commit date offsets
within Commit Data Chunk, storing the offsets in a new chunk did not
affect the performance measurably. The new chunk is "Generation DATa
(GDAT) chunk" and it stores corrected commit date offsets while CDAT
chunk stores topological level. The old versions of Git would ignore
GDAT chunk, using topological levels from CDAT chunk. In contrast, new
versions of Git would use corrected commit dates, falling back to
topological level if the generation data chunk is absent in the
commit-graph file.
While storing corrected commit date offsets saves us 4 bytes per commit
(as compared with storing corrected commit dates directly), it's however
possible for the offset to overflow the space allocated. To handle such
cases, we introduce a new chunk, _Generation Data Overflow_ (GDOV) that
stores the corrected commit date. For overflowing offsets, we set MSB
and store the position into the GDOV chunk, in a mechanism similar to
the Extra Edges list chunk.
For mixed generation number environment (for example new Git on the
command line, old Git used by GUI client), we can encounter a
mixed-chain commit-graph (a commit-graph chain where some of split
commit-graph files have GDAT chunk and others do not). As backward
compatibility is one of the goals, we can define the following behavior:
While reading a mixed-chain commit-graph version, we fall back on
topological levels as corrected commit dates and topological levels
cannot be compared directly.
When adding new layer to the split commit-graph file, and when merging
some or all layers (replacing them in the latter case), the new layer
will have GDAT chunk if and only if in the final result there would be
no layer without GDAT chunk just below it.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
.../technical/commit-graph-format.txt | 28 +++++--
Documentation/technical/commit-graph.txt | 77 +++++++++++++++----
2 files changed, 86 insertions(+), 19 deletions(-)
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b3b58880b92..b6658eff188 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
The Git commit graph stores a list of commit OIDs and some associated
metadata, including:
-- The generation number of the commit. Commits with no parents have
- generation number 1; commits with parents have generation number
- one more than the maximum generation number of its parents. We
- reserve zero as special, and can be used to mark a generation
- number invalid or as "not computed".
+- The generation number of the commit.
- The root tree OID.
@@ -86,13 +82,33 @@ CHUNK DATA:
position. If there are more than two parents, the second value
has its most-significant bit on and the other bits store an array
position into the Extra Edge List chunk.
- * The next 8 bytes store the generation number of the commit and
+ * The next 8 bytes store the topological level (generation number v1)
+ of the commit and
the commit time in seconds since EPOCH. The generation number
uses the higher 30 bits of the first 4 bytes, while the commit
time uses the 32 bits of the second 4 bytes, along with the lowest
2 bits of the lowest byte, storing the 33rd and 34th bit of the
commit time.
+ Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+ * This list of 4-byte values store corrected commit date offsets for the
+ commits, arranged in the same order as commit data chunk.
+ * If the corrected commit date offset cannot be stored within 31 bits,
+ the value has its most-significant bit on and the other bits store
+ the position of corrected commit date into the Generation Data Overflow
+ chunk.
+ * Generation Data chunk is present only when commit-graph file is written
+ by compatible versions of Git and in case of split commit-graph chains,
+ the topmost layer also has Generation Data chunk.
+
+ Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
+ * This list of 8-byte values stores the corrected commit date offsets
+ for commits with corrected commit date offsets that cannot be
+ stored within 31 bits.
+ * Generation Data Overflow chunk is present only when Generation Data
+ chunk is present and atleast one corrected commit date offset cannot
+ be stored within 31 bits.
+
Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
This list of 4-byte values store the second through nth parents for
all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index f14a7659aa8..f05e7bda1a9 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
Values 1-4 satisfy the requirements of parse_commit_gently().
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates (generation number v2)
+2. Topological levels (generation nummber v1)
- * A commit with no parents (a root commit) has generation number one.
+Define "corrected committer date" of a commit recursively as follows:
- * A commit with at least one parent has generation number one more than
- the largest generation number among its parents.
+ * A commit with no parents (a root commit) has corrected committer date
+ equal to its committer date.
-Equivalently, the generation number of a commit A is one more than the
+ * A commit with at least one parent has corrected committer date equal to
+ the maximum of its commiter date and one more than the largest corrected
+ committer date among its parents.
+
+ * As a special case, a root commit with timestamp zero has corrected commit
+ date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+ (that is, an uncomputed corrected commit date).
+
+Define the "topological level" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has topological level of one.
+
+ * A commit with at least one parent has topological level one more than
+ the largest topological level among its parents.
+
+Equivalently, the topological level of a commit A is one more than the
length of a longest path from A to a root commit. The recursive definition
is easier to use for computation and observing the following property:
@@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
generation numbers, then we always expand the boundary commit with highest
generation number and can easily detect the stopping condition.
+The property applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
This property can be used to significantly reduce the time it takes to
walk commits and determine topological relationships. Without generation
numbers, the general heuristic is the following:
@@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
If A and B are commits with commit time X and Y, respectively, and
X < Y, then A _probably_ cannot reach B.
-This heuristic is currently used whenever the computation is allowed to
+In absence of corrected commit dates (for example, old versions of Git or
+mixed generation graph chains),
+this heuristic is currently used whenever the computation is allowed to
violate topological relationships due to clock skew (such as "git log"
with default order), but is not used when the topological order is
required (such as merge base calculations, "git log --graph").
@@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
generation number and walk until reaching commits with known generation
number.
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
in the commit-graph file. If a commit-graph file was written by a version
of Git that did not compute generation numbers, then those commits will
have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
walking a few extra commits, but the simplicity in dealing with commits
with generation number *_INFINITY or *_ZERO is valuable.
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
+topological levels (generation number v1) are computed to be at least
+this value. We limit at this value since it is the largest value that
+can be stored in the commit-graph file using the 30 bits available
+to topological levels. This presents another case where a commit can
+have generation number equal to that of a parent.
Design Details
--------------
@@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
number of commits) could be extracted into config settings for full
flexibility.
+## Handling Mixed Generation Number Chains
+
+With the introduction of generation number v2 and generation data chunk, the
+following scenario is possible:
+
+1. "New" Git writes a commit-graph with the corrected commit dates.
+2. "Old" Git writes a split commit-graph on top without corrected commit dates.
+
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
+levels of the higher layer. For this reason, Git inspects the topmost
+layer to see if the layer is missing corrected commit dates. In such a case
+Git only uses topological level for generation numbers.
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
+guarantees that if a layer has corrected commit dates, all lower layers
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
+commit dates. Instead, the new layer will have corrected commit dates if the
+layer below the new layer has corrected commit dates.
+
+While writing or merging layers, if the new layer is the only layer, it will
+have corrected commit dates when written by compatible versions of Git. Thus,
+rewriting split commit-graph as a single file (`--split=replace`) creates a
+single layer with corrected commit dates.
+
## Deleting graph-{hash} files
After a new tip file is written, some `graph-{hash}` files may no longer
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v7 08/11] commit-graph: implement corrected commit date
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
` (6 preceding siblings ...)
2021-02-01 6:58 ` [PATCH v7 07/11] commit-graph: document generation number v2 Abhishek Kumar via GitGitGadget
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 09/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
` (3 subsequent siblings)
11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
With most of preparations done, let's implement corrected commit date.
The corrected commit date for a commit is defined as:
* A commit with no parents (a root commit) has corrected commit date
equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
the maximum of its commit date and one more than the largest corrected
commit date among its parents.
As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).
To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.
Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.
However, using offsets be problematic if a commit is malformed but valid
and has committer date of 0 Unix time, as the offset would be the same
as corrected commit date and thus require 64-bits to be stored properly.
While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 8f17815021d..d1e6ced8647 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1343,9 +1343,11 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
ctx->commits.nr);
for (i = 0; i < ctx->commits.nr; i++) {
uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+ timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
display_progress(ctx->progress, i + 1);
- if (level != GENERATION_NUMBER_ZERO)
+ if (level != GENERATION_NUMBER_ZERO &&
+ corrected_commit_date != GENERATION_NUMBER_ZERO)
continue;
commit_list_insert(ctx->commits.list[i], &list);
@@ -1354,17 +1356,24 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
struct commit_list *parent;
int all_parents_computed = 1;
uint32_t max_level = 0;
+ timestamp_t max_corrected_commit_date = 0;
for (parent = current->parents; parent; parent = parent->next) {
level = *topo_level_slab_at(ctx->topo_levels, parent->item);
+ corrected_commit_date = commit_graph_data_at(parent->item)->generation;
- if (level == GENERATION_NUMBER_ZERO) {
+ if (level == GENERATION_NUMBER_ZERO ||
+ corrected_commit_date == GENERATION_NUMBER_ZERO) {
all_parents_computed = 0;
commit_list_insert(parent->item, &list);
break;
- } else if (level > max_level) {
- max_level = level;
}
+
+ if (level > max_level)
+ max_level = level;
+
+ if (corrected_commit_date > max_corrected_commit_date)
+ max_corrected_commit_date = corrected_commit_date;
}
if (all_parents_computed) {
@@ -1373,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (max_level > GENERATION_NUMBER_V1_MAX - 1)
max_level = GENERATION_NUMBER_V1_MAX - 1;
*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+ if (current->date && current->date > max_corrected_commit_date)
+ max_corrected_commit_date = current->date - 1;
+ commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
}
}
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v7 09/11] commit-graph: implement generation data chunk
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
` (7 preceding siblings ...)
2021-02-01 6:58 ` [PATCH v7 08/11] commit-graph: implement corrected commit date Abhishek Kumar via GitGitGadget
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 10/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
` (2 subsequent siblings)
11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.
We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.
Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).
We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.
To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:
We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.
If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.
We test the overflow-related code with the following repo history:
F - N - U
/ \
U - N - U N
\ /
N - F - N
Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.
The largest offset observed is 2 ^ 31, just large enough to overflow.
[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 114 ++++++++++++++++++++++++++++++----
commit-graph.h | 3 +
commit.h | 1 +
t/README | 3 +
t/helper/test-read-graph.c | 4 ++
t/t4216-log-bloom.sh | 4 +-
t/t5318-commit-graph.sh | 79 +++++++++++++++++++----
t/t5324-split-commit-graph.sh | 12 ++--
t/t6600-test-reach.sh | 6 ++
t/test-lib-functions.sh | 6 ++
10 files changed, 200 insertions(+), 32 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index d1e6ced8647..d2afcc83283 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,11 +38,13 @@ void git_test_write_commit_graph_or_die(void)
#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA 0x47444154 /* "GDAT" */
+#define GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW 0x47444f56 /* "GDOV" */
#define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
#define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 9
#define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
@@ -61,6 +63,8 @@ void git_test_write_commit_graph_or_die(void)
#define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
+ GRAPH_FANOUT_SIZE + the_hash_algo->rawsz)
+#define CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW (1ULL << 31)
+
/* Remember to update object flag allocation in object.h */
#define REACHABLE (1u<<15)
@@ -394,6 +398,20 @@ struct commit_graph *parse_commit_graph(struct repository *r,
graph->chunk_commit_data = data + chunk_offset;
break;
+ case GRAPH_CHUNKID_GENERATION_DATA:
+ if (graph->chunk_generation_data)
+ chunk_repeated = 1;
+ else
+ graph->chunk_generation_data = data + chunk_offset;
+ break;
+
+ case GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW:
+ if (graph->chunk_generation_data_overflow)
+ chunk_repeated = 1;
+ else
+ graph->chunk_generation_data_overflow = data + chunk_offset;
+ break;
+
case GRAPH_CHUNKID_EXTRAEDGES:
if (graph->chunk_extra_edges)
chunk_repeated = 1;
@@ -754,8 +772,8 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
{
const unsigned char *commit_data;
struct commit_graph_data *graph_data;
- uint32_t lex_index;
- uint64_t date_high, date_low;
+ uint32_t lex_index, offset_pos;
+ uint64_t date_high, date_low, offset;
while (pos < g->num_commits_in_base)
g = g->base_graph;
@@ -773,7 +791,19 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
- graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+ if (g->chunk_generation_data) {
+ offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
+
+ if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
+ if (!g->chunk_generation_data_overflow)
+ die(_("commit-graph requires overflow generation data but has none"));
+
+ offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW;
+ graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos);
+ } else
+ graph_data->generation = item->date + offset;
+ } else
+ graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
if (g->topo_levels)
*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
@@ -945,6 +975,7 @@ struct write_commit_graph_context {
struct oid_array oids;
struct packed_commit_list commits;
int num_extra_edges;
+ int num_generation_data_overflows;
unsigned long approx_nr_objects;
struct progress *progress;
int progress_done;
@@ -963,7 +994,8 @@ struct write_commit_graph_context {
report_progress:1,
split:1,
changed_paths:1,
- order_by_pack:1;
+ order_by_pack:1,
+ write_generation_data:1;
struct topo_level_slab *topo_levels;
const struct commit_graph_opts *opts;
@@ -1123,6 +1155,45 @@ static int write_graph_chunk_data(struct hashfile *f,
return 0;
}
+static int write_graph_chunk_generation_data(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ int i, num_generation_data_overflows = 0;
+
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+ display_progress(ctx->progress, ++ctx->progress_cnt);
+
+ if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+ offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
+ num_generation_data_overflows++;
+ }
+
+ hashwrite_be32(f, offset);
+ }
+
+ return 0;
+}
+
+static int write_graph_chunk_generation_data_overflow(struct hashfile *f,
+ struct write_commit_graph_context *ctx)
+{
+ int i;
+ for (i = 0; i < ctx->commits.nr; i++) {
+ struct commit *c = ctx->commits.list[i];
+ timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
+ display_progress(ctx->progress, ++ctx->progress_cnt);
+
+ if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) {
+ hashwrite_be32(f, offset >> 32);
+ hashwrite_be32(f, (uint32_t) offset);
+ }
+ }
+
+ return 0;
+}
+
static int write_graph_chunk_extra_edges(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
@@ -1386,6 +1457,9 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
if (current->date && current->date > max_corrected_commit_date)
max_corrected_commit_date = current->date - 1;
commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
+
+ if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
+ ctx->num_generation_data_overflows++;
}
}
}
@@ -1719,6 +1793,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
chunks[2].id = GRAPH_CHUNKID_DATA;
chunks[2].size = (hashsz + 16) * ctx->commits.nr;
chunks[2].write_fn = write_graph_chunk_data;
+
+ if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
+ ctx->write_generation_data = 0;
+ if (ctx->write_generation_data) {
+ chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
+ chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
+ chunks[num_chunks].write_fn = write_graph_chunk_generation_data;
+ num_chunks++;
+ }
+ if (ctx->num_generation_data_overflows) {
+ chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW;
+ chunks[num_chunks].size = sizeof(timestamp_t) * ctx->num_generation_data_overflows;
+ chunks[num_chunks].write_fn = write_graph_chunk_generation_data_overflow;
+ num_chunks++;
+ }
if (ctx->num_extra_edges) {
chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
chunks[num_chunks].size = 4 * ctx->num_extra_edges;
@@ -2139,6 +2228,8 @@ int write_commit_graph(struct object_directory *odb,
ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
ctx->opts = opts;
ctx->total_bloom_filter_data_size = 0;
+ ctx->write_generation_data = 1;
+ ctx->num_generation_data_overflows = 0;
bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
bloom_settings.bits_per_entry);
@@ -2445,16 +2536,17 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
continue;
/*
- * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
- * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
- * extra logic in the following condition.
+ * If we are using topological level and one of our parents has
+ * generation GENERATION_NUMBER_V1_MAX, then our generation is
+ * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
+ * in the following condition.
*/
- if (max_generation == GENERATION_NUMBER_V1_MAX)
+ if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
max_generation--;
generation = commit_graph_generation(graph_commit);
- if (generation != max_generation + 1)
- graph_report(_("commit-graph generation for commit %s is %"PRItime" != %"PRItime),
+ if (generation < max_generation + 1)
+ graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
oid_to_hex(&cur_oid),
generation,
max_generation + 1);
diff --git a/commit-graph.h b/commit-graph.h
index 2e9aa7824ee..19a02001fde 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -6,6 +6,7 @@
#include "oidset.h"
#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
+#define GIT_TEST_COMMIT_GRAPH_NO_GDAT "GIT_TEST_COMMIT_GRAPH_NO_GDAT"
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
@@ -68,6 +69,8 @@ struct commit_graph {
const uint32_t *chunk_oid_fanout;
const unsigned char *chunk_oid_lookup;
const unsigned char *chunk_commit_data;
+ const unsigned char *chunk_generation_data;
+ const unsigned char *chunk_generation_data_overflow;
const unsigned char *chunk_extra_edges;
const unsigned char *chunk_base_graphs;
const unsigned char *chunk_bloom_indexes;
diff --git a/commit.h b/commit.h
index 742d96c41e8..eff94f3f7c2 100644
--- a/commit.h
+++ b/commit.h
@@ -14,6 +14,7 @@
#define GENERATION_NUMBER_INFINITY ((1ULL << 63) - 1)
#define GENERATION_NUMBER_V1_MAX 0x3FFFFFFF
#define GENERATION_NUMBER_ZERO 0
+#define GENERATION_NUMBER_V2_OFFSET_MAX ((1ULL << 31) - 1)
struct commit_list {
struct commit *item;
diff --git a/t/README b/t/README
index c730a707705..8a121487279 100644
--- a/t/README
+++ b/t/README
@@ -393,6 +393,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
be written after every 'git commit' command, and overrides the
'core.commitGraph' setting to true.
+GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
+commit-graph to be written without generation data chunk.
+
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
commit-graph write to compute and write changed path Bloom filters for
every 'git commit-graph write', as if the `--changed-paths` option was
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 5f585a17256..75927b2c81d 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -33,6 +33,10 @@ int cmd__read_graph(int argc, const char **argv)
printf(" oid_lookup");
if (graph->chunk_commit_data)
printf(" commit_metadata");
+ if (graph->chunk_generation_data)
+ printf(" generation_data");
+ if (graph->chunk_generation_data_overflow)
+ printf(" generation_data_overflow");
if (graph->chunk_extra_edges)
printf(" extra_edges");
if (graph->chunk_bloom_indexes)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 0f16c4b9d52..50f206db550 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -43,11 +43,11 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
'
graph_read_expect () {
- NUM_CHUNKS=5
+ NUM_CHUNKS=6
cat >expect <<- EOF
header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
num_commits: $1
- chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+ chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
EOF
test-tool read-graph >actual &&
test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 2ed0c1544da..fa27df579a5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -76,7 +76,7 @@ graph_git_behavior 'no graph' full commits/3 commits/1
graph_read_expect() {
OPTIONAL=""
NUM_CHUNKS=3
- if test ! -z $2
+ if test ! -z "$2"
then
OPTIONAL=" $2"
NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
@@ -103,14 +103,14 @@ test_expect_success 'exit with correct error on bad input to --stdin-commits' '
# valid commit and tree OID
git rev-parse HEAD HEAD^{tree} >in &&
git commit-graph write --stdin-commits <in &&
- graph_read_expect 3
+ graph_read_expect 3 generation_data
'
test_expect_success 'write graph' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "3"
+ graph_read_expect "3" generation_data
'
test_expect_success POSIXPERM 'write graph has correct permissions' '
@@ -219,7 +219,7 @@ test_expect_success 'write graph with merges' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "10" "extra_edges"
+ graph_read_expect "10" "generation_data extra_edges"
'
graph_git_behavior 'merge 1 vs 2' full merge/1 merge/2
@@ -254,7 +254,7 @@ test_expect_success 'write graph with new commit' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'full graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -264,7 +264,7 @@ test_expect_success 'write graph with nothing new' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'cleared graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -274,7 +274,7 @@ test_expect_success 'build graph from latest pack with closure' '
cd "$TRASH_DIRECTORY/full" &&
cat new-idx | git commit-graph write --stdin-packs &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "9" "extra_edges"
+ graph_read_expect "9" "generation_data extra_edges"
'
graph_git_behavior 'graph from pack, commit 8 vs merge 1' full commits/8 merge/1
@@ -287,7 +287,7 @@ test_expect_success 'build graph from commits with closure' '
git rev-parse merge/1 >>commits-in &&
cat commits-in | git commit-graph write --stdin-commits &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "6"
+ graph_read_expect "6" "generation_data"
'
graph_git_behavior 'graph from commits, commit 8 vs merge 1' full commits/8 merge/1
@@ -297,7 +297,7 @@ test_expect_success 'build graph from commits with append' '
cd "$TRASH_DIRECTORY/full" &&
git rev-parse merge/3 | git commit-graph write --stdin-commits --append &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "10" "extra_edges"
+ graph_read_expect "10" "generation_data extra_edges"
'
graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -307,7 +307,7 @@ test_expect_success 'build graph using --reachable' '
cd "$TRASH_DIRECTORY/full" &&
git commit-graph write --reachable &&
test_path_is_file $objdir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'append graph, commit 8 vs merge 1' full commits/8 merge/1
@@ -328,7 +328,7 @@ test_expect_success 'write graph in bare repo' '
cd "$TRASH_DIRECTORY/bare" &&
git commit-graph write &&
test_path_is_file $baredir/info/commit-graph &&
- graph_read_expect "11" "extra_edges"
+ graph_read_expect "11" "generation_data extra_edges"
'
graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1
@@ -454,8 +454,9 @@ test_expect_success 'warn on improper hash version' '
test_expect_success 'git commit-graph verify' '
cd "$TRASH_DIRECTORY/full" &&
- git rev-parse commits/8 | git commit-graph write --stdin-commits &&
- git commit-graph verify >output
+ git rev-parse commits/8 | GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --stdin-commits &&
+ git commit-graph verify >output &&
+ graph_read_expect 9 extra_edges
'
NUM_COMMITS=9
@@ -741,4 +742,56 @@ test_expect_success 'corrupt commit-graph write (missing tree)' '
)
'
+# We test the overflow-related code with the following repo history:
+#
+# 4:F - 5:N - 6:U
+# / \
+# 1:U - 2:N - 3:U M:N
+# \ /
+# 7:N - 8:F - 9:N
+#
+# Here the commits denoted by U have committer date of zero seconds
+# since Unix epoch, the commits denoted by N have committer date
+# starting from 1112354055 seconds since Unix epoch (default committer
+# date for the test suite), and the commits denoted by F have committer
+# date of (2 ^ 31 - 2) seconds since Unix epoch.
+#
+# The largest offset observed is 2 ^ 31, just large enough to overflow.
+#
+
+test_expect_success 'set up and verify repo with generation data overflow chunk' '
+ objdir=".git/objects" &&
+ UNIX_EPOCH_ZERO="@0 +0000" &&
+ FUTURE_DATE="@2147483646 +0000" &&
+ test_oid_cache <<-EOF &&
+ oid_version sha1:1
+ oid_version sha256:2
+ EOF
+ cd "$TRASH_DIRECTORY" &&
+ mkdir repo &&
+ cd repo &&
+ git init &&
+ test_commit --date "$UNIX_EPOCH_ZERO" 1 &&
+ test_commit 2 &&
+ test_commit --date "$UNIX_EPOCH_ZERO" 3 &&
+ git commit-graph write --reachable &&
+ graph_read_expect 3 generation_data &&
+ test_commit --date "$FUTURE_DATE" 4 &&
+ test_commit 5 &&
+ test_commit --date "$UNIX_EPOCH_ZERO" 6 &&
+ git branch left &&
+ git reset --hard 3 &&
+ test_commit 7 &&
+ test_commit --date "$FUTURE_DATE" 8 &&
+ test_commit 9 &&
+ git branch right &&
+ git reset --hard 3 &&
+ test_merge M left right &&
+ git commit-graph write --reachable &&
+ graph_read_expect 10 "generation_data generation_data_overflow" &&
+ git commit-graph verify
+'
+
+graph_git_behavior 'generation data overflow chunk repo' repo left right
+
test_done
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 4d3842b83b9..587757b62d9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -13,11 +13,11 @@ test_expect_success 'setup repo' '
infodir=".git/objects/info" &&
graphdir="$infodir/commit-graphs" &&
test_oid_cache <<-EOM
- shallow sha1:1760
- shallow sha256:2064
+ shallow sha1:2132
+ shallow sha256:2436
- base sha1:1376
- base sha256:1496
+ base sha1:1408
+ base sha256:1528
oid_version sha1:1
oid_version sha256:2
@@ -31,9 +31,9 @@ graph_read_expect() {
NUM_BASE=$2
fi
cat >expect <<- EOF
- header: 43475048 1 $(test_oid oid_version) 3 $NUM_BASE
+ header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
num_commits: $1
- chunks: oid_fanout oid_lookup commit_metadata
+ chunks: oid_fanout oid_lookup commit_metadata generation_data
EOF
test-tool read-graph >output &&
test_cmp expect output
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index af10f0dc090..e2d33a8a4c4 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,6 +55,9 @@ test_expect_success 'setup' '
git show-ref -s commit-5-5 | git commit-graph write --stdin-commits &&
mv .git/objects/info/commit-graph commit-graph-half &&
chmod u+w commit-graph-half &&
+ GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable &&
+ mv .git/objects/info/commit-graph commit-graph-no-gdat &&
+ chmod u+w commit-graph-no-gdat &&
git config core.commitGraph true
'
@@ -67,6 +70,9 @@ run_all_modes () {
test_cmp expect actual &&
cp commit-graph-half .git/objects/info/commit-graph &&
"$@" <input >actual &&
+ test_cmp expect actual &&
+ cp commit-graph-no-gdat .git/objects/info/commit-graph &&
+ "$@" <input >actual &&
test_cmp expect actual
}
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 6bca0023168..df5bba07295 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -218,6 +218,12 @@ test_commit () {
--signoff)
signoff="$1"
;;
+ --date)
+ notick=yes
+ GIT_COMMITTER_DATE="$2"
+ GIT_AUTHOR_DATE="$2"
+ shift
+ ;;
-C)
indir="$2"
shift
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v7 10/11] commit-graph: use generation v2 only if entire chain does
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
` (8 preceding siblings ...)
2021-02-01 6:58 ` [PATCH v7 09/11] commit-graph: implement generation data chunk Abhishek Kumar via GitGitGadget
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 6:58 ` [PATCH v7 11/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
2021-02-01 13:14 ` [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:
1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.
It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.
This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).
Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.
Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 30 +++++-
commit-graph.h | 1 +
t/t5324-split-commit-graph.sh | 181 ++++++++++++++++++++++++++++++++++
3 files changed, 210 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index d2afcc83283..77fef5a240e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -614,6 +614,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
return graph_chain;
}
+static void validate_mixed_generation_chain(struct commit_graph *g)
+{
+ int read_generation_data;
+
+ if (!g)
+ return;
+
+ read_generation_data = !!g->chunk_generation_data;
+
+ while (g) {
+ g->read_generation_data = read_generation_data;
+ g = g->base_graph;
+ }
+}
+
struct commit_graph *read_commit_graph_one(struct repository *r,
struct object_directory *odb)
{
@@ -622,6 +637,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
if (!g)
g = load_commit_graph_chain(r, odb);
+ validate_mixed_generation_chain(g);
+
return g;
}
@@ -791,7 +808,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
- if (g->chunk_generation_data) {
+ if (g->read_generation_data) {
offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
@@ -2019,6 +2036,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
if (i < ctx->num_commit_graphs_after)
ctx->commit_graph_hash_after[i] = xstrdup(oid_to_hex(&g->oid));
+ /*
+ * If the topmost remaining layer has generation data chunk, the
+ * resultant layer also has generation data chunk.
+ */
+ if (i == ctx->num_commit_graphs_after - 2)
+ ctx->write_generation_data = !!g->chunk_generation_data;
+
i--;
g = g->base_graph;
}
@@ -2343,6 +2367,8 @@ int write_commit_graph(struct object_directory *odb,
} else
ctx->num_commit_graphs_after = 1;
+ validate_mixed_generation_chain(ctx->r->objects->commit_graph);
+
compute_generation_numbers(ctx);
if (ctx->changed_paths)
@@ -2541,7 +2567,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
* also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic
* in the following condition.
*/
- if (!g->chunk_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
+ if (!g->read_generation_data && max_generation == GENERATION_NUMBER_V1_MAX)
max_generation--;
generation = commit_graph_generation(graph_commit);
diff --git a/commit-graph.h b/commit-graph.h
index 19a02001fde..ad52130883b 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -64,6 +64,7 @@ struct commit_graph {
struct object_directory *odb;
uint32_t num_commits_in_base;
+ unsigned int read_generation_data;
struct commit_graph *base_graph;
const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 587757b62d9..8e90f3423b8 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -453,4 +453,185 @@ test_expect_success 'prevent regression for duplicate commits across layers' '
git -C dup commit-graph verify
'
+NUM_FIRST_LAYER_COMMITS=64
+NUM_SECOND_LAYER_COMMITS=16
+NUM_THIRD_LAYER_COMMITS=7
+NUM_FOURTH_LAYER_COMMITS=8
+NUM_FIFTH_LAYER_COMMITS=16
+SECOND_LAYER_SEQUENCE_START=$(($NUM_FIRST_LAYER_COMMITS + 1))
+SECOND_LAYER_SEQUENCE_END=$(($SECOND_LAYER_SEQUENCE_START + $NUM_SECOND_LAYER_COMMITS - 1))
+THIRD_LAYER_SEQUENCE_START=$(($SECOND_LAYER_SEQUENCE_END + 1))
+THIRD_LAYER_SEQUENCE_END=$(($THIRD_LAYER_SEQUENCE_START + $NUM_THIRD_LAYER_COMMITS - 1))
+FOURTH_LAYER_SEQUENCE_START=$(($THIRD_LAYER_SEQUENCE_END + 1))
+FOURTH_LAYER_SEQUENCE_END=$(($FOURTH_LAYER_SEQUENCE_START + $NUM_FOURTH_LAYER_COMMITS - 1))
+FIFTH_LAYER_SEQUENCE_START=$(($FOURTH_LAYER_SEQUENCE_END + 1))
+FIFTH_LAYER_SEQUENCE_END=$(($FIFTH_LAYER_SEQUENCE_START + $NUM_FIFTH_LAYER_COMMITS - 1))
+
+# Current split graph chain:
+#
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+ graphdir=".git/objects/info/commit-graphs" &&
+ test_oid_cache <<-EOF &&
+ oid_version sha1:1
+ oid_version sha256:2
+ EOF
+ git init mixed &&
+ (
+ cd mixed &&
+ git config core.commitGraph true &&
+ git config gc.writeCommitGraph false &&
+ for i in $(test_seq $NUM_FIRST_LAYER_COMMITS)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split &&
+ graph_read_expect $NUM_FIRST_LAYER_COMMITS &&
+ test_line_count = 1 $graphdir/commit-graph-chain &&
+ for i in $(test_seq $SECOND_LAYER_SEQUENCE_START $SECOND_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+ test_line_count = 2 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 4 1
+ num_commits: $NUM_SECOND_LAYER_COMMITS
+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
+ test_cmp expect output &&
+ git commit-graph verify &&
+ cat $graphdir/commit-graph-chain
+ )
+'
+
+# The new layer will be added without generation data chunk as it was not
+# present on the layer underneath it.
+#
+# 7 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if not present on existing tip' '
+ git clone mixed mixed-no-gdat &&
+ (
+ cd mixed-no-gdat &&
+ for i in $(test_seq $THIRD_LAYER_SEQUENCE_START $THIRD_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split=no-merge &&
+ test_line_count = 3 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 4 2
+ num_commits: $NUM_THIRD_LAYER_COMMITS
+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
+ test_cmp expect output &&
+ git commit-graph verify
+ )
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+# 8 commits (No GDAT)
+# ------------------------
+# 7 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+# The top two layers are merged and do not have generation data chunk as layer below them does
+# not have generation data chunk.
+#
+# 15 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'do not write generation data chunk if the topmost remaining layer does not have generation data chunk' '
+ git clone mixed-no-gdat mixed-merge-no-gdat &&
+ (
+ cd mixed-merge-no-gdat &&
+ for i in $(test_seq $FOURTH_LAYER_SEQUENCE_START $FOURTH_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split --size-multiple 1 &&
+ test_line_count = 3 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 4 2
+ num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
+ chunks: oid_fanout oid_lookup commit_metadata
+ EOF
+ test_cmp expect output &&
+ git commit-graph verify
+ )
+'
+
+# Number of commits in each layer of the split-commit graph before merge:
+#
+# 16 commits (No GDAT)
+# ------------------------
+# 15 commits (No GDAT)
+# ------------------------
+# 16 commits (No GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+# The top three layers are merged and has generation data chunk as the topmost remaining layer
+# has generation data chunk.
+#
+# 47 commits (GDAT)
+# ------------------------
+# 64 commits (GDAT)
+#
+test_expect_success 'write generation data chunk if topmost remaining layer has generation data chunk' '
+ git clone mixed-merge-no-gdat mixed-merge-gdat &&
+ (
+ cd mixed-merge-gdat &&
+ for i in $(test_seq $FIFTH_LAYER_SEQUENCE_START $FIFTH_LAYER_SEQUENCE_END)
+ do
+ test_commit $i &&
+ git branch commits/$i || return 1
+ done &&
+ git commit-graph write --reachable --split --size-multiple 1 &&
+ test_line_count = 2 $graphdir/commit-graph-chain &&
+ test-tool read-graph >output &&
+ cat >expect <<-EOF &&
+ header: 43475048 1 $(test_oid oid_version) 5 1
+ num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
+ chunks: oid_fanout oid_lookup commit_metadata generation_data
+ EOF
+ test_cmp expect output
+ )
+'
+
+test_expect_success 'write generation data chunk when commit-graph chain is replaced' '
+ git clone mixed mixed-replace &&
+ (
+ cd mixed-replace &&
+ git commit-graph write --reachable --split=replace &&
+ test_path_is_file $graphdir/commit-graph-chain &&
+ test_line_count = 1 $graphdir/commit-graph-chain &&
+ verify_chain_files_exist $graphdir &&
+ graph_read_expect $(($NUM_FIRST_LAYER_COMMITS + $NUM_SECOND_LAYER_COMMITS)) &&
+ git commit-graph verify
+ )
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* [PATCH v7 11/11] commit-reach: use corrected commit dates in paint_down_to_common()
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
` (9 preceding siblings ...)
2021-02-01 6:58 ` [PATCH v7 10/11] commit-graph: use generation v2 only if entire chain does Abhishek Kumar via GitGitGadget
@ 2021-02-01 6:58 ` Abhishek Kumar via GitGitGadget
2021-02-01 13:14 ` [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
11 siblings, 0 replies; 211+ messages in thread
From: Abhishek Kumar via GitGitGadget @ 2021-02-01 6:58 UTC (permalink / raw)
To: git
Cc: Derrick Stolee, Jakub Narębski, Abhishek Kumar,
SZEDER Gábor, Taylor Blau, Abhishek Kumar, Abhishek Kumar
From: Abhishek Kumar <abhishekkumar8222@gmail.com>
091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.
For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.
While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).
This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.
While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.
With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.
While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.
As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
commit-graph.c | 14 ++++++++++++++
commit-graph.h | 6 ++++++
commit-reach.c | 2 +-
t/t6404-recursive-merge.sh | 5 ++++-
4 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 77fef5a240e..bf735fac4ea 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -714,6 +714,20 @@ int generation_numbers_enabled(struct repository *r)
return !!first_generation;
}
+int corrected_commit_dates_enabled(struct repository *r)
+{
+ struct commit_graph *g;
+ if (!prepare_commit_graph(r))
+ return 0;
+
+ g = r->objects->commit_graph;
+
+ if (!g->num_commits)
+ return 0;
+
+ return g->read_generation_data;
+}
+
struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
{
struct commit_graph *g = r->objects->commit_graph;
diff --git a/commit-graph.h b/commit-graph.h
index ad52130883b..97f3497c279 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -95,6 +95,12 @@ struct commit_graph *parse_commit_graph(struct repository *r,
*/
int generation_numbers_enabled(struct repository *r);
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
+
struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
enum commit_graph_write_flags {
diff --git a/commit-reach.c b/commit-reach.c
index 9b24b0378d5..e38771ca5a1 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
int i;
timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
- if (!min_generation)
+ if (!min_generation && !corrected_commit_dates_enabled(r))
queue.compare = compare_commits_by_commit_date;
one->object.flags |= PARENT1;
diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
index c7ab7048f58..eaf48e941e2 100755
--- a/t/t6404-recursive-merge.sh
+++ b/t/t6404-recursive-merge.sh
@@ -18,6 +18,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
export GIT_COMMITTER_DATE
test_expect_success 'setup tests' '
+ GIT_TEST_COMMIT_GRAPH=0 &&
+ export GIT_TEST_COMMIT_GRAPH &&
echo 1 >a1 &&
git add a1 &&
GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -69,7 +71,7 @@ test_expect_success 'setup tests' '
'
test_expect_success 'combined merge conflicts' '
- test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+ test_must_fail git merge -m final G
'
test_expect_success 'result contains a conflict' '
@@ -85,6 +87,7 @@ test_expect_success 'result contains a conflict' '
'
test_expect_success 'virtual trees were processed' '
+ # TODO: fragile test, relies on ambigious merge-base resolution
git ls-files --stage >out &&
cat >expect <<-EOF &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 211+ messages in thread
* Re: [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date
2021-02-01 6:58 ` [PATCH v7 " Abhishek Kumar via GitGitGadget
` (10 preceding siblings ...)
2021-02-01 6:58 ` [PATCH v7 11/11] commit-reach: use corrected commit dates in paint_down_to_common() Abhishek Kumar via GitGitGadget
@ 2021-02-01 13:14 ` Derrick Stolee
2021-02-01 18:26 ` Junio C Hamano
11 siblings, 1 reply; 211+ messages in thread
From: Derrick Stolee @ 2021-02-01 13:14 UTC (permalink / raw)
To: Abhishek Kumar via GitGitGadget, git
Cc: Jakub Narębski, Abhishek Kumar, SZEDER Gábor,
Taylor Blau, Junio C Hamano
On 2/1/2021 1:58 AM, Abhishek Kumar via GitGitGadget wrote:
> Changes in version 7:
>
> * Moved the documentation patch ahead of "commit-graph: implement corrected
> commit date" and elaborated on the introduction of generation number v2.
The only change in this version is this commit message:
> 11: e571f03d8bd ! 7: 8647b5d2e38 doc: add corrected commit date info
> @@ Metadata
> Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> ## Commit message ##
> - doc: add corrected commit date info
> + commit-graph: document generation number v2
>
> - With generation data chunk and corrected commit dates implemented, let's
> - update the technical documentation for commit-graph.
> + Git uses topological levels in the commit-graph file for commit-graph
> + traversal operations like 'git log --graph'. Unfortunately, topological
> + levels can perform worse than committer date when parents of a commit
> + differ greatly in generation numbers [1]. For example, 'git merge-base
> + v4.8 v4.9' on the Linux repository walks 635,579 commits using
> + topological levels and walks 167,468 using committer date. Since
> + 091f4cf3 (commit: don't use generation numbers if not needed,
> + 2018-08-30), 'git merge-base' uses committer date heuristic unless there
> + is a cutoff because of the performance hit.
> +
> + [1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/
> +
> + Thus, the need for generation number v2 was born. As Git used to die
> + when graph version understood by it and in the commit-graph file are
> + different [2], we needed a way to distinguish between the old and new
> + generation number without incrementing the graph version.
> +
> + [2] https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/
> +
> + The following candidates were proposed (https://github.com/derrickstolee/gen-test,
> + https://github.com/abhishekkumar2718/git/pull/1):
> + - (Epoch, Date) Pairs.
> + - Maximum Generation Numbers.
> + - Corrected Commit Date.
> + - FELINE Index.
> + - Corrected Commit Date with Monotonically Increasing Offsets.
> +
> + Based on performance, local computability, and immutability (along with
> + the introduction of an additional commit-graph chunk which relieved the
> + requirement of backwards-compatibility) Corrected Commit Date was chosen
> + as generation number v2 and is defined as follows:
> +
> + For a commit C, let its corrected commit date be the maximum of the
> + commit date of C and the corrected commit dates of its parents plus 1.
> + Then corrected commit date offset is the difference between corrected
> + commit date of C and commit date of C. As a special case, a root commit
> + with the timestamp zero has corrected commit date of 1 to distinguish it
> + from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).
> +
> + While it was proposed initially to store corrected commit date offsets
> + within Commit Data Chunk, storing the offsets in a new chunk did not
> + affect the performance measurably. The new chunk is "Generation DATa
> + (GDAT) chunk" and it stores corrected commit date offsets while CDAT
> + chunk stores topological level. The old versions of Git would ignore
> + GDAT chunk, using topological levels from CDAT chunk. In contrast, new
> + versions of Git would use corrected commit dates, falling back to
> + topological level if the generation data chunk is absent in the
> + commit-graph file.
> +
> + While storing corrected commit date offsets saves us 4 bytes per commit
> + (as compared with storing corrected commit dates directly), it's however
> + possible for the offset to overflow the space allocated. To handle such
> + cases, we introduce a new chunk, _Generation Data Overflow_ (GDOV) that
> + stores the corrected commit date. For overflowing offsets, we set MSB
> + and store the position into the GDOV chunk, in a mechanism similar to
> + the Extra Edges list chunk.
> +
> + For mixed generation number environment (for example new Git on the
> + command line, old Git used by GUI client), we can encounter a
> + mixed-chain commit-graph (a commit-graph chain where some of split
> + commit-graph files have GDAT chunk and others do not). As backward
> + compatibility is one of the goals, we can define the following behavior:
> +
> + While reading a mixed-chain commit-graph version, we fall back on
> + topological levels as corrected commit dates and topological levels
> + cannot be compared directly.
> +
> + When adding new layer to the split commit-graph file, and when merging
> + some or all layers (replacing them in the latter case), the new layer
> + will have GDAT chunk if and only if in the final result there would be
> + no layer without GDAT chunk just below it.
While that is a quality message, v6 has landed in 'next' and I've begun
working off of that version. As Taylor attempted to say [1], this topic
should be considered final and updates should be follow-ups on top.
[1] https://lore.kernel.org/git/YBYLwpKdUfxCNwaz@nand.local/
(Of course, if Junio says differently, then listen to him.)
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 211+ messages in thread
* Re: [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date
2021-02-01 13:14 ` [PATCH v7 00/11] [GSoC] Implement Corrected Commit Date Derrick Stolee
@ 2021-02-01 18:26 ` Junio C Hamano
0 siblings, 0 replies; 211+ messages in thread
From: Junio C Hamano @ 2021-02-01 18:26 UTC (permalink / raw)
To: Derrick Stolee
Cc: Abhishek Kumar via GitGitGadget, git, Jakub Narębski,
Abhishek Kumar, SZEDER Gábor, Taylor Blau
Derrick Stolee <stolee@gmail.com> writes:
>> + When adding new layer to the split commit-graph file, and when merging
>> + some or all layers (replacing them in the latter case), the new layer
>> + will have GDAT chunk if and only if in the final result there would be
>> + no layer without GDAT chunk just below it.
>
> While that is a quality message, v6 has landed in 'next' and I've begun
> working off of that version. As Taylor attempted to say [1], this topic
> should be considered final and updates should be follow-ups on top.
>
> [1] https://lore.kernel.org/git/YBYLwpKdUfxCNwaz@nand.local/
Sounds sensible, modulo s/final/solid enough/ ;-)
I would imagine that the "quality message" has something of value to
keep to help future developers, and if that is the case, a follow-up
patch to add to the Documentation/technical/ would be appropriate.
Thanks all, for a quality series.
^ permalink raw reply [flat|nested] 211+ messages in thread