git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] commit-graph: ignore duplicates when merging layers
@ 2020-10-08 13:56 Derrick Stolee via GitGitGadget
  2020-10-08 14:15 ` Taylor Blau
  2020-10-08 14:59 ` [PATCH v2] " Derrick Stolee via GitGitGadget
  0 siblings, 2 replies; 14+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-10-08 13:56 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.

[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/

The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.

While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.

Thus, this die() is too aggignoring the duplicates.

This leads to some additional complication that we did no have before:
if we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.

Begrudgingly, the best way to fix this is to copy the commit pointers
into a new list that only contains de-duplicated commit IDs. This adds a
slight memory overhead, but it is small compared to having all of these
commits parsed in memory, so it should be an acceptable cost for
avoiding these failures.

Since the root cause for producing commit-graph layers with these
duplicate commits is currently unknown, it is difficult to create a test
for this scenario. For now, we must rely on testing the example data
graciously provided in [1]. My local test successfully merged layers,
and 'git commit-graph verify' passed.

Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
    commit-graph: ignore duplicates when merging layers
    
    This wasn't quite as simple as what Peff had posted, since we really
    don't want to keep duplicate commits around in the new merged layer.
    
    I still don't have a grasp on how this happened in the first place, but
    will keep looking.
    
    Thanks, -Stolee

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-747%2Fderrickstolee%2Fcommit-graph-dup-commits-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-747/derrickstolee/commit-graph-dup-commits-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/747

 commit-graph.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..29bac78dc3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -2009,6 +2009,7 @@ static int commit_compare(const void *_a, const void *_b)
 static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
+	struct packed_commit_list deduped_commits = { NULL, 0, 0 };
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -2016,6 +2017,8 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 					ctx->commits.nr);
 
 	QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
+	deduped_commits.alloc = ctx->commits.nr;
+	ALLOC_ARRAY(deduped_commits.list, deduped_commits.alloc);
 
 	ctx->num_extra_edges = 0;
 	for (i = 0; i < ctx->commits.nr; i++) {
@@ -2023,17 +2026,30 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 
 		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
 			  &ctx->commits.list[i]->object.oid)) {
-			die(_("unexpected duplicate commit id %s"),
-			    oid_to_hex(&ctx->commits.list[i]->object.oid));
+			/*
+			 * Silently ignore duplicates. These were likely
+			 * created due to a commit appearing in multiple
+			 * layers of the chain, which is unexpected but
+			 * not invalid. We should make sure there is a
+			 * unique copy in the new layer.
+			 */
 		} else {
 			unsigned int num_parents;
 
+			deduped_commits.list[deduped_commits.nr] = ctx->commits.list[i];
+			deduped_commits.nr++;
+
 			num_parents = commit_list_count(ctx->commits.list[i]->parents);
 			if (num_parents > 2)
 				ctx->num_extra_edges += num_parents - 1;
 		}
 	}
 
+	free(ctx->commits.list);
+	ctx->commits.list = deduped_commits.list;
+	ctx->commits.nr = deduped_commits.nr;
+	ctx->commits.alloc = deduped_commits.alloc;
+
 	stop_progress(&ctx->progress);
 }
 

base-commit: d98273ba77e1ab9ec755576bc86c716a97bf59d7
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] commit-graph: ignore duplicates when merging layers
  2020-10-08 13:56 [PATCH] commit-graph: ignore duplicates when merging layers Derrick Stolee via GitGitGadget
@ 2020-10-08 14:15 ` Taylor Blau
  2020-10-08 14:29   ` Derrick Stolee
  2020-10-08 14:59 ` [PATCH v2] " Derrick Stolee via GitGitGadget
  1 sibling, 1 reply; 14+ messages in thread
From: Taylor Blau @ 2020-10-08 14:15 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Derrick Stolee, Derrick Stolee

On Thu, Oct 08, 2020 at 01:56:52PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Thomas reported [1] that a "git fetch" command was failing with an error
> saying "unexpected duplicate commit id". The root cause is that they had
> fetch.writeCommitGraph enabled which generates commit-graph chains, and
> this instance was merging two layers that both contained the same commit
> ID.
>
> [1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/
>
> The initial assumption is that Git would not write a commit ID into a
> commit-graph layer if it already exists in a lower commit-graph layer.
> Somehow, this specific case did get into that situation, leading to this
> error.
>
> While unexpected, this isn't actually invalid (as long as the two layers
> agree on the metadata for the commit). When we parse a commit that does
> not have a graph_pos in the commit_graph_data_slab, we use binary search
> in the commit-graph layers to find the commit and set graph_pos. That
> position is never used again in this case. However, when we parse a
> commit from the commit-graph file, we load its parents from the
> commit-graph and assign graph_pos at that point. If those parents were
> already parsed from the commit-graph, then nothing needs to be done.
> Otherwise, this graph_pos is a valid position in the commit-graph so we
> can parse the parents, when necessary.
>
> Thus, this die() is too aggignoring the duplicates.

s/aggignoring/aggressively ignoring ?

>
> This leads to some additional complication that we did no have before:

s/no/not, but I am more wondering about what "This" is. I think what
you're saying is: "Suppose we didn't die on duplicates, what would
happen? Well, there'd be some additional problems, but here's a way that
we can fix them (storing the de-duplicated OIDs separately)".

> if we only ignore the duplicates, then we will produce a commit-graph
> that has identical commit IDs listed in adjacent positions. This excess
> data will never be removed from the commit-graph, which could cascade
> into significantly bloated file sizes.
>
> Begrudgingly, the best way to fix this is to copy the commit pointers
> into a new list that only contains de-duplicated commit IDs. This adds a
> slight memory overhead, but it is small compared to having all of these
> commits parsed in memory, so it should be an acceptable cost for
> avoiding these failures.
>
> Since the root cause for producing commit-graph layers with these
> duplicate commits is currently unknown, it is difficult to create a test
> for this scenario. For now, we must rely on testing the example data
> graciously provided in [1]. My local test successfully merged layers,
> and 'git commit-graph verify' passed.
>
> Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
> Co-authored-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>     commit-graph: ignore duplicates when merging layers
>
>     This wasn't quite as simple as what Peff had posted, since we really
>     don't want to keep duplicate commits around in the new merged layer.
>
>     I still don't have a grasp on how this happened in the first place, but
>     will keep looking.

I'm looking as well, but I haven't found any smoking guns yet. I could
imagine that this is a problem that existed before 0bd52e27e3
(commit-graph.h: store an odb in 'struct write_commit_graph_context',
2020-02-03), and simply couldn't be tickled because of how brittle
comparing ODB paths is. I could equally imagine that 0bd52e27e3 did
introduce this problem.
>
>     Thanks, -Stolee
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-747%2Fderrickstolee%2Fcommit-graph-dup-commits-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-747/derrickstolee/commit-graph-dup-commits-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/747
>
>  commit-graph.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index cb042bdba8..29bac78dc3 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -2009,6 +2009,7 @@ static int commit_compare(const void *_a, const void *_b)
>  static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
>  {
>  	uint32_t i;
> +	struct packed_commit_list deduped_commits = { NULL, 0, 0 };
>
>  	if (ctx->report_progress)
>  		ctx->progress = start_delayed_progress(
> @@ -2016,6 +2017,8 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
>  					ctx->commits.nr);
>
>  	QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
> +	deduped_commits.alloc = ctx->commits.nr;
> +	ALLOC_ARRAY(deduped_commits.list, deduped_commits.alloc);

I'm not sure that this deduped_commits list is actually necessary.

It would be nice for this caller if ctx->commits were a linked list
since it would make deleting duplicates easy, but I think that it would
be a burden for other callers. So, that's a dead end.

But what about marking the duplicate positions by NULL-ing them out, and
then taking another pass over the list to (1) compact it (i.e., push
everything down so that all of the NULLs occur at the end), and then (2)
truncate the length to the number of unique commits.

I could imagine that something like that is a little trickier, but it
seems worth it to avoid doubling the memory cost of this function.

>  	ctx->num_extra_edges = 0;
>  	for (i = 0; i < ctx->commits.nr; i++) {
> @@ -2023,17 +2026,30 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
>
>  		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
>  			  &ctx->commits.list[i]->object.oid)) {
> -			die(_("unexpected duplicate commit id %s"),
> -			    oid_to_hex(&ctx->commits.list[i]->object.oid));
> +			/*
> +			 * Silently ignore duplicates. These were likely
> +			 * created due to a commit appearing in multiple
> +			 * layers of the chain, which is unexpected but
> +			 * not invalid. We should make sure there is a
> +			 * unique copy in the new layer.
> +			 */
>  		} else {
>  			unsigned int num_parents;
>
> +			deduped_commits.list[deduped_commits.nr] = ctx->commits.list[i];
> +			deduped_commits.nr++;
> +
>  			num_parents = commit_list_count(ctx->commits.list[i]->parents);
>  			if (num_parents > 2)
>  				ctx->num_extra_edges += num_parents - 1;
>  		}
>  	}
>
> +	free(ctx->commits.list);
> +	ctx->commits.list = deduped_commits.list;
> +	ctx->commits.nr = deduped_commits.nr;
> +	ctx->commits.alloc = deduped_commits.alloc;
> +
>  	stop_progress(&ctx->progress);
>  }
>
>
> base-commit: d98273ba77e1ab9ec755576bc86c716a97bf59d7
> --
> gitgitgadget

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] commit-graph: ignore duplicates when merging layers
  2020-10-08 14:15 ` Taylor Blau
@ 2020-10-08 14:29   ` Derrick Stolee
  0 siblings, 0 replies; 14+ messages in thread
From: Derrick Stolee @ 2020-10-08 14:29 UTC (permalink / raw)
  To: Taylor Blau, Derrick Stolee via GitGitGadget
  Cc: git, Derrick Stolee, Derrick Stolee

On 10/8/2020 10:15 AM, Taylor Blau wrote:
> On Thu, Oct 08, 2020 at 01:56:52PM +0000, Derrick Stolee via GitGitGadget wrote:
>> Thus, this die() is too aggignoring the duplicates.
> 
> s/aggignoring/aggressively ignoring ?
> 
>>
>> This leads to some additional complication that we did no have before:
> 
> s/no/not, but I am more wondering about what "This" is. I think what
> you're saying is: "Suppose we didn't die on duplicates, what would
> happen? Well, there'd be some additional problems, but here's a way that
> we can fix them (storing the de-duplicated OIDs separately)".

Thanks. The message will be edited to fix these brain farts.

>>     I still don't have a grasp on how this happened in the first place, but
>>     will keep looking.
> 
> I'm looking as well, but I haven't found any smoking guns yet. I could
> imagine that this is a problem that existed before 0bd52e27e3
> (commit-graph.h: store an odb in 'struct write_commit_graph_context',
> 2020-02-03), and simply couldn't be tickled because of how brittle
> comparing ODB paths is. I could equally imagine that 0bd52e27e3 did
> introduce this problem.

Thanks.

>> +	ALLOC_ARRAY(deduped_commits.list, deduped_commits.alloc);
> 
> I'm not sure that this deduped_commits list is actually necessary.
> 
> It would be nice for this caller if ctx->commits were a linked list
> since it would make deleting duplicates easy, but I think that it would
> be a burden for other callers. So, that's a dead end.
> 
> But what about marking the duplicate positions by NULL-ing them out, and
> then taking another pass over the list to (1) compact it (i.e., push
> everything down so that all of the NULLs occur at the end), and then (2)
> truncate the length to the number of unique commits.
> 
> I could imagine that something like that is a little trickier, but it
> seems worth it to avoid doubling the memory cost of this function.

You are correct that we can just re-use the commits.list by "collapsing"
the list on top of duplicate entries. I'll send a new version that does
exactly that.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] commit-graph: ignore duplicates when merging layers
  2020-10-08 13:56 [PATCH] commit-graph: ignore duplicates when merging layers Derrick Stolee via GitGitGadget
  2020-10-08 14:15 ` Taylor Blau
@ 2020-10-08 14:59 ` Derrick Stolee via GitGitGadget
  2020-10-08 15:04   ` [PATCH v3] " Derrick Stolee via GitGitGadget
  1 sibling, 1 reply; 14+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-10-08 14:59 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.

[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/

The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.

While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.

Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.

If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.

Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.

Since the root cause for producing commit-graph layers with these
duplicate commits is currently unknown, it is difficult to create a test
for this scenario. For now, we must rely on testing the example data
graciously provided in [1]. My local test successfully merged layers,
and 'git commit-graph verify' passed.

Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Taylor Blau <me@ttaylorr.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
    commit-graph: ignore duplicates when merging layers
    
    This wasn't quite as simple as what Peff had posted, since we really
    don't want to keep duplicate commits around in the new merged layer.
    
    I still don't have a grasp on how this happened in the first place, but
    will keep looking.
    
    Thanks, -Stolee

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-747%2Fderrickstolee%2Fcommit-graph-dup-commits-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-747/derrickstolee/commit-graph-dup-commits-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/747

Range-diff vs v1:

 1:  b3342403f1 ! 1:  85f4e578b8 commit-graph: ignore duplicates when merging layers
     @@ Commit message
          Otherwise, this graph_pos is a valid position in the commit-graph so we
          can parse the parents, when necessary.
      
     -    Thus, this die() is too aggignoring the duplicates.
     +    Thus, this die() is too aggressive. The easiest thing to do would be to
     +    ignore the duplicates.
      
     -    This leads to some additional complication that we did no have before:
     -    if we only ignore the duplicates, then we will produce a commit-graph
     +    If we only ignore the duplicates, then we will produce a commit-graph
          that has identical commit IDs listed in adjacent positions. This excess
          data will never be removed from the commit-graph, which could cascade
          into significantly bloated file sizes.
      
     -    Begrudgingly, the best way to fix this is to copy the commit pointers
     -    into a new list that only contains de-duplicated commit IDs. This adds a
     -    slight memory overhead, but it is small compared to having all of these
     -    commits parsed in memory, so it should be an acceptable cost for
     -    avoiding these failures.
     +    Thankfully, we can collapse the list to erase the duplicate commit
     +    pointers. This allows us to get the end result we want without extra
     +    memory costs and minimal CPU time.
      
          Since the root cause for producing commit-graph layers with these
          duplicate commits is currently unknown, it is difficult to create a test
     @@ Commit message
          and 'git commit-graph verify' passed.
      
          Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
     +    Helped-by: Taylor Blau <me@ttaylorr.com>
          Co-authored-by: Jeff King <peff@peff.net>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      


 commit-graph.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..29bac78dc3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -2009,6 +2009,7 @@ static int commit_compare(const void *_a, const void *_b)
 static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
+	struct packed_commit_list deduped_commits = { NULL, 0, 0 };
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -2016,6 +2017,8 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 					ctx->commits.nr);
 
 	QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
+	deduped_commits.alloc = ctx->commits.nr;
+	ALLOC_ARRAY(deduped_commits.list, deduped_commits.alloc);
 
 	ctx->num_extra_edges = 0;
 	for (i = 0; i < ctx->commits.nr; i++) {
@@ -2023,17 +2026,30 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 
 		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
 			  &ctx->commits.list[i]->object.oid)) {
-			die(_("unexpected duplicate commit id %s"),
-			    oid_to_hex(&ctx->commits.list[i]->object.oid));
+			/*
+			 * Silently ignore duplicates. These were likely
+			 * created due to a commit appearing in multiple
+			 * layers of the chain, which is unexpected but
+			 * not invalid. We should make sure there is a
+			 * unique copy in the new layer.
+			 */
 		} else {
 			unsigned int num_parents;
 
+			deduped_commits.list[deduped_commits.nr] = ctx->commits.list[i];
+			deduped_commits.nr++;
+
 			num_parents = commit_list_count(ctx->commits.list[i]->parents);
 			if (num_parents > 2)
 				ctx->num_extra_edges += num_parents - 1;
 		}
 	}
 
+	free(ctx->commits.list);
+	ctx->commits.list = deduped_commits.list;
+	ctx->commits.nr = deduped_commits.nr;
+	ctx->commits.alloc = deduped_commits.alloc;
+
 	stop_progress(&ctx->progress);
 }
 

base-commit: d98273ba77e1ab9ec755576bc86c716a97bf59d7
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3] commit-graph: ignore duplicates when merging layers
  2020-10-08 14:59 ` [PATCH v2] " Derrick Stolee via GitGitGadget
@ 2020-10-08 15:04   ` Derrick Stolee via GitGitGadget
  2020-10-08 15:53     ` Jeff King
  2020-10-09 20:53     ` [PATCH v4 0/2] " Derrick Stolee via GitGitGadget
  0 siblings, 2 replies; 14+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-10-08 15:04 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.

[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/

The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.

While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.

Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.

If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.

Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.

Since the root cause for producing commit-graph layers with these
duplicate commits is currently unknown, it is difficult to create a test
for this scenario. For now, we must rely on testing the example data
graciously provided in [1]. My local test successfully merged layers,
and 'git commit-graph verify' passed.

Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Taylor Blau <me@ttaylorr.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
    commit-graph: ignore duplicates when merging layers
    
    This wasn't quite as simple as what Peff had posted, since we really
    don't want to keep duplicate commits around in the new merged layer.
    
    I still don't have a grasp on how this happened in the first place, but
    will keep looking.
    
    Thanks, -Stolee
    
    APOLOGIES: v2 accidentally only changed the commit message, not the
    patch contents. Please ignore v2 and go straight to v3.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-747%2Fderrickstolee%2Fcommit-graph-dup-commits-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-747/derrickstolee/commit-graph-dup-commits-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/747

Range-diff vs v2:

 1:  85f4e578b8 ! 1:  9e760f07ac commit-graph: ignore duplicates when merging layers
     @@ Commit message
      
       ## commit-graph.c ##
      @@ commit-graph.c: static int commit_compare(const void *_a, const void *_b)
     + 
       static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
       {
     - 	uint32_t i;
     -+	struct packed_commit_list deduped_commits = { NULL, 0, 0 };
     +-	uint32_t i;
     ++	uint32_t i, dedup_i = 0;
       
       	if (ctx->report_progress)
       		ctx->progress = start_delayed_progress(
      @@ commit-graph.c: static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
     - 					ctx->commits.nr);
     - 
     - 	QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
     -+	deduped_commits.alloc = ctx->commits.nr;
     -+	ALLOC_ARRAY(deduped_commits.list, deduped_commits.alloc);
     - 
     - 	ctx->num_extra_edges = 0;
     - 	for (i = 0; i < ctx->commits.nr; i++) {
     -@@ commit-graph.c: static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
       
       		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
       			  &ctx->commits.list[i]->object.oid)) {
     @@ commit-graph.c: static void sort_and_scan_merged_commits(struct write_commit_gra
       		} else {
       			unsigned int num_parents;
       
     -+			deduped_commits.list[deduped_commits.nr] = ctx->commits.list[i];
     -+			deduped_commits.nr++;
     ++			ctx->commits.list[dedup_i] = ctx->commits.list[i];
     ++			dedup_i++;
      +
       			num_parents = commit_list_count(ctx->commits.list[i]->parents);
       			if (num_parents > 2)
     @@ commit-graph.c: static void sort_and_scan_merged_commits(struct write_commit_gra
       		}
       	}
       
     -+	free(ctx->commits.list);
     -+	ctx->commits.list = deduped_commits.list;
     -+	ctx->commits.nr = deduped_commits.nr;
     -+	ctx->commits.alloc = deduped_commits.alloc;
     ++	ctx->commits.nr = dedup_i;
      +
       	stop_progress(&ctx->progress);
       }


 commit-graph.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..0280dcb2ce 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -2008,7 +2008,7 @@ static int commit_compare(const void *_a, const void *_b)
 
 static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 {
-	uint32_t i;
+	uint32_t i, dedup_i = 0;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -2023,17 +2023,27 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 
 		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
 			  &ctx->commits.list[i]->object.oid)) {
-			die(_("unexpected duplicate commit id %s"),
-			    oid_to_hex(&ctx->commits.list[i]->object.oid));
+			/*
+			 * Silently ignore duplicates. These were likely
+			 * created due to a commit appearing in multiple
+			 * layers of the chain, which is unexpected but
+			 * not invalid. We should make sure there is a
+			 * unique copy in the new layer.
+			 */
 		} else {
 			unsigned int num_parents;
 
+			ctx->commits.list[dedup_i] = ctx->commits.list[i];
+			dedup_i++;
+
 			num_parents = commit_list_count(ctx->commits.list[i]->parents);
 			if (num_parents > 2)
 				ctx->num_extra_edges += num_parents - 1;
 		}
 	}
 
+	ctx->commits.nr = dedup_i;
+
 	stop_progress(&ctx->progress);
 }
 

base-commit: d98273ba77e1ab9ec755576bc86c716a97bf59d7
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] commit-graph: ignore duplicates when merging layers
  2020-10-08 15:04   ` [PATCH v3] " Derrick Stolee via GitGitGadget
@ 2020-10-08 15:53     ` Jeff King
  2020-10-08 16:26       ` Derrick Stolee
  2020-10-09 20:53     ` [PATCH v4 0/2] " Derrick Stolee via GitGitGadget
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff King @ 2020-10-08 15:53 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Taylor Blau, Derrick Stolee, Derrick Stolee, Derrick Stolee

On Thu, Oct 08, 2020 at 03:04:39PM +0000, Derrick Stolee via GitGitGadget wrote:

> Since the root cause for producing commit-graph layers with these
> duplicate commits is currently unknown, it is difficult to create a test
> for this scenario. For now, we must rely on testing the example data
> graciously provided in [1]. My local test successfully merged layers,
> and 'git commit-graph verify' passed.

Yeah, that is unfortunate. We could synthetically create such a graph
file, but I'm not sure if it's worth the trouble.

> @@ -2023,17 +2023,27 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
>  
>  		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
>  			  &ctx->commits.list[i]->object.oid)) {
> -			die(_("unexpected duplicate commit id %s"),
> -			    oid_to_hex(&ctx->commits.list[i]->object.oid));
> +			/*
> +			 * Silently ignore duplicates. These were likely
> +			 * created due to a commit appearing in multiple
> +			 * layers of the chain, which is unexpected but
> +			 * not invalid. We should make sure there is a
> +			 * unique copy in the new layer.
> +			 */

You mentioned earlier checking tha the metadata for the duplicates was
identical. How hard would that be to do here?

>  		} else {
>  			unsigned int num_parents;
>  
> +			ctx->commits.list[dedup_i] = ctx->commits.list[i];
> +			dedup_i++;
> +

This in-place de-duping is much nicer than what was in v1. There's still
a slight cost to the common case when we have no duplicates, but it's
minor (just an extra noop self-assignment of each index).

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] commit-graph: ignore duplicates when merging layers
  2020-10-08 15:53     ` Jeff King
@ 2020-10-08 16:26       ` Derrick Stolee
  2020-10-08 16:42         ` Taylor Blau
  2020-10-08 16:43         ` Jeff King
  0 siblings, 2 replies; 14+ messages in thread
From: Derrick Stolee @ 2020-10-08 16:26 UTC (permalink / raw)
  To: Jeff King, Derrick Stolee via GitGitGadget
  Cc: git, Taylor Blau, Derrick Stolee, Derrick Stolee

On 10/8/2020 11:53 AM, Jeff King wrote:
> On Thu, Oct 08, 2020 at 03:04:39PM +0000, Derrick Stolee via GitGitGadget wrote:
>> @@ -2023,17 +2023,27 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
>>  
>>  		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
>>  			  &ctx->commits.list[i]->object.oid)) {
>> -			die(_("unexpected duplicate commit id %s"),
>> -			    oid_to_hex(&ctx->commits.list[i]->object.oid));
>> +			/*
>> +			 * Silently ignore duplicates. These were likely
>> +			 * created due to a commit appearing in multiple
>> +			 * layers of the chain, which is unexpected but
>> +			 * not invalid. We should make sure there is a
>> +			 * unique copy in the new layer.
>> +			 */
> 
> You mentioned earlier checking tha the metadata for the duplicates was
> identical. How hard would that be to do here?

I do think it is a bit tricky, since we would need to identify
from these duplicates which commit-graph layers they live in,
then compare the binary data in each row (for tree, date, generation)
and also logical data (convert parent int-ids into oids). One way
to do this would be to create distinct 'struct commit' objects (do
not use lookup_commit()) but finding the two positions within the
layers is the hard part.

At this point, any disagreement between rows would be corrupt data
in one or the other, and it should be caught by the 'verify'
subcommand. It definitely would be caught by 'verify' in the merged
layer after the 'write' completes.

At this point, we don't have any evidence that whatever causes the
duplicate rows could possibly write the wrong data to the duplicate
rows. I'll keep it in mind as we look for that root cause.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] commit-graph: ignore duplicates when merging layers
  2020-10-08 16:26       ` Derrick Stolee
@ 2020-10-08 16:42         ` Taylor Blau
  2020-10-08 16:43         ` Jeff King
  1 sibling, 0 replies; 14+ messages in thread
From: Taylor Blau @ 2020-10-08 16:42 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jeff King, Derrick Stolee via GitGitGadget, git, Taylor Blau,
	Derrick Stolee, Derrick Stolee

On Thu, Oct 08, 2020 at 12:26:29PM -0400, Derrick Stolee wrote:
> At this point, any disagreement between rows would be corrupt data
> in one or the other, and it should be caught by the 'verify'
> subcommand. It definitely would be caught by 'verify' in the merged
> layer after the 'write' completes.

Yeah, I'm fine with assuming that this data is correct here, since we
would have already "checked" it after we wrote it.

Of course, that means that if we find another commit-graph bug that
writes bad data and fix it in a future version, old commit-graphs with
duplicate objects have a chance to persist their data.

But, we again have 'git commit-graph verify' as a last resort there, so
I think it's OK.

> At this point, we don't have any evidence that whatever causes the
> duplicate rows could possibly write the wrong data to the duplicate
> rows. I'll keep it in mind as we look for that root cause.

Thanks.

Taylor

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3] commit-graph: ignore duplicates when merging layers
  2020-10-08 16:26       ` Derrick Stolee
  2020-10-08 16:42         ` Taylor Blau
@ 2020-10-08 16:43         ` Jeff King
  1 sibling, 0 replies; 14+ messages in thread
From: Jeff King @ 2020-10-08 16:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Taylor Blau,
	Derrick Stolee, Derrick Stolee

On Thu, Oct 08, 2020 at 12:26:29PM -0400, Derrick Stolee wrote:

> >> +			/*
> >> +			 * Silently ignore duplicates. These were likely
> >> +			 * created due to a commit appearing in multiple
> >> +			 * layers of the chain, which is unexpected but
> >> +			 * not invalid. We should make sure there is a
> >> +			 * unique copy in the new layer.
> >> +			 */
> > 
> > You mentioned earlier checking tha the metadata for the duplicates was
> > identical. How hard would that be to do here?
> 
> I do think it is a bit tricky, since we would need to identify
> from these duplicates which commit-graph layers they live in,
> then compare the binary data in each row (for tree, date, generation)
> and also logical data (convert parent int-ids into oids). One way
> to do this would be to create distinct 'struct commit' objects (do
> not use lookup_commit()) but finding the two positions within the
> layers is the hard part.

OK, that sounds sufficiently hard that it isn't worth doing. I wondered
if there was easy access since we had the commit_graph handles here. But
I guess it really depends on which chunks are even available.

> At this point, any disagreement between rows would be corrupt data
> in one or the other, and it should be caught by the 'verify'
> subcommand. It definitely would be caught by 'verify' in the merged
> layer after the 'write' completes.
> 
> At this point, we don't have any evidence that whatever causes the
> duplicate rows could possibly write the wrong data to the duplicate
> rows. I'll keep it in mind as we look for that root cause.

That makes sense. I wonder if it is worth tipping the user off that
something funny is going on, and they may want to run "verify". I.e.,
should we be downgrading the die() to a warning(), rather than silently
skipping the duplicate.

I guess it depends on how often we expect this to happen. If the root
cause turns out to be some race that's unusual but may come up from time
to time, then the warning would unnecessarily alarm people, and/or be
annoying. But we don't know yet if that's the case.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 0/2] commit-graph: ignore duplicates when merging layers
  2020-10-08 15:04   ` [PATCH v3] " Derrick Stolee via GitGitGadget
  2020-10-08 15:53     ` Jeff King
@ 2020-10-09 20:53     ` Derrick Stolee via GitGitGadget
  2020-10-09 20:53       ` [PATCH v4 1/2] " Derrick Stolee via GitGitGadget
  2020-10-09 20:53       ` [PATCH v4 2/2] commit-graph: don't write commit-graph when disabled Derrick Stolee via GitGitGadget
  1 sibling, 2 replies; 14+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-10-09 20:53 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau, Derrick Stolee, Jeff King, Derrick Stolee

Here is v4 of a series that was previously only one patch.

Thanks to the efforts of Thomas, Peff, and Taylor, we have a full
understanding of what went wrong here. Details are in the patches
themselves, but generally when writing a commit-graph chain we rely on the
commit-graph parsing to know which commits are already in the lower layers
of the chain. When 'core.commitGraph' is disabled, then we don't do that
parsing, and then we add all reachable commits to the top layer and merge
down. This causes us to hit the previous die() statement.

This fixes the problem by first handling any duplicates we see during a
merge (this is important for handling any other data out there in this
situation) and also to disable commit-graph writes when 'core.commitGraph'
is disabled.

Thanks, -Stolee

Derrick Stolee (2):
  commit-graph: ignore duplicates when merging layers
  commit-graph: don't write commit-graph when disabled

 Documentation/git-commit-graph.txt |  4 +++-
 commit-graph.c                     | 21 ++++++++++++++++++---
 t/t5324-split-commit-graph.sh      | 13 +++++++++++++
 3 files changed, 34 insertions(+), 4 deletions(-)


base-commit: d98273ba77e1ab9ec755576bc86c716a97bf59d7
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-747%2Fderrickstolee%2Fcommit-graph-dup-commits-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-747/derrickstolee/commit-graph-dup-commits-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/747

Range-diff vs v3:

 1:  9e760f07ac ! 1:  9279aea3ef commit-graph: ignore duplicates when merging layers
     @@ Commit message
          pointers. This allows us to get the end result we want without extra
          memory costs and minimal CPU time.
      
     -    Since the root cause for producing commit-graph layers with these
     -    duplicate commits is currently unknown, it is difficult to create a test
     -    for this scenario. For now, we must rely on testing the example data
     -    graciously provided in [1]. My local test successfully merged layers,
     -    and 'git commit-graph verify' passed.
     +    The root cause is due to disabling core.commitGraph, which prevents
     +    parsing commits from the lower layers during a 'git commit-graph write
     +    --split' command. Since we use the 'graph_pos' value to determine
     +    whether a commit is in a lower layer, we never discover that those
     +    commits are already in the commit-graph chain and add them to the top
     +    layer. This layer is then merged down, creating duplicates.
     +
     +    The test added in t5324-split-commit-graph.sh fails without this change.
     +    However, we still have not completely removed the need for this
     +    duplicate check. That will come in a follow-up change.
      
          Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
          Helped-by: Taylor Blau <me@ttaylorr.com>
     @@ commit-graph.c: static void sort_and_scan_merged_commits(struct write_commit_gra
       	stop_progress(&ctx->progress);
       }
       
     +
     + ## t/t5324-split-commit-graph.sh ##
     +@@ t/t5324-split-commit-graph.sh: test_expect_success '--split=replace with partial Bloom data' '
     + 	verify_chain_files_exist $graphdir
     + '
     + 
     ++test_expect_success 'prevent regression for duplicate commits across layers' '
     ++	git init dup &&
     ++	git -C dup config core.commitGraph false &&
     ++	git -C dup commit --allow-empty -m one &&
     ++	git -C dup commit-graph write --split=no-merge --reachable &&
     ++	git -C dup commit --allow-empty -m two &&
     ++	git -C dup commit-graph write --split=no-merge --reachable &&
     ++	git -C dup commit --allow-empty -m three &&
     ++	git -C dup commit-graph write --split --reachable &&
     ++	git -C dup commit-graph verify
     ++'
     ++
     + test_done
 -:  ---------- > 2:  4439e8ae8f commit-graph: don't write commit-graph when disabled

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 1/2] commit-graph: ignore duplicates when merging layers
  2020-10-09 20:53     ` [PATCH v4 0/2] " Derrick Stolee via GitGitGadget
@ 2020-10-09 20:53       ` Derrick Stolee via GitGitGadget
  2020-10-09 20:53       ` [PATCH v4 2/2] commit-graph: don't write commit-graph when disabled Derrick Stolee via GitGitGadget
  1 sibling, 0 replies; 14+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-10-09 20:53 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Derrick Stolee, Jeff King, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.

[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/

The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.

While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.

Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.

If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.

Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.

The root cause is due to disabling core.commitGraph, which prevents
parsing commits from the lower layers during a 'git commit-graph write
--split' command. Since we use the 'graph_pos' value to determine
whether a commit is in a lower layer, we never discover that those
commits are already in the commit-graph chain and add them to the top
layer. This layer is then merged down, creating duplicates.

The test added in t5324-split-commit-graph.sh fails without this change.
However, we still have not completely removed the need for this
duplicate check. That will come in a follow-up change.

Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Taylor Blau <me@ttaylorr.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c                | 16 +++++++++++++---
 t/t5324-split-commit-graph.sh | 12 ++++++++++++
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..0280dcb2ce 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -2008,7 +2008,7 @@ static int commit_compare(const void *_a, const void *_b)
 
 static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 {
-	uint32_t i;
+	uint32_t i, dedup_i = 0;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -2023,17 +2023,27 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 
 		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
 			  &ctx->commits.list[i]->object.oid)) {
-			die(_("unexpected duplicate commit id %s"),
-			    oid_to_hex(&ctx->commits.list[i]->object.oid));
+			/*
+			 * Silently ignore duplicates. These were likely
+			 * created due to a commit appearing in multiple
+			 * layers of the chain, which is unexpected but
+			 * not invalid. We should make sure there is a
+			 * unique copy in the new layer.
+			 */
 		} else {
 			unsigned int num_parents;
 
+			ctx->commits.list[dedup_i] = ctx->commits.list[i];
+			dedup_i++;
+
 			num_parents = commit_list_count(ctx->commits.list[i]->parents);
 			if (num_parents > 2)
 				ctx->num_extra_edges += num_parents - 1;
 		}
 	}
 
+	ctx->commits.nr = dedup_i;
+
 	stop_progress(&ctx->progress);
 }
 
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index c334ee9155..a314ce0368 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -440,4 +440,16 @@ test_expect_success '--split=replace with partial Bloom data' '
 	verify_chain_files_exist $graphdir
 '
 
+test_expect_success 'prevent regression for duplicate commits across layers' '
+	git init dup &&
+	git -C dup config core.commitGraph false &&
+	git -C dup commit --allow-empty -m one &&
+	git -C dup commit-graph write --split=no-merge --reachable &&
+	git -C dup commit --allow-empty -m two &&
+	git -C dup commit-graph write --split=no-merge --reachable &&
+	git -C dup commit --allow-empty -m three &&
+	git -C dup commit-graph write --split --reachable &&
+	git -C dup commit-graph verify
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 2/2] commit-graph: don't write commit-graph when disabled
  2020-10-09 20:53     ` [PATCH v4 0/2] " Derrick Stolee via GitGitGadget
  2020-10-09 20:53       ` [PATCH v4 1/2] " Derrick Stolee via GitGitGadget
@ 2020-10-09 20:53       ` Derrick Stolee via GitGitGadget
  2020-10-09 21:12         ` Junio C Hamano
  2020-10-09 21:17         ` Taylor Blau
  1 sibling, 2 replies; 14+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-10-09 20:53 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Derrick Stolee, Jeff King, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The core.commitGraph config setting can be set to 'false' to prevent
parsing commits from the commit-graph file(s). This causes an issue when
trying to write with "--split" which needs to distinguish between
commits that are in the existing commit-graph layers and commits that
are not. The existing mechanism uses parse_commit() and follows by
checking if there is a 'graph_pos' that shows the commit was parsed from
the commit-graph file.

When core.commitGraph=false, we do not parse the commits from the
commit-graph and 'graph_pos' indicates that no commits are in the
existing file. The --split logic moves forward creating a new layer on
top that holds all reachable commits, then possibly merges down into
those layers, resulting in duplicate commits. The previous change makes
that merging process more robust to such a situation in case it happens
in the written commit-graph data.

The easy answer here is to avoid writing a commit-graph if reading the
commit-graph is disabled. Since the resulting commit-graph will would not
be read by subsequent Git processes. This is more natural than forcing
core.commitGraph to be true for the 'write' process.

Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-commit-graph.txt | 4 +++-
 commit-graph.c                     | 5 +++++
 t/t5324-split-commit-graph.sh      | 3 ++-
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index de6b6de230..e1f48c95b3 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -39,7 +39,9 @@ COMMANDS
 --------
 'write'::
 
-Write a commit-graph file based on the commits found in packfiles.
+Write a commit-graph file based on the commits found in packfiles. If
+the config option `core.commitGraph` is disabled, then this command will
+output a warning, then return success without writing a commit-graph file.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified pack-indexes. (Cannot be combined
diff --git a/commit-graph.c b/commit-graph.c
index 0280dcb2ce..6f62a07313 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -2160,6 +2160,11 @@ int write_commit_graph(struct object_directory *odb,
 	int replace = 0;
 	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
+	prepare_repo_settings(the_repository);
+	if (!the_repository->settings.core_commit_graph) {
+		warning(_("attempting to write a commit-graph, but 'core.commitGraph' is disabled"));
+		return 0;
+	}
 	if (!commit_graph_compatible(the_repository))
 		return 0;
 
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index a314ce0368..4d3842b83b 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -442,8 +442,9 @@ test_expect_success '--split=replace with partial Bloom data' '
 
 test_expect_success 'prevent regression for duplicate commits across layers' '
 	git init dup &&
-	git -C dup config core.commitGraph false &&
 	git -C dup commit --allow-empty -m one &&
+	git -C dup -c core.commitGraph=false commit-graph write --split=no-merge --reachable 2>err &&
+	test_i18ngrep "attempting to write a commit-graph" err &&
 	git -C dup commit-graph write --split=no-merge --reachable &&
 	git -C dup commit --allow-empty -m two &&
 	git -C dup commit-graph write --split=no-merge --reachable &&
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/2] commit-graph: don't write commit-graph when disabled
  2020-10-09 20:53       ` [PATCH v4 2/2] commit-graph: don't write commit-graph when disabled Derrick Stolee via GitGitGadget
@ 2020-10-09 21:12         ` Junio C Hamano
  2020-10-09 21:17         ` Taylor Blau
  1 sibling, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2020-10-09 21:12 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Taylor Blau, Derrick Stolee, Jeff King, Derrick Stolee,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +	if (!the_repository->settings.core_commit_graph) {
> +		warning(_("attempting to write a commit-graph, but 'core.commitGraph' is disabled"));
> +		return 0;
> +	}

Makes sense.  We probably would want to short-circuit invocation of
commit-graph related tasks in the maintenance by checking the
feature, if we are not already doing so.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/2] commit-graph: don't write commit-graph when disabled
  2020-10-09 20:53       ` [PATCH v4 2/2] commit-graph: don't write commit-graph when disabled Derrick Stolee via GitGitGadget
  2020-10-09 21:12         ` Junio C Hamano
@ 2020-10-09 21:17         ` Taylor Blau
  1 sibling, 0 replies; 14+ messages in thread
From: Taylor Blau @ 2020-10-09 21:17 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Taylor Blau, Derrick Stolee, Jeff King, Derrick Stolee,
	Derrick Stolee

On Fri, Oct 09, 2020 at 08:53:52PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The core.commitGraph config setting can be set to 'false' to prevent
> parsing commits from the commit-graph file(s). This causes an issue when
> trying to write with "--split" which needs to distinguish between
> commits that are in the existing commit-graph layers and commits that
> are not. The existing mechanism uses parse_commit() and follows by
> checking if there is a 'graph_pos' that shows the commit was parsed from
> the commit-graph file.
>
> When core.commitGraph=false, we do not parse the commits from the
> commit-graph and 'graph_pos' indicates that no commits are in the
> existing file. The --split logic moves forward creating a new layer on
> top that holds all reachable commits, then possibly merges down into
> those layers, resulting in duplicate commits. The previous change makes
> that merging process more robust to such a situation in case it happens
> in the written commit-graph data.

You're noting something interesting here which is that I actually think
setting 'core.commitGraph' _would_ be OK for non-split writes, and
'--split=replace' (along with any other split that happens to write a
single layer).

But, I think that actually enforcing that rule (i.e., "if you have
core.commitGraph set to false, you can't run `git commit-graph write`
except in X Y Z certain situations") is overly-complex and confusing to
users. So, I like what you have here a lot.

> The easy answer here is to avoid writing a commit-graph if reading the
> commit-graph is disabled. Since the resulting commit-graph will would not
> be read by subsequent Git processes. This is more natural than forcing
> core.commitGraph to be true for the 'write' process.
>
> Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
> Helped-by: Jeff King <peff@peff.net>
> Helped-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

You can add my:

  Signed-off-by: Taylor Blau <me@ttaylorr.com>

to the patch below, too, unless you want to take my suggestion below...

> ---
>  Documentation/git-commit-graph.txt | 4 +++-
>  commit-graph.c                     | 5 +++++
>  t/t5324-split-commit-graph.sh      | 3 ++-
>  3 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index de6b6de230..e1f48c95b3 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -39,7 +39,9 @@ COMMANDS
>  --------
>  'write'::
>
> -Write a commit-graph file based on the commits found in packfiles.
> +Write a commit-graph file based on the commits found in packfiles. If
> +the config option `core.commitGraph` is disabled, then this command will
> +output a warning, then return success without writing a commit-graph file.
>  +
>  With the `--stdin-packs` option, generate the new commit graph by
>  walking objects only in the specified pack-indexes. (Cannot be combined
> diff --git a/commit-graph.c b/commit-graph.c
> index 0280dcb2ce..6f62a07313 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -2160,6 +2160,11 @@ int write_commit_graph(struct object_directory *odb,
>  	int replace = 0;
>  	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>
> +	prepare_repo_settings(the_repository);
> +	if (!the_repository->settings.core_commit_graph) {
> +		warning(_("attempting to write a commit-graph, but 'core.commitGraph' is disabled"));
> +		return 0;
> +	}

Should this check be folded into 'commit_graph_compatible()'? Maybe in
'prepare_commit_graph()' which itself calls 'commit_graph_compatible()'?
I admit that I find this chain of callers to be confusing.

I suppose one argument for checking it here _before_ calling
'commit_graph_compatible()' is that it allows you to issue a specific
warning before returning from this function, so I'm OK with it.

I also don't have a concrete suggestion of where a better place for this
hunk might be, so I'm fine with what you wrote.

>  	if (!commit_graph_compatible(the_repository))
>  		return 0;
>
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index a314ce0368..4d3842b83b 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -442,8 +442,9 @@ test_expect_success '--split=replace with partial Bloom data' '
>
>  test_expect_success 'prevent regression for duplicate commits across layers' '
>  	git init dup &&
> -	git -C dup config core.commitGraph false &&
>  	git -C dup commit --allow-empty -m one &&
> +	git -C dup -c core.commitGraph=false commit-graph write --split=no-merge --reachable 2>err &&
> +	test_i18ngrep "attempting to write a commit-graph" err &&
>  	git -C dup commit-graph write --split=no-merge --reachable &&
>  	git -C dup commit --allow-empty -m two &&
>  	git -C dup commit-graph write --split=no-merge --reachable &&

Hmm. I would have preferred to see a new test here. Unless I'm wrong, I
believe the patched version of this test _doesn't_ have a duplicate
commit across multiple layers:

  - We try to write a layer with 'one', but don't (because
    'core.commitGraph' is set to false).

  - Then we write a layer for 'one' with 'core.commitGraph' unset.

  - Then we write a layer for 'two' (and only 'two'), since we read the
    below layer containing 'one'.

But, I'm not sure of a better way to test this, either. You fixed the
bug that this is trying to exercise, so it's no longer being exercised
here, but then again neither is the new code that is supposed to handle
it.

I wonder if it is maybe worth having some sample commit-graphs laying
around in a t5324 directory that _would_ demonstrate this problem. OTOH,
maybe that is just me being overly pedantic and worrying about something
that isn't actually a problem.

I trust your judgement, so whatever you feel like is fine with me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-10-09 21:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-08 13:56 [PATCH] commit-graph: ignore duplicates when merging layers Derrick Stolee via GitGitGadget
2020-10-08 14:15 ` Taylor Blau
2020-10-08 14:29   ` Derrick Stolee
2020-10-08 14:59 ` [PATCH v2] " Derrick Stolee via GitGitGadget
2020-10-08 15:04   ` [PATCH v3] " Derrick Stolee via GitGitGadget
2020-10-08 15:53     ` Jeff King
2020-10-08 16:26       ` Derrick Stolee
2020-10-08 16:42         ` Taylor Blau
2020-10-08 16:43         ` Jeff King
2020-10-09 20:53     ` [PATCH v4 0/2] " Derrick Stolee via GitGitGadget
2020-10-09 20:53       ` [PATCH v4 1/2] " Derrick Stolee via GitGitGadget
2020-10-09 20:53       ` [PATCH v4 2/2] commit-graph: don't write commit-graph when disabled Derrick Stolee via GitGitGadget
2020-10-09 21:12         ` Junio C Hamano
2020-10-09 21:17         ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).