* [PATCH 0/6] Compute and consume generation numbers @ 2018-04-03 16:51 Derrick Stolee 2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee ` (9 more replies) 0 siblings, 10 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw) To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee This is the first of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a constant-time check in queue_has_nonstale() instead of the previous linear-time check. This does not have a significant performance benefit in repositories of normal size, but in the Windows repository, some merge-base calculations improve from 3.1s to 2.9s. A modest speedup, but provides an actual consumer of generation numbers as a starting point. A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series depends on v7 of ds/commit-graph. Derrick Stolee (6): object.c: parse commit in graph first commit: add generation number to struct commmit commit-graph: compute generation numbers commit: sort by generation number in paint_down_to_common() commit.c: use generation number to stop merge-base walks commit-graph.txt: update design doc with generation numbers Documentation/technical/commit-graph.txt | 7 +--- alloc.c | 1 + commit-graph.c | 48 +++++++++++++++++++++ commit.c | 53 ++++++++++++++++++++---- commit.h | 7 +++- object.c | 4 +- 6 files changed, 104 insertions(+), 16 deletions(-) -- 2.17.0.20.g9f30ba16e1 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH 1/6] object.c: parse commit in graph first 2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee @ 2018-04-03 16:51 ` Derrick Stolee 2018-04-03 18:21 ` Jonathan Tan 2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee ` (8 subsequent siblings) 9 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw) To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). Before adding generation numbers to the commit-graph, we need to ensure that any commit that exists in the graph is loaded from the graph, so check parse_commit_in_graph() before calling parse_commit_buffer(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- object.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/object.c b/object.c index e6ad3f61f0..4cd3e98e04 100644 --- a/object.c +++ b/object.c @@ -3,6 +3,7 @@ #include "blob.h" #include "tree.h" #include "commit.h" +#include "commit-graph.h" #include "tag.h" static struct object **obj_hash; @@ -207,7 +208,8 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type } else if (type == OBJ_COMMIT) { struct commit *commit = lookup_commit(oid); if (commit) { - if (parse_commit_buffer(commit, buffer, size)) + if (!parse_commit_in_graph(commit) && + parse_commit_buffer(commit, buffer, size)) return NULL; if (!get_cached_commit_buffer(commit, NULL)) { set_commit_buffer(commit, buffer, size); -- 2.17.0.rc0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH 1/6] object.c: parse commit in graph first 2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee @ 2018-04-03 18:21 ` Jonathan Tan 2018-04-03 18:28 ` Jeff King 0 siblings, 1 reply; 162+ messages in thread From: Jonathan Tan @ 2018-04-03 18:21 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff On Tue, 3 Apr 2018 12:51:38 -0400 Derrick Stolee <dstolee@microsoft.com> wrote: > Most code paths load commits using lookup_commit() and then > parse_commit(). In some cases, including some branch lookups, the commit > is parsed using parse_object_buffer() which side-steps parse_commit() in > favor of parse_commit_buffer(). > > Before adding generation numbers to the commit-graph, we need to ensure > that any commit that exists in the graph is loaded from the graph, so > check parse_commit_in_graph() before calling parse_commit_buffer(). > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Modifying parse_object_buffer() is the most pragmatic way to accomplish this, but this also means that parse_object_buffer() now potentially reads from the local object store (instead of only relying on what's in memory and what's in the provided buffer). parse_object_buffer() is called by several callers including in builtin/fsck.c. I would feel more comfortable if the relevant [1] caller to parse_object_buffer() was modified instead of parse_object_buffer(), but I'll let others give their opinions too. [1] The caller which, if modified, will result in the speedup to the merge-base calculations in the Windows repository you describe in your cover letter. ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 1/6] object.c: parse commit in graph first 2018-04-03 18:21 ` Jonathan Tan @ 2018-04-03 18:28 ` Jeff King 2018-04-03 18:32 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jeff King @ 2018-04-03 18:28 UTC (permalink / raw) To: Jonathan Tan; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider On Tue, Apr 03, 2018 at 11:21:36AM -0700, Jonathan Tan wrote: > On Tue, 3 Apr 2018 12:51:38 -0400 > Derrick Stolee <dstolee@microsoft.com> wrote: > > > Most code paths load commits using lookup_commit() and then > > parse_commit(). In some cases, including some branch lookups, the commit > > is parsed using parse_object_buffer() which side-steps parse_commit() in > > favor of parse_commit_buffer(). > > > > Before adding generation numbers to the commit-graph, we need to ensure > > that any commit that exists in the graph is loaded from the graph, so > > check parse_commit_in_graph() before calling parse_commit_buffer(). > > > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > > Modifying parse_object_buffer() is the most pragmatic way to accomplish > this, but this also means that parse_object_buffer() now potentially > reads from the local object store (instead of only relying on what's in > memory and what's in the provided buffer). parse_object_buffer() is > called by several callers including in builtin/fsck.c. I would feel more > comfortable if the relevant [1] caller to parse_object_buffer() was > modified instead of parse_object_buffer(), but I'll let others give > their opinions too. It's not just you. This seems like a really odd place to put it. Especially because if we have the buffer to pass to this function, then we'd already have incurred the cost to inflate the object. -Peff ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 1/6] object.c: parse commit in graph first 2018-04-03 18:28 ` Jeff King @ 2018-04-03 18:32 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-03 18:32 UTC (permalink / raw) To: Jeff King, Jonathan Tan Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider On 4/3/2018 2:28 PM, Jeff King wrote: > On Tue, Apr 03, 2018 at 11:21:36AM -0700, Jonathan Tan wrote: > >> On Tue, 3 Apr 2018 12:51:38 -0400 >> Derrick Stolee <dstolee@microsoft.com> wrote: >> >>> Most code paths load commits using lookup_commit() and then >>> parse_commit(). In some cases, including some branch lookups, the commit >>> is parsed using parse_object_buffer() which side-steps parse_commit() in >>> favor of parse_commit_buffer(). >>> >>> Before adding generation numbers to the commit-graph, we need to ensure >>> that any commit that exists in the graph is loaded from the graph, so >>> check parse_commit_in_graph() before calling parse_commit_buffer(). >>> >>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> Modifying parse_object_buffer() is the most pragmatic way to accomplish >> this, but this also means that parse_object_buffer() now potentially >> reads from the local object store (instead of only relying on what's in >> memory and what's in the provided buffer). parse_object_buffer() is >> called by several callers including in builtin/fsck.c. I would feel more >> comfortable if the relevant [1] caller to parse_object_buffer() was >> modified instead of parse_object_buffer(), but I'll let others give >> their opinions too. > It's not just you. This seems like a really odd place to put it. > Especially because if we have the buffer to pass to this function, then > we'd already have incurred the cost to inflate the object. > OK. Thanks. I'll try to find the better place to put this check. -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH 2/6] commit: add generation number to struct commmit 2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee 2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee @ 2018-04-03 16:51 ` Derrick Stolee 2018-04-03 18:05 ` Brandon Williams 2018-04-03 18:24 ` Jonathan Tan 2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee ` (7 subsequent siblings) 9 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw) To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use two special values to signal the generation number is invalid: GENERATION_NUMBER_UNDEF 0xFFFFFFFF GENERATION_NUMBER_NONE 0 The first (_UNDEF) means the generation number has not been loaded or computed. The second (_NONE) means the generation number was loaded from a commit graph file that was stored before generation numbers were computed. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- alloc.c | 1 + commit-graph.c | 2 ++ commit.h | 3 +++ 3 files changed, 6 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..1a62e85ac3 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_UNDEF; return c; } diff --git a/commit-graph.c b/commit-graph.c index 1fc63d541b..d24b947525 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = &item->parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index e57ae4b583..3cadd386f3 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,8 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF +#define GENERATION_NUMBER_UNDEF 0xFFFFFFFF +#define GENERATION_NUMBER_NONE 0 struct commit_list { struct commit *item; @@ -24,6 +26,7 @@ struct commit { struct commit_list *parents; struct tree *tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0.rc0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH 2/6] commit: add generation number to struct commmit 2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee @ 2018-04-03 18:05 ` Brandon Williams 2018-04-03 18:28 ` Jeff King 2018-04-03 18:24 ` Jonathan Tan 1 sibling, 1 reply; 162+ messages in thread From: Brandon Williams @ 2018-04-03 18:05 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff On 04/03, Derrick Stolee wrote: > The generation number of a commit is defined recursively as follows: > > * If a commit A has no parents, then the generation number of A is one. > * If a commit A has parents, then the generation number of A is one > more than the maximum generation number among the parents of A. > > Add a uint32_t generation field to struct commit so we can pass this Is there any reason to believe this would be too small of a value in the future? Or is a 32 bit unsigned good enough? > information to revision walks. We use two special values to signal > the generation number is invalid: > > GENERATION_NUMBER_UNDEF 0xFFFFFFFF > GENERATION_NUMBER_NONE 0 > > The first (_UNDEF) means the generation number has not been loaded or > computed. The second (_NONE) means the generation number was loaded > from a commit graph file that was stored before generation numbers > were computed. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > alloc.c | 1 + > commit-graph.c | 2 ++ > commit.h | 3 +++ > 3 files changed, 6 insertions(+) > > diff --git a/alloc.c b/alloc.c > index cf4f8b61e1..1a62e85ac3 100644 > --- a/alloc.c > +++ b/alloc.c > @@ -94,6 +94,7 @@ void *alloc_commit_node(void) > c->object.type = OBJ_COMMIT; > c->index = alloc_commit_index(); > c->graph_pos = COMMIT_NOT_FROM_GRAPH; > + c->generation = GENERATION_NUMBER_UNDEF; > return c; > } > > diff --git a/commit-graph.c b/commit-graph.c > index 1fc63d541b..d24b947525 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin > date_low = get_be32(commit_data + g->hash_len + 12); > item->date = (timestamp_t)((date_high << 32) | date_low); > > + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; > + > pptr = &item->parents; > > edge_value = get_be32(commit_data + g->hash_len); > diff --git a/commit.h b/commit.h > index e57ae4b583..3cadd386f3 100644 > --- a/commit.h > +++ b/commit.h > @@ -10,6 +10,8 @@ > #include "pretty.h" > > #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF > +#define GENERATION_NUMBER_UNDEF 0xFFFFFFFF > +#define GENERATION_NUMBER_NONE 0 > > struct commit_list { > struct commit *item; > @@ -24,6 +26,7 @@ struct commit { > struct commit_list *parents; > struct tree *tree; > uint32_t graph_pos; > + uint32_t generation; > }; > > extern int save_commit_buffer; > -- > 2.17.0.rc0 > -- Brandon Williams ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 2/6] commit: add generation number to struct commmit 2018-04-03 18:05 ` Brandon Williams @ 2018-04-03 18:28 ` Jeff King 2018-04-03 18:31 ` Derrick Stolee ` (3 more replies) 0 siblings, 4 replies; 162+ messages in thread From: Jeff King @ 2018-04-03 18:28 UTC (permalink / raw) To: Brandon Williams; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote: > On 04/03, Derrick Stolee wrote: > > The generation number of a commit is defined recursively as follows: > > > > * If a commit A has no parents, then the generation number of A is one. > > * If a commit A has parents, then the generation number of A is one > > more than the maximum generation number among the parents of A. > > > > Add a uint32_t generation field to struct commit so we can pass this > > Is there any reason to believe this would be too small of a value in the > future? Or is a 32 bit unsigned good enough? The linux kernel took ~10 years to produce 500k commits. Even assuming those were all linear (and they're not), that gives us ~80,000 years of leeway. So even if the pace of development speeds up or we have a quicker project, it still seems we have a pretty reasonable safety margin. -Peff ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 2/6] commit: add generation number to struct commmit 2018-04-03 18:28 ` Jeff King @ 2018-04-03 18:31 ` Derrick Stolee 2018-04-03 18:32 ` Brandon Williams ` (2 subsequent siblings) 3 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-03 18:31 UTC (permalink / raw) To: Jeff King, Brandon Williams Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider On 4/3/2018 2:28 PM, Jeff King wrote: > On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote: > >> On 04/03, Derrick Stolee wrote: >>> The generation number of a commit is defined recursively as follows: >>> >>> * If a commit A has no parents, then the generation number of A is one. >>> * If a commit A has parents, then the generation number of A is one >>> more than the maximum generation number among the parents of A. >>> >>> Add a uint32_t generation field to struct commit so we can pass this >> Is there any reason to believe this would be too small of a value in the >> future? Or is a 32 bit unsigned good enough? > The linux kernel took ~10 years to produce 500k commits. Even assuming > those were all linear (and they're not), that gives us ~80,000 years of > leeway. So even if the pace of development speeds up or we have a > quicker project, it still seems we have a pretty reasonable safety > margin. That, and larger projects do not have linear histories. Despite having almost 2 million reachable commits, the Windows repository has maximum generation number ~100,000. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 2/6] commit: add generation number to struct commmit 2018-04-03 18:28 ` Jeff King 2018-04-03 18:31 ` Derrick Stolee @ 2018-04-03 18:32 ` Brandon Williams 2018-04-03 18:44 ` Stefan Beller 2018-04-03 23:17 ` Ramsay Jones 3 siblings, 0 replies; 162+ messages in thread From: Brandon Williams @ 2018-04-03 18:32 UTC (permalink / raw) To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider On 04/03, Jeff King wrote: > On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote: > > > On 04/03, Derrick Stolee wrote: > > > The generation number of a commit is defined recursively as follows: > > > > > > * If a commit A has no parents, then the generation number of A is one. > > > * If a commit A has parents, then the generation number of A is one > > > more than the maximum generation number among the parents of A. > > > > > > Add a uint32_t generation field to struct commit so we can pass this > > > > Is there any reason to believe this would be too small of a value in the > > future? Or is a 32 bit unsigned good enough? > > The linux kernel took ~10 years to produce 500k commits. Even assuming > those were all linear (and they're not), that gives us ~80,000 years of > leeway. So even if the pace of development speeds up or we have a > quicker project, it still seems we have a pretty reasonable safety > margin. > > -Peff I figured as much, but just wanted to check since the windows folks seems to produce commits pretty quickly. -- Brandon Williams ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 2/6] commit: add generation number to struct commmit 2018-04-03 18:28 ` Jeff King 2018-04-03 18:31 ` Derrick Stolee 2018-04-03 18:32 ` Brandon Williams @ 2018-04-03 18:44 ` Stefan Beller 2018-04-03 23:17 ` Ramsay Jones 3 siblings, 0 replies; 162+ messages in thread From: Stefan Beller @ 2018-04-03 18:44 UTC (permalink / raw) To: Jeff King Cc: Brandon Williams, Derrick Stolee, git, Ævar Arnfjörð Bjarmason, Lars Schneider On Tue, Apr 3, 2018 at 11:28 AM, Jeff King <peff@peff.net> wrote: > On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote: > >> On 04/03, Derrick Stolee wrote: >> > The generation number of a commit is defined recursively as follows: >> > >> > * If a commit A has no parents, then the generation number of A is one. >> > * If a commit A has parents, then the generation number of A is one >> > more than the maximum generation number among the parents of A. >> > >> > Add a uint32_t generation field to struct commit so we can pass this >> >> Is there any reason to believe this would be too small of a value in the >> future? Or is a 32 bit unsigned good enough? > > The linux kernel took ~10 years to produce 500k commits. Even assuming > those were all linear (and they're not), ... which you meant in terms of DAG, where a linear history is the worst case for generation numbers. I first read it the other way round, as the best case w.r.t. timing ~/linux$ git log --oneline |wc -l 721223 $ git log --oneline --since 2012 |wc -l 421853 $ git log --oneline --since 2011 |wc -l 477155 The number of commits is growing exponentially, though the exponential part is very small and the YoY growth can be estimated using linear interpolation. In linux, the release is a natural synchronization point IIUC as well as on a regular schedule. So an interesting question to ask there would be whether the delta in generation number goes up over time, or if the DAG just gets wider (=more parallel) > that gives us ~80,000 years of > leeway. So even if the pace of development speeds up or we have a > quicker project, it still seems we have a pretty reasonable safety > margin. Thanks for the estimate. Stefan ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 2/6] commit: add generation number to struct commmit 2018-04-03 18:28 ` Jeff King ` (2 preceding siblings ...) 2018-04-03 18:44 ` Stefan Beller @ 2018-04-03 23:17 ` Ramsay Jones 2018-04-03 23:19 ` Jeff King 3 siblings, 1 reply; 162+ messages in thread From: Ramsay Jones @ 2018-04-03 23:17 UTC (permalink / raw) To: Jeff King, Brandon Williams Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider On 03/04/18 19:28, Jeff King wrote: > On Tue, Apr 03, 2018 at 11:05:36AM -0700, Brandon Williams wrote: > >> On 04/03, Derrick Stolee wrote: >>> The generation number of a commit is defined recursively as follows: >>> >>> * If a commit A has no parents, then the generation number of A is one. >>> * If a commit A has parents, then the generation number of A is one >>> more than the maximum generation number among the parents of A. >>> >>> Add a uint32_t generation field to struct commit so we can pass this >> >> Is there any reason to believe this would be too small of a value in the >> future? Or is a 32 bit unsigned good enough? > > The linux kernel took ~10 years to produce 500k commits. Even assuming > those were all linear (and they're not), that gives us ~80,000 years of > leeway. So even if the pace of development speeds up or we have a > quicker project, it still seems we have a pretty reasonable safety > margin. I didn't read the patches closely, but isn't it ~20,000 years? Given that '#define GENERATION_NUMBER_MAX 0x3FFFFFFF', that is. ;-) ATB, Ramsay Jones ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 2/6] commit: add generation number to struct commmit 2018-04-03 23:17 ` Ramsay Jones @ 2018-04-03 23:19 ` Jeff King 0 siblings, 0 replies; 162+ messages in thread From: Jeff King @ 2018-04-03 23:19 UTC (permalink / raw) To: Ramsay Jones Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller, larsxschneider On Wed, Apr 04, 2018 at 12:17:06AM +0100, Ramsay Jones wrote: > >> Is there any reason to believe this would be too small of a value in the > >> future? Or is a 32 bit unsigned good enough? > > > > The linux kernel took ~10 years to produce 500k commits. Even assuming > > those were all linear (and they're not), that gives us ~80,000 years of > > leeway. So even if the pace of development speeds up or we have a > > quicker project, it still seems we have a pretty reasonable safety > > margin. > > I didn't read the patches closely, but isn't it ~20,000 years? > > Given that '#define GENERATION_NUMBER_MAX 0x3FFFFFFF', that is. ;-) What, I'm supposed to read the patches before responding? Heresy. -Peff ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 2/6] commit: add generation number to struct commmit 2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee 2018-04-03 18:05 ` Brandon Williams @ 2018-04-03 18:24 ` Jonathan Tan 1 sibling, 0 replies; 162+ messages in thread From: Jonathan Tan @ 2018-04-03 18:24 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff On Tue, 3 Apr 2018 12:51:39 -0400 Derrick Stolee <dstolee@microsoft.com> wrote: > The generation number of a commit is defined recursively as follows: > > * If a commit A has no parents, then the generation number of A is one. > * If a commit A has parents, then the generation number of A is one > more than the maximum generation number among the parents of A. > > Add a uint32_t generation field to struct commit so we can pass this > information to revision walks. We use two special values to signal > the generation number is invalid: > > GENERATION_NUMBER_UNDEF 0xFFFFFFFF > GENERATION_NUMBER_NONE 0 > > The first (_UNDEF) means the generation number has not been loaded or > computed. The second (_NONE) means the generation number was loaded > from a commit graph file that was stored before generation numbers > were computed. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> This looks straightforward and correct, thanks. I think some of the description above should appear as code comments. > +#define GENERATION_NUMBER_UNDEF 0xFFFFFFFF > +#define GENERATION_NUMBER_NONE 0 I would include the description above here as documentation, and would replace "was stored before generation numbers were computed" by "was written by a version of Git that did not support generation numbers". ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH 3/6] commit-graph: compute generation numbers 2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee 2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee 2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee @ 2018-04-03 16:51 ` Derrick Stolee 2018-04-03 18:30 ` Jonathan Tan 2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee ` (6 subsequent siblings) 9 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw) To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit-graph.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ commit.h | 1 + 2 files changed, 47 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index d24b947525..b80c8ad80e 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -419,6 +419,13 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_UNDEF) { + if ((*list)->generation > GENERATION_NUMBER_MAX) + die("generation number %u is too large to store in commit-graph", + (*list)->generation); + packedDate[0] |= htonl((*list)->generation << 2); + } + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -551,6 +558,43 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_UNDEF && + commits[i]->generation != GENERATION_NUMBER_NONE) + continue; + + commit_list_insert(commits[i], &list); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_UNDEF || + parent->item->generation == GENERATION_NUMBER_NONE) { + all_parents_computed = 0; + commit_list_insert(parent->item, &list); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(&list); + } + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -674,6 +718,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(&lk, graph_name, 0); diff --git a/commit.h b/commit.h index 3cadd386f3..bc7a3186c5 100644 --- a/commit.h +++ b/commit.h @@ -11,6 +11,7 @@ #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF #define GENERATION_NUMBER_UNDEF 0xFFFFFFFF +#define GENERATION_NUMBER_MAX 0x3FFFFFFF #define GENERATION_NUMBER_NONE 0 struct commit_list { -- 2.17.0.rc0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH 3/6] commit-graph: compute generation numbers 2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee @ 2018-04-03 18:30 ` Jonathan Tan 2018-04-03 18:49 ` Stefan Beller 0 siblings, 1 reply; 162+ messages in thread From: Jonathan Tan @ 2018-04-03 18:30 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff On Tue, 3 Apr 2018 12:51:40 -0400 Derrick Stolee <dstolee@microsoft.com> wrote: > + if ((*list)->generation != GENERATION_NUMBER_UNDEF) { > + if ((*list)->generation > GENERATION_NUMBER_MAX) > + die("generation number %u is too large to store in commit-graph", > + (*list)->generation); > + packedDate[0] |= htonl((*list)->generation << 2); > + } The die() should have "BUG:" if you agree with my comment below. > +static void compute_generation_numbers(struct commit** commits, > + int nr_commits) Style: space before **, not after. > + if (all_parents_computed) { > + current->generation = max_generation + 1; > + pop_commit(&list); > + } I think the current->generation should be clamped to _MAX here. If we do, then the die() I mentioned in my first comment will have "BUG:", since we are never meant to write any number larger than _MAX in ->generation. ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 3/6] commit-graph: compute generation numbers 2018-04-03 18:30 ` Jonathan Tan @ 2018-04-03 18:49 ` Stefan Beller 0 siblings, 0 replies; 162+ messages in thread From: Stefan Beller @ 2018-04-03 18:49 UTC (permalink / raw) To: Jonathan Tan Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason, Lars Schneider, Jeff King On Tue, Apr 3, 2018 at 11:30 AM, Jonathan Tan <jonathantanmy@google.com> wrote: > On Tue, 3 Apr 2018 12:51:40 -0400 > Derrick Stolee <dstolee@microsoft.com> wrote: > >> + if ((*list)->generation != GENERATION_NUMBER_UNDEF) { >> + if ((*list)->generation > GENERATION_NUMBER_MAX) >> + die("generation number %u is too large to store in commit-graph", >> + (*list)->generation); >> + packedDate[0] |= htonl((*list)->generation << 2); >> + } > > The die() should have "BUG:" if you agree with my comment below. I would remove the BUG/die() altogether and keep going. (But do not write it out, i.e. warn and skip the next line) A degraded commit graph with partial generation numbers is better than Git refusing to write any part of the commit graph (which later on will be part of many maintenance operations I would think, leading to more immediate headache rather than "working but slightly slower") > >> +static void compute_generation_numbers(struct commit** commits, >> + int nr_commits) > > Style: space before **, not after. > >> + if (all_parents_computed) { >> + current->generation = max_generation + 1; >> + pop_commit(&list); >> + } > > I think the current->generation should be clamped to _MAX here. If we do, then > the die() I mentioned in my first comment will have "BUG:", since we are never > meant to write any number larger than _MAX in ->generation. When we clamp here, we'd have to treat the _MAX specially in all our use cases or we'd encounter funny bugs due to miss ordered commits later? ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH 4/6] commit: use generations in paint_down_to_common() 2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee ` (2 preceding siblings ...) 2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee @ 2018-04-03 16:51 ` Derrick Stolee 2018-04-03 18:31 ` Stefan Beller 2018-04-03 18:31 ` Jonathan Tan 2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee ` (5 subsequent siblings) 9 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw) To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_UNDEF. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 19 ++++++++++++++++++- commit.h | 1 + 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 3e39c86abf..95ae7e13a3 100644 --- a/commit.c +++ b/commit.c @@ -624,6 +624,23 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* newer commits with larger date first */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -773,7 +790,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index bc7a3186c5..cb97b7636a 100644 --- a/commit.h +++ b/commit.h @@ -332,6 +332,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0.rc0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH 4/6] commit: use generations in paint_down_to_common() 2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee @ 2018-04-03 18:31 ` Stefan Beller 2018-04-03 18:31 ` Jonathan Tan 1 sibling, 0 replies; 162+ messages in thread From: Stefan Beller @ 2018-04-03 18:31 UTC (permalink / raw) To: Derrick Stolee Cc: git, Ævar Arnfjörð Bjarmason, Lars Schneider, Jeff King On Tue, Apr 3, 2018 at 9:51 AM, Derrick Stolee <dstolee@microsoft.com> wrote: > Define compare_commits_by_gen_then_commit_date(), which uses generation > numbers as a primary comparison and commit date to break ties (or as a > comparison when both commits do not have computed generation numbers). > > Since the commit-graph file is closed under reachability, we know that > all commits in the file have generation at most GENERATION_NUMBER_MAX > which is less than GENERATION_NUMBER_UNDEF. > > This change does not affect the number of commits that are walked during > the execution of paint_down_to_common(), only the order that those > commits are inspected. In the case that commit dates violate topological > order (i.e. a parent is "newer" than a child), the previous code could > walk a commit twice: if a commit is reached with the PARENT1 bit, but > later is re-visited with the PARENT2 bit, then that PARENT2 bit must be > propagated to its parents. Using generation numbers avoids this extra > effort, even if it is somewhat rare. This patch (or later in this series) may want to touch Documentation/technical/commit-graph.txt, that mentions this in the section of Future Work: - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following operations are important candidates: - paint_down_to_common() - 'log --topo-order' The paint down to common is only internal, not exposed to the user for ordering, i.e. the topological ordering is still ordering commits in a branch adjacent? Thanks, Stefan ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 4/6] commit: use generations in paint_down_to_common() 2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee 2018-04-03 18:31 ` Stefan Beller @ 2018-04-03 18:31 ` Jonathan Tan 1 sibling, 0 replies; 162+ messages in thread From: Jonathan Tan @ 2018-04-03 18:31 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff On Tue, 3 Apr 2018 12:51:41 -0400 Derrick Stolee <dstolee@microsoft.com> wrote: > +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) > +{ > + const struct commit *a = a_, *b = b_; > + > + if (a->generation < b->generation) > + return 1; > + else if (a->generation > b->generation) > + return -1; > + > + /* newer commits with larger date first */ > + if (a->date < b->date) > + return 1; > + else if (a->date > b->date) > + return -1; > + return 0; > +} I think it would be clearer if you commented above the first block "newer commits first", then on the second block, "use date as a heuristic to determine newer commit". Other than that, this looks good. ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH 5/6] commit.c: use generation to halt paint walk 2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee ` (3 preceding siblings ...) 2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee @ 2018-04-03 16:51 ` Derrick Stolee 2018-04-03 19:01 ` Jonathan Tan 2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee ` (4 subsequent siblings) 9 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw) To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee In paint_down_to_common(), the walk is halted when the queue contains only stale commits. The queue_has_nonstale() method iterates over the entire queue looking for a nonstale commit. In a wide commit graph where the two sides share many commits in common, but have deep sets of different commits, this method may inspect many elements before finding a nonstale commit. In the worst case, this can give quadratic performance in paint_down_to_common(). Convert queue_has_nonstale() to use generation numbers for an O(1) termination condition. To properly take advantage of this condition, track the minimum generation number of a commit that enters the queue with nonstale status. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 37 ++++++++++++++++++++++++++++++------- 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/commit.c b/commit.c index 95ae7e13a3..858f4fdbc9 100644 --- a/commit.c +++ b/commit.c @@ -776,14 +776,22 @@ void sort_in_topological_order(struct commit_list **list, enum rev_sort_order so static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT); -static int queue_has_nonstale(struct prio_queue *queue) +static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen) { - int i; - for (i = 0; i < queue->nr; i++) { - struct commit *commit = queue->array[i].data; - if (!(commit->object.flags & STALE)) - return 1; + if (min_gen != GENERATION_NUMBER_UNDEF) { + if (queue->nr > 0) { + struct commit *commit = queue->array[0].data; + return commit->generation >= min_gen; + } + } else { + int i; + for (i = 0; i < queue->nr; i++) { + struct commit *commit = queue->array[i].data; + if (!(commit->object.flags & STALE)) + return 1; + } } + return 0; } @@ -793,6 +801,8 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_UNDEF; + uint32_t min_nonstale_gen = GENERATION_NUMBER_UNDEF; one->object.flags |= PARENT1; if (!n) { @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return result; } prio_queue_put(&queue, one); + if (one->generation < min_nonstale_gen) + min_nonstale_gen = one->generation; for (i = 0; i < n; i++) { twos[i]->object.flags |= PARENT2; prio_queue_put(&queue, twos[i]); + if (twos[i]->generation < min_nonstale_gen) + min_nonstale_gen = twos[i]->generation; } - while (queue_has_nonstale(&queue)) { + while (queue_has_nonstale(&queue, min_nonstale_gen)) { struct commit *commit = prio_queue_get(&queue); struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + + last_gen = commit->generation; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return NULL; p->object.flags |= flags; prio_queue_put(&queue, p); + + if (!(flags & STALE) && + p->generation < min_nonstale_gen) + min_nonstale_gen = p->generation; } } -- 2.17.0.rc0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH 5/6] commit.c: use generation to halt paint walk 2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee @ 2018-04-03 19:01 ` Jonathan Tan 0 siblings, 0 replies; 162+ messages in thread From: Jonathan Tan @ 2018-04-03 19:01 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff On Tue, 3 Apr 2018 12:51:42 -0400 Derrick Stolee <dstolee@microsoft.com> wrote: > -static int queue_has_nonstale(struct prio_queue *queue) > +static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen) > { > - int i; > - for (i = 0; i < queue->nr; i++) { > - struct commit *commit = queue->array[i].data; > - if (!(commit->object.flags & STALE)) > - return 1; > + if (min_gen != GENERATION_NUMBER_UNDEF) { > + if (queue->nr > 0) { > + struct commit *commit = queue->array[0].data; > + return commit->generation >= min_gen; > + } This only works if the prio_queue has compare_commits_by_gen_then_commit_date. Also, I don't think that the min_gen != GENERATION_NUMBER_UNDEF check is necessary. So I would write this as: if (queue->compare == compare_commits_by_gen_then_commit_date && queue->nr) { struct commit *commit = queue->array[0].data; return commit->generation >= min_gen; } for (i = 0 ... If you'd rather not perform the comparison to compare_commits_by_gen_then_commit_date every time you invoke queue_has_nonstale(), that's fine with me too, but document somewhere that queue_has_nonstale() only works if this comparison function is used. > + if (commit->generation > last_gen) > + BUG("bad generation skip"); > + > + last_gen = commit->generation; last_gen seems to only be used to ensure that the priority queue returns elements in the correct order - I think we can generally trust the queue, and if we need to test it, we can do it elsewhere. ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH 6/6] commit-graph.txt: update future work 2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee ` (4 preceding siblings ...) 2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee @ 2018-04-03 16:51 ` Derrick Stolee 2018-04-03 19:04 ` Jonathan Tan 2018-04-03 16:56 ` [PATCH 0/6] Compute and consume generation numbers Derrick Stolee ` (3 subsequent siblings) 9 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-03 16:51 UTC (permalink / raw) To: git; +Cc: avarab, sbeller, larsxschneider, peff, Derrick Stolee We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/technical/commit-graph.txt | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..be68bee43d 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -98,17 +98,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: - - paint_down_to_common() - 'log --topo-order' - Currently, parse_commit_gently() requires filling in the root tree -- 2.17.0.rc0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH 6/6] commit-graph.txt: update future work 2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee @ 2018-04-03 19:04 ` Jonathan Tan 0 siblings, 0 replies; 162+ messages in thread From: Jonathan Tan @ 2018-04-03 19:04 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff On Tue, 3 Apr 2018 12:51:43 -0400 Derrick Stolee <dstolee@microsoft.com> wrote: > We now calculate generation numbers in the commit-graph file and use > them in paint_down_to_common(). For completeness, I'll mention that I don't see any issues with this patch, of course. Thanks for this series. ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee ` (5 preceding siblings ...) 2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee @ 2018-04-03 16:56 ` Derrick Stolee 2018-04-03 18:03 ` Brandon Williams ` (2 subsequent siblings) 9 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-03 16:56 UTC (permalink / raw) To: Derrick Stolee, git; +Cc: avarab, sbeller, larsxschneider, peff On 4/3/2018 12:51 PM, Derrick Stolee wrote: > This is the first of several "small" patches that follow the serialized > Git commit graph patch (ds/commit-graph). > > As described in Documentation/technical/commit-graph.txt, the generation > number of a commit is one more than the maximum generation number among > its parents (trivially, a commit with no parents has generation number > one). > > This series makes the computation of generation numbers part of the > commit-graph write process. > > Finally, generation numbers are used to order commits in the priority > queue in paint_down_to_common(). This allows a constant-time check in > queue_has_nonstale() instead of the previous linear-time check. > > This does not have a significant performance benefit in repositories > of normal size, but in the Windows repository, some merge-base > calculations improve from 3.1s to 2.9s. A modest speedup, but provides > an actual consumer of generation numbers as a starting point. > > A more substantial refactoring of revision.c is required before making > 'git log --graph' use generation numbers effectively. > > This patch series depends on v7 of ds/commit-graph. > > Derrick Stolee (6): > object.c: parse commit in graph first > commit: add generation number to struct commmit > commit-graph: compute generation numbers > commit: sort by generation number in paint_down_to_common() > commit.c: use generation number to stop merge-base walks > commit-graph.txt: update design doc with generation numbers This patch is also available as a GitHub pull request [1] [1] https://github.com/derrickstolee/git/pull/5 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee ` (6 preceding siblings ...) 2018-04-03 16:56 ` [PATCH 0/6] Compute and consume generation numbers Derrick Stolee @ 2018-04-03 18:03 ` Brandon Williams 2018-04-03 18:29 ` Derrick Stolee 2018-04-07 16:55 ` Jakub Narebski 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee 9 siblings, 1 reply; 162+ messages in thread From: Brandon Williams @ 2018-04-03 18:03 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, peff On 04/03, Derrick Stolee wrote: > This is the first of several "small" patches that follow the serialized > Git commit graph patch (ds/commit-graph). > > As described in Documentation/technical/commit-graph.txt, the generation > number of a commit is one more than the maximum generation number among > its parents (trivially, a commit with no parents has generation number > one). Thanks for ensuring that this is defined and documented somewhere :) > > This series makes the computation of generation numbers part of the > commit-graph write process. > > Finally, generation numbers are used to order commits in the priority > queue in paint_down_to_common(). This allows a constant-time check in > queue_has_nonstale() instead of the previous linear-time check. > > This does not have a significant performance benefit in repositories > of normal size, but in the Windows repository, some merge-base > calculations improve from 3.1s to 2.9s. A modest speedup, but provides > an actual consumer of generation numbers as a starting point. > > A more substantial refactoring of revision.c is required before making > 'git log --graph' use generation numbers effectively. log --graph should benefit a lot more from this correct? I know we've talked a bit about negotiation and I wonder if these generation numbers should be able to help out a little bit with that some day. > > This patch series depends on v7 of ds/commit-graph. > > Derrick Stolee (6): > object.c: parse commit in graph first > commit: add generation number to struct commmit > commit-graph: compute generation numbers > commit: sort by generation number in paint_down_to_common() > commit.c: use generation number to stop merge-base walks > commit-graph.txt: update design doc with generation numbers > > Documentation/technical/commit-graph.txt | 7 +--- > alloc.c | 1 + > commit-graph.c | 48 +++++++++++++++++++++ > commit.c | 53 ++++++++++++++++++++---- > commit.h | 7 +++- > object.c | 4 +- > 6 files changed, 104 insertions(+), 16 deletions(-) > > -- > 2.17.0.20.g9f30ba16e1 > -- Brandon Williams ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-03 18:03 ` Brandon Williams @ 2018-04-03 18:29 ` Derrick Stolee 2018-04-03 18:47 ` Jeff King 2018-04-07 17:09 ` [PATCH 0/6] Compute and consume generation numbers Jakub Narebski 0 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-03 18:29 UTC (permalink / raw) To: Brandon Williams, Derrick Stolee Cc: git, avarab, sbeller, larsxschneider, peff On 4/3/2018 2:03 PM, Brandon Williams wrote: > On 04/03, Derrick Stolee wrote: >> This is the first of several "small" patches that follow the serialized >> Git commit graph patch (ds/commit-graph). >> >> As described in Documentation/technical/commit-graph.txt, the generation >> number of a commit is one more than the maximum generation number among >> its parents (trivially, a commit with no parents has generation number >> one). > Thanks for ensuring that this is defined and documented somewhere :) > >> This series makes the computation of generation numbers part of the >> commit-graph write process. >> >> Finally, generation numbers are used to order commits in the priority >> queue in paint_down_to_common(). This allows a constant-time check in >> queue_has_nonstale() instead of the previous linear-time check. >> >> This does not have a significant performance benefit in repositories >> of normal size, but in the Windows repository, some merge-base >> calculations improve from 3.1s to 2.9s. A modest speedup, but provides >> an actual consumer of generation numbers as a starting point. >> >> A more substantial refactoring of revision.c is required before making >> 'git log --graph' use generation numbers effectively. > log --graph should benefit a lot more from this correct? I know we've > talked a bit about negotiation and I wonder if these generation numbers > should be able to help out a little bit with that some day. 'log --graph' should be a HUGE speedup, when it is refactored. Since the topo-order can "stream" commits to the pager, it can be very responsive to return the graph in almost all conditions. (The case where generation numbers are not enough is when filters reduce the set of displayed commits to be very sparse, so many commits are walked anyway.) If we have generic "can X reach Y?" queries, then we can also use generation numbers there to great effect (by not walking commits Z with gen(Z) <= gen(Y)). Perhaps I should look at that "git branch --contains" thread for ideas. For negotiation, there are some things we can do here. VSTS uses generation numbers as a heuristic for determining "all wants connected to haves" which is a condition for halting negotiation. The idea is very simple, and I'd be happy to discuss it on a separate thread. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-03 18:29 ` Derrick Stolee @ 2018-04-03 18:47 ` Jeff King 2018-04-03 19:05 ` Jeff King 2018-04-07 17:09 ` [PATCH 0/6] Compute and consume generation numbers Jakub Narebski 1 sibling, 1 reply; 162+ messages in thread From: Jeff King @ 2018-04-03 18:47 UTC (permalink / raw) To: Derrick Stolee Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller, larsxschneider On Tue, Apr 03, 2018 at 02:29:01PM -0400, Derrick Stolee wrote: > If we have generic "can X reach Y?" queries, then we can also use generation > numbers there to great effect (by not walking commits Z with gen(Z) <= > gen(Y)). Perhaps I should look at that "git branch --contains" thread for > ideas. I think the gist of it is the patch below. Which I hastily adapted from the patch we run at GitHub that uses timestamps as a proxy. So it's possible I completely flubbed the logic. I'm assuming unavailable generation numbers are set to 0; the logic is actually a bit simpler if they end up as (uint32_t)-1. Assuming it works, that would cover for-each-ref and tag. You'd probably want to drop the "with_commit_tag_algo" flag in ref-filter.h, and just use always use it by default (and that would cover "git branch"). --- diff --git a/ref-filter.c b/ref-filter.c index 45fc56216a..6bea6173d1 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate, return CONTAINS_YES; } - /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation && candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = -1; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation && c->generation < cutoff ) + cutoff = c->generation; + } + if (cutoff == -1) + cutoff = 0; + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-03 18:47 ` Jeff King @ 2018-04-03 19:05 ` Jeff King 2018-04-04 15:45 ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jeff King @ 2018-04-03 19:05 UTC (permalink / raw) To: Derrick Stolee Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller, larsxschneider On Tue, Apr 03, 2018 at 02:47:27PM -0400, Jeff King wrote: > On Tue, Apr 03, 2018 at 02:29:01PM -0400, Derrick Stolee wrote: > > > If we have generic "can X reach Y?" queries, then we can also use generation > > numbers there to great effect (by not walking commits Z with gen(Z) <= > > gen(Y)). Perhaps I should look at that "git branch --contains" thread for > > ideas. > > I think the gist of it is the patch below. Which I hastily adapted from > the patch we run at GitHub that uses timestamps as a proxy. So it's > possible I completely flubbed the logic. I'm assuming unavailable > generation numbers are set to 0; the logic is actually a bit simpler if > they end up as (uint32_t)-1. Oh indeed, that is already the value of your UNDEF. So the patch is more like this: diff --git a/ref-filter.c b/ref-filter.c index 45fc56216a..b147b1d0ee 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate, return CONTAINS_YES; } - /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_UNDEF; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + if (cutoff == GENERATION_NUMBER_UNDEF) + cutoff = GENERATION_NUMBER_NONE; + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH 7/6] ref-filter: use generation number for --contains 2018-04-03 19:05 ` Jeff King @ 2018-04-04 15:45 ` Derrick Stolee 2018-04-04 15:45 ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee 2018-04-04 18:22 ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King 0 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-04 15:45 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee A commit A can reach a commit B only if the generation number of A is strictly larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for '--contains' type queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- ref-filter.c | 26 +++++++++++++++++++++----- 1 file changed, 21 insertions(+), 5 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index 45fc56216a..b147b1d0ee 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate, return CONTAINS_YES; } - /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_UNDEF; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + if (cutoff == GENERATION_NUMBER_UNDEF) + cutoff = GENERATION_NUMBER_NONE; + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1634,7 +1650,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1648,7 +1664,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0.rc0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH 8/6] commit: use generation numbers for in_merge_bases() 2018-04-04 15:45 ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee @ 2018-04-04 15:45 ` Derrick Stolee 2018-04-04 15:48 ` Derrick Stolee 2018-04-04 18:22 ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King 1 sibling, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-04 15:45 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the target commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 858f4fdbc9..2566cba79f 100644 --- a/commit.c +++ b/commit.c @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_UNDEF; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return 0; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0.rc0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases() 2018-04-04 15:45 ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee @ 2018-04-04 15:48 ` Derrick Stolee 2018-04-04 17:01 ` Brandon Williams 2018-04-04 18:24 ` Jeff King 0 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-04 15:48 UTC (permalink / raw) To: Derrick Stolee, git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill On 4/4/2018 11:45 AM, Derrick Stolee wrote: > The containment algorithm for 'git branch --contains' is different > from that for 'git tag --contains' in that it uses is_descendant_of() > instead of contains_tag_algo(). The expensive portion of the branch > algorithm is computing merge bases. > > When a commit-graph file exists with generation numbers computed, > we can avoid this merge-base calculation when the target commit has > a larger generation number than the target commits. > > Performance tests were run on a copy of the Linux repository where > HEAD is contained in v4.13 but no earlier tag. Also, all tags were > copied to branches and 'git branch --contains' was tested: > > Before: 60.0s > After: 0.4s > Rel %: -99.3% > > Reported-by: Jeff King <peff@peff.net> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/commit.c b/commit.c > index 858f4fdbc9..2566cba79f 100644 > --- a/commit.c > +++ b/commit.c > @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * > { > struct commit_list *bases; > int ret = 0, i; > + uint32_t min_generation = GENERATION_NUMBER_UNDEF; > > if (parse_commit(commit)) > return ret; > - for (i = 0; i < nr_reference; i++) > + for (i = 0; i < nr_reference; i++) { > if (parse_commit(reference[i])) > return ret; > + if (min_generation > reference[i]->generation) > + min_generation = reference[i]->generation; > + } > + > + if (commit->generation > min_generation) > + return 0; > > bases = paint_down_to_common(commit, nr_reference, reference); > if (commit->object.flags & PARENT2) This patch may suffice to speed up 'git branch --contains' instead of needing to always use the 'git tag --contains' algorithm as considered in [1]. Thanks, -Stolee [1] https://public-inbox.org/git/20180303051516.GE27689@sigill.intra.peff.net/ Re: [PATCH 0/4] Speed up git tag --contains ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases() 2018-04-04 15:48 ` Derrick Stolee @ 2018-04-04 17:01 ` Brandon Williams 2018-04-04 18:24 ` Jeff King 1 sibling, 0 replies; 162+ messages in thread From: Brandon Williams @ 2018-04-04 17:01 UTC (permalink / raw) To: Derrick Stolee; +Cc: Derrick Stolee, git, peff, avarab, sbeller, larsxschneider On 04/04, Derrick Stolee wrote: > On 4/4/2018 11:45 AM, Derrick Stolee wrote: > > The containment algorithm for 'git branch --contains' is different > > from that for 'git tag --contains' in that it uses is_descendant_of() > > instead of contains_tag_algo(). The expensive portion of the branch > > algorithm is computing merge bases. > > > > When a commit-graph file exists with generation numbers computed, > > we can avoid this merge-base calculation when the target commit has > > a larger generation number than the target commits. > > > > Performance tests were run on a copy of the Linux repository where > > HEAD is contained in v4.13 but no earlier tag. Also, all tags were > > copied to branches and 'git branch --contains' was tested: > > > > Before: 60.0s > > After: 0.4s > > Rel %: -99.3% Now that is an impressive speedup. -- Brandon Williams ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases() 2018-04-04 15:48 ` Derrick Stolee 2018-04-04 17:01 ` Brandon Williams @ 2018-04-04 18:24 ` Jeff King 2018-04-04 18:53 ` Derrick Stolee 1 sibling, 1 reply; 162+ messages in thread From: Jeff King @ 2018-04-04 18:24 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill On Wed, Apr 04, 2018 at 11:48:42AM -0400, Derrick Stolee wrote: > > diff --git a/commit.c b/commit.c > > index 858f4fdbc9..2566cba79f 100644 > > --- a/commit.c > > +++ b/commit.c > > @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * > > { > > struct commit_list *bases; > > int ret = 0, i; > > + uint32_t min_generation = GENERATION_NUMBER_UNDEF; > > if (parse_commit(commit)) > > return ret; > > - for (i = 0; i < nr_reference; i++) > > + for (i = 0; i < nr_reference; i++) { > > if (parse_commit(reference[i])) > > return ret; > > + if (min_generation > reference[i]->generation) > > + min_generation = reference[i]->generation; > > + } > > + > > + if (commit->generation > min_generation) > > + return 0; > > bases = paint_down_to_common(commit, nr_reference, reference); > > if (commit->object.flags & PARENT2) > > This patch may suffice to speed up 'git branch --contains' instead of > needing to always use the 'git tag --contains' algorithm as considered in > [1]. I'd have to do some timings, but I suspect we may want to switch to the "tag --contains" algorithm anyway. This still does N independent merge-base operations, one per ref. So with enough refs, you're still better off throwing it all into one big traversal. -Peff ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases() 2018-04-04 18:24 ` Jeff King @ 2018-04-04 18:53 ` Derrick Stolee 2018-04-04 18:59 ` Jeff King 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-04 18:53 UTC (permalink / raw) To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill On 4/4/2018 2:24 PM, Jeff King wrote: > On Wed, Apr 04, 2018 at 11:48:42AM -0400, Derrick Stolee wrote: > >>> diff --git a/commit.c b/commit.c >>> index 858f4fdbc9..2566cba79f 100644 >>> --- a/commit.c >>> +++ b/commit.c >>> @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * >>> { >>> struct commit_list *bases; >>> int ret = 0, i; >>> + uint32_t min_generation = GENERATION_NUMBER_UNDEF; >>> if (parse_commit(commit)) >>> return ret; >>> - for (i = 0; i < nr_reference; i++) >>> + for (i = 0; i < nr_reference; i++) { >>> if (parse_commit(reference[i])) >>> return ret; >>> + if (min_generation > reference[i]->generation) >>> + min_generation = reference[i]->generation; >>> + } >>> + >>> + if (commit->generation > min_generation) >>> + return 0; >>> bases = paint_down_to_common(commit, nr_reference, reference); >>> if (commit->object.flags & PARENT2) >> This patch may suffice to speed up 'git branch --contains' instead of >> needing to always use the 'git tag --contains' algorithm as considered in >> [1]. I guess I want to specify: the only reason to NOT switch to the tags algorithm is because it _may_ hurt existing cases in certain data shapes... > I'd have to do some timings, but I suspect we may want to switch to the > "tag --contains" algorithm anyway. This still does N independent > merge-base operations, one per ref. So with enough refs, you're still > better off throwing it all into one big traversal. ...and I suppose your timings are to find out if there are data shapes where the branch algorithm is faster. Perhaps that is impossible now that we have the generation number cutoff for the tag algorithm. Since the branch algorithm checks generation numbers before triggering pain_down_to_common(), we will do N independent merge-base calculations, where N is the number of branches with large enough generation numbers (which is why my test does so well: most are below the target generation number). This doesn't help at all if none of the refs are in the graph. The other thing to do is add a minimum generation for the walk in paint_down_to_common() so even if commit->generation <= min_generation we still only walk down to commit->generation instead of all merge bases. This is something we could change in a later patch. Patches 7 and 8 seem to me like simple changes with no downside UNLESS we are deciding instead to delete the code I'm changing. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 8/6] commit: use generation numbers for in_merge_bases() 2018-04-04 18:53 ` Derrick Stolee @ 2018-04-04 18:59 ` Jeff King 0 siblings, 0 replies; 162+ messages in thread From: Jeff King @ 2018-04-04 18:59 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill On Wed, Apr 04, 2018 at 02:53:45PM -0400, Derrick Stolee wrote: > > I'd have to do some timings, but I suspect we may want to switch to the > > "tag --contains" algorithm anyway. This still does N independent > > merge-base operations, one per ref. So with enough refs, you're still > > better off throwing it all into one big traversal. > > ...and I suppose your timings are to find out if there are data shapes where > the branch algorithm is faster. Perhaps that is impossible now that we have > the generation number cutoff for the tag algorithm. Well, I wanted to show the opposite: that the branch algorithm can still perform quite poorly. :) I think with generation numbers that the tag algorithm should always perform better, since you can't walk past a merge base when using a cutoff. But it could definitely perform worse in a case where you don't have generation numbers. > Patches 7 and 8 seem to me like simple changes with no downside UNLESS we > are deciding instead to delete the code I'm changing. Yeah, I think they are strict improvements modulo the inverted UNDEF logic I mentioned. -Peff ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 7/6] ref-filter: use generation number for --contains 2018-04-04 15:45 ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee 2018-04-04 15:45 ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee @ 2018-04-04 18:22 ` Jeff King 2018-04-04 19:06 ` Derrick Stolee 1 sibling, 1 reply; 162+ messages in thread From: Jeff King @ 2018-04-04 18:22 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, bmwill On Wed, Apr 04, 2018 at 11:45:53AM -0400, Derrick Stolee wrote: > @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, > struct contains_cache *cache) > { > struct contains_stack contains_stack = { 0, 0, NULL }; > - enum contains_result result = contains_test(candidate, want, cache); > + enum contains_result result; > + uint32_t cutoff = GENERATION_NUMBER_UNDEF; > + const struct commit_list *p; > + > + for (p = want; p; p = p->next) { > + struct commit *c = p->item; > + parse_commit_or_die(c); > + if (c->generation < cutoff) > + cutoff = c->generation; > + } > + if (cutoff == GENERATION_NUMBER_UNDEF) > + cutoff = GENERATION_NUMBER_NONE; Hmm, on reflection, I'm not sure if this is right in the face of multiple "want" commits, only some of which have generation numbers. We probably want to disable the cutoff if _any_ "want" commit doesn't have a number. There's also an obvious corner case where this won't kick in, and you'd really like it to: recently added commits. E.g,. if I do this: git gc ;# imagine this writes generation numbers git pull git tag --contains HEAD then HEAD isn't going to have a generation number. But this is the case where we have the most to gain, since we could throw away all of the ancient tags immediately upon seeing that their generation numbers are way less than that of HEAD. I wonder to what degree it's worth traversing to come up with a generation number for the "want" commits. If we walked, say, 50 commits to do it, you'd probably save a lot of work (since the alternative is walking thousands of commits until you realize that some ancient "v1.0" tag is not useful). I'd actually go so far as to say that any amount of traversal is generally going to be worth it to come up with the correct generation cutoff here. You can come up with pathological cases where you only have one really recent tag or something, but in practice every repository where performance is a concern is going to end up with refs much further back than it would take to reach the cutoff condition. -Peff ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 7/6] ref-filter: use generation number for --contains 2018-04-04 18:22 ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King @ 2018-04-04 19:06 ` Derrick Stolee 2018-04-04 19:16 ` Jeff King 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-04 19:06 UTC (permalink / raw) To: Jeff King, Derrick Stolee; +Cc: git, avarab, sbeller, larsxschneider, bmwill On 4/4/2018 2:22 PM, Jeff King wrote: > On Wed, Apr 04, 2018 at 11:45:53AM -0400, Derrick Stolee wrote: > >> @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, >> struct contains_cache *cache) >> { >> struct contains_stack contains_stack = { 0, 0, NULL }; >> - enum contains_result result = contains_test(candidate, want, cache); >> + enum contains_result result; >> + uint32_t cutoff = GENERATION_NUMBER_UNDEF; >> + const struct commit_list *p; >> + >> + for (p = want; p; p = p->next) { >> + struct commit *c = p->item; >> + parse_commit_or_die(c); >> + if (c->generation < cutoff) >> + cutoff = c->generation; >> + } Now that you mention it, let me split out the portion you are probably talking about as incorrect: >> + if (cutoff == GENERATION_NUMBER_UNDEF) >> + cutoff = GENERATION_NUMBER_NONE; You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we get no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then our walk will be limited to commits NOT in the commit-graph (which we hope is small if proper hygiene is followed). > Hmm, on reflection, I'm not sure if this is right in the face of > multiple "want" commits, only some of which have generation numbers. We > probably want to disable the cutoff if _any_ "want" commit doesn't have > a number. > > There's also an obvious corner case where this won't kick in, and you'd > really like it to: recently added commits. E.g,. if I do this: > > git gc ;# imagine this writes generation numbers > git pull > git tag --contains HEAD > > then HEAD isn't going to have a generation number. But this is the case > where we have the most to gain, since we could throw away all of the > ancient tags immediately upon seeing that their generation numbers are > way less than that of HEAD. > > I wonder to what degree it's worth traversing to come up with a > generation number for the "want" commits. If we walked, say, 50 commits > to do it, you'd probably save a lot of work (since the alternative is > walking thousands of commits until you realize that some ancient "v1.0" > tag is not useful). > > I'd actually go so far as to say that any amount of traversal is > generally going to be worth it to come up with the correct generation > cutoff here. You can come up with pathological cases where you only have > one really recent tag or something, but in practice every repository > where performance is a concern is going to end up with refs much further > back than it would take to reach the cutoff condition. Perhaps there is some value in walking to find the correct cutoff value, but it is difficult to determine how far we are from commits with correct generation numbers _a priori_. I'd rather rely on the commit-graph being in a good state, not too far behind the refs. An added complexity of computing generation numbers dynamically is that we would need to add a dependence on the commit-graph file's existence at all. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 7/6] ref-filter: use generation number for --contains 2018-04-04 19:06 ` Derrick Stolee @ 2018-04-04 19:16 ` Jeff King 2018-04-04 19:22 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jeff King @ 2018-04-04 19:16 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill On Wed, Apr 04, 2018 at 03:06:26PM -0400, Derrick Stolee wrote: > > > @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, > > > struct contains_cache *cache) > > > { > > > struct contains_stack contains_stack = { 0, 0, NULL }; > > > - enum contains_result result = contains_test(candidate, want, cache); > > > + enum contains_result result; > > > + uint32_t cutoff = GENERATION_NUMBER_UNDEF; > > > + const struct commit_list *p; > > > + > > > + for (p = want; p; p = p->next) { > > > + struct commit *c = p->item; > > > + parse_commit_or_die(c); > > > + if (c->generation < cutoff) > > > + cutoff = c->generation; > > > + } > > Now that you mention it, let me split out the portion you are probably > talking about as incorrect: > > > > + if (cutoff == GENERATION_NUMBER_UNDEF) > > > + cutoff = GENERATION_NUMBER_NONE; > > You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we get > no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then our walk > will be limited to commits NOT in the commit-graph (which we hope is small > if proper hygiene is followed). I think it's more than that. If we leave it at UNDEF, that's wrong, because contains_test() compares: candidate->generation < cutoff which would _always_ be true. In other words, we're saying that our "want" has an insanely high generation number, and traversing can never find it. Which is clearly wrong. So we have to put it at "0", to say "you should always traverse, we can't tell you that this is a dead end". So that part of the logic is currently correct. But what I was getting at is that the loop behavior can't just pick the min cutoff. The min is effectively "0" if there's even a single ref for which we don't have a generation number, because we cannot ever stop traversing (we might get to that commit if we kept going). (It's also possible I'm confused about how UNDEF and NONE are used; I'm assuming commits for which we don't have a generation number available would get UNDEF in their commit->generation field). If you could make the assumption that when we have a generation for commit X, then we have a generation for all of its ancestors, things get easier. Because then if you hit commit X with a generation number and want to compare it to a cutoff, you know that either: 1. The cutoff is defined, in which case you can stop traversing if we've gone past the cutoff. 2. The cutoff is undefined, in which case we cannot possibly reach our "want" by traversing. Even if it has a smaller generation number than us, it's on an unrelated line of development. I don't know that the reachability property is explicitly promised by your work, but it seems like it would be a natural fallout (after all, you have to know the generation of each ancestor in order to compute the later ones, so you're really just promising that you've actually stored all the ones you've computed). > > I wonder to what degree it's worth traversing to come up with a > > generation number for the "want" commits. If we walked, say, 50 commits > > to do it, you'd probably save a lot of work (since the alternative is > > walking thousands of commits until you realize that some ancient "v1.0" > > tag is not useful). > > > > I'd actually go so far as to say that any amount of traversal is > > generally going to be worth it to come up with the correct generation > > cutoff here. You can come up with pathological cases where you only have > > one really recent tag or something, but in practice every repository > > where performance is a concern is going to end up with refs much further > > back than it would take to reach the cutoff condition. > > Perhaps there is some value in walking to find the correct cutoff value, but > it is difficult to determine how far we are from commits with correct > generation numbers _a priori_. I'd rather rely on the commit-graph being in > a good state, not too far behind the refs. An added complexity of computing > generation numbers dynamically is that we would need to add a dependence on > the commit-graph file's existence at all. If you could make the reachability assumption, I think this question just goes away. As soon as you hit a commit with _any_ generation number, you could quit traversing down that path. -Peff ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 7/6] ref-filter: use generation number for --contains 2018-04-04 19:16 ` Jeff King @ 2018-04-04 19:22 ` Derrick Stolee 2018-04-04 19:42 ` Jeff King 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-04 19:22 UTC (permalink / raw) To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill On 4/4/2018 3:16 PM, Jeff King wrote: > On Wed, Apr 04, 2018 at 03:06:26PM -0400, Derrick Stolee wrote: > >>>> @@ -1615,8 +1619,20 @@ static enum contains_result contains_tag_algo(struct commit *candidate, >>>> struct contains_cache *cache) >>>> { >>>> struct contains_stack contains_stack = { 0, 0, NULL }; >>>> - enum contains_result result = contains_test(candidate, want, cache); >>>> + enum contains_result result; >>>> + uint32_t cutoff = GENERATION_NUMBER_UNDEF; >>>> + const struct commit_list *p; >>>> + >>>> + for (p = want; p; p = p->next) { >>>> + struct commit *c = p->item; >>>> + parse_commit_or_die(c); >>>> + if (c->generation < cutoff) >>>> + cutoff = c->generation; >>>> + } >> Now that you mention it, let me split out the portion you are probably >> talking about as incorrect: >> >>>> + if (cutoff == GENERATION_NUMBER_UNDEF) >>>> + cutoff = GENERATION_NUMBER_NONE; >> You're right, we don't want this. Since GENERATION_NUMBER_NONE == 0, we get >> no benefit from this. If we keep it GENERATION_NUMBER_UNDEF, then our walk >> will be limited to commits NOT in the commit-graph (which we hope is small >> if proper hygiene is followed). > I think it's more than that. If we leave it at UNDEF, that's wrong, > because contains_test() compares: > > candidate->generation < cutoff > > which would _always_ be true. In other words, we're saying that our > "want" has an insanely high generation number, and traversing can never > find it. Which is clearly wrong. That condition is not always true (which is why we use strict comparison instead of <=). If a commit is not in the commit-graph file, then its generation is equal to GENERATION_NUMBER_UNDEF, as shown in alloc.c: void *alloc_commit_node(void) { struct commit *c = alloc_node(&commit_state, sizeof(struct commit)); c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; c->generation = GENERATION_NUMBER_UNDEF; return c; } > So we have to put it at "0", to say "you should always traverse, we > can't tell you that this is a dead end". So that part of the logic is > currently correct. > > But what I was getting at is that the loop behavior can't just pick the > min cutoff. The min is effectively "0" if there's even a single ref for > which we don't have a generation number, because we cannot ever stop > traversing (we might get to that commit if we kept going). > > (It's also possible I'm confused about how UNDEF and NONE are used; I'm > assuming commits for which we don't have a generation number available > would get UNDEF in their commit->generation field). I think it is this case. > If you could make the assumption that when we have a generation for > commit X, then we have a generation for all of its ancestors, things get > easier. Because then if you hit commit X with a generation number and > want to compare it to a cutoff, you know that either: > > 1. The cutoff is defined, in which case you can stop traversing if > we've gone past the cutoff. > > 2. The cutoff is undefined, in which case we cannot possibly reach > our "want" by traversing. Even if it has a smaller generation > number than us, it's on an unrelated line of development. > > I don't know that the reachability property is explicitly promised by > your work, but it seems like it would be a natural fallout (after all, > you have to know the generation of each ancestor in order to compute the > later ones, so you're really just promising that you've actually stored > all the ones you've computed). The commit-graph is closed under reachability, so if a commit has a generation number then it is in the graph and so are all its ancestors. The reason for GENERATION_NUMBER_NONE is that the commit-graph file stores "0" for generation number until this patch. It still satisfies the condition that gen(A) < gen(B) if B can reach A, but also gives us a condition for "this commit still needs its generation number computed". > >>> I wonder to what degree it's worth traversing to come up with a >>> generation number for the "want" commits. If we walked, say, 50 commits >>> to do it, you'd probably save a lot of work (since the alternative is >>> walking thousands of commits until you realize that some ancient "v1.0" >>> tag is not useful). >>> >>> I'd actually go so far as to say that any amount of traversal is >>> generally going to be worth it to come up with the correct generation >>> cutoff here. You can come up with pathological cases where you only have >>> one really recent tag or something, but in practice every repository >>> where performance is a concern is going to end up with refs much further >>> back than it would take to reach the cutoff condition. >> Perhaps there is some value in walking to find the correct cutoff value, but >> it is difficult to determine how far we are from commits with correct >> generation numbers _a priori_. I'd rather rely on the commit-graph being in >> a good state, not too far behind the refs. An added complexity of computing >> generation numbers dynamically is that we would need to add a dependence on >> the commit-graph file's existence at all. > If you could make the reachability assumption, I think this question > just goes away. As soon as you hit a commit with _any_ generation > number, you could quit traversing down that path. That is the idea. I should make this clearer in all of my commit messages. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 7/6] ref-filter: use generation number for --contains 2018-04-04 19:22 ` Derrick Stolee @ 2018-04-04 19:42 ` Jeff King 2018-04-04 19:45 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jeff King @ 2018-04-04 19:42 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill On Wed, Apr 04, 2018 at 03:22:01PM -0400, Derrick Stolee wrote: > > I don't know that the reachability property is explicitly promised by > > your work, but it seems like it would be a natural fallout (after all, > > you have to know the generation of each ancestor in order to compute the > > later ones, so you're really just promising that you've actually stored > > all the ones you've computed). > > The commit-graph is closed under reachability, so if a commit has a > generation number then it is in the graph and so are all its ancestors. OK, if we assume that it's closed, then I think we can effectively ignore the UNDEF cases. They'll just work out. And then yes I'd agree that the: if (cutoff == UNDEF) cutoff = NONE; code is wrong. We'd want to keep it at UNDEF so we stop traversing at any generation number. > The reason for GENERATION_NUMBER_NONE is that the commit-graph file stores > "0" for generation number until this patch. It still satisfies the condition > that gen(A) < gen(B) if B can reach A, but also gives us a condition for > "this commit still needs its generation number computed". OK. I thought at first that would yield wrong results when comparing UNDEF to NONE, but I think for this kind of --contains traversal, it's still OK (NONE is less than UNDEF, but we know that the UNDEF thing cannot be found by traversing from a NONE). > > If you could make the reachability assumption, I think this question > > just goes away. As soon as you hit a commit with _any_ generation > > number, you could quit traversing down that path. > That is the idea. I should make this clearer in all of my commit messages. Yes, please. :) And maybe in the documentation of the file format, if it's not there (I didn't check). It's a very useful property, and we want to make sure people making use of the graph know they can depend on it. -Peff ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 7/6] ref-filter: use generation number for --contains 2018-04-04 19:42 ` Jeff King @ 2018-04-04 19:45 ` Derrick Stolee 2018-04-04 19:46 ` Jeff King 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-04 19:45 UTC (permalink / raw) To: Jeff King; +Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill On 4/4/2018 3:42 PM, Jeff King wrote: > On Wed, Apr 04, 2018 at 03:22:01PM -0400, Derrick Stolee wrote: > >> That is the idea. I should make this clearer in all of my commit messages. > Yes, please. :) And maybe in the documentation of the file format, if > it's not there (I didn't check). It's a very useful property, and we > want to make sure people making use of the graph know they can depend on > it. For v2, I'll expand on the roles of _UNDEF and _NONE in the discussion of generation numbers in Documentation/technical/commit-graph.txt (the design doc instead of the file format). Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 7/6] ref-filter: use generation number for --contains 2018-04-04 19:45 ` Derrick Stolee @ 2018-04-04 19:46 ` Jeff King 0 siblings, 0 replies; 162+ messages in thread From: Jeff King @ 2018-04-04 19:46 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, avarab, sbeller, larsxschneider, bmwill On Wed, Apr 04, 2018 at 03:45:30PM -0400, Derrick Stolee wrote: > On 4/4/2018 3:42 PM, Jeff King wrote: > > On Wed, Apr 04, 2018 at 03:22:01PM -0400, Derrick Stolee wrote: > > > > > That is the idea. I should make this clearer in all of my commit messages. > > Yes, please. :) And maybe in the documentation of the file format, if > > it's not there (I didn't check). It's a very useful property, and we > > want to make sure people making use of the graph know they can depend on > > it. > > For v2, I'll expand on the roles of _UNDEF and _NONE in the discussion of > generation numbers in Documentation/technical/commit-graph.txt (the design > doc instead of the file format). Yeah, that makes sense. Thanks, and thanks for a thoughtful discussion. The performance numbers are very exciting. -Peff ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-03 18:29 ` Derrick Stolee 2018-04-03 18:47 ` Jeff King @ 2018-04-07 17:09 ` Jakub Narebski 1 sibling, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-04-07 17:09 UTC (permalink / raw) To: Derrick Stolee Cc: Brandon Williams, Derrick Stolee, git, avarab, sbeller, larsxschneider, peff Derrick Stolee <stolee@gmail.com> writes: > On 4/3/2018 2:03 PM, Brandon Williams wrote: >> On 04/03, Derrick Stolee wrote: >>> This is the first of several "small" patches that follow the serialized >>> Git commit graph patch (ds/commit-graph). >>> >>> As described in Documentation/technical/commit-graph.txt, the generation >>> number of a commit is one more than the maximum generation number among >>> its parents (trivially, a commit with no parents has generation number >>> one). [...] >>> A more substantial refactoring of revision.c is required before making >>> 'git log --graph' use generation numbers effectively. >> >> log --graph should benefit a lot more from this correct? I know we've >> talked a bit about negotiation and I wonder if these generation numbers >> should be able to help out a little bit with that some day. > > 'log --graph' should be a HUGE speedup, when it is refactored. Since > the topo-order can "stream" commits to the pager, it can be very > responsive to return the graph in almost all conditions. (The case > where generation numbers are not enough is when filters reduce the set > of displayed commits to be very sparse, so many commits are walked > anyway.) I wonder if next big speedup would be to store [some] topological ordering of commits in the commit graph... It could be done for example in two chunks: a mapping to position in topological order, and list of commits sorted in topological order. Note also that FELINE index uses (or can use -- but it is supposedly the optimal choice) position of vertex/node in topological order as one of the two values in the pair that composes FELINE index. > If we have generic "can X reach Y?" queries, then we can also use > generation numbers there to great effect (by not walking commits Z > with gen(Z) <= gen(Y)). Perhaps I should look at that "git branch > --contains" thread for ideas. This is something that is shown in the Google Colab [Jupyter] Notebook I have mentioned: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing > For negotiation, there are some things we can do here. VSTS uses > generation numbers as a heuristic for determining "all wants connected > to haves" which is a condition for halting negotiation. The idea is > very simple, and I'd be happy to discuss it on a separate thread. Nice. How much speedup it gives? Best regards, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee ` (7 preceding siblings ...) 2018-04-03 18:03 ` Brandon Williams @ 2018-04-07 16:55 ` Jakub Narebski 2018-04-08 1:06 ` Derrick Stolee 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee 9 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-07 16:55 UTC (permalink / raw) To: Derrick Stolee Cc: git, Ævar Arnfjörð Bjarmason, Stefan Beller, Lars Schneider, Jeff King Hello, Derrick Stolee <dstolee@microsoft.com> writes: > This is the first of several "small" patches that follow the serialized > Git commit graph patch (ds/commit-graph). > > As described in Documentation/technical/commit-graph.txt, the generation > number of a commit is one more than the maximum generation number among > its parents (trivially, a commit with no parents has generation number > one). > > This series makes the computation of generation numbers part of the > commit-graph write process. > > Finally, generation numbers are used [...]. > > This does not have a significant performance benefit in repositories > of normal size, but in the Windows repository, some merge-base > calculations improve from 3.1s to 2.9s. A modest speedup, but provides > an actual consumer of generation numbers as a starting point. > > A more substantial refactoring of revision.c is required before making > 'git log --graph' use generation numbers effectively. I have started working on Jupyter Notebook on Google Colaboratory to find out how much speedup we can get using generation numbers (level negative-cut filter), FELINE index (negative-cut filter) and min-post intervals in some spanning tree (positive-cut filter, if I understand it correctly the base of GRAIL method) in commit graphs. Currently I am at the stage of reproducing results in FELINE paper: "Reachability Queries in Very Large Graphs: A Fast Refined Online Search Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed J. Zaki (2014). This paper is available in the PDF form at https://openproceedings.org/EDBT/2014/paper_166.pdf The Jupyter Notebook (which runs on Google cloud, but can be also run locally) uses Python kernel, NetworkX librabry for graph manipulation, and matplotlib (via NetworkX) for display. Available at: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing I hope that could be of help, or at least interesting -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-07 16:55 ` Jakub Narebski @ 2018-04-08 1:06 ` Derrick Stolee 2018-04-11 19:32 ` Jakub Narebski 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-08 1:06 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, Ævar Arnfjörð Bjarmason, Stefan Beller, Lars Schneider, Jeff King On 4/7/2018 12:55 PM, Jakub Narebski wrote: > Currently I am at the stage of reproducing results in FELINE paper: > "Reachability Queries in Very Large Graphs: A Fast Refined Online Search > Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed > J. Zaki (2014). This paper is available in the PDF form at > https://openproceedings.org/EDBT/2014/paper_166.pdf > > The Jupyter Notebook (which runs on Google cloud, but can be also run > locally) uses Python kernel, NetworkX librabry for graph manipulation, > and matplotlib (via NetworkX) for display. > > Available at: > https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg > https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing > > I hope that could be of help, or at least interesting Let me know when you can give numbers (either raw performance or # of commits walked) for real-world Git commit graphs. The Linux repo is a good example to use for benchmarking, but I also use the Kotlin repo sometimes as it has over a million objects and over 250K commits. Of course, the only important statistic at the end of the day is the end-to-end time of a 'git ...' command. Your investigations should inform whether it is worth prototyping the feature in the git codebase. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-08 1:06 ` Derrick Stolee @ 2018-04-11 19:32 ` Jakub Narebski 2018-04-11 19:58 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-11 19:32 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason, Stefan Beller, Lars Schneider, Jeff King Derrick Stolee <stolee@gmail.com> writes: > On 4/7/2018 12:55 PM, Jakub Narebski wrote: >> Currently I am at the stage of reproducing results in FELINE paper: >> "Reachability Queries in Very Large Graphs: A Fast Refined Online Search >> Approach" by Renê R. Veloso, Loïc Cerf, Wagner Meira Jr and Mohammed >> J. Zaki (2014). This paper is available in the PDF form at >> https://openproceedings.org/EDBT/2014/paper_166.pdf >> >> The Jupyter Notebook (which runs on Google cloud, but can be also run >> locally) uses Python kernel, NetworkX librabry for graph manipulation, >> and matplotlib (via NetworkX) for display. >> >> Available at: >> https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg >> https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing >> >> I hope that could be of help, or at least interesting > > Let me know when you can give numbers (either raw performance or # of > commits walked) for real-world Git commit graphs. The Linux repo is a > good example to use for benchmarking, but I also use the Kotlin repo > sometimes as it has over a million objects and over 250K commits. As I am curently converting git repository into commit graph, number of objects doesn't matter. Though Kotlin is nicely in largish size set, not as large as Linux kernel which has 750K commits, but mich larger than git.git with 65K commits. > Of course, the only important statistic at the end of the day is the > end-to-end time of a 'git ...' command. Your investigations should > inform whether it is worth prototyping the feature in the git > codebase. What would you suggest as a good test that could imply performance? The Google Colab notebook linked to above includes a function to count number of commits (nodes / vertices in the commit graph) walked, currently in the worst case scenario. I have tried finding number of false positives for level (generation number) filter and for FELINE index, and number of false negatives for min-post intervals in the spanning tree (for DFS tree) for 10000 randomly selected pairs of commits... but I don't think this is a good benchmark. I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) that has 750832 nodes and 811733 edges, and 563747941392 possible directed pairs, we have for 10000 randomly selected pairs of commits: level-filter has 91 = 0.91% [all] false positives FELINE index has 78 = 0.78% [all] false positives FELINE index has 1.16667 less false positives than level filter min-post spanning-tree intervals has 3641 = 36.41% [all] false negatives For git.git repository (https://github.com/git/git.git) that has 52950 nodes and 65887 edges the numbers are slighly more in FELINE index favor (also out of 10000 random pairs): level-filter has 504 = 9.11% false positives FELINE index has 125 = 2.26% false positives FELINE index has 4.032 less false positives than level filter This is for FELINE which does not use level / generatio-numbers filter. Regards, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-11 19:32 ` Jakub Narebski @ 2018-04-11 19:58 ` Derrick Stolee 2018-04-14 16:52 ` Jakub Narebski 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-11 19:58 UTC (permalink / raw) To: Jakub Narebski Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason, Stefan Beller, Lars Schneider, Jeff King On 4/11/2018 3:32 PM, Jakub Narebski wrote: > What would you suggest as a good test that could imply performance? The > Google Colab notebook linked to above includes a function to count > number of commits (nodes / vertices in the commit graph) walked, > currently in the worst case scenario. The two main questions to consider are: 1. Can X reach Y? 2. What is the set of merge-bases between X and Y? And the thing to measure is a commit count. If possible, it would be good to count commits walked (commits whose parent list is enumerated) and commits inspected (commits that were listed as a parent of some walked commit). Walked commits require a commit parse -- albeit from the commit-graph instead of the ODB now -- while inspected commits only check the in-memory cache. For git.git and Linux, I like to use the release tags as tests. They provide a realistic view of the linear history, and maintenance releases have their own history from the major releases. > I have tried finding number of false positives for level (generation > number) filter and for FELINE index, and number of false negatives for > min-post intervals in the spanning tree (for DFS tree) for 10000 > randomly selected pairs of commits... but I don't think this is a good > benchmark. What is a false-positive? A case where gen(X) < gen(Y) but Y cannot reach X? I do not think that is a great benchmark, but I guess it is something to measure. > I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) > that has 750832 nodes and 811733 edges, and 563747941392 possible > directed pairs, we have for 10000 randomly selected pairs of commits: > > level-filter has 91 = 0.91% [all] false positives > FELINE index has 78 = 0.78% [all] false positives > FELINE index has 1.16667 less false positives than level filter > > min-post spanning-tree intervals has 3641 = 36.41% [all] false > negatives Perhaps something you can do instead of sampling from N^2 commits in total is to select a pair of generations (say, G = 20000, G' = 20100) or regions of generations ( 20000 <= G <= 20050, 20100 <= G' <= 20150) and see how many false positives you see by testing all pairs (one from each level). The delta between the generations may need to be smaller to actually have a large proportion of unreachable pairs. Try different levels, since major version releases tend to "pinch" the commit graph to a common history. > For git.git repository (https://github.com/git/git.git) that has 52950 > nodes and 65887 edges the numbers are slighly more in FELINE index > favor (also out of 10000 random pairs): > > level-filter has 504 = 9.11% false positives > FELINE index has 125 = 2.26% false positives > FELINE index has 4.032 less false positives than level filter > > This is for FELINE which does not use level / generatio-numbers filter. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-11 19:58 ` Derrick Stolee @ 2018-04-14 16:52 ` Jakub Narebski 2018-04-21 20:44 ` Jakub Narebski 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-14 16:52 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason, Stefan Beller, Lars Schneider, Jeff King Derrick Stolee <stolee@gmail.com> writes: > On 4/11/2018 3:32 PM, Jakub Narebski wrote: >> What would you suggest as a good test that could imply performance? The >> Google Colab notebook linked to above includes a function to count >> number of commits (nodes / vertices in the commit graph) walked, >> currently in the worst case scenario. > > The two main questions to consider are: > > 1. Can X reach Y? That is easy to do. The function generic_is_reachable() does that... though using direct translation of the pseudocode for "Algorithm 3: Reachable" from FELINE paper, which is recursive and doesn't check if vertex was already visited was not good idea for large graphs such as Linux kernel commit graph, oops. That is why generic_is_reachable_large() was created. > 2. What is the set of merge-bases between X and Y? I don't have an algorithm for that in the Google Colaboratory notebook. Though I see that there exist algorithms for calculating lowest common ancestors in DAGs... I'll have to take a look how Git does that. > > And the thing to measure is a commit count. If possible, it would be > good to count commits walked (commits whose parent list is enumerated) > and commits inspected (commits that were listed as a parent of some > walked commit). Walked commits require a commit parse -- albeit from > the commit-graph instead of the ODB now -- while inspected commits > only check the in-memory cache. I don't quite see the distinction. Whether we access generation number of a commit (information about level of vertex in graph), or a parent list (vertex successors / neighbours), it both needs accessing commit-graph; well, accessing parents may be more costly for octopus merges (due to having to go through EDGE chunk). I can easily return the set of visited commits (vertices), or just size of said set. > > For git.git and Linux, I like to use the release tags as tests. They > provide a realistic view of the linear history, and maintenance > releases have their own history from the major releases. Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the FELINE index does not bring any improvements over using just level (generation number) filter. But that may be caused by narrowing od commit DAG around releases. I try do do the same between commits in wide part, with many commits with the same level (same generation number) both for source and for target commit. Though this may be unfair to level filter, though... Note however that FELINE index is not unabiguous, like generation numbers are (modulo decision whether to start at 0 or at 1); it depends on the topological ordering chosen for the X elements. >> I have tried finding number of false positives for level (generation >> number) filter and for FELINE index, and number of false negatives for >> min-post intervals in the spanning tree (for DFS tree) for 10000 >> randomly selected pairs of commits... but I don't think this is a good >> benchmark. > > What is a false-positive? A case where gen(X) < gen(Y) but Y cannot > reach X? Yes. (And equivalent for FELINE index, which is a pair of integers). > I do not think that is a great benchmark, but I guess it is > something to measure. I have simply used it to have something to compare. >> I Linux kernel sources (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) >> that has 750832 nodes and 811733 edges, and 563747941392 possible >> directed pairs, we have for 10000 randomly selected pairs of commits: >> >> level-filter has 91 = 0.91% [all] false positives >> FELINE index has 78 = 0.78% [all] false positives >> FELINE index has 1.16667 less false positives than level filter >> >> min-post spanning-tree intervals has 3641 = 36.41% [all] false >> negatives > > Perhaps something you can do instead of sampling from N^2 commits in > total is to select a pair of generations (say, G = 20000, G' = 20100) > or regions of generations ( 20000 <= G <= 20050, 20100 <= G' <= 20150) > and see how many false positives you see by testing all pairs (one > from each level). The delta between the generations may need to be > smaller to actually have a large proportion of unreachable pairs. Try > different levels, since major version releases tend to "pinch" the > commit graph to a common history. That's a good idea. >> For git.git repository (https://github.com/git/git.git) that has 52950 >> nodes and 65887 edges the numbers are slighly more in FELINE index >> favor (also out of 10000 random pairs): >> >> level-filter has 504 = 9.11% false positives >> FELINE index has 125 = 2.26% false positives >> FELINE index has 4.032 less false positives than level filter >> >> This is for FELINE which does not use level / generatio-numbers filter. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-14 16:52 ` Jakub Narebski @ 2018-04-21 20:44 ` Jakub Narebski 2018-04-23 13:54 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-21 20:44 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason, Stefan Beller, Lars Schneider, Jeff King Jakub Narebski <jnareb@gmail.com> writes: > Derrick Stolee <stolee@gmail.com> writes: >> On 4/11/2018 3:32 PM, Jakub Narebski wrote: > >>> What would you suggest as a good test that could imply performance? The >>> Google Colab notebook linked to above includes a function to count >>> number of commits (nodes / vertices in the commit graph) walked, >>> currently in the worst case scenario. >> >> The two main questions to consider are: >> >> 1. Can X reach Y? > > That is easy to do. The function generic_is_reachable() does > that... though using direct translation of the pseudocode for > "Algorithm 3: Reachable" from FELINE paper, which is recursive and > doesn't check if vertex was already visited was not good idea for large > graphs such as Linux kernel commit graph, oops. That is why > generic_is_reachable_large() was created. [...] >> And the thing to measure is a commit count. If possible, it would be >> good to count commits walked (commits whose parent list is enumerated) >> and commits inspected (commits that were listed as a parent of some >> walked commit). Walked commits require a commit parse -- albeit from >> the commit-graph instead of the ODB now -- while inspected commits >> only check the in-memory cache. [...] >> >> For git.git and Linux, I like to use the release tags as tests. They >> provide a realistic view of the linear history, and maintenance >> releases have their own history from the major releases. > > Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the > FELINE index does not bring any improvements over using just level > (generation number) filter. But that may be caused by narrowing od > commit DAG around releases. > > I try do do the same between commits in wide part, with many commits > with the same level (same generation number) both for source and for > target commit. Though this may be unfair to level filter, though... > > > Note however that FELINE index is not unabiguous, like generation > numbers are (modulo decision whether to start at 0 or at 1); it depends > on the topological ordering chosen for the X elements. One can now test reachability on git.git repository; there is a form where one can plug source and destination revisions at https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK&line=2&uniqifier=1 I have tried the case that is quite unfair to the generation numbers filter, namely the check between one of recent tags, and one commit that shares generation number among largest number of other commits. Here level = generation number-1 (as it starts at 0 for root commit, not 1). The results are: * src = 468165c1d = v2.17.0 * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec * 468165c1d has level 18418 which it shares with 6 commits * 66d2e04ec has level 14776 which it shares with 93 commits * gen(468165c1d) - gen(66d2e04ec) = 3642 algorithm | access | walk | maxdepth | visited | level-f | FELINE-f | -----------+---------+--------+----------+---------+----------+-----------+ naive | 48865 | 39599 | 244 | 9200 | | | level | 3086 | 2492 | 113 | 528 | 285 | | FELINE | 283 | 216 | 68 | 0 | | 25 | lev+FELINE | 282 | 215 | 68 | 0 | 5 | 24 | -----------+---------+--------+----------+---------+----------+-----------+ lev+FEL+mpi| 79 | 59 | 21 | 0 | 0 | 0 | Here we have: * 'naive' implementation means simple DFS walk, without any filters (cut-offs) * 'level' means using levels / generation numbers based negative-cut filter * 'FELINE' means using FELINE index based negative-cut filter * 'lev+FELINE' means combining generation numbers filter with FELINE filter * 'mpi' means min-post [smanning-tree] intervals for positive-cut filter; note that the code does not walk the path after cut, but it is easy to do The stats have the following meaning: * 'access' means accessing the node * 'walk' is actual walking the node * 'maxdepth' is maximum depth of the stack used for DFS * 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter were used for negative-cut; note that those are not disjoint; node can be rejected by both level filter and FELINE filter For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor: the results are the same, with 5 commits accessed and 6 walked compared to 61574 accessed in naive algorithm. The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads (different child-less commits) and 9 roots, and has average clustering coefficient 0.000409217. P.S. Would it be better to move the discussion about possible extensions to the commit-graph in the form of new chunks (topological order, FELINE index, min-post intervals, bloom filter for changed files, etc.) be moved into separate thread? -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH 0/6] Compute and consume generation numbers 2018-04-21 20:44 ` Jakub Narebski @ 2018-04-23 13:54 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-23 13:54 UTC (permalink / raw) To: Jakub Narebski Cc: Derrick Stolee, git, Ævar Arnfjörð Bjarmason, Stefan Beller, Lars Schneider, Jeff King On 4/21/2018 4:44 PM, Jakub Narebski wrote: > Jakub Narebski <jnareb@gmail.com> writes: >> Derrick Stolee <stolee@gmail.com> writes: >>> On 4/11/2018 3:32 PM, Jakub Narebski wrote: >>>> What would you suggest as a good test that could imply performance? The >>>> Google Colab notebook linked to above includes a function to count >>>> number of commits (nodes / vertices in the commit graph) walked, >>>> currently in the worst case scenario. >>> The two main questions to consider are: >>> >>> 1. Can X reach Y? >> That is easy to do. The function generic_is_reachable() does >> that... though using direct translation of the pseudocode for >> "Algorithm 3: Reachable" from FELINE paper, which is recursive and >> doesn't check if vertex was already visited was not good idea for large >> graphs such as Linux kernel commit graph, oops. That is why >> generic_is_reachable_large() was created. > [...] > >>> And the thing to measure is a commit count. If possible, it would be >>> good to count commits walked (commits whose parent list is enumerated) >>> and commits inspected (commits that were listed as a parent of some >>> walked commit). Walked commits require a commit parse -- albeit from >>> the commit-graph instead of the ODB now -- while inspected commits >>> only check the in-memory cache. > [...] >>> For git.git and Linux, I like to use the release tags as tests. They >>> provide a realistic view of the linear history, and maintenance >>> releases have their own history from the major releases. >> Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the >> FELINE index does not bring any improvements over using just level >> (generation number) filter. But that may be caused by narrowing od >> commit DAG around releases. >> >> I try do do the same between commits in wide part, with many commits >> with the same level (same generation number) both for source and for >> target commit. Though this may be unfair to level filter, though... >> >> >> Note however that FELINE index is not unabiguous, like generation >> numbers are (modulo decision whether to start at 0 or at 1); it depends >> on the topological ordering chosen for the X elements. > One can now test reachability on git.git repository; there is a form > where one can plug source and destination revisions at > https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK&line=2&uniqifier=1 > > I have tried the case that is quite unfair to the generation numbers > filter, namely the check between one of recent tags, and one commit that > shares generation number among largest number of other commits. > > Here level = generation number-1 (as it starts at 0 for root commit, not > 1). > > The results are: > * src = 468165c1d = v2.17.0 > * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec > > * 468165c1d has level 18418 which it shares with 6 commits > * 66d2e04ec has level 14776 which it shares with 93 commits > * gen(468165c1d) - gen(66d2e04ec) = 3642 > > algorithm | access | walk | maxdepth | visited | level-f | FELINE-f | > -----------+---------+--------+----------+---------+----------+-----------+ > naive | 48865 | 39599 | 244 | 9200 | | | > level | 3086 | 2492 | 113 | 528 | 285 | | > FELINE | 283 | 216 | 68 | 0 | | 25 | > lev+FELINE | 282 | 215 | 68 | 0 | 5 | 24 | > -----------+---------+--------+----------+---------+----------+-----------+ > lev+FEL+mpi| 79 | 59 | 21 | 0 | 0 | 0 | > > Here we have: > * 'naive' implementation means simple DFS walk, without any filters (cut-offs) > * 'level' means using levels / generation numbers based negative-cut filter > * 'FELINE' means using FELINE index based negative-cut filter > * 'lev+FELINE' means combining generation numbers filter with FELINE filter > * 'mpi' means min-post [smanning-tree] intervals for positive-cut filter; > note that the code does not walk the path after cut, but it is easy to do > > The stats have the following meaning: > * 'access' means accessing the node > * 'walk' is actual walking the node > * 'maxdepth' is maximum depth of the stack used for DFS > * 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter > were used for negative-cut; note that those are not disjoint; node can > be rejected by both level filter and FELINE filter > > For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor: > the results are the same, with 5 commits accessed and 6 walked compared > to 61574 accessed in naive algorithm. > > The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads > (different child-less commits) and 9 roots, and has average clustering > coefficient 0.000409217. Thanks for these results. Now, write a patch. I'm sticking to generation numbers for my patch because of the simplified computation, but you can contribute a FELINE implementation. > P.S. Would it be better to move the discussion about possible extensions > to the commit-graph in the form of new chunks (topological order, FELINE > index, min-post intervals, bloom filter for changed files, etc.) be > moved into separate thread? Yes. I think we've exhausted this thought experiment and future discussion should revolve around actual implementations in Git with end-to-end performance times. The computation time for computing the FELINE index should be included in that discussion. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v2 00/10] Compute and consume generation numbers 2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee ` (8 preceding siblings ...) 2018-04-07 16:55 ` Jakub Narebski @ 2018-04-09 16:41 ` Derrick Stolee 2018-04-09 16:41 ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee ` (10 more replies) 9 siblings, 11 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-09 16:41 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee Thanks for the lively discussion of this patch series in v1! I've incorporated the feedback from the previous round, added patches [7/6] and [8/6], expanded the discussion of generation numbers in the design document, and added another speedup for 'git branch --contains'. One major difference: I renamed the macros from _UNDEF to _INFINITY and _NONE to _ZERO. This communicates their value more clearly, since the previous names were unclear about which was larger than the "real" generation numbers. Patch 2 includes a change to builtin/merge.c and a new test in t5318-commit-graph.sh that exposes a problem I found when testing the previous patch series on my box. The "BUG: bad generation skip" message from "commit.c: use generation to halt paint walk" would halt a fast- forward merge since the HEAD commit was loaded before the core.commitGraph config setting was loaded. It is crucial that all commits that exist in the commit-graph file are loaded from that file or else we will lose our expected inequalities of generation numbers. Thanks, -Stolee -- >8 -- This is the one of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This section is expanded to describe the interaction with special generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph file) and *_ZERO (commits in a commit-graph file written before generation numbers were implemented). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a constant-time check in queue_has_nonstale() instead of the previous linear-time check. Further, use generation numbers for '--contains' queries in 'git tag' and 'git branch', providing a significant speedup (at least 95% for some cases). A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series depends on v7 of ds/commit-graph. Derrick Stolee (10): object.c: parse commit in graph first merge: check config before loading commits commit: add generation number to struct commmit commit-graph: compute generation numbers commit: use generations in paint_down_to_common() commit.c: use generation to halt paint walk commit-graph.txt: update future work ref-filter: use generation number for --contains commit: use generation numbers for in_merge_bases() commit: add short-circuit to paint_down_to_common() Documentation/technical/commit-graph.txt | 50 +++++++++++++-- alloc.c | 1 + builtin/merge.c | 5 +- commit-graph.c | 48 +++++++++++++++ commit.c | 78 ++++++++++++++++++++---- commit.h | 5 ++ object.c | 4 +- ref-filter.c | 24 ++++++-- t/t5318-commit-graph.sh | 9 +++ 9 files changed, 197 insertions(+), 27 deletions(-) -- 2.17.0 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v2 01/10] object.c: parse commit in graph first 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee @ 2018-04-09 16:41 ` Derrick Stolee 2018-04-09 16:41 ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee ` (9 subsequent siblings) 10 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-09 16:41 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). Before adding generation numbers to the commit-graph, we need to ensure that any commit that exists in the graph is loaded from the graph, so check parse_commit_in_graph() before calling parse_commit_buffer(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- object.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/object.c b/object.c index e6ad3f61f0..4cd3e98e04 100644 --- a/object.c +++ b/object.c @@ -3,6 +3,7 @@ #include "blob.h" #include "tree.h" #include "commit.h" +#include "commit-graph.h" #include "tag.h" static struct object **obj_hash; @@ -207,7 +208,8 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type } else if (type == OBJ_COMMIT) { struct commit *commit = lookup_commit(oid); if (commit) { - if (parse_commit_buffer(commit, buffer, size)) + if (!parse_commit_in_graph(commit) && + parse_commit_buffer(commit, buffer, size)) return NULL; if (!get_cached_commit_buffer(commit, NULL)) { set_commit_buffer(commit, buffer, size); -- 2.17.0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v2 02/10] merge: check config before loading commits 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee 2018-04-09 16:41 ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee @ 2018-04-09 16:41 ` Derrick Stolee 2018-04-11 2:12 ` Junio C Hamano 2018-04-09 16:42 ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee ` (8 subsequent siblings) 10 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-09 16:41 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee In anticipation of using generation numbers from the commit-graph, we must ensure that all commits that exist in the commit-graph are loaded from that file instead of from the object database. Since the commit-graph file is only checked if core.commitGraph is true, we must check the default config before we load any commits. In the merge builtin, the config was checked after loading the HEAD commit. This was due to the use of the global 'branch' when checking merge-specific config settings. Move the config load to be between the initialization of 'branch' and the commit lookup. Also add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- builtin/merge.c | 5 +++-- t/t5318-commit-graph.sh | 9 +++++++++ 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/builtin/merge.c b/builtin/merge.c index ee050a47f3..20897f8223 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", &branch); + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(&head_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(&head_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a380419b65..77d85aefe7 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 +test_expect_success 'perform fast-forward merge in full repo' ' + cd "$TRASH_DIRECTORY/full" && + git checkout -b merge-5-to-8 commits/5 && + git merge commits/8 && + git show-ref -s merge-5-to-8 >output && + git show-ref -s commits/8 >expect && + test_cmp expect output +' + test_done -- 2.17.0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v2 02/10] merge: check config before loading commits 2018-04-09 16:41 ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee @ 2018-04-11 2:12 ` Junio C Hamano 2018-04-11 12:49 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Junio C Hamano @ 2018-04-11 2:12 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, peff, avarab, sbeller, larsxschneider, bmwill Derrick Stolee <dstolee@microsoft.com> writes: > diff --git a/builtin/merge.c b/builtin/merge.c > index ee050a47f3..20897f8223 100644 > --- a/builtin/merge.c > +++ b/builtin/merge.c > @@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix) > branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL); > if (branch) > skip_prefix(branch, "refs/heads/", &branch); > + init_diff_ui_defaults(); > + git_config(git_merge_config, NULL); > + > if (!branch || is_null_oid(&head_oid)) > head_commit = NULL; > else > head_commit = lookup_commit_or_die(&head_oid, "HEAD"); > > - init_diff_ui_defaults(); > - git_config(git_merge_config, NULL); Wow, that's tricky. git_merge_config() wants to know which "branch" we are on, and this place is as early as we can move the call to without breaking things. Is this to allow parse_object() called in lookup_commit_reference_gently() to know if we can rely on the data cached in the commit-graph data? > Move the config load to be between the initialization of 'branch' > and the commit lookup. Also add a test to t5318-commit-graph.sh > that exercises this code path to prevent a regression. It is not clear to me how a successful merge of commits/8 demonstrates that reading the config earlier than before is regression free. > diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh > index a380419b65..77d85aefe7 100755 > --- a/t/t5318-commit-graph.sh > +++ b/t/t5318-commit-graph.sh > @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' > graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 > graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 > > +test_expect_success 'perform fast-forward merge in full repo' ' > + cd "$TRASH_DIRECTORY/full" && > + git checkout -b merge-5-to-8 commits/5 && > + git merge commits/8 && > + git show-ref -s merge-5-to-8 >output && > + git show-ref -s commits/8 >expect && > + test_cmp expect output > +' > + > test_done ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v2 02/10] merge: check config before loading commits 2018-04-11 2:12 ` Junio C Hamano @ 2018-04-11 12:49 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-11 12:49 UTC (permalink / raw) To: Junio C Hamano, Derrick Stolee Cc: git, peff, avarab, sbeller, larsxschneider, bmwill On 4/10/2018 10:12 PM, Junio C Hamano wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> diff --git a/builtin/merge.c b/builtin/merge.c >> index ee050a47f3..20897f8223 100644 >> --- a/builtin/merge.c >> +++ b/builtin/merge.c >> @@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix) >> branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL); >> if (branch) >> skip_prefix(branch, "refs/heads/", &branch); >> + init_diff_ui_defaults(); >> + git_config(git_merge_config, NULL); >> + >> if (!branch || is_null_oid(&head_oid)) >> head_commit = NULL; >> else >> head_commit = lookup_commit_or_die(&head_oid, "HEAD"); >> >> - init_diff_ui_defaults(); >> - git_config(git_merge_config, NULL); > Wow, that's tricky. git_merge_config() wants to know which "branch" > we are on, and this place is as early as we can move the call to > without breaking things. Is this to allow parse_object() called > in lookup_commit_reference_gently() to know if we can rely on the > data cached in the commit-graph data? When I saw the bug on my machine, I tracked the issue down to a call to parse_commit_in_graph() that skipped the graph check since core_commit_graph was not set. The call stack from this call is as follows: * lookup_commit_or_die() * lookup_commit_reference() * lookup_commit_reference_gently() * parse_object() * parse_object_buffer() * parse_commit_in_graph() [as introduced in PATCH 01/10] > >> Move the config load to be between the initialization of 'branch' >> and the commit lookup. Also add a test to t5318-commit-graph.sh >> that exercises this code path to prevent a regression. > It is not clear to me how a successful merge of commits/8 > demonstrates that reading the config earlier than before is > regression free. I didn't want to introduce commits in an order that led to a commit failing tests, but if you drop the change to builtin/merge.c from this series, the tip commit will fail this test with "BUG: bad generation skip". The reason for this failure is that commits/5 is loaded from HEAD from the object database, so its generation is marked as GENERATION_NUMBER_INFINITY, and the commit is marked as parsed. Later, the commit at merges/3 is loaded from the graph with generation 4. This triggers the BUG statement in paint_down_to_common(). That is why it is important to check a fast-forward merge. In the 'graph_git_behavior' steps of t5318-commit-graph.sh, we were already testing 'git merge-base' to check the commit walk logic. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v2 03/10] commit: add generation number to struct commmit 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee 2018-04-09 16:41 ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee 2018-04-09 16:41 ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee @ 2018-04-09 16:42 ` Derrick Stolee 2018-04-09 17:59 ` Stefan Beller 2018-04-11 2:31 ` Junio C Hamano 2018-04-09 16:42 ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee ` (7 subsequent siblings) 10 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use two special values to signal the generation number is invalid: GENERATION_NUMBER_ININITY 0xFFFFFFFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_ZERO) means the generation number was loaded from a commit graph file that was stored before generation numbers were computed. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- alloc.c | 1 + commit-graph.c | 2 ++ commit.h | 4 ++++ 3 files changed, 7 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } diff --git a/commit-graph.c b/commit-graph.c index 1fc63d541b..d24b947525 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -264,6 +264,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = &item->parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index e57ae4b583..b91df315c5 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF +#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF +#define GENERATION_NUMBER_MAX 0x3FFFFFFF +#define GENERATION_NUMBER_ZERO 0 struct commit_list { struct commit *item; @@ -24,6 +27,7 @@ struct commit { struct commit_list *parents; struct tree *tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v2 03/10] commit: add generation number to struct commmit 2018-04-09 16:42 ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee @ 2018-04-09 17:59 ` Stefan Beller 2018-04-11 2:31 ` Junio C Hamano 1 sibling, 0 replies; 162+ messages in thread From: Stefan Beller @ 2018-04-09 17:59 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, peff, avarab, larsxschneider, bmwill On Mon, Apr 9, 2018 at 9:42 AM, Derrick Stolee <dstolee@microsoft.com> wrote: > The generation number of a commit is defined recursively as follows: > > * If a commit A has no parents, then the generation number of A is one. > * If a commit A has parents, then the generation number of A is one > more than the maximum generation number among the parents of A. > > Add a uint32_t generation field to struct commit so we can pass this > information to revision walks. We use two special values to signal > the generation number is invalid: > > GENERATION_NUMBER_ININITY 0xFFFFFFFF GENERATION_NUMBER_INFINITY On disk we currently only store up to 2^30-1, (2 bits fewer than MAX_UINT_32), but here we just take the maximum value of what a uint32_t can store. That miss match should not be a problem albeit aesthetically. Once we run into scaling problems, we can just up to uint64_t in the code, and defer the solution on disk to a new file format. With both ZERO and _INFINITY we are at the border of uint wrap-around, so we have to be very careful to not add/subtract one and then compare. Just to watch out for when reviewing. ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v2 03/10] commit: add generation number to struct commmit 2018-04-09 16:42 ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee 2018-04-09 17:59 ` Stefan Beller @ 2018-04-11 2:31 ` Junio C Hamano 2018-04-11 12:57 ` Derrick Stolee 1 sibling, 1 reply; 162+ messages in thread From: Junio C Hamano @ 2018-04-11 2:31 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, peff, avarab, sbeller, larsxschneider, bmwill Derrick Stolee <dstolee@microsoft.com> writes: > The generation number of a commit is defined recursively as follows: > > * If a commit A has no parents, then the generation number of A is one. > * If a commit A has parents, then the generation number of A is one > more than the maximum generation number among the parents of A. > > Add a uint32_t generation field to struct commit so we can pass this > information to revision walks. We use two special values to signal > the generation number is invalid: > > GENERATION_NUMBER_ININITY 0xFFFFFFFF > GENERATION_NUMBER_ZERO 0 > > The first (_INFINITY) means the generation number has not been loaded or > computed. The second (_ZERO) means the generation number was loaded > from a commit graph file that was stored before generation numbers > were computed. Should it also be possible for a caller to tell if a given commit has too deep a history, i.e. we do not know its generation number exactly, but we know it is larger than 1<<30? It seems that we only have a 30-bit field in the file, so wouldn't we need a special value defined in (e.g. "0") so that we can tell that the commit has such a large generation number? E.g. > + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; if (!item->generation) item->generation = GENERATION_NUMBER_OVERFLOW; when we read it from the file? We obviously need to do something similar when assigning a generation number to a child commit, perhaps like #define GENERATION_NUMBER_OVERFLOW (GENERATION_NUMBER_MAX + 1) commit->generation = 1; /* assume no parent */ for (p = commit->parents; p; p++) { uint32_t gen = p->item->generation + 1; if (gen >= GENERATION_NUMBER_OVERFLOW) { commit->generation = GENERATION_NUMBER_OVERFLOW; break; } else if (commit->generation < gen) commit->generation = gen; } or something? And then on the writing side you'd encode too large a generation as '0'. ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v2 03/10] commit: add generation number to struct commmit 2018-04-11 2:31 ` Junio C Hamano @ 2018-04-11 12:57 ` Derrick Stolee 2018-04-11 23:28 ` Junio C Hamano 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-11 12:57 UTC (permalink / raw) To: Junio C Hamano, Derrick Stolee Cc: git, peff, avarab, sbeller, larsxschneider, bmwill On 4/10/2018 10:31 PM, Junio C Hamano wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> The generation number of a commit is defined recursively as follows: >> >> * If a commit A has no parents, then the generation number of A is one. >> * If a commit A has parents, then the generation number of A is one >> more than the maximum generation number among the parents of A. >> >> Add a uint32_t generation field to struct commit so we can pass this >> information to revision walks. We use two special values to signal >> the generation number is invalid: >> >> GENERATION_NUMBER_ININITY 0xFFFFFFFF >> GENERATION_NUMBER_ZERO 0 >> >> The first (_INFINITY) means the generation number has not been loaded or >> computed. The second (_ZERO) means the generation number was loaded >> from a commit graph file that was stored before generation numbers >> were computed. > Should it also be possible for a caller to tell if a given commit > has too deep a history, i.e. we do not know its generation number > exactly, but we know it is larger than 1<<30? > > It seems that we only have a 30-bit field in the file, so wouldn't > we need a special value defined in (e.g. "0") so that we can tell > that the commit has such a large generation number? E.g. > >> + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; > if (!item->generation) > item->generation = GENERATION_NUMBER_OVERFLOW; > > when we read it from the file? > > We obviously need to do something similar when assigning a > generation number to a child commit, perhaps like > > #define GENERATION_NUMBER_OVERFLOW (GENERATION_NUMBER_MAX + 1) > > commit->generation = 1; /* assume no parent */ > for (p = commit->parents; p; p++) { > uint32_t gen = p->item->generation + 1; > > if (gen >= GENERATION_NUMBER_OVERFLOW) { > commit->generation = GENERATION_NUMBER_OVERFLOW; > break; > } else if (commit->generation < gen) > commit->generation = gen; > } > > or something? And then on the writing side you'd encode too large a > generation as '0'. You raise a very good point. How about we do a slightly different arrangement for these overflow commits? Instead of storing the commits in the commit-graph file as "0" (which currently means "written by a version of git that did not compute generation numbers") we could let GENERATION_NUMBER_MAX be the maximum generation of a commit in the commit-graph, and if a commit would have larger generation, we collapse it down to that value. It slightly complicates the diagram I made in Documentation/technical/commit-graph.txt, but it was already a bit of a simplification. Here is an updated diagram, but likely we will want to limit discussion of the special-case GENERATION_NUMBER_MAX to the prose, since it is not a practical situation at the moment. +-----------------------------------------+ | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF | +-----------------------------------------+ | | | ^ | | | | | | +------+ | | [gen(A) = gen(B)] | V | +------------------------------------+ | | GENERATION_NUMBER_MAX = 0x3FFFFFFF | | +------------------------------------+ | | | ^ | | | | | | +------+ | | [gen(A) = gen(B)] V V +-------------------------------------+ | 0 < commit->generation < 0x3FFFFFFF | +-------------------------------------+ | | ^ | | | | +------+ | [gen(A) > gen(B)] V +-------------------------------------+ | GENERATION_NUMBER_ZERO = 0 | +-------------------------------------+ | ^ | | +------+ [gen(A) = gen(B)] Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v2 03/10] commit: add generation number to struct commmit 2018-04-11 12:57 ` Derrick Stolee @ 2018-04-11 23:28 ` Junio C Hamano 0 siblings, 0 replies; 162+ messages in thread From: Junio C Hamano @ 2018-04-11 23:28 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, peff, avarab, sbeller, larsxschneider, bmwill Derrick Stolee <stolee@gmail.com> writes: > How about we do a slightly different > arrangement for these overflow commits? > > Instead of storing the commits in the commit-graph file as "0" (which > currently means "written by a version of git that did not compute > generation numbers") we could let GENERATION_NUMBER_MAX be the maximum > generation of a commit in the commit-graph, and if a commit would have > larger generation, we collapse it down to that value. Sure. Any value we can tell that it is special is fine. Thanks. ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v2 04/10] commit-graph: compute generation numbers 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee ` (2 preceding siblings ...) 2018-04-09 16:42 ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee @ 2018-04-09 16:42 ` Derrick Stolee 2018-04-11 2:51 ` Junio C Hamano 2018-04-09 16:42 ` [PATCH v2 05/10] commit: use generations in paint_down_to_common() Derrick Stolee ` (6 subsequent siblings) 10 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit-graph.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index d24b947525..5fd63acc31 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -419,6 +419,13 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) { + if ((*list)->generation > GENERATION_NUMBER_MAX) + die("generation number %u is too large to store in commit-graph", + (*list)->generation); + packedDate[0] |= htonl((*list)->generation << 2); + } + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -551,6 +558,43 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], &list); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, &list); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(&list); + } + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -674,6 +718,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(&lk, graph_name, 0); -- 2.17.0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v2 04/10] commit-graph: compute generation numbers 2018-04-09 16:42 ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee @ 2018-04-11 2:51 ` Junio C Hamano 2018-04-11 13:02 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Junio C Hamano @ 2018-04-11 2:51 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, peff, avarab, sbeller, larsxschneider, bmwill Derrick Stolee <dstolee@microsoft.com> writes: > + if ((*list)->generation != GENERATION_NUMBER_INFINITY) { > + if ((*list)->generation > GENERATION_NUMBER_MAX) > + die("generation number %u is too large to store in commit-graph", > + (*list)->generation); > + packedDate[0] |= htonl((*list)->generation << 2); > + } How serious do we want this feature to be? On one extreme, we could be irresponsible and say it will be a problem for our descendants in the future if their repositories have more than billion pearls on a single strand, and the above certainly is a reasonable way to punt. Those who actually encounter the problem will notice by Git dying somewhere rather deep in the callchain. Or we could say Git actually does support a history that is arbitrarily long, even though such a deep portion of history will not benefit from having generation numbers in commit-graph. I've been assuming that our stance is the latter and that is why I made noises about overflowing 30-bit generation field in my review of the previous step. In case we want to do the "we know this is very large, but we do not know the exact value", we may actually want a mode where we can pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and make sure that the code to handle overflow behaves sensibly. > + for (i = 0; i < nr_commits; i++) { > + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && > + commits[i]->generation != GENERATION_NUMBER_ZERO) > + continue; > + > + commit_list_insert(commits[i], &list); > + while (list) { > +... > + } > + } So we go over the list of commits just _once_ and make sure each of them gets the generation assigned correctly by (conceptually recursively but iteratively in implementation by using a commit list) making sure that all its parents have generation assigned and compute the generation for the commit, before moving to the next one. Which sounds correct. ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v2 04/10] commit-graph: compute generation numbers 2018-04-11 2:51 ` Junio C Hamano @ 2018-04-11 13:02 ` Derrick Stolee 2018-04-11 18:49 ` Stefan Beller 2018-04-11 19:26 ` Eric Sunshine 0 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-11 13:02 UTC (permalink / raw) To: Junio C Hamano, Derrick Stolee Cc: git, peff, avarab, sbeller, larsxschneider, bmwill On 4/10/2018 10:51 PM, Junio C Hamano wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> + if ((*list)->generation != GENERATION_NUMBER_INFINITY) { >> + if ((*list)->generation > GENERATION_NUMBER_MAX) >> + die("generation number %u is too large to store in commit-graph", >> + (*list)->generation); >> + packedDate[0] |= htonl((*list)->generation << 2); >> + } > > How serious do we want this feature to be? On one extreme, we could > be irresponsible and say it will be a problem for our descendants in > the future if their repositories have more than billion pearls on a > single strand, and the above certainly is a reasonable way to punt. > Those who actually encounter the problem will notice by Git dying > somewhere rather deep in the callchain. > > Or we could say Git actually does support a history that is > arbitrarily long, even though such a deep portion of history will > not benefit from having generation numbers in commit-graph. > > I've been assuming that our stance is the latter and that is why I > made noises about overflowing 30-bit generation field in my review > of the previous step. > > In case we want to do the "we know this is very large, but we do not > know the exact value", we may actually want a mode where we can > pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and > make sure that the code to handle overflow behaves sensibly. I agree. I wonder how we can effectively expose this value into a test. It's probably not sufficient to manually test using compiler flags ("-D GENERATION_NUMBER_MAX=8"). > >> + for (i = 0; i < nr_commits; i++) { >> + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && >> + commits[i]->generation != GENERATION_NUMBER_ZERO) >> + continue; >> + >> + commit_list_insert(commits[i], &list); >> + while (list) { >> +... >> + } >> + } > So we go over the list of commits just _once_ and make sure each of > them gets the generation assigned correctly by (conceptually > recursively but iteratively in implementation by using a commit > list) making sure that all its parents have generation assigned and > compute the generation for the commit, before moving to the next > one. Which sounds correct. Yes, we compute the generation number of a commit exactly once. We use the list as a stack so we do not have recursion limits during our depth-first search (DFS). We rely on the object cache to ensure we store the computed generation numbers, and computed generation numbers provide termination conditions to the DFS. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v2 04/10] commit-graph: compute generation numbers 2018-04-11 13:02 ` Derrick Stolee @ 2018-04-11 18:49 ` Stefan Beller 2018-04-11 19:26 ` Eric Sunshine 1 sibling, 0 replies; 162+ messages in thread From: Stefan Beller @ 2018-04-11 18:49 UTC (permalink / raw) To: Derrick Stolee Cc: Junio C Hamano, Derrick Stolee, git, peff, avarab, larsxschneider, bmwill On Wed, Apr 11, 2018 at 6:02 AM, Derrick Stolee <stolee@gmail.com> wrote: > On 4/10/2018 10:51 PM, Junio C Hamano wrote: >> >> Derrick Stolee <dstolee@microsoft.com> writes: >> >>> + if ((*list)->generation != GENERATION_NUMBER_INFINITY) { >>> + if ((*list)->generation > GENERATION_NUMBER_MAX) >>> + die("generation number %u is too large to >>> store in commit-graph", >>> + (*list)->generation); >>> + packedDate[0] |= htonl((*list)->generation << 2); >>> + } >> >> >> How serious do we want this feature to be? On one extreme, we could >> be irresponsible and say it will be a problem for our descendants in >> the future if their repositories have more than billion pearls on a >> single strand, and the above certainly is a reasonable way to punt. >> Those who actually encounter the problem will notice by Git dying >> somewhere rather deep in the callchain. >> >> Or we could say Git actually does support a history that is >> arbitrarily long, even though such a deep portion of history will >> not benefit from having generation numbers in commit-graph. >> >> I've been assuming that our stance is the latter and that is why I >> made noises about overflowing 30-bit generation field in my review >> of the previous step. >> >> In case we want to do the "we know this is very large, but we do not >> know the exact value", we may actually want a mode where we can >> pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and >> make sure that the code to handle overflow behaves sensibly. > > > I agree. I wonder how we can effectively expose this value into a test. It's > probably not sufficient to manually test using compiler flags ("-D > GENERATION_NUMBER_MAX=8"). Would using an environment variable for this testing purpose be a good idea? If we allow a user to pass in an arbitrary maximum, then we'd have to care about generation numbers that are stored in the commit graph file larger than that user specific maximum, though. Looking through the output of "git grep getenv" we only have two instances with _DEBUG, both in transport. ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v2 04/10] commit-graph: compute generation numbers 2018-04-11 13:02 ` Derrick Stolee 2018-04-11 18:49 ` Stefan Beller @ 2018-04-11 19:26 ` Eric Sunshine 1 sibling, 0 replies; 162+ messages in thread From: Eric Sunshine @ 2018-04-11 19:26 UTC (permalink / raw) To: Derrick Stolee Cc: Junio C Hamano, Derrick Stolee, git, peff, avarab, sbeller, larsxschneider, bmwill On Wed, Apr 11, 2018 at 9:02 AM, Derrick Stolee <stolee@gmail.com> wrote: > On 4/10/2018 10:51 PM, Junio C Hamano wrote: >> In case we want to do the "we know this is very large, but we do not >> know the exact value", we may actually want a mode where we can >> pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and >> make sure that the code to handle overflow behaves sensibly. > > I agree. I wonder how we can effectively expose this value into a test. It's > probably not sufficient to manually test using compiler flags ("-D > GENERATION_NUMBER_MAX=8"). A few similar cases of tests needing to tweak some behavior do so by environment variable. See, for instance, GIT_GETTEXT_POISON and GIT_FSMONITOR_TEST. ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v2 05/10] commit: use generations in paint_down_to_common() 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee ` (3 preceding siblings ...) 2018-04-09 16:42 ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee @ 2018-04-09 16:42 ` Derrick Stolee 2018-04-09 16:42 ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee ` (5 subsequent siblings) 10 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 19 ++++++++++++++++++- commit.h | 1 + 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 3e39c86abf..95ae7e13a3 100644 --- a/commit.c +++ b/commit.c @@ -624,6 +624,23 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* newer commits with larger date first */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -773,7 +790,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index b91df315c5..c440f56bf9 100644 --- a/commit.h +++ b/commit.h @@ -332,6 +332,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v2 06/10] commit.c: use generation to halt paint walk 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee ` (4 preceding siblings ...) 2018-04-09 16:42 ` [PATCH v2 05/10] commit: use generations in paint_down_to_common() Derrick Stolee @ 2018-04-09 16:42 ` Derrick Stolee 2018-04-11 3:02 ` Junio C Hamano 2018-04-09 16:42 ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee ` (4 subsequent siblings) 10 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee In paint_down_to_common(), the walk is halted when the queue contains only stale commits. The queue_has_nonstale() method iterates over the entire queue looking for a nonstale commit. In a wide commit graph where the two sides share many commits in common, but have deep sets of different commits, this method may inspect many elements before finding a nonstale commit. In the worst case, this can give quadratic performance in paint_down_to_common(). Convert queue_has_nonstale() to use generation numbers for an O(1) termination condition. To properly take advantage of this condition, track the minimum generation number of a commit that enters the queue with nonstale status. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 37 ++++++++++++++++++++++++++++++------- 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/commit.c b/commit.c index 95ae7e13a3..00bdc2ab21 100644 --- a/commit.c +++ b/commit.c @@ -776,14 +776,22 @@ void sort_in_topological_order(struct commit_list **list, enum rev_sort_order so static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT); -static int queue_has_nonstale(struct prio_queue *queue) +static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen) { - int i; - for (i = 0; i < queue->nr; i++) { - struct commit *commit = queue->array[i].data; - if (!(commit->object.flags & STALE)) - return 1; + if (min_gen != GENERATION_NUMBER_INFINITY) { + if (queue->nr > 0) { + struct commit *commit = queue->array[0].data; + return commit->generation >= min_gen; + } + } else { + int i; + for (i = 0; i < queue->nr; i++) { + struct commit *commit = queue->array[i].data; + if (!(commit->object.flags & STALE)) + return 1; + } } + return 0; } @@ -793,6 +801,8 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; + uint32_t min_nonstale_gen = GENERATION_NUMBER_INFINITY; one->object.flags |= PARENT1; if (!n) { @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return result; } prio_queue_put(&queue, one); + if (one->generation < min_nonstale_gen) + min_nonstale_gen = one->generation; for (i = 0; i < n; i++) { twos[i]->object.flags |= PARENT2; prio_queue_put(&queue, twos[i]); + if (twos[i]->generation < min_nonstale_gen) + min_nonstale_gen = twos[i]->generation; } - while (queue_has_nonstale(&queue)) { + while (queue_has_nonstale(&queue, min_nonstale_gen)) { struct commit *commit = prio_queue_get(&queue); struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + + last_gen = commit->generation; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc return NULL; p->object.flags |= flags; prio_queue_put(&queue, p); + + if (!(flags & STALE) && + p->generation < min_nonstale_gen) + min_nonstale_gen = p->generation; } } -- 2.17.0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v2 06/10] commit.c: use generation to halt paint walk 2018-04-09 16:42 ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee @ 2018-04-11 3:02 ` Junio C Hamano 2018-04-11 13:24 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Junio C Hamano @ 2018-04-11 3:02 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, peff, avarab, sbeller, larsxschneider, bmwill Derrick Stolee <dstolee@microsoft.com> writes: > @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc > return result; > } > prio_queue_put(&queue, one); > + if (one->generation < min_nonstale_gen) > + min_nonstale_gen = one->generation; > > for (i = 0; i < n; i++) { > twos[i]->object.flags |= PARENT2; > prio_queue_put(&queue, twos[i]); > + if (twos[i]->generation < min_nonstale_gen) > + min_nonstale_gen = twos[i]->generation; > } > > - while (queue_has_nonstale(&queue)) { > + while (queue_has_nonstale(&queue, min_nonstale_gen)) { > struct commit *commit = prio_queue_get(&queue); > struct commit_list *parents; > int flags; > > + if (commit->generation > last_gen) > + BUG("bad generation skip"); > + > + last_gen = commit->generation; > + > flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); > if (flags == (PARENT1 | PARENT2)) { > if (!(commit->object.flags & RESULT)) { > @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc > return NULL; > p->object.flags |= flags; Hmph. Can a commit that used to be not stale (and contributed to the current value of min_nonstale_gen) become stale here by getting visited twice, invalidating the value in min_nonstale_gen? > prio_queue_put(&queue, p); > + > + if (!(flags & STALE) && > + p->generation < min_nonstale_gen) > + min_nonstale_gen = p->generation; > } > } ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v2 06/10] commit.c: use generation to halt paint walk 2018-04-11 3:02 ` Junio C Hamano @ 2018-04-11 13:24 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-11 13:24 UTC (permalink / raw) To: Junio C Hamano, Derrick Stolee Cc: git, peff, avarab, sbeller, larsxschneider, bmwill On 4/10/2018 11:02 PM, Junio C Hamano wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> @@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc >> return result; >> } >> prio_queue_put(&queue, one); >> + if (one->generation < min_nonstale_gen) >> + min_nonstale_gen = one->generation; >> >> for (i = 0; i < n; i++) { >> twos[i]->object.flags |= PARENT2; >> prio_queue_put(&queue, twos[i]); >> + if (twos[i]->generation < min_nonstale_gen) >> + min_nonstale_gen = twos[i]->generation; >> } >> >> - while (queue_has_nonstale(&queue)) { >> + while (queue_has_nonstale(&queue, min_nonstale_gen)) { >> struct commit *commit = prio_queue_get(&queue); >> struct commit_list *parents; >> int flags; >> >> + if (commit->generation > last_gen) >> + BUG("bad generation skip"); >> + >> + last_gen = commit->generation; >> + >> flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); >> if (flags == (PARENT1 | PARENT2)) { >> if (!(commit->object.flags & RESULT)) { >> @@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc >> return NULL; >> p->object.flags |= flags; > Hmph. Can a commit that used to be not stale (and contributed to > the current value of min_nonstale_gen) become stale here by getting > visited twice, invalidating the value in min_nonstale_gen? min_nonstale_gen can be "wrong" in the way you say, but fits the definition from the commit message: "To properly take advantage of this condition, track the minimum generation number of a commit that **enters the queue** with nonstale status." (Emphasis added) You make an excellent point about how this can be problematic. I was confused by the lack of clear performance benefits here, but I think that whatever benefits making queue_has_nonstale() be O(1) were removed by walking more commits than necessary. Consider the following commit graph, where M is a parent of both A and B, S is a parent of M and B, and there is a large set of commits reachable from M with generation number larger than gen(S). A B | __/| |/ | M | |\ | . | | . | | . |_/ |/ S Between A and B, the true merge base is M. Anything reachable from M is marked as stale. When S is added to the queue, it is only reachable from B, so it is non-stale. However, it is marked stale after M is walked. The old code would detect this as a termination condition, but the new code would not. I think this data shape is actually common (not exactly, as it may be that some ancestor of M provides a second path to S) especially in the world of pull requests and users merging master into their topic branches. I'll remove this commit in the next version, but use the new prototype for queue_has_nonstale() in "commit: add short-circuit to paint_down_to_common()" using the given 'min_generation' instead of 'min_nonstale_gen'. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v2 07/10] commit-graph.txt: update future work 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee ` (5 preceding siblings ...) 2018-04-09 16:42 ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee @ 2018-04-09 16:42 ` Derrick Stolee 2018-04-12 9:12 ` Junio C Hamano 2018-04-09 16:42 ` [PATCH v2 08/10] ref-filter: use generation number for --contains Derrick Stolee ` (3 subsequent siblings) 10 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the two "special" generation numbers GENERATION_NUMBER_INFINITY and *_ZERO interact with other generation numbers. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/technical/commit-graph.txt | 50 +++++++++++++++++++++--- 1 file changed, 44 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..a8df0ae9db 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,49 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + + If A and B are commits with generation numbers N amd M, respectively, + and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +Here is a diagram to visualize the shape of the full commit graph, and +how different generation numbers relate: + + +-----------------------------------------+ + | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF | + +-----------------------------------------+ + | | ^ + | | | + | +------+ + | [gen(A) = gen(B)] + V + +-------------------------------------+ + | 0 < commit->generation < 0x40000000 | + +-------------------------------------+ + | | ^ + | | | + | +------+ + | [gen(A) > gen(B)] + V + +-------------------------------------+ + | GENERATION_NUMBER_ZERO = 0 | + +-------------------------------------+ + | ^ + | | + +------+ + [gen(A) = gen(B)] + Design Details -------------- @@ -98,17 +141,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: - - paint_down_to_common() - 'log --topo-order' - Currently, parse_commit_gently() requires filling in the root tree -- 2.17.0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v2 07/10] commit-graph.txt: update future work 2018-04-09 16:42 ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee @ 2018-04-12 9:12 ` Junio C Hamano 2018-04-12 11:35 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Junio C Hamano @ 2018-04-12 9:12 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, peff, avarab, sbeller, larsxschneider, bmwill Derrick Stolee <dstolee@microsoft.com> writes: > +Here is a diagram to visualize the shape of the full commit graph, and > +how different generation numbers relate: > + > + +-----------------------------------------+ > + | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF | > + +-----------------------------------------+ > + | | ^ > + | | | > + | +------+ > + | [gen(A) = gen(B)] > + V > + +-------------------------------------+ > + | 0 < commit->generation < 0x40000000 | > + +-------------------------------------+ > + | | ^ > + | | | > + | +------+ > + | [gen(A) > gen(B)] > + V > + +-------------------------------------+ > + | GENERATION_NUMBER_ZERO = 0 | > + +-------------------------------------+ > + | ^ > + | | > + +------+ > + [gen(A) = gen(B)] It may be just me but all I can read out of the above is that commit->generation may store 0xFFFFFFFF, a value between 0 and 0x40000000, or 0. I cannot quite tell what the notation [gen(A) <cmp> gen(B)] is trying to say. I am guessing "Two generation numbers within the 'valid' range can be compared" is what the second one is trying to say, but it is much less interesting to know that two infinities compare equal than how generation numbers from different classes compare, which cannot be depicted in the above notation, I am afraid. For example, don't we want to say that a commit with INF can never be reached by a commit with a valid generation number, or something like that? > Design Details > -------------- > > @@ -98,17 +141,12 @@ Future Work > - The 'commit-graph' subcommand does not have a "verify" mode that is > necessary for integration with fsck. > > -- The file format includes room for precomputed generation numbers. These > - are not currently computed, so all generation numbers will be marked as > - 0 (or "uncomputed"). A later patch will include this calculation. > - > - After computing and storing generation numbers, we must make graph > walks aware of generation numbers to gain the performance benefits they > enable. This will mostly be accomplished by swapping a commit-date-ordered > priority queue with one ordered by generation number. The following > - operations are important candidates: > + operation is an important candidate: Good. ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v2 07/10] commit-graph.txt: update future work 2018-04-12 9:12 ` Junio C Hamano @ 2018-04-12 11:35 ` Derrick Stolee 2018-04-13 9:53 ` Jakub Narebski 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-12 11:35 UTC (permalink / raw) To: Junio C Hamano, Derrick Stolee Cc: git, peff, avarab, sbeller, larsxschneider, bmwill On 4/12/2018 5:12 AM, Junio C Hamano wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> +Here is a diagram to visualize the shape of the full commit graph, and >> +how different generation numbers relate: >> + >> + +-----------------------------------------+ >> + | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF | >> + +-----------------------------------------+ >> + | | ^ >> + | | | >> + | +------+ >> + | [gen(A) = gen(B)] >> + V >> + +-------------------------------------+ >> + | 0 < commit->generation < 0x40000000 | >> + +-------------------------------------+ >> + | | ^ >> + | | | >> + | +------+ >> + | [gen(A) > gen(B)] >> + V >> + +-------------------------------------+ >> + | GENERATION_NUMBER_ZERO = 0 | >> + +-------------------------------------+ >> + | ^ >> + | | >> + +------+ >> + [gen(A) = gen(B)] > It may be just me but all I can read out of the above is that > commit->generation may store 0xFFFFFFFF, a value between 0 and > 0x40000000, or 0. I cannot quite tell what the notation [gen(A) > <cmp> gen(B)] is trying to say. I am guessing "Two generation > numbers within the 'valid' range can be compared" is what the second > one is trying to say, but it is much less interesting to know that > two infinities compare equal than how generation numbers from > different classes compare, which cannot be depicted in the above > notation, I am afraid. For example, don't we want to say that a > commit with INF can never be reached by a commit with a valid > generation number, or something like that? My intention with the arrows was to demonstrate where parent relationships can go, and the generation-number relation between a commit A with parent B. Clearly, this diagram is less than helpful. > >> Design Details >> -------------- >> >> @@ -98,17 +141,12 @@ Future Work >> - The 'commit-graph' subcommand does not have a "verify" mode that is >> necessary for integration with fsck. >> >> -- The file format includes room for precomputed generation numbers. These >> - are not currently computed, so all generation numbers will be marked as >> - 0 (or "uncomputed"). A later patch will include this calculation. >> - >> - After computing and storing generation numbers, we must make graph >> walks aware of generation numbers to gain the performance benefits they >> enable. This will mostly be accomplished by swapping a commit-date-ordered >> priority queue with one ordered by generation number. The following >> - operations are important candidates: >> + operation is an important candidate: > Good. ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v2 07/10] commit-graph.txt: update future work 2018-04-12 11:35 ` Derrick Stolee @ 2018-04-13 9:53 ` Jakub Narebski 0 siblings, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-04-13 9:53 UTC (permalink / raw) To: Derrick Stolee Cc: Junio C Hamano, Derrick Stolee, git, Jeff King, Ævar Arnfjörð Bjarmason, Stefan Beller, Lars Schneider, Brandon Williams Derrick Stolee <stolee@gmail.com> writes: > On 4/12/2018 5:12 AM, Junio C Hamano wrote: >> Derrick Stolee <dstolee@microsoft.com> writes: >> >>> +Here is a diagram to visualize the shape of the full commit graph, and >>> +how different generation numbers relate: >>> + >>> + +-----------------------------------------+ >>> + | GENERATION_NUMBER_INFINITY = 0xFFFFFFFF | >>> + +-----------------------------------------+ >>> + | | ^ >>> + | | | >>> + | +------+ >>> + | [gen(A) = gen(B)] >>> + V >>> + +-------------------------------------+ >>> + | 0 < commit->generation < 0x40000000 | >>> + +-------------------------------------+ >>> + | | ^ >>> + | | | >>> + | +------+ >>> + | [gen(A) > gen(B)] >>> + V >>> + +-------------------------------------+ >>> + | GENERATION_NUMBER_ZERO = 0 | >>> + +-------------------------------------+ >>> + | ^ >>> + | | >>> + +------+ >>> + [gen(A) = gen(B)] >> >> It may be just me but all I can read out of the above is that It's not just you. >> commit->generation may store 0xFFFFFFFF, a value between 0 and >> 0x40000000, or 0. I cannot quite tell what the notation [gen(A) >> <cmp> gen(B)] is trying to say. I am guessing "Two generation >> numbers within the 'valid' range can be compared" is what the second >> one is trying to say, but it is much less interesting to know that >> two infinities compare equal than how generation numbers from >> different classes compare, which cannot be depicted in the above >> notation, I am afraid. For example, don't we want to say that a >> commit with INF can never be reached by a commit with a valid >> generation number, or something like that? > > My intention with the arrows was to demonstrate where parent > relationships can go, and the generation-number relation between a > commit A with parent B. Clearly, this diagram is less than helpful. Perhaps the following table would make the information clearer (perhaps in addition to the above graph, but without "gen(A) {cmp} gen(B)" arrows). I assume that it is possible to have both GENERATION_NUMBER_ZERO and non zero generation numbers in one repo, perhaps via alternates. I also assume that A != B, and that generation numbers (both set, and 0s) are transitivelu closed under reachability. gen(A) \ commit B -> | gen(B) \-----\ | commit A \ | 0xFFFFFFFF | larger | smaller | 0x00000000 ----------------\--------+------------+----------+---------+------------ 0xFFFFFFFF | = > > > 0 < larger < 0x40000000 | < N = n > > 0 < smaller < 0x40000000 | < N < N = n > 0x00000000 | < N < N < N = The "<", "=", ">" denotes result of comparison between gen(A) and gen(B). Generation numbers create a negative-cut filter: "N" and "n" denote situation where we know from gen(A) and gen(B) that B is not reachable from A. As can be seen if we use gen(A) < gen(B) as cutoff, we don't need to treat "infinity" and "zero" in a special way. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v2 08/10] ref-filter: use generation number for --contains 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee ` (6 preceding siblings ...) 2018-04-09 16:42 ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee @ 2018-04-09 16:42 ` Derrick Stolee 2018-04-09 16:42 ` [PATCH v2 09/10] commit: use generation numbers for in_merge_bases() Derrick Stolee ` (2 subsequent siblings) 10 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee A commit A can reach a commit B only if the generation number of A is strictly larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for '--contains' type queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- ref-filter.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index 45fc56216a..2f5e79b5de 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1584,7 +1584,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1598,8 +1599,11 @@ static enum contains_result contains_test(struct commit *candidate, return CONTAINS_YES; } - /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1615,8 +1619,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1634,7 +1648,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1648,7 +1662,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v2 09/10] commit: use generation numbers for in_merge_bases() 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee ` (7 preceding siblings ...) 2018-04-09 16:42 ` [PATCH v2 08/10] ref-filter: use generation number for --contains Derrick Stolee @ 2018-04-09 16:42 ` Derrick Stolee 2018-04-09 16:42 ` [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee 10 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the target commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 00bdc2ab21..0b155dece8 100644 --- a/commit.c +++ b/commit.c @@ -1059,12 +1059,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return 0; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common() 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee ` (8 preceding siblings ...) 2018-04-09 16:42 ` [PATCH v2 09/10] commit: use generation numbers for in_merge_bases() Derrick Stolee @ 2018-04-09 16:42 ` Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee 10 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-09 16:42 UTC (permalink / raw) To: git; +Cc: peff, avarab, sbeller, larsxschneider, bmwill, Derrick Stolee When running 'git branch --contains', the in_merge_bases_many() method calls paint_down_to_common() to discover if a specific commit is reachable from a set of branches. Commits with lower generation number are not needed to correctly answer the containment query of in_merge_bases_many(). Add a new parameter, min_generation, to paint_down_to_common() that prevents walking commits with generation number strictly less than min_generation. If 0 is given, then there is no functional change. For in_merge_bases_many(), we can pass commit->generation as the cutoff, and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 0b155dece8..7348075e38 100644 --- a/commit.c +++ b/commit.c @@ -796,7 +796,9 @@ static int queue_has_nonstale(struct prio_queue *queue, uint32_t min_gen) } /* all input commits in one and twos[] must have been parsed! */ -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) +static struct commit_list *paint_down_to_common(struct commit *one, int n, + struct commit **twos, + int min_generation) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; @@ -830,6 +832,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc last_gen = commit->generation; + if (commit->generation < min_generation) + break; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -882,7 +887,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(&list); @@ -949,7 +954,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1073,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return 0; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) ret = 1; clear_commit_marks(commit, all_flags); -- 2.17.0 ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v3 0/9] Compute and consume generation numbers 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee ` (9 preceding siblings ...) 2018-04-09 16:42 ` [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee @ 2018-04-17 17:00 ` Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee ` (10 more replies) 10 siblings, 11 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw) To: git Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy, Derrick Stolee Thanks for all the help on v2. Here are a few changes between versions: * Removed the constant-time check in queue_has_nonstale() due to the possibility of a performance hit and no evidence of a performance benefit in typical cases. * Reordered the commits about loading commits from the commit-graph. This way it is easier to demonstrate the incorrect checks. On my machine, every commit compiles and the test suite passes, but patches 6-8 have the bug that is fixed in patch 9 "merge: check config before loading commits". * The interaction with parse_commit_in_graph() from parse_object() is replaced with a new 'check_graph' parameter in parse_commit_buffer(). This allows us to fill in the graph_pos and generation values for commits that are parsed directly from a buffer. This keeps the existing behavior that a commit parsed this way should match its buffer. * There was discussion about making GENERATION_NUMBER_MAX assignable by an environment variable so we could add tests that exercise the behavior of capping a generation at that value. Perhaps the code around this is simple enough that we do not need to add that complexity. Thanks, -Stolee -- >8 -- This is the one of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph) and lazy-loading trees (ds/lazy-load-trees). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This section is expanded to describe the interaction with special generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph file) and *_ZERO (commits in a commit-graph file written before generation numbers were implemented). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a short-circuit mechanism to improve performance of `git branch --contains`. Further, use generation numbers for 'git tag --contains), providing a significant speedup (at least 95% for some cases). A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series is build on ds/lazy-load-trees. Derrick Stolee (9): commit: add generation number to struct commmit commit-graph: compute generation numbers commit: use generations in paint_down_to_common() commit-graph.txt: update design document ref-filter: use generation number for --contains commit: use generation numbers for in_merge_bases() commit: add short-circuit to paint_down_to_common() commit-graph: always load commit-graph information merge: check config before loading commits Documentation/technical/commit-graph.txt | 30 +++++-- alloc.c | 1 + builtin/merge.c | 5 +- commit-graph.c | 99 +++++++++++++++++++----- commit-graph.h | 8 ++ commit.c | 54 +++++++++++-- commit.h | 7 +- object.c | 2 +- ref-filter.c | 23 +++++- sha1_file.c | 2 +- t/t5318-commit-graph.sh | 9 +++ 11 files changed, 199 insertions(+), 41 deletions(-) base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707 -- 2.17.0.39.g685157f7fb ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v3 1/9] commit: add generation number to struct commmit 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee @ 2018-04-17 17:00 ` Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 2/9] commit-graph: compute generation numbers Derrick Stolee ` (9 subsequent siblings) 10 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw) To: git Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy, Derrick Stolee The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use three special values to signal the generation number is invalid: GENERATION_NUMBER_INFINITY 0xFFFFFFFF GENERATION_NUMBER_MAX 0x3FFFFFFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_MAX) means the generation number is too large to store in the commit-graph file. The third (_ZERO) means the generation number was loaded from a commit graph file that was written by a version of git that did not support generation numbers. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- alloc.c | 1 + commit-graph.c | 2 ++ commit.h | 4 ++++ 3 files changed, 7 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } diff --git a/commit-graph.c b/commit-graph.c index 70fa1b25fd..9ad21c3ffb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = &item->parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index 23a3f364ed..aac3b8c56f 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF +#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF +#define GENERATION_NUMBER_MAX 0x3FFFFFFF +#define GENERATION_NUMBER_ZERO 0 struct commit_list { struct commit *item; @@ -30,6 +33,7 @@ struct commit { */ struct tree *maybe_tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v3 2/9] commit-graph: compute generation numbers 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee @ 2018-04-17 17:00 ` Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee ` (8 subsequent siblings) 10 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw) To: git Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy, Derrick Stolee While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. If a computed generation number would exceed GENERATION_NUMBER_MAX, then use GENERATION_NUMBER_MAX instead. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit-graph.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 9ad21c3ffb..688d5b1801 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -439,6 +439,10 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) { + packedDate[0] |= htonl((*list)->generation << 2); + } + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -571,6 +575,46 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], &list); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, &list); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(&list); + } + + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -694,6 +738,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(&lk, graph_name, 0); -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v3 3/9] commit: use generations in paint_down_to_common() 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 2/9] commit-graph: compute generation numbers Derrick Stolee @ 2018-04-17 17:00 ` Derrick Stolee 2018-04-18 14:31 ` Jakub Narebski 2018-04-17 17:00 ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee ` (7 subsequent siblings) 10 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw) To: git Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy, Derrick Stolee Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 20 +++++++++++++++++++- commit.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 711f674c18..a44899c733 100644 --- a/commit.c +++ b/commit.c @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + /* newer commits first */ + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* use date as a heuristic when generataions are equal */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index aac3b8c56f..64436ff44e 100644 --- a/commit.h +++ b/commit.h @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v3 3/9] commit: use generations in paint_down_to_common() 2018-04-17 17:00 ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee @ 2018-04-18 14:31 ` Jakub Narebski 2018-04-18 14:46 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-18 14:31 UTC (permalink / raw) To: Derrick Stolee Cc: git, peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy Derrick Stolee <dstolee@microsoft.com> writes: > Define compare_commits_by_gen_then_commit_date(), which uses generation > numbers as a primary comparison and commit date to break ties (or as a > comparison when both commits do not have computed generation numbers). > > Since the commit-graph file is closed under reachability, we know that > all commits in the file have generation at most GENERATION_NUMBER_MAX > which is less than GENERATION_NUMBER_INFINITY. > > This change does not affect the number of commits that are walked during > the execution of paint_down_to_common(), only the order that those > commits are inspected. In the case that commit dates violate topological > order (i.e. a parent is "newer" than a child), the previous code could > walk a commit twice: if a commit is reached with the PARENT1 bit, but > later is re-visited with the PARENT2 bit, then that PARENT2 bit must be > propagated to its parents. Using generation numbers avoids this extra > effort, even if it is somewhat rare. Does it mean that it gives no measureable performance improvements for typical test cases? > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit.c | 20 +++++++++++++++++++- > commit.h | 1 + > 2 files changed, 20 insertions(+), 1 deletion(-) > > diff --git a/commit.c b/commit.c > index 711f674c18..a44899c733 100644 > --- a/commit.c > +++ b/commit.c > @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, > return 0; > } > > +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) > +{ > + const struct commit *a = a_, *b = b_; > + > + /* newer commits first */ > + if (a->generation < b->generation) > + return 1; > + else if (a->generation > b->generation) > + return -1; > + > + /* use date as a heuristic when generataions are equal */ Very minor typo in above comment: s/generataions/generations/ > + if (a->date < b->date) > + return 1; > + else if (a->date > b->date) > + return -1; > + return 0; > +} > + > int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) > { > const struct commit *a = a_, *b = b_; > @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) > /* all input commits in one and twos[] must have been parsed! */ > static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) > { > - struct prio_queue queue = { compare_commits_by_commit_date }; > + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; > struct commit_list *result = NULL; > int i; > > diff --git a/commit.h b/commit.h > index aac3b8c56f..64436ff44e 100644 > --- a/commit.h > +++ b/commit.h > @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); > extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); > > int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); > +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); > > LAST_ARG_MUST_BE_NULL > extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 3/9] commit: use generations in paint_down_to_common() 2018-04-18 14:31 ` Jakub Narebski @ 2018-04-18 14:46 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-18 14:46 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy On 4/18/2018 10:31 AM, Jakub Narebski wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> Define compare_commits_by_gen_then_commit_date(), which uses generation >> numbers as a primary comparison and commit date to break ties (or as a >> comparison when both commits do not have computed generation numbers). >> >> Since the commit-graph file is closed under reachability, we know that >> all commits in the file have generation at most GENERATION_NUMBER_MAX >> which is less than GENERATION_NUMBER_INFINITY. >> >> This change does not affect the number of commits that are walked during >> the execution of paint_down_to_common(), only the order that those >> commits are inspected. In the case that commit dates violate topological >> order (i.e. a parent is "newer" than a child), the previous code could >> walk a commit twice: if a commit is reached with the PARENT1 bit, but >> later is re-visited with the PARENT2 bit, then that PARENT2 bit must be >> propagated to its parents. Using generation numbers avoids this extra >> effort, even if it is somewhat rare. > Does it mean that it gives no measureable performance improvements for > typical test cases? Not in this commit. When we add the `min_generation` parameter in a later commit, we do get a significant performance boost (when we can supply a non-zero value to `min_generation`). This step of using generation numbers for the priority is important for that commit, but on its own has limited value outside of the clock-skew case mentioned above. > >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> --- >> commit.c | 20 +++++++++++++++++++- >> commit.h | 1 + >> 2 files changed, 20 insertions(+), 1 deletion(-) >> >> diff --git a/commit.c b/commit.c >> index 711f674c18..a44899c733 100644 >> --- a/commit.c >> +++ b/commit.c >> @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, >> return 0; >> } >> >> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) >> +{ >> + const struct commit *a = a_, *b = b_; >> + >> + /* newer commits first */ >> + if (a->generation < b->generation) >> + return 1; >> + else if (a->generation > b->generation) >> + return -1; >> + >> + /* use date as a heuristic when generataions are equal */ > Very minor typo in above comment: > > s/generataions/generations/ Good catch! > >> + if (a->date < b->date) >> + return 1; >> + else if (a->date > b->date) >> + return -1; >> + return 0; >> +} >> + >> int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) >> { >> const struct commit *a = a_, *b = b_; >> @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) >> /* all input commits in one and twos[] must have been parsed! */ >> static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) >> { >> - struct prio_queue queue = { compare_commits_by_commit_date }; >> + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; >> struct commit_list *result = NULL; >> int i; >> >> diff --git a/commit.h b/commit.h >> index aac3b8c56f..64436ff44e 100644 >> --- a/commit.h >> +++ b/commit.h >> @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); >> extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); >> >> int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); >> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); >> >> LAST_ARG_MUST_BE_NULL >> extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v3 4/9] commit-graph.txt: update design document 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee ` (2 preceding siblings ...) 2018-04-17 17:00 ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee @ 2018-04-17 17:00 ` Derrick Stolee 2018-04-18 19:47 ` Jakub Narebski 2018-04-17 17:00 ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee ` (6 subsequent siblings) 10 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw) To: git Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy, Derrick Stolee We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the three special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and _MAX interact with other generation numbers. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++----- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..d9f2713efa 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + + If A and B are commits with generation numbers N amd M, respectively, + and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. + Design Details -------------- @@ -98,17 +121,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: - - paint_down_to_common() - 'log --topo-order' - Currently, parse_commit_gently() requires filling in the root tree -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v3 4/9] commit-graph.txt: update design document 2018-04-17 17:00 ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee @ 2018-04-18 19:47 ` Jakub Narebski 0 siblings, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-04-18 19:47 UTC (permalink / raw) To: Derrick Stolee Cc: git, peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy Derrick Stolee <dstolee@microsoft.com> writes: > We now calculate generation numbers in the commit-graph file and use > them in paint_down_to_common(). All right. > > Expand the section on generation numbers to discuss how the three > special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and > _MAX interact with other generation numbers. Very good. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++----- > 1 file changed, 24 insertions(+), 6 deletions(-) > > diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt > index 0550c6d0dc..d9f2713efa 100644 > --- a/Documentation/technical/commit-graph.txt > +++ b/Documentation/technical/commit-graph.txt > @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" > generation number and walk until reaching commits with known generation > number. > > +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not > +in the commit-graph file. If a commit-graph file was written by a version > +of Git that did not compute generation numbers, then those commits will > +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. I have to wonder if there would be any relesed Git that do not compute generation numbers... On the other hand in case the user-visible view of the project history changes, be it because shallow clone is shortened or deepened, or grafts file is edited, or a commit object is replaced with another with different parents - we can still use "commit-graph" data, just pretend that generation numbers (which are invalid in altered history) are all zero. (I'll write about this idea in comments to later series.) On the other hand with GENERATION_NUMBER_ZERO these series of patches are self-contained and bisectable. > + > +Since the commit-graph file is closed under reachability, we can guarantee > +the following weaker condition on all commits: I have had to look up the contents of the whole file, but it turns out that it is all right: "weaker condition" refers to earlier "N <= M". Minor sidenote: if one would be extremly pedantic, one could say that previous condition is incorrect, because it doesn't state explicitely that commit A != commit B. ;-) > + > + If A and B are commits with generation numbers N amd M, respectively, > + and N < M, then A cannot reach B. > + > +Note how the strict inequality differs from the inequality when we have > +fully-computed generation numbers. Using strict inequality may result in > +walking a few extra commits, but the simplicity in dealing with commits > +with generation number *_INFINITY or *_ZERO is valuable. > + > +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose > +generation numbers are computed to be at least this value. We limit at > +this value since it is the largest value that can be stored in the > +commit-graph file using the 30 bits available to generation numbers. This > +presents another case where a commit can have generation number equal to > +that of a parent. I wonder if something like the table I have proposed in v2 version of this patch [1] would make it easier or harder to understand. [1]: https://public-inbox.org/git/86a7u7mnzi.fsf@gmail.com/ Something like the following: | gen(B) | gen(A) | _INFINITY | _MAX | larger | smaller | _ZERO -------------+-----------+----------+----------+----------+-------- _INFINITY | = | > | > | > | > _MAX | < N | = | > | > | > larger | < N | < N | = n | > | > smaller | < N | < N | < N | = n | > _ZERO | < N | < N | < N | < N | = Here "n" and "N" denotes stronger condition, and "N" denotes weaker condition. We have _INFINITY > _MAX > larger > smaller > _ZERO. > + > Design Details > -------------- > > @@ -98,17 +121,12 @@ Future Work > - The 'commit-graph' subcommand does not have a "verify" mode that is > necessary for integration with fsck. > > -- The file format includes room for precomputed generation numbers. These > - are not currently computed, so all generation numbers will be marked as > - 0 (or "uncomputed"). A later patch will include this calculation. > - > - After computing and storing generation numbers, we must make graph > walks aware of generation numbers to gain the performance benefits they > enable. This will mostly be accomplished by swapping a commit-date-ordered > priority queue with one ordered by generation number. The following > - operations are important candidates: > + operation is an important candidate: > > - - paint_down_to_common() > - 'log --topo-order' > > - Currently, parse_commit_gently() requires filling in the root tree Looks good. ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v3 5/9] ref-filter: use generation number for --contains 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee ` (3 preceding siblings ...) 2018-04-17 17:00 ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee @ 2018-04-17 17:00 ` Derrick Stolee 2018-04-18 21:02 ` Jakub Narebski 2018-04-17 17:00 ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee ` (5 subsequent siblings) 10 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw) To: git Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy, Derrick Stolee A commit A can reach a commit B only if the generation number of A is larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for 'git tag --contains' queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- ref-filter.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index cffd8bf3ce..e2fea6d635 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate, /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + parse_commit_or_die(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v3 5/9] ref-filter: use generation number for --contains 2018-04-17 17:00 ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee @ 2018-04-18 21:02 ` Jakub Narebski 2018-04-23 14:22 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-18 21:02 UTC (permalink / raw) To: Derrick Stolee Cc: git, peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy Here I can offer only the cursory examination, as I don't know this area of code in question. Derrick Stolee <dstolee@microsoft.com> writes: > A commit A can reach a commit B only if the generation number of A > is larger than the generation number of B. This condition allows > significantly short-circuiting commit-graph walks. > > Use generation number for 'git tag --contains' queries. > > On a copy of the Linux repository where HEAD is containd in v4.13 > but no earlier tag, the command 'git tag --contains HEAD' had the > following peformance improvement: > > Before: 0.81s > After: 0.04s > Rel %: -95% A question: what is the performance after if the "commit-graph" feature is disabled, or there is no commit-graph file? Is there performance regression in this case, or is the difference negligible? > > Helped-by: Jeff King <peff@peff.net> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > ref-filter.c | 23 +++++++++++++++++++---- > 1 file changed, 19 insertions(+), 4 deletions(-) > > diff --git a/ref-filter.c b/ref-filter.c > index cffd8bf3ce..e2fea6d635 100644 > --- a/ref-filter.c > +++ b/ref-filter.c > @@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) > /* > * Test whether the candidate or one of its parents is contained in the list. ^^^^^^^^^^^^^^^^^^^^^ Sidenote: when examining the code after the change, I have noticed that the above part of commit header for the comtains_test() function is no longer entirely correct, as the function only checks the candidate commit, and in no place it access its parents. But that is not your problem. > * Do not recurse to find out, though, but return -1 if inconclusive. > */ > static enum contains_result contains_test(struct commit *candidate, > const struct commit_list *want, > - struct contains_cache *cache) > + struct contains_cache *cache, > + uint32_t cutoff) > { > enum contains_result *cached = contains_cache_at(cache, candidate); > > @@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate, > > /* Otherwise, we don't know; prepare to recurse */ > parse_commit_or_die(candidate); > + > + if (candidate->generation < cutoff) > + return CONTAINS_NO; > + Looks good to me. The only [minor] question may be whether to define separate type for generation numbers, and whether to future proof the tests - though the latter would be almost certainly overengineering, and the former probablt too. > return CONTAINS_UNKNOWN; > } > > @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, > struct contains_cache *cache) > { > struct contains_stack contains_stack = { 0, 0, NULL }; > - enum contains_result result = contains_test(candidate, want, cache); > + enum contains_result result; > + uint32_t cutoff = GENERATION_NUMBER_INFINITY; > + const struct commit_list *p; > + > + for (p = want; p; p = p->next) { > + struct commit *c = p->item; > + parse_commit_or_die(c); > + if (c->generation < cutoff) > + cutoff = c->generation; > + } Sholdn't the above be made conditional on the ability to get generation numbers from the commit-graph file (feature is turned on and file exists)? Otherwise here after the change contains_tag_algo() now parses each commit in 'want', which I think was not done previously. With commit-graph file parsing is [probably] cheap. Without it, not necessary. But I might be worrying about nothing. > > + result = contains_test(candidate, want, cache, cutoff); Other than the question about possible performace regression if commit-graph data is not available, it looks good to me. > if (result != CONTAINS_UNKNOWN) > return result; > > @@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, > * If we just popped the stack, parents->item has been marked, > * therefore contains_test will return a meaningful yes/no. > */ > - else switch (contains_test(parents->item, want, cache)) { > + else switch (contains_test(parents->item, want, cache, cutoff)) { > case CONTAINS_YES: > *contains_cache_at(cache, commit) = CONTAINS_YES; > contains_stack.nr--; > @@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, > } > } > free(contains_stack.contains_stack); > - return contains_test(candidate, want, cache); > + return contains_test(candidate, want, cache, cutoff); Simple change. It looks good to me. > } > > static int commit_contains(struct ref_filter *filter, struct commit *commit, ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 5/9] ref-filter: use generation number for --contains 2018-04-18 21:02 ` Jakub Narebski @ 2018-04-23 14:22 ` Derrick Stolee 2018-04-24 18:56 ` Jakub Narebski 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-23 14:22 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy On 4/18/2018 5:02 PM, Jakub Narebski wrote: > Here I can offer only the cursory examination, as I don't know this area > of code in question. > > Derrick Stolee <dstolee@microsoft.com> writes: > >> A commit A can reach a commit B only if the generation number of A >> is larger than the generation number of B. This condition allows >> significantly short-circuiting commit-graph walks. >> >> Use generation number for 'git tag --contains' queries. >> >> On a copy of the Linux repository where HEAD is containd in v4.13 >> but no earlier tag, the command 'git tag --contains HEAD' had the >> following peformance improvement: >> >> Before: 0.81s >> After: 0.04s >> Rel %: -95% > A question: what is the performance after if the "commit-graph" feature > is disabled, or there is no commit-graph file? Is there performance > regression in this case, or is the difference negligible? Negligible, since we are adding a small number of integer comparisons and the main cost is in commit parsing. More on commit parsing in response to your comments below. > >> Helped-by: Jeff King <peff@peff.net> >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> --- >> ref-filter.c | 23 +++++++++++++++++++---- >> 1 file changed, 19 insertions(+), 4 deletions(-) >> >> diff --git a/ref-filter.c b/ref-filter.c >> index cffd8bf3ce..e2fea6d635 100644 >> --- a/ref-filter.c >> +++ b/ref-filter.c >> @@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) >> /* >> * Test whether the candidate or one of its parents is contained in the list. > ^^^^^^^^^^^^^^^^^^^^^ > > Sidenote: when examining the code after the change, I have noticed that > the above part of commit header for the comtains_test() function is no > longer entirely correct, as the function only checks the candidate > commit, and in no place it access its parents. > > But that is not your problem. I'll add a commit in the next version that fixes this comment before I make any changes to the method. > >> * Do not recurse to find out, though, but return -1 if inconclusive. >> */ >> static enum contains_result contains_test(struct commit *candidate, >> const struct commit_list *want, >> - struct contains_cache *cache) >> + struct contains_cache *cache, >> + uint32_t cutoff) >> { >> enum contains_result *cached = contains_cache_at(cache, candidate); >> >> @@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate, >> >> /* Otherwise, we don't know; prepare to recurse */ >> parse_commit_or_die(candidate); >> + >> + if (candidate->generation < cutoff) >> + return CONTAINS_NO; >> + > Looks good to me. > > The only [minor] question may be whether to define separate type for > generation numbers, and whether to future proof the tests - though the > latter would be almost certainly overengineering, and the former > probablt too. If we have multiple notions of generation, then we can refactor all references to the "generation" member. > >> return CONTAINS_UNKNOWN; >> } >> >> @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, >> struct contains_cache *cache) >> { >> struct contains_stack contains_stack = { 0, 0, NULL }; >> - enum contains_result result = contains_test(candidate, want, cache); >> + enum contains_result result; >> + uint32_t cutoff = GENERATION_NUMBER_INFINITY; >> + const struct commit_list *p; >> + >> + for (p = want; p; p = p->next) { >> + struct commit *c = p->item; >> + parse_commit_or_die(c); >> + if (c->generation < cutoff) >> + cutoff = c->generation; >> + } > Sholdn't the above be made conditional on the ability to get generation > numbers from the commit-graph file (feature is turned on and file > exists)? Otherwise here after the change contains_tag_algo() now parses > each commit in 'want', which I think was not done previously. > > With commit-graph file parsing is [probably] cheap. Without it, not > necessary. > > But I might be worrying about nothing. Not nothing. This parses the "wants" when we previously did not parse the wants. Further: this parsing happens before we do the simple check of comparing the OID of the candidate against the wants. The question is: are these parsed commits significant compared to the walk that will parse many more commits? It is certainly possible. One way to fix this is to call 'prepare_commit_graph()' directly and then test that 'commit_graph' is non-null before performing any parses. I'm not thrilled with how that couples the commit-graph implementation to this feature, but that may be necessary to avoid regressions in the non-commit-graph case. > >> >> + result = contains_test(candidate, want, cache, cutoff); > Other than the question about possible performace regression if > commit-graph data is not available, it looks good to me. > >> if (result != CONTAINS_UNKNOWN) >> return result; >> >> @@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, >> * If we just popped the stack, parents->item has been marked, >> * therefore contains_test will return a meaningful yes/no. >> */ >> - else switch (contains_test(parents->item, want, cache)) { >> + else switch (contains_test(parents->item, want, cache, cutoff)) { >> case CONTAINS_YES: >> *contains_cache_at(cache, commit) = CONTAINS_YES; >> contains_stack.nr--; >> @@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, >> } >> } >> free(contains_stack.contains_stack); >> - return contains_test(candidate, want, cache); >> + return contains_test(candidate, want, cache, cutoff); > Simple change. It looks good to me. > >> } >> >> static int commit_contains(struct ref_filter *filter, struct commit *commit, ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 5/9] ref-filter: use generation number for --contains 2018-04-23 14:22 ` Derrick Stolee @ 2018-04-24 18:56 ` Jakub Narebski 2018-04-25 14:11 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-24 18:56 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy Derrick Stolee <stolee@gmail.com> writes: > On 4/18/2018 5:02 PM, Jakub Narebski wrote: >> Derrick Stolee <dstolee@microsoft.com> writes: >> >>> A commit A can reach a commit B only if the generation number of A >>> is larger than the generation number of B. This condition allows >>> significantly short-circuiting commit-graph walks. >>> >>> Use generation number for 'git tag --contains' queries. >>> >>> On a copy of the Linux repository where HEAD is containd in v4.13 >>> but no earlier tag, the command 'git tag --contains HEAD' had the >>> following peformance improvement: >>> >>> Before: 0.81s >>> After: 0.04s >>> Rel %: -95% >> >> A question: what is the performance after if the "commit-graph" feature >> is disabled, or there is no commit-graph file? Is there performance >> regression in this case, or is the difference negligible? > > Negligible, since we are adding a small number of integer comparisons > and the main cost is in commit parsing. More on commit parsing in > response to your comments below. If it is proven to be always negligible, then its all right. If it is unlikely to be non-negligible, well, still O.K. But I wonder if maybe there is some situation where the cost of extra parsing is non-negligble. [...] >>> @@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, >>> struct contains_cache *cache) >>> { >>> struct contains_stack contains_stack = { 0, 0, NULL }; >>> - enum contains_result result = contains_test(candidate, want, cache); >>> + enum contains_result result; >>> + uint32_t cutoff = GENERATION_NUMBER_INFINITY; >>> + const struct commit_list *p; >>> + >>> + for (p = want; p; p = p->next) { >>> + struct commit *c = p->item; >>> + parse_commit_or_die(c); >>> + if (c->generation < cutoff) >>> + cutoff = c->generation; >>> + } >> Sholdn't the above be made conditional on the ability to get generation >> numbers from the commit-graph file (feature is turned on and file >> exists)? Otherwise here after the change contains_tag_algo() now parses >> each commit in 'want', which I think was not done previously. >> >> With commit-graph file parsing is [probably] cheap. Without it, not >> necessary. >> >> But I might be worrying about nothing. > > Not nothing. This parses the "wants" when we previously did not parse > the wants. Further: this parsing happens before we do the simple check > of comparing the OID of the candidate against the wants. > > The question is: are these parsed commits significant compared to the > walk that will parse many more commits? It is certainly possible. > > One way to fix this is to call 'prepare_commit_graph()' directly and > then test that 'commit_graph' is non-null before performing any > parses. I'm not thrilled with how that couples the commit-graph > implementation to this feature, but that may be necessary to avoid > regressions in the non-commit-graph case. Another possible solution (not sure if better or worse) would be to change the signature of contains_tag_algo() function to take parameter or flag that would decide whether to parse "wants". Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 5/9] ref-filter: use generation number for --contains 2018-04-24 18:56 ` Jakub Narebski @ 2018-04-25 14:11 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:11 UTC (permalink / raw) To: Jakub Narebski Cc: Derrick Stolee, git, peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy On 4/24/2018 2:56 PM, Jakub Narebski wrote: > Derrick Stolee <stolee@gmail.com> writes: >> One way to fix this is to call 'prepare_commit_graph()' directly and >> then test that 'commit_graph' is non-null before performing any >> parses. I'm not thrilled with how that couples the commit-graph >> implementation to this feature, but that may be necessary to avoid >> regressions in the non-commit-graph case. > Another possible solution (not sure if better or worse) would be to > change the signature of contains_tag_algo() function to take parameter > or flag that would decide whether to parse "wants". If I reorder commits so "commit-graph:always load commit-graph information" is before this one, then we can call load_commit_graph_info() which just fills the generation and graph_pos information. This will keep the coupling very light, instead of needing to call prepare_commit_graph() or checking the config setting. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee ` (4 preceding siblings ...) 2018-04-17 17:00 ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee @ 2018-04-17 17:00 ` Derrick Stolee 2018-04-18 22:15 ` Jakub Narebski 2018-04-17 17:00 ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee ` (4 subsequent siblings) 10 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw) To: git Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy, Derrick Stolee The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the target commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index a44899c733..bceb79c419 100644 --- a/commit.c +++ b/commit.c @@ -1053,12 +1053,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return 0; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() 2018-04-17 17:00 ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee @ 2018-04-18 22:15 ` Jakub Narebski 2018-04-23 14:31 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-18 22:15 UTC (permalink / raw) To: Derrick Stolee Cc: git, peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy Derrick Stolee <dstolee@microsoft.com> writes: > The containment algorithm for 'git branch --contains' is different > from that for 'git tag --contains' in that it uses is_descendant_of() > instead of contains_tag_algo(). The expensive portion of the branch > algorithm is computing merge bases. > > When a commit-graph file exists with generation numbers computed, > we can avoid this merge-base calculation when the target commit has > a larger generation number than the target commits. You have "target" twice in above paragraph; one of those should probably be something else. > > Performance tests were run on a copy of the Linux repository where > HEAD is contained in v4.13 but no earlier tag. Also, all tags were > copied to branches and 'git branch --contains' was tested: > > Before: 60.0s > After: 0.4s > Rel %: -99.3% Nice... > > Reported-by: Jeff King <peff@peff.net> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) ...especially for so small changes. > > diff --git a/commit.c b/commit.c > index a44899c733..bceb79c419 100644 > --- a/commit.c > +++ b/commit.c > @@ -1053,12 +1053,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * > { > struct commit_list *bases; > int ret = 0, i; > + uint32_t min_generation = GENERATION_NUMBER_INFINITY; > > if (parse_commit(commit)) > return ret; > - for (i = 0; i < nr_reference; i++) > + for (i = 0; i < nr_reference; i++) { > if (parse_commit(reference[i])) > return ret; > + if (min_generation > reference[i]->generation) > + min_generation = reference[i]->generation; > + } > + > + if (commit->generation > min_generation) > + return 0; Why not use "return ret;" instead of "return 0;", like the rest of the code [cryptically] does, that is: + if (commit->generation > min_generation) + return ret; > > bases = paint_down_to_common(commit, nr_reference, reference); > if (commit->object.flags & PARENT2) Otherwise, it looks good to me. ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() 2018-04-18 22:15 ` Jakub Narebski @ 2018-04-23 14:31 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-23 14:31 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy On 4/18/2018 6:15 PM, Jakub Narebski wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> The containment algorithm for 'git branch --contains' is different >> from that for 'git tag --contains' in that it uses is_descendant_of() >> instead of contains_tag_algo(). The expensive portion of the branch >> algorithm is computing merge bases. >> >> When a commit-graph file exists with generation numbers computed, >> we can avoid this merge-base calculation when the target commit has >> a larger generation number than the target commits. > You have "target" twice in above paragraph; one of those should probably > be something else. Thanks. Second "target" should be "initial". > [...] >> + >> + if (commit->generation > min_generation) >> + return 0; > Why not use "return ret;" instead of "return 0;", like the rest of the > code [cryptically] does, that is: > > + if (commit->generation > min_generation) > + return ret; Sure. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee ` (5 preceding siblings ...) 2018-04-17 17:00 ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee @ 2018-04-17 17:00 ` Derrick Stolee 2018-04-18 23:19 ` Jakub Narebski 2018-04-19 8:32 ` Jakub Narebski 2018-04-17 17:00 ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee ` (3 subsequent siblings) 10 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw) To: git Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy, Derrick Stolee When running 'git branch --contains', the in_merge_bases_many() method calls paint_down_to_common() to discover if a specific commit is reachable from a set of branches. Commits with lower generation number are not needed to correctly answer the containment query of in_merge_bases_many(). Add a new parameter, min_generation, to paint_down_to_common() that prevents walking commits with generation number strictly less than min_generation. If 0 is given, then there is no functional change. For in_merge_bases_many(), we can pass commit->generation as the cutoff, and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index bceb79c419..a70f120878 100644 --- a/commit.c +++ b/commit.c @@ -805,11 +805,14 @@ static int queue_has_nonstale(struct prio_queue *queue) } /* all input commits in one and twos[] must have been parsed! */ -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) +static struct commit_list *paint_down_to_common(struct commit *one, int n, + struct commit **twos, + int min_generation) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; one->object.flags |= PARENT1; if (!n) { @@ -828,6 +831,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + last_gen = commit->generation; + + if (commit->generation < min_generation) + break; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(&list); @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return 0; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) ret = 1; clear_commit_marks(commit, all_flags); -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() 2018-04-17 17:00 ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee @ 2018-04-18 23:19 ` Jakub Narebski 2018-04-23 14:40 ` Derrick Stolee 2018-04-19 8:32 ` Jakub Narebski 1 sibling, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-18 23:19 UTC (permalink / raw) To: Derrick Stolee Cc: git, peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy Derrick Stolee <dstolee@microsoft.com> writes: > When running 'git branch --contains', the in_merge_bases_many() > method calls paint_down_to_common() to discover if a specific > commit is reachable from a set of branches. Commits with lower > generation number are not needed to correctly answer the > containment query of in_merge_bases_many(). Right. This description is not entirely clear to me, but I don't have a better proposal. Good enough, I guess. > > Add a new parameter, min_generation, to paint_down_to_common() that > prevents walking commits with generation number strictly less than > min_generation. If 0 is given, then there is no functional change. Is it new parameter really needed, i.e. do you really need to change the signature of this function? See below for details. > > For in_merge_bases_many(), we can pass commit->generation as the > cutoff,... This is the only callsite that uses min_generation with non-zero value, and it uses commit->generation to fill it... while commit itself is one of exiting parameters. > [...], and this saves time during 'git branch --contains' queries > that would otherwise walk "around" the commit we are inspecting. If I understand the code properly, what happens is that we can now short-circuit if all commits that are left are lower than the target commit. This is because max-order priority queue is used: if the commit with maximum generation number is below generation number of target commit, then target commit is not reachable from any commit in the priority queue (all of which has generation number less or equal than the commit at head of queue, i.e. all are same level or deeper); compare what I have written in [1] [1]: https://public-inbox.org/git/866052dkju.fsf@gmail.com/ Do I have that right? If so, it looks all right to me. > > For a copy of the Linux repository, where HEAD is checked out at > v4.13~100, we get the following performance improvement for > 'git branch --contains' over the previous commit: > > Before: 0.21s > After: 0.13s > Rel %: -38% Nice. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit.c | 18 ++++++++++++++---- > 1 file changed, 14 insertions(+), 4 deletions(-) > > diff --git a/commit.c b/commit.c > index bceb79c419..a70f120878 100644 > --- a/commit.c > +++ b/commit.c > @@ -805,11 +805,14 @@ static int queue_has_nonstale(struct prio_queue *queue) > } > > /* all input commits in one and twos[] must have been parsed! */ > -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) > +static struct commit_list *paint_down_to_common(struct commit *one, int n, > + struct commit **twos, > + int min_generation) > { > struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; > struct commit_list *result = NULL; > int i; > + uint32_t last_gen = GENERATION_NUMBER_INFINITY; Do we really need to change the signature of paint_down_to_common(), or would it be enough to create a local variable min_generation set initially to one->generation. + uint32_t min_generation = one->generation; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; > > one->object.flags |= PARENT1; > if (!n) { > @@ -828,6 +831,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc > struct commit_list *parents; > int flags; > > + if (commit->generation > last_gen) > + BUG("bad generation skip"); > + last_gen = commit->generation; > + > + if (commit->generation < min_generation) > + break; > + I think, after looking at the whole post-image code, that it is all right. > flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); > if (flags == (PARENT1 | PARENT2)) { > if (!(commit->object.flags & RESULT)) { > @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co > return NULL; > } > > - list = paint_down_to_common(one, n, twos); > + list = paint_down_to_common(one, n, twos, 0); > > while (list) { > struct commit *commit = pop_commit(&list); > @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) > filled_index[filled] = j; > work[filled++] = array[j]; > } > - common = paint_down_to_common(array[i], filled, work); > + common = paint_down_to_common(array[i], filled, work, 0); > if (array[i]->object.flags & PARENT2) > redundant[i] = 1; > for (j = 0; j < filled; j++) > @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * > if (commit->generation > min_generation) > return 0; > > - bases = paint_down_to_common(commit, nr_reference, reference); > + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); Is it the only case where we would call paint_down_to_common() with non-zero last parameter? Would we always use commit->generation where commit is the first parameter of paint_down_to_common()? If both are true and will remain true, then in my humble opinion it is not necessary to change the signature of this function. > if (commit->object.flags & PARENT2) > ret = 1; > clear_commit_marks(commit, all_flags); ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() 2018-04-18 23:19 ` Jakub Narebski @ 2018-04-23 14:40 ` Derrick Stolee 2018-04-23 21:38 ` Jakub Narebski 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-23 14:40 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy On 4/18/2018 7:19 PM, Jakub Narebski wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > [...] >> [...], and this saves time during 'git branch --contains' queries >> that would otherwise walk "around" the commit we are inspecting. > If I understand the code properly, what happens is that we can now > short-circuit if all commits that are left are lower than the target > commit. > > This is because max-order priority queue is used: if the commit with > maximum generation number is below generation number of target commit, > then target commit is not reachable from any commit in the priority > queue (all of which has generation number less or equal than the commit > at head of queue, i.e. all are same level or deeper); compare what I > have written in [1] > > [1]: https://public-inbox.org/git/866052dkju.fsf@gmail.com/ > > Do I have that right? If so, it looks all right to me. Yes, the priority queue needs to compare via generation number first or there will be errors. This is why we could not use commit time before. > >> For a copy of the Linux repository, where HEAD is checked out at >> v4.13~100, we get the following performance improvement for >> 'git branch --contains' over the previous commit: >> >> Before: 0.21s >> After: 0.13s >> Rel %: -38% > [...] >> flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); >> if (flags == (PARENT1 | PARENT2)) { >> if (!(commit->object.flags & RESULT)) { >> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co >> return NULL; >> } >> >> - list = paint_down_to_common(one, n, twos); >> + list = paint_down_to_common(one, n, twos, 0); >> >> while (list) { >> struct commit *commit = pop_commit(&list); >> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) >> filled_index[filled] = j; >> work[filled++] = array[j]; >> } >> - common = paint_down_to_common(array[i], filled, work); >> + common = paint_down_to_common(array[i], filled, work, 0); >> if (array[i]->object.flags & PARENT2) >> redundant[i] = 1; >> for (j = 0; j < filled; j++) >> @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * >> if (commit->generation > min_generation) >> return 0; >> >> - bases = paint_down_to_common(commit, nr_reference, reference); >> + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); > Is it the only case where we would call paint_down_to_common() with > non-zero last parameter? Would we always use commit->generation where > commit is the first parameter of paint_down_to_common()? > > If both are true and will remain true, then in my humble opinion it is > not necessary to change the signature of this function. We need to change the signature some way, but maybe the way I chose is not the best. To elaborate: paint_down_to_common() is used for multiple purposes. The caller here that supplies 'commit->generation' is used only to compute reachability (by testing if the flag PARENT2 exists on the commit, then clears all flags). The other callers expect the full walk down to the common commits, and keeps those PARENT1, PARENT2, and STALE flags for future use (such as reporting merge bases). Usually the call to paint_down_to_common() is followed by a revision walk that only halts when reaching root commits or commits with both PARENT1 and PARENT2 flags on, so always short-circuiting on generations would break the functionality; this is confirmed by the t5318-commit-graph.sh. An alternative to the signature change is to add a boolean parameter "use_cutoff" or something, that specifies "don't walk beyond the commit". This may give a more of a clear description of what it will do with the generation value, but since we are already performing generation comparisons before calling paint_down_to_common() I find this simple enough. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() 2018-04-23 14:40 ` Derrick Stolee @ 2018-04-23 21:38 ` Jakub Narebski 2018-04-24 12:31 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-23 21:38 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy Derrick Stolee <stolee@gmail.com> writes: > On 4/18/2018 7:19 PM, Jakub Narebski wrote: >> Derrick Stolee <dstolee@microsoft.com> writes: >> > [...] >>> [...], and this saves time during 'git branch --contains' queries >>> that would otherwise walk "around" the commit we are inspecting. >>> >> If I understand the code properly, what happens is that we can now >> short-circuit if all commits that are left are lower than the target >> commit. >> >> This is because max-order priority queue is used: if the commit with >> maximum generation number is below generation number of target commit, >> then target commit is not reachable from any commit in the priority >> queue (all of which has generation number less or equal than the commit >> at head of queue, i.e. all are same level or deeper); compare what I >> have written in [1] >> >> [1]: https://public-inbox.org/git/866052dkju.fsf@gmail.com/ >> >> Do I have that right? If so, it looks all right to me. > > Yes, the priority queue needs to compare via generation number first > or there will be errors. This is why we could not use commit time > before. I was more concerned about getting right the order in the priority queue (does it return minimal or maximal generation number). I understand that the cutoff could not be used without generation numbers because of the possibility of clock skew - using cutoff on dates could lead to wrong results. >>> For a copy of the Linux repository, where HEAD is checked out at >>> v4.13~100, we get the following performance improvement for >>> 'git branch --contains' over the previous commit: >>> >>> Before: 0.21s >>> After: 0.13s >>> Rel %: -38% >> [...] >>> flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); >>> if (flags == (PARENT1 | PARENT2)) { >>> if (!(commit->object.flags & RESULT)) { >>> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co >>> return NULL; >>> } >>> - list = paint_down_to_common(one, n, twos); >>> + list = paint_down_to_common(one, n, twos, 0); >>> while (list) { >>> struct commit *commit = pop_commit(&list); >>> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) >>> filled_index[filled] = j; >>> work[filled++] = array[j]; >>> } >>> - common = paint_down_to_common(array[i], filled, work); >>> + common = paint_down_to_common(array[i], filled, work, 0); >>> if (array[i]->object.flags & PARENT2) >>> redundant[i] = 1; >>> for (j = 0; j < filled; j++) >>> @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * >>> if (commit->generation > min_generation) >>> return 0; >>> - bases = paint_down_to_common(commit, nr_reference, reference); >>> + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); >> >> Is it the only case where we would call paint_down_to_common() with >> non-zero last parameter? Would we always use commit->generation where >> commit is the first parameter of paint_down_to_common()? >> >> If both are true and will remain true, then in my humble opinion it is >> not necessary to change the signature of this function. > > We need to change the signature some way, but maybe the way I chose is > not the best. No, after taking longer I think the new signature is a good choice. > To elaborate: paint_down_to_common() is used for multiple > purposes. The caller here that supplies 'commit->generation' is used > only to compute reachability (by testing if the flag PARENT2 exists on > the commit, then clears all flags). The other callers expect the full > walk down to the common commits, and keeps those PARENT1, PARENT2, and > STALE flags for future use (such as reporting merge bases). Usually > the call to paint_down_to_common() is followed by a revision walk that > only halts when reaching root commits or commits with both PARENT1 and > PARENT2 flags on, so always short-circuiting on generations would > break the functionality; this is confirmed by the > t5318-commit-graph.sh. Right. I have realized that just after sending the email. I'm sorry about this. > > An alternative to the signature change is to add a boolean parameter > "use_cutoff" or something, that specifies "don't walk beyond the > commit". This may give a more of a clear description of what it will > do with the generation value, but since we are already performing > generation comparisons before calling paint_down_to_common() I find > this simple enough. Two things: 1. The signature proposed in the patch is more generic. The cutoff does not need to be equal to the generation number of the commit, though currently it always (all of one time the new mechanism is used) is. So now I think the new signature of paint_down_to_common() is all right as it is proposed here. 2. The way generation numbers are defined (with 0 being a special case, and generation numbers starting from 1 for parent-less commits), and the way they are compared (using strict comparison, to avoid having to special-case _ZERO, _MAX and _INFINITY generation numbers) the cutoff of 0 means no cutoff. On the other hand cutoff of 0 can be understood as meaning no cutoff as a special case. It could be made more clear to use (as I proposed elsewhere in this thread) symbolic name for this no-cutoff case via preprocessor constants or enums, e.g. GENERATION_NO_CUTOFF: @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, GENERATION_NO_CUTOFF); while (list) { struct commit *commit = pop_commit(&list); @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, GENERATION_NO_CUTOFF); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) But whether it makes code more readable, or less readable, is a matter of opinion and taste. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() 2018-04-23 21:38 ` Jakub Narebski @ 2018-04-24 12:31 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-24 12:31 UTC (permalink / raw) To: Jakub Narebski Cc: Derrick Stolee, git, peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy On 4/23/2018 5:38 PM, Jakub Narebski wrote: > Derrick Stolee <stolee@gmail.com> writes: > >> On 4/18/2018 7:19 PM, Jakub Narebski wrote: >>> Derrick Stolee <dstolee@microsoft.com> writes: >>> >> [...] >>>> [...], and this saves time during 'git branch --contains' queries >>>> that would otherwise walk "around" the commit we are inspecting. >>>> >>> If I understand the code properly, what happens is that we can now >>> short-circuit if all commits that are left are lower than the target >>> commit. >>> >>> This is because max-order priority queue is used: if the commit with >>> maximum generation number is below generation number of target commit, >>> then target commit is not reachable from any commit in the priority >>> queue (all of which has generation number less or equal than the commit >>> at head of queue, i.e. all are same level or deeper); compare what I >>> have written in [1] >>> >>> [1]: https://public-inbox.org/git/866052dkju.fsf@gmail.com/ >>> >>> Do I have that right? If so, it looks all right to me. >> Yes, the priority queue needs to compare via generation number first >> or there will be errors. This is why we could not use commit time >> before. > I was more concerned about getting right the order in the priority queue > (does it return minimal or maximal generation number). > > I understand that the cutoff could not be used without generation > numbers because of the possibility of clock skew - using cutoff on dates > could lead to wrong results. Maximal generation number is important so we do not visit commits multiple times (say, once with PARENT1 set, and a second time when PARENT2 is set). A minimal generation number order would create a DFS order and walk until the cutoff every time. In cases without clock skew, maximal generation number order will walk the same set of commits as maximal commit time; the order may differ, but only between incomparable commits. >>>> For a copy of the Linux repository, where HEAD is checked out at >>>> v4.13~100, we get the following performance improvement for >>>> 'git branch --contains' over the previous commit: >>>> >>>> Before: 0.21s >>>> After: 0.13s >>>> Rel %: -38% >>> [...] >>>> flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); >>>> if (flags == (PARENT1 | PARENT2)) { >>>> if (!(commit->object.flags & RESULT)) { >>>> @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co >>>> return NULL; >>>> } >>>> - list = paint_down_to_common(one, n, twos); >>>> + list = paint_down_to_common(one, n, twos, 0); >>>> while (list) { >>>> struct commit *commit = pop_commit(&list); >>>> @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) >>>> filled_index[filled] = j; >>>> work[filled++] = array[j]; >>>> } >>>> - common = paint_down_to_common(array[i], filled, work); >>>> + common = paint_down_to_common(array[i], filled, work, 0); >>>> if (array[i]->object.flags & PARENT2) >>>> redundant[i] = 1; >>>> for (j = 0; j < filled; j++) >>>> @@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * >>>> if (commit->generation > min_generation) >>>> return 0; >>>> - bases = paint_down_to_common(commit, nr_reference, reference); >>>> + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); >>> Is it the only case where we would call paint_down_to_common() with >>> non-zero last parameter? Would we always use commit->generation where >>> commit is the first parameter of paint_down_to_common()? >>> >>> If both are true and will remain true, then in my humble opinion it is >>> not necessary to change the signature of this function. >> We need to change the signature some way, but maybe the way I chose is >> not the best. > No, after taking longer I think the new signature is a good choice. > >> To elaborate: paint_down_to_common() is used for multiple >> purposes. The caller here that supplies 'commit->generation' is used >> only to compute reachability (by testing if the flag PARENT2 exists on >> the commit, then clears all flags). The other callers expect the full >> walk down to the common commits, and keeps those PARENT1, PARENT2, and >> STALE flags for future use (such as reporting merge bases). Usually >> the call to paint_down_to_common() is followed by a revision walk that >> only halts when reaching root commits or commits with both PARENT1 and >> PARENT2 flags on, so always short-circuiting on generations would >> break the functionality; this is confirmed by the >> t5318-commit-graph.sh. > Right. > > I have realized that just after sending the email. I'm sorry about this. > >> An alternative to the signature change is to add a boolean parameter >> "use_cutoff" or something, that specifies "don't walk beyond the >> commit". This may give a more of a clear description of what it will >> do with the generation value, but since we are already performing >> generation comparisons before calling paint_down_to_common() I find >> this simple enough. > Two things: > > 1. The signature proposed in the patch is more generic. The cutoff does > not need to be equal to the generation number of the commit, though > currently it always (all of one time the new mechanism is used) is. > > So now I think the new signature of paint_down_to_common() is all > right as it is proposed here. > > 2. The way generation numbers are defined (with 0 being a special case, > and generation numbers starting from 1 for parent-less commits), and > the way they are compared (using strict comparison, to avoid having > to special-case _ZERO, _MAX and _INFINITY generation numbers) the > cutoff of 0 means no cutoff. > > On the other hand cutoff of 0 can be understood as meaning no cutoff > as a special case. > > It could be made more clear to use (as I proposed elsewhere in this > thread) symbolic name for this no-cutoff case via preprocessor > constants or enums, e.g. GENERATION_NO_CUTOFF: > > @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co > return NULL; > } > - list = paint_down_to_common(one, n, twos); > + list = paint_down_to_common(one, n, twos, GENERATION_NO_CUTOFF); > while (list) { > struct commit *commit = pop_commit(&list); > @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) > filled_index[filled] = j; > work[filled++] = array[j]; > } > - common = paint_down_to_common(array[i], filled, work); > + common = paint_down_to_common(array[i], filled, work, GENERATION_NO_CUTOFF); > if (array[i]->object.flags & PARENT2) > redundant[i] = 1; > for (j = 0; j < filled; j++) > > > But whether it makes code more readable, or less readable, is a > matter of opinion and taste. > Since paint_down_to_common() is static to this file, I think 0 is cleaner. If the method was external and used by other .c files, then I would use this macro trick to clarify "what does this zero parameter mean?". Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() 2018-04-17 17:00 ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee 2018-04-18 23:19 ` Jakub Narebski @ 2018-04-19 8:32 ` Jakub Narebski 1 sibling, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-04-19 8:32 UTC (permalink / raw) To: Derrick Stolee Cc: git, peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy Derrick Stolee <dstolee@microsoft.com> writes: > @@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co > return NULL; > } > > - list = paint_down_to_common(one, n, twos); > + list = paint_down_to_common(one, n, twos, 0); > > while (list) { > struct commit *commit = pop_commit(&list); > @@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt) > filled_index[filled] = j; > work[filled++] = array[j]; > } > - common = paint_down_to_common(array[i], filled, work); > + common = paint_down_to_common(array[i], filled, work, 0); > if (array[i]->object.flags & PARENT2) > redundant[i] = 1; > for (j = 0; j < filled; j++) Wouldn't it be better and more readable to create a symbolic name for this 0, for example: - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, GENERATION_NO_CUTOFF); Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v3 8/9] commit-graph: always load commit-graph information 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee ` (6 preceding siblings ...) 2018-04-17 17:00 ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee @ 2018-04-17 17:00 ` Derrick Stolee 2018-04-17 17:50 ` Derrick Stolee 2018-04-19 0:02 ` Jakub Narebski 2018-04-17 17:00 ` [PATCH v3 9/9] merge: check config before loading commits Derrick Stolee ` (2 subsequent siblings) 10 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw) To: git Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy, Derrick Stolee Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. This avoids duplicate work when we already checked the graph in parse_commit_gently() or when simply checking the buffer contents in check_commit(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------ commit-graph.h | 8 ++++++++ commit.c | 7 +++++-- commit.h | 2 +- object.c | 2 +- sha1_file.c | 2 +- 6 files changed, 49 insertions(+), 23 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 688d5b1801..21e853c21a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return &commit_list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; uint32_t *parent_data_ptr; uint64_t date_low, date_high; struct commit_list **pptr; - const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; item->object.parsed = 1; item->graph_pos = pos; @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(commit_graph, &(item->object.oid), pos); + } +} + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + + if (item->object.parsed) + return 0; if (!core_commit_graph) return 0; - if (item->object.parsed) - return 1; - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), &pos); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, pos); - } - + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) + return fill_commit_in_graph(item, commit_graph, pos); return 0; } +void load_commit_graph_info(struct commit *item) +{ + uint32_t pos; + if (!core_commit_graph) + return; + prepare_commit_graph(); + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) + fill_commit_graph_info(item, commit_graph, pos); +} + static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) { struct object_id oid; diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit.c b/commit.c index a70f120878..9ef6f699bd 100644 --- a/commit.c +++ b/commit.c @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) return ret; } -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size) +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph) { const char *tail = buffer; const char *bufptr = buffer; @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s } item->date = parse_commit_date(bufptr, tail); + if (check_graph) + load_commit_graph_info(item); + return 0; } @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return error("Object %s not a commit", oid_to_hex(&item->object.oid)); } - ret = parse_commit_buffer(item, buffer, size); + ret = parse_commit_buffer(item, buffer, size, 0); if (save_commit_buffer && !ret) { set_commit_buffer(item, buffer, size); return 0; diff --git a/commit.h b/commit.h index 64436ff44e..b5afde1ae9 100644 --- a/commit.h +++ b/commit.h @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); */ struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size); +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); int parse_commit_gently(struct commit *item, int quiet_on_missing); static inline int parse_commit(struct commit *item) { diff --git a/object.c b/object.c index e6ad3f61f0..efe4871325 100644 --- a/object.c +++ b/object.c @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type } else if (type == OBJ_COMMIT) { struct commit *commit = lookup_commit(oid); if (commit) { - if (parse_commit_buffer(commit, buffer, size)) + if (parse_commit_buffer(commit, buffer, size, 1)) return NULL; if (!get_cached_commit_buffer(commit, NULL)) { set_commit_buffer(commit, buffer, size); diff --git a/sha1_file.c b/sha1_file.c index 1b94f39c4c..0fd4f0b8b6 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size) { struct commit c; memset(&c, 0, sizeof(c)); - if (parse_commit_buffer(&c, buf, size)) + if (parse_commit_buffer(&c, buf, size, 0)) die("corrupt commit"); } -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v3 8/9] commit-graph: always load commit-graph information 2018-04-17 17:00 ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee @ 2018-04-17 17:50 ` Derrick Stolee 2018-04-19 0:02 ` Jakub Narebski 1 sibling, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-17 17:50 UTC (permalink / raw) To: Derrick Stolee, git Cc: peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy On 4/17/2018 1:00 PM, Derrick Stolee wrote: > Most code paths load commits using lookup_commit() and then > parse_commit(). In some cases, including some branch lookups, the commit > is parsed using parse_object_buffer() which side-steps parse_commit() in > favor of parse_commit_buffer(). > > With generation numbers in the commit-graph, we need to ensure that any > commit that exists in the commit-graph file has its generation number > loaded. > > Create new load_commit_graph_info() method to fill in the information > for a commit that exists only in the commit-graph file. Call it from > parse_commit_buffer() after loading the other commit information from > the given buffer. Only fill this information when specified by the > 'check_graph' parameter. This avoids duplicate work when we already > checked the graph in parse_commit_gently() or when simply checking the > buffer contents in check_commit(). > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------ > commit-graph.h | 8 ++++++++ > commit.c | 7 +++++-- > commit.h | 2 +- > object.c | 2 +- > sha1_file.c | 2 +- > 6 files changed, 49 insertions(+), 23 deletions(-) > > diff --git a/commit-graph.c b/commit-graph.c > index 688d5b1801..21e853c21a 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, > return &commit_list_insert(c, pptr)->next; > } > > +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) > +{ > + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; > + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; > +} > + > static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) > { > uint32_t edge_value; > uint32_t *parent_data_ptr; > uint64_t date_low, date_high; > struct commit_list **pptr; > - const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; > + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; > > item->object.parsed = 1; > item->graph_pos = pos; > @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin > return 1; > } > > +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) > +{ > + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { > + *pos = item->graph_pos; > + return 1; > + } else { > + return bsearch_graph(commit_graph, &(item->object.oid), pos); The reference to 'commit_graph' in the above line should be 'g'. Sorry! > + } > +} > + > int parse_commit_in_graph(struct commit *item) > { > + uint32_t pos; > + > + if (item->object.parsed) > + return 0; > if (!core_commit_graph) > return 0; > - if (item->object.parsed) > - return 1; > - > prepare_commit_graph(); > - if (commit_graph) { > - uint32_t pos; > - int found; > - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { > - pos = item->graph_pos; > - found = 1; > - } else { > - found = bsearch_graph(commit_graph, &(item->object.oid), &pos); > - } > - > - if (found) > - return fill_commit_in_graph(item, commit_graph, pos); > - } > - > + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) > + return fill_commit_in_graph(item, commit_graph, pos); > return 0; > } > > +void load_commit_graph_info(struct commit *item) > +{ > + uint32_t pos; > + if (!core_commit_graph) > + return; > + prepare_commit_graph(); > + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) > + fill_commit_graph_info(item, commit_graph, pos); > +} > + > static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) > { > struct object_id oid; > diff --git a/commit-graph.h b/commit-graph.h > index 260a468e73..96cccb10f3 100644 > --- a/commit-graph.h > +++ b/commit-graph.h > @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); > */ > int parse_commit_in_graph(struct commit *item); > > +/* > + * It is possible that we loaded commit contents from the commit buffer, > + * but we also want to ensure the commit-graph content is correctly > + * checked and filled. Fill the graph_pos and generation members of > + * the given commit. > + */ > +void load_commit_graph_info(struct commit *item); > + > struct tree *get_commit_tree_in_graph(const struct commit *c); > > struct commit_graph { > diff --git a/commit.c b/commit.c > index a70f120878..9ef6f699bd 100644 > --- a/commit.c > +++ b/commit.c > @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) > return ret; > } > > -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size) > +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph) > { > const char *tail = buffer; > const char *bufptr = buffer; > @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s > } > item->date = parse_commit_date(bufptr, tail); > > + if (check_graph) > + load_commit_graph_info(item); > + > return 0; > } > > @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) > return error("Object %s not a commit", > oid_to_hex(&item->object.oid)); > } > - ret = parse_commit_buffer(item, buffer, size); > + ret = parse_commit_buffer(item, buffer, size, 0); > if (save_commit_buffer && !ret) { > set_commit_buffer(item, buffer, size); > return 0; > diff --git a/commit.h b/commit.h > index 64436ff44e..b5afde1ae9 100644 > --- a/commit.h > +++ b/commit.h > @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); > */ > struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); > > -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size); > +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); > int parse_commit_gently(struct commit *item, int quiet_on_missing); > static inline int parse_commit(struct commit *item) > { > diff --git a/object.c b/object.c > index e6ad3f61f0..efe4871325 100644 > --- a/object.c > +++ b/object.c > @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type > } else if (type == OBJ_COMMIT) { > struct commit *commit = lookup_commit(oid); > if (commit) { > - if (parse_commit_buffer(commit, buffer, size)) > + if (parse_commit_buffer(commit, buffer, size, 1)) > return NULL; > if (!get_cached_commit_buffer(commit, NULL)) { > set_commit_buffer(commit, buffer, size); > diff --git a/sha1_file.c b/sha1_file.c > index 1b94f39c4c..0fd4f0b8b6 100644 > --- a/sha1_file.c > +++ b/sha1_file.c > @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size) > { > struct commit c; > memset(&c, 0, sizeof(c)); > - if (parse_commit_buffer(&c, buf, size)) > + if (parse_commit_buffer(&c, buf, size, 0)) > die("corrupt commit"); > } > ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 8/9] commit-graph: always load commit-graph information 2018-04-17 17:00 ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee 2018-04-17 17:50 ` Derrick Stolee @ 2018-04-19 0:02 ` Jakub Narebski 2018-04-23 14:49 ` Derrick Stolee 1 sibling, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-19 0:02 UTC (permalink / raw) To: Derrick Stolee Cc: git, peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy Derrick Stolee <dstolee@microsoft.com> writes: > Most code paths load commits using lookup_commit() and then > parse_commit(). In some cases, including some branch lookups, the commit > is parsed using parse_object_buffer() which side-steps parse_commit() in > favor of parse_commit_buffer(). > > With generation numbers in the commit-graph, we need to ensure that any > commit that exists in the commit-graph file has its generation number > loaded. All right, that is nice explanation of the why behind this change. > > Create new load_commit_graph_info() method to fill in the information > for a commit that exists only in the commit-graph file. Call it from > parse_commit_buffer() after loading the other commit information from > the given buffer. Only fill this information when specified by the > 'check_graph' parameter. This avoids duplicate work when we already > checked the graph in parse_commit_gently() or when simply checking the > buffer contents in check_commit(). Couldn't this 'check_graph' parameter be a global variable similar to the 'commit_graph' variable? Maybe I am not understanding it. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------ > commit-graph.h | 8 ++++++++ > commit.c | 7 +++++-- > commit.h | 2 +- > object.c | 2 +- > sha1_file.c | 2 +- > 6 files changed, 49 insertions(+), 23 deletions(-) > > diff --git a/commit-graph.c b/commit-graph.c > index 688d5b1801..21e853c21a 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, > return &commit_list_insert(c, pptr)->next; > } > > +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) > +{ > + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; > + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; > +} > + > static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) > { > uint32_t edge_value; > uint32_t *parent_data_ptr; > uint64_t date_low, date_high; > struct commit_list **pptr; > - const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; > + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; I'm probably wrong, but isn't it unrelated change? > > item->object.parsed = 1; > item->graph_pos = pos; > @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin > return 1; > } > > +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) > +{ > + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { > + *pos = item->graph_pos; > + return 1; > + } else { > + return bsearch_graph(commit_graph, &(item->object.oid), pos); > + } > +} All right (after the fix). > + > int parse_commit_in_graph(struct commit *item) > { > + uint32_t pos; > + > + if (item->object.parsed) > + return 0; > if (!core_commit_graph) > return 0; > - if (item->object.parsed) > - return 1; Hmmm... previously the function returned 1 if item->object.parsed, now it returns 0 for this situation. I don't understand this change. > - > prepare_commit_graph(); > - if (commit_graph) { > - uint32_t pos; > - int found; > - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { > - pos = item->graph_pos; > - found = 1; > - } else { > - found = bsearch_graph(commit_graph, &(item->object.oid), &pos); > - } > - > - if (found) > - return fill_commit_in_graph(item, commit_graph, pos); > - } > - > + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) > + return fill_commit_in_graph(item, commit_graph, pos); Nice refactoring. > return 0; > } > > +void load_commit_graph_info(struct commit *item) > +{ > + uint32_t pos; > + if (!core_commit_graph) > + return; > + prepare_commit_graph(); > + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) > + fill_commit_graph_info(item, commit_graph, pos); > +} And the reason for the refactoring. > + > static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) > { > struct object_id oid; > diff --git a/commit-graph.h b/commit-graph.h > index 260a468e73..96cccb10f3 100644 > --- a/commit-graph.h > +++ b/commit-graph.h > @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); > */ > int parse_commit_in_graph(struct commit *item); > > +/* > + * It is possible that we loaded commit contents from the commit buffer, > + * but we also want to ensure the commit-graph content is correctly > + * checked and filled. Fill the graph_pos and generation members of > + * the given commit. > + */ > +void load_commit_graph_info(struct commit *item); > + > struct tree *get_commit_tree_in_graph(const struct commit *c); > > struct commit_graph { > diff --git a/commit.c b/commit.c > index a70f120878..9ef6f699bd 100644 > --- a/commit.c > +++ b/commit.c > @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) > return ret; > } > > -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size) > +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph) > { > const char *tail = buffer; > const char *bufptr = buffer; > @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s > } > item->date = parse_commit_date(bufptr, tail); > > + if (check_graph) > + load_commit_graph_info(item); > + > return 0; > } > > @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) > return error("Object %s not a commit", > oid_to_hex(&item->object.oid)); > } > - ret = parse_commit_buffer(item, buffer, size); > + ret = parse_commit_buffer(item, buffer, size, 0); > if (save_commit_buffer && !ret) { > set_commit_buffer(item, buffer, size); > return 0; > diff --git a/commit.h b/commit.h > index 64436ff44e..b5afde1ae9 100644 > --- a/commit.h > +++ b/commit.h > @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); > */ > struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); > > -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size); > +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); > int parse_commit_gently(struct commit *item, int quiet_on_missing); > static inline int parse_commit(struct commit *item) > { > diff --git a/object.c b/object.c > index e6ad3f61f0..efe4871325 100644 > --- a/object.c > +++ b/object.c > @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type > } else if (type == OBJ_COMMIT) { > struct commit *commit = lookup_commit(oid); > if (commit) { > - if (parse_commit_buffer(commit, buffer, size)) > + if (parse_commit_buffer(commit, buffer, size, 1)) > return NULL; > if (!get_cached_commit_buffer(commit, NULL)) { > set_commit_buffer(commit, buffer, size); > diff --git a/sha1_file.c b/sha1_file.c > index 1b94f39c4c..0fd4f0b8b6 100644 > --- a/sha1_file.c > +++ b/sha1_file.c > @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size) > { > struct commit c; > memset(&c, 0, sizeof(c)); > - if (parse_commit_buffer(&c, buf, size)) > + if (parse_commit_buffer(&c, buf, size, 0)) > die("corrupt commit"); > } ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 8/9] commit-graph: always load commit-graph information 2018-04-19 0:02 ` Jakub Narebski @ 2018-04-23 14:49 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-23 14:49 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy On 4/18/2018 8:02 PM, Jakub Narebski wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> Most code paths load commits using lookup_commit() and then >> parse_commit(). In some cases, including some branch lookups, the commit >> is parsed using parse_object_buffer() which side-steps parse_commit() in >> favor of parse_commit_buffer(). >> >> With generation numbers in the commit-graph, we need to ensure that any >> commit that exists in the commit-graph file has its generation number >> loaded. > All right, that is nice explanation of the why behind this change. > >> Create new load_commit_graph_info() method to fill in the information >> for a commit that exists only in the commit-graph file. Call it from >> parse_commit_buffer() after loading the other commit information from >> the given buffer. Only fill this information when specified by the >> 'check_graph' parameter. This avoids duplicate work when we already >> checked the graph in parse_commit_gently() or when simply checking the >> buffer contents in check_commit(). > Couldn't this 'check_graph' parameter be a global variable similar to > the 'commit_graph' variable? Maybe I am not understanding it. See the two callers at the bottom of the patch. They have different purposes: one needs to fill in a valid commit struct, the other needs to check the commit buffer is valid (then throws away the struct). They have different values for 'check_graph'. Also, in parse_commit_gently() we check parse_commit_in_graph() before we call parse_commit_buffer, so we do not want to repeat work; in the case of a valid commit-graph file, but the commit is not in the commit-graph, we would repeat our binary search for the same commit. > >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> --- >> commit-graph.c | 51 ++++++++++++++++++++++++++++++++------------------ >> commit-graph.h | 8 ++++++++ >> commit.c | 7 +++++-- >> commit.h | 2 +- >> object.c | 2 +- >> sha1_file.c | 2 +- >> 6 files changed, 49 insertions(+), 23 deletions(-) >> >> diff --git a/commit-graph.c b/commit-graph.c >> index 688d5b1801..21e853c21a 100644 >> --- a/commit-graph.c >> +++ b/commit-graph.c >> @@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, >> return &commit_list_insert(c, pptr)->next; >> } >> >> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) >> +{ >> + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; >> + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; >> +} >> + >> static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) >> { >> uint32_t edge_value; >> uint32_t *parent_data_ptr; >> uint64_t date_low, date_high; >> struct commit_list **pptr; >> - const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; >> + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; > I'm probably wrong, but isn't it unrelated change? You're right. I saw this while I was in here, and there was a similar comment on this change in a different patch. Probably best to keep these cleanup things in a separate commit. >> item->object.parsed = 1; >> item->graph_pos = pos; >> @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin >> return 1; >> } >> >> +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) >> +{ >> + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { >> + *pos = item->graph_pos; >> + return 1; >> + } else { >> + return bsearch_graph(commit_graph, &(item->object.oid), pos); >> + } >> +} > All right (after the fix). > >> + >> int parse_commit_in_graph(struct commit *item) >> { >> + uint32_t pos; >> + >> + if (item->object.parsed) >> + return 0; >> if (!core_commit_graph) >> return 0; >> - if (item->object.parsed) >> - return 1; > Hmmm... previously the function returned 1 if item->object.parsed, now > it returns 0 for this situation. I don't understand this change. The good news is that this change is unimportant (the only caller is parse_commit_gently() which checks item->object.parsed before calling parse_commit_in_graph()). I wonder why I reordered those things, anyway. I'll revert to simplify the patch. > >> - >> prepare_commit_graph(); >> - if (commit_graph) { >> - uint32_t pos; >> - int found; >> - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { >> - pos = item->graph_pos; >> - found = 1; >> - } else { >> - found = bsearch_graph(commit_graph, &(item->object.oid), &pos); >> - } >> - >> - if (found) >> - return fill_commit_in_graph(item, commit_graph, pos); >> - } >> - >> + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) >> + return fill_commit_in_graph(item, commit_graph, pos); > Nice refactoring. > >> return 0; >> } >> >> +void load_commit_graph_info(struct commit *item) >> +{ >> + uint32_t pos; >> + if (!core_commit_graph) >> + return; >> + prepare_commit_graph(); >> + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) >> + fill_commit_graph_info(item, commit_graph, pos); >> +} > And the reason for the refactoring. > >> + >> static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) >> { >> struct object_id oid; >> diff --git a/commit-graph.h b/commit-graph.h >> index 260a468e73..96cccb10f3 100644 >> --- a/commit-graph.h >> +++ b/commit-graph.h >> @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); >> */ >> int parse_commit_in_graph(struct commit *item); >> >> +/* >> + * It is possible that we loaded commit contents from the commit buffer, >> + * but we also want to ensure the commit-graph content is correctly >> + * checked and filled. Fill the graph_pos and generation members of >> + * the given commit. >> + */ >> +void load_commit_graph_info(struct commit *item); >> + >> struct tree *get_commit_tree_in_graph(const struct commit *c); >> >> struct commit_graph { >> diff --git a/commit.c b/commit.c >> index a70f120878..9ef6f699bd 100644 >> --- a/commit.c >> +++ b/commit.c >> @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) >> return ret; >> } >> >> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size) >> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph) >> { >> const char *tail = buffer; >> const char *bufptr = buffer; >> @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s >> } >> item->date = parse_commit_date(bufptr, tail); >> >> + if (check_graph) >> + load_commit_graph_info(item); >> + >> return 0; >> } >> >> @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) >> return error("Object %s not a commit", >> oid_to_hex(&item->object.oid)); >> } >> - ret = parse_commit_buffer(item, buffer, size); >> + ret = parse_commit_buffer(item, buffer, size, 0); >> if (save_commit_buffer && !ret) { >> set_commit_buffer(item, buffer, size); >> return 0; >> diff --git a/commit.h b/commit.h >> index 64436ff44e..b5afde1ae9 100644 >> --- a/commit.h >> +++ b/commit.h >> @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); >> */ >> struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); >> >> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size); >> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); >> int parse_commit_gently(struct commit *item, int quiet_on_missing); >> static inline int parse_commit(struct commit *item) >> { >> diff --git a/object.c b/object.c >> index e6ad3f61f0..efe4871325 100644 >> --- a/object.c >> +++ b/object.c >> @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type >> } else if (type == OBJ_COMMIT) { >> struct commit *commit = lookup_commit(oid); >> if (commit) { >> - if (parse_commit_buffer(commit, buffer, size)) >> + if (parse_commit_buffer(commit, buffer, size, 1)) >> return NULL; >> if (!get_cached_commit_buffer(commit, NULL)) { >> set_commit_buffer(commit, buffer, size); >> diff --git a/sha1_file.c b/sha1_file.c >> index 1b94f39c4c..0fd4f0b8b6 100644 >> --- a/sha1_file.c >> +++ b/sha1_file.c >> @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size) >> { >> struct commit c; >> memset(&c, 0, sizeof(c)); >> - if (parse_commit_buffer(&c, buf, size)) >> + if (parse_commit_buffer(&c, buf, size, 0)) >> die("corrupt commit"); >> } ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v3 9/9] merge: check config before loading commits 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee ` (7 preceding siblings ...) 2018-04-17 17:00 ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee @ 2018-04-17 17:00 ` Derrick Stolee 2018-04-19 0:04 ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee 10 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-17 17:00 UTC (permalink / raw) To: git Cc: peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy, Derrick Stolee Now that we use generation numbers from the commit-graph, we must ensure that all commits that exist in the commit-graph are loaded from that file instead of from the object database. Since the commit-graph file is only checked if core.commitGraph is true, we must check the default config before we load any commits. In the merge builtin, the config was checked after loading the HEAD commit. This was due to the use of the global 'branch' when checking merge-specific config settings. Move the config load to be between the initialization of 'branch' and the commit lookup. Without this change, a fast-forward merge would hit a BUG("bad generation skip") statement in commit.c during paint_down_to_common(). This is because the HEAD commit would be loaded with "infinite" generation but then reached by commits with "finite" generation numbers. Add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- builtin/merge.c | 5 +++-- t/t5318-commit-graph.sh | 9 +++++++++ 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/builtin/merge.c b/builtin/merge.c index 5e5e4497e3..7e1da6c6ea 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,13 +1148,14 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", &branch); + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(&head_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(&head_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a380419b65..77d85aefe7 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 +test_expect_success 'perform fast-forward merge in full repo' ' + cd "$TRASH_DIRECTORY/full" && + git checkout -b merge-5-to-8 commits/5 && + git merge commits/8 && + git show-ref -s merge-5-to-8 >output && + git show-ref -s commits/8 >expect && + test_cmp expect output +' + test_done -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v3 0/9] Compute and consume generation numbers 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee ` (8 preceding siblings ...) 2018-04-17 17:00 ` [PATCH v3 9/9] merge: check config before loading commits Derrick Stolee @ 2018-04-19 0:04 ` Jakub Narebski 2018-04-23 14:54 ` Derrick Stolee 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee 10 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-19 0:04 UTC (permalink / raw) To: Derrick Stolee Cc: git, peff, stolee, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy Derrick Stolee <dstolee@microsoft.com> writes: > -- >8 -- > > This is the one of several "small" patches that follow the serialized > Git commit graph patch (ds/commit-graph) and lazy-loading trees > (ds/lazy-load-trees). > > As described in Documentation/technical/commit-graph.txt, the generation > number of a commit is one more than the maximum generation number among > its parents (trivially, a commit with no parents has generation number > one). This section is expanded to describe the interaction with special > generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph > file) and *_ZERO (commits in a commit-graph file written before generation > numbers were implemented). > > This series makes the computation of generation numbers part of the > commit-graph write process. > > Finally, generation numbers are used to order commits in the priority > queue in paint_down_to_common(). This allows a short-circuit mechanism > to improve performance of `git branch --contains`. > > Further, use generation numbers for 'git tag --contains', providing a > significant speedup (at least 95% for some cases). > > A more substantial refactoring of revision.c is required before making > 'git log --graph' use generation numbers effectively. > > This patch series is build on ds/lazy-load-trees. > > Derrick Stolee (9): > commit: add generation number to struct commmit Nice and short patch. Looks good to me. > commit-graph: compute generation numbers Another quite easy to understand patch. LGTM. > commit: use generations in paint_down_to_common() Nice and short patch; minor typo in comment in code. Otherwise it looks good to me. > commit-graph.txt: update design document I see that diagram got removed in this version; maybe it could be replaced with relationship table? Anyway, it looks good to me. > ref-filter: use generation number for --contains A question: how performance looks like after the change if commit-graph is not available? > commit: use generation numbers for in_merge_bases() Possible typo in the commit message, and stylistic inconsistence in in_merge_bases() - though actually more clear than existing code. Short, simple, and gives good performance improvenebts. > commit: add short-circuit to paint_down_to_common() Looks good to me; ignore [mostly] what I have written in response to the patch in question. > commit-graph: always load commit-graph information Looks all right; question: parameter or one more global variable. > merge: check config before loading commits This looks good to me. > > Documentation/technical/commit-graph.txt | 30 +++++-- > alloc.c | 1 + > builtin/merge.c | 5 +- > commit-graph.c | 99 +++++++++++++++++++----- > commit-graph.h | 8 ++ > commit.c | 54 +++++++++++-- > commit.h | 7 +- > object.c | 2 +- > ref-filter.c | 23 +++++- > sha1_file.c | 2 +- > t/t5318-commit-graph.sh | 9 +++ > 11 files changed, 199 insertions(+), 41 deletions(-) > > > base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707 ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v3 0/9] Compute and consume generation numbers 2018-04-19 0:04 ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski @ 2018-04-23 14:54 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-23 14:54 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, peff, avarab, sbeller, larsxschneider, bmwill, gitster, sunshine, jonathantanmy On 4/18/2018 8:04 PM, Jakub Narebski wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> -- >8 -- >> >> This is the one of several "small" patches that follow the serialized >> Git commit graph patch (ds/commit-graph) and lazy-loading trees >> (ds/lazy-load-trees). >> >> As described in Documentation/technical/commit-graph.txt, the generation >> number of a commit is one more than the maximum generation number among >> its parents (trivially, a commit with no parents has generation number >> one). This section is expanded to describe the interaction with special >> generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph >> file) and *_ZERO (commits in a commit-graph file written before generation >> numbers were implemented). >> >> This series makes the computation of generation numbers part of the >> commit-graph write process. >> >> Finally, generation numbers are used to order commits in the priority >> queue in paint_down_to_common(). This allows a short-circuit mechanism >> to improve performance of `git branch --contains`. >> >> Further, use generation numbers for 'git tag --contains', providing a >> significant speedup (at least 95% for some cases). >> >> A more substantial refactoring of revision.c is required before making >> 'git log --graph' use generation numbers effectively. >> >> This patch series is build on ds/lazy-load-trees. >> >> Derrick Stolee (9): >> commit: add generation number to struct commmit > Nice and short patch. Looks good to me. > >> commit-graph: compute generation numbers > Another quite easy to understand patch. LGTM. > >> commit: use generations in paint_down_to_common() > Nice and short patch; minor typo in comment in code. > Otherwise it looks good to me. > >> commit-graph.txt: update design document > I see that diagram got removed in this version; maybe it could be > replaced with relationship table? > > Anyway, it looks good to me. The diagrams and tables seemed to cause more confusion than clarity. I think the reader should create their own mental model from the definitions and description and we should avoid trying to make a summary. > >> ref-filter: use generation number for --contains > A question: how performance looks like after the change if commit-graph > is not available? The performance issue is minor, but will be fixed in v4. > >> commit: use generation numbers for in_merge_bases() > Possible typo in the commit message, and stylistic inconsistence in > in_merge_bases() - though actually more clear than existing code. > > Short, simple, and gives good performance improvenebts. > >> commit: add short-circuit to paint_down_to_common() > Looks good to me; ignore [mostly] what I have written in response to the > patch in question. > >> commit-graph: always load commit-graph information > Looks all right; question: parameter or one more global variable. I responded to say that the global variable approach is incorrect. Parameter is important to functionality and performance. > >> merge: check config before loading commits > This looks good to me. > >> Documentation/technical/commit-graph.txt | 30 +++++-- >> alloc.c | 1 + >> builtin/merge.c | 5 +- >> commit-graph.c | 99 +++++++++++++++++++----- >> commit-graph.h | 8 ++ >> commit.c | 54 +++++++++++-- >> commit.h | 7 +- >> object.c | 2 +- >> ref-filter.c | 23 +++++- >> sha1_file.c | 2 +- >> t/t5318-commit-graph.sh | 9 +++ >> 11 files changed, 199 insertions(+), 41 deletions(-) >> >> >> base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707 ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v4 00/10] Compute and consume generation numbers 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee ` (9 preceding siblings ...) 2018-04-19 0:04 ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski @ 2018-04-25 14:37 ` Derrick Stolee 2018-04-25 14:37 ` [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list Derrick Stolee ` (11 more replies) 10 siblings, 12 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw) To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee Thanks for the feedback on the previous version. I think this series is stabilizing nicely. I'll reply to this message with an inter-diff as it is not too large to share but would clutter this cover letter. Thanks, -Stolee -- >8 -- This is the one of several "small" patches that follow the serialized Git commit graph patch (ds/commit-graph) and lazy-loading trees (ds/lazy-load-trees). As described in Documentation/technical/commit-graph.txt, the generation number of a commit is one more than the maximum generation number among its parents (trivially, a commit with no parents has generation number one). This section is expanded to describe the interaction with special generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph file) and *_ZERO (commits in a commit-graph file written before generation numbers were implemented). This series makes the computation of generation numbers part of the commit-graph write process. Finally, generation numbers are used to order commits in the priority queue in paint_down_to_common(). This allows a short-circuit mechanism to improve performance of `git branch --contains`. Further, use generation numbers for 'git tag --contains), providing a significant speedup (at least 95% for some cases). A more substantial refactoring of revision.c is required before making 'git log --graph' use generation numbers effectively. This patch series is built on ds/lazy-load-trees. Derrick Stolee (10): ref-filter: fix outdated comment on in_commit_list commit: add generation number to struct commmit commit-graph: compute generation numbers commit: use generations in paint_down_to_common() commit-graph: always load commit-graph information ref-filter: use generation number for --contains commit: use generation numbers for in_merge_bases() commit: add short-circuit to paint_down_to_common() merge: check config before loading commits commit-graph.txt: update design document Documentation/technical/commit-graph.txt | 30 ++++++-- alloc.c | 1 + builtin/merge.c | 7 +- commit-graph.c | 92 ++++++++++++++++++++---- commit-graph.h | 8 +++ commit.c | 54 +++++++++++--- commit.h | 7 +- object.c | 2 +- ref-filter.c | 26 +++++-- sha1_file.c | 2 +- t/t5318-commit-graph.sh | 9 +++ 11 files changed, 198 insertions(+), 40 deletions(-) base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707 -- 2.17.0.39.g685157f7fb ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee @ 2018-04-25 14:37 ` Derrick Stolee 2018-04-28 17:54 ` Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 02/10] commit: add generation number to struct commmit Derrick Stolee ` (10 subsequent siblings) 11 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw) To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee The in_commit_list() method does not check the parents of the candidate for containment in the list. Fix the comment that incorrectly states that it does. Reported-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- ref-filter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ref-filter.c b/ref-filter.c index cffd8bf3ce..aff24d93be 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) } /* - * Test whether the candidate or one of its parents is contained in the list. + * Test whether the candidate is contained in the list. * Do not recurse to find out, though, but return -1 if inconclusive. */ static enum contains_result contains_test(struct commit *candidate, -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list 2018-04-25 14:37 ` [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list Derrick Stolee @ 2018-04-28 17:54 ` Jakub Narebski 0 siblings, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-04-28 17:54 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, gitster, peff, avarab Derrick Stolee <dstolee@microsoft.com> writes: > The in_commit_list() method does not check the parents of > the candidate for containment in the list. Fix the comment > that incorrectly states that it does. > > Reported-by: Jakub Narebski <jnareb@gmail.com> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > ref-filter.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/ref-filter.c b/ref-filter.c > index cffd8bf3ce..aff24d93be 100644 > --- a/ref-filter.c > +++ b/ref-filter.c > @@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) > } > > /* > - * Test whether the candidate or one of its parents is contained in the list. > + * Test whether the candidate is contained in the list. > * Do not recurse to find out, though, but return -1 if inconclusive. > */ > static enum contains_result contains_test(struct commit *candidate, All right. Always good to have comment and code match. FYI: the contains_test() function described in this comment only checks the candidate, and never access candidate commit parents. All recursion, which naturally includes checking parents, is in the contains_tag_algo(). I guess that the code was refactored, but comment had not been changed. -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v4 02/10] commit: add generation number to struct commmit 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee 2018-04-25 14:37 ` [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list Derrick Stolee @ 2018-04-25 14:37 ` Derrick Stolee 2018-04-28 22:35 ` Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 03/10] commit-graph: compute generation numbers Derrick Stolee ` (9 subsequent siblings) 11 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw) To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use three special values to signal the generation number is invalid: GENERATION_NUMBER_INFINITY 0xFFFFFFFF GENERATION_NUMBER_MAX 0x3FFFFFFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_MAX) means the generation number is too large to store in the commit-graph file. The third (_ZERO) means the generation number was loaded from a commit graph file that was written by a version of git that did not support generation numbers. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- alloc.c | 1 + commit-graph.c | 2 ++ commit.h | 4 ++++ 3 files changed, 7 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } diff --git a/commit-graph.c b/commit-graph.c index 70fa1b25fd..9ad21c3ffb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = &item->parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index 23a3f364ed..aac3b8c56f 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF +#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF +#define GENERATION_NUMBER_MAX 0x3FFFFFFF +#define GENERATION_NUMBER_ZERO 0 struct commit_list { struct commit *item; @@ -30,6 +33,7 @@ struct commit { */ struct tree *maybe_tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v4 02/10] commit: add generation number to struct commmit 2018-04-25 14:37 ` [PATCH v4 02/10] commit: add generation number to struct commmit Derrick Stolee @ 2018-04-28 22:35 ` Jakub Narebski 2018-04-30 12:05 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-28 22:35 UTC (permalink / raw) To: Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <dstolee@microsoft.com> writes: > The generation number of a commit is defined recursively as follows: > > * If a commit A has no parents, then the generation number of A is one. > * If a commit A has parents, then the generation number of A is one > more than the maximum generation number among the parents of A. Very minor nitpick: it would be more readable wrapped differently: * If a commit A has parents, then the generation number of A is one more than the maximum generation number among parents of A. Very minor nitpick: possibly "parents", not "the parents", but I am not native English speaker. > > Add a uint32_t generation field to struct commit so we can pass this > information to revision walks. We use three special values to signal > the generation number is invalid: > > GENERATION_NUMBER_INFINITY 0xFFFFFFFF > GENERATION_NUMBER_MAX 0x3FFFFFFF > GENERATION_NUMBER_ZERO 0 > > The first (_INFINITY) means the generation number has not been loaded or > computed. The second (_MAX) means the generation number is too large to > store in the commit-graph file. The third (_ZERO) means the generation > number was loaded from a commit graph file that was written by a version > of git that did not support generation numbers. Good explanation; I wonder if we want to have it in some shortened form also in comments, and not only in the commit message. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > alloc.c | 1 + > commit-graph.c | 2 ++ > commit.h | 4 ++++ > 3 files changed, 7 insertions(+) I have reordered patches to make it easier to review. > diff --git a/commit.h b/commit.h > index 23a3f364ed..aac3b8c56f 100644 > --- a/commit.h > +++ b/commit.h > @@ -10,6 +10,9 @@ > #include "pretty.h" > > #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF > +#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF > +#define GENERATION_NUMBER_MAX 0x3FFFFFFF > +#define GENERATION_NUMBER_ZERO 0 I wonder if it wouldn't be good to have some short in-line comments explaining those constants, or a block comment above them. > > struct commit_list { > struct commit *item; > @@ -30,6 +33,7 @@ struct commit { > */ > struct tree *maybe_tree; > uint32_t graph_pos; > + uint32_t generation; > }; > > extern int save_commit_buffer; All right, simple addition of the new field. Nothing to go wrong here. Sidenote: With 0x7FFFFFFF being (if I am not wrong) maximum graph_pos and maximum number of nodes in commit graph, we won't hit 0x3FFFFFFF generation number limit for all except very, very linear histories. > > diff --git a/alloc.c b/alloc.c > index cf4f8b61e1..e8ab14f4a1 100644 > --- a/alloc.c > +++ b/alloc.c > @@ -94,6 +94,7 @@ void *alloc_commit_node(void) > c->object.type = OBJ_COMMIT; > c->index = alloc_commit_index(); > c->graph_pos = COMMIT_NOT_FROM_GRAPH; > + c->generation = GENERATION_NUMBER_INFINITY; > return c; > } All right, start with initializing it with "not from commit-graph" value after allocation. > > diff --git a/commit-graph.c b/commit-graph.c > index 70fa1b25fd..9ad21c3ffb 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin > date_low = get_be32(commit_data + g->hash_len + 12); > item->date = (timestamp_t)((date_high << 32) | date_low); > > + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; > + I guess we should not worry about these "magical constants" sprinkled here, like "+ 8" above. Let's examine how it goes, taking a look at commit-graph-format.txt in Documentation/technical/commit-graph-format.txt * The first H (g->hash_len) bytes are for the OID of the root tree. * The next 8 bytes are for the positions of the first two parents [...] So 'commit_data + g->hash_len + 8' is our offset from the start of commit data. All right. * The next 8 bytes store the generation number of the commit and the commit time in seconds since EPOCH. The generation number uses the higher 30 bits of the first 4 bytes. [...] The higher 30 bits of the 4 bytes, which is 32 bits, means that we need to shift 32-bit value 2 bits right, so that we get lower 30 bits of 32-bit value. All right. All 4-byte numbers are in network order. Shouldn't it be ntohl() to convert from network order to host order, and not get_be32()? I guess they are the same (network order is big-endian order), and get_be32() is what rest of git uses... Looks all right. > pptr = &item->parents; > > edge_value = get_be32(commit_data + g->hash_len); ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 02/10] commit: add generation number to struct commmit 2018-04-28 22:35 ` Jakub Narebski @ 2018-04-30 12:05 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-30 12:05 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason On 4/28/2018 6:35 PM, Jakub Narebski wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> The generation number of a commit is defined recursively as follows: >> >> * If a commit A has no parents, then the generation number of A is one. >> * If a commit A has parents, then the generation number of A is one >> more than the maximum generation number among the parents of A. > Very minor nitpick: it would be more readable wrapped differently: > > * If a commit A has parents, then the generation number of A is > one more than the maximum generation number among parents of A. > > Very minor nitpick: possibly "parents", not "the parents", but I am > not native English speaker. > >> Add a uint32_t generation field to struct commit so we can pass this >> information to revision walks. We use three special values to signal >> the generation number is invalid: >> >> GENERATION_NUMBER_INFINITY 0xFFFFFFFF >> GENERATION_NUMBER_MAX 0x3FFFFFFF >> GENERATION_NUMBER_ZERO 0 >> >> The first (_INFINITY) means the generation number has not been loaded or >> computed. The second (_MAX) means the generation number is too large to >> store in the commit-graph file. The third (_ZERO) means the generation >> number was loaded from a commit graph file that was written by a version >> of git that did not support generation numbers. > Good explanation; I wonder if we want to have it in some shortened form > also in comments, and not only in the commit message. > >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> --- >> alloc.c | 1 + >> commit-graph.c | 2 ++ >> commit.h | 4 ++++ >> 3 files changed, 7 insertions(+) > I have reordered patches to make it easier to review. > >> diff --git a/commit.h b/commit.h >> index 23a3f364ed..aac3b8c56f 100644 >> --- a/commit.h >> +++ b/commit.h >> @@ -10,6 +10,9 @@ >> #include "pretty.h" >> >> #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF >> +#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF >> +#define GENERATION_NUMBER_MAX 0x3FFFFFFF >> +#define GENERATION_NUMBER_ZERO 0 > I wonder if it wouldn't be good to have some short in-line comments > explaining those constants, or a block comment above them. > >> >> struct commit_list { >> struct commit *item; >> @@ -30,6 +33,7 @@ struct commit { >> */ >> struct tree *maybe_tree; >> uint32_t graph_pos; >> + uint32_t generation; >> }; >> >> extern int save_commit_buffer; > All right, simple addition of the new field. Nothing to go wrong here. > > Sidenote: With 0x7FFFFFFF being (if I am not wrong) maximum graph_pos > and maximum number of nodes in commit graph, we won't hit 0x3FFFFFFF > generation number limit for all except very, very linear histories. Both of these limits are far away from being realistic. But we could extend the maximum graph_pos independently from the maximum generation number now that we have the "capped" logic. > >> diff --git a/alloc.c b/alloc.c >> index cf4f8b61e1..e8ab14f4a1 100644 >> --- a/alloc.c >> +++ b/alloc.c >> @@ -94,6 +94,7 @@ void *alloc_commit_node(void) >> c->object.type = OBJ_COMMIT; >> c->index = alloc_commit_index(); >> c->graph_pos = COMMIT_NOT_FROM_GRAPH; >> + c->generation = GENERATION_NUMBER_INFINITY; >> return c; >> } > All right, start with initializing it with "not from commit-graph" value > after allocation. > >> >> diff --git a/commit-graph.c b/commit-graph.c >> index 70fa1b25fd..9ad21c3ffb 100644 >> --- a/commit-graph.c >> +++ b/commit-graph.c >> @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin >> date_low = get_be32(commit_data + g->hash_len + 12); >> item->date = (timestamp_t)((date_high << 32) | date_low); >> >> + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; >> + > I guess we should not worry about these "magical constants" sprinkled > here, like "+ 8" above. > > Let's examine how it goes, taking a look at commit-graph-format.txt > in Documentation/technical/commit-graph-format.txt > > * The first H (g->hash_len) bytes are for the OID of the root tree. > * The next 8 bytes are for the positions of the first two parents [...] > > So 'commit_data + g->hash_len + 8' is our offset from the start of > commit data. All right. > > * The next 8 bytes store the generation number of the commit and > the commit time in seconds since EPOCH. The generation number > uses the higher 30 bits of the first 4 bytes. [...] > > The higher 30 bits of the 4 bytes, which is 32 bits, means that we need > to shift 32-bit value 2 bits right, so that we get lower 30 bits of > 32-bit value. All right. > > All 4-byte numbers are in network order. > > Shouldn't it be ntohl() to convert from network order to host order, and > not get_be32()? I guess they are the same (network order is big-endian > order), and get_be32() is what rest of git uses... ntohl() takes a 32-bit value, while get_be32() takes a pointer. This makes pulling network-bytes out of streams much cleaner with get_be32(), so I try to use that whenever possible. ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v4 03/10] commit-graph: compute generation numbers 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee 2018-04-25 14:37 ` [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list Derrick Stolee 2018-04-25 14:37 ` [PATCH v4 02/10] commit: add generation number to struct commmit Derrick Stolee @ 2018-04-25 14:37 ` Derrick Stolee 2018-04-26 2:35 ` Junio C Hamano 2018-04-29 9:08 ` Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 04/10] commit: use generations in paint_down_to_common() Derrick Stolee ` (8 subsequent siblings) 11 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw) To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. If a computed generation number would exceed GENERATION_NUMBER_MAX, then use GENERATION_NUMBER_MAX instead. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit-graph.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 9ad21c3ffb..047fa9fca5 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + if ((*list)->generation != GENERATION_NUMBER_INFINITY) + packedDate[0] |= htonl((*list)->generation << 2); + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct commit** commits, + int nr_commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < nr_commits; i++) { + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && + commits[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits[i], &list); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, &list); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(&list); + } + + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(commits.list, commits.nr); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(&lk, graph_name, 0); -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v4 03/10] commit-graph: compute generation numbers 2018-04-25 14:37 ` [PATCH v4 03/10] commit-graph: compute generation numbers Derrick Stolee @ 2018-04-26 2:35 ` Junio C Hamano 2018-04-26 12:58 ` Derrick Stolee 2018-04-29 9:08 ` Jakub Narebski 1 sibling, 1 reply; 162+ messages in thread From: Junio C Hamano @ 2018-04-26 2:35 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, peff, jnareb, avarab Derrick Stolee <dstolee@microsoft.com> writes: > @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, > else > packedDate[0] = 0; > > + if ((*list)->generation != GENERATION_NUMBER_INFINITY) > + packedDate[0] |= htonl((*list)->generation << 2); > + > packedDate[1] = htonl((*list)->date); > hashwrite(f, packedDate, 8); The ones that have infinity are written as zero here. The code that reads the generation field off of a file in fill_commit_graph_info() and fill_commit_in_graph() both leave such a record in file as-is, so the reader of what we write out will think it is _ZERO, not _INF. Not that it matters, as it seems that most of the code being added by this series treat _ZERO and _INF more or less interchangeably. But it does raise another question, i.e. do we need both _ZERO and _INF, or is it sufficient to have just a single _UNKNOWN? > @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) > } > } > > +static void compute_generation_numbers(struct commit** commits, > + int nr_commits) > +{ > + int i; > + struct commit_list *list = NULL; > + > + for (i = 0; i < nr_commits; i++) { > + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && > + commits[i]->generation != GENERATION_NUMBER_ZERO) > + continue; > + > + commit_list_insert(commits[i], &list); > + while (list) { > + struct commit *current = list->item; > + struct commit_list *parent; > + int all_parents_computed = 1; > + uint32_t max_generation = 0; > + > + for (parent = current->parents; parent; parent = parent->next) { > + if (parent->item->generation == GENERATION_NUMBER_INFINITY || > + parent->item->generation == GENERATION_NUMBER_ZERO) { > + all_parents_computed = 0; > + commit_list_insert(parent->item, &list); > + break; > + } else if (parent->item->generation > max_generation) { > + max_generation = parent->item->generation; > + } > + } > + > + if (all_parents_computed) { > + current->generation = max_generation + 1; > + pop_commit(&list); > + } If we haven't computed all parents' generations yet, current->generation is undefined (or at least "left as initialized"), so it does not make much sense to attempt to clip it at _MAX at this point. At leat not yet. IOW, shouldn't the following two lines be inside the "we now know genno of all parents, so we can compute genno for commit" block above? > + if (current->generation > GENERATION_NUMBER_MAX) > + current->generation = GENERATION_NUMBER_MAX; > + } > + } > +} > + > void write_commit_graph(const char *obj_dir, > const char **pack_indexes, > int nr_packs, > @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir, > if (commits.nr >= GRAPH_PARENT_MISSING) > die(_("too many commits to write graph")); > > + compute_generation_numbers(commits.list, commits.nr); > + > graph_name = get_commit_graph_filename(obj_dir); > fd = hold_lock_file_for_update(&lk, graph_name, 0); ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 03/10] commit-graph: compute generation numbers 2018-04-26 2:35 ` Junio C Hamano @ 2018-04-26 12:58 ` Derrick Stolee 2018-04-26 13:49 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-26 12:58 UTC (permalink / raw) To: Junio C Hamano, Derrick Stolee; +Cc: git, peff, jnareb, avarab n 4/25/2018 10:35 PM, Junio C Hamano wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, >> else >> packedDate[0] = 0; >> >> + if ((*list)->generation != GENERATION_NUMBER_INFINITY) >> + packedDate[0] |= htonl((*list)->generation << 2); >> + >> packedDate[1] = htonl((*list)->date); >> hashwrite(f, packedDate, 8); > The ones that have infinity are written as zero here. The code that > reads the generation field off of a file in fill_commit_graph_info() > and fill_commit_in_graph() both leave such a record in file as-is, > so the reader of what we write out will think it is _ZERO, not _INF. > > Not that it matters, as it seems that most of the code being added > by this series treat _ZERO and _INF more or less interchangeably. > But it does raise another question, i.e. do we need both _ZERO and > _INF, or is it sufficient to have just a single _UNKNOWN? This code is confusing. The 'if' condition is useless, since at this point every commit should be finite (since we computed generation numbers for everyone). We should just write the value always. For the sake of discussion, the value _INFINITY means not in the graph and _ZERO means in the graph without a computed generation number. It's a small distinction, but it gives a single boundary to use for reachability queries that test generation number. > >> @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) >> } >> } >> >> +static void compute_generation_numbers(struct commit** commits, >> + int nr_commits) >> +{ >> + int i; >> + struct commit_list *list = NULL; >> + >> + for (i = 0; i < nr_commits; i++) { >> + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && >> + commits[i]->generation != GENERATION_NUMBER_ZERO) >> + continue; >> + >> + commit_list_insert(commits[i], &list); >> + while (list) { >> + struct commit *current = list->item; >> + struct commit_list *parent; >> + int all_parents_computed = 1; >> + uint32_t max_generation = 0; >> + >> + for (parent = current->parents; parent; parent = parent->next) { >> + if (parent->item->generation == GENERATION_NUMBER_INFINITY || >> + parent->item->generation == GENERATION_NUMBER_ZERO) { >> + all_parents_computed = 0; >> + commit_list_insert(parent->item, &list); >> + break; >> + } else if (parent->item->generation > max_generation) { >> + max_generation = parent->item->generation; >> + } >> + } >> + >> + if (all_parents_computed) { >> + current->generation = max_generation + 1; >> + pop_commit(&list); >> + } > If we haven't computed all parents' generations yet, > current->generation is undefined (or at least "left as > initialized"), so it does not make much sense to attempt to clip it > at _MAX at this point. At leat not yet. > > IOW, shouldn't the following two lines be inside the "we now know > genno of all parents, so we can compute genno for commit" block > above? You're right! Good catch. This code sets every merge commit to _MAX. It should be in the block above. > >> + if (current->generation > GENERATION_NUMBER_MAX) >> + current->generation = GENERATION_NUMBER_MAX; >> + } >> + } >> +} >> + >> void write_commit_graph(const char *obj_dir, >> const char **pack_indexes, >> int nr_packs, >> @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir, >> if (commits.nr >= GRAPH_PARENT_MISSING) >> die(_("too many commits to write graph")); >> >> + compute_generation_numbers(commits.list, commits.nr); >> + >> graph_name = get_commit_graph_filename(obj_dir); >> fd = hold_lock_file_for_update(&lk, graph_name, 0); ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 03/10] commit-graph: compute generation numbers 2018-04-26 12:58 ` Derrick Stolee @ 2018-04-26 13:49 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-26 13:49 UTC (permalink / raw) To: Junio C Hamano, Derrick Stolee; +Cc: git, peff, jnareb, avarab On 4/26/2018 8:58 AM, Derrick Stolee wrote: > n 4/25/2018 10:35 PM, Junio C Hamano wrote: >> Derrick Stolee <dstolee@microsoft.com> writes: >> >>> @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct >>> hashfile *f, int hash_len, >>> else >>> packedDate[0] = 0; >>> + if ((*list)->generation != GENERATION_NUMBER_INFINITY) >>> + packedDate[0] |= htonl((*list)->generation << 2); >>> + >>> packedDate[1] = htonl((*list)->date); >>> hashwrite(f, packedDate, 8); >> The ones that have infinity are written as zero here. The code that >> reads the generation field off of a file in fill_commit_graph_info() >> and fill_commit_in_graph() both leave such a record in file as-is, >> so the reader of what we write out will think it is _ZERO, not _INF. >> >> Not that it matters, as it seems that most of the code being added >> by this series treat _ZERO and _INF more or less interchangeably. >> But it does raise another question, i.e. do we need both _ZERO and >> _INF, or is it sufficient to have just a single _UNKNOWN? > > This code is confusing. The 'if' condition is useless, since at this > point every commit should be finite (since we computed generation > numbers for everyone). We should just write the value always. > > For the sake of discussion, the value _INFINITY means not in the graph > and _ZERO means in the graph without a computed generation number. > It's a small distinction, but it gives a single boundary to use for > reachability queries that test generation number. > >> >>> @@ -571,6 +574,46 @@ static void close_reachable(struct >>> packed_oid_list *oids) >>> } >>> } >>> +static void compute_generation_numbers(struct commit** commits, >>> + int nr_commits) >>> +{ >>> + int i; >>> + struct commit_list *list = NULL; >>> + >>> + for (i = 0; i < nr_commits; i++) { >>> + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && >>> + commits[i]->generation != GENERATION_NUMBER_ZERO) >>> + continue; >>> + >>> + commit_list_insert(commits[i], &list); >>> + while (list) { >>> + struct commit *current = list->item; >>> + struct commit_list *parent; >>> + int all_parents_computed = 1; >>> + uint32_t max_generation = 0; >>> + >>> + for (parent = current->parents; parent; parent = >>> parent->next) { >>> + if (parent->item->generation == >>> GENERATION_NUMBER_INFINITY || >>> + parent->item->generation == >>> GENERATION_NUMBER_ZERO) { >>> + all_parents_computed = 0; >>> + commit_list_insert(parent->item, &list); >>> + break; >>> + } else if (parent->item->generation > >>> max_generation) { >>> + max_generation = parent->item->generation; >>> + } >>> + } >>> + >>> + if (all_parents_computed) { >>> + current->generation = max_generation + 1; >>> + pop_commit(&list); >>> + } >> If we haven't computed all parents' generations yet, >> current->generation is undefined (or at least "left as >> initialized"), so it does not make much sense to attempt to clip it >> at _MAX at this point. At leat not yet. >> >> IOW, shouldn't the following two lines be inside the "we now know >> genno of all parents, so we can compute genno for commit" block >> above? > > You're right! Good catch. This code sets every merge commit to _MAX. > It should be in the block above. > >> >>> + if (current->generation > GENERATION_NUMBER_MAX) >>> + current->generation = GENERATION_NUMBER_MAX; >>> + } >>> + } This bothered me: why didn't I catch a bug here? I rebased my "fsck" RFC onto this branch and it succeeded. Then, I realized that this does not actually write incorrect values, since we re-visit this commit again after we pop the stack down to this commit. However, there is time in the middle where we have set the generation (in memory) incorrectly and that could easily turn into a real bug by a later change. I'll stick the _MAX check in the if above to prevent confusion. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 03/10] commit-graph: compute generation numbers 2018-04-25 14:37 ` [PATCH v4 03/10] commit-graph: compute generation numbers Derrick Stolee 2018-04-26 2:35 ` Junio C Hamano @ 2018-04-29 9:08 ` Jakub Narebski 2018-05-01 12:10 ` Derrick Stolee 1 sibling, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-29 9:08 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, gitster, peff, avarab Derrick Stolee <dstolee@microsoft.com> writes: > While preparing commits to be written into a commit-graph file, compute > the generation numbers using a depth-first strategy. Sidenote: for generation numbers it does not matter if we use depth-first or breadth-first strategy, but it is more natural to use depth-first search because generation numbers need post-order processing (parents before child). > > The only commits that are walked in this depth-first search are those > without a precomputed generation number. Thus, computation time will be > relative to the number of new commits to the commit-graph file. A question: what happens if the existing commit graph is from older version of git and has _ZERO for generation numbers? Answer: I see that we treat both _INFINITY (not in commit-graph) and _ZERO (in commit graph but not computed) as not computed generation numbers. All right. > > If a computed generation number would exceed GENERATION_NUMBER_MAX, then > use GENERATION_NUMBER_MAX instead. All right, though I guess this would remain theoretical for a long while. We don't have any way of testing this, at least not without recompiling Git with lower value of GENERATION_NUMBER_MAX -- which means not automatically, isn't it? > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit-graph.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 45 insertions(+) > > diff --git a/commit-graph.c b/commit-graph.c > index 9ad21c3ffb..047fa9fca5 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, > else > packedDate[0] = 0; > > + if ((*list)->generation != GENERATION_NUMBER_INFINITY) > + packedDate[0] |= htonl((*list)->generation << 2); > + If we stumble upon commit marked as "not in commit-graph" while writing commit graph, it is a BUG(), isn't it? (Problem noticed by Junio.) It is a bit strange to me that the code uses get_be32 for reading, but htonl for writing. Is Git tested on non little-endian machines, like big-endian ppc64 or s390x, or on mixed-endian machines (or selectable-endian machines with data endianness set to non little-endian, like ia64)? If not, could we use for example openSUSE Build Service (https://build.opensuse.org/) for this? > packedDate[1] = htonl((*list)->date); > hashwrite(f, packedDate, 8); > > @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) > } > } > > +static void compute_generation_numbers(struct commit** commits, > + int nr_commits) > +{ > + int i; > + struct commit_list *list = NULL; All right, commit_list will work as stack. > + > + for (i = 0; i < nr_commits; i++) { > + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && > + commits[i]->generation != GENERATION_NUMBER_ZERO) > + continue; All right, we consider _INFINITY and _SERO as not computed. If generation number is computed (by 'recursion' or from commit graph), we (re)use it. This means that generation number calculation is incremental, as intended -- good. > + > + commit_list_insert(commits[i], &list); Start depth-first walks from commits given. > + while (list) { > + struct commit *current = list->item; > + struct commit_list *parent; > + int all_parents_computed = 1; Here all_parents_computed is a boolean flag. I see that it is easier to start with assumption that all parents will have computed generation numbers. > + uint32_t max_generation = 0; The generation number value of 0 functions as sentinel; generation numbers start from 1. Not that it matters much, as lowest possible generation number is 1, and we could have started from that value. > + > + for (parent = current->parents; parent; parent = parent->next) { > + if (parent->item->generation == GENERATION_NUMBER_INFINITY || > + parent->item->generation == GENERATION_NUMBER_ZERO) { > + all_parents_computed = 0; > + commit_list_insert(parent->item, &list); > + break; If some parent doesn't have generation number calculated, we add it to stack (and break out of loop because it is depth-first walk), and mark this situation. All right. > + } else if (parent->item->generation > max_generation) { > + max_generation = parent->item->generation; Otherwise, update max_generation. All right. > + } > + } > + > + if (all_parents_computed) { > + current->generation = max_generation + 1; > + pop_commit(&list); > + } > + > + if (current->generation > GENERATION_NUMBER_MAX) > + current->generation = GENERATION_NUMBER_MAX; This conditional should be inside all_parents_computed test, for example like this: + if (all_parents_computed) { + current->generation = max_generation + 1; + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + + pop_commit(&list); + } (Noticed by Junio.) Sidenote: when we revisit the commit, returning from depth-first walk of one of its parents, we calculate max_generation from scratch again. This does not matter for performance, as it's just data access and calculating maximum - any workaround to not restart those calculations would take more time and memory. And it's simple. > + } > + } > +} > + > void write_commit_graph(const char *obj_dir, > const char **pack_indexes, > int nr_packs, > @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir, > if (commits.nr >= GRAPH_PARENT_MISSING) > die(_("too many commits to write graph")); > > + compute_generation_numbers(commits.list, commits.nr); > + Nice and simple. All right. I guess that we do not pass "struct packed_commit_list commits" as argument to compute_generation_numbers instead of "struct commit** commits.list" and "int commits.nr" to compute_generation_numbers() to keep the latter nice and generic? > graph_name = get_commit_graph_filename(obj_dir); > fd = hold_lock_file_for_update(&lk, graph_name, 0); Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 03/10] commit-graph: compute generation numbers 2018-04-29 9:08 ` Jakub Narebski @ 2018-05-01 12:10 ` Derrick Stolee 2018-05-02 16:15 ` Jakub Narebski 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:10 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee; +Cc: git, gitster, peff, avarab On 4/29/2018 5:08 AM, Jakub Narebski wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> While preparing commits to be written into a commit-graph file, compute >> the generation numbers using a depth-first strategy. > Sidenote: for generation numbers it does not matter if we use > depth-first or breadth-first strategy, but it is more natural to use > depth-first search because generation numbers need post-order processing > (parents before child). > >> The only commits that are walked in this depth-first search are those >> without a precomputed generation number. Thus, computation time will be >> relative to the number of new commits to the commit-graph file. > A question: what happens if the existing commit graph is from older > version of git and has _ZERO for generation numbers? > > Answer: I see that we treat both _INFINITY (not in commit-graph) and > _ZERO (in commit graph but not computed) as not computed generation > numbers. All right. > >> If a computed generation number would exceed GENERATION_NUMBER_MAX, then >> use GENERATION_NUMBER_MAX instead. > All right, though I guess this would remain theoretical for a long > while. > > We don't have any way of testing this, at least not without recompiling > Git with lower value of GENERATION_NUMBER_MAX -- which means not > automatically, isn't it? > >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> --- >> commit-graph.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 45 insertions(+) >> >> diff --git a/commit-graph.c b/commit-graph.c >> index 9ad21c3ffb..047fa9fca5 100644 >> --- a/commit-graph.c >> +++ b/commit-graph.c >> @@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, >> else >> packedDate[0] = 0; >> >> + if ((*list)->generation != GENERATION_NUMBER_INFINITY) >> + packedDate[0] |= htonl((*list)->generation << 2); >> + > If we stumble upon commit marked as "not in commit-graph" while writing > commit graph, it is a BUG(), isn't it? > > (Problem noticed by Junio.) Since we are computing the values for all commits in the list, this condition is not important and will be removed. > > It is a bit strange to me that the code uses get_be32 for reading, but > htonl for writing. Is Git tested on non little-endian machines, like > big-endian ppc64 or s390x, or on mixed-endian machines (or > selectable-endian machines with data endianness set to non > little-endian, like ia64)? If not, could we use for example openSUSE > Build Service (https://build.opensuse.org/) for this? Since we are packing two values into 64 bits, I am using htonl() here to arrange the 30-bit generation number alongside the 34-bit commit date value, then writing with hashwrite(). The other 32-bit integers are written with hashwrite_be32() to avoid translating this data in-memory. > >> packedDate[1] = htonl((*list)->date); >> hashwrite(f, packedDate, 8); >> >> @@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids) >> } >> } >> >> +static void compute_generation_numbers(struct commit** commits, >> + int nr_commits) >> +{ >> + int i; >> + struct commit_list *list = NULL; > All right, commit_list will work as stack. > >> + >> + for (i = 0; i < nr_commits; i++) { >> + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && >> + commits[i]->generation != GENERATION_NUMBER_ZERO) >> + continue; > All right, we consider _INFINITY and _SERO as not computed. If > generation number is computed (by 'recursion' or from commit graph), we > (re)use it. This means that generation number calculation is > incremental, as intended -- good. > >> + >> + commit_list_insert(commits[i], &list); > Start depth-first walks from commits given. > >> + while (list) { >> + struct commit *current = list->item; >> + struct commit_list *parent; >> + int all_parents_computed = 1; > Here all_parents_computed is a boolean flag. I see that it is easier to > start with assumption that all parents will have computed generation > numbers. > >> + uint32_t max_generation = 0; > The generation number value of 0 functions as sentinel; generation > numbers start from 1. Not that it matters much, as lowest possible > generation number is 1, and we could have started from that value. Except that for a commit with no parents, we want it to receive generation number max_generation + 1 = 1, so this value of 0 is important. > >> + >> + for (parent = current->parents; parent; parent = parent->next) { >> + if (parent->item->generation == GENERATION_NUMBER_INFINITY || >> + parent->item->generation == GENERATION_NUMBER_ZERO) { >> + all_parents_computed = 0; >> + commit_list_insert(parent->item, &list); >> + break; > If some parent doesn't have generation number calculated, we add it to > stack (and break out of loop because it is depth-first walk), and mark > this situation. All right. > >> + } else if (parent->item->generation > max_generation) { >> + max_generation = parent->item->generation; > Otherwise, update max_generation. All right. > >> + } >> + } >> + >> + if (all_parents_computed) { >> + current->generation = max_generation + 1; >> + pop_commit(&list); >> + } >> + >> + if (current->generation > GENERATION_NUMBER_MAX) >> + current->generation = GENERATION_NUMBER_MAX; > This conditional should be inside all_parents_computed test, for example > like this: > > + if (all_parents_computed) { > + current->generation = max_generation + 1; > + if (current->generation > GENERATION_NUMBER_MAX) > + current->generation = GENERATION_NUMBER_MAX; > + > + pop_commit(&list); > + } > > (Noticed by Junio.) > > Sidenote: when we revisit the commit, returning from depth-first walk of > one of its parents, we calculate max_generation from scratch again. > This does not matter for performance, as it's just data access and > calculating maximum - any workaround to not restart those calculations > would take more time and memory. And it's simple. > >> + } >> + } >> +} >> + >> void write_commit_graph(const char *obj_dir, >> const char **pack_indexes, >> int nr_packs, >> @@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir, >> if (commits.nr >= GRAPH_PARENT_MISSING) >> die(_("too many commits to write graph")); >> >> + compute_generation_numbers(commits.list, commits.nr); >> + > Nice and simple. All right. > > I guess that we do not pass "struct packed_commit_list commits" as > argument to compute_generation_numbers instead of "struct commit** > commits.list" and "int commits.nr" to compute_generation_numbers() to > keep the latter nice and generic? Good catch. There is no reason to not use packed_commit_list here. > >> graph_name = get_commit_graph_filename(obj_dir); >> fd = hold_lock_file_for_update(&lk, graph_name, 0); > Best, ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 03/10] commit-graph: compute generation numbers 2018-05-01 12:10 ` Derrick Stolee @ 2018-05-02 16:15 ` Jakub Narebski 0 siblings, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-05-02 16:15 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <stolee@gmail.com> writes: > On 4/29/2018 5:08 AM, Jakub Narebski wrote: >> Derrick Stolee <dstolee@microsoft.com> writes: [...] >> It is a bit strange to me that the code uses get_be32 for reading, but >> htonl for writing. Is Git tested on non little-endian machines, like >> big-endian ppc64 or s390x, or on mixed-endian machines (or >> selectable-endian machines with data endianness set to non >> little-endian, like ia64)? If not, could we use for example openSUSE >> Build Service (https://build.opensuse.org/) for this? > > Since we are packing two values into 64 bits, I am using htonl() here > to arrange the 30-bit generation number alongside the 34-bit commit > date value, then writing with hashwrite(). The other 32-bit integers > are written with hashwrite_be32() to avoid translating this data > in-memory. O.K., so you are using what is more effective and easier to use. Nice to know, thanks for the information. [...] >>> +static void compute_generation_numbers(struct commit** commits, >>> + int nr_commits) >>> +{ [...] >>> + for (i = 0; i < nr_commits; i++) { >>> + if (commits[i]->generation != GENERATION_NUMBER_INFINITY && >>> + commits[i]->generation != GENERATION_NUMBER_ZERO) >>> + continue; [...] >>> + compute_generation_numbers(commits.list, commits.nr); >>> + >> Nice and simple. All right. >> >> I guess that we do not pass "struct packed_commit_list commits" as >> argument to compute_generation_numbers instead of "struct commit** >> commits.list" and "int commits.nr" to compute_generation_numbers() to >> keep the latter nice and generic? > > Good catch. There is no reason to not use packed_commit_list here. Actually, now that v5 shows how using packed_commit_list looks like, in my opinion it looks uglier. And it might be easier to make mistake. Also, depending on how compiler is able to optimize it, the version passing packed_commit_list as an argument has one more indirection (following two pointers) in the loop. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v4 04/10] commit: use generations in paint_down_to_common() 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee ` (2 preceding siblings ...) 2018-04-25 14:37 ` [PATCH v4 03/10] commit-graph: compute generation numbers Derrick Stolee @ 2018-04-25 14:37 ` Derrick Stolee 2018-04-26 3:22 ` Junio C Hamano 2018-04-29 15:40 ` Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 06/10] ref-filter: use generation number for --contains Derrick Stolee ` (7 subsequent siblings) 11 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw) To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 20 +++++++++++++++++++- commit.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 711f674c18..4d00b0a1d6 100644 --- a/commit.c +++ b/commit.c @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + /* newer commits first */ + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* use date as a heuristic when generations are equal */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index aac3b8c56f..64436ff44e 100644 --- a/commit.h +++ b/commit.h @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common() 2018-04-25 14:37 ` [PATCH v4 04/10] commit: use generations in paint_down_to_common() Derrick Stolee @ 2018-04-26 3:22 ` Junio C Hamano 2018-04-26 9:02 ` Jakub Narebski 2018-04-29 15:40 ` Jakub Narebski 1 sibling, 1 reply; 162+ messages in thread From: Junio C Hamano @ 2018-04-26 3:22 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, peff, jnareb, avarab Derrick Stolee <dstolee@microsoft.com> writes: > Define compare_commits_by_gen_then_commit_date(), which uses generation > numbers as a primary comparison and commit date to break ties (or as a > comparison when both commits do not have computed generation numbers). > > Since the commit-graph file is closed under reachability, we know that > all commits in the file have generation at most GENERATION_NUMBER_MAX > which is less than GENERATION_NUMBER_INFINITY. I suspect that my puzzlement may be coming from my not "getting" what you meant by "closed under reachability", but could you also explain how _INF and _ZERO interact with commits with normal generation numbers? I've always assumed that genno will be used only when comparing two commits with valid genno and otherwise we'd fall back to the traditional date based one, but... > +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) > +{ > + const struct commit *a = a_, *b = b_; > + > + /* newer commits first */ > + if (a->generation < b->generation) > + return 1; > + else if (a->generation > b->generation) > + return -1; ... this does not check if a->generation is _ZERO or _INF. Both being _MAX is OK (the control will fall through and use the dates below). One being _MAX and the other being a normal value is also OK (the above comparisons will declare the commit with _MAX is farther than less-than-max one from a root). Or is the assumption that if one has _ZERO, that must have come from an ancient commit-graph file and none of the commits have anything but _ZERO? > + /* use date as a heuristic when generations are equal */ > + if (a->date < b->date) > + return 1; > + else if (a->date > b->date) > + return -1; > + return 0; > +} > + > int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) > { > const struct commit *a = a_, *b = b_; > @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) > /* all input commits in one and twos[] must have been parsed! */ > static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) > { > - struct prio_queue queue = { compare_commits_by_commit_date }; > + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; > struct commit_list *result = NULL; > int i; > > diff --git a/commit.h b/commit.h > index aac3b8c56f..64436ff44e 100644 > --- a/commit.h > +++ b/commit.h > @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); > extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); > > int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); > +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); > > LAST_ARG_MUST_BE_NULL > extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common() 2018-04-26 3:22 ` Junio C Hamano @ 2018-04-26 9:02 ` Jakub Narebski 2018-04-28 14:38 ` Jakub Narebski 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-26 9:02 UTC (permalink / raw) To: Junio C Hamano Cc: Derrick Stolee, git, Jeff King, Ævar Arnfjörð Bjarmason Junio C Hamano <gitster@pobox.com> writes: > Derrick Stolee <dstolee@microsoft.com> writes: > >> Define compare_commits_by_gen_then_commit_date(), which uses generation >> numbers as a primary comparison and commit date to break ties (or as a >> comparison when both commits do not have computed generation numbers). >> >> Since the commit-graph file is closed under reachability, we know that >> all commits in the file have generation at most GENERATION_NUMBER_MAX >> which is less than GENERATION_NUMBER_INFINITY. > > I suspect that my puzzlement may be coming from my not "getting" > what you meant by "closed under reachability", It means that if commit A is in the commit graph, then all of its ancestors (all commits reachable from A) are also in the commit graph. > but could you also > explain how _INF and _ZERO interact with commits with normal > generation numbers? I've always assumed that genno will be used > only when comparing two commits with valid genno and otherwise we'd > fall back to the traditional date based one, but... > >> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) >> +{ >> + const struct commit *a = a_, *b = b_; >> + >> + /* newer commits first */ >> + if (a->generation < b->generation) >> + return 1; >> + else if (a->generation > b->generation) >> + return -1; > > ... this does not check if a->generation is _ZERO or _INF. > > Both being _MAX is OK (the control will fall through and use the > dates below). One being _MAX and the other being a normal value is > also OK (the above comparisons will declare the commit with _MAX is > farther than less-than-max one from a root). > > Or is the assumption that if one has _ZERO, that must have come from > an ancient commit-graph file and none of the commits have anything > but _ZERO? There is stronger and weaker version of the negative-cut criteria based on generation numbers. The strong criteria: if A != B and gen(A) <= gen(B), then A cannot reach B The weaker criteria: if gen(A) < gen(B), then A cannot reach B Because commit-graph is closed under reachability, this means that if A is in commit graph, and B is outside of it, then A cannot reach B If A is in commit graph, then either _MAX >= gen(A) >= 1, or gen(A) == _ZERO. Because _INFINITY > _MAX > _ZERO, then we have if _MAX >= gen(A) >= 1 || gen(A) == 0, and gen(B) == _INFINITY then A cannot reach B which also fullfils the weaker criteria if gen(A) < gen(B), then A cannot reach B If both A and B are outside commit-graph, i.e. gen(A) = gen(B) = _INFINITY, or if both A and B have gen(A) = gen(B) = _MAX, or if both A and B come from old commit graph with gen(A) = gen(B) =_ZERO, then we cannot say anything about reachability... and weak criteria also does not say anything about reachability. Maybe the following ASCII table would make it clear. | gen(B) | ................................ ::::::: gen(A) | _INFINITY | _MAX | larger | smaller | _ZERO -------------+-----------+----------+----------+----------+-------- _INFINITY | = | > | > | > | > _MAX | < Nn | = | > | > | > larger | < Nn | < Nn | = n | > | > smaller | < Nn | < Nn | < Nn | = n | > _ZERO | < Nn | < Nn | < Nn | < Nn | = Here "n" denotes stronger condition, and "N" denotes weaker condition. We have _INFINITY > _MAX > larger > smaller > _ZERO. NOTE however that it is a *tradeoff*. Using weaker criteria, with strict inequality, means that we don't need to handle _INFINITY, _MAX and _ZERO corner-cases in a special way; but it also means that we would walk slightly more commits than if we used stronger criteria, with less or equals. For Linux kernel public repository commit graph[1] we have maximum of 512 commits sharing the same level, 5.43 sharing the same commit on average, and 50% of time only 2 commits sharing the same level (median, or 2nd quartile, or 50% percentile). This is roughly the amount of commits we walk more with weaker cut-off condition. [1]: with 750k commits, but which is not largest commit graph any more :-0 >> + /* use date as a heuristic when generations are equal */ >> + if (a->date < b->date) >> + return 1; >> + else if (a->date > b->date) >> + return -1; >> + return 0; >> +} HTH -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common() 2018-04-26 9:02 ` Jakub Narebski @ 2018-04-28 14:38 ` Jakub Narebski 0 siblings, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-04-28 14:38 UTC (permalink / raw) To: Junio C Hamano Cc: Derrick Stolee, git, Jeff King, Ævar Arnfjörð Bjarmason Jakub Narebski <jnareb@gmail.com> writes: > Junio C Hamano <gitster@pobox.com> writes: >> Derrick Stolee <dstolee@microsoft.com> writes: [...] >>> +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) >>> +{ >>> + const struct commit *a = a_, *b = b_; >>> + >>> + /* newer commits first */ >>> + if (a->generation < b->generation) >>> + return 1; >>> + else if (a->generation > b->generation) >>> + return -1; >> >> ... this does not check if a->generation is _ZERO or _INF. >> >> Both being _MAX is OK (the control will fall through and use the >> dates below). One being _MAX and the other being a normal value is >> also OK (the above comparisons will declare the commit with _MAX is >> farther than less-than-max one from a root). >> >> Or is the assumption that if one has _ZERO, that must have come from >> an ancient commit-graph file and none of the commits have anything >> but _ZERO? > > There is stronger and weaker version of the negative-cut criteria based > on generation numbers. > > The strong criteria: > > if A != B and gen(A) <= gen(B), then A cannot reach B > > The weaker criteria: > > if gen(A) < gen(B), then A cannot reach B > > > Because commit-graph is closed under reachability, this means that > > if A is in commit graph, and B is outside of it, then A cannot reach B > > If A is in commit graph, then either _MAX >= gen(A) >= 1, > or gen(A) == _ZERO. Because _INFINITY > _MAX > _ZERO, then we have > > if _MAX >= gen(A) >= 1 || gen(A) == 0, and gen(B) == _INFINITY > then A cannot reach B > > which also fullfils the weaker criteria > > if gen(A) < gen(B), then A cannot reach B > > > If both A and B are outside commit-graph, i.e. gen(A) = gen(B) = _INFINITY, > or if both A and B have gen(A) = gen(B) = _MAX, > or if both A and B come from old commit graph with gen(A) = gen(B) =_ZERO, > then we cannot say anything about reachability... and weak criteria > also does not say anything about reachability. > > > Maybe the following ASCII table would make it clear. > > | gen(B) > | ................................ ::::::: > gen(A) | _INFINITY | _MAX | larger | smaller | _ZERO > -------------+-----------+----------+----------+----------+-------- > _INFINITY | = | > | > | > | > > _MAX | < Nn | = | > | > | > > larger | < Nn | < Nn | = n | > | > > smaller | < Nn | < Nn | < Nn | = n | > > _ZERO | < Nn | < Nn | < Nn | < Nn | = > > Here "n" denotes stronger condition, and "N" denotes weaker condition. > We have _INFINITY > _MAX > larger > smaller > _ZERO. > > > NOTE however that it is a *tradeoff*. Using weaker criteria, with > strict inequality, means that we don't need to handle _INFINITY, _MAX > and _ZERO corner-cases in a special way; but it also means that we would > walk slightly more commits than if we used stronger criteria, with less > or equals. Actually, if we look at the table above, it turns out that we can use the stronger version of negative-cut criteria without special-casing all the possible combinations. Just use stronger criteria on normal range, weaker criteria if any of generation numbers is special generation number. if _MAX > gen(A) > _ZERO and _MAX > gen(B) > _ZERO then if A != B and gen(A) <= gen(B) then A cannot reach B else A can reach B else /* at least one special case */ if gen(A) < gen(B) then A cannot reach B else A can reach B NOTE that it specifically does not matter for created here compare_commits_by_gen_then_commit_date(), as it requires strict inequality for sorting - which using weak criteria explains why we don't need any special cases in the code here. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 04/10] commit: use generations in paint_down_to_common() 2018-04-25 14:37 ` [PATCH v4 04/10] commit: use generations in paint_down_to_common() Derrick Stolee 2018-04-26 3:22 ` Junio C Hamano @ 2018-04-29 15:40 ` Jakub Narebski 1 sibling, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-04-29 15:40 UTC (permalink / raw) To: Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <dstolee@microsoft.com> writes: > Define compare_commits_by_gen_then_commit_date(), which uses generation > numbers as a primary comparison and commit date to break ties (or as a > comparison when both commits do not have computed generation numbers). All right, this looks reasonable thing to do when we have access to commit generation numbers.. > Since the commit-graph file is closed under reachability, we know that > all commits in the file have generation at most GENERATION_NUMBER_MAX > which is less than GENERATION_NUMBER_INFINITY. Thus the condition that if B is reachable from A, then gen(A) >= gen(B), even if they have generation numbers _INFINITY, _MAX or _ZERO. We use generation numbers, if possible, to choose closest commit; if not, we use dates. > > This change does not affect the number of commits that are walked during > the execution of paint_down_to_common(), only the order that those > commits are inspected. In the case that commit dates violate topological > order (i.e. a parent is "newer" than a child), the previous code could > walk a commit twice: if a commit is reached with the PARENT1 bit, but > later is re-visited with the PARENT2 bit, then that PARENT2 bit must be > propagated to its parents. Using generation numbers avoids this extra > effort, even if it is somewhat rare. Actually the ordering of commits walked does not affect the correctness of the result. Better ordering means that commits do not need to be walked twice; I think it would be possible to craft repository in which unlucky clock skew would lead to depth-first walk of commits later part of walk would mark STALE. Pedantry aside, I think it is a good description of analysis of change results. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit.c | 20 +++++++++++++++++++- > commit.h | 1 + > 2 files changed, 20 insertions(+), 1 deletion(-) > > diff --git a/commit.c b/commit.c > index 711f674c18..4d00b0a1d6 100644 > --- a/commit.c > +++ b/commit.c > @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, > return 0; > } > > +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) > +{ > + const struct commit *a = a_, *b = b_; > + > + /* newer commits first */ To be pedantic, larger generation number does not necessary mean that commit was created later (is newer), only that it is on longer chain since common ancestor or root commit. > + if (a->generation < b->generation) > + return 1; > + else if (a->generation > b->generation) > + return -1; If the commit-graph feature is not available, or is disabled, all commits would have the same generation number (_INFINITY), then this block would be always practically no-op. This is not very costly: 2 access to data which should be in cache, and 1 to 2 comparison operations. But I wonder if we wouldn't want to avoid this nano-cost if possible... > + > + /* use date as a heuristic when generations are equal */ > + if (a->date < b->date) > + return 1; > + else if (a->date > b->date) > + return -1; > + return 0; The above is the same code as in compare_commits_by_commit_date(), but there it is with "newer commits with larger date first" as comment instead. All right: we need inlining for speed. > +} > + > int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) > { > const struct commit *a = a_, *b = b_; > @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) > /* all input commits in one and twos[] must have been parsed! */ > static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) > { > - struct prio_queue queue = { compare_commits_by_commit_date }; > + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; I wonder if it would be worth it to avoid comparing by generation numbers without commit graph data: + struct prio_queue queue; [...] + if (commit_graph) + queue.compare = compare_commits_by_gen_then_commit_date; + else + queue.compare = compare_commits_by_commit_date; Or something like that. But perhaps this nano-optimization is not worth it (it is not that complicated, though). Sidenote: when I searched for compare_commits_by_commit_date use, I have noticed that it is used, I think as heuristics, for packfile creation in upload-pack.c and fetch-pack.c. Would they similarly improve with compare_commits_by_gen_then_commit_date? This is of course not something that this commit, or this patch series, needs to address now. > struct commit_list *result = NULL; > int i; > > diff --git a/commit.h b/commit.h > index aac3b8c56f..64436ff44e 100644 > --- a/commit.h > +++ b/commit.h > @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); > extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); > > int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); > +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); All right. > > LAST_ARG_MUST_BE_NULL > extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v4 06/10] ref-filter: use generation number for --contains 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee ` (3 preceding siblings ...) 2018-04-25 14:37 ` [PATCH v4 04/10] commit: use generations in paint_down_to_common() Derrick Stolee @ 2018-04-25 14:37 ` Derrick Stolee 2018-04-30 16:34 ` Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 05/10] commit-graph: always load commit-graph information Derrick Stolee ` (6 subsequent siblings) 11 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw) To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee A commit A can reach a commit B only if the generation number of A is strictly larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for '--contains' type queries. On a copy of the Linux repository where HEAD is containd in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- ref-filter.c | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index aff24d93be..fb35067fc9 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -16,6 +16,7 @@ #include "trailer.h" #include "wt-status.h" #include "commit-slab.h" +#include "commit-graph.h" static struct ref_msg { const char *gone; @@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit *candidate, /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + load_commit_graph_info(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v4 06/10] ref-filter: use generation number for --contains 2018-04-25 14:37 ` [PATCH v4 06/10] ref-filter: use generation number for --contains Derrick Stolee @ 2018-04-30 16:34 ` Jakub Narebski 0 siblings, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-04-30 16:34 UTC (permalink / raw) To: Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <dstolee@microsoft.com> writes: > A commit A can reach a commit B only if the generation number of A > is strictly larger than the generation number of B. This condition > allows significantly short-circuiting commit-graph walks. > > Use generation number for '--contains' type queries. > > On a copy of the Linux repository where HEAD is containd in v4.13 Minor typo: containd -> contained. > but no earlier tag, the command 'git tag --contains HEAD' had the > following peformance improvement: > > Before: 0.81s > After: 0.04s > Rel %: -95% Very nice. I guess that any performance changes for when commit-graph feature is not available are negligible / not measurable. Rel % = (before - after)/before * 100%, isn't it?. Good. > > Helped-by: Jeff King <peff@peff.net> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > ref-filter.c | 24 ++++++++++++++++++++---- > 1 file changed, 20 insertions(+), 4 deletions(-) > > diff --git a/ref-filter.c b/ref-filter.c > index aff24d93be..fb35067fc9 100644 > --- a/ref-filter.c > +++ b/ref-filter.c > @@ -16,6 +16,7 @@ > #include "trailer.h" > #include "wt-status.h" > #include "commit-slab.h" > +#include "commit-graph.h" > > static struct ref_msg { > const char *gone; > @@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) > */ > static enum contains_result contains_test(struct commit *candidate, > const struct commit_list *want, > - struct contains_cache *cache) > + struct contains_cache *cache, > + uint32_t cutoff) > { > enum contains_result *cached = contains_cache_at(cache, candidate); > > @@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit *candidate, > > /* Otherwise, we don't know; prepare to recurse */ > parse_commit_or_die(candidate); > + > + if (candidate->generation < cutoff) > + return CONTAINS_NO; > + We use here weaker negative-cut criteria, which has the advantage of simply automatic handling of special values: _INFINITY, _MAX, _ZERO. Stronger version: if A != B and A ---> B, then gen(A) > gen(B) if gen(A) <= gen(B) and A != B, then A -/-> B Weaker version: if gen(A) < gen(B), then A -/-> B If commit-graph feature is not available, then all generation numbers would be _INFINITY, and cutoff would also be _INFINITY - which means this operation is practically no-op. One memory access (probably from cache) and one comparison is very cheap. All right. > return CONTAINS_UNKNOWN; > } > > @@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, > struct contains_cache *cache) > { > struct contains_stack contains_stack = { 0, 0, NULL }; > - enum contains_result result = contains_test(candidate, want, cache); > + enum contains_result result; > + uint32_t cutoff = GENERATION_NUMBER_INFINITY; > + const struct commit_list *p; > + > + for (p = want; p; p = p->next) { > + struct commit *c = p->item; > + load_commit_graph_info(c); > + if (c->generation < cutoff) > + cutoff = c->generation; > + } For each in wants, load generation numbers if needed and find lowest one. Anything lower cannot reach any of wants. All right. If commit-graph feature is not available, this is practically no-op. It is fast, as it only accesses memory - it does not access disk, nor do it needs to do any decompression, un-deltafication or parsing. All right. > > + result = contains_test(candidate, want, cache, cutoff); > if (result != CONTAINS_UNKNOWN) > return result; > > @@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, > * If we just popped the stack, parents->item has been marked, > * therefore contains_test will return a meaningful yes/no. > */ > - else switch (contains_test(parents->item, want, cache)) { > + else switch (contains_test(parents->item, want, cache, cutoff)) { > case CONTAINS_YES: > *contains_cache_at(cache, commit) = CONTAINS_YES; > contains_stack.nr--; > @@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, > } > } > free(contains_stack.contains_stack); > - return contains_test(candidate, want, cache); > + return contains_test(candidate, want, cache, cutoff); Those two just update callsite to new signatore. All right. > } > > static int commit_contains(struct ref_filter *filter, struct commit *commit, ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v4 05/10] commit-graph: always load commit-graph information 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee ` (4 preceding siblings ...) 2018-04-25 14:37 ` [PATCH v4 06/10] ref-filter: use generation number for --contains Derrick Stolee @ 2018-04-25 14:37 ` Derrick Stolee 2018-04-29 22:14 ` Jakub Narebski 2018-04-29 22:18 ` Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 07/10] commit: use generation numbers for in_merge_bases() Derrick Stolee ` (5 subsequent siblings) 11 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw) To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit-graph.c | 45 ++++++++++++++++++++++++++++++--------------- commit-graph.h | 8 ++++++++ commit.c | 7 +++++-- commit.h | 2 +- object.c | 2 +- sha1_file.c | 2 +- 6 files changed, 46 insertions(+), 20 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 047fa9fca5..aebd242def 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return &commit_list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(g, &(item->object.oid), pos); + } +} + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + if (!core_commit_graph) return 0; if (item->object.parsed) return 1; - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), &pos); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, pos); - } - + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) + return fill_commit_in_graph(item, commit_graph, pos); return 0; } +void load_commit_graph_info(struct commit *item) +{ + uint32_t pos; + if (!core_commit_graph) + return; + prepare_commit_graph(); + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) + fill_commit_graph_info(item, commit_graph, pos); +} + static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) { struct object_id oid; diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit.c b/commit.c index 4d00b0a1d6..39a3749abd 100644 --- a/commit.c +++ b/commit.c @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) return ret; } -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size) +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph) { const char *tail = buffer; const char *bufptr = buffer; @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s } item->date = parse_commit_date(bufptr, tail); + if (check_graph) + load_commit_graph_info(item); + return 0; } @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return error("Object %s not a commit", oid_to_hex(&item->object.oid)); } - ret = parse_commit_buffer(item, buffer, size); + ret = parse_commit_buffer(item, buffer, size, 0); if (save_commit_buffer && !ret) { set_commit_buffer(item, buffer, size); return 0; diff --git a/commit.h b/commit.h index 64436ff44e..b5afde1ae9 100644 --- a/commit.h +++ b/commit.h @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); */ struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size); +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); int parse_commit_gently(struct commit *item, int quiet_on_missing); static inline int parse_commit(struct commit *item) { diff --git a/object.c b/object.c index e6ad3f61f0..efe4871325 100644 --- a/object.c +++ b/object.c @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type } else if (type == OBJ_COMMIT) { struct commit *commit = lookup_commit(oid); if (commit) { - if (parse_commit_buffer(commit, buffer, size)) + if (parse_commit_buffer(commit, buffer, size, 1)) return NULL; if (!get_cached_commit_buffer(commit, NULL)) { set_commit_buffer(commit, buffer, size); diff --git a/sha1_file.c b/sha1_file.c index 1b94f39c4c..0fd4f0b8b6 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size) { struct commit c; memset(&c, 0, sizeof(c)); - if (parse_commit_buffer(&c, buf, size)) + if (parse_commit_buffer(&c, buf, size, 0)) die("corrupt commit"); } -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v4 05/10] commit-graph: always load commit-graph information 2018-04-25 14:37 ` [PATCH v4 05/10] commit-graph: always load commit-graph information Derrick Stolee @ 2018-04-29 22:14 ` Jakub Narebski 2018-05-01 12:19 ` Derrick Stolee 2018-04-29 22:18 ` Jakub Narebski 1 sibling, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-29 22:14 UTC (permalink / raw) To: Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <dstolee@microsoft.com> writes: > Most code paths load commits using lookup_commit() and then > parse_commit(). And this automatically loads commit graph if needed, thanks to changes in parse_commit_gently(), which parse_commit() uses. > In some cases, including some branch lookups, the commit > is parsed using parse_object_buffer() which side-steps parse_commit() in > favor of parse_commit_buffer(). I guess the problem is that we cannot just add parse_commit_in_graph() like we did in parse_commit_gently(), for some reason? Like for example that parse_commit_gently() uses parse_commit_buffer() - which could have been handled by moving parse_commit_in_graph() down the call chain from parse_commit_gently() to parse_commit_buffer()... if not the fact that check_commit() also uses parse_commit_buffer(), but it does not want to load commit graph. Am I right? > > With generation numbers in the commit-graph, we need to ensure that any > commit that exists in the commit-graph file has its generation number > loaded. Is it generation number, or generation number and position in commit graph? > > Create new load_commit_graph_info() method to fill in the information > for a commit that exists only in the commit-graph file. Call it from > parse_commit_buffer() after loading the other commit information from > the given buffer. Only fill this information when specified by the > 'check_graph' parameter. I think this commit would be easier to review if it was split into pure refactoring part (extracting fill_commit_graph_info() and find_commit_in_graph()). On the other hand the refactoring was needed to reduce code duplication betweem existing parse_commit_in_graph() and new load_commit_graph_info() functions. I guess that the difference between parse_commit_in_graph() and load_commit_graph_info() is that the former cares only about having just enough information that is needed for parse_commit_gently() - and does not load graph data if commit is parsed, while the latter is about loading commit-graph data like generation numbers. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit-graph.c | 45 ++++++++++++++++++++++++++++++--------------- > commit-graph.h | 8 ++++++++ > commit.c | 7 +++++-- > commit.h | 2 +- > object.c | 2 +- > sha1_file.c | 2 +- > 6 files changed, 46 insertions(+), 20 deletions(-) I wonder if it would be possible to add tests for this feature, for example that commit-graph is read when it should (including those branch lookups), and is not read when the feature should be disabled. But the only way to test it I can think of is a stupid one: create invalid commit graph, and check that git fails as expected (trying to read said malformed file), and does not fail if commit graph feature is disabled. > Let me reorder files (BTW, is there a way for Git to put *.h files before *.c files in diff?) for easier review: > diff --git a/commit-graph.h b/commit-graph.h > index 260a468e73..96cccb10f3 100644 > --- a/commit-graph.h > +++ b/commit-graph.h > @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); > */ > int parse_commit_in_graph(struct commit *item); > > +/* > + * It is possible that we loaded commit contents from the commit buffer, > + * but we also want to ensure the commit-graph content is correctly > + * checked and filled. Fill the graph_pos and generation members of > + * the given commit. > + */ > +void load_commit_graph_info(struct commit *item); > + > struct tree *get_commit_tree_in_graph(const struct commit *c); > > struct commit_graph { > diff --git a/commit-graph.c b/commit-graph.c > index 047fa9fca5..aebd242def 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, > return &commit_list_insert(c, pptr)->next; > } > > +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) > +{ > + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; > + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; > +} The comment in the header file commit-graph.h talks about filling graph_pos and generation members of the given commit, but I don't see filling graph_pos member here. Sidenote: it is a tiny little bit strange to see symbolic constants like GRAPH_DATA_WIDTH near using magic values such as 8 and 2. > + > static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) > { > uint32_t edge_value; > @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin > return 1; > } > > +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) > +{ > + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { > + *pos = item->graph_pos; > + return 1; > + } else { > + return bsearch_graph(g, &(item->object.oid), pos); > + } > +} Nice refactoring here. > + > int parse_commit_in_graph(struct commit *item) > { > + uint32_t pos; > + > if (!core_commit_graph) > return 0; > if (item->object.parsed) > return 1; > - > prepare_commit_graph(); > - if (commit_graph) { > - uint32_t pos; > - int found; > - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { > - pos = item->graph_pos; > - found = 1; > - } else { > - found = bsearch_graph(commit_graph, &(item->object.oid), &pos); > - } > - > - if (found) > - return fill_commit_in_graph(item, commit_graph, pos); > - } > - > + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) > + return fill_commit_in_graph(item, commit_graph, pos); > return 0; > } > > +void load_commit_graph_info(struct commit *item) > +{ > + uint32_t pos; > + if (!core_commit_graph) > + return; > + prepare_commit_graph(); > + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) > + fill_commit_graph_info(item, commit_graph, pos); > +} Similar functions, different goals (as the names imply). > + > static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) > { > struct object_id oid; > diff --git a/commit.c b/commit.c > index 4d00b0a1d6..39a3749abd 100644 > --- a/commit.c > +++ b/commit.c > @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) > return ret; > } > > -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size) > +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph) > { > const char *tail = buffer; > const char *bufptr = buffer; > @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s > } > item->date = parse_commit_date(bufptr, tail); > > + if (check_graph) > + load_commit_graph_info(item); > + All right, read commit-graph specific data after parsing commit itself. It is at the end because commit object needs to be parsed sequentially, and it includes more info that is contained in commit-graph CDAT+EDGE data. > return 0; > } > > @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) > return error("Object %s not a commit", > oid_to_hex(&item->object.oid)); > } > - ret = parse_commit_buffer(item, buffer, size); > + ret = parse_commit_buffer(item, buffer, size, 0); The parse_commit_gently() contract is that it provides only bare minimum of information, from commit-graph if possible, and does read object from disk and parses it only when it could not avoid it. If it needs to parse it, it doesn't need to fill commit-graph specific data again. All right. > if (save_commit_buffer && !ret) { > set_commit_buffer(item, buffer, size); > return 0; > diff --git a/commit.h b/commit.h > index 64436ff44e..b5afde1ae9 100644 > --- a/commit.h > +++ b/commit.h > @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); > */ > struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); > > -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size); > +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); > int parse_commit_gently(struct commit *item, int quiet_on_missing); > static inline int parse_commit(struct commit *item) > { > diff --git a/object.c b/object.c > index e6ad3f61f0..efe4871325 100644 > --- a/object.c > +++ b/object.c > @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type > } else if (type == OBJ_COMMIT) { > struct commit *commit = lookup_commit(oid); > if (commit) { > - if (parse_commit_buffer(commit, buffer, size)) > + if (parse_commit_buffer(commit, buffer, size, 1)) All that rigamarole was needed because of DS> In some cases, including some branch lookups, the commit DS> is parsed using parse_object_buffer() which side-steps parse_commit() in DS> favor of parse_commit_buffer(). Here we want parse_object_buffer() to get also commit-graph specific data, if available. All right. > return NULL; > if (!get_cached_commit_buffer(commit, NULL)) { > set_commit_buffer(commit, buffer, size); > diff --git a/sha1_file.c b/sha1_file.c > index 1b94f39c4c..0fd4f0b8b6 100644 > --- a/sha1_file.c > +++ b/sha1_file.c > @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size) > { > struct commit c; > memset(&c, 0, sizeof(c)); > - if (parse_commit_buffer(&c, buf, size)) > + if (parse_commit_buffer(&c, buf, size, 0)) For check we don't need commit graph data. Looks all right. > die("corrupt commit"); > } Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 05/10] commit-graph: always load commit-graph information 2018-04-29 22:14 ` Jakub Narebski @ 2018-05-01 12:19 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:19 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason On 4/29/2018 6:14 PM, Jakub Narebski wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> Most code paths load commits using lookup_commit() and then >> parse_commit(). > And this automatically loads commit graph if needed, thanks to changes > in parse_commit_gently(), which parse_commit() uses. > >> In some cases, including some branch lookups, the commit >> is parsed using parse_object_buffer() which side-steps parse_commit() in >> favor of parse_commit_buffer(). > I guess the problem is that we cannot just add parse_commit_in_graph() > like we did in parse_commit_gently(), for some reason? Like for example > that parse_commit_gently() uses parse_commit_buffer() - which could have > been handled by moving parse_commit_in_graph() down the call chain from > parse_commit_gently() to parse_commit_buffer()... if not the fact that > check_commit() also uses parse_commit_buffer(), but it does not want to > load commit graph. Am I right? If a caller uses parse_commit_buffer() directly, then we will guarantee that all values in the struct commit that would be loaded from the buffer are loaded from the buffer. This means we do NOT load the root tree id or commit date from the commit-graph file. We do still need to load the data that is not available in the buffer, such as graph_pos and generation. > >> With generation numbers in the commit-graph, we need to ensure that any >> commit that exists in the commit-graph file has its generation number >> loaded. > Is it generation number, or generation number and position in commit > graph? We don't need to ensure the graph_pos (the commit will never be re-parsed, so we will not try to find it in the commit-graph file again), but we DO need to ensure the generation (or our commit walks will be incorrect). We get the graph_pos as a side-effect. > >> Create new load_commit_graph_info() method to fill in the information >> for a commit that exists only in the commit-graph file. Call it from >> parse_commit_buffer() after loading the other commit information from >> the given buffer. Only fill this information when specified by the >> 'check_graph' parameter. > I think this commit would be easier to review if it was split into pure > refactoring part (extracting fill_commit_graph_info() and > find_commit_in_graph()). On the other hand the refactoring was needed > to reduce code duplication betweem existing parse_commit_in_graph() and > new load_commit_graph_info() functions. > > I guess that the difference between parse_commit_in_graph() and > load_commit_graph_info() is that the former cares only about having just > enough information that is needed for parse_commit_gently() - and does > not load graph data if commit is parsed, while the latter is about > loading commit-graph data like generation numbers. > >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> --- >> commit-graph.c | 45 ++++++++++++++++++++++++++++++--------------- >> commit-graph.h | 8 ++++++++ >> commit.c | 7 +++++-- >> commit.h | 2 +- >> object.c | 2 +- >> sha1_file.c | 2 +- >> 6 files changed, 46 insertions(+), 20 deletions(-) > I wonder if it would be possible to add tests for this feature, for > example that commit-graph is read when it should (including those branch > lookups), and is not read when the feature should be disabled. > > But the only way to test it I can think of is a stupid one: create > invalid commit graph, and check that git fails as expected (trying to > read said malformed file), and does not fail if commit graph feature is > disabled. > > Let me reorder files (BTW, is there a way for Git to put *.h files > before *.c files in diff?) for easier review: > >> diff --git a/commit-graph.h b/commit-graph.h >> index 260a468e73..96cccb10f3 100644 >> --- a/commit-graph.h >> +++ b/commit-graph.h >> @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); >> */ >> int parse_commit_in_graph(struct commit *item); >> >> +/* >> + * It is possible that we loaded commit contents from the commit buffer, >> + * but we also want to ensure the commit-graph content is correctly >> + * checked and filled. Fill the graph_pos and generation members of >> + * the given commit. >> + */ >> +void load_commit_graph_info(struct commit *item); >> + >> struct tree *get_commit_tree_in_graph(const struct commit *c); >> >> struct commit_graph { >> diff --git a/commit-graph.c b/commit-graph.c >> index 047fa9fca5..aebd242def 100644 >> --- a/commit-graph.c >> +++ b/commit-graph.c >> @@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, >> return &commit_list_insert(c, pptr)->next; >> } >> >> +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) >> +{ >> + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; >> + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; >> +} > The comment in the header file commit-graph.h talks about filling > graph_pos and generation members of the given commit, but I don't see > filling graph_pos member here. We are missing the following line: + item->graph_pos = pos; I will add it for v5. The equivalent line exists in fill_commit_in_graph(). > > Sidenote: it is a tiny little bit strange to see symbolic constants like > GRAPH_DATA_WIDTH near using magic values such as 8 and 2. There needs to be some boundary between abstraction and concreteness when dealing directly with a binary file format. GRAPH_DATA_WIDTH helps us navigate to the correct "row" in the chunk, while we use the constants 8 and 2 to get the correct "column" out of that row. > >> + >> static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) >> { >> uint32_t edge_value; >> @@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin >> return 1; >> } >> >> +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) >> +{ >> + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { >> + *pos = item->graph_pos; >> + return 1; >> + } else { >> + return bsearch_graph(g, &(item->object.oid), pos); >> + } >> +} > Nice refactoring here. > >> + >> int parse_commit_in_graph(struct commit *item) >> { >> + uint32_t pos; >> + >> if (!core_commit_graph) >> return 0; >> if (item->object.parsed) >> return 1; >> - >> prepare_commit_graph(); >> - if (commit_graph) { >> - uint32_t pos; >> - int found; >> - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { >> - pos = item->graph_pos; >> - found = 1; >> - } else { >> - found = bsearch_graph(commit_graph, &(item->object.oid), &pos); >> - } >> - >> - if (found) >> - return fill_commit_in_graph(item, commit_graph, pos); >> - } >> - >> + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) >> + return fill_commit_in_graph(item, commit_graph, pos); >> return 0; >> } >> >> +void load_commit_graph_info(struct commit *item) >> +{ >> + uint32_t pos; >> + if (!core_commit_graph) >> + return; >> + prepare_commit_graph(); >> + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) >> + fill_commit_graph_info(item, commit_graph, pos); >> +} > Similar functions, different goals (as the names imply). > >> + >> static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) >> { >> struct object_id oid; >> diff --git a/commit.c b/commit.c >> index 4d00b0a1d6..39a3749abd 100644 >> --- a/commit.c >> +++ b/commit.c >> @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) >> return ret; >> } >> >> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size) >> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph) >> { >> const char *tail = buffer; >> const char *bufptr = buffer; >> @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s >> } >> item->date = parse_commit_date(bufptr, tail); >> >> + if (check_graph) >> + load_commit_graph_info(item); >> + > All right, read commit-graph specific data after parsing commit itself. > It is at the end because commit object needs to be parsed sequentially, > and it includes more info that is contained in commit-graph CDAT+EDGE > data. > >> return 0; >> } >> >> @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) >> return error("Object %s not a commit", >> oid_to_hex(&item->object.oid)); >> } >> - ret = parse_commit_buffer(item, buffer, size); >> + ret = parse_commit_buffer(item, buffer, size, 0); > The parse_commit_gently() contract is that it provides only bare minimum > of information, from commit-graph if possible, and does read object from > disk and parses it only when it could not avoid it. If it needs to > parse it, it doesn't need to fill commit-graph specific data again. > > All right. > >> if (save_commit_buffer && !ret) { >> set_commit_buffer(item, buffer, size); >> return 0; >> diff --git a/commit.h b/commit.h >> index 64436ff44e..b5afde1ae9 100644 >> --- a/commit.h >> +++ b/commit.h >> @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); >> */ >> struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); >> >> -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size); >> +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); >> int parse_commit_gently(struct commit *item, int quiet_on_missing); >> static inline int parse_commit(struct commit *item) >> { >> diff --git a/object.c b/object.c >> index e6ad3f61f0..efe4871325 100644 >> --- a/object.c >> +++ b/object.c >> @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type >> } else if (type == OBJ_COMMIT) { >> struct commit *commit = lookup_commit(oid); >> if (commit) { >> - if (parse_commit_buffer(commit, buffer, size)) >> + if (parse_commit_buffer(commit, buffer, size, 1)) > All that rigamarole was needed because of > > DS> In some cases, including some branch lookups, the commit > DS> is parsed using parse_object_buffer() which side-steps parse_commit() in > DS> favor of parse_commit_buffer(). > > Here we want parse_object_buffer() to get also commit-graph specific > data, if available. All right. > >> return NULL; >> if (!get_cached_commit_buffer(commit, NULL)) { >> set_commit_buffer(commit, buffer, size); >> diff --git a/sha1_file.c b/sha1_file.c >> index 1b94f39c4c..0fd4f0b8b6 100644 >> --- a/sha1_file.c >> +++ b/sha1_file.c >> @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size) >> { >> struct commit c; >> memset(&c, 0, sizeof(c)); >> - if (parse_commit_buffer(&c, buf, size)) >> + if (parse_commit_buffer(&c, buf, size, 0)) > For check we don't need commit graph data. Looks all right. > >> die("corrupt commit"); >> } > Best, ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 05/10] commit-graph: always load commit-graph information 2018-04-25 14:37 ` [PATCH v4 05/10] commit-graph: always load commit-graph information Derrick Stolee 2018-04-29 22:14 ` Jakub Narebski @ 2018-04-29 22:18 ` Jakub Narebski 1 sibling, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-04-29 22:18 UTC (permalink / raw) To: Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason [Forgot about one thing] Derrick Stolee <dstolee@microsoft.com> writes: > Create new load_commit_graph_info() method to fill in the information > for a commit that exists only in the commit-graph file. The above sentence is a bit hard to parse because of ambiguity: is it "the information" that exists only in the commit-graph file, or "a commit" that exists only in the commit-graph file? > Call it from > parse_commit_buffer() after loading the other commit information from > the given buffer. Only fill this information when specified by the > 'check_graph' parameter. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v4 07/10] commit: use generation numbers for in_merge_bases() 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee ` (5 preceding siblings ...) 2018-04-25 14:37 ` [PATCH v4 05/10] commit-graph: always load commit-graph information Derrick Stolee @ 2018-04-25 14:37 ` Derrick Stolee 2018-04-30 17:05 ` Jakub Narebski 2018-04-25 14:38 ` [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee ` (4 subsequent siblings) 11 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:37 UTC (permalink / raw) To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the initial commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 39a3749abd..7bb007f56a 100644 --- a/commit.c +++ b/commit.c @@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (min_generation > reference[i]->generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return ret; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v4 07/10] commit: use generation numbers for in_merge_bases() 2018-04-25 14:37 ` [PATCH v4 07/10] commit: use generation numbers for in_merge_bases() Derrick Stolee @ 2018-04-30 17:05 ` Jakub Narebski 0 siblings, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-04-30 17:05 UTC (permalink / raw) To: Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <dstolee@microsoft.com> writes: > The containment algorithm for 'git branch --contains' is different > from that for 'git tag --contains' in that it uses is_descendant_of() > instead of contains_tag_algo(). The expensive portion of the branch > algorithm is computing merge bases. > > When a commit-graph file exists with generation numbers computed, > we can avoid this merge-base calculation when the target commit has > a larger generation number than the initial commits. Right. > > Performance tests were run on a copy of the Linux repository where > HEAD is contained in v4.13 but no earlier tag. Also, all tags were > copied to branches and 'git branch --contains' was tested: I guess that it is equivalent of 'git tag --contains' setup from previous commit, just for 'git branch --contains', isn't it? > > Before: 60.0s > After: 0.4s > Rel %: -99.3% Very nice. Sidenote: an alternative to using "Rel %" of -99.3% (which is calculated as (before-after)/before) would be to use "Speedup" of 150 x (calculated as before/after). One one hand it might be more readable, on the other hand it might be a bit misleading. Yet another alternative would be to use a chart like the following: time Before After Before 60.0s -- -99.3% After 0.4s +149% -- Anyway, consistency in presentation in patch series is good. So I am for keeping your notation thorough the series. > > Reported-by: Jeff King <peff@peff.net> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/commit.c b/commit.c > index 39a3749abd..7bb007f56a 100644 > --- a/commit.c > +++ b/commit.c > @@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * Let's give it a bit more context: /* * Is "commit" an ancestor of one of the "references"? */ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit **reference) > { > struct commit_list *bases; > int ret = 0, i; > + uint32_t min_generation = GENERATION_NUMBER_INFINITY; > > if (parse_commit(commit)) > return ret; > - for (i = 0; i < nr_reference; i++) > + for (i = 0; i < nr_reference; i++) { > if (parse_commit(reference[i])) > return ret; We use parse_commit(), so there is no need for calling load_commit_graph_info(), like in previous patch. All right. > + if (min_generation > reference[i]->generation) At first glance, I thought it was wrong; but it is the same as the following, it is just a matter of taste (which feels more natural): + if (reference[i]->generation < min_generation) > + min_generation = reference[i]->generation; > + } > + > + if (commit->generation > min_generation) > + return ret; All right, using weak version of generation numbers based negative-cut nicely handles automatically all corner-cases, including the case where commit-graaph feature is turned off. If commit-graph feature is not available, it costs only few memory access and few comparisons than before, and performance is dominated by something else anyway. Negligible and possibly unnoticeable change, I guess. Good. > > bases = paint_down_to_common(commit, nr_reference, reference); > if (commit->object.flags & PARENT2) ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee ` (6 preceding siblings ...) 2018-04-25 14:37 ` [PATCH v4 07/10] commit: use generation numbers for in_merge_bases() Derrick Stolee @ 2018-04-25 14:38 ` Derrick Stolee 2018-04-30 22:19 ` Jakub Narebski 2018-04-25 14:38 ` [PATCH v4 09/10] merge: check config before loading commits Derrick Stolee ` (3 subsequent siblings) 11 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:38 UTC (permalink / raw) To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee When running 'git branch --contains', the in_merge_bases_many() method calls paint_down_to_common() to discover if a specific commit is reachable from a set of branches. Commits with lower generation number are not needed to correctly answer the containment query of in_merge_bases_many(). Add a new parameter, min_generation, to paint_down_to_common() that prevents walking commits with generation number strictly less than min_generation. If 0 is given, then there is no functional change. For in_merge_bases_many(), we can pass commit->generation as the cutoff, and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 7bb007f56a..e2e16ea1a7 100644 --- a/commit.c +++ b/commit.c @@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue) } /* all input commits in one and twos[] must have been parsed! */ -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) +static struct commit_list *paint_down_to_common(struct commit *one, int n, + struct commit **twos, + int min_generation) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; one->object.flags |= PARENT1; if (!n) { @@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip"); + last_gen = commit->generation; + + if (commit->generation < min_generation) + break; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -879,7 +889,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(&list); @@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return ret; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) ret = 1; clear_commit_marks(commit, all_flags); -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() 2018-04-25 14:38 ` [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee @ 2018-04-30 22:19 ` Jakub Narebski 2018-05-01 11:47 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-30 22:19 UTC (permalink / raw) To: Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <dstolee@microsoft.com> writes: > When running 'git branch --contains', the in_merge_bases_many() > method calls paint_down_to_common() to discover if a specific > commit is reachable from a set of branches. Commits with lower > generation number are not needed to correctly answer the > containment query of in_merge_bases_many(). > > Add a new parameter, min_generation, to paint_down_to_common() that > prevents walking commits with generation number strictly less than > min_generation. If 0 is given, then there is no functional change. This is thanks to the fact that generation numbers start at zero (as special case, though, with _ZERO), and we use strict inequality to avoid handling _ZERO etc. in a special way. As you wrote in response in previous version of this series, because paint_down_to_common() is file-local, there is no need to come up with symbolic name for GENERATION_NO_CUTOFF case. All right. > > For in_merge_bases_many(), we can pass commit->generation as the > cutoff, and this saves time during 'git branch --contains' queries > that would otherwise walk "around" the commit we are inspecting. All right, and when using paint_down_to_common() to actually find merge bases, and not only check for containment, we cannot use cutoff. Therefore at least one call site needs to run it without functional change... which we can do. Good. > > For a copy of the Linux repository, where HEAD is checked out at > v4.13~100, we get the following performance improvement for > 'git branch --contains' over the previous commit: > > Before: 0.21s > After: 0.13s > Rel %: -38% Nice. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit.c | 18 ++++++++++++++---- > 1 file changed, 14 insertions(+), 4 deletions(-) Let me reorder chunks a bit to make it easier to review. > > diff --git a/commit.c b/commit.c > index 7bb007f56a..e2e16ea1a7 100644 > --- a/commit.c > +++ b/commit.c > @@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * > if (commit->generation > min_generation) > return ret; > > - bases = paint_down_to_common(commit, nr_reference, reference); > + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); > if (commit->object.flags & PARENT2) > ret = 1; > clear_commit_marks(commit, all_flags); > @@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue) > } > > /* all input commits in one and twos[] must have been parsed! */ > -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) > +static struct commit_list *paint_down_to_common(struct commit *one, int n, > + struct commit **twos, > + int min_generation) > { > struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; > struct commit_list *result = NULL; > int i; > + uint32_t last_gen = GENERATION_NUMBER_INFINITY; > > one->object.flags |= PARENT1; > if (!n) { > @@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc > struct commit_list *parents; > int flags; > > + if (commit->generation > last_gen) > + BUG("bad generation skip"); > + last_gen = commit->generation; Shouldn't we provide more information about where the problem is to the user, to make it easier to debug the repository / commit-graph data? Good to have this sanity check here. > + > + if (commit->generation < min_generation) > + break; So the reasoning for this, as far as I understand, is the following. Please correct me if I am wrong. The callsite with non-zero min_generation, in_merge_bases_many(), tries to find out if "commit" is an ancestor of one of the "references". At least one of "references" is above "commit", so in_merge_bases_many() uses paint_down_to_common() - but is interested only if "commit" was painted as reachable from one of "references". Thus we can interrupt the walk if we know that none of [considered] commits in the queue can reach "commit"/"one" - as if they were all STALE. The search is done using priority queue (a bit like in Dijkstra algorithm), with newer commits - with larger generation numbers - considered first. Thus if current commit has generation number less than min_generation cutoff, i.e. if it is below "commit", then all remaining commits in the queue are below cutoff. Good. > + > flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); > if (flags == (PARENT1 | PARENT2)) { > if (!(commit->object.flags & RESULT)) { > @@ -879,7 +889,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co > return NULL; > } > > - list = paint_down_to_common(one, n, twos); > + list = paint_down_to_common(one, n, twos, 0); When calculating merge bases there is no such possibility of an early return due to generation number cutoff. All right then. > > while (list) { > struct commit *commit = pop_commit(&list); > @@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt) > filled_index[filled] = j; > work[filled++] = array[j]; > } > - common = paint_down_to_common(array[i], filled, work); > + common = paint_down_to_common(array[i], filled, work, 0); Here we are interested not only if "one"/array[i] is reachable from "twos"/work, but also if "twos" is reachable from "one". Simple cutoff only works in one way, though I wonder if we couldn't use cutoff being minimum generation number of "one" and "twos" together. But that may be left for a separate commit (after checking that the above is correct). Not as simple and obvious as paint_down_to_common() used in in_merge_bases_any(), so it is all right. > if (array[i]->object.flags & PARENT2) > redundant[i] = 1; > for (j = 0; j < filled; j++) ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() 2018-04-30 22:19 ` Jakub Narebski @ 2018-05-01 11:47 ` Derrick Stolee 2018-05-02 13:05 ` Jakub Narebski 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 11:47 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason On 4/30/2018 6:19 PM, Jakub Narebski wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> When running 'git branch --contains', the in_merge_bases_many() >> method calls paint_down_to_common() to discover if a specific >> commit is reachable from a set of branches. Commits with lower >> generation number are not needed to correctly answer the >> containment query of in_merge_bases_many(). >> >> Add a new parameter, min_generation, to paint_down_to_common() that >> prevents walking commits with generation number strictly less than >> min_generation. If 0 is given, then there is no functional change. > This is thanks to the fact that generation numbers start at zero (as > special case, though, with _ZERO), and we use strict inequality to avoid > handling _ZERO etc. in a special way. > > As you wrote in response in previous version of this series, because > paint_down_to_common() is file-local, there is no need to come up with > symbolic name for GENERATION_NO_CUTOFF case. > > All right. > >> For in_merge_bases_many(), we can pass commit->generation as the >> cutoff, and this saves time during 'git branch --contains' queries >> that would otherwise walk "around" the commit we are inspecting. > All right, and when using paint_down_to_common() to actually find merge > bases, and not only check for containment, we cannot use cutoff. > Therefore at least one call site needs to run it without functional > change... which we can do. Good. > >> For a copy of the Linux repository, where HEAD is checked out at >> v4.13~100, we get the following performance improvement for >> 'git branch --contains' over the previous commit: >> >> Before: 0.21s >> After: 0.13s >> Rel %: -38% > Nice. > >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> --- >> commit.c | 18 ++++++++++++++---- >> 1 file changed, 14 insertions(+), 4 deletions(-) > Let me reorder chunks a bit to make it easier to review. > >> diff --git a/commit.c b/commit.c >> index 7bb007f56a..e2e16ea1a7 100644 >> --- a/commit.c >> +++ b/commit.c >> @@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * >> if (commit->generation > min_generation) >> return ret; >> >> - bases = paint_down_to_common(commit, nr_reference, reference); >> + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); >> if (commit->object.flags & PARENT2) >> ret = 1; >> clear_commit_marks(commit, all_flags); >> @@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue) >> } >> >> /* all input commits in one and twos[] must have been parsed! */ >> -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) >> +static struct commit_list *paint_down_to_common(struct commit *one, int n, >> + struct commit **twos, >> + int min_generation) >> { >> struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; >> struct commit_list *result = NULL; >> int i; >> + uint32_t last_gen = GENERATION_NUMBER_INFINITY; >> >> one->object.flags |= PARENT1; >> if (!n) { >> @@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc >> struct commit_list *parents; >> int flags; >> >> + if (commit->generation > last_gen) >> + BUG("bad generation skip"); >> + last_gen = commit->generation; > Shouldn't we provide more information about where the problem is to the > user, to make it easier to debug the repository / commit-graph data? > > Good to have this sanity check here. This BUG() _should_ only be seen by developers who add callers which do not load commits from the commit-graph file. There is a chance that there are cases not covered by this patch and the added tests, though. Hopefully we catch them all by dogfooding the feature before turning it on by default. I can add the following to help debug these bad situations: + BUG("bad generation skip %d > %d at %s", + commit->generation, last_gen, + oid_to_hex(&commit->object.oid)); > >> + >> + if (commit->generation < min_generation) >> + break; > So the reasoning for this, as far as I understand, is the following. > Please correct me if I am wrong. > > The callsite with non-zero min_generation, in_merge_bases_many(), tries > to find out if "commit" is an ancestor of one of the "references". At > least one of "references" is above "commit", so in_merge_bases_many() > uses paint_down_to_common() - but is interested only if "commit" was > painted as reachable from one of "references". > > Thus we can interrupt the walk if we know that none of [considered] > commits in the queue can reach "commit"/"one" - as if they were all > STALE. > > The search is done using priority queue (a bit like in Dijkstra > algorithm), with newer commits - with larger generation numbers - > considered first. Thus if current commit has generation number less > than min_generation cutoff, i.e. if it is below "commit", then all > remaining commits in the queue are below cutoff. > > Good. > >> + >> flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); >> if (flags == (PARENT1 | PARENT2)) { >> if (!(commit->object.flags & RESULT)) { >> @@ -879,7 +889,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co >> return NULL; >> } >> >> - list = paint_down_to_common(one, n, twos); >> + list = paint_down_to_common(one, n, twos, 0); > When calculating merge bases there is no such possibility of an early > return due to generation number cutoff. All right then. > >> >> while (list) { >> struct commit *commit = pop_commit(&list); >> @@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt) >> filled_index[filled] = j; >> work[filled++] = array[j]; >> } >> - common = paint_down_to_common(array[i], filled, work); >> + common = paint_down_to_common(array[i], filled, work, 0); > Here we are interested not only if "one"/array[i] is reachable from > "twos"/work, but also if "twos" is reachable from "one". Simple cutoff > only works in one way, though I wonder if we couldn't use cutoff being > minimum generation number of "one" and "twos" together. > > But that may be left for a separate commit (after checking that the > above is correct). > > Not as simple and obvious as paint_down_to_common() used in > in_merge_bases_any(), so it is all right. Thanks for reporting this. Since we are only concerned about reachability in this method, it is a good candidate to use min_generation. It is also subtle enough that we should leave it as a separate commit. Also, we can measure performance improvements separately, as I will mention in my commit message (but I'll copy it here): For a copy of the Linux repository, we measured the following performance improvements: git merge-base v3.3 v4.5 Before: 234 ms After: 208 ms Rel %: -11% git merge-base v4.3 v4.5 Before: 102 ms After: 83 ms Rel %: -19% The experiments above were chosen to demonstrate that we are improving the filtering of the merge-base set. In the first example, more time is spent walking the history to find the set of merge bases before the remove_redundant() call. The starting commits are closer together in the second example, therefore more time is spent in remove_redundant(). The relative change in performance differs as expected. > >> if (array[i]->object.flags & PARENT2) >> redundant[i] = 1; >> for (j = 0; j < filled; j++) ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() 2018-05-01 11:47 ` Derrick Stolee @ 2018-05-02 13:05 ` Jakub Narebski 2018-05-02 13:42 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-05-02 13:05 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <stolee@gmail.com> writes: > On 4/30/2018 6:19 PM, Jakub Narebski wrote: >> Derrick Stolee <dstolee@microsoft.com> writes: [...] >>> @@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc >>> struct commit_list *parents; >>> int flags; >>> + if (commit->generation > last_gen) >>> + BUG("bad generation skip"); >>> + last_gen = commit->generation; >> Shouldn't we provide more information about where the problem is to the >> user, to make it easier to debug the repository / commit-graph data? >> >> Good to have this sanity check here. > > This BUG() _should_ only be seen by developers who add callers which > do not load commits from the commit-graph file. There is a chance that > there are cases not covered by this patch and the added tests, > though. Hopefully we catch them all by dogfooding the feature before > turning it on by default. > > I can add the following to help debug these bad situations: > > + BUG("bad generation skip %d > %d at %s", > + commit->generation, last_gen, > + oid_to_hex(&commit->object.oid)); On one hand, after thiking about this a bit, I agree that this BUG() is more about catching the errors in Git code, rather than in repository. On the other hand, the more detailed information could help determining what the problems is (e.g. that "at <hex>" is at HEAD). Hopefully we won't see which is which, as it would mean bugs in Git ;)) [...] >>> @@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt) >>> filled_index[filled] = j; >>> work[filled++] = array[j]; >>> } >>> - common = paint_down_to_common(array[i], filled, work); >>> + common = paint_down_to_common(array[i], filled, work, 0); >> >> Here we are interested not only if "one"/array[i] is reachable from >> "twos"/work, but also if "twos" is reachable from "one". Simple cutoff >> only works in one way, though I wonder if we couldn't use cutoff being >> minimum generation number of "one" and "twos" together. >> >> But that may be left for a separate commit (after checking that the >> above is correct). >> >> Not as simple and obvious as paint_down_to_common() used in >> in_merge_bases_any(), so it is all right. > > Thanks for reporting this. Since we are only concerned about > reachability in this method, it is a good candidate to use > min_generation. It is also subtle enough that we should leave it as a > separate commit. Thanks for checking this, and for the followup. > Also, we can measure performance improvements > separately, as I will mention in my commit message (but I'll copy it > here): > > For a copy of the Linux repository, we measured the following > performance improvements: > > git merge-base v3.3 v4.5 > > Before: 234 ms > After: 208 ms > Rel %: -11% > > git merge-base v4.3 v4.5 > > Before: 102 ms > After: 83 ms > Rel %: -19% > > The experiments above were chosen to demonstrate that we are > improving the filtering of the merge-base set. In the first > example, more time is spent walking the history to find the > set of merge bases before the remove_redundant() call. The > starting commits are closer together in the second example, > therefore more time is spent in remove_redundant(). The relative > change in performance differs as expected. Nice. I was not expecting as much performance improvements as we got for --contains tests because remove_redundant() is a final step in longer process, dominated by man calculations. Still, nothing to sneeze about. Best regards, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() 2018-05-02 13:05 ` Jakub Narebski @ 2018-05-02 13:42 ` Derrick Stolee 0 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-02 13:42 UTC (permalink / raw) To: Jakub Narebski Cc: Derrick Stolee, git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason On 5/2/2018 9:05 AM, Jakub Narebski wrote: > Derrick Stolee <stolee@gmail.com> writes: >> For a copy of the Linux repository, we measured the following >> performance improvements: >> >> git merge-base v3.3 v4.5 >> >> Before: 234 ms >> After: 208 ms >> Rel %: -11% >> >> git merge-base v4.3 v4.5 >> >> Before: 102 ms >> After: 83 ms >> Rel %: -19% >> >> The experiments above were chosen to demonstrate that we are >> improving the filtering of the merge-base set. In the first >> example, more time is spent walking the history to find the >> set of merge bases before the remove_redundant() call. The >> starting commits are closer together in the second example, >> therefore more time is spent in remove_redundant(). The relative >> change in performance differs as expected. > Nice. > > I was not expecting as much performance improvements as we got for > --contains tests because remove_redundant() is a final step in longer > process, dominated by man calculations. Still, nothing to sneeze about. One reason these numbers are not too surprising is that remove_redundant() can demonstrate quadratic behavior. It is calculating pair-wise reachability by starting a walk at each of the candidates (in the worst case). In typical cases, the first walk marks many of the other candidates as redundant and we don't need to start walks from those commits. A possible optimization could be to sort the candidates by descending generation so we find the first walk is likely to mark the rest as redundant. But this may already be the case if the candidates are added to the list in order of "discovery" which is already simulating this behavior. Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v4 09/10] merge: check config before loading commits 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee ` (7 preceding siblings ...) 2018-04-25 14:38 ` [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee @ 2018-04-25 14:38 ` Derrick Stolee 2018-04-30 22:54 ` Jakub Narebski 2018-04-25 14:38 ` [PATCH v4 10/10] commit-graph.txt: update design document Derrick Stolee ` (2 subsequent siblings) 11 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:38 UTC (permalink / raw) To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee Now that we use generation numbers from the commit-graph, we must ensure that all commits that exist in the commit-graph are loaded from that file instead of from the object database. Since the commit-graph file is only checked if core.commitGraph is true, we must check the default config before we load any commits. In the merge builtin, the config was checked after loading the HEAD commit. This was due to the use of the global 'branch' when checking merge-specific config settings. Move the config load to be between the initialization of 'branch' and the commit lookup. Without this change, a fast-forward merge would hit a BUG("bad generation skip") statement in commit.c during paint_down_to_common(). This is because the HEAD commit would be loaded with "infinite" generation but then reached by commits with "finite" generation numbers. Add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- builtin/merge.c | 7 ++++--- t/t5318-commit-graph.sh | 9 +++++++++ 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/builtin/merge.c b/builtin/merge.c index 5e5e4497e3..b819756946 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", &branch); + + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(&head_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(&head_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); - if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); argc = parse_options(argc, argv, prefix, builtin_merge_options, diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a380419b65..77d85aefe7 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 +test_expect_success 'perform fast-forward merge in full repo' ' + cd "$TRASH_DIRECTORY/full" && + git checkout -b merge-5-to-8 commits/5 && + git merge commits/8 && + git show-ref -s merge-5-to-8 >output && + git show-ref -s commits/8 >expect && + test_cmp expect output +' + test_done -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v4 09/10] merge: check config before loading commits 2018-04-25 14:38 ` [PATCH v4 09/10] merge: check config before loading commits Derrick Stolee @ 2018-04-30 22:54 ` Jakub Narebski 2018-05-01 11:52 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-30 22:54 UTC (permalink / raw) To: Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <dstolee@microsoft.com> writes: > Now that we use generation numbers from the commit-graph, we must > ensure that all commits that exist in the commit-graph are loaded > from that file instead of from the object database. Since the > commit-graph file is only checked if core.commitGraph is true, we > must check the default config before we load any commits. > > In the merge builtin, the config was checked after loading the HEAD > commit. This was due to the use of the global 'branch' when checking > merge-specific config settings. > > Move the config load to be between the initialization of 'branch' and > the commit lookup. Sidenote: I wonder why reading config was postponed to later in the command lifetime... I guess it was to avoid having to read config if HEAD was invalid. > > Without this change, a fast-forward merge would hit a BUG("bad > generation skip") statement in commit.c during paint_down_to_common(). > This is because the HEAD commit would be loaded with "infinite" > generation but then reached by commits with "finite" generation > numbers. I guess this is because we avoid re-parsing objects at all costs; we want to avoid re-reading commit graph too. > > Add a test to t5318-commit-graph.sh that exercises this code path to > prevent a regression. I would prefer if this commit was put earlier in the series, to avoid having broken Git (and thus a possibility of problems when bisecting) in between those two commits. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > builtin/merge.c | 7 ++++--- > t/t5318-commit-graph.sh | 9 +++++++++ > 2 files changed, 13 insertions(+), 3 deletions(-) > > diff --git a/builtin/merge.c b/builtin/merge.c > index 5e5e4497e3..b819756946 100644 > --- a/builtin/merge.c > +++ b/builtin/merge.c > @@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix) > branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL); > if (branch) > skip_prefix(branch, "refs/heads/", &branch); > + > + init_diff_ui_defaults(); > + git_config(git_merge_config, NULL); > + > if (!branch || is_null_oid(&head_oid)) > head_commit = NULL; > else > head_commit = lookup_commit_or_die(&head_oid, "HEAD"); > > - init_diff_ui_defaults(); > - git_config(git_merge_config, NULL); > - Good. > if (branch_mergeoptions) > parse_branch_merge_options(branch_mergeoptions); > argc = parse_options(argc, argv, prefix, builtin_merge_options, > diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh > index a380419b65..77d85aefe7 100755 > --- a/t/t5318-commit-graph.sh > +++ b/t/t5318-commit-graph.sh > @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' > graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 > graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 > > +test_expect_success 'perform fast-forward merge in full repo' ' > + cd "$TRASH_DIRECTORY/full" && > + git checkout -b merge-5-to-8 commits/5 && > + git merge commits/8 && > + git show-ref -s merge-5-to-8 >output && > + git show-ref -s commits/8 >expect && > + test_cmp expect output > +' All right. (though I wonder if this tests catches all problems where BUG("bad generation skip") could have been encountered. > + > test_done Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 09/10] merge: check config before loading commits 2018-04-30 22:54 ` Jakub Narebski @ 2018-05-01 11:52 ` Derrick Stolee 2018-05-02 11:41 ` Jakub Narebski 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 11:52 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason On 4/30/2018 6:54 PM, Jakub Narebski wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> Now that we use generation numbers from the commit-graph, we must >> ensure that all commits that exist in the commit-graph are loaded >> from that file instead of from the object database. Since the >> commit-graph file is only checked if core.commitGraph is true, we >> must check the default config before we load any commits. >> >> In the merge builtin, the config was checked after loading the HEAD >> commit. This was due to the use of the global 'branch' when checking >> merge-specific config settings. >> >> Move the config load to be between the initialization of 'branch' and >> the commit lookup. > Sidenote: I wonder why reading config was postponed to later in the > command lifetime... I guess it was to avoid having to read config if > HEAD was invalid. The 'branch' does need to be loaded before the call to git_config (as I found out after moving the config call too early), so I suppose it was natural to pair that with resolving head_commit. > >> Without this change, a fast-forward merge would hit a BUG("bad >> generation skip") statement in commit.c during paint_down_to_common(). >> This is because the HEAD commit would be loaded with "infinite" >> generation but then reached by commits with "finite" generation >> numbers. > I guess this is because we avoid re-parsing objects at all costs; we > want to avoid re-reading commit graph too. > >> Add a test to t5318-commit-graph.sh that exercises this code path to >> prevent a regression. > I would prefer if this commit was put earlier in the series, to avoid > having broken Git (and thus a possibility of problems when bisecting) in > between those two commits. > >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> --- >> builtin/merge.c | 7 ++++--- >> t/t5318-commit-graph.sh | 9 +++++++++ >> 2 files changed, 13 insertions(+), 3 deletions(-) >> >> diff --git a/builtin/merge.c b/builtin/merge.c >> index 5e5e4497e3..b819756946 100644 >> --- a/builtin/merge.c >> +++ b/builtin/merge.c >> @@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix) >> branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL); >> if (branch) >> skip_prefix(branch, "refs/heads/", &branch); >> + >> + init_diff_ui_defaults(); >> + git_config(git_merge_config, NULL); >> + >> if (!branch || is_null_oid(&head_oid)) >> head_commit = NULL; >> else >> head_commit = lookup_commit_or_die(&head_oid, "HEAD"); >> >> - init_diff_ui_defaults(); >> - git_config(git_merge_config, NULL); >> - > Good. > >> if (branch_mergeoptions) >> parse_branch_merge_options(branch_mergeoptions); >> argc = parse_options(argc, argv, prefix, builtin_merge_options, >> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh >> index a380419b65..77d85aefe7 100755 >> --- a/t/t5318-commit-graph.sh >> +++ b/t/t5318-commit-graph.sh >> @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' >> graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 >> graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 >> >> +test_expect_success 'perform fast-forward merge in full repo' ' >> + cd "$TRASH_DIRECTORY/full" && >> + git checkout -b merge-5-to-8 commits/5 && >> + git merge commits/8 && >> + git show-ref -s merge-5-to-8 >output && >> + git show-ref -s commits/8 >expect && >> + test_cmp expect output >> +' > All right. (though I wonder if this tests catches all problems where > BUG("bad generation skip") could have been encountered. We will never know until we have this series running in the wild (and even then, some features are very obscure) and enough people turn on the config setting. One goal of the "fsck and gc" series is to get this feature running during the rest of the test suite as much as possible, so we can get additional coverage. Also to get more experience from the community dogfooding the feature. > >> + >> test_done > Best, ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 09/10] merge: check config before loading commits 2018-05-01 11:52 ` Derrick Stolee @ 2018-05-02 11:41 ` Jakub Narebski 0 siblings, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-05-02 11:41 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <stolee@gmail.com> writes: > On 4/30/2018 6:54 PM, Jakub Narebski wrote: >> Derrick Stolee <dstolee@microsoft.com> writes: >> >>> Now that we use generation numbers from the commit-graph, we must >>> ensure that all commits that exist in the commit-graph are loaded >>> from that file instead of from the object database. Since the >>> commit-graph file is only checked if core.commitGraph is true, we >>> must check the default config before we load any commits. >>> >>> In the merge builtin, the config was checked after loading the HEAD >>> commit. This was due to the use of the global 'branch' when checking >>> merge-specific config settings. >>> >>> Move the config load to be between the initialization of 'branch' and >>> the commit lookup. >> >> Sidenote: I wonder why reading config was postponed to later in the >> command lifetime... I guess it was to avoid having to read config if >> HEAD was invalid. > > The 'branch' does need to be loaded before the call to git_config (as > I found out after moving the config call too early), so I suppose it > was natural to pair that with resolving head_commit. Right, so there was only a limited number of places where call to git_config could be put correctly. Now I wonder no more. [...] >>> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh >>> index a380419b65..77d85aefe7 100755 >>> --- a/t/t5318-commit-graph.sh >>> +++ b/t/t5318-commit-graph.sh >>> @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' >>> graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 >>> graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 >>> +test_expect_success 'perform fast-forward merge in full repo' ' >>> + cd "$TRASH_DIRECTORY/full" && >>> + git checkout -b merge-5-to-8 commits/5 && >>> + git merge commits/8 && >>> + git show-ref -s merge-5-to-8 >output && >>> + git show-ref -s commits/8 >expect && >>> + test_cmp expect output >>> +' >> All right. (though I wonder if this tests catches all problems where >> BUG("bad generation skip") could have been encountered. > > We will never know until we have this series running in the wild (and > even then, some features are very obscure) and enough people turn on > the config setting. > > One goal of the "fsck and gc" series is to get this feature running > during the rest of the test suite as much as possible, so we can get > additional coverage. Also to get more experience from the community > dogfooding the feature. Sidenote: for two out of three features that change the view of history we could also update commit-graph automatically: * the shortening or deepening of shallow clone could also re-calculate the commit graph (or invalidate it) * git-replace could check if the replacement modifies history, and if so, recalculate the commit graph (or invalidate it/check its validity) * there is no such possibility for grafts, but they are deprecated anyway -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v4 10/10] commit-graph.txt: update design document 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee ` (8 preceding siblings ...) 2018-04-25 14:38 ` [PATCH v4 09/10] merge: check config before loading commits Derrick Stolee @ 2018-04-25 14:38 ` Derrick Stolee 2018-04-30 23:32 ` Jakub Narebski 2018-04-25 14:40 ` [PATCH v4 00/10] Compute and consume generation numbers Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee 11 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:38 UTC (permalink / raw) To: git; +Cc: gitster, peff, jnareb, avarab, Derrick Stolee We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the three special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and _MAX interact with other generation numbers. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++----- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..d9f2713efa 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + + If A and B are commits with generation numbers N amd M, respectively, + and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. + Design Details -------------- @@ -98,17 +121,12 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operations are important candidates: + operation is an important candidate: - - paint_down_to_common() - 'log --topo-order' - Currently, parse_commit_gently() requires filling in the root tree -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v4 10/10] commit-graph.txt: update design document 2018-04-25 14:38 ` [PATCH v4 10/10] commit-graph.txt: update design document Derrick Stolee @ 2018-04-30 23:32 ` Jakub Narebski 2018-05-01 12:00 ` Derrick Stolee 0 siblings, 1 reply; 162+ messages in thread From: Jakub Narebski @ 2018-04-30 23:32 UTC (permalink / raw) To: Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <dstolee@microsoft.com> writes: > We now calculate generation numbers in the commit-graph file and use > them in paint_down_to_common(). > > Expand the section on generation numbers to discuss how the three > special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and > _MAX interact with other generation numbers. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Looks good. > --- > Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++----- > 1 file changed, 24 insertions(+), 6 deletions(-) > > diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt > index 0550c6d0dc..d9f2713efa 100644 > --- a/Documentation/technical/commit-graph.txt > +++ b/Documentation/technical/commit-graph.txt > @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" > generation number and walk until reaching commits with known generation > number. > > +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not > +in the commit-graph file. If a commit-graph file was written by a version > +of Git that did not compute generation numbers, then those commits will > +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. > + > +Since the commit-graph file is closed under reachability, we can guarantee > +the following weaker condition on all commits: > + > + If A and B are commits with generation numbers N amd M, respectively, > + and N < M, then A cannot reach B. > + > +Note how the strict inequality differs from the inequality when we have > +fully-computed generation numbers. Using strict inequality may result in > +walking a few extra commits, The linux kernel commit graph has maximum of 513 commits sharing the same generation number, but is is 5.43 commits sharing the same generation number on average, with standard deviation 10.70; median is even lower: it is 2, with 5.35 median absolute deviation (MAD). So on average it would be a few extra commits. Right. > but the simplicity in dealing with commits > +with generation number *_INFINITY or *_ZERO is valuable. As I wrote before, handling those corner cases in more complicated, but not that complicated. We could simply use stronger condition if both generation numbers are ordinary generation numbers, and weaker condition when at least one generation number has one of those special values. > + > +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose > +generation numbers are computed to be at least this value. We limit at > +this value since it is the largest value that can be stored in the > +commit-graph file using the 30 bits available to generation numbers. This > +presents another case where a commit can have generation number equal to > +that of a parent. Ordinary generation numbers, where stronger condition holds, are those between GENERATION_NUMBER_ZERO < gen(C) < GENERATION_NUMBER_MAX. > + > Design Details > -------------- > > @@ -98,17 +121,12 @@ Future Work > - The 'commit-graph' subcommand does not have a "verify" mode that is > necessary for integration with fsck. > > -- The file format includes room for precomputed generation numbers. These > - are not currently computed, so all generation numbers will be marked as > - 0 (or "uncomputed"). A later patch will include this calculation. > - Good. > - After computing and storing generation numbers, we must make graph > walks aware of generation numbers to gain the performance benefits they > enable. This will mostly be accomplished by swapping a commit-date-ordered > priority queue with one ordered by generation number. The following > - operations are important candidates: > + operation is an important candidate: > > - - paint_down_to_common() > - 'log --topo-order' Another possible candidates: - remove_redundant() - see comment in previous patch - still_interesting() - where Git uses date slop to stop walking too far > > - Currently, parse_commit_gently() requires filling in the root tree One important issue left is handling features that change view of project history, and their interaction with commit-graph feature. What would happen, if we turn on commit-graph feature, generate commit graph file, and then: * use graft file or remove graft entries to cut history, or remove cut or join two [independent] histories. * use git-replace mechanims to do the same * in shallow clone, deepen or shorten the clone What would happen if without re-generating commit-graph file (assuming tha Git wouldn't do it for us), we run some feature that makes use of commit-graph data: - git branch --contains - git tag --contains - git rev-list A..B Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 10/10] commit-graph.txt: update design document 2018-04-30 23:32 ` Jakub Narebski @ 2018-05-01 12:00 ` Derrick Stolee 2018-05-02 7:57 ` Jakub Narebski 0 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:00 UTC (permalink / raw) To: Jakub Narebski, Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason On 4/30/2018 7:32 PM, Jakub Narebski wrote: > Derrick Stolee <dstolee@microsoft.com> writes: > >> We now calculate generation numbers in the commit-graph file and use >> them in paint_down_to_common(). >> >> Expand the section on generation numbers to discuss how the three >> special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and >> _MAX interact with other generation numbers. >> >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > Looks good. > >> --- >> Documentation/technical/commit-graph.txt | 30 +++++++++++++++++++----- >> 1 file changed, 24 insertions(+), 6 deletions(-) >> >> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt >> index 0550c6d0dc..d9f2713efa 100644 >> --- a/Documentation/technical/commit-graph.txt >> +++ b/Documentation/technical/commit-graph.txt >> @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" >> generation number and walk until reaching commits with known generation >> number. >> >> +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not >> +in the commit-graph file. If a commit-graph file was written by a version >> +of Git that did not compute generation numbers, then those commits will >> +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. >> + >> +Since the commit-graph file is closed under reachability, we can guarantee >> +the following weaker condition on all commits: >> + >> + If A and B are commits with generation numbers N amd M, respectively, >> + and N < M, then A cannot reach B. >> + >> +Note how the strict inequality differs from the inequality when we have >> +fully-computed generation numbers. Using strict inequality may result in >> +walking a few extra commits, > The linux kernel commit graph has maximum of 513 commits sharing the > same generation number, but is is 5.43 commits sharing the same > generation number on average, with standard deviation 10.70; median is > even lower: it is 2, with 5.35 median absolute deviation (MAD). > > So on average it would be a few extra commits. Right. > >> but the simplicity in dealing with commits >> +with generation number *_INFINITY or *_ZERO is valuable. > As I wrote before, handling those corner cases in more complicated, but > not that complicated. We could simply use stronger condition if both > generation numbers are ordinary generation numbers, and weaker condition > when at least one generation number has one of those special values. > >> + >> +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose >> +generation numbers are computed to be at least this value. We limit at >> +this value since it is the largest value that can be stored in the >> +commit-graph file using the 30 bits available to generation numbers. This >> +presents another case where a commit can have generation number equal to >> +that of a parent. > Ordinary generation numbers, where stronger condition holds, are those > between GENERATION_NUMBER_ZERO < gen(C) < GENERATION_NUMBER_MAX. > >> + >> Design Details >> -------------- >> >> @@ -98,17 +121,12 @@ Future Work >> - The 'commit-graph' subcommand does not have a "verify" mode that is >> necessary for integration with fsck. >> >> -- The file format includes room for precomputed generation numbers. These >> - are not currently computed, so all generation numbers will be marked as >> - 0 (or "uncomputed"). A later patch will include this calculation. >> - > Good. > >> - After computing and storing generation numbers, we must make graph >> walks aware of generation numbers to gain the performance benefits they >> enable. This will mostly be accomplished by swapping a commit-date-ordered >> priority queue with one ordered by generation number. The following >> - operations are important candidates: >> + operation is an important candidate: >> >> - - paint_down_to_common() >> - 'log --topo-order' > Another possible candidates: > > - remove_redundant() - see comment in previous patch > - still_interesting() - where Git uses date slop to stop walking > too far remove_redundant() will be included in v5, thanks. Instead of "still_interesting()" I'll add "git tag --merged" as the candidate to consider, as discussed in [1]. [1] https://public-inbox.org/git/87fu3g67ry.fsf@lant.ki.iif.hu/t/#u "branch --contains / tag --merged inconsistency" > >> >> - Currently, parse_commit_gently() requires filling in the root tree > One important issue left is handling features that change view of > project history, and their interaction with commit-graph feature. > > What would happen, if we turn on commit-graph feature, generate commit > graph file, and then: > > * use graft file or remove graft entries to cut history, or remove cut > or join two [independent] histories. > * use git-replace mechanims to do the same > * in shallow clone, deepen or shorten the clone > > What would happen if without re-generating commit-graph file (assuming > tha Git wouldn't do it for us), we run some feature that makes use of > commit-graph data: > > - git branch --contains > - git tag --contains > - git rev-list A..B > The commit-graph is not supported in these scenarios (yet). grafts are specifically mentioned in the future work section. I'm not particularly interested in supporting these features, so they are good venues for other contributors to get involved in the commit-graph feature. Eventually, they will be blockers to making the commit-graph feature a "default" feature. That is when I will pay attention to these situations. For now, a user must opt-in to having a commit-graph file (and that same user has possibly opted in to these history modifying features). Thanks, -Stolee ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 10/10] commit-graph.txt: update design document 2018-05-01 12:00 ` Derrick Stolee @ 2018-05-02 7:57 ` Jakub Narebski 0 siblings, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-05-02 7:57 UTC (permalink / raw) To: Derrick Stolee Cc: Derrick Stolee, git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <stolee@gmail.com> writes: > On 4/30/2018 7:32 PM, Jakub Narebski wrote: >> Derrick Stolee <dstolee@microsoft.com> writes: [...] >>> - After computing and storing generation numbers, we must make graph >>> walks aware of generation numbers to gain the performance benefits they >>> enable. This will mostly be accomplished by swapping a commit-date-ordered >>> priority queue with one ordered by generation number. The following >>> - operations are important candidates: >>> + operation is an important candidate: >>> - - paint_down_to_common() >>> - 'log --topo-order' >> >> Another possible candidates: >> >> - remove_redundant() - see comment in previous patch >> - still_interesting() - where Git uses date slop to stop walking >> too far > > remove_redundant() will be included in v5, thanks. Oh. Nice. I'll try to review the new patch in detail soon. > Instead of "still_interesting()" I'll add "git tag --merged" as the > candidate to consider, as discussed in [1]. > > [1] https://public-inbox.org/git/87fu3g67ry.fsf@lant.ki.iif.hu/t/#u > "branch --contains / tag --merged inconsistency" All right. I have mentioned still_interesting() as a hint where possible additional generation numbers based optimization may lurk (because that's where heuristic based on dates is used - similarly to how it was done in this series with paint_down_to_common()). [...] >> One important issue left is handling features that change view of >> project history, and their interaction with commit-graph feature. >> >> What would happen, if we turn on commit-graph feature, generate commit >> graph file, and then: >> >> * use graft file or remove graft entries to cut history, or remove cut >> or join two [independent] histories. >> * use git-replace mechanims to do the same >> * in shallow clone, deepen or shorten the clone >> >> What would happen if without re-generating commit-graph file (assuming >> tha Git wouldn't do it for us), we run some feature that makes use of >> commit-graph data: >> >> - git branch --contains >> - git tag --contains >> - git rev-list A..B >> > > The commit-graph is not supported in these scenarios (yet). grafts are > specifically mentioned in the future work section. > > I'm not particularly interested in supporting these features, so they > are good venues for other contributors to get involved in the > commit-graph feature. Eventually, they will be blockers to making the > commit-graph feature a "default" feature. That is when I will pay > attention to these situations. For now, a user must opt-in to having a > commit-graph file (and that same user has possibly opted in to these > history modifying features). Well, that is sensible approach. Get commit-graph features in working condition, and worry about beng able to make it on by default later. Nice to have it clarified. I'll stop nagging about that, then ;-P One issue: 'grafts' are mentioned in the future work section of the technical documentation, but we don't have *any* warning about commit-graph limitations in user-facing documentation, that is git-commit-graph(1) manpage. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v4 00/10] Compute and consume generation numbers 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee ` (9 preceding siblings ...) 2018-04-25 14:38 ` [PATCH v4 10/10] commit-graph.txt: update design document Derrick Stolee @ 2018-04-25 14:40 ` Derrick Stolee 2018-04-28 17:28 ` Jakub Narebski 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee 11 siblings, 1 reply; 162+ messages in thread From: Derrick Stolee @ 2018-04-25 14:40 UTC (permalink / raw) To: Derrick Stolee, git; +Cc: gitster, peff, jnareb, avarab As promised, here is the diff from v3. Thanks, -Stolee -- >8 -- diff --git a/builtin/merge.c b/builtin/merge.c index 7e1da6c6ea..b819756946 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,6 +1148,7 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", &branch); + init_diff_ui_defaults(); git_config(git_merge_config, NULL); @@ -1156,7 +1157,6 @@ int cmd_merge(int argc, const char **argv, const char *prefix) else head_commit = lookup_commit_or_die(&head_oid, "HEAD"); - if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); argc = parse_options(argc, argv, prefix, builtin_merge_options, diff --git a/commit-graph.c b/commit-graph.c index 21e853c21a..aebd242def 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -257,7 +257,7 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin uint32_t *parent_data_ptr; uint64_t date_low, date_high; struct commit_list **pptr; - const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * pos; item->object.parsed = 1; item->graph_pos = pos; @@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin *pos = item->graph_pos; return 1; } else { - return bsearch_graph(commit_graph, &(item->object.oid), pos); + return bsearch_graph(g, &(item->object.oid), pos); } } @@ -312,10 +312,10 @@ int parse_commit_in_graph(struct commit *item) { uint32_t pos; - if (item->object.parsed) - return 0; if (!core_commit_graph) return 0; + if (item->object.parsed) + return 1; prepare_commit_graph(); if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) return fill_commit_in_graph(item, commit_graph, pos); @@ -454,9 +454,8 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; - if ((*list)->generation != GENERATION_NUMBER_INFINITY) { + if ((*list)->generation != GENERATION_NUMBER_INFINITY) packedDate[0] |= htonl((*list)->generation << 2); - } packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); diff --git a/commit.c b/commit.c index 9ef6f699bd..e2e16ea1a7 100644 --- a/commit.c +++ b/commit.c @@ -653,7 +653,7 @@ int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void else if (a->generation > b->generation) return -1; - /* use date as a heuristic when generataions are equal */ + /* use date as a heuristic when generations are equal */ if (a->date < b->date) return 1; else if (a->date > b->date) @@ -1078,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * } if (commit->generation > min_generation) - return 0; + return ret; bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) diff --git a/ref-filter.c b/ref-filter.c index e2fea6d635..fb35067fc9 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -16,6 +16,7 @@ #include "trailer.h" #include "wt-status.h" #include "commit-slab.h" +#include "commit-graph.h" static struct ref_msg { const char *gone; @@ -1582,7 +1583,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) } /* - * Test whether the candidate or one of its parents is contained in the list. + * Test whether the candidate is contained in the list. * Do not recurse to find out, though, but return -1 if inconclusive. */ static enum contains_result contains_test(struct commit *candidate, @@ -1629,7 +1630,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, for (p = want; p; p = p->next) { struct commit *c = p->item; - parse_commit_or_die(c); + load_commit_graph_info(c); if (c->generation < cutoff) cutoff = c->generation; } ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v4 00/10] Compute and consume generation numbers 2018-04-25 14:40 ` [PATCH v4 00/10] Compute and consume generation numbers Derrick Stolee @ 2018-04-28 17:28 ` Jakub Narebski 0 siblings, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-04-28 17:28 UTC (permalink / raw) To: Derrick Stolee; +Cc: Derrick Stolee, git, gitster, peff, avarab Derrick Stolee <stolee@gmail.com> writes: > As promised, here is the diff from v3. What is this strange string " " in place of tabs in the interdiff? " " here is Unicode Character 'NO-BREAK SPACE' (U+00A0). Though it doesn't matter for viewing, my newsreader (Gnus from GNU Emacs) thinks that it is worth notifying about when replying. Also, it looks like at least in one place the diff got line-wrapped. > Thanks, > -Stolee > > -- >8 -- > > diff --git a/builtin/merge.c b/builtin/merge.c > index 7e1da6c6ea..b819756946 100644 > --- a/builtin/merge.c > +++ b/builtin/merge.c > @@ -1148,6 +1148,7 @@ int cmd_merge(int argc, const char **argv, const > char *prefix) > branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, > NULL); > if (branch) > skip_prefix(branch, "refs/heads/", &branch); > + > init_diff_ui_defaults(); > git_config(git_merge_config, NULL); > > @@ -1156,7 +1157,6 @@ int cmd_merge(int argc, const char **argv, const > char *prefix) > else > head_commit = lookup_commit_or_die(&head_oid, "HEAD"); > > - > if (branch_mergeoptions) > parse_branch_merge_options(branch_mergeoptions); > argc = parse_options(argc, argv, prefix, builtin_merge_options, Whitespace fixes, all right. > diff --git a/commit-graph.c b/commit-graph.c > index 21e853c21a..aebd242def 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -257,7 +257,7 @@ static int fill_commit_in_graph(struct commit > *item, struct commit_graph *g, uin > uint32_t *parent_data_ptr; > uint64_t date_low, date_high; > struct commit_list **pptr; > - const unsigned char *commit_data = g->chunk_commit_data + > GRAPH_DATA_WIDTH * pos; > + const unsigned char *commit_data = g->chunk_commit_data + > (g->hash_len + 16) * pos; > > item->object.parsed = 1; > item->graph_pos = pos; This was accidental change in v3 (unrelated to the changes in commit it were in). Though I wonder if the symbolic constant route is not better - though as separate standalone commit. > @@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit > *item, struct commit_graph *g, uin > *pos = item->graph_pos; > return 1; > } else { > - return bsearch_graph(commit_graph, > &(item->object.oid), pos); > + return bsearch_graph(g, &(item->object.oid), pos); > } > } > Fixup for a commit, that was sent in separate fixup email in v3. All right. Though I wonder if it wouldn't be better to call global variable 'the_commit_graph' to avoid such errors in the future... > @@ -312,10 +312,10 @@ int parse_commit_in_graph(struct commit *item) > { > uint32_t pos; > > - if (item->object.parsed) > - return 0; > if (!core_commit_graph) > return 0; > + if (item->object.parsed) > + return 1; > prepare_commit_graph(); > if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) > return fill_commit_in_graph(item, commit_graph, pos); Fixed accidental flip-flopping about return value when item->object.parsed. I'd have to take a look at actual commits to say whether I think it is all right or not. > @@ -454,9 +454,8 @@ static void write_graph_chunk_data(struct hashfile > *f, int hash_len, > else > packedDate[0] = 0; > > - if ((*list)->generation != GENERATION_NUMBER_INFINITY) { > + if ((*list)->generation != GENERATION_NUMBER_INFINITY) > packedDate[0] |= htonl((*list)->generation << 2); > - } > > packedDate[1] = htonl((*list)->date); > hashwrite(f, packedDate, 8); Coding style change, to be more in line with CodingGuidelines, namely that we usually do not use block for single-command in conditionals. All right. > diff --git a/commit.c b/commit.c > index 9ef6f699bd..e2e16ea1a7 100644 > --- a/commit.c > +++ b/commit.c > @@ -653,7 +653,7 @@ int compare_commits_by_gen_then_commit_date(const > void *a_, const void *b_, void > else if (a->generation > b->generation) > return -1; > > - /* use date as a heuristic when generataions are equal */ > + /* use date as a heuristic when generations are equal */ > if (a->date < b->date) > return 1; > else if (a->date > b->date) Fixed typo in comment. All right. > @@ -1078,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, > int nr_reference, struct commit * > } > > if (commit->generation > min_generation) > - return 0; > + return ret; > > bases = paint_down_to_common(commit, nr_reference, reference, > commit->generation); > if (commit->object.flags & PARENT2) Unifying way of returning result (to one that was used before this commit in this fragment of the git code). Looks all right, from what I remember. > diff --git a/ref-filter.c b/ref-filter.c > index e2fea6d635..fb35067fc9 100644 > --- a/ref-filter.c > +++ b/ref-filter.c > @@ -16,6 +16,7 @@ > #include "trailer.h" > #include "wt-status.h" > #include "commit-slab.h" > +#include "commit-graph.h" > > static struct ref_msg { > const char *gone; > @@ -1629,7 +1630,7 @@ static enum contains_result > contains_tag_algo(struct commit *candidate, > > for (p = want; p; p = p->next) { > struct commit *c = p->item; > - parse_commit_or_die(c); > + load_commit_graph_info(c); > if (c->generation < cutoff) > cutoff = c->generation; > } Avoiding performance penalty when not using commit-graph feature (or when it is turned off). Looks good on first glance. > @@ -1582,7 +1583,7 @@ static int in_commit_list(const struct > commit_list *want, struct commit *c) > } > > /* > - * Test whether the candidate or one of its parents is contained in > the list. > + * Test whether the candidate is contained in the list. > * Do not recurse to find out, though, but return -1 if inconclusive. > */ > static enum contains_result contains_test(struct commit *candidate, Bringing comment in line with the function it is about. Good. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v5 00/11] Compute and consume generation numbers 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee ` (10 preceding siblings ...) 2018-04-25 14:40 ` [PATCH v4 00/10] Compute and consume generation numbers Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list Derrick Stolee ` (11 more replies) 11 siblings, 12 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee Most of the changes from v4 are cosmetic, but there is one new commit: commit: use generation number in remove_redundant() Other changes are non-functional, but do clarify things. Inter-diff from v4: diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index d9f2713efa..e1a883eb46 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -125,9 +125,10 @@ Future Work walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following - operation is an important candidate: + operations are important candidates: - 'log --topo-order' + - 'tag --merged' - Currently, parse_commit_gently() requires filling in the root tree object for a commit. This passes through lookup_tree() and consequently diff --git a/commit-graph.c b/commit-graph.c index aebd242def..a8c337dd77 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -248,6 +248,7 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) { const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->graph_pos = pos; item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; } @@ -454,8 +455,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; - if ((*list)->generation != GENERATION_NUMBER_INFINITY) - packedDate[0] |= htonl((*list)->generation << 2); + packedDate[0] |= htonl((*list)->generation << 2); packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -589,18 +589,17 @@ static void close_reachable(struct packed_oid_list *oids) } } -static void compute_generation_numbers(struct commit** commits, - int nr_commits) +static void compute_generation_numbers(struct packed_commit_list* commits) { int i; struct commit_list *list = NULL; - for (i = 0; i < nr_commits; i++) { - if (commits[i]->generation != GENERATION_NUMBER_INFINITY && - commits[i]->generation != GENERATION_NUMBER_ZERO) + for (i = 0; i < commits->nr; i++) { + if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY && + commits->list[i]->generation != GENERATION_NUMBER_ZERO) continue; - commit_list_insert(commits[i], &list); + commit_list_insert(commits->list[i], &list); while (list) { struct commit *current = list->item; struct commit_list *parent; @@ -621,10 +620,10 @@ static void compute_generation_numbers(struct commit** commits, if (all_parents_computed) { current->generation = max_generation + 1; pop_commit(&list); - } - if (current->generation > GENERATION_NUMBER_MAX) - current->generation = GENERATION_NUMBER_MAX; + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } } } } @@ -752,7 +751,7 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); - compute_generation_numbers(commits.list, commits.nr); + compute_generation_numbers(&commits); graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(&lk, graph_name, 0); diff --git a/commit.c b/commit.c index e2e16ea1a7..5064db4e61 100644 --- a/commit.c +++ b/commit.c @@ -835,7 +835,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, int flags; if (commit->generation > last_gen) - BUG("bad generation skip"); + BUG("bad generation skip %8x > %8x at %s", + commit->generation, last_gen, + oid_to_hex(&commit->object.oid)); last_gen = commit->generation; if (commit->generation < min_generation) @@ -947,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt) parse_commit(array[i]); for (i = 0; i < cnt; i++) { struct commit_list *common; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (redundant[i]) continue; @@ -955,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt) continue; filled_index[filled] = j; work[filled++] = array[j]; + + if (array[j]->generation < min_generation) + min_generation = array[j]->generation; } - common = paint_down_to_common(array[i], filled, work, 0); + common = paint_down_to_common(array[i], filled, work, + min_generation); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1073,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; - if (min_generation > reference[i]->generation) + if (reference[i]->generation < min_generation) min_generation = reference[i]->generation; } -- >8 -- Derrick Stolee (11): ref-filter: fix outdated comment on in_commit_list commit: add generation number to struct commmit commit-graph: compute generation numbers commit: use generations in paint_down_to_common() commit-graph: always load commit-graph information ref-filter: use generation number for --contains commit: use generation numbers for in_merge_bases() commit: add short-circuit to paint_down_to_common() commit: use generation number in remove_redundant() merge: check config before loading commits commit-graph.txt: update design document Documentation/technical/commit-graph.txt | 30 ++++++-- alloc.c | 1 + builtin/merge.c | 7 +- commit-graph.c | 91 ++++++++++++++++++++---- commit-graph.h | 8 +++ commit.c | 61 +++++++++++++--- commit.h | 7 +- object.c | 2 +- ref-filter.c | 26 +++++-- sha1_file.c | 2 +- t/t5318-commit-graph.sh | 9 +++ 11 files changed, 204 insertions(+), 40 deletions(-) base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707 -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 02/11] commit: add generation number to struct commmit Derrick Stolee ` (10 subsequent siblings) 11 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee The in_commit_list() method does not check the parents of the candidate for containment in the list. Fix the comment that incorrectly states that it does. Reported-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- ref-filter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ref-filter.c b/ref-filter.c index cffd8bf3ce..aff24d93be 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) } /* - * Test whether the candidate or one of its parents is contained in the list. + * Test whether the candidate is contained in the list. * Do not recurse to find out, though, but return -1 if inconclusive. */ static enum contains_result contains_test(struct commit *candidate, -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v5 02/11] commit: add generation number to struct commmit 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 03/11] commit-graph: compute generation numbers Derrick Stolee ` (9 subsequent siblings) 11 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee The generation number of a commit is defined recursively as follows: * If a commit A has no parents, then the generation number of A is one. * If a commit A has parents, then the generation number of A is one more than the maximum generation number among the parents of A. Add a uint32_t generation field to struct commit so we can pass this information to revision walks. We use three special values to signal the generation number is invalid: GENERATION_NUMBER_INFINITY 0xFFFFFFFF GENERATION_NUMBER_MAX 0x3FFFFFFF GENERATION_NUMBER_ZERO 0 The first (_INFINITY) means the generation number has not been loaded or computed. The second (_MAX) means the generation number is too large to store in the commit-graph file. The third (_ZERO) means the generation number was loaded from a commit graph file that was written by a version of git that did not support generation numbers. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- alloc.c | 1 + commit-graph.c | 2 ++ commit.h | 4 ++++ 3 files changed, 7 insertions(+) diff --git a/alloc.c b/alloc.c index cf4f8b61e1..e8ab14f4a1 100644 --- a/alloc.c +++ b/alloc.c @@ -94,6 +94,7 @@ void *alloc_commit_node(void) c->object.type = OBJ_COMMIT; c->index = alloc_commit_index(); c->graph_pos = COMMIT_NOT_FROM_GRAPH; + c->generation = GENERATION_NUMBER_INFINITY; return c; } diff --git a/commit-graph.c b/commit-graph.c index 70fa1b25fd..9ad21c3ffb 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin date_low = get_be32(commit_data + g->hash_len + 12); item->date = (timestamp_t)((date_high << 32) | date_low); + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; + pptr = &item->parents; edge_value = get_be32(commit_data + g->hash_len); diff --git a/commit.h b/commit.h index 23a3f364ed..aac3b8c56f 100644 --- a/commit.h +++ b/commit.h @@ -10,6 +10,9 @@ #include "pretty.h" #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF +#define GENERATION_NUMBER_INFINITY 0xFFFFFFFF +#define GENERATION_NUMBER_MAX 0x3FFFFFFF +#define GENERATION_NUMBER_ZERO 0 struct commit_list { struct commit *item; @@ -30,6 +33,7 @@ struct commit { */ struct tree *maybe_tree; uint32_t graph_pos; + uint32_t generation; }; extern int save_commit_buffer; -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v5 03/11] commit-graph: compute generation numbers 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 02/11] commit: add generation number to struct commmit Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 04/11] commit: use generations in paint_down_to_common() Derrick Stolee ` (8 subsequent siblings) 11 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee While preparing commits to be written into a commit-graph file, compute the generation numbers using a depth-first strategy. The only commits that are walked in this depth-first search are those without a precomputed generation number. Thus, computation time will be relative to the number of new commits to the commit-graph file. If a computed generation number would exceed GENERATION_NUMBER_MAX, then use GENERATION_NUMBER_MAX instead. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit-graph.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/commit-graph.c b/commit-graph.c index 9ad21c3ffb..36d765e10a 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -439,6 +439,8 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, else packedDate[0] = 0; + packedDate[0] |= htonl((*list)->generation << 2); + packedDate[1] = htonl((*list)->date); hashwrite(f, packedDate, 8); @@ -571,6 +573,45 @@ static void close_reachable(struct packed_oid_list *oids) } } +static void compute_generation_numbers(struct packed_commit_list* commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < commits->nr; i++) { + if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY && + commits->list[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits->list[i], &list); + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, &list); + break; + } else if (parent->item->generation > max_generation) { + max_generation = parent->item->generation; + } + } + + if (all_parents_computed) { + current->generation = max_generation + 1; + pop_commit(&list); + + if (current->generation > GENERATION_NUMBER_MAX) + current->generation = GENERATION_NUMBER_MAX; + } + } + } +} + void write_commit_graph(const char *obj_dir, const char **pack_indexes, int nr_packs, @@ -694,6 +735,8 @@ void write_commit_graph(const char *obj_dir, if (commits.nr >= GRAPH_PARENT_MISSING) die(_("too many commits to write graph")); + compute_generation_numbers(&commits); + graph_name = get_commit_graph_filename(obj_dir); fd = hold_lock_file_for_update(&lk, graph_name, 0); -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v5 04/11] commit: use generations in paint_down_to_common() 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee ` (2 preceding siblings ...) 2018-05-01 12:47 ` [PATCH v5 03/11] commit-graph: compute generation numbers Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 05/11] commit-graph: always load commit-graph information Derrick Stolee ` (7 subsequent siblings) 11 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee Define compare_commits_by_gen_then_commit_date(), which uses generation numbers as a primary comparison and commit date to break ties (or as a comparison when both commits do not have computed generation numbers). Since the commit-graph file is closed under reachability, we know that all commits in the file have generation at most GENERATION_NUMBER_MAX which is less than GENERATION_NUMBER_INFINITY. This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. In the case that commit dates violate topological order (i.e. a parent is "newer" than a child), the previous code could walk a commit twice: if a commit is reached with the PARENT1 bit, but later is re-visited with the PARENT2 bit, then that PARENT2 bit must be propagated to its parents. Using generation numbers avoids this extra effort, even if it is somewhat rare. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 20 +++++++++++++++++++- commit.h | 1 + 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 711f674c18..4d00b0a1d6 100644 --- a/commit.c +++ b/commit.c @@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused) +{ + const struct commit *a = a_, *b = b_; + + /* newer commits first */ + if (a->generation < b->generation) + return 1; + else if (a->generation > b->generation) + return -1; + + /* use date as a heuristic when generations are equal */ + if (a->date < b->date) + return 1; + else if (a->date > b->date) + return -1; + return 0; +} + int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; @@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue) /* all input commits in one and twos[] must have been parsed! */ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) { - struct prio_queue queue = { compare_commits_by_commit_date }; + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; diff --git a/commit.h b/commit.h index aac3b8c56f..64436ff44e 100644 --- a/commit.h +++ b/commit.h @@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf); extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc); int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); +int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused); LAST_ARG_MUST_BE_NULL extern int run_commit_hook(int editor_is_used, const char *index_file, const char *name, ...); -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v5 05/11] commit-graph: always load commit-graph information 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee ` (3 preceding siblings ...) 2018-05-01 12:47 ` [PATCH v5 04/11] commit: use generations in paint_down_to_common() Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 06/11] ref-filter: use generation number for --contains Derrick Stolee ` (6 subsequent siblings) 11 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee Most code paths load commits using lookup_commit() and then parse_commit(). In some cases, including some branch lookups, the commit is parsed using parse_object_buffer() which side-steps parse_commit() in favor of parse_commit_buffer(). With generation numbers in the commit-graph, we need to ensure that any commit that exists in the commit-graph file has its generation number loaded. Create new load_commit_graph_info() method to fill in the information for a commit that exists only in the commit-graph file. Call it from parse_commit_buffer() after loading the other commit information from the given buffer. Only fill this information when specified by the 'check_graph' parameter. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit-graph.c | 46 +++++++++++++++++++++++++++++++--------------- commit-graph.h | 8 ++++++++ commit.c | 7 +++++-- commit.h | 2 +- object.c | 2 +- sha1_file.c | 2 +- 6 files changed, 47 insertions(+), 20 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 36d765e10a..a8c337dd77 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -245,6 +245,13 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, return &commit_list_insert(c, pptr)->next; } +static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) +{ + const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; + item->graph_pos = pos; + item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; +} + static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos) { uint32_t edge_value; @@ -292,31 +299,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin return 1; } +static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos) +{ + if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { + *pos = item->graph_pos; + return 1; + } else { + return bsearch_graph(g, &(item->object.oid), pos); + } +} + int parse_commit_in_graph(struct commit *item) { + uint32_t pos; + if (!core_commit_graph) return 0; if (item->object.parsed) return 1; - prepare_commit_graph(); - if (commit_graph) { - uint32_t pos; - int found; - if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) { - pos = item->graph_pos; - found = 1; - } else { - found = bsearch_graph(commit_graph, &(item->object.oid), &pos); - } - - if (found) - return fill_commit_in_graph(item, commit_graph, pos); - } - + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) + return fill_commit_in_graph(item, commit_graph, pos); return 0; } +void load_commit_graph_info(struct commit *item) +{ + uint32_t pos; + if (!core_commit_graph) + return; + prepare_commit_graph(); + if (commit_graph && find_commit_in_graph(item, commit_graph, &pos)) + fill_commit_graph_info(item, commit_graph, pos); +} + static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c) { struct object_id oid; diff --git a/commit-graph.h b/commit-graph.h index 260a468e73..96cccb10f3 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir); */ int parse_commit_in_graph(struct commit *item); +/* + * It is possible that we loaded commit contents from the commit buffer, + * but we also want to ensure the commit-graph content is correctly + * checked and filled. Fill the graph_pos and generation members of + * the given commit. + */ +void load_commit_graph_info(struct commit *item); + struct tree *get_commit_tree_in_graph(const struct commit *c); struct commit_graph { diff --git a/commit.c b/commit.c index 4d00b0a1d6..39a3749abd 100644 --- a/commit.c +++ b/commit.c @@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep) return ret; } -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size) +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph) { const char *tail = buffer; const char *bufptr = buffer; @@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s } item->date = parse_commit_date(bufptr, tail); + if (check_graph) + load_commit_graph_info(item); + return 0; } @@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing) return error("Object %s not a commit", oid_to_hex(&item->object.oid)); } - ret = parse_commit_buffer(item, buffer, size); + ret = parse_commit_buffer(item, buffer, size, 0); if (save_commit_buffer && !ret) { set_commit_buffer(item, buffer, size); return 0; diff --git a/commit.h b/commit.h index 64436ff44e..b5afde1ae9 100644 --- a/commit.h +++ b/commit.h @@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name); */ struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name); -int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size); +int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph); int parse_commit_gently(struct commit *item, int quiet_on_missing); static inline int parse_commit(struct commit *item) { diff --git a/object.c b/object.c index e6ad3f61f0..efe4871325 100644 --- a/object.c +++ b/object.c @@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type } else if (type == OBJ_COMMIT) { struct commit *commit = lookup_commit(oid); if (commit) { - if (parse_commit_buffer(commit, buffer, size)) + if (parse_commit_buffer(commit, buffer, size, 1)) return NULL; if (!get_cached_commit_buffer(commit, NULL)) { set_commit_buffer(commit, buffer, size); diff --git a/sha1_file.c b/sha1_file.c index 1b94f39c4c..0fd4f0b8b6 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size) { struct commit c; memset(&c, 0, sizeof(c)); - if (parse_commit_buffer(&c, buf, size)) + if (parse_commit_buffer(&c, buf, size, 0)) die("corrupt commit"); } -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v5 06/11] ref-filter: use generation number for --contains 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee ` (4 preceding siblings ...) 2018-05-01 12:47 ` [PATCH v5 05/11] commit-graph: always load commit-graph information Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 07/11] commit: use generation numbers for in_merge_bases() Derrick Stolee ` (5 subsequent siblings) 11 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee A commit A can reach a commit B only if the generation number of A is strictly larger than the generation number of B. This condition allows significantly short-circuiting commit-graph walks. Use generation number for '--contains' type queries. On a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag, the command 'git tag --contains HEAD' had the following peformance improvement: Before: 0.81s After: 0.04s Rel %: -95% Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- ref-filter.c | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/ref-filter.c b/ref-filter.c index aff24d93be..fb35067fc9 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -16,6 +16,7 @@ #include "trailer.h" #include "wt-status.h" #include "commit-slab.h" +#include "commit-graph.h" static struct ref_msg { const char *gone; @@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c) */ static enum contains_result contains_test(struct commit *candidate, const struct commit_list *want, - struct contains_cache *cache) + struct contains_cache *cache, + uint32_t cutoff) { enum contains_result *cached = contains_cache_at(cache, candidate); @@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit *candidate, /* Otherwise, we don't know; prepare to recurse */ parse_commit_or_die(candidate); + + if (candidate->generation < cutoff) + return CONTAINS_NO; + return CONTAINS_UNKNOWN; } @@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate, struct contains_cache *cache) { struct contains_stack contains_stack = { 0, 0, NULL }; - enum contains_result result = contains_test(candidate, want, cache); + enum contains_result result; + uint32_t cutoff = GENERATION_NUMBER_INFINITY; + const struct commit_list *p; + + for (p = want; p; p = p->next) { + struct commit *c = p->item; + load_commit_graph_info(c); + if (c->generation < cutoff) + cutoff = c->generation; + } + result = contains_test(candidate, want, cache, cutoff); if (result != CONTAINS_UNKNOWN) return result; @@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, * If we just popped the stack, parents->item has been marked, * therefore contains_test will return a meaningful yes/no. */ - else switch (contains_test(parents->item, want, cache)) { + else switch (contains_test(parents->item, want, cache, cutoff)) { case CONTAINS_YES: *contains_cache_at(cache, commit) = CONTAINS_YES; contains_stack.nr--; @@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate, } } free(contains_stack.contains_stack); - return contains_test(candidate, want, cache); + return contains_test(candidate, want, cache, cutoff); } static int commit_contains(struct ref_filter *filter, struct commit *commit, -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v5 07/11] commit: use generation numbers for in_merge_bases() 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee ` (5 preceding siblings ...) 2018-05-01 12:47 ` [PATCH v5 06/11] ref-filter: use generation number for --contains Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 08/11] commit: add short-circuit to paint_down_to_common() Derrick Stolee ` (4 subsequent siblings) 11 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee The containment algorithm for 'git branch --contains' is different from that for 'git tag --contains' in that it uses is_descendant_of() instead of contains_tag_algo(). The expensive portion of the branch algorithm is computing merge bases. When a commit-graph file exists with generation numbers computed, we can avoid this merge-base calculation when the target commit has a larger generation number than the initial commits. Performance tests were run on a copy of the Linux repository where HEAD is contained in v4.13 but no earlier tag. Also, all tags were copied to branches and 'git branch --contains' was tested: Before: 60.0s After: 0.4s Rel %: -99.3% Reported-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 39a3749abd..3ecdc13356 100644 --- a/commit.c +++ b/commit.c @@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * { struct commit_list *bases; int ret = 0, i; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (parse_commit(commit)) return ret; - for (i = 0; i < nr_reference; i++) + for (i = 0; i < nr_reference; i++) { if (parse_commit(reference[i])) return ret; + if (reference[i]->generation < min_generation) + min_generation = reference[i]->generation; + } + + if (commit->generation > min_generation) + return ret; bases = paint_down_to_common(commit, nr_reference, reference); if (commit->object.flags & PARENT2) -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v5 08/11] commit: add short-circuit to paint_down_to_common() 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee ` (6 preceding siblings ...) 2018-05-01 12:47 ` [PATCH v5 07/11] commit: use generation numbers for in_merge_bases() Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 09/11] commit: use generation number in remove_redundant() Derrick Stolee ` (3 subsequent siblings) 11 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee When running 'git branch --contains', the in_merge_bases_many() method calls paint_down_to_common() to discover if a specific commit is reachable from a set of branches. Commits with lower generation number are not needed to correctly answer the containment query of in_merge_bases_many(). Add a new parameter, min_generation, to paint_down_to_common() that prevents walking commits with generation number strictly less than min_generation. If 0 is given, then there is no functional change. For in_merge_bases_many(), we can pass commit->generation as the cutoff, and this saves time during 'git branch --contains' queries that would otherwise walk "around" the commit we are inspecting. For a copy of the Linux repository, where HEAD is checked out at v4.13~100, we get the following performance improvement for 'git branch --contains' over the previous commit: Before: 0.21s After: 0.13s Rel %: -38% Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/commit.c b/commit.c index 3ecdc13356..9875feec01 100644 --- a/commit.c +++ b/commit.c @@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue) } /* all input commits in one and twos[] must have been parsed! */ -static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit **twos) +static struct commit_list *paint_down_to_common(struct commit *one, int n, + struct commit **twos, + int min_generation) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; struct commit_list *result = NULL; int i; + uint32_t last_gen = GENERATION_NUMBER_INFINITY; one->object.flags |= PARENT1; if (!n) { @@ -831,6 +834,15 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struc struct commit_list *parents; int flags; + if (commit->generation > last_gen) + BUG("bad generation skip %8x > %8x at %s", + commit->generation, last_gen, + oid_to_hex(&commit->object.oid)); + last_gen = commit->generation; + + if (commit->generation < min_generation) + break; + flags = commit->object.flags & (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit->object.flags & RESULT)) { @@ -879,7 +891,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co return NULL; } - list = paint_down_to_common(one, n, twos); + list = paint_down_to_common(one, n, twos, 0); while (list) { struct commit *commit = pop_commit(&list); @@ -946,7 +958,7 @@ static int remove_redundant(struct commit **array, int cnt) filled_index[filled] = j; work[filled++] = array[j]; } - common = paint_down_to_common(array[i], filled, work); + common = paint_down_to_common(array[i], filled, work, 0); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) @@ -1070,7 +1082,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * if (commit->generation > min_generation) return ret; - bases = paint_down_to_common(commit, nr_reference, reference); + bases = paint_down_to_common(commit, nr_reference, reference, commit->generation); if (commit->object.flags & PARENT2) ret = 1; clear_commit_marks(commit, all_flags); -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v5 09/11] commit: use generation number in remove_redundant() 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee ` (7 preceding siblings ...) 2018-05-01 12:47 ` [PATCH v5 08/11] commit: add short-circuit to paint_down_to_common() Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-01 15:37 ` Derrick Stolee 2018-05-03 18:45 ` Jakub Narebski 2018-05-01 12:47 ` [PATCH v5 10/11] merge: check config before loading commits Derrick Stolee ` (2 subsequent siblings) 11 siblings, 2 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee The static remove_redundant() method is used to filter a list of commits by removing those that are reachable from another commit in the list. This is used to remove all possible merge- bases except a maximal, mutually independent set. To determine these commits are independent, we use a number of paint_down_to_common() walks and use the PARENT1, PARENT2 flags to determine reachability. Since we only care about reachability and not the full set of merge-bases between 'one' and 'twos', we can use the 'min_generation' parameter to short-circuit the walk. When no commit-graph exists, there is no change in behavior. For a copy of the Linux repository, we measured the following performance improvements: git merge-base v3.3 v4.5 Before: 234 ms After: 208 ms Rel %: -11% git merge-base v4.3 v4.5 Before: 102 ms After: 83 ms Rel %: -19% The experiments above were chosen to demonstrate that we are improving the filtering of the merge-base set. In the first example, more time is spent walking the history to find the set of merge bases before the remove_redundant() call. The starting commits are closer together in the second example, therefore more time is spent in remove_redundant(). The relative change in performance differs as expected. Reported-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 9875feec01..5064db4e61 100644 --- a/commit.c +++ b/commit.c @@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt) parse_commit(array[i]); for (i = 0; i < cnt; i++) { struct commit_list *common; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; if (redundant[i]) continue; @@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt) continue; filled_index[filled] = j; work[filled++] = array[j]; + + if (array[j]->generation < min_generation) + min_generation = array[j]->generation; } - common = paint_down_to_common(array[i], filled, work, 0); + common = paint_down_to_common(array[i], filled, work, + min_generation); if (array[i]->object.flags & PARENT2) redundant[i] = 1; for (j = 0; j < filled; j++) -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v5 09/11] commit: use generation number in remove_redundant() 2018-05-01 12:47 ` [PATCH v5 09/11] commit: use generation number in remove_redundant() Derrick Stolee @ 2018-05-01 15:37 ` Derrick Stolee 2018-05-03 18:45 ` Jakub Narebski 1 sibling, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 15:37 UTC (permalink / raw) To: Derrick Stolee, git; +Cc: gitster, peff, jnareb, avarab On 5/1/2018 8:47 AM, Derrick Stolee wrote: > The static remove_redundant() method is used to filter a list > of commits by removing those that are reachable from another > commit in the list. This is used to remove all possible merge- > bases except a maximal, mutually independent set. > > To determine these commits are independent, we use a number of > paint_down_to_common() walks and use the PARENT1, PARENT2 flags > to determine reachability. Since we only care about reachability > and not the full set of merge-bases between 'one' and 'twos', we > can use the 'min_generation' parameter to short-circuit the walk. > > When no commit-graph exists, there is no change in behavior. > > For a copy of the Linux repository, we measured the following > performance improvements: > > git merge-base v3.3 v4.5 > > Before: 234 ms > After: 208 ms > Rel %: -11% > > git merge-base v4.3 v4.5 > > Before: 102 ms > After: 83 ms > Rel %: -19% > > The experiments above were chosen to demonstrate that we are > improving the filtering of the merge-base set. In the first > example, more time is spent walking the history to find the > set of merge bases before the remove_redundant() call. The > starting commits are closer together in the second example, > therefore more time is spent in remove_redundant(). The relative > change in performance differs as expected. > > Reported-by: Jakub Narebski <jnareb@gmail.com> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/commit.c b/commit.c > index 9875feec01..5064db4e61 100644 > --- a/commit.c > +++ b/commit.c > @@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt) > parse_commit(array[i]); > for (i = 0; i < cnt; i++) { > struct commit_list *common; > + uint32_t min_generation = GENERATION_NUMBER_INFINITY; This initialization should be uint32_t min_generation = array[i]->generation; since the assignment (using j) below skips the ith commit. > > if (redundant[i]) > continue; > @@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt) > continue; > filled_index[filled] = j; > work[filled++] = array[j]; > + > + if (array[j]->generation < min_generation) > + min_generation = array[j]->generation; > } > - common = paint_down_to_common(array[i], filled, work, 0); > + common = paint_down_to_common(array[i], filled, work, > + min_generation); > if (array[i]->object.flags & PARENT2) > redundant[i] = 1; > for (j = 0; j < filled; j++) ^ permalink raw reply [flat|nested] 162+ messages in thread
* Re: [PATCH v5 09/11] commit: use generation number in remove_redundant() 2018-05-01 12:47 ` [PATCH v5 09/11] commit: use generation number in remove_redundant() Derrick Stolee 2018-05-01 15:37 ` Derrick Stolee @ 2018-05-03 18:45 ` Jakub Narebski 1 sibling, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-05-03 18:45 UTC (permalink / raw) To: Derrick Stolee Cc: git, Junio C Hamano, Derrick Stolee, Jeff King, Ævar Arnfjörð Bjarmason Derrick Stolee <dstolee@microsoft.com> writes: > The static remove_redundant() method is used to filter a list > of commits by removing those that are reachable from another > commit in the list. This is used to remove all possible merge- > bases except a maximal, mutually independent set. > > To determine these commits are independent, we use a number of > paint_down_to_common() walks and use the PARENT1, PARENT2 flags > to determine reachability. Since we only care about reachability > and not the full set of merge-bases between 'one' and 'twos', we > can use the 'min_generation' parameter to short-circuit the walk. > > When no commit-graph exists, there is no change in behavior. > > For a copy of the Linux repository, we measured the following > performance improvements: > > git merge-base v3.3 v4.5 > > Before: 234 ms > After: 208 ms > Rel %: -11% > > git merge-base v4.3 v4.5 > > Before: 102 ms > After: 83 ms > Rel %: -19% > > The experiments above were chosen to demonstrate that we are > improving the filtering of the merge-base set. In the first > example, more time is spent walking the history to find the > set of merge bases before the remove_redundant() call. The > starting commits are closer together in the second example, > therefore more time is spent in remove_redundant(). The relative > change in performance differs as expected. > > Reported-by: Jakub Narebski <jnareb@gmail.com> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Good description. > --- > commit.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > Let me extend context a bit to make it easier to review. > diff --git a/commit.c b/commit.c > index 9875feec01..5064db4e61 100644 > --- a/commit.c > +++ b/commit.c > @@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt) > parse_commit(array[i]); > for (i = 0; i < cnt; i++) { > struct commit_list *common; > + uint32_t min_generation = GENERATION_NUMBER_INFINITY; As you have noticed, and how it is already fixed in 'pu' it should be + uint32_t min_generation = array[i]->generation; > > if (redundant[i]) > continue; > @@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt) > continue; > filled_index[filled] = j; > work[filled++] = array[j]; > + > + if (array[j]->generation < min_generation) > + min_generation = array[j]->generation; remove_redundant() checks if i-th commit is reachable from commits i+1..cnt, and vice versa - via checking PARENT1 and PARENT2 flag, respectively. As you have noticed this means that the min_generation cutoff should be minimum of array[i]->generation, and all of array[j]->generation for j=i+1..cnt. There is no reason going further down if we are interested only in reachability, and not actually in merge bases. > } > - common = paint_down_to_common(array[i], filled, work, 0); > + common = paint_down_to_common(array[i], filled, work, > + min_generation); > if (array[i]->object.flags & PARENT2) > redundant[i] = 1; > for (j = 0; j < filled; j++) if (work[j]->object.flags & PARENT1) redundant[filled_index[j]] = 1; Beside this issue, nice and simple speedup. Good. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
* [PATCH v5 10/11] merge: check config before loading commits 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee ` (8 preceding siblings ...) 2018-05-01 12:47 ` [PATCH v5 09/11] commit: use generation number in remove_redundant() Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 11/11] commit-graph.txt: update design document Derrick Stolee 2018-05-03 11:18 ` [PATCH v5 00/11] Compute and consume generation numbers Jakub Narebski 11 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee Now that we use generation numbers from the commit-graph, we must ensure that all commits that exist in the commit-graph are loaded from that file instead of from the object database. Since the commit-graph file is only checked if core.commitGraph is true, we must check the default config before we load any commits. In the merge builtin, the config was checked after loading the HEAD commit. This was due to the use of the global 'branch' when checking merge-specific config settings. Move the config load to be between the initialization of 'branch' and the commit lookup. Without this change, a fast-forward merge would hit a BUG("bad generation skip") statement in commit.c during paint_down_to_common(). This is because the HEAD commit would be loaded with "infinite" generation but then reached by commits with "finite" generation numbers. Add a test to t5318-commit-graph.sh that exercises this code path to prevent a regression. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- builtin/merge.c | 7 ++++--- t/t5318-commit-graph.sh | 9 +++++++++ 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/builtin/merge.c b/builtin/merge.c index 5e5e4497e3..b819756946 100644 --- a/builtin/merge.c +++ b/builtin/merge.c @@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char *prefix) branch = branch_to_free = resolve_refdup("HEAD", 0, &head_oid, NULL); if (branch) skip_prefix(branch, "refs/heads/", &branch); + + init_diff_ui_defaults(); + git_config(git_merge_config, NULL); + if (!branch || is_null_oid(&head_oid)) head_commit = NULL; else head_commit = lookup_commit_or_die(&head_oid, "HEAD"); - init_diff_ui_defaults(); - git_config(git_merge_config, NULL); - if (branch_mergeoptions) parse_branch_merge_options(branch_mergeoptions); argc = parse_options(argc, argv, prefix, builtin_merge_options, diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh index a380419b65..77d85aefe7 100755 --- a/t/t5318-commit-graph.sh +++ b/t/t5318-commit-graph.sh @@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' ' graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 merge/1 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 merge/2 +test_expect_success 'perform fast-forward merge in full repo' ' + cd "$TRASH_DIRECTORY/full" && + git checkout -b merge-5-to-8 commits/5 && + git merge commits/8 && + git show-ref -s merge-5-to-8 >output && + git show-ref -s commits/8 >expect && + test_cmp expect output +' + test_done -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* [PATCH v5 11/11] commit-graph.txt: update design document 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee ` (9 preceding siblings ...) 2018-05-01 12:47 ` [PATCH v5 10/11] merge: check config before loading commits Derrick Stolee @ 2018-05-01 12:47 ` Derrick Stolee 2018-05-03 11:18 ` [PATCH v5 00/11] Compute and consume generation numbers Jakub Narebski 11 siblings, 0 replies; 162+ messages in thread From: Derrick Stolee @ 2018-05-01 12:47 UTC (permalink / raw) To: git; +Cc: gitster, stolee, peff, jnareb, avarab, Derrick Stolee We now calculate generation numbers in the commit-graph file and use them in paint_down_to_common(). Expand the section on generation numbers to discuss how the three special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and _MAX interact with other generation numbers. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/technical/commit-graph.txt | 29 ++++++++++++++++++++---- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt index 0550c6d0dc..e1a883eb46 100644 --- a/Documentation/technical/commit-graph.txt +++ b/Documentation/technical/commit-graph.txt @@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having "infinite" generation number and walk until reaching commits with known generation number. +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + + If A and B are commits with generation numbers N amd M, respectively, + and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. + Design Details -------------- @@ -98,18 +121,14 @@ Future Work - The 'commit-graph' subcommand does not have a "verify" mode that is necessary for integration with fsck. -- The file format includes room for precomputed generation numbers. These - are not currently computed, so all generation numbers will be marked as - 0 (or "uncomputed"). A later patch will include this calculation. - - After computing and storing generation numbers, we must make graph walks aware of generation numbers to gain the performance benefits they enable. This will mostly be accomplished by swapping a commit-date-ordered priority queue with one ordered by generation number. The following operations are important candidates: - - paint_down_to_common() - 'log --topo-order' + - 'tag --merged' - Currently, parse_commit_gently() requires filling in the root tree object for a commit. This passes through lookup_tree() and consequently -- 2.17.0.39.g685157f7fb ^ permalink raw reply related [flat|nested] 162+ messages in thread
* Re: [PATCH v5 00/11] Compute and consume generation numbers 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee ` (10 preceding siblings ...) 2018-05-01 12:47 ` [PATCH v5 11/11] commit-graph.txt: update design document Derrick Stolee @ 2018-05-03 11:18 ` Jakub Narebski 11 siblings, 0 replies; 162+ messages in thread From: Jakub Narebski @ 2018-05-03 11:18 UTC (permalink / raw) To: Derrick Stolee Cc: git, Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason, Derrick Stolee Derrick Stolee <dstolee@microsoft.com> writes: > Most of the changes from v4 are cosmetic, but there is one new commit: > > commit: use generation number in remove_redundant() > > Other changes are non-functional, but do clarify things. I wonder if out perf framework in t/perf could help here to show performance gains for the whole series. Though it may not include operations that are most helped by this one. For commit-graph feature if would be nice, if feasible, to see changes in performance from before version, checking both state where feature is enabled to see the gains, and state where feature is disabled to see if there are no performance regressions. > > Inter-diff from v4: O.K., now to commenting on inter-changes. > diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt > index d9f2713efa..e1a883eb46 100644 > --- a/Documentation/technical/commit-graph.txt > +++ b/Documentation/technical/commit-graph.txt > @@ -125,9 +125,10 @@ Future Work > walks aware of generation numbers to gain the performance benefits they > enable. This will mostly be accomplished by swapping a commit-date-ordered > priority queue with one ordered by generation number. The following > - operation is an important candidate: > + operations are important candidates: > > - 'log --topo-order' > + - 'tag --merged' > > - Currently, parse_commit_gently() requires filling in the root tree > object for a commit. This passes through lookup_tree() and consequently O.K., this is about discussion in "branch --contains / tag --merged inconsistency" thread: https://public-inbox.org/git/87fu3g67ry.fsf@lant.ki.iif.hu/t/#u > diff --git a/commit-graph.c b/commit-graph.c > index aebd242def..a8c337dd77 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -248,6 +248,7 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g, > static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos) > { > const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos; > + item->graph_pos = pos; > item->generation = get_be32(commit_data + g->hash_len + 8) >> 2; > } > Minor bugfix. > @@ -454,8 +455,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, > else > packedDate[0] = 0; > > - if ((*list)->generation != GENERATION_NUMBER_INFINITY) > - packedDate[0] |= htonl((*list)->generation << 2); > + packedDate[0] |= htonl((*list)->generation << 2); > > packedDate[1] = htonl((*list)->date); > hashwrite(f, packedDate, 8); Minor bugfix. > @@ -589,18 +589,17 @@ static void close_reachable(struct packed_oid_list *oids) > } > } > > -static void compute_generation_numbers(struct commit** commits, > - int nr_commits) > +static void compute_generation_numbers(struct packed_commit_list* commits) > { > int i; > struct commit_list *list = NULL; > > - for (i = 0; i < nr_commits; i++) { > - if (commits[i]->generation != GENERATION_NUMBER_INFINITY && > - commits[i]->generation != GENERATION_NUMBER_ZERO) > + for (i = 0; i < commits->nr; i++) { > + if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY && > + commits->list[i]->generation != GENERATION_NUMBER_ZERO) > continue; > > - commit_list_insert(commits[i], &list); > + commit_list_insert(commits->list[i], &list); > while (list) { > struct commit *current = list->item; > struct commit_list *parent; Refactoring: signature change from pair of struct commit** + int to struct packed_commit_list*. I think that it makes code a bit uglier for no gain, but that is just my personal opinion; it is a matter of taste. > @@ -621,10 +620,10 @@ static void compute_generation_numbers(struct commit** commits, > if (all_parents_computed) { > current->generation = max_generation + 1; > pop_commit(&list); > - } > > - if (current->generation > GENERATION_NUMBER_MAX) > - current->generation = GENERATION_NUMBER_MAX; > + if (current->generation > GENERATION_NUMBER_MAX) > + current->generation = GENERATION_NUMBER_MAX; > + } > } > } > } Bugfix (though it didn't result in wrong information writen out, just in inconsistent state in the middle of computation). > @@ -752,7 +751,7 @@ void write_commit_graph(const char *obj_dir, > if (commits.nr >= GRAPH_PARENT_MISSING) > die(_("too many commits to write graph")); > > - compute_generation_numbers(commits.list, commits.nr); > + compute_generation_numbers(&commits); > > graph_name = get_commit_graph_filename(obj_dir); > fd = hold_lock_file_for_update(&lk, graph_name, 0); The other side of signature change. > diff --git a/commit.c b/commit.c > index e2e16ea1a7..5064db4e61 100644 > --- a/commit.c > +++ b/commit.c > @@ -835,7 +835,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, > int flags; > > if (commit->generation > last_gen) > - BUG("bad generation skip"); > + BUG("bad generation skip %8x > %8x at %s", > + commit->generation, last_gen, > + oid_to_hex(&commit->object.oid)); > last_gen = commit->generation; > > if (commit->generation < min_generation) More detailed BUG() message, always nice to have. > @@ -947,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt) > parse_commit(array[i]); > for (i = 0; i < cnt; i++) { > struct commit_list *common; > + uint32_t min_generation = GENERATION_NUMBER_INFINITY; > > if (redundant[i]) > continue; > @@ -955,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt) > continue; > filled_index[filled] = j; > work[filled++] = array[j]; > + > + if (array[j]->generation < min_generation) > + min_generation = array[j]->generation; > } > - common = paint_down_to_common(array[i], filled, work, 0); > + common = paint_down_to_common(array[i], filled, work, > + min_generation); > if (array[i]->object.flags & PARENT2) > redundant[i] = 1; > for (j = 0; j < filled; j++) New commit in series. Change looks quite short, gives measurable performance gains (in appropriate case). > @@ -1073,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int nr_reference, struct commit * > for (i = 0; i < nr_reference; i++) { > if (parse_commit(reference[i])) > return ret; > - if (min_generation > reference[i]->generation) > + if (reference[i]->generation < min_generation) > min_generation = reference[i]->generation; > } > > Style change. > -- >8 -- > > Derrick Stolee (11): > ref-filter: fix outdated comment on in_commit_list > commit: add generation number to struct commmit > commit-graph: compute generation numbers > commit: use generations in paint_down_to_common() > commit-graph: always load commit-graph information > ref-filter: use generation number for --contains > commit: use generation numbers for in_merge_bases() > commit: add short-circuit to paint_down_to_common() > commit: use generation number in remove_redundant() > merge: check config before loading commits > commit-graph.txt: update design document It looks like the series is maturing nicely. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 162+ messages in thread
end of thread, other threads:[~2018-05-03 18:45 UTC | newest] Thread overview: 162+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-03 16:51 [PATCH 0/6] Compute and consume generation numbers Derrick Stolee 2018-04-03 16:51 ` [PATCH 1/6] object.c: parse commit in graph first Derrick Stolee 2018-04-03 18:21 ` Jonathan Tan 2018-04-03 18:28 ` Jeff King 2018-04-03 18:32 ` Derrick Stolee 2018-04-03 16:51 ` [PATCH 2/6] commit: add generation number to struct commmit Derrick Stolee 2018-04-03 18:05 ` Brandon Williams 2018-04-03 18:28 ` Jeff King 2018-04-03 18:31 ` Derrick Stolee 2018-04-03 18:32 ` Brandon Williams 2018-04-03 18:44 ` Stefan Beller 2018-04-03 23:17 ` Ramsay Jones 2018-04-03 23:19 ` Jeff King 2018-04-03 18:24 ` Jonathan Tan 2018-04-03 16:51 ` [PATCH 3/6] commit-graph: compute generation numbers Derrick Stolee 2018-04-03 18:30 ` Jonathan Tan 2018-04-03 18:49 ` Stefan Beller 2018-04-03 16:51 ` [PATCH 4/6] commit: use generations in paint_down_to_common() Derrick Stolee 2018-04-03 18:31 ` Stefan Beller 2018-04-03 18:31 ` Jonathan Tan 2018-04-03 16:51 ` [PATCH 5/6] commit.c: use generation to halt paint walk Derrick Stolee 2018-04-03 19:01 ` Jonathan Tan 2018-04-03 16:51 ` [PATCH 6/6] commit-graph.txt: update future work Derrick Stolee 2018-04-03 19:04 ` Jonathan Tan 2018-04-03 16:56 ` [PATCH 0/6] Compute and consume generation numbers Derrick Stolee 2018-04-03 18:03 ` Brandon Williams 2018-04-03 18:29 ` Derrick Stolee 2018-04-03 18:47 ` Jeff King 2018-04-03 19:05 ` Jeff King 2018-04-04 15:45 ` [PATCH 7/6] ref-filter: use generation number for --contains Derrick Stolee 2018-04-04 15:45 ` [PATCH 8/6] commit: use generation numbers for in_merge_bases() Derrick Stolee 2018-04-04 15:48 ` Derrick Stolee 2018-04-04 17:01 ` Brandon Williams 2018-04-04 18:24 ` Jeff King 2018-04-04 18:53 ` Derrick Stolee 2018-04-04 18:59 ` Jeff King 2018-04-04 18:22 ` [PATCH 7/6] ref-filter: use generation number for --contains Jeff King 2018-04-04 19:06 ` Derrick Stolee 2018-04-04 19:16 ` Jeff King 2018-04-04 19:22 ` Derrick Stolee 2018-04-04 19:42 ` Jeff King 2018-04-04 19:45 ` Derrick Stolee 2018-04-04 19:46 ` Jeff King 2018-04-07 17:09 ` [PATCH 0/6] Compute and consume generation numbers Jakub Narebski 2018-04-07 16:55 ` Jakub Narebski 2018-04-08 1:06 ` Derrick Stolee 2018-04-11 19:32 ` Jakub Narebski 2018-04-11 19:58 ` Derrick Stolee 2018-04-14 16:52 ` Jakub Narebski 2018-04-21 20:44 ` Jakub Narebski 2018-04-23 13:54 ` Derrick Stolee 2018-04-09 16:41 ` [PATCH v2 00/10] " Derrick Stolee 2018-04-09 16:41 ` [PATCH v2 01/10] object.c: parse commit in graph first Derrick Stolee 2018-04-09 16:41 ` [PATCH v2 02/10] merge: check config before loading commits Derrick Stolee 2018-04-11 2:12 ` Junio C Hamano 2018-04-11 12:49 ` Derrick Stolee 2018-04-09 16:42 ` [PATCH v2 03/10] commit: add generation number to struct commmit Derrick Stolee 2018-04-09 17:59 ` Stefan Beller 2018-04-11 2:31 ` Junio C Hamano 2018-04-11 12:57 ` Derrick Stolee 2018-04-11 23:28 ` Junio C Hamano 2018-04-09 16:42 ` [PATCH v2 04/10] commit-graph: compute generation numbers Derrick Stolee 2018-04-11 2:51 ` Junio C Hamano 2018-04-11 13:02 ` Derrick Stolee 2018-04-11 18:49 ` Stefan Beller 2018-04-11 19:26 ` Eric Sunshine 2018-04-09 16:42 ` [PATCH v2 05/10] commit: use generations in paint_down_to_common() Derrick Stolee 2018-04-09 16:42 ` [PATCH v2 06/10] commit.c: use generation to halt paint walk Derrick Stolee 2018-04-11 3:02 ` Junio C Hamano 2018-04-11 13:24 ` Derrick Stolee 2018-04-09 16:42 ` [PATCH v2 07/10] commit-graph.txt: update future work Derrick Stolee 2018-04-12 9:12 ` Junio C Hamano 2018-04-12 11:35 ` Derrick Stolee 2018-04-13 9:53 ` Jakub Narebski 2018-04-09 16:42 ` [PATCH v2 08/10] ref-filter: use generation number for --contains Derrick Stolee 2018-04-09 16:42 ` [PATCH v2 09/10] commit: use generation numbers for in_merge_bases() Derrick Stolee 2018-04-09 16:42 ` [PATCH v2 10/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 0/9] Compute and consume generation numbers Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 1/9] commit: add generation number to struct commmit Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 2/9] commit-graph: compute generation numbers Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 3/9] commit: use generations in paint_down_to_common() Derrick Stolee 2018-04-18 14:31 ` Jakub Narebski 2018-04-18 14:46 ` Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 4/9] commit-graph.txt: update design document Derrick Stolee 2018-04-18 19:47 ` Jakub Narebski 2018-04-17 17:00 ` [PATCH v3 5/9] ref-filter: use generation number for --contains Derrick Stolee 2018-04-18 21:02 ` Jakub Narebski 2018-04-23 14:22 ` Derrick Stolee 2018-04-24 18:56 ` Jakub Narebski 2018-04-25 14:11 ` Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 6/9] commit: use generation numbers for in_merge_bases() Derrick Stolee 2018-04-18 22:15 ` Jakub Narebski 2018-04-23 14:31 ` Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common() Derrick Stolee 2018-04-18 23:19 ` Jakub Narebski 2018-04-23 14:40 ` Derrick Stolee 2018-04-23 21:38 ` Jakub Narebski 2018-04-24 12:31 ` Derrick Stolee 2018-04-19 8:32 ` Jakub Narebski 2018-04-17 17:00 ` [PATCH v3 8/9] commit-graph: always load commit-graph information Derrick Stolee 2018-04-17 17:50 ` Derrick Stolee 2018-04-19 0:02 ` Jakub Narebski 2018-04-23 14:49 ` Derrick Stolee 2018-04-17 17:00 ` [PATCH v3 9/9] merge: check config before loading commits Derrick Stolee 2018-04-19 0:04 ` [PATCH v3 0/9] Compute and consume generation numbers Jakub Narebski 2018-04-23 14:54 ` Derrick Stolee 2018-04-25 14:37 ` [PATCH v4 00/10] " Derrick Stolee 2018-04-25 14:37 ` [PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list Derrick Stolee 2018-04-28 17:54 ` Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 02/10] commit: add generation number to struct commmit Derrick Stolee 2018-04-28 22:35 ` Jakub Narebski 2018-04-30 12:05 ` Derrick Stolee 2018-04-25 14:37 ` [PATCH v4 03/10] commit-graph: compute generation numbers Derrick Stolee 2018-04-26 2:35 ` Junio C Hamano 2018-04-26 12:58 ` Derrick Stolee 2018-04-26 13:49 ` Derrick Stolee 2018-04-29 9:08 ` Jakub Narebski 2018-05-01 12:10 ` Derrick Stolee 2018-05-02 16:15 ` Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 04/10] commit: use generations in paint_down_to_common() Derrick Stolee 2018-04-26 3:22 ` Junio C Hamano 2018-04-26 9:02 ` Jakub Narebski 2018-04-28 14:38 ` Jakub Narebski 2018-04-29 15:40 ` Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 06/10] ref-filter: use generation number for --contains Derrick Stolee 2018-04-30 16:34 ` Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 05/10] commit-graph: always load commit-graph information Derrick Stolee 2018-04-29 22:14 ` Jakub Narebski 2018-05-01 12:19 ` Derrick Stolee 2018-04-29 22:18 ` Jakub Narebski 2018-04-25 14:37 ` [PATCH v4 07/10] commit: use generation numbers for in_merge_bases() Derrick Stolee 2018-04-30 17:05 ` Jakub Narebski 2018-04-25 14:38 ` [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common() Derrick Stolee 2018-04-30 22:19 ` Jakub Narebski 2018-05-01 11:47 ` Derrick Stolee 2018-05-02 13:05 ` Jakub Narebski 2018-05-02 13:42 ` Derrick Stolee 2018-04-25 14:38 ` [PATCH v4 09/10] merge: check config before loading commits Derrick Stolee 2018-04-30 22:54 ` Jakub Narebski 2018-05-01 11:52 ` Derrick Stolee 2018-05-02 11:41 ` Jakub Narebski 2018-04-25 14:38 ` [PATCH v4 10/10] commit-graph.txt: update design document Derrick Stolee 2018-04-30 23:32 ` Jakub Narebski 2018-05-01 12:00 ` Derrick Stolee 2018-05-02 7:57 ` Jakub Narebski 2018-04-25 14:40 ` [PATCH v4 00/10] Compute and consume generation numbers Derrick Stolee 2018-04-28 17:28 ` Jakub Narebski 2018-05-01 12:47 ` [PATCH v5 00/11] " Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 02/11] commit: add generation number to struct commmit Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 03/11] commit-graph: compute generation numbers Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 04/11] commit: use generations in paint_down_to_common() Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 05/11] commit-graph: always load commit-graph information Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 06/11] ref-filter: use generation number for --contains Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 07/11] commit: use generation numbers for in_merge_bases() Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 08/11] commit: add short-circuit to paint_down_to_common() Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 09/11] commit: use generation number in remove_redundant() Derrick Stolee 2018-05-01 15:37 ` Derrick Stolee 2018-05-03 18:45 ` Jakub Narebski 2018-05-01 12:47 ` [PATCH v5 10/11] merge: check config before loading commits Derrick Stolee 2018-05-01 12:47 ` [PATCH v5 11/11] commit-graph.txt: update design document Derrick Stolee 2018-05-03 11:18 ` [PATCH v5 00/11] Compute and consume generation numbers Jakub Narebski
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.