git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Use generation numbers for --topo-order
@ 2018-08-27 20:41 Derrick Stolee via GitGitGadget
  2018-08-27 20:41 ` [PATCH 1/6] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
                   ` (7 more replies)
  0 siblings, 8 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-08-27 20:41 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

This patch series performs a decently-sized refactoring of the revision-walk
machinery. Well, "refactoring" is probably the wrong word, as I don't
actually remove the old code. Instead, when we see certain options in the
'rev_info' struct, we redirect the commit-walk logic to a new set of methods
that distribute the workload differently. By using generation numbers in the
commit-graph, we can significantly improve 'git log --graph' commands (and
the underlying 'git rev-list --topo-order').

On the Linux repository, I got the following performance results when
comparing to the previous version with or without a commit-graph:

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

If you want to read this series but are unfamiliar with the commit-graph and
generation numbers, then I recommend reading 
Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the
subject. In particular, the three-part walk described in "revision.c:
refactor basic topo-order logic" is present (but underexplained) as an
animated PNG [2].

Since revision.c is an incredibly important (and old) portion of the
codebase -- and because there are so many orthogonal options in 'struct
rev_info' -- I consider this submission to be "RFC quality". That is, I am
not confident that I am not missing anything, or that my solution is the
best it can be. I did merge this branch with ds/commit-graph-with-grafts and
the "DO-NOT-MERGE: write and read commit-graph always" commit that computes
a commit-graph with every 'git commit' command. The test suite passed with
that change, available on GitHub [3]. To ensure that I cover at least the
case I think are interesting, I added tests to t6600-test-reach.sh to verify
the walks report the correct results for the three cases there (no
commit-graph, full commit-graph, and a partial commit-graph so the walk
starts at GENERATION_NUMBER_INFINITY).

One notable case that is not included in this series is the case of a
history comparison such as 'git rev-list --topo-order A..B'. The existing
code in limit_list() has ways to cut the walk short when all pending commits
are UNINTERESTING. Since this code depends on commit_list instead of the
prio_queue we are using here, I chose to leave it untouched for now. We can
revisit it in a separate series later. Since handle_commit() turns on
revs->limited when a commit is UNINTERESTING, we do not hit the new code in
this case. Removing this 'revs->limited = 1;' line yields correct results,
but the performance is worse.

This series is based on ds/reachable.

Thanks, -Stolee

[1] 
https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/
Supercharging the Git Commit Graph III: Generations and Graph Algorithms

[2] 
https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png
Animation showing three-part walk

[3] https://github.com/derrickstolee/git/tree/topo-order/testA branch
containing this series along with commits to compute commit-graph in entire
test suite.

Derrick Stolee (6):
  prio-queue: add 'peek' operation
  test-reach: add run_three_modes method
  test-reach: add rev-list tests
  revision.c: begin refactoring --topo-order logic
  commit/revisions: bookkeeping before refactoring
  revision.c: refactor basic topo-order logic

 commit.c                   |  11 +-
 commit.h                   |   8 ++
 object.h                   |   4 +-
 prio-queue.c               |   9 ++
 prio-queue.h               |   6 +
 revision.c                 | 232 ++++++++++++++++++++++++++++++++++++-
 revision.h                 |   6 +
 t/helper/test-prio-queue.c |  10 +-
 t/t6600-test-reach.sh      |  98 +++++++++++++++-
 9 files changed, 361 insertions(+), 23 deletions(-)


base-commit: 6cc017431c1c48f80d1c6512fdcc9866cf4b7f55
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-25%2Fderrickstolee%2Ftopo-order%2Fprogress-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-25/derrickstolee/topo-order/progress-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/25
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH 1/6] prio-queue: add 'peek' operation
  2018-08-27 20:41 [PATCH 0/6] Use generation numbers for --topo-order Derrick Stolee via GitGitGadget
@ 2018-08-27 20:41 ` Derrick Stolee via GitGitGadget
  2018-08-27 20:41 ` [PATCH 2/6] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-08-27 20:41 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When consuming a priority queue, it can be convenient to inspect
the next object that will be dequeued without actually dequeueing
it. Our existing library did not have such a 'peek' operation, so
add it as prio_queue_peek().

Add a reference-level comparison in t/helper/test-prio-queue.c
so this method is exercised by t0009-prio-queue.sh.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 prio-queue.c               |  9 +++++++++
 prio-queue.h               |  6 ++++++
 t/helper/test-prio-queue.c | 10 +++++++---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/prio-queue.c b/prio-queue.c
index a078451872..d3f488cb05 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
 	}
 	return result;
 }
+
+void *prio_queue_peek(struct prio_queue *queue)
+{
+	if (!queue->nr)
+		return NULL;
+	if (!queue->compare)
+		return queue->array[queue->nr - 1].data;
+	return queue->array[0].data;
+}
diff --git a/prio-queue.h b/prio-queue.h
index d030ec9dd6..682e51867a 100644
--- a/prio-queue.h
+++ b/prio-queue.h
@@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing);
  */
 extern void *prio_queue_get(struct prio_queue *);
 
+/*
+ * Gain access to the "thing" that would be returned by
+ * prio_queue_get, but do not remove it from the queue.
+ */
+extern void *prio_queue_peek(struct prio_queue *);
+
 extern void clear_prio_queue(struct prio_queue *);
 
 /* Reverse the LIFO elements */
diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
index 9807b649b1..e817bbf464 100644
--- a/t/helper/test-prio-queue.c
+++ b/t/helper/test-prio-queue.c
@@ -22,9 +22,13 @@ int cmd__prio_queue(int argc, const char **argv)
 	struct prio_queue pq = { intcmp };
 
 	while (*++argv) {
-		if (!strcmp(*argv, "get"))
-			show(prio_queue_get(&pq));
-		else if (!strcmp(*argv, "dump")) {
+		if (!strcmp(*argv, "get")) {
+			void *peek = prio_queue_peek(&pq);
+			void *get = prio_queue_get(&pq);
+			if (peek != get)
+				BUG("peek and get results do not match");
+			show(get);
+		} else if (!strcmp(*argv, "dump")) {
 			int *v;
 			while ((v = prio_queue_get(&pq)))
 			       show(v);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 2/6] test-reach: add run_three_modes method
  2018-08-27 20:41 [PATCH 0/6] Use generation numbers for --topo-order Derrick Stolee via GitGitGadget
  2018-08-27 20:41 ` [PATCH 1/6] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
@ 2018-08-27 20:41 ` Derrick Stolee via GitGitGadget
  2018-08-27 20:41 ` [PATCH 3/6] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-08-27 20:41 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'test_three_modes' method assumes we are using the 'test-tool
reach' command for our test. However, we may want to use the data
shape of our commit graph and the three modes (no commit-graph,
full commit-graph, partial commit-graph) for other git commands.

Split test_three_modes to be a simple translation on a more general
run_three_modes method that executes the given command and tests
the actual output to the expected output.

While inspecting this code, I realized that the final test for
'commit_contains --tag' is silently dropping the '--tag' argument.
It should be quoted to include both.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6600-test-reach.sh | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..1b18e12a4e 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -53,18 +53,22 @@ test_expect_success 'setup' '
 	git config core.commitGraph true
 '
 
-test_three_modes () {
+run_three_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	$1 <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-full .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	$1 <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	$1 <input >actual &&
 	test_cmp expect actual
 }
 
+test_three_modes () {
+	run_three_modes "test-tool reach $1"
+}
+
 test_expect_success 'ref_newer:miss' '
 	cat >input <<-\EOF &&
 	A:commit-5-7
@@ -219,7 +223,7 @@ test_expect_success 'commit_contains:hit' '
 	EOF
 	echo "commit_contains(_,A,X,_):1" >expect &&
 	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_three_modes "commit_contains --tag"
 '
 
 test_expect_success 'commit_contains:miss' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 3/6] test-reach: add rev-list tests
  2018-08-27 20:41 [PATCH 0/6] Use generation numbers for --topo-order Derrick Stolee via GitGitGadget
  2018-08-27 20:41 ` [PATCH 1/6] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
  2018-08-27 20:41 ` [PATCH 2/6] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
@ 2018-08-27 20:41 ` Derrick Stolee via GitGitGadget
  2018-08-27 20:41 ` [PATCH 4/6] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-08-27 20:41 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6600-test-reach.sh | 84 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 1b18e12a4e..2fcaa39077 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
 	test_three_modes commit_contains --tag
 '
 
+test_expect_success 'rev-list: basic topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
+		commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \
+		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
+		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
+	>expect &&
+	run_three_modes "git rev-list --topo-order commit-6-6"
+'
+
+test_expect_success 'rev-list: first-parent topo-order' '
+	git rev-parse \
+		commit-6-6 \
+		commit-6-5 \
+		commit-6-4 \
+		commit-6-3 \
+		commit-6-2 \
+		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
+	>expect &&
+	run_three_modes "git rev-list --first-parent --topo-order commit-6-6"
+'
+
+test_expect_success 'rev-list: range topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes "git rev-list --topo-order commit-3-3..commit-6-6"
+'
+
+test_expect_success 'rev-list: range topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 \
+		commit-6-5 commit-5-5 commit-4-5 \
+		commit-6-4 commit-5-4 commit-4-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes "git rev-list --topo-order commit-3-8..commit-6-6"
+'
+
+test_expect_success 'rev-list: first-parent range topo-order' '
+	git rev-parse \
+		commit-6-6 \
+		commit-6-5 \
+		commit-6-4 \
+		commit-6-3 \
+		commit-6-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes "git rev-list --first-parent --topo-order commit-3-8..commit-6-6"
+'
+
+test_expect_success 'rev-list: ancestry-path topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+	>expect &&
+	run_three_modes "git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6"
+'
+
+test_expect_success 'rev-list: symmetric difference topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 \
+		commit-6-5 commit-5-5 commit-4-5 \
+		commit-6-4 commit-5-4 commit-4-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+		commit-3-8 commit-2-8 commit-1-8 \
+		commit-3-7 commit-2-7 commit-1-7 \
+	>expect &&
+	run_three_modes "git rev-list --topo-order commit-3-8...commit-6-6"
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 4/6] revision.c: begin refactoring --topo-order logic
  2018-08-27 20:41 [PATCH 0/6] Use generation numbers for --topo-order Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2018-08-27 20:41 ` [PATCH 3/6] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
@ 2018-08-27 20:41 ` Derrick Stolee via GitGitGadget
  2018-08-27 20:41 ` [PATCH 5/6] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-08-27 20:41 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 42 ++++++++++++++++++++++++++++++++++++++----
 revision.h |  4 ++++
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index 3205a3947a..1db70dc951 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2451,7 +2452,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 	if (revs->diffopt.objfind)
 		revs->simplify_history = 0;
 
-	if (revs->topo_order)
+	if (revs->topo_order && !generation_numbers_enabled(the_repository))
 		revs->limited = 1;
 
 	if (revs->prune_data.nr) {
@@ -2889,6 +2890,33 @@ static int mark_uninteresting(const struct object_id *oid,
 	return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+	struct topo_walk_info *info;
+	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+	info = revs->topo_walk_info;
+	memset(info, 0, sizeof(struct topo_walk_info));
+
+	limit_list(revs);
+	sort_in_topological_order(&revs->commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+	return pop_commit(&revs->commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+		if (!revs->ignore_missing_links)
+			die("Failed to traverse parents of commit %s",
+			    oid_to_hex(&commit->object.oid));
+	}
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
 	int i;
@@ -2925,11 +2953,13 @@ int prepare_revision_walk(struct rev_info *revs)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
 		return 0;
-	if (revs->limited)
+	if (revs->limited) {
 		if (limit_list(revs) < 0)
 			return -1;
-	if (revs->topo_order)
-		sort_in_topological_order(&revs->commits, revs->sort_order);
+		if (revs->topo_order)
+			sort_in_topological_order(&revs->commits, revs->sort_order);
+	} else if (revs->topo_order)
+		init_topo_walk(revs);
 	if (revs->line_level_traverse)
 		line_log_filter(revs);
 	if (revs->simplify_merges)
@@ -3254,6 +3284,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
 
 		if (revs->reflog_info)
 			commit = next_reflog_entry(revs->reflog_info);
+		else if (revs->topo_walk_info)
+			commit = next_topo_commit(revs);
 		else
 			commit = pop_commit(&revs->commits);
 
@@ -3275,6 +3307,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
 
 			if (revs->reflog_info)
 				try_to_simplify_commit(revs, commit);
+			else if (revs->topo_walk_info)
+				expand_topo_walk(revs, commit);
 			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
 				if (!revs->ignore_missing_links)
 					die("Failed to traverse parents of commit %s",
diff --git a/revision.h b/revision.h
index bf2239f876..e48181673d 100644
--- a/revision.h
+++ b/revision.h
@@ -54,6 +54,8 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct topo_walk_info;
+
 struct rev_info {
 	/* Starting list */
 	struct commit_list *commits;
@@ -227,6 +229,8 @@ struct rev_info {
 	const char *break_bar;
 
 	struct revision_sources *sources;
+
+	struct topo_walk_info *topo_walk_info;
 };
 
 extern int ref_excluded(struct string_list *, const char *path);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 5/6] commit/revisions: bookkeeping before refactoring
  2018-08-27 20:41 [PATCH 0/6] Use generation numbers for --topo-order Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2018-08-27 20:41 ` [PATCH 4/6] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
@ 2018-08-27 20:41 ` Derrick Stolee via GitGitGadget
  2018-08-27 20:41 ` [PATCH 6/6] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-08-27 20:41 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

There are a few things that need to move around a little before
making a big refactoring in the topo-order logic:

1. We need access to record_author_date() and
   compare_commits_by_author_date() in revision.c. These are used
   currently by sort_in_topological_order() in commit.c.

2. Moving these methods to commit.h requires adding the author_slab
   definition to commit.h.

3. The add_parents_to_list() method in revision.c performs logic
   around the UNINTERESTING flag and other special cases depending
   on the struct rev_info. Allow this method to ignore a NULL 'list'
   parameter, as we will not be populating the list for our walk.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c   | 11 ++++-------
 commit.h   |  8 ++++++++
 revision.c |  6 ++++--
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/commit.c b/commit.c
index 32d1234bd7..2dbe187b8c 100644
--- a/commit.c
+++ b/commit.c
@@ -655,11 +655,8 @@ struct commit *pop_commit(struct commit_list **stack)
 /* count number of children that have not been emitted */
 define_commit_slab(indegree_slab, int);
 
-/* record author-date for each commit object */
-define_commit_slab(author_date_slab, unsigned long);
-
-static void record_author_date(struct author_date_slab *author_date,
-			       struct commit *commit)
+void record_author_date(struct author_date_slab *author_date,
+			struct commit *commit)
 {
 	const char *buffer = get_commit_buffer(commit, NULL);
 	struct ident_split ident;
@@ -684,8 +681,8 @@ fail_exit:
 	unuse_commit_buffer(commit, buffer);
 }
 
-static int compare_commits_by_author_date(const void *a_, const void *b_,
-					  void *cb_data)
+int compare_commits_by_author_date(const void *a_, const void *b_,
+				   void *cb_data)
 {
 	const struct commit *a = a_, *b = b_;
 	struct author_date_slab *author_date = cb_data;
diff --git a/commit.h b/commit.h
index e2c99d9b04..51de10e698 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,7 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 #include "pretty.h"
+#include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
 #define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
@@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf);
  */
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
+/* record author-date for each commit object */
+define_commit_slab(author_date_slab, timestamp_t);
+
+void record_author_date(struct author_date_slab *author_date,
+			struct commit *commit);
+
+int compare_commits_by_author_date(const void *a_, const void *b_, void *unused);
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
diff --git a/revision.c b/revision.c
index 1db70dc951..565f903e46 100644
--- a/revision.c
+++ b/revision.c
@@ -804,7 +804,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
 			if (p->object.flags & SEEN)
 				continue;
 			p->object.flags |= SEEN;
-			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
+			if (list)
+				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
 		}
 		return 0;
 	}
@@ -843,7 +844,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
 		p->object.flags |= left_flag;
 		if (!(p->object.flags & SEEN)) {
 			p->object.flags |= SEEN;
-			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
+			if (list)
+				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
 		}
 		if (revs->first_parent_only)
 			break;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH 6/6] revision.c: refactor basic topo-order logic
  2018-08-27 20:41 [PATCH 0/6] Use generation numbers for --topo-order Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2018-08-27 20:41 ` [PATCH 5/6] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
@ 2018-08-27 20:41 ` Derrick Stolee via GitGitGadget
  2018-08-27 21:23 ` [PATCH 0/6] Use generation numbers for --topo-order Junio C Hamano
  2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  7 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-08-27 20:41 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

    If A and B are commits with generation numbers gen(A) and
    gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.

For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 object.h   |   4 +-
 revision.c | 196 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 revision.h |   2 +
 3 files changed, 194 insertions(+), 8 deletions(-)

diff --git a/object.h b/object.h
index b132944c51..a84eea61d2 100644
--- a/object.h
+++ b/object.h
@@ -57,7 +57,7 @@ struct object_array {
 
 /*
  * object flag allocation:
- * revision.h:               0---------10                                26
+ * revision.h:               0---------10                                26--28
  * fetch-pack.c:             0----5
  * walker.c:                 0-2
  * upload-pack.c:                4       11-----14  16-----19
@@ -75,7 +75,7 @@ struct object_array {
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  27
+#define FLAG_BITS  29
 
 /*
  * The object type is stored in 3 bits.
diff --git a/revision.c b/revision.c
index 565f903e46..7b4beb9978 100644
--- a/revision.c
+++ b/revision.c
@@ -26,6 +26,7 @@
 #include "argv-array.h"
 #include "commit-reach.h"
 #include "commit-graph.h"
+#include "prio-queue.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2892,30 +2893,213 @@ static int mark_uninteresting(const struct object_id *oid,
 	return 0;
 }
 
-struct topo_walk_info {};
+define_commit_slab(indegree_slab, int);
+
+struct topo_walk_info {
+	uint32_t min_generation;
+	struct prio_queue explore_queue;
+	struct prio_queue indegree_queue;
+	struct prio_queue topo_queue;
+	struct indegree_slab indegree;
+	struct author_date_slab author_date;
+};
+
+static inline void test_flag_and_insert(struct prio_queue *q, struct commit *c, int flag)
+{
+	if (c->object.flags & flag)
+		return;
+
+	c->object.flags |= flag;
+	prio_queue_put(q, c);
+}
+
+static void explore_walk_step(struct rev_info *revs)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit_list *p;
+	struct commit *c = prio_queue_get(&info->explore_queue);
+
+	if (!c)
+		return;
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	if (revs->max_age != -1 && (c->date < revs->max_age))
+		c->object.flags |= UNINTERESTING;
+
+	if (add_parents_to_list(revs, c, NULL, NULL) < 0)
+		return;
+
+	if (c->object.flags & UNINTERESTING)
+		mark_parents_uninteresting(c);
+
+	for (p = c->parents; p; p = p->next)
+		test_flag_and_insert(&info->explore_queue, p->item, TOPO_WALK_EXPLORED);
+}
+
+static void explore_to_depth(struct rev_info *revs,
+			     uint32_t gen)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c;
+	while ((c = prio_queue_peek(&info->explore_queue)) &&
+	       c->generation >= gen)
+		explore_walk_step(revs);
+}
+
+static void indegree_walk_step(struct rev_info *revs)
+{
+	struct commit_list *p;
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c = prio_queue_get(&info->indegree_queue);
+
+	if (!c)
+		return;
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	explore_to_depth(revs, c->generation);
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	for (p = c->parents; p; p = p->next) {
+		struct commit *parent = p->item;
+		int *pi = indegree_slab_at(&info->indegree, parent);
+
+		if (*pi)
+			(*pi)++;
+		else
+			*pi = 2;
+
+		test_flag_and_insert(&info->indegree_queue, parent, TOPO_WALK_INDEGREE);
+
+		if (revs->first_parent_only)
+			return;
+	}
+}
+
+static void compute_indegrees_to_depth(struct rev_info *revs)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c;
+	while ((c = prio_queue_peek(&info->indegree_queue)) &&
+	       c->generation >= info->min_generation)
+		indegree_walk_step(revs);
+}
 
 static void init_topo_walk(struct rev_info *revs)
 {
 	struct topo_walk_info *info;
+	struct commit_list *list;
 	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
 	info = revs->topo_walk_info;
 	memset(info, 0, sizeof(struct topo_walk_info));
 
-	limit_list(revs);
-	sort_in_topological_order(&revs->commits, revs->sort_order);
+	init_indegree_slab(&info->indegree);
+	memset(&info->explore_queue, '\0', sizeof(info->explore_queue));
+	memset(&info->indegree_queue, '\0', sizeof(info->indegree_queue));
+	memset(&info->topo_queue, '\0', sizeof(info->topo_queue));
+
+	switch (revs->sort_order) {
+	default: /* REV_SORT_IN_GRAPH_ORDER */
+		info->topo_queue.compare = NULL;
+		break;
+	case REV_SORT_BY_COMMIT_DATE:
+		info->topo_queue.compare = compare_commits_by_commit_date;
+		break;
+	case REV_SORT_BY_AUTHOR_DATE:
+		init_author_date_slab(&info->author_date);
+		info->topo_queue.compare = compare_commits_by_author_date;
+		info->topo_queue.cb_data = &info->author_date;
+		break;
+	}
+
+	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
+	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
+
+	info->min_generation = GENERATION_NUMBER_INFINITY;
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+		test_flag_and_insert(&info->explore_queue, c, TOPO_WALK_EXPLORED);
+		test_flag_and_insert(&info->indegree_queue, c, TOPO_WALK_INDEGREE);
+
+		if (parse_commit_gently(c, 1))
+			continue;
+		if (c->generation < info->min_generation)
+			info->min_generation = c->generation;
+	}
+
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+		*(indegree_slab_at(&info->indegree, c)) = 1;
+
+		if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
+			record_author_date(&info->author_date, c);
+	}
+	compute_indegrees_to_depth(revs);
+
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+
+		if (*(indegree_slab_at(&info->indegree, c)) == 1)
+			prio_queue_put(&info->topo_queue, c);
+	}
+
+	/*
+	 * This is unfortunate; the initial tips need to be shown
+	 * in the order given from the revision traversal machinery.
+	 */
+	if (revs->sort_order == REV_SORT_IN_GRAPH_ORDER)
+		prio_queue_reverse(&info->topo_queue);
 }
 
 static struct commit *next_topo_commit(struct rev_info *revs)
 {
-	return pop_commit(&revs->commits);
+	struct commit *c;
+	struct topo_walk_info *info = revs->topo_walk_info;
+
+	/* pop next off of topo_queue */
+	c = prio_queue_get(&info->topo_queue);
+
+	if (c)
+		*(indegree_slab_at(&info->indegree, c)) = 0;
+
+	return c;
 }
 
 static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 {
-	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+	struct commit_list *p;
+	struct topo_walk_info *info = revs->topo_walk_info;
+	if (add_parents_to_list(revs, commit, NULL, NULL) < 0) {
 		if (!revs->ignore_missing_links)
 			die("Failed to traverse parents of commit %s",
-			    oid_to_hex(&commit->object.oid));
+				oid_to_hex(&commit->object.oid));
+	}
+
+	for (p = commit->parents; p; p = p->next) {
+		struct commit *parent = p->item;
+		int *pi;
+
+		if (parse_commit_gently(parent, 1) < 0)
+			continue;
+
+		if (parent->generation < info->min_generation) {
+			info->min_generation = parent->generation;
+			compute_indegrees_to_depth(revs);
+		}
+
+		pi = indegree_slab_at(&info->indegree, parent);
+
+		(*pi)--;
+		if (*pi == 1)
+			prio_queue_put(&info->topo_queue, parent);
+
+		if (revs->first_parent_only)
+			return;
 	}
 }
 
diff --git a/revision.h b/revision.h
index e48181673d..ca10392021 100644
--- a/revision.h
+++ b/revision.h
@@ -21,6 +21,8 @@
 #define PATCHSAME	(1u<<9)
 #define BOTTOM		(1u<<10)
 #define TRACK_LINEAR	(1u<<26)
+#define TOPO_WALK_EXPLORED (1u<<27)
+#define TOPO_WALK_INDEGREE (1u<<28)
 #define ALL_REV_FLAGS	(((1u<<11)-1) | TRACK_LINEAR)
 
 #define DECORATE_SHORT_REFS	1
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH 0/6] Use generation numbers for --topo-order
  2018-08-27 20:41 [PATCH 0/6] Use generation numbers for --topo-order Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2018-08-27 20:41 ` [PATCH 6/6] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
@ 2018-08-27 21:23 ` Junio C Hamano
  2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  7 siblings, 0 replies; 87+ messages in thread
From: Junio C Hamano @ 2018-08-27 21:23 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This patch series performs a decently-sized refactoring of the revision-walk
> machinery. Well, "refactoring" is probably the wrong word, as I don't
> actually remove the old code. Instead, when we see certain options in the
> 'rev_info' struct, we redirect the commit-walk logic to a new set of methods
> that distribute the workload differently. By using generation numbers in the
> commit-graph, we can significantly improve 'git log --graph' commands (and
> the underlying 'git rev-list --topo-order').

Finally ;-).

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 0/6] Use generation numbers for --topo-order
  2018-08-27 20:41 [PATCH 0/6] Use generation numbers for --topo-order Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2018-08-27 21:23 ` [PATCH 0/6] Use generation numbers for --topo-order Junio C Hamano
@ 2018-09-18  4:08 ` Derrick Stolee via GitGitGadget
  2018-09-18  4:08   ` [PATCH v2 1/6] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
                     ` (7 more replies)
  7 siblings, 8 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-18  4:08 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

This patch series performs a decently-sized refactoring of the revision-walk
machinery. Well, "refactoring" is probably the wrong word, as I don't
actually remove the old code. Instead, when we see certain options in the
'rev_info' struct, we redirect the commit-walk logic to a new set of methods
that distribute the workload differently. By using generation numbers in the
commit-graph, we can significantly improve 'git log --graph' commands (and
the underlying 'git rev-list --topo-order').

On the Linux repository, I got the following performance results when
comparing to the previous version with or without a commit-graph:

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

If you want to read this series but are unfamiliar with the commit-graph and
generation numbers, then I recommend reading 
Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the
subject. In particular, the three-part walk described in "revision.c:
refactor basic topo-order logic" is present (but underexplained) as an
animated PNG [2].

Since revision.c is an incredibly important (and old) portion of the
codebase -- and because there are so many orthogonal options in 'struct
rev_info' -- I consider this submission to be "RFC quality". That is, I am
not confident that I am not missing anything, or that my solution is the
best it can be. I did merge this branch with ds/commit-graph-with-grafts and
the "DO-NOT-MERGE: write and read commit-graph always" commit that computes
a commit-graph with every 'git commit' command. The test suite passed with
that change, available on GitHub [3]. To ensure that I cover at least the
case I think are interesting, I added tests to t6600-test-reach.sh to verify
the walks report the correct results for the three cases there (no
commit-graph, full commit-graph, and a partial commit-graph so the walk
starts at GENERATION_NUMBER_INFINITY).

One notable case that is not included in this series is the case of a
history comparison such as 'git rev-list --topo-order A..B'. The existing
code in limit_list() has ways to cut the walk short when all pending commits
are UNINTERESTING. Since this code depends on commit_list instead of the
prio_queue we are using here, I chose to leave it untouched for now. We can
revisit it in a separate series later. Since handle_commit() turns on
revs->limited when a commit is UNINTERESTING, we do not hit the new code in
this case. Removing this 'revs->limited = 1;' line yields correct results,
but the performance is worse.

This series was based on ds/reachable, but is now based on 'master' to not
conflict with 182070 "commit: use timestamp_t for author_date_slab". There
is a small conflict with md/filter-trees, because it renamed a flag in
revisions.h in the line before I add new flags. Hopefully this conflict is
not too difficult to resolve.

Thanks, -Stolee

[1] 
https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/
Supercharging the Git Commit Graph III: Generations and Graph Algorithms

[2] 
https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png
Animation showing three-part walk

[3] https://github.com/derrickstolee/git/tree/topo-order/testA branch
containing this series along with commits to compute commit-graph in entire
test suite.

Derrick Stolee (6):
  prio-queue: add 'peek' operation
  test-reach: add run_three_modes method
  test-reach: add rev-list tests
  revision.c: begin refactoring --topo-order logic
  commit/revisions: bookkeeping before refactoring
  revision.c: refactor basic topo-order logic

 commit.c                   |  11 +-
 commit.h                   |   8 ++
 object.h                   |   4 +-
 prio-queue.c               |   9 ++
 prio-queue.h               |   6 +
 revision.c                 | 232 ++++++++++++++++++++++++++++++++++++-
 revision.h                 |   6 +
 t/helper/test-prio-queue.c |  10 +-
 t/t6600-test-reach.sh      |  98 +++++++++++++++-
 9 files changed, 361 insertions(+), 23 deletions(-)


base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-25%2Fderrickstolee%2Ftopo-order%2Fprogress-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-25/derrickstolee/topo-order/progress-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/25

Range-diff vs v1:

 1:  5e55669f4d = 1:  cc1ec4c270 prio-queue: add 'peek' operation
 2:  9628396af1 = 2:  404c918608 test-reach: add run_three_modes method
 3:  708b4550a1 = 3:  30dee58c61 test-reach: add rev-list tests
 4:  908442417d ! 4:  a74ae13d4e revision.c: begin refactoring --topo-order logic
     @@ -168,4 +168,4 @@
      +	struct topo_walk_info *topo_walk_info;
       };
       
     - extern int ref_excluded(struct string_list *, const char *path);
     + int ref_excluded(struct string_list *, const char *path);
 5:  a7272f2799 ! 5:  0e64fc144c commit/revisions: bookkeeping before refactoring
     @@ -27,7 +27,7 @@
       define_commit_slab(indegree_slab, int);
       
      -/* record author-date for each commit object */
     --define_commit_slab(author_date_slab, unsigned long);
     +-define_commit_slab(author_date_slab, timestamp_t);
      -
      -static void record_author_date(struct author_date_slab *author_date,
      -			       struct commit *commit)
 6:  73713bcbee ! 6:  3b185ac3b1 revision.c: refactor basic topo-order logic
     @@ -153,11 +153,11 @@
       
       /*
        * object flag allocation:
     -- * revision.h:               0---------10                                26
     -+ * revision.h:               0---------10                                26--28
     -  * fetch-pack.c:             0----5
     +- * revision.h:               0---------10                              2526
     ++ * revision.h:               0---------10                              25----28
     +  * fetch-pack.c:             01
     +  * negotiator/default.c:       2--5
        * walker.c:                 0-2
     -  * upload-pack.c:                4       11-----14  16-----19
      @@
        * builtin/show-branch.c:    0-------------------------------------------26
        * builtin/unpack-objects.c:                                 2021
     @@ -404,11 +404,11 @@
      --- a/revision.h
      +++ b/revision.h
      @@
     - #define PATCHSAME	(1u<<9)
     - #define BOTTOM		(1u<<10)
     + #define USER_GIVEN	(1u<<25) /* given directly by the user */
       #define TRACK_LINEAR	(1u<<26)
     + #define ALL_REV_FLAGS	(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
      +#define TOPO_WALK_EXPLORED (1u<<27)
      +#define TOPO_WALK_INDEGREE (1u<<28)
     - #define ALL_REV_FLAGS	(((1u<<11)-1) | TRACK_LINEAR)
       
       #define DECORATE_SHORT_REFS	1
     + #define DECORATE_FULL_REFS	2

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 1/6] prio-queue: add 'peek' operation
  2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
@ 2018-09-18  4:08   ` Derrick Stolee via GitGitGadget
  2018-09-18  4:08   ` [PATCH v2 2/6] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-18  4:08 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When consuming a priority queue, it can be convenient to inspect
the next object that will be dequeued without actually dequeueing
it. Our existing library did not have such a 'peek' operation, so
add it as prio_queue_peek().

Add a reference-level comparison in t/helper/test-prio-queue.c
so this method is exercised by t0009-prio-queue.sh.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 prio-queue.c               |  9 +++++++++
 prio-queue.h               |  6 ++++++
 t/helper/test-prio-queue.c | 10 +++++++---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/prio-queue.c b/prio-queue.c
index a078451872..d3f488cb05 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
 	}
 	return result;
 }
+
+void *prio_queue_peek(struct prio_queue *queue)
+{
+	if (!queue->nr)
+		return NULL;
+	if (!queue->compare)
+		return queue->array[queue->nr - 1].data;
+	return queue->array[0].data;
+}
diff --git a/prio-queue.h b/prio-queue.h
index d030ec9dd6..682e51867a 100644
--- a/prio-queue.h
+++ b/prio-queue.h
@@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing);
  */
 extern void *prio_queue_get(struct prio_queue *);
 
+/*
+ * Gain access to the "thing" that would be returned by
+ * prio_queue_get, but do not remove it from the queue.
+ */
+extern void *prio_queue_peek(struct prio_queue *);
+
 extern void clear_prio_queue(struct prio_queue *);
 
 /* Reverse the LIFO elements */
diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
index 9807b649b1..e817bbf464 100644
--- a/t/helper/test-prio-queue.c
+++ b/t/helper/test-prio-queue.c
@@ -22,9 +22,13 @@ int cmd__prio_queue(int argc, const char **argv)
 	struct prio_queue pq = { intcmp };
 
 	while (*++argv) {
-		if (!strcmp(*argv, "get"))
-			show(prio_queue_get(&pq));
-		else if (!strcmp(*argv, "dump")) {
+		if (!strcmp(*argv, "get")) {
+			void *peek = prio_queue_peek(&pq);
+			void *get = prio_queue_get(&pq);
+			if (peek != get)
+				BUG("peek and get results do not match");
+			show(get);
+		} else if (!strcmp(*argv, "dump")) {
 			int *v;
 			while ((v = prio_queue_get(&pq)))
 			       show(v);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v2 2/6] test-reach: add run_three_modes method
  2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2018-09-18  4:08   ` [PATCH v2 1/6] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
@ 2018-09-18  4:08   ` Derrick Stolee via GitGitGadget
  2018-09-18 18:02     ` SZEDER Gábor
  2018-09-18  4:08   ` [PATCH v2 3/6] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-18  4:08 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'test_three_modes' method assumes we are using the 'test-tool
reach' command for our test. However, we may want to use the data
shape of our commit graph and the three modes (no commit-graph,
full commit-graph, partial commit-graph) for other git commands.

Split test_three_modes to be a simple translation on a more general
run_three_modes method that executes the given command and tests
the actual output to the expected output.

While inspecting this code, I realized that the final test for
'commit_contains --tag' is silently dropping the '--tag' argument.
It should be quoted to include both.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6600-test-reach.sh | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..1b18e12a4e 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -53,18 +53,22 @@ test_expect_success 'setup' '
 	git config core.commitGraph true
 '
 
-test_three_modes () {
+run_three_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	$1 <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-full .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	$1 <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	$1 <input >actual &&
 	test_cmp expect actual
 }
 
+test_three_modes () {
+	run_three_modes "test-tool reach $1"
+}
+
 test_expect_success 'ref_newer:miss' '
 	cat >input <<-\EOF &&
 	A:commit-5-7
@@ -219,7 +223,7 @@ test_expect_success 'commit_contains:hit' '
 	EOF
 	echo "commit_contains(_,A,X,_):1" >expect &&
 	test_three_modes commit_contains &&
-	test_three_modes commit_contains --tag
+	test_three_modes "commit_contains --tag"
 '
 
 test_expect_success 'commit_contains:miss' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v2 3/6] test-reach: add rev-list tests
  2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2018-09-18  4:08   ` [PATCH v2 1/6] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
  2018-09-18  4:08   ` [PATCH v2 2/6] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
@ 2018-09-18  4:08   ` Derrick Stolee via GitGitGadget
  2018-09-18  4:08   ` [PATCH v2 4/6] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-18  4:08 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6600-test-reach.sh | 84 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 1b18e12a4e..2fcaa39077 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
 	test_three_modes commit_contains --tag
 '
 
+test_expect_success 'rev-list: basic topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
+		commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \
+		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
+		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
+	>expect &&
+	run_three_modes "git rev-list --topo-order commit-6-6"
+'
+
+test_expect_success 'rev-list: first-parent topo-order' '
+	git rev-parse \
+		commit-6-6 \
+		commit-6-5 \
+		commit-6-4 \
+		commit-6-3 \
+		commit-6-2 \
+		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
+	>expect &&
+	run_three_modes "git rev-list --first-parent --topo-order commit-6-6"
+'
+
+test_expect_success 'rev-list: range topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes "git rev-list --topo-order commit-3-3..commit-6-6"
+'
+
+test_expect_success 'rev-list: range topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 \
+		commit-6-5 commit-5-5 commit-4-5 \
+		commit-6-4 commit-5-4 commit-4-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes "git rev-list --topo-order commit-3-8..commit-6-6"
+'
+
+test_expect_success 'rev-list: first-parent range topo-order' '
+	git rev-parse \
+		commit-6-6 \
+		commit-6-5 \
+		commit-6-4 \
+		commit-6-3 \
+		commit-6-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes "git rev-list --first-parent --topo-order commit-3-8..commit-6-6"
+'
+
+test_expect_success 'rev-list: ancestry-path topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+	>expect &&
+	run_three_modes "git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6"
+'
+
+test_expect_success 'rev-list: symmetric difference topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 \
+		commit-6-5 commit-5-5 commit-4-5 \
+		commit-6-4 commit-5-4 commit-4-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+		commit-3-8 commit-2-8 commit-1-8 \
+		commit-3-7 commit-2-7 commit-1-7 \
+	>expect &&
+	run_three_modes "git rev-list --topo-order commit-3-8...commit-6-6"
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v2 4/6] revision.c: begin refactoring --topo-order logic
  2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2018-09-18  4:08   ` [PATCH v2 3/6] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
@ 2018-09-18  4:08   ` Derrick Stolee via GitGitGadget
  2018-09-18  4:08   ` [PATCH v2 5/6] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-18  4:08 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 42 ++++++++++++++++++++++++++++++++++++++----
 revision.h |  4 ++++
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index e18bd530e4..2dcde8a8ac 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 	if (revs->diffopt.objfind)
 		revs->simplify_history = 0;
 
-	if (revs->topo_order)
+	if (revs->topo_order && !generation_numbers_enabled(the_repository))
 		revs->limited = 1;
 
 	if (revs->prune_data.nr) {
@@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid,
 	return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+	struct topo_walk_info *info;
+	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+	info = revs->topo_walk_info;
+	memset(info, 0, sizeof(struct topo_walk_info));
+
+	limit_list(revs);
+	sort_in_topological_order(&revs->commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+	return pop_commit(&revs->commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+		if (!revs->ignore_missing_links)
+			die("Failed to traverse parents of commit %s",
+			    oid_to_hex(&commit->object.oid));
+	}
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
 	int i;
@@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
 		return 0;
-	if (revs->limited)
+	if (revs->limited) {
 		if (limit_list(revs) < 0)
 			return -1;
-	if (revs->topo_order)
-		sort_in_topological_order(&revs->commits, revs->sort_order);
+		if (revs->topo_order)
+			sort_in_topological_order(&revs->commits, revs->sort_order);
+	} else if (revs->topo_order)
+		init_topo_walk(revs);
 	if (revs->line_level_traverse)
 		line_log_filter(revs);
 	if (revs->simplify_merges)
@@ -3257,6 +3287,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
 
 		if (revs->reflog_info)
 			commit = next_reflog_entry(revs->reflog_info);
+		else if (revs->topo_walk_info)
+			commit = next_topo_commit(revs);
 		else
 			commit = pop_commit(&revs->commits);
 
@@ -3278,6 +3310,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
 
 			if (revs->reflog_info)
 				try_to_simplify_commit(revs, commit);
+			else if (revs->topo_walk_info)
+				expand_topo_walk(revs, commit);
 			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
 				if (!revs->ignore_missing_links)
 					die("Failed to traverse parents of commit %s",
diff --git a/revision.h b/revision.h
index 2b30ac270d..fd4154ff75 100644
--- a/revision.h
+++ b/revision.h
@@ -56,6 +56,8 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct topo_walk_info;
+
 struct rev_info {
 	/* Starting list */
 	struct commit_list *commits;
@@ -245,6 +247,8 @@ struct rev_info {
 	const char *break_bar;
 
 	struct revision_sources *sources;
+
+	struct topo_walk_info *topo_walk_info;
 };
 
 int ref_excluded(struct string_list *, const char *path);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v2 5/6] commit/revisions: bookkeeping before refactoring
  2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2018-09-18  4:08   ` [PATCH v2 4/6] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
@ 2018-09-18  4:08   ` Derrick Stolee via GitGitGadget
  2018-09-18  4:08   ` [PATCH v2 6/6] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-18  4:08 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

There are a few things that need to move around a little before
making a big refactoring in the topo-order logic:

1. We need access to record_author_date() and
   compare_commits_by_author_date() in revision.c. These are used
   currently by sort_in_topological_order() in commit.c.

2. Moving these methods to commit.h requires adding the author_slab
   definition to commit.h.

3. The add_parents_to_list() method in revision.c performs logic
   around the UNINTERESTING flag and other special cases depending
   on the struct rev_info. Allow this method to ignore a NULL 'list'
   parameter, as we will not be populating the list for our walk.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c   | 11 ++++-------
 commit.h   |  8 ++++++++
 revision.c |  6 ++++--
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/commit.c b/commit.c
index d0f199e122..f68e04b2f1 100644
--- a/commit.c
+++ b/commit.c
@@ -655,11 +655,8 @@ struct commit *pop_commit(struct commit_list **stack)
 /* count number of children that have not been emitted */
 define_commit_slab(indegree_slab, int);
 
-/* record author-date for each commit object */
-define_commit_slab(author_date_slab, timestamp_t);
-
-static void record_author_date(struct author_date_slab *author_date,
-			       struct commit *commit)
+void record_author_date(struct author_date_slab *author_date,
+			struct commit *commit)
 {
 	const char *buffer = get_commit_buffer(commit, NULL);
 	struct ident_split ident;
@@ -684,8 +681,8 @@ fail_exit:
 	unuse_commit_buffer(commit, buffer);
 }
 
-static int compare_commits_by_author_date(const void *a_, const void *b_,
-					  void *cb_data)
+int compare_commits_by_author_date(const void *a_, const void *b_,
+				   void *cb_data)
 {
 	const struct commit *a = a_, *b = b_;
 	struct author_date_slab *author_date = cb_data;
diff --git a/commit.h b/commit.h
index 2b1a734388..ff0eb5f8ef 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,7 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 #include "pretty.h"
+#include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
 #define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
@@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf);
  */
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
+/* record author-date for each commit object */
+define_commit_slab(author_date_slab, timestamp_t);
+
+void record_author_date(struct author_date_slab *author_date,
+			struct commit *commit);
+
+int compare_commits_by_author_date(const void *a_, const void *b_, void *unused);
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
diff --git a/revision.c b/revision.c
index 2dcde8a8ac..92012d5f45 100644
--- a/revision.c
+++ b/revision.c
@@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
 			if (p->object.flags & SEEN)
 				continue;
 			p->object.flags |= SEEN;
-			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
+			if (list)
+				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
 		}
 		return 0;
 	}
@@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
 		p->object.flags |= left_flag;
 		if (!(p->object.flags & SEEN)) {
 			p->object.flags |= SEEN;
-			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
+			if (list)
+				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
 		}
 		if (revs->first_parent_only)
 			break;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v2 6/6] revision.c: refactor basic topo-order logic
  2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2018-09-18  4:08   ` [PATCH v2 5/6] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
@ 2018-09-18  4:08   ` Derrick Stolee via GitGitGadget
  2018-09-18  5:51     ` Ævar Arnfjörð Bjarmason
  2018-09-18  6:05   ` [PATCH v2 0/6] Use generation numbers for --topo-order Ævar Arnfjörð Bjarmason
  2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
  7 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-18  4:08 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

    If A and B are commits with generation numbers gen(A) and
    gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.

For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 object.h   |   4 +-
 revision.c | 196 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 revision.h |   2 +
 3 files changed, 194 insertions(+), 8 deletions(-)

diff --git a/object.h b/object.h
index 0feb90ae61..796792cb32 100644
--- a/object.h
+++ b/object.h
@@ -59,7 +59,7 @@ struct object_array {
 
 /*
  * object flag allocation:
- * revision.h:               0---------10                              2526
+ * revision.h:               0---------10                              25----28
  * fetch-pack.c:             01
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
@@ -78,7 +78,7 @@ struct object_array {
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  27
+#define FLAG_BITS  29
 
 /*
  * The object type is stored in 3 bits.
diff --git a/revision.c b/revision.c
index 92012d5f45..c5d0cb6599 100644
--- a/revision.c
+++ b/revision.c
@@ -26,6 +26,7 @@
 #include "argv-array.h"
 #include "commit-reach.h"
 #include "commit-graph.h"
+#include "prio-queue.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2895,30 +2896,213 @@ static int mark_uninteresting(const struct object_id *oid,
 	return 0;
 }
 
-struct topo_walk_info {};
+define_commit_slab(indegree_slab, int);
+
+struct topo_walk_info {
+	uint32_t min_generation;
+	struct prio_queue explore_queue;
+	struct prio_queue indegree_queue;
+	struct prio_queue topo_queue;
+	struct indegree_slab indegree;
+	struct author_date_slab author_date;
+};
+
+static inline void test_flag_and_insert(struct prio_queue *q, struct commit *c, int flag)
+{
+	if (c->object.flags & flag)
+		return;
+
+	c->object.flags |= flag;
+	prio_queue_put(q, c);
+}
+
+static void explore_walk_step(struct rev_info *revs)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit_list *p;
+	struct commit *c = prio_queue_get(&info->explore_queue);
+
+	if (!c)
+		return;
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	if (revs->max_age != -1 && (c->date < revs->max_age))
+		c->object.flags |= UNINTERESTING;
+
+	if (add_parents_to_list(revs, c, NULL, NULL) < 0)
+		return;
+
+	if (c->object.flags & UNINTERESTING)
+		mark_parents_uninteresting(c);
+
+	for (p = c->parents; p; p = p->next)
+		test_flag_and_insert(&info->explore_queue, p->item, TOPO_WALK_EXPLORED);
+}
+
+static void explore_to_depth(struct rev_info *revs,
+			     uint32_t gen)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c;
+	while ((c = prio_queue_peek(&info->explore_queue)) &&
+	       c->generation >= gen)
+		explore_walk_step(revs);
+}
+
+static void indegree_walk_step(struct rev_info *revs)
+{
+	struct commit_list *p;
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c = prio_queue_get(&info->indegree_queue);
+
+	if (!c)
+		return;
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	explore_to_depth(revs, c->generation);
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	for (p = c->parents; p; p = p->next) {
+		struct commit *parent = p->item;
+		int *pi = indegree_slab_at(&info->indegree, parent);
+
+		if (*pi)
+			(*pi)++;
+		else
+			*pi = 2;
+
+		test_flag_and_insert(&info->indegree_queue, parent, TOPO_WALK_INDEGREE);
+
+		if (revs->first_parent_only)
+			return;
+	}
+}
+
+static void compute_indegrees_to_depth(struct rev_info *revs)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c;
+	while ((c = prio_queue_peek(&info->indegree_queue)) &&
+	       c->generation >= info->min_generation)
+		indegree_walk_step(revs);
+}
 
 static void init_topo_walk(struct rev_info *revs)
 {
 	struct topo_walk_info *info;
+	struct commit_list *list;
 	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
 	info = revs->topo_walk_info;
 	memset(info, 0, sizeof(struct topo_walk_info));
 
-	limit_list(revs);
-	sort_in_topological_order(&revs->commits, revs->sort_order);
+	init_indegree_slab(&info->indegree);
+	memset(&info->explore_queue, '\0', sizeof(info->explore_queue));
+	memset(&info->indegree_queue, '\0', sizeof(info->indegree_queue));
+	memset(&info->topo_queue, '\0', sizeof(info->topo_queue));
+
+	switch (revs->sort_order) {
+	default: /* REV_SORT_IN_GRAPH_ORDER */
+		info->topo_queue.compare = NULL;
+		break;
+	case REV_SORT_BY_COMMIT_DATE:
+		info->topo_queue.compare = compare_commits_by_commit_date;
+		break;
+	case REV_SORT_BY_AUTHOR_DATE:
+		init_author_date_slab(&info->author_date);
+		info->topo_queue.compare = compare_commits_by_author_date;
+		info->topo_queue.cb_data = &info->author_date;
+		break;
+	}
+
+	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
+	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
+
+	info->min_generation = GENERATION_NUMBER_INFINITY;
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+		test_flag_and_insert(&info->explore_queue, c, TOPO_WALK_EXPLORED);
+		test_flag_and_insert(&info->indegree_queue, c, TOPO_WALK_INDEGREE);
+
+		if (parse_commit_gently(c, 1))
+			continue;
+		if (c->generation < info->min_generation)
+			info->min_generation = c->generation;
+	}
+
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+		*(indegree_slab_at(&info->indegree, c)) = 1;
+
+		if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
+			record_author_date(&info->author_date, c);
+	}
+	compute_indegrees_to_depth(revs);
+
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+
+		if (*(indegree_slab_at(&info->indegree, c)) == 1)
+			prio_queue_put(&info->topo_queue, c);
+	}
+
+	/*
+	 * This is unfortunate; the initial tips need to be shown
+	 * in the order given from the revision traversal machinery.
+	 */
+	if (revs->sort_order == REV_SORT_IN_GRAPH_ORDER)
+		prio_queue_reverse(&info->topo_queue);
 }
 
 static struct commit *next_topo_commit(struct rev_info *revs)
 {
-	return pop_commit(&revs->commits);
+	struct commit *c;
+	struct topo_walk_info *info = revs->topo_walk_info;
+
+	/* pop next off of topo_queue */
+	c = prio_queue_get(&info->topo_queue);
+
+	if (c)
+		*(indegree_slab_at(&info->indegree, c)) = 0;
+
+	return c;
 }
 
 static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 {
-	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+	struct commit_list *p;
+	struct topo_walk_info *info = revs->topo_walk_info;
+	if (add_parents_to_list(revs, commit, NULL, NULL) < 0) {
 		if (!revs->ignore_missing_links)
 			die("Failed to traverse parents of commit %s",
-			    oid_to_hex(&commit->object.oid));
+				oid_to_hex(&commit->object.oid));
+	}
+
+	for (p = commit->parents; p; p = p->next) {
+		struct commit *parent = p->item;
+		int *pi;
+
+		if (parse_commit_gently(parent, 1) < 0)
+			continue;
+
+		if (parent->generation < info->min_generation) {
+			info->min_generation = parent->generation;
+			compute_indegrees_to_depth(revs);
+		}
+
+		pi = indegree_slab_at(&info->indegree, parent);
+
+		(*pi)--;
+		if (*pi == 1)
+			prio_queue_put(&info->topo_queue, parent);
+
+		if (revs->first_parent_only)
+			return;
 	}
 }
 
diff --git a/revision.h b/revision.h
index fd4154ff75..b20c16c0e0 100644
--- a/revision.h
+++ b/revision.h
@@ -24,6 +24,8 @@
 #define USER_GIVEN	(1u<<25) /* given directly by the user */
 #define TRACK_LINEAR	(1u<<26)
 #define ALL_REV_FLAGS	(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
+#define TOPO_WALK_EXPLORED (1u<<27)
+#define TOPO_WALK_INDEGREE (1u<<28)
 
 #define DECORATE_SHORT_REFS	1
 #define DECORATE_FULL_REFS	2
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v2 6/6] revision.c: refactor basic topo-order logic
  2018-09-18  4:08   ` [PATCH v2 6/6] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
@ 2018-09-18  5:51     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-09-18  5:51 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee


On Tue, Sep 18 2018, Derrick Stolee via GitGitGadget wrote:

> diff --git a/revision.h b/revision.h
> index fd4154ff75..b20c16c0e0 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -24,6 +24,8 @@
>  #define USER_GIVEN	(1u<<25) /* given directly by the user */
>  #define TRACK_LINEAR	(1u<<26)
>  #define ALL_REV_FLAGS	(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
> +#define TOPO_WALK_EXPLORED (1u<<27)
> +#define TOPO_WALK_INDEGREE (1u<<28)

Maybe lead with a commit to indent these bitfield defines so this change
doesn't end up making these two new flags (due to the length of the
name) misaligned.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2 0/6] Use generation numbers for --topo-order
  2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2018-09-18  4:08   ` [PATCH v2 6/6] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
@ 2018-09-18  6:05   ` Ævar Arnfjörð Bjarmason
  2018-09-21 15:47     ` Derrick Stolee
  2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
  7 siblings, 1 reply; 87+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-09-18  6:05 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano


On Tue, Sep 18 2018, Derrick Stolee via GitGitGadget wrote:

Thanks. Good to see the commit graph used for more stuff.

> On the Linux repository, I got the following performance results when
> comparing to the previous version with or without a commit-graph:
>
> Test: git rev-list --topo-order -100 HEAD
> HEAD~1, no commit-graph: 6.80 s
> HEAD~1, w/ commit-graph: 0.77 s
>   HEAD, w/ commit-graph: 0.02 s
>
> Test: git rev-list --topo-order -100 HEAD -- tools
> HEAD~1, no commit-graph: 9.63 s
> HEAD~1, w/ commit-graph: 6.06 s
>   HEAD, w/ commit-graph: 0.06 s

It would be great if this were made into a t/perf/ test shipped with
this series, that would be later quoted in a commit, as in
e.g. 3b41fb0cb2 ("fsck: use oidset instead of oid_array for skipList",
2018-09-03).

Although generalizing that "-- tools" part (i.e. finding a candidate
dir) will require some heuristic, but would make it useful when running
this against other erpos.

> If you want to read this series but are unfamiliar with the commit-graph and
> generation numbers, then I recommend reading
> Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the
> subject. In particular, the three-part walk described in "revision.c:
> refactor basic topo-order logic" is present (but underexplained) as an
> animated PNG [2].

We discussed some of this in private E-Mail, and this isn't really
feedback on *this* series in particular, just on the general
commit-graph work.

Right now git-config(1) just matter-of-factly says how to enable it, and
points to git-commit-graph(1) for further info, which just shows how to
run the tool. But nothing's describing what stuff is sped up, and those
sorts of docs aren't being updated as new optimizations (e.g. this
--topo-order walk) are added.

For that you need to scour a combination of your blogpost & commits in
git.git (with quoted perf numbers).

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2 2/6] test-reach: add run_three_modes method
  2018-09-18  4:08   ` [PATCH v2 2/6] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
@ 2018-09-18 18:02     ` SZEDER Gábor
  2018-09-19 19:31       ` Junio C Hamano
  0 siblings, 1 reply; 87+ messages in thread
From: SZEDER Gábor @ 2018-09-18 18:02 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee

On Mon, Sep 17, 2018 at 09:08:44PM -0700, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The 'test_three_modes' method assumes we are using the 'test-tool
> reach' command for our test. However, we may want to use the data
> shape of our commit graph and the three modes (no commit-graph,
> full commit-graph, partial commit-graph) for other git commands.
> 
> Split test_three_modes to be a simple translation on a more general
> run_three_modes method that executes the given command and tests
> the actual output to the expected output.
> 
> While inspecting this code, I realized that the final test for
> 'commit_contains --tag' is silently dropping the '--tag' argument.
> It should be quoted to include both.

Nit: while quoting the function's arguments does fix the issue, it
leaves the tests prone to the same issue in the future.  Wouldn't it
be better to use $@ inside the function to refer to all its arguments?


> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/t6600-test-reach.sh | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
> index d139a00d1d..1b18e12a4e 100755
> --- a/t/t6600-test-reach.sh
> +++ b/t/t6600-test-reach.sh
> @@ -53,18 +53,22 @@ test_expect_success 'setup' '
>  	git config core.commitGraph true
>  '
>  
> -test_three_modes () {
> +run_three_modes () {
>  	test_when_finished rm -rf .git/objects/info/commit-graph &&
> -	test-tool reach $1 <input >actual &&
> +	$1 <input >actual &&
>  	test_cmp expect actual &&
>  	cp commit-graph-full .git/objects/info/commit-graph &&
> -	test-tool reach $1 <input >actual &&
> +	$1 <input >actual &&
>  	test_cmp expect actual &&
>  	cp commit-graph-half .git/objects/info/commit-graph &&
> -	test-tool reach $1 <input >actual &&
> +	$1 <input >actual &&
>  	test_cmp expect actual
>  }
>  
> +test_three_modes () {
> +	run_three_modes "test-tool reach $1"
> +}
> +
>  test_expect_success 'ref_newer:miss' '
>  	cat >input <<-\EOF &&
>  	A:commit-5-7
> @@ -219,7 +223,7 @@ test_expect_success 'commit_contains:hit' '
>  	EOF
>  	echo "commit_contains(_,A,X,_):1" >expect &&
>  	test_three_modes commit_contains &&
> -	test_three_modes commit_contains --tag
> +	test_three_modes "commit_contains --tag"
>  '
>  
>  test_expect_success 'commit_contains:miss' '
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2 2/6] test-reach: add run_three_modes method
  2018-09-18 18:02     ` SZEDER Gábor
@ 2018-09-19 19:31       ` Junio C Hamano
  2018-09-19 19:38         ` Junio C Hamano
  0 siblings, 1 reply; 87+ messages in thread
From: Junio C Hamano @ 2018-09-19 19:31 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Derrick Stolee via GitGitGadget, git, Derrick Stolee

SZEDER Gábor <szeder.dev@gmail.com> writes:

>> While inspecting this code, I realized that the final test for
>> 'commit_contains --tag' is silently dropping the '--tag' argument.
>> It should be quoted to include both.
>
> Nit: while quoting the function's arguments does fix the issue, it
> leaves the tests prone to the same issue in the future.  Wouldn't it
> be better to use $@ inside the function to refer to all its arguments?

IOW, do it more like this?

>> -test_three_modes () {
>> +run_three_modes () {
>>  	test_when_finished rm -rf .git/objects/info/commit-graph &&
>> -	test-tool reach $1 <input >actual &&
>> +	$1 <input >actual &&

	"$@" <input >actual

i.e. treat each parameter as separate things without further getting
split at $IFS and ...

>> +test_three_modes () {
>> +	run_three_modes "test-tool reach $1"

	run_three_modes test-tool reach "$1"

... make sure there three things are sent as separate, by quoting
"$1" inside dq.

I think that makes sense.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2 2/6] test-reach: add run_three_modes method
  2018-09-19 19:31       ` Junio C Hamano
@ 2018-09-19 19:38         ` Junio C Hamano
  2018-09-20 21:18           ` Junio C Hamano
  0 siblings, 1 reply; 87+ messages in thread
From: Junio C Hamano @ 2018-09-19 19:38 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Derrick Stolee via GitGitGadget, git, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> SZEDER Gábor <szeder.dev@gmail.com> writes:
>
>>> While inspecting this code, I realized that the final test for
>>> 'commit_contains --tag' is silently dropping the '--tag' argument.
>>> It should be quoted to include both.
>>
>> Nit: while quoting the function's arguments does fix the issue, it
>> leaves the tests prone to the same issue in the future.  Wouldn't it
>> be better to use $@ inside the function to refer to all its arguments?
>
> IOW, do it more like this?
>
>>> -test_three_modes () {
>>> +run_three_modes () {
>>>  	test_when_finished rm -rf .git/objects/info/commit-graph &&
>>> -	test-tool reach $1 <input >actual &&
>>> +	$1 <input >actual &&
>
> 	"$@" <input >actual
>
> i.e. treat each parameter as separate things without further getting
> split at $IFS and ...
>
>>> +test_three_modes () {
>>> +	run_three_modes "test-tool reach $1"
>
> 	run_three_modes test-tool reach "$1"
>
> ... make sure there three things are sent as separate, by quoting
> "$1" inside dq.
>
> I think that makes sense.

I also noticed that 2/6 made "commti_contains --tag" enclosed in dq
pair for one test, but the next test after it has the identical one.

Here is what I queued in the meantime.

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 1b18e12a4e..1377849bf8 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -55,18 +55,18 @@ test_expect_success 'setup' '
 
 run_three_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
-	$1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-full .git/objects/info/commit-graph &&
-	$1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
-	$1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
 test_three_modes () {
-	run_three_modes "test-tool reach $1"
+	run_three_modes test-tool reach "$1"
 }
 
 test_expect_success 'ref_newer:miss' '
@@ -223,7 +223,7 @@ test_expect_success 'commit_contains:hit' '
 	EOF
 	echo "commit_contains(_,A,X,_):1" >expect &&
 	test_three_modes commit_contains &&
-	test_three_modes "commit_contains --tag"
+	test_three_modes commit_contains --tag
 '
 
 test_expect_success 'commit_contains:miss' '
-- 
2.19.0-216-g2d3b1c576c


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v2 2/6] test-reach: add run_three_modes method
  2018-09-19 19:38         ` Junio C Hamano
@ 2018-09-20 21:18           ` Junio C Hamano
  0 siblings, 0 replies; 87+ messages in thread
From: Junio C Hamano @ 2018-09-20 21:18 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Derrick Stolee via GitGitGadget, git, Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> I also noticed that 2/6 made "commti_contains --tag" enclosed in dq
> pair for one test, but the next test after it has the identical one.
>
> Here is what I queued in the meantime.
> ...

And of course, I find out that 3/6 needs a matching update after
I've almost finished day's integration cycle, and need to redo the
whole thing X-<.

Here is a squashable update for 3/6 to match the proposed change.

-- >8 --
Subject: fixup! test-reach: add rev-list tests

 t/t6600-test-reach.sh | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 990ab56e7a..cf9179bdb8 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -252,7 +252,7 @@ test_expect_success 'rev-list: basic topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes "git rev-list --topo-order commit-6-6"
+	run_three_modes git rev-list --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent topo-order' '
@@ -264,7 +264,7 @@ test_expect_success 'rev-list: first-parent topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
 	>expect &&
-	run_three_modes "git rev-list --first-parent --topo-order commit-6-6"
+	run_three_modes git rev-list --first-parent --topo-order commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -276,7 +276,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes "git rev-list --topo-order commit-3-3..commit-6-6"
+	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: range topo-order' '
@@ -288,7 +288,7 @@ test_expect_success 'rev-list: range topo-order' '
 		commit-6-2 commit-5-2 commit-4-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes "git rev-list --topo-order commit-3-8..commit-6-6"
+	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: first-parent range topo-order' '
@@ -300,7 +300,7 @@ test_expect_success 'rev-list: first-parent range topo-order' '
 		commit-6-2 \
 		commit-6-1 commit-5-1 commit-4-1 \
 	>expect &&
-	run_three_modes "git rev-list --first-parent --topo-order commit-3-8..commit-6-6"
+	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
 '
 
 test_expect_success 'rev-list: ancestry-path topo-order' '
@@ -310,7 +310,7 @@ test_expect_success 'rev-list: ancestry-path topo-order' '
 		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
 		commit-6-3 commit-5-3 commit-4-3 \
 	>expect &&
-	run_three_modes "git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6"
+	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
 '
 
 test_expect_success 'rev-list: symmetric difference topo-order' '
@@ -324,7 +324,7 @@ test_expect_success 'rev-list: symmetric difference topo-order' '
 		commit-3-8 commit-2-8 commit-1-8 \
 		commit-3-7 commit-2-7 commit-1-7 \
 	>expect &&
-	run_three_modes "git rev-list --topo-order commit-3-8...commit-6-6"
+	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
 '
 
 test_done

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v2 0/6] Use generation numbers for --topo-order
  2018-09-18  6:05   ` [PATCH v2 0/6] Use generation numbers for --topo-order Ævar Arnfjörð Bjarmason
@ 2018-09-21 15:47     ` Derrick Stolee
  0 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-09-21 15:47 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, Junio C Hamano

On 9/18/2018 2:05 AM, Ævar Arnfjörð Bjarmason wrote:
> On Tue, Sep 18 2018, Derrick Stolee via GitGitGadget wrote:
>
> Thanks. Good to see the commit graph used for more stuff.
>
>> On the Linux repository, I got the following performance results when
>> comparing to the previous version with or without a commit-graph:
>>
>> Test: git rev-list --topo-order -100 HEAD
>> HEAD~1, no commit-graph: 6.80 s
>> HEAD~1, w/ commit-graph: 0.77 s
>>    HEAD, w/ commit-graph: 0.02 s
>>
>> Test: git rev-list --topo-order -100 HEAD -- tools
>> HEAD~1, no commit-graph: 9.63 s
>> HEAD~1, w/ commit-graph: 6.06 s
>>    HEAD, w/ commit-graph: 0.06 s
> It would be great if this were made into a t/perf/ test shipped with
> this series, that would be later quoted in a commit, as in
> e.g. 3b41fb0cb2 ("fsck: use oidset instead of oid_array for skipList",
> 2018-09-03).
>
> Although generalizing that "-- tools" part (i.e. finding a candidate
> dir) will require some heuristic, but would make it useful when running
> this against other erpos.

t/perf/p4211-line-log.sh has the following test:


     test_perf 'git log --oneline --raw --parents -1000' '
             git log --oneline --raw --parents -1000 >/dev/null
     '

We could add the following to the end of that script to get similar 
values, since it already selects a file randomly at the top of the script:

     test_perf 'git log --oneline --raw --parents -1000 -- <file>' '
             git log --oneline --raw --parents -1000 -- $file >/dev/null
     '

>
>> If you want to read this series but are unfamiliar with the commit-graph and
>> generation numbers, then I recommend reading
>> Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the
>> subject. In particular, the three-part walk described in "revision.c:
>> refactor basic topo-order logic" is present (but underexplained) as an
>> animated PNG [2].
> We discussed some of this in private E-Mail, and this isn't really
> feedback on *this* series in particular, just on the general
> commit-graph work.
>
> Right now git-config(1) just matter-of-factly says how to enable it, and
> points to git-commit-graph(1) for further info, which just shows how to
> run the tool. But nothing's describing what stuff is sped up, and those
> sorts of docs aren't being updated as new optimizations (e.g. this
> --topo-order walk) are added.
>
> For that you need to scour a combination of your blogpost & commits in
> git.git (with quoted perf numbers).

Thanks for reminding me. I have this on my list of TODOs.

-Stolee


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 0/7] Use generation numbers for --topo-order
  2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2018-09-18  6:05   ` [PATCH v2 0/6] Use generation numbers for --topo-order Ævar Arnfjörð Bjarmason
@ 2018-09-21 17:39   ` Derrick Stolee via GitGitGadget
  2018-09-21 17:39     ` [PATCH v3 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
                       ` (8 more replies)
  7 siblings, 9 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-21 17:39 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano

This patch series performs a decently-sized refactoring of the revision-walk
machinery. Well, "refactoring" is probably the wrong word, as I don't
actually remove the old code. Instead, when we see certain options in the
'rev_info' struct, we redirect the commit-walk logic to a new set of methods
that distribute the workload differently. By using generation numbers in the
commit-graph, we can significantly improve 'git log --graph' commands (and
the underlying 'git rev-list --topo-order').

On the Linux repository, I got the following performance results when
comparing to the previous version with or without a commit-graph:

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

If you want to read this series but are unfamiliar with the commit-graph and
generation numbers, then I recommend reading 
Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the
subject. In particular, the three-part walk described in "revision.c:
refactor basic topo-order logic" is present (but underexplained) as an
animated PNG [2].

Since revision.c is an incredibly important (and old) portion of the
codebase -- and because there are so many orthogonal options in 'struct
rev_info' -- I consider this submission to be "RFC quality". That is, I am
not confident that I am not missing anything, or that my solution is the
best it can be. I did merge this branch with ds/commit-graph-with-grafts and
the "DO-NOT-MERGE: write and read commit-graph always" commit that computes
a commit-graph with every 'git commit' command. The test suite passed with
that change, available on GitHub [3]. To ensure that I cover at least the
case I think are interesting, I added tests to t6600-test-reach.sh to verify
the walks report the correct results for the three cases there (no
commit-graph, full commit-graph, and a partial commit-graph so the walk
starts at GENERATION_NUMBER_INFINITY).

One notable case that is not included in this series is the case of a
history comparison such as 'git rev-list --topo-order A..B'. The existing
code in limit_list() has ways to cut the walk short when all pending commits
are UNINTERESTING. Since this code depends on commit_list instead of the
prio_queue we are using here, I chose to leave it untouched for now. We can
revisit it in a separate series later. Since handle_commit() turns on
revs->limited when a commit is UNINTERESTING, we do not hit the new code in
this case. Removing this 'revs->limited = 1;' line yields correct results,
but the performance is worse.

This series was based on ds/reachable, but is now based on 'master' to not
conflict with 182070 "commit: use timestamp_t for author_date_slab". There
is a small conflict with md/filter-trees, because it renamed a flag in
revisions.h in the line before I add new flags. Hopefully this conflict is
not too difficult to resolve.

Changes in V3: I added a new patch that updates the tab-alignment for flags
in revision.h before adding new ones (Thanks, Ævar!). Also, I squashed the
recommended changes to run_three_modes and test_three_modes from Szeder and
Junio. Thanks!

Thanks, -Stolee

[1] 
https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/
Supercharging the Git Commit Graph III: Generations and Graph Algorithms

[2] 
https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png
Animation showing three-part walk

[3] https://github.com/derrickstolee/git/tree/topo-order/testA branch
containing this series along with commits to compute commit-graph in entire
test suite.

Cc: avarab@gmail.comCc: szeder.dev@gmail.com

Derrick Stolee (7):
  prio-queue: add 'peek' operation
  test-reach: add run_three_modes method
  test-reach: add rev-list tests
  revision.c: begin refactoring --topo-order logic
  commit/revisions: bookkeeping before refactoring
  revision.h: add whitespace in flag definitions
  revision.c: refactor basic topo-order logic

 commit.c                   |  11 +-
 commit.h                   |   8 ++
 object.h                   |   4 +-
 prio-queue.c               |   9 ++
 prio-queue.h               |   6 +
 revision.c                 | 232 ++++++++++++++++++++++++++++++++++++-
 revision.h                 |  34 +++---
 t/helper/test-prio-queue.c |  10 +-
 t/t6600-test-reach.sh      |  96 ++++++++++++++-
 9 files changed, 374 insertions(+), 36 deletions(-)


base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-25%2Fderrickstolee%2Ftopo-order%2Fprogress-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-25/derrickstolee/topo-order/progress-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/25

Range-diff vs v2:

 1:  cc1ec4c270 = 1:  cc1ec4c270 prio-queue: add 'peek' operation
 2:  404c918608 ! 2:  b2a1ade148 test-reach: add run_three_modes method
     @@ -11,10 +11,6 @@
          run_three_modes method that executes the given command and tests
          the actual output to the expected output.
      
     -    While inspecting this code, I realized that the final test for
     -    'commit_contains --tag' is silently dropping the '--tag' argument.
     -    It should be quoted to include both.
     -
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
      diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
     @@ -28,31 +24,22 @@
      +run_three_modes () {
       	test_when_finished rm -rf .git/objects/info/commit-graph &&
      -	test-tool reach $1 <input >actual &&
     -+	$1 <input >actual &&
     ++	"$@" <input >actual &&
       	test_cmp expect actual &&
       	cp commit-graph-full .git/objects/info/commit-graph &&
      -	test-tool reach $1 <input >actual &&
     -+	$1 <input >actual &&
     ++	"$@" <input >actual &&
       	test_cmp expect actual &&
       	cp commit-graph-half .git/objects/info/commit-graph &&
      -	test-tool reach $1 <input >actual &&
     -+	$1 <input >actual &&
     ++	"$@" <input >actual &&
       	test_cmp expect actual
       }
       
      +test_three_modes () {
     -+	run_three_modes "test-tool reach $1"
     ++	run_three_modes test-tool reach "$@"
      +}
      +
       test_expect_success 'ref_newer:miss' '
       	cat >input <<-\EOF &&
       	A:commit-5-7
     -@@
     - 	EOF
     - 	echo "commit_contains(_,A,X,_):1" >expect &&
     - 	test_three_modes commit_contains &&
     --	test_three_modes commit_contains --tag
     -+	test_three_modes "commit_contains --tag"
     - '
     - 
     - test_expect_success 'commit_contains:miss' '
 3:  30dee58c61 ! 3:  b0ceb96076 test-reach: add rev-list tests
     @@ -30,7 +30,7 @@
      +		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
      +		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
      +	>expect &&
     -+	run_three_modes "git rev-list --topo-order commit-6-6"
     ++	run_three_modes git rev-list --topo-order commit-6-6
      +'
      +
      +test_expect_success 'rev-list: first-parent topo-order' '
     @@ -42,7 +42,7 @@
      +		commit-6-2 \
      +		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
      +	>expect &&
     -+	run_three_modes "git rev-list --first-parent --topo-order commit-6-6"
     ++	run_three_modes git rev-list --first-parent --topo-order commit-6-6
      +'
      +
      +test_expect_success 'rev-list: range topo-order' '
     @@ -54,7 +54,7 @@
      +		commit-6-2 commit-5-2 commit-4-2 \
      +		commit-6-1 commit-5-1 commit-4-1 \
      +	>expect &&
     -+	run_three_modes "git rev-list --topo-order commit-3-3..commit-6-6"
     ++	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
      +'
      +
      +test_expect_success 'rev-list: range topo-order' '
     @@ -66,7 +66,7 @@
      +		commit-6-2 commit-5-2 commit-4-2 \
      +		commit-6-1 commit-5-1 commit-4-1 \
      +	>expect &&
     -+	run_three_modes "git rev-list --topo-order commit-3-8..commit-6-6"
     ++	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
      +'
      +
      +test_expect_success 'rev-list: first-parent range topo-order' '
     @@ -78,7 +78,7 @@
      +		commit-6-2 \
      +		commit-6-1 commit-5-1 commit-4-1 \
      +	>expect &&
     -+	run_three_modes "git rev-list --first-parent --topo-order commit-3-8..commit-6-6"
     ++	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
      +'
      +
      +test_expect_success 'rev-list: ancestry-path topo-order' '
     @@ -88,7 +88,7 @@
      +		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
      +		commit-6-3 commit-5-3 commit-4-3 \
      +	>expect &&
     -+	run_three_modes "git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6"
     ++	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
      +'
      +
      +test_expect_success 'rev-list: symmetric difference topo-order' '
     @@ -102,7 +102,7 @@
      +		commit-3-8 commit-2-8 commit-1-8 \
      +		commit-3-7 commit-2-7 commit-1-7 \
      +	>expect &&
     -+	run_three_modes "git rev-list --topo-order commit-3-8...commit-6-6"
     ++	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
      +'
      +
       test_done
 4:  a74ae13d4e = 4:  fd1a0ab7cd revision.c: begin refactoring --topo-order logic
 5:  0e64fc144c = 5:  e86f304082 commit/revisions: bookkeeping before refactoring
 -:  ---------- > 6:  fa6d5ef152 revision.h: add whitespace in flag definitions
 6:  3b185ac3b1 ! 7:  020b2f50c5 revision.c: refactor basic topo-order logic
     @@ -404,11 +404,11 @@
      --- a/revision.h
      +++ b/revision.h
      @@
     - #define USER_GIVEN	(1u<<25) /* given directly by the user */
     - #define TRACK_LINEAR	(1u<<26)
     - #define ALL_REV_FLAGS	(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
     -+#define TOPO_WALK_EXPLORED (1u<<27)
     -+#define TOPO_WALK_INDEGREE (1u<<28)
     + #define USER_GIVEN		(1u<<25) /* given directly by the user */
     + #define TRACK_LINEAR		(1u<<26)
     + #define ALL_REV_FLAGS		(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
     ++#define TOPO_WALK_EXPLORED	(1u<<27)
     ++#define TOPO_WALK_INDEGREE	(1u<<28)
       
       #define DECORATE_SHORT_REFS	1
       #define DECORATE_FULL_REFS	2

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 1/7] prio-queue: add 'peek' operation
  2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
@ 2018-09-21 17:39     ` Derrick Stolee via GitGitGadget
  2018-09-26 19:15       ` Derrick Stolee
  2018-10-11 13:54       ` Jeff King
  2018-09-21 17:39     ` [PATCH v3 2/7] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
                       ` (7 subsequent siblings)
  8 siblings, 2 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-21 17:39 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When consuming a priority queue, it can be convenient to inspect
the next object that will be dequeued without actually dequeueing
it. Our existing library did not have such a 'peek' operation, so
add it as prio_queue_peek().

Add a reference-level comparison in t/helper/test-prio-queue.c
so this method is exercised by t0009-prio-queue.sh.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 prio-queue.c               |  9 +++++++++
 prio-queue.h               |  6 ++++++
 t/helper/test-prio-queue.c | 10 +++++++---
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/prio-queue.c b/prio-queue.c
index a078451872..d3f488cb05 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
 	}
 	return result;
 }
+
+void *prio_queue_peek(struct prio_queue *queue)
+{
+	if (!queue->nr)
+		return NULL;
+	if (!queue->compare)
+		return queue->array[queue->nr - 1].data;
+	return queue->array[0].data;
+}
diff --git a/prio-queue.h b/prio-queue.h
index d030ec9dd6..682e51867a 100644
--- a/prio-queue.h
+++ b/prio-queue.h
@@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing);
  */
 extern void *prio_queue_get(struct prio_queue *);
 
+/*
+ * Gain access to the "thing" that would be returned by
+ * prio_queue_get, but do not remove it from the queue.
+ */
+extern void *prio_queue_peek(struct prio_queue *);
+
 extern void clear_prio_queue(struct prio_queue *);
 
 /* Reverse the LIFO elements */
diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
index 9807b649b1..e817bbf464 100644
--- a/t/helper/test-prio-queue.c
+++ b/t/helper/test-prio-queue.c
@@ -22,9 +22,13 @@ int cmd__prio_queue(int argc, const char **argv)
 	struct prio_queue pq = { intcmp };
 
 	while (*++argv) {
-		if (!strcmp(*argv, "get"))
-			show(prio_queue_get(&pq));
-		else if (!strcmp(*argv, "dump")) {
+		if (!strcmp(*argv, "get")) {
+			void *peek = prio_queue_peek(&pq);
+			void *get = prio_queue_get(&pq);
+			if (peek != get)
+				BUG("peek and get results do not match");
+			show(get);
+		} else if (!strcmp(*argv, "dump")) {
 			int *v;
 			while ((v = prio_queue_get(&pq)))
 			       show(v);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 2/7] test-reach: add run_three_modes method
  2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
  2018-09-21 17:39     ` [PATCH v3 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
@ 2018-09-21 17:39     ` Derrick Stolee via GitGitGadget
  2018-10-11 13:57       ` Jeff King
  2018-09-21 17:39     ` [PATCH v3 3/7] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
                       ` (6 subsequent siblings)
  8 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-21 17:39 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'test_three_modes' method assumes we are using the 'test-tool
reach' command for our test. However, we may want to use the data
shape of our commit graph and the three modes (no commit-graph,
full commit-graph, partial commit-graph) for other git commands.

Split test_three_modes to be a simple translation on a more general
run_three_modes method that executes the given command and tests
the actual output to the expected output.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6600-test-reach.sh | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..9d65b8b946 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -53,18 +53,22 @@ test_expect_success 'setup' '
 	git config core.commitGraph true
 '
 
-test_three_modes () {
+run_three_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-full .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
+test_three_modes () {
+	run_three_modes test-tool reach "$@"
+}
+
 test_expect_success 'ref_newer:miss' '
 	cat >input <<-\EOF &&
 	A:commit-5-7
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 3/7] test-reach: add rev-list tests
  2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
  2018-09-21 17:39     ` [PATCH v3 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
  2018-09-21 17:39     ` [PATCH v3 2/7] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
@ 2018-09-21 17:39     ` Derrick Stolee via GitGitGadget
  2018-10-11 13:58       ` Jeff King
  2018-09-21 17:39     ` [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
                       ` (5 subsequent siblings)
  8 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-21 17:39 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6600-test-reach.sh | 84 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 9d65b8b946..288f703b7b 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
 	test_three_modes commit_contains --tag
 '
 
+test_expect_success 'rev-list: basic topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
+		commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \
+		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
+		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent topo-order' '
+	git rev-parse \
+		commit-6-6 \
+		commit-6-5 \
+		commit-6-4 \
+		commit-6-3 \
+		commit-6-2 \
+		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
+	>expect &&
+	run_three_modes git rev-list --first-parent --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 \
+		commit-6-5 commit-5-5 commit-4-5 \
+		commit-6-4 commit-5-4 commit-4-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent range topo-order' '
+	git rev-parse \
+		commit-6-6 \
+		commit-6-5 \
+		commit-6-4 \
+		commit-6-3 \
+		commit-6-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: ancestry-path topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+	>expect &&
+	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: symmetric difference topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 \
+		commit-6-5 commit-5-5 commit-4-5 \
+		commit-6-4 commit-5-4 commit-4-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+		commit-3-8 commit-2-8 commit-1-8 \
+		commit-3-7 commit-2-7 commit-1-7 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic
  2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2018-09-21 17:39     ` [PATCH v3 3/7] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
@ 2018-09-21 17:39     ` Derrick Stolee via GitGitGadget
  2018-10-11 14:06       ` Jeff King
  2018-10-12  6:33       ` Junio C Hamano
  2018-09-21 17:39     ` [PATCH v3 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
                       ` (4 subsequent siblings)
  8 siblings, 2 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-21 17:39 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 42 ++++++++++++++++++++++++++++++++++++++----
 revision.h |  4 ++++
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index e18bd530e4..2dcde8a8ac 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 	if (revs->diffopt.objfind)
 		revs->simplify_history = 0;
 
-	if (revs->topo_order)
+	if (revs->topo_order && !generation_numbers_enabled(the_repository))
 		revs->limited = 1;
 
 	if (revs->prune_data.nr) {
@@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid,
 	return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+	struct topo_walk_info *info;
+	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+	info = revs->topo_walk_info;
+	memset(info, 0, sizeof(struct topo_walk_info));
+
+	limit_list(revs);
+	sort_in_topological_order(&revs->commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+	return pop_commit(&revs->commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+		if (!revs->ignore_missing_links)
+			die("Failed to traverse parents of commit %s",
+			    oid_to_hex(&commit->object.oid));
+	}
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
 	int i;
@@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
 		return 0;
-	if (revs->limited)
+	if (revs->limited) {
 		if (limit_list(revs) < 0)
 			return -1;
-	if (revs->topo_order)
-		sort_in_topological_order(&revs->commits, revs->sort_order);
+		if (revs->topo_order)
+			sort_in_topological_order(&revs->commits, revs->sort_order);
+	} else if (revs->topo_order)
+		init_topo_walk(revs);
 	if (revs->line_level_traverse)
 		line_log_filter(revs);
 	if (revs->simplify_merges)
@@ -3257,6 +3287,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
 
 		if (revs->reflog_info)
 			commit = next_reflog_entry(revs->reflog_info);
+		else if (revs->topo_walk_info)
+			commit = next_topo_commit(revs);
 		else
 			commit = pop_commit(&revs->commits);
 
@@ -3278,6 +3310,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
 
 			if (revs->reflog_info)
 				try_to_simplify_commit(revs, commit);
+			else if (revs->topo_walk_info)
+				expand_topo_walk(revs, commit);
 			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
 				if (!revs->ignore_missing_links)
 					die("Failed to traverse parents of commit %s",
diff --git a/revision.h b/revision.h
index 2b30ac270d..fd4154ff75 100644
--- a/revision.h
+++ b/revision.h
@@ -56,6 +56,8 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct topo_walk_info;
+
 struct rev_info {
 	/* Starting list */
 	struct commit_list *commits;
@@ -245,6 +247,8 @@ struct rev_info {
 	const char *break_bar;
 
 	struct revision_sources *sources;
+
+	struct topo_walk_info *topo_walk_info;
 };
 
 int ref_excluded(struct string_list *, const char *path);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 5/7] commit/revisions: bookkeeping before refactoring
  2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2018-09-21 17:39     ` [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
@ 2018-09-21 17:39     ` Derrick Stolee via GitGitGadget
  2018-10-11 14:21       ` Jeff King
  2018-09-21 17:39     ` [PATCH v3 6/7] revision.h: add whitespace in flag definitions Derrick Stolee via GitGitGadget
                       ` (3 subsequent siblings)
  8 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-21 17:39 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

There are a few things that need to move around a little before
making a big refactoring in the topo-order logic:

1. We need access to record_author_date() and
   compare_commits_by_author_date() in revision.c. These are used
   currently by sort_in_topological_order() in commit.c.

2. Moving these methods to commit.h requires adding the author_slab
   definition to commit.h.

3. The add_parents_to_list() method in revision.c performs logic
   around the UNINTERESTING flag and other special cases depending
   on the struct rev_info. Allow this method to ignore a NULL 'list'
   parameter, as we will not be populating the list for our walk.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c   | 11 ++++-------
 commit.h   |  8 ++++++++
 revision.c |  6 ++++--
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/commit.c b/commit.c
index d0f199e122..f68e04b2f1 100644
--- a/commit.c
+++ b/commit.c
@@ -655,11 +655,8 @@ struct commit *pop_commit(struct commit_list **stack)
 /* count number of children that have not been emitted */
 define_commit_slab(indegree_slab, int);
 
-/* record author-date for each commit object */
-define_commit_slab(author_date_slab, timestamp_t);
-
-static void record_author_date(struct author_date_slab *author_date,
-			       struct commit *commit)
+void record_author_date(struct author_date_slab *author_date,
+			struct commit *commit)
 {
 	const char *buffer = get_commit_buffer(commit, NULL);
 	struct ident_split ident;
@@ -684,8 +681,8 @@ fail_exit:
 	unuse_commit_buffer(commit, buffer);
 }
 
-static int compare_commits_by_author_date(const void *a_, const void *b_,
-					  void *cb_data)
+int compare_commits_by_author_date(const void *a_, const void *b_,
+				   void *cb_data)
 {
 	const struct commit *a = a_, *b = b_;
 	struct author_date_slab *author_date = cb_data;
diff --git a/commit.h b/commit.h
index 2b1a734388..ff0eb5f8ef 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,7 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 #include "pretty.h"
+#include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
 #define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
@@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf);
  */
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
+/* record author-date for each commit object */
+define_commit_slab(author_date_slab, timestamp_t);
+
+void record_author_date(struct author_date_slab *author_date,
+			struct commit *commit);
+
+int compare_commits_by_author_date(const void *a_, const void *b_, void *unused);
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
diff --git a/revision.c b/revision.c
index 2dcde8a8ac..92012d5f45 100644
--- a/revision.c
+++ b/revision.c
@@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
 			if (p->object.flags & SEEN)
 				continue;
 			p->object.flags |= SEEN;
-			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
+			if (list)
+				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
 		}
 		return 0;
 	}
@@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
 		p->object.flags |= left_flag;
 		if (!(p->object.flags & SEEN)) {
 			p->object.flags |= SEEN;
-			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
+			if (list)
+				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
 		}
 		if (revs->first_parent_only)
 			break;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 6/7] revision.h: add whitespace in flag definitions
  2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2018-09-21 17:39     ` [PATCH v3 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
@ 2018-09-21 17:39     ` Derrick Stolee via GitGitGadget
  2018-09-21 17:39     ` [PATCH v3 7/7] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
                       ` (2 subsequent siblings)
  8 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-21 17:39 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In anticipation of adding longer flag names in the next change, add
an extra tab to each flag definition in revision.h.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.h | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/revision.h b/revision.h
index fd4154ff75..e7bd059d80 100644
--- a/revision.h
+++ b/revision.h
@@ -10,20 +10,20 @@
 #include "commit-slab-decl.h"
 
 /* Remember to update object flag allocation in object.h */
-#define SEEN		(1u<<0)
-#define UNINTERESTING   (1u<<1)
-#define TREESAME	(1u<<2)
-#define SHOWN		(1u<<3)
-#define TMP_MARK	(1u<<4) /* for isolated cases; clean after use */
-#define BOUNDARY	(1u<<5)
-#define CHILD_SHOWN	(1u<<6)
-#define ADDED		(1u<<7)	/* Parents already parsed and added? */
-#define SYMMETRIC_LEFT	(1u<<8)
-#define PATCHSAME	(1u<<9)
-#define BOTTOM		(1u<<10)
-#define USER_GIVEN	(1u<<25) /* given directly by the user */
-#define TRACK_LINEAR	(1u<<26)
-#define ALL_REV_FLAGS	(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
+#define SEEN			(1u<<0)
+#define UNINTERESTING		(1u<<1)
+#define TREESAME		(1u<<2)
+#define SHOWN			(1u<<3)
+#define TMP_MARK		(1u<<4) /* for isolated cases; clean after use */
+#define BOUNDARY		(1u<<5)
+#define CHILD_SHOWN		(1u<<6)
+#define ADDED			(1u<<7)	/* Parents already parsed and added? */
+#define SYMMETRIC_LEFT		(1u<<8)
+#define PATCHSAME		(1u<<9)
+#define BOTTOM			(1u<<10)
+#define USER_GIVEN		(1u<<25) /* given directly by the user */
+#define TRACK_LINEAR		(1u<<26)
+#define ALL_REV_FLAGS		(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
 
 #define DECORATE_SHORT_REFS	1
 #define DECORATE_FULL_REFS	2
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v3 7/7] revision.c: refactor basic topo-order logic
  2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
                       ` (5 preceding siblings ...)
  2018-09-21 17:39     ` [PATCH v3 6/7] revision.h: add whitespace in flag definitions Derrick Stolee via GitGitGadget
@ 2018-09-21 17:39     ` Derrick Stolee via GitGitGadget
  2018-09-27 17:57       ` Derrick Stolee
                         ` (2 more replies)
  2018-09-21 21:22     ` [PATCH v3 0/7] Use generation numbers for --topo-order Junio C Hamano
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  8 siblings, 3 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-09-21 17:39 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

    If A and B are commits with generation numbers gen(A) and
    gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.

For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 object.h   |   4 +-
 revision.c | 196 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 revision.h |   2 +
 3 files changed, 194 insertions(+), 8 deletions(-)

diff --git a/object.h b/object.h
index 0feb90ae61..796792cb32 100644
--- a/object.h
+++ b/object.h
@@ -59,7 +59,7 @@ struct object_array {
 
 /*
  * object flag allocation:
- * revision.h:               0---------10                              2526
+ * revision.h:               0---------10                              25----28
  * fetch-pack.c:             01
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
@@ -78,7 +78,7 @@ struct object_array {
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  27
+#define FLAG_BITS  29
 
 /*
  * The object type is stored in 3 bits.
diff --git a/revision.c b/revision.c
index 92012d5f45..c5d0cb6599 100644
--- a/revision.c
+++ b/revision.c
@@ -26,6 +26,7 @@
 #include "argv-array.h"
 #include "commit-reach.h"
 #include "commit-graph.h"
+#include "prio-queue.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2895,30 +2896,213 @@ static int mark_uninteresting(const struct object_id *oid,
 	return 0;
 }
 
-struct topo_walk_info {};
+define_commit_slab(indegree_slab, int);
+
+struct topo_walk_info {
+	uint32_t min_generation;
+	struct prio_queue explore_queue;
+	struct prio_queue indegree_queue;
+	struct prio_queue topo_queue;
+	struct indegree_slab indegree;
+	struct author_date_slab author_date;
+};
+
+static inline void test_flag_and_insert(struct prio_queue *q, struct commit *c, int flag)
+{
+	if (c->object.flags & flag)
+		return;
+
+	c->object.flags |= flag;
+	prio_queue_put(q, c);
+}
+
+static void explore_walk_step(struct rev_info *revs)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit_list *p;
+	struct commit *c = prio_queue_get(&info->explore_queue);
+
+	if (!c)
+		return;
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	if (revs->max_age != -1 && (c->date < revs->max_age))
+		c->object.flags |= UNINTERESTING;
+
+	if (add_parents_to_list(revs, c, NULL, NULL) < 0)
+		return;
+
+	if (c->object.flags & UNINTERESTING)
+		mark_parents_uninteresting(c);
+
+	for (p = c->parents; p; p = p->next)
+		test_flag_and_insert(&info->explore_queue, p->item, TOPO_WALK_EXPLORED);
+}
+
+static void explore_to_depth(struct rev_info *revs,
+			     uint32_t gen)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c;
+	while ((c = prio_queue_peek(&info->explore_queue)) &&
+	       c->generation >= gen)
+		explore_walk_step(revs);
+}
+
+static void indegree_walk_step(struct rev_info *revs)
+{
+	struct commit_list *p;
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c = prio_queue_get(&info->indegree_queue);
+
+	if (!c)
+		return;
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	explore_to_depth(revs, c->generation);
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	for (p = c->parents; p; p = p->next) {
+		struct commit *parent = p->item;
+		int *pi = indegree_slab_at(&info->indegree, parent);
+
+		if (*pi)
+			(*pi)++;
+		else
+			*pi = 2;
+
+		test_flag_and_insert(&info->indegree_queue, parent, TOPO_WALK_INDEGREE);
+
+		if (revs->first_parent_only)
+			return;
+	}
+}
+
+static void compute_indegrees_to_depth(struct rev_info *revs)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c;
+	while ((c = prio_queue_peek(&info->indegree_queue)) &&
+	       c->generation >= info->min_generation)
+		indegree_walk_step(revs);
+}
 
 static void init_topo_walk(struct rev_info *revs)
 {
 	struct topo_walk_info *info;
+	struct commit_list *list;
 	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
 	info = revs->topo_walk_info;
 	memset(info, 0, sizeof(struct topo_walk_info));
 
-	limit_list(revs);
-	sort_in_topological_order(&revs->commits, revs->sort_order);
+	init_indegree_slab(&info->indegree);
+	memset(&info->explore_queue, '\0', sizeof(info->explore_queue));
+	memset(&info->indegree_queue, '\0', sizeof(info->indegree_queue));
+	memset(&info->topo_queue, '\0', sizeof(info->topo_queue));
+
+	switch (revs->sort_order) {
+	default: /* REV_SORT_IN_GRAPH_ORDER */
+		info->topo_queue.compare = NULL;
+		break;
+	case REV_SORT_BY_COMMIT_DATE:
+		info->topo_queue.compare = compare_commits_by_commit_date;
+		break;
+	case REV_SORT_BY_AUTHOR_DATE:
+		init_author_date_slab(&info->author_date);
+		info->topo_queue.compare = compare_commits_by_author_date;
+		info->topo_queue.cb_data = &info->author_date;
+		break;
+	}
+
+	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
+	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
+
+	info->min_generation = GENERATION_NUMBER_INFINITY;
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+		test_flag_and_insert(&info->explore_queue, c, TOPO_WALK_EXPLORED);
+		test_flag_and_insert(&info->indegree_queue, c, TOPO_WALK_INDEGREE);
+
+		if (parse_commit_gently(c, 1))
+			continue;
+		if (c->generation < info->min_generation)
+			info->min_generation = c->generation;
+	}
+
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+		*(indegree_slab_at(&info->indegree, c)) = 1;
+
+		if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
+			record_author_date(&info->author_date, c);
+	}
+	compute_indegrees_to_depth(revs);
+
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+
+		if (*(indegree_slab_at(&info->indegree, c)) == 1)
+			prio_queue_put(&info->topo_queue, c);
+	}
+
+	/*
+	 * This is unfortunate; the initial tips need to be shown
+	 * in the order given from the revision traversal machinery.
+	 */
+	if (revs->sort_order == REV_SORT_IN_GRAPH_ORDER)
+		prio_queue_reverse(&info->topo_queue);
 }
 
 static struct commit *next_topo_commit(struct rev_info *revs)
 {
-	return pop_commit(&revs->commits);
+	struct commit *c;
+	struct topo_walk_info *info = revs->topo_walk_info;
+
+	/* pop next off of topo_queue */
+	c = prio_queue_get(&info->topo_queue);
+
+	if (c)
+		*(indegree_slab_at(&info->indegree, c)) = 0;
+
+	return c;
 }
 
 static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 {
-	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+	struct commit_list *p;
+	struct topo_walk_info *info = revs->topo_walk_info;
+	if (add_parents_to_list(revs, commit, NULL, NULL) < 0) {
 		if (!revs->ignore_missing_links)
 			die("Failed to traverse parents of commit %s",
-			    oid_to_hex(&commit->object.oid));
+				oid_to_hex(&commit->object.oid));
+	}
+
+	for (p = commit->parents; p; p = p->next) {
+		struct commit *parent = p->item;
+		int *pi;
+
+		if (parse_commit_gently(parent, 1) < 0)
+			continue;
+
+		if (parent->generation < info->min_generation) {
+			info->min_generation = parent->generation;
+			compute_indegrees_to_depth(revs);
+		}
+
+		pi = indegree_slab_at(&info->indegree, parent);
+
+		(*pi)--;
+		if (*pi == 1)
+			prio_queue_put(&info->topo_queue, parent);
+
+		if (revs->first_parent_only)
+			return;
 	}
 }
 
diff --git a/revision.h b/revision.h
index e7bd059d80..7cc3bf5fc0 100644
--- a/revision.h
+++ b/revision.h
@@ -24,6 +24,8 @@
 #define USER_GIVEN		(1u<<25) /* given directly by the user */
 #define TRACK_LINEAR		(1u<<26)
 #define ALL_REV_FLAGS		(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
+#define TOPO_WALK_EXPLORED	(1u<<27)
+#define TOPO_WALK_INDEGREE	(1u<<28)
 
 #define DECORATE_SHORT_REFS	1
 #define DECORATE_FULL_REFS	2
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 0/7] Use generation numbers for --topo-order
  2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
                       ` (6 preceding siblings ...)
  2018-09-21 17:39     ` [PATCH v3 7/7] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
@ 2018-09-21 21:22     ` Junio C Hamano
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  8 siblings, 0 replies; 87+ messages in thread
From: Junio C Hamano @ 2018-09-21 21:22 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, peff

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Changes in V3: I added a new patch that updates the tab-alignment for flags
> in revision.h before adding new ones (Thanks, Ævar!).

This is most unwelcome while other topics are in flight that caused
unnecessary conflict.  It would have been very welcomed if the
codebase was dormant, though.

I'll live, and there is no need to resend, but this change may not
appear in today's pushout (I'll have to push out the result of
integration before I saw this new reroll with all the other topics).



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/7] prio-queue: add 'peek' operation
  2018-09-21 17:39     ` [PATCH v3 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
@ 2018-09-26 19:15       ` Derrick Stolee
  2018-10-11 13:54       ` Jeff King
  1 sibling, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-09-26 19:15 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git; +Cc: peff, Junio C Hamano, Derrick Stolee

On 9/21/2018 1:39 PM, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When consuming a priority queue, it can be convenient to inspect
> the next object that will be dequeued without actually dequeueing
> it. Our existing library did not have such a 'peek' operation, so
> add it as prio_queue_peek().
>
> Add a reference-level comparison in t/helper/test-prio-queue.c
> so this method is exercised by t0009-prio-queue.sh.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   prio-queue.c               |  9 +++++++++
>   prio-queue.h               |  6 ++++++
>   t/helper/test-prio-queue.c | 10 +++++++---
>   3 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/prio-queue.c b/prio-queue.c
> index a078451872..d3f488cb05 100644
> --- a/prio-queue.c
> +++ b/prio-queue.c
> @@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
>   	}
>   	return result;
>   }
> +
> +void *prio_queue_peek(struct prio_queue *queue)
> +{
> +	if (!queue->nr)
> +		return NULL;
> +	if (!queue->compare)
> +		return queue->array[queue->nr - 1].data;
> +	return queue->array[0].data;
> +}

The second branch here is never run by the test suite, as the only 
consumers never have compare== NULL. I'll add an ability to test this 
"stack" behavior into t0009-prio-queue.sh.

-Stolee


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic
  2018-09-21 17:39     ` [PATCH v3 7/7] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
@ 2018-09-27 17:57       ` Derrick Stolee
  2018-10-06 16:56         ` Jakub Narebski
  2018-10-11 15:35       ` Jeff King
  2018-10-11 22:32       ` Stefan Beller
  2 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee @ 2018-09-27 17:57 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: peff, Junio C Hamano, Derrick Stolee, Jakub Narębski

On 9/21/2018 1:39 PM, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When running a command like 'git rev-list --topo-order HEAD',
> Git performed the following steps:
>
> 1. Run limit_list(), which parses all reachable commits,
>     adds them to a linked list, and distributes UNINTERESTING
>     flags. If all unprocessed commits are UNINTERESTING, then
>     it may terminate without walking all reachable commits.
>     This does not occur if we do not specify UNINTERESTING
>     commits.
>
> 2. Run sort_in_topological_order(), which is an implementation
>     of Kahn's algorithm. It first iterates through the entire
>     set of important commits and computes the in-degree of each
>     (plus one, as we use 'zero' as a special value here). Then,
>     we walk the commits in priority order, adding them to the
>     priority queue if and only if their in-degree is one. As
>     we remove commits from this priority queue, we decrement the
>     in-degree of their parents.
>
> 3. While we are peeling commits for output, get_revision_1()
>     uses pop_commit on the full list of commits computed by
>     sort_in_topological_order().
>
> In the new algorithm, these three steps correspond to three
> different commit walks. We run these walks simultaneously,
> and advance each only as far as necessary to satisfy the
> requirements of the 'higher order' walk. We know when we can
> pause each walk by using generation numbers from the commit-
> graph feature.
Hello, Git contributors.

I understand that this commit message and patch are pretty daunting. 
There is a lot to read and digest. I would like to see if anyone is 
willing to put the work in to review this patch, as I quite like what it 
does, and the performance numbers below.
> In my local testing, I used the following Git commands on the
> Linux repository in three modes: HEAD~1 with no commit-graph,
> HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
> allows comparing the benefits we get from parsing commits from
> the commit-graph and then again the benefits we get by
> restricting the set of commits we walk.
>
> Test: git rev-list --topo-order -100 HEAD
> HEAD~1, no commit-graph: 6.80 s
> HEAD~1, w/ commit-graph: 0.77 s
>    HEAD, w/ commit-graph: 0.02 s
>
> Test: git rev-list --topo-order -100 HEAD -- tools
> HEAD~1, no commit-graph: 9.63 s
> HEAD~1, w/ commit-graph: 6.06 s
>    HEAD, w/ commit-graph: 0.06 s

If there is something I can do to make this easier to review, then 
please let me know.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic
  2018-09-27 17:57       ` Derrick Stolee
@ 2018-10-06 16:56         ` Jakub Narebski
  0 siblings, 0 replies; 87+ messages in thread
From: Jakub Narebski @ 2018-10-06 16:56 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Jeff King, Junio C Hamano,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 9/21/2018 1:39 PM, Derrick Stolee via GitGitGadget wrote:

> Hello, Git contributors.
>
> I understand that this commit message and patch are pretty
> daunting. There is a lot to read and digest. I would like to see if
> anyone is willing to put the work in to review this patch, as I quite
> like what it does, and the performance numbers below.

I'll try to find time to review v3 of this patch series this week.

>> In my local testing, I used the following Git commands on the
>> Linux repository in three modes: HEAD~1 with no commit-graph,
>> HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
>> allows comparing the benefits we get from parsing commits from
>> the commit-graph and then again the benefits we get by
>> restricting the set of commits we walk.
>>
>> Test: git rev-list --topo-order -100 HEAD
>> HEAD~1, no commit-graph: 6.80 s
>> HEAD~1, w/ commit-graph: 0.77 s
>>    HEAD, w/ commit-graph: 0.02 s
>>
>> Test: git rev-list --topo-order -100 HEAD -- tools
>> HEAD~1, no commit-graph: 9.63 s
>> HEAD~1, w/ commit-graph: 6.06 s
>>    HEAD, w/ commit-graph: 0.06 s
>
> If there is something I can do to make this easier to review, then
> please let me know.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/7] prio-queue: add 'peek' operation
  2018-09-21 17:39     ` [PATCH v3 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
  2018-09-26 19:15       ` Derrick Stolee
@ 2018-10-11 13:54       ` Jeff King
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff King @ 2018-10-11 13:54 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, Sep 21, 2018 at 10:39:27AM -0700, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When consuming a priority queue, it can be convenient to inspect
> the next object that will be dequeued without actually dequeueing
> it. Our existing library did not have such a 'peek' operation, so
> add it as prio_queue_peek().

Makes sense.

> +void *prio_queue_peek(struct prio_queue *queue)
> +{
> +	if (!queue->nr)
> +		return NULL;
> +	if (!queue->compare)
> +		return queue->array[queue->nr - 1].data;
> +	return queue->array[0].data;

The non-compare version of get() treats this like a LIFO, and you do the
same here. Looks good.

In theory get() could be implemented in terms of peek(), but the result
is not actually shorter because we have to check those same conditions
to decide how to remove the item anyway.

> diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
> index 9807b649b1..e817bbf464 100644
> --- a/t/helper/test-prio-queue.c
> +++ b/t/helper/test-prio-queue.c
> @@ -22,9 +22,13 @@ int cmd__prio_queue(int argc, const char **argv)
>  	struct prio_queue pq = { intcmp };
>  
>  	while (*++argv) {
> -		if (!strcmp(*argv, "get"))
> -			show(prio_queue_get(&pq));
> -		else if (!strcmp(*argv, "dump")) {
> +		if (!strcmp(*argv, "get")) {
> +			void *peek = prio_queue_peek(&pq);
> +			void *get = prio_queue_get(&pq);
> +			if (peek != get)
> +				BUG("peek and get results do not match");
> +			show(get);
> +		} else if (!strcmp(*argv, "dump")) {

This is a nice cheap way of piggy-backing on the existing get tests.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 2/7] test-reach: add run_three_modes method
  2018-09-21 17:39     ` [PATCH v3 2/7] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
@ 2018-10-11 13:57       ` Jeff King
  0 siblings, 0 replies; 87+ messages in thread
From: Jeff King @ 2018-10-11 13:57 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, Sep 21, 2018 at 10:39:29AM -0700, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The 'test_three_modes' method assumes we are using the 'test-tool
> reach' command for our test. However, we may want to use the data
> shape of our commit graph and the three modes (no commit-graph,
> full commit-graph, partial commit-graph) for other git commands.
> 
> Split test_three_modes to be a simple translation on a more general
> run_three_modes method that executes the given command and tests
> the actual output to the expected output.
>
> [...]
> +test_three_modes () {
> +	run_three_modes test-tool reach "$@"
> +}

Makes sense. Sometimes in the test suite we want to be able to pass a
whole shell snippet to eval, but unless we specifically need that for
this series, running "$@" directly is simpler.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 3/7] test-reach: add rev-list tests
  2018-09-21 17:39     ` [PATCH v3 3/7] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
@ 2018-10-11 13:58       ` Jeff King
  2018-10-12  4:34         ` Junio C Hamano
  0 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-10-11 13:58 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, Sep 21, 2018 at 10:39:30AM -0700, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The rev-list command is critical to Git's functionality. Ensure it
> works in the three commit-graph environments constructed in
> t6600-test-reach.sh. Here are a few important types of rev-list
> operations:
> 
> * Basic: git rev-list --topo-order HEAD
> * Range: git rev-list --topo-order compare..HEAD
> * Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
> * Symmetric Difference: git rev-list --topo-order compare...HEAD

Makes sense. I'll assume you filled out all those "expect" blocks
correctly.  ;)

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic
  2018-09-21 17:39     ` [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
@ 2018-10-11 14:06       ` Jeff King
  2018-10-12  6:33       ` Junio C Hamano
  1 sibling, 0 replies; 87+ messages in thread
From: Jeff King @ 2018-10-11 14:06 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, Sep 21, 2018 at 10:39:32AM -0700, Derrick Stolee via GitGitGadget wrote:

> [..]
> When setting revs->limited only because revs->topo_order is true,
> only do so if generation numbers are not available. There is no
> reason to use the new logic as it will behave similarly when all
> generation numbers are INFINITY or ZERO.
> 
> In prepare_revision_walk(), if we have revs->topo_order but not
> revs->limited, then we trigger the new logic. It breaks the logic
> into three pieces, to fit with the existing framework:

Nicely explained. Your abstracted init/next/expand API seems sane, but
of course the real test will be reading the later patches that make use
of it. :)

The patch matches my understanding of your explanation.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 5/7] commit/revisions: bookkeeping before refactoring
  2018-09-21 17:39     ` [PATCH v3 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
@ 2018-10-11 14:21       ` Jeff King
  0 siblings, 0 replies; 87+ messages in thread
From: Jeff King @ 2018-10-11 14:21 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, Sep 21, 2018 at 10:39:33AM -0700, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> There are a few things that need to move around a little before
> making a big refactoring in the topo-order logic:
> 
> 1. We need access to record_author_date() and
>    compare_commits_by_author_date() in revision.c. These are used
>    currently by sort_in_topological_order() in commit.c.
> 
> 2. Moving these methods to commit.h requires adding the author_slab
>    definition to commit.h.

The overall goal makes sense. Do we really need to define the whole slab
in the header file? We're going to end up with multiple copies of the
functions, since they're declared static in each file that includes
commit.h.

From what's here, I think you could get away with just:

  struct author_date_slab;
  void record_author_date(struct author_date_slab *author_date,
                          struct commit *commit);

in the header file. But presumably callers would eventually want to
allocate their own author dates. If that's all we need, then these days
you can do:

  declare_commit_slab(author_date, timestamp_t);

to get the type declaration.

If they really do need the functions accessible outside of commit.c,
then perhaps:

  define_shared_commit_slab(author_date, timestamp_t);

in commit.h, and:

  implement_shared_commit_slab(author_date, timestamp_t);

in commit.c (the type repetition is not too bad, as the compiler would
catch any mistakes).

The only downside of this approach is that we're less likely to be able
to inline element access (though "peek" is big enough that I'm not sure
it ends up inlined anyway).

> 3. The add_parents_to_list() method in revision.c performs logic
>    around the UNINTERESTING flag and other special cases depending
>    on the struct rev_info. Allow this method to ignore a NULL 'list'
>    parameter, as we will not be populating the list for our walk.

So now you can add_parents_to_list() without a list? That sounds
confusing. :)

Is it possible to split the function into two? Some
handle_uninteresting_parents() logic, and then an add_parents_to_list()
that calls that, but also adds to the list?

A cursory look at the function suggests it's actually kind of tricky.
Perhaps as an alternative, add_parents_to_list() could just get a more
descriptive name?

> ---
>  commit.c   | 11 ++++-------
>  commit.h   |  8 ++++++++
>  revision.c |  6 ++++--
>  3 files changed, 16 insertions(+), 9 deletions(-)

The patch itself seems straight-forward based on those explanations.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic
  2018-09-21 17:39     ` [PATCH v3 7/7] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
  2018-09-27 17:57       ` Derrick Stolee
@ 2018-10-11 15:35       ` Jeff King
  2018-10-11 16:21         ` Derrick Stolee
  2018-10-11 22:32       ` Stefan Beller
  2 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-10-11 15:35 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee

On Fri, Sep 21, 2018 at 10:39:36AM -0700, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When running a command like 'git rev-list --topo-order HEAD',
> Git performed the following steps:
> [...]
> In the new algorithm, these three steps correspond to three
> different commit walks. We run these walks simultaneously,

A minor nit, but this commit message doesn't mention the most basic
thing up front: that its main purpose is to introduce a new algorithm
for topo-order. ;)

It's obvious in the context of reviewing the series, but somebody
reading "git log" later may want a bit more. Perhaps:

  revision.c: implement generation-based topo-order algorithm

as a subject, and/or an introductory paragraph like:

  The current --topo-order algorithm requires walking all commits we
  are going to output up front, topo-sorting them, all before
  outputting the first value. This patch introduces a new algorithm
  which uses stored generation numbers to incrementally walk in
  topo-order, outputting commits as we go.

Other than that, I find this to be a wonderfully explanatory commit
message. :)

> The walks are as follows:
> 
> 1. EXPLORE: using the explore_queue priority queue (ordered by
>    maximizing the generation number), parse each reachable
>    commit until all commits in the queue have generation
>    number strictly lower than needed. During this walk, update
>    the UNINTERESTING flags as necessary.

OK, this makes sense. If we know that everybody else in our queue is at
generation X, then it is safe to output a commit at generation greater
than X.

I think this by itself would allow us to implement "show no parents
before all of its children are shown", right? But --topo-order promises
a bit more: "avoid showing commits no multiple lines of history
intermixed".

I guess also INFINITY generation numbers need more. For a real
generation number, we know that "gen(A) == gen(B)" implies that there is
no ancestry relationship between the two. But not so for INFINITY.

> 2. INDEGREE: using the indegree_queue priority queue (ordered
>    by maximizing the generation number), add one to the in-
>    degree of each parent for each commit that is walked. Since
>    we walk in order of decreasing generation number, we know
>    that discovering an in-degree value of 0 means the value for
>    that commit was not initialized, so should be initialized to
>    two. (Recall that in-degree value "1" is what we use to say a
>    commit is ready for output.) As we iterate the parents of a
>    commit during this walk, ensure the EXPLORE walk has walked
>    beyond their generation numbers.

I wondered how this would work for INFINITY. We can't know the order of
a bunch of INFINITY nodes at all, so we never know when their in-degree
values are "done". But if I understand the EXPLORE walk, we'd basically
walk all of INFINITY down to something with a real generation number. Is
that right?

But after that, I'm not totally clear on why we need this INDEGREE walk.

> 3. TOPO: using the topo_queue priority queue (ordered based on
>    the sort_order given, which could be commit-date, author-
>    date, or typical topo-order which treats the queue as a LIFO
>    stack), remove a commit from the queue and decrement the
>    in-degree of each parent. If a parent has an in-degree of
>    one, then we add it to the topo_queue. Before we decrement
>    the in-degree, however, ensure the INDEGREE walk has walked
>    beyond that generation number.

OK, this makes sense to make --author-date-order, etc, work. Potentially
those numbers might have no relationship at all with the graph
structure, but we promise "no parent before its children are shown", so
this is really just a tie-breaker after the topo-sort anyway. As long as
steps 1 and 2 are correct and produce a complete set of commits for one
"layer", this should be OK.

I guess I'm not 100% convinced that we don't have a case where we
haven't yet parsed or considered some commit that we know cannot have an
ancestry relationship with commits we are outputting. But it may have an
author-date-order relationship.

(I'm not at all convinced that this _is_ a problem, and I suspect it
isn't; I'm only suggesting I haven't fully grokked the proof).

> ---
>  object.h   |   4 +-
>  revision.c | 196 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  revision.h |   2 +
>  3 files changed, 194 insertions(+), 8 deletions(-)

I'll pause here on evaluating the actual code. It looks sane from a
cursory read, but there's no point in digging further until I'm sure I
fully understand the algorithm. I think that needs a little more brain
power from me, and hopefully discussion around my comments above will
help trigger that.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic
  2018-10-11 15:35       ` Jeff King
@ 2018-10-11 16:21         ` Derrick Stolee
  2018-10-25  9:43           ` Jeff King
  0 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee @ 2018-10-11 16:21 UTC (permalink / raw)
  To: Jeff King, Derrick Stolee via GitGitGadget
  Cc: git, Junio C Hamano, Derrick Stolee

On 10/11/2018 11:35 AM, Jeff King wrote:
> On Fri, Sep 21, 2018 at 10:39:36AM -0700, Derrick Stolee via GitGitGadget wrote:
>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> When running a command like 'git rev-list --topo-order HEAD',
>> Git performed the following steps:
>> [...]
>> In the new algorithm, these three steps correspond to three
>> different commit walks. We run these walks simultaneously,
> A minor nit, but this commit message doesn't mention the most basic
> thing up front: that its main purpose is to introduce a new algorithm
> for topo-order. ;)
>
> It's obvious in the context of reviewing the series, but somebody
> reading "git log" later may want a bit more. Perhaps:
>
>    revision.c: implement generation-based topo-order algorithm
>
> as a subject, and/or an introductory paragraph like:
>
>    The current --topo-order algorithm requires walking all commits we
>    are going to output up front, topo-sorting them, all before
>    outputting the first value. This patch introduces a new algorithm
>    which uses stored generation numbers to incrementally walk in
>    topo-order, outputting commits as we go.
>
> Other than that, I find this to be a wonderfully explanatory commit
> message. :)

Good idea. I'll make that change.

>
>> The walks are as follows:
>>
>> 1. EXPLORE: using the explore_queue priority queue (ordered by
>>     maximizing the generation number), parse each reachable
>>     commit until all commits in the queue have generation
>>     number strictly lower than needed. During this walk, update
>>     the UNINTERESTING flags as necessary.
> OK, this makes sense. If we know that everybody else in our queue is at
> generation X, then it is safe to output a commit at generation greater
> than X.
>
> I think this by itself would allow us to implement "show no parents
> before all of its children are shown", right? But --topo-order promises
> a bit more: "avoid showing commits no multiple lines of history
> intermixed".
>
> I guess also INFINITY generation numbers need more. For a real
> generation number, we know that "gen(A) == gen(B)" implies that there is
> no ancestry relationship between the two. But not so for INFINITY.

Yeah, to deal with INFINITY (and ZERO, but that won't happen if 
generation_numbers_enabled() returns true), we treat gen(A) == gen(B) as 
a "no information" state. So, to output a commit at generation X, we 
need to have our maximum generation number in the unexplored area to be 
at most X - 1. You'll see strict inequality when checking generations.


>> 2. INDEGREE: using the indegree_queue priority queue (ordered
>>     by maximizing the generation number), add one to the in-
>>     degree of each parent for each commit that is walked. Since
>>     we walk in order of decreasing generation number, we know
>>     that discovering an in-degree value of 0 means the value for
>>     that commit was not initialized, so should be initialized to
>>     two. (Recall that in-degree value "1" is what we use to say a
>>     commit is ready for output.) As we iterate the parents of a
>>     commit during this walk, ensure the EXPLORE walk has walked
>>     beyond their generation numbers.
> I wondered how this would work for INFINITY. We can't know the order of
> a bunch of INFINITY nodes at all, so we never know when their in-degree
> values are "done". But if I understand the EXPLORE walk, we'd basically
> walk all of INFINITY down to something with a real generation number. Is
> that right?
>
> But after that, I'm not totally clear on why we need this INDEGREE walk.

The INDEGREE walk is an important element for Kahn's algorithm. The 
final output order is dictated by peeling commits of "indegree zero" to 
ensure all children are output before their parents. (Note: since we use 
literal 0 to mean "uninitialized", we peel commits when the indegree 
slab has value 1.)

This walk replaces the indegree logic from sort_in_topological_order(). 
That method performs one walk that fills the indegree slab, then another 
walk that peels the commits with indegree 0 and inserts them into a list.

>> 3. TOPO: using the topo_queue priority queue (ordered based on
>>     the sort_order given, which could be commit-date, author-
>>     date, or typical topo-order which treats the queue as a LIFO
>>     stack), remove a commit from the queue and decrement the
>>     in-degree of each parent. If a parent has an in-degree of
>>     one, then we add it to the topo_queue. Before we decrement
>>     the in-degree, however, ensure the INDEGREE walk has walked
>>     beyond that generation number.
> OK, this makes sense to make --author-date-order, etc, work. Potentially
> those numbers might have no relationship at all with the graph
> structure, but we promise "no parent before its children are shown", so
> this is really just a tie-breaker after the topo-sort anyway. As long as
> steps 1 and 2 are correct and produce a complete set of commits for one
> "layer", this should be OK.
>
> I guess I'm not 100% convinced that we don't have a case where we
> haven't yet parsed or considered some commit that we know cannot have an
> ancestry relationship with commits we are outputting. But it may have an
> author-date-order relationship.
>
> (I'm not at all convinced that this _is_ a problem, and I suspect it
> isn't; I'm only suggesting I haven't fully grokked the proof).
The INDEGREE walk should not stop until it has explored at least to the 
point that all indegree 0 commits are exposed (relative to the current 
state of the walk).

At initialization, we walk from all starting positions until the maximum 
generation number in our queue is less than the minimum generation of a 
starting commit. The starting positions that have indegree 0 are then 
added to the topo_queue, and the sort order dictates which is the best. 
 From this point on, we can only create a new "indegree 0" commit by 
removing a commit from the topo_queue and decrementing the indegree of 
its parents. Those parents with indegree 0 are inserted into topo_queue 
and compared to all other indegree 0 commits. Thus, we will always 
explore enough to make the right choice relative our sort order.

>
>> ---
>>   object.h   |   4 +-
>>   revision.c | 196 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>>   revision.h |   2 +
>>   3 files changed, 194 insertions(+), 8 deletions(-)
> I'll pause here on evaluating the actual code. It looks sane from a
> cursory read, but there's no point in digging further until I'm sure I
> fully understand the algorithm. I think that needs a little more brain
> power from me, and hopefully discussion around my comments above will
> help trigger that.
Thanks for reading! I understand that reading the code is useless 
without understanding the high-level concepts. I'm happy to iterate on 
this. If I can find a better way to explain the algorithm in the commit 
message to avoid the "huh?" moments above, then I will.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic
  2018-09-21 17:39     ` [PATCH v3 7/7] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
  2018-09-27 17:57       ` Derrick Stolee
  2018-10-11 15:35       ` Jeff King
@ 2018-10-11 22:32       ` Stefan Beller
  2 siblings, 0 replies; 87+ messages in thread
From: Stefan Beller @ 2018-10-11 22:32 UTC (permalink / raw)
  To: gitgitgadget; +Cc: git, Jeff King, Junio C Hamano, Derrick Stolee

On Fri, Sep 21, 2018 at 10:39 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
[...]
> For the test above, I specifically selected a path that is changed
> frequently, including by merge commits. A less-frequently-changed
> path (such as 'README') has similar end-to-end time since we need
> to walk the same number of commits (before determining we do not
> have 100 hits). However, get get the benefit that the output is

"get get"

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 3/7] test-reach: add rev-list tests
  2018-10-11 13:58       ` Jeff King
@ 2018-10-12  4:34         ` Junio C Hamano
  0 siblings, 0 replies; 87+ messages in thread
From: Junio C Hamano @ 2018-10-12  4:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee via GitGitGadget, git, Derrick Stolee

Jeff King <peff@peff.net> writes:

> On Fri, Sep 21, 2018 at 10:39:30AM -0700, Derrick Stolee via GitGitGadget wrote:
>
>> From: Derrick Stolee <dstolee@microsoft.com>
>> 
>> The rev-list command is critical to Git's functionality. Ensure it
>> works in the three commit-graph environments constructed in
>> t6600-test-reach.sh. Here are a few important types of rev-list
>> operations:
>> 
>> * Basic: git rev-list --topo-order HEAD
>> * Range: git rev-list --topo-order compare..HEAD
>> * Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
>> * Symmetric Difference: git rev-list --topo-order compare...HEAD
>
> Makes sense. I'll assume you filled out all those "expect" blocks
> correctly.  ;)

Well, otherwise three-modes test would barf at least when it is
running in its "no graph" mode, so I'd assume we are covered.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic
  2018-09-21 17:39     ` [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
  2018-10-11 14:06       ` Jeff King
@ 2018-10-12  6:33       ` Junio C Hamano
  2018-10-12 12:32         ` Derrick Stolee
  2018-10-12 16:15         ` Johannes Sixt
  1 sibling, 2 replies; 87+ messages in thread
From: Junio C Hamano @ 2018-10-12  6:33 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, peff, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> * revs->limited implies we run limit_list() to walk the entire
>   reachable set. There are some short-cuts here, such as if we
>   perform a range query like 'git rev-list COMPARE..HEAD' and we
>   can stop limit_list() when all queued commits are uninteresting.
>
> * revs->topo_order implies we run sort_in_topological_order(). See
>   the implementation of that method in commit.c. It implies that
>   the full set of commits to order is in the given commit_list.
>
> These two methods imply that a 'git rev-list --topo-order HEAD'
> command must walk the entire reachable set of commits _twice_ before
> returning a single result.

With or without "--topo-order", running rev-list without any
negative commit means we must dig down to the roots that can be
reached from the positive commits we have.

I am to sure if having to run the "sort" of order N counts as "walk
the entire reachable set once" (in addition to the enumeration that
must be done to prepare that N commits, performed in limit_list()).

> In v2.18.0, the commit-graph file contains zero-valued bytes in the
> positions where the generation number is stored in v2.19.0 and later.
> Thus, we use generation_numbers_enabled() to check if the commit-graph
> is available and has non-zero generation numbers.
>
> When setting revs->limited only because revs->topo_order is true,
> only do so if generation numbers are not available. There is no
> reason to use the new logic as it will behave similarly when all
> generation numbers are INFINITY or ZERO.

> In prepare_revision_walk(), if we have revs->topo_order but not
> revs->limited, then we trigger the new logic. It breaks the logic
> into three pieces, to fit with the existing framework:
>
> 1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
>    struct. We use the presence of this struct as a signal to use the
>    new methods during our walk. In this patch, this method simply
>    calls limit_list() and sort_in_topological_order(). In the future,
>    this method will set up a new data structure to perform that logic
>    in-line.
>
> 2. next_topo_commit() provides get_revision_1() with the next topo-
>    ordered commit in the list. Currently, this simply pops the commit
>    from revs->commits.

... because everything is already done in #1 above.  Which makes sense.

> 3. expand_topo_walk() provides get_revision_1() with a way to signal
>    walking beyond the latest commit. Currently, this calls
>    add_parents_to_list() exactly like the old logic.

"latest"?  We dig down the history from newer to older, so at some
point we hit an old commit and need to find the parents to keep
walking towards even older parts of the history.  Did you mean
"earliest" instead?

> While this commit presents method redirection for performing the
> exact same logic as before, it allows the next commit to focus only
> on the new logic.

OK.

> diff --git a/revision.c b/revision.c
> index e18bd530e4..2dcde8a8ac 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -25,6 +25,7 @@
>  #include "worktree.h"
>  #include "argv-array.h"
>  #include "commit-reach.h"
> +#include "commit-graph.h"
>  
>  volatile show_early_output_fn_t show_early_output;
>  
> @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
>  	if (revs->diffopt.objfind)
>  		revs->simplify_history = 0;
>  
> -	if (revs->topo_order)
> +	if (revs->topo_order && !generation_numbers_enabled(the_repository))
>  		revs->limited = 1;

Are we expecting that this is always a bool?  Can there be new
commits for which generation numbers are not computed and stored
while all the old, stable and packed commits have generation
numbers?

> @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid,
>  	return 0;
>  }
>  
> +struct topo_walk_info {};
> +
> +static void init_topo_walk(struct rev_info *revs)
> +{
> +	struct topo_walk_info *info;
> +	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
> +	info = revs->topo_walk_info;
> +	memset(info, 0, sizeof(struct topo_walk_info));

There is no member in the struct at this point.  Are we sure this is
safe?  Just being curious.  I know xmalloc() gives us at least one
byte and info won't be NULL.  I just do not know offhand if we have
a guarantee that memset() acts sensibly to fill the first 0 bytes.

> +	limit_list(revs);
> +	sort_in_topological_order(&revs->commits, revs->sort_order);
> +}
> +
> +static struct commit *next_topo_commit(struct rev_info *revs)
> +{
> +	return pop_commit(&revs->commits);
> +}
> +
> +static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
> +{
> +	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
> +		if (!revs->ignore_missing_links)
> +			die("Failed to traverse parents of commit %s",
> +			    oid_to_hex(&commit->object.oid));
> +	}
> +}
> +
>  int prepare_revision_walk(struct rev_info *revs)
>  {
>  	int i;
> @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
>  		commit_list_sort_by_date(&revs->commits);
>  	if (revs->no_walk)
>  		return 0;
> -	if (revs->limited)
> +	if (revs->limited) {
>  		if (limit_list(revs) < 0)
>  			return -1;
> -	if (revs->topo_order)
> -		sort_in_topological_order(&revs->commits, revs->sort_order);
> +		if (revs->topo_order)
> +			sort_in_topological_order(&revs->commits, revs->sort_order);
> +	} else if (revs->topo_order)
> +		init_topo_walk(revs);
>  	if (revs->line_level_traverse)
>  		line_log_filter(revs);
>  	if (revs->simplify_merges)

The diff is a bit hard to grok around here, but 

 - when limited *and* topo_order, we do the sort here, as we know we
   already have called limit_list(), i.e. we behave identically as
   the code before this patch in that case.

 - when not limited but topo_order, then we do init_topo_walk();
   currently we do limit_list() and sort_in_topological_order(),
   which means we do the same as above.

As long as limit_list() and sort_in_topological_order() does not
look at revs->limited bit, this patch cannot cause any regression.

> @@ -3257,6 +3287,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
>  
>  		if (revs->reflog_info)
>  			commit = next_reflog_entry(revs->reflog_info);
> +		else if (revs->topo_walk_info)
> +			commit = next_topo_commit(revs);
>  		else
>  			commit = pop_commit(&revs->commits);

So this get_revision_1() always grabs the commit from next_topo_commit()
when topo-order is in effect.

> @@ -3278,6 +3310,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
>  
>  			if (revs->reflog_info)
>  				try_to_simplify_commit(revs, commit);
> +			else if (revs->topo_walk_info)
> +				expand_topo_walk(revs, commit);
>  			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
>  				if (!revs->ignore_missing_links)
>  					die("Failed to traverse parents of commit %s",

And this add-parents-or-barf is replicated in expand_topo_walk() at
this step, so there is no change in behaviour.

Looks like a cleanly done preparation that is a no-op.

> diff --git a/revision.h b/revision.h
> index 2b30ac270d..fd4154ff75 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -56,6 +56,8 @@ struct rev_cmdline_info {
>  #define REVISION_WALK_NO_WALK_SORTED 1
>  #define REVISION_WALK_NO_WALK_UNSORTED 2
>  
> +struct topo_walk_info;
> +
>  struct rev_info {
>  	/* Starting list */
>  	struct commit_list *commits;
> @@ -245,6 +247,8 @@ struct rev_info {
>  	const char *break_bar;
>  
>  	struct revision_sources *sources;
> +
> +	struct topo_walk_info *topo_walk_info;
>  };
>  
>  int ref_excluded(struct string_list *, const char *path);

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic
  2018-10-12  6:33       ` Junio C Hamano
@ 2018-10-12 12:32         ` Derrick Stolee
  2018-10-12 16:15         ` Johannes Sixt
  1 sibling, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-10-12 12:32 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget; +Cc: git, peff, Derrick Stolee

On 10/12/2018 2:33 AM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> * revs->limited implies we run limit_list() to walk the entire
>>    reachable set. There are some short-cuts here, such as if we
>>    perform a range query like 'git rev-list COMPARE..HEAD' and we
>>    can stop limit_list() when all queued commits are uninteresting.
>>
>> * revs->topo_order implies we run sort_in_topological_order(). See
>>    the implementation of that method in commit.c. It implies that
>>    the full set of commits to order is in the given commit_list.
>>
>> These two methods imply that a 'git rev-list --topo-order HEAD'
>> command must walk the entire reachable set of commits _twice_ before
>> returning a single result.
> With or without "--topo-order", running rev-list without any
> negative commit means we must dig down to the roots that can be
> reached from the positive commits we have.
If we use default order in 'git log', we don't walk all the way to the 
root commits, and instead trust the commit-date. (This is different than 
--date-order, which does guarantee parents after children.) In this 
case, revs->limited is false.
> I am to sure if having to run the "sort" of order N counts as "walk
> the entire reachable set once" (in addition to the enumeration that
> must be done to prepare that N commits, performed in limit_list()).

sort_in_topological_order() does actually _two_ walks (the in-degree 
computation plus the walk that peels commits of in-degree zero), but 
those walks are cheaper because we've already parsed the commits in 
limit_list().
>> 3. expand_topo_walk() provides get_revision_1() with a way to signal
>>     walking beyond the latest commit. Currently, this calls
>>     add_parents_to_list() exactly like the old logic.
> "latest"?  We dig down the history from newer to older, so at some
> point we hit an old commit and need to find the parents to keep
> walking towards even older parts of the history.  Did you mean
> "earliest" instead?
I mean "latest" in terms of the algorithm, so "the commit that was 
returned by get_revision_1() most recently". This could use some 
rewriting for clarity.
>>   
>> @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
>>   	if (revs->diffopt.objfind)
>>   		revs->simplify_history = 0;
>>   
>> -	if (revs->topo_order)
>> +	if (revs->topo_order && !generation_numbers_enabled(the_repository))
>>   		revs->limited = 1;
> Are we expecting that this is always a bool?  Can there be new
> commits for which generation numbers are not computed and stored
> while all the old, stable and packed commits have generation
> numbers?

For this algorithm to work, we only care that _some_ commits have 
generation numbers. We expect that if a commit-graph file exists with 
generation numbers, then the majority of commits have generation 
numbers. The commits that were added or fetched since the commit-graph 
was written will have generation number INFINITY, but the topo-order 
algorithm will still work and be efficient in those cases. (This is also 
why we have the "half graph" case in test_three_modes.)

>> @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid,
>>   	return 0;
>>   }
>>   
>> +struct topo_walk_info {};
>> +
>> +static void init_topo_walk(struct rev_info *revs)
>> +{
>> +	struct topo_walk_info *info;
>> +	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
>> +	info = revs->topo_walk_info;
>> +	memset(info, 0, sizeof(struct topo_walk_info));
> There is no member in the struct at this point.  Are we sure this is
> safe?  Just being curious.  I know xmalloc() gives us at least one
> byte and info won't be NULL.  I just do not know offhand if we have
> a guarantee that memset() acts sensibly to fill the first 0 bytes.
This is a good question. It seems to work for me when I check out your 
version of this commit (6c04ff30 "revision.c: begin refactoring 
--topo-order logic") and run all tests.
>> +	limit_list(revs);
>> +	sort_in_topological_order(&revs->commits, revs->sort_order);
>> +}
>> +
>> +static struct commit *next_topo_commit(struct rev_info *revs)
>> +{
>> +	return pop_commit(&revs->commits);
>> +}
>> +
>> +static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
>> +{
>> +	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
>> +		if (!revs->ignore_missing_links)
>> +			die("Failed to traverse parents of commit %s",
>> +			    oid_to_hex(&commit->object.oid));
>> +	}
>> +}
>> +
>>   int prepare_revision_walk(struct rev_info *revs)
>>   {
>>   	int i;
>> @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
>>   		commit_list_sort_by_date(&revs->commits);
>>   	if (revs->no_walk)
>>   		return 0;
>> -	if (revs->limited)
>> +	if (revs->limited) {
>>   		if (limit_list(revs) < 0)
>>   			return -1;
>> -	if (revs->topo_order)
>> -		sort_in_topological_order(&revs->commits, revs->sort_order);
>> +		if (revs->topo_order)
>> +			sort_in_topological_order(&revs->commits, revs->sort_order);
>> +	} else if (revs->topo_order)
>> +		init_topo_walk(revs);
>>   	if (revs->line_level_traverse)
>>   		line_log_filter(revs);
>>   	if (revs->simplify_merges)
> The diff is a bit hard to grok around here, but
>
>   - when limited *and* topo_order, we do the sort here, as we know we
>     already have called limit_list(), i.e. we behave identically as
>     the code before this patch in that case.
>
>   - when not limited but topo_order, then we do init_topo_walk();
>     currently we do limit_list() and sort_in_topological_order(),
>     which means we do the same as above.
>
> As long as limit_list() and sort_in_topological_order() does not
> look at revs->limited bit, this patch cannot cause any regression.
>
>> @@ -3257,6 +3287,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
>>   
>>   		if (revs->reflog_info)
>>   			commit = next_reflog_entry(revs->reflog_info);
>> +		else if (revs->topo_walk_info)
>> +			commit = next_topo_commit(revs);
>>   		else
>>   			commit = pop_commit(&revs->commits);
> So this get_revision_1() always grabs the commit from next_topo_commit()
> when topo-order is in effect.
And specifically, when the conditions for our new topo-walk algorithm 
are in effect. If the commit-graph doesn't exist, the old logic will 
still go through for "git log --topo-order".

Thanks for the careful look!
-Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic
  2018-10-12  6:33       ` Junio C Hamano
  2018-10-12 12:32         ` Derrick Stolee
@ 2018-10-12 16:15         ` Johannes Sixt
  2018-10-13  8:05           ` Junio C Hamano
  1 sibling, 1 reply; 87+ messages in thread
From: Johannes Sixt @ 2018-10-12 16:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee via GitGitGadget, git, peff, Derrick Stolee

Am 12.10.18 um 08:33 schrieb Junio C Hamano:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +struct topo_walk_info {};
>> +
>> +static void init_topo_walk(struct rev_info *revs)
>> +{
>> +	struct topo_walk_info *info;
>> +	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
>> +	info = revs->topo_walk_info;
>> +	memset(info, 0, sizeof(struct topo_walk_info));
> 
> There is no member in the struct at this point.  Are we sure this is
> safe?  Just being curious.

sizeof cannot return 0. sizeof(struct topo_walk_info) will be 1 here.

-- Hannes

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic
  2018-10-12 16:15         ` Johannes Sixt
@ 2018-10-13  8:05           ` Junio C Hamano
  0 siblings, 0 replies; 87+ messages in thread
From: Junio C Hamano @ 2018-10-13  8:05 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Derrick Stolee via GitGitGadget, git, peff, Derrick Stolee

Johannes Sixt <j6t@kdbg.org> writes:

> Am 12.10.18 um 08:33 schrieb Junio C Hamano:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>> +struct topo_walk_info {};
>>> +
>>> +static void init_topo_walk(struct rev_info *revs)
>>> +{
>>> +	struct topo_walk_info *info;
>>> +	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
>>> +	info = revs->topo_walk_info;
>>> +	memset(info, 0, sizeof(struct topo_walk_info));
>>
>> There is no member in the struct at this point.  Are we sure this is
>> safe?  Just being curious.
>
> sizeof cannot return 0. sizeof(struct topo_walk_info) will be 1 here.

Thanks.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v4 0/7] Use generation numbers for --topo-order
  2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
                       ` (7 preceding siblings ...)
  2018-09-21 21:22     ` [PATCH v3 0/7] Use generation numbers for --topo-order Junio C Hamano
@ 2018-10-16 22:36     ` Derrick Stolee via GitGitGadget
  2018-10-16 22:36       ` [PATCH v4 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
                         ` (9 more replies)
  8 siblings, 10 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-16 22:36 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano

This patch series performs a decently-sized refactoring of the revision-walk
machinery. Well, "refactoring" is probably the wrong word, as I don't
actually remove the old code. Instead, when we see certain options in the
'rev_info' struct, we redirect the commit-walk logic to a new set of methods
that distribute the workload differently. By using generation numbers in the
commit-graph, we can significantly improve 'git log --graph' commands (and
the underlying 'git rev-list --topo-order').

On the Linux repository, I got the following performance results when
comparing to the previous version with or without a commit-graph:

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

If you want to read this series but are unfamiliar with the commit-graph and
generation numbers, then I recommend reading 
Documentation/technical/commit-graph.txt or a blob post [1] I wrote on the
subject. In particular, the three-part walk described in "revision.c:
refactor basic topo-order logic" is present (but underexplained) as an
animated PNG [2].

Since revision.c is an incredibly important (and old) portion of the
codebase -- and because there are so many orthogonal options in 'struct
rev_info' -- I consider this submission to be "RFC quality". That is, I am
not confident that I am not missing anything, or that my solution is the
best it can be. I did merge this branch with ds/commit-graph-with-grafts and
the "DO-NOT-MERGE: write and read commit-graph always" commit that computes
a commit-graph with every 'git commit' command. The test suite passed with
that change, available on GitHub [3]. To ensure that I cover at least the
case I think are interesting, I added tests to t6600-test-reach.sh to verify
the walks report the correct results for the three cases there (no
commit-graph, full commit-graph, and a partial commit-graph so the walk
starts at GENERATION_NUMBER_INFINITY).

One notable case that is not included in this series is the case of a
history comparison such as 'git rev-list --topo-order A..B'. The existing
code in limit_list() has ways to cut the walk short when all pending commits
are UNINTERESTING. Since this code depends on commit_list instead of the
prio_queue we are using here, I chose to leave it untouched for now. We can
revisit it in a separate series later. Since handle_commit() turns on
revs->limited when a commit is UNINTERESTING, we do not hit the new code in
this case. Removing this 'revs->limited = 1;' line yields correct results,
but the performance is worse.

This series was based on ds/reachable, but is now based on 'master' to not
conflict with 182070 "commit: use timestamp_t for author_date_slab". There
is a small conflict with md/filter-trees, because it renamed a flag in
revisions.h in the line before I add new flags. Hopefully this conflict is
not too difficult to resolve.

Changes in V3: I added a new patch that updates the tab-alignment for flags
in revision.h before adding new ones (Thanks, Ævar!). Also, I squashed the
recommended changes to run_three_modes and test_three_modes from Szeder and
Junio. Thanks!

Changes in V4: I'm sending a V4 to respond to the feedback so far. Still
looking forward to more on the really big commit!

 * Removed the whitespace changes to the flags in revision.c that caused
   merge pain. 
   
   
 * The prio-queue peek function is now covered by tests when in "stack"
   mode.
   
   
 * The "add_parents_to_list()" function is now renamed to
   "process_parents()"
   
   
 * Added a new commit that expands test coverage with alternate orders and
   file history (use GIT_TEST_COMMIT_GRAPH to have
   t6012-rev-list-simplify.sh cover the new logic). These tests found a
   problem with author dates (I forgot to record them during the explore
   walk).
   
   
 * Commit message edits.
   
   

Thanks, -Stolee

[1] 
https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/
Supercharging the Git Commit Graph III: Generations and Graph Algorithms

[2] 
https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png
Animation showing three-part walk

[3] https://github.com/derrickstolee/git/tree/topo-order/testA branch
containing this series along with commits to compute commit-graph in entire
test suite.

Cc: avarab@gmail.comCc: szeder.dev@gmail.com

Derrick Stolee (7):
  prio-queue: add 'peek' operation
  test-reach: add run_three_modes method
  test-reach: add rev-list tests
  revision.c: begin refactoring --topo-order logic
  commit/revisions: bookkeeping before refactoring
  revision.c: generation-based topo-order algorithm
  t6012: make rev-list tests more interesting

 commit.c                     |  11 +-
 commit.h                     |   8 ++
 object.h                     |   4 +-
 prio-queue.c                 |   9 ++
 prio-queue.h                 |   6 +
 revision.c                   | 245 +++++++++++++++++++++++++++++++++--
 revision.h                   |   6 +
 t/helper/test-prio-queue.c   |  26 ++--
 t/t0009-prio-queue.sh        |  14 ++
 t/t6012-rev-list-simplify.sh |  45 +++++--
 t/t6600-test-reach.sh        |  96 +++++++++++++-
 11 files changed, 430 insertions(+), 40 deletions(-)


base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-25%2Fderrickstolee%2Ftopo-order%2Fprogress-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-25/derrickstolee/topo-order/progress-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/25

Range-diff vs v3:

 1:  cc1ec4c270 ! 1:  2358cfd5ed prio-queue: add 'peek' operation
     @@ -8,7 +8,9 @@
          add it as prio_queue_peek().
      
          Add a reference-level comparison in t/helper/test-prio-queue.c
     -    so this method is exercised by t0009-prio-queue.sh.
     +    so this method is exercised by t0009-prio-queue.sh. Further, add
     +    a test that checks the behavior when the compare function is NULL
     +    (i.e. the queue becomes a stack).
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ -56,6 +58,11 @@
      -		if (!strcmp(*argv, "get"))
      -			show(prio_queue_get(&pq));
      -		else if (!strcmp(*argv, "dump")) {
     +-			int *v;
     +-			while ((v = prio_queue_get(&pq)))
     +-			       show(v);
     +-		}
     +-		else {
      +		if (!strcmp(*argv, "get")) {
      +			void *peek = prio_queue_peek(&pq);
      +			void *get = prio_queue_get(&pq);
     @@ -63,6 +70,40 @@
      +				BUG("peek and get results do not match");
      +			show(get);
      +		} else if (!strcmp(*argv, "dump")) {
     - 			int *v;
     - 			while ((v = prio_queue_get(&pq)))
     - 			       show(v);
     ++			void *peek;
     ++			void *get;
     ++			while ((peek = prio_queue_peek(&pq))) {
     ++				get = prio_queue_get(&pq);
     ++				if (peek != get)
     ++					BUG("peek and get results do not match");
     ++				show(get);
     ++			}
     ++		} else if (!strcmp(*argv, "stack")) {
     ++			pq.compare = NULL;
     ++		} else {
     + 			int *v = malloc(sizeof(*v));
     + 			*v = atoi(*argv);
     + 			prio_queue_put(&pq, v);
     +
     +diff --git a/t/t0009-prio-queue.sh b/t/t0009-prio-queue.sh
     +--- a/t/t0009-prio-queue.sh
     ++++ b/t/t0009-prio-queue.sh
     +@@
     + 	test_cmp expect actual
     + '
     + 
     ++cat >expect <<'EOF'
     ++3
     ++2
     ++6
     ++4
     ++5
     ++1
     ++8
     ++EOF
     ++test_expect_success 'stack order' '
     ++	test-tool prio-queue stack 8 1 5 4 6 2 3 dump >actual &&
     ++	test_cmp expect actual
     ++'
     ++
     + test_done
 2:  b2a1ade148 = 2:  3a4b68e479 test-reach: add run_three_modes method
 3:  b0ceb96076 = 3:  12a3f6d367 test-reach: add rev-list tests
 4:  fd1a0ab7cd = 4:  cd9eef9688 revision.c: begin refactoring --topo-order logic
 5:  e86f304082 ! 5:  f3e291665d commit/revisions: bookkeeping before refactoring
     @@ -16,7 +16,11 @@
             around the UNINTERESTING flag and other special cases depending
             on the struct rev_info. Allow this method to ignore a NULL 'list'
             parameter, as we will not be populating the list for our walk.
     +       Also rename the method to the slightly more generic name
     +       process_parents() to make clear that this method does more than
     +       add to a list (and no list is required anymore).
      
     +    Helped-by: Jeff King <peff@peff.net>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
      diff --git a/commit.c b/commit.c
     @@ -28,7 +32,8 @@
       
      -/* record author-date for each commit object */
      -define_commit_slab(author_date_slab, timestamp_t);
     --
     ++implement_shared_commit_slab(author_date_slab, timestamp_t);
     + 
      -static void record_author_date(struct author_date_slab *author_date,
      -			       struct commit *commit)
      +void record_author_date(struct author_date_slab *author_date,
     @@ -64,7 +69,7 @@
       extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
       
      +/* record author-date for each commit object */
     -+define_commit_slab(author_date_slab, timestamp_t);
     ++define_shared_commit_slab(author_date_slab, timestamp_t);
      +
      +void record_author_date(struct author_date_slab *author_date,
      +			struct commit *commit);
     @@ -77,6 +82,17 @@
      diff --git a/revision.c b/revision.c
      --- a/revision.c
      +++ b/revision.c
     +@@
     + 		*cache = new_entry;
     + }
     + 
     +-static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
     +-		    struct commit_list **list, struct commit_list **cache_ptr)
     ++static int process_parents(struct rev_info *revs, struct commit *commit,
     ++			   struct commit_list **list, struct commit_list **cache_ptr)
     + {
     + 	struct commit_list *parent = commit->parents;
     + 	unsigned left_flag;
      @@
       			if (p->object.flags & SEEN)
       				continue;
     @@ -97,3 +113,39 @@
       		}
       		if (revs->first_parent_only)
       			break;
     +@@
     + 
     + 		if (revs->max_age != -1 && (commit->date < revs->max_age))
     + 			obj->flags |= UNINTERESTING;
     +-		if (add_parents_to_list(revs, commit, &list, NULL) < 0)
     ++		if (process_parents(revs, commit, &list, NULL) < 0)
     + 			return -1;
     + 		if (obj->flags & UNINTERESTING) {
     + 			mark_parents_uninteresting(commit);
     +@@
     + 
     + static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
     + {
     +-	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
     ++	if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
     + 		if (!revs->ignore_missing_links)
     + 			die("Failed to traverse parents of commit %s",
     + 			    oid_to_hex(&commit->object.oid));
     +@@
     + 	for (;;) {
     + 		struct commit *p = *pp;
     + 		if (!revs->limited)
     +-			if (add_parents_to_list(revs, p, &revs->commits, &cache) < 0)
     ++			if (process_parents(revs, p, &revs->commits, &cache) < 0)
     + 				return rewrite_one_error;
     + 		if (p->object.flags & UNINTERESTING)
     + 			return rewrite_one_ok;
     +@@
     + 				try_to_simplify_commit(revs, commit);
     + 			else if (revs->topo_walk_info)
     + 				expand_topo_walk(revs, commit);
     +-			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
     ++			else if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
     + 				if (!revs->ignore_missing_links)
     + 					die("Failed to traverse parents of commit %s",
     + 						oid_to_hex(&commit->object.oid));
 6:  fa6d5ef152 < -:  ---------- revision.h: add whitespace in flag definitions
 7:  020b2f50c5 ! 6:  aa0bb2221d revision.c: refactor basic topo-order logic
     @@ -1,6 +1,15 @@
      Author: Derrick Stolee <dstolee@microsoft.com>
      
     -    revision.c: refactor basic topo-order logic
     +    revision.c: generation-based topo-order algorithm
     +
     +    The current --topo-order algorithm requires walking all
     +    reachable commits up front, topo-sorting them, all before
     +    outputting the first value. This patch introduces a new
     +    algorithm which uses stored generation numbers to
     +    incrementally walk in topo-order, outputting commits as
     +    we go. This can dramatically reduce the computation time
     +    to write a fixed number of commits, such as when limiting
     +    with "-n <N>" or filling the first page of a pager.
      
          When running a command like 'git rev-list --topo-order HEAD',
          Git performed the following steps:
     @@ -139,11 +148,12 @@
          frequently, including by merge commits. A less-frequently-changed
          path (such as 'README') has similar end-to-end time since we need
          to walk the same number of commits (before determining we do not
     -    have 100 hits). However, get get the benefit that the output is
     +    have 100 hits). However, get the benefit that the output is
          presented to the user as it is discovered, much the same as a
          normal 'git log' command (no '--topo-order'). This is an improved
          user experience, even if the command has the same runtime.
      
     +    Helped-by: Jeff King <peff@peff.net>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
      diff --git a/object.h b/object.h
     @@ -216,10 +226,13 @@
      +	if (parse_commit_gently(c, 1) < 0)
      +		return;
      +
     ++	if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
     ++		record_author_date(&info->author_date, c);
     ++
      +	if (revs->max_age != -1 && (c->date < revs->max_age))
      +		c->object.flags |= UNINTERESTING;
      +
     -+	if (add_parents_to_list(revs, c, NULL, NULL) < 0)
     ++	if (process_parents(revs, c, NULL, NULL) < 0)
      +		return;
      +
      +	if (c->object.flags & UNINTERESTING)
     @@ -366,10 +379,10 @@
       
       static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
       {
     --	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
     +-	if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
      +	struct commit_list *p;
      +	struct topo_walk_info *info = revs->topo_walk_info;
     -+	if (add_parents_to_list(revs, commit, NULL, NULL) < 0) {
     ++	if (process_parents(revs, commit, NULL, NULL) < 0) {
       		if (!revs->ignore_missing_links)
       			die("Failed to traverse parents of commit %s",
      -			    oid_to_hex(&commit->object.oid));
     @@ -404,9 +417,9 @@
      --- a/revision.h
      +++ b/revision.h
      @@
     - #define USER_GIVEN		(1u<<25) /* given directly by the user */
     - #define TRACK_LINEAR		(1u<<26)
     - #define ALL_REV_FLAGS		(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
     + #define USER_GIVEN	(1u<<25) /* given directly by the user */
     + #define TRACK_LINEAR	(1u<<26)
     + #define ALL_REV_FLAGS	(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
      +#define TOPO_WALK_EXPLORED	(1u<<27)
      +#define TOPO_WALK_INDEGREE	(1u<<28)
       
 -:  ---------- > 7:  a21febe112 t6012: make rev-list tests more interesting

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v4 1/7] prio-queue: add 'peek' operation
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
@ 2018-10-16 22:36       ` Derrick Stolee via GitGitGadget
  2018-10-16 22:36       ` [PATCH v4 2/7] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
                         ` (8 subsequent siblings)
  9 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-16 22:36 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When consuming a priority queue, it can be convenient to inspect
the next object that will be dequeued without actually dequeueing
it. Our existing library did not have such a 'peek' operation, so
add it as prio_queue_peek().

Add a reference-level comparison in t/helper/test-prio-queue.c
so this method is exercised by t0009-prio-queue.sh. Further, add
a test that checks the behavior when the compare function is NULL
(i.e. the queue becomes a stack).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 prio-queue.c               |  9 +++++++++
 prio-queue.h               |  6 ++++++
 t/helper/test-prio-queue.c | 26 ++++++++++++++++++--------
 t/t0009-prio-queue.sh      | 14 ++++++++++++++
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/prio-queue.c b/prio-queue.c
index a078451872..d3f488cb05 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
 	}
 	return result;
 }
+
+void *prio_queue_peek(struct prio_queue *queue)
+{
+	if (!queue->nr)
+		return NULL;
+	if (!queue->compare)
+		return queue->array[queue->nr - 1].data;
+	return queue->array[0].data;
+}
diff --git a/prio-queue.h b/prio-queue.h
index d030ec9dd6..682e51867a 100644
--- a/prio-queue.h
+++ b/prio-queue.h
@@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing);
  */
 extern void *prio_queue_get(struct prio_queue *);
 
+/*
+ * Gain access to the "thing" that would be returned by
+ * prio_queue_get, but do not remove it from the queue.
+ */
+extern void *prio_queue_peek(struct prio_queue *);
+
 extern void clear_prio_queue(struct prio_queue *);
 
 /* Reverse the LIFO elements */
diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
index 9807b649b1..5bc9c46ea5 100644
--- a/t/helper/test-prio-queue.c
+++ b/t/helper/test-prio-queue.c
@@ -22,14 +22,24 @@ int cmd__prio_queue(int argc, const char **argv)
 	struct prio_queue pq = { intcmp };
 
 	while (*++argv) {
-		if (!strcmp(*argv, "get"))
-			show(prio_queue_get(&pq));
-		else if (!strcmp(*argv, "dump")) {
-			int *v;
-			while ((v = prio_queue_get(&pq)))
-			       show(v);
-		}
-		else {
+		if (!strcmp(*argv, "get")) {
+			void *peek = prio_queue_peek(&pq);
+			void *get = prio_queue_get(&pq);
+			if (peek != get)
+				BUG("peek and get results do not match");
+			show(get);
+		} else if (!strcmp(*argv, "dump")) {
+			void *peek;
+			void *get;
+			while ((peek = prio_queue_peek(&pq))) {
+				get = prio_queue_get(&pq);
+				if (peek != get)
+					BUG("peek and get results do not match");
+				show(get);
+			}
+		} else if (!strcmp(*argv, "stack")) {
+			pq.compare = NULL;
+		} else {
 			int *v = malloc(sizeof(*v));
 			*v = atoi(*argv);
 			prio_queue_put(&pq, v);
diff --git a/t/t0009-prio-queue.sh b/t/t0009-prio-queue.sh
index e56dfce668..3941ad2528 100755
--- a/t/t0009-prio-queue.sh
+++ b/t/t0009-prio-queue.sh
@@ -47,4 +47,18 @@ test_expect_success 'notice empty queue' '
 	test_cmp expect actual
 '
 
+cat >expect <<'EOF'
+3
+2
+6
+4
+5
+1
+8
+EOF
+test_expect_success 'stack order' '
+	test-tool prio-queue stack 8 1 5 4 6 2 3 dump >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v4 2/7] test-reach: add run_three_modes method
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  2018-10-16 22:36       ` [PATCH v4 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
@ 2018-10-16 22:36       ` Derrick Stolee via GitGitGadget
  2018-10-16 22:36       ` [PATCH v4 3/7] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
                         ` (7 subsequent siblings)
  9 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-16 22:36 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'test_three_modes' method assumes we are using the 'test-tool
reach' command for our test. However, we may want to use the data
shape of our commit graph and the three modes (no commit-graph,
full commit-graph, partial commit-graph) for other git commands.

Split test_three_modes to be a simple translation on a more general
run_three_modes method that executes the given command and tests
the actual output to the expected output.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6600-test-reach.sh | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..9d65b8b946 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -53,18 +53,22 @@ test_expect_success 'setup' '
 	git config core.commitGraph true
 '
 
-test_three_modes () {
+run_three_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-full .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
+test_three_modes () {
+	run_three_modes test-tool reach "$@"
+}
+
 test_expect_success 'ref_newer:miss' '
 	cat >input <<-\EOF &&
 	A:commit-5-7
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v4 3/7] test-reach: add rev-list tests
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  2018-10-16 22:36       ` [PATCH v4 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
  2018-10-16 22:36       ` [PATCH v4 2/7] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
@ 2018-10-16 22:36       ` Derrick Stolee via GitGitGadget
  2018-10-21 10:21         ` Jakub Narebski
  2018-10-16 22:36       ` [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
                         ` (6 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-16 22:36 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6600-test-reach.sh | 84 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 9d65b8b946..288f703b7b 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
 	test_three_modes commit_contains --tag
 '
 
+test_expect_success 'rev-list: basic topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
+		commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \
+		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
+		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent topo-order' '
+	git rev-parse \
+		commit-6-6 \
+		commit-6-5 \
+		commit-6-4 \
+		commit-6-3 \
+		commit-6-2 \
+		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
+	>expect &&
+	run_three_modes git rev-list --first-parent --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 \
+		commit-6-5 commit-5-5 commit-4-5 \
+		commit-6-4 commit-5-4 commit-4-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent range topo-order' '
+	git rev-parse \
+		commit-6-6 \
+		commit-6-5 \
+		commit-6-4 \
+		commit-6-3 \
+		commit-6-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: ancestry-path topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+	>expect &&
+	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: symmetric difference topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 \
+		commit-6-5 commit-5-5 commit-4-5 \
+		commit-6-4 commit-5-4 commit-4-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+		commit-3-8 commit-2-8 commit-1-8 \
+		commit-3-7 commit-2-7 commit-1-7 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (2 preceding siblings ...)
  2018-10-16 22:36       ` [PATCH v4 3/7] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
@ 2018-10-16 22:36       ` Derrick Stolee via GitGitGadget
  2018-10-21 15:55         ` Jakub Narebski
  2018-10-16 22:36       ` [PATCH v4 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
                         ` (5 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-16 22:36 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 42 ++++++++++++++++++++++++++++++++++++++----
 revision.h |  4 ++++
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index e18bd530e4..2dcde8a8ac 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 	if (revs->diffopt.objfind)
 		revs->simplify_history = 0;
 
-	if (revs->topo_order)
+	if (revs->topo_order && !generation_numbers_enabled(the_repository))
 		revs->limited = 1;
 
 	if (revs->prune_data.nr) {
@@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid,
 	return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+	struct topo_walk_info *info;
+	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+	info = revs->topo_walk_info;
+	memset(info, 0, sizeof(struct topo_walk_info));
+
+	limit_list(revs);
+	sort_in_topological_order(&revs->commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+	return pop_commit(&revs->commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+		if (!revs->ignore_missing_links)
+			die("Failed to traverse parents of commit %s",
+			    oid_to_hex(&commit->object.oid));
+	}
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
 	int i;
@@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
 		return 0;
-	if (revs->limited)
+	if (revs->limited) {
 		if (limit_list(revs) < 0)
 			return -1;
-	if (revs->topo_order)
-		sort_in_topological_order(&revs->commits, revs->sort_order);
+		if (revs->topo_order)
+			sort_in_topological_order(&revs->commits, revs->sort_order);
+	} else if (revs->topo_order)
+		init_topo_walk(revs);
 	if (revs->line_level_traverse)
 		line_log_filter(revs);
 	if (revs->simplify_merges)
@@ -3257,6 +3287,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
 
 		if (revs->reflog_info)
 			commit = next_reflog_entry(revs->reflog_info);
+		else if (revs->topo_walk_info)
+			commit = next_topo_commit(revs);
 		else
 			commit = pop_commit(&revs->commits);
 
@@ -3278,6 +3310,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
 
 			if (revs->reflog_info)
 				try_to_simplify_commit(revs, commit);
+			else if (revs->topo_walk_info)
+				expand_topo_walk(revs, commit);
 			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
 				if (!revs->ignore_missing_links)
 					die("Failed to traverse parents of commit %s",
diff --git a/revision.h b/revision.h
index 2b30ac270d..fd4154ff75 100644
--- a/revision.h
+++ b/revision.h
@@ -56,6 +56,8 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct topo_walk_info;
+
 struct rev_info {
 	/* Starting list */
 	struct commit_list *commits;
@@ -245,6 +247,8 @@ struct rev_info {
 	const char *break_bar;
 
 	struct revision_sources *sources;
+
+	struct topo_walk_info *topo_walk_info;
 };
 
 int ref_excluded(struct string_list *, const char *path);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v4 5/7] commit/revisions: bookkeeping before refactoring
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (3 preceding siblings ...)
  2018-10-16 22:36       ` [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
@ 2018-10-16 22:36       ` Derrick Stolee via GitGitGadget
  2018-10-21 21:17         ` Jakub Narebski
  2018-10-16 22:36       ` [PATCH v4 6/7] revision.c: generation-based topo-order algorithm Derrick Stolee via GitGitGadget
                         ` (4 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-16 22:36 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

There are a few things that need to move around a little before
making a big refactoring in the topo-order logic:

1. We need access to record_author_date() and
   compare_commits_by_author_date() in revision.c. These are used
   currently by sort_in_topological_order() in commit.c.

2. Moving these methods to commit.h requires adding the author_slab
   definition to commit.h.

3. The add_parents_to_list() method in revision.c performs logic
   around the UNINTERESTING flag and other special cases depending
   on the struct rev_info. Allow this method to ignore a NULL 'list'
   parameter, as we will not be populating the list for our walk.
   Also rename the method to the slightly more generic name
   process_parents() to make clear that this method does more than
   add to a list (and no list is required anymore).

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c   | 11 +++++------
 commit.h   |  8 ++++++++
 revision.c | 18 ++++++++++--------
 3 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/commit.c b/commit.c
index d0f199e122..861a485e93 100644
--- a/commit.c
+++ b/commit.c
@@ -655,11 +655,10 @@ struct commit *pop_commit(struct commit_list **stack)
 /* count number of children that have not been emitted */
 define_commit_slab(indegree_slab, int);
 
-/* record author-date for each commit object */
-define_commit_slab(author_date_slab, timestamp_t);
+implement_shared_commit_slab(author_date_slab, timestamp_t);
 
-static void record_author_date(struct author_date_slab *author_date,
-			       struct commit *commit)
+void record_author_date(struct author_date_slab *author_date,
+			struct commit *commit)
 {
 	const char *buffer = get_commit_buffer(commit, NULL);
 	struct ident_split ident;
@@ -684,8 +683,8 @@ fail_exit:
 	unuse_commit_buffer(commit, buffer);
 }
 
-static int compare_commits_by_author_date(const void *a_, const void *b_,
-					  void *cb_data)
+int compare_commits_by_author_date(const void *a_, const void *b_,
+				   void *cb_data)
 {
 	const struct commit *a = a_, *b = b_;
 	struct author_date_slab *author_date = cb_data;
diff --git a/commit.h b/commit.h
index 2b1a734388..977d397356 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,7 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 #include "pretty.h"
+#include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
 #define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
@@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf);
  */
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
+/* record author-date for each commit object */
+define_shared_commit_slab(author_date_slab, timestamp_t);
+
+void record_author_date(struct author_date_slab *author_date,
+			struct commit *commit);
+
+int compare_commits_by_author_date(const void *a_, const void *b_, void *unused);
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
diff --git a/revision.c b/revision.c
index 2dcde8a8ac..36458265a0 100644
--- a/revision.c
+++ b/revision.c
@@ -768,8 +768,8 @@ static void commit_list_insert_by_date_cached(struct commit *p, struct commit_li
 		*cache = new_entry;
 }
 
-static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
-		    struct commit_list **list, struct commit_list **cache_ptr)
+static int process_parents(struct rev_info *revs, struct commit *commit,
+			   struct commit_list **list, struct commit_list **cache_ptr)
 {
 	struct commit_list *parent = commit->parents;
 	unsigned left_flag;
@@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
 			if (p->object.flags & SEEN)
 				continue;
 			p->object.flags |= SEEN;
-			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
+			if (list)
+				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
 		}
 		return 0;
 	}
@@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
 		p->object.flags |= left_flag;
 		if (!(p->object.flags & SEEN)) {
 			p->object.flags |= SEEN;
-			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
+			if (list)
+				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
 		}
 		if (revs->first_parent_only)
 			break;
@@ -1091,7 +1093,7 @@ static int limit_list(struct rev_info *revs)
 
 		if (revs->max_age != -1 && (commit->date < revs->max_age))
 			obj->flags |= UNINTERESTING;
-		if (add_parents_to_list(revs, commit, &list, NULL) < 0)
+		if (process_parents(revs, commit, &list, NULL) < 0)
 			return -1;
 		if (obj->flags & UNINTERESTING) {
 			mark_parents_uninteresting(commit);
@@ -2913,7 +2915,7 @@ static struct commit *next_topo_commit(struct rev_info *revs)
 
 static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 {
-	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+	if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
 		if (!revs->ignore_missing_links)
 			die("Failed to traverse parents of commit %s",
 			    oid_to_hex(&commit->object.oid));
@@ -2979,7 +2981,7 @@ static enum rewrite_result rewrite_one(struct rev_info *revs, struct commit **pp
 	for (;;) {
 		struct commit *p = *pp;
 		if (!revs->limited)
-			if (add_parents_to_list(revs, p, &revs->commits, &cache) < 0)
+			if (process_parents(revs, p, &revs->commits, &cache) < 0)
 				return rewrite_one_error;
 		if (p->object.flags & UNINTERESTING)
 			return rewrite_one_ok;
@@ -3312,7 +3314,7 @@ static struct commit *get_revision_1(struct rev_info *revs)
 				try_to_simplify_commit(revs, commit);
 			else if (revs->topo_walk_info)
 				expand_topo_walk(revs, commit);
-			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+			else if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
 				if (!revs->ignore_missing_links)
 					die("Failed to traverse parents of commit %s",
 						oid_to_hex(&commit->object.oid));
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v4 6/7] revision.c: generation-based topo-order algorithm
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (4 preceding siblings ...)
  2018-10-16 22:36       ` [PATCH v4 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
@ 2018-10-16 22:36       ` Derrick Stolee via GitGitGadget
  2018-10-22 13:37         ` Jakub Narebski
  2018-10-16 22:36       ` [PATCH v4 7/7] t6012: make rev-list tests more interesting Derrick Stolee via GitGitGadget
                         ` (3 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-16 22:36 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

    If A and B are commits with generation numbers gen(A) and
    gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.

For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 object.h   |   4 +-
 revision.c | 199 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 revision.h |   2 +
 3 files changed, 197 insertions(+), 8 deletions(-)

diff --git a/object.h b/object.h
index 0feb90ae61..796792cb32 100644
--- a/object.h
+++ b/object.h
@@ -59,7 +59,7 @@ struct object_array {
 
 /*
  * object flag allocation:
- * revision.h:               0---------10                              2526
+ * revision.h:               0---------10                              25----28
  * fetch-pack.c:             01
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
@@ -78,7 +78,7 @@ struct object_array {
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  27
+#define FLAG_BITS  29
 
 /*
  * The object type is stored in 3 bits.
diff --git a/revision.c b/revision.c
index 36458265a0..472f3994e3 100644
--- a/revision.c
+++ b/revision.c
@@ -26,6 +26,7 @@
 #include "argv-array.h"
 #include "commit-reach.h"
 #include "commit-graph.h"
+#include "prio-queue.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2895,30 +2896,216 @@ static int mark_uninteresting(const struct object_id *oid,
 	return 0;
 }
 
-struct topo_walk_info {};
+define_commit_slab(indegree_slab, int);
+
+struct topo_walk_info {
+	uint32_t min_generation;
+	struct prio_queue explore_queue;
+	struct prio_queue indegree_queue;
+	struct prio_queue topo_queue;
+	struct indegree_slab indegree;
+	struct author_date_slab author_date;
+};
+
+static inline void test_flag_and_insert(struct prio_queue *q, struct commit *c, int flag)
+{
+	if (c->object.flags & flag)
+		return;
+
+	c->object.flags |= flag;
+	prio_queue_put(q, c);
+}
+
+static void explore_walk_step(struct rev_info *revs)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit_list *p;
+	struct commit *c = prio_queue_get(&info->explore_queue);
+
+	if (!c)
+		return;
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
+		record_author_date(&info->author_date, c);
+
+	if (revs->max_age != -1 && (c->date < revs->max_age))
+		c->object.flags |= UNINTERESTING;
+
+	if (process_parents(revs, c, NULL, NULL) < 0)
+		return;
+
+	if (c->object.flags & UNINTERESTING)
+		mark_parents_uninteresting(c);
+
+	for (p = c->parents; p; p = p->next)
+		test_flag_and_insert(&info->explore_queue, p->item, TOPO_WALK_EXPLORED);
+}
+
+static void explore_to_depth(struct rev_info *revs,
+			     uint32_t gen)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c;
+	while ((c = prio_queue_peek(&info->explore_queue)) &&
+	       c->generation >= gen)
+		explore_walk_step(revs);
+}
+
+static void indegree_walk_step(struct rev_info *revs)
+{
+	struct commit_list *p;
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c = prio_queue_get(&info->indegree_queue);
+
+	if (!c)
+		return;
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	explore_to_depth(revs, c->generation);
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	for (p = c->parents; p; p = p->next) {
+		struct commit *parent = p->item;
+		int *pi = indegree_slab_at(&info->indegree, parent);
+
+		if (*pi)
+			(*pi)++;
+		else
+			*pi = 2;
+
+		test_flag_and_insert(&info->indegree_queue, parent, TOPO_WALK_INDEGREE);
+
+		if (revs->first_parent_only)
+			return;
+	}
+}
+
+static void compute_indegrees_to_depth(struct rev_info *revs)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c;
+	while ((c = prio_queue_peek(&info->indegree_queue)) &&
+	       c->generation >= info->min_generation)
+		indegree_walk_step(revs);
+}
 
 static void init_topo_walk(struct rev_info *revs)
 {
 	struct topo_walk_info *info;
+	struct commit_list *list;
 	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
 	info = revs->topo_walk_info;
 	memset(info, 0, sizeof(struct topo_walk_info));
 
-	limit_list(revs);
-	sort_in_topological_order(&revs->commits, revs->sort_order);
+	init_indegree_slab(&info->indegree);
+	memset(&info->explore_queue, '\0', sizeof(info->explore_queue));
+	memset(&info->indegree_queue, '\0', sizeof(info->indegree_queue));
+	memset(&info->topo_queue, '\0', sizeof(info->topo_queue));
+
+	switch (revs->sort_order) {
+	default: /* REV_SORT_IN_GRAPH_ORDER */
+		info->topo_queue.compare = NULL;
+		break;
+	case REV_SORT_BY_COMMIT_DATE:
+		info->topo_queue.compare = compare_commits_by_commit_date;
+		break;
+	case REV_SORT_BY_AUTHOR_DATE:
+		init_author_date_slab(&info->author_date);
+		info->topo_queue.compare = compare_commits_by_author_date;
+		info->topo_queue.cb_data = &info->author_date;
+		break;
+	}
+
+	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
+	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
+
+	info->min_generation = GENERATION_NUMBER_INFINITY;
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+		test_flag_and_insert(&info->explore_queue, c, TOPO_WALK_EXPLORED);
+		test_flag_and_insert(&info->indegree_queue, c, TOPO_WALK_INDEGREE);
+
+		if (parse_commit_gently(c, 1))
+			continue;
+		if (c->generation < info->min_generation)
+			info->min_generation = c->generation;
+	}
+
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+		*(indegree_slab_at(&info->indegree, c)) = 1;
+
+		if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
+			record_author_date(&info->author_date, c);
+	}
+	compute_indegrees_to_depth(revs);
+
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+
+		if (*(indegree_slab_at(&info->indegree, c)) == 1)
+			prio_queue_put(&info->topo_queue, c);
+	}
+
+	/*
+	 * This is unfortunate; the initial tips need to be shown
+	 * in the order given from the revision traversal machinery.
+	 */
+	if (revs->sort_order == REV_SORT_IN_GRAPH_ORDER)
+		prio_queue_reverse(&info->topo_queue);
 }
 
 static struct commit *next_topo_commit(struct rev_info *revs)
 {
-	return pop_commit(&revs->commits);
+	struct commit *c;
+	struct topo_walk_info *info = revs->topo_walk_info;
+
+	/* pop next off of topo_queue */
+	c = prio_queue_get(&info->topo_queue);
+
+	if (c)
+		*(indegree_slab_at(&info->indegree, c)) = 0;
+
+	return c;
 }
 
 static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 {
-	if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
+	struct commit_list *p;
+	struct topo_walk_info *info = revs->topo_walk_info;
+	if (process_parents(revs, commit, NULL, NULL) < 0) {
 		if (!revs->ignore_missing_links)
 			die("Failed to traverse parents of commit %s",
-			    oid_to_hex(&commit->object.oid));
+				oid_to_hex(&commit->object.oid));
+	}
+
+	for (p = commit->parents; p; p = p->next) {
+		struct commit *parent = p->item;
+		int *pi;
+
+		if (parse_commit_gently(parent, 1) < 0)
+			continue;
+
+		if (parent->generation < info->min_generation) {
+			info->min_generation = parent->generation;
+			compute_indegrees_to_depth(revs);
+		}
+
+		pi = indegree_slab_at(&info->indegree, parent);
+
+		(*pi)--;
+		if (*pi == 1)
+			prio_queue_put(&info->topo_queue, parent);
+
+		if (revs->first_parent_only)
+			return;
 	}
 }
 
diff --git a/revision.h b/revision.h
index fd4154ff75..b0b3bb8025 100644
--- a/revision.h
+++ b/revision.h
@@ -24,6 +24,8 @@
 #define USER_GIVEN	(1u<<25) /* given directly by the user */
 #define TRACK_LINEAR	(1u<<26)
 #define ALL_REV_FLAGS	(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
+#define TOPO_WALK_EXPLORED	(1u<<27)
+#define TOPO_WALK_INDEGREE	(1u<<28)
 
 #define DECORATE_SHORT_REFS	1
 #define DECORATE_FULL_REFS	2
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v4 7/7] t6012: make rev-list tests more interesting
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (5 preceding siblings ...)
  2018-10-16 22:36       ` [PATCH v4 6/7] revision.c: generation-based topo-order algorithm Derrick Stolee via GitGitGadget
@ 2018-10-16 22:36       ` Derrick Stolee via GitGitGadget
  2018-10-23 15:48         ` Jakub Narebski
  2018-10-21 12:57       ` [PATCH v4 0/7] Use generation numbers for --topo-order Jakub Narebski
                         ` (2 subsequent siblings)
  9 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-16 22:36 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we are working to rewrite some of the revision-walk machinery,
there could easily be some interesting interactions between the
options that force topological constraints (--topo-order,
--date-order, and --author-date-order) along with specifying a
path.

Add extra tests to t6012-rev-list-simplify.sh to add coverage of
these interactions. To ensure interesting things occur, alter the
repo data shape to have different orders depending on topo-, date-,
or author-date-order.

When testing using GIT_TEST_COMMIT_GRAPH, this assists in covering
the new logic for topo-order walks using generation numbers. The
extra tests can be added indepently.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6012-rev-list-simplify.sh | 45 ++++++++++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 9 deletions(-)

diff --git a/t/t6012-rev-list-simplify.sh b/t/t6012-rev-list-simplify.sh
index b5a1190ffe..a10f0df02b 100755
--- a/t/t6012-rev-list-simplify.sh
+++ b/t/t6012-rev-list-simplify.sh
@@ -12,6 +12,22 @@ unnote () {
 	git name-rev --tags --stdin | sed -e "s|$OID_REGEX (tags/\([^)]*\)) |\1 |g"
 }
 
+#
+# Create a test repo with interesting commit graph:
+#
+# A--B----------G--H--I--K--L
+#  \  \           /     /
+#   \  \         /     /
+#    C------E---F     J
+#        \_/
+#
+# The commits are laid out from left-to-right starting with
+# the root commit A and terminating at the tip commit L.
+#
+# There are a few places where we adjust the commit date or
+# author date to make the --topo-order, --date-order, and
+# --author-date-order flags produce different output.
+
 test_expect_success setup '
 	echo "Hi there" >file &&
 	echo "initial" >lost &&
@@ -21,10 +37,18 @@ test_expect_success setup '
 
 	git branch other-branch &&
 
+	git symbolic-ref HEAD refs/heads/unrelated &&
+	git rm -f "*" &&
+	echo "Unrelated branch" >side &&
+	git add side &&
+	test_tick && git commit -m "Side root" &&
+	note J &&
+	git checkout master &&
+
 	echo "Hello" >file &&
 	echo "second" >lost &&
 	git add file lost &&
-	test_tick && git commit -m "Modified file and lost" &&
+	test_tick && GIT_AUTHOR_DATE=$(($test_tick + 120)) git commit -m "Modified file and lost" &&
 	note B &&
 
 	git checkout other-branch &&
@@ -63,13 +87,6 @@ test_expect_success setup '
 	test_tick && git commit -a -m "Final change" &&
 	note I &&
 
-	git symbolic-ref HEAD refs/heads/unrelated &&
-	git rm -f "*" &&
-	echo "Unrelated branch" >side &&
-	git add side &&
-	test_tick && git commit -m "Side root" &&
-	note J &&
-
 	git checkout master &&
 	test_tick && git merge --allow-unrelated-histories -m "Coolest" unrelated &&
 	note K &&
@@ -103,14 +120,24 @@ check_result () {
 	check_outcome success "$@"
 }
 
-check_result 'L K J I H G F E D C B A' --full-history
+check_result 'L K J I H F E D C G B A' --full-history --topo-order
+check_result 'L K I H G F E D C B J A' --full-history
+check_result 'L K I H G F E D C B J A' --full-history --date-order
+check_result 'L K I H G F E D B C J A' --full-history --author-date-order
 check_result 'K I H E C B A' --full-history -- file
 check_result 'K I H E C B A' --full-history --topo-order -- file
 check_result 'K I H E C B A' --full-history --date-order -- file
+check_result 'K I H E B C A' --full-history --author-date-order -- file
 check_result 'I E C B A' --simplify-merges -- file
+check_result 'I E C B A' --simplify-merges --topo-order -- file
+check_result 'I E C B A' --simplify-merges --date-order -- file
+check_result 'I E B C A' --simplify-merges --author-date-order -- file
 check_result 'I B A' -- file
 check_result 'I B A' --topo-order -- file
+check_result 'I B A' --date-order -- file
+check_result 'I B A' --author-date-order -- file
 check_result 'H' --first-parent -- another-file
+check_result 'H' --first-parent --topo-order -- another-file
 
 check_result 'E C B A' --full-history E -- lost
 test_expect_success 'full history simplification without parent' '
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 3/7] test-reach: add rev-list tests
  2018-10-16 22:36       ` [PATCH v4 3/7] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
@ 2018-10-21 10:21         ` Jakub Narebski
  2018-10-21 15:28           ` Derrick Stolee
  0 siblings, 1 reply; 87+ messages in thread
From: Jakub Narebski @ 2018-10-21 10:21 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Jeff King, Junio C Hamano, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> The rev-list command is critical to Git's functionality. Ensure it
> works in the three commit-graph environments constructed in
> t6600-test-reach.sh. Here are a few important types of rev-list
> operations:
>
> * Basic: git rev-list --topo-order HEAD
> * Range: git rev-list --topo-order compare..HEAD
> * Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
> * Symmetric Difference: git rev-list --topo-order compare...HEAD

Could you remind us here which of those operations will be using
generation numbers after this patch series?

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/t6600-test-reach.sh | 84 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 84 insertions(+)
>
> diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
> index 9d65b8b946..288f703b7b 100755
> --- a/t/t6600-test-reach.sh
> +++ b/t/t6600-test-reach.sh
> @@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
>  	test_three_modes commit_contains --tag
>  '
>  
> +test_expect_success 'rev-list: basic topo-order' '
> +	git rev-parse \
> +		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
> +		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
> +		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
> +		commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \
> +		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
> +		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
> +	>expect &&
> +	run_three_modes git rev-list --topo-order commit-6-6
> +'

I wonder if this test could be make easier to write and less error
prone, e.g. creating it from ASCII-art graphics.

But it is good enough.

[...]

--
Jakub Narębski

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 0/7] Use generation numbers for --topo-order
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (6 preceding siblings ...)
  2018-10-16 22:36       ` [PATCH v4 7/7] t6012: make rev-list tests more interesting Derrick Stolee via GitGitGadget
@ 2018-10-21 12:57       ` Jakub Narebski
  2018-11-01  5:21       ` Junio C Hamano
  2018-11-01 13:46       ` [PATCH v5 " Derrick Stolee
  9 siblings, 0 replies; 87+ messages in thread
From: Jakub Narebski @ 2018-10-21 12:57 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, peff, Junio C Hamano

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This patch series performs a decently-sized refactoring of the revision-walk
> machinery. Well, "refactoring" is probably the wrong word, as I don't
> actually remove the old code. Instead, when we see certain options in the
> 'rev_info' struct, we redirect the commit-walk logic to a new set of methods
> that distribute the workload differently. By using generation numbers in the
> commit-graph, we can significantly improve 'git log --graph' commands (and
> the underlying 'git rev-list --topo-order').
>
> On the Linux repository, I got the following performance results when
> comparing to the previous version with or without a commit-graph:
>
> Test: git rev-list --topo-order -100 HEAD
> HEAD~1, no commit-graph: 6.80 s
> HEAD~1, w/ commit-graph: 0.77 s
>   HEAD, w/ commit-graph: 0.02 s
>
> Test: git rev-list --topo-order -100 HEAD -- tools
> HEAD~1, no commit-graph: 9.63 s
> HEAD~1, w/ commit-graph: 6.06 s
>   HEAD, w/ commit-graph: 0.06 s

I wonder if we could make use of existing infrstructure in 't/perf/' to
perform those benchmarks for us (perhaps augmented with large
repository, and only if requested -- similarly to how long tests are
handled).

--
Jakub Narębski

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 3/7] test-reach: add rev-list tests
  2018-10-21 10:21         ` Jakub Narebski
@ 2018-10-21 15:28           ` Derrick Stolee
  0 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-10-21 15:28 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee via GitGitGadget
  Cc: git, Jeff King, Junio C Hamano, Derrick Stolee

On 10/21/2018 6:21 AM, Jakub Narebski wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The rev-list command is critical to Git's functionality. Ensure it
>> works in the three commit-graph environments constructed in
>> t6600-test-reach.sh. Here are a few important types of rev-list
>> operations:
>>
>> * Basic: git rev-list --topo-order HEAD
>> * Range: git rev-list --topo-order compare..HEAD
>> * Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
>> * Symmetric Difference: git rev-list --topo-order compare...HEAD
> Could you remind us here which of those operations will be using
> generation numbers after this patch series?

For this series, we are focused only on the --topo-order with a single 
start position. The versions that use a compare branch still use the old 
logic. In the future, I would like to use the new logic for these other 
modes.

>>   
>> +test_expect_success 'rev-list: basic topo-order' '
>> +	git rev-parse \
>> +		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
>> +		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
>> +		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
>> +		commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \
>> +		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
>> +		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
>> +	>expect &&
>> +	run_three_modes git rev-list --topo-order commit-6-6
>> +'
> I wonder if this test could be make easier to write and less error
> prone, e.g. creating it from ASCII-art graphics.
>
> But it is good enough.

I did lay out the branch names in a grid layout similar to the 
commit-graph layout. It's easier to see the purposeful layout in the 
comparison sections where some commits don't appear in the output.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic
  2018-10-16 22:36       ` [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
@ 2018-10-21 15:55         ` Jakub Narebski
  2018-10-22  1:12           ` Junio C Hamano
  0 siblings, 1 reply; 87+ messages in thread
From: Jakub Narebski @ 2018-10-21 15:55 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Jeff King, Junio C Hamano, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When running 'git rev-list --topo-order' and its kin, the topo_order
> setting in struct rev_info implies the limited setting. This means
> that the following things happen during prepare_revision_walk():
>
> * revs->limited implies we run limit_list() to walk the entire
>   reachable set. There are some short-cuts here, such as if we
>   perform a range query like 'git rev-list COMPARE..HEAD' and we
>   can stop limit_list() when all queued commits are uninteresting.

And if revs->topo_order is set, then (with current implementation) we
need limit_list() to run to generate commit_list with commits to be
topologically sorted, which is done by setting revs->limited.

In short, with current code revs->topo_order implies revs->limited.

>
> * revs->topo_order implies we run sort_in_topological_order(). See
>   the implementation of that method in commit.c. It implies that
>   the full set of commits to order is in the given commit_list.

So the current code uses "generate list of commits, then sort it"
approach...

>
> These two methods imply that a 'git rev-list --topo-order HEAD'
> command must walk the entire reachable set of commits _twice_ before
> returning a single result.
>
> If we have a commit-graph file with generation numbers computed, then
> there is a better way.

...instead of generating commits in topological order as you go.

>                        This patch introduces some necessary logic
> redirection when we are in this situation.

O.K., this should make main commit smaller.  All right.

> In v2.18.0, the commit-graph file contains zero-valued bytes in the
> positions where the generation number is stored in v2.19.0 and later.
> Thus, we use generation_numbers_enabled() to check if the commit-graph
> is available and has non-zero generation numbers.
>
> When setting revs->limited only because revs->topo_order is true,
> only do so if generation numbers are not available. There is no
> reason to use the new logic as it will behave similarly when all
> generation numbers are INFINITY or ZERO.

O.K. we will be using new algorithm only when there actually are some
generation numbers.


> In prepare_revision_walk(), if we have revs->topo_order but not
> revs->limited, then we trigger the new logic. It breaks the logic
> into three pieces, to fit with the existing framework:

So if revs->limited is set (but not because revs->topo_order is set),
which means A..B queries, we will be still using the old algorithm.
All right, though I wonder if it could be improved in the future
(perhaps with the help of other graph labelling / indices than
generation numbers, maybe a positive-cut index).

Do you have an idea why there is no improvement with the new code in
this case?

> 1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
>    struct. We use the presence of this struct as a signal to use the
>    new methods during our walk. In this patch, this method simply
>    calls limit_list() and sort_in_topological_order(). In the future,
>    this method will set up a new data structure to perform that logic
>    in-line.
>
> 2. next_topo_commit() provides get_revision_1() with the next topo-
>    ordered commit in the list. Currently, this simply pops the commit
>    from revs->commits.
>
> 3. expand_topo_walk() provides get_revision_1() with a way to signal
>    walking beyond the latest commit. Currently, this calls
>    add_parents_to_list() exactly like the old logic.

So all three new functions should perform exactly like the old logic,
isn't it?

> While this commit presents method redirection for performing the
> exact same logic as before, it allows the next commit to focus only
> on the new logic.

All right, it's logical.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  revision.c | 42 ++++++++++++++++++++++++++++++++++++++----
>  revision.h |  4 ++++
>  2 files changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index e18bd530e4..2dcde8a8ac 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -25,6 +25,7 @@
>  #include "worktree.h"
>  #include "argv-array.h"
>  #include "commit-reach.h"
> +#include "commit-graph.h"
>  
>  volatile show_early_output_fn_t show_early_output;
>  
> @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
>  	if (revs->diffopt.objfind)
>  		revs->simplify_history = 0;
>  
> -	if (revs->topo_order)
> +	if (revs->topo_order && !generation_numbers_enabled(the_repository))
>  		revs->limited = 1;

All right, with --topo-order and existing generation numbers don't force
the revs->limited code (i.e. explicit not wrapped use of limit_list()).

So with --topo-order and A..B, we have revs->limited set, with
--topo-order and no generation numbers we have revs->limited set.

>  
>  	if (revs->prune_data.nr) {
> @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid,
>  	return 0;
>  }
>  
> +struct topo_walk_info {};

Nice trick with using NULL-ness of the pointer to the currently empty
struct as a boolean flag denoting whether to use new generation number
using algorithm for topological sorting.

> +
> +static void init_topo_walk(struct rev_info *revs)
> +{
> +	struct topo_walk_info *info;

I guess this helper variables is here for next revisions, as we could
have made without it...

> +	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
> +	info = revs->topo_walk_info;
> +	memset(info, 0, sizeof(struct topo_walk_info));

...by using

  +	memset(revs->topo_walk_info, 0, sizeof(struct topo_walk_info));

> +
> +	limit_list(revs);
> +	sort_in_topological_order(&revs->commits, revs->sort_order);

This is not exactly identical to the old code, which has

	if (limit_list(revs) < 0)
		return -1;
	if (revs->topo_order)
		sort_in_topological_order(&revs->commits, revs->sort_order);

We know that init_topo_walk() would be invoked, as the name implies,
only when revs->topo_order is set, but do we know that limit_list()
would not return an error?

> +}
> +
> +static struct commit *next_topo_commit(struct rev_info *revs)
> +{
> +	return pop_commit(&revs->commits);
> +}

All right, identical to the old code.

> +
> +static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
> +{
> +	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
> +		if (!revs->ignore_missing_links)
> +			die("Failed to traverse parents of commit %s",
> +			    oid_to_hex(&commit->object.oid));
> +	}
> +}

All right, identical to the old code.

While at it, should this message be marked up for translation, or is it
something so low-level (and rare) that should be kept untranslated?  But
this would be better left for separate commit series, to not entangle
this one with spurious changes.

> +
>  int prepare_revision_walk(struct rev_info *revs)
>  {
>  	int i;
> @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
>  		commit_list_sort_by_date(&revs->commits);
>  	if (revs->no_walk)
>  		return 0;
> -	if (revs->limited)
> +	if (revs->limited) {
>  		if (limit_list(revs) < 0)
>  			return -1;
> -	if (revs->topo_order)
> -		sort_in_topological_order(&revs->commits, revs->sort_order);
> +		if (revs->topo_order)
> +			sort_in_topological_order(&revs->commits, revs->sort_order);
> +	} else if (revs->topo_order)
> +		init_topo_walk(revs);

Previously when revs->topo_order was set, Git called
sort_in_topological_order(), because revs->limited got always set to
truthy value if revs->topo_order was true.

Now running sort_in_topological_order() is done only if revs->limited is
set (because of A..B); if it is not, init_topo_walk() is called.

All right, identical to the old code, up to checking the return value of
limit_list(), see previous comments.

>  	if (revs->line_level_traverse)
>  		line_log_filter(revs);
>  	if (revs->simplify_merges)
> @@ -3257,6 +3287,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
>  
>  		if (revs->reflog_info)
>  			commit = next_reflog_entry(revs->reflog_info);
> +		else if (revs->topo_walk_info)
> +			commit = next_topo_commit(revs);
>  		else
>  			commit = pop_commit(&revs->commits);

All right, identical to the old code.

> @@ -3278,6 +3310,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
>  
>  			if (revs->reflog_info)
>  				try_to_simplify_commit(revs, commit);
> +			else if (revs->topo_walk_info)
> +				expand_topo_walk(revs, commit);
>  			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
>  				if (!revs->ignore_missing_links)
>  					die("Failed to traverse parents of commit %s",

All right, identical to the old code.

> diff --git a/revision.h b/revision.h
> index 2b30ac270d..fd4154ff75 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -56,6 +56,8 @@ struct rev_cmdline_info {
>  #define REVISION_WALK_NO_WALK_SORTED 1
>  #define REVISION_WALK_NO_WALK_UNSORTED 2
>  
> +struct topo_walk_info;
> +
>  struct rev_info {
>  	/* Starting list */
>  	struct commit_list *commits;
> @@ -245,6 +247,8 @@ struct rev_info {
>  	const char *break_bar;
>  
>  	struct revision_sources *sources;
> +
> +	struct topo_walk_info *topo_walk_info;
>  };
>  
>  int ref_excluded(struct string_list *, const char *path);

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 5/7] commit/revisions: bookkeeping before refactoring
  2018-10-16 22:36       ` [PATCH v4 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
@ 2018-10-21 21:17         ` Jakub Narebski
  0 siblings, 0 replies; 87+ messages in thread
From: Jakub Narebski @ 2018-10-21 21:17 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Jeff King, Junio C Hamano, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> There are a few things that need to move around a little before
> making a big refactoring in the topo-order logic:
>
> 1. We need access to record_author_date() and
>    compare_commits_by_author_date() in revision.c. These are used
>    currently by sort_in_topological_order() in commit.c.
>
> 2. Moving these methods to commit.h requires adding the author_slab
>    definition to commit.h.

Those two changes are connected, and must be kept together.

> 3. The add_parents_to_list() method in revision.c performs logic
>    around the UNINTERESTING flag and other special cases depending
>    on the struct rev_info. Allow this method to ignore a NULL 'list'
>    parameter, as we will not be populating the list for our walk.
>    Also rename the method to the slightly more generic name
>    process_parents() to make clear that this method does more than
>    add to a list (and no list is required anymore).

But as far as I can understand, this change is independent, and it could
be put into a separate commmit.

The change of function name to process_parents() and allowing for 'list'
parameter to be NULL are related, though.

>
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

No need to split, unless there would be v5 anyway, in my opinion.

> ---
>  commit.c   | 11 +++++------
>  commit.h   |  8 ++++++++
>  revision.c | 18 ++++++++++--------
>  3 files changed, 23 insertions(+), 14 deletions(-)
>
> diff --git a/commit.c b/commit.c
> index d0f199e122..861a485e93 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -655,11 +655,10 @@ struct commit *pop_commit(struct commit_list **stack)
>  /* count number of children that have not been emitted */
>  define_commit_slab(indegree_slab, int);
>  
> -/* record author-date for each commit object */
> -define_commit_slab(author_date_slab, timestamp_t);
> +implement_shared_commit_slab(author_date_slab, timestamp_t);

I see that the comment got moved to the site with
define_shared_commit_slab(), i.e. to commit.h, instead of duplicting
it.  All right.

Sidenote: Ugh, small_caps preprocessor macros [trickery].

>  
> -static void record_author_date(struct author_date_slab *author_date,
> -			       struct commit *commit)
> +void record_author_date(struct author_date_slab *author_date,
> +			struct commit *commit)
>  {
>  	const char *buffer = get_commit_buffer(commit, NULL);
>  	struct ident_split ident;
> @@ -684,8 +683,8 @@ fail_exit:
>  	unuse_commit_buffer(commit, buffer);
>  }
>  
> -static int compare_commits_by_author_date(const void *a_, const void *b_,
> -					  void *cb_data)
> +int compare_commits_by_author_date(const void *a_, const void *b_,
> +				   void *cb_data)

All right, this is straighforward changing record_author_date() and
compare_commits_by_author_date() from static (file-local) functions to
exported functions.

>  {
>  	const struct commit *a = a_, *b = b_;
>  	struct author_date_slab *author_date = cb_data;
> diff --git a/commit.h b/commit.h
> index 2b1a734388..977d397356 100644
> --- a/commit.h
> +++ b/commit.h
> @@ -8,6 +8,7 @@
>  #include "gpg-interface.h"
>  #include "string-list.h"
>  #include "pretty.h"
> +#include "commit-slab.h"
>  
>  #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
>  #define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
> @@ -328,6 +329,13 @@ extern int remove_signature(struct strbuf *buf);
>   */
>  extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
>  
> +/* record author-date for each commit object */
> +define_shared_commit_slab(author_date_slab, timestamp_t);

All right, this is needed for record_author_date() function, which is
now exported.

> +
> +void record_author_date(struct author_date_slab *author_date,
> +			struct commit *commit);
> +
> +int compare_commits_by_author_date(const void *a_, const void *b_, void *unused);

O.K., this is simply exporting previously static functions.

>  int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
>  int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
>  
> diff --git a/revision.c b/revision.c
> index 2dcde8a8ac..36458265a0 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -768,8 +768,8 @@ static void commit_list_insert_by_date_cached(struct commit *p, struct commit_li
>  		*cache = new_entry;
>  }
>  
> -static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
> -		    struct commit_list **list, struct commit_list **cache_ptr)
> +static int process_parents(struct rev_info *revs, struct commit *commit,
> +			   struct commit_list **list, struct commit_list **cache_ptr)

All right, straighforward rename.

>  {
>  	struct commit_list *parent = commit->parents;
>  	unsigned left_flag;
> @@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
>  			if (p->object.flags & SEEN)
>  				continue;
>  			p->object.flags |= SEEN;
> -			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
> +			if (list)
> +				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
>  		}
>  		return 0;
>  	}
> @@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
>  		p->object.flags |= left_flag;
>  		if (!(p->object.flags & SEEN)) {
>  			p->object.flags |= SEEN;
> -			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
> +			if (list)
> +				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);

All right, both of those is about allowing 'list' parameter to be NULL,
and invoking commit_list_insert_by_date_cached() only if it's not NULL.

>  		}
>  		if (revs->first_parent_only)
>  			break;
> @@ -1091,7 +1093,7 @@ static int limit_list(struct rev_info *revs)
>  
>  		if (revs->max_age != -1 && (commit->date < revs->max_age))
>  			obj->flags |= UNINTERESTING;
> -		if (add_parents_to_list(revs, commit, &list, NULL) < 0)
> +		if (process_parents(revs, commit, &list, NULL) < 0)
>  			return -1;
>  		if (obj->flags & UNINTERESTING) {
>  			mark_parents_uninteresting(commit);
> @@ -2913,7 +2915,7 @@ static struct commit *next_topo_commit(struct rev_info *revs)
>  
>  static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
>  {
> -	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
> +	if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
>  		if (!revs->ignore_missing_links)
>  			die("Failed to traverse parents of commit %s",
>  			    oid_to_hex(&commit->object.oid));
> @@ -2979,7 +2981,7 @@ static enum rewrite_result rewrite_one(struct rev_info *revs, struct commit **pp
>  	for (;;) {
>  		struct commit *p = *pp;
>  		if (!revs->limited)
> -			if (add_parents_to_list(revs, p, &revs->commits, &cache) < 0)
> +			if (process_parents(revs, p, &revs->commits, &cache) < 0)
>  				return rewrite_one_error;
>  		if (p->object.flags & UNINTERESTING)
>  			return rewrite_one_ok;
> @@ -3312,7 +3314,7 @@ static struct commit *get_revision_1(struct rev_info *revs)
>  				try_to_simplify_commit(revs, commit);
>  			else if (revs->topo_walk_info)
>  				expand_topo_walk(revs, commit);
> -			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
> +			else if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
>  				if (!revs->ignore_missing_links)
>  					die("Failed to traverse parents of commit %s",
>  						oid_to_hex(&commit->object.oid));

All those is just changing the calling convention due to function
rename.

(I wonder if such simple refactoring could have been done via Coccinelle
patch).


Anyway, looks good to me.
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic
  2018-10-21 15:55         ` Jakub Narebski
@ 2018-10-22  1:12           ` Junio C Hamano
  2018-10-22  1:51             ` Derrick Stolee
  0 siblings, 1 reply; 87+ messages in thread
From: Junio C Hamano @ 2018-10-22  1:12 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee via GitGitGadget, git, Jeff King, Derrick Stolee

Jakub Narebski <jnareb@gmail.com> writes:

> So if revs->limited is set (but not because revs->topo_order is set),
> which means A..B queries, we will be still using the old algorithm.
> All right, though I wonder if it could be improved in the future
> (perhaps with the help of other graph labelling / indices than
> generation numbers, maybe a positive-cut index).
>
> Do you have an idea why there is no improvement with the new code in
> this case?

I didn't get the impression that it would not be possible to improve
the "--topo A..B" case by using generation numbers from this series.
Isn't it just because the necessary code has not been written yet?
In addition to what is needed for "--topo P1 P2 P3..." (all
positive), limited walk needs to notice the bottom boundary and stop
traversal.  Having generation numbers would make it slightly easier
than without, as you know that a positive commit you have will not
be marked UNINTERESTING due to a negative commit whose ancestors
have not been explored, as long as that negative commit has a higher
generation number.  But you still need to adjust the traversal logic
to properly terminate upon hitting UNINTERESTING one, and also
propagate the bit down the history, which is not needed at all if
you only want to support the "positive only" case.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic
  2018-10-22  1:12           ` Junio C Hamano
@ 2018-10-22  1:51             ` Derrick Stolee
  2018-10-22  1:55               ` [RFC PATCH] revision.c: use new algorithm in A..B case Derrick Stolee
  2018-10-25  8:28               ` [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic Junio C Hamano
  0 siblings, 2 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-10-22  1:51 UTC (permalink / raw)
  To: Junio C Hamano, Jakub Narebski
  Cc: Derrick Stolee via GitGitGadget, git, Jeff King, Derrick Stolee

On 10/21/2018 9:12 PM, Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
>
>> So if revs->limited is set (but not because revs->topo_order is set),
>> which means A..B queries, we will be still using the old algorithm.
>> All right, though I wonder if it could be improved in the future
>> (perhaps with the help of other graph labelling / indices than
>> generation numbers, maybe a positive-cut index).
>>
>> Do you have an idea why there is no improvement with the new code in
>> this case?
> I didn't get the impression that it would not be possible to improve
> the "--topo A..B" case by using generation numbers from this series.
> Isn't it just because the necessary code has not been written yet?
> In addition to what is needed for "--topo P1 P2 P3..." (all
> positive), limited walk needs to notice the bottom boundary and stop
> traversal.  Having generation numbers would make it slightly easier
> than without, as you know that a positive commit you have will not
> be marked UNINTERESTING due to a negative commit whose ancestors
> have not been explored, as long as that negative commit has a higher
> generation number.  But you still need to adjust the traversal logic
> to properly terminate upon hitting UNINTERESTING one, and also
> propagate the bit down the history, which is not needed at all if
> you only want to support the "positive only" case.

Actually, the code has been written, but the problem is the same as the 
performance issue when I made paint_down_to_common() use generation 
numbers: the algorithm for deciding what is in the set "reachable from A 
but not reachable from B" uses commit-date order as a heuristic to avoid 
walking the entire graph. Yes, the revision parameters specify "limited" 
in order to call "limit_list()", but it uses the same algorithm to 
determine the reachable set difference.

You can test this yourself! Run the following two commands in the Git 
repository using v2.19.1:

     time git log --topo-order -10 master >/dev/null

     time git log --topo-order -10 maint..master >/dev/null

I get 0.39s for the first call and 0.01s for the second. (Note: I 
specified "-10" to ensure we are only writing 10 commits and the output 
size does not factor into the time.) This is because the first walks the 
entire history, while the second uses the heuristic walk to identify a 
much smaller subgraph that the topo-order algorithm uses.

Just as before, by using this algorithm for the B..A case, we are adding 
an extra restriction on the algorithm: always be correct. This results 
in us walking a larger set (everything reachable from B or A with 
generation number at least the smallest generation of a commit reachable 
from only one).

I believe this can be handled by using a smarter generation number (one 
that relies on commit date as a heuristic, but still have enough 
information to guarantee topological relationships), and I've already 
started testing a few of these directions. It is possible now that we 
have concrete graph algorithms to use on real repositories. I hope to 
share a report on my findings in a couple weeks. I'll include how using 
this algorithm compares to the old algorithm in the B..A case.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [RFC PATCH] revision.c: use new algorithm in A..B case
  2018-10-22  1:51             ` Derrick Stolee
@ 2018-10-22  1:55               ` Derrick Stolee
  2018-10-25  8:28               ` [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic Junio C Hamano
  1 sibling, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-10-22  1:55 UTC (permalink / raw)
  To: git; +Cc: dstolee, jnareb, gitster

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---

I just wanted to mention that in order to use the new logic for 'git log
--topo-order A..B', we just need the following patch. It is an extra
time that sets 'revs->limited' to 1, triggering the old logic.

You can use this for comparison purposes, but I'm not ready to do this
until more performance testing is ready in this case. Since these
comparison commands are already pretty fast when the diff is small,
there is less urgency to improve performance here.

Thanks,
-Stolee

 revision.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/revision.c b/revision.c
index 472f3994e3..8e5656f7b4 100644
--- a/revision.c
+++ b/revision.c
@@ -278,10 +278,8 @@ static struct commit *handle_commit(struct rev_info *revs,
 
 		if (parse_commit(commit) < 0)
 			die("unable to parse commit %s", name);
-		if (flags & UNINTERESTING) {
+		if (flags & UNINTERESTING)
 			mark_parents_uninteresting(commit);
-			revs->limited = 1;
-		}
 		if (revs->sources) {
 			char **slot = revision_sources_at(revs->sources, commit);
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 6/7] revision.c: generation-based topo-order algorithm
  2018-10-16 22:36       ` [PATCH v4 6/7] revision.c: generation-based topo-order algorithm Derrick Stolee via GitGitGadget
@ 2018-10-22 13:37         ` Jakub Narebski
  2018-10-23 13:54           ` Derrick Stolee
  0 siblings, 1 reply; 87+ messages in thread
From: Jakub Narebski @ 2018-10-22 13:37 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Jeff King, Junio C Hamano, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> The current --topo-order algorithm requires walking all
> reachable commits up front, topo-sorting them, all before
> outputting the first value. This patch introduces a new
> algorithm which uses stored generation numbers to
> incrementally walk in topo-order, outputting commits as
> we go. This can dramatically reduce the computation time
> to write a fixed number of commits, such as when limiting
> with "-n <N>" or filling the first page of a pager.
>
> When running a command like 'git rev-list --topo-order HEAD',
> Git performed the following steps:
>
> 1. Run limit_list(), which parses all reachable commits,
>    adds them to a linked list, and distributes UNINTERESTING
>    flags. If all unprocessed commits are UNINTERESTING, then
>    it may terminate without walking all reachable commits.
>    This does not occur if we do not specify UNINTERESTING
>    commits.
>
> 2. Run sort_in_topological_order(), which is an implementation
>    of Kahn's algorithm. It first iterates through the entire
>    set of important commits and computes the in-degree of each
>    (plus one, as we use 'zero' as a special value here). Then,
>    we walk the commits in priority order, adding them to the
>    priority queue if and only if their in-degree is one. As
>    we remove commits from this priority queue, we decrement the
>    in-degree of their parents.

Because in-degree has very specific defined meaning of number of
children, i.e. the number of _incoming_ edges, I would say "if and only
if their in-degree-plus-one is one".  It is more exact, even if it looks
a bit funny.

> 3. While we are peeling commits for output, get_revision_1()
>    uses pop_commit on the full list of commits computed by
>    sort_in_topological_order().

All right, so those are separate steps (separate walks): prepare and
parse commits, topologically sort list of commits from previous step,
output sorted list of commits from previous step.

> In the new algorithm, these three steps correspond to three
> different commit walks. We run these walks simultaneously,
> and advance each only as far as necessary to satisfy the
> requirements of the 'higher order' walk.

What does 'higher order' walk means: steps 3, 2, 1, in this order,
i.e. output being the highest order, or something different?

Sidenote: the new algorithm looks a bit like Unix pipeline, where each
step of pipeline does not output much more than next step needs /
requests.

>                                          We know when we can
> pause each walk by using generation numbers from the commit-
> graph feature.

Do I understand it correctly that this is mainly used in Kahn's
algorithm to find out through the negative-cut index of generation
number which commits in the to-be-sorted list cannot have an in-degree
of zero (or otherise cannot be next commit to be shown in output)?

> Recall that the generation number of a commit satisfies:
>
> * If the commit has at least one parent, then the generation
>   number is one more than the maximum generation number among
>   its parents.
>
> * If the commit has no parent, then the generation number is one.
>
> There are two special generation numbers:
>
> * GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
>   indicates that the commit is not stored in the commit-graph and
>   the generation number was not previously calculated.
>
> * GENERATION_NUMBER_ZERO: this value (0) is a special indicator
>   to say that the commit-graph was generated by a version of Git
>   that does not compute generation numbers (such as v2.18.0).
>
> Since we use generation_numbers_enabled() before using the new
> algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
> However, the existence of GENERATION_NUMBER_INFINITY implies the
> following weaker statement than the usual we expect from
> generation numbers:
>
>     If A and B are commits with generation numbers gen(A) and
>     gen(B) and gen(A) < gen(B), then A cannot reach B.
>
> Thus, we will walk in each of our stages until the "maximum
> unexpanded generation number" is strictly lower than the
> generation number of a commit we are about to use.

And this "maximum unexpanded generation number" must be greater or equal
to 1, thanks to assuming generation_numbers_enabled().


Let's start by writing down the original version of the Kahn's algorith
(which is not the only way to calculate topological ordering; another
method is to use depth-first searches).

  L ← Empty list that will contain the sorted elements
  S ← Set of all nodes with no incoming edge
  while S is non-empty do
      remove a node n from S
      add n to tail of L
      for each node m with an edge e from n to m do
          remove edge e from the graph
          if m has no other incoming edges then
              insert m into S
  if graph has edges then
      return error   _(graph has at least one cycle)_
  else 
      return L       _(a topologically sorted order)_

In the case of Git, we display only commits reachable from the starting
commits, so only those starting commits can have no incoming edge, by
the definition of the reachable commit (note that some starting commits
can be reachable from other starting commits).

Note that in Git by construction we cannot have cycles in the objects
graph, and that 'remove edge e [= n -> m] from the graph' simply means
decreasing the [effective] in-degree of node m.

> The walks are as follows:
>
> 1. EXPLORE: using the explore_queue priority queue (ordered by
>    maximizing the generation number), parse each reachable
>    commit until all commits in the queue have generation
>    number strictly lower than needed. During this walk, update
>    the UNINTERESTING flags as necessary.

All right, that looks sensible.  Parse commits and update the
UNINTERESTING flags only up to what might be needed.

Though I would add for each walk what are post-conditions, i.e. what
requirements list of returned commits does fullfill.  In the case of the
EXPLORE walk it would be that commits are in the "reachable from start
commts" set, parsed and not UNINTERESTING.  And that there are all such
commits there with generation number greater or equal if needed (and
their parents).


Wouldn't this though make the output always start at the commit with
maximal generation number (such commit or commits would need to have an
in-degree of zero, i.e. no incoming edges), instead of whatever order is
requested (if date order contradicts generation number order) or in the
command line arguments order?

> 2. INDEGREE: using the indegree_queue priority queue (ordered
>    by maximizing the generation number), add one to the in-
>    degree of each parent for each commit that is walked. Since
>    we walk in order of decreasing generation number, we know
>    that discovering an in-degree value of 0 means the value for
>    that commit was not initialized, so should be initialized to
>    two. (Recall that in-degree value "1" is what we use to say a
>    commit is ready for output.)

The post-condition is that all returned commits have their in-degree
plus one calculated.

Mixing actual in-degree (number of incoming edges, zero means no
incoming edge and candidate for the next commit in topological order),
and details of implementation (using value of zero for uninitialized,
and thus actually storing in-degree plus one) makes this description a
bit hard to follow.

Note that the additional complication is that if commits have generation
number INFINITY, then we cannot say anything about reachability among
commits with this special generation number.  That means that until we
process all commits with generation number INFINITY, we don't know which
ones have no incoming edges (a real in-degree of zero).  If they are not
INFINITY, the stronger version of the reachability condition for
generation number holds, and thus we know that if we pop the commit and
it has uninitialized in-degree and generation number not INFINITY, then
it has no-incoming edges (in-degree of zero).

That does not matter much, but for the fact that before outputting list
of commits / returning from the function we need to ensure that all out
commits have defined in-degree value.  All that have in-degree undefined
when indegree_queue is empty, because of reachability and generation
numbers constraints, actually have an in-degree of zero (no incoming
edges).

>                                  As we iterate the parents of a
>    commit during this walk, ensure the EXPLORE walk has walked
>    beyond their generation numbers.

All right. looks sensible from the point of view of trying to do
streaming of sorted commits.

>
> 3. TOPO: using the topo_queue priority queue (ordered based on
>    the sort_order given, which could be commit-date, author-
>    date, or typical topo-order which treats the queue as a LIFO
>    stack), remove a commit from the queue and decrement the
>    in-degree of each parent. If a parent has an in-degree of
>    one, then we add it to the topo_queue. Before we decrement
>    the in-degree, however, ensure the INDEGREE walk has walked
>    beyond that generation number.

This description missed an important constraint, namely that all commits
in the topo_order queue have real in-degree of zero, i.e. no incoming
edges.  The topo_order queue is set S in the Kahn's algorithm.

Also, we need to know how to populate the topo_queue at start with at
least one commit.  How it is initially populated?  That is very
important question.

>
> The implementations of these walks are in the following methods:
>
> * explore_walk_step and explore_to_depth
> * indegree_walk_step and compute_indegrees_to_depth
> * next_topo_commit and expand_topo_walk

All right, one one hand: good calling convention.  On the other hand:
why the difference in naming?

>
> These methods have some patterns that may seem strange at first,
> but they are probably carry-overs from their equivalents in
> limit_list and sort_in_topological_order.
>
> One thing that is missing from this implementation is a proper
> way to stop walking when the entire queue is UNINTERESTING, so
> this implementation is not enabled by comparisions, such as in
> 'git rev-list --topo-order A..B'. This can be updated in the
> future.

All right, lets start with easier step.

>
> In my local testing, I used the following Git commands on the
> Linux repository in three modes: HEAD~1 with no commit-graph,
> HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
> allows comparing the benefits we get from parsing commits from
> the commit-graph and then again the benefits we get by
> restricting the set of commits we walk.
>
> Test: git rev-list --topo-order -100 HEAD
> HEAD~1, no commit-graph: 6.80 s
> HEAD~1, w/ commit-graph: 0.77 s
>   HEAD, w/ commit-graph: 0.02 s
>
> Test: git rev-list --topo-order -100 HEAD -- tools
> HEAD~1, no commit-graph: 9.63 s
> HEAD~1, w/ commit-graph: 6.06 s
>   HEAD, w/ commit-graph: 0.06 s
>
> This speedup is due to a few things. First, the new generation-
> number-enabled algorithm walks commits on order of the number of
> results output (subject to some branching structure expectations).
> Since we limit to 100 results, we are running a query similar to
> filling a single page of results. Second, when specifying a path,
> we must parse the root tree object for each commit we walk. The
> previous benefits from the commit-graph are entirely from reading
> the commit-graph instead of parsing commits. Since we need to
> parse trees for the same number of commits as before, we slow
> down significantly from the non-path-based query.
>
> For the test above, I specifically selected a path that is changed
> frequently, including by merge commits. A less-frequently-changed
> path (such as 'README') has similar end-to-end time since we need
> to walk the same number of commits (before determining we do not
> have 100 hits). However, get the benefit that the output is
> presented to the user as it is discovered, much the same as a
> normal 'git log' command (no '--topo-order'). This is an improved
> user experience, even if the command has the same runtime.

First, do I understand it correctly that in first case the gains from
new algorithms are so slim because with commit-graph file and no path
limiting we don't hit repository anyway; we walk less commits, but
reading commit data from commit-graph file is fast/

Second, I wonder if there is some easy way to perform automatic latency
tests, i.e. how fast does Git show the first page of output...

> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  object.h   |   4 +-
>  revision.c | 199 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  revision.h |   2 +
>  3 files changed, 197 insertions(+), 8 deletions(-)

Daunting change to review.

> diff --git a/object.h b/object.h
> index 0feb90ae61..796792cb32 100644
> --- a/object.h
> +++ b/object.h
> @@ -59,7 +59,7 @@ struct object_array {
>  
>  /*
>   * object flag allocation:
> - * revision.h:               0---------10                              2526
> + * revision.h:               0---------10                              25----28
>   * fetch-pack.c:             01
>   * negotiator/default.c:       2--5
>   * walker.c:                 0-2
> @@ -78,7 +78,7 @@ struct object_array {
>   * builtin/show-branch.c:    0-------------------------------------------26
>   * builtin/unpack-objects.c:                                 2021
>   */
> -#define FLAG_BITS  27
> +#define FLAG_BITS  29

What are those two additional object flags needed for revision.h /
revision.c after this change?

Ah, those are TOPO_WALK_EXPLORED and TOPO_WALK_INDEGREE.

>  
>  /*
>   * The object type is stored in 3 bits.
> diff --git a/revision.c b/revision.c
> index 36458265a0..472f3994e3 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -26,6 +26,7 @@
>  #include "argv-array.h"
>  #include "commit-reach.h"
>  #include "commit-graph.h"
> +#include "prio-queue.h"
>  
>  volatile show_early_output_fn_t show_early_output;
>  
> @@ -2895,30 +2896,216 @@ static int mark_uninteresting(const struct object_id *oid,
>  	return 0;
>  }
>  
> -struct topo_walk_info {};
> +define_commit_slab(indegree_slab, int);
> +
> +struct topo_walk_info {
> +	uint32_t min_generation;
> +	struct prio_queue explore_queue;
> +	struct prio_queue indegree_queue;
> +	struct prio_queue topo_queue;
> +	struct indegree_slab indegree;

All right.

> +	struct author_date_slab author_date;

Why this slab is needed in topo_walk_info struct?

> +};
> +
> +static inline void test_flag_and_insert(struct prio_queue *q, struct commit *c, int flag)
> +{
> +	if (c->object.flags & flag)
> +		return;
> +
> +	c->object.flags |= flag;
> +	prio_queue_put(q, c);
> +}

This is an independent change, though I see that it is quite specific
(as opposed to quite generic prio_queue_peek() operation added earlier
in this series), so it does not make much sense as standalone change.

It inserts commit into priority queue only if it didn't have flags set,
and sets the flag (so we won't add it to the queue again, not without
unsetting the flag), am I correct?

> +
> +static void explore_walk_step(struct rev_info *revs)
> +{
> +	struct topo_walk_info *info = revs->topo_walk_info;
> +	struct commit_list *p;
> +	struct commit *c = prio_queue_get(&info->explore_queue);
> +
> +	if (!c)
> +		return;
> +
> +	if (parse_commit_gently(c, 1) < 0)
> +		return;

All right, all commits taken out of explore_queue are parsed.  This is
used to ensure that all commits qith generation number greater than some
set cutoff are parsed.

> +
> +	if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
> +		record_author_date(&info->author_date, c);
> +
> +	if (revs->max_age != -1 && (c->date < revs->max_age))
> +		c->object.flags |= UNINTERESTING;

These two conditionals looks a bit strange to me; they are hardcoded
specific cases of query.  But that might be just me...

> +
> +	if (process_parents(revs, c, NULL, NULL) < 0)
> +		return;

I see that we are using process_parents(), formerly i.e. before patch
5/7 add_parents_to_list(), with NULL 'list' parameter for the first
time.

> +
> +	if (c->object.flags & UNINTERESTING)
> +		mark_parents_uninteresting(c);
> +
> +	for (p = c->parents; p; p = p->next)
> +		test_flag_and_insert(&info->explore_queue, p->item, TOPO_WALK_EXPLORED);

Do we need to insert parents to the queue even if they were marked
UNINTERESTING?

I guess that we use test_flag_and_insert() instead of prio_queue_put()
to avoid duplicate entries in the queue.  I think the queue is initially
populated with the starting commits, but those need not to be
unreachable from each other, and walking down parents we can encounter
starting commit already in the queue.  Am I correct?

> +}

Let's compare this new function with the limit_list() used in the old
algorithm for --topo-order walk (and even now for A..B walks), or to be
more exact with the contents of the while loop.

1. limit_list() doesn't have the check if the commit exists, and
   does not use parse_commit_gently().  Why the difference, i.e. where
   revs->commits gets parsed, and why explore_walk_step() cannot rely on
   this?

   I get that the goal is to not have parse commits if not needed, so it
   is good that it is moved to explore_walk_step().

2. limit_list() is also missing running record_author_date() when
   sorting output by author date.  I guess that explore_walk_step()
   needs this because commit-graph file does not include this
   information.

3. Handling of revs->max_age by marking commit as UNINTERESTING if
   needed is the same in limit_list() and in explore_walk_step().

4. limit_list() but not explore_walk_step() handles revs->min_age near
   the end pf the loop by terminating the loop.  I guess for this case
   we have revs->limited set, and we use old algorithm, isn't it?

   Something to remember when adding A..B handling to new algorithm.

5. add_parents_to_list() / process_parents() is nearly the same in
   limit_list() and in explore_walk_step(), but for the fact that the
   new function doesn't use 'list' parameter.

6. Both limit_list() and explore_walk_step() use
   mark_parents_uninteresting() on uninteresting commits.

   However, limit_list() breaks out of the loop, and uses
   interesting_cache with slop.  I guess that those two facts are
   connected, right?

7. Then explore_walk_step() inserts parents to the priority queue if
   they are not present there already, with test_flag_and_insert(),
   which rough equivalent in limit_list() would be using
   commit_list_insert().

8. limit_list() has also some code for show_early_output, which I guess
   explore_walk_step() does not need to handle.

> +
> +static void explore_to_depth(struct rev_info *revs,
> +			     uint32_t gen)
> +{
> +	struct topo_walk_info *info = revs->topo_walk_info;
> +	struct commit *c;
> +	while ((c = prio_queue_peek(&info->explore_queue)) &&
> +	       c->generation >= gen)

I have originally thought that if we extract prio_queue_get() and
test_flag_and_insert() / prio_queue_put() out of explore_walk_step() and
put it into this loop, i.e. into the calling function, we could avoid
code duplication between explore_walk_step() and limit_list()... but I
guess that is impossible anyway.

> +		explore_walk_step(revs);
> +}

Nice, tight, and easy to understand function.  Though perhaps 'gen'
could be called 'gen_cutoff' or 'min_gen', or 'min_gen_cufott'.

> +
> +static void indegree_walk_step(struct rev_info *revs)
> +{
> +	struct commit_list *p;
> +	struct topo_walk_info *info = revs->topo_walk_info;
> +	struct commit *c = prio_queue_get(&info->indegree_queue);
> +
> +	if (!c)
> +		return;
> +
> +	if (parse_commit_gently(c, 1) < 0)
> +		return;

All right, we need to parse commit 'c' to have its generation number,
and we need to do the same in explore_walk_step() because we walk
possibly unparsed parents.

> +
> +	explore_to_depth(revs, c->generation);

If we walk everything up to current commit depth, then we have walked
all commits that can affect in-degree of current commit.  Good.

> +
> +	if (parse_commit_gently(c, 1) < 0)
> +		return;

Why do we parse the same commit again???

> +
> +	for (p = c->parents; p; p = p->next) {
> +		struct commit *parent = p->item;
> +		int *pi = indegree_slab_at(&info->indegree, parent);

Sidenote: I would call this 'indegree_plus_one', not 'indegree'.  But
maybe I am too pedantic here.

> +
> +		if (*pi)
> +			(*pi)++;

If in-degree of parent is defined, then increase it.

> +		else
> +			*pi = 2;

If in-degree of parent is not defined, then it is first incoming edge,
and in-degree plus one is thus 2 (i.e. 1 + INDEGREE_ZERO).

> +
> +		test_flag_and_insert(&info->indegree_queue, parent, TOPO_WALK_INDEGREE);
> +
> +		if (revs->first_parent_only)
> +			return;
> +	}

This loop looks all right to me: we insert the parents if they do not
exist in the queue, and we handle --first-parent correctly.

> +}
> +
> +static void compute_indegrees_to_depth(struct rev_info *revs)
> +{
> +	struct topo_walk_info *info = revs->topo_walk_info;
> +	struct commit *c;
> +	while ((c = prio_queue_peek(&info->indegree_queue)) &&
> +	       c->generation >= info->min_generation)
> +		indegree_walk_step(revs);
> +}

All right, this looks correct.  It is identical with explore_to_depth(),
but for the change of queue member of topo_walk_info and step function.

Sidenote: if C had true macros (higher-order functions), then it might
be worth encoding this structure in a macro.  Preprocessor macros though
would make the code more obscure, not less.

>  
>  static void init_topo_walk(struct rev_info *revs)
>  {
>  	struct topo_walk_info *info;
> +	struct commit_list *list;

Hmmm, I wonder what do we need this 'list' for.

>  	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
>  	info = revs->topo_walk_info;
>  	memset(info, 0, sizeof(struct topo_walk_info));
>  
> -	limit_list(revs);
> -	sort_in_topological_order(&revs->commits, revs->sort_order);
> +	init_indegree_slab(&info->indegree);
> +	memset(&info->explore_queue, '\0', sizeof(info->explore_queue));
> +	memset(&info->indegree_queue, '\0', sizeof(info->indegree_queue));
> +	memset(&info->topo_queue, '\0', sizeof(info->topo_queue));

Why this memset uses '\0' as a filler value and not 0?  The queues are
not strings.

> +
> +	switch (revs->sort_order) {
> +	default: /* REV_SORT_IN_GRAPH_ORDER */
> +		info->topo_queue.compare = NULL;
> +		break;
> +	case REV_SORT_BY_COMMIT_DATE:
> +		info->topo_queue.compare = compare_commits_by_commit_date;
> +		break;
> +	case REV_SORT_BY_AUTHOR_DATE:
> +		init_author_date_slab(&info->author_date);
> +		info->topo_queue.compare = compare_commits_by_author_date;
> +		info->topo_queue.cb_data = &info->author_date;
> +		break;
> +	}

O.K., that are all possible values for revs->sort_order (all possible
values of the rev_sort_order enum).

> +
> +	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
> +	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;

All right, those lower level priority queues are sorted by generation
number (with commit date as tie breaker).

> +
> +	info->min_generation = GENERATION_NUMBER_INFINITY;
> +	for (list = revs->commits; list; list = list->next) {

This list loops over all starting commits, isn't it.

> +		struct commit *c = list->item;
> +		test_flag_and_insert(&info->explore_queue, c, TOPO_WALK_EXPLORED);
> +		test_flag_and_insert(&info->indegree_queue, c, TOPO_WALK_INDEGREE);
> +
> +		if (parse_commit_gently(c, 1))
> +			continue;

Why do we insert commits that cannot be parsed to those two queues?

> +		if (c->generation < info->min_generation)
> +			info->min_generation = c->generation;

All right, we have parsed commit 'c' so we know its generation numbers.

> +	}

Here all starting commits are inserted into both expore_queue (for
parsing and walk), and to indegree_queue (for in-degree calculations).
All right.

> +
> +	for (list = revs->commits; list; list = list->next) {
> +		struct commit *c = list->item;
> +		*(indegree_slab_at(&info->indegree, c)) = 1;
> +
> +		if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
> +			record_author_date(&info->author_date, c);
> +	}

This is a separate loop to initialize and possibly record data in slabs
for indegree and author_date info.

I wonder why it is in a separate loop.  Is it to make code cleaner, to
separate different concerns into separate loops?

> +	compute_indegrees_to_depth(revs);

It looks a bit strange that depth is not passed as a parameter, but its
value is embedded inside revs structure, but I guess it is done this way
to keep it in sync.

Though it is a bit *inconsistent* to have explore_to_depth() having
'gen' parameter, but compute_indegrees_to_depth() not having it.  There
is '_to_depth()' in a name, and there is no 'depth' parameter...


Here we have computed indegrees of all starting commits, walking the
commit graph if necessary.

> +
> +	for (list = revs->commits; list; list = list->next) {
> +		struct commit *c = list->item;
> +
> +		if (*(indegree_slab_at(&info->indegree, c)) == 1)
> +			prio_queue_put(&info->topo_queue, c);
> +	}

And here we add all commits with no incoming edges, i.e. with real
in-degree of zero, and "indegree plus one" equal 1, or INDEGREE_ZERO.

This is the starting point of Kahn's algorithm (assuming that in-degrees
will be calculated correctly while running it).  All right.

> +
> +	/*
> +	 * This is unfortunate; the initial tips need to be shown
> +	 * in the order given from the revision traversal machinery.
> +	 */
> +	if (revs->sort_order == REV_SORT_IN_GRAPH_ORDER)
> +		prio_queue_reverse(&info->topo_queue);

Right, with REV_SORT_IN_GRAPH_ORDER the priority queue is actually a
stack, and access through this stack reverses the order of commits as it
was originally in the list (last commit was added last, and stack is
LIFO structure, last added element is retrieved first).

I think thet here some sort of complication with regards to
REV_SORT_IN_GRAPH_ORDER is unavoidable, unless priority queue is
enhanced to work as an ordinary FIFO queue in addition to making it work
as LIFO stack.

>  }
>  
>  static struct commit *next_topo_commit(struct rev_info *revs)
>  {
> -	return pop_commit(&revs->commits);
> +	struct commit *c;
> +	struct topo_walk_info *info = revs->topo_walk_info;
> +
> +	/* pop next off of topo_queue */
> +	c = prio_queue_get(&info->topo_queue);

All right, pop_commit() transforms straighforwardly to
prio_queue_get().

> +
> +	if (c)
> +		*(indegree_slab_at(&info->indegree, c)) = 0;

Why do we need to mark indegree of commit to be returned as undefined
here (INDEGREE_UNINITIALIZED)?

> +
> +	return c;
>  }
>

Before the change, expand_topo_walk() simply added parents to the list,
and actual sorting was done by sort_in_topological_order().

>  static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
>  {
> -	if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
> +	struct commit_list *p;
> +	struct topo_walk_info *info = revs->topo_walk_info;
> +	if (process_parents(revs, commit, NULL, NULL) < 0) {

All right, here we remove storing commits in revs->commits list, the
third parameter changed from &revs->commits to NULL.

>  		if (!revs->ignore_missing_links)
>  			die("Failed to traverse parents of commit %s",
> -			    oid_to_hex(&commit->object.oid));
> +				oid_to_hex(&commit->object.oid));

The above looks like spurious and accidental whitespace change, isn't
it?

> +	}
> +

All right, the loop below looks like the inner loop of the Kahn's
algorithm, i.e.:

      for each node m with an edge e from n to m do
          remove edge e from the graph
          if m has no other incoming edges then
              insert m into S


> +	for (p = commit->parents; p; p = p->next) {
> +		struct commit *parent = p->item;
> +		int *pi;
> +
> +		if (parse_commit_gently(parent, 1) < 0)
> +			continue;

All right, we need to parse parent commit to ensure that we can access
its generation number.

> +
> +		if (parent->generation < info->min_generation) {
> +			info->min_generation = parent->generation;
> +			compute_indegrees_to_depth(revs);
> +		}

The above ensures that the parent will have correctly calculated
in-degree.  Looks all right.

> +
> +		pi = indegree_slab_at(&info->indegree, parent);
> +
> +		(*pi)--;

          remove edge e from the graph

> +		if (*pi == 1)
> +			prio_queue_put(&info->topo_queue, parent);

If parent has no incoming edges (indegree == 1 == INDEGREE_ZERO), then
insert it into topo_queue.

          if m has no other incoming edges then
              insert m into S

> +
> +		if (revs->first_parent_only)
> +			return;
>  	}
>  }

Looks all right.

>  
> diff --git a/revision.h b/revision.h
> index fd4154ff75..b0b3bb8025 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -24,6 +24,8 @@
>  #define USER_GIVEN	(1u<<25) /* given directly by the user */
>  #define TRACK_LINEAR	(1u<<26)
>  #define ALL_REV_FLAGS	(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
> +#define TOPO_WALK_EXPLORED	(1u<<27)
> +#define TOPO_WALK_INDEGREE	(1u<<28)

To be more exact, this flag does not mean that the commit has been
explored, or has its in-degree calculated, but that it was added to the
queue for exploring, or for having its in-degree calculated.

Current names of those two new preprocessor constants might be
considered mildly misleading, absent context.

>  
>  #define DECORATE_SHORT_REFS	1
>  #define DECORATE_FULL_REFS	2

--
Jakub Narębski

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 6/7] revision.c: generation-based topo-order algorithm
  2018-10-22 13:37         ` Jakub Narebski
@ 2018-10-23 13:54           ` Derrick Stolee
  2018-10-26 16:55             ` Jakub Narebski
  0 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee @ 2018-10-23 13:54 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee via GitGitGadget
  Cc: git, Jeff King, Junio C Hamano, Derrick Stolee

On 10/22/2018 9:37 AM, Jakub Narebski wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The current --topo-order algorithm requires walking all
>> reachable commits up front, topo-sorting them, all before
>> outputting the first value. This patch introduces a new
>> algorithm which uses stored generation numbers to
>> incrementally walk in topo-order, outputting commits as
>> we go. This can dramatically reduce the computation time
>> to write a fixed number of commits, such as when limiting
>> with "-n <N>" or filling the first page of a pager.
>>
>> When running a command like 'git rev-list --topo-order HEAD',
>> Git performed the following steps:
>>
>> 1. Run limit_list(), which parses all reachable commits,
>>     adds them to a linked list, and distributes UNINTERESTING
>>     flags. If all unprocessed commits are UNINTERESTING, then
>>     it may terminate without walking all reachable commits.
>>     This does not occur if we do not specify UNINTERESTING
>>     commits.
>>
>> 2. Run sort_in_topological_order(), which is an implementation
>>     of Kahn's algorithm. It first iterates through the entire
>>     set of important commits and computes the in-degree of each
>>     (plus one, as we use 'zero' as a special value here). Then,
>>     we walk the commits in priority order, adding them to the
>>     priority queue if and only if their in-degree is one. As
>>     we remove commits from this priority queue, we decrement the
>>     in-degree of their parents.
> Because in-degree has very specific defined meaning of number of
> children, i.e. the number of _incoming_ edges, I would say "if and only
> if their in-degree-plus-one is one".  It is more exact, even if it looks
> a bit funny.
>
>> 3. While we are peeling commits for output, get_revision_1()
>>     uses pop_commit on the full list of commits computed by
>>     sort_in_topological_order().
> All right, so those are separate steps (separate walks): prepare and
> parse commits, topologically sort list of commits from previous step,
> output sorted list of commits from previous step.

I would rephrase your explanation above as: prepare and parse commits, 
compute in-degrees, and peel commits of in-degree zero.

>> In the new algorithm, these three steps correspond to three
>> different commit walks. We run these walks simultaneously,
>> and advance each only as far as necessary to satisfy the
>> requirements of the 'higher order' walk.
> What does 'higher order' walk means: steps 3, 2, 1, in this order,
> i.e. output being the highest order, or something different?

Yes. We only walk "level 2" in order to satisfy how far we are in "level 3".

> Sidenote: the new algorithm looks a bit like Unix pipeline, where each
> step of pipeline does not output much more than next step needs /
> requests.

That's essentially the idea.

>>                                           We know when we can
>> pause each walk by using generation numbers from the commit-
>> graph feature.
> Do I understand it correctly that this is mainly used in Kahn's
> algorithm to find out through the negative-cut index of generation
> number which commits in the to-be-sorted list cannot have an in-degree
> of zero (or otherise cannot be next commit to be shown in output)?

In each step of the algorithm, we operate under the assumption that 
certain vertices have "all necessary information".

In the case of "level 3", we need to know that all descendants were 
walked and our in-degree calculation is correct. We guarantee this by 
ensuring that "level 2" has walked beyond that commit's generation number.

In the case of "level 2", we need to know that we have parsed all 
descendants and determined their simplifications (if necessary, such as 
in file-history) and if they are UNINTERESTING. We guarantee this by 
ensuring that "level 1" has walked beyond that commit's generation number.

In the previous algorithm, these guarantees were handled by doing each 
step on all reachable commits before moving to the next level.

>> Recall that the generation number of a commit satisfies:
>>
>> * If the commit has at least one parent, then the generation
>>    number is one more than the maximum generation number among
>>    its parents.
>>
>> * If the commit has no parent, then the generation number is one.
>>
>> There are two special generation numbers:
>>
>> * GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
>>    indicates that the commit is not stored in the commit-graph and
>>    the generation number was not previously calculated.
>>
>> * GENERATION_NUMBER_ZERO: this value (0) is a special indicator
>>    to say that the commit-graph was generated by a version of Git
>>    that does not compute generation numbers (such as v2.18.0).
>>
>> Since we use generation_numbers_enabled() before using the new
>> algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
>> However, the existence of GENERATION_NUMBER_INFINITY implies the
>> following weaker statement than the usual we expect from
>> generation numbers:
>>
>>      If A and B are commits with generation numbers gen(A) and
>>      gen(B) and gen(A) < gen(B), then A cannot reach B.
>>
>> Thus, we will walk in each of our stages until the "maximum
>> unexpanded generation number" is strictly lower than the
>> generation number of a commit we are about to use.
> And this "maximum unexpanded generation number" must be greater or equal
> to 1, thanks to assuming generation_numbers_enabled().
>
>
> Let's start by writing down the original version of the Kahn's algorith
> (which is not the only way to calculate topological ordering; another
> method is to use depth-first searches).
>
>    L ← Empty list that will contain the sorted elements
>    S ← Set of all nodes with no incoming edge
>    while S is non-empty do
>        remove a node n from S
>        add n to tail of L
>        for each node m with an edge e from n to m do
>            remove edge e from the graph
>            if m has no other incoming edges then
>                insert m into S
>    if graph has edges then
>        return error   _(graph has at least one cycle)_
>    else
>        return L       _(a topologically sorted order)_
>
> In the case of Git, we display only commits reachable from the starting
> commits, so only those starting commits can have no incoming edge, by
> the definition of the reachable commit (note that some starting commits
> can be reachable from other starting commits).
>
> Note that in Git by construction we cannot have cycles in the objects
> graph, and that 'remove edge e [= n -> m] from the graph' simply means
> decreasing the [effective] in-degree of node m.
>
>> The walks are as follows:
>>
>> 1. EXPLORE: using the explore_queue priority queue (ordered by
>>     maximizing the generation number), parse each reachable
>>     commit until all commits in the queue have generation
>>     number strictly lower than needed. During this walk, update
>>     the UNINTERESTING flags as necessary.
> All right, that looks sensible.  Parse commits and update the
> UNINTERESTING flags only up to what might be needed.
>
> Though I would add for each walk what are post-conditions, i.e. what
> requirements list of returned commits does fullfill.  In the case of the
> EXPLORE walk it would be that commits are in the "reachable from start
> commts" set, parsed and not UNINTERESTING.  And that there are all such
> commits there with generation number greater or equal if needed (and
> their parents).
>
>
> Wouldn't this though make the output always start at the commit with
> maximal generation number (such commit or commits would need to have an
> in-degree of zero, i.e. no incoming edges), instead of whatever order is
> requested (if date order contradicts generation number order) or in the
> command line arguments order?

The final order is prioritized in the "level 3" walk, which either uses 
an incrementing counter (--topo-order), commit date (--date-order), or 
author date (--author-date-order) as the priority.

>> 2. INDEGREE: using the indegree_queue priority queue (ordered
>>     by maximizing the generation number), add one to the in-
>>     degree of each parent for each commit that is walked. Since
>>     we walk in order of decreasing generation number, we know
>>     that discovering an in-degree value of 0 means the value for
>>     that commit was not initialized, so should be initialized to
>>     two. (Recall that in-degree value "1" is what we use to say a
>>     commit is ready for output.)
> The post-condition is that all returned commits have their in-degree
> plus one calculated.
>
> Mixing actual in-degree (number of incoming edges, zero means no
> incoming edge and candidate for the next commit in topological order),
> and details of implementation (using value of zero for uninitialized,
> and thus actually storing in-degree plus one) makes this description a
> bit hard to follow.
>
> Note that the additional complication is that if commits have generation
> number INFINITY, then we cannot say anything about reachability among
> commits with this special generation number.  That means that until we
> process all commits with generation number INFINITY, we don't know which
> ones have no incoming edges (a real in-degree of zero).  If they are not
> INFINITY, the stronger version of the reachability condition for
> generation number holds, and thus we know that if we pop the commit and
> it has uninitialized in-degree and generation number not INFINITY, then
> it has no-incoming edges (in-degree of zero).
>
> That does not matter much, but for the fact that before outputting list
> of commits / returning from the function we need to ensure that all out
> commits have defined in-degree value.  All that have in-degree undefined
> when indegree_queue is empty, because of reachability and generation
> numbers constraints, actually have an in-degree of zero (no incoming
> edges).

This is why we walk until exploring _beyond_ a given generation number. 
Generation number INFINITY is not special with that restriction, as made 
clear in the earlier discussion of generation numbers.

We expect there to be commits with generation number INFINITY, because 
users will not be updating their commit-graph with every single 'git 
commit' command. This mode is covered in our test cases.

>
>>                                   As we iterate the parents of a
>>     commit during this walk, ensure the EXPLORE walk has walked
>>     beyond their generation numbers.
> All right. looks sensible from the point of view of trying to do
> streaming of sorted commits.
>
>> 3. TOPO: using the topo_queue priority queue (ordered based on
>>     the sort_order given, which could be commit-date, author-
>>     date, or typical topo-order which treats the queue as a LIFO
>>     stack), remove a commit from the queue and decrement the
>>     in-degree of each parent. If a parent has an in-degree of
>>     one, then we add it to the topo_queue. Before we decrement
>>     the in-degree, however, ensure the INDEGREE walk has walked
>>     beyond that generation number.
> This description missed an important constraint, namely that all commits
> in the topo_order queue have real in-degree of zero, i.e. no incoming
> edges.  The topo_order queue is set S in the Kahn's algorithm.
>
> Also, we need to know how to populate the topo_queue at start with at
> least one commit.  How it is initially populated?  That is very
> important question.
>
>> The implementations of these walks are in the following methods:
>>
>> * explore_walk_step and explore_to_depth
>> * indegree_walk_step and compute_indegrees_to_depth
>> * next_topo_commit and expand_topo_walk
> All right, one one hand: good calling convention.  On the other hand:
> why the difference in naming?
>
>> These methods have some patterns that may seem strange at first,
>> but they are probably carry-overs from their equivalents in
>> limit_list and sort_in_topological_order.
>>
>> One thing that is missing from this implementation is a proper
>> way to stop walking when the entire queue is UNINTERESTING, so
>> this implementation is not enabled by comparisions, such as in
>> 'git rev-list --topo-order A..B'. This can be updated in the
>> future.
> All right, lets start with easier step.
>
>> In my local testing, I used the following Git commands on the
>> Linux repository in three modes: HEAD~1 with no commit-graph,
>> HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
>> allows comparing the benefits we get from parsing commits from
>> the commit-graph and then again the benefits we get by
>> restricting the set of commits we walk.
>>
>> Test: git rev-list --topo-order -100 HEAD
>> HEAD~1, no commit-graph: 6.80 s
>> HEAD~1, w/ commit-graph: 0.77 s
>>    HEAD, w/ commit-graph: 0.02 s
>>
>> Test: git rev-list --topo-order -100 HEAD -- tools
>> HEAD~1, no commit-graph: 9.63 s
>> HEAD~1, w/ commit-graph: 6.06 s
>>    HEAD, w/ commit-graph: 0.06 s
>>
>> This speedup is due to a few things. First, the new generation-
>> number-enabled algorithm walks commits on order of the number of
>> results output (subject to some branching structure expectations).
>> Since we limit to 100 results, we are running a query similar to
>> filling a single page of results. Second, when specifying a path,
>> we must parse the root tree object for each commit we walk. The
>> previous benefits from the commit-graph are entirely from reading
>> the commit-graph instead of parsing commits. Since we need to
>> parse trees for the same number of commits as before, we slow
>> down significantly from the non-path-based query.
>>
>> For the test above, I specifically selected a path that is changed
>> frequently, including by merge commits. A less-frequently-changed
>> path (such as 'README') has similar end-to-end time since we need
>> to walk the same number of commits (before determining we do not
>> have 100 hits). However, get the benefit that the output is
>> presented to the user as it is discovered, much the same as a
>> normal 'git log' command (no '--topo-order'). This is an improved
>> user experience, even if the command has the same runtime.
> First, do I understand it correctly that in first case the gains from
> new algorithms are so slim because with commit-graph file and no path
> limiting we don't hit repository anyway; we walk less commits, but
> reading commit data from commit-graph file is fast/

If you mean 0.77s to 0.02s is "slim" then yes, it is because the 
commit-graph command already made a full walk of the commit history 
faster. (I'm only poking at this because the _relative_ improvement is 
significant, even if the command was already sub-second.)

> Second, I wonder if there is some easy way to perform automatic latency
> tests, i.e. how fast does Git show the first page of output...

I have talked with Jeff Hostetler about this, to see if we can have a 
"time to first page" traced with trace2, but we don't seem to have 
access to that information within Git. We would need to insert it into 
the pager. The "-100" is used instead.

>
>> Helped-by: Jeff King <peff@peff.net>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   object.h   |   4 +-
>>   revision.c | 199 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>>   revision.h |   2 +
>>   3 files changed, 197 insertions(+), 8 deletions(-)
> Daunting change to review.
>
>> diff --git a/object.h b/object.h
>> index 0feb90ae61..796792cb32 100644
>> --- a/object.h
>> +++ b/object.h
>> @@ -59,7 +59,7 @@ struct object_array {
>>   
>>   /*
>>    * object flag allocation:
>> - * revision.h:               0---------10                              2526
>> + * revision.h:               0---------10                              25----28
>>    * fetch-pack.c:             01
>>    * negotiator/default.c:       2--5
>>    * walker.c:                 0-2
>> @@ -78,7 +78,7 @@ struct object_array {
>>    * builtin/show-branch.c:    0-------------------------------------------26
>>    * builtin/unpack-objects.c:                                 2021
>>    */
>> -#define FLAG_BITS  27
>> +#define FLAG_BITS  29
> What are those two additional object flags needed for revision.h /
> revision.c after this change?
>
> Ah, those are TOPO_WALK_EXPLORED and TOPO_WALK_INDEGREE.
>
>>   
>>   /*
>>    * The object type is stored in 3 bits.
>> diff --git a/revision.c b/revision.c
>> index 36458265a0..472f3994e3 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -26,6 +26,7 @@
>>   #include "argv-array.h"
>>   #include "commit-reach.h"
>>   #include "commit-graph.h"
>> +#include "prio-queue.h"
>>   
>>   volatile show_early_output_fn_t show_early_output;
>>   
>> @@ -2895,30 +2896,216 @@ static int mark_uninteresting(const struct object_id *oid,
>>   	return 0;
>>   }
>>   
>> -struct topo_walk_info {};
>> +define_commit_slab(indegree_slab, int);
>> +
>> +struct topo_walk_info {
>> +	uint32_t min_generation;
>> +	struct prio_queue explore_queue;
>> +	struct prio_queue indegree_queue;
>> +	struct prio_queue topo_queue;
>> +	struct indegree_slab indegree;
> All right.
>
>> +	struct author_date_slab author_date;
> Why this slab is needed in topo_walk_info struct?
>
>> +};
>> +
>> +static inline void test_flag_and_insert(struct prio_queue *q, struct commit *c, int flag)
>> +{
>> +	if (c->object.flags & flag)
>> +		return;
>> +
>> +	c->object.flags |= flag;
>> +	prio_queue_put(q, c);
>> +}
> This is an independent change, though I see that it is quite specific
> (as opposed to quite generic prio_queue_peek() operation added earlier
> in this series), so it does not make much sense as standalone change.
>
> It inserts commit into priority queue only if it didn't have flags set,
> and sets the flag (so we won't add it to the queue again, not without
> unsetting the flag), am I correct?

Yes, this pattern of using a flag to avoid duplicate entries in the 
priority queue appears in multiple walks. It wasn't needed before. We 
call it four times in the code below.
>> +
>> +static void explore_walk_step(struct rev_info *revs)
>> +{
>> +	struct topo_walk_info *info = revs->topo_walk_info;
>> +	struct commit_list *p;
>> +	struct commit *c = prio_queue_get(&info->explore_queue);
>> +
>> +	if (!c)
>> +		return;
>> +
>> +	if (parse_commit_gently(c, 1) < 0)
>> +		return;
> All right, all commits taken out of explore_queue are parsed.  This is
> used to ensure that all commits qith generation number greater than some
> set cutoff are parsed.
>
>> +
>> +	if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
>> +		record_author_date(&info->author_date, c);
>> +
>> +	if (revs->max_age != -1 && (c->date < revs->max_age))
>> +		c->object.flags |= UNINTERESTING;
> These two conditionals looks a bit strange to me; they are hardcoded
> specific cases of query.  But that might be just me...

These special cases are important for making all of the different option 
flags to rev-list work with the algorithm. They are pulled directly from 
limit_list().

>
>> +
>> +	if (process_parents(revs, c, NULL, NULL) < 0)
>> +		return;
> I see that we are using process_parents(), formerly i.e. before patch
> 5/7 add_parents_to_list(), with NULL 'list' parameter for the first
> time.
>> +
>> +	if (c->object.flags & UNINTERESTING)
>> +		mark_parents_uninteresting(c);
>> +
>> +	for (p = c->parents; p; p = p->next)
>> +		test_flag_and_insert(&info->explore_queue, p->item, TOPO_WALK_EXPLORED);
> Do we need to insert parents to the queue even if they were marked
> UNINTERESTING?

We need to propagate the UNINTERESTING flag to our parents. That 
propagation happens in process_parents().

>
> I guess that we use test_flag_and_insert() instead of prio_queue_put()
> to avoid duplicate entries in the queue.  I think the queue is initially
> populated with the starting commits, but those need not to be
> unreachable from each other, and walking down parents we can encounter
> starting commit already in the queue.  Am I correct?

We can also reach commits in multiple ways, so the initial conditions 
are not the only ways to insert duplicates.

>> +}
> Let's compare this new function with the limit_list() used in the old
> algorithm for --topo-order walk (and even now for A..B walks), or to be
> more exact with the contents of the while loop.
>
> 1. limit_list() doesn't have the check if the commit exists, and
>     does not use parse_commit_gently().  Why the difference, i.e. where
>     revs->commits gets parsed, and why explore_walk_step() cannot rely on
>     this?
>
>     I get that the goal is to not have parse commits if not needed, so it
>     is good that it is moved to explore_walk_step().
>
> 2. limit_list() is also missing running record_author_date() when
>     sorting output by author date.  I guess that explore_walk_step()
>     needs this because commit-graph file does not include this
>     information.
>
> 3. Handling of revs->max_age by marking commit as UNINTERESTING if
>     needed is the same in limit_list() and in explore_walk_step().
>
> 4. limit_list() but not explore_walk_step() handles revs->min_age near
>     the end pf the loop by terminating the loop.  I guess for this case
>     we have revs->limited set, and we use old algorithm, isn't it?
>
>     Something to remember when adding A..B handling to new algorithm.
>
> 5. add_parents_to_list() / process_parents() is nearly the same in
>     limit_list() and in explore_walk_step(), but for the fact that the
>     new function doesn't use 'list' parameter.
>
> 6. Both limit_list() and explore_walk_step() use
>     mark_parents_uninteresting() on uninteresting commits.
>
>     However, limit_list() breaks out of the loop, and uses
>     interesting_cache with slop.  I guess that those two facts are
>     connected, right?
>
> 7. Then explore_walk_step() inserts parents to the priority queue if
>     they are not present there already, with test_flag_and_insert(),
>     which rough equivalent in limit_list() would be using
>     commit_list_insert().
>
> 8. limit_list() has also some code for show_early_output, which I guess
>     explore_walk_step() does not need to handle.
>
>> +
>> +static void explore_to_depth(struct rev_info *revs,
>> +			     uint32_t gen)
>> +{
>> +	struct topo_walk_info *info = revs->topo_walk_info;
>> +	struct commit *c;
>> +	while ((c = prio_queue_peek(&info->explore_queue)) &&
>> +	       c->generation >= gen)
> I have originally thought that if we extract prio_queue_get() and
> test_flag_and_insert() / prio_queue_put() out of explore_walk_step() and
> put it into this loop, i.e. into the calling function, we could avoid
> code duplication between explore_walk_step() and limit_list()... but I
> guess that is impossible anyway.
>
>> +		explore_walk_step(revs);
>> +}
> Nice, tight, and easy to understand function.  Though perhaps 'gen'
> could be called 'gen_cutoff' or 'min_gen', or 'min_gen_cufott'.
>
>> +
>> +static void indegree_walk_step(struct rev_info *revs)
>> +{
>> +	struct commit_list *p;
>> +	struct topo_walk_info *info = revs->topo_walk_info;
>> +	struct commit *c = prio_queue_get(&info->indegree_queue);
>> +
>> +	if (!c)
>> +		return;
>> +
>> +	if (parse_commit_gently(c, 1) < 0)
>> +		return;
> All right, we need to parse commit 'c' to have its generation number,
> and we need to do the same in explore_walk_step() because we walk
> possibly unparsed parents.
>
>> +
>> +	explore_to_depth(revs, c->generation);
> If we walk everything up to current commit depth, then we have walked
> all commits that can affect in-degree of current commit.  Good.
>
>> +
>> +	if (parse_commit_gently(c, 1) < 0)
>> +		return;
> Why do we parse the same commit again???

Good point! Accidental duplicate lines.

>> +
>> +	for (p = c->parents; p; p = p->next) {
>> +		struct commit *parent = p->item;
>> +		int *pi = indegree_slab_at(&info->indegree, parent);
> Sidenote: I would call this 'indegree_plus_one', not 'indegree'.  But
> maybe I am too pedantic here.
>
>> +
>> +		if (*pi)
>> +			(*pi)++;
> If in-degree of parent is defined, then increase it.
>
>> +		else
>> +			*pi = 2;
> If in-degree of parent is not defined, then it is first incoming edge,
> and in-degree plus one is thus 2 (i.e. 1 + INDEGREE_ZERO).
>
>> +
>> +		test_flag_and_insert(&info->indegree_queue, parent, TOPO_WALK_INDEGREE);
>> +
>> +		if (revs->first_parent_only)
>> +			return;
>> +	}
> This loop looks all right to me: we insert the parents if they do not
> exist in the queue, and we handle --first-parent correctly.
>
>> +}
>> +
>> +static void compute_indegrees_to_depth(struct rev_info *revs)
>> +{
>> +	struct topo_walk_info *info = revs->topo_walk_info;
>> +	struct commit *c;
>> +	while ((c = prio_queue_peek(&info->indegree_queue)) &&
>> +	       c->generation >= info->min_generation)
>> +		indegree_walk_step(revs);
>> +}
> All right, this looks correct.  It is identical with explore_to_depth(),
> but for the change of queue member of topo_walk_info and step function.
>
> Sidenote: if C had true macros (higher-order functions), then it might
> be worth encoding this structure in a macro.  Preprocessor macros though
> would make the code more obscure, not less.
>
>>   
>>   static void init_topo_walk(struct rev_info *revs)
>>   {
>>   	struct topo_walk_info *info;
>> +	struct commit_list *list;
> Hmmm, I wonder what do we need this 'list' for.
>
>>   	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
>>   	info = revs->topo_walk_info;
>>   	memset(info, 0, sizeof(struct topo_walk_info));
>>   
>> -	limit_list(revs);
>> -	sort_in_topological_order(&revs->commits, revs->sort_order);
>> +	init_indegree_slab(&info->indegree);
>> +	memset(&info->explore_queue, '\0', sizeof(info->explore_queue));
>> +	memset(&info->indegree_queue, '\0', sizeof(info->indegree_queue));
>> +	memset(&info->topo_queue, '\0', sizeof(info->topo_queue));
> Why this memset uses '\0' as a filler value and not 0?  The queues are
> not strings.
>
>> +
>> +	switch (revs->sort_order) {
>> +	default: /* REV_SORT_IN_GRAPH_ORDER */
>> +		info->topo_queue.compare = NULL;
>> +		break;
>> +	case REV_SORT_BY_COMMIT_DATE:
>> +		info->topo_queue.compare = compare_commits_by_commit_date;
>> +		break;
>> +	case REV_SORT_BY_AUTHOR_DATE:
>> +		init_author_date_slab(&info->author_date);
>> +		info->topo_queue.compare = compare_commits_by_author_date;
>> +		info->topo_queue.cb_data = &info->author_date;
>> +		break;
>> +	}
> O.K., that are all possible values for revs->sort_order (all possible
> values of the rev_sort_order enum).
>
>> +
>> +	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
>> +	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
> All right, those lower level priority queues are sorted by generation
> number (with commit date as tie breaker).
>
>> +
>> +	info->min_generation = GENERATION_NUMBER_INFINITY;
>> +	for (list = revs->commits; list; list = list->next) {
> This list loops over all starting commits, isn't it.
>
>> +		struct commit *c = list->item;
>> +		test_flag_and_insert(&info->explore_queue, c, TOPO_WALK_EXPLORED);
>> +		test_flag_and_insert(&info->indegree_queue, c, TOPO_WALK_INDEGREE);
>> +
>> +		if (parse_commit_gently(c, 1))
>> +			continue;
> Why do we insert commits that cannot be parsed to those two queues?
>
>> +		if (c->generation < info->min_generation)
>> +			info->min_generation = c->generation;
> All right, we have parsed commit 'c' so we know its generation numbers.
>
>> +	}
> Here all starting commits are inserted into both expore_queue (for
> parsing and walk), and to indegree_queue (for in-degree calculations).
> All right.
>
>> +
>> +	for (list = revs->commits; list; list = list->next) {
>> +		struct commit *c = list->item;
>> +		*(indegree_slab_at(&info->indegree, c)) = 1;
>> +
>> +		if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
>> +			record_author_date(&info->author_date, c);
>> +	}
> This is a separate loop to initialize and possibly record data in slabs
> for indegree and author_date info.
>
> I wonder why it is in a separate loop.  Is it to make code cleaner, to
> separate different concerns into separate loops?
>
>> +	compute_indegrees_to_depth(revs);
> It looks a bit strange that depth is not passed as a parameter, but its
> value is embedded inside revs structure, but I guess it is done this way
> to keep it in sync.
>
> Though it is a bit *inconsistent* to have explore_to_depth() having
> 'gen' parameter, but compute_indegrees_to_depth() not having it.  There
> is '_to_depth()' in a name, and there is no 'depth' parameter...
>
>
> Here we have computed indegrees of all starting commits, walking the
> commit graph if necessary.
>
>> +
>> +	for (list = revs->commits; list; list = list->next) {
>> +		struct commit *c = list->item;
>> +
>> +		if (*(indegree_slab_at(&info->indegree, c)) == 1)
>> +			prio_queue_put(&info->topo_queue, c);
>> +	}
> And here we add all commits with no incoming edges, i.e. with real
> in-degree of zero, and "indegree plus one" equal 1, or INDEGREE_ZERO.
>
> This is the starting point of Kahn's algorithm (assuming that in-degrees
> will be calculated correctly while running it).  All right.
>
>> +
>> +	/*
>> +	 * This is unfortunate; the initial tips need to be shown
>> +	 * in the order given from the revision traversal machinery.
>> +	 */
>> +	if (revs->sort_order == REV_SORT_IN_GRAPH_ORDER)
>> +		prio_queue_reverse(&info->topo_queue);
> Right, with REV_SORT_IN_GRAPH_ORDER the priority queue is actually a
> stack, and access through this stack reverses the order of commits as it
> was originally in the list (last commit was added last, and stack is
> LIFO structure, last added element is retrieved first).
>
> I think thet here some sort of complication with regards to
> REV_SORT_IN_GRAPH_ORDER is unavoidable, unless priority queue is
> enhanced to work as an ordinary FIFO queue in addition to making it work
> as LIFO stack.
>
>>   }
>>   
>>   static struct commit *next_topo_commit(struct rev_info *revs)
>>   {
>> -	return pop_commit(&revs->commits);
>> +	struct commit *c;
>> +	struct topo_walk_info *info = revs->topo_walk_info;
>> +
>> +	/* pop next off of topo_queue */
>> +	c = prio_queue_get(&info->topo_queue);
> All right, pop_commit() transforms straighforwardly to
> prio_queue_get().
>
>> +
>> +	if (c)
>> +		*(indegree_slab_at(&info->indegree, c)) = 0;
> Why do we need to mark indegree of commit to be returned as undefined
> here (INDEGREE_UNINITIALIZED)?
>
>> +
>> +	return c;
>>   }
>>
> Before the change, expand_topo_walk() simply added parents to the list,
> and actual sorting was done by sort_in_topological_order().
>
>>   static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
>>   {
>> -	if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
>> +	struct commit_list *p;
>> +	struct topo_walk_info *info = revs->topo_walk_info;
>> +	if (process_parents(revs, commit, NULL, NULL) < 0) {
> All right, here we remove storing commits in revs->commits list, the
> third parameter changed from &revs->commits to NULL.
>
>>   		if (!revs->ignore_missing_links)
>>   			die("Failed to traverse parents of commit %s",
>> -			    oid_to_hex(&commit->object.oid));
>> +				oid_to_hex(&commit->object.oid));
> The above looks like spurious and accidental whitespace change, isn't
> it?

Correct. Thanks for finding it.

>
>> +	}
>> +
> All right, the loop below looks like the inner loop of the Kahn's
> algorithm, i.e.:
>
>        for each node m with an edge e from n to m do
>            remove edge e from the graph
>            if m has no other incoming edges then
>                insert m into S
>
>
>> +	for (p = commit->parents; p; p = p->next) {
>> +		struct commit *parent = p->item;
>> +		int *pi;
>> +
>> +		if (parse_commit_gently(parent, 1) < 0)
>> +			continue;
> All right, we need to parse parent commit to ensure that we can access
> its generation number.
>
>> +
>> +		if (parent->generation < info->min_generation) {
>> +			info->min_generation = parent->generation;
>> +			compute_indegrees_to_depth(revs);
>> +		}
> The above ensures that the parent will have correctly calculated
> in-degree.  Looks all right.
>
>> +
>> +		pi = indegree_slab_at(&info->indegree, parent);
>> +
>> +		(*pi)--;
>            remove edge e from the graph
>
>> +		if (*pi == 1)
>> +			prio_queue_put(&info->topo_queue, parent);
> If parent has no incoming edges (indegree == 1 == INDEGREE_ZERO), then
> insert it into topo_queue.
>
>            if m has no other incoming edges then
>                insert m into S
>
>> +
>> +		if (revs->first_parent_only)
>> +			return;
>>   	}
>>   }
> Looks all right.

Thanks for taking the time on this huge patch!

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 7/7] t6012: make rev-list tests more interesting
  2018-10-16 22:36       ` [PATCH v4 7/7] t6012: make rev-list tests more interesting Derrick Stolee via GitGitGadget
@ 2018-10-23 15:48         ` Jakub Narebski
  0 siblings, 0 replies; 87+ messages in thread
From: Jakub Narebski @ 2018-10-23 15:48 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Jeff King, Junio C Hamano, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> As we are working to rewrite some of the revision-walk machinery,
> there could easily be some interesting interactions between the
> options that force topological constraints (--topo-order,
> --date-order, and --author-date-order) along with specifying a
> path.
>
> Add extra tests to t6012-rev-list-simplify.sh to add coverage of
> these interactions. To ensure interesting things occur, alter the
> repo data shape to have different orders depending on topo-, date-,
> or author-date-order.

Very nice, though I have noticed (please correct me if I am wrong) that
in all cases path limited query always have the same result for
--topo-order and for --date-order; as opposed to three different results
for three different revision sorting modes for path-less query.

>
> When testing using GIT_TEST_COMMIT_GRAPH, this assists in covering
> the new logic for topo-order walks using generation numbers. The
> extra tests can be added indepently.

Good.  I guess we are mainly interested in tests without limits and
exclusions, i.e. A or A B and not A..B or A...B walks.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/t6012-rev-list-simplify.sh | 45 ++++++++++++++++++++++++++++--------
>  1 file changed, 36 insertions(+), 9 deletions(-)
>
> diff --git a/t/t6012-rev-list-simplify.sh b/t/t6012-rev-list-simplify.sh
> index b5a1190ffe..a10f0df02b 100755
> --- a/t/t6012-rev-list-simplify.sh
> +++ b/t/t6012-rev-list-simplify.sh
> @@ -12,6 +12,22 @@ unnote () {
>  	git name-rev --tags --stdin | sed -e "s|$OID_REGEX (tags/\([^)]*\)) |\1 |g"
>  }
>  
> +#
> +# Create a test repo with interesting commit graph:
> +#
> +# A--B----------G--H--I--K--L
> +#  \  \           /     /
> +#   \  \         /     /
> +#    C------E---F     J
> +#        \_/
> +#
> +# The commits are laid out from left-to-right starting with
> +# the root commit A and terminating at the tip commit L.

Do I understand it correctly that it is a visualization of history
created by existing code (which is a very nice to have)?

> +#
> +# There are a few places where we adjust the commit date or
> +# author date to make the --topo-order, --date-order, and
> +# --author-date-order flags produce different output.

Sidenote: it looks like "a few places" is "one place" for now...

> +
>  test_expect_success setup '
>  	echo "Hi there" >file &&
>  	echo "initial" >lost &&
> @@ -21,10 +37,18 @@ test_expect_success setup '
>  
>  	git branch other-branch &&
>  
> +	git symbolic-ref HEAD refs/heads/unrelated &&
> +	git rm -f "*" &&
> +	echo "Unrelated branch" >side &&
> +	git add side &&
> +	test_tick && git commit -m "Side root" &&
> +	note J &&
> +	git checkout master &&

I see that this fragment is moved earlier, but I don't know what
consequences it does have.

> +
>  	echo "Hello" >file &&
>  	echo "second" >lost &&
>  	git add file lost &&
> -	test_tick && git commit -m "Modified file and lost" &&
> +	test_tick && GIT_AUTHOR_DATE=$(($test_tick + 120)) git commit -m "Modified file and lost" &&
>  	note B &&

Nice trick, though I think it produces slightly unrealistic history (at
least in absence of the clock skew).  Author dates are ordinarily
earlier or equal to commit dates, and commits can be authored in
different order that they were committed.

>  
>  	git checkout other-branch &&
> @@ -63,13 +87,6 @@ test_expect_success setup '
>  	test_tick && git commit -a -m "Final change" &&
>  	note I &&
>  
> -	git symbolic-ref HEAD refs/heads/unrelated &&
> -	git rm -f "*" &&
> -	echo "Unrelated branch" >side &&
> -	git add side &&
> -	test_tick && git commit -m "Side root" &&
> -	note J &&
> -
>  	git checkout master &&
>  	test_tick && git merge --allow-unrelated-histories -m "Coolest" unrelated &&
>  	note K &&
> @@ -103,14 +120,24 @@ check_result () {
>  	check_outcome success "$@"
>  }
>  
> -check_result 'L K J I H G F E D C B A' --full-history
> +check_result 'L K J I H F E D C G B A' --full-history --topo-order
> +check_result 'L K I H G F E D C B J A' --full-history
> +check_result 'L K I H G F E D C B J A' --full-history --date-order
> +check_result 'L K I H G F E D B C J A' --full-history --author-date-order
>  check_result 'K I H E C B A' --full-history -- file
>  check_result 'K I H E C B A' --full-history --topo-order -- file
>  check_result 'K I H E C B A' --full-history --date-order -- file
> +check_result 'K I H E B C A' --full-history --author-date-order -- file
>  check_result 'I E C B A' --simplify-merges -- file
> +check_result 'I E C B A' --simplify-merges --topo-order -- file
> +check_result 'I E C B A' --simplify-merges --date-order -- file
> +check_result 'I E B C A' --simplify-merges --author-date-order -- file
>  check_result 'I B A' -- file
>  check_result 'I B A' --topo-order -- file
> +check_result 'I B A' --date-order -- file
> +check_result 'I B A' --author-date-order -- file
>  check_result 'H' --first-parent -- another-file
> +check_result 'H' --first-parent --topo-order -- another-file
>  
>  check_result 'E C B A' --full-history E -- lost
>  test_expect_success 'full history simplification without parent' '

More tests, looks good.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic
  2018-10-22  1:51             ` Derrick Stolee
  2018-10-22  1:55               ` [RFC PATCH] revision.c: use new algorithm in A..B case Derrick Stolee
@ 2018-10-25  8:28               ` Junio C Hamano
  2018-10-26 20:56                 ` Jakub Narebski
  1 sibling, 1 reply; 87+ messages in thread
From: Junio C Hamano @ 2018-10-25  8:28 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jakub Narebski, Derrick Stolee via GitGitGadget, git, Jeff King,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>     time git log --topo-order -10 master >/dev/null
>
>     time git log --topo-order -10 maint..master >/dev/null
>
> I get 0.39s for the first call and 0.01s for the second. (Note: I
> specified "-10" to ensure we are only writing 10 commits and the
> output size does not factor into the time.) This is because the first
> walks the entire history, while the second uses the heuristic walk to
> identify a much smaller subgraph that the topo-order algorithm uses.

The algorithm can be fooled by skewed timestamps (i.e. that SLOP
thing tries to work around), but is helped by being able to leave
early, and it will give us the correct answer as long as there is no
timestamp inversion.

But monotonically increasing "timestamp" without inversion is what
we invented "generation numbers" for, no?  When there is no
timestamp inversion, would a walk based on commit timestamps walk
smaller set than a walk based on commit generation numbers?

> Just as before, by using this algorithm for the B..A case, we are
> adding an extra restriction on the algorithm: always be correct. This
> results in us walking a larger set (everything reachable from B or A
> with generation number at least the smallest generation of a commit
> reachable from only one).
>
> I believe this can be handled by using a smarter generation number
> (one that relies on commit date as a heuristic, but still have enough
> information to guarantee topological relationships), and I've already
> started testing a few of these directions.

Good ot hear.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic
  2018-10-11 16:21         ` Derrick Stolee
@ 2018-10-25  9:43           ` Jeff King
  2018-10-25 13:00             ` Derrick Stolee
  0 siblings, 1 reply; 87+ messages in thread
From: Jeff King @ 2018-10-25  9:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Junio C Hamano, Derrick Stolee

On Thu, Oct 11, 2018 at 12:21:44PM -0400, Derrick Stolee wrote:

> > > 2. INDEGREE: using the indegree_queue priority queue (ordered
> > >     by maximizing the generation number), add one to the in-
> > >     degree of each parent for each commit that is walked. Since
> > >     we walk in order of decreasing generation number, we know
> > >     that discovering an in-degree value of 0 means the value for
> > >     that commit was not initialized, so should be initialized to
> > >     two. (Recall that in-degree value "1" is what we use to say a
> > >     commit is ready for output.) As we iterate the parents of a
> > >     commit during this walk, ensure the EXPLORE walk has walked
> > >     beyond their generation numbers.
> > I wondered how this would work for INFINITY. We can't know the order of
> > a bunch of INFINITY nodes at all, so we never know when their in-degree
> > values are "done". But if I understand the EXPLORE walk, we'd basically
> > walk all of INFINITY down to something with a real generation number. Is
> > that right?
> > 
> > But after that, I'm not totally clear on why we need this INDEGREE walk.
> 
> The INDEGREE walk is an important element for Kahn's algorithm. The final
> output order is dictated by peeling commits of "indegree zero" to ensure all
> children are output before their parents. (Note: since we use literal 0 to
> mean "uninitialized", we peel commits when the indegree slab has value 1.)
> 
> This walk replaces the indegree logic from sort_in_topological_order(). That
> method performs one walk that fills the indegree slab, then another walk
> that peels the commits with indegree 0 and inserts them into a list.

I guess my big question here was: if we have generation numbers, do we
need Kahn's algorithm? That is, in a fully populated set of generation
numbers (i.e., no INFINITY), we could always just pick a commit with the
highest generation number to show.

So if we EXPLORE down to a real generation number in phase 1, why do we
need to care about INDEGREE anymore? Or am I wrong that we always have a
real generation number (i.e., not INFINITY) after EXPLORE? (And if so,
why is exploring to a real generation number a bad idea; presumably
it's due to a worst-case that goes deeper than we'd otherwise need to
here).

> [...]

Everything else you said here made perfect sense.

-Peff

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic
  2018-10-25  9:43           ` Jeff King
@ 2018-10-25 13:00             ` Derrick Stolee
  0 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-10-25 13:00 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee via GitGitGadget, git, Junio C Hamano, Derrick Stolee

On 10/25/2018 5:43 AM, Jeff King wrote:
> On Thu, Oct 11, 2018 at 12:21:44PM -0400, Derrick Stolee wrote:
>
>>>> 2. INDEGREE: using the indegree_queue priority queue (ordered
>>>>      by maximizing the generation number), add one to the in-
>>>>      degree of each parent for each commit that is walked. Since
>>>>      we walk in order of decreasing generation number, we know
>>>>      that discovering an in-degree value of 0 means the value for
>>>>      that commit was not initialized, so should be initialized to
>>>>      two. (Recall that in-degree value "1" is what we use to say a
>>>>      commit is ready for output.) As we iterate the parents of a
>>>>      commit during this walk, ensure the EXPLORE walk has walked
>>>>      beyond their generation numbers.
>>> I wondered how this would work for INFINITY. We can't know the order of
>>> a bunch of INFINITY nodes at all, so we never know when their in-degree
>>> values are "done". But if I understand the EXPLORE walk, we'd basically
>>> walk all of INFINITY down to something with a real generation number. Is
>>> that right?
>>>
>>> But after that, I'm not totally clear on why we need this INDEGREE walk.
>> The INDEGREE walk is an important element for Kahn's algorithm. The final
>> output order is dictated by peeling commits of "indegree zero" to ensure all
>> children are output before their parents. (Note: since we use literal 0 to
>> mean "uninitialized", we peel commits when the indegree slab has value 1.)
>>
>> This walk replaces the indegree logic from sort_in_topological_order(). That
>> method performs one walk that fills the indegree slab, then another walk
>> that peels the commits with indegree 0 and inserts them into a list.
> I guess my big question here was: if we have generation numbers, do we
> need Kahn's algorithm? That is, in a fully populated set of generation
> numbers (i.e., no INFINITY), we could always just pick a commit with the
> highest generation number to show.
>
> So if we EXPLORE down to a real generation number in phase 1, why do we
> need to care about INDEGREE anymore? Or am I wrong that we always have a
> real generation number (i.e., not INFINITY) after EXPLORE? (And if so,
> why is exploring to a real generation number a bad idea; presumably
> it's due to a worst-case that goes deeper than we'd otherwise need to
> here).

The issue is that we our final order (used by level 3) is unrelated to 
generation number. Yes, if we prioritized by generation number then we 
could output the commits in _some_ order that doesn't violate 
topological constraints. However, we are asking for a different 
priority, which is different than the generation number priority.

In the case of "--topo-order", we want to output the commits reachable 
from the second parent of a merge before the commits reachable from the 
first parent. However, in most cases the generation number of the first 
parent is higher than the second parent (there are more things in the 
merge chain than in a small topic that got merged). The INDEGREE is what 
allows us to know when we can peel these commits while still respecting 
the priority we want at the end.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 6/7] revision.c: generation-based topo-order algorithm
  2018-10-23 13:54           ` Derrick Stolee
@ 2018-10-26 16:55             ` Jakub Narebski
  0 siblings, 0 replies; 87+ messages in thread
From: Jakub Narebski @ 2018-10-26 16:55 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Jeff King, Junio C Hamano,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:
> On 10/22/2018 9:37 AM, Jakub Narebski wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]
>> Sidenote: the new algorithm looks a bit like Unix pipeline, where each
>> step of pipeline does not output much more than next step needs /
>> requests.
>
> That's essentially the idea.

Some of the newer languages have built-in support for similar kind of
pipeline for connecting processes, be it channels in Go, supplies and
suppliers in Perl6.  I wonder if there exists some library implementing
this kind of construct in C.

That aside, I wonder if when there would be support for more
reachability indices than generation numbers, if it wouldn't be better
to pass a commit as a limiter (up to this commit), than specific indices
like current passing of generation number.  Just food for thoughs...

[...]
>>> In my local testing, I used the following Git commands on the
>>> Linux repository in three modes: HEAD~1 with no commit-graph,
>>> HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
>>> allows comparing the benefits we get from parsing commits from
>>> the commit-graph and then again the benefits we get by
>>> restricting the set of commits we walk.
>>>
>>> Test: git rev-list --topo-order -100 HEAD
>>> HEAD~1, no commit-graph: 6.80 s
>>> HEAD~1, w/ commit-graph: 0.77 s
>>>    HEAD, w/ commit-graph: 0.02 s
>>>
>>> Test: git rev-list --topo-order -100 HEAD -- tools
>>> HEAD~1, no commit-graph: 9.63 s
>>> HEAD~1, w/ commit-graph: 6.06 s
>>>    HEAD, w/ commit-graph: 0.06 s
>>>
>>> This speedup is due to a few things. First, the new generation-
>>> number-enabled algorithm walks commits on order of the number of
>>> results output (subject to some branching structure expectations).
>>> Since we limit to 100 results, we are running a query similar to
>>> filling a single page of results. Second, when specifying a path,
>>> we must parse the root tree object for each commit we walk. The
>>> previous benefits from the commit-graph are entirely from reading
>>> the commit-graph instead of parsing commits. Since we need to
>>> parse trees for the same number of commits as before, we slow
>>> down significantly from the non-path-based query.
>>>
>>> For the test above, I specifically selected a path that is changed
>>> frequently, including by merge commits. A less-frequently-changed
>>> path (such as 'README') has similar end-to-end time since we need
>>> to walk the same number of commits (before determining we do not
>>> have 100 hits). However, get the benefit that the output is
>>> presented to the user as it is discovered, much the same as a
>>> normal 'git log' command (no '--topo-order'). This is an improved
>>> user experience, even if the command has the same runtime.
>>>
>> First, do I understand it correctly that in first case the gains from
>> new algorithms are so slim because with commit-graph file and no path
>> limiting we don't hit repository anyway; we walk less commits, but
>> reading commit data from commit-graph file is fast/
>
> If you mean 0.77s to 0.02s is "slim" then yes, it is because the
> commit-graph command already made a full walk of the commit history
> faster. (I'm only poking at this because the _relative_ improvement is
> significant, even if the command was already sub-second.)

First, you didn't provide us with percentages, i.e. relative improvement
(and I am lazy).  Second, 0.02s can be within the margin of error,
depending on how it is measured, and how stable this measurement is.

>> Second, I wonder if there is some easy way to perform automatic latency
>> tests, i.e. how fast does Git show the first page of output...
>
> I have talked with Jeff Hostetler about this, to see if we can have a
> "time to first page" traced with trace2, but we don't seem to have
> access to that information within Git. We would need to insert it into
> the pager. The "-100" is used instead.

Perhaps another solution to the problem of "first page of output"
latency tests could be feasible, namely create a helper test-pager-1p
"pager" program that would automatically quit after first page of
output; or perhaps even one that benchmarks each page of output
automatically.

There exists 'pv' (pipe viewer) program for pipes, so I think it would
be possible to do equivalent, but as a pager.

[...]
>>> +static inline void test_flag_and_insert(struct prio_queue *q, struct commit *c, int flag)
>>> +{
>>> +	if (c->object.flags & flag)
>>> +		return;
>>> +
>>> +	c->object.flags |= flag;
>>> +	prio_queue_put(q, c);
>>> +}
>>
>> This is an independent change, though I see that it is quite specific
>> (as opposed to quite generic prio_queue_peek() operation added earlier
>> in this series), so it does not make much sense as standalone change.
>>
>> It inserts commit into priority queue only if it didn't have flags set,
>> and sets the flag (so we won't add it to the queue again, not without
>> unsetting the flag), am I correct?
>
> Yes, this pattern of using a flag to avoid duplicate entries in the
> priority queue appears in multiple walks. It wasn't needed before. We
> call it four times in the code below.

>> I guess that we use test_flag_and_insert() instead of prio_queue_put()
>> to avoid duplicate entries in the queue.  I think the queue is initially
>> populated with the starting commits, but those need not to be
>> unreachable from each other, and walking down parents we can encounter
>> starting commit already in the queue.  Am I correct?
>
> We can also reach commits in multiple ways, so the initial conditions
> are not the only ways to insert duplicates.

Right.

[...]
>>> +	if (c->object.flags & UNINTERESTING)
>>> +		mark_parents_uninteresting(c);
>>> +
>>> +	for (p = c->parents; p; p = p->next)
>>> +		test_flag_and_insert(&info->explore_queue, p->item, TOPO_WALK_EXPLORED);
>>
>> Do we need to insert parents to the queue even if they were marked
>> UNINTERESTING?
>
> We need to propagate the UNINTERESTING flag to our parents. That
> propagation happens in process_parents().

I think I understand.  We need to propagate UNINTERESTING flag down the
chain, isn't it?

[...]
>>>     static void init_topo_walk(struct rev_info *revs)
>>>   {
>>>   	struct topo_walk_info *info;
>>> +	struct commit_list *list;
>>>   	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
>>>   	info = revs->topo_walk_info;
>>>   	memset(info, 0, sizeof(struct topo_walk_info));
>>>   -	limit_list(revs);
>>> -	sort_in_topological_order(&revs->commits, revs->sort_order);
>>> +	init_indegree_slab(&info->indegree);
>>> +	memset(&info->explore_queue, '\0', sizeof(info->explore_queue));
>>> +	memset(&info->indegree_queue, '\0', sizeof(info->indegree_queue));
>>> +	memset(&info->topo_queue, '\0', sizeof(info->topo_queue));
>>
>> Why this memset uses '\0' as a filler value and not 0?  The queues are
>> not strings [and you use 0 in other places].

I think you missed answering about this issue.

[...]
>> Looks all right.
>
> Thanks for taking the time on this huge patch!

You are welcome.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic
  2018-10-25  8:28               ` [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic Junio C Hamano
@ 2018-10-26 20:56                 ` Jakub Narebski
  0 siblings, 0 replies; 87+ messages in thread
From: Jakub Narebski @ 2018-10-26 20:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget, git, Jeff King,
	Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

> Derrick Stolee <stolee@gmail.com> writes:
>
>>     time git log --topo-order -10 master >/dev/null
>>
>>     time git log --topo-order -10 maint..master >/dev/null
>>
>> I get 0.39s for the first call and 0.01s for the second. (Note: I
>> specified "-10" to ensure we are only writing 10 commits and the
>> output size does not factor into the time.) This is because the first
>> walks the entire history, while the second uses the heuristic walk to
>> identify a much smaller subgraph that the topo-order algorithm uses.
>
> The algorithm can be fooled by skewed timestamps (i.e. that SLOP
> thing tries to work around), but is helped by being able to leave
> early, and it will give us the correct answer as long as there is no
> timestamp inversion.
>
> But monotonically increasing "timestamp" without inversion is what
> we invented "generation numbers" for, no?  When there is no
> timestamp inversion, would a walk based on commit timestamps walk
> smaller set than a walk based on commit generation numbers?

The problem, as far as I understand it, is in B..A case, to be more
exact with the walk from the exclusion set.  There can be cases when
there are two paths from the commit in the exclusion set, and sorting by
generation number will always walk the longer one, while date-order
based heuristic walk traverses smaller subgraph.

This problem was described in "[PATCH 1/1] commit: don't use generation
numbers if not needed" [1]; relevant fragment cited below:

DS> For instance, computing the merge-base between consecutive versions of
DS> the Linux kernel has no effect for versions after v4.9, but 'git
DS> merge-base v4.8 v4.9' presents a performance regression [...]
DS> 
DS> The topology of this case can be described in a simplified way
DS> here:
DS> 
DS>   v4.9
DS>    |  \
DS>    |   \
DS>   v4.8  \
DS>    | \   \
DS>    |  \   |
DS>   ...  A  B
DS>    |  /  /
DS>    | /  /
DS>    |/__/
DS>    C
DS> 
DS> Here, the "..." means "a very long line of commits". By generation
DS> number, A and B have generation one more than C. However, A and B
DS> have commit date higher than most of the commits reachable from
DS> v4.8. When the walk reaches v4.8, we realize that it has PARENT1
DS> and PARENT2 flags, so everything it can reach is marked as STALE,
DS> including A. B has only the PARENT1 flag, so is not STALE.
DS> 
DS> When paint_down_to_common() is run using
DS> compare_commits_by_commit_date, A and B are removed from the queue
DS> early and C is inserted into the queue. At this point, C and the
DS> rest of the queue entries are marked as STALE. The loop then
DS> terminates.
DS> 
DS> When paint_down_to_common() is run using
DS> compare_commits_by_gen_then_commit_date, B is removed from the
DS> queue only after the many commits reachable from v4.8 are explored.
DS> This causes the loop to run longer. The reason for this regression
DS> is simple: the queue order is intended to not explore a commit
DS> until everything that _could_ reach that commit is explored. From
DS> the information gathered by the original ordering, we have no
DS> guarantee that there is not a commit D reachable from v4.8 that
DS> can also reach B. We gained absolute correctness in exchange for
DS> a performance regression.

[1]: https://public-inbox.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/T/#u

>> Just as before, by using this algorithm for the B..A case, we are
>> adding an extra restriction on the algorithm: always be correct. This
>> results in us walking a larger set (everything reachable from B or A
>> with generation number at least the smallest generation of a commit
>> reachable from only one).
>>
>> I believe this can be handled by using a smarter generation number
>> (one that relies on commit date as a heuristic, but still have enough
>> information to guarantee topological relationships), and I've already
>> started testing a few of these directions.
>
> Good to hear.

I'm not sure if you can enhance generation numbers, or rather using /
sorting of generation numbers in such way.

With generation numbers, if you have two possible paths to the merge
commit, and one is much longer than the other, then the commits on this
longer path will have larger generation numbers.


        0     1     2     3     4     5     6     7       = gen(c) - gen(B)
    --- B<----*<----*<----*<----*<----*<----*<----M ---
        ^                                        /
         \--------------------------x<--------x</         = gen(c) - gen(B)
                                    1         2

    ===================== commit date ======================>

Using priority-queue which selects max generation number would then
always walk the longer path (or walk longer path first).

This does not matter for the walk from the inclusive set (A in B..A), as
we want to walk all commits anyway; we walk both paths, it does not
matter which we walk first (it may change the unimportant parts of
topological sort, though).

For the walk from the exclusive set (B in B..A), any path is good; we
don't need and do not want to walk all paths.  Sorting by generation
numbers will always choose the longer path, while sorting by commit date
may walk the "shortcut" first, thus marking commits reachable from the
inclusive subset (from A) STALE earlier, thus earlier notice of
all-stale, and end of walk.

I think that in the case of walking from the exclusion subset, using
positive-cut reachbility index would be more useful than trying to come
up with "smarter generation number" (though I am not saying that it
would be impossible).  For example using reachability bitmap indices, or
post-order in spanning tree (the latter descibed in [2]) to mark more
commits stale than just parents, may help more.

Actually, this is something independent from "smarter generation
numbers", and can only help...

[2]: "[RFC] Other chunks for commit-graph, part 2 - reachability indexes"
     https://public-inbox.org/git/86muxcuyod.fsf@gmail.com/

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 0/7] Use generation numbers for --topo-order
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (7 preceding siblings ...)
  2018-10-21 12:57       ` [PATCH v4 0/7] Use generation numbers for --topo-order Jakub Narebski
@ 2018-11-01  5:21       ` Junio C Hamano
  2018-11-01 13:49         ` Derrick Stolee
  2018-11-01 13:46       ` [PATCH v5 " Derrick Stolee
  9 siblings, 1 reply; 87+ messages in thread
From: Junio C Hamano @ 2018-11-01  5:21 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, peff

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This patch series performs a decently-sized refactoring of the revision-walk
> machinery. Well, "refactoring" is probably the wrong word, as I don't
> actually remove the old code. Instead, when we see certain options in the
> 'rev_info' struct, we redirect the commit-walk logic to a new set of methods
> that distribute the workload differently. By using generation numbers in the
> commit-graph, we can significantly improve 'git log --graph' commands (and
> the underlying 'git rev-list --topo-order').

Review discussions seem to have petered out.  Would we merge this to
'next' and start cooking, perhaps for the remainder of this cycle?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v5 0/7] Use generation numbers for --topo-order
  2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
                         ` (8 preceding siblings ...)
  2018-11-01  5:21       ` Junio C Hamano
@ 2018-11-01 13:46       ` Derrick Stolee
  2018-11-01 13:46         ` [PATCH v5 1/7] prio-queue: add 'peek' operation Derrick Stolee
                           ` (6 more replies)
  9 siblings, 7 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-01 13:46 UTC (permalink / raw)
  To: git; +Cc: gitster, avarab, szeder.dev, peff, jnareb, Derrick Stolee

This patch series performs a decently-sized refactoring of the
revision-walk machinery. Well, "refactoring" is probably the wrong word,
as I don't actually remove the old code. Instead, when we see certain
options in the 'rev_info' struct, we redirect the commit-walk logic to
a new set of methods that distribute the workload differently. By using
generation numbers in the commit-graph, we can significantly improve
'git log --graph' commands (and the underlying 'git rev-list --topo-order').

On the Linux repository, I got the following performance results when
comparing to the previous version with or without a commit-graph:

    Test: git rev-list --topo-order -100 HEAD
    HEAD~1, no commit-graph: 6.80 s
    HEAD~1, w/ commit-graph: 0.77 s
      HEAD, w/ commit-graph: 0.02 s

    Test: git rev-list --topo-order -100 HEAD -- tools
    HEAD~1, no commit-graph: 9.63 s
    HEAD~1, w/ commit-graph: 6.06 s
      HEAD, w/ commit-graph: 0.06 s

If you want to read this series but are unfamiliar with the commit-graph
and generation numbers, then I recommend reading
`Documentation/technical/commit-graph.txt` or a blog post [1] I wrote on
the subject. In particular, the three-part walk described in "revision.c:
refactor basic topo-order logic" is present (but underexplained) as an
animated PNG [2].

**UPDATED** Now that we have had some review and some dogfooding, I'm
removing the paragraph I had here about "RFC quality". I think this is
ready to merge!

One notable case that is not included in this series is the case of a
history comparison such as 'git rev-list --topo-order A..B'. The existing
code in limit_list() has ways to cut the walk short when all pending
commits are UNINTERESTING. Since this code depends on commit_list instead
of the prio_queue we are using here, I chose to leave it untouched for now.
We can revisit it in a separate series later. Since handle_commit() turns
on revs->limited when a commit is UNINTERESTING, we do not hit the new
code in this case. Removing this 'revs->limited = 1;' line yields correct
results, but the performance can be worse.

**UPDATED** See the discussion about Generation Number V2 [4] for more
on this topic.

Changes in V5: Thanks Jakub for feedback on the huge commit! I think
I've responded to all the code feedback. See the range-diff at the
end of this cover-page.

Thanks,
-Stolee

[1] https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/
   Supercharging the Git Commit Graph III: Generations and Graph Algorithms

[2] https://msdnshared.blob.core.windows.net/media/2018/06/commit-graph-topo-order-b-a.png
    Animation showing three-part walk

[3] https://github.com/derrickstolee/git/tree/topo-order/test
    A branch containing this series along with commits to compute commit-graph in entire test suite.

[4] https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com/
    [RFC] Generation Number v2

Note: I'm not submitting this version via GitGitGadget because it's
currently struggling with how to handle a PR in a conflict state.
The new flags in revision.h have a conflict with recent changes in
master.

Derrick Stolee (7):
  prio-queue: add 'peek' operation
  test-reach: add run_three_modes method
  test-reach: add rev-list tests
  revision.c: begin refactoring --topo-order logic
  commit/revisions: bookkeeping before refactoring
  revision.c: generation-based topo-order algorithm
  t6012: make rev-list tests more interesting

 commit.c                     |   9 +-
 commit.h                     |   7 +
 object.h                     |   4 +-
 prio-queue.c                 |   9 ++
 prio-queue.h                 |   6 +
 revision.c                   | 243 +++++++++++++++++++++++++++++++++--
 revision.h                   |   6 +
 t/helper/test-prio-queue.c   |  26 ++--
 t/t0009-prio-queue.sh        |  14 ++
 t/t6012-rev-list-simplify.sh |  45 +++++--
 t/t6600-test-reach.sh        |  96 +++++++++++++-
 11 files changed, 426 insertions(+), 39 deletions(-)


base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303
-- 
2.19.1.542.gc4df23f792

-->8--

1:  2358cfd5ed = 1:  7c75a56505 prio-queue: add 'peek' operation
2:  3a4b68e479 = 2:  686c4370de test-reach: add run_three_modes method
3:  12a3f6d367 = 3:  7410c00982 test-reach: add rev-list tests
4:  cd9eef9688 = 4:  5439e11e37 revision.c: begin refactoring --topo-order logic
5:  f3e291665d ! 5:  71554deb9b commit/revisions: bookkeeping before refactoring
    @@ -9,8 +9,8 @@
            compare_commits_by_author_date() in revision.c. These are used
            currently by sort_in_topological_order() in commit.c.
     
    -    2. Moving these methods to commit.h requires adding the author_slab
    -       definition to commit.h.
    +    2. Moving these methods to commit.h requires adding an author_date_slab
    +       declaration to commit.h. Consumers will need their own implementation.
     
         3. The add_parents_to_list() method in revision.c performs logic
            around the UNINTERESTING flag and other special cases depending
    @@ -31,8 +31,7 @@
      define_commit_slab(indegree_slab, int);
      
     -/* record author-date for each commit object */
    --define_commit_slab(author_date_slab, timestamp_t);
    -+implement_shared_commit_slab(author_date_slab, timestamp_t);
    + define_commit_slab(author_date_slab, timestamp_t);
      
     -static void record_author_date(struct author_date_slab *author_date,
     -			       struct commit *commit)
    @@ -69,8 +68,7 @@
      extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
      
     +/* record author-date for each commit object */
    -+define_shared_commit_slab(author_date_slab, timestamp_t);
    -+
    ++struct author_date_slab;
     +void record_author_date(struct author_date_slab *author_date,
     +			struct commit *commit);
     +
6:  aa0bb2221d ! 6:  84c142e0bc revision.c: generation-based topo-order algorithm
    @@ -195,6 +195,7 @@
      
     -struct topo_walk_info {};
     +define_commit_slab(indegree_slab, int);
    ++define_commit_slab(author_date_slab, timestamp_t);
     +
     +struct topo_walk_info {
     +	uint32_t min_generation;
    @@ -243,12 +244,12 @@
     +}
     +
     +static void explore_to_depth(struct rev_info *revs,
    -+			     uint32_t gen)
    ++			     uint32_t gen_cutoff)
     +{
     +	struct topo_walk_info *info = revs->topo_walk_info;
     +	struct commit *c;
     +	while ((c = prio_queue_peek(&info->explore_queue)) &&
    -+	       c->generation >= gen)
    ++	       c->generation >= gen_cutoff)
     +		explore_walk_step(revs);
     +}
     +
    @@ -266,9 +267,6 @@
     +
     +	explore_to_depth(revs, c->generation);
     +
    -+	if (parse_commit_gently(c, 1) < 0)
    -+		return;
    -+
     +	for (p = c->parents; p; p = p->next) {
     +		struct commit *parent = p->item;
     +		int *pi = indegree_slab_at(&info->indegree, parent);
    @@ -285,12 +283,13 @@
     +	}
     +}
     +
    -+static void compute_indegrees_to_depth(struct rev_info *revs)
    ++static void compute_indegrees_to_depth(struct rev_info *revs,
    ++				       uint32_t gen_cutoff)
     +{
     +	struct topo_walk_info *info = revs->topo_walk_info;
     +	struct commit *c;
     +	while ((c = prio_queue_peek(&info->indegree_queue)) &&
    -+	       c->generation >= info->min_generation)
    ++	       c->generation >= gen_cutoff)
     +		indegree_walk_step(revs);
     +}
      
    @@ -305,9 +304,9 @@
     -	limit_list(revs);
     -	sort_in_topological_order(&revs->commits, revs->sort_order);
     +	init_indegree_slab(&info->indegree);
    -+	memset(&info->explore_queue, '\0', sizeof(info->explore_queue));
    -+	memset(&info->indegree_queue, '\0', sizeof(info->indegree_queue));
    -+	memset(&info->topo_queue, '\0', sizeof(info->topo_queue));
    ++	memset(&info->explore_queue, 0, sizeof(info->explore_queue));
    ++	memset(&info->indegree_queue, 0, sizeof(info->indegree_queue));
    ++	memset(&info->topo_queue, 0, sizeof(info->topo_queue));
     +
     +	switch (revs->sort_order) {
     +	default: /* REV_SORT_IN_GRAPH_ORDER */
    @@ -329,23 +328,22 @@
     +	info->min_generation = GENERATION_NUMBER_INFINITY;
     +	for (list = revs->commits; list; list = list->next) {
     +		struct commit *c = list->item;
    -+		test_flag_and_insert(&info->explore_queue, c, TOPO_WALK_EXPLORED);
    -+		test_flag_and_insert(&info->indegree_queue, c, TOPO_WALK_INDEGREE);
     +
     +		if (parse_commit_gently(c, 1))
     +			continue;
    ++
    ++		test_flag_and_insert(&info->explore_queue, c, TOPO_WALK_EXPLORED);
    ++		test_flag_and_insert(&info->indegree_queue, c, TOPO_WALK_INDEGREE);
    ++
     +		if (c->generation < info->min_generation)
     +			info->min_generation = c->generation;
    -+	}
     +
    -+	for (list = revs->commits; list; list = list->next) {
    -+		struct commit *c = list->item;
     +		*(indegree_slab_at(&info->indegree, c)) = 1;
     +
     +		if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
     +			record_author_date(&info->author_date, c);
     +	}
    -+	compute_indegrees_to_depth(revs);
    ++	compute_indegrees_to_depth(revs, info->min_generation);
     +
     +	for (list = revs->commits; list; list = list->next) {
     +		struct commit *c = list->item;
    @@ -385,9 +383,8 @@
     +	if (process_parents(revs, commit, NULL, NULL) < 0) {
      		if (!revs->ignore_missing_links)
      			die("Failed to traverse parents of commit %s",
    --			    oid_to_hex(&commit->object.oid));
    -+				oid_to_hex(&commit->object.oid));
    -+	}
    + 			    oid_to_hex(&commit->object.oid));
    + 	}
     +
     +	for (p = commit->parents; p; p = p->next) {
     +		struct commit *parent = p->item;
    @@ -398,7 +395,7 @@
     +
     +		if (parent->generation < info->min_generation) {
     +			info->min_generation = parent->generation;
    -+			compute_indegrees_to_depth(revs);
    ++			compute_indegrees_to_depth(revs, info->min_generation);
     +		}
     +
     +		pi = indegree_slab_at(&info->indegree, parent);
    @@ -409,9 +406,10 @@
     +
     +		if (revs->first_parent_only)
     +			return;
    - 	}
    ++	}
      }
      
    + int prepare_revision_walk(struct rev_info *revs)
     
      diff --git a/revision.h b/revision.h
      --- a/revision.h
7:  a21febe112 = 7:  5479087812 t6012: make rev-list tests more interesting

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v5 1/7] prio-queue: add 'peek' operation
  2018-11-01 13:46       ` [PATCH v5 " Derrick Stolee
@ 2018-11-01 13:46         ` Derrick Stolee
  2018-11-01 13:46         ` [PATCH v5 2/7] test-reach: add run_three_modes method Derrick Stolee
                           ` (5 subsequent siblings)
  6 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-01 13:46 UTC (permalink / raw)
  To: git; +Cc: gitster, avarab, szeder.dev, peff, jnareb, Derrick Stolee

When consuming a priority queue, it can be convenient to inspect
the next object that will be dequeued without actually dequeueing
it. Our existing library did not have such a 'peek' operation, so
add it as prio_queue_peek().

Add a reference-level comparison in t/helper/test-prio-queue.c
so this method is exercised by t0009-prio-queue.sh. Further, add
a test that checks the behavior when the compare function is NULL
(i.e. the queue becomes a stack).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 prio-queue.c               |  9 +++++++++
 prio-queue.h               |  6 ++++++
 t/helper/test-prio-queue.c | 26 ++++++++++++++++++--------
 t/t0009-prio-queue.sh      | 14 ++++++++++++++
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/prio-queue.c b/prio-queue.c
index a078451872..d3f488cb05 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -85,3 +85,12 @@ void *prio_queue_get(struct prio_queue *queue)
 	}
 	return result;
 }
+
+void *prio_queue_peek(struct prio_queue *queue)
+{
+	if (!queue->nr)
+		return NULL;
+	if (!queue->compare)
+		return queue->array[queue->nr - 1].data;
+	return queue->array[0].data;
+}
diff --git a/prio-queue.h b/prio-queue.h
index d030ec9dd6..682e51867a 100644
--- a/prio-queue.h
+++ b/prio-queue.h
@@ -46,6 +46,12 @@ extern void prio_queue_put(struct prio_queue *, void *thing);
  */
 extern void *prio_queue_get(struct prio_queue *);
 
+/*
+ * Gain access to the "thing" that would be returned by
+ * prio_queue_get, but do not remove it from the queue.
+ */
+extern void *prio_queue_peek(struct prio_queue *);
+
 extern void clear_prio_queue(struct prio_queue *);
 
 /* Reverse the LIFO elements */
diff --git a/t/helper/test-prio-queue.c b/t/helper/test-prio-queue.c
index 9807b649b1..5bc9c46ea5 100644
--- a/t/helper/test-prio-queue.c
+++ b/t/helper/test-prio-queue.c
@@ -22,14 +22,24 @@ int cmd__prio_queue(int argc, const char **argv)
 	struct prio_queue pq = { intcmp };
 
 	while (*++argv) {
-		if (!strcmp(*argv, "get"))
-			show(prio_queue_get(&pq));
-		else if (!strcmp(*argv, "dump")) {
-			int *v;
-			while ((v = prio_queue_get(&pq)))
-			       show(v);
-		}
-		else {
+		if (!strcmp(*argv, "get")) {
+			void *peek = prio_queue_peek(&pq);
+			void *get = prio_queue_get(&pq);
+			if (peek != get)
+				BUG("peek and get results do not match");
+			show(get);
+		} else if (!strcmp(*argv, "dump")) {
+			void *peek;
+			void *get;
+			while ((peek = prio_queue_peek(&pq))) {
+				get = prio_queue_get(&pq);
+				if (peek != get)
+					BUG("peek and get results do not match");
+				show(get);
+			}
+		} else if (!strcmp(*argv, "stack")) {
+			pq.compare = NULL;
+		} else {
 			int *v = malloc(sizeof(*v));
 			*v = atoi(*argv);
 			prio_queue_put(&pq, v);
diff --git a/t/t0009-prio-queue.sh b/t/t0009-prio-queue.sh
index e56dfce668..3941ad2528 100755
--- a/t/t0009-prio-queue.sh
+++ b/t/t0009-prio-queue.sh
@@ -47,4 +47,18 @@ test_expect_success 'notice empty queue' '
 	test_cmp expect actual
 '
 
+cat >expect <<'EOF'
+3
+2
+6
+4
+5
+1
+8
+EOF
+test_expect_success 'stack order' '
+	test-tool prio-queue stack 8 1 5 4 6 2 3 dump >actual &&
+	test_cmp expect actual
+'
+
 test_done

base-commit: 2d3b1c576c85b7f5db1f418907af00ab88e0c303
-- 
2.19.1.542.gc4df23f792


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v5 2/7] test-reach: add run_three_modes method
  2018-11-01 13:46       ` [PATCH v5 " Derrick Stolee
  2018-11-01 13:46         ` [PATCH v5 1/7] prio-queue: add 'peek' operation Derrick Stolee
@ 2018-11-01 13:46         ` Derrick Stolee
  2018-11-01 13:46         ` [PATCH v5 3/7] test-reach: add rev-list tests Derrick Stolee
                           ` (4 subsequent siblings)
  6 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-01 13:46 UTC (permalink / raw)
  To: git; +Cc: gitster, avarab, szeder.dev, peff, jnareb, Derrick Stolee

The 'test_three_modes' method assumes we are using the 'test-tool
reach' command for our test. However, we may want to use the data
shape of our commit graph and the three modes (no commit-graph,
full commit-graph, partial commit-graph) for other git commands.

Split test_three_modes to be a simple translation on a more general
run_three_modes method that executes the given command and tests
the actual output to the expected output.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6600-test-reach.sh | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index d139a00d1d..9d65b8b946 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -53,18 +53,22 @@ test_expect_success 'setup' '
 	git config core.commitGraph true
 '
 
-test_three_modes () {
+run_three_modes () {
 	test_when_finished rm -rf .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-full .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual &&
 	cp commit-graph-half .git/objects/info/commit-graph &&
-	test-tool reach $1 <input >actual &&
+	"$@" <input >actual &&
 	test_cmp expect actual
 }
 
+test_three_modes () {
+	run_three_modes test-tool reach "$@"
+}
+
 test_expect_success 'ref_newer:miss' '
 	cat >input <<-\EOF &&
 	A:commit-5-7
-- 
2.19.1.542.gc4df23f792


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v5 3/7] test-reach: add rev-list tests
  2018-11-01 13:46       ` [PATCH v5 " Derrick Stolee
  2018-11-01 13:46         ` [PATCH v5 1/7] prio-queue: add 'peek' operation Derrick Stolee
  2018-11-01 13:46         ` [PATCH v5 2/7] test-reach: add run_three_modes method Derrick Stolee
@ 2018-11-01 13:46         ` Derrick Stolee
  2018-11-01 13:46         ` [PATCH v5 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee
                           ` (3 subsequent siblings)
  6 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-01 13:46 UTC (permalink / raw)
  To: git; +Cc: gitster, avarab, szeder.dev, peff, jnareb, Derrick Stolee

The rev-list command is critical to Git's functionality. Ensure it
works in the three commit-graph environments constructed in
t6600-test-reach.sh. Here are a few important types of rev-list
operations:

* Basic: git rev-list --topo-order HEAD
* Range: git rev-list --topo-order compare..HEAD
* Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
* Symmetric Difference: git rev-list --topo-order compare...HEAD

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6600-test-reach.sh | 84 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index 9d65b8b946..288f703b7b 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -243,4 +243,88 @@ test_expect_success 'commit_contains:miss' '
 	test_three_modes commit_contains --tag
 '
 
+test_expect_success 'rev-list: basic topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
+		commit-6-3 commit-5-3 commit-4-3 commit-3-3 commit-2-3 commit-1-3 \
+		commit-6-2 commit-5-2 commit-4-2 commit-3-2 commit-2-2 commit-1-2 \
+		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent topo-order' '
+	git rev-parse \
+		commit-6-6 \
+		commit-6-5 \
+		commit-6-4 \
+		commit-6-3 \
+		commit-6-2 \
+		commit-6-1 commit-5-1 commit-4-1 commit-3-1 commit-2-1 commit-1-1 \
+	>expect &&
+	run_three_modes git rev-list --first-parent --topo-order commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 commit-2-6 commit-1-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 commit-2-5 commit-1-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 commit-2-4 commit-1-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: range topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 \
+		commit-6-5 commit-5-5 commit-4-5 \
+		commit-6-4 commit-5-4 commit-4-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: first-parent range topo-order' '
+	git rev-parse \
+		commit-6-6 \
+		commit-6-5 \
+		commit-6-4 \
+		commit-6-3 \
+		commit-6-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+	>expect &&
+	run_three_modes git rev-list --first-parent --topo-order commit-3-8..commit-6-6
+'
+
+test_expect_success 'rev-list: ancestry-path topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 commit-3-6 \
+		commit-6-5 commit-5-5 commit-4-5 commit-3-5 \
+		commit-6-4 commit-5-4 commit-4-4 commit-3-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+	>expect &&
+	run_three_modes git rev-list --topo-order --ancestry-path commit-3-3..commit-6-6
+'
+
+test_expect_success 'rev-list: symmetric difference topo-order' '
+	git rev-parse \
+		commit-6-6 commit-5-6 commit-4-6 \
+		commit-6-5 commit-5-5 commit-4-5 \
+		commit-6-4 commit-5-4 commit-4-4 \
+		commit-6-3 commit-5-3 commit-4-3 \
+		commit-6-2 commit-5-2 commit-4-2 \
+		commit-6-1 commit-5-1 commit-4-1 \
+		commit-3-8 commit-2-8 commit-1-8 \
+		commit-3-7 commit-2-7 commit-1-7 \
+	>expect &&
+	run_three_modes git rev-list --topo-order commit-3-8...commit-6-6
+'
+
 test_done
-- 
2.19.1.542.gc4df23f792


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v5 4/7] revision.c: begin refactoring --topo-order logic
  2018-11-01 13:46       ` [PATCH v5 " Derrick Stolee
                           ` (2 preceding siblings ...)
  2018-11-01 13:46         ` [PATCH v5 3/7] test-reach: add rev-list tests Derrick Stolee
@ 2018-11-01 13:46         ` Derrick Stolee
  2018-11-01 13:46         ` [PATCH v5 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee
                           ` (2 subsequent siblings)
  6 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-01 13:46 UTC (permalink / raw)
  To: git; +Cc: gitster, avarab, szeder.dev, peff, jnareb, Derrick Stolee

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 42 ++++++++++++++++++++++++++++++++++++++----
 revision.h |  4 ++++
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index e18bd530e4..2dcde8a8ac 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 	if (revs->diffopt.objfind)
 		revs->simplify_history = 0;
 
-	if (revs->topo_order)
+	if (revs->topo_order && !generation_numbers_enabled(the_repository))
 		revs->limited = 1;
 
 	if (revs->prune_data.nr) {
@@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid,
 	return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+	struct topo_walk_info *info;
+	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+	info = revs->topo_walk_info;
+	memset(info, 0, sizeof(struct topo_walk_info));
+
+	limit_list(revs);
+	sort_in_topological_order(&revs->commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+	return pop_commit(&revs->commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+		if (!revs->ignore_missing_links)
+			die("Failed to traverse parents of commit %s",
+			    oid_to_hex(&commit->object.oid));
+	}
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
 	int i;
@@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
 		return 0;
-	if (revs->limited)
+	if (revs->limited) {
 		if (limit_list(revs) < 0)
 			return -1;
-	if (revs->topo_order)
-		sort_in_topological_order(&revs->commits, revs->sort_order);
+		if (revs->topo_order)
+			sort_in_topological_order(&revs->commits, revs->sort_order);
+	} else if (revs->topo_order)
+		init_topo_walk(revs);
 	if (revs->line_level_traverse)
 		line_log_filter(revs);
 	if (revs->simplify_merges)
@@ -3257,6 +3287,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
 
 		if (revs->reflog_info)
 			commit = next_reflog_entry(revs->reflog_info);
+		else if (revs->topo_walk_info)
+			commit = next_topo_commit(revs);
 		else
 			commit = pop_commit(&revs->commits);
 
@@ -3278,6 +3310,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
 
 			if (revs->reflog_info)
 				try_to_simplify_commit(revs, commit);
+			else if (revs->topo_walk_info)
+				expand_topo_walk(revs, commit);
 			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
 				if (!revs->ignore_missing_links)
 					die("Failed to traverse parents of commit %s",
diff --git a/revision.h b/revision.h
index 2b30ac270d..fd4154ff75 100644
--- a/revision.h
+++ b/revision.h
@@ -56,6 +56,8 @@ struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct topo_walk_info;
+
 struct rev_info {
 	/* Starting list */
 	struct commit_list *commits;
@@ -245,6 +247,8 @@ struct rev_info {
 	const char *break_bar;
 
 	struct revision_sources *sources;
+
+	struct topo_walk_info *topo_walk_info;
 };
 
 int ref_excluded(struct string_list *, const char *path);
-- 
2.19.1.542.gc4df23f792


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v5 5/7] commit/revisions: bookkeeping before refactoring
  2018-11-01 13:46       ` [PATCH v5 " Derrick Stolee
                           ` (3 preceding siblings ...)
  2018-11-01 13:46         ` [PATCH v5 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee
@ 2018-11-01 13:46         ` Derrick Stolee
  2018-11-01 13:46         ` [PATCH v5 6/7] revision.c: generation-based topo-order algorithm Derrick Stolee
  2018-11-01 13:46         ` [PATCH v5 7/7] t6012: make rev-list tests more interesting Derrick Stolee
  6 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-01 13:46 UTC (permalink / raw)
  To: git; +Cc: gitster, avarab, szeder.dev, peff, jnareb, Derrick Stolee

There are a few things that need to move around a little before
making a big refactoring in the topo-order logic:

1. We need access to record_author_date() and
   compare_commits_by_author_date() in revision.c. These are used
   currently by sort_in_topological_order() in commit.c.

2. Moving these methods to commit.h requires adding an author_date_slab
   declaration to commit.h. Consumers will need their own implementation.

3. The add_parents_to_list() method in revision.c performs logic
   around the UNINTERESTING flag and other special cases depending
   on the struct rev_info. Allow this method to ignore a NULL 'list'
   parameter, as we will not be populating the list for our walk.
   Also rename the method to the slightly more generic name
   process_parents() to make clear that this method does more than
   add to a list (and no list is required anymore).

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c   |  9 ++++-----
 commit.h   |  7 +++++++
 revision.c | 18 ++++++++++--------
 3 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/commit.c b/commit.c
index d0f199e122..a025a0db60 100644
--- a/commit.c
+++ b/commit.c
@@ -655,11 +655,10 @@ struct commit *pop_commit(struct commit_list **stack)
 /* count number of children that have not been emitted */
 define_commit_slab(indegree_slab, int);
 
-/* record author-date for each commit object */
 define_commit_slab(author_date_slab, timestamp_t);
 
-static void record_author_date(struct author_date_slab *author_date,
-			       struct commit *commit)
+void record_author_date(struct author_date_slab *author_date,
+			struct commit *commit)
 {
 	const char *buffer = get_commit_buffer(commit, NULL);
 	struct ident_split ident;
@@ -684,8 +683,8 @@ static void record_author_date(struct author_date_slab *author_date,
 	unuse_commit_buffer(commit, buffer);
 }
 
-static int compare_commits_by_author_date(const void *a_, const void *b_,
-					  void *cb_data)
+int compare_commits_by_author_date(const void *a_, const void *b_,
+				   void *cb_data)
 {
 	const struct commit *a = a_, *b = b_;
 	struct author_date_slab *author_date = cb_data;
diff --git a/commit.h b/commit.h
index 2b1a734388..ec5b9093ad 100644
--- a/commit.h
+++ b/commit.h
@@ -8,6 +8,7 @@
 #include "gpg-interface.h"
 #include "string-list.h"
 #include "pretty.h"
+#include "commit-slab.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0xFFFFFFFF
 #define GENERATION_NUMBER_INFINITY 0xFFFFFFFF
@@ -328,6 +329,12 @@ extern int remove_signature(struct strbuf *buf);
  */
 extern int check_commit_signature(const struct commit *commit, struct signature_check *sigc);
 
+/* record author-date for each commit object */
+struct author_date_slab;
+void record_author_date(struct author_date_slab *author_date,
+			struct commit *commit);
+
+int compare_commits_by_author_date(const void *a_, const void *b_, void *unused);
 int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
 int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused);
 
diff --git a/revision.c b/revision.c
index 2dcde8a8ac..36458265a0 100644
--- a/revision.c
+++ b/revision.c
@@ -768,8 +768,8 @@ static void commit_list_insert_by_date_cached(struct commit *p, struct commit_li
 		*cache = new_entry;
 }
 
-static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
-		    struct commit_list **list, struct commit_list **cache_ptr)
+static int process_parents(struct rev_info *revs, struct commit *commit,
+			   struct commit_list **list, struct commit_list **cache_ptr)
 {
 	struct commit_list *parent = commit->parents;
 	unsigned left_flag;
@@ -808,7 +808,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
 			if (p->object.flags & SEEN)
 				continue;
 			p->object.flags |= SEEN;
-			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
+			if (list)
+				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
 		}
 		return 0;
 	}
@@ -847,7 +848,8 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
 		p->object.flags |= left_flag;
 		if (!(p->object.flags & SEEN)) {
 			p->object.flags |= SEEN;
-			commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
+			if (list)
+				commit_list_insert_by_date_cached(p, list, cached_base, cache_ptr);
 		}
 		if (revs->first_parent_only)
 			break;
@@ -1091,7 +1093,7 @@ static int limit_list(struct rev_info *revs)
 
 		if (revs->max_age != -1 && (commit->date < revs->max_age))
 			obj->flags |= UNINTERESTING;
-		if (add_parents_to_list(revs, commit, &list, NULL) < 0)
+		if (process_parents(revs, commit, &list, NULL) < 0)
 			return -1;
 		if (obj->flags & UNINTERESTING) {
 			mark_parents_uninteresting(commit);
@@ -2913,7 +2915,7 @@ static struct commit *next_topo_commit(struct rev_info *revs)
 
 static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 {
-	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+	if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
 		if (!revs->ignore_missing_links)
 			die("Failed to traverse parents of commit %s",
 			    oid_to_hex(&commit->object.oid));
@@ -2979,7 +2981,7 @@ static enum rewrite_result rewrite_one(struct rev_info *revs, struct commit **pp
 	for (;;) {
 		struct commit *p = *pp;
 		if (!revs->limited)
-			if (add_parents_to_list(revs, p, &revs->commits, &cache) < 0)
+			if (process_parents(revs, p, &revs->commits, &cache) < 0)
 				return rewrite_one_error;
 		if (p->object.flags & UNINTERESTING)
 			return rewrite_one_ok;
@@ -3312,7 +3314,7 @@ static struct commit *get_revision_1(struct rev_info *revs)
 				try_to_simplify_commit(revs, commit);
 			else if (revs->topo_walk_info)
 				expand_topo_walk(revs, commit);
-			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+			else if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
 				if (!revs->ignore_missing_links)
 					die("Failed to traverse parents of commit %s",
 						oid_to_hex(&commit->object.oid));
-- 
2.19.1.542.gc4df23f792


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v5 6/7] revision.c: generation-based topo-order algorithm
  2018-11-01 13:46       ` [PATCH v5 " Derrick Stolee
                           ` (4 preceding siblings ...)
  2018-11-01 13:46         ` [PATCH v5 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee
@ 2018-11-01 13:46         ` Derrick Stolee
  2018-11-01 15:48           ` SZEDER Gábor
  2019-11-08  2:50           ` Mike Hommey
  2018-11-01 13:46         ` [PATCH v5 7/7] t6012: make rev-list tests more interesting Derrick Stolee
  6 siblings, 2 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-01 13:46 UTC (permalink / raw)
  To: git; +Cc: gitster, avarab, szeder.dev, peff, jnareb, Derrick Stolee

The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

    If A and B are commits with generation numbers gen(A) and
    gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.

For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 object.h   |   4 +-
 revision.c | 195 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 revision.h |   2 +
 3 files changed, 194 insertions(+), 7 deletions(-)

diff --git a/object.h b/object.h
index 0feb90ae61..796792cb32 100644
--- a/object.h
+++ b/object.h
@@ -59,7 +59,7 @@ struct object_array {
 
 /*
  * object flag allocation:
- * revision.h:               0---------10                              2526
+ * revision.h:               0---------10                              25----28
  * fetch-pack.c:             01
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
@@ -78,7 +78,7 @@ struct object_array {
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  27
+#define FLAG_BITS  29
 
 /*
  * The object type is stored in 3 bits.
diff --git a/revision.c b/revision.c
index 36458265a0..4ef47d2fb4 100644
--- a/revision.c
+++ b/revision.c
@@ -26,6 +26,7 @@
 #include "argv-array.h"
 #include "commit-reach.h"
 #include "commit-graph.h"
+#include "prio-queue.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2895,31 +2896,215 @@ static int mark_uninteresting(const struct object_id *oid,
 	return 0;
 }
 
-struct topo_walk_info {};
+define_commit_slab(indegree_slab, int);
+define_commit_slab(author_date_slab, timestamp_t);
+
+struct topo_walk_info {
+	uint32_t min_generation;
+	struct prio_queue explore_queue;
+	struct prio_queue indegree_queue;
+	struct prio_queue topo_queue;
+	struct indegree_slab indegree;
+	struct author_date_slab author_date;
+};
+
+static inline void test_flag_and_insert(struct prio_queue *q, struct commit *c, int flag)
+{
+	if (c->object.flags & flag)
+		return;
+
+	c->object.flags |= flag;
+	prio_queue_put(q, c);
+}
+
+static void explore_walk_step(struct rev_info *revs)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit_list *p;
+	struct commit *c = prio_queue_get(&info->explore_queue);
+
+	if (!c)
+		return;
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
+		record_author_date(&info->author_date, c);
+
+	if (revs->max_age != -1 && (c->date < revs->max_age))
+		c->object.flags |= UNINTERESTING;
+
+	if (process_parents(revs, c, NULL, NULL) < 0)
+		return;
+
+	if (c->object.flags & UNINTERESTING)
+		mark_parents_uninteresting(c);
+
+	for (p = c->parents; p; p = p->next)
+		test_flag_and_insert(&info->explore_queue, p->item, TOPO_WALK_EXPLORED);
+}
+
+static void explore_to_depth(struct rev_info *revs,
+			     uint32_t gen_cutoff)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c;
+	while ((c = prio_queue_peek(&info->explore_queue)) &&
+	       c->generation >= gen_cutoff)
+		explore_walk_step(revs);
+}
+
+static void indegree_walk_step(struct rev_info *revs)
+{
+	struct commit_list *p;
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c = prio_queue_get(&info->indegree_queue);
+
+	if (!c)
+		return;
+
+	if (parse_commit_gently(c, 1) < 0)
+		return;
+
+	explore_to_depth(revs, c->generation);
+
+	for (p = c->parents; p; p = p->next) {
+		struct commit *parent = p->item;
+		int *pi = indegree_slab_at(&info->indegree, parent);
+
+		if (*pi)
+			(*pi)++;
+		else
+			*pi = 2;
+
+		test_flag_and_insert(&info->indegree_queue, parent, TOPO_WALK_INDEGREE);
+
+		if (revs->first_parent_only)
+			return;
+	}
+}
+
+static void compute_indegrees_to_depth(struct rev_info *revs,
+				       uint32_t gen_cutoff)
+{
+	struct topo_walk_info *info = revs->topo_walk_info;
+	struct commit *c;
+	while ((c = prio_queue_peek(&info->indegree_queue)) &&
+	       c->generation >= gen_cutoff)
+		indegree_walk_step(revs);
+}
 
 static void init_topo_walk(struct rev_info *revs)
 {
 	struct topo_walk_info *info;
+	struct commit_list *list;
 	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
 	info = revs->topo_walk_info;
 	memset(info, 0, sizeof(struct topo_walk_info));
 
-	limit_list(revs);
-	sort_in_topological_order(&revs->commits, revs->sort_order);
+	init_indegree_slab(&info->indegree);
+	memset(&info->explore_queue, 0, sizeof(info->explore_queue));
+	memset(&info->indegree_queue, 0, sizeof(info->indegree_queue));
+	memset(&info->topo_queue, 0, sizeof(info->topo_queue));
+
+	switch (revs->sort_order) {
+	default: /* REV_SORT_IN_GRAPH_ORDER */
+		info->topo_queue.compare = NULL;
+		break;
+	case REV_SORT_BY_COMMIT_DATE:
+		info->topo_queue.compare = compare_commits_by_commit_date;
+		break;
+	case REV_SORT_BY_AUTHOR_DATE:
+		init_author_date_slab(&info->author_date);
+		info->topo_queue.compare = compare_commits_by_author_date;
+		info->topo_queue.cb_data = &info->author_date;
+		break;
+	}
+
+	info->explore_queue.compare = compare_commits_by_gen_then_commit_date;
+	info->indegree_queue.compare = compare_commits_by_gen_then_commit_date;
+
+	info->min_generation = GENERATION_NUMBER_INFINITY;
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+
+		if (parse_commit_gently(c, 1))
+			continue;
+
+		test_flag_and_insert(&info->explore_queue, c, TOPO_WALK_EXPLORED);
+		test_flag_and_insert(&info->indegree_queue, c, TOPO_WALK_INDEGREE);
+
+		if (c->generation < info->min_generation)
+			info->min_generation = c->generation;
+
+		*(indegree_slab_at(&info->indegree, c)) = 1;
+
+		if (revs->sort_order == REV_SORT_BY_AUTHOR_DATE)
+			record_author_date(&info->author_date, c);
+	}
+	compute_indegrees_to_depth(revs, info->min_generation);
+
+	for (list = revs->commits; list; list = list->next) {
+		struct commit *c = list->item;
+
+		if (*(indegree_slab_at(&info->indegree, c)) == 1)
+			prio_queue_put(&info->topo_queue, c);
+	}
+
+	/*
+	 * This is unfortunate; the initial tips need to be shown
+	 * in the order given from the revision traversal machinery.
+	 */
+	if (revs->sort_order == REV_SORT_IN_GRAPH_ORDER)
+		prio_queue_reverse(&info->topo_queue);
 }
 
 static struct commit *next_topo_commit(struct rev_info *revs)
 {
-	return pop_commit(&revs->commits);
+	struct commit *c;
+	struct topo_walk_info *info = revs->topo_walk_info;
+
+	/* pop next off of topo_queue */
+	c = prio_queue_get(&info->topo_queue);
+
+	if (c)
+		*(indegree_slab_at(&info->indegree, c)) = 0;
+
+	return c;
 }
 
 static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 {
-	if (process_parents(revs, commit, &revs->commits, NULL) < 0) {
+	struct commit_list *p;
+	struct topo_walk_info *info = revs->topo_walk_info;
+	if (process_parents(revs, commit, NULL, NULL) < 0) {
 		if (!revs->ignore_missing_links)
 			die("Failed to traverse parents of commit %s",
 			    oid_to_hex(&commit->object.oid));
 	}
+
+	for (p = commit->parents; p; p = p->next) {
+		struct commit *parent = p->item;
+		int *pi;
+
+		if (parse_commit_gently(parent, 1) < 0)
+			continue;
+
+		if (parent->generation < info->min_generation) {
+			info->min_generation = parent->generation;
+			compute_indegrees_to_depth(revs, info->min_generation);
+		}
+
+		pi = indegree_slab_at(&info->indegree, parent);
+
+		(*pi)--;
+		if (*pi == 1)
+			prio_queue_put(&info->topo_queue, parent);
+
+		if (revs->first_parent_only)
+			return;
+	}
 }
 
 int prepare_revision_walk(struct rev_info *revs)
diff --git a/revision.h b/revision.h
index fd4154ff75..b0b3bb8025 100644
--- a/revision.h
+++ b/revision.h
@@ -24,6 +24,8 @@
 #define USER_GIVEN	(1u<<25) /* given directly by the user */
 #define TRACK_LINEAR	(1u<<26)
 #define ALL_REV_FLAGS	(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
+#define TOPO_WALK_EXPLORED	(1u<<27)
+#define TOPO_WALK_INDEGREE	(1u<<28)
 
 #define DECORATE_SHORT_REFS	1
 #define DECORATE_FULL_REFS	2
-- 
2.19.1.542.gc4df23f792


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* [PATCH v5 7/7] t6012: make rev-list tests more interesting
  2018-11-01 13:46       ` [PATCH v5 " Derrick Stolee
                           ` (5 preceding siblings ...)
  2018-11-01 13:46         ` [PATCH v5 6/7] revision.c: generation-based topo-order algorithm Derrick Stolee
@ 2018-11-01 13:46         ` Derrick Stolee
  6 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-01 13:46 UTC (permalink / raw)
  To: git; +Cc: gitster, avarab, szeder.dev, peff, jnareb, Derrick Stolee

As we are working to rewrite some of the revision-walk machinery,
there could easily be some interesting interactions between the
options that force topological constraints (--topo-order,
--date-order, and --author-date-order) along with specifying a
path.

Add extra tests to t6012-rev-list-simplify.sh to add coverage of
these interactions. To ensure interesting things occur, alter the
repo data shape to have different orders depending on topo-, date-,
or author-date-order.

When testing using GIT_TEST_COMMIT_GRAPH, this assists in covering
the new logic for topo-order walks using generation numbers. The
extra tests can be added indepently.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t6012-rev-list-simplify.sh | 45 ++++++++++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 9 deletions(-)

diff --git a/t/t6012-rev-list-simplify.sh b/t/t6012-rev-list-simplify.sh
index b5a1190ffe..a10f0df02b 100755
--- a/t/t6012-rev-list-simplify.sh
+++ b/t/t6012-rev-list-simplify.sh
@@ -12,6 +12,22 @@ unnote () {
 	git name-rev --tags --stdin | sed -e "s|$OID_REGEX (tags/\([^)]*\)) |\1 |g"
 }
 
+#
+# Create a test repo with interesting commit graph:
+#
+# A--B----------G--H--I--K--L
+#  \  \           /     /
+#   \  \         /     /
+#    C------E---F     J
+#        \_/
+#
+# The commits are laid out from left-to-right starting with
+# the root commit A and terminating at the tip commit L.
+#
+# There are a few places where we adjust the commit date or
+# author date to make the --topo-order, --date-order, and
+# --author-date-order flags produce different output.
+
 test_expect_success setup '
 	echo "Hi there" >file &&
 	echo "initial" >lost &&
@@ -21,10 +37,18 @@ test_expect_success setup '
 
 	git branch other-branch &&
 
+	git symbolic-ref HEAD refs/heads/unrelated &&
+	git rm -f "*" &&
+	echo "Unrelated branch" >side &&
+	git add side &&
+	test_tick && git commit -m "Side root" &&
+	note J &&
+	git checkout master &&
+
 	echo "Hello" >file &&
 	echo "second" >lost &&
 	git add file lost &&
-	test_tick && git commit -m "Modified file and lost" &&
+	test_tick && GIT_AUTHOR_DATE=$(($test_tick + 120)) git commit -m "Modified file and lost" &&
 	note B &&
 
 	git checkout other-branch &&
@@ -63,13 +87,6 @@ test_expect_success setup '
 	test_tick && git commit -a -m "Final change" &&
 	note I &&
 
-	git symbolic-ref HEAD refs/heads/unrelated &&
-	git rm -f "*" &&
-	echo "Unrelated branch" >side &&
-	git add side &&
-	test_tick && git commit -m "Side root" &&
-	note J &&
-
 	git checkout master &&
 	test_tick && git merge --allow-unrelated-histories -m "Coolest" unrelated &&
 	note K &&
@@ -103,14 +120,24 @@ check_result () {
 	check_outcome success "$@"
 }
 
-check_result 'L K J I H G F E D C B A' --full-history
+check_result 'L K J I H F E D C G B A' --full-history --topo-order
+check_result 'L K I H G F E D C B J A' --full-history
+check_result 'L K I H G F E D C B J A' --full-history --date-order
+check_result 'L K I H G F E D B C J A' --full-history --author-date-order
 check_result 'K I H E C B A' --full-history -- file
 check_result 'K I H E C B A' --full-history --topo-order -- file
 check_result 'K I H E C B A' --full-history --date-order -- file
+check_result 'K I H E B C A' --full-history --author-date-order -- file
 check_result 'I E C B A' --simplify-merges -- file
+check_result 'I E C B A' --simplify-merges --topo-order -- file
+check_result 'I E C B A' --simplify-merges --date-order -- file
+check_result 'I E B C A' --simplify-merges --author-date-order -- file
 check_result 'I B A' -- file
 check_result 'I B A' --topo-order -- file
+check_result 'I B A' --date-order -- file
+check_result 'I B A' --author-date-order -- file
 check_result 'H' --first-parent -- another-file
+check_result 'H' --first-parent --topo-order -- another-file
 
 check_result 'E C B A' --full-history E -- lost
 test_expect_success 'full history simplification without parent' '
-- 
2.19.1.542.gc4df23f792


^ permalink raw reply related	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 0/7] Use generation numbers for --topo-order
  2018-11-01  5:21       ` Junio C Hamano
@ 2018-11-01 13:49         ` Derrick Stolee
  2018-11-01 23:54           ` Junio C Hamano
  0 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee @ 2018-11-01 13:49 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget; +Cc: git, peff

On 11/1/2018 1:21 AM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> This patch series performs a decently-sized refactoring of the revision-walk
>> machinery. Well, "refactoring" is probably the wrong word, as I don't
>> actually remove the old code. Instead, when we see certain options in the
>> 'rev_info' struct, we redirect the commit-walk logic to a new set of methods
>> that distribute the workload differently. By using generation numbers in the
>> commit-graph, we can significantly improve 'git log --graph' commands (and
>> the underlying 'git rev-list --topo-order').
> Review discussions seem to have petered out.  Would we merge this to
> 'next' and start cooking, perhaps for the remainder of this cycle?

Thanks, but I've just sent a v5 responding to Jakub's feedback on v4. [1]

I'd be happy to let it sit in next until you feel it has cooked long 
enough. I'm available to respond to feedback in the form of new topics.

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20181101134623.84055-1-dstolee@microsoft.com/T/#u

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v5 6/7] revision.c: generation-based topo-order algorithm
  2018-11-01 13:46         ` [PATCH v5 6/7] revision.c: generation-based topo-order algorithm Derrick Stolee
@ 2018-11-01 15:48           ` SZEDER Gábor
  2018-11-01 16:12             ` Derrick Stolee
  2019-11-08  2:50           ` Mike Hommey
  1 sibling, 1 reply; 87+ messages in thread
From: SZEDER Gábor @ 2018-11-01 15:48 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, avarab, peff, jnareb, Derrick Stolee

On Thu, Nov 01, 2018 at 01:46:22PM +0000, Derrick Stolee wrote:
> 1. EXPLORE: using the explore_queue priority queue (ordered by
>    maximizing the generation number)

> 2. INDEGREE: using the indegree_queue priority queue (ordered
>    by maximizing the generation number)

Nit: I've been pondering for a while what exactly does "order by
maximizing ..." mean.  Highest to lowest or lowest to highest?  If I
understand the rest of the descriptions (that I snipped) correctly,
then it's the former, but I find that phrase in itself too ambiguous.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v5 6/7] revision.c: generation-based topo-order algorithm
  2018-11-01 15:48           ` SZEDER Gábor
@ 2018-11-01 16:12             ` Derrick Stolee
  0 siblings, 0 replies; 87+ messages in thread
From: Derrick Stolee @ 2018-11-01 16:12 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, gitster, avarab, peff, jnareb, Derrick Stolee

On 11/1/2018 11:48 AM, SZEDER Gábor wrote:
> On Thu, Nov 01, 2018 at 01:46:22PM +0000, Derrick Stolee wrote:
>> 1. EXPLORE: using the explore_queue priority queue (ordered by
>>     maximizing the generation number)
>> 2. INDEGREE: using the indegree_queue priority queue (ordered
>>     by maximizing the generation number)
> Nit: I've been pondering for a while what exactly does "order by
> maximizing ..." mean.  Highest to lowest or lowest to highest?  If I
> understand the rest of the descriptions (that I snipped) correctly,
> then it's the former, but I find that phrase in itself too ambiguous.

It means that our priority-queue "get" operation selects the item in the
queue that is largest by our comparison function (first generation number,
thencommit-date for ties).This means we walk commits that have high
generation number before those with lower generation number, guaranteeing
that we walk all children of a commit before walking that commit.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v4 0/7] Use generation numbers for --topo-order
  2018-11-01 13:49         ` Derrick Stolee
@ 2018-11-01 23:54           ` Junio C Hamano
  0 siblings, 0 replies; 87+ messages in thread
From: Junio C Hamano @ 2018-11-01 23:54 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, peff, Jakub Narębski

Derrick Stolee <stolee@gmail.com> writes:

>> Review discussions seem to have petered out.  Would we merge this to
>> 'next' and start cooking, perhaps for the remainder of this cycle?
>
> Thanks, but I've just sent a v5 responding to Jakub's feedback on v4. [1]
>
> I'd be happy to let it sit in next until you feel it has cooked long
> enough. I'm available to respond to feedback in the form of new
> topics.

OK.  I'm quite happy to see this round of review helped greatly by
Jakub, by the way.

THanks, both.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v5 6/7] revision.c: generation-based topo-order algorithm
  2018-11-01 13:46         ` [PATCH v5 6/7] revision.c: generation-based topo-order algorithm Derrick Stolee
  2018-11-01 15:48           ` SZEDER Gábor
@ 2019-11-08  2:50           ` Mike Hommey
  2019-11-11  1:07             ` Derrick Stolee
  1 sibling, 1 reply; 87+ messages in thread
From: Mike Hommey @ 2019-11-08  2:50 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, gitster, avarab, szeder.dev, peff, jnareb, Derrick Stolee

Replying to this old thread because I have questions regarding the
patch, in the context of problems I had downstream, in git-cinnabar.

On Thu, Nov 01, 2018 at 01:46:22PM +0000, Derrick Stolee wrote:
>  static void init_topo_walk(struct rev_info *revs)
>  {
>  	struct topo_walk_info *info;
> +	struct commit_list *list;
>  	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));

Not directly from this patch, but there's nothing that frees this memory
AFAICS, but then, that's also true for most of the things in struct
rev_info.

> diff --git a/revision.h b/revision.h
> index fd4154ff75..b0b3bb8025 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -24,6 +24,8 @@
>  #define USER_GIVEN	(1u<<25) /* given directly by the user */
>  #define TRACK_LINEAR	(1u<<26)
>  #define ALL_REV_FLAGS	(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
> +#define TOPO_WALK_EXPLORED	(1u<<27)
> +#define TOPO_WALK_INDEGREE	(1u<<28)

Should these two flags be included in ALL_REV_FLAGS?
Should they be reset by reset_revision_walk?

At least for the latter, I'd say yes, otherwise you can end up with
missing revs in a subsequent topo-order revwalk.

Mike

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v5 6/7] revision.c: generation-based topo-order algorithm
  2019-11-08  2:50           ` Mike Hommey
@ 2019-11-11  1:07             ` Derrick Stolee
  2019-11-18 23:04               ` SZEDER Gábor
  0 siblings, 1 reply; 87+ messages in thread
From: Derrick Stolee @ 2019-11-11  1:07 UTC (permalink / raw)
  To: Mike Hommey
  Cc: git, gitster, avarab, szeder.dev, peff, jnareb, Derrick Stolee

On 11/7/2019 9:50 PM, Mike Hommey wrote:
> Replying to this old thread because I have questions regarding the
> patch, in the context of problems I had downstream, in git-cinnabar.
> 
> On Thu, Nov 01, 2018 at 01:46:22PM +0000, Derrick Stolee wrote:
>>  static void init_topo_walk(struct rev_info *revs)
>>  {
>>  	struct topo_walk_info *info;
>> +	struct commit_list *list;
>>  	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
> 
> Not directly from this patch, but there's nothing that frees this memory
> AFAICS, but then, that's also true for most of the things in struct
> rev_info.

This is true, the 'struct rev_info' doesn't get cleaned up at the end.
It is probably a lot of work to find all the consumers and get them to
clean everything up, and the value is rather low. I believe the expectation
is that each process will only run a revision walk at most once.

>> diff --git a/revision.h b/revision.h
>> index fd4154ff75..b0b3bb8025 100644
>> --- a/revision.h
>> +++ b/revision.h
>> @@ -24,6 +24,8 @@
>>  #define USER_GIVEN	(1u<<25) /* given directly by the user */
>>  #define TRACK_LINEAR	(1u<<26)
>>  #define ALL_REV_FLAGS	(((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
>> +#define TOPO_WALK_EXPLORED	(1u<<27)
>> +#define TOPO_WALK_INDEGREE	(1u<<28)
> 
> Should these two flags be included in ALL_REV_FLAGS?
> Should they be reset by reset_revision_walk?
> 
> At least for the latter, I'd say yes, otherwise you can end up with
> missing revs in a subsequent topo-order revwalk.

This is probably true. Sounds like a quick contribution could
be in order?

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v5 6/7] revision.c: generation-based topo-order algorithm
  2019-11-11  1:07             ` Derrick Stolee
@ 2019-11-18 23:04               ` SZEDER Gábor
  0 siblings, 0 replies; 87+ messages in thread
From: SZEDER Gábor @ 2019-11-18 23:04 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Mike Hommey, git, gitster, avarab, peff, jnareb, Derrick Stolee

On Sun, Nov 10, 2019 at 08:07:31PM -0500, Derrick Stolee wrote:
> On 11/7/2019 9:50 PM, Mike Hommey wrote:
> > Replying to this old thread because I have questions regarding the
> > patch, in the context of problems I had downstream, in git-cinnabar.
> > 
> > On Thu, Nov 01, 2018 at 01:46:22PM +0000, Derrick Stolee wrote:
> >>  static void init_topo_walk(struct rev_info *revs)
> >>  {
> >>  	struct topo_walk_info *info;
> >> +	struct commit_list *list;
> >>  	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
> > 
> > Not directly from this patch, but there's nothing that frees this memory
> > AFAICS, but then, that's also true for most of the things in struct
> > rev_info.
> 
> This is true, the 'struct rev_info' doesn't get cleaned up at the end.
> It is probably a lot of work to find all the consumers and get them to
> clean everything up, and the value is rather low. I believe the expectation
> is that each process will only run a revision walk at most once.

I don't think that's a valid expectation.

Several commands must do multiple revision walks in a single process,
e.g. 'describe' or 'name-rev', but they tend to do so by rolling their
own low-level revision walking (e.g. by putting all ~SEEN parents into
a 'commit_list' and iterating until the list becomes empty) instead of
a higher-level 'while ((commit = get_revision(revs)))' loop.

Alas, some of those commands are buggy, or at least 'git describe' is
[1], and AFAICT the only way to fix that bug is to walk the history in
topo-order.  And of course we should not roll its own topo-order
revision walk for each of those commands, but rather should convert
them to use get_revision(), so they can all rely on the magic of the
commit-graph-based on-the-fly topo-order, especially since the
commit-graph is now enabled by default.  However, all this means a lot
of separate get_revision()-based revision walks in a single process.

[1]  https://public-inbox.org/git/20191008123156.GG11529@szeder.dev/


^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2019-11-18 23:04 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-27 20:41 [PATCH 0/6] Use generation numbers for --topo-order Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 1/6] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 2/6] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 3/6] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 4/6] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 5/6] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
2018-08-27 20:41 ` [PATCH 6/6] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
2018-08-27 21:23 ` [PATCH 0/6] Use generation numbers for --topo-order Junio C Hamano
2018-09-18  4:08 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2018-09-18  4:08   ` [PATCH v2 1/6] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
2018-09-18  4:08   ` [PATCH v2 2/6] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
2018-09-18 18:02     ` SZEDER Gábor
2018-09-19 19:31       ` Junio C Hamano
2018-09-19 19:38         ` Junio C Hamano
2018-09-20 21:18           ` Junio C Hamano
2018-09-18  4:08   ` [PATCH v2 3/6] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
2018-09-18  4:08   ` [PATCH v2 4/6] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
2018-09-18  4:08   ` [PATCH v2 5/6] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
2018-09-18  4:08   ` [PATCH v2 6/6] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
2018-09-18  5:51     ` Ævar Arnfjörð Bjarmason
2018-09-18  6:05   ` [PATCH v2 0/6] Use generation numbers for --topo-order Ævar Arnfjörð Bjarmason
2018-09-21 15:47     ` Derrick Stolee
2018-09-21 17:39   ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
2018-09-21 17:39     ` [PATCH v3 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
2018-09-26 19:15       ` Derrick Stolee
2018-10-11 13:54       ` Jeff King
2018-09-21 17:39     ` [PATCH v3 2/7] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
2018-10-11 13:57       ` Jeff King
2018-09-21 17:39     ` [PATCH v3 3/7] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
2018-10-11 13:58       ` Jeff King
2018-10-12  4:34         ` Junio C Hamano
2018-09-21 17:39     ` [PATCH v3 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
2018-10-11 14:06       ` Jeff King
2018-10-12  6:33       ` Junio C Hamano
2018-10-12 12:32         ` Derrick Stolee
2018-10-12 16:15         ` Johannes Sixt
2018-10-13  8:05           ` Junio C Hamano
2018-09-21 17:39     ` [PATCH v3 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
2018-10-11 14:21       ` Jeff King
2018-09-21 17:39     ` [PATCH v3 6/7] revision.h: add whitespace in flag definitions Derrick Stolee via GitGitGadget
2018-09-21 17:39     ` [PATCH v3 7/7] revision.c: refactor basic topo-order logic Derrick Stolee via GitGitGadget
2018-09-27 17:57       ` Derrick Stolee
2018-10-06 16:56         ` Jakub Narebski
2018-10-11 15:35       ` Jeff King
2018-10-11 16:21         ` Derrick Stolee
2018-10-25  9:43           ` Jeff King
2018-10-25 13:00             ` Derrick Stolee
2018-10-11 22:32       ` Stefan Beller
2018-09-21 21:22     ` [PATCH v3 0/7] Use generation numbers for --topo-order Junio C Hamano
2018-10-16 22:36     ` [PATCH v4 " Derrick Stolee via GitGitGadget
2018-10-16 22:36       ` [PATCH v4 1/7] prio-queue: add 'peek' operation Derrick Stolee via GitGitGadget
2018-10-16 22:36       ` [PATCH v4 2/7] test-reach: add run_three_modes method Derrick Stolee via GitGitGadget
2018-10-16 22:36       ` [PATCH v4 3/7] test-reach: add rev-list tests Derrick Stolee via GitGitGadget
2018-10-21 10:21         ` Jakub Narebski
2018-10-21 15:28           ` Derrick Stolee
2018-10-16 22:36       ` [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee via GitGitGadget
2018-10-21 15:55         ` Jakub Narebski
2018-10-22  1:12           ` Junio C Hamano
2018-10-22  1:51             ` Derrick Stolee
2018-10-22  1:55               ` [RFC PATCH] revision.c: use new algorithm in A..B case Derrick Stolee
2018-10-25  8:28               ` [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic Junio C Hamano
2018-10-26 20:56                 ` Jakub Narebski
2018-10-16 22:36       ` [PATCH v4 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee via GitGitGadget
2018-10-21 21:17         ` Jakub Narebski
2018-10-16 22:36       ` [PATCH v4 6/7] revision.c: generation-based topo-order algorithm Derrick Stolee via GitGitGadget
2018-10-22 13:37         ` Jakub Narebski
2018-10-23 13:54           ` Derrick Stolee
2018-10-26 16:55             ` Jakub Narebski
2018-10-16 22:36       ` [PATCH v4 7/7] t6012: make rev-list tests more interesting Derrick Stolee via GitGitGadget
2018-10-23 15:48         ` Jakub Narebski
2018-10-21 12:57       ` [PATCH v4 0/7] Use generation numbers for --topo-order Jakub Narebski
2018-11-01  5:21       ` Junio C Hamano
2018-11-01 13:49         ` Derrick Stolee
2018-11-01 23:54           ` Junio C Hamano
2018-11-01 13:46       ` [PATCH v5 " Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 1/7] prio-queue: add 'peek' operation Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 2/7] test-reach: add run_three_modes method Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 3/7] test-reach: add rev-list tests Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 4/7] revision.c: begin refactoring --topo-order logic Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 5/7] commit/revisions: bookkeeping before refactoring Derrick Stolee
2018-11-01 13:46         ` [PATCH v5 6/7] revision.c: generation-based topo-order algorithm Derrick Stolee
2018-11-01 15:48           ` SZEDER Gábor
2018-11-01 16:12             ` Derrick Stolee
2019-11-08  2:50           ` Mike Hommey
2019-11-11  1:07             ` Derrick Stolee
2019-11-18 23:04               ` SZEDER Gábor
2018-11-01 13:46         ` [PATCH v5 7/7] t6012: make rev-list tests more interesting Derrick Stolee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).