All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fetch: avoid repeated commits in mark_complete
@ 2011-05-19 20:48 Jeff King
  2011-05-19 21:23 ` [RFC/PATCH 0/4] commit lists as priority queues Jeff King
  2011-05-20  1:42 ` [PATCH] fetch: avoid repeated commits in mark_complete Junio C Hamano
  0 siblings, 2 replies; 12+ messages in thread
From: Jeff King @ 2011-05-19 20:48 UTC (permalink / raw)
  To: git; +Cc: git-dev

We add every local ref to a list so that we can mark them
and all of their ancestors back to a certain cutoff point.
However, if some refs point to the same commit, we will end
up adding them to the list many times.

Furthermore, since commit_lists are stored as linked lists,
we must do an O(n) traversal of the list in order to find
the right place to insert each commit. This makes building
the list O(n^2) in the number of refs.

For normal repositories, this isn't a big deal. We have a
few hundreds refs at most, and most of them are unique. But
consider an "alternates" repo that serves as an object
database for many other similar repos. For reachability, it
needs to keep a copy of the refs in each child repo. This
means it may have a large number of refs, many of which
point to the same commits.

By noting commits we have already added to the list, we can
shrink the size of "n" in such a repo to the number of
unique commits, which is on the order of what a normal repo
would contain (it's actually more than a normal repo, since child repos
may have branches at different states, but in practice it tends
to be much smaller than the list with duplicates).

Here are the results on one particular giant repo
(containing objects for all Rails forks on GitHub):

  $ git for-each-ref | wc -l
  112514

  [before]
  $ git fetch --no-tags ../remote.git
  63.52user 0.12system 1:03.68elapsed 99%CPU (0avgtext+0avgdata 137648maxresident)k
  1856inputs+48outputs (11major+19603minor)pagefaults 0swaps

  $ git fetch --no-tags ../remote.git
  6.15user 0.08system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 123856maxresident)k
  0inputs+40outputs (0major+18872minor)pagefaults 0swaps

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fetch-pack.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 85aff02..56c0b4a 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -472,8 +472,10 @@ static int mark_complete(const char *path, const unsigned char *sha1, int flag,
 	}
 	if (o && o->type == OBJ_COMMIT) {
 		struct commit *commit = (struct commit *)o;
-		commit->object.flags |= COMPLETE;
-		commit_list_insert_by_date(commit, &complete);
+		if (!(commit->object.flags & COMPLETE)) {
+			commit->object.flags |= COMPLETE;
+			commit_list_insert_by_date(commit, &complete);
+		}
 	}
 	return 0;
 }
-- 
1.7.5.8.ga95ab

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC/PATCH 0/4] commit lists as priority queues
  2011-05-19 20:48 [PATCH] fetch: avoid repeated commits in mark_complete Jeff King
@ 2011-05-19 21:23 ` Jeff King
  2011-05-19 21:24   ` [PATCH 1/4] Makefile: sort TEST_PROGRAMS list Jeff King
                     ` (4 more replies)
  2011-05-20  1:42 ` [PATCH] fetch: avoid repeated commits in mark_complete Junio C Hamano
  1 sibling, 5 replies; 12+ messages in thread
From: Jeff King @ 2011-05-19 21:23 UTC (permalink / raw)
  To: git

On Thu, May 19, 2011 at 04:48:51PM -0400, Jeff King wrote:

> By noting commits we have already added to the list, we can
> shrink the size of "n" in such a repo to the number of
> unique commits, which is on the order of what a normal repo
> would contain (it's actually more than a normal repo, since child repos
> may have branches at different states, but in practice it tends
> to be much smaller than the list with duplicates).
> 
> Here are the results on one particular giant repo
> (containing objects for all Rails forks on GitHub):
> 
>   $ git for-each-ref | wc -l
>   112514

So this seems to work pretty well for us in practice, because our
giant-refs use case is a big alternates repo, and of those ~112K refs,
there are only ~5K or so unique ones. Which is small enough that in this
timing:

>   [before]
>   $ git fetch --no-tags ../remote.git
>   63.52user 0.12system 1:03.68elapsed 99%CPU (0avgtext+0avgdata 137648maxresident)k
>   1856inputs+48outputs (11major+19603minor)pagefaults 0swaps
> 
>   $ git fetch --no-tags ../remote.git
>   6.15user 0.08system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 123856maxresident)k
>   0inputs+40outputs (0major+18872minor)pagefaults 0swaps

The 6 seconds are not spent doing anything related to mark_complete, so
it's effectively fast enough (i.e., we made "n" small enough that the
O(n^2) behavior doesn't matter). But obviously there's still a
worst-case where you get bad behavior if you have a lot of unique refs.
I don't know if anybody has that situation in practice (it might be
something like creating a ref for every commit).

So I also considered a different approach, which is to use a better data
structure to hold the sorted commit list (a priority queue). That gives
a similar speedup:

  $ git fetch --no-tags ../remote.git
  6.20user 0.03system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 123712maxresident)k
  0inputs+40outputs (0major+18885minor)pagefaults 0swaps

and actually has good asymptotic complexity. I also had a feeling we
could use a faster commit_list in other places, since get_revision is
all based around the same pop-and-push-the-parents workflow. But after
spending a few hours on it, it seems that _lots_ of places in the code
work on the assumption of a linked list (or at least doing very simple
in-order traversal), and the changes got very ugly very quickly. And
performance-wise, it doesn't seem to be a huge problem, mainly because:

  1. We don't tend to put more items in a commit_list than the width of
     the history graph. So our "n" tends to be in the dozens at most.

  2. We have the hack in fce87ae (Fix quadratic performance in
     rewrite_one., 2008-07-12) which works OK in practice.

So it's a lot of extra code, and I haven't found a case where it
actually improves performance. But here's that series, just for
reference. We may want to take 1/4, which is an unrelated cleanup.

  [1/4]: Makefile: sort TEST_PROGRAMS list
  [2/4]: basic priority queue implementation
  [3/4]: commit: add infrastructure for priority queues of commits
  [4/4]: fetch-pack: use priority queue for mark_complete

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] Makefile: sort TEST_PROGRAMS list
  2011-05-19 21:23 ` [RFC/PATCH 0/4] commit lists as priority queues Jeff King
@ 2011-05-19 21:24   ` Jeff King
  2011-05-19 21:24   ` [PATCH 2/4] basic priority queue implementation Jeff King
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2011-05-19 21:24 UTC (permalink / raw)
  To: git

We usually keep these lists in sorted order, but the last
few entries were just tacked on the end.

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 5379aaa..d09ee70 100644
--- a/Makefile
+++ b/Makefile
@@ -415,8 +415,10 @@ TEST_PROGRAMS_NEED_X += test-date
 TEST_PROGRAMS_NEED_X += test-delta
 TEST_PROGRAMS_NEED_X += test-dump-cache-tree
 TEST_PROGRAMS_NEED_X += test-genrandom
+TEST_PROGRAMS_NEED_X += test-index-version
 TEST_PROGRAMS_NEED_X += test-line-buffer
 TEST_PROGRAMS_NEED_X += test-match-trees
+TEST_PROGRAMS_NEED_X += test-mktemp
 TEST_PROGRAMS_NEED_X += test-obj-pool
 TEST_PROGRAMS_NEED_X += test-parse-options
 TEST_PROGRAMS_NEED_X += test-path-utils
@@ -427,8 +429,6 @@ TEST_PROGRAMS_NEED_X += test-string-pool
 TEST_PROGRAMS_NEED_X += test-subprocess
 TEST_PROGRAMS_NEED_X += test-svn-fe
 TEST_PROGRAMS_NEED_X += test-treap
-TEST_PROGRAMS_NEED_X += test-index-version
-TEST_PROGRAMS_NEED_X += test-mktemp
 
 TEST_PROGRAMS = $(patsubst %,%$X,$(TEST_PROGRAMS_NEED_X))
 
-- 
1.7.5.8.ga95ab

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] basic priority queue implementation
  2011-05-19 21:23 ` [RFC/PATCH 0/4] commit lists as priority queues Jeff King
  2011-05-19 21:24   ` [PATCH 1/4] Makefile: sort TEST_PROGRAMS list Jeff King
@ 2011-05-19 21:24   ` Jeff King
  2011-05-20  0:47     ` Thiago Farina
  2011-05-19 21:25   ` [PATCH 3/4] commit: add infrastructure for priority queues of commits Jeff King
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2011-05-19 21:24 UTC (permalink / raw)
  To: git

This can provide better algorithmic complexity for some of
the date-sorted commit list uses, among other things.

The queue is stored as a heap, allowing O(log) insertion and
top-element removal, and O(1) peeking at the top element.
Current commit lists are sorted linked lists, giving O(1)
peeking and top-element removal, but O(n^2) insertion.

Signed-off-by: Jeff King <peff@peff.net>
---
 .gitignore       |    1 +
 Makefile         |    3 ++
 queue.c          |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 queue.h          |   17 +++++++++++++
 t/t0007-queue.sh |   50 ++++++++++++++++++++++++++++++++++++++
 test-queue.c     |   39 ++++++++++++++++++++++++++++++
 6 files changed, 180 insertions(+), 0 deletions(-)
 create mode 100644 queue.c
 create mode 100644 queue.h
 create mode 100755 t/t0007-queue.sh
 create mode 100644 test-queue.c

diff --git a/.gitignore b/.gitignore
index 711fce7..179483a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -174,6 +174,7 @@
 /test-obj-pool
 /test-parse-options
 /test-path-utils
+/test-queue
 /test-run-command
 /test-sha1
 /test-sigchain
diff --git a/Makefile b/Makefile
index d09ee70..737b2d5 100644
--- a/Makefile
+++ b/Makefile
@@ -422,6 +422,7 @@ TEST_PROGRAMS_NEED_X += test-mktemp
 TEST_PROGRAMS_NEED_X += test-obj-pool
 TEST_PROGRAMS_NEED_X += test-parse-options
 TEST_PROGRAMS_NEED_X += test-path-utils
+TEST_PROGRAMS_NEED_X += test-queue
 TEST_PROGRAMS_NEED_X += test-run-command
 TEST_PROGRAMS_NEED_X += test-sha1
 TEST_PROGRAMS_NEED_X += test-sigchain
@@ -532,6 +533,7 @@ LIB_H += parse-options.h
 LIB_H += patch-ids.h
 LIB_H += pkt-line.h
 LIB_H += progress.h
+LIB_H += queue.h
 LIB_H += quote.h
 LIB_H += reflog-walk.h
 LIB_H += refs.h
@@ -629,6 +631,7 @@ LIB_OBJS += pkt-line.o
 LIB_OBJS += preload-index.o
 LIB_OBJS += pretty.o
 LIB_OBJS += progress.o
+LIB_OBJS += queue.o
 LIB_OBJS += quote.o
 LIB_OBJS += reachable.o
 LIB_OBJS += read-cache.o
diff --git a/queue.c b/queue.c
new file mode 100644
index 0000000..068b559
--- /dev/null
+++ b/queue.c
@@ -0,0 +1,70 @@
+#include "queue.h"
+#include "cache.h"
+
+static inline void queue_swap(struct queue *pq, unsigned i, unsigned j)
+{
+	void *tmp = pq->items[i];
+	pq->items[i] = pq->items[j];
+	pq->items[j] = tmp;
+}
+
+static void queue_heapify_up(struct queue *pq)
+{
+	unsigned i = pq->nr - 1;
+	while (i > 0) {
+		int parent = (i-1)/2;
+
+		if (pq->cmp(pq->items[i], pq->items[parent]) <= 0)
+			return;
+
+		queue_swap(pq, i, parent);
+		i = parent;
+	}
+}
+
+void queue_insert(struct queue *pq, void *item)
+{
+	ALLOC_GROW(pq->items, pq->nr + 1, pq->alloc);
+	pq->items[pq->nr++] = item;
+	queue_heapify_up(pq);
+}
+
+static void queue_heapify_down(struct queue *pq)
+{
+	unsigned i = 0;
+	while (1) {
+		int largest = i, left = 2*i + 1, right = 2*i + 2;
+
+		if (left < pq->nr &&
+		    pq->cmp(pq->items[left], pq->items[largest]) > 0)
+			largest = left;
+		if (right < pq->nr &&
+		    pq->cmp(pq->items[right], pq->items[largest]) > 0)
+			largest = right;
+
+		if (largest == i)
+			return;
+
+		queue_swap(pq, i, largest);
+		i = largest;
+	}
+}
+
+void *queue_peek(struct queue *pq)
+{
+	return pq->nr ? pq->items[0] : NULL;
+}
+
+void *queue_pop(struct queue *pq)
+{
+	void *ret;
+
+	if (!pq->nr)
+		return NULL;
+	ret = pq->items[0];
+
+	pq->items[0] = pq->items[--pq->nr];
+	queue_heapify_down(pq);
+
+	return ret;
+}
diff --git a/queue.h b/queue.h
new file mode 100644
index 0000000..fee5a51
--- /dev/null
+++ b/queue.h
@@ -0,0 +1,17 @@
+#ifndef QUEUE_H
+#define QUEUE_H
+
+typedef int (*queue_comparison_func_t)(const void *, const void *);
+
+struct queue {
+	queue_comparison_func_t cmp;
+	void **items;
+	unsigned nr;
+	unsigned alloc;
+};
+
+void queue_insert(struct queue *pq, void *item);
+void *queue_peek(struct queue *pq);
+void *queue_pop(struct queue *pq);
+
+#endif /* QUEUE_H */
diff --git a/t/t0007-queue.sh b/t/t0007-queue.sh
new file mode 100755
index 0000000..ee357bb
--- /dev/null
+++ b/t/t0007-queue.sh
@@ -0,0 +1,50 @@
+#!/bin/sh
+
+test_description='basic tests for priority queue implementation'
+. ./test-lib.sh
+
+cat >expect <<'EOF'
+10
+9
+8
+7
+6
+5
+5
+4
+3
+2
+1
+EOF
+test_expect_success 'basic ordering' '
+	test-queue 2 6 3 10 9 5 7 4 5 8 1 dump >actual &&
+	test_cmp expect actual
+'
+
+cat >expect <<'EOF'
+3
+5
+4
+6
+2
+1
+EOF
+test_expect_success 'mixed insert and pop' '
+	test-queue 1 2 3 pop 4 5 pop pop 6 dump >actual &&
+	test_cmp expect actual
+'
+
+cat >expect <<'EOF'
+2
+1
+NULL
+2
+1
+NULL
+EOF
+test_expect_success 'notice empty queue' '
+	test-queue 1 2 pop pop pop 1 2 pop pop pop >actual &&
+	test_cmp expect actual
+'
+
+test_done
diff --git a/test-queue.c b/test-queue.c
new file mode 100644
index 0000000..a1e92e2
--- /dev/null
+++ b/test-queue.c
@@ -0,0 +1,39 @@
+#include "cache.h"
+#include "queue.h"
+
+static int intcmp(const void *va, const void *vb)
+{
+	const int *a = va, *b = vb;
+	return *a - *b;
+}
+
+static void show(int *v)
+{
+	if (!v)
+		printf("NULL\n");
+	else
+		printf("%d\n", *v);
+	free(v);
+}
+
+int main(int argc, char **argv)
+{
+	struct queue pq = { intcmp };
+
+	while (*++argv) {
+		if (!strcmp(*argv, "pop"))
+			show(queue_pop(&pq));
+		else if (!strcmp(*argv, "dump")) {
+			int *v;
+			while ((v = queue_pop(&pq)))
+			       show(v);
+		}
+		else {
+			int *v = malloc(sizeof(*v));
+			*v = atoi(*argv);
+			queue_insert(&pq, v);
+		}
+	}
+
+	return 0;
+}
-- 
1.7.5.8.ga95ab

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] commit: add infrastructure for priority queues of commits
  2011-05-19 21:23 ` [RFC/PATCH 0/4] commit lists as priority queues Jeff King
  2011-05-19 21:24   ` [PATCH 1/4] Makefile: sort TEST_PROGRAMS list Jeff King
  2011-05-19 21:24   ` [PATCH 2/4] basic priority queue implementation Jeff King
@ 2011-05-19 21:25   ` Jeff King
  2011-05-19 21:26   ` [PATCH 4/4] fetch-pack: use priority queue for mark_complete Jeff King
  2011-05-20 22:03   ` [RFC/PATCH 0/4] commit lists as priority queues Junio C Hamano
  4 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2011-05-19 21:25 UTC (permalink / raw)
  To: git

Using a priority queue to store date-ordered commits can be
algorithmically faster than the current implementation
(which uses sorted linked lists).

The previous commit introduced a generic priority queue.
This commit adds some infrastructure for using it with
commits. Specifically:

  1. A date-comparison function and queue-initializer, so
     you can create a queue like:

       struct queue pq = COMMIT_QUEUE_INIT;

  2. A function to pop the most recent commit from the
     queue, adding its parents to the queue unless a
     particular mark is seen on them. This is equivalent to
     the current pop_most_recent_commit, which operates on
     commit_list structs.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit.c |   27 +++++++++++++++++++++++++++
 commit.h |    5 +++++
 2 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/commit.c b/commit.c
index ac337c7..9d0a182 100644
--- a/commit.c
+++ b/commit.c
@@ -419,6 +419,33 @@ struct commit *pop_most_recent_commit(struct commit_list **list,
 	return ret;
 }
 
+int commit_datecmp(const void *va, const void *vb)
+{
+	const struct commit *a = va, *b = vb;
+	if (a->date < b->date)
+		return -1;
+	else if (a->date > b->date)
+		return 1;
+	return 0;
+}
+
+struct commit *pop_commit_from_queue(struct queue *pq, unsigned int mark)
+{
+	struct commit *ret = queue_pop(pq);
+	struct commit_list *parents = ret ? ret->parents : NULL;
+
+	while (parents) {
+		struct commit *commit = parents->item;
+		if (!parse_commit(commit) && !(commit->object.flags & mark)) {
+			commit->object.flags |= mark;
+			queue_insert(pq, commit);
+		}
+		parents = parents->next;
+	}
+
+	return ret;
+}
+
 void clear_commit_marks(struct commit *commit, unsigned int mark)
 {
 	while (commit) {
diff --git a/commit.h b/commit.h
index 43940e2..87c9b4a 100644
--- a/commit.h
+++ b/commit.h
@@ -5,6 +5,7 @@
 #include "tree.h"
 #include "strbuf.h"
 #include "decorate.h"
+#include "queue.h"
 
 struct commit_list {
 	struct commit *item;
@@ -121,6 +122,10 @@ void pp_remainder(enum cmit_fmt fmt,
 struct commit *pop_most_recent_commit(struct commit_list **list,
 				      unsigned int mark);
 
+#define COMMIT_QUEUE_INIT { commit_datecmp }
+int commit_datecmp(const void *a, const void *b);
+struct commit *pop_commit_from_queue(struct queue *pq, unsigned int mark);
+
 struct commit *pop_commit(struct commit_list **stack);
 
 void clear_commit_marks(struct commit *commit, unsigned int mark);
-- 
1.7.5.8.ga95ab

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] fetch-pack: use priority queue for mark_complete
  2011-05-19 21:23 ` [RFC/PATCH 0/4] commit lists as priority queues Jeff King
                     ` (2 preceding siblings ...)
  2011-05-19 21:25   ` [PATCH 3/4] commit: add infrastructure for priority queues of commits Jeff King
@ 2011-05-19 21:26   ` Jeff King
  2011-05-20 22:03   ` [RFC/PATCH 0/4] commit lists as priority queues Junio C Hamano
  4 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2011-05-19 21:26 UTC (permalink / raw)
  To: git

The regular commit_list has quadratic insertion complexity.
Fetch-pack will insert an item for each ref we have.  For a
"normal" number of refs, this is not a big deal, but it is
quite inefficient when you have tens or hundreds of
thousands of refs.

Instead, let's use a priority queue with better insertion
performance. For example, timings from one repo (which
stores alternates for all of the Rails forks at GitHub, with
mirrored refs from each individual repo to manage object
pruning):

  $ git for-each-ref | wc -l
  112514

  [before]
  $ time git fetch --no-tags ../remote.git
  63.52user 0.12system 1:03.68elapsed 99%CPU (0avgtext+0avgdata 137648maxresident)k
  1856inputs+48outputs (11major+19603minor)pagefaults 0swaps

  [after]
  $ time git fetch --no-tags ../remote.git
  6.20user 0.03system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 123712maxresident)k
  0inputs+40outputs (0major+18885minor)pagefaults 0swaps

Signed-off-by: Jeff King <peff@peff.net>
---
So this is the big performance-win patch. But if you didn't read my
other "fetch: avoid repeated commits in mark_complete", go do that. It's
a much simpler solution that yields the same speedup in practice.

 builtin/fetch-pack.c |   11 ++++++-----
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 85aff02..9a951f3 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -457,7 +457,7 @@ done:
 	return count ? retval : 0;
 }
 
-static struct commit_list *complete;
+static struct queue complete = COMMIT_QUEUE_INIT;
 
 static int mark_complete(const char *path, const unsigned char *sha1, int flag, void *cb_data)
 {
@@ -473,18 +473,19 @@ static int mark_complete(const char *path, const unsigned char *sha1, int flag,
 	if (o && o->type == OBJ_COMMIT) {
 		struct commit *commit = (struct commit *)o;
 		commit->object.flags |= COMPLETE;
-		commit_list_insert_by_date(commit, &complete);
+		queue_insert(&complete, commit);
 	}
 	return 0;
 }
 
 static void mark_recent_complete_commits(unsigned long cutoff)
 {
-	while (complete && cutoff <= complete->item->date) {
+	struct commit *c;
+	while ((c = queue_peek(&complete)) && cutoff <= c->date) {
 		if (args.verbose)
 			fprintf(stderr, "Marking %s as complete\n",
-				sha1_to_hex(complete->item->object.sha1));
-		pop_most_recent_commit(&complete, COMPLETE);
+				sha1_to_hex(c->object.sha1));
+		pop_commit_from_queue(&complete, COMPLETE);
 	}
 }
 
-- 
1.7.5.8.ga95ab

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] basic priority queue implementation
  2011-05-19 21:24   ` [PATCH 2/4] basic priority queue implementation Jeff King
@ 2011-05-20  0:47     ` Thiago Farina
  2011-05-20  7:38       ` Jeff King
  0 siblings, 1 reply; 12+ messages in thread
From: Thiago Farina @ 2011-05-20  0:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, May 19, 2011 at 6:24 PM, Jeff King <peff@peff.net> wrote:
> This can provide better algorithmic complexity for some of
> the date-sorted commit list uses, among other things.
>
> The queue is stored as a heap, allowing O(log) insertion and
> top-element removal, and O(1) peeking at the top element.
> Current commit lists are sorted linked lists, giving O(1)
> peeking and top-element removal, but O(n^2) insertion.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  .gitignore       |    1 +
>  Makefile         |    3 ++
>  queue.c          |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  queue.h          |   17 +++++++++++++
>  t/t0007-queue.sh |   50 ++++++++++++++++++++++++++++++++++++++
>  test-queue.c     |   39 ++++++++++++++++++++++++++++++
>  6 files changed, 180 insertions(+), 0 deletions(-)
>  create mode 100644 queue.c
>  create mode 100644 queue.h
>  create mode 100755 t/t0007-queue.sh
>  create mode 100644 test-queue.c
>
> diff --git a/.gitignore b/.gitignore
> index 711fce7..179483a 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -174,6 +174,7 @@
>  /test-obj-pool
>  /test-parse-options
>  /test-path-utils
> +/test-queue
>  /test-run-command
>  /test-sha1
>  /test-sigchain
> diff --git a/Makefile b/Makefile
> index d09ee70..737b2d5 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -422,6 +422,7 @@ TEST_PROGRAMS_NEED_X += test-mktemp
>  TEST_PROGRAMS_NEED_X += test-obj-pool
>  TEST_PROGRAMS_NEED_X += test-parse-options
>  TEST_PROGRAMS_NEED_X += test-path-utils
> +TEST_PROGRAMS_NEED_X += test-queue
>  TEST_PROGRAMS_NEED_X += test-run-command
>  TEST_PROGRAMS_NEED_X += test-sha1
>  TEST_PROGRAMS_NEED_X += test-sigchain
> @@ -532,6 +533,7 @@ LIB_H += parse-options.h
>  LIB_H += patch-ids.h
>  LIB_H += pkt-line.h
>  LIB_H += progress.h
> +LIB_H += queue.h
>  LIB_H += quote.h
>  LIB_H += reflog-walk.h
>  LIB_H += refs.h
> @@ -629,6 +631,7 @@ LIB_OBJS += pkt-line.o
>  LIB_OBJS += preload-index.o
>  LIB_OBJS += pretty.o
>  LIB_OBJS += progress.o
> +LIB_OBJS += queue.o
>  LIB_OBJS += quote.o
>  LIB_OBJS += reachable.o
>  LIB_OBJS += read-cache.o
> diff --git a/queue.c b/queue.c
> new file mode 100644
> index 0000000..068b559
> --- /dev/null
> +++ b/queue.c
> @@ -0,0 +1,70 @@
> +#include "queue.h"
> +#include "cache.h"
> +
> +static inline void queue_swap(struct queue *pq, unsigned i, unsigned j)
> +{
> +       void *tmp = pq->items[i];
> +       pq->items[i] = pq->items[j];
> +       pq->items[j] = tmp;
> +}
> +
> +static void queue_heapify_up(struct queue *pq)
> +{
> +       unsigned i = pq->nr - 1;
> +       while (i > 0) {
> +               int parent = (i-1)/2;
> +
> +               if (pq->cmp(pq->items[i], pq->items[parent]) <= 0)
> +                       return;
> +
> +               queue_swap(pq, i, parent);
> +               i = parent;
> +       }
> +}
> +
> +void queue_insert(struct queue *pq, void *item)
> +{
> +       ALLOC_GROW(pq->items, pq->nr + 1, pq->alloc);
> +       pq->items[pq->nr++] = item;
> +       queue_heapify_up(pq);
> +}
> +
> +static void queue_heapify_down(struct queue *pq)
> +{
> +       unsigned i = 0;
> +       while (1) {
> +               int largest = i, left = 2*i + 1, right = 2*i + 2;
> +
> +               if (left < pq->nr &&
> +                   pq->cmp(pq->items[left], pq->items[largest]) > 0)
> +                       largest = left;
> +               if (right < pq->nr &&
> +                   pq->cmp(pq->items[right], pq->items[largest]) > 0)
> +                       largest = right;
> +
> +               if (largest == i)
> +                       return;
> +
> +               queue_swap(pq, i, largest);
> +               i = largest;
> +       }
> +}
> +
> +void *queue_peek(struct queue *pq)
> +{
> +       return pq->nr ? pq->items[0] : NULL;
> +}
> +
> +void *queue_pop(struct queue *pq)
> +{
> +       void *ret;
> +
> +       if (!pq->nr)
> +               return NULL;
> +       ret = pq->items[0];
> +
> +       pq->items[0] = pq->items[--pq->nr];
> +       queue_heapify_down(pq);
> +
> +       return ret;
> +}
> diff --git a/queue.h b/queue.h
> new file mode 100644
> index 0000000..fee5a51
> --- /dev/null
> +++ b/queue.h
> @@ -0,0 +1,17 @@
> +#ifndef QUEUE_H
> +#define QUEUE_H
> +
> +typedef int (*queue_comparison_func_t)(const void *, const void *);
> +
> +struct queue {
> +       queue_comparison_func_t cmp;
> +       void **items;
> +       unsigned nr;
> +       unsigned alloc;
> +};
> +
> +void queue_insert(struct queue *pq, void *item);
I'd rename this to queue_append as we add |item| to the end of the
array (like you did for sha1_array_append), opposed of inserting it at
some position/index.

> +void *queue_peek(struct queue *pq);
> +void *queue_pop(struct queue *pq);
> +
> +#endif /* QUEUE_H */
> diff --git a/t/t0007-queue.sh b/t/t0007-queue.sh
> new file mode 100755
> index 0000000..ee357bb
> --- /dev/null
> +++ b/t/t0007-queue.sh
> @@ -0,0 +1,50 @@
> +#!/bin/sh
> +
> +test_description='basic tests for priority queue implementation'
> +. ./test-lib.sh
> +
> +cat >expect <<'EOF'
> +10
> +9
> +8
> +7
> +6
> +5
> +5
> +4
> +3
> +2
> +1
> +EOF
> +test_expect_success 'basic ordering' '
> +       test-queue 2 6 3 10 9 5 7 4 5 8 1 dump >actual &&
> +       test_cmp expect actual
> +'
> +
> +cat >expect <<'EOF'
> +3
> +5
> +4
> +6
> +2
> +1
> +EOF
> +test_expect_success 'mixed insert and pop' '
> +       test-queue 1 2 3 pop 4 5 pop pop 6 dump >actual &&
> +       test_cmp expect actual
> +'
> +
> +cat >expect <<'EOF'
> +2
> +1
> +NULL
> +2
> +1
> +NULL
> +EOF
> +test_expect_success 'notice empty queue' '
> +       test-queue 1 2 pop pop pop 1 2 pop pop pop >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_done
> diff --git a/test-queue.c b/test-queue.c
> new file mode 100644
> index 0000000..a1e92e2
> --- /dev/null
> +++ b/test-queue.c
> @@ -0,0 +1,39 @@
> +#include "cache.h"
> +#include "queue.h"
> +
> +static int intcmp(const void *va, const void *vb)
> +{
> +       const int *a = va, *b = vb;
> +       return *a - *b;
> +}
> +
> +static void show(int *v)
> +{
> +       if (!v)
> +               printf("NULL\n");
> +       else
> +               printf("%d\n", *v);
> +       free(v);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +       struct queue pq = { intcmp };
> +
> +       while (*++argv) {
> +               if (!strcmp(*argv, "pop"))
> +                       show(queue_pop(&pq));
> +               else if (!strcmp(*argv, "dump")) {
> +                       int *v;
> +                       while ((v = queue_pop(&pq)))
> +                              show(v);
> +               }
> +               else {
> +                       int *v = malloc(sizeof(*v));
> +                       *v = atoi(*argv);
> +                       queue_insert(&pq, v);
> +               }
> +       }
> +
> +       return 0;
> +}
> --
> 1.7.5.8.ga95ab
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fetch: avoid repeated commits in mark_complete
  2011-05-19 20:48 [PATCH] fetch: avoid repeated commits in mark_complete Jeff King
  2011-05-19 21:23 ` [RFC/PATCH 0/4] commit lists as priority queues Jeff King
@ 2011-05-20  1:42 ` Junio C Hamano
  1 sibling, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2011-05-20  1:42 UTC (permalink / raw)
  To: Jeff King; +Cc: git, git-dev

Trivially correct and the right thing to do.  I love this kind of patches
;-)

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] basic priority queue implementation
  2011-05-20  0:47     ` Thiago Farina
@ 2011-05-20  7:38       ` Jeff King
  2011-05-20 13:13         ` Thiago Farina
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2011-05-20  7:38 UTC (permalink / raw)
  To: Thiago Farina; +Cc: git

On Thu, May 19, 2011 at 09:47:38PM -0300, Thiago Farina wrote:

> > +void queue_insert(struct queue *pq, void *item);
> I'd rename this to queue_append as we add |item| to the end of the
> array (like you did for sha1_array_append), opposed of inserting it at
> some position/index.

It's definitely not an append. The data structure is a priority queue,
so the element is inserted within the heap at the proper position
according to the comparison function (notice that we stick at the end,
but then heapify_up).

Speaking of naming, though, the real problem is that this data structure
should be called "pqueue" or something similar to indicate that it is
not a simple FIFO. Unfortunately, the short-and-sweet "pqueue" is taken
by openssl, which pollutes all over the global namespace.

-Peff

PS If you don't mind, please try to trim your quoted text a bit. Finding
your 3-line paragraph amid 200 lines of quoted text was a little
challenging. :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] basic priority queue implementation
  2011-05-20  7:38       ` Jeff King
@ 2011-05-20 13:13         ` Thiago Farina
  2011-05-20 13:23           ` Jeff King
  0 siblings, 1 reply; 12+ messages in thread
From: Thiago Farina @ 2011-05-20 13:13 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Fri, May 20, 2011 at 4:38 AM, Jeff King <peff@peff.net> wrote:
> On Thu, May 19, 2011 at 09:47:38PM -0300, Thiago Farina wrote:
>
>> > +void queue_insert(struct queue *pq, void *item);
>> I'd rename this to queue_append as we add |item| to the end of the
>> array (like you did for sha1_array_append), opposed of inserting it at
>> some position/index.
>
> It's definitely not an append. The data structure is a priority queue,
> so the element is inserted within the heap at the proper position
> according to the comparison function (notice that we stick at the end,
> but then heapify_up).
>
OK, sorry. I didn't read the heapify_up part :(

> Speaking of naming, though, the real problem is that this data structure
> should be called "pqueue" or something similar to indicate that it is
> not a simple FIFO. Unfortunately, the short-and-sweet "pqueue" is taken
> by openssl, which pollutes all over the global namespace.
>
Hum, yeah when I read the commit message I though about the name of
the structure, but I didn't want to bother you with that. Probably you
have considered the option of naming it 'priority_queue' too. Haven't
you chose priority_queue, because you consider the name longer than
necessary?

> -Peff
>
> PS If you don't mind, please try to trim your quoted text a bit. Finding
> your 3-line paragraph amid 200 lines of quoted text was a little
> challenging. :)
>
OK. ;)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] basic priority queue implementation
  2011-05-20 13:13         ` Thiago Farina
@ 2011-05-20 13:23           ` Jeff King
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2011-05-20 13:23 UTC (permalink / raw)
  To: Thiago Farina; +Cc: git

On Fri, May 20, 2011 at 10:13:10AM -0300, Thiago Farina wrote:

> > Speaking of naming, though, the real problem is that this data structure
> > should be called "pqueue" or something similar to indicate that it is
> > not a simple FIFO. Unfortunately, the short-and-sweet "pqueue" is taken
> > by openssl, which pollutes all over the global namespace.
> >
> Hum, yeah when I read the commit message I though about the name of
> the structure, but I didn't want to bother you with that. Probably you
> have considered the option of naming it 'priority_queue' too. Haven't
> you chose priority_queue, because you consider the name longer than
> necessary?

Exactly. Though unless somebody can come up with a practical case where
using the priority queue speeds up some operation, I don't think it is
worth applying anyway. So the naming question is somewhat academic. :)

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC/PATCH 0/4] commit lists as priority queues
  2011-05-19 21:23 ` [RFC/PATCH 0/4] commit lists as priority queues Jeff King
                     ` (3 preceding siblings ...)
  2011-05-19 21:26   ` [PATCH 4/4] fetch-pack: use priority queue for mark_complete Jeff King
@ 2011-05-20 22:03   ` Junio C Hamano
  4 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2011-05-20 22:03 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> ... I also had a feeling we
> could use a faster commit_list in other places, since get_revision is
> all based around the same pop-and-push-the-parents workflow. But after
> spending a few hours on it, it seems that _lots_ of places in the code
> work on the assumption of a linked list (or at least doing very simple
> in-order traversal), and the changes got very ugly very quickly.

I recall looking at the same codepaths a few years ago for different
reasons. get_revision() callchain does a _lot_ of alloc/free, so I
naturally wanted to catch all the allocations and frees to replace them
with a custom allocator that slices from a larger slab. It involved too
many places so I decided it wasn't really worth doing only that change
without other benefit (e.g. lifting the "listness" assumption), and
stopped.

If somebody is interested to tackle it, we would probably need a two step
conversion process. First introduce an abstracted interface that passes
around the "list-as-a-whole" object with a few selected methods (e.g. drop
an element into it at the sorted location and get a "cursor" that points
at it, given a "cursor", grab an element from the list, increment the
"cursor" to point at the next object, etc.), choosing a set just
sufficient to express the patterns of accesses the current codebase makes,
and implement the API over the existing linked-list implementation to
convert all the callers. After making sure this works, update the API
implementation with the queue in this patch.

Or something like that.

> ... And
> performance-wise, it doesn't seem to be a huge problem, mainly because:
>
>   1. We don't tend to put more items in a commit_list than the width of
>      the history graph. So our "n" tends to be in the dozens at most.

I was worried that you may end up with many concurrent traversal going in
a very bushy history (think: traversing from the tip of "next"), but I
agree that it would be in the order of dozens, not thousands.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-05-20 22:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-19 20:48 [PATCH] fetch: avoid repeated commits in mark_complete Jeff King
2011-05-19 21:23 ` [RFC/PATCH 0/4] commit lists as priority queues Jeff King
2011-05-19 21:24   ` [PATCH 1/4] Makefile: sort TEST_PROGRAMS list Jeff King
2011-05-19 21:24   ` [PATCH 2/4] basic priority queue implementation Jeff King
2011-05-20  0:47     ` Thiago Farina
2011-05-20  7:38       ` Jeff King
2011-05-20 13:13         ` Thiago Farina
2011-05-20 13:23           ` Jeff King
2011-05-19 21:25   ` [PATCH 3/4] commit: add infrastructure for priority queues of commits Jeff King
2011-05-19 21:26   ` [PATCH 4/4] fetch-pack: use priority queue for mark_complete Jeff King
2011-05-20 22:03   ` [RFC/PATCH 0/4] commit lists as priority queues Junio C Hamano
2011-05-20  1:42 ` [PATCH] fetch: avoid repeated commits in mark_complete Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.