All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Subject: [RFC/PATCH 0/4] commit lists as priority queues
Date: Thu, 19 May 2011 17:23:49 -0400	[thread overview]
Message-ID: <20110519212349.GA28589@sigill.intra.peff.net> (raw)
In-Reply-To: <20110519204851.GA28574@sigill.intra.peff.net>

On Thu, May 19, 2011 at 04:48:51PM -0400, Jeff King wrote:

> By noting commits we have already added to the list, we can
> shrink the size of "n" in such a repo to the number of
> unique commits, which is on the order of what a normal repo
> would contain (it's actually more than a normal repo, since child repos
> may have branches at different states, but in practice it tends
> to be much smaller than the list with duplicates).
> 
> Here are the results on one particular giant repo
> (containing objects for all Rails forks on GitHub):
> 
>   $ git for-each-ref | wc -l
>   112514

So this seems to work pretty well for us in practice, because our
giant-refs use case is a big alternates repo, and of those ~112K refs,
there are only ~5K or so unique ones. Which is small enough that in this
timing:

>   [before]
>   $ git fetch --no-tags ../remote.git
>   63.52user 0.12system 1:03.68elapsed 99%CPU (0avgtext+0avgdata 137648maxresident)k
>   1856inputs+48outputs (11major+19603minor)pagefaults 0swaps
> 
>   $ git fetch --no-tags ../remote.git
>   6.15user 0.08system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 123856maxresident)k
>   0inputs+40outputs (0major+18872minor)pagefaults 0swaps

The 6 seconds are not spent doing anything related to mark_complete, so
it's effectively fast enough (i.e., we made "n" small enough that the
O(n^2) behavior doesn't matter). But obviously there's still a
worst-case where you get bad behavior if you have a lot of unique refs.
I don't know if anybody has that situation in practice (it might be
something like creating a ref for every commit).

So I also considered a different approach, which is to use a better data
structure to hold the sorted commit list (a priority queue). That gives
a similar speedup:

  $ git fetch --no-tags ../remote.git
  6.20user 0.03system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 123712maxresident)k
  0inputs+40outputs (0major+18885minor)pagefaults 0swaps

and actually has good asymptotic complexity. I also had a feeling we
could use a faster commit_list in other places, since get_revision is
all based around the same pop-and-push-the-parents workflow. But after
spending a few hours on it, it seems that _lots_ of places in the code
work on the assumption of a linked list (or at least doing very simple
in-order traversal), and the changes got very ugly very quickly. And
performance-wise, it doesn't seem to be a huge problem, mainly because:

  1. We don't tend to put more items in a commit_list than the width of
     the history graph. So our "n" tends to be in the dozens at most.

  2. We have the hack in fce87ae (Fix quadratic performance in
     rewrite_one., 2008-07-12) which works OK in practice.

So it's a lot of extra code, and I haven't found a case where it
actually improves performance. But here's that series, just for
reference. We may want to take 1/4, which is an unrelated cleanup.

  [1/4]: Makefile: sort TEST_PROGRAMS list
  [2/4]: basic priority queue implementation
  [3/4]: commit: add infrastructure for priority queues of commits
  [4/4]: fetch-pack: use priority queue for mark_complete

-Peff

  reply	other threads:[~2011-05-19 21:23 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-19 20:48 [PATCH] fetch: avoid repeated commits in mark_complete Jeff King
2011-05-19 21:23 ` Jeff King [this message]
2011-05-19 21:24   ` [PATCH 1/4] Makefile: sort TEST_PROGRAMS list Jeff King
2011-05-19 21:24   ` [PATCH 2/4] basic priority queue implementation Jeff King
2011-05-20  0:47     ` Thiago Farina
2011-05-20  7:38       ` Jeff King
2011-05-20 13:13         ` Thiago Farina
2011-05-20 13:23           ` Jeff King
2011-05-19 21:25   ` [PATCH 3/4] commit: add infrastructure for priority queues of commits Jeff King
2011-05-19 21:26   ` [PATCH 4/4] fetch-pack: use priority queue for mark_complete Jeff King
2011-05-20 22:03   ` [RFC/PATCH 0/4] commit lists as priority queues Junio C Hamano
2011-05-20  1:42 ` [PATCH] fetch: avoid repeated commits in mark_complete Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110519212349.GA28589@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.