[RFC] log(n)-transmissions common commit handshake

From: Thomas Rast <trast@student.ethz.ch>
To: git@vger.kernel.org
Subject: [RFC] log(n)-transmissions common commit handshake
Date: Thu, 18 Sep 2008 01:00:30 +0200	[thread overview]
Message-ID: <200809180100.32626.trast@student.ethz.ch> (raw)

[-- Attachment #1: Type: text/plain, Size: 8601 bytes --]

Hi *

This is a fairly long mail, so here's a brief abstract:

  Using two passes, first exponential stride through history, then
  bisection on the "right" windows, we implement the find_common
  handshake in log(history_size) many transmissions.  The followup
  patch partially achieves that, but a few points are open and strict
  backwards compatibility is missing (but its absence is Mostly
  Harmless).

Terminology
===========

The sending side (running upload-pack) is called server, and the
receiving side (running fetch-pack) is the client.  Commits fall into
the following categories:

'want':
	Requested by the client for transmission.

'have':
	In the client's object storage.

'common':
	In both client's and server's object storage.

'private':
	have, but not in the server's storage.

'required':
	Part of the history graph "between" commons and wants; minimal
	subset that needs to be sent over the wire as part of the
	transmission.

("Verbing weirds language." -- Calvin)

When I say "a commit is not _yet_ X", I mean "it is not yet known to
be X", usually from the client's POV.

Motivation
==========

This spawned out of the O(#haves*#wants*#history) analysis late last
week.  As a brief summary, the server uses the fairly expensive test
ok_to_give_up() to check whether it can cut the handshake short since
it knows enough commons to cover all wants.  The thread is here:

  http://article.gmane.org/gmane.comp.version-control.git/95787

(Be sure to read the follow-up regarding the first '*' bullet point.)

It does this shortcut because the client risks getting side-tracked on
a private chunk of history leading to a private root.  (According to
the commit message.  I'm not sure how effective it really is.)

This check by far dominates the rest of the server's work, which is
mostly lookup_object(): it only has to take each "have" line sha1, see
if it is present, and if so, ACK it.  The client announces when it
considers the handshake finished, so there is no other requirement to
test for it.

A more efficient handshake could -- in addition to being more
efficient :-) -- solve this problem simply by hitting the private root
very quickly and proceeding elsewhere, avoiding the server-side check.

Outline of current algorithm
============================

The client sends all its 'have' sha1s, ordered by date, in chunks of
32.  The server replies with an ACK for each 'common', i.e., for any
sha1 it finds in its own object store.

[This is complicated a bit by the exchange of refs, see "Open
questions".]

As it goes, the client marks the commons, and as soon as it runs out
of unmarket commits, it sends a "done".  The server then knows the
entire set of commons too, and can proceed to generate a pack file of
the requireds.

There is a hard limit of 256 commits that go un-ACKed until the client
gives up and declares that there are no common commits.

Idea
====

The proposed algorithm scans the haves in a different order, using two
stages: exponential stride and bisection.

(1) Exponential stride:
-----------------------

Emit the first commit, then the 2nd from that, then the 4th from that,
and so on.  As a boundary case, always emit the root.

If the history has "bounded width" (the meaning should be fairly
obvious), then this stage emits O(log(n)) commits up to the root.

(2) Bisection:
--------------

Call a pair of (A,B) where A is an ancestor of B "interesting" if B is
not yet common and A is not yet private.

Then the bisection step is: between any interesting pair of commits
(A,B) emitted so far, emit the one in the middle M=M(A,B).

["Cut points" between last common and first private commits clearly
lie only between such pairs.]

Assuming for a moment that the server responds immediately, this also
takes only O(log(n)) commits since M must be either common or private,
and thus A..M or M..B is no longer interesting, respectively.

To deal with the delayed responses in the actual protocol, at each
pass 'i' emit M(A,B) for all (A,B) in pairs(i) only if the pair is
still interesting.  Add both (A,M) and (M,B) to pairs(i+1).  Iterate.
pairs(0) is constructed from pass (1).

Intuitively, this means we emit more than log(n) only if we have
nothing else left to do.

----

Between the two stages, we achieve O(log(n)) complexity.  (This
notably means that we hit private roots in far less transmitted sha1s
than the chunk size of 32.)

Unfortunately, this ordering sometimes conflicts with the server's
ok_to_give_up() shortcut mentioned in "Motivation": Consider a linear
branch 'master', and suppose the only 'common' sent out in the first
batch was the root.  ok_to_give_up() detects that it has a common on
the only ancestry line of the want, and trigger a fake ACK for all
further commits.  This becomes a problem when we later emit, say,
'master~1' and the server ACKs it even though it is private.  (It does
not affect correctness, but can increase the depth of history
transmitted by a factor of 2 at worst.)

RFC Implementation
==================

I'll just highlight a few design decisions.  It's my first serious C
coding in years, so warm up the flamethrowers...

* The work_list (replacing rev_list) is the "recursion stack" for
  get_rev().  New "frames" are inserted such that the order of emitted
  commits is
   (a) refs
   (b) stage (1) in date order
   (b) stage (2) in bisection depth order

  get_rev() is split across functions for each case.

  Refs are moved to front because the current code emits them anyway
  (see "Open questions" below), mostly to help fill the first 64
  "have" lines with things we are going to send anyway, as opposed to
  the "real" walk where we make choices.

* The bisection does not actually remember pairs, but flags commits
  with the directions it has already started a bisection from them.
  Combined with the existing POPPED flag for commits that were
  emitted, the pair (A,B) can be determined from (A,direction) by a
  parent/children search through the history.  This simplifies the
  implementation.

  The idea is then to scan with two pointers 'half' and 'full' in
  parallel, where the first one makes only half as many steps, until
  we hit a POPPED commit.  This must then be the interval we want to
  bisect, and we can check whether it is still interesting, emit M,
  and queue it for recursion at one level deeper.

* We maintain a list 'outstanding' of all commits that have been sent,
  so that an "implied not-ACK" can be inferred from the next ACK.
  This is used to do a child-recursive NOTCOMMON flagging, analogous
  to the parent-recursive COMMON flagging on an ACK.  The bisection
  uses these flags to determine ranges.

* Yes, commit_info is a pointless level of indirection.  I had more
  tracking information in it before the last draft, and left it in for
  the time being.

[Hint: when playing with the patch, command lines such as

   git fetch-pack -v -k <url> <head> 2>&1 | git name-rev --stdin

help see the effects.]

Open Problems
=============

* Effect of branch/merge points

  I do not see a better solution to deal with branch/merge points than
  recursing into them.  There may be examples of history where this
  extra level of recursion spoils the log(n) bound.  Carefully
  choosing which side to scan first might help.

* Impact/clever use of refs

  For some reason, current git sends all refs to the server, even if
  the server should already know about lots of them.  For example, in
  git.git, emitting v1.6.0.2 covers almost all tags in the repository
  by simple ancestor scanning.

  Is there a reason for this behaviour?  Otherwise it would be better
  to emit them in date order and intelligently handle commons.  (In
  fact this does not depend on the discussed change.)

* Backwards compatibility

  As mentioned above, ok_to_give_up() is not really compatible with
  this algorithm.  I suppose we should introduce a new capability
  identifier that has the client enable the log(n) walker, and the
  server disable ok_to_give_up()?  Is there a better approach, or are
  there problems with it?  (I haven't worked with the protocol enough
  to say, though it seems harmless.)

... and of course everything that I missed. ;-)

I'd greatly appreciate comments, especially from people who know the
protocol side of git.

- Thomas

-- 
Thomas Rast
trast@student.ethz.ch

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]