Re: [PATCH 14/16] bisect: use a modified breadth-first search to find relevant weights

From: Stephan Beyer <s-beyer@gmx.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Christian Couder <chriscool@tuxfamily.org>
Subject: Re: [PATCH 14/16] bisect: use a modified breadth-first search to find relevant weights
Date: Fri, 26 Feb 2016 21:55:16 +0100	[thread overview]
Message-ID: <56D0BBB4.9020305@gmx.net> (raw)
In-Reply-To: <xmqqr3g0kvzx.fsf@gitster.mtv.corp.google.com>

Hi,

On 02/26/2016 04:09 AM, Junio C Hamano wrote:
> Interesting.  So you walk from the bottom commits, incrementing the
> weight while walking on a part of the graph that is single strand of
> pearls, or doing the "count the reachable ones the hard way" when
> you hit a merge [*1*], but either way, once you know the commit you
> are looking at is above the mid-way, i.e. reaches more than half of
> the graph, then stop walking because its children will never be
> better than that commit?

Exactly. Maybe that's an even better description for the commit message l-)

> I haven't studied the code carefully, but your mention of BFS makes
> me wonder if you can "narrow" down the search space even more.

I had an idea that aimed at accomplishing this by finding the
highest-distance merge commit using binary search. (The BFS collected
all merge commits, in that order, then you could do binary search.)
So you have O(log(#mergecommits)) "do it the hard way" weight computations.
However, in the worst case (or even in every case, I did not thoroughly
think about the cases that can occur), it could happen that it also had
to do these "do it the hard way" computations for all merge commits with
smaller weight... That's why I dropped this idea in favor of the more
simple approach that I sent to the list.

> In an earlier step of the series, you introduced a work queue to
> replace the recursion.  The recursion forces the algorithm to work
> along the topology of the history, but you have more latitude in
> selecting the commit to dig further with a work queue.
> 
> I wonder if you can sort the elements on the work queue based on
> their distance (i.e. by using a priority queue).  You know the total
> upfront, so you may find a commit whose weight is exactly half of it
> before traversing all the bottom half of the history and it may turn
> out to be even more efficient.  After all, you are only looking for
> just one such commit, not trying to find all the commits with the
> best weight.

For the compute_weight() function (the "doing it the hard way"
function), this would not make sense since all parent commits (of a
merge commit we call compute_weight() on) will have smaller distance.

However, your idea can help:
I just noticed that it's not important that I use a *B*FS because of the
acyclity. So I could also use a DFS that is "more greedy" going towards
high numbers.
Note that the traversal is always on trees because the merge commits are
cut out. So whenever a branching (a commit with more than one child)
occurs, the DFS should follow the branch of maximum length to its leaf.
(These maximum lengths can be saved during build_reversed_dag().)
Then, if there is a halfway commit, we stop. If not, we can just go on
with the remaining DFS...
I think I only rethought and rephrased your idea. :) Sounds good to me.
I'm going to add commits for that.

> [Footnote]
> 
> *1* A merge between a commit that reaches N commits and another that
> reaches M commits may have weight smaller than the sum of N and M,
> and that is why I left the original "truly stupid algorithm" Linus
> wrote as-is when I did the obvious optimization for the linear parts
> of the history.

Yes. In fact, the "truly stupid algorithm" is not truly stupid. I'm not
quite sure, but I think it's still the best algorithm known so far for
that problem. (But maybe the problem is just not very interesting in
science.)

> *2* Whenever I revisited the bisection in my head, I thought about
> ways to improve that "truly stupid" counting, but never thought
> about an approach to simply reduce the number of times the "truly
> stupid" counting has to be done.

Well, I also improved that "truly stupid" counting a little in a former
commit (patch 7). But it's just the implementation that I improved
(time/memory trade-off), not the algorithm. :)

Cheers,
  Stephan