Re: [RFC] Possible idea for GSoC 2020

From: "Jakub Narębski" <jnareb@gmail.com>
To: Abhishek Kumar <abhishekkumar8222@gmail.com>
Cc: git <git@vger.kernel.org>, Junio C Hamano <gitster@pobox.com>,
	Derrick Stolee <stolee@gmail.com>,
	Christian Couder <christian.couder@gmail.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Emily Shaffer <emilyshaffer@google.com>,
	Heba Waly <heba.waly@gmail.com>
Subject: Re: [RFC] Possible idea for GSoC 2020
Date: Fri, 27 Mar 2020 19:31:28 +0100	[thread overview]
Message-ID: <CANQwDwcu9JvcD-BDyfVdRe5LY16OKd+-bw5Lj35GgEq7Scez7A@mail.gmail.com> (raw)
In-Reply-To: <CAHk66ftQqFqP-4kd4-8cHtCMEofSUvbeSQ24pcCCrkz7+2JG1w@mail.gmail.com>

"Hello Abhishek,

Somehow I have missed replying to this email.

On Wed, 18 Mar 2020 at 17:46, Abhishek Kumar
<abhishekkumar8222@gmail.com> wrote:
[...]
> >>> My _prediction_ is that the two-dimensional system will be more
> >>> complicated to write and use, and will not have any measurable
> >>> difference. I'd be happy to be wrong, but I also would not send
> >>> anyone down this direction only to find out I'm right and that
> >>> effort was wasted.
> >>
> >> Agreed. I have been through the papers of the involved variants and on graphs
> >> comparable to some of the largest git repositories, the performance improves by
> >> fifty nanoseconds for a random query.
> >
> > I would recommend extending results for other types of large graphs to
> > the commit graphs with care.  The characteristics of those graphs are
> > quite different from characteristics of commit graph: they usually are
> > scale-free graphs, with low maximum level, and low connectivity: the
> > probability of two random nodes being connected in order of 10^-3 or
> > 10^-4; see e.g. https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=99
>
> > The last one, called R-ratio, means that testing on random query
> > actually tests mainly negative-cut filters.  That is why some papers
> > provide either separate numbers for negative and for positive queries,
> > or separate numbers for random and for balanced queries.
>
> I do agree that we should be careful while extending results but am skeptical
> about the performance difference. If anything, general results could help us
> eyeball the sort of improvement we can expect.

The fact is that for example in FELINE paper authors add min-post positive-cut
filter to their own index to improve performance for positive or
balanced queries.

"Reachability Queries in Very Large Graphs: A Fast Refined Online
Search Approach" (2014)
http://openprocedings.org/EDBT/2014/paper_166.pdf

> Of course, there is only one definitive source of truth -
> implementing indices and benchmarking performance.

Right.

> Speaking of special characteristics, are there any indexes designed for maximum
> performance with such graphs?

I was not able to find any reachability labelings that are intended for
commit graphs; even papers examining characteristics of commit graphs
as graphs are sparse. One that I have found is

Marco Biazzini, Martin Monperrus, Benoit Baudry
"On Analyzing the Topology of Commit Histories in
Decentralized Version Control Systems" (2014)
https://hal.archives-ouvertes.fr/hal-01063789

> > > Additionally:
> > > 1. They require significantly more space per commit.
>
> > This depends on the type of the algorithm: is it Label-Only (answering
> > reachability queries without consulting graph), or Label+Graph
> > (augmented online search algorithms):
> > https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=78
> >
> > The CDAT chunk in current version of commit-graph format takes H+16
> > bytes per commit (where H is the size of object id hash).  From those
> > H+16 bytes 30 bits (slightly less that 4 bytes) are used for current
> > reachability label: the topological level aka generation number.
> > https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=45
> >
> > The proposed min-post interval label would take 8 bytes per commit, that
> > is 4 bytes per single number in interval.  That is not much, provided
> > that we get visible performance improvements for at least some often
> > used git commands.
>
> Agreed. Both min-post and GRAIL use 8 bytes each.

Actually GRAIL uses k*8 bytes, where recommended value of k
is k=5 for dense graphs and k=2 for sparse graphs, and k is number
of random spanning trees which is number of intervals (negative-cut).

GRAIL = Graph Reachability indexing via rAndomized Interval
Labeling.

>                                                                                     Even FERRARI's
> performance would reach the flat end of diminishing returns if we
> assign three intervals (or 25 bytes) i.e six interval ends and three bits
> for recording whether intervals are exact or approximate.

Actually the current commit-graph format can store at most
(1 << 30) + (1 << 29) + (1 << 28) - 1 (around 1.8 billion) commits.
We can use most significant bit of one end of interval to record
whether interval is exact or approximate (in current format in
CDAT it is used to record whether there are more than 2 parents).
This means that k intervals for FERRARI take also k*8 bytes.
Again they recommend values of k=5 for dense, k=3 or k=2
for sparse. Also there is FERRARI-G variant, with k intervals
on average (global limit) - though I am not sure if it is something
that we can use.

>
> But PReaCH requires 8m + 65 bytes for each commit, which is a huge ask.

Let's take into account only "Pruning Based on DFS Numbering" from
there - I don't think reachability contraction hierarchies labeling
would be good fit for commit-graphs, because they assume bidi-BFS.
Reverse graph is not something that can be incrementally updated.

Maximum number of 4 byte numbers (one word for each end of interval,
and also one word for position of node) would be 5 or 6, depending
on whether we would store only p_tree, or [min(p_tree), post(p_tree)]
interval directly. Even 6*4 bytes per commit is not that huge of a task,
given that current CDAT is H + 2*4 + 8, where H is hash length.

> [An additional 4m bytes from current commit-graph chunk format since we
> do not store children nodes needed for the bi-directional nature of CHs.]
>
> >> 2. They require significantly more preprocessing time.
> >
> > This again depends on the type of algorithm: Label-Only or Label+G.
> >
> > In the case of min-post interval labels, they can be computed together
> > with generation number, during the same commit-graph walk. The amount
> > of calculations required to compute min-post interval is not much.
> > Therefore I think it would be not unreasonable cost.
>
> Also agreed. I do consider min-post interval labels and GRAIL to be some of
> more reasonable choices.
>
> But FERRARI would have marginally better performance than GRAIL and the
> five DFS passes made by PReaCH during preprocessing make it unsuitable.

Actually all five PReaCH DFS-labels can be computed during
_single_ DFS pass; implemented in Colaboratory:
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=4l0Ld0Jklq1o
as find_dfs_intervals_extra().

[...]
> >> My recommendation is that a GSoC student update the
> >> generation number to "v2" based on the definition you made in [1].
> >> That proposal is also more likely to be effective in Git because
> >> it makes use of extra heuristic information (commit date) to
> >> assist the types of algorithms we care about.
>
> Hear me out on this but topological levels can be
> considered a special case of all corrected commit
> dates occurring one time unit apart.

Right.

> Storing corrected dates instead of topological levels
> for min-post interval labels might actually have the
> best performance of all.

But you cannot use corrected dates for post(v);
topological levels and post-visit order in DFS
are different beasts.

Best,
-- 
Jakub Narębski