Re: [RFC] Possible idea for GSoC 2020

From: Christian Couder <christian.couder@gmail.com>
To: Jakub Narebski <jnareb@gmail.com>
Cc: Derrick Stolee <stolee@gmail.com>, git <git@vger.kernel.org>,
	Heba Waly <heba.waly@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Emily Shaffer <emilyshaffer@google.com>,
	Abhishek Kumar <abhishekkumar8222@gmail.com>
Subject: Re: [RFC] Possible idea for GSoC 2020
Date: Tue, 17 Mar 2020 18:04:23 +0100	[thread overview]
Message-ID: <CAP8UFD1ihR1PtM2y24HTKa2woGXMVFq9MoVSj1cHVZCNWSH7EA@mail.gmail.com> (raw)
In-Reply-To: <867dzj6p4y.fsf@gmail.com>

On Tue, Mar 17, 2020 at 3:18 PM Jakub Narebski <jnareb@gmail.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
> > On Tue, Mar 17, 2020 at 4:13 AM Jakub Narebski <jnareb@gmail.com> wrote:

> >>>> If for each commit 'v' we would compute and store in the commit-graph
> >>>> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
> >>>> that were visited during the part of depth-first search that started
> >>>> from 'v' (which is the minimum of post-order number for subtree of a
> >>>> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
> >>>> Then the following condition is true:
> >>>>
> >>>>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
> >>>
> >>> How many places in Git do we ask "can v reach u?" and how many would
> >>> return immediately without needing a walk in this new approach? My
> >>> guess is that we will have a very narrow window where this query
> >>> returns a positive result.
> >>
> >> As I wrote below, such positive-cut filter would be directly helpful in
> >> performing the following commands:
> >>
> >>  - `git merge-base --is-ancestor`
> >>  - `git branch --contains`
> >>  - `git tag --contains`
> >>  - `git branch --merged`
> >>  - `git tag --merged`
> >>
> >> It would be also useful for tag autofollow in git-fetch; is is N-to-M
> >> equivalent to 1-to-N / N-to-1 `--contains` queries.
> >>
> >> I am quite sure that positive-cut filter would make `--ancestry-path`
> >> walk faster.
> >>
> >> I think, but I am not sure, that positive-cut filter can make parts of
> >> topological sort and merge base algorithms at least a tiny bit faster.
> >
> > Is there an easy way to check that it would provide significant
> > performance improvements at least in some cases? Can we ask the
> > student to do that at the beginning of the GSoC?
>
> The "Reachability labels for version control graphs.ipynb" Jupyter
> Notebook on Google Colaboratory was created to answer this question
> (originally for the FELINE reachability index).  Among others it can
> min-post interval labels and topological levels (generation numbers),
> use them for reachability queries, and load Linux kernel
> commit-graph. The exploration didn't get finished, but it would be not
> that difficult, I think, to at least find the amount of false negatives
> for min-post interval labeling for git.git or Linux kernel repo.
>
>   https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
>
> As Jupyter Notebook, it is run in the web browser.  It can either use
> local runtime, or run on Google Cloud runtime.  On the other hand it
> requires at least some knowledge of Python...

Ok, if that's a possible approach to the project, then please add it
to the description.

> >> I think it would be possible to compute post(v) and min_tree(v) using
> >> incremental updates, and to make it compatibile with incremental
> >> commit-graph format (with the commit-graph chain).  But I have not
> >> proven it.
> >
> > Would it be difficult to prove? What would be required? And again can
> > we ask the student to do that at the beginning of the GSoC?
>
> Formal mathematical proof by induction would be not that difficult.  The
> problem is, I think, in finding all possible classes of initial spanning
> forest arrangements, and all possible classes of commit-graph growth by
> one commit -- and examining how this affect the spanning forest.

Ok.

> >>> The point of generation number v2 [1] was to allow moving to "exact"
> >>> algorithms for things like merge-base where we still use commit time
> >>> as a heuristic, and could be wrong because of special data shapes.
> >>> We don't use generation number in these examples because using only
> >>> generation number can lead to a large increase in number of commits
> >>> walked. The example we saw in the Linux kernel repository was a bug
> >>> fix created on top of a very old commit, so there was a commit of
> >>> low generation with very high commit-date that caused extra walking.
> >>> (See [2] for a detailed description of the data shape.)
> >>>
> >>> My _prediction_ is that the two-dimensional system will be more
> >>> complicated to write and use, and will not have any measurable
> >>> difference. I'd be happy to be wrong, but I also would not send
> >>> anyone down this direction only to find out I'm right and that
> >>> effort was wasted.
> >>
> >> That might be a problem.
> >>
> >> This is a bit of a "moonshot" / research project, moreso than others.
> >> Though it would be still valuable, in my opionion, even if the code
> >> wouldn't ultimately get merged and added into Git.
> >
> > I agree that it feels like a "moonshot" / research project.
> >
> >>> My recommendation is that a GSoC student update the
> >>> generation number to "v2" based on the definition you made in [1].
> >>> That proposal is also more likely to be effective in Git because
> >>> it makes use of extra heuristic information (commit date) to
> >>> assist the types of algorithms we care about.
> >>>
> >>> In that case, the "difficult" part is moving the "generation"
> >>> member of struct commit into a slab before making it a 64-bit
> >>> value. (This is likely necessary for your plan, anyway.) Updating
> >>> the generation number to v2 is relatively straight-forward after
> >>> that, as someone can follow all places that reference or compute
> >>> generation numbers and apply a diff
> >>
> >> Good idea!  Though I am not sure if it is not too late to add it to the
> >> https://git.github.io/SoC-2020-Ideas/ as the self imposed deadline of
> >> March 16 (where students can start submitting proposals to GSoC) has
> >> just passed.  Christian, what do you think?
> >
> > Would that be a different project idea or part of your "Graph labeling
> > for speeding up git commands" project idea?
> >
> > I am very reluctant to add new project ideas at that time. I don't
> > think student will have time to properly research it and get it
> > reviewed.
> >
> > It could be part of your research project though, to check if that
> > approach is better or good enough compared to what you suggest in the
> > current version of your project.
> >
> >> Would you agree, Stolee, to be a _possible_ mentor or co-mentor for
> >> "Generation number v2" project?
> >
> > At this point I think it might be best if you are both willing to
> > co-mentor a "moonshot" / research project to find what is the best way
> > forward by bench-marking the different approaches that you both
> > suggest for different commands/use cases.
>
> So perhaps we should expand "Commit graph labeling for speeding up git
> commands" idea, splitting it into two possible ways of working in this
> project: the more technical 'Generation number v2', and 'Interval labels
> for commit graph' which is more of research project?

Ok for that.

>  Which should be
> put first, then?

Which ever you prefer. If you say that any part could be done
separately, that doesn't matter much.

Best,
Christian.