git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Couder <christian.couder@gmail.com>
To: Jakub Narebski <jnareb@gmail.com>
Cc: Derrick Stolee <stolee@gmail.com>, git <git@vger.kernel.org>,
	Heba Waly <heba.waly@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Emily Shaffer <emilyshaffer@google.com>,
	Abhishek Kumar <abhishekkumar8222@gmail.com>
Subject: Re: [RFC] Possible idea for GSoC 2020
Date: Tue, 17 Mar 2020 18:04:23 +0100	[thread overview]
Message-ID: <CAP8UFD1ihR1PtM2y24HTKa2woGXMVFq9MoVSj1cHVZCNWSH7EA@mail.gmail.com> (raw)
In-Reply-To: <867dzj6p4y.fsf@gmail.com>

On Tue, Mar 17, 2020 at 3:18 PM Jakub Narebski <jnareb@gmail.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
> > On Tue, Mar 17, 2020 at 4:13 AM Jakub Narebski <jnareb@gmail.com> wrote:

> >>>> If for each commit 'v' we would compute and store in the commit-graph
> >>>> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
> >>>> that were visited during the part of depth-first search that started
> >>>> from 'v' (which is the minimum of post-order number for subtree of a
> >>>> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
> >>>> Then the following condition is true:
> >>>>
> >>>>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
> >>>
> >>> How many places in Git do we ask "can v reach u?" and how many would
> >>> return immediately without needing a walk in this new approach? My
> >>> guess is that we will have a very narrow window where this query
> >>> returns a positive result.
> >>
> >> As I wrote below, such positive-cut filter would be directly helpful in
> >> performing the following commands:
> >>
> >>  - `git merge-base --is-ancestor`
> >>  - `git branch --contains`
> >>  - `git tag --contains`
> >>  - `git branch --merged`
> >>  - `git tag --merged`
> >>
> >> It would be also useful for tag autofollow in git-fetch; is is N-to-M
> >> equivalent to 1-to-N / N-to-1 `--contains` queries.
> >>
> >> I am quite sure that positive-cut filter would make `--ancestry-path`
> >> walk faster.
> >>
> >> I think, but I am not sure, that positive-cut filter can make parts of
> >> topological sort and merge base algorithms at least a tiny bit faster.
> >
> > Is there an easy way to check that it would provide significant
> > performance improvements at least in some cases? Can we ask the
> > student to do that at the beginning of the GSoC?
>
> The "Reachability labels for version control graphs.ipynb" Jupyter
> Notebook on Google Colaboratory was created to answer this question
> (originally for the FELINE reachability index).  Among others it can
> min-post interval labels and topological levels (generation numbers),
> use them for reachability queries, and load Linux kernel
> commit-graph. The exploration didn't get finished, but it would be not
> that difficult, I think, to at least find the amount of false negatives
> for min-post interval labeling for git.git or Linux kernel repo.
>
>   https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
>
> As Jupyter Notebook, it is run in the web browser.  It can either use
> local runtime, or run on Google Cloud runtime.  On the other hand it
> requires at least some knowledge of Python...

Ok, if that's a possible approach to the project, then please add it
to the description.

> >> I think it would be possible to compute post(v) and min_tree(v) using
> >> incremental updates, and to make it compatibile with incremental
> >> commit-graph format (with the commit-graph chain).  But I have not
> >> proven it.
> >
> > Would it be difficult to prove? What would be required? And again can
> > we ask the student to do that at the beginning of the GSoC?
>
> Formal mathematical proof by induction would be not that difficult.  The
> problem is, I think, in finding all possible classes of initial spanning
> forest arrangements, and all possible classes of commit-graph growth by
> one commit -- and examining how this affect the spanning forest.

Ok.

> >>> The point of generation number v2 [1] was to allow moving to "exact"
> >>> algorithms for things like merge-base where we still use commit time
> >>> as a heuristic, and could be wrong because of special data shapes.
> >>> We don't use generation number in these examples because using only
> >>> generation number can lead to a large increase in number of commits
> >>> walked. The example we saw in the Linux kernel repository was a bug
> >>> fix created on top of a very old commit, so there was a commit of
> >>> low generation with very high commit-date that caused extra walking.
> >>> (See [2] for a detailed description of the data shape.)
> >>>
> >>> My _prediction_ is that the two-dimensional system will be more
> >>> complicated to write and use, and will not have any measurable
> >>> difference. I'd be happy to be wrong, but I also would not send
> >>> anyone down this direction only to find out I'm right and that
> >>> effort was wasted.
> >>
> >> That might be a problem.
> >>
> >> This is a bit of a "moonshot" / research project, moreso than others.
> >> Though it would be still valuable, in my opionion, even if the code
> >> wouldn't ultimately get merged and added into Git.
> >
> > I agree that it feels like a "moonshot" / research project.
> >
> >>> My recommendation is that a GSoC student update the
> >>> generation number to "v2" based on the definition you made in [1].
> >>> That proposal is also more likely to be effective in Git because
> >>> it makes use of extra heuristic information (commit date) to
> >>> assist the types of algorithms we care about.
> >>>
> >>> In that case, the "difficult" part is moving the "generation"
> >>> member of struct commit into a slab before making it a 64-bit
> >>> value. (This is likely necessary for your plan, anyway.) Updating
> >>> the generation number to v2 is relatively straight-forward after
> >>> that, as someone can follow all places that reference or compute
> >>> generation numbers and apply a diff
> >>
> >> Good idea!  Though I am not sure if it is not too late to add it to the
> >> https://git.github.io/SoC-2020-Ideas/ as the self imposed deadline of
> >> March 16 (where students can start submitting proposals to GSoC) has
> >> just passed.  Christian, what do you think?
> >
> > Would that be a different project idea or part of your "Graph labeling
> > for speeding up git commands" project idea?
> >
> > I am very reluctant to add new project ideas at that time. I don't
> > think student will have time to properly research it and get it
> > reviewed.
> >
> > It could be part of your research project though, to check if that
> > approach is better or good enough compared to what you suggest in the
> > current version of your project.
> >
> >> Would you agree, Stolee, to be a _possible_ mentor or co-mentor for
> >> "Generation number v2" project?
> >
> > At this point I think it might be best if you are both willing to
> > co-mentor a "moonshot" / research project to find what is the best way
> > forward by bench-marking the different approaches that you both
> > suggest for different commands/use cases.
>
> So perhaps we should expand "Commit graph labeling for speeding up git
> commands" idea, splitting it into two possible ways of working in this
> project: the more technical 'Generation number v2', and 'Interval labels
> for commit graph' which is more of research project?

Ok for that.

>  Which should be
> put first, then?

Which ever you prefer. If you say that any part could be done
separately, that doesn't matter much.

Best,
Christian.

  reply	other threads:[~2020-03-17 17:04 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-10 14:50 [RFC] Possible idea for GSoC 2020 Jakub Narebski
2020-03-11 19:03 ` Junio C Hamano
2020-03-13 10:56   ` Jakub Narebski
2020-03-15 14:26     ` Jakub Narebski
2020-03-17 12:24       ` Kaartic Sivaraam
2020-03-17 12:39         ` Kaartic Sivaraam
2020-03-17 14:22         ` Jakub Narebski
2020-03-11 20:29 ` Christian Couder
2020-03-11 21:30   ` Jakub Narebski
2020-03-11 21:48     ` Christian Couder
2020-03-12 12:17       ` Jakub Narebski
2020-03-12 13:08         ` Jakub Narebski
2020-03-13 10:59           ` Jakub Narebski
2020-03-13 13:08 ` Philip Oakley
2020-03-13 14:34   ` Jakub Narebski
2020-03-15 18:57     ` Philip Oakley
2020-03-15 21:14       ` Jakub Narebski
2020-03-16 14:47         ` Philip Oakley
2020-03-16 12:44 ` Derrick Stolee
2020-03-17  3:13   ` Jakub Narebski
2020-03-17  7:24     ` Christian Couder
2020-03-17 11:49       ` Derrick Stolee
2020-03-17 14:18       ` Jakub Narebski
2020-03-17 17:04         ` Christian Couder [this message]
2020-03-18 13:55           ` Jakub Narebski
2020-03-18 15:25             ` Derrick Stolee
2020-03-19 12:52               ` Jakub Narebski
2020-03-13 17:30 Abhishek Kumar
2020-03-17 17:00 Abhishek Kumar
2020-03-17 18:05 ` Jakub Narebski
2020-03-17 18:00 Abhishek Kumar
2020-03-19 12:50 ` Jakub Narebski
     [not found] <CAHk66ftQqFqP-4kd4-8cHtCMEofSUvbeSQ24pcCCrkz7+2JG1w@mail.gmail.com>
2020-03-27 18:31 ` Jakub Narębski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAP8UFD1ihR1PtM2y24HTKa2woGXMVFq9MoVSj1cHVZCNWSH7EA@mail.gmail.com \
    --to=christian.couder@gmail.com \
    --cc=abhishekkumar8222@gmail.com \
    --cc=emilyshaffer@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=heba.waly@gmail.com \
    --cc=jnareb@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).