git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Outreachy][Proposal] Accelerate rename detection and the range-diff
@ 2020-10-26  7:49 Sangeeta NB
  2020-10-26 16:52 ` Elijah Newren
  0 siblings, 1 reply; 5+ messages in thread
From: Sangeeta NB @ 2020-10-26  7:49 UTC (permalink / raw)
  To: Kaartic Sivaraam, christian.couder, Git List

Hey Everyone,

I would love to participate in outreachy this year with Git in the
project "Accelerate rename detection and the range-diff command in
Git". I have contributed to the microproject "Unify the meaning of
dirty between diff and describe"[1] which is still under review, but
through the process, I have got myself familiar with the mailing list
and patch review system. I am also contributing to another issue[2]
which is still under discussion[3] about `git bisect` and `git
rebase`.

[1] https://lore.kernel.org/git/pull.751.git.1602781723670.gitgitgadget@gmail.com
[2] https://github.com/gitgitgadget/git/issues/486
[3] https://lore.kernel.org/git/pull.765.git.1603271344522.gitgitgadget@gmail.com/

Coming to the project, I have read more about it[4] and have created
the initial version for the timeline. I would really love to have
comments on it.

[4] https://github.com/gitgitgadget/git/issues/519

Also, there's a column for community-specific questions in the final
application. Is there anything specific that I have to fill in that?

Please let me know if I missed anything.

Looking forward to working and learning with you all.

Thanks and Regards,
Sangeeta

=================================================

Link to docs: https://docs.google.com/document/d/15mgqy4id1fXZWE1NvBEERWvET9zy-ZEfhp4x0NNv_d4/edit?usp=sharing

=================================================

## Accelerate rename detection and the range-diff command in Git

# Timeline

## Nov 23 - Dec 1(Before intern officially starts)

* Getting to know the mentors.
* Bonding with the community.
* Understanding the structure of the code and familiarizing myself
with the requirements during the internship period.
* Create a concrete workflow for outreachy tasks.


## Dec 1 - Dec 20

* Study about various Approximate Nearest Neighbor Search algorithms.
* There are various comparisons for the Approximate Nearest Neighbor
algorithm like:
* [ANN benchmarks](http://ann-benchmarks.com/)
* [How to benchmark ANN
algorithms](https://medium.com/gsi-technology/how-to-benchmark-ann-algorithms-a9f1cef6be08)

* Would compare all the algorithms and would narrow down to one or two
best algorithms for our use case.

## Dec 11: Initial point of feedback

* Would take feedback from the mentors and would ask about all the
expectations that mentors and the community have from me.

## Dec 21 - Jan 05

* Would study how Locality Sensitive Hashing (data-independent) or
Locality Preserving Hashing (data-dependent) can improve our accuracy
(or even complexity).
* Would study various hashing algorithms and combine them with our
nearest neighbor search algorithm.

## Jan 06 - Jan 20
* Study if a pre-trained Support Vector Machine can add something to
our use case.
* Study how different organizations(eg Gerrit) decide if two commits
are similar or not.
* SVM’s have accuracy disadvantage as compared to nearest neighbor
algorithms. Therefore, I would look into ways if we can create a
hybrid algorithm which uses SVM’s and nearest neighbor algorithms and
get better accuracy. There are also some research papers on the same.
I would study that and would finalize the algorithm after discussion
with mentors and the community.

## Jan 12: Midpoint feedback
* Would take feedback from the mentors and would ask about ways where
I can improve or places where I was lagging.

## Jan 21 - Feb 15
* Implement the finalized algorithm.
* Benchmark its accuracy and complexity against existing methods.
* Use it for the rename detection and for commit matching in `git range-diff`.
* Update the documentation for the same.


## Feb 16 - Mar 02 ( Wrap up)
* Buffer period for incomplete work.
* Wrap up the code.
* Implement the reviews and suggestions given by mentors.
* Write documentation for the code if required.
* Get my patches merged.


## Mar 02: Final feedback
* Would take the final feedback from the mentors and would ask about
ways where I could have improved on.
* Would talk about ways to connect even after the Outreachy period.


## Post-Outreachy
* I intend to keep contributing even after the Outreachy period ends.
* Would love to co-mentor(if possible) in the next outreachy and GSoC rounds.
* Would love to review patches of other contributors and take part in
the mailing list discussions.


# Other Involvements
* Blogging is an important part of Outreachy, therefore I would love
to write a blog every weekend or every fortnight, as discussed with
mentors, writing in it the summary of work done so far, anything I
learned in that week, and my experience.
* I would also be glad to help other contributors and users solve
their issues and help the maintainers in reviewing patches over the
outreachy period and even after that.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Outreachy][Proposal] Accelerate rename detection and the range-diff
  2020-10-26  7:49 [Outreachy][Proposal] Accelerate rename detection and the range-diff Sangeeta NB
@ 2020-10-26 16:52 ` Elijah Newren
  2020-10-30  9:02   ` Kaartic Sivaraam
  0 siblings, 1 reply; 5+ messages in thread
From: Elijah Newren @ 2020-10-26 16:52 UTC (permalink / raw)
  To: Sangeeta NB; +Cc: Kaartic Sivaraam, Christian Couder, Git List

Hi and welcome!

On Mon, Oct 26, 2020 at 1:44 AM Sangeeta NB <sangunb09@gmail.com> wrote:
>
> Hey Everyone,
>
> I would love to participate in outreachy this year with Git in the
> project "Accelerate rename detection and the range-diff command in
> Git". I have contributed to the microproject "Unify the meaning of
> dirty between diff and describe"[1] which is still under review, but
> through the process, I have got myself familiar with the mailing list
> and patch review system. I am also contributing to another issue[2]
> which is still under discussion[3] about `git bisect` and `git
> rebase`.
>
> [1] https://lore.kernel.org/git/pull.751.git.1602781723670.gitgitgadget@gmail.com
> [2] https://github.com/gitgitgadget/git/issues/486
> [3] https://lore.kernel.org/git/pull.765.git.1603271344522.gitgitgadget@gmail.com/
>
> Coming to the project, I have read more about it[4] and have created
> the initial version for the timeline. I would really love to have
> comments on it.
>
> [4] https://github.com/gitgitgadget/git/issues/519

I might be the bearer of some bad or concerning news.  This email is
directed more to the mentors and others on the git mailing list, but
obviously may affect you as well:

I apologize for not stating my concerns more forcefully earlier, but I
didn't have as many details at the time or have an idea how fast
merge-ort could be upstreamed.  Anyway, I'm still concerned that this
might not be a good project for Outreachy due to two factors: unclear
benefit, and conflicts:

1) I've got merges down to the point where even if there is a massive
rename of 26000 files (e.g. renaming "drivers/" to "pilots/" in the
linux kernel), rename detection is NOT the long tent pole in a merge.
So although this project is interesting, it's not clear that this
project will help us much.  It might be better to get my changes
merged down and see if there's enough need for additional
optimizations first.

2) Ignoring what I've already submitted, the remaining diffstat for
merge-ort is about 5500 lines....
  2a) If I break that ~5500 lines into patches with 50 lines each,
that's 111 patches.  If I assume I can send 10-20 patches per week
without overwhelming folks, that's 6-11 weeks, pulling us somewhere
into mid-December or mid-January.  10-20 patches per week might be
over-optimistic on reviewer fatigue, which would push it out even
further.
  2b) Work is going to soon rotate me onto other non-git projects,
meaning even if the mailing list can review my changes aggressively,
there's a chance I might not be able to keep up on feeding them to the
list.
  2c) diffcore-rename.c is only ~700 lines right now.  My 5500 lines
of changes includes over 1000 new lines for diffcore-rename.c and
about 150 line removals for it.  These changes are spread all over the
file; only four small functions remain untouched.  In fact, I even
made big changes to struct diff_rename_dst too, so any new uses of it
would almost certainly have textual conflicts.
  2d) My diffcore-rename.c changes probably do not make logical sense
to submit first.  They should come after some groundwork is laid for
merge-ort.

Even though at a high level this project is complementary to the
optimizations I made in my 'merge-ort' work, I fear there will be LOTS
of intermediate conflicts as we both make changes to the same areas
during the same time and make a mess of things.

If you all think this is still a good project to have an intern work
on, I'll defer to you, but I am concerned.


Elijah

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Outreachy][Proposal] Accelerate rename detection and the range-diff
  2020-10-26 16:52 ` Elijah Newren
@ 2020-10-30  9:02   ` Kaartic Sivaraam
  2020-10-31 20:31     ` Elijah Newren
  0 siblings, 1 reply; 5+ messages in thread
From: Kaartic Sivaraam @ 2020-10-30  9:02 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Sangeeta NB, Christian Couder, Git List

Hi Elijah,

On 26/10/20 10:22 pm, Elijah Newren wrote:
> 
> On Mon, Oct 26, 2020 at 1:44 AM Sangeeta NB <sangunb09@gmail.com> wrote:
>>
>> I would love to participate in outreachy this year with Git in the
>> project "Accelerate rename detection and the range-diff command in
>> Git". I have contributed to the microproject "Unify the meaning of
>> dirty between diff and describe"[1] which is still under review, but
>> through the process, I have got myself familiar with the mailing list
>> and patch review system. I am also contributing to another issue[2]
>> which is still under discussion[3] about `git bisect` and `git
>> rebase`.
>>
>> [1] https://lore.kernel.org/git/pull.751.git.1602781723670.gitgitgadget@gmail.com
>> [2] https://github.com/gitgitgadget/git/issues/486
>> [3] https://lore.kernel.org/git/pull.765.git.1603271344522.gitgitgadget@gmail.com/
>>
>> Coming to the project, I have read more about it[4] and have created
>> the initial version for the timeline. I would really love to have
>> comments on it.
>>
>> [4] https://github.com/gitgitgadget/git/issues/519
> 
> I might be the bearer of some bad or concerning news.  This email is
> directed more to the mentors and others on the git mailing list, but
> obviously may affect you as well:
> 
> I apologize for not stating my concerns more forcefully earlier, but I
> didn't have as many details at the time or have an idea how fast
> merge-ort could be upstreamed.  Anyway, I'm still concerned that this
> might not be a good project for Outreachy due to two factors: unclear
> benefit, and conflicts:
> 
> 1) I've got merges down to the point where even if there is a massive
> rename of 26000 files (e.g. renaming "drivers/" to "pilots/" in the
> linux kernel), rename detection is NOT the long tent pole in a merge.
> So although this project is interesting, it's not clear that this
> project will help us much.  It might be better to get my changes
> merged down and see if there's enough need for additional
> optimizations first.
> 
> 2) Ignoring what I've already submitted, the remaining diffstat for
> merge-ort is about 5500 lines....
>    2a) If I break that ~5500 lines into patches with 50 lines each,
> that's 111 patches.  If I assume I can send 10-20 patches per week
> without overwhelming folks, that's 6-11 weeks, pulling us somewhere
> into mid-December or mid-January.  10-20 patches per week might be
> over-optimistic on reviewer fatigue, which would push it out even
> further.
>    2b) Work is going to soon rotate me onto other non-git projects,
> meaning even if the mailing list can review my changes aggressively,
> there's a chance I might not be able to keep up on feeding them to the
> list.
>    2c) diffcore-rename.c is only ~700 lines right now.  My 5500 lines
> of changes includes over 1000 new lines for diffcore-rename.c and
> about 150 line removals for it.  These changes are spread all over the
> file; only four small functions remain untouched.  In fact, I even
> made big changes to struct diff_rename_dst too, so any new uses of it
> would almost certainly have textual conflicts.
>    2d) My diffcore-rename.c changes probably do not make logical sense
> to submit first.  They should come after some groundwork is laid for
> merge-ort.
> 
> Even though at a high level this project is complementary to the
> optimizations I made in my 'merge-ort' work, I fear there will be LOTS
> of intermediate conflicts as we both make changes to the same areas
> during the same time and make a mess of things.
> 

Thanks for the detailed concerns. Some thoughts:

- Given that a major portion of the project would be to evaluate
   various algorithms and identifying the most suitable one, I believe
   implementation conflict shouldn't be a problem as it's expected to
   start only by late-January. Also, as Christian pointed out elsewhere
   it might be a good learning experience.

- I do have a concern about one thing, though. For evaluating the
   algorithm in the context of Git, we might need to do some experimental
   implementations to get some metrics which would serve as the data that
   we could use to identify the optimal algorithm. I'm  wondering whether
   your planned changes might affect that. In the sense that, is there a
   chance for the evaluation to become obsolete as a consequence of those
   changes? If yes, what could we do to overcome that? Any thoughts on
   this would be helpful.

> If you all think this is still a good project to have an intern work
> on, I'll defer to you, but I am concerned.
> 

-- 
Sivaraam

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Outreachy][Proposal] Accelerate rename detection and the range-diff
  2020-10-30  9:02   ` Kaartic Sivaraam
@ 2020-10-31 20:31     ` Elijah Newren
  2020-11-02 18:35       ` Kaartic Sivaraam
  0 siblings, 1 reply; 5+ messages in thread
From: Elijah Newren @ 2020-10-31 20:31 UTC (permalink / raw)
  To: Kaartic Sivaraam; +Cc: Sangeeta NB, Christian Couder, Git List

Hi,

On Fri, Oct 30, 2020 at 2:02 AM Kaartic Sivaraam
<kaartic.sivaraam@gmail.com> wrote:
>
> Hi Elijah,
>
> On 26/10/20 10:22 pm, Elijah Newren wrote:
> >
> > On Mon, Oct 26, 2020 at 1:44 AM Sangeeta NB <sangunb09@gmail.com> wrote:
> >>
> >> I would love to participate in outreachy this year with Git in the
> >> project "Accelerate rename detection and the range-diff command in
> >> Git". I have contributed to the microproject "Unify the meaning of
> >> dirty between diff and describe"[1] which is still under review, but
> >> through the process, I have got myself familiar with the mailing list
> >> and patch review system. I am also contributing to another issue[2]
> >> which is still under discussion[3] about `git bisect` and `git
> >> rebase`.
> >>
> >> [1] https://lore.kernel.org/git/pull.751.git.1602781723670.gitgitgadget@gmail.com
> >> [2] https://github.com/gitgitgadget/git/issues/486
> >> [3] https://lore.kernel.org/git/pull.765.git.1603271344522.gitgitgadget@gmail.com/
> >>
> >> Coming to the project, I have read more about it[4] and have created
> >> the initial version for the timeline. I would really love to have
> >> comments on it.
> >>
> >> [4] https://github.com/gitgitgadget/git/issues/519
> >
> > I might be the bearer of some bad or concerning news.  This email is
> > directed more to the mentors and others on the git mailing list, but
> > obviously may affect you as well:
> >
> > I apologize for not stating my concerns more forcefully earlier, but I
> > didn't have as many details at the time or have an idea how fast
> > merge-ort could be upstreamed.  Anyway, I'm still concerned that this
> > might not be a good project for Outreachy due to two factors: unclear
> > benefit, and conflicts:
> >
> > 1) I've got merges down to the point where even if there is a massive
> > rename of 26000 files (e.g. renaming "drivers/" to "pilots/" in the
> > linux kernel), rename detection is NOT the long tent pole in a merge.
> > So although this project is interesting, it's not clear that this
> > project will help us much.  It might be better to get my changes
> > merged down and see if there's enough need for additional
> > optimizations first.
> >
> > 2) Ignoring what I've already submitted, the remaining diffstat for
> > merge-ort is about 5500 lines....
> >    2a) If I break that ~5500 lines into patches with 50 lines each,
> > that's 111 patches.  If I assume I can send 10-20 patches per week
> > without overwhelming folks, that's 6-11 weeks, pulling us somewhere
> > into mid-December or mid-January.  10-20 patches per week might be
> > over-optimistic on reviewer fatigue, which would push it out even
> > further.
> >    2b) Work is going to soon rotate me onto other non-git projects,
> > meaning even if the mailing list can review my changes aggressively,
> > there's a chance I might not be able to keep up on feeding them to the
> > list.
> >    2c) diffcore-rename.c is only ~700 lines right now.  My 5500 lines
> > of changes includes over 1000 new lines for diffcore-rename.c and
> > about 150 line removals for it.  These changes are spread all over the
> > file; only four small functions remain untouched.  In fact, I even
> > made big changes to struct diff_rename_dst too, so any new uses of it
> > would almost certainly have textual conflicts.
> >    2d) My diffcore-rename.c changes probably do not make logical sense
> > to submit first.  They should come after some groundwork is laid for
> > merge-ort.
> >
> > Even though at a high level this project is complementary to the
> > optimizations I made in my 'merge-ort' work, I fear there will be LOTS
> > of intermediate conflicts as we both make changes to the same areas
> > during the same time and make a mess of things.
> >
>
> Thanks for the detailed concerns. Some thoughts:
>
> - Given that a major portion of the project would be to evaluate
>    various algorithms and identifying the most suitable one, I believe
>    implementation conflict shouldn't be a problem as it's expected to
>    start only by late-January. Also, as Christian pointed out elsewhere
>    it might be a good learning experience.

"late-January" _might_ be okay, but I'm worried that relying on
optimistic timelines is a bad idea.  However, if the primary purpose
is a good learning experience, or if the primary purpose is to
evaluate different algorithms (i.e. we're not relying on the timelines
to avoid conflict, it's just a bonus if they don't), then sure, no
problem there.

> - I do have a concern about one thing, though. For evaluating the
>    algorithm in the context of Git, we might need to do some experimental
>    implementations to get some metrics which would serve as the data that
>    we could use to identify the optimal algorithm. I'm  wondering whether
>    your planned changes might affect that. In the sense that, is there a
>    chance for the evaluation to become obsolete as a consequence of those
>    changes? If yes, what could we do to overcome that? Any thoughts on
>    this would be helpful.

That is certainly a possibility, yes.  One way to address that concern
is for me to freeze some branch (likely some version that I deploy
internally at $DAYJOB for testing), and for you to build on that.  If
all the new merge backend code gets reviewed and upstreamed fast
enough, and the areas you depend on don't change too drastically based
on reviewer comments, then building on merge-ort creates no
impediments for the Outreachy project to get upstreamed at the normal
time.  I can understand, though, if that plan seems worrisome due to
worries about how fast the new backend will be upstreamed or how much
it needs to change in the process; that is, after all, why I raised my
concerns in the first place.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Outreachy][Proposal] Accelerate rename detection and the range-diff
  2020-10-31 20:31     ` Elijah Newren
@ 2020-11-02 18:35       ` Kaartic Sivaraam
  0 siblings, 0 replies; 5+ messages in thread
From: Kaartic Sivaraam @ 2020-11-02 18:35 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Sangeeta NB, Christian Couder, Git List

Hi Elijah,

On 01/11/20 2:01 am, Elijah Newren wrote:
> 
> On Fri, Oct 30, 2020 at 2:02 AM Kaartic Sivaraam
> <kaartic.sivaraam@gmail.com> wrote:
>>
>> Thanks for the detailed concerns. Some thoughts:
>>
>> - Given that a major portion of the project would be to evaluate
>>     various algorithms and identifying the most suitable one, I believe
>>     implementation conflict shouldn't be a problem as it's expected to
>>     start only by late-January. Also, as Christian pointed out elsewhere
>>     it might be a good learning experience.
> 
> "late-January" _might_ be okay, but I'm worried that relying on
> optimistic timelines is a bad idea.  However, if the primary purpose
> is a good learning experience, or if the primary purpose is to
> evaluate different algorithms (i.e. we're not relying on the timelines
> to avoid conflict, it's just a bonus if they don't), then sure, no
> problem there.
> 

Yeah. I believe a good part of this project would be evaluating the 
various algorithms. Implementation would be a part of it, sure. I don't 
think it would be too time sensitive, though. So, I hope we can work 
through the timelines as the project and your work progress.

>> - I do have a concern about one thing, though. For evaluating the
>>     algorithm in the context of Git, we might need to do some experimental
>>     implementations to get some metrics which would serve as the data that
>>     we could use to identify the optimal algorithm. I'm  wondering whether
>>     your planned changes might affect that. In the sense that, is there a
>>     chance for the evaluation to become obsolete as a consequence of those
>>     changes? If yes, what could we do to overcome that? Any thoughts on
>>     this would be helpful.
> 
> That is certainly a possibility, yes.  One way to address that concern
> is for me to freeze some branch (likely some version that I deploy
> internally at $DAYJOB for testing), and for you to build on that.  If
> all the new merge backend code gets reviewed and upstreamed fast
> enough, and the areas you depend on don't change too drastically based
> on reviewer comments, then building on merge-ort creates no
> impediments for the Outreachy project to get upstreamed at the normal
> time.

Thanks. That does sound like a good way to overcome that problem. We can 
discuss more about that once the intern is selected and their internship 
period begins.

> I can understand, though, if that plan seems worrisome due to
> worries about how fast the new backend will be upstreamed or how much
> it needs to change in the process; that is, after all, why I raised my
> concerns in the first place.
> 

Which indeed is very helpful for planning the project. Thanks for that! 
Its pretty clear now that closely following your work and adapting the 
timeline accordingly as time progresses is a part of the project. That 
might indeed be an interesting experience in and of itself for the 
intern who would be working on this project.

-- 
Sivaraam

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-11-02 18:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-26  7:49 [Outreachy][Proposal] Accelerate rename detection and the range-diff Sangeeta NB
2020-10-26 16:52 ` Elijah Newren
2020-10-30  9:02   ` Kaartic Sivaraam
2020-10-31 20:31     ` Elijah Newren
2020-11-02 18:35       ` Kaartic Sivaraam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).