[GSOC] Blog about weeks 4, 5

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [GSOC] Blog about weeks 4, 5
       [not found] <20200706182213.GA51227@Abhishek-Arch>
@ 2020-07-07  2:24 ` Abhishek Kumar
  2020-07-13 20:00   ` Jakub Narębski
  0 siblings, 1 reply; 3+ messages in thread
From: Abhishek Kumar @ 2020-07-07  2:24 UTC (permalink / raw)
  To: abhishekkumar8222; +Cc: git, stolee, jnareb

Hello everyone!

Over the last two weeks, I have worked on refining the performance
report on generation numbers. Here are our conclusions:

- Corrected Commit Dates With Monotonically Offset (i.e.  generation
  number v5) performs better than topological levels but is still walks
  too many commits when compared with Corrected Commit Dates.

Number of commits walked (git merge-base v4.8 v4.9, on linux repository):

Topological Level                          : 635579
Corrected Commit Date                      : 167468
Corrected Commit Date With Monotonic Offset: 506577

As such, I am expecting that we will store Corrected Commit Date in an
additional chunk (called "generation data chunk") and store topological
levels into CDAT. Thus, old Git clients can operate as expected, with
new Git clients using the better generation number.

- Using a new chunk does affect the locality of reference but did not
  impact the performance appreciably.
- This does increase the size of commit graph file by nearly 5%.

You can read more in my report [1] and the pull request with
instructions to replicate the results [2].

[1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/T/#mda33f6e13873df55901768e8fd6d774282002146
[2]: https://github.com/abhishekkumar2718/git/pull/1

I talk a bit more about a patch I worked on, trying to improve
performance of commit graph write using buffers which ultimately did not
work and is dropped. Up next is actually implementing the generation
number and take care of all little details.

https://abhishekkumar2718.github.io/programming/2020/07/05/gsoc-weeks-4-5.html

Feedback and suggestions welcome!

Thanks
- Abhishek

--------

Re-sending this email as I forgot to cc git@vger.kernel.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [GSOC] Blog about weeks 4, 5
  2020-07-07  2:24 ` [GSOC] Blog about weeks 4, 5 Abhishek Kumar
@ 2020-07-13 20:00   ` Jakub Narębski
  2020-07-14  6:23     ` Abhishek Kumar
  0 siblings, 1 reply; 3+ messages in thread
From: Jakub Narębski @ 2020-07-13 20:00 UTC (permalink / raw)
  To: Abhishek Kumar; +Cc: git, Derrick Stolee

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:

> Hello everyone!
>
> Over the last two weeks, I have worked on refining the performance
> report on generation numbers. Here are our conclusions:
>
> - Corrected Commit Dates With Monotonically Offset (i.e.  generation
>   number v5) performs better than topological levels but is still walks
>   too many commits when compared with Corrected Commit Dates.

Thank you for your work examining different approaches to introducing
generation number v2.

> Number of commits walked (git merge-base v4.8 v4.9, on linux repository):
>
> Topological Level                          : 635579
> Corrected Commit Date                      : 167468
> Corrected Commit Date With Monotonic Offset: 506577

It is a bit strange that requiring monotonic offsets leads to so much
of a difference in performance (in commits walked).

>
> As such, I am expecting that we will store Corrected Commit Date in an
> additional chunk (called "generation data chunk") and store topological
> levels into CDAT. Thus, old Git clients can operate as expected, with
> new Git clients using the better generation number.
>
> - Using a new chunk does affect the locality of reference but did not
>   impact the performance appreciably.
> - This does increase the size of commit graph file by nearly 5%.

All right, it seems like it is the way to go.

> You can read more in my report [1] and the pull request with
> instructions to replicate the results [2].
>
> [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/T/#mda33f6e13873df55901768e8fd6d774282002146
> [2]: https://github.com/abhishekkumar2718/git/pull/1
>
> I talk a bit more about a patch I worked on, trying to improve
> performance of commit graph write using buffers which ultimately did not
> work and is dropped. Up next is actually implementing the generation
> number and take care of all little details.
>
> https://abhishekkumar2718.github.io/programming/2020/07/05/gsoc-weeks-4-5.html
>
> Feedback and suggestions welcome!

Some comments about the blog entry contents:

AK> Dr. Stolee pointed out ... [to] use the number of commits as a
AK> metric instead of wall clock timing (which can be influenced by other
AK> factors like CPU usage at the time).

There are a few factors.  If we compare similar algorithms, that might
be a good decision.

First, one can try to reduce the influence of random factors on the wall
clock timing by using statistics.  For example one can try to detect and
remove outliers by using robust statistics measures to detect them, like
tools like for example Dumbbench [3], hyperfine [4] or bench [5].  After
warmup, one approach is to compute the robust estimate of value, e.g.
median, and robust estimate of dispersion, e.g. MAD = median absolute
deviation, and use those to detect outliers, e.g. rescale MAD and mark
as outlier and remove entries that are more than "three sigma" of robust
dispersion away from robust estimate of value.  Dumbbench [3] has good
explanation.

[3]: https://metacpan.org/pod/Dumbbench#HOW-IT-WORKS-AND-WHY-IT-DOESN'T
[4]: https://github.com/sharkdp/hyperfine
[5]: https://github.com/Gabriel439/bench

Second, because of pecularities of current processor architecture
(caches, data prefetching, branch prediction) performing more operations
might in admittedly rare cases be faster than doing less operations. One
such example can be found in the CppCon 2019 talk by Andrei Alexandrescu
"Speed Is Found In The Minds of People" [6][7] about 'small sort', where
doing more operations results in, on average, faster sort.  This of
course has a possibility to happen only if difference with the number of
operations is small enough... nevertheless it might be a good idea to at
least check that the wall clock time agrees with conclusions from the
number of commits walked, for at least a few examples.

[6]: https://www.youtube.com/watch?v=FJJTYQYB1JQ
[7]: https://github.com/CppCon/CppCon2019/blob/master/Presentations/speed_is_found_in_the_minds_of_people/speed_is_found_in_the_minds_of_people__andrei_alexandrescu__cppcon_2019.pdf

AK> With the second report, storing corrected commit date in GDAT as
AK> well as computing topological levels seems like a no-brainer. I have
AK> started working on the patch and will push to the mailing list after
AK> some discussion on the report.

Do you have any numbers how much does providing backward compatibility
cost at `git commit-graph write`, that is how much more time it takes to
computer topological levels during computation of corrected
committerdate compared to storing GENERATION_NUMBER_MAX in place of
topological level, and whether having topological level (as tie-breaker)
helps with Git performance when using commit-graphh for querying?  Does
having topological levels as tie-breaker or secondary negative-cut
reachability index helps at all?

Thank you for your work and for the report.

P.S. Would it be possible to put GSoC entries into separate 'GSoC'
category instead of generic 'Programming' one, or add a 'GSoC' tag?

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [GSOC] Blog about weeks 4, 5
  2020-07-13 20:00   ` Jakub Narębski
@ 2020-07-14  6:23     ` Abhishek Kumar
  0 siblings, 0 replies; 3+ messages in thread
From: Abhishek Kumar @ 2020-07-14  6:23 UTC (permalink / raw)
  To: Jakub Narębski; +Cc: git, stolee, abhishekkumar8222

On Mon, Jul 13, 2020 at 10:00:03PM +0200, Jakub Narębski wrote:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> 
> > Hello everyone!
> >
> > Over the last two weeks, I have worked on refining the performance
> > report on generation numbers. Here are our conclusions:
> >
> > - Corrected Commit Dates With Monotonically Offset (i.e.  generation
> >   number v5) performs better than topological levels but is still walks
> >   too many commits when compared with Corrected Commit Dates.
> 
> Thank you for your work examining different approaches to introducing
> generation number v2.
> 
> > Number of commits walked (git merge-base v4.8 v4.9, on linux repository):
> >
> > Topological Level                          : 635579
> > Corrected Commit Date                      : 167468
> > Corrected Commit Date With Monotonic Offset: 506577
> 
> It is a bit strange that requiring monotonic offsets leads to so much
> of a difference in performance (in commits walked).
> 
> >
> > As such, I am expecting that we will store Corrected Commit Date in an
> > additional chunk (called "generation data chunk") and store topological
> > levels into CDAT. Thus, old Git clients can operate as expected, with
> > new Git clients using the better generation number.
> >
> > - Using a new chunk does affect the locality of reference but did not
> >   impact the performance appreciably.
> > - This does increase the size of commit graph file by nearly 5%.
> 
> All right, it seems like it is the way to go.
> 
> > You can read more in my report [1] and the pull request with
> > instructions to replicate the results [2].
> >
> > [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/T/#mda33f6e13873df55901768e8fd6d774282002146
> > [2]: https://github.com/abhishekkumar2718/git/pull/1
> >
> > I talk a bit more about a patch I worked on, trying to improve
> > performance of commit graph write using buffers which ultimately did not
> > work and is dropped. Up next is actually implementing the generation
> > number and take care of all little details.
> >
> > https://abhishekkumar2718.github.io/programming/2020/07/05/gsoc-weeks-4-5.html
> >
> > Feedback and suggestions welcome!
> 
> Some comments about the blog entry contents:
> 
> AK> Dr. Stolee pointed out ... [to] use the number of commits as a
> AK> metric instead of wall clock timing (which can be influenced by other
> AK> factors like CPU usage at the time).
> 
> There are a few factors.  If we compare similar algorithms, that might
> be a good decision.
> 
> First, one can try to reduce the influence of random factors on the wall
> clock timing by using statistics.  For example one can try to detect and
> remove outliers by using robust statistics measures to detect them, like
> tools like for example Dumbbench [3], hyperfine [4] or bench [5].  After
> warmup, one approach is to compute the robust estimate of value, e.g.
> median, and robust estimate of dispersion, e.g. MAD = median absolute
> deviation, and use those to detect outliers, e.g. rescale MAD and mark
> as outlier and remove entries that are more than "three sigma" of robust
> dispersion away from robust estimate of value.  Dumbbench [3] has good
> explanation.
> 
> [3]: https://metacpan.org/pod/Dumbbench#HOW-IT-WORKS-AND-WHY-IT-DOESN'T
> [4]: https://github.com/sharkdp/hyperfine
> [5]: https://github.com/Gabriel439/bench

That's interesting. When you think about it, medians are a better
measure than average because medians are robust to the outliers.

> 
> Second, because of pecularities of current processor architecture
> (caches, data prefetching, branch prediction) performing more operations
> might in admittedly rare cases be faster than doing less operations. One
> such example can be found in the CppCon 2019 talk by Andrei Alexandrescu
> "Speed Is Found In The Minds of People" [6][7] about 'small sort', where
> doing more operations results in, on average, faster sort.  This of
> course has a possibility to happen only if difference with the number of
> operations is small enough... nevertheless it might be a good idea to at
> least check that the wall clock time agrees with conclusions from the
> number of commits walked, for at least a few examples.
> 
> [6]: https://www.youtube.com/watch?v=FJJTYQYB1JQ
> [7]: https://github.com/CppCon/CppCon2019/blob/master/Presentations/speed_is_found_in_the_minds_of_people/speed_is_found_in_the_minds_of_people__andrei_alexandrescu__cppcon_2019.pdf
> 
> AK> With the second report, storing corrected commit date in GDAT as
> AK> well as computing topological levels seems like a no-brainer. I have
> AK> started working on the patch and will push to the mailing list after
> AK> some discussion on the report.
> 
> Do you have any numbers how much does providing backward compatibility
> cost at `git commit-graph write`, that is how much more time it takes to
> computer topological levels during computation of corrected
> committerdate compared to storing GENERATION_NUMBER_MAX in place of
> topological level, and whether having topological level (as tie-breaker)
> helps with Git performance when using commit-graphh for querying?  Does
> having topological levels as tie-breaker or secondary negative-cut
> reachability index helps at all?
> 
We do have timings comparing the time to compute topological levels as
compared to storing GENERATION_NUMBER_MAX in place [1]:

Writing GENERATION_NUMBER_MAX to commit-graph: 14.175s
Writing topological levels to commit-graph:    14.331s

That's around 160ms and 1% percent faster.

I do think there's a case to be made for GENERATION_NUMBER_MAX because
the performance degradation for old Git would help in faster adoption
(Junio was in favor of this, the last time we discussed alternatives [2]).
It is a double-edged sword as we force people who cannot upgrade git
into worse performance.

I do not have anything for using topological level as a tie-breaker.
Will benchmark and get back to you.

[1]: https://lore.kernel.org/git/20200630150056.GA4111@Abhishek-Arch/
[2]: https://lore.kernel.org/git/xmqq8sjp1mnz.fsf@gitster.c.googlers.com/

> 
> Thank you for your work and for the report.
> 
> P.S. Would it be possible to put GSoC entries into separate 'GSoC'
> category instead of generic 'Programming' one, or add a 'GSoC' tag?
> 

Great idea! Try this out: https://abhishekkumar2718.github.io/gsoc/

> Best,
> --
> Jakub Narębski

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-07-14  6:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200706182213.GA51227@Abhishek-Arch>
2020-07-07  2:24 ` [GSOC] Blog about weeks 4, 5 Abhishek Kumar
2020-07-13 20:00   ` Jakub Narębski
2020-07-14  6:23     ` Abhishek Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).