All of lore.kernel.org
 help / color / mirror / Atom feed
* [INFO] Some preliminary performance data
@ 2020-05-02 23:20 Aleksandar Markovic
  2020-05-02 23:24 ` Aleksandar Markovic
  0 siblings, 1 reply; 11+ messages in thread
From: Aleksandar Markovic @ 2020-05-02 23:20 UTC (permalink / raw)
  To: QEMU Developers, Richard Henderson, Alex Bennée,
	Peter Maydell, Stefan Hajnoczi, Lukáš Doktor, craxel

[-- Attachment #1: Type: text/plain, Size: 3117 bytes --]

Hi, all.

I just want to share with you some bits and pieces of data that I got while
doing some preliminary experimentation for the GSoC project "TCG Continuous
Benchmarking", that Ahmed Karaman, a student of the fourth final year of
Electical Engineering Faculty in Cairo, will execute.

*User Mode*

   * As expected, for any program dealing with any substantional
floating-point calculation, softfloat library will be the the heaviest CPU
cycles consumer.
   * We plan to examine the performance behaviour of non-FP programs
(integer arithmetic), or even non-numeric programs (sorting strings, for
example).

*System Mode*

   * I did profiling of booting several machines using a tool called
callgrind (a part of valgrind). The tool offers pletora of information,
however it looks it is little confused by usage of coroutines, and that
makes some of its reports look very illogical, or plain ugly. Still, it
seems valid data can be extracted from it. Without going into details, here
is what it says for one machine (bear in mind that results may vary to a
great extent between machines):
     ** The booting involved six threads, one for display handling, one for
emulations, and four more. The last four did almost nothing during boot,
just almost entire time siting idle, waiting for something. As far as
"Total Instruction Fetch Count" (this is the main measure used in
callgrind), they were distributed in proportion 1:3 between display thread
and emulation thread (the rest of threads were negligible) (but,
interestingly enough, for another machine that proportion was 1:20).
     ** The display thread is dominated by vga_update_display() function
(21.5% "self" time, and 51.6% "self + callees" time, called almost 40000
times). Other functions worth mentioning are
cpu_physical_memory_snapshot_get_dirty() and
memory_region_snapshot_get_dirty(), which are very small functions, but are
both invoked over 26 000 000 times, and contribute with over 20% of display
thread instruction fetch count together.
     ** Focusing now on emulation thread, "Total Instruction Fetch Counts"
were roughly distributed this way:
           - 15.7% is execution of GIT-ed code from translation block buffer
           - 39.9% is execution of helpers
           - 44.4% is code translation stage, including some coroutine
activities
        Top two among helpers:
          - helper_le_stl_memory()
          - helper_lookup_tb_ptr() (this one is invoked whopping 36 000 000
times)
        Single largest instruction consumer of code translation:
          - liveness_pass_1(), that constitutes 21.5% of the entire
"emulation thread" consumption, or, in other way, almost half of code
translation stage (that sits at 44.4%)

Please take all this with a little grain of salt, since these results are
just of preliminary nature.

I would like to use this opportunity to welcome Ahmed Karaman, a talented
young man from Egypt, into QEMU development community, that'll work on "TCG
Continuous Benchmarking" project this summer. Please do help them in his
first steps as our colleague. Best luck to Ahmed!

Thanks,
Aleksandar

[-- Attachment #2: Type: text/html, Size: 3665 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [INFO] Some preliminary performance data
  2020-05-02 23:20 [INFO] Some preliminary performance data Aleksandar Markovic
@ 2020-05-02 23:24 ` Aleksandar Markovic
  2020-05-03  6:47   ` Ahmed Karaman
  2020-05-06 11:26   ` Alex Bennée
  0 siblings, 2 replies; 11+ messages in thread
From: Aleksandar Markovic @ 2020-05-02 23:24 UTC (permalink / raw)
  To: QEMU Developers, Richard Henderson, Alex Bennée,
	Peter Maydell, Stefan Hajnoczi, Lukáš Doktor, kraxel,
	ahmedkhaledkaraman

[-- Attachment #1: Type: text/plain, Size: 3465 bytes --]

[correcting some email addresses]

нед, 3. мај 2020. у 01:20 Aleksandar Markovic <
aleksandar.qemu.devel@gmail.com> је написао/ла:

> Hi, all.
>
> I just want to share with you some bits and pieces of data that I got
> while doing some preliminary experimentation for the GSoC project "TCG
> Continuous Benchmarking", that Ahmed Karaman, a student of the fourth final
> year of Electical Engineering Faculty in Cairo, will execute.
>
> *User Mode*
>
>    * As expected, for any program dealing with any substantional
> floating-point calculation, softfloat library will be the the heaviest CPU
> cycles consumer.
>    * We plan to examine the performance behaviour of non-FP programs
> (integer arithmetic), or even non-numeric programs (sorting strings, for
> example).
>
> *System Mode*
>
>    * I did profiling of booting several machines using a tool called
> callgrind (a part of valgrind). The tool offers pletora of information,
> however it looks it is little confused by usage of coroutines, and that
> makes some of its reports look very illogical, or plain ugly. Still, it
> seems valid data can be extracted from it. Without going into details, here
> is what it says for one machine (bear in mind that results may vary to a
> great extent between machines):
>      ** The booting involved six threads, one for display handling, one
> for emulations, and four more. The last four did almost nothing during
> boot, just almost entire time siting idle, waiting for something. As far as
> "Total Instruction Fetch Count" (this is the main measure used in
> callgrind), they were distributed in proportion 1:3 between display thread
> and emulation thread (the rest of threads were negligible) (but,
> interestingly enough, for another machine that proportion was 1:20).
>      ** The display thread is dominated by vga_update_display() function
> (21.5% "self" time, and 51.6% "self + callees" time, called almost 40000
> times). Other functions worth mentioning are
> cpu_physical_memory_snapshot_get_dirty() and
> memory_region_snapshot_get_dirty(), which are very small functions, but are
> both invoked over 26 000 000 times, and contribute with over 20% of display
> thread instruction fetch count together.
>      ** Focusing now on emulation thread, "Total Instruction Fetch Counts"
> were roughly distributed this way:
>            - 15.7% is execution of GIT-ed code from translation block
> buffer
>            - 39.9% is execution of helpers
>            - 44.4% is code translation stage, including some coroutine
> activities
>         Top two among helpers:
>           - helper_le_stl_memory()
>           - helper_lookup_tb_ptr() (this one is invoked whopping 36 000
> 000 times)
>         Single largest instruction consumer of code translation:
>           - liveness_pass_1(), that constitutes 21.5% of the entire
> "emulation thread" consumption, or, in other way, almost half of code
> translation stage (that sits at 44.4%)
>
> Please take all this with a little grain of salt, since these results are
> just of preliminary nature.
>
> I would like to use this opportunity to welcome Ahmed Karaman, a talented
> young man from Egypt, into QEMU development community, that'll work on "TCG
> Continuous Benchmarking" project this summer. Please do help them in his
> first steps as our colleague. Best luck to Ahmed!
>
> Thanks,
> Aleksandar
>
>

[-- Attachment #2: Type: text/html, Size: 4109 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [INFO] Some preliminary performance data
  2020-05-02 23:24 ` Aleksandar Markovic
@ 2020-05-03  6:47   ` Ahmed Karaman
  2020-05-06 11:26   ` Alex Bennée
  1 sibling, 0 replies; 11+ messages in thread
From: Ahmed Karaman @ 2020-05-03  6:47 UTC (permalink / raw)
  To: Aleksandar Markovic
  Cc: Lukáš Doktor, Peter Maydell, Stefan Hajnoczi,
	Richard Henderson, QEMU Developers, kraxel, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 3811 bytes --]

Thanks Mr. Aleksandar for the introduction.
I'm really looking forward to working with the QEMU developers community
this summer.
Wishing all of you health and safety.


On Sun, May 3, 2020, 1:25 AM Aleksandar Markovic <
aleksandar.qemu.devel@gmail.com> wrote:

> [correcting some email addresses]
>
> нед, 3. мај 2020. у 01:20 Aleksandar Markovic <
> aleksandar.qemu.devel@gmail.com> је написао/ла:
>
>> Hi, all.
>>
>> I just want to share with you some bits and pieces of data that I got
>> while doing some preliminary experimentation for the GSoC project "TCG
>> Continuous Benchmarking", that Ahmed Karaman, a student of the fourth final
>> year of Electical Engineering Faculty in Cairo, will execute.
>>
>> *User Mode*
>>
>>    * As expected, for any program dealing with any substantional
>> floating-point calculation, softfloat library will be the the heaviest CPU
>> cycles consumer.
>>    * We plan to examine the performance behaviour of non-FP programs
>> (integer arithmetic), or even non-numeric programs (sorting strings, for
>> example).
>>
>> *System Mode*
>>
>>    * I did profiling of booting several machines using a tool called
>> callgrind (a part of valgrind). The tool offers pletora of information,
>> however it looks it is little confused by usage of coroutines, and that
>> makes some of its reports look very illogical, or plain ugly. Still, it
>> seems valid data can be extracted from it. Without going into details, here
>> is what it says for one machine (bear in mind that results may vary to a
>> great extent between machines):
>>      ** The booting involved six threads, one for display handling, one
>> for emulations, and four more. The last four did almost nothing during
>> boot, just almost entire time siting idle, waiting for something. As far as
>> "Total Instruction Fetch Count" (this is the main measure used in
>> callgrind), they were distributed in proportion 1:3 between display thread
>> and emulation thread (the rest of threads were negligible) (but,
>> interestingly enough, for another machine that proportion was 1:20).
>>      ** The display thread is dominated by vga_update_display() function
>> (21.5% "self" time, and 51.6% "self + callees" time, called almost 40000
>> times). Other functions worth mentioning are
>> cpu_physical_memory_snapshot_get_dirty() and
>> memory_region_snapshot_get_dirty(), which are very small functions, but are
>> both invoked over 26 000 000 times, and contribute with over 20% of display
>> thread instruction fetch count together.
>>      ** Focusing now on emulation thread, "Total Instruction Fetch
>> Counts" were roughly distributed this way:
>>            - 15.7% is execution of GIT-ed code from translation block
>> buffer
>>            - 39.9% is execution of helpers
>>            - 44.4% is code translation stage, including some coroutine
>> activities
>>         Top two among helpers:
>>           - helper_le_stl_memory()
>>           - helper_lookup_tb_ptr() (this one is invoked whopping 36 000
>> 000 times)
>>         Single largest instruction consumer of code translation:
>>           - liveness_pass_1(), that constitutes 21.5% of the entire
>> "emulation thread" consumption, or, in other way, almost half of code
>> translation stage (that sits at 44.4%)
>>
>> Please take all this with a little grain of salt, since these results are
>> just of preliminary nature.
>>
>> I would like to use this opportunity to welcome Ahmed Karaman, a talented
>> young man from Egypt, into QEMU development community, that'll work on "TCG
>> Continuous Benchmarking" project this summer. Please do help them in his
>> first steps as our colleague. Best luck to Ahmed!
>>
>> Thanks,
>> Aleksandar
>>
>>

[-- Attachment #2: Type: text/html, Size: 4778 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [INFO] Some preliminary performance data
  2020-05-02 23:24 ` Aleksandar Markovic
  2020-05-03  6:47   ` Ahmed Karaman
@ 2020-05-06 11:26   ` Alex Bennée
  2020-05-09 10:16     ` Aleksandar Markovic
  1 sibling, 1 reply; 11+ messages in thread
From: Alex Bennée @ 2020-05-06 11:26 UTC (permalink / raw)
  To: Aleksandar Markovic
  Cc: Lukáš Doktor, Peter Maydell, Stefan Hajnoczi,
	Richard Henderson, QEMU Developers, ahmedkhaledkaraman,
	Emilio G . Cota, kraxel


Aleksandar Markovic <aleksandar.qemu.devel@gmail.com> writes:

Some preliminary thoughts....

>> Hi, all.
>>
>> I just want to share with you some bits and pieces of data that I got
>> while doing some preliminary experimentation for the GSoC project "TCG
>> Continuous Benchmarking", that Ahmed Karaman, a student of the fourth final
>> year of Electical Engineering Faculty in Cairo, will execute.
>>
>> *User Mode*
>>
>>    * As expected, for any program dealing with any substantional
>> floating-point calculation, softfloat library will be the the heaviest CPU
>> cycles consumer.
>>    * We plan to examine the performance behaviour of non-FP programs
>> (integer arithmetic), or even non-numeric programs (sorting strings, for
>> example).

Emilio was the last person to do extensive bench-marking on TCG and he
used a mild fork of the venerable nbench:

  https://github.com/cota/dbt-bench

as the hot code is fairly small it offers a good way of testing quality
of the output. Larger programs will differ as they can involve more code
generation.

>>
>> *System Mode*
>>
>>    * I did profiling of booting several machines using a tool called
>> callgrind (a part of valgrind). The tool offers pletora of information,
>> however it looks it is little confused by usage of coroutines, and that
>> makes some of its reports look very illogical, or plain ugly.

Doesn't running through valgrind inherently serialise execution anyway?
If you are looking for latency caused by locks we have support for the
QEMU sync profiler built into the code. See "help sync-profile" on the HMP.

>> Still, it
>> seems valid data can be extracted from it. Without going into details, here
>> is what it says for one machine (bear in mind that results may vary to a
>> great extent between machines):

You can also use perf to use sampling to find hot points in the code.
One of last years GSoC student wrote some patches that included the
ability to dump a jit info file for perf to consume. We never got it
merged in the end but it might be worth having a go at pulling the
relevant bits out from:

  Subject: [PATCH  v9 00/13] TCG code quality tracking and perf integration
  Date: Mon,  7 Oct 2019 16:28:26 +0100
  Message-Id: <20191007152839.30804-1-alex.bennee@linaro.org>

>>      ** The booting involved six threads, one for display handling, one
>> for emulations, and four more. The last four did almost nothing during
>> boot, just almost entire time siting idle, waiting for something. As far as
>> "Total Instruction Fetch Count" (this is the main measure used in
>> callgrind), they were distributed in proportion 1:3 between display thread
>> and emulation thread (the rest of threads were negligible) (but,
>> interestingly enough, for another machine that proportion was 1:20).
>>      ** The display thread is dominated by vga_update_display() function
>> (21.5% "self" time, and 51.6% "self + callees" time, called almost 40000
>> times). Other functions worth mentioning are
>> cpu_physical_memory_snapshot_get_dirty() and
>> memory_region_snapshot_get_dirty(), which are very small functions, but are
>> both invoked over 26 000 000 times, and contribute with over 20% of display
>> thread instruction fetch count together.

The memory region tracking code will end up forcing the slow path for a
lot of memory accesses to video memory via softmmu. You may want to
measure if there is a difference using one of the virtio based graphics
displays.

>>      ** Focusing now on emulation thread, "Total Instruction Fetch Counts"
>> were roughly distributed this way:
>>            - 15.7% is execution of GIT-ed code from translation block
>> buffer
>>            - 39.9% is execution of helpers
>>            - 44.4% is code translation stage, including some coroutine
>> activities
>>         Top two among helpers:
>>           - helper_le_stl_memory()

I assume that is the MMU slow-path being called from the generated code.

>>           - helper_lookup_tb_ptr() (this one is invoked whopping 36 000
>> 000 times)

This is an optimisation to avoid exiting the run-loop to find the next
block. From memory I think the two main cases you'll see are:

 - computed jumps (i.e. target not known at JIT time)
 - jumps outside of the current page

>>         Single largest instruction consumer of code translation:
>>           - liveness_pass_1(), that constitutes 21.5% of the entire
>> "emulation thread" consumption, or, in other way, almost half of code
>> translation stage (that sits at 44.4%)

This is very much driven by how much code generation vs running you see.
In most of my personal benchmarks I never really notice code generation
because I give my machines large amounts of RAM so code tends to stay
resident so not need to be re-translated. When the optimiser shows up
it's usually accompanied by high TB flush and invalidate counts in "info
jit" because we are doing more translation that we usually do.

I'll also mention my foray into tracking down the performance regression
of DOSBox Doom:

  https://diasp.eu/posts/8659062

it presented a very nice demonstration of the increasing complexity (and
run time) of the optimiser which was completely wasted due to
self-modifying code causing us to regenerate code all the time.

>>
>> Please take all this with a little grain of salt, since these results are
>> just of preliminary nature.
>>
>> I would like to use this opportunity to welcome Ahmed Karaman, a talented
>> young man from Egypt, into QEMU development community, that'll work on "TCG
>> Continuous Benchmarking" project this summer. Please do help them in his
>> first steps as our colleague. Best luck to Ahmed!

Welcome to the QEMU community Ahmed. Feel free to CC me on TCG
performance related patches. I like to see things go faster ;-)

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [INFO] Some preliminary performance data
  2020-05-06 11:26   ` Alex Bennée
@ 2020-05-09 10:16     ` Aleksandar Markovic
  2020-05-09 10:26       ` Aleksandar Markovic
  2020-05-09 11:36       ` Laurent Desnogues
  0 siblings, 2 replies; 11+ messages in thread
From: Aleksandar Markovic @ 2020-05-09 10:16 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Lukáš Doktor, Peter Maydell, Stefan Hajnoczi,
	Richard Henderson, QEMU Developers, ahmedkhaledkaraman,
	Emilio G . Cota, Gerd Hoffmann


[-- Attachment #1.1.1: Type: text/plain, Size: 7010 bytes --]

сре, 6. мај 2020. у 13:26 Alex Bennée <alex.bennee@linaro.org> је
написао/ла:
>
>
> Aleksandar Markovic <aleksandar.qemu.devel@gmail.com> writes:
>
> Some preliminary thoughts....
>

Alex, many thanks for all your thoughts and hints that are truly helpful!

I will most likely respond to all of them in some future mail, but for now
I will comment just one.

> >> Hi, all.
> >>
> >> I just want to share with you some bits and pieces of data that I got
> >> while doing some preliminary experimentation for the GSoC project "TCG
> >> Continuous Benchmarking", that Ahmed Karaman, a student of the fourth
final
> >> year of Electical Engineering Faculty in Cairo, will execute.
> >>
> >> *User Mode*
> >>
> >>    * As expected, for any program dealing with any substantional
> >> floating-point calculation, softfloat library will be the the heaviest
CPU
> >> cycles consumer.
> >>    * We plan to examine the performance behaviour of non-FP programs
> >> (integer arithmetic), or even non-numeric programs (sorting strings,
for
> >> example).
>
> Emilio was the last person to do extensive bench-marking on TCG and he
> used a mild fork of the venerable nbench:
>
>   https://github.com/cota/dbt-bench
>
> as the hot code is fairly small it offers a good way of testing quality
> of the output. Larger programs will differ as they can involve more code
> generation.
>
> >>
> >> *System Mode*
> >>
> >>    * I did profiling of booting several machines using a tool called
> >> callgrind (a part of valgrind). The tool offers pletora of information,
> >> however it looks it is little confused by usage of coroutines, and that
> >> makes some of its reports look very illogical, or plain ugly.
>
> Doesn't running through valgrind inherently serialise execution anyway?
> If you are looking for latency caused by locks we have support for the
> QEMU sync profiler built into the code. See "help sync-profile" on the
HMP.
>
> >> Still, it
> >> seems valid data can be extracted from it. Without going into details,
here
> >> is what it says for one machine (bear in mind that results may vary to
a
> >> great extent between machines):
>
> You can also use perf to use sampling to find hot points in the code.
> One of last years GSoC student wrote some patches that included the
> ability to dump a jit info file for perf to consume. We never got it
> merged in the end but it might be worth having a go at pulling the
> relevant bits out from:
>
>   Subject: [PATCH  v9 00/13] TCG code quality tracking and perf
integration
>   Date: Mon,  7 Oct 2019 16:28:26 +0100
>   Message-Id: <20191007152839.30804-1-alex.bennee@linaro.org>
>
> >>      ** The booting involved six threads, one for display handling, one
> >> for emulations, and four more. The last four did almost nothing during
> >> boot, just almost entire time siting idle, waiting for something. As
far as
> >> "Total Instruction Fetch Count" (this is the main measure used in
> >> callgrind), they were distributed in proportion 1:3 between display
thread
> >> and emulation thread (the rest of threads were negligible) (but,
> >> interestingly enough, for another machine that proportion was 1:20).
> >>      ** The display thread is dominated by vga_update_display()
function
> >> (21.5% "self" time, and 51.6% "self + callees" time, called almost
40000
> >> times). Other functions worth mentioning are
> >> cpu_physical_memory_snapshot_get_dirty() and
> >> memory_region_snapshot_get_dirty(), which are very small functions,
but are
> >> both invoked over 26 000 000 times, and contribute with over 20% of
display
> >> thread instruction fetch count together.
>
> The memory region tracking code will end up forcing the slow path for a
> lot of memory accesses to video memory via softmmu. You may want to
> measure if there is a difference using one of the virtio based graphics
> displays.
>
> >>      ** Focusing now on emulation thread, "Total Instruction Fetch
Counts"
> >> were roughly distributed this way:
> >>            - 15.7% is execution of GIT-ed code from translation block
> >> buffer
> >>            - 39.9% is execution of helpers
> >>            - 44.4% is code translation stage, including some coroutine
> >> activities
> >>         Top two among helpers:
> >>           - helper_le_stl_memory()
>
> I assume that is the MMU slow-path being called from the generated code.
>
> >>           - helper_lookup_tb_ptr() (this one is invoked whopping 36 000
> >> 000 times)
>
> This is an optimisation to avoid exiting the run-loop to find the next
> block. From memory I think the two main cases you'll see are:
>
>  - computed jumps (i.e. target not known at JIT time)
>  - jumps outside of the current page
>
> >>         Single largest instruction consumer of code translation:
> >>           - liveness_pass_1(), that constitutes 21.5% of the entire
> >> "emulation thread" consumption, or, in other way, almost half of code
> >> translation stage (that sits at 44.4%)
>
> This is very much driven by how much code generation vs running you see.
> In most of my personal benchmarks I never really notice code generation
> because I give my machines large amounts of RAM so code tends to stay
> resident so not need to be re-translated. When the optimiser shows up
> it's usually accompanied by high TB flush and invalidate counts in "info
> jit" because we are doing more translation that we usually do.
>

Yes, I think the machine was setup with only 128MB RAM.

That would be an interesting experiment for Ahmed actually - to
measure impact of given RAM memory to performance.

But it looks that at least for machines with small RAM, translation
phase will take significant percentage.

I am attaching call graph for translation phase for "Hello World" built
for mips, and emulated by QEMU: *tb_gen_code() and its calees)

(I am also attaching the pic in case it is not visible well inline)

[image: tb_gen_code.png]


> I'll also mention my foray into tracking down the performance regression
> of DOSBox Doom:
>
>   https://diasp.eu/posts/8659062
>
> it presented a very nice demonstration of the increasing complexity (and
> run time) of the optimiser which was completely wasted due to
> self-modifying code causing us to regenerate code all the time.
>
> >>
> >> Please take all this with a little grain of salt, since these results
are
> >> just of preliminary nature.
> >>
> >> I would like to use this opportunity to welcome Ahmed Karaman, a
talented
> >> young man from Egypt, into QEMU development community, that'll work on
"TCG
> >> Continuous Benchmarking" project this summer. Please do help them in
his
> >> first steps as our colleague. Best luck to Ahmed!
>
> Welcome to the QEMU community Ahmed. Feel free to CC me on TCG
> performance related patches. I like to see things go faster ;-)
>
> --
> Alex Bennée

[-- Attachment #1.1.2: Type: text/html, Size: 8860 bytes --]

[-- Attachment #1.2: tb_gen_code.png --]
[-- Type: image/png, Size: 96587 bytes --]

[-- Attachment #2: tb_gen_code.png --]
[-- Type: image/png, Size: 96587 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [INFO] Some preliminary performance data
  2020-05-09 10:16     ` Aleksandar Markovic
@ 2020-05-09 10:26       ` Aleksandar Markovic
  2020-05-09 11:36       ` Laurent Desnogues
  1 sibling, 0 replies; 11+ messages in thread
From: Aleksandar Markovic @ 2020-05-09 10:26 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Lukáš Doktor, Peter Maydell, Stefan Hajnoczi,
	Richard Henderson, QEMU Developers, ahmedkhaledkaraman,
	Emilio G . Cota, Gerd Hoffmann

[-- Attachment #1: Type: text/plain, Size: 7661 bytes --]

суб, 9. мај 2020. у 12:16 Aleksandar Markovic <
aleksandar.qemu.devel@gmail.com> је написао/ла:
>
>
>
> сре, 6. мај 2020. у 13:26 Alex Bennée <alex.bennee@linaro.org> је
написао/ла:
> >
> >
> > Aleksandar Markovic <aleksandar.qemu.devel@gmail.com> writes:
> >
> > Some preliminary thoughts....
> >
>
> Alex, many thanks for all your thoughts and hints that are truly helpful!
>
> I will most likely respond to all of them in some future mail, but for now
> I will comment just one.
>

It looks right-click and "View Image" works for html mails with
embedded images - it displays the image in its original resolution.
So, no need for attachments. Good to know for potential Ahmed's
reports with images.

Aleksandar

> > >> Hi, all.
> > >>
> > >> I just want to share with you some bits and pieces of data that I got
> > >> while doing some preliminary experimentation for the GSoC project
"TCG
> > >> Continuous Benchmarking", that Ahmed Karaman, a student of the
fourth final
> > >> year of Electical Engineering Faculty in Cairo, will execute.
> > >>
> > >> *User Mode*
> > >>
> > >>    * As expected, for any program dealing with any substantional
> > >> floating-point calculation, softfloat library will be the the
heaviest CPU
> > >> cycles consumer.
> > >>    * We plan to examine the performance behaviour of non-FP programs
> > >> (integer arithmetic), or even non-numeric programs (sorting strings,
for
> > >> example).
> >
> > Emilio was the last person to do extensive bench-marking on TCG and he
> > used a mild fork of the venerable nbench:
> >
> >   https://github.com/cota/dbt-bench
> >
> > as the hot code is fairly small it offers a good way of testing quality
> > of the output. Larger programs will differ as they can involve more code
> > generation.
> >
> > >>
> > >> *System Mode*
> > >>
> > >>    * I did profiling of booting several machines using a tool called
> > >> callgrind (a part of valgrind). The tool offers pletora of
information,
> > >> however it looks it is little confused by usage of coroutines, and
that
> > >> makes some of its reports look very illogical, or plain ugly.
> >
> > Doesn't running through valgrind inherently serialise execution anyway?
> > If you are looking for latency caused by locks we have support for the
> > QEMU sync profiler built into the code. See "help sync-profile" on the
HMP.
> >
> > >> Still, it
> > >> seems valid data can be extracted from it. Without going into
details, here
> > >> is what it says for one machine (bear in mind that results may vary
to a
> > >> great extent between machines):
> >
> > You can also use perf to use sampling to find hot points in the code.
> > One of last years GSoC student wrote some patches that included the
> > ability to dump a jit info file for perf to consume. We never got it
> > merged in the end but it might be worth having a go at pulling the
> > relevant bits out from:
> >
> >   Subject: [PATCH  v9 00/13] TCG code quality tracking and perf
integration
> >   Date: Mon,  7 Oct 2019 16:28:26 +0100
> >   Message-Id: <20191007152839.30804-1-alex.bennee@linaro.org>
> >
> > >>      ** The booting involved six threads, one for display handling,
one
> > >> for emulations, and four more. The last four did almost nothing
during
> > >> boot, just almost entire time siting idle, waiting for something. As
far as
> > >> "Total Instruction Fetch Count" (this is the main measure used in
> > >> callgrind), they were distributed in proportion 1:3 between display
thread
> > >> and emulation thread (the rest of threads were negligible) (but,
> > >> interestingly enough, for another machine that proportion was 1:20).
> > >>      ** The display thread is dominated by vga_update_display()
function
> > >> (21.5% "self" time, and 51.6% "self + callees" time, called almost
40000
> > >> times). Other functions worth mentioning are
> > >> cpu_physical_memory_snapshot_get_dirty() and
> > >> memory_region_snapshot_get_dirty(), which are very small functions,
but are
> > >> both invoked over 26 000 000 times, and contribute with over 20% of
display
> > >> thread instruction fetch count together.
> >
> > The memory region tracking code will end up forcing the slow path for a
> > lot of memory accesses to video memory via softmmu. You may want to
> > measure if there is a difference using one of the virtio based graphics
> > displays.
> >
> > >>      ** Focusing now on emulation thread, "Total Instruction Fetch
Counts"
> > >> were roughly distributed this way:
> > >>            - 15.7% is execution of GIT-ed code from translation block
> > >> buffer
> > >>            - 39.9% is execution of helpers
> > >>            - 44.4% is code translation stage, including some
coroutine
> > >> activities
> > >>         Top two among helpers:
> > >>           - helper_le_stl_memory()
> >
> > I assume that is the MMU slow-path being called from the generated code.
> >
> > >>           - helper_lookup_tb_ptr() (this one is invoked whopping 36
000
> > >> 000 times)
> >
> > This is an optimisation to avoid exiting the run-loop to find the next
> > block. From memory I think the two main cases you'll see are:
> >
> >  - computed jumps (i.e. target not known at JIT time)
> >  - jumps outside of the current page
> >
> > >>         Single largest instruction consumer of code translation:
> > >>           - liveness_pass_1(), that constitutes 21.5% of the entire
> > >> "emulation thread" consumption, or, in other way, almost half of code
> > >> translation stage (that sits at 44.4%)
> >
> > This is very much driven by how much code generation vs running you see.
> > In most of my personal benchmarks I never really notice code generation
> > because I give my machines large amounts of RAM so code tends to stay
> > resident so not need to be re-translated. When the optimiser shows up
> > it's usually accompanied by high TB flush and invalidate counts in "info
> > jit" because we are doing more translation that we usually do.
> >
>
> Yes, I think the machine was setup with only 128MB RAM.
>
> That would be an interesting experiment for Ahmed actually - to
> measure impact of given RAM memory to performance.
>
> But it looks that at least for machines with small RAM, translation
> phase will take significant percentage.
>
> I am attaching call graph for translation phase for "Hello World" built
> for mips, and emulated by QEMU: *tb_gen_code() and its calees)
>
> (I am also attaching the pic in case it is not visible well inline)
>
>
>
>
> > I'll also mention my foray into tracking down the performance regression
> > of DOSBox Doom:
> >
> >   https://diasp.eu/posts/8659062
> >
> > it presented a very nice demonstration of the increasing complexity (and
> > run time) of the optimiser which was completely wasted due to
> > self-modifying code causing us to regenerate code all the time.
> >
> > >>
> > >> Please take all this with a little grain of salt, since these
results are
> > >> just of preliminary nature.
> > >>
> > >> I would like to use this opportunity to welcome Ahmed Karaman, a
talented
> > >> young man from Egypt, into QEMU development community, that'll work
on "TCG
> > >> Continuous Benchmarking" project this summer. Please do help them in
his
> > >> first steps as our colleague. Best luck to Ahmed!
> >
> > Welcome to the QEMU community Ahmed. Feel free to CC me on TCG
> > performance related patches. I like to see things go faster ;-)
> >
> > --
> > Alex Bennée

[-- Attachment #2: Type: text/html, Size: 9883 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [INFO] Some preliminary performance data
  2020-05-09 10:16     ` Aleksandar Markovic
  2020-05-09 10:26       ` Aleksandar Markovic
@ 2020-05-09 11:36       ` Laurent Desnogues
  2020-05-09 12:37         ` Aleksandar Markovic
  1 sibling, 1 reply; 11+ messages in thread
From: Laurent Desnogues @ 2020-05-09 11:36 UTC (permalink / raw)
  To: Aleksandar Markovic
  Cc: Lukáš Doktor, Peter Maydell, Stefan Hajnoczi,
	Richard Henderson, QEMU Developers, ahmedkhaledkaraman,
	Emilio G . Cota, Gerd Hoffmann, Alex Bennée

On Sat, May 9, 2020 at 12:17 PM Aleksandar Markovic
<aleksandar.qemu.devel@gmail.com> wrote:
>  сре, 6. мај 2020. у 13:26 Alex Bennée <alex.bennee@linaro.org> је написао/ла:
>
> > This is very much driven by how much code generation vs running you see.
> > In most of my personal benchmarks I never really notice code generation
> > because I give my machines large amounts of RAM so code tends to stay
> > resident so not need to be re-translated. When the optimiser shows up
> > it's usually accompanied by high TB flush and invalidate counts in "info
> > jit" because we are doing more translation that we usually do.
> >
>
> Yes, I think the machine was setup with only 128MB RAM.
>
> That would be an interesting experiment for Ahmed actually - to
> measure impact of given RAM memory to performance.
>
> But it looks that at least for machines with small RAM, translation
> phase will take significant percentage.
>
> I am attaching call graph for translation phase for "Hello World" built
> for mips, and emulated by QEMU: *tb_gen_code() and its calees)

Sorry if I'm stating the obvious but both "Hello World" and a
Linux boot will exhibit similar behaviors with low reuse of
translated blocks, which means translation will show up in
profiles as a lot of time is spent in translating blocks that
will run once.  If you push in that direction you might reach
the conclusion that a non JIST simulator is faster than QEMU.

You will have to carefully select the tests you run:  you need
a large spectrum from Linux boot, "Hello World" up to synthetic
benchmarks.

Again sorry if that was too trivial :-)

Laurent


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [INFO] Some preliminary performance data
  2020-05-09 11:36       ` Laurent Desnogues
@ 2020-05-09 12:37         ` Aleksandar Markovic
  2020-05-09 12:50           ` Laurent Desnogues
  0 siblings, 1 reply; 11+ messages in thread
From: Aleksandar Markovic @ 2020-05-09 12:37 UTC (permalink / raw)
  To: Laurent Desnogues
  Cc: Lukáš Doktor, Peter Maydell, Stefan Hajnoczi,
	Richard Henderson, QEMU Developers, ahmedkhaledkaraman,
	Emilio G . Cota, Gerd Hoffmann, Alex Bennée

суб, 9. мај 2020. у 13:37 Laurent Desnogues
<laurent.desnogues@gmail.com> је написао/ла:
>
> On Sat, May 9, 2020 at 12:17 PM Aleksandar Markovic
> <aleksandar.qemu.devel@gmail.com> wrote:
> >  сре, 6. мај 2020. у 13:26 Alex Bennée <alex.bennee@linaro.org> је написао/ла:
> >
> > > This is very much driven by how much code generation vs running you see.
> > > In most of my personal benchmarks I never really notice code generation
> > > because I give my machines large amounts of RAM so code tends to stay
> > > resident so not need to be re-translated. When the optimiser shows up
> > > it's usually accompanied by high TB flush and invalidate counts in "info
> > > jit" because we are doing more translation that we usually do.
> > >
> >
> > Yes, I think the machine was setup with only 128MB RAM.
> >
> > That would be an interesting experiment for Ahmed actually - to
> > measure impact of given RAM memory to performance.
> >
> > But it looks that at least for machines with small RAM, translation
> > phase will take significant percentage.
> >
> > I am attaching call graph for translation phase for "Hello World" built
> > for mips, and emulated by QEMU: *tb_gen_code() and its calees)
>

Hi, Laurent,

"Hello world" was taken as an example where code generation is
dominant. It was taken to illustrate how performance-wise code
generation overhead is distributed (illustrating dominance of a
single function).

While "Hello world" by itself is not a significant example, it conveys
a useful information: it says how much is the overhead of QEMU
linux-user executable initialization, and code generation spent on
emulation of loading target executable and printing a simple
message. This can be roughly deducted from the result for
a meaningful benchmark.

Booting of a virtual machine is a legitimate scenario for measuring
performance, and perhaps even attempting improving it.

Everything should be measured - code generation, JIT-ed code
execution, and helpers execution - in all cases, and checked
whether it departs from expected behavior.

Let's say that we emulate a benchmark that basically runs some
code in a loop, or an algorithm - one would expect that after a
while, while increasing number of iterations of the loop, or the
size of data in the algorithm, code generation becomes less and
less significant, converging to zero. Well, this should be confirmed
with an experiment, and not taken for granted.

I think limiting measurements only on, let's say, execution of
JIT-ed code (if that is what you implied) is a logical mistake.
The right conclusions should be drawn from the complete
picture, shouldn't it?

Yours,
Aleksandar

> Sorry if I'm stating the obvious but both "Hello World" and a
> Linux boot will exhibit similar behaviors with low reuse of
> translated blocks, which means translation will show up in
> profiles as a lot of time is spent in translating blocks that
> will run once.  If you push in that direction you might reach
> the conclusion that a non JIST simulator is faster than QEMU.
>
> You will have to carefully select the tests you run:  you need
> a large spectrum from Linux boot, "Hello World" up to synthetic
> benchmarks.
>
> Again sorry if that was too trivial :-)
>
> Laurent


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [INFO] Some preliminary performance data
  2020-05-09 12:37         ` Aleksandar Markovic
@ 2020-05-09 12:50           ` Laurent Desnogues
  2020-05-09 12:55             ` Aleksandar Markovic
  2020-05-09 16:49             ` Alex Bennée
  0 siblings, 2 replies; 11+ messages in thread
From: Laurent Desnogues @ 2020-05-09 12:50 UTC (permalink / raw)
  To: Aleksandar Markovic
  Cc: Lukáš Doktor, Peter Maydell, Stefan Hajnoczi,
	Richard Henderson, QEMU Developers, ahmedkhaledkaraman,
	Emilio G . Cota, Gerd Hoffmann, Alex Bennée

On Sat, May 9, 2020 at 2:38 PM Aleksandar Markovic
<aleksandar.qemu.devel@gmail.com> wrote:
>
> суб, 9. мај 2020. у 13:37 Laurent Desnogues
> <laurent.desnogues@gmail.com> је написао/ла:
> >
> > On Sat, May 9, 2020 at 12:17 PM Aleksandar Markovic
> > <aleksandar.qemu.devel@gmail.com> wrote:
> > >  сре, 6. мај 2020. у 13:26 Alex Bennée <alex.bennee@linaro.org> је написао/ла:
> > >
> > > > This is very much driven by how much code generation vs running you see.
> > > > In most of my personal benchmarks I never really notice code generation
> > > > because I give my machines large amounts of RAM so code tends to stay
> > > > resident so not need to be re-translated. When the optimiser shows up
> > > > it's usually accompanied by high TB flush and invalidate counts in "info
> > > > jit" because we are doing more translation that we usually do.
> > > >
> > >
> > > Yes, I think the machine was setup with only 128MB RAM.
> > >
> > > That would be an interesting experiment for Ahmed actually - to
> > > measure impact of given RAM memory to performance.
> > >
> > > But it looks that at least for machines with small RAM, translation
> > > phase will take significant percentage.
> > >
> > > I am attaching call graph for translation phase for "Hello World" built
> > > for mips, and emulated by QEMU: *tb_gen_code() and its calees)
> >
>
> Hi, Laurent,
>
> "Hello world" was taken as an example where code generation is
> dominant. It was taken to illustrate how performance-wise code
> generation overhead is distributed (illustrating dominance of a
> single function).
>
> While "Hello world" by itself is not a significant example, it conveys
> a useful information: it says how much is the overhead of QEMU
> linux-user executable initialization, and code generation spent on
> emulation of loading target executable and printing a simple
> message. This can be roughly deducted from the result for
> a meaningful benchmark.
>
> Booting of a virtual machine is a legitimate scenario for measuring
> performance, and perhaps even attempting improving it.
>
> Everything should be measured - code generation, JIT-ed code
> execution, and helpers execution - in all cases, and checked
> whether it departs from expected behavior.
>
> Let's say that we emulate a benchmark that basically runs some
> code in a loop, or an algorithm - one would expect that after a
> while, while increasing number of iterations of the loop, or the
> size of data in the algorithm, code generation becomes less and
> less significant, converging to zero. Well, this should be confirmed
> with an experiment, and not taken for granted.
>
> I think limiting measurements only on, let's say, execution of
> JIT-ed code (if that is what you implied) is a logical mistake.
> The right conclusions should be drawn from the complete
> picture, shouldn't it?

I explicitly wrote that you should consider a wide spectrum of
programs so I think we're in violent agreement ;-)

Thanks,

Laurent

> Yours,
> Aleksandar
>
> > Sorry if I'm stating the obvious but both "Hello World" and a
> > Linux boot will exhibit similar behaviors with low reuse of
> > translated blocks, which means translation will show up in
> > profiles as a lot of time is spent in translating blocks that
> > will run once.  If you push in that direction you might reach
> > the conclusion that a non JIST simulator is faster than QEMU.
> >
> > You will have to carefully select the tests you run:  you need
> > a large spectrum from Linux boot, "Hello World" up to synthetic
> > benchmarks.
> >
> > Again sorry if that was too trivial :-)
> >
> > Laurent


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [INFO] Some preliminary performance data
  2020-05-09 12:50           ` Laurent Desnogues
@ 2020-05-09 12:55             ` Aleksandar Markovic
  2020-05-09 16:49             ` Alex Bennée
  1 sibling, 0 replies; 11+ messages in thread
From: Aleksandar Markovic @ 2020-05-09 12:55 UTC (permalink / raw)
  To: Laurent Desnogues
  Cc: Lukáš Doktor, Peter Maydell, Stefan Hajnoczi,
	Richard Henderson, QEMU Developers, ahmedkhaledkaraman,
	Emilio G . Cota, Gerd Hoffmann, Alex Bennée

суб, 9. мај 2020. у 14:50 Laurent Desnogues
<laurent.desnogues@gmail.com> је написао/ла:
>
> On Sat, May 9, 2020 at 2:38 PM Aleksandar Markovic
> <aleksandar.qemu.devel@gmail.com> wrote:
> >
> > суб, 9. мај 2020. у 13:37 Laurent Desnogues
> > <laurent.desnogues@gmail.com> је написао/ла:
> > >
> > > On Sat, May 9, 2020 at 12:17 PM Aleksandar Markovic
> > > <aleksandar.qemu.devel@gmail.com> wrote:
> > > >  сре, 6. мај 2020. у 13:26 Alex Bennée <alex.bennee@linaro.org> је написао/ла:
> > > >
> > > > > This is very much driven by how much code generation vs running you see.
> > > > > In most of my personal benchmarks I never really notice code generation
> > > > > because I give my machines large amounts of RAM so code tends to stay
> > > > > resident so not need to be re-translated. When the optimiser shows up
> > > > > it's usually accompanied by high TB flush and invalidate counts in "info
> > > > > jit" because we are doing more translation that we usually do.
> > > > >
> > > >
> > > > Yes, I think the machine was setup with only 128MB RAM.
> > > >
> > > > That would be an interesting experiment for Ahmed actually - to
> > > > measure impact of given RAM memory to performance.
> > > >
> > > > But it looks that at least for machines with small RAM, translation
> > > > phase will take significant percentage.
> > > >
> > > > I am attaching call graph for translation phase for "Hello World" built
> > > > for mips, and emulated by QEMU: *tb_gen_code() and its calees)
> > >
> >
> > Hi, Laurent,
> >
> > "Hello world" was taken as an example where code generation is
> > dominant. It was taken to illustrate how performance-wise code
> > generation overhead is distributed (illustrating dominance of a
> > single function).
> >
> > While "Hello world" by itself is not a significant example, it conveys
> > a useful information: it says how much is the overhead of QEMU
> > linux-user executable initialization, and code generation spent on
> > emulation of loading target executable and printing a simple
> > message. This can be roughly deducted from the result for
> > a meaningful benchmark.
> >
> > Booting of a virtual machine is a legitimate scenario for measuring
> > performance, and perhaps even attempting improving it.
> >
> > Everything should be measured - code generation, JIT-ed code
> > execution, and helpers execution - in all cases, and checked
> > whether it departs from expected behavior.
> >
> > Let's say that we emulate a benchmark that basically runs some
> > code in a loop, or an algorithm - one would expect that after a
> > while, while increasing number of iterations of the loop, or the
> > size of data in the algorithm, code generation becomes less and
> > less significant, converging to zero. Well, this should be confirmed
> > with an experiment, and not taken for granted.
> >
> > I think limiting measurements only on, let's say, execution of
> > JIT-ed code (if that is what you implied) is a logical mistake.
> > The right conclusions should be drawn from the complete
> > picture, shouldn't it?
>
> I explicitly wrote that you should consider a wide spectrum of
> programs so I think we're in violent agreement ;-)
>

lol, I will write down "violent agreement" in my mental notebook
of useful phrases. :))

Yours,
Aleksandar

> Thanks,
>
> Laurent
>
> > Yours,
> > Aleksandar
> >
> > > Sorry if I'm stating the obvious but both "Hello World" and a
> > > Linux boot will exhibit similar behaviors with low reuse of
> > > translated blocks, which means translation will show up in
> > > profiles as a lot of time is spent in translating blocks that
> > > will run once.  If you push in that direction you might reach
> > > the conclusion that a non JIST simulator is faster than QEMU.
> > >
> > > You will have to carefully select the tests you run:  you need
> > > a large spectrum from Linux boot, "Hello World" up to synthetic
> > > benchmarks.
> > >
> > > Again sorry if that was too trivial :-)
> > >
> > > Laurent


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [INFO] Some preliminary performance data
  2020-05-09 12:50           ` Laurent Desnogues
  2020-05-09 12:55             ` Aleksandar Markovic
@ 2020-05-09 16:49             ` Alex Bennée
  1 sibling, 0 replies; 11+ messages in thread
From: Alex Bennée @ 2020-05-09 16:49 UTC (permalink / raw)
  To: Laurent Desnogues
  Cc: Lukáš Doktor, Peter Maydell, Stefan Hajnoczi,
	Richard Henderson, QEMU Developers, ahmedkhaledkaraman,
	Aleksandar Markovic, Emilio G . Cota, Gerd Hoffmann


Laurent Desnogues <laurent.desnogues@gmail.com> writes:

> On Sat, May 9, 2020 at 2:38 PM Aleksandar Markovic
> <aleksandar.qemu.devel@gmail.com> wrote:
>>
>> суб, 9. мај 2020. у 13:37 Laurent Desnogues
>> <laurent.desnogues@gmail.com> је написао/ла:
>> >
>> > On Sat, May 9, 2020 at 12:17 PM Aleksandar Markovic
>> > <aleksandar.qemu.devel@gmail.com> wrote:
>> > >  сре, 6. мај 2020. у 13:26 Alex Bennée <alex.bennee@linaro.org> је написао/ла:
>> > >
>> > > > This is very much driven by how much code generation vs running you see.
>> > > > In most of my personal benchmarks I never really notice code generation
>> > > > because I give my machines large amounts of RAM so code tends to stay
>> > > > resident so not need to be re-translated. When the optimiser shows up
>> > > > it's usually accompanied by high TB flush and invalidate counts in "info
>> > > > jit" because we are doing more translation that we usually do.
>> > > >
>> > >
>> > > Yes, I think the machine was setup with only 128MB RAM.
>> > >
>> > > That would be an interesting experiment for Ahmed actually - to
>> > > measure impact of given RAM memory to performance.
>> > >
>> > > But it looks that at least for machines with small RAM, translation
>> > > phase will take significant percentage.
>> > >
>> > > I am attaching call graph for translation phase for "Hello World" built
>> > > for mips, and emulated by QEMU: *tb_gen_code() and its calees)
>> >
>>
>> Hi, Laurent,
>>
>> "Hello world" was taken as an example where code generation is
>> dominant. It was taken to illustrate how performance-wise code
>> generation overhead is distributed (illustrating dominance of a
>> single function).
>>
>> While "Hello world" by itself is not a significant example, it conveys
>> a useful information: it says how much is the overhead of QEMU
>> linux-user executable initialization, and code generation spent on
>> emulation of loading target executable and printing a simple
>> message. This can be roughly deducted from the result for
>> a meaningful benchmark.
>>
>> Booting of a virtual machine is a legitimate scenario for measuring
>> performance, and perhaps even attempting improving it.
>>
>> Everything should be measured - code generation, JIT-ed code
>> execution, and helpers execution - in all cases, and checked
>> whether it departs from expected behavior.
>>
>> Let's say that we emulate a benchmark that basically runs some
>> code in a loop, or an algorithm - one would expect that after a
>> while, while increasing number of iterations of the loop, or the
>> size of data in the algorithm, code generation becomes less and
>> less significant, converging to zero. Well, this should be confirmed
>> with an experiment, and not taken for granted.
>>
>> I think limiting measurements only on, let's say, execution of
>> JIT-ed code (if that is what you implied) is a logical mistake.
>> The right conclusions should be drawn from the complete
>> picture, shouldn't it?
>
> I explicitly wrote that you should consider a wide spectrum of
> programs so I think we're in violent agreement ;-)

If you want a good example for a real world use case where we could
improve things then I suggest looking at compilers.

They are frequently instantiated once per compilation unit and once done
all the JIT translations are thrown away. While the code-path taken by a
compiler may be different for every unit it compiles I bet there are
savings we could make by caching compilation. The first step would be
identifying how similar the profiles of the generated code generated is.

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-05-09 16:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-02 23:20 [INFO] Some preliminary performance data Aleksandar Markovic
2020-05-02 23:24 ` Aleksandar Markovic
2020-05-03  6:47   ` Ahmed Karaman
2020-05-06 11:26   ` Alex Bennée
2020-05-09 10:16     ` Aleksandar Markovic
2020-05-09 10:26       ` Aleksandar Markovic
2020-05-09 11:36       ` Laurent Desnogues
2020-05-09 12:37         ` Aleksandar Markovic
2020-05-09 12:50           ` Laurent Desnogues
2020-05-09 12:55             ` Aleksandar Markovic
2020-05-09 16:49             ` Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.