[INFO] Some preliminary performance data

* [INFO] Some preliminary performance data
@ 2020-05-02 23:20 Aleksandar Markovic
  2020-05-02 23:24 ` Aleksandar Markovic
  0 siblings, 1 reply; 11+ messages in thread
From: Aleksandar Markovic @ 2020-05-02 23:20 UTC (permalink / raw)
  To: QEMU Developers, Richard Henderson, Alex Bennée,
	Peter Maydell, Stefan Hajnoczi, Lukáš Doktor, craxel

[-- Attachment #1: Type: text/plain, Size: 3117 bytes --]

Hi, all.

I just want to share with you some bits and pieces of data that I got while
doing some preliminary experimentation for the GSoC project "TCG Continuous
Benchmarking", that Ahmed Karaman, a student of the fourth final year of
Electical Engineering Faculty in Cairo, will execute.

*User Mode*

   * As expected, for any program dealing with any substantional
floating-point calculation, softfloat library will be the the heaviest CPU
cycles consumer.
   * We plan to examine the performance behaviour of non-FP programs
(integer arithmetic), or even non-numeric programs (sorting strings, for
example).

*System Mode*

   * I did profiling of booting several machines using a tool called
callgrind (a part of valgrind). The tool offers pletora of information,
however it looks it is little confused by usage of coroutines, and that
makes some of its reports look very illogical, or plain ugly. Still, it
seems valid data can be extracted from it. Without going into details, here
is what it says for one machine (bear in mind that results may vary to a
great extent between machines):
     ** The booting involved six threads, one for display handling, one for
emulations, and four more. The last four did almost nothing during boot,
just almost entire time siting idle, waiting for something. As far as
"Total Instruction Fetch Count" (this is the main measure used in
callgrind), they were distributed in proportion 1:3 between display thread
and emulation thread (the rest of threads were negligible) (but,
interestingly enough, for another machine that proportion was 1:20).
     ** The display thread is dominated by vga_update_display() function
(21.5% "self" time, and 51.6% "self + callees" time, called almost 40000
times). Other functions worth mentioning are
cpu_physical_memory_snapshot_get_dirty() and
memory_region_snapshot_get_dirty(), which are very small functions, but are
both invoked over 26 000 000 times, and contribute with over 20% of display
thread instruction fetch count together.
     ** Focusing now on emulation thread, "Total Instruction Fetch Counts"
were roughly distributed this way:
           - 15.7% is execution of GIT-ed code from translation block buffer
           - 39.9% is execution of helpers
           - 44.4% is code translation stage, including some coroutine
activities
        Top two among helpers:
          - helper_le_stl_memory()
          - helper_lookup_tb_ptr() (this one is invoked whopping 36 000 000
times)
        Single largest instruction consumer of code translation:
          - liveness_pass_1(), that constitutes 21.5% of the entire
"emulation thread" consumption, or, in other way, almost half of code
translation stage (that sits at 44.4%)

Please take all this with a little grain of salt, since these results are
just of preliminary nature.

I would like to use this opportunity to welcome Ahmed Karaman, a talented
young man from Egypt, into QEMU development community, that'll work on "TCG
Continuous Benchmarking" project this summer. Please do help them in his
first steps as our colleague. Best luck to Ahmed!

Thanks,
Aleksandar

[-- Attachment #2: Type: text/html, Size: 3665 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread