All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Benchmarking linux-user performance
@ 2017-03-10  1:23 Emilio G. Cota
  2017-03-10 11:45 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 8+ messages in thread
From: Emilio G. Cota @ 2017-03-10  1:23 UTC (permalink / raw)
  To: Richard Henderson, Laurent Vivier, Peter Maydell, Paolo Bonzini,
	Alex Benn�e
  Cc: qemu-devel

Hi all,

Inspired by SimBench[1], I have written a set of scripts ("DBT-bench")
to easily obtain and plot performance numbers for linux-user.

The (Perl) scripts are available here:
  https://github.com/cota/dbt-bench
[ It's better to clone with --recursive because the benchmarks
(NBench) are pulled as a submodule. ]

I'm using NBench because (1) it's just a few files and they take
very little time to run (~5min per QEMU version, if performance
on the host machine is stable), (2) AFAICT its sources are in the
public domain (whereas SPEC's sources cannot be redistributed),
and (3) with NBench I get results similar to SPEC's.

Here are linux-user performance numbers from v1.0 to v2.8 (higher
is better):

                        x86_64 NBench Integer Performance
                 Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz                
                                                                               
  36 +-+-+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+-+-+   
     |   +   +   +   +  +   +   +   +   +   +   +   +   +  +   +   +  ***  |   
  34 +-+                                                             #*A*+-+   
     |                                                            *A*      |   
  32 +-+                                                          #      +-+   
  30 +-+                                                          #      +-+   
     |                                                           #         |   
  28 +-+                                                        #        +-+   
     |                                 *A*#*A*#*A*#*A*#*A*#     #          |   
  26 +-+                   *A*#*A*#***#    ***         ******#*A*        +-+   
     |                     #       *A*                    *A* ***          |   
  24 +-+                  #                                              +-+   
  22 +-+                 #                                               +-+   
     |             #*A**A*                                                 |   
  20 +-+       #*A*                                                      +-+   
     |  *A*#*A*  +   +  +   +   +   +   +   +   +   +   +  +   +   +   +   |   
  18 +-+-+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+-+-+   
       v1.v1.1v1.2v1.v1.4v1.5v1.6v1.7v2.0v2.1v2.2v2.3v2.v2.5v2.6v2.7v2.8.0     
                                  QEMU version                                 


                     x86_64 NBench Floating Point Performance                  
                  Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz               
                                                                               
  1.88 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+   
       |   +   +  +  *A*#*A*  +  +   +   +   +   +  +   +   +   +  +   +   |   
  1.86 +-+           *** ***                                             +-+   
       |            #       #   *A*#***                                    |   
       |      *A*# #         # ##   *A*                                    |   
  1.84 +-+    #  *A*         *A*      #                                  +-+   
       |      #                        #                              *A*  |   
  1.82 +-+   #                          #                            ##  +-+   
       |     #                          *A*#                        #      |   
   1.8 +-+  #                               #  #*A*               *A*    +-+   
       |    #                               *A*   #                #       |   
  1.78 +-+*A*                                      #       *A*    #      +-+   
       |                                           #   ***#  #    #        |   
       |                                           *A*#*A*    #  #         |   
  1.76 +-+                                         ***         # #       +-+   
       |   +   +  +   +   +   +  +   +   +   +   +  +   +   +  *A* +   +   |   
  1.74 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+   
         v1.v1.v1.2v1.3v1.4v1.v1.6v1.7v2.0v2.1v2.v2.3v2.4v2.5v2.v2.7v2.8.0     
                                   QEMU version                                

Same plots, in PNG: http://imgur.com/a/nF7Ls

These plots are obtained simply by running
	$ QEMU_PATH=path/to/qemu QEMU_ARCH=x86_64 make -j
from dbt-bench, although note that some user intervention was needed
to compile old QEMU versions.

I think having some well-defined, easy-to-run benchmarks (even
if far from perfect, like these) to aid development is better
than not having any. My hope is that having these will encourage
future performance improvements to the emulation loop and TCG -- or
at least serve as a warning when performance regresses excessively :-)

Let me know if you find this work useful.

Thanks,

		Emilio

[1] https://bitbucket.org/simbench/simbench
Simbench's authors have a paper on it, although it is not publicly
available yet (will be presented at the ISPASS'17 conference in April).
The abstract can be accessed here though: http://tinyurl.com/hahb4yj

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Benchmarking linux-user performance
  2017-03-10  1:23 [Qemu-devel] Benchmarking linux-user performance Emilio G. Cota
@ 2017-03-10 11:45 ` Dr. David Alan Gilbert
  2017-03-10 11:48   ` Peter Maydell
  2017-03-11  2:18   ` Emilio G. Cota
  0 siblings, 2 replies; 8+ messages in thread
From: Dr. David Alan Gilbert @ 2017-03-10 11:45 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Richard Henderson, Laurent Vivier, Peter Maydell, Paolo Bonzini,
	Alex Benn�e, qemu-devel

* Emilio G. Cota (cota@braap.org) wrote:
> Hi all,
> 
> Inspired by SimBench[1], I have written a set of scripts ("DBT-bench")
> to easily obtain and plot performance numbers for linux-user.
> 
> The (Perl) scripts are available here:
>   https://github.com/cota/dbt-bench
> [ It's better to clone with --recursive because the benchmarks
> (NBench) are pulled as a submodule. ]
> 
> I'm using NBench because (1) it's just a few files and they take
> very little time to run (~5min per QEMU version, if performance
> on the host machine is stable), (2) AFAICT its sources are in the
> public domain (whereas SPEC's sources cannot be redistributed),
> and (3) with NBench I get results similar to SPEC's.

Does NBench include anything with lots of small processes, or a large
chunk of code.  Using benchmarks with small code tends to skew DBT optimisations
towards very heavy block optimisation that dont work in real applications where
the cost of translation can hurt if it's too high.

> Here are linux-user performance numbers from v1.0 to v2.8 (higher
> is better):
> 
>                         x86_64 NBench Integer Performance
>                  Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz                
>                                                                                
>   36 +-+-+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+-+-+   
>      |   +   +   +   +  +   +   +   +   +   +   +   +   +  +   +   +  ***  |   
>   34 +-+                                                             #*A*+-+   
>      |                                                            *A*      |   
>   32 +-+                                                          #      +-+   
>   30 +-+                                                          #      +-+   
>      |                                                           #         |   
>   28 +-+                                                        #        +-+   
>      |                                 *A*#*A*#*A*#*A*#*A*#     #          |   
>   26 +-+                   *A*#*A*#***#    ***         ******#*A*        +-+   
>      |                     #       *A*                    *A* ***          |   
>   24 +-+                  #                                              +-+   
>   22 +-+                 #                                               +-+   
>      |             #*A**A*                                                 |   
>   20 +-+       #*A*                                                      +-+   
>      |  *A*#*A*  +   +  +   +   +   +   +   +   +   +   +  +   +   +   +   |   
>   18 +-+-+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+-+-+   
>        v1.v1.1v1.2v1.v1.4v1.5v1.6v1.7v2.0v2.1v2.2v2.3v2.v2.5v2.6v2.7v2.8.0     
>                                   QEMU version                                 

Nice, there was someone on list complaining about 2.6 being slower for them.

>                      x86_64 NBench Floating Point Performance                  
>                   Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz               
>                                                                                
>   1.88 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+   
>        |   +   +  +  *A*#*A*  +  +   +   +   +   +  +   +   +   +  +   +   |   
>   1.86 +-+           *** ***                                             +-+   
>        |            #       #   *A*#***                                    |   
>        |      *A*# #         # ##   *A*                                    |   
>   1.84 +-+    #  *A*         *A*      #                                  +-+   
>        |      #                        #                              *A*  |   
>   1.82 +-+   #                          #                            ##  +-+   
>        |     #                          *A*#                        #      |   
>    1.8 +-+  #                               #  #*A*               *A*    +-+   
>        |    #                               *A*   #                #       |   
>   1.78 +-+*A*                                      #       *A*    #      +-+   
>        |                                           #   ***#  #    #        |   
>        |                                           *A*#*A*    #  #         |   
>   1.76 +-+                                         ***         # #       +-+   
>        |   +   +  +   +   +   +  +   +   +   +   +  +   +   +  *A* +   +   |   
>   1.74 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+   
>          v1.v1.v1.2v1.3v1.4v1.v1.6v1.7v2.0v2.1v2.v2.3v2.4v2.5v2.v2.7v2.8.0     
>                                    QEMU version                                

I'm assuming the dips are where QEMU fixed something and cared about corner
cases/accuracy?

Dave

> Same plots, in PNG: http://imgur.com/a/nF7Ls
> 
> These plots are obtained simply by running
> 	$ QEMU_PATH=path/to/qemu QEMU_ARCH=x86_64 make -j
> from dbt-bench, although note that some user intervention was needed
> to compile old QEMU versions.
> 
> I think having some well-defined, easy-to-run benchmarks (even
> if far from perfect, like these) to aid development is better
> than not having any. My hope is that having these will encourage
> future performance improvements to the emulation loop and TCG -- or
> at least serve as a warning when performance regresses excessively :-)
> 
> Let me know if you find this work useful.
> 
> Thanks,
> 
> 		Emilio
> 
> [1] https://bitbucket.org/simbench/simbench
> Simbench's authors have a paper on it, although it is not publicly
> available yet (will be presented at the ISPASS'17 conference in April).
> The abstract can be accessed here though: http://tinyurl.com/hahb4yj
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Benchmarking linux-user performance
  2017-03-10 11:45 ` Dr. David Alan Gilbert
@ 2017-03-10 11:48   ` Peter Maydell
  2017-03-11  2:25     ` Emilio G. Cota
  2017-03-11  2:18   ` Emilio G. Cota
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Maydell @ 2017-03-10 11:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Emilio G. Cota, Richard Henderson, Laurent Vivier, Paolo Bonzini,
	Alex Benn�e, qemu-devel

On 10 March 2017 at 12:45, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> * Emilio G. Cota (cota@braap.org) wrote:
>>                      x86_64 NBench Floating Point Performance
>>                   Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
>>
>>   1.88 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+
>>        |   +   +  +  *A*#*A*  +  +   +   +   +   +  +   +   +   +  +   +   |
>>   1.86 +-+           *** ***                                             +-+
>>        |            #       #   *A*#***                                    |
>>        |      *A*# #         # ##   *A*                                    |
>>   1.84 +-+    #  *A*         *A*      #                                  +-+
>>        |      #                        #                              *A*  |
>>   1.82 +-+   #                          #                            ##  +-+
>>        |     #                          *A*#                        #      |
>>    1.8 +-+  #                               #  #*A*               *A*    +-+
>>        |    #                               *A*   #                #       |
>>   1.78 +-+*A*                                      #       *A*    #      +-+
>>        |                                           #   ***#  #    #        |
>>        |                                           *A*#*A*    #  #         |
>>   1.76 +-+                                         ***         # #       +-+
>>        |   +   +  +   +   +   +  +   +   +   +   +  +   +   +  *A* +   +   |
>>   1.74 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+
>>          v1.v1.v1.2v1.3v1.4v1.v1.6v1.7v2.0v2.1v2.v2.3v2.4v2.5v2.v2.7v2.8.0
>>                                    QEMU version
>
> I'm assuming the dips are where QEMU fixed something and cared about corner
> cases/accuracy?

Given the scale on the LHS is from 1.74 to 1.88 my guess is that the
variation is in large part noise and the major thing is "our fp
performance is bounded by softfloat, which doesn't change and is
always very slow".

thanks
-- PMM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Benchmarking linux-user performance
  2017-03-10 11:45 ` Dr. David Alan Gilbert
  2017-03-10 11:48   ` Peter Maydell
@ 2017-03-11  2:18   ` Emilio G. Cota
  2017-03-14 17:06     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 8+ messages in thread
From: Emilio G. Cota @ 2017-03-11  2:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Richard Henderson, Laurent Vivier, Peter Maydell, Paolo Bonzini,
	Alex Benn�e, qemu-devel

On Fri, Mar 10, 2017 at 11:45:33 +0000, Dr. David Alan Gilbert wrote:
> * Emilio G. Cota (cota@braap.org) wrote:
> >   https://github.com/cota/dbt-bench
> > I'm using NBench because (1) it's just a few files and they take
> > very little time to run (~5min per QEMU version, if performance
> > on the host machine is stable), (2) AFAICT its sources are in the
> > public domain (whereas SPEC's sources cannot be redistributed),
> > and (3) with NBench I get results similar to SPEC's.
> 
> Does NBench include anything with lots of small processes, or a large
> chunk of code.  Using benchmarks with small code tends to skew DBT optimisations
> towards very heavy block optimisation that dont work in real applications where
> the cost of translation can hurt if it's too high.

Yes this is a valid point.

I haven't looked at the NBench code in detail, but I'd expect all programs
in the suite to be small and have hotspots (this is consistent with
the fact that performance doesn't change even if the TB hash table
isn't used, i.e. the loops are small enough to remain in tb_jmp_cache.)
IOW, we'd be mostly measuring the quality of the translated code,
not the translation overhead.

It seems that a good benchmark to take translation overhead into account
would be gcc/perlbench from SPEC (see [1]; ~20% of exec time is spent
on translation). Unfortunately, none of them can be redistributed.

I'll consider other options. For instance, I looked today at using golang's
compilation tests, but they crash under qemu-user. I'll keep looking
at other options -- the requirement is to have something that is easy
to build (i.e. gcc is not an option) and that it runs fast.

A hack that one can do to measure code translation as opposed to execution
is to disable caching with a 2-liner to avoid insertions to the TB hash
table and tb_jmp_cache. The problem is that then we basically just
measure code translation performance, which isn't really realistic
either.

In any case, note that most efforts I've seen to compile very good code
(with QEMU or other cross-ISA DBT), do some sort of profiling so that
only hot blocks are optimized -- see for example [1] and [2].

[1] "Characterization of Dynamic Binary Translation Overhead".
    Edson Borin and Youfeng Wu. IISWC 2009.
    http://amas-bt.cs.virginia.edu/2008proceedings/AmasBT2008.pdf#page=4

[2] "HQEMU: a multi-threaded and retargetable dynamic binary translator
    on multicores".
    Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu
    Pangfeng Liu, Chien-Min Wang and Yeh-Ching Chung. CGO 2012.
    http://www.iis.sinica.edu.tw/papers/dyhong/18239-F.pdf


> > Here are linux-user performance numbers from v1.0 to v2.8 (higher
> > is better):
> > 
> >                         x86_64 NBench Integer Performance
> >                  Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz                
> >                                                                                
> >   36 +-+-+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+-+-+   
> >      |   +   +   +   +  +   +   +   +   +   +   +   +   +  +   +   +  ***  |   
> >   34 +-+                                                             #*A*+-+   
> >      |                                                            *A*      |   
> >   32 +-+                                                          #      +-+   
> >   30 +-+                                                          #      +-+   
> >      |                                                           #         |   
> >   28 +-+                                                        #        +-+   
> >      |                                 *A*#*A*#*A*#*A*#*A*#     #          |   
> >   26 +-+                   *A*#*A*#***#    ***         ******#*A*        +-+   
> >      |                     #       *A*                    *A* ***          |   
> >   24 +-+                  #                                              +-+   
> >   22 +-+                 #                                               +-+   
> >      |             #*A**A*                                                 |   
> >   20 +-+       #*A*                                                      +-+   
> >      |  *A*#*A*  +   +  +   +   +   +   +   +   +   +   +  +   +   +   +   |   
> >   18 +-+-+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+-+-+   
> >        v1.v1.1v1.2v1.v1.4v1.5v1.6v1.7v2.0v2.1v2.2v2.3v2.v2.5v2.6v2.7v2.8.0     
> >                                   QEMU version                                 
> 
> Nice, there was someone on list complaining about 2.6 being slower for them.
> 
> >                      x86_64 NBench Floating Point Performance                  
> >                   Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz               
> >                                                                                
> >   1.88 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+   
> >        |   +   +  +  *A*#*A*  +  +   +   +   +   +  +   +   +   +  +   +   |   
> >   1.86 +-+           *** ***                                             +-+   
> >        |            #       #   *A*#***                                    |   
> >        |      *A*# #         # ##   *A*                                    |   
> >   1.84 +-+    #  *A*         *A*      #                                  +-+   
> >        |      #                        #                              *A*  |   
> >   1.82 +-+   #                          #                            ##  +-+   
> >        |     #                          *A*#                        #      |   
> >    1.8 +-+  #                               #  #*A*               *A*    +-+   
> >        |    #                               *A*   #                #       |   
> >   1.78 +-+*A*                                      #       *A*    #      +-+   
> >        |                                           #   ***#  #    #        |   
> >        |                                           *A*#*A*    #  #         |   
> >   1.76 +-+                                         ***         # #       +-+   
> >        |   +   +  +   +   +   +  +   +   +   +   +  +   +   +  *A* +   +   |   
> >   1.74 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+   
> >          v1.v1.v1.2v1.3v1.4v1.v1.6v1.7v2.0v2.1v2.v2.3v2.4v2.5v2.v2.7v2.8.0     
> >                                    QEMU version                                
> 
> I'm assuming the dips are where QEMU fixed something and cared about corner
> cases/accuracy?

It'd be hard to say why the numbers vary across versions without running
a profiler and git bisect. I only know the reason for v2.7, where most if not all
of the improvement is due to the removal of tb_lock() when executing
code in qemu-user thanks to the QHT work.

		E.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Benchmarking linux-user performance
  2017-03-10 11:48   ` Peter Maydell
@ 2017-03-11  2:25     ` Emilio G. Cota
  2017-03-11 15:02       ` Peter Maydell
  0 siblings, 1 reply; 8+ messages in thread
From: Emilio G. Cota @ 2017-03-11  2:25 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Dr. David Alan Gilbert, Richard Henderson, Laurent Vivier,
	Paolo Bonzini, Alex Benn�e, qemu-devel

On Fri, Mar 10, 2017 at 12:48:31 +0100, Peter Maydell wrote:
> On 10 March 2017 at 12:45, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > * Emilio G. Cota (cota@braap.org) wrote:
> >>                      x86_64 NBench Floating Point Performance
> >>                   Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
> >>
> >>   1.88 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+
> >>        |   +   +  +  *A*#*A*  +  +   +   +   +   +  +   +   +   +  +   +   |
> >>   1.86 +-+           *** ***                                             +-+
> >>        |            #       #   *A*#***                                    |
> >>        |      *A*# #         # ##   *A*                                    |
> >>   1.84 +-+    #  *A*         *A*      #                                  +-+
> >>        |      #                        #                              *A*  |
> >>   1.82 +-+   #                          #                            ##  +-+
> >>        |     #                          *A*#                        #      |
> >>    1.8 +-+  #                               #  #*A*               *A*    +-+
> >>        |    #                               *A*   #                #       |
> >>   1.78 +-+*A*                                      #       *A*    #      +-+
> >>        |                                           #   ***#  #    #        |
> >>        |                                           *A*#*A*    #  #         |
> >>   1.76 +-+                                         ***         # #       +-+
> >>        |   +   +  +   +   +   +  +   +   +   +   +  +   +   +  *A* +   +   |
> >>   1.74 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+
> >>          v1.v1.v1.2v1.3v1.4v1.v1.6v1.7v2.0v2.1v2.v2.3v2.4v2.5v2.v2.7v2.8.0
> >>                                    QEMU version
> >
> > I'm assuming the dips are where QEMU fixed something and cared about corner
> > cases/accuracy?
> 
> Given the scale on the LHS is from 1.74 to 1.88 my guess is that the
> variation is in large part noise and the major thing is "our fp
> performance is bounded by softfloat, which doesn't change and is
> always very slow".

It isn't "measurement noise" -- if you look at the PNGs the measurements
are very stable (all points have error bars): http://imgur.com/a/nF7Ls

It's true that performance here varies very little. This is just the
result of Amdahl's law, as you point out. (upon re-reading your message,
I see that perhaps what you meant by "noise" is exactly this.)

		E.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Benchmarking linux-user performance
  2017-03-11  2:25     ` Emilio G. Cota
@ 2017-03-11 15:02       ` Peter Maydell
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Maydell @ 2017-03-11 15:02 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Dr. David Alan Gilbert, Richard Henderson, Laurent Vivier,
	Paolo Bonzini, Alex Benn�e, qemu-devel

On 11 March 2017 at 03:25, Emilio G. Cota <cota@braap.org> wrote:
> On Fri, Mar 10, 2017 at 12:48:31 +0100, Peter Maydell wrote:
>> Given the scale on the LHS is from 1.74 to 1.88 my guess is that the
>> variation is in large part noise and the major thing is "our fp
>> performance is bounded by softfloat, which doesn't change and is
>> always very slow".
>
> It isn't "measurement noise" -- if you look at the PNGs the measurements
> are very stable (all points have error bars): http://imgur.com/a/nF7Ls
>
> It's true that performance here varies very little. This is just the
> result of Amdahl's law, as you point out. (upon re-reading your message,
> I see that perhaps what you meant by "noise" is exactly this.)

Yes, sorry, I wasn't really using the right terminology there.
I just meant that the release-to-release variation is not as
significant as it appears from the graph, because the LHS axis
scale is covering such a small range.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Benchmarking linux-user performance
  2017-03-11  2:18   ` Emilio G. Cota
@ 2017-03-14 17:06     ` Dr. David Alan Gilbert
  2017-03-16 17:13       ` Emilio G. Cota
  0 siblings, 1 reply; 8+ messages in thread
From: Dr. David Alan Gilbert @ 2017-03-14 17:06 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Richard Henderson, Laurent Vivier, Peter Maydell, Paolo Bonzini,
	Alex Benn�e, qemu-devel

* Emilio G. Cota (cota@braap.org) wrote:
> On Fri, Mar 10, 2017 at 11:45:33 +0000, Dr. David Alan Gilbert wrote:
> > * Emilio G. Cota (cota@braap.org) wrote:
> > >   https://github.com/cota/dbt-bench
> > > I'm using NBench because (1) it's just a few files and they take
> > > very little time to run (~5min per QEMU version, if performance
> > > on the host machine is stable), (2) AFAICT its sources are in the
> > > public domain (whereas SPEC's sources cannot be redistributed),
> > > and (3) with NBench I get results similar to SPEC's.
> > 
> > Does NBench include anything with lots of small processes, or a large
> > chunk of code.  Using benchmarks with small code tends to skew DBT optimisations
> > towards very heavy block optimisation that dont work in real applications where
> > the cost of translation can hurt if it's too high.
> 
> Yes this is a valid point.
> 
> I haven't looked at the NBench code in detail, but I'd expect all programs
> in the suite to be small and have hotspots (this is consistent with
> the fact that performance doesn't change even if the TB hash table
> isn't used, i.e. the loops are small enough to remain in tb_jmp_cache.)
> IOW, we'd be mostly measuring the quality of the translated code,
> not the translation overhead.
> 
> It seems that a good benchmark to take translation overhead into account
> would be gcc/perlbench from SPEC (see [1]; ~20% of exec time is spent
> on translation). Unfortunately, none of them can be redistributed.
> 
> I'll consider other options. For instance, I looked today at using golang's
> compilation tests, but they crash under qemu-user. I'll keep looking
> at other options -- the requirement is to have something that is easy
> to build (i.e. gcc is not an option) and that it runs fast.

Yes, needs to be self contained but large enough to be interesting.
Isn't SPECs perlbench just a variant of a standard free benchmark
that can be used?
(Select alternative preferred language).

> A hack that one can do to measure code translation as opposed to execution
> is to disable caching with a 2-liner to avoid insertions to the TB hash
> table and tb_jmp_cache. The problem is that then we basically just
> measure code translation performance, which isn't really realistic
> either.
> 
> In any case, note that most efforts I've seen to compile very good code
> (with QEMU or other cross-ISA DBT), do some sort of profiling so that
> only hot blocks are optimized -- see for example [1] and [2].

Right, and often there's a trade off between an interpret step, and one or
more translate/optimisation steps and have to pick thresholds etc.

Dave

> [1] "Characterization of Dynamic Binary Translation Overhead".
>     Edson Borin and Youfeng Wu. IISWC 2009.
>     http://amas-bt.cs.virginia.edu/2008proceedings/AmasBT2008.pdf#page=4
> 
> [2] "HQEMU: a multi-threaded and retargetable dynamic binary translator
>     on multicores".
>     Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu
>     Pangfeng Liu, Chien-Min Wang and Yeh-Ching Chung. CGO 2012.
>     http://www.iis.sinica.edu.tw/papers/dyhong/18239-F.pdf
> 
> 
> > > Here are linux-user performance numbers from v1.0 to v2.8 (higher
> > > is better):
> > > 
> > >                         x86_64 NBench Integer Performance
> > >                  Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz                
> > >                                                                                
> > >   36 +-+-+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+-+-+   
> > >      |   +   +   +   +  +   +   +   +   +   +   +   +   +  +   +   +  ***  |   
> > >   34 +-+                                                             #*A*+-+   
> > >      |                                                            *A*      |   
> > >   32 +-+                                                          #      +-+   
> > >   30 +-+                                                          #      +-+   
> > >      |                                                           #         |   
> > >   28 +-+                                                        #        +-+   
> > >      |                                 *A*#*A*#*A*#*A*#*A*#     #          |   
> > >   26 +-+                   *A*#*A*#***#    ***         ******#*A*        +-+   
> > >      |                     #       *A*                    *A* ***          |   
> > >   24 +-+                  #                                              +-+   
> > >   22 +-+                 #                                               +-+   
> > >      |             #*A**A*                                                 |   
> > >   20 +-+       #*A*                                                      +-+   
> > >      |  *A*#*A*  +   +  +   +   +   +   +   +   +   +   +  +   +   +   +   |   
> > >   18 +-+-+---+---+---+--+---+---+---+---+---+---+---+---+--+---+---+---+-+-+   
> > >        v1.v1.1v1.2v1.v1.4v1.5v1.6v1.7v2.0v2.1v2.2v2.3v2.v2.5v2.6v2.7v2.8.0     
> > >                                   QEMU version                                 
> > 
> > Nice, there was someone on list complaining about 2.6 being slower for them.
> > 
> > >                      x86_64 NBench Floating Point Performance                  
> > >                   Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz               
> > >                                                                                
> > >   1.88 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+   
> > >        |   +   +  +  *A*#*A*  +  +   +   +   +   +  +   +   +   +  +   +   |   
> > >   1.86 +-+           *** ***                                             +-+   
> > >        |            #       #   *A*#***                                    |   
> > >        |      *A*# #         # ##   *A*                                    |   
> > >   1.84 +-+    #  *A*         *A*      #                                  +-+   
> > >        |      #                        #                              *A*  |   
> > >   1.82 +-+   #                          #                            ##  +-+   
> > >        |     #                          *A*#                        #      |   
> > >    1.8 +-+  #                               #  #*A*               *A*    +-+   
> > >        |    #                               *A*   #                #       |   
> > >   1.78 +-+*A*                                      #       *A*    #      +-+   
> > >        |                                           #   ***#  #    #        |   
> > >        |                                           *A*#*A*    #  #         |   
> > >   1.76 +-+                                         ***         # #       +-+   
> > >        |   +   +  +   +   +   +  +   +   +   +   +  +   +   +  *A* +   +   |   
> > >   1.74 +-+-+---+--+---+---+---+--+---+---+---+---+--+---+---+---+--+---+-+-+   
> > >          v1.v1.v1.2v1.3v1.4v1.v1.6v1.7v2.0v2.1v2.v2.3v2.4v2.5v2.v2.7v2.8.0     
> > >                                    QEMU version                                
> > 
> > I'm assuming the dips are where QEMU fixed something and cared about corner
> > cases/accuracy?
> 
> It'd be hard to say why the numbers vary across versions without running
> a profiler and git bisect. I only know the reason for v2.7, where most if not all
> of the improvement is due to the removal of tb_lock() when executing
> code in qemu-user thanks to the QHT work.
> 
> 		E.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Benchmarking linux-user performance
  2017-03-14 17:06     ` Dr. David Alan Gilbert
@ 2017-03-16 17:13       ` Emilio G. Cota
  0 siblings, 0 replies; 8+ messages in thread
From: Emilio G. Cota @ 2017-03-16 17:13 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Richard Henderson, Laurent Vivier, Peter Maydell, Paolo Bonzini,
	Alex Benn�e, qemu-devel

On Tue, Mar 14, 2017 at 17:06:57 +0000, Dr. David Alan Gilbert wrote:
> * Emilio G. Cota (cota@braap.org) wrote:
> > It seems that a good benchmark to take translation overhead into account
> > would be gcc/perlbench from SPEC (see [1]; ~20% of exec time is spent
> > on translation). Unfortunately, none of them can be redistributed.
> > 
> > I'll consider other options. For instance, I looked today at using golang's
> > compilation tests, but they crash under qemu-user. I'll keep looking
> > at other options -- the requirement is to have something that is easy
> > to build (i.e. gcc is not an option) and that it runs fast.
> 
> Yes, needs to be self contained but large enough to be interesting.
> Isn't SPECs perlbench just a variant of a standard free benchmark
> that can be used?
> (Select alternative preferred language).

SPEC takes an old Perl distribution and a few standard Perl benchmarks.
These sources (with SPEC's modifications) are of course redistributable.
However, SPEC also adds scripts that are propietary.

What I've ended up doing is selecting a small subset of the tests in the
Perl distribution with a profile under QEMU similar to that of
SPEC's perlbench (see patch below). This requires building (and testing)
Perl, which takes a few minutes on a modern machine (ouch) but fortunately
it is only done once. After that, the tests themselves take only a
few seconds.

The bummer is that cross-compiling the Perl distro is not officially
supported. But well at least we have now an easy-to-run "compiler-like"
benchmark, if only for the host's ISA.

I updated the README with profile data -- I'm pasting that update below.
Grab the changes from https://github.com/cota/dbt-bench

Here are the numbers for the Perl benchmark, from QEMU v1.7 -> v2.8.
The Y axis is Execution Time in seconds, so lower is better:

                       x86_64 Perl Compilation Performance                     
                 Host: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz                
                                                                               
   10 +-+---+------+-----+-----+-----+------+-----+----***----+------+---+-+   
      |     +      +     +     +     +      +     +     *     +      +     |   
  9.8 +-+                                              #A                +-+   
      |                                          *** ## *#                 |   
  9.6 +-+                                         *##  ***#              +-+   
  9.4 +-+                                         A        #             +-+   
      |                                          #*         #***           |   
  9.2 +-+                                       #***         #*          +-+   
      |                                        #              A##          |   
    9 +-+  ***          ***         ***        #              *  #       +-+   
      |     A#####***    *    ***    *     ***#              ***  #        |   
  8.8 +-+   *     #*  ###A#####A#####*      *#                     #***  +-+   
  8.6 +-+  ***     A##   *     *     A######A                        *   +-+   
      |           ***   ***   ***    *     ***                       A     |   
  8.4 +-+                            *                               *   +-+   
      |     +      +     +     +    ***     +     +     +     +     ***    |   
  8.2 +-+---+------+-----+-----+-----+------+-----+-----+-----+------+---+-+   
         v1.7.0 v2.0.0v2.1.0v2.2.0v2.3.0 v2.4.0v2.5.0v2.6.0v2.7.0 v2.8.0       
                                  QEMU version 
PNGs for Perl + NBench here: http://imgur.com/a/LlpxE

Thanks,

		Emilio

commit f4ca2537bffe544779aa3f1814cec9d66dd9a17e
Author: Emilio G. Cota <cota@braap.org>
Date:   Thu Mar 16 12:48:44 2017 -0400

    README: document and quantify the difference between NBench and Perl
    
    While at it, also show how Perl's perf is very similar to SPEC06's perlbench.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>

diff --git a/README.md b/README.md
index b6d4037..b4578d6 100644
--- a/README.md
+++ b/README.md
@@ -61,3 +61,111 @@ Other output formats are possible, see `Makefile`.
   valuable files that were never meant to be committed (e.g. scripts). For
   this reason it is best to just clone a fresh QEMU repo to be used with
   DBT-bench rather than using your development tree.
+
+## What is the difference between the benchmarks?
+
+NBench programs are small, with execution time dominated by small code loops. Thus,
+when run under a DBT engine, the resulting performance depends almost entirely
+on the quality of the output code.
+
+The Perl benchmarks compile Perl code. As is common for compilation workloads,
+they execute large amounts of code and show no particular code execution
+hotspots. Thus, the resulting DBT performance depends largely on code
+translation speed.
+
+Quantitatively, the differences can be clearly seen under a profiler. For QEMU
+v2.8.0, we get:
+
+* NBench:
+
+```
+# Samples: 1M of event 'cycles:pp'
+# Event count (approx.): 1111661663176
+#
+# Overhead  Command       Shared Object        Symbol
+# ........  ............  ...................  .........................................
+#
+     6.26%  qemu-x86_64   qemu-x86_64          [.] float64_mul
+     6.24%  qemu-x86_64   qemu-x86_64          [.] roundAndPackFloat64
+     4.18%  qemu-x86_64   qemu-x86_64          [.] subFloat64Sigs
+     2.72%  qemu-x86_64   qemu-x86_64          [.] addFloat64Sigs
+     2.29%  qemu-x86_64   qemu-x86_64          [.] cpu_exec
+     1.29%  qemu-x86_64   qemu-x86_64          [.] float64_add
+     1.12%  qemu-x86_64   qemu-x86_64          [.] float64_sub
+     0.79%  qemu-x86_64   qemu-x86_64          [.] object_class_dynamic_cast_assert
+     0.71%  qemu-x86_64   qemu-x86_64          [.] helper_mulsd
+     0.66%  qemu-x86_64   perf-23090.map       [.] 0x000055afd37d0b8a
+     0.64%  qemu-x86_64   perf-23090.map       [.] 0x000055afd377cd8f
+     0.59%  qemu-x86_64   perf-23090.map       [.] 0x000055afd37d019a
+     [...]
+```
+
+* Perl:
+
+```
+# Samples: 90K of event 'cycles:pp'
+# Event count (approx.): 97757063053
+#
+# Overhead  Command       Shared Object            Symbol
+# ........  ............  .......................  ...........................................
+#
+   22.93%  qemu-x86_64   [kernel.kallsyms]        [k] isolate_freepages_block
+    9.38%  qemu-x86_64   qemu-x86_64              [.] cpu_exec
+    5.69%  qemu-x86_64   qemu-x86_64              [.] tcg_gen_code
+    5.30%  qemu-x86_64   qemu-x86_64              [.] tcg_optimize
+    3.45%  qemu-x86_64   qemu-x86_64              [.] liveness_pass_1
+    3.24%  qemu-x86_64   [kernel.kallsyms]        [k] isolate_migratepages_block
+    2.39%  qemu-x86_64   qemu-x86_64              [.] object_class_dynamic_cast_assert
+    1.48%  qemu-x86_64   [kernel.kallsyms]        [k] unlock_page
+    1.29%  qemu-x86_64   [kernel.kallsyms]        [k] pageblock_pfn_to_page
+    1.29%  qemu-x86_64   qemu-x86_64              [.] tcg_out_opc.isra.13
+    1.11%  qemu-x86_64   qemu-x86_64              [.] tcg_gen_op2
+    0.98%  qemu-x86_64   [kernel.kallsyms]        [k] migrate_pages
+    0.87%  qemu-x86_64   qemu-x86_64              [.] qht_lookup
+    0.83%  qemu-x86_64   qemu-x86_64              [.] tcg_temp_new_internal
+    0.77%  qemu-x86_64   qemu-x86_64              [.] tcg_out_modrm_sib_offset.constprop.37
+    0.76%  qemu-x86_64   qemu-x86_64              [.] disas_insn.isra.49
+    0.70%  qemu-x86_64   [kernel.kallsyms]        [k] __wake_up_bit
+    0.55%  qemu-x86_64   [kernel.kallsyms]        [k] __reset_isolation_suitable
+    0.47%  qemu-x86_64   qemu-x86_64              [.] tcg_opt_gen_mov
+    [...]
+```
+
+### Why don't you just run SPEC06?
+
+SPEC's source code cannot be redistributed. Some of its benchmarks are based
+on free software, but the SPEC authors added on top of it non-free code
+(usually scripts) that cannot be redistributed.
+
+For this reason we use here benchmarks that are freely redistributable,
+while capturing different performance profiles: NBench represents "hotspot
+code" and Perl represents a typical "compiler" workload. In fact, Perl's
+performance profile under QEMU is very similar to that of SPEC06's perlbench;
+compare Perl's profile above with SPEC06 perlbench's below:
+
+```
+# Samples: 14K of event 'cycles:pp'
+# Event count (approx.): 15657871399
+#
+# Overhead  Command      Shared Object            Symbol
+# ........  ...........  .......................  ...........................................
+#
+   16.93%  qemu-x86_64  qemu-x86_64              [.] cpu_exec
+    9.16%  qemu-x86_64  [kernel.kallsyms]        [k] isolate_freepages_block
+    5.47%  qemu-x86_64  qemu-x86_64              [.] tcg_gen_code
+    4.82%  qemu-x86_64  qemu-x86_64              [.] tcg_optimize
+    4.15%  qemu-x86_64  qemu-x86_64              [.] object_class_dynamic_cast_assert
+    3.25%  qemu-x86_64  qemu-x86_64              [.] liveness_pass_1
+    1.55%  qemu-x86_64  qemu-x86_64              [.] qht_lookup
+    1.23%  qemu-x86_64  qemu-x86_64              [.] tcg_gen_op2
+    1.04%  qemu-x86_64  [kernel.kallsyms]        [k] copy_page
+    1.00%  qemu-x86_64  qemu-x86_64              [.] tcg_out_opc.isra.13
+    0.82%  qemu-x86_64  qemu-x86_64              [.] tcg_temp_new_internal
+    0.78%  qemu-x86_64  qemu-x86_64              [.] tcg_out_modrm_sib_offset.constprop.37
+    0.72%  qemu-x86_64  qemu-x86_64              [.] tb_cmp
+    0.69%  qemu-x86_64  [kernel.kallsyms]        [k] isolate_migratepages_block
+    0.67%  qemu-x86_64  qemu-x86_64              [.] disas_insn.isra.49
+    0.53%  qemu-x86_64  qemu-x86_64              [.] object_get_class
+    0.52%  qemu-x86_64  [kernel.kallsyms]        [k] __wake_up_bit
+    [...]
+```
--
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-03-16 17:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-10  1:23 [Qemu-devel] Benchmarking linux-user performance Emilio G. Cota
2017-03-10 11:45 ` Dr. David Alan Gilbert
2017-03-10 11:48   ` Peter Maydell
2017-03-11  2:25     ` Emilio G. Cota
2017-03-11 15:02       ` Peter Maydell
2017-03-11  2:18   ` Emilio G. Cota
2017-03-14 17:06     ` Dr. David Alan Gilbert
2017-03-16 17:13       ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.