All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
@ 2015-03-29 21:52 Richard Henderson
  2015-03-30  5:33 ` Stefan Weil
  2015-03-30  5:43 ` Stefan Weil
  0 siblings, 2 replies; 7+ messages in thread
From: Richard Henderson @ 2015-03-29 21:52 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-trivial, Stefan Weil, Alex Bennée, qemu-devel

On Mar 27, 2015 14:09, "Emilio G. Cota" <cota@braap.org> wrote:
>
> On Fri, Mar 27, 2015 at 09:55:03 +0000, Alex Bennée wrote: 
> > Have you been able to measure any performance improvement with these new 
> > structures? In theory, if aligned with cache lines, performance should 
> > improve but real numbers would be nice. 
>
> I haven't benchmarked anything, which makes me very uneasy. All 
> I've checked is that the system boots, and FWIW I appreciate no 
> difference in boot time.

No decrease in boot time is good. We /know/ we're saving memory, after all.
 
>
> Is there a benchmark suite to test TCG changes? 

No, sorry. 


r~

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
  2015-03-29 21:52 [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp Richard Henderson
@ 2015-03-30  5:33 ` Stefan Weil
  2015-03-30  5:43 ` Stefan Weil
  1 sibling, 0 replies; 7+ messages in thread
From: Stefan Weil @ 2015-03-30  5:33 UTC (permalink / raw)
  To: Richard Henderson, Emilio G. Cota
  Cc: qemu-trivial, Alex Bennée, qemu-devel

Am 29.03.2015 um 23:52 schrieb Richard Henderson:
> On Mar 27, 2015 14:09, "Emilio G. Cota" <cota@braap.org> wrote:
>> On Fri, Mar 27, 2015 at 09:55:03 +0000, Alex Bennée wrote:
>>> Have you been able to measure any performance improvement with these new
>>> structures? In theory, if aligned with cache lines, performance should
>>> improve but real numbers would be nice.
>> I haven't benchmarked anything, which makes me very uneasy. All
>> I've checked is that the system boots, and FWIW I appreciate no
>> difference in boot time.
> No decrease in boot time is good. We /know/ we're saving memory, after all.
>   
>> Is there a benchmark suite to test TCG changes?
> No, sorry.
>
>
> r~

Benchmarking TCG with QEMU's system emulation is nearly impossible
because operating systems usually contain lots of timer based operations.
The TCG interpreter for example is really slow, but a BIOS will boot
faster than expected with it.

The user mode emulation is much better for benchmarks.
Run some command line Linux application which mainly does
computations (not file i/o) using user mode emulation on Linux.
The OpenSSL package contains bntest which can be used
as a benchmark for TCG. Redirect all output to /dev/null when
you run it.

Binaries for i386 and x86_64 are available from
http://qemu.weilnetz.de/test/user/.

Stefan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
  2015-03-29 21:52 [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp Richard Henderson
  2015-03-30  5:33 ` Stefan Weil
@ 2015-03-30  5:43 ` Stefan Weil
  2015-04-03  0:07   ` [Qemu-devel] [PATCH v2] " Emilio G. Cota
  1 sibling, 1 reply; 7+ messages in thread
From: Stefan Weil @ 2015-03-30  5:43 UTC (permalink / raw)
  To: Richard Henderson, Emilio G. Cota
  Cc: qemu-trivial, Alex Bennée, qemu-devel

Am 29.03.2015 um 23:52 schrieb Richard Henderson:
> No decrease in boot time is good. We /know/ we're saving memory, after all.

Well, I would not mind a decrease in boot time, too.
The more it decreases, the better. :-)

To be honest: in my version I only used 1 bit bitfield entries for
boolean values, but 8 bit values (aligned on byte boundaries)
for other values because as far as I know, most (all?) cpu
architectures will need more time to extract some bits from
a machine word than to extract a byte.

I have no idea whether this makes a difference in performance
as I did not run any runtime benchmark.

Stefan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Qemu-devel] [PATCH v2] tcg: optimise memory layout of TCGTemp
  2015-03-30  5:43 ` Stefan Weil
@ 2015-04-03  0:07   ` Emilio G. Cota
  2015-04-03  8:13     ` Stefan Weil
                       ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Emilio G. Cota @ 2015-04-03  0:07 UTC (permalink / raw)
  To: Stefan Weil
  Cc: qemu-trivial, Laurent Desnogues, Alex Bennée, qemu-devel,
	Richard Henderson

This brings down the size of the struct from 56 to 32 bytes on 64-bit,
and to 20 bytes on 32-bit. This leads to memory savings:

Before:
$ find . -name 'tcg.o' | xargs size
   text    data     bss     dec     hex filename
  41131   29800      88   71019   1156b ./aarch64-softmmu/tcg/tcg.o
  37969   29416      96   67481   10799 ./x86_64-linux-user/tcg/tcg.o
  39354   28816      96   68266   10aaa ./arm-linux-user/tcg/tcg.o
  40802   29096      88   69986   11162 ./arm-softmmu/tcg/tcg.o
  39417   29672      88   69177   10e39 ./x86_64-softmmu/tcg/tcg.o

After:
$ find . -name 'tcg.o' | xargs size
   text    data     bss     dec     hex filename
  40883   29800      88   70771   11473 ./aarch64-softmmu/tcg/tcg.o
  37473   29416      96   66985   105a9 ./x86_64-linux-user/tcg/tcg.o
  38858   28816      96   67770   108ba ./arm-linux-user/tcg/tcg.o
  40554   29096      88   69738   1106a ./arm-softmmu/tcg/tcg.o
  39169   29672      88   68929   10d41 ./x86_64-softmmu/tcg/tcg.o

Note that using an entire byte for some enums that need less than
that wastes a few bits (noticeable in 32 bits, where we use
20 bytes instead of 16) but avoids extraction code, which overall
is a win--I've tested several variations of the patch, and the appended
is the best performer for OpenSSL's bntest by a very small margin:

Before:
$ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
[...]
 Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):

      10538.479833 task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.38% )
               772 context-switches          #    0.073 K/sec                    ( +-  2.03% )
                 0 cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
             2,207 page-faults               #    0.209 K/sec                    ( +-  0.08% )
      10.552871687 seconds time elapsed                                          ( +-  0.39% )

After:
$ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
 Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):

      10459.968847 task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.30% )
               739 context-switches          #    0.071 K/sec                    ( +-  1.71% )
                 0 cpu-migrations            #    0.000 K/sec                    ( +- 68.14% )
             2,204 page-faults               #    0.211 K/sec                    ( +-  0.10% )
      10.473900411 seconds time elapsed                                          ( +-  0.30% )

Suggested-by: Stefan Weil <sw@weilnetz.de>
Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index add7f75..7f95132 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -417,20 +417,19 @@ static inline TCGCond tcg_high_cond(TCGCond c)
     }
 }
 
-#define TEMP_VAL_DEAD  0
-#define TEMP_VAL_REG   1
-#define TEMP_VAL_MEM   2
-#define TEMP_VAL_CONST 3
+typedef enum TCGTempVal {
+    TEMP_VAL_DEAD,
+    TEMP_VAL_REG,
+    TEMP_VAL_MEM,
+    TEMP_VAL_CONST,
+} TCGTempVal;
 
-/* XXX: optimize memory layout */
 typedef struct TCGTemp {
-    TCGType base_type;
-    TCGType type;
-    int val_type;
-    int reg;
-    tcg_target_long val;
-    int mem_reg;
-    intptr_t mem_offset;
+    unsigned int reg:8;
+    unsigned int mem_reg:8;
+    TCGTempVal val_type:8;
+    TCGType base_type:8;
+    TCGType type:8;
     unsigned int fixed_reg:1;
     unsigned int mem_coherent:1;
     unsigned int mem_allocated:1;
@@ -438,6 +437,9 @@ typedef struct TCGTemp {
                                   basic blocks. Otherwise, it is not
                                   preserved across basic blocks. */
     unsigned int temp_allocated:1; /* never used for code gen */
+
+    tcg_target_long val;
+    intptr_t mem_offset;
     const char *name;
 } TCGTemp;
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PATCH v2] tcg: optimise memory layout of TCGTemp
  2015-04-03  0:07   ` [Qemu-devel] [PATCH v2] " Emilio G. Cota
@ 2015-04-03  8:13     ` Stefan Weil
  2015-04-03 14:17     ` Richard Henderson
  2015-04-07 14:59     ` Alex Bennée
  2 siblings, 0 replies; 7+ messages in thread
From: Stefan Weil @ 2015-04-03  8:13 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: qemu-trivial, Laurent Desnogues, Alex Bennée, qemu-devel,
	Richard Henderson

Am 03.04.2015 um 02:07 schrieb Emilio G. Cota:
> This brings down the size of the struct from 56 to 32 bytes on 64-bit,
> and to 20 bytes on 32-bit. This leads to memory savings:
>
> Before:
> $ find . -name 'tcg.o' | xargs size
>     text    data     bss     dec     hex filename
>    41131   29800      88   71019   1156b ./aarch64-softmmu/tcg/tcg.o
>    37969   29416      96   67481   10799 ./x86_64-linux-user/tcg/tcg.o
>    39354   28816      96   68266   10aaa ./arm-linux-user/tcg/tcg.o
>    40802   29096      88   69986   11162 ./arm-softmmu/tcg/tcg.o
>    39417   29672      88   69177   10e39 ./x86_64-softmmu/tcg/tcg.o
>
> After:
> $ find . -name 'tcg.o' | xargs size
>     text    data     bss     dec     hex filename
>    40883   29800      88   70771   11473 ./aarch64-softmmu/tcg/tcg.o
>    37473   29416      96   66985   105a9 ./x86_64-linux-user/tcg/tcg.o
>    38858   28816      96   67770   108ba ./arm-linux-user/tcg/tcg.o
>    40554   29096      88   69738   1106a ./arm-softmmu/tcg/tcg.o
>    39169   29672      88   68929   10d41 ./x86_64-softmmu/tcg/tcg.o
>
> Note that using an entire byte for some enums that need less than
> that wastes a few bits (noticeable in 32 bits, where we use
> 20 bytes instead of 16) but avoids extraction code, which overall
> is a win--I've tested several variations of the patch, and the appended
> is the best performer for OpenSSL's bntest by a very small margin:
>
> Before:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> [...]
>   Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
>        10538.479833 task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.38% )
>                 772 context-switches          #    0.073 K/sec                    ( +-  2.03% )
>                   0 cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
>               2,207 page-faults               #    0.209 K/sec                    ( +-  0.08% )
>        10.552871687 seconds time elapsed                                          ( +-  0.39% )
>
> After:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
>   Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
>        10459.968847 task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.30% )
>                 739 context-switches          #    0.071 K/sec                    ( +-  1.71% )
>                   0 cpu-migrations            #    0.000 K/sec                    ( +- 68.14% )
>               2,204 page-faults               #    0.211 K/sec                    ( +-  0.10% )
>        10.473900411 seconds time elapsed                                          ( +-  0.30% )
>
> Suggested-by: Stefan Weil <sw@weilnetz.de>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>   tcg/tcg.h | 26 ++++++++++++++------------
>   1 file changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index add7f75..7f95132 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -417,20 +417,19 @@ static inline TCGCond tcg_high_cond(TCGCond c)
>       }
>   }
>   
> -#define TEMP_VAL_DEAD  0
> -#define TEMP_VAL_REG   1
> -#define TEMP_VAL_MEM   2
> -#define TEMP_VAL_CONST 3
> +typedef enum TCGTempVal {
> +    TEMP_VAL_DEAD,
> +    TEMP_VAL_REG,
> +    TEMP_VAL_MEM,
> +    TEMP_VAL_CONST,
> +} TCGTempVal;
>   
> -/* XXX: optimize memory layout */
>   typedef struct TCGTemp {
> -    TCGType base_type;
> -    TCGType type;
> -    int val_type;
> -    int reg;
> -    tcg_target_long val;
> -    int mem_reg;
> -    intptr_t mem_offset;
> +    unsigned int reg:8;
> +    unsigned int mem_reg:8;
> +    TCGTempVal val_type:8;
> +    TCGType base_type:8;
> +    TCGType type:8;
>       unsigned int fixed_reg:1;
>       unsigned int mem_coherent:1;
>       unsigned int mem_allocated:1;
> @@ -438,6 +437,9 @@ typedef struct TCGTemp {
>                                     basic blocks. Otherwise, it is not
>                                     preserved across basic blocks. */
>       unsigned int temp_allocated:1; /* never used for code gen */
> +
> +    tcg_target_long val;
> +    intptr_t mem_offset;
>       const char *name;
>   } TCGTemp;

Thanks for doing those tests. There are some smaller cosmetics which
might be changed, too (uint8_t for unsigned int with 8 bit, bool for
boolean bit values), but I think your patch is a real gain.

Reviewed-by: Stefan Weil <sw@weilnetz.de>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PATCH v2] tcg: optimise memory layout of TCGTemp
  2015-04-03  0:07   ` [Qemu-devel] [PATCH v2] " Emilio G. Cota
  2015-04-03  8:13     ` Stefan Weil
@ 2015-04-03 14:17     ` Richard Henderson
  2015-04-07 14:59     ` Alex Bennée
  2 siblings, 0 replies; 7+ messages in thread
From: Richard Henderson @ 2015-04-03 14:17 UTC (permalink / raw)
  To: Emilio G. Cota, Stefan Weil
  Cc: qemu-trivial, Laurent Desnogues, Alex Bennée, qemu-devel

On 04/02/2015 05:07 PM, Emilio G. Cota wrote:
> After:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
>  Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
> 
>       10459.968847 task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.30% )
>                739 context-switches          #    0.071 K/sec                    ( +-  1.71% )
>                  0 cpu-migrations            #    0.000 K/sec                    ( +- 68.14% )
>              2,204 page-faults               #    0.211 K/sec                    ( +-  0.10% )
>       10.473900411 seconds time elapsed                                          ( +-  0.30% )
> 
> Suggested-by: Stefan Weil <sw@weilnetz.de>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tcg/tcg.h | 26 ++++++++++++++------------
>  1 file changed, 14 insertions(+), 12 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>

I'll put this in a queue for 2.4.


r~

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PATCH v2] tcg: optimise memory layout of TCGTemp
  2015-04-03  0:07   ` [Qemu-devel] [PATCH v2] " Emilio G. Cota
  2015-04-03  8:13     ` Stefan Weil
  2015-04-03 14:17     ` Richard Henderson
@ 2015-04-07 14:59     ` Alex Bennée
  2 siblings, 0 replies; 7+ messages in thread
From: Alex Bennée @ 2015-04-07 14:59 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: qemu-trivial, Stefan Weil, Laurent Desnogues, qemu-devel,
	Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> This brings down the size of the struct from 56 to 32 bytes on 64-bit,
> and to 20 bytes on 32-bit. This leads to memory savings:
>
> Before:
> $ find . -name 'tcg.o' | xargs size
>    text    data     bss     dec     hex filename
>   41131   29800      88   71019   1156b ./aarch64-softmmu/tcg/tcg.o
>   37969   29416      96   67481   10799 ./x86_64-linux-user/tcg/tcg.o
>   39354   28816      96   68266   10aaa ./arm-linux-user/tcg/tcg.o
>   40802   29096      88   69986   11162 ./arm-softmmu/tcg/tcg.o
>   39417   29672      88   69177   10e39 ./x86_64-softmmu/tcg/tcg.o
>
> After:
> $ find . -name 'tcg.o' | xargs size
>    text    data     bss     dec     hex filename
>   40883   29800      88   70771   11473 ./aarch64-softmmu/tcg/tcg.o
>   37473   29416      96   66985   105a9 ./x86_64-linux-user/tcg/tcg.o
>   38858   28816      96   67770   108ba ./arm-linux-user/tcg/tcg.o
>   40554   29096      88   69738   1106a ./arm-softmmu/tcg/tcg.o
>   39169   29672      88   68929   10d41 ./x86_64-softmmu/tcg/tcg.o
>
> Note that using an entire byte for some enums that need less than
> that wastes a few bits (noticeable in 32 bits, where we use
> 20 bytes instead of 16) but avoids extraction code, which overall
> is a win--I've tested several variations of the patch, and the appended
> is the best performer for OpenSSL's bntest by a very small margin:
>
> Before:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> [...]
>  Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
>       10538.479833 task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.38% )
>                772 context-switches          #    0.073 K/sec                    ( +-  2.03% )
>                  0 cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
>              2,207 page-faults               #    0.209 K/sec                    ( +-  0.08% )
>       10.552871687 seconds time elapsed                                          ( +-  0.39% )
>
> After:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
>  Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
>       10459.968847 task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.30% )
>                739 context-switches          #    0.071 K/sec                    ( +-  1.71% )
>                  0 cpu-migrations            #    0.000 K/sec                    ( +- 68.14% )
>              2,204 page-faults               #    0.211 K/sec                    ( +-  0.10% )
>       10.473900411 seconds time elapsed
>                ( +-  0.30% )

I'll take that as a win condition ;-)

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

>
> Suggested-by: Stefan Weil <sw@weilnetz.de>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tcg/tcg.h | 26 ++++++++++++++------------
>  1 file changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index add7f75..7f95132 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -417,20 +417,19 @@ static inline TCGCond tcg_high_cond(TCGCond c)
>      }
>  }
>  
> -#define TEMP_VAL_DEAD  0
> -#define TEMP_VAL_REG   1
> -#define TEMP_VAL_MEM   2
> -#define TEMP_VAL_CONST 3
> +typedef enum TCGTempVal {
> +    TEMP_VAL_DEAD,
> +    TEMP_VAL_REG,
> +    TEMP_VAL_MEM,
> +    TEMP_VAL_CONST,
> +} TCGTempVal;
>  
> -/* XXX: optimize memory layout */
>  typedef struct TCGTemp {
> -    TCGType base_type;
> -    TCGType type;
> -    int val_type;
> -    int reg;
> -    tcg_target_long val;
> -    int mem_reg;
> -    intptr_t mem_offset;
> +    unsigned int reg:8;
> +    unsigned int mem_reg:8;
> +    TCGTempVal val_type:8;
> +    TCGType base_type:8;
> +    TCGType type:8;
>      unsigned int fixed_reg:1;
>      unsigned int mem_coherent:1;
>      unsigned int mem_allocated:1;
> @@ -438,6 +437,9 @@ typedef struct TCGTemp {
>                                    basic blocks. Otherwise, it is not
>                                    preserved across basic blocks. */
>      unsigned int temp_allocated:1; /* never used for code gen */
> +
> +    tcg_target_long val;
> +    intptr_t mem_offset;
>      const char *name;
>  } TCGTemp;

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-04-07 14:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-29 21:52 [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp Richard Henderson
2015-03-30  5:33 ` Stefan Weil
2015-03-30  5:43 ` Stefan Weil
2015-04-03  0:07   ` [Qemu-devel] [PATCH v2] " Emilio G. Cota
2015-04-03  8:13     ` Stefan Weil
2015-04-03 14:17     ` Richard Henderson
2015-04-07 14:59     ` Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.