All of lore.kernel.org
 help / color / mirror / Atom feed
* OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command)
@ 2009-06-28 18:19 Filip Navara
  2009-06-28 21:24 ` Laurent Desnogues
  0 siblings, 1 reply; 5+ messages in thread
From: Filip Navara @ 2009-06-28 18:19 UTC (permalink / raw)
  To: Blue Swirl
  Cc: Anthony Liguori, ehabkost, jan.kiszka, dlaor, qemu-devel,
	Luiz Capitulino, Avi Kivity

On Sun, Jun 28, 2009 at 7:51 PM, Blue Swirl<blauwirbel@gmail.com> wrote:
> On 6/28/09, Filip Navara <filip.navara@gmail.com> wrote:
>> On Sun, Jun 28, 2009 at 5:52 PM, Avi Kivity<avi@redhat.com> wrote:
>>  > It really isn't very complicated, and
>>  > the thread only got so long because the topic is relatively simple.  Post an
>>  > RFC and a mile-long patchset about changing TCG to SSA form, and see how you
>>  > get no replies.
>>
>>
>> I wouldn't even dare to push the SSA patch... Mile-long doesn't
>>  describe it precisely enough. Imagine it was applied to all the
>>  targets.

Just to be perfectly clear, this was meant as a joke. I don't have any
working SSA patch and neither am I working on one right now, but I am
interested in the topic. Main reason for my interest is this:

http://www.info.uni-karlsruhe.de/lehre/2006SS/uebau2/folien/08-RA_v1_4.pdf
http://www.info.uni-karlsruhe.de/~hack/ra_ssa.pdf

I'd like to know if the register allocation can be improved. I don't
believe SSA would help much in anything else since the input code to
translators was already compiled with optimizing compiler and so most
of the SSA-based optimizations would be redundant.

Doing a profiling run on several ARM demo programs showed that most of
the generated code was doing load/store operations to the machine
registers (in CPU_env). Sample run of FreeRTOS looked like this (OP
counts):

movi_i32 1603
ld_i32 1305
st_i32 1174
add_i32 530
...

If there could be done something that would allow the guest registers
to be stored in host registers, even if for a temporary amount of time
it would certainly help the guests that I'm dealing with.

Best regards,
Filip Navara

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command)
  2009-06-28 18:19 OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command) Filip Navara
@ 2009-06-28 21:24 ` Laurent Desnogues
  2009-06-28 23:19   ` Filip Navara
  0 siblings, 1 reply; 5+ messages in thread
From: Laurent Desnogues @ 2009-06-28 21:24 UTC (permalink / raw)
  To: Filip Navara
  Cc: Anthony Liguori, ehabkost, jan.kiszka, dlaor, qemu-devel,
	Luiz Capitulino, Blue Swirl, Avi Kivity

On Sun, Jun 28, 2009 at 8:19 PM, Filip Navara<filip.navara@gmail.com> wrote:
> Doing a profiling run on several ARM demo programs showed that most of
> the generated code was doing load/store operations to the machine
> registers (in CPU_env). Sample run of FreeRTOS looked like this (OP
> counts):
>
> movi_i32 1603
> ld_i32 1305
> st_i32 1174
> add_i32 530
> ...
>
> If there could be done something that would allow the guest registers
> to be stored in host registers, even if for a temporary amount of time
> it would certainly help the guests that I'm dealing with.

TCG does a good job for register allocation.

The problem you have here is that the ARM translator
isn't using tcg_global_mem_new_i32 for ARM registers.

Here's an example of number of ops I see when using
tcg_global_mem_new_i32:

exit_tb 4991
add_i32 7945
st_i32 8257
movi_i32 26812
mov_i32 38369

And with the trunk:

exit_tb 4957
add_i32 8165
st_i32 20281
ld_i32 21926
movi_i32 25083


Laurent

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command)
  2009-06-28 21:24 ` Laurent Desnogues
@ 2009-06-28 23:19   ` Filip Navara
  2009-06-28 23:35     ` Filip Navara
  0 siblings, 1 reply; 5+ messages in thread
From: Filip Navara @ 2009-06-28 23:19 UTC (permalink / raw)
  To: Laurent Desnogues; +Cc: Blue Swirl, Anthony Liguori, qemu-devel, Avi Kivity

[-- Attachment #1: Type: text/plain, Size: 1623 bytes --]

On Sun, Jun 28, 2009 at 11:24 PM, Laurent
Desnogues<laurent.desnogues@gmail.com> wrote:
> On Sun, Jun 28, 2009 at 8:19 PM, Filip Navara<filip.navara@gmail.com> wrote:
>> Doing a profiling run on several ARM demo programs showed that most of
>> the generated code was doing load/store operations to the machine
>> registers (in CPU_env). Sample run of FreeRTOS looked like this (OP
>> counts):
>>
>> movi_i32 1603
>> ld_i32 1305
>> st_i32 1174
>> add_i32 530
>> ...
>>
>> If there could be done something that would allow the guest registers
>> to be stored in host registers, even if for a temporary amount of time
>> it would certainly help the guests that I'm dealing with.
>
> TCG does a good job for register allocation.
>
> The problem you have here is that the ARM translator
> isn't using tcg_global_mem_new_i32 for ARM registers.

Interesting, thanks for the tip. I have been trying to achieve the
same effect using tcg_global_reg_new_i32, no wonder it felt so hard.
:)

> Here's an example of number of ops I see when using
> tcg_global_mem_new_i32:
>
> exit_tb 4991
> add_i32 7945
> st_i32 8257
> movi_i32 26812
> mov_i32 38369
>
> And with the trunk:
>
> exit_tb 4957
> add_i32 8165
> st_i32 20281
> ld_i32 21926
> movi_i32 25083
>
>
> Laurent
>

Attached is a proof-of-concept of ARM patch for using
tcg_global_mem_new_i32. I didn't have much time to test it yet, but on
synthetic benchmark it improved the performance by 13 DMIPS to the
total of 216 DMIPS, which equals to 6% improvement. On x86 host the
register allocation still looks very pathetic, I will post a follow-up
soon.

Best regards,
Filip Navara

[-- Attachment #2: 0001-First-try-at-using-tcg_global_mem_new_i32.patch.txt --]
[-- Type: text/plain, Size: 3869 bytes --]

From 4feddee0e7e02e1daab764dbbf9d694277b1e00a Mon Sep 17 00:00:00 2001
From: Filip Navara <filip.navara@gmail.com>
Date: Mon, 29 Jun 2009 01:13:42 +0200
Subject: [PATCH] First try at using tcg_global_mem_new_i32.

---
 target-arm/translate.c |   40 +++++++++++++++++++++++-----------------
 1 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/target-arm/translate.c b/target-arm/translate.c
index 62c9eff..9a39536 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -77,6 +77,7 @@ typedef struct DisasContext {
 static TCGv_ptr cpu_env;
 /* We reuse the same 64-bit temporaries for efficiency.  */
 static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
+static TCGv_i32 cpu_R[16];
 
 /* FIXME:  These should be removed.  */
 static TCGv cpu_T[2];
@@ -86,14 +87,26 @@ static TCGv_i64 cpu_F0d, cpu_F1d;
 #define ICOUNT_TEMP cpu_T[0]
 #include "gen-icount.h"
 
+static const char *regnames[] =
+    { "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7",
+      "r8", "r9", "r10", "r11", "r12", "r13", "r14", "pc" };
+
 /* initialize TCG globals.  */
 void arm_translate_init(void)
 {
+    int i;
+
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
 
     cpu_T[0] = tcg_global_reg_new_i32(TCG_AREG1, "T0");
     cpu_T[1] = tcg_global_reg_new_i32(TCG_AREG2, "T1");
 
+    for (i = 0; i < 16; i++) {
+        cpu_R[i] = tcg_global_mem_new_i32(TCG_AREG0,
+                                          offsetof(CPUState, regs[i]),
+                                          regnames[i]);
+    }
+
 #define GEN_HELPER 2
 #include "helpers.h"
 }
@@ -168,7 +181,7 @@ static void load_reg_var(DisasContext *s, TCGv var, int reg)
             addr = (long)s->pc + 4;
         tcg_gen_movi_i32(var, addr);
     } else {
-        tcg_gen_ld_i32(var, cpu_env, offsetof(CPUState, regs[reg]));
+        tcg_gen_mov_i32(var, cpu_R[reg]);
     }
 }
 
@@ -188,7 +201,7 @@ static void store_reg(DisasContext *s, int reg, TCGv var)
         tcg_gen_andi_i32(var, var, ~1);
         s->is_jmp = DISAS_JUMP;
     }
-    tcg_gen_st_i32(var, cpu_env, offsetof(CPUState, regs[reg]));
+    tcg_gen_mov_i32(cpu_R[reg], var);
     dead_tmp(var);
 }
 
@@ -790,27 +803,22 @@ static inline void gen_bx_im(DisasContext *s, uint32_t addr)
     TCGv tmp;
 
     s->is_jmp = DISAS_UPDATE;
-    tmp = new_tmp();
     if (s->thumb != (addr & 1)) {
+        tmp = new_tmp();
         tcg_gen_movi_i32(tmp, addr & 1);
         tcg_gen_st_i32(tmp, cpu_env, offsetof(CPUState, thumb));
+        dead_tmp(tmp);
     }
-    tcg_gen_movi_i32(tmp, addr & ~1);
-    tcg_gen_st_i32(tmp, cpu_env, offsetof(CPUState, regs[15]));
-    dead_tmp(tmp);
+    tcg_gen_mov_i32(cpu_R[15], addr & ~1);
 }
 
 /* Set PC and Thumb state from var.  var is marked as dead.  */
 static inline void gen_bx(DisasContext *s, TCGv var)
 {
-    TCGv tmp;
-
     s->is_jmp = DISAS_UPDATE;
-    tmp = new_tmp();
-    tcg_gen_andi_i32(tmp, var, 1);
-    store_cpu_field(tmp, thumb);
-    tcg_gen_andi_i32(var, var, ~1);
-    store_cpu_field(var, regs[15]);
+    tcg_gen_andi_i32(cpu_R[15], var, ~1);
+    tcg_gen_andi_i32(var, var, 1);
+    store_cpu_field(var, thumb);
 }
 
 /* Variant of store_reg which uses branch&exchange logic when storing
@@ -889,9 +897,7 @@ static inline void gen_movl_T2_reg(DisasContext *s, int reg)
 
 static inline void gen_set_pc_im(uint32_t val)
 {
-    TCGv tmp = new_tmp();
-    tcg_gen_movi_i32(tmp, val);
-    store_cpu_field(tmp, regs[15]);
+    tcg_gen_movi_i32(cpu_R[15], val);
 }
 
 static inline void gen_movl_reg_TN(DisasContext *s, int reg, int t)
@@ -903,7 +909,7 @@ static inline void gen_movl_reg_TN(DisasContext *s, int reg, int t)
     } else {
         tmp = cpu_T[t];
     }
-    tcg_gen_st_i32(tmp, cpu_env, offsetof(CPUState, regs[reg]));
+    tcg_gen_mov_i32(cpu_R[reg], tmp);
     if (reg == 15) {
         dead_tmp(tmp);
         s->is_jmp = DISAS_JUMP;
-- 
1.6.3.msysgit.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command)
  2009-06-28 23:19   ` Filip Navara
@ 2009-06-28 23:35     ` Filip Navara
  2009-06-29  6:39       ` Laurent Desnogues
  0 siblings, 1 reply; 5+ messages in thread
From: Filip Navara @ 2009-06-28 23:35 UTC (permalink / raw)
  To: Laurent Desnogues; +Cc: Blue Swirl, Anthony Liguori, qemu-devel, Avi Kivity

On Mon, Jun 29, 2009 at 1:19 AM, Filip Navara<filip.navara@gmail.com> wrote:
> On x86 host the register allocation still looks very pathetic, I will post a follow-up
> soon.

Let's look at the very first two instructions generated by the guest:

----------------
IN:
0x00200070:  ldr	r0, [pc, #108]	; 0x2000e4
0x00200074:  ldr	pc, [pc, #108]	; 0x2000e8

OP:
 movi_i32 tmp8,$0x200078
 movi_i32 tmp9,$0x6c
 add_i32 tmp8,tmp8,tmp9
 qemu_ld32u tmp9,tmp8,$0x0
 mov_i32 r0,tmp9
 movi_i32 tmp9,$0x20007c
 movi_i32 tmp10,$0x6c
 add_i32 tmp9,tmp9,tmp10
 qemu_ld32u tmp8,tmp9,$0x0
 movi_i32 tmp10,$0xfffffffe
 and_i32 tmp8,tmp8,tmp10
 mov_i32 pc,tmp8
 exit_tb $0x0

OUT: [size=128]
0x03230020:  mov    $0x200078,%eax
0x03230025:  add    $0x6c,%eax
0x03230028:  mov    %eax,%ecx
0x0323002a:  mov    %ecx,%edx
0x0323002c:  mov    %ecx,%eax

-- this instruction sets %eax to value that it already has

0x0323002e:  shr    $0x6,%edx
0x03230031:  and    $0xfffffc03,%eax
0x03230037:  and    $0xff0,%edx
0x0323003d:  lea    0x540(%edx,%ebp,1),%edx
0x03230044:  cmp    (%edx),%eax
0x03230046:  mov    %ecx,%eax
0x03230048:  je     0x3230053
0x0323004a:  xor    %edx,%edx
0x0323004c:  call   0x55cbc0
0x03230051:  jmp    0x3230058
0x03230053:  add    0xc(%edx),%eax
0x03230056:  mov    (%eax),%eax
0x03230058:  mov    $0x20007c,%edx
0x0323005d:  add    $0x6c,%edx
0x03230060:  mov    %edx,%ecx
0x03230062:  mov    %eax,0x0(%ebp)
0x03230065:  mov    %ecx,%edx

-- same here

0x03230067:  mov    %ecx,%eax
0x03230069:  shr    $0x6,%edx
0x0323006c:  and    $0xfffffc03,%eax
0x03230072:  and    $0xff0,%edx
0x03230078:  lea    0x540(%edx,%ebp,1),%edx
0x0323007f:  cmp    (%edx),%eax
0x03230081:  mov    %ecx,%eax
0x03230083:  je     0x323008e
0x03230085:  xor    %edx,%edx
0x03230087:  call   0x55cbc0
0x0323008c:  jmp    0x3230093
0x0323008e:  add    0xc(%edx),%eax
0x03230091:  mov    (%eax),%eax
0x03230093:  and    $0xfffffffe,%eax
0x03230096:  mov    %eax,0x3c(%ebp)
0x03230099:  xor    %eax,%eax
0x0323009b:  jmp    0x7ec928

If someone can explain me why the redundant mov instructions are
generated I'd be very happy. Thanks.

Best regards,
Filip Navara

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command)
  2009-06-28 23:35     ` Filip Navara
@ 2009-06-29  6:39       ` Laurent Desnogues
  0 siblings, 0 replies; 5+ messages in thread
From: Laurent Desnogues @ 2009-06-29  6:39 UTC (permalink / raw)
  To: Filip Navara; +Cc: qemu-devel

On Mon, Jun 29, 2009 at 1:35 AM, Filip Navara<filip.navara@gmail.com> wrote:
> On Mon, Jun 29, 2009 at 1:19 AM, Filip Navara<filip.navara@gmail.com> wrote:
>> On x86 host the register allocation still looks very pathetic, I will post a follow-up
>> soon.
>
> Let's look at the very first two instructions generated by the guest:
>
> ----------------
> IN:
> 0x00200070:  ldr        r0, [pc, #108]  ; 0x2000e4
> 0x00200074:  ldr        pc, [pc, #108]  ; 0x2000e8
>
> OP:
>  movi_i32 tmp8,$0x200078
>  movi_i32 tmp9,$0x6c
>  add_i32 tmp8,tmp8,tmp9
>  qemu_ld32u tmp9,tmp8,$0x0
>  mov_i32 r0,tmp9
>  movi_i32 tmp9,$0x20007c
>  movi_i32 tmp10,$0x6c
>  add_i32 tmp9,tmp9,tmp10
>  qemu_ld32u tmp8,tmp9,$0x0
>  movi_i32 tmp10,$0xfffffffe
>  and_i32 tmp8,tmp8,tmp10
>  mov_i32 pc,tmp8
>  exit_tb $0x0
>
> OUT: [size=128]
> 0x03230020:  mov    $0x200078,%eax
> 0x03230025:  add    $0x6c,%eax
> 0x03230028:  mov    %eax,%ecx
> 0x0323002a:  mov    %ecx,%edx
> 0x0323002c:  mov    %ecx,%eax
>
> -- this instruction sets %eax to value that it already has
>
> 0x0323002e:  shr    $0x6,%edx
> 0x03230031:  and    $0xfffffc03,%eax
> 0x03230037:  and    $0xff0,%edx
> 0x0323003d:  lea    0x540(%edx,%ebp,1),%edx
> 0x03230044:  cmp    (%edx),%eax
> 0x03230046:  mov    %ecx,%eax
> 0x03230048:  je     0x3230053
> 0x0323004a:  xor    %edx,%edx
> 0x0323004c:  call   0x55cbc0
> 0x03230051:  jmp    0x3230058
> 0x03230053:  add    0xc(%edx),%eax
> 0x03230056:  mov    (%eax),%eax
> 0x03230058:  mov    $0x20007c,%edx
> 0x0323005d:  add    $0x6c,%edx
> 0x03230060:  mov    %edx,%ecx
> 0x03230062:  mov    %eax,0x0(%ebp)
> 0x03230065:  mov    %ecx,%edx
>
> -- same here
>
> 0x03230067:  mov    %ecx,%eax
> 0x03230069:  shr    $0x6,%edx
> 0x0323006c:  and    $0xfffffc03,%eax
> 0x03230072:  and    $0xff0,%edx
> 0x03230078:  lea    0x540(%edx,%ebp,1),%edx
> 0x0323007f:  cmp    (%edx),%eax
> 0x03230081:  mov    %ecx,%eax
> 0x03230083:  je     0x323008e
> 0x03230085:  xor    %edx,%edx
> 0x03230087:  call   0x55cbc0
> 0x0323008c:  jmp    0x3230093
> 0x0323008e:  add    0xc(%edx),%eax
> 0x03230091:  mov    (%eax),%eax
> 0x03230093:  and    $0xfffffffe,%eax
> 0x03230096:  mov    %eax,0x3c(%ebp)
> 0x03230099:  xor    %eax,%eax
> 0x0323009b:  jmp    0x7ec928
>
> If someone can explain me why the redundant mov instructions are
> generated I'd be very happy. Thanks.

What you see here is due to hard-coded assembly instructions
used to make a load.  cf tcg_out_qemu_ld in tcg-target.c


Laurent

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-06-29  6:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-28 18:19 OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command) Filip Navara
2009-06-28 21:24 ` Laurent Desnogues
2009-06-28 23:19   ` Filip Navara
2009-06-28 23:35     ` Filip Navara
2009-06-29  6:39       ` Laurent Desnogues

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.