* OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command)
@ 2009-06-28 18:19 Filip Navara
2009-06-28 21:24 ` Laurent Desnogues
0 siblings, 1 reply; 5+ messages in thread
From: Filip Navara @ 2009-06-28 18:19 UTC (permalink / raw)
To: Blue Swirl
Cc: Anthony Liguori, ehabkost, jan.kiszka, dlaor, qemu-devel,
Luiz Capitulino, Avi Kivity
On Sun, Jun 28, 2009 at 7:51 PM, Blue Swirl<blauwirbel@gmail.com> wrote:
> On 6/28/09, Filip Navara <filip.navara@gmail.com> wrote:
>> On Sun, Jun 28, 2009 at 5:52 PM, Avi Kivity<avi@redhat.com> wrote:
>> > It really isn't very complicated, and
>> > the thread only got so long because the topic is relatively simple. Post an
>> > RFC and a mile-long patchset about changing TCG to SSA form, and see how you
>> > get no replies.
>>
>>
>> I wouldn't even dare to push the SSA patch... Mile-long doesn't
>> describe it precisely enough. Imagine it was applied to all the
>> targets.
Just to be perfectly clear, this was meant as a joke. I don't have any
working SSA patch and neither am I working on one right now, but I am
interested in the topic. Main reason for my interest is this:
http://www.info.uni-karlsruhe.de/lehre/2006SS/uebau2/folien/08-RA_v1_4.pdf
http://www.info.uni-karlsruhe.de/~hack/ra_ssa.pdf
I'd like to know if the register allocation can be improved. I don't
believe SSA would help much in anything else since the input code to
translators was already compiled with optimizing compiler and so most
of the SSA-based optimizations would be redundant.
Doing a profiling run on several ARM demo programs showed that most of
the generated code was doing load/store operations to the machine
registers (in CPU_env). Sample run of FreeRTOS looked like this (OP
counts):
movi_i32 1603
ld_i32 1305
st_i32 1174
add_i32 530
...
If there could be done something that would allow the guest registers
to be stored in host registers, even if for a temporary amount of time
it would certainly help the guests that I'm dealing with.
Best regards,
Filip Navara
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command)
2009-06-28 18:19 OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command) Filip Navara
@ 2009-06-28 21:24 ` Laurent Desnogues
2009-06-28 23:19 ` Filip Navara
0 siblings, 1 reply; 5+ messages in thread
From: Laurent Desnogues @ 2009-06-28 21:24 UTC (permalink / raw)
To: Filip Navara
Cc: Anthony Liguori, ehabkost, jan.kiszka, dlaor, qemu-devel,
Luiz Capitulino, Blue Swirl, Avi Kivity
On Sun, Jun 28, 2009 at 8:19 PM, Filip Navara<filip.navara@gmail.com> wrote:
> Doing a profiling run on several ARM demo programs showed that most of
> the generated code was doing load/store operations to the machine
> registers (in CPU_env). Sample run of FreeRTOS looked like this (OP
> counts):
>
> movi_i32 1603
> ld_i32 1305
> st_i32 1174
> add_i32 530
> ...
>
> If there could be done something that would allow the guest registers
> to be stored in host registers, even if for a temporary amount of time
> it would certainly help the guests that I'm dealing with.
TCG does a good job for register allocation.
The problem you have here is that the ARM translator
isn't using tcg_global_mem_new_i32 for ARM registers.
Here's an example of number of ops I see when using
tcg_global_mem_new_i32:
exit_tb 4991
add_i32 7945
st_i32 8257
movi_i32 26812
mov_i32 38369
And with the trunk:
exit_tb 4957
add_i32 8165
st_i32 20281
ld_i32 21926
movi_i32 25083
Laurent
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command)
2009-06-28 21:24 ` Laurent Desnogues
@ 2009-06-28 23:19 ` Filip Navara
2009-06-28 23:35 ` Filip Navara
0 siblings, 1 reply; 5+ messages in thread
From: Filip Navara @ 2009-06-28 23:19 UTC (permalink / raw)
To: Laurent Desnogues; +Cc: Blue Swirl, Anthony Liguori, qemu-devel, Avi Kivity
[-- Attachment #1: Type: text/plain, Size: 1623 bytes --]
On Sun, Jun 28, 2009 at 11:24 PM, Laurent
Desnogues<laurent.desnogues@gmail.com> wrote:
> On Sun, Jun 28, 2009 at 8:19 PM, Filip Navara<filip.navara@gmail.com> wrote:
>> Doing a profiling run on several ARM demo programs showed that most of
>> the generated code was doing load/store operations to the machine
>> registers (in CPU_env). Sample run of FreeRTOS looked like this (OP
>> counts):
>>
>> movi_i32 1603
>> ld_i32 1305
>> st_i32 1174
>> add_i32 530
>> ...
>>
>> If there could be done something that would allow the guest registers
>> to be stored in host registers, even if for a temporary amount of time
>> it would certainly help the guests that I'm dealing with.
>
> TCG does a good job for register allocation.
>
> The problem you have here is that the ARM translator
> isn't using tcg_global_mem_new_i32 for ARM registers.
Interesting, thanks for the tip. I have been trying to achieve the
same effect using tcg_global_reg_new_i32, no wonder it felt so hard.
:)
> Here's an example of number of ops I see when using
> tcg_global_mem_new_i32:
>
> exit_tb 4991
> add_i32 7945
> st_i32 8257
> movi_i32 26812
> mov_i32 38369
>
> And with the trunk:
>
> exit_tb 4957
> add_i32 8165
> st_i32 20281
> ld_i32 21926
> movi_i32 25083
>
>
> Laurent
>
Attached is a proof-of-concept of ARM patch for using
tcg_global_mem_new_i32. I didn't have much time to test it yet, but on
synthetic benchmark it improved the performance by 13 DMIPS to the
total of 216 DMIPS, which equals to 6% improvement. On x86 host the
register allocation still looks very pathetic, I will post a follow-up
soon.
Best regards,
Filip Navara
[-- Attachment #2: 0001-First-try-at-using-tcg_global_mem_new_i32.patch.txt --]
[-- Type: text/plain, Size: 3869 bytes --]
From 4feddee0e7e02e1daab764dbbf9d694277b1e00a Mon Sep 17 00:00:00 2001
From: Filip Navara <filip.navara@gmail.com>
Date: Mon, 29 Jun 2009 01:13:42 +0200
Subject: [PATCH] First try at using tcg_global_mem_new_i32.
---
target-arm/translate.c | 40 +++++++++++++++++++++++-----------------
1 files changed, 23 insertions(+), 17 deletions(-)
diff --git a/target-arm/translate.c b/target-arm/translate.c
index 62c9eff..9a39536 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -77,6 +77,7 @@ typedef struct DisasContext {
static TCGv_ptr cpu_env;
/* We reuse the same 64-bit temporaries for efficiency. */
static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
+static TCGv_i32 cpu_R[16];
/* FIXME: These should be removed. */
static TCGv cpu_T[2];
@@ -86,14 +87,26 @@ static TCGv_i64 cpu_F0d, cpu_F1d;
#define ICOUNT_TEMP cpu_T[0]
#include "gen-icount.h"
+static const char *regnames[] =
+ { "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7",
+ "r8", "r9", "r10", "r11", "r12", "r13", "r14", "pc" };
+
/* initialize TCG globals. */
void arm_translate_init(void)
{
+ int i;
+
cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
cpu_T[0] = tcg_global_reg_new_i32(TCG_AREG1, "T0");
cpu_T[1] = tcg_global_reg_new_i32(TCG_AREG2, "T1");
+ for (i = 0; i < 16; i++) {
+ cpu_R[i] = tcg_global_mem_new_i32(TCG_AREG0,
+ offsetof(CPUState, regs[i]),
+ regnames[i]);
+ }
+
#define GEN_HELPER 2
#include "helpers.h"
}
@@ -168,7 +181,7 @@ static void load_reg_var(DisasContext *s, TCGv var, int reg)
addr = (long)s->pc + 4;
tcg_gen_movi_i32(var, addr);
} else {
- tcg_gen_ld_i32(var, cpu_env, offsetof(CPUState, regs[reg]));
+ tcg_gen_mov_i32(var, cpu_R[reg]);
}
}
@@ -188,7 +201,7 @@ static void store_reg(DisasContext *s, int reg, TCGv var)
tcg_gen_andi_i32(var, var, ~1);
s->is_jmp = DISAS_JUMP;
}
- tcg_gen_st_i32(var, cpu_env, offsetof(CPUState, regs[reg]));
+ tcg_gen_mov_i32(cpu_R[reg], var);
dead_tmp(var);
}
@@ -790,27 +803,22 @@ static inline void gen_bx_im(DisasContext *s, uint32_t addr)
TCGv tmp;
s->is_jmp = DISAS_UPDATE;
- tmp = new_tmp();
if (s->thumb != (addr & 1)) {
+ tmp = new_tmp();
tcg_gen_movi_i32(tmp, addr & 1);
tcg_gen_st_i32(tmp, cpu_env, offsetof(CPUState, thumb));
+ dead_tmp(tmp);
}
- tcg_gen_movi_i32(tmp, addr & ~1);
- tcg_gen_st_i32(tmp, cpu_env, offsetof(CPUState, regs[15]));
- dead_tmp(tmp);
+ tcg_gen_mov_i32(cpu_R[15], addr & ~1);
}
/* Set PC and Thumb state from var. var is marked as dead. */
static inline void gen_bx(DisasContext *s, TCGv var)
{
- TCGv tmp;
-
s->is_jmp = DISAS_UPDATE;
- tmp = new_tmp();
- tcg_gen_andi_i32(tmp, var, 1);
- store_cpu_field(tmp, thumb);
- tcg_gen_andi_i32(var, var, ~1);
- store_cpu_field(var, regs[15]);
+ tcg_gen_andi_i32(cpu_R[15], var, ~1);
+ tcg_gen_andi_i32(var, var, 1);
+ store_cpu_field(var, thumb);
}
/* Variant of store_reg which uses branch&exchange logic when storing
@@ -889,9 +897,7 @@ static inline void gen_movl_T2_reg(DisasContext *s, int reg)
static inline void gen_set_pc_im(uint32_t val)
{
- TCGv tmp = new_tmp();
- tcg_gen_movi_i32(tmp, val);
- store_cpu_field(tmp, regs[15]);
+ tcg_gen_movi_i32(cpu_R[15], val);
}
static inline void gen_movl_reg_TN(DisasContext *s, int reg, int t)
@@ -903,7 +909,7 @@ static inline void gen_movl_reg_TN(DisasContext *s, int reg, int t)
} else {
tmp = cpu_T[t];
}
- tcg_gen_st_i32(tmp, cpu_env, offsetof(CPUState, regs[reg]));
+ tcg_gen_mov_i32(cpu_R[reg], tmp);
if (reg == 15) {
dead_tmp(tmp);
s->is_jmp = DISAS_JUMP;
--
1.6.3.msysgit.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command)
2009-06-28 23:19 ` Filip Navara
@ 2009-06-28 23:35 ` Filip Navara
2009-06-29 6:39 ` Laurent Desnogues
0 siblings, 1 reply; 5+ messages in thread
From: Filip Navara @ 2009-06-28 23:35 UTC (permalink / raw)
To: Laurent Desnogues; +Cc: Blue Swirl, Anthony Liguori, qemu-devel, Avi Kivity
On Mon, Jun 29, 2009 at 1:19 AM, Filip Navara<filip.navara@gmail.com> wrote:
> On x86 host the register allocation still looks very pathetic, I will post a follow-up
> soon.
Let's look at the very first two instructions generated by the guest:
----------------
IN:
0x00200070: ldr r0, [pc, #108] ; 0x2000e4
0x00200074: ldr pc, [pc, #108] ; 0x2000e8
OP:
movi_i32 tmp8,$0x200078
movi_i32 tmp9,$0x6c
add_i32 tmp8,tmp8,tmp9
qemu_ld32u tmp9,tmp8,$0x0
mov_i32 r0,tmp9
movi_i32 tmp9,$0x20007c
movi_i32 tmp10,$0x6c
add_i32 tmp9,tmp9,tmp10
qemu_ld32u tmp8,tmp9,$0x0
movi_i32 tmp10,$0xfffffffe
and_i32 tmp8,tmp8,tmp10
mov_i32 pc,tmp8
exit_tb $0x0
OUT: [size=128]
0x03230020: mov $0x200078,%eax
0x03230025: add $0x6c,%eax
0x03230028: mov %eax,%ecx
0x0323002a: mov %ecx,%edx
0x0323002c: mov %ecx,%eax
-- this instruction sets %eax to value that it already has
0x0323002e: shr $0x6,%edx
0x03230031: and $0xfffffc03,%eax
0x03230037: and $0xff0,%edx
0x0323003d: lea 0x540(%edx,%ebp,1),%edx
0x03230044: cmp (%edx),%eax
0x03230046: mov %ecx,%eax
0x03230048: je 0x3230053
0x0323004a: xor %edx,%edx
0x0323004c: call 0x55cbc0
0x03230051: jmp 0x3230058
0x03230053: add 0xc(%edx),%eax
0x03230056: mov (%eax),%eax
0x03230058: mov $0x20007c,%edx
0x0323005d: add $0x6c,%edx
0x03230060: mov %edx,%ecx
0x03230062: mov %eax,0x0(%ebp)
0x03230065: mov %ecx,%edx
-- same here
0x03230067: mov %ecx,%eax
0x03230069: shr $0x6,%edx
0x0323006c: and $0xfffffc03,%eax
0x03230072: and $0xff0,%edx
0x03230078: lea 0x540(%edx,%ebp,1),%edx
0x0323007f: cmp (%edx),%eax
0x03230081: mov %ecx,%eax
0x03230083: je 0x323008e
0x03230085: xor %edx,%edx
0x03230087: call 0x55cbc0
0x0323008c: jmp 0x3230093
0x0323008e: add 0xc(%edx),%eax
0x03230091: mov (%eax),%eax
0x03230093: and $0xfffffffe,%eax
0x03230096: mov %eax,0x3c(%ebp)
0x03230099: xor %eax,%eax
0x0323009b: jmp 0x7ec928
If someone can explain me why the redundant mov instructions are
generated I'd be very happy. Thanks.
Best regards,
Filip Navara
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command)
2009-06-28 23:35 ` Filip Navara
@ 2009-06-29 6:39 ` Laurent Desnogues
0 siblings, 0 replies; 5+ messages in thread
From: Laurent Desnogues @ 2009-06-29 6:39 UTC (permalink / raw)
To: Filip Navara; +Cc: qemu-devel
On Mon, Jun 29, 2009 at 1:35 AM, Filip Navara<filip.navara@gmail.com> wrote:
> On Mon, Jun 29, 2009 at 1:19 AM, Filip Navara<filip.navara@gmail.com> wrote:
>> On x86 host the register allocation still looks very pathetic, I will post a follow-up
>> soon.
>
> Let's look at the very first two instructions generated by the guest:
>
> ----------------
> IN:
> 0x00200070: ldr r0, [pc, #108] ; 0x2000e4
> 0x00200074: ldr pc, [pc, #108] ; 0x2000e8
>
> OP:
> movi_i32 tmp8,$0x200078
> movi_i32 tmp9,$0x6c
> add_i32 tmp8,tmp8,tmp9
> qemu_ld32u tmp9,tmp8,$0x0
> mov_i32 r0,tmp9
> movi_i32 tmp9,$0x20007c
> movi_i32 tmp10,$0x6c
> add_i32 tmp9,tmp9,tmp10
> qemu_ld32u tmp8,tmp9,$0x0
> movi_i32 tmp10,$0xfffffffe
> and_i32 tmp8,tmp8,tmp10
> mov_i32 pc,tmp8
> exit_tb $0x0
>
> OUT: [size=128]
> 0x03230020: mov $0x200078,%eax
> 0x03230025: add $0x6c,%eax
> 0x03230028: mov %eax,%ecx
> 0x0323002a: mov %ecx,%edx
> 0x0323002c: mov %ecx,%eax
>
> -- this instruction sets %eax to value that it already has
>
> 0x0323002e: shr $0x6,%edx
> 0x03230031: and $0xfffffc03,%eax
> 0x03230037: and $0xff0,%edx
> 0x0323003d: lea 0x540(%edx,%ebp,1),%edx
> 0x03230044: cmp (%edx),%eax
> 0x03230046: mov %ecx,%eax
> 0x03230048: je 0x3230053
> 0x0323004a: xor %edx,%edx
> 0x0323004c: call 0x55cbc0
> 0x03230051: jmp 0x3230058
> 0x03230053: add 0xc(%edx),%eax
> 0x03230056: mov (%eax),%eax
> 0x03230058: mov $0x20007c,%edx
> 0x0323005d: add $0x6c,%edx
> 0x03230060: mov %edx,%ecx
> 0x03230062: mov %eax,0x0(%ebp)
> 0x03230065: mov %ecx,%edx
>
> -- same here
>
> 0x03230067: mov %ecx,%eax
> 0x03230069: shr $0x6,%edx
> 0x0323006c: and $0xfffffc03,%eax
> 0x03230072: and $0xff0,%edx
> 0x03230078: lea 0x540(%edx,%ebp,1),%edx
> 0x0323007f: cmp (%edx),%eax
> 0x03230081: mov %ecx,%eax
> 0x03230083: je 0x323008e
> 0x03230085: xor %edx,%edx
> 0x03230087: call 0x55cbc0
> 0x0323008c: jmp 0x3230093
> 0x0323008e: add 0xc(%edx),%eax
> 0x03230091: mov (%eax),%eax
> 0x03230093: and $0xfffffffe,%eax
> 0x03230096: mov %eax,0x3c(%ebp)
> 0x03230099: xor %eax,%eax
> 0x0323009b: jmp 0x7ec928
>
> If someone can explain me why the redundant mov instructions are
> generated I'd be very happy. Thanks.
What you see here is due to hard-coded assembly instructions
used to make a load. cf tcg_out_qemu_ld in tcg-target.c
Laurent
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-06-29 6:39 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-28 18:19 OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command) Filip Navara
2009-06-28 21:24 ` Laurent Desnogues
2009-06-28 23:19 ` Filip Navara
2009-06-28 23:35 ` Filip Navara
2009-06-29 6:39 ` Laurent Desnogues
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.