All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 1/3] tcg/arm: fix branch target change during code retranslation
@ 2011-01-06 21:54 Aurelien Jarno
  2011-01-06 21:54 ` [Qemu-devel] [PATCH 2/3] tcg/arm: fix qemu_st64 for big endian targets Aurelien Jarno
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Aurelien Jarno @ 2011-01-06 21:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Andrzej Zaborowski, Aurelien Jarno

QEMU uses code retranslation to restore the CPU state when an exception
happens. For it to work the retranslation must not modify the generated
code. This is what is currently implemented in ARM TCG.

However on CPU that don't have icache/dcache/memory synchronised like
ARM, this requirement is stronger and code retranslation must not modify
the generated code "atomically", as the cache line might be flushed
at any moment (interrupt, exception, task switching), even if not
triggered by QEMU. The probability for this to happen is very low, and
depends on cache size and associativiy, machine load, interrupts, so the
symptoms are might happen randomly.

This requirement is currently not followed in tcg/arm, for the
load/store code, which basically has the following structure:
  1) tlb access code is written
  2) conditional fast path code is written
  3) branch is written with a temporary target
  4) slow path code is written
  5) branch target is updated
The cache lines corresponding to the retranslated code is not flushed
after code retranslation as the generated code is supposed to be the
same. However if the cache line corresponding to the branch instruction
is flushed between step 3 and 5, and is not flushed again before the
code is executed again, the branch target is wrong. In the guest, the
symptoms are MMU page fault at a random addresses, which leads to
kernel page fault or segmentation faults.

The patch fixes this issue by avoiding writing the branch target until
it is known, that is by writing only the branch instruction first, and
later only the offset.

This fixes booting linux guests on ARM hosts (tested: arm, i386, mips,
mipsel, sh4, sparc).

Cc: Andrzej Zaborowski <balrog@zabor.org>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 tcg/arm/tcg-target.c |   28 ++++++++++++++++++++--------
 1 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index a3af5b2..9def2e5 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -113,12 +113,25 @@ static const int tcg_target_call_oarg_regs[2] = {
     TCG_REG_R0, TCG_REG_R1
 };
 
+static inline void reloc_abs32(void *code_ptr, tcg_target_long target)
+{
+    *(uint32_t *) code_ptr = target;
+}
+
+static inline void reloc_pc24(void *code_ptr, tcg_target_long target)
+{
+    uint32_t offset = ((target - ((tcg_target_long) code_ptr + 8)) >> 2);
+
+    *(uint32_t *) code_ptr = ((*(uint32_t *) code_ptr) & ~0xffffff)
+                             | (offset & 0xffffff);
+}
+
 static void patch_reloc(uint8_t *code_ptr, int type,
                 tcg_target_long value, tcg_target_long addend)
 {
     switch (type) {
     case R_ARM_ABS32:
-        *(uint32_t *) code_ptr = value;
+        reloc_abs32(code_ptr, value);
         break;
 
     case R_ARM_CALL:
@@ -127,8 +140,7 @@ static void patch_reloc(uint8_t *code_ptr, int type,
         tcg_abort();
 
     case R_ARM_PC24:
-        *(uint32_t *) code_ptr = ((*(uint32_t *) code_ptr) & 0xff000000) |
-                (((value - ((tcg_target_long) code_ptr + 8)) >> 2) & 0xffffff);
+        reloc_pc24(code_ptr, value);
         break;
     }
 }
@@ -1031,7 +1043,7 @@ static inline void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
     }
 
     label_ptr = (void *) s->code_ptr;
-    tcg_out_b(s, COND_EQ, 8);
+    tcg_out_b_noaddr(s, COND_EQ);
 
     /* TODO: move this code to where the constants pool will be */
     if (addr_reg != TCG_REG_R0) {
@@ -1076,7 +1088,7 @@ static inline void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
         break;
     }
 
-    *label_ptr += ((void *) s->code_ptr - (void *) label_ptr - 8) >> 2;
+    reloc_pc24(label_ptr, (tcg_target_long)s->code_ptr);
 #else /* !CONFIG_SOFTMMU */
     if (GUEST_BASE) {
         uint32_t offset = GUEST_BASE;
@@ -1245,7 +1257,7 @@ static inline void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
     }
 
     label_ptr = (void *) s->code_ptr;
-    tcg_out_b(s, COND_EQ, 8);
+    tcg_out_b_noaddr(s, COND_EQ);
 
     /* TODO: move this code to where the constants pool will be */
     tcg_out_dat_reg(s, COND_AL, ARITH_MOV,
@@ -1317,7 +1329,7 @@ static inline void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
     if (opc == 3)
         tcg_out_dat_imm(s, COND_AL, ARITH_ADD, TCG_REG_R13, TCG_REG_R13, 0x10);
 
-    *label_ptr += ((void *) s->code_ptr - (void *) label_ptr - 8) >> 2;
+    reloc_pc24(label_ptr, (tcg_target_long)s->code_ptr);
 #else /* !CONFIG_SOFTMMU */
     if (GUEST_BASE) {
         uint32_t offset = GUEST_BASE;
@@ -1399,7 +1411,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
             /* Direct jump method */
 #if defined(USE_DIRECT_JUMP)
             s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
-            tcg_out_b(s, COND_AL, 8);
+            tcg_out_b_noaddr(s, COND_AL);
 #else
             tcg_out_ld32_12(s, COND_AL, TCG_REG_PC, TCG_REG_PC, -4);
             s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 2/3] tcg/arm: fix qemu_st64 for big endian targets
  2011-01-06 21:54 [Qemu-devel] [PATCH 1/3] tcg/arm: fix branch target change during code retranslation Aurelien Jarno
@ 2011-01-06 21:54 ` Aurelien Jarno
  2011-01-07 12:37   ` andrzej zaborowski
  2011-01-06 21:54 ` [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading Aurelien Jarno
  2011-01-07  9:12 ` [Qemu-devel] [PATCH 1/3] tcg/arm: fix branch target change during code retranslation Edgar E. Iglesias
  2 siblings, 1 reply; 12+ messages in thread
From: Aurelien Jarno @ 2011-01-06 21:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Andrzej Zaborowski, Aurelien Jarno

Due to a typo, qemu_st64 doesn't properly byteswap the 32-bit low word of
a 64 bit word before saving it. This patch fixes that.

Cc: Andrzej Zaborowski <balrog@zabor.org>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 tcg/arm/tcg-target.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index 9def2e5..08c44c1 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -1248,7 +1248,7 @@ static inline void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
             tcg_out_bswap32(s, COND_EQ, TCG_REG_R0, data_reg2);
             tcg_out_st32_rwb(s, COND_EQ, TCG_REG_R0, TCG_REG_R1, addr_reg);
             tcg_out_bswap32(s, COND_EQ, TCG_REG_R0, data_reg);
-            tcg_out_st32_12(s, COND_EQ, data_reg, TCG_REG_R1, 4);
+            tcg_out_st32_12(s, COND_EQ, TCG_REG_R0, TCG_REG_R1, 4);
         } else {
             tcg_out_st32_rwb(s, COND_EQ, data_reg, TCG_REG_R1, addr_reg);
             tcg_out_st32_12(s, COND_EQ, data_reg2, TCG_REG_R1, 4);
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading
  2011-01-06 21:54 [Qemu-devel] [PATCH 1/3] tcg/arm: fix branch target change during code retranslation Aurelien Jarno
  2011-01-06 21:54 ` [Qemu-devel] [PATCH 2/3] tcg/arm: fix qemu_st64 for big endian targets Aurelien Jarno
@ 2011-01-06 21:54 ` Aurelien Jarno
  2011-01-07 12:52   ` andrzej zaborowski
  2011-01-07  9:12 ` [Qemu-devel] [PATCH 1/3] tcg/arm: fix branch target change during code retranslation Edgar E. Iglesias
  2 siblings, 1 reply; 12+ messages in thread
From: Aurelien Jarno @ 2011-01-06 21:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Andrzej Zaborowski, Aurelien Jarno

Improve constant loading in two ways:
- On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
  the mvn rd, #0. Fix the conditions.
- On <= ARMv6 versions, where movw and movt are not available, load the
  constants using mov and orr with rotations depending on the constant
  to load. This is very useful for example to load constants where the
  low byte is 0. This reduce the generated code size by about 7%.

Also fix the coding style at the same time.

Cc: Andrzej Zaborowski <balrog@zabor.org>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 tcg/arm/tcg-target.c |   39 +++++++++++++++++++++------------------
 1 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index 08c44c1..1eb5605 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -406,35 +406,38 @@ static inline void tcg_out_dat_imm(TCGContext *s,
 }
 
 static inline void tcg_out_movi32(TCGContext *s,
-                int cond, int rd, int32_t arg)
+                int cond, int rd, uint32_t arg)
 {
     /* TODO: This is very suboptimal, we can easily have a constant
      * pool somewhere after all the instructions.  */
-
-    if (arg < 0 && arg > -0x100)
-        return tcg_out_dat_imm(s, cond, ARITH_MVN, rd, 0, (~arg) & 0xff);
-
-    if (use_armv7_instructions) {
+    if ((int)arg < 0 && (int)arg >= -0x100) {
+        tcg_out_dat_imm(s, cond, ARITH_MVN, rd, 0, (~arg) & 0xff);
+    } else if (use_armv7_instructions) {
         /* use movw/movt */
         /* movw */
         tcg_out32(s, (cond << 28) | 0x03000000 | (rd << 12)
                   | ((arg << 4) & 0x000f0000) | (arg & 0xfff));
-        if (arg & 0xffff0000)
+        if (arg & 0xffff0000) {
             /* movt */
             tcg_out32(s, (cond << 28) | 0x03400000 | (rd << 12)
                       | ((arg >> 12) & 0x000f0000) | ((arg >> 16) & 0xfff));
-    } else {
-        tcg_out_dat_imm(s, cond, ARITH_MOV, rd, 0, arg & 0xff);
-        if (arg & 0x0000ff00)
-            tcg_out_dat_imm(s, cond, ARITH_ORR, rd, rd,
-                            ((arg >>  8) & 0xff) | 0xc00);
-        if (arg & 0x00ff0000)
-            tcg_out_dat_imm(s, cond, ARITH_ORR, rd, rd,
-                            ((arg >> 16) & 0xff) | 0x800);
-        if (arg & 0xff000000)
-            tcg_out_dat_imm(s, cond, ARITH_ORR, rd, rd,
-                            ((arg >> 24) & 0xff) | 0x400);
         }
+    } else {
+        int opc = ARITH_MOV;
+        int rn = 0;
+
+        do {
+            int i, rot;
+
+            i = ctz32(arg) & ~1;
+            rot = ((32 - i) << 7) & 0xf00;
+            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
+            arg &= ~(0xff << i);
+
+            opc = ARITH_ORR;
+            rn = rd;
+        } while (arg);
+    }
 }
 
 static inline void tcg_out_mul32(TCGContext *s,
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] tcg/arm: fix branch target change during code retranslation
  2011-01-06 21:54 [Qemu-devel] [PATCH 1/3] tcg/arm: fix branch target change during code retranslation Aurelien Jarno
  2011-01-06 21:54 ` [Qemu-devel] [PATCH 2/3] tcg/arm: fix qemu_st64 for big endian targets Aurelien Jarno
  2011-01-06 21:54 ` [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading Aurelien Jarno
@ 2011-01-07  9:12 ` Edgar E. Iglesias
  2 siblings, 0 replies; 12+ messages in thread
From: Edgar E. Iglesias @ 2011-01-07  9:12 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: Andrzej Zaborowski, qemu-devel

On Thu, Jan 06, 2011 at 10:54:32PM +0100, Aurelien Jarno wrote:
> QEMU uses code retranslation to restore the CPU state when an exception
> happens. For it to work the retranslation must not modify the generated
> code. This is what is currently implemented in ARM TCG.
> 
> However on CPU that don't have icache/dcache/memory synchronised like
> ARM, this requirement is stronger and code retranslation must not modify
> the generated code "atomically", as the cache line might be flushed
> at any moment (interrupt, exception, task switching), even if not
> triggered by QEMU. The probability for this to happen is very low, and
> depends on cache size and associativiy, machine load, interrupts, so the
> symptoms are might happen randomly.
> 
> This requirement is currently not followed in tcg/arm, for the
> load/store code, which basically has the following structure:
>   1) tlb access code is written
>   2) conditional fast path code is written
>   3) branch is written with a temporary target
>   4) slow path code is written
>   5) branch target is updated
> The cache lines corresponding to the retranslated code is not flushed
> after code retranslation as the generated code is supposed to be the
> same. However if the cache line corresponding to the branch instruction
> is flushed between step 3 and 5, and is not flushed again before the
> code is executed again, the branch target is wrong. In the guest, the
> symptoms are MMU page fault at a random addresses, which leads to
> kernel page fault or segmentation faults.
> 
> The patch fixes this issue by avoiding writing the branch target until
> it is known, that is by writing only the branch instruction first, and
> later only the offset.
> 
> This fixes booting linux guests on ARM hosts (tested: arm, i386, mips,
> mipsel, sh4, sparc).
> 
> Cc: Andrzej Zaborowski <balrog@zabor.org>
> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>

Good catch!

The patch looks good to me.

Acked-by: Edgar E. Iglesias <edgar.iglesias@gmail.com>


> ---
>  tcg/arm/tcg-target.c |   28 ++++++++++++++++++++--------
>  1 files changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
> index a3af5b2..9def2e5 100644
> --- a/tcg/arm/tcg-target.c
> +++ b/tcg/arm/tcg-target.c
> @@ -113,12 +113,25 @@ static const int tcg_target_call_oarg_regs[2] = {
>      TCG_REG_R0, TCG_REG_R1
>  };
>  
> +static inline void reloc_abs32(void *code_ptr, tcg_target_long target)
> +{
> +    *(uint32_t *) code_ptr = target;
> +}
> +
> +static inline void reloc_pc24(void *code_ptr, tcg_target_long target)
> +{
> +    uint32_t offset = ((target - ((tcg_target_long) code_ptr + 8)) >> 2);
> +
> +    *(uint32_t *) code_ptr = ((*(uint32_t *) code_ptr) & ~0xffffff)
> +                             | (offset & 0xffffff);
> +}
> +
>  static void patch_reloc(uint8_t *code_ptr, int type,
>                  tcg_target_long value, tcg_target_long addend)
>  {
>      switch (type) {
>      case R_ARM_ABS32:
> -        *(uint32_t *) code_ptr = value;
> +        reloc_abs32(code_ptr, value);
>          break;
>  
>      case R_ARM_CALL:
> @@ -127,8 +140,7 @@ static void patch_reloc(uint8_t *code_ptr, int type,
>          tcg_abort();
>  
>      case R_ARM_PC24:
> -        *(uint32_t *) code_ptr = ((*(uint32_t *) code_ptr) & 0xff000000) |
> -                (((value - ((tcg_target_long) code_ptr + 8)) >> 2) & 0xffffff);
> +        reloc_pc24(code_ptr, value);
>          break;
>      }
>  }
> @@ -1031,7 +1043,7 @@ static inline void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
>      }
>  
>      label_ptr = (void *) s->code_ptr;
> -    tcg_out_b(s, COND_EQ, 8);
> +    tcg_out_b_noaddr(s, COND_EQ);
>  
>      /* TODO: move this code to where the constants pool will be */
>      if (addr_reg != TCG_REG_R0) {
> @@ -1076,7 +1088,7 @@ static inline void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
>          break;
>      }
>  
> -    *label_ptr += ((void *) s->code_ptr - (void *) label_ptr - 8) >> 2;
> +    reloc_pc24(label_ptr, (tcg_target_long)s->code_ptr);
>  #else /* !CONFIG_SOFTMMU */
>      if (GUEST_BASE) {
>          uint32_t offset = GUEST_BASE;
> @@ -1245,7 +1257,7 @@ static inline void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
>      }
>  
>      label_ptr = (void *) s->code_ptr;
> -    tcg_out_b(s, COND_EQ, 8);
> +    tcg_out_b_noaddr(s, COND_EQ);
>  
>      /* TODO: move this code to where the constants pool will be */
>      tcg_out_dat_reg(s, COND_AL, ARITH_MOV,
> @@ -1317,7 +1329,7 @@ static inline void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
>      if (opc == 3)
>          tcg_out_dat_imm(s, COND_AL, ARITH_ADD, TCG_REG_R13, TCG_REG_R13, 0x10);
>  
> -    *label_ptr += ((void *) s->code_ptr - (void *) label_ptr - 8) >> 2;
> +    reloc_pc24(label_ptr, (tcg_target_long)s->code_ptr);
>  #else /* !CONFIG_SOFTMMU */
>      if (GUEST_BASE) {
>          uint32_t offset = GUEST_BASE;
> @@ -1399,7 +1411,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>              /* Direct jump method */
>  #if defined(USE_DIRECT_JUMP)
>              s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
> -            tcg_out_b(s, COND_AL, 8);
> +            tcg_out_b_noaddr(s, COND_AL);
>  #else
>              tcg_out_ld32_12(s, COND_AL, TCG_REG_PC, TCG_REG_PC, -4);
>              s->tb_jmp_offset[args[0]] = s->code_ptr - s->code_buf;
> -- 
> 1.7.2.3
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] tcg/arm: fix qemu_st64 for big endian targets
  2011-01-06 21:54 ` [Qemu-devel] [PATCH 2/3] tcg/arm: fix qemu_st64 for big endian targets Aurelien Jarno
@ 2011-01-07 12:37   ` andrzej zaborowski
  0 siblings, 0 replies; 12+ messages in thread
From: andrzej zaborowski @ 2011-01-07 12:37 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: qemu-devel

On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
> Due to a typo, qemu_st64 doesn't properly byteswap the 32-bit low word of
> a 64 bit word before saving it. This patch fixes that.

This is a good catch.
Acked-by: Andrzej Zaborowski <balrogg@gmail.com>

Cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading
  2011-01-06 21:54 ` [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading Aurelien Jarno
@ 2011-01-07 12:52   ` andrzej zaborowski
  2011-01-07 12:55     ` andrzej zaborowski
  2011-01-07 14:40     ` Aurelien Jarno
  0 siblings, 2 replies; 12+ messages in thread
From: andrzej zaborowski @ 2011-01-07 12:52 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: qemu-devel

Hi,

On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
> Improve constant loading in two ways:
> - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
>  the mvn rd, #0. Fix the conditions.
> - On <= ARMv6 versions, where movw and movt are not available, load the
>  constants using mov and orr with rotations depending on the constant
>  to load. This is very useful for example to load constants where the
>  low byte is 0. This reduce the generated code size by about 7%.

That's a nice improvement.  For some instructions using MVN and AND
could yield even shorter code and I think with that the optimisation
options (except loading from a constant pool) would be exhausted :)

...
>         }
> +    } else {
> +        int opc = ARITH_MOV;
> +        int rn = 0;
> +
> +        do {
> +            int i, rot;
> +
> +            i = ctz32(arg) & ~1;
> +            rot = ((32 - i) << 7) & 0xf00;
> +            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
> +            arg &= ~(0xff << i);
> +
> +            opc = ARITH_ORR;
> +            rn = rd;

I think you could get rid of rn and just use rd from the start of the
loop.  Otherwise acked by me too.

Best regards

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading
  2011-01-07 12:52   ` andrzej zaborowski
@ 2011-01-07 12:55     ` andrzej zaborowski
  2011-01-07 14:40     ` Aurelien Jarno
  1 sibling, 0 replies; 12+ messages in thread
From: andrzej zaborowski @ 2011-01-07 12:55 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: qemu-devel

On 7 January 2011 13:52, andrzej zaborowski <balrog@zabor.org> wrote:
> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> Improve constant loading in two ways:
>> - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
>>  the mvn rd, #0. Fix the conditions.
>> - On <= ARMv6 versions, where movw and movt are not available, load the
>>  constants using mov and orr with rotations depending on the constant
>>  to load. This is very useful for example to load constants where the
>>  low byte is 0. This reduce the generated code size by about 7%.
>
> That's a nice improvement.  For some instructions using MVN and AND

Oops, I mean for some *values*.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading
  2011-01-07 12:52   ` andrzej zaborowski
  2011-01-07 12:55     ` andrzej zaborowski
@ 2011-01-07 14:40     ` Aurelien Jarno
  2011-01-07 15:56       ` andrzej zaborowski
  1 sibling, 1 reply; 12+ messages in thread
From: Aurelien Jarno @ 2011-01-07 14:40 UTC (permalink / raw)
  To: balrogg; +Cc: qemu-devel

On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote:
> Hi,
> 
> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
> > Improve constant loading in two ways:
> > - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
> >  the mvn rd, #0. Fix the conditions.
> > - On <= ARMv6 versions, where movw and movt are not available, load the
> >  constants using mov and orr with rotations depending on the constant
> >  to load. This is very useful for example to load constants where the
> >  low byte is 0. This reduce the generated code size by about 7%.
> 
> That's a nice improvement.  For some instructions using MVN and AND
> could yield even shorter code and I think with that the optimisation
> options (except loading from a constant pool) would be exhausted :)

I also did something with MVN and BIC, it works well, but the problem is
to find the right heuristic to choose between MOV/ORR and MVN/BIC. In my
tries, it was making the code bigger.

> ...
> >         }
> > +    } else {
> > +        int opc = ARITH_MOV;
> > +        int rn = 0;
> > +
> > +        do {
> > +            int i, rot;
> > +
> > +            i = ctz32(arg) & ~1;
> > +            rot = ((32 - i) << 7) & 0xf00;
> > +            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
> > +            arg &= ~(0xff << i);
> > +
> > +            opc = ARITH_ORR;
> > +            rn = rd;
> 
> I think you could get rid of rn and just use rd from the start of the
> loop.  Otherwise acked by me too.
> 

What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order
to generate a correct ARM instruction.

-- 
Aurelien Jarno	                        GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading
  2011-01-07 14:40     ` Aurelien Jarno
@ 2011-01-07 15:56       ` andrzej zaborowski
  2011-01-09 22:40         ` Aurelien Jarno
  0 siblings, 1 reply; 12+ messages in thread
From: andrzej zaborowski @ 2011-01-07 15:56 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: qemu-devel

On 7 January 2011 15:40, Aurelien Jarno <aurelien@aurel32.net> wrote:
> On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote:
>> Hi,
>>
>> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> > Improve constant loading in two ways:
>> > - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
>> >  the mvn rd, #0. Fix the conditions.
>> > - On <= ARMv6 versions, where movw and movt are not available, load the
>> >  constants using mov and orr with rotations depending on the constant
>> >  to load. This is very useful for example to load constants where the
>> >  low byte is 0. This reduce the generated code size by about 7%.
>>
>> That's a nice improvement.  For some instructions using MVN and AND
>> could yield even shorter code and I think with that the optimisation
>> options (except loading from a constant pool) would be exhausted :)
>
> I also did something with MVN and BIC, it works well, but the problem is
> to find the right heuristic to choose between MOV/ORR and MVN/BIC. In my
> tries, it was making the code bigger.

I was thinking of running both without writing the instructions, then
comparing the lengths and then running the better method.  It's
possible that the cost of this outweights the shorter code advantage
though.

>
>> ...
>> >         }
>> > +    } else {
>> > +        int opc = ARITH_MOV;
>> > +        int rn = 0;
>> > +
>> > +        do {
>> > +            int i, rot;
>> > +
>> > +            i = ctz32(arg) & ~1;
>> > +            rot = ((32 - i) << 7) & 0xf00;
>> > +            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
>> > +            arg &= ~(0xff << i);
>> > +
>> > +            opc = ARITH_ORR;
>> > +            rn = rd;
>>
>> I think you could get rid of rn and just use rd from the start of the
>> loop.  Otherwise acked by me too.
>>
>
> What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order
> to generate a correct ARM instruction.

According to my ARM926 manual rn is ignored for MOV/MVN, perhaps it's
different in later revisions.

Cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading
  2011-01-07 15:56       ` andrzej zaborowski
@ 2011-01-09 22:40         ` Aurelien Jarno
  2011-01-09 23:33           ` andrzej zaborowski
  0 siblings, 1 reply; 12+ messages in thread
From: Aurelien Jarno @ 2011-01-09 22:40 UTC (permalink / raw)
  To: andrzej zaborowski; +Cc: qemu-devel

On Fri, Jan 07, 2011 at 04:56:32PM +0100, andrzej zaborowski wrote:
> On 7 January 2011 15:40, Aurelien Jarno <aurelien@aurel32.net> wrote:
> > On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote:
> >> Hi,
> >>
> >> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
> >> > Improve constant loading in two ways:
> >> > - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using
> >> >  the mvn rd, #0. Fix the conditions.
> >> > - On <= ARMv6 versions, where movw and movt are not available, load the
> >> >  constants using mov and orr with rotations depending on the constant
> >> >  to load. This is very useful for example to load constants where the
> >> >  low byte is 0. This reduce the generated code size by about 7%.
> >>
> >> That's a nice improvement.  For some instructions using MVN and AND
> >> could yield even shorter code and I think with that the optimisation
> >> options (except loading from a constant pool) would be exhausted :)
> >
> > I also did something with MVN and BIC, it works well, but the problem is
> > to find the right heuristic to choose between MOV/ORR and MVN/BIC. In my
> > tries, it was making the code bigger.
> 
> I was thinking of running both without writing the instructions, then
> comparing the lengths and then running the better method.  It's
> possible that the cost of this outweights the shorter code advantage
> though.
> 
> >
> >> ...
> >> >         }
> >> > +    } else {
> >> > +        int opc = ARITH_MOV;
> >> > +        int rn = 0;
> >> > +
> >> > +        do {
> >> > +            int i, rot;
> >> > +
> >> > +            i = ctz32(arg) & ~1;
> >> > +            rot = ((32 - i) << 7) & 0xf00;
> >> > +            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
> >> > +            arg &= ~(0xff << i);
> >> > +
> >> > +            opc = ARITH_ORR;
> >> > +            rn = rd;
> >>
> >> I think you could get rid of rn and just use rd from the start of the
> >> loop.  Otherwise acked by me too.
> >>
> >
> > What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order
> > to generate a correct ARM instruction.
> 
> According to my ARM926 manual rn is ignored for MOV/MVN, perhaps it's
> different in later revisions.
> 

I have just tried, and it actually works (tried on ARMv5 and ARMv7). 
Note that binutils is not able to disassemble such an instruction and
outputs in qemu.log something like:
| 0x01000008:  e3aa50ff  undefined instruction 0xe3aa50ff

However what worries me the most is that the "ARM Architecture Reference
Manual ARMv7-A and ARMv7-R edition" defines this opcode with the rn field
as "(0)(0)(0)(0)". Looking at what it means:

| An instruction is UNPREDICTABLE if:
| [...]
| * the pseudocode for that encoding does not indicate that a different
|   special case applies, and a bit marked (0) or (1) in the encoding 
| diagram of an instruction is not 0 or 1 respectively.

In short is it still going to work on newer CPUs?

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading
  2011-01-09 22:40         ` Aurelien Jarno
@ 2011-01-09 23:33           ` andrzej zaborowski
  2011-01-10  3:41             ` Peter Maydell
  0 siblings, 1 reply; 12+ messages in thread
From: andrzej zaborowski @ 2011-01-09 23:33 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: qemu-devel

On 9 January 2011 23:40, Aurelien Jarno <aurelien@aurel32.net> wrote:
> On Fri, Jan 07, 2011 at 04:56:32PM +0100, andrzej zaborowski wrote:
>> On 7 January 2011 15:40, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> > On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote:
>> >> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> >> ...
>> >> >         }
>> >> > +    } else {
>> >> > +        int opc = ARITH_MOV;
>> >> > +        int rn = 0;
>> >> > +
>> >> > +        do {
>> >> > +            int i, rot;
>> >> > +
>> >> > +            i = ctz32(arg) & ~1;
>> >> > +            rot = ((32 - i) << 7) & 0xf00;
>> >> > +            tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot);
>> >> > +            arg &= ~(0xff << i);
>> >> > +
>> >> > +            opc = ARITH_ORR;
>> >> > +            rn = rd;
>> >>
>> >> I think you could get rid of rn and just use rd from the start of the
>> >> loop.  Otherwise acked by me too.
>> >>
>> >
>> > What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order
>> > to generate a correct ARM instruction.
>>
>> According to my ARM926 manual rn is ignored for MOV/MVN, perhaps it's
>> different in later revisions.
>>
>
> I have just tried, and it actually works (tried on ARMv5 and ARMv7).

Also works under qemu-arm :)

> Note that binutils is not able to disassemble such an instruction and
> outputs in qemu.log something like:
> | 0x01000008:  e3aa50ff  undefined instruction 0xe3aa50ff
>
> However what worries me the most is that the "ARM Architecture Reference
> Manual ARMv7-A and ARMv7-R edition" defines this opcode with the rn field
> as "(0)(0)(0)(0)". Looking at what it means:
>
> | An instruction is UNPREDICTABLE if:
> | [...]
> | * the pseudocode for that encoding does not indicate that a different
> |   special case applies, and a bit marked (0) or (1) in the encoding
> | diagram of an instruction is not 0 or 1 respectively.
>
> In short is it still going to work on newer CPUs?

Perhaps let's be on the safe side and use your version with rn = 0.

I think it *should* work on the new ARM ISAs because of backwards
compatibility: x works under ARMv4 & ARMv5 and x is not listed under
the differences between new and old ISA, thus it needs to work under a
new ISA.

Cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading
  2011-01-09 23:33           ` andrzej zaborowski
@ 2011-01-10  3:41             ` Peter Maydell
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Maydell @ 2011-01-10  3:41 UTC (permalink / raw)
  To: andrzej zaborowski; +Cc: qemu-devel, Aurelien Jarno

On 9 January 2011 23:33, andrzej zaborowski <balrogg@gmail.com> wrote:
> On 9 January 2011 23:40, Aurelien Jarno <aurelien@aurel32.net> wrote:
>> Note that binutils is not able to disassemble such an instruction and
>> outputs in qemu.log something like:
>> | 0x01000008:  e3aa50ff  undefined instruction 0xe3aa50ff
>>
>> However what worries me the most is that the "ARM Architecture Reference
>> Manual ARMv7-A and ARMv7-R edition" defines this opcode with the rn field
>> as "(0)(0)(0)(0)". Looking at what it means:
>>
>> | An instruction is UNPREDICTABLE if:
>> | [...]
>> | * the pseudocode for that encoding does not indicate that a different
>> |   special case applies, and a bit marked (0) or (1) in the encoding
>> | diagram of an instruction is not 0 or 1 respectively.
>>
>> In short is it still going to work on newer CPUs?

It might not work on existing CPUs, never mind newer ones. We
mean it about UNPREDICTABLE :-) Some cores choose to make
patterns which fail these "should be zero/one" checks cause an
UNDEF exception. Some don't.

> I think it *should* work on the new ARM ISAs because of backwards
> compatibility: x works under ARMv4 & ARMv5 and x is not listed under
> the differences between new and old ISA, thus it needs to work under a
> new ISA.

I went back and checked the ARM ARM for ARMv4 (that's ARM
document DUI0100B, dated 1996). It says that for MOV and MVN
bits 19..16 are "SBZ", ie "Should Be Zero", meaning that non-zero
is UNPREDICTABLE. So this isn't a change in behaviour -- the
ISA has always been clear that you should not do it.

[Note for the unwary: UNPREDICTABLE in ARM docs doesn't
mean totally unpredictable -- an implementation isn't allowed to
permit it to be a security hole or to hang the processor, for instance.
But you can't rely on it doing anything useful or consistent.]

-- PMM

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-01-10  3:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-06 21:54 [Qemu-devel] [PATCH 1/3] tcg/arm: fix branch target change during code retranslation Aurelien Jarno
2011-01-06 21:54 ` [Qemu-devel] [PATCH 2/3] tcg/arm: fix qemu_st64 for big endian targets Aurelien Jarno
2011-01-07 12:37   ` andrzej zaborowski
2011-01-06 21:54 ` [Qemu-devel] [PATCH 3/3] tcg/arm: improve constant loading Aurelien Jarno
2011-01-07 12:52   ` andrzej zaborowski
2011-01-07 12:55     ` andrzej zaborowski
2011-01-07 14:40     ` Aurelien Jarno
2011-01-07 15:56       ` andrzej zaborowski
2011-01-09 22:40         ` Aurelien Jarno
2011-01-09 23:33           ` andrzej zaborowski
2011-01-10  3:41             ` Peter Maydell
2011-01-07  9:12 ` [Qemu-devel] [PATCH 1/3] tcg/arm: fix branch target change during code retranslation Edgar E. Iglesias

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.