All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup
@ 2013-05-31 17:51 Jani Kokkonen
  2013-05-31 17:57 ` [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup Jani Kokkonen
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Jani Kokkonen @ 2013-05-31 17:51 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Laurent Desnogues, Claudio Fontana, qemu-devel, Richard Henderson

From: Jani Kokkonen <jani.kokkonen@huawei.com>

This series implements the TCG tlb fast lookup in tcg_out_qemu_ld/st
for the aarch64 TCG target.

It requires the reviewed but not committed yet series
"[PATCH v4 0/3] ARM aarch64 TCG target" at:

http://lists.nongnu.org/archive/html/qemu-devel/2013-05/msg04200.html
https://github.com/hw-claudio/qemu/tree/tcg-aarch64-current

Limitations of this initial implementation:
 * CONFIG_SOFTMMU only

Tested running on a x86-64 physical machine running Foundation v8,
running a linux 3.2.0 minimal host system based on linaro v8
image build 0.8.4423 for user space.

Tested guests: arm v5, PPC64, i386 linux test images.
Also tested on x86-64/linux built with buildroot.

Claudio Fontana (3):
  tcg/aarch64: more low level ops in preparation of tlb lookup
  tcg/aarch64: implement byte swap operations
  tcg/aarch64: implement sign/zero extend operations

Jani Kokkonen (1):
  tcg/aarch64: implement tlb lookup fast path

 tcg/aarch64/tcg-target.c | 333 ++++++++++++++++++++++++++++++++++++++++++++---
 tcg/aarch64/tcg-target.h |  30 ++---
 2 files changed, 328 insertions(+), 35 deletions(-)

-- 
1.8.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup
  2013-05-31 17:51 [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup Jani Kokkonen
@ 2013-05-31 17:57 ` Jani Kokkonen
  2013-05-31 19:07   ` Richard Henderson
  2013-05-31 18:01 ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations Jani Kokkonen
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Jani Kokkonen @ 2013-05-31 17:57 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Laurent Desnogues, Claudio Fontana, qemu-devel, Richard Henderson

From: Claudio Fontana <claudio.fontana@huawei.com>

for arith operations, add SUBS and add a shift parameter
so that all arith instructions can make use of shifted registers.
Also add functions to TEST/AND registers with immediate patterns.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
 tcg/aarch64/tcg-target.c | 72 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 58 insertions(+), 14 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
index ff626eb..1343d49 100644
--- a/tcg/aarch64/tcg-target.c
+++ b/tcg/aarch64/tcg-target.c
@@ -188,6 +188,7 @@ enum aarch64_ldst_op_type { /* type of operation */
 enum aarch64_arith_opc {
     ARITH_ADD = 0x0b,
     ARITH_SUB = 0x4b,
+    ARITH_SUBS = 0x6b,
     ARITH_AND = 0x0a,
     ARITH_OR = 0x2a,
     ARITH_XOR = 0x4a
@@ -394,12 +395,20 @@ static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
 }
 
 static inline void tcg_out_arith(TCGContext *s, enum aarch64_arith_opc opc,
-                                 int ext, TCGReg rd, TCGReg rn, TCGReg rm)
+                                 int ext, TCGReg rd, TCGReg rn, TCGReg rm,
+                                 int shift_imm)
 {
     /* Using shifted register arithmetic operations */
     /* if extended registry operation (64bit) just OR with 0x80 << 24 */
-    unsigned int base = ext ? (0x80 | opc) << 24 : opc << 24;
-    tcg_out32(s, base | rm << 16 | rn << 5 | rd);
+    unsigned int shift, base = ext ? (0x80 | opc) << 24 : opc << 24;
+    if (shift_imm == 0) {
+        shift = 0;
+    } else if (shift_imm > 0) {
+        shift = shift_imm << 10 | 1 << 22;
+    } else /* (shift_imm < 0) */ {
+        shift = (-shift_imm) << 10;
+    }
+    tcg_out32(s, base | rm << 16 | shift | rn << 5 | rd);
 }
 
 static inline void tcg_out_mul(TCGContext *s, int ext,
@@ -482,11 +491,11 @@ static inline void tcg_out_rotl(TCGContext *s, int ext,
     tcg_out_extr(s, ext, rd, rn, rn, bits - (m & max));
 }
 
-static inline void tcg_out_cmp(TCGContext *s, int ext, TCGReg rn, TCGReg rm)
+static inline void tcg_out_cmp(TCGContext *s, int ext, TCGReg rn, TCGReg rm,
+                               int shift_imm)
 {
     /* Using CMP alias SUBS wzr, Wn, Wm */
-    unsigned int base = ext ? 0xeb00001f : 0x6b00001f;
-    tcg_out32(s, base | rm << 16 | rn << 5);
+    tcg_out_arith(s, ARITH_SUBS, ext, TCG_REG_XZR, rn, rm, shift_imm);
 }
 
 static inline void tcg_out_cset(TCGContext *s, int ext, TCGReg rd, TCGCond c)
@@ -569,6 +578,40 @@ static inline void tcg_out_call(TCGContext *s, tcg_target_long target)
     }
 }
 
+/* encode a logical immediate, mapping user parameter
+   M=set bits pattern length to S=M-1 */
+static inline unsigned int
+aarch64_limm(unsigned int m, unsigned int r)
+{
+    assert(m > 0);
+    return r << 16 | (m - 1) << 10;
+}
+
+/* test a register against an immediate bit pattern made of
+   M set bits rotated right by R.
+   Examples:
+   to test a 32/64 reg against 0x00000007, pass M = 3,  R = 0.
+   to test a 32/64 reg against 0x000000ff, pass M = 8,  R = 0.
+   to test a 32bit reg against 0xff000000, pass M = 8,  R = 8.
+   to test a 32bit reg against 0xff0000ff, pass M = 16, R = 8.
+ */
+static inline void tcg_out_tst(TCGContext *s, int ext, TCGReg rn,
+                               unsigned int m, unsigned int r)
+{
+    /* using TST alias of ANDS XZR, Xn,#bimm64 0x7200001f */
+    unsigned int base = ext ? 0xf240001f : 0x7200001f;
+    tcg_out32(s, base | aarch64_limm(m, r) | rn << 5);
+}
+
+/* and a register with a bit pattern, similarly to TST, no flags change */
+static inline void tcg_out_andi(TCGContext *s, int ext, TCGReg rd, TCGReg rn,
+                                unsigned int m, unsigned int r)
+{
+    /* using AND 0x12000000 */
+    unsigned int base = ext ? 0x92400000 : 0x12000000;
+    tcg_out32(s, base | aarch64_limm(m, r) | rn << 5 | rd);
+}
+
 static inline void tcg_out_ret(TCGContext *s)
 {
     /* emit RET { LR } */
@@ -830,31 +873,31 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_add_i64:
         ext = 1; /* fall through */
     case INDEX_op_add_i32:
-        tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2]);
+        tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2], 0);
         break;
 
     case INDEX_op_sub_i64:
         ext = 1; /* fall through */
     case INDEX_op_sub_i32:
-        tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2]);
+        tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2], 0);
         break;
 
     case INDEX_op_and_i64:
         ext = 1; /* fall through */
     case INDEX_op_and_i32:
-        tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2]);
+        tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2], 0);
         break;
 
     case INDEX_op_or_i64:
         ext = 1; /* fall through */
     case INDEX_op_or_i32:
-        tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2]);
+        tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2], 0);
         break;
 
     case INDEX_op_xor_i64:
         ext = 1; /* fall through */
     case INDEX_op_xor_i32:
-        tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2]);
+        tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2], 0);
         break;
 
     case INDEX_op_mul_i64:
@@ -909,7 +952,8 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
         if (const_args[2]) {    /* ROR / EXTR Wd, Wm, Wm, 32 - m */
             tcg_out_rotl(s, ext, args[0], args[1], args[2]);
         } else {
-            tcg_out_arith(s, ARITH_SUB, 0, TCG_REG_TMP, TCG_REG_XZR, args[2]);
+            tcg_out_arith(s, ARITH_SUB, 0,
+                          TCG_REG_TMP, TCG_REG_XZR, args[2], 0);
             tcg_out_shiftrot_reg(s, SRR_ROR, ext,
                                  args[0], args[1], TCG_REG_TMP);
         }
@@ -918,14 +962,14 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_brcond_i64:
         ext = 1; /* fall through */
     case INDEX_op_brcond_i32: /* CMP 0, 1, cond(2), label 3 */
-        tcg_out_cmp(s, ext, args[0], args[1]);
+        tcg_out_cmp(s, ext, args[0], args[1], 0);
         tcg_out_goto_label_cond(s, args[2], args[3]);
         break;
 
     case INDEX_op_setcond_i64:
         ext = 1; /* fall through */
     case INDEX_op_setcond_i32:
-        tcg_out_cmp(s, ext, args[1], args[2]);
+        tcg_out_cmp(s, ext, args[1], args[2], 0);
         tcg_out_cset(s, 0, args[0], args[3]);
         break;
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations
  2013-05-31 17:51 [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup Jani Kokkonen
  2013-05-31 17:57 ` [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup Jani Kokkonen
@ 2013-05-31 18:01 ` Jani Kokkonen
  2013-05-31 19:11   ` Richard Henderson
  2013-05-31 18:05 ` [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations Jani Kokkonen
  2013-05-31 18:07 ` [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path Jani Kokkonen
  3 siblings, 1 reply; 14+ messages in thread
From: Jani Kokkonen @ 2013-05-31 18:01 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Laurent Desnogues, Claudio Fontana, qemu-devel, Richard Henderson

From: Claudio Fontana <claudio.fontana@huawei.com>

implement the optional byte swap operations with the dedicated
aarch64 instructions.
These instructions are also needed for the tlb lookup.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
 tcg/aarch64/tcg-target.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 tcg/aarch64/tcg-target.h | 10 +++++-----
 2 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
index 1343d49..716c987 100644
--- a/tcg/aarch64/tcg-target.c
+++ b/tcg/aarch64/tcg-target.c
@@ -658,6 +658,27 @@ static inline void tcg_out_goto_label_cond(TCGContext *s,
     }
 }
 
+static inline void tcg_out_rev(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
+{
+    /* using REV 0x5ac00800 */
+    unsigned int base = ext ? 0xdac00c00 : 0x5ac00800;
+    tcg_out32(s, base | rm << 5 | rd);
+}
+
+static inline void tcg_out_rev16(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
+{
+    /* using REV16 0x5ac00400 */
+    unsigned int base = ext ? 0xdac00400 : 0x5ac00400;
+    tcg_out32(s, base | rm << 5 | rd);
+}
+
+static inline void tcg_out_rev32(TCGContext *s, TCGReg rd, TCGReg rm)
+{
+    /* using REV32 0xdac00800 */
+    unsigned int base = 0xdac00800;
+    tcg_out32(s, base | rm << 5 | rd);
+}
+
 #ifdef CONFIG_SOFTMMU
 #include "exec/softmmu_defs.h"
 
@@ -1010,6 +1031,20 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
         tcg_out_qemu_st(s, args, 3);
         break;
 
+    case INDEX_op_bswap64_i64:
+        ext = 1; /* fall through */
+    case INDEX_op_bswap32_i32:
+        tcg_out_rev(s, ext, args[0], args[1]);
+        break;
+    case INDEX_op_bswap16_i64:
+        ext = 1; /* fall through */
+    case INDEX_op_bswap16_i32:
+        tcg_out_rev16(s, ext, args[0], args[1]);
+        break;
+    case INDEX_op_bswap32_i64:
+        tcg_out_rev32(s, args[0], args[1]);
+        break;
+
     default:
         tcg_abort(); /* opcode not implemented */
     }
@@ -1091,6 +1126,13 @@ static const TCGTargetOpDef aarch64_op_defs[] = {
     { INDEX_op_qemu_st16, { "l", "l" } },
     { INDEX_op_qemu_st32, { "l", "l" } },
     { INDEX_op_qemu_st64, { "l", "l" } },
+
+    { INDEX_op_bswap16_i32, { "r", "r" } },
+    { INDEX_op_bswap32_i32, { "r", "r" } },
+    { INDEX_op_bswap16_i64, { "r", "r" } },
+    { INDEX_op_bswap32_i64, { "r", "r" } },
+    { INDEX_op_bswap64_i64, { "r", "r" } },
+
     { -1 },
 };
 
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 075ab2a..247ef43 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -44,8 +44,8 @@ typedef enum {
 #define TCG_TARGET_HAS_ext16s_i32       0
 #define TCG_TARGET_HAS_ext8u_i32        0
 #define TCG_TARGET_HAS_ext16u_i32       0
-#define TCG_TARGET_HAS_bswap16_i32      0
-#define TCG_TARGET_HAS_bswap32_i32      0
+#define TCG_TARGET_HAS_bswap16_i32      1
+#define TCG_TARGET_HAS_bswap32_i32      1
 #define TCG_TARGET_HAS_not_i32          0
 #define TCG_TARGET_HAS_neg_i32          0
 #define TCG_TARGET_HAS_rot_i32          1
@@ -68,9 +68,9 @@ typedef enum {
 #define TCG_TARGET_HAS_ext8u_i64        0
 #define TCG_TARGET_HAS_ext16u_i64       0
 #define TCG_TARGET_HAS_ext32u_i64       0
-#define TCG_TARGET_HAS_bswap16_i64      0
-#define TCG_TARGET_HAS_bswap32_i64      0
-#define TCG_TARGET_HAS_bswap64_i64      0
+#define TCG_TARGET_HAS_bswap16_i64      1
+#define TCG_TARGET_HAS_bswap32_i64      1
+#define TCG_TARGET_HAS_bswap64_i64      1
 #define TCG_TARGET_HAS_not_i64          0
 #define TCG_TARGET_HAS_neg_i64          0
 #define TCG_TARGET_HAS_rot_i64          1
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations
  2013-05-31 17:51 [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup Jani Kokkonen
  2013-05-31 17:57 ` [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup Jani Kokkonen
  2013-05-31 18:01 ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations Jani Kokkonen
@ 2013-05-31 18:05 ` Jani Kokkonen
  2013-05-31 19:13   ` Richard Henderson
  2013-05-31 18:07 ` [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path Jani Kokkonen
  3 siblings, 1 reply; 14+ messages in thread
From: Jani Kokkonen @ 2013-05-31 18:05 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Laurent Desnogues, Claudio Fontana, qemu-devel, Richard Henderson

From: Claudio Fontana <claudio.fontana@huawei.com>

implement the optional sign/zero extend operations with the dedicated
aarch64 instructions.
These instructions are also needed for the tlb lookup.

Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
 tcg/aarch64/tcg-target.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++--
 tcg/aarch64/tcg-target.h | 20 ++++++++---------
 2 files changed, 66 insertions(+), 12 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
index 716c987..24b2862 100644
--- a/tcg/aarch64/tcg-target.c
+++ b/tcg/aarch64/tcg-target.c
@@ -679,6 +679,24 @@ static inline void tcg_out_rev32(TCGContext *s, TCGReg rd, TCGReg rm)
     tcg_out32(s, base | rm << 5 | rd);
 }
 
+static inline void tcg_out_sxt(TCGContext *s, int ext, int s_bits,
+                               TCGReg rd, TCGReg rn)
+{
+    /* using ALIASes SXTB 0x13001c00, SXTH 0x13003c00, SXTW 0x93407c00
+       of SBFM Xd, Xn, #0, #7|15|31 */
+    int bits = 8 * (1 << s_bits) - 1;
+    tcg_out_sbfm(s, ext, rd, rn, 0, bits);
+}
+
+static inline void tcg_out_uxt(TCGContext *s, int s_bits,
+                               TCGReg rd, TCGReg rn)
+{
+    /* using ALIASes UXTB 0x53001c00, UXTH 0x53003c00
+       of UBFM Wd, Wn, #0, #7|15 and mov */
+    int bits = 8 * (1 << s_bits) - 1;
+    tcg_out_ubfm(s, 0, rd, rn, 0, bits);
+}
+
 #ifdef CONFIG_SOFTMMU
 #include "exec/softmmu_defs.h"
 
@@ -726,8 +744,7 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
     tcg_out_callr(s, TCG_REG_TMP);
 
     if (opc & 0x04) { /* sign extend */
-        unsigned int bits = 8 * (1 << s_bits) - 1;
-        tcg_out_sbfm(s, 1, data_reg, TCG_REG_X0, 0, bits); /* 7|15|31 */
+        tcg_out_sxt(s, 1, s_bits, data_reg, TCG_REG_X0);
     } else {
         tcg_out_movr(s, 1, data_reg, TCG_REG_X0);
     }
@@ -1045,6 +1062,31 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
         tcg_out_rev32(s, args[0], args[1]);
         break;
 
+    case INDEX_op_ext8s_i64:
+        ext = 1; /* fall through */
+    case INDEX_op_ext8s_i32:
+        tcg_out_sxt(s, ext, 0, args[0], args[1]);
+        break;
+    case INDEX_op_ext16s_i64:
+        ext = 1; /* fall through */
+    case INDEX_op_ext16s_i32:
+        tcg_out_sxt(s, ext, 1, args[0], args[1]);
+        break;
+    case INDEX_op_ext32s_i64:
+        tcg_out_sxt(s, 1, 2, args[0], args[1]);
+        break;
+    case INDEX_op_ext8u_i64:
+    case INDEX_op_ext8u_i32:
+        tcg_out_uxt(s, 0, args[0], args[1]);
+        break;
+    case INDEX_op_ext16u_i64:
+    case INDEX_op_ext16u_i32:
+        tcg_out_uxt(s, 1, args[0], args[1]);
+        break;
+    case INDEX_op_ext32u_i64:
+        tcg_out_movr(s, 0, args[0], args[1]);
+        break;
+
     default:
         tcg_abort(); /* opcode not implemented */
     }
@@ -1133,6 +1175,18 @@ static const TCGTargetOpDef aarch64_op_defs[] = {
     { INDEX_op_bswap32_i64, { "r", "r" } },
     { INDEX_op_bswap64_i64, { "r", "r" } },
 
+    { INDEX_op_ext8s_i32, { "r", "r" } },
+    { INDEX_op_ext16s_i32, { "r", "r" } },
+    { INDEX_op_ext8u_i32, { "r", "r" } },
+    { INDEX_op_ext16u_i32, { "r", "r" } },
+
+    { INDEX_op_ext8s_i64, { "r", "r" } },
+    { INDEX_op_ext16s_i64, { "r", "r" } },
+    { INDEX_op_ext32s_i64, { "r", "r" } },
+    { INDEX_op_ext8u_i64, { "r", "r" } },
+    { INDEX_op_ext16u_i64, { "r", "r" } },
+    { INDEX_op_ext32u_i64, { "r", "r" } },
+
     { -1 },
 };
 
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 247ef43..97e4a5b 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -40,10 +40,10 @@ typedef enum {
 
 /* optional instructions */
 #define TCG_TARGET_HAS_div_i32          0
-#define TCG_TARGET_HAS_ext8s_i32        0
-#define TCG_TARGET_HAS_ext16s_i32       0
-#define TCG_TARGET_HAS_ext8u_i32        0
-#define TCG_TARGET_HAS_ext16u_i32       0
+#define TCG_TARGET_HAS_ext8s_i32        1
+#define TCG_TARGET_HAS_ext16s_i32       1
+#define TCG_TARGET_HAS_ext8u_i32        1
+#define TCG_TARGET_HAS_ext16u_i32       1
 #define TCG_TARGET_HAS_bswap16_i32      1
 #define TCG_TARGET_HAS_bswap32_i32      1
 #define TCG_TARGET_HAS_not_i32          0
@@ -62,12 +62,12 @@ typedef enum {
 #define TCG_TARGET_HAS_muls2_i32        0
 
 #define TCG_TARGET_HAS_div_i64          0
-#define TCG_TARGET_HAS_ext8s_i64        0
-#define TCG_TARGET_HAS_ext16s_i64       0
-#define TCG_TARGET_HAS_ext32s_i64       0
-#define TCG_TARGET_HAS_ext8u_i64        0
-#define TCG_TARGET_HAS_ext16u_i64       0
-#define TCG_TARGET_HAS_ext32u_i64       0
+#define TCG_TARGET_HAS_ext8s_i64        1
+#define TCG_TARGET_HAS_ext16s_i64       1
+#define TCG_TARGET_HAS_ext32s_i64       1
+#define TCG_TARGET_HAS_ext8u_i64        1
+#define TCG_TARGET_HAS_ext16u_i64       1
+#define TCG_TARGET_HAS_ext32u_i64       1
 #define TCG_TARGET_HAS_bswap16_i64      1
 #define TCG_TARGET_HAS_bswap32_i64      1
 #define TCG_TARGET_HAS_bswap64_i64      1
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path
  2013-05-31 17:51 [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup Jani Kokkonen
                   ` (2 preceding siblings ...)
  2013-05-31 18:05 ` [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations Jani Kokkonen
@ 2013-05-31 18:07 ` Jani Kokkonen
  2013-05-31 20:25   ` Richard Henderson
  3 siblings, 1 reply; 14+ messages in thread
From: Jani Kokkonen @ 2013-05-31 18:07 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Laurent Desnogues, Claudio Fontana, qemu-devel, Richard Henderson

From: Jani Kokkonen <jani.kokkonen@huawei.com>

implement the fast path for tcg_out_qemu_ld/st.

Signed-off-by: Jani Kokkonen <jani.kokkonen@huawei.com>
---
 tcg/aarch64/tcg-target.c | 161 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 157 insertions(+), 4 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
index 24b2862..47ec4a7 100644
--- a/tcg/aarch64/tcg-target.c
+++ b/tcg/aarch64/tcg-target.c
@@ -700,6 +700,36 @@ static inline void tcg_out_uxt(TCGContext *s, int s_bits,
 #ifdef CONFIG_SOFTMMU
 #include "exec/softmmu_defs.h"
 
+/* Load and compare a TLB entry, leaving the flags set.  Leaves X2 pointing
+   to the tlb entry.  Clobbers X0,X1,X2,X3 and TMP.  */
+
+static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg,
+                             int s_bits, uint8_t **label_ptr, int tlb_offset)
+{
+    TCGReg base = TCG_AREG0;
+
+    tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS);
+    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset);
+    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0);
+    tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0);
+    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2,
+                             TCG_REG_X0, -CPU_TLB_ENTRY_BITS);
+#if TARGET_LONG_BITS == 64
+    tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
+#else
+    tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
+#endif
+    /* check alignment */
+    if (s_bits) {
+        tcg_out_tst(s, 1, addr_reg, s_bits, 0);
+        label_ptr[0] = s->code_ptr;
+        tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
+    }
+    tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS);
+    label_ptr[1] = s->code_ptr;
+    tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
+}
+
 /* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
    int mmu_idx) */
 static const void * const qemu_ld_helpers[4] = {
@@ -723,18 +753,85 @@ static const void * const qemu_st_helpers[4] = {
 static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
 {
     TCGReg addr_reg, data_reg;
+    bool bswap;
 #ifdef CONFIG_SOFTMMU
     int mem_index, s_bits;
+    int i;
+    uint8_t *label_ptr[2] = { NULL };
+    uint8_t *label_ptr2;
 #endif
     data_reg = args[0];
     addr_reg = args[1];
+#ifdef TARGET_WORDS_BIGENDIAN
+    bswap = 1;
+#else
+    bswap = 0;
+#endif
 
 #ifdef CONFIG_SOFTMMU
     mem_index = args[2];
     s_bits = opc & 3;
 
-    /* TODO: insert TLB lookup here */
+    tcg_out_tlb_read(s, addr_reg, s_bits, label_ptr,
+                     offsetof(CPUArchState, tlb_table[mem_index][0].addr_read));
 
+    tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X1, TCG_REG_X2,
+             offsetof(CPUTLBEntry, addend) - offsetof(CPUTLBEntry, addr_read));
+    switch (opc) {
+    case 0:
+        tcg_out_ldst_r(s, LDST_8, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+        break;
+    case 0 | 4:
+        tcg_out_ldst_r(s, LDST_8, LDST_LD_S_X, data_reg, addr_reg, TCG_REG_X1);
+        break;
+    case 1:
+        tcg_out_ldst_r(s, LDST_16, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+        if (bswap) {
+            tcg_out_rev16(s, 1, data_reg, data_reg);
+        }
+        break;
+    case 1 | 4:
+        if (bswap) {
+            tcg_out_ldst_r(s, LDST_16, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+            tcg_out_rev16(s, 1, data_reg, data_reg);
+            tcg_out_sxt(s, 1, s_bits, data_reg, data_reg);
+        } else {
+            tcg_out_ldst_r(s, LDST_16, LDST_LD_S_X,
+                           data_reg, addr_reg, TCG_REG_X1);
+        }
+        break;
+    case 2:
+        tcg_out_ldst_r(s, LDST_32, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+        if (bswap) {
+            tcg_out_rev32(s, data_reg, data_reg);
+        }
+        break;
+    case 2 | 4:
+        if (bswap) {
+            tcg_out_ldst_r(s, LDST_32, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+            tcg_out_rev32(s, data_reg, data_reg);
+            tcg_out_sxt(s, 1, s_bits, data_reg, data_reg);
+        } else {
+            tcg_out_ldst_r(s, LDST_32, LDST_LD_S_X,
+                           data_reg, addr_reg, TCG_REG_X1);
+        }
+        break;
+    case 3:
+        tcg_out_ldst_r(s, LDST_64, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+         if (bswap) {
+            tcg_out_rev(s, 1, data_reg, data_reg);
+        }
+        break;
+    default:
+        tcg_abort();
+    }
+    label_ptr2 = s->code_ptr;
+    tcg_out_goto_noaddr(s);
+    for (i = 0; i < 2; i++) {
+        if (label_ptr[i]) {
+            reloc_pc19(label_ptr[i], (tcg_target_long)s->code_ptr);
+        }
+    }
     /* all arguments passed via registers */
     tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
     tcg_out_movr(s, (TARGET_LONG_BITS == 64), TCG_REG_X1, addr_reg);
@@ -748,7 +845,7 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
     } else {
         tcg_out_movr(s, 1, data_reg, TCG_REG_X0);
     }
-
+    reloc_pc26(label_ptr2, (tcg_target_long)s->code_ptr);
 #else /* !CONFIG_SOFTMMU */
     tcg_abort(); /* TODO */
 #endif
@@ -757,8 +854,17 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
 static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
 {
     TCGReg addr_reg, data_reg;
+    bool bswap;
 #ifdef CONFIG_SOFTMMU
     int mem_index, s_bits;
+    int i;
+    uint8_t *label_ptr[2] = { NULL };
+    uint8_t *label_ptr2;
+#endif
+#ifdef TARGET_WORDS_BIGENDIAN
+    bswap = 1;
+#else
+    bswap = 0;
 #endif
     data_reg = args[0];
     addr_reg = args[1];
@@ -767,8 +873,55 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
     mem_index = args[2];
     s_bits = opc & 3;
 
-    /* TODO: insert TLB lookup here */
+    tcg_out_tlb_read(s, addr_reg, s_bits, label_ptr,
+                  offsetof(CPUArchState, tlb_table[mem_index][0].addr_write));
 
+    tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X1, TCG_REG_X2,
+           offsetof(CPUTLBEntry, addend) - offsetof(CPUTLBEntry, addr_write));
+    switch (opc) {
+    case 0:
+        tcg_out_ldst_r(s, LDST_8, LDST_ST, data_reg, addr_reg, TCG_REG_X1);
+        break;
+    case 1:
+        if (bswap) {
+            tcg_out_rev16(s, 1, TCG_REG_X0, data_reg);
+            tcg_out_ldst_r(s, LDST_16, LDST_ST, TCG_REG_X0,
+                      addr_reg, TCG_REG_X1);
+        } else {
+            tcg_out_ldst_r(s, LDST_16, LDST_ST, data_reg,
+                      addr_reg, TCG_REG_X1);
+        }
+        break;
+    case 2:
+        if (bswap) {
+            tcg_out_rev32(s, TCG_REG_X0, data_reg);
+            tcg_out_ldst_r(s, LDST_32, LDST_ST, TCG_REG_X0,
+                       addr_reg, TCG_REG_X1);
+        } else {
+            tcg_out_ldst_r(s, LDST_32, LDST_ST, data_reg,
+                       addr_reg, TCG_REG_X1);
+        }
+        break;
+    case 3:
+        if (bswap) {
+            tcg_out_rev(s, 1, TCG_REG_X0, data_reg);
+            tcg_out_ldst_r(s, LDST_64, LDST_ST, TCG_REG_X0,
+                       addr_reg, TCG_REG_X1);
+        } else {
+            tcg_out_ldst_r(s, LDST_64, LDST_ST, data_reg,
+                       addr_reg, TCG_REG_X1);
+        }
+        break;
+    default:
+        tcg_abort();
+    }
+    label_ptr2 = s->code_ptr;
+    tcg_out_goto_noaddr(s);
+    for (i = 0; i < 2; i++) {
+        if (label_ptr[i]) {
+            reloc_pc19(label_ptr[i], (tcg_target_long)s->code_ptr);
+        }
+    }
     /* all arguments passed via registers */
     tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
     tcg_out_movr(s, (TARGET_LONG_BITS == 64), TCG_REG_X1, addr_reg);
@@ -777,7 +930,7 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
     tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP,
                  (tcg_target_long)qemu_st_helpers[s_bits]);
     tcg_out_callr(s, TCG_REG_TMP);
-
+    reloc_pc26(label_ptr2, (tcg_target_long)s->code_ptr);
 #else /* !CONFIG_SOFTMMU */
     tcg_abort(); /* TODO */
 #endif
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup
  2013-05-31 17:57 ` [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup Jani Kokkonen
@ 2013-05-31 19:07   ` Richard Henderson
  2013-06-03  9:43     ` Claudio Fontana
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Henderson @ 2013-05-31 19:07 UTC (permalink / raw)
  To: Jani Kokkonen
  Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel

On 05/31/2013 10:57 AM, Jani Kokkonen wrote:
> +    ARITH_SUBS = 0x6b,

Any reason you're adding SUBS here, but not ANDS?

> +/* encode a logical immediate, mapping user parameter
> +   M=set bits pattern length to S=M-1 */
> +static inline unsigned int
> +aarch64_limm(unsigned int m, unsigned int r)
> +{
> +    assert(m > 0);
> +    return r << 16 | (m - 1) << 10;
> +}
> +
> +/* test a register against an immediate bit pattern made of
> +   M set bits rotated right by R.
> +   Examples:
> +   to test a 32/64 reg against 0x00000007, pass M = 3,  R = 0.
> +   to test a 32/64 reg against 0x000000ff, pass M = 8,  R = 0.
> +   to test a 32bit reg against 0xff000000, pass M = 8,  R = 8.
> +   to test a 32bit reg against 0xff0000ff, pass M = 16, R = 8.
> + */
> +static inline void tcg_out_tst(TCGContext *s, int ext, TCGReg rn,
> +                               unsigned int m, unsigned int r)
> +{
> +    /* using TST alias of ANDS XZR, Xn,#bimm64 0x7200001f */
> +    unsigned int base = ext ? 0xf240001f : 0x7200001f;
> +    tcg_out32(s, base | aarch64_limm(m, r) | rn << 5);
> +}
> +
> +/* and a register with a bit pattern, similarly to TST, no flags change */
> +static inline void tcg_out_andi(TCGContext *s, int ext, TCGReg rd, TCGReg rn,
> +                                unsigned int m, unsigned int r)
> +{
> +    /* using AND 0x12000000 */
> +    unsigned int base = ext ? 0x92400000 : 0x12000000;
> +    tcg_out32(s, base | aarch64_limm(m, r) | rn << 5 | rd);
> +}
> +

This should be a separate patch, since it's not related to the tcg_out_arith
change.


r~

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations
  2013-05-31 18:01 ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations Jani Kokkonen
@ 2013-05-31 19:11   ` Richard Henderson
  2013-06-03  9:44     ` Claudio Fontana
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Henderson @ 2013-05-31 19:11 UTC (permalink / raw)
  To: Jani Kokkonen
  Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel

On 05/31/2013 11:01 AM, Jani Kokkonen wrote:
> +static inline void tcg_out_rev(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
> +{
> +    /* using REV 0x5ac00800 */
> +    unsigned int base = ext ? 0xdac00c00 : 0x5ac00800;
> +    tcg_out32(s, base | rm << 5 | rd);
> +}
> +
> +static inline void tcg_out_rev16(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
> +{
> +    /* using REV16 0x5ac00400 */
> +    unsigned int base = ext ? 0xdac00400 : 0x5ac00400;
> +    tcg_out32(s, base | rm << 5 | rd);
> +}
> +
> +static inline void tcg_out_rev32(TCGContext *s, TCGReg rd, TCGReg rm)
> +{
> +    /* using REV32 0xdac00800 */
> +    unsigned int base = 0xdac00800;
> +    tcg_out32(s, base | rm << 5 | rd);
> +}

You don't actually need rev32.

> * bswap32_i32/i64 t0, t1
> 
> 32 bit byte swap on a 32/64 bit value. With a 64 bit value, it assumes that
> the four high order bytes are set to zero.

The fact that the high order bytes are known to be zero means that you
can always use tcg_out_rev with ext=0.

    case INDEX_op_bswap64_i64:
        ext = 1;
        /* FALLTHRU */
    case INDEX_op_bswap32_i64:
    case INDEX_op_bswap32_i32:
        tcg_out_rev(s, ext, args[0], args[1]);
        break;
    case INDEX_op_bswap16_i64:
    case INDEX_op_bswap16_i32:
        tcg_out_rev16(s, 0, args[0], args[1]);
        break;


r~

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations
  2013-05-31 18:05 ` [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations Jani Kokkonen
@ 2013-05-31 19:13   ` Richard Henderson
  2013-06-03  9:48     ` Claudio Fontana
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Henderson @ 2013-05-31 19:13 UTC (permalink / raw)
  To: Jani Kokkonen
  Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel

On 05/31/2013 11:05 AM, Jani Kokkonen wrote:
> +static inline void tcg_out_uxt(TCGContext *s, int s_bits,
> +                               TCGReg rd, TCGReg rn)
> +{
> +    /* using ALIASes UXTB 0x53001c00, UXTH 0x53003c00
> +       of UBFM Wd, Wn, #0, #7|15 and mov */
> +    int bits = 8 * (1 << s_bits) - 1;
> +    tcg_out_ubfm(s, 0, rd, rn, 0, bits);
> +}

Err, ubfm never generates mov, does it?

Yes, you do that later,

> +    case INDEX_op_ext32u_i64:
> +        tcg_out_movr(s, 0, args[0], args[1]);
> +        break;

but the comment isn't actually correct in tcg_out_uxt, surely?


r~

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path
  2013-05-31 18:07 ` [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path Jani Kokkonen
@ 2013-05-31 20:25   ` Richard Henderson
  2013-06-03 11:21     ` Jani Kokkonen
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Henderson @ 2013-05-31 20:25 UTC (permalink / raw)
  To: Jani Kokkonen
  Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel

On 05/31/2013 11:07 AM, Jani Kokkonen wrote:
> +/* Load and compare a TLB entry, leaving the flags set.  Leaves X2 pointing
> +   to the tlb entry.  Clobbers X0,X1,X2,X3 and TMP.  */
> +
> +static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg,
> +                             int s_bits, uint8_t **label_ptr, int tlb_offset)
> +{

You copied the comment from ARM, and it isn't correct.  You generate branches.

> +    TCGReg base = TCG_AREG0;
> +
> +    tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS);
> +    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset);
> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0);
> +    tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0);
> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2,
> +                             TCG_REG_X0, -CPU_TLB_ENTRY_BITS);
> +#if TARGET_LONG_BITS == 64
> +    tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
> +#else
> +    tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
> +#endif
> +    /* check alignment */
> +    if (s_bits) {
> +        tcg_out_tst(s, 1, addr_reg, s_bits, 0);
> +        label_ptr[0] = s->code_ptr;
> +        tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
> +    }
> +    tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS);
> +    label_ptr[1] = s->code_ptr;
> +    tcg_out_goto_cond_noaddr(s, TCG_COND_NE);

I'm positive that the branch predictor would be happier with a single branch
rather than the two you generate here.  It ought to be possible to use a
different set of insns to do this in one go.

How about something like

	@ extract the tlb index from the address
	ubfm	w0, addr_reg, TARGET_PAGE_BITS, CPU_TLB_BITS

	@ add any "high bits" from the tlb offset
	@ noting that env will be much smaller than 24 bits.
	add	x1, env, tlb_offset & 0xfff000

	@ zap the tlb index from the address for compare
	@ this is all high bits plus 0-3 low bits set, so this
	@ should match a logical immediate.
	and	w/x2, addr_reg, TARGET_PAGE_MASK | ((1 << s_bits) - 1)

	@ merge the tlb index into the env+tlb_offset
	add	x1, x1, x0, lsl #3

	@ load the tlb comparator.  the 12-bit scaled offset
	@ form will fit the bits remaining from above, given that
	@ we're loading an aligned object, and so the low 2/3 bits
	@ will be clear.
	ldr	w/x0, [x1, tlb_offset & 0xfff]

	@ load the tlb addend.  do this early to avoid stalling.
	@ the addend_offset differs from tlb_offset by 1-3 words.
	@ given that we've got overlap between the scaled 12-bit
	@ value and the 12-bit shifted value above, this also ought
	@ to always be representable.
	ldr	x3, [x1, (tlb_offset & 0xfff) + (addend_offset - tlb_offset)]

	@ perform the comparison
	cmp	w/x0, w/x2

	@ generate the complete host address in parallel with the cmp.
	add	x3, x3, addr_reg		@ 64-bit guest
	add	x3, x3, addr_reg, uxtw		@ 32-bit guest

	bne	miss_label

Note that the w/x above indicates the ext setting that ought to be used,
depending on the address size of the guest.

This is at least 2 insns shorter than your sequence.

Have you looked at doing the out-of-line tlb miss sequence right from the
very beginning?  It's not that much more difficult to accomplish than the
inline tlb miss.

See CONFIG_QEMU_LDST_OPTIMIZATION, and the implementation in tcg/arm.
You won't need two nops after the call; aarch64 can do all the required
extensions and data movement operations in a single insn.


r~

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup
  2013-05-31 19:07   ` Richard Henderson
@ 2013-06-03  9:43     ` Claudio Fontana
  0 siblings, 0 replies; 14+ messages in thread
From: Claudio Fontana @ 2013-06-03  9:43 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Laurent Desnogues, Peter Maydell, Jani Kokkonen, qemu-devel

On 31.05.2013 21:07, Richard Henderson wrote:
> On 05/31/2013 10:57 AM, Jani Kokkonen wrote:
>> +    ARITH_SUBS = 0x6b,
> 
> Any reason you're adding SUBS here, but not ANDS?

I also forgot ANDS, I'll add them and reorder.

>> +/* encode a logical immediate, mapping user parameter
>> +   M=set bits pattern length to S=M-1 */
>> +static inline unsigned int
>> +aarch64_limm(unsigned int m, unsigned int r)
>> +{
>> +    assert(m > 0);
>> +    return r << 16 | (m - 1) << 10;
>> +}
>> +
>> +/* test a register against an immediate bit pattern made of
>> +   M set bits rotated right by R.
>> +   Examples:
>> +   to test a 32/64 reg against 0x00000007, pass M = 3,  R = 0.
>> +   to test a 32/64 reg against 0x000000ff, pass M = 8,  R = 0.
>> +   to test a 32bit reg against 0xff000000, pass M = 8,  R = 8.
>> +   to test a 32bit reg against 0xff0000ff, pass M = 16, R = 8.
>> + */
>> +static inline void tcg_out_tst(TCGContext *s, int ext, TCGReg rn,
>> +                               unsigned int m, unsigned int r)
>> +{
>> +    /* using TST alias of ANDS XZR, Xn,#bimm64 0x7200001f */
>> +    unsigned int base = ext ? 0xf240001f : 0x7200001f;
>> +    tcg_out32(s, base | aarch64_limm(m, r) | rn << 5);
>> +}
>> +
>> +/* and a register with a bit pattern, similarly to TST, no flags change */
>> +static inline void tcg_out_andi(TCGContext *s, int ext, TCGReg rd, TCGReg rn,
>> +                                unsigned int m, unsigned int r)
>> +{
>> +    /* using AND 0x12000000 */
>> +    unsigned int base = ext ? 0x92400000 : 0x12000000;
>> +    tcg_out32(s, base | aarch64_limm(m, r) | rn << 5 | rd);
>> +}
>> +
> 
> This should be a separate patch, since it's not related to the tcg_out_arith
> change.
> 

Agreed.

Claudio

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations
  2013-05-31 19:11   ` Richard Henderson
@ 2013-06-03  9:44     ` Claudio Fontana
  0 siblings, 0 replies; 14+ messages in thread
From: Claudio Fontana @ 2013-06-03  9:44 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Laurent Desnogues, Peter Maydell, Jani Kokkonen, qemu-devel

On 31.05.2013 21:11, Richard Henderson wrote:
> On 05/31/2013 11:01 AM, Jani Kokkonen wrote:
>> +static inline void tcg_out_rev(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
>> +{
>> +    /* using REV 0x5ac00800 */
>> +    unsigned int base = ext ? 0xdac00c00 : 0x5ac00800;
>> +    tcg_out32(s, base | rm << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_rev16(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
>> +{
>> +    /* using REV16 0x5ac00400 */
>> +    unsigned int base = ext ? 0xdac00400 : 0x5ac00400;
>> +    tcg_out32(s, base | rm << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_rev32(TCGContext *s, TCGReg rd, TCGReg rm)
>> +{
>> +    /* using REV32 0xdac00800 */
>> +    unsigned int base = 0xdac00800;
>> +    tcg_out32(s, base | rm << 5 | rd);
>> +}
> 
> You don't actually need rev32.
> 
>> * bswap32_i32/i64 t0, t1
>>
>> 32 bit byte swap on a 32/64 bit value. With a 64 bit value, it assumes that
>> the four high order bytes are set to zero.
> 
> The fact that the high order bytes are known to be zero means that you
> can always use tcg_out_rev with ext=0.
> 
>     case INDEX_op_bswap64_i64:
>         ext = 1;
>         /* FALLTHRU */
>     case INDEX_op_bswap32_i64:
>     case INDEX_op_bswap32_i32:
>         tcg_out_rev(s, ext, args[0], args[1]);
>         break;
>     case INDEX_op_bswap16_i64:
>     case INDEX_op_bswap16_i32:
>         tcg_out_rev16(s, 0, args[0], args[1]);
>         break;
> 
> 
> r~
> 

ACK.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations
  2013-05-31 19:13   ` Richard Henderson
@ 2013-06-03  9:48     ` Claudio Fontana
  0 siblings, 0 replies; 14+ messages in thread
From: Claudio Fontana @ 2013-06-03  9:48 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Laurent Desnogues, Peter Maydell, Jani Kokkonen, qemu-devel

On 31.05.2013 21:13, Richard Henderson wrote:
> On 05/31/2013 11:05 AM, Jani Kokkonen wrote:
>> +static inline void tcg_out_uxt(TCGContext *s, int s_bits,
>> +                               TCGReg rd, TCGReg rn)
>> +{
>> +    /* using ALIASes UXTB 0x53001c00, UXTH 0x53003c00
>> +       of UBFM Wd, Wn, #0, #7|15 and mov */
>> +    int bits = 8 * (1 << s_bits) - 1;
>> +    tcg_out_ubfm(s, 0, rd, rn, 0, bits);
>> +}
> 
> Err, ubfm never generates mov, does it?

No, the comment is a leftover, it's wrong.

> Yes, you do that later,
> 
>> +    case INDEX_op_ext32u_i64:
>> +        tcg_out_movr(s, 0, args[0], args[1]);
>> +        break;
> 
> but the comment isn't actually correct in tcg_out_uxt, surely?
> 

right.

I was think also about the

INDEX_op_ext16u_i64 and INDEX_op_ext16u_i32,
I think I can just use ext = 0 for both when doing the UXT,
consistently to what we discussed before about trying to use
ext=0 whenever possible.

CLaudio

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path
  2013-05-31 20:25   ` Richard Henderson
@ 2013-06-03 11:21     ` Jani Kokkonen
  2013-06-03 15:52       ` Richard Henderson
  0 siblings, 1 reply; 14+ messages in thread
From: Jani Kokkonen @ 2013-06-03 11:21 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel

On 5/31/2013 10:25 PM, Richard Henderson wrote:
> On 05/31/2013 11:07 AM, Jani Kokkonen wrote:
>> +/* Load and compare a TLB entry, leaving the flags set.  Leaves X2 pointing
>> +   to the tlb entry.  Clobbers X0,X1,X2,X3 and TMP.  */
>> +
>> +static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg,
>> +                             int s_bits, uint8_t **label_ptr, int tlb_offset)
>> +{
> 
> You copied the comment from ARM, and it isn't correct.  You generate branches.

I will fix the comment.
> 
>> +    TCGReg base = TCG_AREG0;
>> +
>> +    tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS);
>> +    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset);
>> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0);
>> +    tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0);
>> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2,
>> +                             TCG_REG_X0, -CPU_TLB_ENTRY_BITS);
>> +#if TARGET_LONG_BITS == 64
>> +    tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
>> +#else
>> +    tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
>> +#endif
>> +    /* check alignment */
>> +    if (s_bits) {
>> +        tcg_out_tst(s, 1, addr_reg, s_bits, 0);
>> +        label_ptr[0] = s->code_ptr;
>> +        tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
>> +    }
>> +    tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS);
>> +    label_ptr[1] = s->code_ptr;
>> +    tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
> 
> I'm positive that the branch predictor would be happier with a single branch
> rather than the two you generate here.  It ought to be possible to use a
> different set of insns to do this in one go.
> 
> How about something like
> 
> 	@ extract the tlb index from the address
> 	ubfm	w0, addr_reg, TARGET_PAGE_BITS, CPU_TLB_BITS
> 
> 	@ add any "high bits" from the tlb offset
> 	@ noting that env will be much smaller than 24 bits.
> 	add	x1, env, tlb_offset & 0xfff000
> 
> 	@ zap the tlb index from the address for compare
> 	@ this is all high bits plus 0-3 low bits set, so this
> 	@ should match a logical immediate.
> 	and	w/x2, addr_reg, TARGET_PAGE_MASK | ((1 << s_bits) - 1)
> 
> 	@ merge the tlb index into the env+tlb_offset
> 	add	x1, x1, x0, lsl #3
> 
> 	@ load the tlb comparator.  the 12-bit scaled offset
> 	@ form will fit the bits remaining from above, given that
> 	@ we're loading an aligned object, and so the low 2/3 bits
> 	@ will be clear.
> 	ldr	w/x0, [x1, tlb_offset & 0xfff]
> 
> 	@ load the tlb addend.  do this early to avoid stalling.
> 	@ the addend_offset differs from tlb_offset by 1-3 words.
> 	@ given that we've got overlap between the scaled 12-bit
> 	@ value and the 12-bit shifted value above, this also ought
> 	@ to always be representable.
> 	ldr	x3, [x1, (tlb_offset & 0xfff) + (addend_offset - tlb_offset)]
> 
> 	@ perform the comparison
> 	cmp	w/x0, w/x2
> 
> 	@ generate the complete host address in parallel with the cmp.
> 	add	x3, x3, addr_reg		@ 64-bit guest
> 	add	x3, x3, addr_reg, uxtw		@ 32-bit guest
> 
> 	bne	miss_label
> 
> Note that the w/x above indicates the ext setting that ought to be used,
> depending on the address size of the guest.
> 
> This is at least 2 insns shorter than your sequence.

Ok, thanks. ubfm instruction will be added and I will modify implementation based on your comments.
> 
> Have you looked at doing the out-of-line tlb miss sequence right from the
> very beginning?  It's not that much more difficult to accomplish than the
> inline tlb miss.

I have to look into this one.
> 
> See CONFIG_QEMU_LDST_OPTIMIZATION, and the implementation in tcg/arm.
> You won't need two nops after the call; aarch64 can do all the required
> extensions and data movement operations in a single insn.
> 
> 

I will take this also into account. 

> r~
> 

-Jani

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path
  2013-06-03 11:21     ` Jani Kokkonen
@ 2013-06-03 15:52       ` Richard Henderson
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2013-06-03 15:52 UTC (permalink / raw)
  To: Jani Kokkonen
  Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel

On 06/03/2013 04:21 AM, Jani Kokkonen wrote:
>> 	@ merge the tlb index into the env+tlb_offset
>> 	add	x1, x1, x0, lsl #3

For the record, oops.  3 should be CPU_TLB_ENTRY_BITS.


r~

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-06-03 15:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-31 17:51 [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup Jani Kokkonen
2013-05-31 17:57 ` [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup Jani Kokkonen
2013-05-31 19:07   ` Richard Henderson
2013-06-03  9:43     ` Claudio Fontana
2013-05-31 18:01 ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations Jani Kokkonen
2013-05-31 19:11   ` Richard Henderson
2013-06-03  9:44     ` Claudio Fontana
2013-05-31 18:05 ` [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations Jani Kokkonen
2013-05-31 19:13   ` Richard Henderson
2013-06-03  9:48     ` Claudio Fontana
2013-05-31 18:07 ` [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path Jani Kokkonen
2013-05-31 20:25   ` Richard Henderson
2013-06-03 11:21     ` Jani Kokkonen
2013-06-03 15:52       ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.