* [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup
@ 2013-05-31 17:51 Jani Kokkonen
2013-05-31 17:57 ` [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup Jani Kokkonen
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Jani Kokkonen @ 2013-05-31 17:51 UTC (permalink / raw)
To: Peter Maydell
Cc: Laurent Desnogues, Claudio Fontana, qemu-devel, Richard Henderson
From: Jani Kokkonen <jani.kokkonen@huawei.com>
This series implements the TCG tlb fast lookup in tcg_out_qemu_ld/st
for the aarch64 TCG target.
It requires the reviewed but not committed yet series
"[PATCH v4 0/3] ARM aarch64 TCG target" at:
http://lists.nongnu.org/archive/html/qemu-devel/2013-05/msg04200.html
https://github.com/hw-claudio/qemu/tree/tcg-aarch64-current
Limitations of this initial implementation:
* CONFIG_SOFTMMU only
Tested running on a x86-64 physical machine running Foundation v8,
running a linux 3.2.0 minimal host system based on linaro v8
image build 0.8.4423 for user space.
Tested guests: arm v5, PPC64, i386 linux test images.
Also tested on x86-64/linux built with buildroot.
Claudio Fontana (3):
tcg/aarch64: more low level ops in preparation of tlb lookup
tcg/aarch64: implement byte swap operations
tcg/aarch64: implement sign/zero extend operations
Jani Kokkonen (1):
tcg/aarch64: implement tlb lookup fast path
tcg/aarch64/tcg-target.c | 333 ++++++++++++++++++++++++++++++++++++++++++++---
tcg/aarch64/tcg-target.h | 30 ++---
2 files changed, 328 insertions(+), 35 deletions(-)
--
1.8.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup
2013-05-31 17:51 [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup Jani Kokkonen
@ 2013-05-31 17:57 ` Jani Kokkonen
2013-05-31 19:07 ` Richard Henderson
2013-05-31 18:01 ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations Jani Kokkonen
` (2 subsequent siblings)
3 siblings, 1 reply; 14+ messages in thread
From: Jani Kokkonen @ 2013-05-31 17:57 UTC (permalink / raw)
To: Peter Maydell
Cc: Laurent Desnogues, Claudio Fontana, qemu-devel, Richard Henderson
From: Claudio Fontana <claudio.fontana@huawei.com>
for arith operations, add SUBS and add a shift parameter
so that all arith instructions can make use of shifted registers.
Also add functions to TEST/AND registers with immediate patterns.
Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
tcg/aarch64/tcg-target.c | 72 ++++++++++++++++++++++++++++++++++++++----------
1 file changed, 58 insertions(+), 14 deletions(-)
diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
index ff626eb..1343d49 100644
--- a/tcg/aarch64/tcg-target.c
+++ b/tcg/aarch64/tcg-target.c
@@ -188,6 +188,7 @@ enum aarch64_ldst_op_type { /* type of operation */
enum aarch64_arith_opc {
ARITH_ADD = 0x0b,
ARITH_SUB = 0x4b,
+ ARITH_SUBS = 0x6b,
ARITH_AND = 0x0a,
ARITH_OR = 0x2a,
ARITH_XOR = 0x4a
@@ -394,12 +395,20 @@ static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
}
static inline void tcg_out_arith(TCGContext *s, enum aarch64_arith_opc opc,
- int ext, TCGReg rd, TCGReg rn, TCGReg rm)
+ int ext, TCGReg rd, TCGReg rn, TCGReg rm,
+ int shift_imm)
{
/* Using shifted register arithmetic operations */
/* if extended registry operation (64bit) just OR with 0x80 << 24 */
- unsigned int base = ext ? (0x80 | opc) << 24 : opc << 24;
- tcg_out32(s, base | rm << 16 | rn << 5 | rd);
+ unsigned int shift, base = ext ? (0x80 | opc) << 24 : opc << 24;
+ if (shift_imm == 0) {
+ shift = 0;
+ } else if (shift_imm > 0) {
+ shift = shift_imm << 10 | 1 << 22;
+ } else /* (shift_imm < 0) */ {
+ shift = (-shift_imm) << 10;
+ }
+ tcg_out32(s, base | rm << 16 | shift | rn << 5 | rd);
}
static inline void tcg_out_mul(TCGContext *s, int ext,
@@ -482,11 +491,11 @@ static inline void tcg_out_rotl(TCGContext *s, int ext,
tcg_out_extr(s, ext, rd, rn, rn, bits - (m & max));
}
-static inline void tcg_out_cmp(TCGContext *s, int ext, TCGReg rn, TCGReg rm)
+static inline void tcg_out_cmp(TCGContext *s, int ext, TCGReg rn, TCGReg rm,
+ int shift_imm)
{
/* Using CMP alias SUBS wzr, Wn, Wm */
- unsigned int base = ext ? 0xeb00001f : 0x6b00001f;
- tcg_out32(s, base | rm << 16 | rn << 5);
+ tcg_out_arith(s, ARITH_SUBS, ext, TCG_REG_XZR, rn, rm, shift_imm);
}
static inline void tcg_out_cset(TCGContext *s, int ext, TCGReg rd, TCGCond c)
@@ -569,6 +578,40 @@ static inline void tcg_out_call(TCGContext *s, tcg_target_long target)
}
}
+/* encode a logical immediate, mapping user parameter
+ M=set bits pattern length to S=M-1 */
+static inline unsigned int
+aarch64_limm(unsigned int m, unsigned int r)
+{
+ assert(m > 0);
+ return r << 16 | (m - 1) << 10;
+}
+
+/* test a register against an immediate bit pattern made of
+ M set bits rotated right by R.
+ Examples:
+ to test a 32/64 reg against 0x00000007, pass M = 3, R = 0.
+ to test a 32/64 reg against 0x000000ff, pass M = 8, R = 0.
+ to test a 32bit reg against 0xff000000, pass M = 8, R = 8.
+ to test a 32bit reg against 0xff0000ff, pass M = 16, R = 8.
+ */
+static inline void tcg_out_tst(TCGContext *s, int ext, TCGReg rn,
+ unsigned int m, unsigned int r)
+{
+ /* using TST alias of ANDS XZR, Xn,#bimm64 0x7200001f */
+ unsigned int base = ext ? 0xf240001f : 0x7200001f;
+ tcg_out32(s, base | aarch64_limm(m, r) | rn << 5);
+}
+
+/* and a register with a bit pattern, similarly to TST, no flags change */
+static inline void tcg_out_andi(TCGContext *s, int ext, TCGReg rd, TCGReg rn,
+ unsigned int m, unsigned int r)
+{
+ /* using AND 0x12000000 */
+ unsigned int base = ext ? 0x92400000 : 0x12000000;
+ tcg_out32(s, base | aarch64_limm(m, r) | rn << 5 | rd);
+}
+
static inline void tcg_out_ret(TCGContext *s)
{
/* emit RET { LR } */
@@ -830,31 +873,31 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
case INDEX_op_add_i64:
ext = 1; /* fall through */
case INDEX_op_add_i32:
- tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2]);
+ tcg_out_arith(s, ARITH_ADD, ext, args[0], args[1], args[2], 0);
break;
case INDEX_op_sub_i64:
ext = 1; /* fall through */
case INDEX_op_sub_i32:
- tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2]);
+ tcg_out_arith(s, ARITH_SUB, ext, args[0], args[1], args[2], 0);
break;
case INDEX_op_and_i64:
ext = 1; /* fall through */
case INDEX_op_and_i32:
- tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2]);
+ tcg_out_arith(s, ARITH_AND, ext, args[0], args[1], args[2], 0);
break;
case INDEX_op_or_i64:
ext = 1; /* fall through */
case INDEX_op_or_i32:
- tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2]);
+ tcg_out_arith(s, ARITH_OR, ext, args[0], args[1], args[2], 0);
break;
case INDEX_op_xor_i64:
ext = 1; /* fall through */
case INDEX_op_xor_i32:
- tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2]);
+ tcg_out_arith(s, ARITH_XOR, ext, args[0], args[1], args[2], 0);
break;
case INDEX_op_mul_i64:
@@ -909,7 +952,8 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
if (const_args[2]) { /* ROR / EXTR Wd, Wm, Wm, 32 - m */
tcg_out_rotl(s, ext, args[0], args[1], args[2]);
} else {
- tcg_out_arith(s, ARITH_SUB, 0, TCG_REG_TMP, TCG_REG_XZR, args[2]);
+ tcg_out_arith(s, ARITH_SUB, 0,
+ TCG_REG_TMP, TCG_REG_XZR, args[2], 0);
tcg_out_shiftrot_reg(s, SRR_ROR, ext,
args[0], args[1], TCG_REG_TMP);
}
@@ -918,14 +962,14 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
case INDEX_op_brcond_i64:
ext = 1; /* fall through */
case INDEX_op_brcond_i32: /* CMP 0, 1, cond(2), label 3 */
- tcg_out_cmp(s, ext, args[0], args[1]);
+ tcg_out_cmp(s, ext, args[0], args[1], 0);
tcg_out_goto_label_cond(s, args[2], args[3]);
break;
case INDEX_op_setcond_i64:
ext = 1; /* fall through */
case INDEX_op_setcond_i32:
- tcg_out_cmp(s, ext, args[1], args[2]);
+ tcg_out_cmp(s, ext, args[1], args[2], 0);
tcg_out_cset(s, 0, args[0], args[3]);
break;
--
1.8.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations
2013-05-31 17:51 [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup Jani Kokkonen
2013-05-31 17:57 ` [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup Jani Kokkonen
@ 2013-05-31 18:01 ` Jani Kokkonen
2013-05-31 19:11 ` Richard Henderson
2013-05-31 18:05 ` [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations Jani Kokkonen
2013-05-31 18:07 ` [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path Jani Kokkonen
3 siblings, 1 reply; 14+ messages in thread
From: Jani Kokkonen @ 2013-05-31 18:01 UTC (permalink / raw)
To: Peter Maydell
Cc: Laurent Desnogues, Claudio Fontana, qemu-devel, Richard Henderson
From: Claudio Fontana <claudio.fontana@huawei.com>
implement the optional byte swap operations with the dedicated
aarch64 instructions.
These instructions are also needed for the tlb lookup.
Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
tcg/aarch64/tcg-target.c | 42 ++++++++++++++++++++++++++++++++++++++++++
tcg/aarch64/tcg-target.h | 10 +++++-----
2 files changed, 47 insertions(+), 5 deletions(-)
diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
index 1343d49..716c987 100644
--- a/tcg/aarch64/tcg-target.c
+++ b/tcg/aarch64/tcg-target.c
@@ -658,6 +658,27 @@ static inline void tcg_out_goto_label_cond(TCGContext *s,
}
}
+static inline void tcg_out_rev(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
+{
+ /* using REV 0x5ac00800 */
+ unsigned int base = ext ? 0xdac00c00 : 0x5ac00800;
+ tcg_out32(s, base | rm << 5 | rd);
+}
+
+static inline void tcg_out_rev16(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
+{
+ /* using REV16 0x5ac00400 */
+ unsigned int base = ext ? 0xdac00400 : 0x5ac00400;
+ tcg_out32(s, base | rm << 5 | rd);
+}
+
+static inline void tcg_out_rev32(TCGContext *s, TCGReg rd, TCGReg rm)
+{
+ /* using REV32 0xdac00800 */
+ unsigned int base = 0xdac00800;
+ tcg_out32(s, base | rm << 5 | rd);
+}
+
#ifdef CONFIG_SOFTMMU
#include "exec/softmmu_defs.h"
@@ -1010,6 +1031,20 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
tcg_out_qemu_st(s, args, 3);
break;
+ case INDEX_op_bswap64_i64:
+ ext = 1; /* fall through */
+ case INDEX_op_bswap32_i32:
+ tcg_out_rev(s, ext, args[0], args[1]);
+ break;
+ case INDEX_op_bswap16_i64:
+ ext = 1; /* fall through */
+ case INDEX_op_bswap16_i32:
+ tcg_out_rev16(s, ext, args[0], args[1]);
+ break;
+ case INDEX_op_bswap32_i64:
+ tcg_out_rev32(s, args[0], args[1]);
+ break;
+
default:
tcg_abort(); /* opcode not implemented */
}
@@ -1091,6 +1126,13 @@ static const TCGTargetOpDef aarch64_op_defs[] = {
{ INDEX_op_qemu_st16, { "l", "l" } },
{ INDEX_op_qemu_st32, { "l", "l" } },
{ INDEX_op_qemu_st64, { "l", "l" } },
+
+ { INDEX_op_bswap16_i32, { "r", "r" } },
+ { INDEX_op_bswap32_i32, { "r", "r" } },
+ { INDEX_op_bswap16_i64, { "r", "r" } },
+ { INDEX_op_bswap32_i64, { "r", "r" } },
+ { INDEX_op_bswap64_i64, { "r", "r" } },
+
{ -1 },
};
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 075ab2a..247ef43 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -44,8 +44,8 @@ typedef enum {
#define TCG_TARGET_HAS_ext16s_i32 0
#define TCG_TARGET_HAS_ext8u_i32 0
#define TCG_TARGET_HAS_ext16u_i32 0
-#define TCG_TARGET_HAS_bswap16_i32 0
-#define TCG_TARGET_HAS_bswap32_i32 0
+#define TCG_TARGET_HAS_bswap16_i32 1
+#define TCG_TARGET_HAS_bswap32_i32 1
#define TCG_TARGET_HAS_not_i32 0
#define TCG_TARGET_HAS_neg_i32 0
#define TCG_TARGET_HAS_rot_i32 1
@@ -68,9 +68,9 @@ typedef enum {
#define TCG_TARGET_HAS_ext8u_i64 0
#define TCG_TARGET_HAS_ext16u_i64 0
#define TCG_TARGET_HAS_ext32u_i64 0
-#define TCG_TARGET_HAS_bswap16_i64 0
-#define TCG_TARGET_HAS_bswap32_i64 0
-#define TCG_TARGET_HAS_bswap64_i64 0
+#define TCG_TARGET_HAS_bswap16_i64 1
+#define TCG_TARGET_HAS_bswap32_i64 1
+#define TCG_TARGET_HAS_bswap64_i64 1
#define TCG_TARGET_HAS_not_i64 0
#define TCG_TARGET_HAS_neg_i64 0
#define TCG_TARGET_HAS_rot_i64 1
--
1.8.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations
2013-05-31 17:51 [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup Jani Kokkonen
2013-05-31 17:57 ` [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup Jani Kokkonen
2013-05-31 18:01 ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations Jani Kokkonen
@ 2013-05-31 18:05 ` Jani Kokkonen
2013-05-31 19:13 ` Richard Henderson
2013-05-31 18:07 ` [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path Jani Kokkonen
3 siblings, 1 reply; 14+ messages in thread
From: Jani Kokkonen @ 2013-05-31 18:05 UTC (permalink / raw)
To: Peter Maydell
Cc: Laurent Desnogues, Claudio Fontana, qemu-devel, Richard Henderson
From: Claudio Fontana <claudio.fontana@huawei.com>
implement the optional sign/zero extend operations with the dedicated
aarch64 instructions.
These instructions are also needed for the tlb lookup.
Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
---
tcg/aarch64/tcg-target.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++--
tcg/aarch64/tcg-target.h | 20 ++++++++---------
2 files changed, 66 insertions(+), 12 deletions(-)
diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
index 716c987..24b2862 100644
--- a/tcg/aarch64/tcg-target.c
+++ b/tcg/aarch64/tcg-target.c
@@ -679,6 +679,24 @@ static inline void tcg_out_rev32(TCGContext *s, TCGReg rd, TCGReg rm)
tcg_out32(s, base | rm << 5 | rd);
}
+static inline void tcg_out_sxt(TCGContext *s, int ext, int s_bits,
+ TCGReg rd, TCGReg rn)
+{
+ /* using ALIASes SXTB 0x13001c00, SXTH 0x13003c00, SXTW 0x93407c00
+ of SBFM Xd, Xn, #0, #7|15|31 */
+ int bits = 8 * (1 << s_bits) - 1;
+ tcg_out_sbfm(s, ext, rd, rn, 0, bits);
+}
+
+static inline void tcg_out_uxt(TCGContext *s, int s_bits,
+ TCGReg rd, TCGReg rn)
+{
+ /* using ALIASes UXTB 0x53001c00, UXTH 0x53003c00
+ of UBFM Wd, Wn, #0, #7|15 and mov */
+ int bits = 8 * (1 << s_bits) - 1;
+ tcg_out_ubfm(s, 0, rd, rn, 0, bits);
+}
+
#ifdef CONFIG_SOFTMMU
#include "exec/softmmu_defs.h"
@@ -726,8 +744,7 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
tcg_out_callr(s, TCG_REG_TMP);
if (opc & 0x04) { /* sign extend */
- unsigned int bits = 8 * (1 << s_bits) - 1;
- tcg_out_sbfm(s, 1, data_reg, TCG_REG_X0, 0, bits); /* 7|15|31 */
+ tcg_out_sxt(s, 1, s_bits, data_reg, TCG_REG_X0);
} else {
tcg_out_movr(s, 1, data_reg, TCG_REG_X0);
}
@@ -1045,6 +1062,31 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
tcg_out_rev32(s, args[0], args[1]);
break;
+ case INDEX_op_ext8s_i64:
+ ext = 1; /* fall through */
+ case INDEX_op_ext8s_i32:
+ tcg_out_sxt(s, ext, 0, args[0], args[1]);
+ break;
+ case INDEX_op_ext16s_i64:
+ ext = 1; /* fall through */
+ case INDEX_op_ext16s_i32:
+ tcg_out_sxt(s, ext, 1, args[0], args[1]);
+ break;
+ case INDEX_op_ext32s_i64:
+ tcg_out_sxt(s, 1, 2, args[0], args[1]);
+ break;
+ case INDEX_op_ext8u_i64:
+ case INDEX_op_ext8u_i32:
+ tcg_out_uxt(s, 0, args[0], args[1]);
+ break;
+ case INDEX_op_ext16u_i64:
+ case INDEX_op_ext16u_i32:
+ tcg_out_uxt(s, 1, args[0], args[1]);
+ break;
+ case INDEX_op_ext32u_i64:
+ tcg_out_movr(s, 0, args[0], args[1]);
+ break;
+
default:
tcg_abort(); /* opcode not implemented */
}
@@ -1133,6 +1175,18 @@ static const TCGTargetOpDef aarch64_op_defs[] = {
{ INDEX_op_bswap32_i64, { "r", "r" } },
{ INDEX_op_bswap64_i64, { "r", "r" } },
+ { INDEX_op_ext8s_i32, { "r", "r" } },
+ { INDEX_op_ext16s_i32, { "r", "r" } },
+ { INDEX_op_ext8u_i32, { "r", "r" } },
+ { INDEX_op_ext16u_i32, { "r", "r" } },
+
+ { INDEX_op_ext8s_i64, { "r", "r" } },
+ { INDEX_op_ext16s_i64, { "r", "r" } },
+ { INDEX_op_ext32s_i64, { "r", "r" } },
+ { INDEX_op_ext8u_i64, { "r", "r" } },
+ { INDEX_op_ext16u_i64, { "r", "r" } },
+ { INDEX_op_ext32u_i64, { "r", "r" } },
+
{ -1 },
};
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 247ef43..97e4a5b 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -40,10 +40,10 @@ typedef enum {
/* optional instructions */
#define TCG_TARGET_HAS_div_i32 0
-#define TCG_TARGET_HAS_ext8s_i32 0
-#define TCG_TARGET_HAS_ext16s_i32 0
-#define TCG_TARGET_HAS_ext8u_i32 0
-#define TCG_TARGET_HAS_ext16u_i32 0
+#define TCG_TARGET_HAS_ext8s_i32 1
+#define TCG_TARGET_HAS_ext16s_i32 1
+#define TCG_TARGET_HAS_ext8u_i32 1
+#define TCG_TARGET_HAS_ext16u_i32 1
#define TCG_TARGET_HAS_bswap16_i32 1
#define TCG_TARGET_HAS_bswap32_i32 1
#define TCG_TARGET_HAS_not_i32 0
@@ -62,12 +62,12 @@ typedef enum {
#define TCG_TARGET_HAS_muls2_i32 0
#define TCG_TARGET_HAS_div_i64 0
-#define TCG_TARGET_HAS_ext8s_i64 0
-#define TCG_TARGET_HAS_ext16s_i64 0
-#define TCG_TARGET_HAS_ext32s_i64 0
-#define TCG_TARGET_HAS_ext8u_i64 0
-#define TCG_TARGET_HAS_ext16u_i64 0
-#define TCG_TARGET_HAS_ext32u_i64 0
+#define TCG_TARGET_HAS_ext8s_i64 1
+#define TCG_TARGET_HAS_ext16s_i64 1
+#define TCG_TARGET_HAS_ext32s_i64 1
+#define TCG_TARGET_HAS_ext8u_i64 1
+#define TCG_TARGET_HAS_ext16u_i64 1
+#define TCG_TARGET_HAS_ext32u_i64 1
#define TCG_TARGET_HAS_bswap16_i64 1
#define TCG_TARGET_HAS_bswap32_i64 1
#define TCG_TARGET_HAS_bswap64_i64 1
--
1.8.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path
2013-05-31 17:51 [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup Jani Kokkonen
` (2 preceding siblings ...)
2013-05-31 18:05 ` [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations Jani Kokkonen
@ 2013-05-31 18:07 ` Jani Kokkonen
2013-05-31 20:25 ` Richard Henderson
3 siblings, 1 reply; 14+ messages in thread
From: Jani Kokkonen @ 2013-05-31 18:07 UTC (permalink / raw)
To: Peter Maydell
Cc: Laurent Desnogues, Claudio Fontana, qemu-devel, Richard Henderson
From: Jani Kokkonen <jani.kokkonen@huawei.com>
implement the fast path for tcg_out_qemu_ld/st.
Signed-off-by: Jani Kokkonen <jani.kokkonen@huawei.com>
---
tcg/aarch64/tcg-target.c | 161 +++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 157 insertions(+), 4 deletions(-)
diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
index 24b2862..47ec4a7 100644
--- a/tcg/aarch64/tcg-target.c
+++ b/tcg/aarch64/tcg-target.c
@@ -700,6 +700,36 @@ static inline void tcg_out_uxt(TCGContext *s, int s_bits,
#ifdef CONFIG_SOFTMMU
#include "exec/softmmu_defs.h"
+/* Load and compare a TLB entry, leaving the flags set. Leaves X2 pointing
+ to the tlb entry. Clobbers X0,X1,X2,X3 and TMP. */
+
+static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg,
+ int s_bits, uint8_t **label_ptr, int tlb_offset)
+{
+ TCGReg base = TCG_AREG0;
+
+ tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS);
+ tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset);
+ tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0);
+ tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0);
+ tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2,
+ TCG_REG_X0, -CPU_TLB_ENTRY_BITS);
+#if TARGET_LONG_BITS == 64
+ tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
+#else
+ tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
+#endif
+ /* check alignment */
+ if (s_bits) {
+ tcg_out_tst(s, 1, addr_reg, s_bits, 0);
+ label_ptr[0] = s->code_ptr;
+ tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
+ }
+ tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS);
+ label_ptr[1] = s->code_ptr;
+ tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
+}
+
/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
int mmu_idx) */
static const void * const qemu_ld_helpers[4] = {
@@ -723,18 +753,85 @@ static const void * const qemu_st_helpers[4] = {
static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
{
TCGReg addr_reg, data_reg;
+ bool bswap;
#ifdef CONFIG_SOFTMMU
int mem_index, s_bits;
+ int i;
+ uint8_t *label_ptr[2] = { NULL };
+ uint8_t *label_ptr2;
#endif
data_reg = args[0];
addr_reg = args[1];
+#ifdef TARGET_WORDS_BIGENDIAN
+ bswap = 1;
+#else
+ bswap = 0;
+#endif
#ifdef CONFIG_SOFTMMU
mem_index = args[2];
s_bits = opc & 3;
- /* TODO: insert TLB lookup here */
+ tcg_out_tlb_read(s, addr_reg, s_bits, label_ptr,
+ offsetof(CPUArchState, tlb_table[mem_index][0].addr_read));
+ tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X1, TCG_REG_X2,
+ offsetof(CPUTLBEntry, addend) - offsetof(CPUTLBEntry, addr_read));
+ switch (opc) {
+ case 0:
+ tcg_out_ldst_r(s, LDST_8, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+ break;
+ case 0 | 4:
+ tcg_out_ldst_r(s, LDST_8, LDST_LD_S_X, data_reg, addr_reg, TCG_REG_X1);
+ break;
+ case 1:
+ tcg_out_ldst_r(s, LDST_16, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+ if (bswap) {
+ tcg_out_rev16(s, 1, data_reg, data_reg);
+ }
+ break;
+ case 1 | 4:
+ if (bswap) {
+ tcg_out_ldst_r(s, LDST_16, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+ tcg_out_rev16(s, 1, data_reg, data_reg);
+ tcg_out_sxt(s, 1, s_bits, data_reg, data_reg);
+ } else {
+ tcg_out_ldst_r(s, LDST_16, LDST_LD_S_X,
+ data_reg, addr_reg, TCG_REG_X1);
+ }
+ break;
+ case 2:
+ tcg_out_ldst_r(s, LDST_32, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+ if (bswap) {
+ tcg_out_rev32(s, data_reg, data_reg);
+ }
+ break;
+ case 2 | 4:
+ if (bswap) {
+ tcg_out_ldst_r(s, LDST_32, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+ tcg_out_rev32(s, data_reg, data_reg);
+ tcg_out_sxt(s, 1, s_bits, data_reg, data_reg);
+ } else {
+ tcg_out_ldst_r(s, LDST_32, LDST_LD_S_X,
+ data_reg, addr_reg, TCG_REG_X1);
+ }
+ break;
+ case 3:
+ tcg_out_ldst_r(s, LDST_64, LDST_LD, data_reg, addr_reg, TCG_REG_X1);
+ if (bswap) {
+ tcg_out_rev(s, 1, data_reg, data_reg);
+ }
+ break;
+ default:
+ tcg_abort();
+ }
+ label_ptr2 = s->code_ptr;
+ tcg_out_goto_noaddr(s);
+ for (i = 0; i < 2; i++) {
+ if (label_ptr[i]) {
+ reloc_pc19(label_ptr[i], (tcg_target_long)s->code_ptr);
+ }
+ }
/* all arguments passed via registers */
tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
tcg_out_movr(s, (TARGET_LONG_BITS == 64), TCG_REG_X1, addr_reg);
@@ -748,7 +845,7 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
} else {
tcg_out_movr(s, 1, data_reg, TCG_REG_X0);
}
-
+ reloc_pc26(label_ptr2, (tcg_target_long)s->code_ptr);
#else /* !CONFIG_SOFTMMU */
tcg_abort(); /* TODO */
#endif
@@ -757,8 +854,17 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, int opc)
static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
{
TCGReg addr_reg, data_reg;
+ bool bswap;
#ifdef CONFIG_SOFTMMU
int mem_index, s_bits;
+ int i;
+ uint8_t *label_ptr[2] = { NULL };
+ uint8_t *label_ptr2;
+#endif
+#ifdef TARGET_WORDS_BIGENDIAN
+ bswap = 1;
+#else
+ bswap = 0;
#endif
data_reg = args[0];
addr_reg = args[1];
@@ -767,8 +873,55 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
mem_index = args[2];
s_bits = opc & 3;
- /* TODO: insert TLB lookup here */
+ tcg_out_tlb_read(s, addr_reg, s_bits, label_ptr,
+ offsetof(CPUArchState, tlb_table[mem_index][0].addr_write));
+ tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X1, TCG_REG_X2,
+ offsetof(CPUTLBEntry, addend) - offsetof(CPUTLBEntry, addr_write));
+ switch (opc) {
+ case 0:
+ tcg_out_ldst_r(s, LDST_8, LDST_ST, data_reg, addr_reg, TCG_REG_X1);
+ break;
+ case 1:
+ if (bswap) {
+ tcg_out_rev16(s, 1, TCG_REG_X0, data_reg);
+ tcg_out_ldst_r(s, LDST_16, LDST_ST, TCG_REG_X0,
+ addr_reg, TCG_REG_X1);
+ } else {
+ tcg_out_ldst_r(s, LDST_16, LDST_ST, data_reg,
+ addr_reg, TCG_REG_X1);
+ }
+ break;
+ case 2:
+ if (bswap) {
+ tcg_out_rev32(s, TCG_REG_X0, data_reg);
+ tcg_out_ldst_r(s, LDST_32, LDST_ST, TCG_REG_X0,
+ addr_reg, TCG_REG_X1);
+ } else {
+ tcg_out_ldst_r(s, LDST_32, LDST_ST, data_reg,
+ addr_reg, TCG_REG_X1);
+ }
+ break;
+ case 3:
+ if (bswap) {
+ tcg_out_rev(s, 1, TCG_REG_X0, data_reg);
+ tcg_out_ldst_r(s, LDST_64, LDST_ST, TCG_REG_X0,
+ addr_reg, TCG_REG_X1);
+ } else {
+ tcg_out_ldst_r(s, LDST_64, LDST_ST, data_reg,
+ addr_reg, TCG_REG_X1);
+ }
+ break;
+ default:
+ tcg_abort();
+ }
+ label_ptr2 = s->code_ptr;
+ tcg_out_goto_noaddr(s);
+ for (i = 0; i < 2; i++) {
+ if (label_ptr[i]) {
+ reloc_pc19(label_ptr[i], (tcg_target_long)s->code_ptr);
+ }
+ }
/* all arguments passed via registers */
tcg_out_movr(s, 1, TCG_REG_X0, TCG_AREG0);
tcg_out_movr(s, (TARGET_LONG_BITS == 64), TCG_REG_X1, addr_reg);
@@ -777,7 +930,7 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, int opc)
tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP,
(tcg_target_long)qemu_st_helpers[s_bits]);
tcg_out_callr(s, TCG_REG_TMP);
-
+ reloc_pc26(label_ptr2, (tcg_target_long)s->code_ptr);
#else /* !CONFIG_SOFTMMU */
tcg_abort(); /* TODO */
#endif
--
1.8.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup
2013-05-31 17:57 ` [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup Jani Kokkonen
@ 2013-05-31 19:07 ` Richard Henderson
2013-06-03 9:43 ` Claudio Fontana
0 siblings, 1 reply; 14+ messages in thread
From: Richard Henderson @ 2013-05-31 19:07 UTC (permalink / raw)
To: Jani Kokkonen
Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel
On 05/31/2013 10:57 AM, Jani Kokkonen wrote:
> + ARITH_SUBS = 0x6b,
Any reason you're adding SUBS here, but not ANDS?
> +/* encode a logical immediate, mapping user parameter
> + M=set bits pattern length to S=M-1 */
> +static inline unsigned int
> +aarch64_limm(unsigned int m, unsigned int r)
> +{
> + assert(m > 0);
> + return r << 16 | (m - 1) << 10;
> +}
> +
> +/* test a register against an immediate bit pattern made of
> + M set bits rotated right by R.
> + Examples:
> + to test a 32/64 reg against 0x00000007, pass M = 3, R = 0.
> + to test a 32/64 reg against 0x000000ff, pass M = 8, R = 0.
> + to test a 32bit reg against 0xff000000, pass M = 8, R = 8.
> + to test a 32bit reg against 0xff0000ff, pass M = 16, R = 8.
> + */
> +static inline void tcg_out_tst(TCGContext *s, int ext, TCGReg rn,
> + unsigned int m, unsigned int r)
> +{
> + /* using TST alias of ANDS XZR, Xn,#bimm64 0x7200001f */
> + unsigned int base = ext ? 0xf240001f : 0x7200001f;
> + tcg_out32(s, base | aarch64_limm(m, r) | rn << 5);
> +}
> +
> +/* and a register with a bit pattern, similarly to TST, no flags change */
> +static inline void tcg_out_andi(TCGContext *s, int ext, TCGReg rd, TCGReg rn,
> + unsigned int m, unsigned int r)
> +{
> + /* using AND 0x12000000 */
> + unsigned int base = ext ? 0x92400000 : 0x12000000;
> + tcg_out32(s, base | aarch64_limm(m, r) | rn << 5 | rd);
> +}
> +
This should be a separate patch, since it's not related to the tcg_out_arith
change.
r~
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations
2013-05-31 18:01 ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations Jani Kokkonen
@ 2013-05-31 19:11 ` Richard Henderson
2013-06-03 9:44 ` Claudio Fontana
0 siblings, 1 reply; 14+ messages in thread
From: Richard Henderson @ 2013-05-31 19:11 UTC (permalink / raw)
To: Jani Kokkonen
Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel
On 05/31/2013 11:01 AM, Jani Kokkonen wrote:
> +static inline void tcg_out_rev(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
> +{
> + /* using REV 0x5ac00800 */
> + unsigned int base = ext ? 0xdac00c00 : 0x5ac00800;
> + tcg_out32(s, base | rm << 5 | rd);
> +}
> +
> +static inline void tcg_out_rev16(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
> +{
> + /* using REV16 0x5ac00400 */
> + unsigned int base = ext ? 0xdac00400 : 0x5ac00400;
> + tcg_out32(s, base | rm << 5 | rd);
> +}
> +
> +static inline void tcg_out_rev32(TCGContext *s, TCGReg rd, TCGReg rm)
> +{
> + /* using REV32 0xdac00800 */
> + unsigned int base = 0xdac00800;
> + tcg_out32(s, base | rm << 5 | rd);
> +}
You don't actually need rev32.
> * bswap32_i32/i64 t0, t1
>
> 32 bit byte swap on a 32/64 bit value. With a 64 bit value, it assumes that
> the four high order bytes are set to zero.
The fact that the high order bytes are known to be zero means that you
can always use tcg_out_rev with ext=0.
case INDEX_op_bswap64_i64:
ext = 1;
/* FALLTHRU */
case INDEX_op_bswap32_i64:
case INDEX_op_bswap32_i32:
tcg_out_rev(s, ext, args[0], args[1]);
break;
case INDEX_op_bswap16_i64:
case INDEX_op_bswap16_i32:
tcg_out_rev16(s, 0, args[0], args[1]);
break;
r~
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations
2013-05-31 18:05 ` [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations Jani Kokkonen
@ 2013-05-31 19:13 ` Richard Henderson
2013-06-03 9:48 ` Claudio Fontana
0 siblings, 1 reply; 14+ messages in thread
From: Richard Henderson @ 2013-05-31 19:13 UTC (permalink / raw)
To: Jani Kokkonen
Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel
On 05/31/2013 11:05 AM, Jani Kokkonen wrote:
> +static inline void tcg_out_uxt(TCGContext *s, int s_bits,
> + TCGReg rd, TCGReg rn)
> +{
> + /* using ALIASes UXTB 0x53001c00, UXTH 0x53003c00
> + of UBFM Wd, Wn, #0, #7|15 and mov */
> + int bits = 8 * (1 << s_bits) - 1;
> + tcg_out_ubfm(s, 0, rd, rn, 0, bits);
> +}
Err, ubfm never generates mov, does it?
Yes, you do that later,
> + case INDEX_op_ext32u_i64:
> + tcg_out_movr(s, 0, args[0], args[1]);
> + break;
but the comment isn't actually correct in tcg_out_uxt, surely?
r~
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path
2013-05-31 18:07 ` [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path Jani Kokkonen
@ 2013-05-31 20:25 ` Richard Henderson
2013-06-03 11:21 ` Jani Kokkonen
0 siblings, 1 reply; 14+ messages in thread
From: Richard Henderson @ 2013-05-31 20:25 UTC (permalink / raw)
To: Jani Kokkonen
Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel
On 05/31/2013 11:07 AM, Jani Kokkonen wrote:
> +/* Load and compare a TLB entry, leaving the flags set. Leaves X2 pointing
> + to the tlb entry. Clobbers X0,X1,X2,X3 and TMP. */
> +
> +static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg,
> + int s_bits, uint8_t **label_ptr, int tlb_offset)
> +{
You copied the comment from ARM, and it isn't correct. You generate branches.
> + TCGReg base = TCG_AREG0;
> +
> + tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS);
> + tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset);
> + tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0);
> + tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0);
> + tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2,
> + TCG_REG_X0, -CPU_TLB_ENTRY_BITS);
> +#if TARGET_LONG_BITS == 64
> + tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
> +#else
> + tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
> +#endif
> + /* check alignment */
> + if (s_bits) {
> + tcg_out_tst(s, 1, addr_reg, s_bits, 0);
> + label_ptr[0] = s->code_ptr;
> + tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
> + }
> + tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS);
> + label_ptr[1] = s->code_ptr;
> + tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
I'm positive that the branch predictor would be happier with a single branch
rather than the two you generate here. It ought to be possible to use a
different set of insns to do this in one go.
How about something like
@ extract the tlb index from the address
ubfm w0, addr_reg, TARGET_PAGE_BITS, CPU_TLB_BITS
@ add any "high bits" from the tlb offset
@ noting that env will be much smaller than 24 bits.
add x1, env, tlb_offset & 0xfff000
@ zap the tlb index from the address for compare
@ this is all high bits plus 0-3 low bits set, so this
@ should match a logical immediate.
and w/x2, addr_reg, TARGET_PAGE_MASK | ((1 << s_bits) - 1)
@ merge the tlb index into the env+tlb_offset
add x1, x1, x0, lsl #3
@ load the tlb comparator. the 12-bit scaled offset
@ form will fit the bits remaining from above, given that
@ we're loading an aligned object, and so the low 2/3 bits
@ will be clear.
ldr w/x0, [x1, tlb_offset & 0xfff]
@ load the tlb addend. do this early to avoid stalling.
@ the addend_offset differs from tlb_offset by 1-3 words.
@ given that we've got overlap between the scaled 12-bit
@ value and the 12-bit shifted value above, this also ought
@ to always be representable.
ldr x3, [x1, (tlb_offset & 0xfff) + (addend_offset - tlb_offset)]
@ perform the comparison
cmp w/x0, w/x2
@ generate the complete host address in parallel with the cmp.
add x3, x3, addr_reg @ 64-bit guest
add x3, x3, addr_reg, uxtw @ 32-bit guest
bne miss_label
Note that the w/x above indicates the ext setting that ought to be used,
depending on the address size of the guest.
This is at least 2 insns shorter than your sequence.
Have you looked at doing the out-of-line tlb miss sequence right from the
very beginning? It's not that much more difficult to accomplish than the
inline tlb miss.
See CONFIG_QEMU_LDST_OPTIMIZATION, and the implementation in tcg/arm.
You won't need two nops after the call; aarch64 can do all the required
extensions and data movement operations in a single insn.
r~
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup
2013-05-31 19:07 ` Richard Henderson
@ 2013-06-03 9:43 ` Claudio Fontana
0 siblings, 0 replies; 14+ messages in thread
From: Claudio Fontana @ 2013-06-03 9:43 UTC (permalink / raw)
To: Richard Henderson
Cc: Laurent Desnogues, Peter Maydell, Jani Kokkonen, qemu-devel
On 31.05.2013 21:07, Richard Henderson wrote:
> On 05/31/2013 10:57 AM, Jani Kokkonen wrote:
>> + ARITH_SUBS = 0x6b,
>
> Any reason you're adding SUBS here, but not ANDS?
I also forgot ANDS, I'll add them and reorder.
>> +/* encode a logical immediate, mapping user parameter
>> + M=set bits pattern length to S=M-1 */
>> +static inline unsigned int
>> +aarch64_limm(unsigned int m, unsigned int r)
>> +{
>> + assert(m > 0);
>> + return r << 16 | (m - 1) << 10;
>> +}
>> +
>> +/* test a register against an immediate bit pattern made of
>> + M set bits rotated right by R.
>> + Examples:
>> + to test a 32/64 reg against 0x00000007, pass M = 3, R = 0.
>> + to test a 32/64 reg against 0x000000ff, pass M = 8, R = 0.
>> + to test a 32bit reg against 0xff000000, pass M = 8, R = 8.
>> + to test a 32bit reg against 0xff0000ff, pass M = 16, R = 8.
>> + */
>> +static inline void tcg_out_tst(TCGContext *s, int ext, TCGReg rn,
>> + unsigned int m, unsigned int r)
>> +{
>> + /* using TST alias of ANDS XZR, Xn,#bimm64 0x7200001f */
>> + unsigned int base = ext ? 0xf240001f : 0x7200001f;
>> + tcg_out32(s, base | aarch64_limm(m, r) | rn << 5);
>> +}
>> +
>> +/* and a register with a bit pattern, similarly to TST, no flags change */
>> +static inline void tcg_out_andi(TCGContext *s, int ext, TCGReg rd, TCGReg rn,
>> + unsigned int m, unsigned int r)
>> +{
>> + /* using AND 0x12000000 */
>> + unsigned int base = ext ? 0x92400000 : 0x12000000;
>> + tcg_out32(s, base | aarch64_limm(m, r) | rn << 5 | rd);
>> +}
>> +
>
> This should be a separate patch, since it's not related to the tcg_out_arith
> change.
>
Agreed.
Claudio
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations
2013-05-31 19:11 ` Richard Henderson
@ 2013-06-03 9:44 ` Claudio Fontana
0 siblings, 0 replies; 14+ messages in thread
From: Claudio Fontana @ 2013-06-03 9:44 UTC (permalink / raw)
To: Richard Henderson
Cc: Laurent Desnogues, Peter Maydell, Jani Kokkonen, qemu-devel
On 31.05.2013 21:11, Richard Henderson wrote:
> On 05/31/2013 11:01 AM, Jani Kokkonen wrote:
>> +static inline void tcg_out_rev(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
>> +{
>> + /* using REV 0x5ac00800 */
>> + unsigned int base = ext ? 0xdac00c00 : 0x5ac00800;
>> + tcg_out32(s, base | rm << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_rev16(TCGContext *s, int ext, TCGReg rd, TCGReg rm)
>> +{
>> + /* using REV16 0x5ac00400 */
>> + unsigned int base = ext ? 0xdac00400 : 0x5ac00400;
>> + tcg_out32(s, base | rm << 5 | rd);
>> +}
>> +
>> +static inline void tcg_out_rev32(TCGContext *s, TCGReg rd, TCGReg rm)
>> +{
>> + /* using REV32 0xdac00800 */
>> + unsigned int base = 0xdac00800;
>> + tcg_out32(s, base | rm << 5 | rd);
>> +}
>
> You don't actually need rev32.
>
>> * bswap32_i32/i64 t0, t1
>>
>> 32 bit byte swap on a 32/64 bit value. With a 64 bit value, it assumes that
>> the four high order bytes are set to zero.
>
> The fact that the high order bytes are known to be zero means that you
> can always use tcg_out_rev with ext=0.
>
> case INDEX_op_bswap64_i64:
> ext = 1;
> /* FALLTHRU */
> case INDEX_op_bswap32_i64:
> case INDEX_op_bswap32_i32:
> tcg_out_rev(s, ext, args[0], args[1]);
> break;
> case INDEX_op_bswap16_i64:
> case INDEX_op_bswap16_i32:
> tcg_out_rev16(s, 0, args[0], args[1]);
> break;
>
>
> r~
>
ACK.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations
2013-05-31 19:13 ` Richard Henderson
@ 2013-06-03 9:48 ` Claudio Fontana
0 siblings, 0 replies; 14+ messages in thread
From: Claudio Fontana @ 2013-06-03 9:48 UTC (permalink / raw)
To: Richard Henderson
Cc: Laurent Desnogues, Peter Maydell, Jani Kokkonen, qemu-devel
On 31.05.2013 21:13, Richard Henderson wrote:
> On 05/31/2013 11:05 AM, Jani Kokkonen wrote:
>> +static inline void tcg_out_uxt(TCGContext *s, int s_bits,
>> + TCGReg rd, TCGReg rn)
>> +{
>> + /* using ALIASes UXTB 0x53001c00, UXTH 0x53003c00
>> + of UBFM Wd, Wn, #0, #7|15 and mov */
>> + int bits = 8 * (1 << s_bits) - 1;
>> + tcg_out_ubfm(s, 0, rd, rn, 0, bits);
>> +}
>
> Err, ubfm never generates mov, does it?
No, the comment is a leftover, it's wrong.
> Yes, you do that later,
>
>> + case INDEX_op_ext32u_i64:
>> + tcg_out_movr(s, 0, args[0], args[1]);
>> + break;
>
> but the comment isn't actually correct in tcg_out_uxt, surely?
>
right.
I was think also about the
INDEX_op_ext16u_i64 and INDEX_op_ext16u_i32,
I think I can just use ext = 0 for both when doing the UXT,
consistently to what we discussed before about trying to use
ext=0 whenever possible.
CLaudio
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path
2013-05-31 20:25 ` Richard Henderson
@ 2013-06-03 11:21 ` Jani Kokkonen
2013-06-03 15:52 ` Richard Henderson
0 siblings, 1 reply; 14+ messages in thread
From: Jani Kokkonen @ 2013-06-03 11:21 UTC (permalink / raw)
To: Richard Henderson
Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel
On 5/31/2013 10:25 PM, Richard Henderson wrote:
> On 05/31/2013 11:07 AM, Jani Kokkonen wrote:
>> +/* Load and compare a TLB entry, leaving the flags set. Leaves X2 pointing
>> + to the tlb entry. Clobbers X0,X1,X2,X3 and TMP. */
>> +
>> +static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg,
>> + int s_bits, uint8_t **label_ptr, int tlb_offset)
>> +{
>
> You copied the comment from ARM, and it isn't correct. You generate branches.
I will fix the comment.
>
>> + TCGReg base = TCG_AREG0;
>> +
>> + tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS);
>> + tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset);
>> + tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0);
>> + tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0);
>> + tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2,
>> + TCG_REG_X0, -CPU_TLB_ENTRY_BITS);
>> +#if TARGET_LONG_BITS == 64
>> + tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
>> +#else
>> + tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
>> +#endif
>> + /* check alignment */
>> + if (s_bits) {
>> + tcg_out_tst(s, 1, addr_reg, s_bits, 0);
>> + label_ptr[0] = s->code_ptr;
>> + tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
>> + }
>> + tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS);
>> + label_ptr[1] = s->code_ptr;
>> + tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
>
> I'm positive that the branch predictor would be happier with a single branch
> rather than the two you generate here. It ought to be possible to use a
> different set of insns to do this in one go.
>
> How about something like
>
> @ extract the tlb index from the address
> ubfm w0, addr_reg, TARGET_PAGE_BITS, CPU_TLB_BITS
>
> @ add any "high bits" from the tlb offset
> @ noting that env will be much smaller than 24 bits.
> add x1, env, tlb_offset & 0xfff000
>
> @ zap the tlb index from the address for compare
> @ this is all high bits plus 0-3 low bits set, so this
> @ should match a logical immediate.
> and w/x2, addr_reg, TARGET_PAGE_MASK | ((1 << s_bits) - 1)
>
> @ merge the tlb index into the env+tlb_offset
> add x1, x1, x0, lsl #3
>
> @ load the tlb comparator. the 12-bit scaled offset
> @ form will fit the bits remaining from above, given that
> @ we're loading an aligned object, and so the low 2/3 bits
> @ will be clear.
> ldr w/x0, [x1, tlb_offset & 0xfff]
>
> @ load the tlb addend. do this early to avoid stalling.
> @ the addend_offset differs from tlb_offset by 1-3 words.
> @ given that we've got overlap between the scaled 12-bit
> @ value and the 12-bit shifted value above, this also ought
> @ to always be representable.
> ldr x3, [x1, (tlb_offset & 0xfff) + (addend_offset - tlb_offset)]
>
> @ perform the comparison
> cmp w/x0, w/x2
>
> @ generate the complete host address in parallel with the cmp.
> add x3, x3, addr_reg @ 64-bit guest
> add x3, x3, addr_reg, uxtw @ 32-bit guest
>
> bne miss_label
>
> Note that the w/x above indicates the ext setting that ought to be used,
> depending on the address size of the guest.
>
> This is at least 2 insns shorter than your sequence.
Ok, thanks. ubfm instruction will be added and I will modify implementation based on your comments.
>
> Have you looked at doing the out-of-line tlb miss sequence right from the
> very beginning? It's not that much more difficult to accomplish than the
> inline tlb miss.
I have to look into this one.
>
> See CONFIG_QEMU_LDST_OPTIMIZATION, and the implementation in tcg/arm.
> You won't need two nops after the call; aarch64 can do all the required
> extensions and data movement operations in a single insn.
>
>
I will take this also into account.
> r~
>
-Jani
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path
2013-06-03 11:21 ` Jani Kokkonen
@ 2013-06-03 15:52 ` Richard Henderson
0 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2013-06-03 15:52 UTC (permalink / raw)
To: Jani Kokkonen
Cc: Laurent Desnogues, Peter Maydell, Claudio Fontana, qemu-devel
On 06/03/2013 04:21 AM, Jani Kokkonen wrote:
>> @ merge the tlb index into the env+tlb_offset
>> add x1, x1, x0, lsl #3
For the record, oops. 3 should be CPU_TLB_ENTRY_BITS.
r~
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2013-06-03 15:59 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-31 17:51 [Qemu-devel] [PATCH 0/4] ARM aarch64 TCG tlb fast lookup Jani Kokkonen
2013-05-31 17:57 ` [Qemu-devel] [PATCH 1/4] tcg/aarch64: more low level ops in preparation of tlb, lookup Jani Kokkonen
2013-05-31 19:07 ` Richard Henderson
2013-06-03 9:43 ` Claudio Fontana
2013-05-31 18:01 ` [Qemu-devel] [PATCH 2/4] tcg/aarch64: implement byte swap operations Jani Kokkonen
2013-05-31 19:11 ` Richard Henderson
2013-06-03 9:44 ` Claudio Fontana
2013-05-31 18:05 ` [Qemu-devel] [PATCH 3/4] tcg/aarch64: implement sign/zero extend operations Jani Kokkonen
2013-05-31 19:13 ` Richard Henderson
2013-06-03 9:48 ` Claudio Fontana
2013-05-31 18:07 ` [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path Jani Kokkonen
2013-05-31 20:25 ` Richard Henderson
2013-06-03 11:21 ` Jani Kokkonen
2013-06-03 15:52 ` Richard Henderson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.