All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations
@ 2017-01-17  9:07 Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type Kirill Batuzov
                   ` (18 more replies)
  0 siblings, 19 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

The goal of these patch series is to set up an infrastructure to emulate
guest vector operations using host vector operations. Preliminary
experiments show that simply translating loads and stores increases
performance of x264 video codec by 10%. The performance of a gcc vectorized
for loop increased 2x.

To be able to emulate guest vector operations using host vector operations,
several things need to be done.

1. Corresponding vector types should be added to TCG. These series add
TCG_v128 and TCG_v64. I've made TCG_v64 a different type than TCG_i64
because it usually needs to be allocated to different registers and
supports different operations.

2. Load/store operations for these new types need to be implemented.

3. For seamless transition from current model to a new one we need to
handle cases where memory occupied by global variable can be accessed via
pointer to the CPUArchState structure. A very simple conservative alias
analysis has been added to do it. This analysis tracks memory loads and
stores that overlap with fields of CPUArchState and provides this
information to the register allocator. The allocator then spills and
reloads affected globals when needed.

4. Allow overlapping globals. For scalar registers this is a rare case, and
overlapping registers can ba handled as a single one (ah, al, ax, eax,
rax). In ARM every Q-register consists of two D-register each consisting of
two S-registers. Handling 4 S-registers as one because they are parts of
the same Q-register is way too inefficient.

5. Add new memory addressing mode to MMU code for large accesses and create
needed helpers. Only 128-bit vectors have been handled for now.

6. Create TCG opcodes for vector operations. Only addition has beed handled
in these series. Each operation has a wrapper that checks if the backend
supports the corresponding operation or not. In one case the vector opcode
is generated, in the other the operation is emulated with scalar
operations. The emulation code is generated inline for performance reasons
(there is a huge performance difference between inline generation
and calling a helper). As a positive side effect this will eventually allow
 to merge similar emulation code for vector instructions from different
frontends to target-independent implementation.

7. Use new operations in the frontend (ARM was used in these series).

8. Support new operations in the backend (x86_64 was used in these series).

For experiments I have used ARM guest on x86_64 host. I wanted some pair of
different architectures with vector extensions both. ARM and x86_64 pair
fits well.

Kirill Batuzov (18):
  tcg: add support for 128bit vector type
  tcg: add support for 64bit vector type
  tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes
  tcg: add simple alias analysis
  tcg: use results of alias analysis in liveness analysis
  tcg: allow globals to overlap
  tcg: add vector addition operations
  target/arm: support access to vector guest registers as globals
  target/arm: use vector opcode to handle vadd.<size> instruction
  tcg/i386: add support for vector opcodes
  tcg/i386: support 64-bit vector operations
  tcg/i386: support remaining vector addition operations
  tcg: do not relay on exact values of MO_BSWAP or MO_SIGN in backend
  tcg: introduce new TCGMemOp - MO_128
  tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes
  softmmu: create helpers for vector loads
  tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops
  target/arm: load two consecutive 64-bits vector regs as a 128-bit
    vector reg

 cputlb.c                     |   4 +
 softmmu_template_vector.h    | 266 +++++++++++++++++++++++++++++++++++++++++++
 target/arm/translate.c       |  89 ++++++++++++++-
 tcg/aarch64/tcg-target.inc.c |   4 +-
 tcg/arm/tcg-target.inc.c     |   4 +-
 tcg/i386/tcg-target.h        |  35 +++++-
 tcg/i386/tcg-target.inc.c    | 245 ++++++++++++++++++++++++++++++++++++---
 tcg/mips/tcg-target.inc.c    |   4 +-
 tcg/optimize.c               | 146 ++++++++++++++++++++++++
 tcg/ppc/tcg-target.inc.c     |   4 +-
 tcg/s390/tcg-target.inc.c    |   4 +-
 tcg/sparc/tcg-target.inc.c   |  12 +-
 tcg/tcg-op.c                 |  20 +++-
 tcg/tcg-op.h                 | 262 ++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg-opc.h                |  34 ++++++
 tcg/tcg.c                    | 146 ++++++++++++++++++++++++
 tcg/tcg.h                    | 147 +++++++++++++++++++++++-
 17 files changed, 1385 insertions(+), 41 deletions(-)
 create mode 100644 softmmu_template_vector.h

-- 
2.1.4

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-18 18:29   ` Richard Henderson
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 02/18] tcg: add support for 64bit " Kirill Batuzov
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Introduce TCG_TYPE_V128 and corresponding TCGv_v128 for TCG temps. Add hepler
functions that work with temps of this new type.

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/tcg-op.h | 24 ++++++++++++++++++++++++
 tcg/tcg.c    | 13 +++++++++++++
 tcg/tcg.h    | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 71 insertions(+)

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 6d044b7..df077d6 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -248,6 +248,23 @@ static inline void tcg_gen_op6ii_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
                 GET_TCGV_I64(a3), GET_TCGV_I64(a4), a5, a6);
 }
 
+static inline void tcg_gen_op1_v128(TCGOpcode opc, TCGv_v128 a1)
+{
+    tcg_gen_op1(&tcg_ctx, opc, GET_TCGV_V128(a1));
+}
+
+static inline void tcg_gen_op2_v128(TCGOpcode opc, TCGv_v128 a1,
+                                    TCGv_v128 a2)
+{
+    tcg_gen_op2(&tcg_ctx, opc, GET_TCGV_V128(a1), GET_TCGV_V128(a2));
+}
+
+static inline void tcg_gen_op3_v128(TCGOpcode opc, TCGv_v128 a1,
+                                    TCGv_v128 a2, TCGv_v128 a3)
+{
+    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_V128(a1), GET_TCGV_V128(a2),
+                GET_TCGV_V128(a3));
+}
 
 /* Generic ops.  */
 
@@ -442,6 +459,13 @@ static inline void tcg_gen_not_i32(TCGv_i32 ret, TCGv_i32 arg)
     }
 }
 
+/* Vector ops */
+
+static inline void tcg_gen_discard_v128(TCGv_v128 arg)
+{
+    tcg_gen_op1_v128(INDEX_op_discard, arg);
+}
+
 /* 64 bit ops */
 
 void tcg_gen_addi_i64(TCGv_i64 ret, TCGv_i64 arg1, int64_t arg2);
diff --git a/tcg/tcg.c b/tcg/tcg.c
index aabf94f..b20a044 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -637,6 +637,14 @@ TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
     return MAKE_TCGV_I64(idx);
 }
 
+TCGv_v128 tcg_temp_new_internal_v128(int temp_local)
+{
+    int idx;
+
+    idx = tcg_temp_new_internal(TCG_TYPE_V128, temp_local);
+    return MAKE_TCGV_V128(idx);
+}
+
 static void tcg_temp_free_internal(int idx)
 {
     TCGContext *s = &tcg_ctx;
@@ -669,6 +677,11 @@ void tcg_temp_free_i64(TCGv_i64 arg)
     tcg_temp_free_internal(GET_TCGV_I64(arg));
 }
 
+void tcg_temp_free_v128(TCGv_v128 arg)
+{
+    tcg_temp_free_internal(GET_TCGV_V128(arg));
+}
+
 TCGv_i32 tcg_const_i32(int32_t val)
 {
     TCGv_i32 t0;
diff --git a/tcg/tcg.h b/tcg/tcg.h
index a35e4c4..b9aa56b 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -235,6 +235,7 @@ typedef struct TCGPool {
 typedef enum TCGType {
     TCG_TYPE_I32,
     TCG_TYPE_I64,
+    TCG_TYPE_V128,
     TCG_TYPE_COUNT, /* number of different types */
 
     /* An alias for the size of the host register.  */
@@ -410,6 +411,7 @@ typedef tcg_target_ulong TCGArg;
 typedef struct TCGv_i32_d *TCGv_i32;
 typedef struct TCGv_i64_d *TCGv_i64;
 typedef struct TCGv_ptr_d *TCGv_ptr;
+typedef struct TCGv_v128_d *TCGv_v128;
 typedef TCGv_ptr TCGv_env;
 #if TARGET_LONG_BITS == 32
 #define TCGv TCGv_i32
@@ -434,6 +436,11 @@ static inline TCGv_ptr QEMU_ARTIFICIAL MAKE_TCGV_PTR(intptr_t i)
     return (TCGv_ptr)i;
 }
 
+static inline TCGv_v128 QEMU_ARTIFICIAL MAKE_TCGV_V128(intptr_t i)
+{
+    return (TCGv_v128)i;
+}
+
 static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_I32(TCGv_i32 t)
 {
     return (intptr_t)t;
@@ -449,6 +456,11 @@ static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_PTR(TCGv_ptr t)
     return (intptr_t)t;
 }
 
+static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_V128(TCGv_v128 t)
+{
+    return (intptr_t)t;
+}
+
 #if TCG_TARGET_REG_BITS == 32
 #define TCGV_LOW(t) MAKE_TCGV_I32(GET_TCGV_I64(t))
 #define TCGV_HIGH(t) MAKE_TCGV_I32(GET_TCGV_I64(t) + 1)
@@ -456,15 +468,18 @@ static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_PTR(TCGv_ptr t)
 
 #define TCGV_EQUAL_I32(a, b) (GET_TCGV_I32(a) == GET_TCGV_I32(b))
 #define TCGV_EQUAL_I64(a, b) (GET_TCGV_I64(a) == GET_TCGV_I64(b))
+#define TCGV_EQUAL_V128(a, b) (GET_TCGV_V128(a) == GET_TCGV_V128(b))
 #define TCGV_EQUAL_PTR(a, b) (GET_TCGV_PTR(a) == GET_TCGV_PTR(b))
 
 /* Dummy definition to avoid compiler warnings.  */
 #define TCGV_UNUSED_I32(x) x = MAKE_TCGV_I32(-1)
 #define TCGV_UNUSED_I64(x) x = MAKE_TCGV_I64(-1)
+#define TCGV_UNUSED_V128(x) x = MAKE_TCGV_V128(-1)
 #define TCGV_UNUSED_PTR(x) x = MAKE_TCGV_PTR(-1)
 
 #define TCGV_IS_UNUSED_I32(x) (GET_TCGV_I32(x) == -1)
 #define TCGV_IS_UNUSED_I64(x) (GET_TCGV_I64(x) == -1)
+#define TCGV_IS_UNUSED_V128(x) (GET_TCGV_V128(x) == -1)
 #define TCGV_IS_UNUSED_PTR(x) (GET_TCGV_PTR(x) == -1)
 
 /* call flags */
@@ -787,9 +802,11 @@ TCGv_i64 tcg_global_reg_new_i64(TCGReg reg, const char *name);
 
 TCGv_i32 tcg_temp_new_internal_i32(int temp_local);
 TCGv_i64 tcg_temp_new_internal_i64(int temp_local);
+TCGv_v128 tcg_temp_new_internal_v128(int temp_local);
 
 void tcg_temp_free_i32(TCGv_i32 arg);
 void tcg_temp_free_i64(TCGv_i64 arg);
+void tcg_temp_free_v128(TCGv_v128 arg);
 
 static inline TCGv_i32 tcg_global_mem_new_i32(TCGv_ptr reg, intptr_t offset,
                                               const char *name)
@@ -825,6 +842,23 @@ static inline TCGv_i64 tcg_temp_local_new_i64(void)
     return tcg_temp_new_internal_i64(1);
 }
 
+static inline TCGv_v128 tcg_global_mem_new_v128(TCGv_ptr reg, intptr_t offset,
+                                                const char *name)
+{
+    int idx = tcg_global_mem_new_internal(TCG_TYPE_V128, reg, offset, name);
+    return MAKE_TCGV_V128(idx);
+}
+
+static inline TCGv_v128 tcg_temp_new_v128(void)
+{
+    return tcg_temp_new_internal_v128(0);
+}
+
+static inline TCGv_v128 tcg_temp_local_new_v128(void)
+{
+    return tcg_temp_new_internal_v128(1);
+}
+
 #if defined(CONFIG_DEBUG_TCG)
 /* If you call tcg_clear_temp_count() at the start of a section of
  * code which is not supposed to leak any TCG temporaries, then
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 02/18] tcg: add support for 64bit vector type
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 03/18] tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes Kirill Batuzov
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Introduce TCG_TYPE_V64 and corresponding TCGv_v64 for TCG temps. Add hepler
functions that work with temps of this new type.

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/tcg-op.h | 23 +++++++++++++++++++++++
 tcg/tcg.c    | 13 +++++++++++++
 tcg/tcg.h    | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 70 insertions(+)

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index df077d6..173fb24 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -266,6 +266,24 @@ static inline void tcg_gen_op3_v128(TCGOpcode opc, TCGv_v128 a1,
                 GET_TCGV_V128(a3));
 }
 
+static inline void tcg_gen_op1_v64(TCGOpcode opc, TCGv_v64 a1)
+{
+    tcg_gen_op1(&tcg_ctx, opc, GET_TCGV_V64(a1));
+}
+
+static inline void tcg_gen_op2_v64(TCGOpcode opc, TCGv_v64 a1,
+                                    TCGv_v64 a2)
+{
+    tcg_gen_op2(&tcg_ctx, opc, GET_TCGV_V64(a1), GET_TCGV_V64(a2));
+}
+
+static inline void tcg_gen_op3_v64(TCGOpcode opc, TCGv_v64 a1,
+                                    TCGv_v64 a2, TCGv_v64 a3)
+{
+    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_V64(a1), GET_TCGV_V64(a2),
+                GET_TCGV_V64(a3));
+}
+
 /* Generic ops.  */
 
 static inline void gen_set_label(TCGLabel *l)
@@ -466,6 +484,11 @@ static inline void tcg_gen_discard_v128(TCGv_v128 arg)
     tcg_gen_op1_v128(INDEX_op_discard, arg);
 }
 
+static inline void tcg_gen_discard_v64(TCGv_v64 arg)
+{
+    tcg_gen_op1_v64(INDEX_op_discard, arg);
+}
+
 /* 64 bit ops */
 
 void tcg_gen_addi_i64(TCGv_i64 ret, TCGv_i64 arg1, int64_t arg2);
diff --git a/tcg/tcg.c b/tcg/tcg.c
index b20a044..e81d1c4 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -637,6 +637,14 @@ TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
     return MAKE_TCGV_I64(idx);
 }
 
+TCGv_v64 tcg_temp_new_internal_v64(int temp_local)
+{
+    int idx;
+
+    idx = tcg_temp_new_internal(TCG_TYPE_V64, temp_local);
+    return MAKE_TCGV_V64(idx);
+}
+
 TCGv_v128 tcg_temp_new_internal_v128(int temp_local)
 {
     int idx;
@@ -677,6 +685,11 @@ void tcg_temp_free_i64(TCGv_i64 arg)
     tcg_temp_free_internal(GET_TCGV_I64(arg));
 }
 
+void tcg_temp_free_v64(TCGv_v64 arg)
+{
+    tcg_temp_free_internal(GET_TCGV_V64(arg));
+}
+
 void tcg_temp_free_v128(TCGv_v128 arg)
 {
     tcg_temp_free_internal(GET_TCGV_V128(arg));
diff --git a/tcg/tcg.h b/tcg/tcg.h
index b9aa56b..397ba86 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -235,6 +235,7 @@ typedef struct TCGPool {
 typedef enum TCGType {
     TCG_TYPE_I32,
     TCG_TYPE_I64,
+    TCG_TYPE_V64,
     TCG_TYPE_V128,
     TCG_TYPE_COUNT, /* number of different types */
 
@@ -411,6 +412,7 @@ typedef tcg_target_ulong TCGArg;
 typedef struct TCGv_i32_d *TCGv_i32;
 typedef struct TCGv_i64_d *TCGv_i64;
 typedef struct TCGv_ptr_d *TCGv_ptr;
+typedef struct TCGv_v64_d *TCGv_v64;
 typedef struct TCGv_v128_d *TCGv_v128;
 typedef TCGv_ptr TCGv_env;
 #if TARGET_LONG_BITS == 32
@@ -436,6 +438,11 @@ static inline TCGv_ptr QEMU_ARTIFICIAL MAKE_TCGV_PTR(intptr_t i)
     return (TCGv_ptr)i;
 }
 
+static inline TCGv_v64 QEMU_ARTIFICIAL MAKE_TCGV_V64(intptr_t i)
+{
+    return (TCGv_v64)i;
+}
+
 static inline TCGv_v128 QEMU_ARTIFICIAL MAKE_TCGV_V128(intptr_t i)
 {
     return (TCGv_v128)i;
@@ -456,6 +463,11 @@ static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_PTR(TCGv_ptr t)
     return (intptr_t)t;
 }
 
+static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_V64(TCGv_v64 t)
+{
+    return (intptr_t)t;
+}
+
 static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_V128(TCGv_v128 t)
 {
     return (intptr_t)t;
@@ -468,17 +480,20 @@ static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_V128(TCGv_v128 t)
 
 #define TCGV_EQUAL_I32(a, b) (GET_TCGV_I32(a) == GET_TCGV_I32(b))
 #define TCGV_EQUAL_I64(a, b) (GET_TCGV_I64(a) == GET_TCGV_I64(b))
+#define TCGV_EQUAL_V64(a, b) (GET_TCGV_V64(a) == GET_TCGV_V64(b))
 #define TCGV_EQUAL_V128(a, b) (GET_TCGV_V128(a) == GET_TCGV_V128(b))
 #define TCGV_EQUAL_PTR(a, b) (GET_TCGV_PTR(a) == GET_TCGV_PTR(b))
 
 /* Dummy definition to avoid compiler warnings.  */
 #define TCGV_UNUSED_I32(x) x = MAKE_TCGV_I32(-1)
 #define TCGV_UNUSED_I64(x) x = MAKE_TCGV_I64(-1)
+#define TCGV_UNUSED_V64(x) x = MAKE_TCGV_V64(-1)
 #define TCGV_UNUSED_V128(x) x = MAKE_TCGV_V128(-1)
 #define TCGV_UNUSED_PTR(x) x = MAKE_TCGV_PTR(-1)
 
 #define TCGV_IS_UNUSED_I32(x) (GET_TCGV_I32(x) == -1)
 #define TCGV_IS_UNUSED_I64(x) (GET_TCGV_I64(x) == -1)
+#define TCGV_IS_UNUSED_V64(x) (GET_TCGV_V64(x) == -1)
 #define TCGV_IS_UNUSED_V128(x) (GET_TCGV_V128(x) == -1)
 #define TCGV_IS_UNUSED_PTR(x) (GET_TCGV_PTR(x) == -1)
 
@@ -802,10 +817,12 @@ TCGv_i64 tcg_global_reg_new_i64(TCGReg reg, const char *name);
 
 TCGv_i32 tcg_temp_new_internal_i32(int temp_local);
 TCGv_i64 tcg_temp_new_internal_i64(int temp_local);
+TCGv_v64 tcg_temp_new_internal_v64(int temp_local);
 TCGv_v128 tcg_temp_new_internal_v128(int temp_local);
 
 void tcg_temp_free_i32(TCGv_i32 arg);
 void tcg_temp_free_i64(TCGv_i64 arg);
+void tcg_temp_free_v64(TCGv_v64 arg);
 void tcg_temp_free_v128(TCGv_v128 arg);
 
 static inline TCGv_i32 tcg_global_mem_new_i32(TCGv_ptr reg, intptr_t offset,
@@ -842,6 +859,23 @@ static inline TCGv_i64 tcg_temp_local_new_i64(void)
     return tcg_temp_new_internal_i64(1);
 }
 
+static inline TCGv_v64 tcg_global_mem_new_v64(TCGv_ptr reg, intptr_t offset,
+                                              const char *name)
+{
+    int idx = tcg_global_mem_new_internal(TCG_TYPE_V64, reg, offset, name);
+    return MAKE_TCGV_V64(idx);
+}
+
+static inline TCGv_v64 tcg_temp_new_v64(void)
+{
+    return tcg_temp_new_internal_v64(0);
+}
+
+static inline TCGv_v64 tcg_temp_local_new_v64(void)
+{
+    return tcg_temp_new_internal_v64(1);
+}
+
 static inline TCGv_v128 tcg_global_mem_new_v128(TCGv_ptr reg, intptr_t offset,
                                                 const char *name)
 {
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 03/18] tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 02/18] tcg: add support for 64bit " Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 04/18] tcg: add simple alias analysis Kirill Batuzov
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/tcg-op.h  | 38 ++++++++++++++++++++++++++++++++++++++
 tcg/tcg-opc.h | 18 ++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 173fb24..c469ea3 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -489,6 +489,44 @@ static inline void tcg_gen_discard_v64(TCGv_v64 arg)
     tcg_gen_op1_v64(INDEX_op_discard, arg);
 }
 
+static inline void tcg_gen_ldst_op_v128(TCGOpcode opc, TCGv_v128 val,
+                                       TCGv_ptr base, TCGArg offset)
+{
+    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_V128(val), GET_TCGV_PTR(base),
+                offset);
+}
+
+static inline void tcg_gen_st_v128(TCGv_v128 arg1, TCGv_ptr arg2,
+                                   tcg_target_long offset)
+{
+    tcg_gen_ldst_op_v128(INDEX_op_st_v128, arg1, arg2, offset);
+}
+
+static inline void tcg_gen_ld_v128(TCGv_v128 ret, TCGv_ptr arg2,
+                                   tcg_target_long offset)
+{
+    tcg_gen_ldst_op_v128(INDEX_op_ld_v128, ret, arg2, offset);
+}
+
+static inline void tcg_gen_ldst_op_v64(TCGOpcode opc, TCGv_v64 val,
+                                       TCGv_ptr base, TCGArg offset)
+{
+    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_V64(val), GET_TCGV_PTR(base),
+                offset);
+}
+
+static inline void tcg_gen_st_v64(TCGv_v64 arg1, TCGv_ptr arg2,
+                                  tcg_target_long offset)
+{
+    tcg_gen_ldst_op_v64(INDEX_op_st_v64, arg1, arg2, offset);
+}
+
+static inline void tcg_gen_ld_v64(TCGv_v64 ret, TCGv_ptr arg2,
+                                  tcg_target_long offset)
+{
+    tcg_gen_ldst_op_v64(INDEX_op_ld_v64, ret, arg2, offset);
+}
+
 /* 64 bit ops */
 
 void tcg_gen_addi_i64(TCGv_i64 ret, TCGv_i64 arg1, int64_t arg2);
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 45528d2..d622592 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -42,6 +42,18 @@ DEF(br, 0, 0, 1, TCG_OPF_BB_END)
 # define IMPL64  TCG_OPF_64BIT
 #endif
 
+#ifdef TCG_TARGET_HAS_REG128
+# define IMPL128 0
+#else
+# define IMPL128 TCG_OPF_NOT_PRESENT
+#endif
+
+#ifdef TCG_TARGET_HAS_REGV64
+# define IMPLV64 0
+#else
+# define IMPLV64 TCG_OPF_NOT_PRESENT
+#endif
+
 DEF(mb, 0, 0, 1, 0)
 
 DEF(mov_i32, 1, 1, 0, TCG_OPF_NOT_PRESENT)
@@ -178,6 +190,12 @@ DEF(mulsh_i64, 1, 2, 0, IMPL(TCG_TARGET_HAS_mulsh_i64))
 #define TLADDR_ARGS  (TARGET_LONG_BITS <= TCG_TARGET_REG_BITS ? 1 : 2)
 #define DATA64_ARGS  (TCG_TARGET_REG_BITS == 64 ? 1 : 2)
 
+/* load/store */
+DEF(st_v128, 0, 2, 1, IMPL128)
+DEF(ld_v128, 1, 1, 1, IMPL128)
+DEF(st_v64, 0, 2, 1, IMPLV64)
+DEF(ld_v64, 1, 1, 1, IMPLV64)
+
 /* QEMU specific */
 DEF(insn_start, 0, 0, TLADDR_ARGS * TARGET_INSN_START_WORDS,
     TCG_OPF_NOT_PRESENT)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 04/18] tcg: add simple alias analysis
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (2 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 03/18] tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 05/18] tcg: use results of alias analysis in liveness analysis Kirill Batuzov
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Add a simple alias analysis to TCG which finds out memory loads and stores
that overlap with CPUState. This information can be used later in liveness
analysis to ensure correctness of register allocation. In particular, if load
or store overlaps with memory location of some global variable, this variable
should be spilled and reloaded at appropriate times.

Previously no such analysis was performed and for correctness reasons it was
required that no load/store operations overlap with memory locations of global
variables.

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---

checkpatch complains here, but I believe this to be false-positive.

---
 tcg/optimize.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg.h      |  17 +++++++
 2 files changed, 163 insertions(+)

diff --git a/tcg/optimize.c b/tcg/optimize.c
index 0f13490..1d0eac2 100644
--- a/tcg/optimize.c
+++ b/tcg/optimize.c
@@ -34,6 +34,7 @@
 
 struct tcg_temp_info {
     bool is_const;
+    bool is_base;
     uint16_t prev_copy;
     uint16_t next_copy;
     tcg_target_ulong val;
@@ -61,6 +62,7 @@ static void reset_temp(TCGArg temp)
     temps[temp].next_copy = temp;
     temps[temp].prev_copy = temp;
     temps[temp].is_const = false;
+    temps[temp].is_base = false;
     temps[temp].mask = -1;
 }
 
@@ -1335,3 +1337,147 @@ void tcg_optimize(TCGContext *s)
         }
     }
 }
+
+/* Simple alias analysis. It finds out which load/store operations overlap
+   with CPUArchState. The result is stored in TCGContext and can be used
+   during liveness analysis and register allocation. */
+void tcg_alias_analysis(TCGContext *s)
+{
+    int oi, oi_next;
+
+    reset_all_temps(s->nb_temps);
+    temps[GET_TCGV_PTR(s->tcg_env)].is_base = true;
+    temps[GET_TCGV_PTR(s->tcg_env)].val = 0;
+
+    for (oi = s->gen_op_buf[0].next; oi != 0; oi = oi_next) {
+        int nb_oargs, i;
+        int size;
+        TCGAliasType tp;
+
+        TCGOp * const op = &s->gen_op_buf[oi];
+        TCGArg * const args = &s->gen_opparam_buf[op->args];
+        TCGOpcode opc = op->opc;
+        const TCGOpDef *def = &tcg_op_defs[opc];
+
+        oi_next = op->next;
+
+        if (opc == INDEX_op_call) {
+            nb_oargs = op->callo;
+        } else {
+            nb_oargs = def->nb_oargs;
+        }
+
+        s->alias_info[oi] = (TCGAliasInfo){
+                TCG_NOT_ALIAS,
+                false,
+                0,
+                0
+            };
+
+        switch (opc) {
+        CASE_OP_32_64(movi):
+            temps[args[0]].is_const = 1;
+            temps[args[0]].val = args[1];
+            break;
+        CASE_OP_32_64(mov):
+            temps[args[0]].is_const = temps[args[1]].is_const;
+            temps[args[0]].is_base = temps[args[1]].is_base;
+            temps[args[0]].val = temps[args[1]].val;
+            break;
+        CASE_OP_32_64(add):
+        CASE_OP_32_64(sub):
+            if (temps[args[1]].is_base && temps[args[2]].is_const) {
+                temps[args[0]].is_base = true;
+                temps[args[0]].is_const = false;
+                temps[args[0]].val =
+                    do_constant_folding(opc, temps[args[1]].val,
+                                        temps[args[2]].val);
+            } else {
+                reset_temp(args[0]);
+            }
+        CASE_OP_32_64(ld8s):
+        CASE_OP_32_64(ld8u):
+            size = 1;
+            tp = TCG_ALIAS_READ;
+            goto do_ldst;
+        CASE_OP_32_64(ld16s):
+        CASE_OP_32_64(ld16u):
+            size = 2;
+            tp = TCG_ALIAS_READ;
+            goto do_ldst;
+        case INDEX_op_ld_i32:
+        case INDEX_op_ld32s_i64:
+        case INDEX_op_ld32u_i64:
+            size = 4;
+            tp = TCG_ALIAS_READ;
+            goto do_ldst;
+        case INDEX_op_ld_i64:
+            size = 8;
+            tp = TCG_ALIAS_READ;
+            goto do_ldst;
+        case INDEX_op_ld_v128:
+            size = 16;
+            tp = TCG_ALIAS_READ;
+            goto do_ldst;
+        CASE_OP_32_64(st8):
+            size = 1;
+            tp = TCG_ALIAS_WRITE;
+            goto do_ldst;
+        CASE_OP_32_64(st16):
+            size = 2;
+            tp = TCG_ALIAS_WRITE;
+            goto do_ldst;
+        case INDEX_op_st_i32:
+        case INDEX_op_st32_i64:
+            size = 4;
+            tp = TCG_ALIAS_WRITE;
+            goto do_ldst;
+        case INDEX_op_st_i64:
+            size = 8;
+            tp = TCG_ALIAS_WRITE;
+            goto do_ldst;
+        case INDEX_op_st_v128:
+            size = 16;
+            tp = TCG_ALIAS_WRITE;
+            goto do_ldst;
+        do_ldst:
+            if (temps[args[1]].is_base) {
+                TCGArg val;
+#if TCG_TARGET_REG_BITS == 32
+                val = do_constant_folding(INDEX_op_add_i32,
+                                          temps[args[1]].val,
+                                          args[2]);
+#else
+                val = do_constant_folding(INDEX_op_add_i64,
+                                          temps[args[1]].val,
+                                          args[2]);
+#endif
+                if ((tcg_target_long)val < sizeof(CPUArchState) &&
+                    (tcg_target_long)val + size > 0) {
+                    s->alias_info[oi].alias_type = tp;
+                    s->alias_info[oi].fixed_offset = true;
+                    s->alias_info[oi].offset = val;
+                    s->alias_info[oi].size = size;
+                } else {
+                    s->alias_info[oi].alias_type = TCG_NOT_ALIAS;
+                }
+            } else {
+                s->alias_info[oi].alias_type = tp;
+                s->alias_info[oi].fixed_offset = false;
+            }
+            goto do_reset_output;
+        default:
+            if (def->flags & TCG_OPF_BB_END) {
+                reset_all_temps(s->nb_temps);
+                temps[GET_TCGV_PTR(s->tcg_env)].is_base = true;
+                temps[GET_TCGV_PTR(s->tcg_env)].val = 0;
+            } else {
+        do_reset_output:
+                for (i = 0; i < nb_oargs; i++) {
+                    reset_temp(args[i]);
+                }
+            }
+            break;
+        }
+    }
+}
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 397ba86..921892f 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -667,6 +667,20 @@ QEMU_BUILD_BUG_ON(OPPARAM_BUF_SIZE > (1 << 14));
 /* Make sure that we don't overflow 64 bits without noticing.  */
 QEMU_BUILD_BUG_ON(sizeof(TCGOp) > 8);
 
+typedef enum TCGAliasType {
+    TCG_NOT_ALIAS = 0,
+    TCG_ALIAS_READ = 1,
+    TCG_ALIAS_WRITE = 2,
+    TCG_ALIAS_RW = TCG_ALIAS_READ | TCG_ALIAS_WRITE
+} TCGAliasType;
+
+typedef struct TCGAliasInfo {
+    TCGAliasType alias_type;
+    bool fixed_offset;
+    tcg_target_long offset;
+    tcg_target_long size;
+} TCGAliasInfo;
+
 struct TCGContext {
     uint8_t *pool_cur, *pool_end;
     TCGPool *pool_first, *pool_current, *pool_first_large;
@@ -751,6 +765,8 @@ struct TCGContext {
     TCGOp gen_op_buf[OPC_BUF_SIZE];
     TCGArg gen_opparam_buf[OPPARAM_BUF_SIZE];
 
+    TCGAliasInfo alias_info[OPC_BUF_SIZE];
+
     uint16_t gen_insn_end_off[TCG_MAX_INSNS];
     target_ulong gen_insn_data[TCG_MAX_INSNS][TARGET_INSN_START_WORDS];
 };
@@ -999,6 +1015,7 @@ TCGOp *tcg_op_insert_before(TCGContext *s, TCGOp *op, TCGOpcode opc, int narg);
 TCGOp *tcg_op_insert_after(TCGContext *s, TCGOp *op, TCGOpcode opc, int narg);
 
 void tcg_optimize(TCGContext *s);
+void tcg_alias_analysis(TCGContext *s);
 
 /* only used for debugging purposes */
 void tcg_dump_ops(TCGContext *s);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 05/18] tcg: use results of alias analysis in liveness analysis
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (3 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 04/18] tcg: add simple alias analysis Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 06/18] tcg: allow globals to overlap Kirill Batuzov
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/tcg.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index e81d1c4..2f97c13 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1448,6 +1448,58 @@ static inline void tcg_la_bb_end(TCGContext *s, uint8_t *temp_state)
     }
 }
 
+static intptr_t tcg_temp_size(const TCGTemp *tmp)
+{
+    switch (tmp->base_type) {
+    case TCG_TYPE_I32:
+        return 4;
+    case TCG_TYPE_I64:
+    case TCG_TYPE_V64:
+        return 8;
+    case TCG_TYPE_V128:
+        return 16;
+    default:
+        tcg_abort();
+    }
+}
+
+/* Check if memory write completely overwrites temp's memory location.
+   If this is the case then the temp can be considered dead. */
+static int tcg_temp_overwrite(TCGContext *s, const TCGTemp *tmp,
+                               const TCGAliasInfo *ai)
+{
+    if (!(ai->alias_type & TCG_ALIAS_WRITE) || !ai->fixed_offset) {
+        return 0;
+    }
+    if (tmp->mem_base != &s->temps[GET_TCGV_PTR(s->tcg_env)]) {
+        return 0;
+    }
+    if (ai->offset > tmp->mem_offset
+        || ai->offset + ai->size < tmp->mem_offset + tcg_temp_size(tmp)) {
+            return 0;
+    }
+    return 1;
+}
+
+/* Check if memory read or write overlaps with temp's memory location.
+   If this is the case then the temp must be synced to memory. */
+static int tcg_temp_overlap(TCGContext *s, const TCGTemp *tmp,
+                            const TCGAliasInfo *ai)
+{
+    if (!ai->fixed_offset || tmp->fixed_reg) {
+        return 0;
+    }
+    if (tmp->mem_base != &s->temps[GET_TCGV_PTR(s->tcg_env)]) {
+        return 1;
+    }
+    if (ai->offset >= tmp->mem_offset + tcg_temp_size(tmp)
+        || ai->offset + ai->size <= tmp->mem_offset) {
+            return 0;
+    } else {
+        return 1;
+    }
+}
+
 /* Liveness analysis : update the opc_arg_life array to tell if a
    given input arguments is dead. Instructions updating dead
    temporaries are removed. */
@@ -1650,6 +1702,23 @@ static void liveness_pass_1(TCGContext *s, uint8_t *temp_state)
                     temp_state[arg] = TS_DEAD;
                 }
 
+                /* record if the operation uses some globals' memory location */
+                if (s->alias_info[oi].alias_type != TCG_NOT_ALIAS) {
+                    for (i = 0; i < s->nb_globals; i++) {
+                        if (tcg_temp_overwrite(s, &s->temps[i],
+                                               &s->alias_info[oi])) {
+                            temp_state[i] = TS_DEAD;
+                        } else if (tcg_temp_overlap(s, &s->temps[i],
+                                                    &s->alias_info[oi])) {
+                            if (s->alias_info[oi].alias_type & TCG_ALIAS_READ) {
+                                temp_state[i] = TS_MEM | TS_DEAD;
+                            } else if (!(temp_state[i] & TS_DEAD)) {
+                                temp_state[i] |= TS_MEM;
+                            }
+                        }
+                    }
+                }
+
                 /* if end of basic block, update */
                 if (def->flags & TCG_OPF_BB_END) {
                     tcg_la_bb_end(s, temp_state);
@@ -2591,6 +2660,8 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
     s->la_time -= profile_getclock();
 #endif
 
+    tcg_alias_analysis(s);
+
     {
         uint8_t *temp_state = tcg_malloc(s->nb_temps + s->nb_indirects);
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 06/18] tcg: allow globals to overlap
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (4 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 05/18] tcg: use results of alias analysis in liveness analysis Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17 19:50   ` Richard Henderson
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 07/18] tcg: add vector addition operations Kirill Batuzov
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Sometimes the target architecture may allow some parts of a register to be
accessed as a different register. If both of these registers are
implemented as globals in QEMU, then their content will overlap and the
change to one global will also change the value of the other. To handle
such situation properly, some fixes are needed in the register allocator
and liveness analysis.

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/tcg.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg.h | 18 ++++++++++++++++++
 2 files changed, 67 insertions(+)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index 2f97c13..330a1c0 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -572,6 +572,8 @@ int tcg_global_mem_new_internal(TCGType type, TCGv_ptr base,
         ts->mem_offset = offset;
         ts->name = name;
     }
+    ts->sub_temps = NULL;
+    ts->overlap_temps = NULL;
     return temp_idx(s, ts);
 }
 
@@ -1500,6 +1502,35 @@ static int tcg_temp_overlap(TCGContext *s, const TCGTemp *tmp,
     }
 }
 
+static void tcg_temp_arr_apply(const TCGArg *arr, uint8_t *temp_state,
+                               uint8_t temp_val)
+{
+    TCGArg i;
+    if (!arr) {
+        return ;
+    }
+    for (i = 0; arr[i] != (TCGArg)-1; i++) {
+        temp_state[arr[i]] = temp_val;
+    }
+}
+
+static void tcg_sub_temps_dead(TCGContext *s, TCGArg tmp, uint8_t *temp_state)
+{
+    tcg_temp_arr_apply(s->temps[tmp].sub_temps, temp_state, TS_DEAD);
+}
+
+static void tcg_sub_temps_sync(TCGContext *s, TCGArg tmp, uint8_t *temp_state)
+{
+    tcg_temp_arr_apply(s->temps[tmp].sub_temps, temp_state, TS_MEM | TS_DEAD);
+}
+
+static void tcg_overlap_temps_sync(TCGContext *s, TCGArg tmp,
+                                   uint8_t *temp_state)
+{
+    tcg_temp_arr_apply(s->temps[tmp].overlap_temps, temp_state,
+                       TS_MEM | TS_DEAD);
+}
+
 /* Liveness analysis : update the opc_arg_life array to tell if a
    given input arguments is dead. Instructions updating dead
    temporaries are removed. */
@@ -1554,6 +1585,11 @@ static void liveness_pass_1(TCGContext *s, uint8_t *temp_state)
                         if (temp_state[arg] & TS_MEM) {
                             arg_life |= SYNC_ARG << i;
                         }
+                        /* sub_temps are also dead */
+                        tcg_sub_temps_dead(&tcg_ctx, arg, temp_state);
+                        /* overlap_temps need to go to memory */
+                        tcg_overlap_temps_sync(&tcg_ctx, arg, temp_state);
+
                         temp_state[arg] = TS_DEAD;
                     }
 
@@ -1581,6 +1617,11 @@ static void liveness_pass_1(TCGContext *s, uint8_t *temp_state)
                     for (i = nb_oargs; i < nb_iargs + nb_oargs; i++) {
                         arg = args[i];
                         if (arg != TCG_CALL_DUMMY_ARG) {
+                            /* both sub_temps and overlap_temps need to go
+                               to memory */
+                            tcg_sub_temps_sync(&tcg_ctx, arg, temp_state);
+                            tcg_overlap_temps_sync(&tcg_ctx, arg, temp_state);
+
                             temp_state[arg] &= ~TS_DEAD;
                         }
                     }
@@ -1699,6 +1740,11 @@ static void liveness_pass_1(TCGContext *s, uint8_t *temp_state)
                     if (temp_state[arg] & TS_MEM) {
                         arg_life |= SYNC_ARG << i;
                     }
+                    /* sub_temps are also dead */
+                    tcg_sub_temps_dead(&tcg_ctx, arg, temp_state);
+                    /* overlap_temps need to go to memory */
+                    tcg_overlap_temps_sync(&tcg_ctx, arg, temp_state);
+
                     temp_state[arg] = TS_DEAD;
                 }
 
@@ -1739,6 +1785,9 @@ static void liveness_pass_1(TCGContext *s, uint8_t *temp_state)
                 /* input arguments are live for preceding opcodes */
                 for (i = nb_oargs; i < nb_oargs + nb_iargs; i++) {
                     temp_state[args[i]] &= ~TS_DEAD;
+                    /* both sub_temps and overlap_temps need to go to memory */
+                    tcg_sub_temps_sync(&tcg_ctx, arg, temp_state);
+                    tcg_overlap_temps_sync(&tcg_ctx, arg, temp_state);
                 }
             }
             break;
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 921892f..6473228 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -623,6 +623,14 @@ typedef struct TCGTemp {
     struct TCGTemp *mem_base;
     intptr_t mem_offset;
     const char *name;
+
+    /* -1 terminated array of temps that are parts of this temp.
+       All bits of them are part of this temp. */
+    const TCGArg *sub_temps;
+    /* -1 terminated array of temps that overlap with this temp.
+       Some bits of them are part of this temp, but some are not. sub_temps
+       are not included here. */
+    const TCGArg *overlap_temps;
 } TCGTemp;
 
 typedef struct TCGContext TCGContext;
@@ -826,6 +834,16 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb);
 
 void tcg_set_frame(TCGContext *s, TCGReg reg, intptr_t start, intptr_t size);
 
+static inline void tcg_temp_set_sub_temps(TCGArg temp, const TCGArg *arr)
+{
+    tcg_ctx.temps[temp].sub_temps = arr;
+}
+
+static inline void tcg_temp_set_overlap_temps(TCGArg temp, const TCGArg *arr)
+{
+    tcg_ctx.temps[temp].overlap_temps = arr;
+}
+
 int tcg_global_mem_new_internal(TCGType, TCGv_ptr, intptr_t, const char *);
 
 TCGv_i32 tcg_global_reg_new_i32(TCGReg reg, const char *name);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 07/18] tcg: add vector addition operations
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (5 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 06/18] tcg: allow globals to overlap Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17 21:56   ` Richard Henderson
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 08/18] target/arm: support access to vector guest registers as globals Kirill Batuzov
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/tcg-op.h  | 169 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg-opc.h |  12 +++++
 tcg/tcg.h     |  29 ++++++++++
 3 files changed, 210 insertions(+)

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index c469ea3..5de74d3 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -1153,6 +1153,8 @@ void tcg_gen_atomic_xor_fetch_i64(TCGv_i64, TCGv, TCGv_i64, TCGArg, TCGMemOp);
     tcg_gen_add_i32(TCGV_PTR_TO_NAT(R), TCGV_PTR_TO_NAT(A), TCGV_PTR_TO_NAT(B))
 # define tcg_gen_addi_ptr(R, A, B) \
     tcg_gen_addi_i32(TCGV_PTR_TO_NAT(R), TCGV_PTR_TO_NAT(A), (B))
+# define tcg_gen_movi_ptr(R, B) \
+    tcg_gen_movi_i32(TCGV_PTR_TO_NAT(R), (B))
 # define tcg_gen_ext_i32_ptr(R, A) \
     tcg_gen_mov_i32(TCGV_PTR_TO_NAT(R), (A))
 #else
@@ -1164,6 +1166,173 @@ void tcg_gen_atomic_xor_fetch_i64(TCGv_i64, TCGv, TCGv_i64, TCGArg, TCGMemOp);
     tcg_gen_add_i64(TCGV_PTR_TO_NAT(R), TCGV_PTR_TO_NAT(A), TCGV_PTR_TO_NAT(B))
 # define tcg_gen_addi_ptr(R, A, B) \
     tcg_gen_addi_i64(TCGV_PTR_TO_NAT(R), TCGV_PTR_TO_NAT(A), (B))
+# define tcg_gen_movi_ptr(R, B) \
+    tcg_gen_movi_i64(TCGV_PTR_TO_NAT(R), (B))
 # define tcg_gen_ext_i32_ptr(R, A) \
     tcg_gen_ext_i32_i64(TCGV_PTR_TO_NAT(R), (A))
 #endif /* UINTPTR_MAX == UINT32_MAX */
+
+/***************************************/
+/* 64-bit and 128-bit vector arithmetic.          */
+
+static inline void *tcg_v128_swap_slot(int n)
+{
+    return &tcg_ctx.v128_swap[n * 16];
+}
+
+/* Find a memory location for 128-bit TCG variable. */
+static inline void tcg_v128_to_ptr(TCGv_v128 tmp, TCGv_ptr base, int slot,
+                                   TCGv_ptr *real_base, intptr_t *real_offset,
+                                   int is_read)
+{
+    int idx = GET_TCGV_V128(tmp);
+    assert(idx >= 0 && idx < tcg_ctx.nb_temps);
+    if (idx < tcg_ctx.nb_globals) {
+        /* Globals use their locations within CPUArchState. */
+        int env = GET_TCGV_PTR(tcg_ctx.tcg_env);
+        TCGTemp *ts_env = &tcg_ctx.temps[env];
+        TCGTemp *ts_arg = &tcg_ctx.temps[idx];
+
+        /* Sanity checks: global's memory locations must be addressed
+           relative to ENV. */
+        assert(ts_env->val_type == TEMP_VAL_REG &&
+               ts_env == ts_arg->mem_base &&
+               ts_arg->mem_allocated);
+
+        *real_base = tcg_ctx.tcg_env;
+        *real_offset = ts_arg->mem_offset;
+    } else {
+        /* Temporaries use swap space in TCGContext. Since we already have
+           a 128-bit temporary we'll assume that the target supports 128-bit
+           loads and stores. */
+        *real_base = base;
+        *real_offset = slot * 16;
+        if (is_read) {
+            tcg_gen_st_v128(tmp, base, slot * 16);
+        }
+    }
+}
+
+/* Find a memory location for 64-bit vector TCG variable. */
+static inline void tcg_v64_to_ptr(TCGv_v64 tmp, TCGv_ptr base, int slot,
+                                  TCGv_ptr *real_base, intptr_t *real_offset,
+                                  int is_read)
+{
+    int idx = GET_TCGV_V64(tmp);
+    assert(idx >= 0 && idx < tcg_ctx.nb_temps);
+    if (idx < tcg_ctx.nb_globals) {
+        /* Globals use their locations within CPUArchState. */
+        int env = GET_TCGV_PTR(tcg_ctx.tcg_env);
+        TCGTemp *ts_env = &tcg_ctx.temps[env];
+        TCGTemp *ts_arg = &tcg_ctx.temps[idx];
+
+        /* Sanity checks: global's memory locations must be addressed
+           relative to ENV. */
+        assert(ts_env->val_type == TEMP_VAL_REG &&
+               ts_env == ts_arg->mem_base &&
+               ts_arg->mem_allocated);
+
+        *real_base = tcg_ctx.tcg_env;
+        *real_offset = ts_arg->mem_offset;
+    } else {
+        /* Temporaries use swap space in TCGContext. Since we already have
+           a 128-bit temporary we'll assume that the target supports 128-bit
+           loads and stores. */
+        *real_base = base;
+        *real_offset = slot * 16;
+        if (is_read) {
+            tcg_gen_st_v64(tmp, base, slot * 16);
+        }
+    }
+}
+
+#define GEN_VECT_WRAPPER(name, type, func)                                   \
+    static inline void glue(tcg_gen_, name)(glue(TCGv_, type) res,           \
+                                            glue(TCGv_, type) arg1,          \
+                                            glue(TCGv_, type) arg2)          \
+    {                                                                        \
+        if (glue(TCG_TARGET_HAS_, name)) {                                   \
+            glue(tcg_gen_op3_, type)(glue(INDEX_op_, name), res, arg1,       \
+                                     arg2);                                  \
+        } else {                                                             \
+            TCGv_ptr base = tcg_temp_new_ptr();                              \
+            TCGv_ptr t1 = tcg_temp_new_ptr();                                \
+            TCGv_ptr t2 = tcg_temp_new_ptr();                                \
+            TCGv_ptr t3 = tcg_temp_new_ptr();                                \
+            TCGv_ptr arg1p, arg2p, resp;                                     \
+            intptr_t arg1of, arg2of, resof;                                  \
+                                                                             \
+            tcg_gen_movi_ptr(base, (unsigned long)&tcg_ctx.v128_swap[0]);    \
+                                                                             \
+            glue(glue(tcg_, type), _to_ptr)(arg1, base, 1,                   \
+                                            &arg1p, &arg1of, 1);             \
+            glue(glue(tcg_, type), _to_ptr)(arg2, base, 2,                   \
+                                            &arg2p, &arg2of, 1);             \
+            glue(glue(tcg_, type), _to_ptr)(res, base, 0, &resp, &resof, 0); \
+                                                                             \
+            tcg_gen_addi_ptr(t1, resp, resof);                               \
+            tcg_gen_addi_ptr(t2, arg1p, arg1of);                             \
+            tcg_gen_addi_ptr(t3, arg2p, arg2of);                             \
+            func(t1, t2, t3);                                                \
+                                                                             \
+            if ((intptr_t)res >= tcg_ctx.nb_globals) {                       \
+                glue(tcg_gen_ld_, type)(res, base, 0);                       \
+            }                                                                \
+                                                                             \
+            tcg_temp_free_ptr(base);                                         \
+            tcg_temp_free_ptr(t1);                                           \
+            tcg_temp_free_ptr(t2);                                           \
+            tcg_temp_free_ptr(t3);                                           \
+        }                                                                    \
+    }
+
+#define TCG_INTERNAL_OP(name, N, size, ld, st, op, type)                     \
+    static inline void glue(tcg_internal_, name)(TCGv_ptr resp,              \
+                                                 TCGv_ptr arg1p,             \
+                                                 TCGv_ptr arg2p)             \
+    {                                                                        \
+        int i;                                                               \
+        glue(TCGv_, type) tmp1, tmp2;                                        \
+                                                                             \
+        tmp1 = glue(tcg_temp_new_, type)();                                  \
+        tmp2 = glue(tcg_temp_new_, type)();                                  \
+                                                                             \
+        for (i = 0; i < N; i++) {                                            \
+            glue(tcg_gen_, ld)(tmp1, arg1p, i * size);                       \
+            glue(tcg_gen_, ld)(tmp2, arg2p, i * size);                       \
+            glue(tcg_gen_, op)(tmp1, tmp1, tmp2);                            \
+            glue(tcg_gen_, st)(tmp1, resp, i * size);                        \
+        }                                                                    \
+                                                                             \
+        glue(tcg_temp_free_, type)(tmp1);                                    \
+        glue(tcg_temp_free_, type)(tmp2);                                    \
+    }
+
+#define TCG_INTERNAL_OP_8(name, N, op) \
+    TCG_INTERNAL_OP(name, N, 1, ld8u_i32, st8_i32, op, i32)
+#define TCG_INTERNAL_OP_16(name, N, op) \
+    TCG_INTERNAL_OP(name, N, 2, ld16u_i32, st16_i32, op, i32)
+#define TCG_INTERNAL_OP_32(name, N, op) \
+    TCG_INTERNAL_OP(name, N, 4, ld_i32, st_i32, op, i32)
+#define TCG_INTERNAL_OP_64(name, N, op) \
+    TCG_INTERNAL_OP(name, N, 8, ld_i64, st_i64, op, i64)
+
+TCG_INTERNAL_OP_8(add_i8x16, 16, add_i32)
+TCG_INTERNAL_OP_16(add_i16x8, 8, add_i32)
+TCG_INTERNAL_OP_32(add_i32x4, 4, add_i32)
+TCG_INTERNAL_OP_64(add_i64x2, 2, add_i64)
+
+TCG_INTERNAL_OP_8(add_i8x8, 8, add_i32)
+TCG_INTERNAL_OP_16(add_i16x4, 4, add_i32)
+TCG_INTERNAL_OP_32(add_i32x2, 2, add_i32)
+TCG_INTERNAL_OP_64(add_i64x1, 1, add_i64)
+
+GEN_VECT_WRAPPER(add_i8x16, v128, tcg_internal_add_i8x16)
+GEN_VECT_WRAPPER(add_i16x8, v128, tcg_internal_add_i16x8)
+GEN_VECT_WRAPPER(add_i32x4, v128, tcg_internal_add_i32x4)
+GEN_VECT_WRAPPER(add_i64x2, v128, tcg_internal_add_i64x2)
+
+GEN_VECT_WRAPPER(add_i8x8, v64, tcg_internal_add_i8x8)
+GEN_VECT_WRAPPER(add_i16x4, v64, tcg_internal_add_i16x4)
+GEN_VECT_WRAPPER(add_i32x2, v64, tcg_internal_add_i32x2)
+GEN_VECT_WRAPPER(add_i64x1, v64, tcg_internal_add_i64x1)
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index d622592..0022535 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -196,6 +196,18 @@ DEF(ld_v128, 1, 1, 1, IMPL128)
 DEF(st_v64, 0, 2, 1, IMPLV64)
 DEF(ld_v64, 1, 1, 1, IMPLV64)
 
+/* 128-bit vector arith */
+DEF(add_i8x16, 1, 2, 0, IMPL128 | IMPL(TCG_TARGET_HAS_add_i8x16))
+DEF(add_i16x8, 1, 2, 0, IMPL128 | IMPL(TCG_TARGET_HAS_add_i16x8))
+DEF(add_i32x4, 1, 2, 0, IMPL128 | IMPL(TCG_TARGET_HAS_add_i32x4))
+DEF(add_i64x2, 1, 2, 0, IMPL128 | IMPL(TCG_TARGET_HAS_add_i64x2))
+
+/* 64-bit vector arith */
+DEF(add_i8x8, 1, 2, 0, IMPLV64 | IMPL(TCG_TARGET_HAS_add_i8x8))
+DEF(add_i16x4, 1, 2, 0, IMPLV64 | IMPL(TCG_TARGET_HAS_add_i16x4))
+DEF(add_i32x2, 1, 2, 0, IMPLV64 | IMPL(TCG_TARGET_HAS_add_i32x2))
+DEF(add_i64x1, 1, 2, 0, IMPLV64 | IMPL(TCG_TARGET_HAS_add_i64x1))
+
 /* QEMU specific */
 DEF(insn_start, 0, 0, TLADDR_ARGS * TARGET_INSN_START_WORDS,
     TCG_OPF_NOT_PRESENT)
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 6473228..6f4d0e7 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -145,6 +145,34 @@ typedef uint64_t TCGRegSet;
 #define TCG_TARGET_HAS_rem_i64          0
 #endif
 
+/* 64-bit vector */
+#ifndef TCG_TARGET_HAS_add_i8x8
+#define TCG_TARGET_HAS_add_i8x8         0
+#endif
+#ifndef TCG_TARGET_HAS_add_i16x4
+#define TCG_TARGET_HAS_add_i16x4        0
+#endif
+#ifndef TCG_TARGET_HAS_add_i32x2
+#define TCG_TARGET_HAS_add_i32x2        0
+#endif
+#ifndef TCG_TARGET_HAS_add_i64x1
+#define TCG_TARGET_HAS_add_i64x1        0
+#endif
+
+/* 128-bit vector */
+#ifndef TCG_TARGET_HAS_add_i8x16
+#define TCG_TARGET_HAS_add_i8x16        0
+#endif
+#ifndef TCG_TARGET_HAS_add_i16x8
+#define TCG_TARGET_HAS_add_i16x8        0
+#endif
+#ifndef TCG_TARGET_HAS_add_i32x4
+#define TCG_TARGET_HAS_add_i32x4        0
+#endif
+#ifndef TCG_TARGET_HAS_add_i64x2
+#define TCG_TARGET_HAS_add_i64x2        0
+#endif
+
 /* For 32-bit targets, some sort of unsigned widening multiply is required.  */
 #if TCG_TARGET_REG_BITS == 32 \
     && !(defined(TCG_TARGET_HAS_mulu2_i32) \
@@ -750,6 +778,7 @@ struct TCGContext {
     void *code_gen_buffer;
     size_t code_gen_buffer_size;
     void *code_gen_ptr;
+    uint8_t v128_swap[16 * 3];
 
     /* Threshold to flush the translated code buffer.  */
     void *code_gen_highwater;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 08/18] target/arm: support access to vector guest registers as globals
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (6 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 07/18] tcg: add vector addition operations Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17 20:07   ` Richard Henderson
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 09/18] target/arm: use vector opcode to handle vadd.<size> instruction Kirill Batuzov
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

To support vector guest registers as globals we need to do two things:

1) create corresponding globals,
2) mark which globals can overlap,

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---

I've declared regnames for new globals the same way they used to be declared for
scalar regs. checkpatch complains about it. Should I move '{' to the same line
for all 3 arrays?

---
 target/arm/translate.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/target/arm/translate.c b/target/arm/translate.c
index 0ad9070..2b81b5d 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -65,6 +65,12 @@ static TCGv_i32 cpu_R[16];
 TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
 TCGv_i64 cpu_exclusive_addr;
 TCGv_i64 cpu_exclusive_val;
+static TCGv_v128 cpu_Q[16];
+static TCGv_v64 cpu_D[32];
+#ifdef CONFIG_USER_ONLY
+TCGv_i64 cpu_exclusive_test;
+TCGv_i32 cpu_exclusive_info;
+#endif
 
 /* FIXME:  These should be removed.  */
 static TCGv_i32 cpu_F0s, cpu_F1s;
@@ -72,14 +78,26 @@ static TCGv_i64 cpu_F0d, cpu_F1d;
 
 #include "exec/gen-icount.h"
 
-static const char *regnames[] =
+static const char *regnames_r[] =
     { "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7",
       "r8", "r9", "r10", "r11", "r12", "r13", "r14", "pc" };
 
+static const char *regnames_q[] =
+    { "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7",
+      "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15" };
+
+static const char *regnames_d[] =
+    { "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
+      "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15",
+      "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23",
+      "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31" };
+
 /* initialize TCG globals.  */
 void arm_translate_init(void)
 {
     int i;
+    static TCGArg overlap_temps[16][2];
+    static TCGArg sub_temps[16][3];
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
     tcg_ctx.tcg_env = cpu_env;
@@ -87,7 +105,30 @@ void arm_translate_init(void)
     for (i = 0; i < 16; i++) {
         cpu_R[i] = tcg_global_mem_new_i32(cpu_env,
                                           offsetof(CPUARMState, regs[i]),
-                                          regnames[i]);
+                                          regnames_r[i]);
+    }
+    for (i = 0; i < 16; i++) {
+        cpu_Q[i] = tcg_global_mem_new_v128(cpu_env,
+                                           offsetof(CPUARMState,
+                                                    vfp.regs[2 * i]),
+                                           regnames_q[i]);
+    }
+    for (i = 0; i < 32; i++) {
+        cpu_D[i] = tcg_global_mem_new_v64(cpu_env,
+                                          offsetof(CPUARMState, vfp.regs[i]),
+                                          regnames_d[i]);
+    }
+    for (i = 0; i < 16; i++) {
+        overlap_temps[i][0] = GET_TCGV_V128(cpu_Q[i]);
+        overlap_temps[i][1] = (TCGArg)-1;
+        sub_temps[i][0] = GET_TCGV_V64(cpu_D[i * 2]);
+        sub_temps[i][1] = GET_TCGV_V64(cpu_D[i * 2 + 1]);
+        sub_temps[i][2] = (TCGArg)-1;
+        tcg_temp_set_overlap_temps(GET_TCGV_V64(cpu_D[i * 2]),
+                                   overlap_temps[i]);
+        tcg_temp_set_overlap_temps(GET_TCGV_V64(cpu_D[i * 2 + 1]),
+                                   overlap_temps[i]);
+        tcg_temp_set_sub_temps(GET_TCGV_V128(cpu_Q[i]), sub_temps[i]);
     }
     cpu_CF = tcg_global_mem_new_i32(cpu_env, offsetof(CPUARMState, CF), "CF");
     cpu_NF = tcg_global_mem_new_i32(cpu_env, offsetof(CPUARMState, NF), "NF");
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 09/18] target/arm: use vector opcode to handle vadd.<size> instruction
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (7 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 08/18] target/arm: support access to vector guest registers as globals Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 10/18] tcg/i386: add support for vector opcodes Kirill Batuzov
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 target/arm/translate.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/target/arm/translate.c b/target/arm/translate.c
index 2b81b5d..4378d44 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -5666,6 +5666,37 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
             return 1;
         }
 
+        /* Use vector ops to handle what we can */
+        switch (op) {
+        case NEON_3R_VADD_VSUB:
+            if (!u) {
+                void (* const gen_add_v128[])(TCGv_v128, TCGv_v128,
+                                             TCGv_v128) = {
+                    tcg_gen_add_i8x16,
+                    tcg_gen_add_i16x8,
+                    tcg_gen_add_i32x4,
+                    tcg_gen_add_i64x2
+                };
+                void (* const gen_add_v64[])(TCGv_v64, TCGv_v64,
+                                             TCGv_v64) = {
+                    tcg_gen_add_i8x8,
+                    tcg_gen_add_i16x4,
+                    tcg_gen_add_i32x2,
+                    tcg_gen_add_i64x1
+                };
+                if (q) {
+                    gen_add_v128[size](cpu_Q[rd >> 1], cpu_Q[rn >> 1],
+                                       cpu_Q[rm >> 1]);
+                } else {
+                    gen_add_v64[size](cpu_D[rd], cpu_D[rn], cpu_D[rm]);
+                }
+                return 0;
+            }
+            break;
+        default:
+            break;
+        }
+
         for (pass = 0; pass < (q ? 4 : 2); pass++) {
 
         if (pairwise) {
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 10/18] tcg/i386: add support for vector opcodes
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (8 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 09/18] target/arm: use vector opcode to handle vadd.<size> instruction Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17 20:19   ` Richard Henderson
  2017-01-27 14:51   ` Alex Bennée
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 11/18] tcg/i386: support 64-bit vector operations Kirill Batuzov
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

To be able to generate vector operations in a TCG backend we need to do
several things.

1. We need to tell the register allocator about vector target's register.
   In case of x86 we'll use xmm0..xmm7. xmm7 is designated as a scratch
   register, others can be used by the register allocator.

2. We need a new constraint to indicate where to use vector registers. In
   this commit the 'V' constraint is introduced.

3. We need to be able to generate bare minimum: load, store and reg-to-reg
   move. MOVDQU is used for loads and stores. MOVDQA is used for reg-to-reg
   moves.

4. Finally we need to support any other opcodes we want. INDEX_op_add_i32x4
   is the only one for now. The PADDD instruction handles it perfectly.

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/i386/tcg-target.h     |  24 +++++++++-
 tcg/i386/tcg-target.inc.c | 109 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 125 insertions(+), 8 deletions(-)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 524cfc6..974a58b 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -29,8 +29,14 @@
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
 
 #ifdef __x86_64__
-# define TCG_TARGET_REG_BITS  64
-# define TCG_TARGET_NB_REGS   16
+# define TCG_TARGET_HAS_REG128 1
+# ifdef TCG_TARGET_HAS_REG128
+#  define TCG_TARGET_REG_BITS  64
+#  define TCG_TARGET_NB_REGS   24
+# else
+#  define TCG_TARGET_REG_BITS  64
+#  define TCG_TARGET_NB_REGS   16
+# endif
 #else
 # define TCG_TARGET_REG_BITS  32
 # define TCG_TARGET_NB_REGS    8
@@ -56,6 +62,16 @@ typedef enum {
     TCG_REG_R13,
     TCG_REG_R14,
     TCG_REG_R15,
+#ifdef TCG_TARGET_HAS_REG128
+    TCG_REG_XMM0,
+    TCG_REG_XMM1,
+    TCG_REG_XMM2,
+    TCG_REG_XMM3,
+    TCG_REG_XMM4,
+    TCG_REG_XMM5,
+    TCG_REG_XMM6,
+    TCG_REG_XMM7,
+#endif
     TCG_REG_RAX = TCG_REG_EAX,
     TCG_REG_RCX = TCG_REG_ECX,
     TCG_REG_RDX = TCG_REG_EDX,
@@ -133,6 +149,10 @@ extern bool have_bmi1;
 #define TCG_TARGET_HAS_mulsh_i64        0
 #endif
 
+#ifdef TCG_TARGET_HAS_REG128
+#define TCG_TARGET_HAS_add_i32x4        1
+#endif
+
 #define TCG_TARGET_deposit_i32_valid(ofs, len) \
     (((ofs) == 0 && (len) == 8) || ((ofs) == 8 && (len) == 8) || \
      ((ofs) == 0 && (len) == 16))
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index eeb1777..69e3198 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -32,6 +32,9 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
 #else
     "%eax", "%ecx", "%edx", "%ebx", "%esp", "%ebp", "%esi", "%edi",
 #endif
+#ifdef TCG_TARGET_HAS_REG128
+    "%xmm0", "%xmm1", "%xmm2", "%xmm3", "%xmm4", "%xmm5", "%xmm6", "%xmm7",
+#endif
 };
 #endif
 
@@ -61,6 +64,16 @@ static const int tcg_target_reg_alloc_order[] = {
     TCG_REG_EDX,
     TCG_REG_EAX,
 #endif
+#ifdef TCG_TARGET_HAS_REG128
+    TCG_REG_XMM0,
+    TCG_REG_XMM1,
+    TCG_REG_XMM2,
+    TCG_REG_XMM3,
+    TCG_REG_XMM4,
+    TCG_REG_XMM5,
+    TCG_REG_XMM6,
+/*  TCG_REG_XMM7, <- scratch register */
+#endif
 };
 
 static const int tcg_target_call_iarg_regs[] = {
@@ -247,6 +260,10 @@ static int target_parse_constraint(TCGArgConstraint *ct, const char **pct_str)
     case 'I':
         ct->ct |= TCG_CT_CONST_I32;
         break;
+    case 'V':
+        ct->ct |= TCG_CT_REG;
+        tcg_regset_set32(ct->u.regs, 0, 0xff0000);
+        break;
 
     default:
         return -1;
@@ -301,6 +318,9 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define P_SIMDF3        0x10000         /* 0xf3 opcode prefix */
 #define P_SIMDF2        0x20000         /* 0xf2 opcode prefix */
 
+#define P_SSE_660F      (P_DATA16 | P_EXT)
+#define P_SSE_F30F      (P_SIMDF3 | P_EXT)
+
 #define OPC_ARITH_EvIz	(0x81)
 #define OPC_ARITH_EvIb	(0x83)
 #define OPC_ARITH_GvEv	(0x03)		/* ... plus (ARITH_FOO << 3) */
@@ -351,6 +371,11 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define OPC_GRP3_Ev	(0xf7)
 #define OPC_GRP5	(0xff)
 
+#define OPC_MOVDQU_M2R  (0x6f | P_SSE_F30F)  /* store 128-bit value */
+#define OPC_MOVDQU_R2M  (0x7f | P_SSE_F30F)  /* load 128-bit value */
+#define OPC_MOVDQA_R2R  (0x6f | P_SSE_660F)  /* reg-to-reg 128-bit mov */
+#define OPC_PADDD       (0xfe | P_SSE_660F)
+
 /* Group 1 opcode extensions for 0x80-0x83.
    These are also used as modifiers for OPC_ARITH.  */
 #define ARITH_ADD 0
@@ -428,6 +453,9 @@ static void tcg_out_opc(TCGContext *s, int opc, int r, int rm, int x)
         tcg_debug_assert((opc & P_REXW) == 0);
         tcg_out8(s, 0x66);
     }
+    if (opc & P_SIMDF3) {
+        tcg_out8(s, 0xf3);
+    }
     if (opc & P_ADDR32) {
         tcg_out8(s, 0x67);
     }
@@ -634,9 +662,24 @@ static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
 static inline void tcg_out_mov(TCGContext *s, TCGType type,
                                TCGReg ret, TCGReg arg)
 {
+    int opc;
     if (arg != ret) {
-        int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
-        tcg_out_modrm(s, opc, ret, arg);
+        switch (type) {
+#ifdef TCG_TARGET_HAS_REG128
+        case TCG_TYPE_V128:
+            ret -= TCG_REG_XMM0;
+            arg -= TCG_REG_XMM0;
+            tcg_out_modrm(s, OPC_MOVDQA_R2R, ret, arg);
+            break;
+#endif
+        case TCG_TYPE_I32:
+        case TCG_TYPE_I64:
+            opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
+            tcg_out_modrm(s, opc, ret, arg);
+            break;
+        default:
+            assert(0);
+        }
     }
 }
 
@@ -711,15 +754,43 @@ static inline void tcg_out_pop(TCGContext *s, int reg)
 static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
                               TCGReg arg1, intptr_t arg2)
 {
-    int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
-    tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
+    int opc;
+    switch (type) {
+#ifdef TCG_TARGET_HAS_REG128
+    case TCG_TYPE_V128:
+        ret -= TCG_REG_XMM0;
+        tcg_out_modrm_offset(s, OPC_MOVDQU_M2R, ret, arg1, arg2);
+        break;
+#endif
+    case TCG_TYPE_I32:
+    case TCG_TYPE_I64:
+        opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
+        tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
+        break;
+    default:
+        assert(0);
+    }
 }
 
 static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
                               TCGReg arg1, intptr_t arg2)
 {
-    int opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
-    tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
+    int opc;
+    switch (type) {
+#ifdef TCG_TARGET_HAS_REG128
+    case TCG_TYPE_V128:
+        arg -= TCG_REG_XMM0;
+        tcg_out_modrm_offset(s, OPC_MOVDQU_R2M, arg, arg1, arg2);
+        break;
+#endif
+    case TCG_TYPE_I32:
+    case TCG_TYPE_I64:
+        opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
+        tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
+        break;
+    default:
+        assert(0);
+    }
 }
 
 static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
@@ -1856,6 +1927,11 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_ld_i32:
         tcg_out_ld(s, TCG_TYPE_I32, args[0], args[1], args[2]);
         break;
+#ifdef TCG_TARGET_HAS_REG128
+    case INDEX_op_ld_v128:
+        tcg_out_ld(s, TCG_TYPE_V128, args[0], args[1], args[2]);
+        break;
+#endif
 
     OP_32_64(st8):
         if (const_args[0]) {
@@ -1888,6 +1964,11 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
             tcg_out_st(s, TCG_TYPE_I32, args[0], args[1], args[2]);
         }
         break;
+#ifdef TCG_TARGET_HAS_REG128
+    case INDEX_op_st_v128:
+        tcg_out_st(s, TCG_TYPE_V128, args[0], args[1], args[2]);
+        break;
+#endif
 
     OP_32_64(add):
         /* For 3-operand addition, use LEA.  */
@@ -2146,6 +2227,13 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_mb:
         tcg_out_mb(s, args[0]);
         break;
+
+#ifdef TCG_TARGET_HAS_REG128
+    case INDEX_op_add_i32x4:
+        tcg_out_modrm(s, OPC_PADDD, args[0], args[2]);
+        break;
+#endif
+
     case INDEX_op_mov_i32:  /* Always emitted via tcg_out_mov.  */
     case INDEX_op_mov_i64:
     case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi.  */
@@ -2171,6 +2259,11 @@ static const TCGTargetOpDef x86_op_defs[] = {
     { INDEX_op_st16_i32, { "ri", "r" } },
     { INDEX_op_st_i32, { "ri", "r" } },
 
+#ifdef TCG_TARGET_HAS_REG128
+    { INDEX_op_ld_v128, { "V", "r" } },
+    { INDEX_op_st_v128, { "V", "r" } },
+#endif
+
     { INDEX_op_add_i32, { "r", "r", "ri" } },
     { INDEX_op_sub_i32, { "r", "0", "ri" } },
     { INDEX_op_mul_i32, { "r", "0", "ri" } },
@@ -2289,6 +2382,10 @@ static const TCGTargetOpDef x86_op_defs[] = {
     { INDEX_op_qemu_ld_i64, { "r", "r", "L", "L" } },
     { INDEX_op_qemu_st_i64, { "L", "L", "L", "L" } },
 #endif
+
+#ifdef TCG_TARGET_HAS_REG128
+    { INDEX_op_add_i32x4, { "V", "0", "V" } },
+#endif
     { -1 },
 };
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 11/18] tcg/i386: support 64-bit vector operations
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (9 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 10/18] tcg/i386: add support for vector opcodes Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 12/18] tcg/i386: support remaining vector addition operations Kirill Batuzov
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/i386/tcg-target.h     |  1 +
 tcg/i386/tcg-target.inc.c | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 974a58b..849b339 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -30,6 +30,7 @@
 
 #ifdef __x86_64__
 # define TCG_TARGET_HAS_REG128 1
+# define TCG_TARGET_HAS_REGV64 1
 # ifdef TCG_TARGET_HAS_REG128
 #  define TCG_TARGET_REG_BITS  64
 #  define TCG_TARGET_NB_REGS   24
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 69e3198..a2d5e09 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -374,6 +374,9 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define OPC_MOVDQU_M2R  (0x6f | P_SSE_F30F)  /* store 128-bit value */
 #define OPC_MOVDQU_R2M  (0x7f | P_SSE_F30F)  /* load 128-bit value */
 #define OPC_MOVDQA_R2R  (0x6f | P_SSE_660F)  /* reg-to-reg 128-bit mov */
+#define OPC_MOVQ_M2R    (0x7e | P_SSE_F30F)
+#define OPC_MOVQ_R2M    (0xd6 | P_SSE_660F)
+#define OPC_MOVQ_R2R    (0xd6 | P_SSE_660F)
 #define OPC_PADDD       (0xfe | P_SSE_660F)
 
 /* Group 1 opcode extensions for 0x80-0x83.
@@ -672,6 +675,13 @@ static inline void tcg_out_mov(TCGContext *s, TCGType type,
             tcg_out_modrm(s, OPC_MOVDQA_R2R, ret, arg);
             break;
 #endif
+#ifdef TCG_TARGET_HAS_REGV64
+        case TCG_TYPE_V64:
+            ret -= TCG_REG_XMM0;
+            arg -= TCG_REG_XMM0;
+            tcg_out_modrm(s, OPC_MOVQ_R2R, ret, arg);
+            break;
+#endif
         case TCG_TYPE_I32:
         case TCG_TYPE_I64:
             opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
@@ -762,6 +772,12 @@ static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
         tcg_out_modrm_offset(s, OPC_MOVDQU_M2R, ret, arg1, arg2);
         break;
 #endif
+#ifdef TCG_TARGET_HAS_REGV64
+    case TCG_TYPE_V64:
+        ret -= TCG_REG_XMM0;
+        tcg_out_modrm_offset(s, OPC_MOVQ_M2R, ret, arg1, arg2);
+        break;
+#endif
     case TCG_TYPE_I32:
     case TCG_TYPE_I64:
         opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
@@ -783,6 +799,12 @@ static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
         tcg_out_modrm_offset(s, OPC_MOVDQU_R2M, arg, arg1, arg2);
         break;
 #endif
+#ifdef TCG_TARGET_HAS_REGV64
+    case TCG_TYPE_V64:
+        arg -= TCG_REG_XMM0;
+        tcg_out_modrm_offset(s, OPC_MOVQ_R2M, arg, arg1, arg2);
+        break;
+#endif
     case TCG_TYPE_I32:
     case TCG_TYPE_I64:
         opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
@@ -2264,6 +2286,11 @@ static const TCGTargetOpDef x86_op_defs[] = {
     { INDEX_op_st_v128, { "V", "r" } },
 #endif
 
+#ifdef TCG_TARGET_HAS_REGV64
+    { INDEX_op_ld_v64, { "V", "r" } },
+    { INDEX_op_st_v64, { "V", "r" } },
+#endif
+
     { INDEX_op_add_i32, { "r", "r", "ri" } },
     { INDEX_op_sub_i32, { "r", "0", "ri" } },
     { INDEX_op_mul_i32, { "r", "0", "ri" } },
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 12/18] tcg/i386: support remaining vector addition operations
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (10 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 11/18] tcg/i386: support 64-bit vector operations Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17 21:49   ` Richard Henderson
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 13/18] tcg: do not relay on exact values of MO_BSWAP or MO_SIGN in backend Kirill Batuzov
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/i386/tcg-target.h     | 10 ++++++++++
 tcg/i386/tcg-target.inc.c | 37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 849b339..5deb08e 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -151,7 +151,17 @@ extern bool have_bmi1;
 #endif
 
 #ifdef TCG_TARGET_HAS_REG128
+#define TCG_TARGET_HAS_add_i8x16        1
+#define TCG_TARGET_HAS_add_i16x8        1
 #define TCG_TARGET_HAS_add_i32x4        1
+#define TCG_TARGET_HAS_add_i64x2        1
+#endif
+
+#ifdef TCG_TARGET_HAS_REGV64
+#define TCG_TARGET_HAS_add_i8x8         1
+#define TCG_TARGET_HAS_add_i16x4        1
+#define TCG_TARGET_HAS_add_i32x2        1
+#define TCG_TARGET_HAS_add_i64x1        1
 #endif
 
 #define TCG_TARGET_deposit_i32_valid(ofs, len) \
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index a2d5e09..d00bd12 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -377,7 +377,10 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define OPC_MOVQ_M2R    (0x7e | P_SSE_F30F)
 #define OPC_MOVQ_R2M    (0xd6 | P_SSE_660F)
 #define OPC_MOVQ_R2R    (0xd6 | P_SSE_660F)
+#define OPC_PADDB       (0xfc | P_SSE_660F)
+#define OPC_PADDW       (0xfd | P_SSE_660F)
 #define OPC_PADDD       (0xfe | P_SSE_660F)
+#define OPC_PADDQ       (0xd4 | P_SSE_660F)
 
 /* Group 1 opcode extensions for 0x80-0x83.
    These are also used as modifiers for OPC_ARITH.  */
@@ -2251,9 +2254,33 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         break;
 
 #ifdef TCG_TARGET_HAS_REG128
+    case INDEX_op_add_i8x16:
+        tcg_out_modrm(s, OPC_PADDB, args[0], args[2]);
+        break;
+    case INDEX_op_add_i16x8:
+        tcg_out_modrm(s, OPC_PADDW, args[0], args[2]);
+        break;
     case INDEX_op_add_i32x4:
         tcg_out_modrm(s, OPC_PADDD, args[0], args[2]);
         break;
+    case INDEX_op_add_i64x2:
+        tcg_out_modrm(s, OPC_PADDQ, args[0], args[2]);
+        break;
+#endif
+
+#ifdef TCG_TARGET_HAS_REGV64
+    case INDEX_op_add_i8x8:
+        tcg_out_modrm(s, OPC_PADDB, args[0], args[2]);
+        break;
+    case INDEX_op_add_i16x4:
+        tcg_out_modrm(s, OPC_PADDW, args[0], args[2]);
+        break;
+    case INDEX_op_add_i32x2:
+        tcg_out_modrm(s, OPC_PADDD, args[0], args[2]);
+        break;
+    case INDEX_op_add_i64x1:
+        tcg_out_modrm(s, OPC_PADDQ, args[0], args[2]);
+        break;
 #endif
 
     case INDEX_op_mov_i32:  /* Always emitted via tcg_out_mov.  */
@@ -2411,7 +2438,17 @@ static const TCGTargetOpDef x86_op_defs[] = {
 #endif
 
 #ifdef TCG_TARGET_HAS_REG128
+    { INDEX_op_add_i8x16, { "V", "0", "V" } },
+    { INDEX_op_add_i16x8, { "V", "0", "V" } },
     { INDEX_op_add_i32x4, { "V", "0", "V" } },
+    { INDEX_op_add_i64x2, { "V", "0", "V" } },
+#endif
+
+#ifdef TCG_TARGET_HAS_REGV64
+    { INDEX_op_add_i8x8, { "V", "0", "V" } },
+    { INDEX_op_add_i16x4, { "V", "0", "V" } },
+    { INDEX_op_add_i32x2, { "V", "0", "V" } },
+    { INDEX_op_add_i64x1, { "V", "0", "V" } },
 #endif
     { -1 },
 };
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 13/18] tcg: do not relay on exact values of MO_BSWAP or MO_SIGN in backend
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (11 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 12/18] tcg/i386: support remaining vector addition operations Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 14/18] tcg: introduce new TCGMemOp - MO_128 Kirill Batuzov
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/aarch64/tcg-target.inc.c |  4 ++--
 tcg/arm/tcg-target.inc.c     |  4 ++--
 tcg/i386/tcg-target.inc.c    |  4 ++--
 tcg/mips/tcg-target.inc.c    |  4 ++--
 tcg/ppc/tcg-target.inc.c     |  4 ++--
 tcg/s390/tcg-target.inc.c    |  4 ++--
 tcg/sparc/tcg-target.inc.c   | 12 ++++++------
 tcg/tcg-op.c                 |  4 ++--
 tcg/tcg.h                    |  1 +
 9 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/tcg/aarch64/tcg-target.inc.c b/tcg/aarch64/tcg-target.inc.c
index 1939d35..a3314e3 100644
--- a/tcg/aarch64/tcg-target.inc.c
+++ b/tcg/aarch64/tcg-target.inc.c
@@ -1002,7 +1002,7 @@ static inline void tcg_out_mb(TCGContext *s, TCGArg a0)
 /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
  *                                     TCGMemOpIdx oi, uintptr_t ra)
  */
-static void * const qemu_ld_helpers[16] = {
+static void * const qemu_ld_helpers[] = {
     [MO_UB]   = helper_ret_ldub_mmu,
     [MO_LEUW] = helper_le_lduw_mmu,
     [MO_LEUL] = helper_le_ldul_mmu,
@@ -1016,7 +1016,7 @@ static void * const qemu_ld_helpers[16] = {
  *                                     uintxx_t val, TCGMemOpIdx oi,
  *                                     uintptr_t ra)
  */
-static void * const qemu_st_helpers[16] = {
+static void * const qemu_st_helpers[] = {
     [MO_UB]   = helper_ret_stb_mmu,
     [MO_LEUW] = helper_le_stw_mmu,
     [MO_LEUL] = helper_le_stl_mmu,
diff --git a/tcg/arm/tcg-target.inc.c b/tcg/arm/tcg-target.inc.c
index ffa0d40..c685785 100644
--- a/tcg/arm/tcg-target.inc.c
+++ b/tcg/arm/tcg-target.inc.c
@@ -1083,7 +1083,7 @@ static inline void tcg_out_mb(TCGContext *s, TCGArg a0)
 /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
  *                                     int mmu_idx, uintptr_t ra)
  */
-static void * const qemu_ld_helpers[16] = {
+static void * const qemu_ld_helpers[] = {
     [MO_UB]   = helper_ret_ldub_mmu,
     [MO_SB]   = helper_ret_ldsb_mmu,
 
@@ -1103,7 +1103,7 @@ static void * const qemu_ld_helpers[16] = {
 /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
  *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
  */
-static void * const qemu_st_helpers[16] = {
+static void * const qemu_st_helpers[] = {
     [MO_UB]   = helper_ret_stb_mmu,
     [MO_LEUW] = helper_le_stw_mmu,
     [MO_LEUL] = helper_le_stl_mmu,
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index d00bd12..cd9de4d 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -1257,7 +1257,7 @@ static void tcg_out_nopn(TCGContext *s, int n)
 /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
  *                                     int mmu_idx, uintptr_t ra)
  */
-static void * const qemu_ld_helpers[16] = {
+static void * const qemu_ld_helpers[] = {
     [MO_UB]   = helper_ret_ldub_mmu,
     [MO_LEUW] = helper_le_lduw_mmu,
     [MO_LEUL] = helper_le_ldul_mmu,
@@ -1270,7 +1270,7 @@ static void * const qemu_ld_helpers[16] = {
 /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
  *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
  */
-static void * const qemu_st_helpers[16] = {
+static void * const qemu_st_helpers[] = {
     [MO_UB]   = helper_ret_stb_mmu,
     [MO_LEUW] = helper_le_stw_mmu,
     [MO_LEUL] = helper_le_stl_mmu,
diff --git a/tcg/mips/tcg-target.inc.c b/tcg/mips/tcg-target.inc.c
index 5b2fe98..f9c02c9 100644
--- a/tcg/mips/tcg-target.inc.c
+++ b/tcg/mips/tcg-target.inc.c
@@ -1101,7 +1101,7 @@ static void tcg_out_call(TCGContext *s, tcg_insn_unit *arg)
 }
 
 #if defined(CONFIG_SOFTMMU)
-static void * const qemu_ld_helpers[16] = {
+static void * const qemu_ld_helpers[] = {
     [MO_UB]   = helper_ret_ldub_mmu,
     [MO_SB]   = helper_ret_ldsb_mmu,
     [MO_LEUW] = helper_le_lduw_mmu,
@@ -1118,7 +1118,7 @@ static void * const qemu_ld_helpers[16] = {
 #endif
 };
 
-static void * const qemu_st_helpers[16] = {
+static void * const qemu_st_helpers[] = {
     [MO_UB]   = helper_ret_stb_mmu,
     [MO_LEUW] = helper_le_stw_mmu,
     [MO_LEUL] = helper_le_stl_mmu,
diff --git a/tcg/ppc/tcg-target.inc.c b/tcg/ppc/tcg-target.inc.c
index a3262cf..b3fde1e 100644
--- a/tcg/ppc/tcg-target.inc.c
+++ b/tcg/ppc/tcg-target.inc.c
@@ -1383,7 +1383,7 @@ static const uint32_t qemu_exts_opc[4] = {
 /* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
  *                                 int mmu_idx, uintptr_t ra)
  */
-static void * const qemu_ld_helpers[16] = {
+static void * const qemu_ld_helpers[] = {
     [MO_UB]   = helper_ret_ldub_mmu,
     [MO_LEUW] = helper_le_lduw_mmu,
     [MO_LEUL] = helper_le_ldul_mmu,
@@ -1396,7 +1396,7 @@ static void * const qemu_ld_helpers[16] = {
 /* helper signature: helper_st_mmu(CPUState *env, target_ulong addr,
  *                                 uintxx_t val, int mmu_idx, uintptr_t ra)
  */
-static void * const qemu_st_helpers[16] = {
+static void * const qemu_st_helpers[] = {
     [MO_UB]   = helper_ret_stb_mmu,
     [MO_LEUW] = helper_le_stw_mmu,
     [MO_LEUL] = helper_le_stl_mmu,
diff --git a/tcg/s390/tcg-target.inc.c b/tcg/s390/tcg-target.inc.c
index 8d5d2bd..75c5e56 100644
--- a/tcg/s390/tcg-target.inc.c
+++ b/tcg/s390/tcg-target.inc.c
@@ -307,7 +307,7 @@ static const uint8_t tcg_cond_to_ltr_cond[] = {
 };
 
 #ifdef CONFIG_SOFTMMU
-static void * const qemu_ld_helpers[16] = {
+static void * const qemu_ld_helpers[] = {
     [MO_UB]   = helper_ret_ldub_mmu,
     [MO_SB]   = helper_ret_ldsb_mmu,
     [MO_LEUW] = helper_le_lduw_mmu,
@@ -322,7 +322,7 @@ static void * const qemu_ld_helpers[16] = {
     [MO_BEQ]  = helper_be_ldq_mmu,
 };
 
-static void * const qemu_st_helpers[16] = {
+static void * const qemu_st_helpers[] = {
     [MO_UB]   = helper_ret_stb_mmu,
     [MO_LEUW] = helper_le_stw_mmu,
     [MO_LEUL] = helper_le_stl_mmu,
diff --git a/tcg/sparc/tcg-target.inc.c b/tcg/sparc/tcg-target.inc.c
index 700c434..6ccc949 100644
--- a/tcg/sparc/tcg-target.inc.c
+++ b/tcg/sparc/tcg-target.inc.c
@@ -844,12 +844,12 @@ static void tcg_out_mb(TCGContext *s, TCGArg a0)
 }
 
 #ifdef CONFIG_SOFTMMU
-static tcg_insn_unit *qemu_ld_trampoline[16];
-static tcg_insn_unit *qemu_st_trampoline[16];
+static tcg_insn_unit *qemu_ld_trampoline[MO_ALL];
+static tcg_insn_unit *qemu_st_trampoline[MO_ALL];
 
 static void build_trampolines(TCGContext *s)
 {
-    static void * const qemu_ld_helpers[16] = {
+    static void * const qemu_ld_helpers[MO_ALL] = {
         [MO_UB]   = helper_ret_ldub_mmu,
         [MO_SB]   = helper_ret_ldsb_mmu,
         [MO_LEUW] = helper_le_lduw_mmu,
@@ -861,7 +861,7 @@ static void build_trampolines(TCGContext *s)
         [MO_BEUL] = helper_be_ldul_mmu,
         [MO_BEQ]  = helper_be_ldq_mmu,
     };
-    static void * const qemu_st_helpers[16] = {
+    static void * const qemu_st_helpers[MO_ALL] = {
         [MO_UB]   = helper_ret_stb_mmu,
         [MO_LEUW] = helper_le_stw_mmu,
         [MO_LEUL] = helper_le_stl_mmu,
@@ -874,7 +874,7 @@ static void build_trampolines(TCGContext *s)
     int i;
     TCGReg ra;
 
-    for (i = 0; i < 16; ++i) {
+    for (i = 0; i < MO_ALL; ++i) {
         if (qemu_ld_helpers[i] == NULL) {
             continue;
         }
@@ -902,7 +902,7 @@ static void build_trampolines(TCGContext *s)
         tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O7, ra);
     }
 
-    for (i = 0; i < 16; ++i) {
+    for (i = 0; i < MO_ALL; ++i) {
         if (qemu_st_helpers[i] == NULL) {
             continue;
         }
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 6e2fb35..0925fab 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -2079,7 +2079,7 @@ typedef void (*gen_atomic_op_i64)(TCGv_i64, TCGv_env, TCGv, TCGv_i64);
 # define WITH_ATOMIC64(X)
 #endif
 
-static void * const table_cmpxchg[16] = {
+static void * const table_cmpxchg[] = {
     [MO_8] = gen_helper_atomic_cmpxchgb,
     [MO_16 | MO_LE] = gen_helper_atomic_cmpxchgw_le,
     [MO_16 | MO_BE] = gen_helper_atomic_cmpxchgw_be,
@@ -2297,7 +2297,7 @@ static void do_atomic_op_i64(TCGv_i64 ret, TCGv addr, TCGv_i64 val,
 }
 
 #define GEN_ATOMIC_HELPER(NAME, OP, NEW)                                \
-static void * const table_##NAME[16] = {                                \
+static void * const table_##NAME[] = {                                  \
     [MO_8] = gen_helper_atomic_##NAME##b,                               \
     [MO_16 | MO_LE] = gen_helper_atomic_##NAME##w_le,                   \
     [MO_16 | MO_BE] = gen_helper_atomic_##NAME##w_be,                   \
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 6f4d0e7..cb672f2 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -375,6 +375,7 @@ typedef enum TCGMemOp {
     MO_TEQ   = MO_TE | MO_Q,
 
     MO_SSIZE = MO_SIZE | MO_SIGN,
+    MO_ALL   = MO_SIZE | MO_SIGN | MO_BSWAP | MO_AMASK,
 } TCGMemOp;
 
 /**
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 14/18] tcg: introduce new TCGMemOp - MO_128
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (12 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 13/18] tcg: do not relay on exact values of MO_BSWAP or MO_SIGN in backend Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 15/18] tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes Kirill Batuzov
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/tcg.h | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index cb672f2..f205c6b 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -295,11 +295,12 @@ typedef enum TCGMemOp {
     MO_16    = 1,
     MO_32    = 2,
     MO_64    = 3,
-    MO_SIZE  = 3,   /* Mask for the above.  */
+    MO_128   = 4,
+    MO_SIZE  = 7,   /* Mask for the above.  */
 
-    MO_SIGN  = 4,   /* Sign-extended, otherwise zero-extended.  */
+    MO_SIGN  = 8,   /* Sign-extended, otherwise zero-extended.  */
 
-    MO_BSWAP = 8,   /* Host reverse endian.  */
+    MO_BSWAP = 16,   /* Host reverse endian.  */
 #ifdef HOST_WORDS_BIGENDIAN
     MO_LE    = MO_BSWAP,
     MO_BE    = 0,
@@ -331,7 +332,7 @@ typedef enum TCGMemOp {
      * - an alignment to a specified size, which may be more or less than
      *   the access size (MO_ALIGN_x where 'x' is a size in bytes);
      */
-    MO_ASHIFT = 4,
+    MO_ASHIFT = 5,
     MO_AMASK = 7 << MO_ASHIFT,
 #ifdef ALIGNED_ONLY
     MO_ALIGN = 0,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 15/18] tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (13 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 14/18] tcg: introduce new TCGMemOp - MO_128 Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 16/18] softmmu: create helpers for vector loads Kirill Batuzov
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/i386/tcg-target.inc.c |  5 +++++
 tcg/tcg-op.c              | 16 ++++++++++++++++
 tcg/tcg-op.h              |  8 ++++++++
 tcg/tcg-opc.h             |  4 ++++
 4 files changed, 33 insertions(+)

diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index cd9de4d..c28fd09 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -2438,6 +2438,11 @@ static const TCGTargetOpDef x86_op_defs[] = {
 #endif
 
 #ifdef TCG_TARGET_HAS_REG128
+    { INDEX_op_qemu_ld_v128, { "V", "L" } },
+    { INDEX_op_qemu_st_v128, { "V", "L" } },
+#endif
+
+#ifdef TCG_TARGET_HAS_REG128
     { INDEX_op_add_i8x16, { "V", "0", "V" } },
     { INDEX_op_add_i16x8, { "V", "0", "V" } },
     { INDEX_op_add_i32x4, { "V", "0", "V" } },
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 0925fab..dd92e71 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -2350,3 +2350,19 @@ static void tcg_gen_mov2_i64(TCGv_i64 r, TCGv_i64 a, TCGv_i64 b)
 GEN_ATOMIC_HELPER(xchg, mov2, 0)
 
 #undef GEN_ATOMIC_HELPER
+
+void tcg_gen_qemu_ld_v128(TCGv_v128 val, TCGv addr, TCGArg idx,
+                          TCGMemOp memop)
+{
+    assert((memop & MO_BSWAP) == MO_TE);
+    TCGMemOpIdx oi = make_memop_idx(memop, idx);
+    tcg_gen_op3si_v128(INDEX_op_qemu_ld_v128, val, addr, oi);
+}
+
+void tcg_gen_qemu_st_v128(TCGv_v128 val, TCGv addr, TCGArg idx,
+                          TCGMemOp memop)
+{
+    assert((memop & MO_BSWAP) == MO_TE);
+    TCGMemOpIdx oi = make_memop_idx(memop, idx);
+    tcg_gen_op3si_v128(INDEX_op_qemu_st_v128, val, addr, oi);
+}
diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 5de74d3..4646f87 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -266,6 +266,12 @@ static inline void tcg_gen_op3_v128(TCGOpcode opc, TCGv_v128 a1,
                 GET_TCGV_V128(a3));
 }
 
+static inline void tcg_gen_op3si_v128(TCGOpcode opc, TCGv_v128 a1,
+                                     TCGv_i32 a2, TCGArg a3)
+{
+    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_V128(a1), GET_TCGV_I32(a2), a3);
+}
+
 static inline void tcg_gen_op1_v64(TCGOpcode opc, TCGv_v64 a1)
 {
     tcg_gen_op1(&tcg_ctx, opc, GET_TCGV_V64(a1));
@@ -885,6 +891,8 @@ void tcg_gen_qemu_ld_i32(TCGv_i32, TCGv, TCGArg, TCGMemOp);
 void tcg_gen_qemu_st_i32(TCGv_i32, TCGv, TCGArg, TCGMemOp);
 void tcg_gen_qemu_ld_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
 void tcg_gen_qemu_st_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
+void tcg_gen_qemu_ld_v128(TCGv_v128, TCGv, TCGArg, TCGMemOp);
+void tcg_gen_qemu_st_v128(TCGv_v128, TCGv, TCGArg, TCGMemOp);
 
 static inline void tcg_gen_qemu_ld8u(TCGv ret, TCGv addr, int mem_index)
 {
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 0022535..8ff1416 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -222,6 +222,10 @@ DEF(qemu_ld_i64, DATA64_ARGS, TLADDR_ARGS, 1,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT)
 DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT)
+DEF(qemu_ld_v128, 1, 1, 1,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | IMPL128)
+DEF(qemu_st_v128, 0, 2, 1,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | IMPL128)
 
 #undef TLADDR_ARGS
 #undef DATA64_ARGS
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 16/18] softmmu: create helpers for vector loads
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (14 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 15/18] tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 17/18] tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops Kirill Batuzov
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 cputlb.c                  |   4 +
 softmmu_template_vector.h | 266 ++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg.h                 |   5 +
 3 files changed, 275 insertions(+)
 create mode 100644 softmmu_template_vector.h

diff --git a/cputlb.c b/cputlb.c
index 813279f..e174773 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -669,6 +669,10 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
 #define DATA_SIZE 8
 #include "softmmu_template.h"
 
+#define SHIFT 4
+#include "softmmu_template_vector.h"
+#undef MMUSUFFIX
+
 /* First set of helpers allows passing in of OI and RETADDR.  This makes
    them callable from other helpers.  */
 
diff --git a/softmmu_template_vector.h b/softmmu_template_vector.h
new file mode 100644
index 0000000..b286d65
--- /dev/null
+++ b/softmmu_template_vector.h
@@ -0,0 +1,266 @@
+/*
+ *  Software MMU support
+ *
+ * Generate helpers used by TCG for qemu_ld/st vector ops and code
+ * load functions.
+ *
+ * Included from target op helpers and exec.c.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include "qemu/timer.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+
+#define DATA_SIZE (1 << SHIFT)
+
+#if DATA_SIZE == 16
+#define SUFFIX v128
+#else
+#error unsupported data size
+#endif
+
+
+#ifdef SOFTMMU_CODE_ACCESS
+#define READ_ACCESS_TYPE MMU_INST_FETCH
+#define ADDR_READ addr_code
+#else
+#define READ_ACCESS_TYPE MMU_DATA_LOAD
+#define ADDR_READ addr_read
+#endif
+
+#define helper_te_ld_name  glue(glue(helper_te_ld, SUFFIX), MMUSUFFIX)
+#define helper_te_st_name  glue(glue(helper_te_st, SUFFIX), MMUSUFFIX)
+
+#ifndef SOFTMMU_CODE_ACCESS
+static inline void glue(io_read, SUFFIX)(CPUArchState *env,
+                                         CPUIOTLBEntry *iotlbentry,
+                                         target_ulong addr,
+                                         uintptr_t retaddr,
+                                         uint8_t *res)
+{
+    CPUState *cpu = ENV_GET_CPU(env);
+    hwaddr physaddr = iotlbentry->addr;
+    MemoryRegion *mr = iotlb_to_region(cpu, physaddr, iotlbentry->attrs);
+    int i;
+
+    assert(0); /* Needs testing */
+
+    physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
+    cpu->mem_io_pc = retaddr;
+    if (mr != &io_mem_rom && mr != &io_mem_notdirty && !cpu->can_do_io) {
+        cpu_io_recompile(cpu, retaddr);
+    }
+
+    cpu->mem_io_vaddr = addr;
+    for (i = 0; i < (1 << SHIFT); i += 8) {
+        memory_region_dispatch_read(mr, physaddr + i, (uint64_t *)(res + i),
+                                    8, iotlbentry->attrs);
+    }
+}
+#endif
+
+void helper_te_ld_name(CPUArchState *env, target_ulong addr,
+                       TCGMemOpIdx oi, uintptr_t retaddr, uint8_t *res)
+{
+    unsigned mmu_idx = get_mmuidx(oi);
+    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
+    uintptr_t haddr;
+    int i;
+
+    /* Adjust the given return address.  */
+    retaddr -= GETPC_ADJ;
+
+    /* If the TLB entry is for a different page, reload and try again.  */
+    if ((addr & TARGET_PAGE_MASK)
+         != (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
+        if ((addr & (DATA_SIZE - 1)) != 0
+            && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+            cpu_unaligned_access(ENV_GET_CPU(env), addr, READ_ACCESS_TYPE,
+                                 mmu_idx, retaddr);
+        }
+        if (!VICTIM_TLB_HIT(ADDR_READ, addr)) {
+            tlb_fill(ENV_GET_CPU(env), addr, READ_ACCESS_TYPE,
+                     mmu_idx, retaddr);
+        }
+        tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
+    }
+
+    /* Handle an IO access.  */
+    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
+        CPUIOTLBEntry *iotlbentry;
+        if ((addr & (DATA_SIZE - 1)) != 0) {
+            goto do_unaligned_access;
+        }
+        iotlbentry = &env->iotlb[mmu_idx][index];
+
+        /* ??? Note that the io helpers always read data in the target
+           byte ordering.  We should push the LE/BE request down into io.  */
+        glue(io_read, SUFFIX)(env, iotlbentry, addr, retaddr, res);
+        return ;
+    }
+
+    /* Handle slow unaligned access (it spans two pages or IO).  */
+    if (DATA_SIZE > 1
+        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+                    >= TARGET_PAGE_SIZE)) {
+        target_ulong addr1, addr2;
+        uint8_t res1[DATA_SIZE * 2];
+        unsigned shift;
+    do_unaligned_access:
+        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+            cpu_unaligned_access(ENV_GET_CPU(env), addr, READ_ACCESS_TYPE,
+                                 mmu_idx, retaddr);
+        }
+        addr1 = addr & ~(DATA_SIZE - 1);
+        addr2 = addr1 + DATA_SIZE;
+        /* Note the adjustment at the beginning of the function.
+           Undo that for the recursion.  */
+        helper_te_ld_name(env, addr1, oi, retaddr + GETPC_ADJ, res1);
+        helper_te_ld_name(env, addr2, oi, retaddr + GETPC_ADJ,
+                          res1 + DATA_SIZE);
+        shift = addr & (DATA_SIZE - 1);
+
+        for (i = 0; i < DATA_SIZE; i++) {
+            res[i] = res1[i + shift];
+        }
+        return;
+    }
+
+    /* Handle aligned access or unaligned access in the same page.  */
+    if ((addr & (DATA_SIZE - 1)) != 0
+        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+        cpu_unaligned_access(ENV_GET_CPU(env), addr, READ_ACCESS_TYPE,
+                             mmu_idx, retaddr);
+    }
+
+    haddr = addr + env->tlb_table[mmu_idx][index].addend;
+    for (i = 0; i < DATA_SIZE; i++) {
+        res[i] = ((uint8_t *)haddr)[i];
+    }
+}
+
+#ifndef SOFTMMU_CODE_ACCESS
+
+static inline void glue(io_write, SUFFIX)(CPUArchState *env,
+                                          CPUIOTLBEntry *iotlbentry,
+                                          uint8_t *val,
+                                          target_ulong addr,
+                                          uintptr_t retaddr)
+{
+    CPUState *cpu = ENV_GET_CPU(env);
+    hwaddr physaddr = iotlbentry->addr;
+    MemoryRegion *mr = iotlb_to_region(cpu, physaddr, iotlbentry->attrs);
+    int i;
+
+    assert(0); /* Needs testing */
+
+    physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
+    if (mr != &io_mem_rom && mr != &io_mem_notdirty && !cpu->can_do_io) {
+        cpu_io_recompile(cpu, retaddr);
+    }
+
+    cpu->mem_io_vaddr = addr;
+    cpu->mem_io_pc = retaddr;
+    for (i = 0; i < (1 << SHIFT); i += 8) {
+        memory_region_dispatch_write(mr, physaddr + i, *(uint64_t *)(val + i),
+                                     8, iotlbentry->attrs);
+    }
+}
+
+void helper_te_st_name(CPUArchState *env, target_ulong addr, uint8_t *val,
+                       TCGMemOpIdx oi, uintptr_t retaddr)
+{
+    unsigned mmu_idx = get_mmuidx(oi);
+    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+    uintptr_t haddr;
+    int i;
+
+    /* Adjust the given return address.  */
+    retaddr -= GETPC_ADJ;
+
+    /* If the TLB entry is for a different page, reload and try again.  */
+    if ((addr & TARGET_PAGE_MASK)
+        != (tlb_addr & (TARGET_PAGE_MASK | TLB_INVALID_MASK))) {
+        if ((addr & (DATA_SIZE - 1)) != 0
+            && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                                 mmu_idx, retaddr);
+        }
+        if (!VICTIM_TLB_HIT(addr_write, addr)) {
+            tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
+        }
+        tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+    }
+
+    /* Handle an IO access.  */
+    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
+        CPUIOTLBEntry *iotlbentry;
+        if ((addr & (DATA_SIZE - 1)) != 0) {
+            goto do_unaligned_access;
+        }
+        iotlbentry = &env->iotlb[mmu_idx][index];
+
+        /* ??? Note that the io helpers always read data in the target
+           byte ordering.  We should push the LE/BE request down into io.  */
+        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+        return;
+    }
+
+    /* Handle slow unaligned access (it spans two pages or IO).  */
+    if (DATA_SIZE > 1
+        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+                     >= TARGET_PAGE_SIZE)) {
+        int i;
+    do_unaligned_access:
+        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                                 mmu_idx, retaddr);
+        }
+        /* XXX: not efficient, but simple */
+        /* Note: relies on the fact that tlb_fill() does not remove the
+         * previous page from the TLB cache.  */
+        for (i = DATA_SIZE - 1; i >= 0; i--) {
+            /* Note the adjustment at the beginning of the function.
+               Undo that for the recursion.  */
+            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val[i],
+                                            oi, retaddr + GETPC_ADJ);
+        }
+        return;
+    }
+
+    /* Handle aligned access or unaligned access in the same page.  */
+    if ((addr & (DATA_SIZE - 1)) != 0
+        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                             mmu_idx, retaddr);
+    }
+
+    haddr = addr + env->tlb_table[mmu_idx][index].addend;
+    for (i = 0; i < DATA_SIZE; i++) {
+        ((uint8_t *)haddr)[i] = val[i];
+    }
+}
+
+#endif /* !defined(SOFTMMU_CODE_ACCESS) */
+
+#undef READ_ACCESS_TYPE
+#undef SHIFT
+#undef SUFFIX
+#undef DATA_SIZE
+#undef ADDR_READ
+#undef helper_te_ld_name
+#undef helper_te_st_name
diff --git a/tcg/tcg.h b/tcg/tcg.h
index f205c6b..c64a2bf 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -1306,6 +1306,11 @@ uint32_t helper_be_ldl_cmmu(CPUArchState *env, target_ulong addr,
 uint64_t helper_be_ldq_cmmu(CPUArchState *env, target_ulong addr,
                             TCGMemOpIdx oi, uintptr_t retaddr);
 
+void helper_te_ldv128_mmu(CPUArchState *env, target_ulong addr,
+                          TCGMemOpIdx oi, uintptr_t retaddr, uint8_t *res);
+void helper_te_stv128_mmu(CPUArchState *env, target_ulong addr, uint8_t *val,
+                          TCGMemOpIdx oi, uintptr_t retaddr);
+
 /* Temporary aliases until backends are converted.  */
 #ifdef TARGET_WORDS_BIGENDIAN
 # define helper_ret_ldsw_mmu  helper_be_ldsw_mmu
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 17/18] tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (15 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 16/18] softmmu: create helpers for vector loads Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 18/18] target/arm: load two consecutive 64-bits vector regs as a 128-bit vector reg Kirill Batuzov
  2017-01-27 14:55 ` [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Alex Bennée
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 tcg/i386/tcg-target.inc.c | 63 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 56 insertions(+), 7 deletions(-)

diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index c28fd09..a48da20 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -1265,6 +1265,7 @@ static void * const qemu_ld_helpers[] = {
     [MO_BEUW] = helper_be_lduw_mmu,
     [MO_BEUL] = helper_be_ldul_mmu,
     [MO_BEQ]  = helper_be_ldq_mmu,
+    [MO_128]  = helper_te_ldv128_mmu,
 };
 
 /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
@@ -1278,6 +1279,7 @@ static void * const qemu_st_helpers[] = {
     [MO_BEUW] = helper_be_stw_mmu,
     [MO_BEUL] = helper_be_stl_mmu,
     [MO_BEQ]  = helper_be_stq_mmu,
+    [MO_128]  = helper_te_stv128_mmu,
 };
 
 /* Perform the TLB load and compare.
@@ -1444,12 +1446,27 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
         ofs += 4;
 
         tcg_out_sti(s, TCG_TYPE_PTR, (uintptr_t)l->raddr, TCG_REG_ESP, ofs);
+
+        if ((opc & MO_SSIZE) == MO_128) {
+            ofs += 4;
+            tcg_out_sti(s, TCG_TYPE_PTR, (uintptr_t)s->v128_swap,
+                        TCG_REG_ESP, ofs);
+        }
     } else {
         tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
         /* The second argument is already loaded with addrlo.  */
         tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2], oi);
         tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[3],
                      (uintptr_t)l->raddr);
+        if ((opc & MO_SSIZE) == MO_128) {
+            if (ARRAY_SIZE(tcg_target_call_iarg_regs) > 4) {
+                tcg_out_movi(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[4],
+                             (uintptr_t)s->v128_swap);
+            } else {
+                tcg_out_sti(s, TCG_TYPE_PTR, (uintptr_t)s->v128_swap,
+                            TCG_REG_ESP, TCG_TARGET_CALL_STACK_OFFSET);
+            }
+        }
     }
 
     tcg_out_call(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
@@ -1485,6 +1502,10 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
             tcg_out_mov(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_EDX);
         }
         break;
+    case MO_128:
+        tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_EAX, (uintptr_t)s->v128_swap);
+        tcg_out_ld(s, TCG_TYPE_V128, l->datalo_reg, TCG_REG_EAX, 0);
+        break;
     default:
         tcg_abort();
     }
@@ -1524,12 +1545,19 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
             ofs += 4;
         }
 
-        tcg_out_st(s, TCG_TYPE_I32, l->datalo_reg, TCG_REG_ESP, ofs);
-        ofs += 4;
-
-        if (s_bits == MO_64) {
-            tcg_out_st(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_ESP, ofs);
+        if (s_bits == MO_128) {
+            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_EAX, (uintptr_t)s->v128_swap);
+            tcg_out_st(s, TCG_TYPE_V128, l->datalo_reg, TCG_REG_EAX, 0);
+            tcg_out_st(s, TCG_TYPE_PTR, TCG_REG_EAX, TCG_REG_ESP, ofs);
+            ofs += 4;
+        } else {
+            tcg_out_st(s, TCG_TYPE_I32, l->datalo_reg, TCG_REG_ESP, ofs);
             ofs += 4;
+
+            if (s_bits == MO_64) {
+                tcg_out_st(s, TCG_TYPE_I32, l->datahi_reg, TCG_REG_ESP, ofs);
+                ofs += 4;
+            }
         }
 
         tcg_out_sti(s, TCG_TYPE_I32, oi, TCG_REG_ESP, ofs);
@@ -1541,8 +1569,16 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     } else {
         tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
         /* The second argument is already loaded with addrlo.  */
-        tcg_out_mov(s, (s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32),
-                    tcg_target_call_iarg_regs[2], l->datalo_reg);
+        if (s_bits == MO_128) {
+            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RAX,
+                         (uintptr_t)s->v128_swap);
+            tcg_out_st(s, TCG_TYPE_V128, l->datalo_reg, TCG_REG_RAX, 0);
+            tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[2],
+                        TCG_REG_RAX);
+        } else {
+            tcg_out_mov(s, (s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32),
+                        tcg_target_call_iarg_regs[2], l->datalo_reg);
+        }
         tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[3], oi);
 
         if (ARRAY_SIZE(tcg_target_call_iarg_regs) > 4) {
@@ -1674,6 +1710,10 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
             }
         }
         break;
+    case MO_128:
+        tcg_out_modrm_sib_offset(s, OPC_MOVDQU_M2R + seg, datalo,
+                                 base, index, 0, ofs);
+        break;
     default:
         tcg_abort();
     }
@@ -1817,6 +1857,9 @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
             tcg_out_modrm_offset(s, movop + seg, datahi, base, ofs+4);
         }
         break;
+    case MO_128:
+        tcg_out_modrm_offset(s, OPC_MOVDQU_R2M + seg, datalo, base, ofs);
+        break;
     default:
         tcg_abort();
     }
@@ -2145,12 +2188,18 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_qemu_ld_i64:
         tcg_out_qemu_ld(s, args, 1);
         break;
+    case INDEX_op_qemu_ld_v128:
+        tcg_out_qemu_ld(s, args, 0);
+        break;
     case INDEX_op_qemu_st_i32:
         tcg_out_qemu_st(s, args, 0);
         break;
     case INDEX_op_qemu_st_i64:
         tcg_out_qemu_st(s, args, 1);
         break;
+    case INDEX_op_qemu_st_v128:
+        tcg_out_qemu_st(s, args, 0);
+        break;
 
     OP_32_64(mulu2):
         tcg_out_modrm(s, OPC_GRP3_Ev + rexw, EXT3_MUL, args[3]);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 18/18] target/arm: load two consecutive 64-bits vector regs as a 128-bit vector reg
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (16 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 17/18] tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops Kirill Batuzov
@ 2017-01-17  9:07 ` Kirill Batuzov
  2017-01-27 14:55 ` [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Alex Bennée
  18 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-17  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Andrzej Zaborowski, Kirill Batuzov

ARM instruction set does not have loads to 128-bit vector register (q-regs).
Instead it can read several consecutive 64-bit vector register (d-regs)
which is used by GCC to load 128-bit registers from memory.

For vector operations to work we need to detect such loads and transform them
into 128-bit loads to 128-bit temporaries.

Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
---
 target/arm/translate.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/target/arm/translate.c b/target/arm/translate.c
index 4378d44..8b28f77 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -4748,6 +4748,19 @@ static int disas_neon_ls_insn(DisasContext *s, uint32_t insn)
                 tcg_gen_addi_i32(addr, addr, 1 << size);
             }
             if (size == 3) {
+#ifdef TCG_TARGET_HAS_REG128
+                if (rd % 2 == 0 && nregs == 2) {
+                    /* 128-bit load */
+                    if (load) {
+                        tcg_gen_qemu_ld_v128(cpu_Q[rd / 2], addr,
+                                             get_mem_index(s), MO_LE | MO_128);
+                    } else {
+                        tcg_gen_qemu_st_v128(cpu_Q[rd / 2], addr,
+                                             get_mem_index(s), MO_LE | MO_128);
+                    }
+                    break;
+                }
+#endif
                 tmp64 = tcg_temp_new_i64();
                 if (load) {
                     gen_aa32_ld64(s, tmp64, addr, get_mem_index(s));
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 06/18] tcg: allow globals to overlap
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 06/18] tcg: allow globals to overlap Kirill Batuzov
@ 2017-01-17 19:50   ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-01-17 19:50 UTC (permalink / raw)
  To: Kirill Batuzov, qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Peter Maydell, Andrzej Zaborowski

On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
> Sometimes the target architecture may allow some parts of a register to be
> accessed as a different register. If both of these registers are
> implemented as globals in QEMU, then their content will overlap and the
> change to one global will also change the value of the other. To handle
> such situation properly, some fixes are needed in the register allocator
> and liveness analysis.

You need to handle the overlap during optimization as well.  Otherwise you'll 
propagate values that you shouldn't.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 08/18] target/arm: support access to vector guest registers as globals
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 08/18] target/arm: support access to vector guest registers as globals Kirill Batuzov
@ 2017-01-17 20:07   ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-01-17 20:07 UTC (permalink / raw)
  To: Kirill Batuzov, qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Peter Maydell, Andrzej Zaborowski

On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
> +    for (i = 0; i < 16; i++) {
> +        overlap_temps[i][0] = GET_TCGV_V128(cpu_Q[i]);
> +        overlap_temps[i][1] = (TCGArg)-1;
> +        sub_temps[i][0] = GET_TCGV_V64(cpu_D[i * 2]);
> +        sub_temps[i][1] = GET_TCGV_V64(cpu_D[i * 2 + 1]);
> +        sub_temps[i][2] = (TCGArg)-1;
> +        tcg_temp_set_overlap_temps(GET_TCGV_V64(cpu_D[i * 2]),
> +                                   overlap_temps[i]);
> +        tcg_temp_set_overlap_temps(GET_TCGV_V64(cpu_D[i * 2 + 1]),
> +                                   overlap_temps[i]);
> +        tcg_temp_set_sub_temps(GET_TCGV_V128(cpu_Q[i]), sub_temps[i]);
>      }

Should we simply detect this generically as the registers are declared?  This 
seems tedious to do for each target.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 10/18] tcg/i386: add support for vector opcodes
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 10/18] tcg/i386: add support for vector opcodes Kirill Batuzov
@ 2017-01-17 20:19   ` Richard Henderson
  2017-01-18 13:05     ` Kirill Batuzov
  2017-01-27 14:51   ` Alex Bennée
  1 sibling, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-01-17 20:19 UTC (permalink / raw)
  To: Kirill Batuzov, qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Peter Maydell, Andrzej Zaborowski

On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
> To be able to generate vector operations in a TCG backend we need to do
> several things.
>
> 1. We need to tell the register allocator about vector target's register.
>    In case of x86 we'll use xmm0..xmm7. xmm7 is designated as a scratch
>    register, others can be used by the register allocator.
>
> 2. We need a new constraint to indicate where to use vector registers. In
>    this commit the 'V' constraint is introduced.
>
> 3. We need to be able to generate bare minimum: load, store and reg-to-reg
>    move. MOVDQU is used for loads and stores. MOVDQA is used for reg-to-reg
>    moves.
>
> 4. Finally we need to support any other opcodes we want. INDEX_op_add_i32x4
>    is the only one for now. The PADDD instruction handles it perfectly.
>
> Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
> ---
>  tcg/i386/tcg-target.h     |  24 +++++++++-
>  tcg/i386/tcg-target.inc.c | 109 +++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 125 insertions(+), 8 deletions(-)
>
> diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
> index 524cfc6..974a58b 100644
> --- a/tcg/i386/tcg-target.h
> +++ b/tcg/i386/tcg-target.h
> @@ -29,8 +29,14 @@
>  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
>
>  #ifdef __x86_64__
> -# define TCG_TARGET_REG_BITS  64
> -# define TCG_TARGET_NB_REGS   16
> +# define TCG_TARGET_HAS_REG128 1
> +# ifdef TCG_TARGET_HAS_REG128
> +#  define TCG_TARGET_REG_BITS  64
> +#  define TCG_TARGET_NB_REGS   24
> +# else
> +#  define TCG_TARGET_REG_BITS  64
> +#  define TCG_TARGET_NB_REGS   16
> +# endif
>  #else
>  # define TCG_TARGET_REG_BITS  32
>  # define TCG_TARGET_NB_REGS    8
> @@ -56,6 +62,16 @@ typedef enum {
>      TCG_REG_R13,
>      TCG_REG_R14,
>      TCG_REG_R15,
> +#ifdef TCG_TARGET_HAS_REG128
> +    TCG_REG_XMM0,
> +    TCG_REG_XMM1,
> +    TCG_REG_XMM2,
> +    TCG_REG_XMM3,
> +    TCG_REG_XMM4,
> +    TCG_REG_XMM5,
> +    TCG_REG_XMM6,
> +    TCG_REG_XMM7,
> +#endif

There's no need to conditionalize this.  The registers can be always defined 
even if they're not used.  We really really really want to keep ifdefs to an 
absolute minimum.

Why are you not defining xmm8-15?

> @@ -634,9 +662,24 @@ static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
>  static inline void tcg_out_mov(TCGContext *s, TCGType type,
>                                 TCGReg ret, TCGReg arg)
>  {
> +    int opc;
>      if (arg != ret) {
> -        int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> -        tcg_out_modrm(s, opc, ret, arg);
> +        switch (type) {
> +#ifdef TCG_TARGET_HAS_REG128
> +        case TCG_TYPE_V128:
> +            ret -= TCG_REG_XMM0;
> +            arg -= TCG_REG_XMM0;
> +            tcg_out_modrm(s, OPC_MOVDQA_R2R, ret, arg);
> +            break;
> +#endif
> +        case TCG_TYPE_I32:
> +        case TCG_TYPE_I64:
> +            opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> +            tcg_out_modrm(s, opc, ret, arg);
> +            break;
> +        default:
> +            assert(0);

g_assert_not_reached().

Again, no ifdefs.

We probably want to generate avx1 code when the cpu supports it, to avoid mode 
switches in the vector registers.  In this case, simply issue the same opcode, 
vex encoded.

> +#ifdef TCG_TARGET_HAS_REG128
> +    { INDEX_op_add_i32x4, { "V", "0", "V" } },
> +#endif

And, clearly, you need to rebase.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 12/18] tcg/i386: support remaining vector addition operations
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 12/18] tcg/i386: support remaining vector addition operations Kirill Batuzov
@ 2017-01-17 21:49   ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-01-17 21:49 UTC (permalink / raw)
  To: Kirill Batuzov, qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Peter Maydell, Andrzej Zaborowski

On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
>  #ifdef TCG_TARGET_HAS_REG128
> +    case INDEX_op_add_i8x16:
> +        tcg_out_modrm(s, OPC_PADDB, args[0], args[2]);
> +        break;
> +    case INDEX_op_add_i16x8:
> +        tcg_out_modrm(s, OPC_PADDW, args[0], args[2]);
> +        break;
>      case INDEX_op_add_i32x4:
>          tcg_out_modrm(s, OPC_PADDD, args[0], args[2]);
>          break;
> +    case INDEX_op_add_i64x2:
> +        tcg_out_modrm(s, OPC_PADDQ, args[0], args[2]);
> +        break;
> +#endif
> +
> +#ifdef TCG_TARGET_HAS_REGV64
> +    case INDEX_op_add_i8x8:
> +        tcg_out_modrm(s, OPC_PADDB, args[0], args[2]);
> +        break;
> +    case INDEX_op_add_i16x4:
> +        tcg_out_modrm(s, OPC_PADDW, args[0], args[2]);
> +        break;
> +    case INDEX_op_add_i32x2:
> +        tcg_out_modrm(s, OPC_PADDD, args[0], args[2]);
> +        break;
> +    case INDEX_op_add_i64x1:
> +        tcg_out_modrm(s, OPC_PADDQ, args[0], args[2]);
> +        break;
>  #endif

Once you drop the ifdefs, combine the cases.  Also: avx1 vpadd*.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 07/18] tcg: add vector addition operations
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 07/18] tcg: add vector addition operations Kirill Batuzov
@ 2017-01-17 21:56   ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-01-17 21:56 UTC (permalink / raw)
  To: Kirill Batuzov, qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Peter Maydell, Andrzej Zaborowski

On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
> +/***************************************/
> +/* 64-bit and 128-bit vector arithmetic.          */
> +
> +static inline void *tcg_v128_swap_slot(int n)
> +{
> +    return &tcg_ctx.v128_swap[n * 16];
> +}
> +
> +/* Find a memory location for 128-bit TCG variable. */
> +static inline void tcg_v128_to_ptr(TCGv_v128 tmp, TCGv_ptr base, int slot,
> +                                   TCGv_ptr *real_base, intptr_t *real_offset,
> +                                   int is_read)

None of this needs to be inline in tcg-op.h.  All of it should be out-of-line 
in tcg-op.c.


> @@ -750,6 +778,7 @@ struct TCGContext {
>      void *code_gen_buffer;
>      size_t code_gen_buffer_size;
>      void *code_gen_ptr;
> +    uint8_t v128_swap[16 * 3];

This is not thread-safe.
Shouldn't use space in TCGContext; should use space on stack.

Since there is no function call that is live, you can re-use the space for 
on-stack arguments.  There is TCG_STATIC_CALL_ARGS_SIZE (128) bytes allocated 
for that.  Which should be more than enough.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 10/18] tcg/i386: add support for vector opcodes
  2017-01-17 20:19   ` Richard Henderson
@ 2017-01-18 13:05     ` Kirill Batuzov
  2017-01-18 18:22       ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-18 13:05 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, Paolo Bonzini, Peter Crosthwaite, Peter Maydell,
	Andrzej Zaborowski

On Tue, 17 Jan 2017, Richard Henderson wrote:

> On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
> > To be able to generate vector operations in a TCG backend we need to do
> > several things.
> > 
> > 1. We need to tell the register allocator about vector target's register.
> >    In case of x86 we'll use xmm0..xmm7. xmm7 is designated as a scratch
> >    register, others can be used by the register allocator.
> > 
> > 2. We need a new constraint to indicate where to use vector registers. In
> >    this commit the 'V' constraint is introduced.
> > 
> > 3. We need to be able to generate bare minimum: load, store and reg-to-reg
> >    move. MOVDQU is used for loads and stores. MOVDQA is used for reg-to-reg
> >    moves.
> > 
> > 4. Finally we need to support any other opcodes we want. INDEX_op_add_i32x4
> >    is the only one for now. The PADDD instruction handles it perfectly.
> > 
> > Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>
> > ---
> >  tcg/i386/tcg-target.h     |  24 +++++++++-
> >  tcg/i386/tcg-target.inc.c | 109
> > +++++++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 125 insertions(+), 8 deletions(-)
> > 
> > diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
> > index 524cfc6..974a58b 100644
> > --- a/tcg/i386/tcg-target.h
> > +++ b/tcg/i386/tcg-target.h
> > @@ -29,8 +29,14 @@
> >  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
> > 
> >  #ifdef __x86_64__
> > -# define TCG_TARGET_REG_BITS  64
> > -# define TCG_TARGET_NB_REGS   16
> > +# define TCG_TARGET_HAS_REG128 1
> > +# ifdef TCG_TARGET_HAS_REG128
> > +#  define TCG_TARGET_REG_BITS  64
> > +#  define TCG_TARGET_NB_REGS   24
> > +# else
> > +#  define TCG_TARGET_REG_BITS  64
> > +#  define TCG_TARGET_NB_REGS   16
> > +# endif
> >  #else
> >  # define TCG_TARGET_REG_BITS  32
> >  # define TCG_TARGET_NB_REGS    8
> > @@ -56,6 +62,16 @@ typedef enum {
> >      TCG_REG_R13,
> >      TCG_REG_R14,
> >      TCG_REG_R15,
> > +#ifdef TCG_TARGET_HAS_REG128
> > +    TCG_REG_XMM0,
> > +    TCG_REG_XMM1,
> > +    TCG_REG_XMM2,
> > +    TCG_REG_XMM3,
> > +    TCG_REG_XMM4,
> > +    TCG_REG_XMM5,
> > +    TCG_REG_XMM6,
> > +    TCG_REG_XMM7,
> > +#endif
> 
> There's no need to conditionalize this.  The registers can be always defined
> even if they're not used.  We really really really want to keep ifdefs to an
> absolute minimum.
> 
> Why are you not defining xmm8-15?

At first I thought about supporting both x86_64 and i386 targets, but
put this idea away (at least for the time being). Since defining xmm8-15
does not contradict anything (as I see it now) I'll add them too.

> 
> > @@ -634,9 +662,24 @@ static inline void tgen_arithr(TCGContext *s, int
> > subop, int dest, int src)
> >  static inline void tcg_out_mov(TCGContext *s, TCGType type,
> >                                 TCGReg ret, TCGReg arg)
> >  {
> > +    int opc;
> >      if (arg != ret) {
> > -        int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> > -        tcg_out_modrm(s, opc, ret, arg);
> > +        switch (type) {
> > +#ifdef TCG_TARGET_HAS_REG128
> > +        case TCG_TYPE_V128:
> > +            ret -= TCG_REG_XMM0;
> > +            arg -= TCG_REG_XMM0;
> > +            tcg_out_modrm(s, OPC_MOVDQA_R2R, ret, arg);
> > +            break;
> > +#endif
> > +        case TCG_TYPE_I32:
> > +        case TCG_TYPE_I64:
> > +            opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> > +            tcg_out_modrm(s, opc, ret, arg);
> > +            break;
> > +        default:
> > +            assert(0);
> 
> g_assert_not_reached().
> 
> Again, no ifdefs.
> 
> We probably want to generate avx1 code when the cpu supports it, to avoid mode
> switches in the vector registers.  In this case, simply issue the same opcode,
> vex encoded.
> 
> > +#ifdef TCG_TARGET_HAS_REG128
> > +    { INDEX_op_add_i32x4, { "V", "0", "V" } },
> > +#endif
> 
> And, clearly, you need to rebase.
> 

I was too late to notice that some conflicting tcg-related pull has hit
master after my last rebase. Sorry. v2 will be rebased.

-- 
Kirill

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 10/18] tcg/i386: add support for vector opcodes
  2017-01-18 13:05     ` Kirill Batuzov
@ 2017-01-18 18:22       ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-01-18 18:22 UTC (permalink / raw)
  To: Kirill Batuzov
  Cc: Peter Maydell, Paolo Bonzini, qemu-devel, Peter Crosthwaite

On 01/18/2017 05:05 AM, Kirill Batuzov wrote:
>> Why are you not defining xmm8-15?
>
> At first I thought about supporting both x86_64 and i386 targets, but
> put this idea away (at least for the time being). Since defining xmm8-15
> does not contradict anything (as I see it now) I'll add them too.

Thanks.  Although (potentialy) all you need to do to support i386 is to make 
sure that TCG_TARGET_HAS_add_* are properly conditionalized on a runtime 
have_sse2 check.  There are other examples of how such runtime checks should be 
done.

(That said, I can imagine there might be other issues with respect to i64 vs 
v64 that might turn out to be complicated.  It wouldn't bother me if we 
restricted vector support to 64-bit hosts.)


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type Kirill Batuzov
@ 2017-01-18 18:29   ` Richard Henderson
  2017-01-19 13:04     ` Kirill Batuzov
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-01-18 18:29 UTC (permalink / raw)
  To: Kirill Batuzov, qemu-devel
  Cc: Peter Maydell, Peter Crosthwaite, Paolo Bonzini

On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
> +static inline TCGv_v128 tcg_global_mem_new_v128(TCGv_ptr reg, intptr_t offset,
> +                                                const char *name)
> +{
> +    int idx = tcg_global_mem_new_internal(TCG_TYPE_V128, reg, offset, name);
> +    return MAKE_TCGV_V128(idx);
> +}

You shouldn't allow a v128 type to be created if the host doesn't support it.

You may want to treat v128 as a pair of v64 if the host supports that. 
Although there's limited applicability there, since only minor hosts (MIPS, 
Sparc, ia64) have 64-bit-only vector extensions.

That said, treating v128 as 2 x v64 scales nicely when we add v256.  Which, if 
we've already gone this far, is clearly how avx2 guest support should be 
implemented.

For hosts that have had no vector support added, you may want to represent v128 
as 2 x i64, for the purpose of intermediate expansion.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type
  2017-01-18 18:29   ` Richard Henderson
@ 2017-01-19 13:04     ` Kirill Batuzov
  2017-01-19 15:09       ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-19 13:04 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, Peter Maydell, Peter Crosthwaite, Paolo Bonzini

On Wed, 18 Jan 2017, Richard Henderson wrote:

> On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
> > +static inline TCGv_v128 tcg_global_mem_new_v128(TCGv_ptr reg, intptr_t
> > offset,
> > +                                                const char *name)
> > +{
> > +    int idx = tcg_global_mem_new_internal(TCG_TYPE_V128, reg, offset,
> > name);
> > +    return MAKE_TCGV_V128(idx);
> > +}
> 
> You shouldn't allow a v128 type to be created if the host doesn't support it.

The idea here was to create it either way, but make sure no operation
will ever be issued if host does not support it (tcg_gen_* wrappers take
care of this).

> 
> You may want to treat v128 as a pair of v64 if the host supports that.
> Although there's limited applicability there, since only minor hosts (MIPS,
> Sparc, ia64) have 64-bit-only vector extensions.
> 
> That said, treating v128 as 2 x v64 scales nicely when we add v256.  Which, if
> we've already gone this far, is clearly how avx2 guest support should be
> implemented.
> 
> For hosts that have had no vector support added, you may want to represent
> v128 as 2 x i64, for the purpose of intermediate expansion.
>

I'm not sure about this last part. The host may not have i64, so there
should be another case - 4 x i32. So we'll get 4 cases for v128:

v128
2 x v64
2 x i64
4 x i32

3 cases will need to be added to tcg_temp_new_internal and
tcg_global_new_mem_internal, two of which are rather useless (2 x i64, 4 x i32).
Introduction of v256 will add 4 more cases two of which will be useless
again. This sounds like too much code that serves no purpose to me.

Maybe we can only adapt 2 x v64 (and later 2 x v128 and may be 4 x v64)
cases and just generate v128 temp that'll never be used if none of these
worked?

-- 
Kirill

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type
  2017-01-19 13:04     ` Kirill Batuzov
@ 2017-01-19 15:09       ` Richard Henderson
  2017-01-19 16:54         ` Kirill Batuzov
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-01-19 15:09 UTC (permalink / raw)
  To: Kirill Batuzov
  Cc: qemu-devel, Peter Maydell, Peter Crosthwaite, Paolo Bonzini

On 01/19/2017 05:04 AM, Kirill Batuzov wrote:
> On Wed, 18 Jan 2017, Richard Henderson wrote:
>
>> On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
>>> +static inline TCGv_v128 tcg_global_mem_new_v128(TCGv_ptr reg, intptr_t
>>> offset,
>>> +                                                const char *name)
>>> +{
>>> +    int idx = tcg_global_mem_new_internal(TCG_TYPE_V128, reg, offset,
>>> name);
>>> +    return MAKE_TCGV_V128(idx);
>>> +}
>>
>> You shouldn't allow a v128 type to be created if the host doesn't support it.
>
> The idea here was to create it either way, but make sure no operation
> will ever be issued if host does not support it (tcg_gen_* wrappers take
> care of this).

Huh?  If you issue *no* operation, then how is the operation being emulated?

> I'm not sure about this last part. The host may not have i64, so there
> should be another case - 4 x i32. So we'll get 4 cases for v128:

Recursively you'd get 4 x i32, but at least they'll be tagged TCG_TYPE_I64, and 
be handled by the rest of the tcg code generator like it should be.

>
> v128
> 2 x v64
> 2 x i64
> 4 x i32
>
> 3 cases will need to be added to tcg_temp_new_internal and
> tcg_global_new_mem_internal, two of which are rather useless (2 x i64, 4 x i32).
> Introduction of v256 will add 4 more cases two of which will be useless
> again. This sounds like too much code that serves no purpose to me.

Useless?  Surely you mean "used by hosts that don't implement v128".

I think one of us is very confused about how you intend to generate fallback 
code.  Perhaps any future patchset revision should update tcg/README first.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type
  2017-01-19 15:09       ` Richard Henderson
@ 2017-01-19 16:54         ` Kirill Batuzov
  2017-01-22  7:00           ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-19 16:54 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, Peter Maydell, Peter Crosthwaite, Paolo Bonzini

On 19.01.2017 18:09, Richard Henderson wrote:
> On 01/19/2017 05:04 AM, Kirill Batuzov wrote:
>> On Wed, 18 Jan 2017, Richard Henderson wrote:
>>
>>> On 01/17/2017 01:07 AM, Kirill Batuzov wrote:
>>>> +static inline TCGv_v128 tcg_global_mem_new_v128(TCGv_ptr reg, intptr_t
>>>> offset,
>>>> +                                                const char *name)
>>>> +{
>>>> +    int idx = tcg_global_mem_new_internal(TCG_TYPE_V128, reg, offset,
>>>> name);
>>>> +    return MAKE_TCGV_V128(idx);
>>>> +}
>>>
>>> You shouldn't allow a v128 type to be created if the host doesn't
>>> support it.
>>
>> The idea here was to create it either way, but make sure no operation
>> will ever be issued if host does not support it (tcg_gen_* wrappers take
>> care of this).
>
> Huh?  If you issue *no* operation, then how is the operation being
> emulated?

Wrappers issue emulation code instead of operation if it is not 
supported by host.

tcg_gen_add_i32x4 looks like this:

if (TCG_TARGET_HAS_add_i32x4) {
     tcg_gen_op3_v128(INDEX_op_add_i32x4, args[0], args[1], args[2]);
} else {
     for (i = 0; i < 4; i++) {
         tcg_gen_ld_i32(...);
         tcg_gen_ld_i32(...);
         tcg_gen_add_i32(...);
         tcg_gen_st_i32(...);
     }
}

So no operation working directly with TCGV_v128 temp should appear 
anywhere in the intermediate representation unless host claims it 
supports it (in which case it must support 128-bit type as well).

>
>> I'm not sure about this last part. The host may not have i64, so there
>> should be another case - 4 x i32. So we'll get 4 cases for v128:
>
> Recursively you'd get 4 x i32, but at least they'll be tagged
> TCG_TYPE_I64, and be handled by the rest of the tcg code generator like
> it should be.
>
>>
>> v128
>> 2 x v64
>> 2 x i64
>> 4 x i32
>>
>> 3 cases will need to be added to tcg_temp_new_internal and
>> tcg_global_new_mem_internal, two of which are rather useless (2 x i64,
>> 4 x i32).
>> Introduction of v256 will add 4 more cases two of which will be useless
>> again. This sounds like too much code that serves no purpose to me.
>
> Useless?  Surely you mean "used by hosts that don't implement v128".

I meant that host that doesn't support v128 type will not use this 
variables. It'll use their memory locations instead, so it does not 
matter how we represent them. The only TCG code that'll see them is 
tcg_gen_<vector_op> wrappers which know how to deal with them.

2 x v64 is a different story. We can make a much better emulation code 
if we represent a v128 variable as a pair of v64 variables and work with 
them as variables.

>
> I think one of us is very confused about how you intend to generate
> fallback code.  Perhaps any future patchset revision should update
> tcg/README first.
>

Sure, I'll do it in v2.

-- 
Kirill

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type
  2017-01-19 16:54         ` Kirill Batuzov
@ 2017-01-22  7:00           ` Richard Henderson
  2017-01-23 10:30             ` Kirill Batuzov
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-01-22  7:00 UTC (permalink / raw)
  To: Kirill Batuzov
  Cc: Peter Maydell, Paolo Bonzini, qemu-devel, Peter Crosthwaite

On 01/19/2017 08:54 AM, Kirill Batuzov wrote:
>
> Wrappers issue emulation code instead of operation if it is not supported by host.
>
> tcg_gen_add_i32x4 looks like this:
>
> if (TCG_TARGET_HAS_add_i32x4) {
>     tcg_gen_op3_v128(INDEX_op_add_i32x4, args[0], args[1], args[2]);
> } else {
>     for (i = 0; i < 4; i++) {
>         tcg_gen_ld_i32(...);
>         tcg_gen_ld_i32(...);
>         tcg_gen_add_i32(...);
>         tcg_gen_st_i32(...);
>     }
> }

To me that begs the question of why you wouldn't issue 4 adds on 4 i32 
registers instead.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type
  2017-01-22  7:00           ` Richard Henderson
@ 2017-01-23 10:30             ` Kirill Batuzov
  2017-01-23 18:43               ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-23 10:30 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Maydell, Paolo Bonzini, qemu-devel, Peter Crosthwaite

On Sat, 21 Jan 2017, Richard Henderson wrote:

> On 01/19/2017 08:54 AM, Kirill Batuzov wrote:
> > 
> > Wrappers issue emulation code instead of operation if it is not supported by
> > host.
> > 
> > tcg_gen_add_i32x4 looks like this:
> > 
> > if (TCG_TARGET_HAS_add_i32x4) {
> >     tcg_gen_op3_v128(INDEX_op_add_i32x4, args[0], args[1], args[2]);
> > } else {
> >     for (i = 0; i < 4; i++) {
> >         tcg_gen_ld_i32(...);
> >         tcg_gen_ld_i32(...);
> >         tcg_gen_add_i32(...);
> >         tcg_gen_st_i32(...);
> >     }
> > }
> 
> To me that begs the question of why you wouldn't issue 4 adds on 4 i32
> registers instead.
>

Because 4 adds on 4 i32 registers work good only when the size of
vector elements matches the size of scalar variables we use for
representation of a vector. add_i16x8 will not be that great if we use
4 i32 variables: each will need to be split into two values, processed
independently and merged back afterwards. And when we create variable we
do not know which operations will be performed on it.

Scalar variables lack primitives to work with them as vectors of shorter
values. This is one of the reasons I added v64 type instead of using i64
for 64-bit vector operations. And this is the reason I'm so opposed to
using them to represent vector types if vector registers are not
supported by host. Handling vector operations with element size that
does not match representation will be complicated, may require special
handling for different operations and will produce a lot of if-s in code.

The method I'm proposing can handle any operation regardless of
representation. This includes handling situation where host supports
vector registers but does not support required operation (for example 
SSE/AVX does not support multiplication of vectors of 8-bit values).

-- 
Kirill

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type
  2017-01-23 10:30             ` Kirill Batuzov
@ 2017-01-23 18:43               ` Richard Henderson
  2017-01-24 14:29                 ` Kirill Batuzov
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-01-23 18:43 UTC (permalink / raw)
  To: Kirill Batuzov
  Cc: Peter Maydell, Peter Crosthwaite, qemu-devel, Paolo Bonzini

On 01/23/2017 02:30 AM, Kirill Batuzov wrote:
> Because 4 adds on 4 i32 registers work good only when the size of
> vector elements matches the size of scalar variables we use for
> representation of a vector. add_i16x8 will not be that great if we use
> 4 i32 variables: each will need to be split into two values, processed
> independently and merged back afterwards.

Certainly.  But that's pretty much exactly how they are processed now.  Usually
via a helper function that accepts an i64 input as a pair of i32 arguments.

> Scalar variables lack primitives to work with them as vectors of shorter
> values. This is one of the reasons I added v64 type instead of using i64
> for 64-bit vector operations. And this is the reason I'm so opposed to
> using them to represent vector types if vector registers are not
> supported by host. Handling vector operations with element size that
> does not match representation will be complicated, may require special
> handling for different operations and will produce a lot of if-s in code.

A lot of if's?  I've no idea what you're talking about.

A v64 type makes sense because generally we're going to allocate them to a
different register set than i64.  That said, i64 is perfectly adequate for
implementing add_i8x8:

  t0  = in1 & 0x7f7f7f7f7f7f7f7f
  t1  = in0 + t0;
  t2  = in1 & 0x8080808080808080
  out = t1 ^ t2

This is less expensive than addition by pieces if there are at least 4 pieces.

> The method I'm proposing can handle any operation regardless of
> representation. This includes handling situation where host supports
> vector registers but does not support required operation (for example 
> SSE/AVX does not support multiplication of vectors of 8-bit values).

Not for nothing but it's trivial to expand with punpcklbw, punpckhbw, pmullw,
pand, packuswb.  That said, if an expansion gets too complicated, it's still
better to move it into a helper than expand 16 * (load, op, store).


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type
  2017-01-23 18:43               ` Richard Henderson
@ 2017-01-24 14:29                 ` Kirill Batuzov
  0 siblings, 0 replies; 36+ messages in thread
From: Kirill Batuzov @ 2017-01-24 14:29 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Maydell, Peter Crosthwaite, qemu-devel, Paolo Bonzini

On Mon, 23 Jan 2017, Richard Henderson wrote:

> On 01/23/2017 02:30 AM, Kirill Batuzov wrote:
> > Because 4 adds on 4 i32 registers work good only when the size of
> > vector elements matches the size of scalar variables we use for
> > representation of a vector. add_i16x8 will not be that great if we use
> > 4 i32 variables: each will need to be split into two values, processed
> > independently and merged back afterwards.
> 
> Certainly.  But that's pretty much exactly how they are processed now.  Usually
> via a helper function that accepts an i64 input as a pair of i32 arguments.
> 
> > Scalar variables lack primitives to work with them as vectors of shorter
> > values. This is one of the reasons I added v64 type instead of using i64
> > for 64-bit vector operations. And this is the reason I'm so opposed to
> > using them to represent vector types if vector registers are not
> > supported by host. Handling vector operations with element size that
> > does not match representation will be complicated, may require special
> > handling for different operations and will produce a lot of if-s in code.
> 
> A lot of if's?  I've no idea what you're talking about.
> 
> A v64 type makes sense because generally we're going to allocate them to a
> different register set than i64.  That said, i64 is perfectly adequate for
> implementing add_i8x8:
> 
>   t0  = in1 & 0x7f7f7f7f7f7f7f7f
>   t1  = in0 + t0;
>   t2  = in1 & 0x8080808080808080
>   out = t1 ^ t2
> 
> This is less expensive than addition by pieces if there are at least 4 pieces.
> 
> > The method I'm proposing can handle any operation regardless of
> > representation. This includes handling situation where host supports
> > vector registers but does not support required operation (for example 
> > SSE/AVX does not support multiplication of vectors of 8-bit values).
> 
> Not for nothing but it's trivial to expand with punpcklbw, punpckhbw, pmullw,
> pand, packuswb.  That said, if an expansion gets too complicated, it's still
> better to move it into a helper than expand 16 * (load, op, store).
>

I'm a bit lost in the discussion so let me try to summarise. As far as I
understand there is only one major point on which we disagree: is it
worth representing vector variables as a sequences of scalar ones?

Pros:
1. We will not get phantom variables of unsupported type like we do in
my current implementation.

2. If we manage to efficiently emulate large enough number of vector
operations using scalar types we'll get some performance benefits. In
this case scalar variables can be allocated on registers and stay there
across several consecutive guest instructions.

I personally doubt that first "if": logical operations will be fine,
addition and subtraction can be implemented, may be shifts, but
everything else will end up as helpers (and they are expensive
from performance perspective).

Cons:
1. Additional cases for each possible representation in
tcg_global_mem_new_internal and tcg_temp_new_internal. I do not see how I
can use existing i64 as a pair of i32 recursively. TCG supports only one
level of indirection: there is a "type" of variable, and a "base_type"
it is used to represent. i64 code does not check "base_type" explicitly,
so if I pass two consecutive i32 variables to these functions they will
work, but this sounds like some dirty hack to me.

2. Additional cases for each possible representation in
tcg_gen_<vector_op> wrappers. We need to generate adequate expansion
code for each representation. That is if do not default to memory
location every time (in which case why bother with different
representation to begin with).

3. TCG variables exhaustion: to (potentially) represent AVX-512
registers with 32 bit variables we'll need 512 of them (32 of 512 bit
registers). TCG_MAX_TEMP is 512. Sure, it can be increased.

Making something a global variable is only beneficial when we can carry
a value of it in a register from one operation to another (so we'll get
ld+op1+op2+st instead of l1+op1+st+ld+op2+st). I'm not sure that subset
of operations we can effectively emulate is large enough for this to
happen often, but my experience with vector operations is limited so it
might be.

Let's do the following: in v2 I'll add representation of v128 as a pair
of v64 and update tcg_gen_<vector_op> wrappers. We'll see how this works
out and decide if it is good to follow with representation of v128 as a
sequence of scalar types.

-- 
Kirill

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 10/18] tcg/i386: add support for vector opcodes
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 10/18] tcg/i386: add support for vector opcodes Kirill Batuzov
  2017-01-17 20:19   ` Richard Henderson
@ 2017-01-27 14:51   ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-01-27 14:51 UTC (permalink / raw)
  To: Kirill Batuzov
  Cc: qemu-devel, Peter Maydell, Peter Crosthwaite, Paolo Bonzini,
	Richard Henderson


Kirill Batuzov <batuzovk@ispras.ru> writes:

> To be able to generate vector operations in a TCG backend we need to do
> several things.
>
> 1. We need to tell the register allocator about vector target's register.
>    In case of x86 we'll use xmm0..xmm7. xmm7 is designated as a scratch
>    register, others can be used by the register allocator.
>
> 2. We need a new constraint to indicate where to use vector registers. In
>    this commit the 'V' constraint is introduced.
>
> 3. We need to be able to generate bare minimum: load, store and reg-to-reg
>    move. MOVDQU is used for loads and stores. MOVDQA is used for reg-to-reg
>    moves.
>
> 4. Finally we need to support any other opcodes we want. INDEX_op_add_i32x4
>    is the only one for now. The PADDD instruction handles it perfectly.
>
> Signed-off-by: Kirill Batuzov <batuzovk@ispras.ru>

This currently fails to apply cleanly to master because of other updates
however I see you have changes to make so I assume you'll re-base then ;-)

> ---
>  tcg/i386/tcg-target.h     |  24 +++++++++-
>  tcg/i386/tcg-target.inc.c | 109 +++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 125 insertions(+), 8 deletions(-)
>
> diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
> index 524cfc6..974a58b 100644
> --- a/tcg/i386/tcg-target.h
> +++ b/tcg/i386/tcg-target.h
> @@ -29,8 +29,14 @@
>  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
>
>  #ifdef __x86_64__
> -# define TCG_TARGET_REG_BITS  64
> -# define TCG_TARGET_NB_REGS   16
> +# define TCG_TARGET_HAS_REG128 1
> +# ifdef TCG_TARGET_HAS_REG128
> +#  define TCG_TARGET_REG_BITS  64
> +#  define TCG_TARGET_NB_REGS   24
> +# else
> +#  define TCG_TARGET_REG_BITS  64
> +#  define TCG_TARGET_NB_REGS   16
> +# endif
>  #else
>  # define TCG_TARGET_REG_BITS  32
>  # define TCG_TARGET_NB_REGS    8
> @@ -56,6 +62,16 @@ typedef enum {
>      TCG_REG_R13,
>      TCG_REG_R14,
>      TCG_REG_R15,
> +#ifdef TCG_TARGET_HAS_REG128
> +    TCG_REG_XMM0,
> +    TCG_REG_XMM1,
> +    TCG_REG_XMM2,
> +    TCG_REG_XMM3,
> +    TCG_REG_XMM4,
> +    TCG_REG_XMM5,
> +    TCG_REG_XMM6,
> +    TCG_REG_XMM7,
> +#endif
>      TCG_REG_RAX = TCG_REG_EAX,
>      TCG_REG_RCX = TCG_REG_ECX,
>      TCG_REG_RDX = TCG_REG_EDX,
> @@ -133,6 +149,10 @@ extern bool have_bmi1;
>  #define TCG_TARGET_HAS_mulsh_i64        0
>  #endif
>
> +#ifdef TCG_TARGET_HAS_REG128
> +#define TCG_TARGET_HAS_add_i32x4        1
> +#endif
> +
>  #define TCG_TARGET_deposit_i32_valid(ofs, len) \
>      (((ofs) == 0 && (len) == 8) || ((ofs) == 8 && (len) == 8) || \
>       ((ofs) == 0 && (len) == 16))
> diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
> index eeb1777..69e3198 100644
> --- a/tcg/i386/tcg-target.inc.c
> +++ b/tcg/i386/tcg-target.inc.c
> @@ -32,6 +32,9 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
>  #else
>      "%eax", "%ecx", "%edx", "%ebx", "%esp", "%ebp", "%esi", "%edi",
>  #endif
> +#ifdef TCG_TARGET_HAS_REG128
> +    "%xmm0", "%xmm1", "%xmm2", "%xmm3", "%xmm4", "%xmm5", "%xmm6", "%xmm7",
> +#endif
>  };
>  #endif
>
> @@ -61,6 +64,16 @@ static const int tcg_target_reg_alloc_order[] = {
>      TCG_REG_EDX,
>      TCG_REG_EAX,
>  #endif
> +#ifdef TCG_TARGET_HAS_REG128
> +    TCG_REG_XMM0,
> +    TCG_REG_XMM1,
> +    TCG_REG_XMM2,
> +    TCG_REG_XMM3,
> +    TCG_REG_XMM4,
> +    TCG_REG_XMM5,
> +    TCG_REG_XMM6,
> +/*  TCG_REG_XMM7, <- scratch register */
> +#endif
>  };
>
>  static const int tcg_target_call_iarg_regs[] = {
> @@ -247,6 +260,10 @@ static int target_parse_constraint(TCGArgConstraint *ct, const char **pct_str)
>      case 'I':
>          ct->ct |= TCG_CT_CONST_I32;
>          break;
> +    case 'V':
> +        ct->ct |= TCG_CT_REG;
> +        tcg_regset_set32(ct->u.regs, 0, 0xff0000);
> +        break;
>
>      default:
>          return -1;
> @@ -301,6 +318,9 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
>  #define P_SIMDF3        0x10000         /* 0xf3 opcode prefix */
>  #define P_SIMDF2        0x20000         /* 0xf2 opcode prefix */
>
> +#define P_SSE_660F      (P_DATA16 | P_EXT)
> +#define P_SSE_F30F      (P_SIMDF3 | P_EXT)
> +
>  #define OPC_ARITH_EvIz	(0x81)
>  #define OPC_ARITH_EvIb	(0x83)
>  #define OPC_ARITH_GvEv	(0x03)		/* ... plus (ARITH_FOO << 3) */
> @@ -351,6 +371,11 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
>  #define OPC_GRP3_Ev	(0xf7)
>  #define OPC_GRP5	(0xff)
>
> +#define OPC_MOVDQU_M2R  (0x6f | P_SSE_F30F)  /* store 128-bit value */
> +#define OPC_MOVDQU_R2M  (0x7f | P_SSE_F30F)  /* load 128-bit value */
> +#define OPC_MOVDQA_R2R  (0x6f | P_SSE_660F)  /* reg-to-reg 128-bit mov */
> +#define OPC_PADDD       (0xfe | P_SSE_660F)
> +
>  /* Group 1 opcode extensions for 0x80-0x83.
>     These are also used as modifiers for OPC_ARITH.  */
>  #define ARITH_ADD 0
> @@ -428,6 +453,9 @@ static void tcg_out_opc(TCGContext *s, int opc, int r, int rm, int x)
>          tcg_debug_assert((opc & P_REXW) == 0);
>          tcg_out8(s, 0x66);
>      }
> +    if (opc & P_SIMDF3) {
> +        tcg_out8(s, 0xf3);
> +    }
>      if (opc & P_ADDR32) {
>          tcg_out8(s, 0x67);
>      }
> @@ -634,9 +662,24 @@ static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
>  static inline void tcg_out_mov(TCGContext *s, TCGType type,
>                                 TCGReg ret, TCGReg arg)
>  {
> +    int opc;
>      if (arg != ret) {
> -        int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> -        tcg_out_modrm(s, opc, ret, arg);
> +        switch (type) {
> +#ifdef TCG_TARGET_HAS_REG128
> +        case TCG_TYPE_V128:
> +            ret -= TCG_REG_XMM0;
> +            arg -= TCG_REG_XMM0;
> +            tcg_out_modrm(s, OPC_MOVDQA_R2R, ret, arg);
> +            break;
> +#endif
> +        case TCG_TYPE_I32:
> +        case TCG_TYPE_I64:
> +            opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> +            tcg_out_modrm(s, opc, ret, arg);
> +            break;
> +        default:
> +            assert(0);
> +        }
>      }
>  }
>
> @@ -711,15 +754,43 @@ static inline void tcg_out_pop(TCGContext *s, int reg)
>  static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
>                                TCGReg arg1, intptr_t arg2)
>  {
> -    int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> -    tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
> +    int opc;
> +    switch (type) {
> +#ifdef TCG_TARGET_HAS_REG128
> +    case TCG_TYPE_V128:
> +        ret -= TCG_REG_XMM0;
> +        tcg_out_modrm_offset(s, OPC_MOVDQU_M2R, ret, arg1, arg2);
> +        break;
> +#endif
> +    case TCG_TYPE_I32:
> +    case TCG_TYPE_I64:
> +        opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> +        tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
> +        break;
> +    default:
> +        assert(0);
> +    }
>  }
>
>  static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
>                                TCGReg arg1, intptr_t arg2)
>  {
> -    int opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> -    tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
> +    int opc;
> +    switch (type) {
> +#ifdef TCG_TARGET_HAS_REG128
> +    case TCG_TYPE_V128:
> +        arg -= TCG_REG_XMM0;
> +        tcg_out_modrm_offset(s, OPC_MOVDQU_R2M, arg, arg1, arg2);
> +        break;
> +#endif
> +    case TCG_TYPE_I32:
> +    case TCG_TYPE_I64:
> +        opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> +        tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
> +        break;
> +    default:
> +        assert(0);
> +    }
>  }
>
>  static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
> @@ -1856,6 +1927,11 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>      case INDEX_op_ld_i32:
>          tcg_out_ld(s, TCG_TYPE_I32, args[0], args[1], args[2]);
>          break;
> +#ifdef TCG_TARGET_HAS_REG128
> +    case INDEX_op_ld_v128:
> +        tcg_out_ld(s, TCG_TYPE_V128, args[0], args[1], args[2]);
> +        break;
> +#endif
>
>      OP_32_64(st8):
>          if (const_args[0]) {
> @@ -1888,6 +1964,11 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>              tcg_out_st(s, TCG_TYPE_I32, args[0], args[1], args[2]);
>          }
>          break;
> +#ifdef TCG_TARGET_HAS_REG128
> +    case INDEX_op_st_v128:
> +        tcg_out_st(s, TCG_TYPE_V128, args[0], args[1], args[2]);
> +        break;
> +#endif
>
>      OP_32_64(add):
>          /* For 3-operand addition, use LEA.  */
> @@ -2146,6 +2227,13 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>      case INDEX_op_mb:
>          tcg_out_mb(s, args[0]);
>          break;
> +
> +#ifdef TCG_TARGET_HAS_REG128
> +    case INDEX_op_add_i32x4:
> +        tcg_out_modrm(s, OPC_PADDD, args[0], args[2]);
> +        break;
> +#endif
> +
>      case INDEX_op_mov_i32:  /* Always emitted via tcg_out_mov.  */
>      case INDEX_op_mov_i64:
>      case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi.  */
> @@ -2171,6 +2259,11 @@ static const TCGTargetOpDef x86_op_defs[] = {
>      { INDEX_op_st16_i32, { "ri", "r" } },
>      { INDEX_op_st_i32, { "ri", "r" } },
>
> +#ifdef TCG_TARGET_HAS_REG128
> +    { INDEX_op_ld_v128, { "V", "r" } },
> +    { INDEX_op_st_v128, { "V", "r" } },
> +#endif
> +
>      { INDEX_op_add_i32, { "r", "r", "ri" } },
>      { INDEX_op_sub_i32, { "r", "0", "ri" } },
>      { INDEX_op_mul_i32, { "r", "0", "ri" } },
> @@ -2289,6 +2382,10 @@ static const TCGTargetOpDef x86_op_defs[] = {
>      { INDEX_op_qemu_ld_i64, { "r", "r", "L", "L" } },
>      { INDEX_op_qemu_st_i64, { "L", "L", "L", "L" } },
>  #endif
> +
> +#ifdef TCG_TARGET_HAS_REG128
> +    { INDEX_op_add_i32x4, { "V", "0", "V" } },
> +#endif
>      { -1 },
>  };


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations
  2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
                   ` (17 preceding siblings ...)
  2017-01-17  9:07 ` [Qemu-devel] [PATCH 18/18] target/arm: load two consecutive 64-bits vector regs as a 128-bit vector reg Kirill Batuzov
@ 2017-01-27 14:55 ` Alex Bennée
  18 siblings, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-01-27 14:55 UTC (permalink / raw)
  To: Kirill Batuzov
  Cc: qemu-devel, Peter Maydell, Peter Crosthwaite, Paolo Bonzini,
	Richard Henderson


Kirill Batuzov <batuzovk@ispras.ru> writes:

> The goal of these patch series is to set up an infrastructure to emulate
> guest vector operations using host vector operations. Preliminary
> experiments show that simply translating loads and stores increases
> performance of x264 video codec by 10%. The performance of a gcc vectorized
> for loop increased 2x.
>
> To be able to emulate guest vector operations using host vector operations,
> several things need to be done.

I see rth has already done a bunch of review so I'll pass on this cycle
but please feel free to add me to the CC list next iteration.

>
> 1. Corresponding vector types should be added to TCG. These series add
> TCG_v128 and TCG_v64. I've made TCG_v64 a different type than TCG_i64
> because it usually needs to be allocated to different registers and
> supports different operations.
>
> 2. Load/store operations for these new types need to be implemented.
>
> 3. For seamless transition from current model to a new one we need to
> handle cases where memory occupied by global variable can be accessed via
> pointer to the CPUArchState structure. A very simple conservative alias
> analysis has been added to do it. This analysis tracks memory loads and
> stores that overlap with fields of CPUArchState and provides this
> information to the register allocator. The allocator then spills and
> reloads affected globals when needed.
>
> 4. Allow overlapping globals. For scalar registers this is a rare case, and
> overlapping registers can ba handled as a single one (ah, al, ax, eax,
> rax). In ARM every Q-register consists of two D-register each consisting of
> two S-registers. Handling 4 S-registers as one because they are parts of
> the same Q-register is way too inefficient.
>
> 5. Add new memory addressing mode to MMU code for large accesses and create
> needed helpers. Only 128-bit vectors have been handled for now.
>
> 6. Create TCG opcodes for vector operations. Only addition has beed handled
> in these series. Each operation has a wrapper that checks if the backend
> supports the corresponding operation or not. In one case the vector opcode
> is generated, in the other the operation is emulated with scalar
> operations. The emulation code is generated inline for performance reasons
> (there is a huge performance difference between inline generation
> and calling a helper). As a positive side effect this will eventually allow
>  to merge similar emulation code for vector instructions from different
> frontends to target-independent implementation.
>
> 7. Use new operations in the frontend (ARM was used in these series).
>
> 8. Support new operations in the backend (x86_64 was used in these series).
>
> For experiments I have used ARM guest on x86_64 host. I wanted some pair of
> different architectures with vector extensions both. ARM and x86_64 pair
> fits well.
>
> Kirill Batuzov (18):
>   tcg: add support for 128bit vector type
>   tcg: add support for 64bit vector type
>   tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes
>   tcg: add simple alias analysis
>   tcg: use results of alias analysis in liveness analysis
>   tcg: allow globals to overlap
>   tcg: add vector addition operations
>   target/arm: support access to vector guest registers as globals
>   target/arm: use vector opcode to handle vadd.<size> instruction
>   tcg/i386: add support for vector opcodes
>   tcg/i386: support 64-bit vector operations
>   tcg/i386: support remaining vector addition operations
>   tcg: do not relay on exact values of MO_BSWAP or MO_SIGN in backend
>   tcg: introduce new TCGMemOp - MO_128
>   tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes
>   softmmu: create helpers for vector loads
>   tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops
>   target/arm: load two consecutive 64-bits vector regs as a 128-bit
>     vector reg
>
>  cputlb.c                     |   4 +
>  softmmu_template_vector.h    | 266 +++++++++++++++++++++++++++++++++++++++++++
>  target/arm/translate.c       |  89 ++++++++++++++-
>  tcg/aarch64/tcg-target.inc.c |   4 +-
>  tcg/arm/tcg-target.inc.c     |   4 +-
>  tcg/i386/tcg-target.h        |  35 +++++-
>  tcg/i386/tcg-target.inc.c    | 245 ++++++++++++++++++++++++++++++++++++---
>  tcg/mips/tcg-target.inc.c    |   4 +-
>  tcg/optimize.c               | 146 ++++++++++++++++++++++++
>  tcg/ppc/tcg-target.inc.c     |   4 +-
>  tcg/s390/tcg-target.inc.c    |   4 +-
>  tcg/sparc/tcg-target.inc.c   |  12 +-
>  tcg/tcg-op.c                 |  20 +++-
>  tcg/tcg-op.h                 | 262 ++++++++++++++++++++++++++++++++++++++++++
>  tcg/tcg-opc.h                |  34 ++++++
>  tcg/tcg.c                    | 146 ++++++++++++++++++++++++
>  tcg/tcg.h                    | 147 +++++++++++++++++++++++-
>  17 files changed, 1385 insertions(+), 41 deletions(-)
>  create mode 100644 softmmu_template_vector.h


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2017-01-27 14:55 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-17  9:07 [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 01/18] tcg: add support for 128bit vector type Kirill Batuzov
2017-01-18 18:29   ` Richard Henderson
2017-01-19 13:04     ` Kirill Batuzov
2017-01-19 15:09       ` Richard Henderson
2017-01-19 16:54         ` Kirill Batuzov
2017-01-22  7:00           ` Richard Henderson
2017-01-23 10:30             ` Kirill Batuzov
2017-01-23 18:43               ` Richard Henderson
2017-01-24 14:29                 ` Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 02/18] tcg: add support for 64bit " Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 03/18] tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 04/18] tcg: add simple alias analysis Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 05/18] tcg: use results of alias analysis in liveness analysis Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 06/18] tcg: allow globals to overlap Kirill Batuzov
2017-01-17 19:50   ` Richard Henderson
2017-01-17  9:07 ` [Qemu-devel] [PATCH 07/18] tcg: add vector addition operations Kirill Batuzov
2017-01-17 21:56   ` Richard Henderson
2017-01-17  9:07 ` [Qemu-devel] [PATCH 08/18] target/arm: support access to vector guest registers as globals Kirill Batuzov
2017-01-17 20:07   ` Richard Henderson
2017-01-17  9:07 ` [Qemu-devel] [PATCH 09/18] target/arm: use vector opcode to handle vadd.<size> instruction Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 10/18] tcg/i386: add support for vector opcodes Kirill Batuzov
2017-01-17 20:19   ` Richard Henderson
2017-01-18 13:05     ` Kirill Batuzov
2017-01-18 18:22       ` Richard Henderson
2017-01-27 14:51   ` Alex Bennée
2017-01-17  9:07 ` [Qemu-devel] [PATCH 11/18] tcg/i386: support 64-bit vector operations Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 12/18] tcg/i386: support remaining vector addition operations Kirill Batuzov
2017-01-17 21:49   ` Richard Henderson
2017-01-17  9:07 ` [Qemu-devel] [PATCH 13/18] tcg: do not relay on exact values of MO_BSWAP or MO_SIGN in backend Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 14/18] tcg: introduce new TCGMemOp - MO_128 Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 15/18] tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 16/18] softmmu: create helpers for vector loads Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 17/18] tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops Kirill Batuzov
2017-01-17  9:07 ` [Qemu-devel] [PATCH 18/18] target/arm: load two consecutive 64-bits vector regs as a 128-bit vector reg Kirill Batuzov
2017-01-27 14:55 ` [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.