All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion
@ 2017-08-17 23:01 Richard Henderson
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-arm, alex.bennee

When Alex and I started talking about this topic, this is the direction
I was thinking.  The primary difference from Alex's version is that the
interface on the target/cpu/ side uses offsets and not a faux temp.  The
secondary difference is that, for smaller vector sizes at least, I will
expand to inline host vector operations.  The use of explicit offsets
aids that.

There are a number of things that are missing in the host vector support,
including register spill/fill.  But in this example conversion we will
never have more than 2 vector registers live at any point, and so we do
not run across those issues.

Some of this infrastructure cannot be exercised with existing front-ends.
It will require support for ARM SVE to be written to get there.
Or to add support for AVX2/AVX512 within target/i386.  ;-)

Unfortunately, the built-in disassembler is too old to handle AVX.
So for testing purposes I disabled the built-in disas so that I could
run the output assembly through an external objdump.

For a trivial test case via aarch64-linux-user:

IN: 
0x0000000000400078:  4e208400      add v0.16b, v0.16b, v0.16b
0x000000000040007c:  4e648462      add v2.8h, v3.8h, v4.8h
0x0000000000400080:  4ea48462      add v2.4s, v3.4s, v4.4s
0x0000000000400084:  4ee48462      add v2.2d, v3.2d, v4.2d
0x0000000000400088:  0ea28462      add v2.2s, v3.2s, v2.2s
0x000000000040008c:  00000000      unallocated (Unallocated)

OP after optimization and liveness analysis:
 ld_i32 tmp0,env,$0xffffffffffffffec              dead: 1
 movi_i32 tmp1,$0x0
 brcond_i32 tmp0,tmp1,lt,$L0                      dead: 0 1

 ---- 0000000000400078 0000000000000000 0000000000000000
 ld_v128 tmp2,env,$0x850
 add8_v128 tmp2,tmp2,tmp2                         dead: 1 2
 st_v128 tmp2,env,$0x850                          dead: 0

 ---- 000000000040007c 0000000000000000 0000000000000000
 ld_v128 tmp2,env,$0x880
 ld_v128 tmp3,env,$0x890
 add16_v128 tmp2,tmp2,tmp3                        dead: 1 2
 st_v128 tmp2,env,$0x870                          dead: 0

 ---- 0000000000400080 0000000000000000 0000000000000000
 ld_v128 tmp2,env,$0x880
 ld_v128 tmp3,env,$0x890
 add32_v128 tmp2,tmp2,tmp3                        dead: 1 2
 st_v128 tmp2,env,$0x870                          dead: 0

 ---- 0000000000400084 0000000000000000 0000000000000000
 ld_v128 tmp2,env,$0x880
 ld_v128 tmp3,env,$0x890
 add64_v128 tmp2,tmp2,tmp3                        dead: 1 2
 st_v128 tmp2,env,$0x870                          dead: 0

 ---- 0000000000400088 0000000000000000 0000000000000000
 ld_v64 tmp4,env,$0x880
 ld_v64 tmp5,env,$0x870
 add32_v64 tmp4,tmp4,tmp5                         dead: 1 2
 st_v64 tmp4,env,$0x870                           dead: 0
 movi_i64 tmp6,$0x0
 st_i64 tmp6,env,$0x878                           dead: 0

 ---- 000000000040008c 0000000000000000 0000000000000000
 movi_i64 pc,$0x40008c                            sync: 0  dead: 0
 movi_i32 tmp0,$0x1
 movi_i32 tmp1,$0x2000000
 movi_i32 tmp7,$0x1
 call exception_with_syndrome,$0x0,$0,env,tmp0,tmp1,tmp7  dead: 0 1 2 3
 set_label $L0
 exit_tb $0x521c86683

OUT: [size=220]
   521c86740:	41 8b 6e ec          		mov    -0x14(%r14),%ebp
   521c86744:	85 ed                		test   %ebp,%ebp
   521c86746:	0f 8c c4 00 00 00    		jl     0x521c86810
   521c8674c:	c4 c1 7a 6f 86 50 08 00 00 	vmovdqu 0x850(%r14),%xmm0
   521c86755:	c4 e1 79 fc c0       		vpaddb %xmm0,%xmm0,%xmm0
   521c8675a:	c4 c1 7a 7f 86 50 08 00 00	vmovdqu %xmm0,0x850(%r14)
   521c86763:	c4 c1 7a 6f 86 80 08 00 00 	vmovdqu 0x880(%r14),%xmm0
   521c8676c:	c4 c1 7a 6f 8e 90 08 00 00 	vmovdqu 0x890(%r14),%xmm1
   521c86775:	c4 e1 79 fd c1       		vpaddw %xmm1,%xmm0,%xmm0
   521c8677a:	c4 c1 7a 7f 86 70 08 00 00 	vmovdqu %xmm0,0x870(%r14)
   521c86783:	c4 c1 7a 6f 86 80 08 00 00 	vmovdqu 0x880(%r14),%xmm0
   521c8678c:	c4 c1 7a 6f 8e 90 08 00 00 	vmovdqu 0x890(%r14),%xmm1
   521c86795:	c4 e1 79 fe c1       		vpaddd %xmm1,%xmm0,%xmm0
   521c8679a:	c4 c1 7a 7f 86 70 08 00 00 	vmovdqu %xmm0,0x870(%r14)
   521c867a3:	c4 c1 7a 6f 86 80 08 00 00 	vmovdqu 0x880(%r14),%xmm0
   521c867ac:	c4 c1 7a 6f 8e 90 08 00 00 	vmovdqu 0x890(%r14),%xmm1
   521c867b5:	c4 e1 79 d4 c1       		vpaddq %xmm1,%xmm0,%xmm0
   521c867ba:	c4 c1 7a 7f 86 70 08 00 00 	vmovdqu %xmm0,0x870(%r14)
   521c867c3:	c4 c1 7a 7e 86 80 08 00 00 	vmovq  0x880(%r14),%xmm0
   521c867cc:	c4 c1 7a 7e 8e 70 08 00 00 	vmovq  0x870(%r14),%xmm1
   521c867d5:	c4 e1 79 fe c1       		vpaddd %xmm1,%xmm0,%xmm0
   521c867da:	c4 c1 79 d6 86 70 08 00 00 	vmovq  %xmm0,0x870(%r14)
   521c867e3:	49 c7 86 78 08 00 00 		movq   $0x0,0x878(%r14)
   521c867ea:	00 00 00 00 
   521c867ee:	49 c7 86 40 01 00 00 		movq   $0x40008c,0x140(%r14)
   521c867f5:	8c 00 40 00 
   521c867f9:	49 8b fe             		mov    %r14,%rdi
   521c867fc:	be 01 00 00 00       		mov    $0x1,%esi
   521c86801:	ba 00 00 00 02       		mov    $0x2000000,%edx
   521c86806:	b9 01 00 00 00       		mov    $0x1,%ecx
   521c8680b:	e8 90 40 c9 ff       		callq  0x52191a8a0
   521c86810:	48 8d 05 6c fe ff ff 		lea    -0x194(%rip),%rax
   521c86817:	e9 3c fe ff ff       		jmpq   0x521c86658

Because I already had some pending fixes to tcg/i386/ wrt VEX encoding,
I've based this on an existing tree.  The compete tree can be found at

    git://github.com/rth7680/qemu.git native-vector-registers-2


r~


Richard Henderson (8):
  tcg: Add generic vector infrastructure and ops for add/sub/logic
  target/arm: Use generic vector infrastructure for aa64 add/sub/logic
  tcg: Add types for host vectors
  tcg: Add operations for host vectors
  tcg: Add tcg_op_supported
  tcg: Add INDEX_op_invalid
  tcg: Expand target vector ops with host vector ops
  tcg/i386: Add vector operations

 Makefile.target            |   5 +-
 tcg/i386/tcg-target.h      |  46 +++-
 tcg/tcg-op-gvec.h          |  92 +++++++
 tcg/tcg-opc.h              |  91 +++++++
 tcg/tcg-runtime.h          |  16 ++
 tcg/tcg.h                  |  37 ++-
 target/arm/translate-a64.c | 137 +++++++----
 tcg/i386/tcg-target.inc.c  | 382 ++++++++++++++++++++++++++---
 tcg/tcg-op-gvec.c          | 583 +++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg-runtime-gvec.c     | 199 ++++++++++++++++
 tcg/tcg.c                  | 323 ++++++++++++++++++++++++-
 11 files changed, 1817 insertions(+), 94 deletions(-)
 create mode 100644 tcg/tcg-op-gvec.h
 create mode 100644 tcg/tcg-op-gvec.c
 create mode 100644 tcg/tcg-runtime-gvec.c

-- 
2.13.5

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic
  2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
  2017-08-30  1:31   ` Philippe Mathieu-Daudé
  2017-09-07 16:34   ` Alex Bennée
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic Richard Henderson
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-arm, alex.bennee

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 Makefile.target        |   5 +-
 tcg/tcg-op-gvec.h      |  88 ++++++++++
 tcg/tcg-runtime.h      |  16 ++
 tcg/tcg-op-gvec.c      | 443 +++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg-runtime-gvec.c | 199 ++++++++++++++++++++++
 5 files changed, 749 insertions(+), 2 deletions(-)
 create mode 100644 tcg/tcg-op-gvec.h
 create mode 100644 tcg/tcg-op-gvec.c
 create mode 100644 tcg/tcg-runtime-gvec.c

diff --git a/Makefile.target b/Makefile.target
index 7f42c45db8..9ae3e904f7 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -93,8 +93,9 @@ all: $(PROGS) stap
 # cpu emulator library
 obj-y += exec.o
 obj-y += accel/
-obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/optimize.o
-obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/tcg-runtime.o
+obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-common.o tcg/optimize.o
+obj-$(CONFIG_TCG) += tcg/tcg-op.o tcg/tcg-op-gvec.o
+obj-$(CONFIG_TCG) += tcg/tcg-runtime.o tcg/tcg-runtime-gvec.o
 obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o
 obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
 obj-y += fpu/softfloat.o
diff --git a/tcg/tcg-op-gvec.h b/tcg/tcg-op-gvec.h
new file mode 100644
index 0000000000..10db3599a5
--- /dev/null
+++ b/tcg/tcg-op-gvec.h
@@ -0,0 +1,88 @@
+/*
+ *  Generic vector operation expansion
+ *
+ *  Copyright (c) 2017 Linaro
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * "Generic" vectors.  All operands are given as offsets from ENV,
+ * and therefore cannot also be allocated via tcg_global_mem_new_*.
+ * OPSZ is the byte size of the vector upon which the operation is performed.
+ * CLSZ is the byte size of the full vector; bytes beyond OPSZ are cleared.
+ *
+ * All sizes must be 8 or any multiple of 16.
+ * When OPSZ is 8, the alignment may be 8, otherwise must be 16.
+ * Operands may completely, but not partially, overlap.
+ */
+
+/* Fundamental operation expanders.  These are exposed to the front ends
+   so that target-specific SIMD operations can be handled similarly to
+   the standard SIMD operations.  */
+
+typedef struct {
+    /* "Small" sizes: expand inline as a 64-bit or 32-bit lane.
+       Generally only one of these will be non-NULL.  */
+    void (*fni8)(TCGv_i64, TCGv_i64, TCGv_i64);
+    void (*fni4)(TCGv_i32, TCGv_i32, TCGv_i32);
+    /* Similarly, but load up a constant and re-use across lanes.  */
+    void (*fni8x)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64);
+    uint64_t extra_value;
+    /* Larger sizes: expand out-of-line helper w/size descriptor.  */
+    void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32);
+} GVecGen3;
+
+void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                    uint32_t opsz, uint32_t clsz, const GVecGen3 *);
+
+#define DEF_GVEC_2(X) \
+    void tcg_gen_gvec_##X(uint32_t dofs, uint32_t aofs, uint32_t bofs, \
+                          uint32_t opsz, uint32_t clsz)
+
+DEF_GVEC_2(add8);
+DEF_GVEC_2(add16);
+DEF_GVEC_2(add32);
+DEF_GVEC_2(add64);
+
+DEF_GVEC_2(sub8);
+DEF_GVEC_2(sub16);
+DEF_GVEC_2(sub32);
+DEF_GVEC_2(sub64);
+
+DEF_GVEC_2(and8);
+DEF_GVEC_2(or8);
+DEF_GVEC_2(xor8);
+DEF_GVEC_2(andc8);
+DEF_GVEC_2(orc8);
+
+#undef DEF_GVEC_2
+
+/*
+ * 64-bit vector operations.  Use these when the register has been
+ * allocated with tcg_global_mem_new_i64.  OPSZ = CLSZ = 8.
+ */
+
+#define DEF_VEC8_2(X) \
+    void tcg_gen_vec8_##X(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+
+DEF_VEC8_2(add8);
+DEF_VEC8_2(add16);
+DEF_VEC8_2(add32);
+
+DEF_VEC8_2(sub8);
+DEF_VEC8_2(sub16);
+DEF_VEC8_2(sub32);
+
+#undef DEF_VEC8_2
diff --git a/tcg/tcg-runtime.h b/tcg/tcg-runtime.h
index c41d38a557..f8d07090f8 100644
--- a/tcg/tcg-runtime.h
+++ b/tcg/tcg-runtime.h
@@ -134,3 +134,19 @@ GEN_ATOMIC_HELPERS(xor_fetch)
 GEN_ATOMIC_HELPERS(xchg)
 
 #undef GEN_ATOMIC_HELPERS
+
+DEF_HELPER_FLAGS_4(gvec_add8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_add16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_add32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_add64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_4(gvec_sub8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_sub16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_sub32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_sub64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_4(gvec_and8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_or8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_xor8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_andc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_orc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
new file mode 100644
index 0000000000..6de49dc07f
--- /dev/null
+++ b/tcg/tcg-op-gvec.c
@@ -0,0 +1,443 @@
+/*
+ *  Generic vector operation expansion
+ *
+ *  Copyright (c) 2017 Linaro
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "cpu.h"
+#include "exec/exec-all.h"
+#include "tcg.h"
+#include "tcg-op.h"
+#include "tcg-op-gvec.h"
+#include "trace-tcg.h"
+#include "trace/mem.h"
+
+#define REP8(x)    ((x) * 0x0101010101010101ull)
+#define REP16(x)   ((x) * 0x0001000100010001ull)
+
+#define MAX_INLINE 16
+
+static inline void check_size_s(uint32_t opsz, uint32_t clsz)
+{
+    tcg_debug_assert(opsz % 8 == 0);
+    tcg_debug_assert(clsz % 8 == 0);
+    tcg_debug_assert(opsz <= clsz);
+}
+
+static inline void check_align_s_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
+{
+    tcg_debug_assert(dofs % 8 == 0);
+    tcg_debug_assert(aofs % 8 == 0);
+    tcg_debug_assert(bofs % 8 == 0);
+}
+
+static inline void check_size_l(uint32_t opsz, uint32_t clsz)
+{
+    tcg_debug_assert(opsz % 16 == 0);
+    tcg_debug_assert(clsz % 16 == 0);
+    tcg_debug_assert(opsz <= clsz);
+}
+
+static inline void check_align_l_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
+{
+    tcg_debug_assert(dofs % 16 == 0);
+    tcg_debug_assert(aofs % 16 == 0);
+    tcg_debug_assert(bofs % 16 == 0);
+}
+
+static inline void check_overlap_3(uint32_t d, uint32_t a,
+                                   uint32_t b, uint32_t s)
+{
+    tcg_debug_assert(d == a || d + s <= a || a + s <= d);
+    tcg_debug_assert(d == b || d + s <= b || b + s <= d);
+    tcg_debug_assert(a == b || a + s <= b || b + s <= a);
+}
+
+static void expand_clr(uint32_t dofs, uint32_t opsz, uint32_t clsz)
+{
+    if (clsz > opsz) {
+        TCGv_i64 zero = tcg_const_i64(0);
+        uint32_t i;
+
+        for (i = opsz; i < clsz; i += 8) {
+            tcg_gen_st_i64(zero, tcg_ctx.tcg_env, dofs + i);
+        }
+        tcg_temp_free_i64(zero);
+    }
+}
+
+static TCGv_i32 make_desc(uint32_t opsz, uint32_t clsz)
+{
+    tcg_debug_assert(opsz >= 16 && opsz <= 255 * 16 && opsz % 16 == 0);
+    tcg_debug_assert(clsz >= 16 && clsz <= 255 * 16 && clsz % 16 == 0);
+    opsz /= 16;
+    clsz /= 16;
+    opsz -= 1;
+    clsz -= 1;
+    return tcg_const_i32(deposit32(opsz, 8, 8, clsz));
+}
+
+static void expand_3_o(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                       uint32_t opsz, uint32_t clsz,
+                       void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32))
+{
+    TCGv_ptr d = tcg_temp_new_ptr();
+    TCGv_ptr a = tcg_temp_new_ptr();
+    TCGv_ptr b = tcg_temp_new_ptr();
+    TCGv_i32 desc = make_desc(opsz, clsz);
+
+    tcg_gen_addi_ptr(d, tcg_ctx.tcg_env, dofs);
+    tcg_gen_addi_ptr(a, tcg_ctx.tcg_env, aofs);
+    tcg_gen_addi_ptr(b, tcg_ctx.tcg_env, bofs);
+    fno(d, a, b, desc);
+
+    tcg_temp_free_ptr(d);
+    tcg_temp_free_ptr(a);
+    tcg_temp_free_ptr(b);
+    tcg_temp_free_i32(desc);
+}
+
+static void expand_3x4(uint32_t dofs, uint32_t aofs,
+                       uint32_t bofs, uint32_t opsz,
+                       void (*fni)(TCGv_i32, TCGv_i32, TCGv_i32))
+{
+    TCGv_i32 t0 = tcg_temp_new_i32();
+    uint32_t i;
+
+    if (aofs == bofs) {
+        for (i = 0; i < opsz; i += 4) {
+            tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
+            fni(t0, t0, t0);
+            tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
+        }
+    } else {
+        TCGv_i32 t1 = tcg_temp_new_i32();
+        for (i = 0; i < opsz; i += 4) {
+            tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
+            tcg_gen_ld_i32(t1, tcg_ctx.tcg_env, bofs + i);
+            fni(t0, t0, t1);
+            tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
+        }
+        tcg_temp_free_i32(t1);
+    }
+    tcg_temp_free_i32(t0);
+}
+
+static void expand_3x8(uint32_t dofs, uint32_t aofs,
+                       uint32_t bofs, uint32_t opsz,
+                       void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64))
+{
+    TCGv_i64 t0 = tcg_temp_new_i64();
+    uint32_t i;
+
+    if (aofs == bofs) {
+        for (i = 0; i < opsz; i += 8) {
+            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
+            fni(t0, t0, t0);
+            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
+        }
+    } else {
+        TCGv_i64 t1 = tcg_temp_new_i64();
+        for (i = 0; i < opsz; i += 8) {
+            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
+            tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
+            fni(t0, t0, t1);
+            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
+        }
+        tcg_temp_free_i64(t1);
+    }
+    tcg_temp_free_i64(t0);
+}
+
+static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                         uint32_t opsz, uint64_t data,
+                         void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64))
+{
+    TCGv_i64 t0 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_const_i64(data);
+    uint32_t i;
+
+    if (aofs == bofs) {
+        for (i = 0; i < opsz; i += 8) {
+            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
+            fni(t0, t0, t0, t2);
+            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
+        }
+    } else {
+        TCGv_i64 t1 = tcg_temp_new_i64();
+        for (i = 0; i < opsz; i += 8) {
+            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
+            tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
+            fni(t0, t0, t1, t2);
+            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
+        }
+        tcg_temp_free_i64(t1);
+    }
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t2);
+}
+
+void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                    uint32_t opsz, uint32_t clsz, const GVecGen3 *g)
+{
+    check_overlap_3(dofs, aofs, bofs, clsz);
+    if (opsz <= MAX_INLINE) {
+        check_size_s(opsz, clsz);
+        check_align_s_3(dofs, aofs, bofs);
+        if (g->fni8) {
+            expand_3x8(dofs, aofs, bofs, opsz, g->fni8);
+        } else if (g->fni4) {
+            expand_3x4(dofs, aofs, bofs, opsz, g->fni4);
+        } else if (g->fni8x) {
+            expand_3x8p1(dofs, aofs, bofs, opsz, g->extra_value, g->fni8x);
+        } else {
+            g_assert_not_reached();
+        }
+        expand_clr(dofs, opsz, clsz);
+    } else {
+        check_size_l(opsz, clsz);
+        check_align_l_3(dofs, aofs, bofs);
+        expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
+    }
+}
+
+static void gen_addv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    TCGv_i64 t3 = tcg_temp_new_i64();
+
+    tcg_gen_andc_i64(t1, a, m);
+    tcg_gen_andc_i64(t2, b, m);
+    tcg_gen_xor_i64(t3, a, b);
+    tcg_gen_add_i64(d, t1, t2);
+    tcg_gen_and_i64(t3, t3, m);
+    tcg_gen_xor_i64(d, d, t3);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+    tcg_temp_free_i64(t3);
+}
+
+void tcg_gen_gvec_add8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                       uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .extra_value = REP8(0x80),
+        .fni8x = gen_addv_mask,
+        .fno = gen_helper_gvec_add8,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_add16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                        uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .extra_value = REP16(0x8000),
+        .fni8x = gen_addv_mask,
+        .fno = gen_helper_gvec_add16,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_add32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                        uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .fni4 = tcg_gen_add_i32,
+        .fno = gen_helper_gvec_add32,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_add64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                        uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .fni8 = tcg_gen_add_i64,
+        .fno = gen_helper_gvec_add64,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_vec8_add8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+    TCGv_i64 m = tcg_const_i64(REP8(0x80));
+    gen_addv_mask(d, a, b, m);
+    tcg_temp_free_i64(m);
+}
+
+void tcg_gen_vec8_add16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+    TCGv_i64 m = tcg_const_i64(REP16(0x8000));
+    gen_addv_mask(d, a, b, m);
+    tcg_temp_free_i64(m);
+}
+
+void tcg_gen_vec8_add32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+
+    tcg_gen_andi_i64(t1, a, ~0xffffffffull);
+    tcg_gen_add_i64(t2, a, b);
+    tcg_gen_add_i64(t1, t1, b);
+    tcg_gen_deposit_i64(d, t1, t2, 0, 32);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+static void gen_subv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    TCGv_i64 t3 = tcg_temp_new_i64();
+
+    tcg_gen_or_i64(t1, a, m);
+    tcg_gen_andc_i64(t2, b, m);
+    tcg_gen_eqv_i64(t3, a, b);
+    tcg_gen_sub_i64(d, t1, t2);
+    tcg_gen_and_i64(t3, t3, m);
+    tcg_gen_xor_i64(d, d, t3);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+    tcg_temp_free_i64(t3);
+}
+
+void tcg_gen_gvec_sub8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                       uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .extra_value = REP8(0x80),
+        .fni8x = gen_subv_mask,
+        .fno = gen_helper_gvec_sub8,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_sub16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                        uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .extra_value = REP16(0x8000),
+        .fni8x = gen_subv_mask,
+        .fno = gen_helper_gvec_sub16,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_sub32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                        uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .fni4 = tcg_gen_sub_i32,
+        .fno = gen_helper_gvec_sub32,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_sub64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                        uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .fni8 = tcg_gen_sub_i64,
+        .fno = gen_helper_gvec_sub64,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_vec8_sub8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+    TCGv_i64 m = tcg_const_i64(REP8(0x80));
+    gen_subv_mask(d, a, b, m);
+    tcg_temp_free_i64(m);
+}
+
+void tcg_gen_vec8_sub16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+    TCGv_i64 m = tcg_const_i64(REP16(0x8000));
+    gen_subv_mask(d, a, b, m);
+    tcg_temp_free_i64(m);
+}
+
+void tcg_gen_vec8_sub32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+
+    tcg_gen_andi_i64(t1, b, ~0xffffffffull);
+    tcg_gen_sub_i64(t2, a, b);
+    tcg_gen_sub_i64(t1, a, t1);
+    tcg_gen_deposit_i64(d, t1, t2, 0, 32);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+void tcg_gen_gvec_and8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                       uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .fni8 = tcg_gen_and_i64,
+        .fno = gen_helper_gvec_and8,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_or8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                      uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .fni8 = tcg_gen_or_i64,
+        .fno = gen_helper_gvec_or8,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_xor8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                       uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .fni8 = tcg_gen_xor_i64,
+        .fno = gen_helper_gvec_xor8,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_andc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                        uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .fni8 = tcg_gen_andc_i64,
+        .fno = gen_helper_gvec_andc8,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_orc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                       uint32_t opsz, uint32_t clsz)
+{
+    static const GVecGen3 g = {
+        .fni8 = tcg_gen_orc_i64,
+        .fno = gen_helper_gvec_orc8,
+    };
+    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
diff --git a/tcg/tcg-runtime-gvec.c b/tcg/tcg-runtime-gvec.c
new file mode 100644
index 0000000000..9a37ce07a2
--- /dev/null
+++ b/tcg/tcg-runtime-gvec.c
@@ -0,0 +1,199 @@
+/*
+ *  Generic vectorized operation runtime
+ *
+ *  Copyright (c) 2017 Linaro
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/host-utils.h"
+#include "cpu.h"
+#include "exec/helper-proto.h"
+
+/* Virtually all hosts support 16-byte vectors.  Those that don't
+   can emulate them via GCC's generic vector extension.
+
+   In tcg-op-gvec.c, we asserted that both the size and alignment
+   of the data are multiples of 16.  */
+
+typedef uint8_t vec8 __attribute__((vector_size(16)));
+typedef uint16_t vec16 __attribute__((vector_size(16)));
+typedef uint32_t vec32 __attribute__((vector_size(16)));
+typedef uint64_t vec64 __attribute__((vector_size(16)));
+
+static inline intptr_t extract_opsz(uint32_t desc)
+{
+    return ((desc & 0xff) + 1) * 16;
+}
+
+static inline intptr_t extract_clsz(uint32_t desc)
+{
+    return (((desc >> 8) & 0xff) + 1) * 16;
+}
+
+static inline void clear_high(void *d, intptr_t opsz, uint32_t desc)
+{
+    intptr_t clsz = extract_clsz(desc);
+    intptr_t i;
+
+    if (unlikely(clsz > opsz)) {
+        for (i = opsz; i < clsz; i += sizeof(vec64)) {
+            *(vec64 *)(d + i) = (vec64){ 0 };
+        }
+    }
+}
+
+void HELPER(gvec_add8)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec8)) {
+        *(vec8 *)(d + i) = *(vec8 *)(a + i) + *(vec8 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_add16)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec16)) {
+        *(vec16 *)(d + i) = *(vec16 *)(a + i) + *(vec16 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_add32)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec32)) {
+        *(vec32 *)(d + i) = *(vec32 *)(a + i) + *(vec32 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_add64)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec64)) {
+        *(vec64 *)(d + i) = *(vec64 *)(a + i) + *(vec64 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_sub8)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec8)) {
+        *(vec8 *)(d + i) = *(vec8 *)(a + i) - *(vec8 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_sub16)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec16)) {
+        *(vec16 *)(d + i) = *(vec16 *)(a + i) - *(vec16 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_sub32)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec32)) {
+        *(vec32 *)(d + i) = *(vec32 *)(a + i) - *(vec32 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_sub64)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec64)) {
+        *(vec64 *)(d + i) = *(vec64 *)(a + i) - *(vec64 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_and8)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec64)) {
+        *(vec64 *)(d + i) = *(vec64 *)(a + i) & *(vec64 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_or8)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec64)) {
+        *(vec64 *)(d + i) = *(vec64 *)(a + i) | *(vec64 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_xor8)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec64)) {
+        *(vec64 *)(d + i) = *(vec64 *)(a + i) ^ *(vec64 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_andc8)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec64)) {
+        *(vec64 *)(d + i) = *(vec64 *)(a + i) &~ *(vec64 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_orc8)(void *d, void *a, void *b, uint32_t desc)
+{
+    intptr_t opsz = extract_opsz(desc);
+    intptr_t i;
+
+    for (i = 0; i < opsz; i += sizeof(vec64)) {
+        *(vec64 *)(d + i) = *(vec64 *)(a + i) |~ *(vec64 *)(b + i);
+    }
+    clear_high(d, opsz, desc);
+}
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic
  2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
  2017-09-07 16:58   ` Alex Bennée
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors Richard Henderson
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-arm, alex.bennee

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/arm/translate-a64.c | 137 ++++++++++++++++++++++++++++-----------------
 1 file changed, 87 insertions(+), 50 deletions(-)

diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 2200e25be0..025354f983 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -21,6 +21,7 @@
 #include "cpu.h"
 #include "exec/exec-all.h"
 #include "tcg-op.h"
+#include "tcg-op-gvec.h"
 #include "qemu/log.h"
 #include "arm_ldst.h"
 #include "translate.h"
@@ -82,6 +83,7 @@ typedef void NeonGenTwoDoubleOPFn(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_ptr);
 typedef void NeonGenOneOpFn(TCGv_i64, TCGv_i64);
 typedef void CryptoTwoOpEnvFn(TCGv_ptr, TCGv_i32, TCGv_i32);
 typedef void CryptoThreeOpEnvFn(TCGv_ptr, TCGv_i32, TCGv_i32, TCGv_i32);
+typedef void GVecGenTwoFn(uint32_t, uint32_t, uint32_t, uint32_t, uint32_t);
 
 /* initialize TCG globals.  */
 void a64_translate_init(void)
@@ -537,6 +539,21 @@ static inline int vec_reg_offset(DisasContext *s, int regno,
     return offs;
 }
 
+/* Return the offset info CPUARMState of the "whole" vector register Qn.  */
+static inline int vec_full_reg_offset(DisasContext *s, int regno)
+{
+    assert_fp_access_checked(s);
+    return offsetof(CPUARMState, vfp.regs[regno * 2]);
+}
+
+/* Return the byte size of the "whole" vector register, VL / 8.  */
+static inline int vec_full_reg_size(DisasContext *s)
+{
+    /* FIXME SVE: We should put the composite ZCR_EL* value into tb->flags.
+       In the meantime this is just the AdvSIMD length of 128.  */
+    return 128 / 8;
+}
+
 /* Return the offset into CPUARMState of a slice (from
  * the least significant end) of FP register Qn (ie
  * Dn, Sn, Hn or Bn).
@@ -9042,11 +9059,38 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
     bool is_q = extract32(insn, 30, 1);
     TCGv_i64 tcg_op1, tcg_op2, tcg_res[2];
     int pass;
+    GVecGenTwoFn *gvec_op;
 
     if (!fp_access_check(s)) {
         return;
     }
 
+    switch (size + 4 * is_u) {
+    case 0: /* AND */
+        gvec_op = tcg_gen_gvec_and8;
+        goto do_gvec;
+    case 1: /* BIC */
+        gvec_op = tcg_gen_gvec_andc8;
+        goto do_gvec;
+    case 2: /* ORR */
+        gvec_op = tcg_gen_gvec_or8;
+        goto do_gvec;
+    case 3: /* ORN */
+        gvec_op = tcg_gen_gvec_orc8;
+        goto do_gvec;
+    case 4: /* EOR */
+        gvec_op = tcg_gen_gvec_xor8;
+        goto do_gvec;
+    do_gvec:
+        gvec_op(vec_full_reg_offset(s, rd),
+                vec_full_reg_offset(s, rn),
+                vec_full_reg_offset(s, rm),
+                is_q ? 16 : 8, vec_full_reg_size(s));
+        return;
+    }
+
+    /* Note that we've now eliminated all !is_u.  */
+
     tcg_op1 = tcg_temp_new_i64();
     tcg_op2 = tcg_temp_new_i64();
     tcg_res[0] = tcg_temp_new_i64();
@@ -9056,47 +9100,27 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
         read_vec_element(s, tcg_op1, rn, pass, MO_64);
         read_vec_element(s, tcg_op2, rm, pass, MO_64);
 
-        if (!is_u) {
-            switch (size) {
-            case 0: /* AND */
-                tcg_gen_and_i64(tcg_res[pass], tcg_op1, tcg_op2);
-                break;
-            case 1: /* BIC */
-                tcg_gen_andc_i64(tcg_res[pass], tcg_op1, tcg_op2);
-                break;
-            case 2: /* ORR */
-                tcg_gen_or_i64(tcg_res[pass], tcg_op1, tcg_op2);
-                break;
-            case 3: /* ORN */
-                tcg_gen_orc_i64(tcg_res[pass], tcg_op1, tcg_op2);
-                break;
-            }
-        } else {
-            if (size != 0) {
-                /* B* ops need res loaded to operate on */
-                read_vec_element(s, tcg_res[pass], rd, pass, MO_64);
-            }
+        /* B* ops need res loaded to operate on */
+        read_vec_element(s, tcg_res[pass], rd, pass, MO_64);
 
-            switch (size) {
-            case 0: /* EOR */
-                tcg_gen_xor_i64(tcg_res[pass], tcg_op1, tcg_op2);
-                break;
-            case 1: /* BSL bitwise select */
-                tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_op2);
-                tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_res[pass]);
-                tcg_gen_xor_i64(tcg_res[pass], tcg_op2, tcg_op1);
-                break;
-            case 2: /* BIT, bitwise insert if true */
-                tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
-                tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_op2);
-                tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
-                break;
-            case 3: /* BIF, bitwise insert if false */
-                tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
-                tcg_gen_andc_i64(tcg_op1, tcg_op1, tcg_op2);
-                tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
-                break;
-            }
+        switch (size) {
+        case 1: /* BSL bitwise select */
+            tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_op2);
+            tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_res[pass]);
+            tcg_gen_xor_i64(tcg_res[pass], tcg_op2, tcg_op1);
+            break;
+        case 2: /* BIT, bitwise insert if true */
+            tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
+            tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_op2);
+            tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
+            break;
+        case 3: /* BIF, bitwise insert if false */
+            tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
+            tcg_gen_andc_i64(tcg_op1, tcg_op1, tcg_op2);
+            tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
+            break;
+        default:
+            g_assert_not_reached();
         }
     }
 
@@ -9370,6 +9394,7 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
     int rn = extract32(insn, 5, 5);
     int rd = extract32(insn, 0, 5);
     int pass;
+    GVecGenTwoFn *gvec_op;
 
     switch (opcode) {
     case 0x13: /* MUL, PMUL */
@@ -9409,6 +9434,28 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
         return;
     }
 
+    switch (opcode) {
+    case 0x10: /* ADD, SUB */
+        {
+            static GVecGenTwoFn * const fns[4][2] = {
+                { tcg_gen_gvec_add8, tcg_gen_gvec_sub8 },
+                { tcg_gen_gvec_add16, tcg_gen_gvec_sub16 },
+                { tcg_gen_gvec_add32, tcg_gen_gvec_sub32 },
+                { tcg_gen_gvec_add64, tcg_gen_gvec_sub64 },
+            };
+            gvec_op = fns[size][u];
+            goto do_gvec;
+        }
+        break;
+
+    do_gvec:
+        gvec_op(vec_full_reg_offset(s, rd),
+                vec_full_reg_offset(s, rn),
+                vec_full_reg_offset(s, rm),
+                is_q ? 16 : 8, vec_full_reg_size(s));
+        return;
+    }
+
     if (size == 3) {
         assert(is_q);
         for (pass = 0; pass < 2; pass++) {
@@ -9581,16 +9628,6 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
                 genfn = fns[size][u];
                 break;
             }
-            case 0x10: /* ADD, SUB */
-            {
-                static NeonGenTwoOpFn * const fns[3][2] = {
-                    { gen_helper_neon_add_u8, gen_helper_neon_sub_u8 },
-                    { gen_helper_neon_add_u16, gen_helper_neon_sub_u16 },
-                    { tcg_gen_add_i32, tcg_gen_sub_i32 },
-                };
-                genfn = fns[size][u];
-                break;
-            }
             case 0x11: /* CMTST, CMEQ */
             {
                 static NeonGenTwoOpFn * const fns[3][2] = {
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors
  2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
  2017-08-17 23:46   ` Philippe Mathieu-Daudé
  2017-09-07 18:18   ` Alex Bennée
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 4/8] tcg: Add operations " Richard Henderson
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-arm, alex.bennee

Nothing uses or enables them yet.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/tcg.h | 5 +++++
 tcg/tcg.c | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index dd97095af5..1277caed3d 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -256,6 +256,11 @@ typedef struct TCGPool {
 typedef enum TCGType {
     TCG_TYPE_I32,
     TCG_TYPE_I64,
+
+    TCG_TYPE_V64,
+    TCG_TYPE_V128,
+    TCG_TYPE_V256,
+
     TCG_TYPE_COUNT, /* number of different types */
 
     /* An alias for the size of the host register.  */
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 787c8ba0f7..ea78d47fad 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -118,7 +118,7 @@ static TCGReg tcg_reg_alloc_new(TCGContext *s, TCGType t)
 static bool tcg_out_ldst_finalize(TCGContext *s);
 #endif
 
-static TCGRegSet tcg_target_available_regs[2];
+static TCGRegSet tcg_target_available_regs[TCG_TYPE_COUNT];
 static TCGRegSet tcg_target_call_clobber_regs;
 
 #if TCG_TARGET_INSN_UNIT_SIZE == 1
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 4/8] tcg: Add operations for host vectors
  2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
                   ` (2 preceding siblings ...)
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
  2017-08-30  1:34   ` Philippe Mathieu-Daudé
  2017-09-07 19:00   ` Alex Bennée
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported Richard Henderson
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-arm, alex.bennee

Nothing uses or implements them yet.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/tcg-opc.h | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg.h     | 24 ++++++++++++++++
 2 files changed, 113 insertions(+)

diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 956fb1e9f3..9162125fac 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -206,6 +206,95 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
 
 #undef TLADDR_ARGS
 #undef DATA64_ARGS
+
+/* Host integer vector operations.  */
+/* These opcodes are required whenever the base vector size is enabled.  */
+
+DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
+DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
+DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
+DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
+DEF(ld_v256, 1, 1, 1, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(st_v64, 0, 2, 1, IMPL(TCG_TARGET_HAS_v64))
+DEF(st_v128, 0, 2, 1, IMPL(TCG_TARGET_HAS_v128))
+DEF(st_v256, 0, 2, 1, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(and_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(and_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(and_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(or_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(or_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(or_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(xor_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(xor_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(xor_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(add8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(add16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(add32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+
+DEF(add8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(add16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(add32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(add64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+
+DEF(add8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(add16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(add32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(add64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(sub8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(sub16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(sub32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+
+DEF(sub8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(sub16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(sub32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(sub64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+
+DEF(sub8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(sub16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(sub32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(sub64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+
+/* These opcodes are optional.
+   All element counts must be supported if any are.  */
+
+DEF(not_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v64))
+DEF(not_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v128))
+DEF(not_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v256))
+
+DEF(andc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v64))
+DEF(andc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v128))
+DEF(andc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v256))
+
+DEF(orc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v64))
+DEF(orc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v128))
+DEF(orc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v256))
+
+DEF(neg8_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
+DEF(neg16_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
+DEF(neg32_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
+
+DEF(neg8_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
+DEF(neg16_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
+DEF(neg32_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
+DEF(neg64_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
+
+DEF(neg8_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
+DEF(neg16_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
+DEF(neg32_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
+DEF(neg64_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
+
 #undef IMPL
 #undef IMPL64
 #undef DEF
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 1277caed3d..b9e15da13b 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -166,6 +166,30 @@ typedef uint64_t TCGRegSet;
 #define TCG_TARGET_HAS_rem_i64          0
 #endif
 
+#ifndef TCG_TARGET_HAS_v64
+#define TCG_TARGET_HAS_v64              0
+#define TCG_TARGET_HAS_andc_v64         0
+#define TCG_TARGET_HAS_orc_v64          0
+#define TCG_TARGET_HAS_not_v64          0
+#define TCG_TARGET_HAS_neg_v64          0
+#endif
+
+#ifndef TCG_TARGET_HAS_v128
+#define TCG_TARGET_HAS_v128             0
+#define TCG_TARGET_HAS_andc_v128        0
+#define TCG_TARGET_HAS_orc_v128         0
+#define TCG_TARGET_HAS_not_v128         0
+#define TCG_TARGET_HAS_neg_v128         0
+#endif
+
+#ifndef TCG_TARGET_HAS_v256
+#define TCG_TARGET_HAS_v256             0
+#define TCG_TARGET_HAS_andc_v256        0
+#define TCG_TARGET_HAS_orc_v256         0
+#define TCG_TARGET_HAS_not_v256         0
+#define TCG_TARGET_HAS_neg_v256         0
+#endif
+
 /* For 32-bit targets, some sort of unsigned widening multiply is required.  */
 #if TCG_TARGET_REG_BITS == 32 \
     && !(defined(TCG_TARGET_HAS_mulu2_i32) \
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported
  2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
                   ` (3 preceding siblings ...)
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 4/8] tcg: Add operations " Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
  2017-08-17 23:44   ` Philippe Mathieu-Daudé
  2017-09-07 19:02   ` Alex Bennée
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid Richard Henderson
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-arm, alex.bennee

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/tcg.h |   2 +
 tcg/tcg.c | 310 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 312 insertions(+)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index b9e15da13b..b443143b21 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -962,6 +962,8 @@ do {\
 #define tcg_temp_free_ptr(T) tcg_temp_free_i64(TCGV_PTR_TO_NAT(T))
 #endif
 
+bool tcg_op_supported(TCGOpcode op);
+
 void tcg_gen_callN(TCGContext *s, void *func,
                    TCGArg ret, int nargs, TCGArg *args);
 
diff --git a/tcg/tcg.c b/tcg/tcg.c
index ea78d47fad..3c3cdda938 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -751,6 +751,316 @@ int tcg_check_temp_count(void)
 }
 #endif
 
+/* Return true if OP may appear in the opcode stream.
+   Test the runtime variable that controls each opcode.  */
+bool tcg_op_supported(TCGOpcode op)
+{
+    switch (op) {
+    case INDEX_op_discard:
+    case INDEX_op_set_label:
+    case INDEX_op_call:
+    case INDEX_op_br:
+    case INDEX_op_mb:
+    case INDEX_op_insn_start:
+    case INDEX_op_exit_tb:
+    case INDEX_op_goto_tb:
+    case INDEX_op_qemu_ld_i32:
+    case INDEX_op_qemu_st_i32:
+    case INDEX_op_qemu_ld_i64:
+    case INDEX_op_qemu_st_i64:
+        return true;
+
+    case INDEX_op_goto_ptr:
+        return TCG_TARGET_HAS_goto_ptr;
+
+    case INDEX_op_mov_i32:
+    case INDEX_op_movi_i32:
+    case INDEX_op_setcond_i32:
+    case INDEX_op_brcond_i32:
+    case INDEX_op_ld8u_i32:
+    case INDEX_op_ld8s_i32:
+    case INDEX_op_ld16u_i32:
+    case INDEX_op_ld16s_i32:
+    case INDEX_op_ld_i32:
+    case INDEX_op_st8_i32:
+    case INDEX_op_st16_i32:
+    case INDEX_op_st_i32:
+    case INDEX_op_add_i32:
+    case INDEX_op_sub_i32:
+    case INDEX_op_mul_i32:
+    case INDEX_op_and_i32:
+    case INDEX_op_or_i32:
+    case INDEX_op_xor_i32:
+    case INDEX_op_shl_i32:
+    case INDEX_op_shr_i32:
+    case INDEX_op_sar_i32:
+        return true;
+
+    case INDEX_op_movcond_i32:
+        return TCG_TARGET_HAS_movcond_i32;
+    case INDEX_op_div_i32:
+    case INDEX_op_divu_i32:
+        return TCG_TARGET_HAS_div_i32;
+    case INDEX_op_rem_i32:
+    case INDEX_op_remu_i32:
+        return TCG_TARGET_HAS_rem_i32;
+    case INDEX_op_div2_i32:
+    case INDEX_op_divu2_i32:
+        return TCG_TARGET_HAS_div2_i32;
+    case INDEX_op_rotl_i32:
+    case INDEX_op_rotr_i32:
+        return TCG_TARGET_HAS_rot_i32;
+    case INDEX_op_deposit_i32:
+        return TCG_TARGET_HAS_deposit_i32;
+    case INDEX_op_extract_i32:
+        return TCG_TARGET_HAS_extract_i32;
+    case INDEX_op_sextract_i32:
+        return TCG_TARGET_HAS_sextract_i32;
+    case INDEX_op_add2_i32:
+        return TCG_TARGET_HAS_add2_i32;
+    case INDEX_op_sub2_i32:
+        return TCG_TARGET_HAS_sub2_i32;
+    case INDEX_op_mulu2_i32:
+        return TCG_TARGET_HAS_mulu2_i32;
+    case INDEX_op_muls2_i32:
+        return TCG_TARGET_HAS_muls2_i32;
+    case INDEX_op_muluh_i32:
+        return TCG_TARGET_HAS_muluh_i32;
+    case INDEX_op_mulsh_i32:
+        return TCG_TARGET_HAS_mulsh_i32;
+    case INDEX_op_ext8s_i32:
+        return TCG_TARGET_HAS_ext8s_i32;
+    case INDEX_op_ext16s_i32:
+        return TCG_TARGET_HAS_ext16s_i32;
+    case INDEX_op_ext8u_i32:
+        return TCG_TARGET_HAS_ext8u_i32;
+    case INDEX_op_ext16u_i32:
+        return TCG_TARGET_HAS_ext16u_i32;
+    case INDEX_op_bswap16_i32:
+        return TCG_TARGET_HAS_bswap16_i32;
+    case INDEX_op_bswap32_i32:
+        return TCG_TARGET_HAS_bswap32_i32;
+    case INDEX_op_not_i32:
+        return TCG_TARGET_HAS_not_i32;
+    case INDEX_op_neg_i32:
+        return TCG_TARGET_HAS_neg_i32;
+    case INDEX_op_andc_i32:
+        return TCG_TARGET_HAS_andc_i32;
+    case INDEX_op_orc_i32:
+        return TCG_TARGET_HAS_orc_i32;
+    case INDEX_op_eqv_i32:
+        return TCG_TARGET_HAS_eqv_i32;
+    case INDEX_op_nand_i32:
+        return TCG_TARGET_HAS_nand_i32;
+    case INDEX_op_nor_i32:
+        return TCG_TARGET_HAS_nor_i32;
+    case INDEX_op_clz_i32:
+        return TCG_TARGET_HAS_clz_i32;
+    case INDEX_op_ctz_i32:
+        return TCG_TARGET_HAS_ctz_i32;
+    case INDEX_op_ctpop_i32:
+        return TCG_TARGET_HAS_ctpop_i32;
+
+    case INDEX_op_brcond2_i32:
+    case INDEX_op_setcond2_i32:
+        return TCG_TARGET_REG_BITS == 32;
+
+    case INDEX_op_mov_i64:
+    case INDEX_op_movi_i64:
+    case INDEX_op_setcond_i64:
+    case INDEX_op_brcond_i64:
+    case INDEX_op_ld8u_i64:
+    case INDEX_op_ld8s_i64:
+    case INDEX_op_ld16u_i64:
+    case INDEX_op_ld16s_i64:
+    case INDEX_op_ld32u_i64:
+    case INDEX_op_ld32s_i64:
+    case INDEX_op_ld_i64:
+    case INDEX_op_st8_i64:
+    case INDEX_op_st16_i64:
+    case INDEX_op_st32_i64:
+    case INDEX_op_st_i64:
+    case INDEX_op_add_i64:
+    case INDEX_op_sub_i64:
+    case INDEX_op_mul_i64:
+    case INDEX_op_and_i64:
+    case INDEX_op_or_i64:
+    case INDEX_op_xor_i64:
+    case INDEX_op_shl_i64:
+    case INDEX_op_shr_i64:
+    case INDEX_op_sar_i64:
+    case INDEX_op_ext_i32_i64:
+    case INDEX_op_extu_i32_i64:
+        return TCG_TARGET_REG_BITS == 64;
+
+    case INDEX_op_movcond_i64:
+        return TCG_TARGET_HAS_movcond_i64;
+    case INDEX_op_div_i64:
+    case INDEX_op_divu_i64:
+        return TCG_TARGET_HAS_div_i64;
+    case INDEX_op_rem_i64:
+    case INDEX_op_remu_i64:
+        return TCG_TARGET_HAS_rem_i64;
+    case INDEX_op_div2_i64:
+    case INDEX_op_divu2_i64:
+        return TCG_TARGET_HAS_div2_i64;
+    case INDEX_op_rotl_i64:
+    case INDEX_op_rotr_i64:
+        return TCG_TARGET_HAS_rot_i64;
+    case INDEX_op_deposit_i64:
+        return TCG_TARGET_HAS_deposit_i64;
+    case INDEX_op_extract_i64:
+        return TCG_TARGET_HAS_extract_i64;
+    case INDEX_op_sextract_i64:
+        return TCG_TARGET_HAS_sextract_i64;
+    case INDEX_op_extrl_i64_i32:
+        return TCG_TARGET_HAS_extrl_i64_i32;
+    case INDEX_op_extrh_i64_i32:
+        return TCG_TARGET_HAS_extrh_i64_i32;
+    case INDEX_op_ext8s_i64:
+        return TCG_TARGET_HAS_ext8s_i64;
+    case INDEX_op_ext16s_i64:
+        return TCG_TARGET_HAS_ext16s_i64;
+    case INDEX_op_ext32s_i64:
+        return TCG_TARGET_HAS_ext32s_i64;
+    case INDEX_op_ext8u_i64:
+        return TCG_TARGET_HAS_ext8u_i64;
+    case INDEX_op_ext16u_i64:
+        return TCG_TARGET_HAS_ext16u_i64;
+    case INDEX_op_ext32u_i64:
+        return TCG_TARGET_HAS_ext32u_i64;
+    case INDEX_op_bswap16_i64:
+        return TCG_TARGET_HAS_bswap16_i64;
+    case INDEX_op_bswap32_i64:
+        return TCG_TARGET_HAS_bswap32_i64;
+    case INDEX_op_bswap64_i64:
+        return TCG_TARGET_HAS_bswap64_i64;
+    case INDEX_op_not_i64:
+        return TCG_TARGET_HAS_not_i64;
+    case INDEX_op_neg_i64:
+        return TCG_TARGET_HAS_neg_i64;
+    case INDEX_op_andc_i64:
+        return TCG_TARGET_HAS_andc_i64;
+    case INDEX_op_orc_i64:
+        return TCG_TARGET_HAS_orc_i64;
+    case INDEX_op_eqv_i64:
+        return TCG_TARGET_HAS_eqv_i64;
+    case INDEX_op_nand_i64:
+        return TCG_TARGET_HAS_nand_i64;
+    case INDEX_op_nor_i64:
+        return TCG_TARGET_HAS_nor_i64;
+    case INDEX_op_clz_i64:
+        return TCG_TARGET_HAS_clz_i64;
+    case INDEX_op_ctz_i64:
+        return TCG_TARGET_HAS_ctz_i64;
+    case INDEX_op_ctpop_i64:
+        return TCG_TARGET_HAS_ctpop_i64;
+    case INDEX_op_add2_i64:
+        return TCG_TARGET_HAS_add2_i64;
+    case INDEX_op_sub2_i64:
+        return TCG_TARGET_HAS_sub2_i64;
+    case INDEX_op_mulu2_i64:
+        return TCG_TARGET_HAS_mulu2_i64;
+    case INDEX_op_muls2_i64:
+        return TCG_TARGET_HAS_muls2_i64;
+    case INDEX_op_muluh_i64:
+        return TCG_TARGET_HAS_muluh_i64;
+    case INDEX_op_mulsh_i64:
+        return TCG_TARGET_HAS_mulsh_i64;
+
+    case INDEX_op_mov_v64:
+    case INDEX_op_movi_v64:
+    case INDEX_op_ld_v64:
+    case INDEX_op_st_v64:
+    case INDEX_op_and_v64:
+    case INDEX_op_or_v64:
+    case INDEX_op_xor_v64:
+    case INDEX_op_add8_v64:
+    case INDEX_op_add16_v64:
+    case INDEX_op_add32_v64:
+    case INDEX_op_sub8_v64:
+    case INDEX_op_sub16_v64:
+    case INDEX_op_sub32_v64:
+        return TCG_TARGET_HAS_v64;
+
+    case INDEX_op_mov_v128:
+    case INDEX_op_movi_v128:
+    case INDEX_op_ld_v128:
+    case INDEX_op_st_v128:
+    case INDEX_op_and_v128:
+    case INDEX_op_or_v128:
+    case INDEX_op_xor_v128:
+    case INDEX_op_add8_v128:
+    case INDEX_op_add16_v128:
+    case INDEX_op_add32_v128:
+    case INDEX_op_add64_v128:
+    case INDEX_op_sub8_v128:
+    case INDEX_op_sub16_v128:
+    case INDEX_op_sub32_v128:
+    case INDEX_op_sub64_v128:
+        return TCG_TARGET_HAS_v128;
+
+    case INDEX_op_mov_v256:
+    case INDEX_op_movi_v256:
+    case INDEX_op_ld_v256:
+    case INDEX_op_st_v256:
+    case INDEX_op_and_v256:
+    case INDEX_op_or_v256:
+    case INDEX_op_xor_v256:
+    case INDEX_op_add8_v256:
+    case INDEX_op_add16_v256:
+    case INDEX_op_add32_v256:
+    case INDEX_op_add64_v256:
+    case INDEX_op_sub8_v256:
+    case INDEX_op_sub16_v256:
+    case INDEX_op_sub32_v256:
+    case INDEX_op_sub64_v256:
+        return TCG_TARGET_HAS_v256;
+
+    case INDEX_op_not_v64:
+        return TCG_TARGET_HAS_not_v64;
+    case INDEX_op_not_v128:
+        return TCG_TARGET_HAS_not_v128;
+    case INDEX_op_not_v256:
+        return TCG_TARGET_HAS_not_v256;
+
+    case INDEX_op_andc_v64:
+        return TCG_TARGET_HAS_andc_v64;
+    case INDEX_op_andc_v128:
+        return TCG_TARGET_HAS_andc_v128;
+    case INDEX_op_andc_v256:
+        return TCG_TARGET_HAS_andc_v256;
+
+    case INDEX_op_orc_v64:
+        return TCG_TARGET_HAS_orc_v64;
+    case INDEX_op_orc_v128:
+        return TCG_TARGET_HAS_orc_v128;
+    case INDEX_op_orc_v256:
+        return TCG_TARGET_HAS_orc_v256;
+
+    case INDEX_op_neg8_v64:
+    case INDEX_op_neg16_v64:
+    case INDEX_op_neg32_v64:
+        return TCG_TARGET_HAS_neg_v64;
+
+    case INDEX_op_neg8_v128:
+    case INDEX_op_neg16_v128:
+    case INDEX_op_neg32_v128:
+    case INDEX_op_neg64_v128:
+        return TCG_TARGET_HAS_neg_v128;
+
+    case INDEX_op_neg8_v256:
+    case INDEX_op_neg16_v256:
+    case INDEX_op_neg32_v256:
+    case INDEX_op_neg64_v256:
+        return TCG_TARGET_HAS_neg_v256;
+
+    case NB_OPS:
+        break;
+    }
+    g_assert_not_reached();
+}
+
 /* Note: we convert the 64 bit args to 32 bit and do some alignment
    and endian swap. Maybe it would be better to do the alignment
    and endian swap in tcg_reg_alloc_call(). */
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid
  2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
                   ` (4 preceding siblings ...)
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
  2017-08-17 23:45   ` Philippe Mathieu-Daudé
  2017-09-08  9:30   ` Alex Bennée
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops Richard Henderson
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-arm, alex.bennee

Add with value 0 so that structure zero initialization can
indicate that the field is not present.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/tcg-opc.h | 2 ++
 tcg/tcg.c     | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 9162125fac..b1445a4c24 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -26,6 +26,8 @@
  * DEF(name, oargs, iargs, cargs, flags)
  */
 
+DEF(invalid, 0, 0, 0, TCG_OPF_NOT_PRESENT)
+
 /* predefined ops */
 DEF(discard, 1, 0, 0, TCG_OPF_NOT_PRESENT)
 DEF(set_label, 0, 0, 1, TCG_OPF_BB_END | TCG_OPF_NOT_PRESENT)
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 3c3cdda938..879b29e81f 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -756,6 +756,9 @@ int tcg_check_temp_count(void)
 bool tcg_op_supported(TCGOpcode op)
 {
     switch (op) {
+    case INDEX_op_invalid:
+        return false;
+
     case INDEX_op_discard:
     case INDEX_op_set_label:
     case INDEX_op_call:
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops
  2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
                   ` (5 preceding siblings ...)
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
  2017-09-08  9:34   ` Alex Bennée
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations Richard Henderson
  2017-09-08 13:49 ` [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Alex Bennée
  8 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-arm, alex.bennee

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/tcg-op-gvec.h |   4 +
 tcg/tcg.h         |   6 +-
 tcg/tcg-op-gvec.c | 230 +++++++++++++++++++++++++++++++++++++++++++-----------
 tcg/tcg.c         |   8 +-
 4 files changed, 197 insertions(+), 51 deletions(-)

diff --git a/tcg/tcg-op-gvec.h b/tcg/tcg-op-gvec.h
index 10db3599a5..99f36d208e 100644
--- a/tcg/tcg-op-gvec.h
+++ b/tcg/tcg-op-gvec.h
@@ -40,6 +40,10 @@ typedef struct {
     /* Similarly, but load up a constant and re-use across lanes.  */
     void (*fni8x)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64);
     uint64_t extra_value;
+    /* Operations with host vector ops.  */
+    TCGOpcode op_v256;
+    TCGOpcode op_v128;
+    TCGOpcode op_v64;
     /* Larger sizes: expand out-of-line helper w/size descriptor.  */
     void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32);
 } GVecGen3;
diff --git a/tcg/tcg.h b/tcg/tcg.h
index b443143b21..7f10501d31 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -825,9 +825,11 @@ int tcg_global_mem_new_internal(TCGType, TCGv_ptr, intptr_t, const char *);
 TCGv_i32 tcg_global_reg_new_i32(TCGReg reg, const char *name);
 TCGv_i64 tcg_global_reg_new_i64(TCGReg reg, const char *name);
 
-TCGv_i32 tcg_temp_new_internal_i32(int temp_local);
-TCGv_i64 tcg_temp_new_internal_i64(int temp_local);
+int tcg_temp_new_internal(TCGType type, bool temp_local);
+TCGv_i32 tcg_temp_new_internal_i32(bool temp_local);
+TCGv_i64 tcg_temp_new_internal_i64(bool temp_local);
 
+void tcg_temp_free_internal(int arg);
 void tcg_temp_free_i32(TCGv_i32 arg);
 void tcg_temp_free_i64(TCGv_i64 arg);
 
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
index 6de49dc07f..3aca565dc0 100644
--- a/tcg/tcg-op-gvec.c
+++ b/tcg/tcg-op-gvec.c
@@ -30,54 +30,73 @@
 #define REP8(x)    ((x) * 0x0101010101010101ull)
 #define REP16(x)   ((x) * 0x0001000100010001ull)
 
-#define MAX_INLINE 16
+#define MAX_UNROLL  4
 
-static inline void check_size_s(uint32_t opsz, uint32_t clsz)
+static inline void check_size_align(uint32_t opsz, uint32_t clsz, uint32_t ofs)
 {
-    tcg_debug_assert(opsz % 8 == 0);
-    tcg_debug_assert(clsz % 8 == 0);
+    uint32_t align = clsz > 16 || opsz >= 16 ? 15 : 7;
+    tcg_debug_assert(opsz > 0);
     tcg_debug_assert(opsz <= clsz);
+    tcg_debug_assert((opsz & align) == 0);
+    tcg_debug_assert((clsz & align) == 0);
+    tcg_debug_assert((ofs & align) == 0);
 }
 
-static inline void check_align_s_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
+static inline void check_overlap_3(uint32_t d, uint32_t a,
+                                   uint32_t b, uint32_t s)
 {
-    tcg_debug_assert(dofs % 8 == 0);
-    tcg_debug_assert(aofs % 8 == 0);
-    tcg_debug_assert(bofs % 8 == 0);
+    tcg_debug_assert(d == a || d + s <= a || a + s <= d);
+    tcg_debug_assert(d == b || d + s <= b || b + s <= d);
+    tcg_debug_assert(a == b || a + s <= b || b + s <= a);
 }
 
-static inline void check_size_l(uint32_t opsz, uint32_t clsz)
+static inline bool check_size_impl(uint32_t opsz, uint32_t lnsz)
 {
-    tcg_debug_assert(opsz % 16 == 0);
-    tcg_debug_assert(clsz % 16 == 0);
-    tcg_debug_assert(opsz <= clsz);
+    uint32_t lnct = opsz / lnsz;
+    return lnct >= 1 && lnct <= MAX_UNROLL;
 }
 
-static inline void check_align_l_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
+static void expand_clr_v(uint32_t dofs, uint32_t clsz, uint32_t lnsz,
+                         TCGType type, TCGOpcode opc_mv, TCGOpcode opc_st)
 {
-    tcg_debug_assert(dofs % 16 == 0);
-    tcg_debug_assert(aofs % 16 == 0);
-    tcg_debug_assert(bofs % 16 == 0);
-}
+    TCGArg t0 = tcg_temp_new_internal(type, 0);
+    TCGArg env = GET_TCGV_PTR(tcg_ctx.tcg_env);
+    uint32_t i;
 
-static inline void check_overlap_3(uint32_t d, uint32_t a,
-                                   uint32_t b, uint32_t s)
-{
-    tcg_debug_assert(d == a || d + s <= a || a + s <= d);
-    tcg_debug_assert(d == b || d + s <= b || b + s <= d);
-    tcg_debug_assert(a == b || a + s <= b || b + s <= a);
+    tcg_gen_op2(&tcg_ctx, opc_mv, t0, 0);
+    for (i = 0; i < clsz; i += lnsz) {
+        tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
+    }
+    tcg_temp_free_internal(t0);
 }
 
-static void expand_clr(uint32_t dofs, uint32_t opsz, uint32_t clsz)
+static void expand_clr(uint32_t dofs, uint32_t clsz)
 {
-    if (clsz > opsz) {
-        TCGv_i64 zero = tcg_const_i64(0);
-        uint32_t i;
+    if (clsz >= 32 && TCG_TARGET_HAS_v256) {
+        uint32_t done = QEMU_ALIGN_DOWN(clsz, 32);
+        expand_clr_v(dofs, done, 32, TCG_TYPE_V256,
+                     INDEX_op_movi_v256, INDEX_op_st_v256);
+        dofs += done;
+        clsz -= done;
+    }
 
-        for (i = opsz; i < clsz; i += 8) {
-            tcg_gen_st_i64(zero, tcg_ctx.tcg_env, dofs + i);
-        }
-        tcg_temp_free_i64(zero);
+    if (clsz >= 16 && TCG_TARGET_HAS_v128) {
+        uint16_t done = QEMU_ALIGN_DOWN(clsz, 16);
+        expand_clr_v(dofs, done, 16, TCG_TYPE_V128,
+                     INDEX_op_movi_v128, INDEX_op_st_v128);
+        dofs += done;
+        clsz -= done;
+    }
+
+    if (TCG_TARGET_REG_BITS == 64) {
+        expand_clr_v(dofs, clsz, 8, TCG_TYPE_I64,
+                     INDEX_op_movi_i64, INDEX_op_st_i64);
+    } else if (TCG_TARGET_HAS_v64) {
+        expand_clr_v(dofs, clsz, 8, TCG_TYPE_V64,
+                     INDEX_op_movi_v64, INDEX_op_st_v64);
+    } else {
+        expand_clr_v(dofs, clsz, 4, TCG_TYPE_I32,
+                     INDEX_op_movi_i32, INDEX_op_st_i32);
     }
 }
 
@@ -164,6 +183,7 @@ static void expand_3x8(uint32_t dofs, uint32_t aofs,
     tcg_temp_free_i64(t0);
 }
 
+/* FIXME: add CSE for constants and we can eliminate this.  */
 static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
                          uint32_t opsz, uint64_t data,
                          void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64))
@@ -192,28 +212,111 @@ static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
     tcg_temp_free_i64(t2);
 }
 
+static void expand_3_v(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+                       uint32_t opsz, uint32_t lnsz, TCGType type,
+                       TCGOpcode opc_op, TCGOpcode opc_ld, TCGOpcode opc_st)
+{
+    TCGArg t0 = tcg_temp_new_internal(type, 0);
+    TCGArg env = GET_TCGV_PTR(tcg_ctx.tcg_env);
+    uint32_t i;
+
+    if (aofs == bofs) {
+        for (i = 0; i < opsz; i += lnsz) {
+            tcg_gen_op3(&tcg_ctx, opc_ld, t0, env, aofs + i);
+            tcg_gen_op3(&tcg_ctx, opc_op, t0, t0, t0);
+            tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
+        }
+    } else {
+        TCGArg t1 = tcg_temp_new_internal(type, 0);
+        for (i = 0; i < opsz; i += lnsz) {
+            tcg_gen_op3(&tcg_ctx, opc_ld, t0, env, aofs + i);
+            tcg_gen_op3(&tcg_ctx, opc_ld, t1, env, bofs + i);
+            tcg_gen_op3(&tcg_ctx, opc_op, t0, t0, t1);
+            tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
+        }
+        tcg_temp_free_internal(t1);
+    }
+    tcg_temp_free_internal(t0);
+}
+
 void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
                     uint32_t opsz, uint32_t clsz, const GVecGen3 *g)
 {
+    check_size_align(opsz, clsz, dofs | aofs | bofs);
     check_overlap_3(dofs, aofs, bofs, clsz);
-    if (opsz <= MAX_INLINE) {
-        check_size_s(opsz, clsz);
-        check_align_s_3(dofs, aofs, bofs);
-        if (g->fni8) {
-            expand_3x8(dofs, aofs, bofs, opsz, g->fni8);
-        } else if (g->fni4) {
-            expand_3x4(dofs, aofs, bofs, opsz, g->fni4);
+
+    if (opsz > MAX_UNROLL * 32 || clsz > MAX_UNROLL * 32) {
+        goto do_ool;
+    }
+
+    /* Recall that ARM SVE allows vector sizes that are not a power of 2.
+       Expand with successively smaller host vector sizes.  The intent is
+       that e.g. opsz == 80 would be expanded with 2x32 + 1x16.  */
+    /* ??? For clsz > opsz, the host may be able to use an op-sized
+       operation, zeroing the balance of the register.  We can then
+       use a cl-sized store to implement the clearing without an extra
+       store operation.  This is true for aarch64 and x86_64 hosts.  */
+
+    if (check_size_impl(opsz, 32) && tcg_op_supported(g->op_v256)) {
+        uint32_t done = QEMU_ALIGN_DOWN(opsz, 32);
+        expand_3_v(dofs, aofs, bofs, done, 32, TCG_TYPE_V256,
+                   g->op_v256, INDEX_op_ld_v256, INDEX_op_st_v256);
+        dofs += done;
+        aofs += done;
+        bofs += done;
+        opsz -= done;
+        clsz -= done;
+    }
+
+    if (check_size_impl(opsz, 16) && tcg_op_supported(g->op_v128)) {
+        uint32_t done = QEMU_ALIGN_DOWN(opsz, 16);
+        expand_3_v(dofs, aofs, bofs, done, 16, TCG_TYPE_V128,
+                   g->op_v128, INDEX_op_ld_v128, INDEX_op_st_v128);
+        dofs += done;
+        aofs += done;
+        bofs += done;
+        opsz -= done;
+        clsz -= done;
+    }
+
+    if (check_size_impl(opsz, 8)) {
+        uint32_t done = QEMU_ALIGN_DOWN(opsz, 8);
+        if (tcg_op_supported(g->op_v64)) {
+            expand_3_v(dofs, aofs, bofs, done, 8, TCG_TYPE_V64,
+                       g->op_v64, INDEX_op_ld_v64, INDEX_op_st_v64);
+        } else if (g->fni8) {
+            expand_3x8(dofs, aofs, bofs, done, g->fni8);
         } else if (g->fni8x) {
-            expand_3x8p1(dofs, aofs, bofs, opsz, g->extra_value, g->fni8x);
+            expand_3x8p1(dofs, aofs, bofs, done, g->extra_value, g->fni8x);
         } else {
-            g_assert_not_reached();
+            done = 0;
         }
-        expand_clr(dofs, opsz, clsz);
-    } else {
-        check_size_l(opsz, clsz);
-        check_align_l_3(dofs, aofs, bofs);
-        expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
+        dofs += done;
+        aofs += done;
+        bofs += done;
+        opsz -= done;
+        clsz -= done;
     }
+
+    if (check_size_impl(opsz, 4)) {
+        uint32_t done = QEMU_ALIGN_DOWN(opsz, 4);
+        expand_3x4(dofs, aofs, bofs, done, g->fni4);
+        dofs += done;
+        aofs += done;
+        bofs += done;
+        opsz -= done;
+        clsz -= done;
+    }
+
+    if (opsz == 0) {
+        if (clsz != 0) {
+            expand_clr(dofs, clsz);
+        }
+        return;
+    }
+
+ do_ool:
+    expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
 }
 
 static void gen_addv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
@@ -240,6 +343,9 @@ void tcg_gen_gvec_add8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
     static const GVecGen3 g = {
         .extra_value = REP8(0x80),
         .fni8x = gen_addv_mask,
+        .op_v256 = INDEX_op_add8_v256,
+        .op_v128 = INDEX_op_add8_v128,
+        .op_v64 = INDEX_op_add8_v64,
         .fno = gen_helper_gvec_add8,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -251,6 +357,9 @@ void tcg_gen_gvec_add16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
     static const GVecGen3 g = {
         .extra_value = REP16(0x8000),
         .fni8x = gen_addv_mask,
+        .op_v256 = INDEX_op_add16_v256,
+        .op_v128 = INDEX_op_add16_v128,
+        .op_v64 = INDEX_op_add16_v64,
         .fno = gen_helper_gvec_add16,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -261,6 +370,9 @@ void tcg_gen_gvec_add32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
 {
     static const GVecGen3 g = {
         .fni4 = tcg_gen_add_i32,
+        .op_v256 = INDEX_op_add32_v256,
+        .op_v128 = INDEX_op_add32_v128,
+        .op_v64 = INDEX_op_add32_v64,
         .fno = gen_helper_gvec_add32,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -271,6 +383,8 @@ void tcg_gen_gvec_add64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
 {
     static const GVecGen3 g = {
         .fni8 = tcg_gen_add_i64,
+        .op_v256 = INDEX_op_add64_v256,
+        .op_v128 = INDEX_op_add64_v128,
         .fno = gen_helper_gvec_add64,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -328,6 +442,9 @@ void tcg_gen_gvec_sub8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
     static const GVecGen3 g = {
         .extra_value = REP8(0x80),
         .fni8x = gen_subv_mask,
+        .op_v256 = INDEX_op_sub8_v256,
+        .op_v128 = INDEX_op_sub8_v128,
+        .op_v64 = INDEX_op_sub8_v64,
         .fno = gen_helper_gvec_sub8,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -339,6 +456,9 @@ void tcg_gen_gvec_sub16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
     static const GVecGen3 g = {
         .extra_value = REP16(0x8000),
         .fni8x = gen_subv_mask,
+        .op_v256 = INDEX_op_sub16_v256,
+        .op_v128 = INDEX_op_sub16_v128,
+        .op_v64 = INDEX_op_sub16_v64,
         .fno = gen_helper_gvec_sub16,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -349,6 +469,9 @@ void tcg_gen_gvec_sub32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
 {
     static const GVecGen3 g = {
         .fni4 = tcg_gen_sub_i32,
+        .op_v256 = INDEX_op_sub32_v256,
+        .op_v128 = INDEX_op_sub32_v128,
+        .op_v64 = INDEX_op_sub32_v64,
         .fno = gen_helper_gvec_sub32,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -359,6 +482,8 @@ void tcg_gen_gvec_sub64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
 {
     static const GVecGen3 g = {
         .fni8 = tcg_gen_sub_i64,
+        .op_v256 = INDEX_op_sub64_v256,
+        .op_v128 = INDEX_op_sub64_v128,
         .fno = gen_helper_gvec_sub64,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -397,6 +522,9 @@ void tcg_gen_gvec_and8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
 {
     static const GVecGen3 g = {
         .fni8 = tcg_gen_and_i64,
+        .op_v256 = INDEX_op_and_v256,
+        .op_v128 = INDEX_op_and_v128,
+        .op_v64 = INDEX_op_and_v64,
         .fno = gen_helper_gvec_and8,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -407,6 +535,9 @@ void tcg_gen_gvec_or8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
 {
     static const GVecGen3 g = {
         .fni8 = tcg_gen_or_i64,
+        .op_v256 = INDEX_op_or_v256,
+        .op_v128 = INDEX_op_or_v128,
+        .op_v64 = INDEX_op_or_v64,
         .fno = gen_helper_gvec_or8,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -417,6 +548,9 @@ void tcg_gen_gvec_xor8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
 {
     static const GVecGen3 g = {
         .fni8 = tcg_gen_xor_i64,
+        .op_v256 = INDEX_op_xor_v256,
+        .op_v128 = INDEX_op_xor_v128,
+        .op_v64 = INDEX_op_xor_v64,
         .fno = gen_helper_gvec_xor8,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -427,6 +561,9 @@ void tcg_gen_gvec_andc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
 {
     static const GVecGen3 g = {
         .fni8 = tcg_gen_andc_i64,
+        .op_v256 = INDEX_op_andc_v256,
+        .op_v128 = INDEX_op_andc_v128,
+        .op_v64 = INDEX_op_andc_v64,
         .fno = gen_helper_gvec_andc8,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -437,6 +574,9 @@ void tcg_gen_gvec_orc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
 {
     static const GVecGen3 g = {
         .fni8 = tcg_gen_orc_i64,
+        .op_v256 = INDEX_op_orc_v256,
+        .op_v128 = INDEX_op_orc_v128,
+        .op_v64 = INDEX_op_orc_v64,
         .fno = gen_helper_gvec_orc8,
     };
     tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 879b29e81f..86eb4214b0 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -604,7 +604,7 @@ int tcg_global_mem_new_internal(TCGType type, TCGv_ptr base,
     return temp_idx(s, ts);
 }
 
-static int tcg_temp_new_internal(TCGType type, int temp_local)
+int tcg_temp_new_internal(TCGType type, bool temp_local)
 {
     TCGContext *s = &tcg_ctx;
     TCGTemp *ts;
@@ -650,7 +650,7 @@ static int tcg_temp_new_internal(TCGType type, int temp_local)
     return idx;
 }
 
-TCGv_i32 tcg_temp_new_internal_i32(int temp_local)
+TCGv_i32 tcg_temp_new_internal_i32(bool temp_local)
 {
     int idx;
 
@@ -658,7 +658,7 @@ TCGv_i32 tcg_temp_new_internal_i32(int temp_local)
     return MAKE_TCGV_I32(idx);
 }
 
-TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
+TCGv_i64 tcg_temp_new_internal_i64(bool temp_local)
 {
     int idx;
 
@@ -666,7 +666,7 @@ TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
     return MAKE_TCGV_I64(idx);
 }
 
-static void tcg_temp_free_internal(int idx)
+void tcg_temp_free_internal(int idx)
 {
     TCGContext *s = &tcg_ctx;
     TCGTemp *ts;
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
  2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
                   ` (6 preceding siblings ...)
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
  2017-08-22 13:15   ` Alex Bennée
  2017-09-08 10:13   ` Alex Bennée
  2017-09-08 13:49 ` [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Alex Bennée
  8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-arm, alex.bennee

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.h     |  46 +++++-
 tcg/tcg-opc.h             |  12 +-
 tcg/i386/tcg-target.inc.c | 382 ++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 399 insertions(+), 41 deletions(-)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index e512648c95..147f82062b 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -30,11 +30,10 @@
 
 #ifdef __x86_64__
 # define TCG_TARGET_REG_BITS  64
-# define TCG_TARGET_NB_REGS   16
 #else
 # define TCG_TARGET_REG_BITS  32
-# define TCG_TARGET_NB_REGS    8
 #endif
+# define TCG_TARGET_NB_REGS   24
 
 typedef enum {
     TCG_REG_EAX = 0,
@@ -56,6 +55,19 @@ typedef enum {
     TCG_REG_R13,
     TCG_REG_R14,
     TCG_REG_R15,
+
+    /* SSE registers; 64-bit has access to 8 more, but we won't
+       need more than a few and using only the first 8 minimizes
+       the need for a rex prefix on the sse instructions.  */
+    TCG_REG_XMM0,
+    TCG_REG_XMM1,
+    TCG_REG_XMM2,
+    TCG_REG_XMM3,
+    TCG_REG_XMM4,
+    TCG_REG_XMM5,
+    TCG_REG_XMM6,
+    TCG_REG_XMM7,
+
     TCG_REG_RAX = TCG_REG_EAX,
     TCG_REG_RCX = TCG_REG_ECX,
     TCG_REG_RDX = TCG_REG_EDX,
@@ -79,6 +91,17 @@ extern bool have_bmi1;
 extern bool have_bmi2;
 extern bool have_popcnt;
 
+#ifdef __SSE2__
+#define have_sse2  true
+#else
+extern bool have_sse2;
+#endif
+#ifdef __AVX2__
+#define have_avx2  true
+#else
+extern bool have_avx2;
+#endif
+
 /* optional instructions */
 #define TCG_TARGET_HAS_div2_i32         1
 #define TCG_TARGET_HAS_rot_i32          1
@@ -147,6 +170,25 @@ extern bool have_popcnt;
 #define TCG_TARGET_HAS_mulsh_i64        0
 #endif
 
+#define TCG_TARGET_HAS_v64              have_sse2
+#define TCG_TARGET_HAS_v128             have_sse2
+#define TCG_TARGET_HAS_v256             have_avx2
+
+#define TCG_TARGET_HAS_andc_v64         TCG_TARGET_HAS_v64
+#define TCG_TARGET_HAS_orc_v64          0
+#define TCG_TARGET_HAS_not_v64          0
+#define TCG_TARGET_HAS_neg_v64          0
+
+#define TCG_TARGET_HAS_andc_v128        TCG_TARGET_HAS_v128
+#define TCG_TARGET_HAS_orc_v128         0
+#define TCG_TARGET_HAS_not_v128         0
+#define TCG_TARGET_HAS_neg_v128         0
+
+#define TCG_TARGET_HAS_andc_v256        TCG_TARGET_HAS_v256
+#define TCG_TARGET_HAS_orc_v256         0
+#define TCG_TARGET_HAS_not_v256         0
+#define TCG_TARGET_HAS_neg_v256         0
+
 #define TCG_TARGET_deposit_i32_valid(ofs, len) \
     (have_bmi2 ||                              \
      ((ofs) == 0 && (len) == 8) ||             \
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index b1445a4c24..b84cd584fb 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -212,13 +212,13 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
 /* Host integer vector operations.  */
 /* These opcodes are required whenever the base vector size is enabled.  */
 
-DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
-DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
-DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(mov_v64, 1, 1, 0, TCG_OPF_NOT_PRESENT)
+DEF(mov_v128, 1, 1, 0, TCG_OPF_NOT_PRESENT)
+DEF(mov_v256, 1, 1, 0, TCG_OPF_NOT_PRESENT)
 
-DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
-DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
-DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
+DEF(movi_v64, 1, 0, 1, TCG_OPF_NOT_PRESENT)
+DEF(movi_v128, 1, 0, 1, TCG_OPF_NOT_PRESENT)
+DEF(movi_v256, 1, 0, 1, TCG_OPF_NOT_PRESENT)
 
 DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
 DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index aeefb72aa0..0e01b54aa0 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -31,7 +31,9 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
     "%r8",  "%r9",  "%r10", "%r11", "%r12", "%r13", "%r14", "%r15",
 #else
     "%eax", "%ecx", "%edx", "%ebx", "%esp", "%ebp", "%esi", "%edi",
+    NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 #endif
+    "%xmm0", "%xmm1", "%xmm2", "%xmm3", "%xmm4", "%xmm5", "%xmm6", "%xmm7",
 };
 #endif
 
@@ -61,6 +63,14 @@ static const int tcg_target_reg_alloc_order[] = {
     TCG_REG_EDX,
     TCG_REG_EAX,
 #endif
+    TCG_REG_XMM0,
+    TCG_REG_XMM1,
+    TCG_REG_XMM2,
+    TCG_REG_XMM3,
+    TCG_REG_XMM4,
+    TCG_REG_XMM5,
+    TCG_REG_XMM6,
+    TCG_REG_XMM7,
 };
 
 static const int tcg_target_call_iarg_regs[] = {
@@ -94,7 +104,7 @@ static const int tcg_target_call_oarg_regs[] = {
 #define TCG_CT_CONST_I32 0x400
 #define TCG_CT_CONST_WSZ 0x800
 
-/* Registers used with L constraint, which are the first argument 
+/* Registers used with L constraint, which are the first argument
    registers on x86_64, and two random call clobbered registers on
    i386. */
 #if TCG_TARGET_REG_BITS == 64
@@ -127,6 +137,16 @@ bool have_bmi1;
 bool have_bmi2;
 bool have_popcnt;
 
+#ifndef have_sse2
+bool have_sse2;
+#endif
+#ifdef have_avx2
+#define have_avx1  have_avx2
+#else
+static bool have_avx1;
+bool have_avx2;
+#endif
+
 #ifdef CONFIG_CPUID_H
 static bool have_movbe;
 static bool have_lzcnt;
@@ -215,6 +235,10 @@ static const char *target_parse_constraint(TCGArgConstraint *ct,
         /* With TZCNT/LZCNT, we can have operand-size as an input.  */
         ct->ct |= TCG_CT_CONST_WSZ;
         break;
+    case 'x':
+        ct->ct |= TCG_CT_REG;
+        tcg_regset_set32(ct->u.regs, 0, 0xff0000);
+        break;
 
         /* qemu_ld/st address constraint */
     case 'L':
@@ -292,6 +316,7 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
 #endif
 #define P_SIMDF3        0x20000         /* 0xf3 opcode prefix */
 #define P_SIMDF2        0x40000         /* 0xf2 opcode prefix */
+#define P_VEXL          0x80000         /* Set VEX.L = 1 */
 
 #define OPC_ARITH_EvIz	(0x81)
 #define OPC_ARITH_EvIb	(0x83)
@@ -324,13 +349,31 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define OPC_MOVL_Iv     (0xb8)
 #define OPC_MOVBE_GyMy  (0xf0 | P_EXT38)
 #define OPC_MOVBE_MyGy  (0xf1 | P_EXT38)
+#define OPC_MOVDQA_GyMy (0x6f | P_EXT | P_DATA16)
+#define OPC_MOVDQA_MyGy (0x7f | P_EXT | P_DATA16)
+#define OPC_MOVDQU_GyMy (0x6f | P_EXT | P_SIMDF3)
+#define OPC_MOVDQU_MyGy (0x7f | P_EXT | P_SIMDF3)
+#define OPC_MOVQ_GyMy   (0x7e | P_EXT | P_SIMDF3)
+#define OPC_MOVQ_MyGy   (0xd6 | P_EXT | P_DATA16)
 #define OPC_MOVSBL	(0xbe | P_EXT)
 #define OPC_MOVSWL	(0xbf | P_EXT)
 #define OPC_MOVSLQ	(0x63 | P_REXW)
 #define OPC_MOVZBL	(0xb6 | P_EXT)
 #define OPC_MOVZWL	(0xb7 | P_EXT)
+#define OPC_PADDB       (0xfc | P_EXT | P_DATA16)
+#define OPC_PADDW       (0xfd | P_EXT | P_DATA16)
+#define OPC_PADDD       (0xfe | P_EXT | P_DATA16)
+#define OPC_PADDQ       (0xd4 | P_EXT | P_DATA16)
+#define OPC_PAND        (0xdb | P_EXT | P_DATA16)
+#define OPC_PANDN       (0xdf | P_EXT | P_DATA16)
 #define OPC_PDEP        (0xf5 | P_EXT38 | P_SIMDF2)
 #define OPC_PEXT        (0xf5 | P_EXT38 | P_SIMDF3)
+#define OPC_POR         (0xeb | P_EXT | P_DATA16)
+#define OPC_PSUBB       (0xf8 | P_EXT | P_DATA16)
+#define OPC_PSUBW       (0xf9 | P_EXT | P_DATA16)
+#define OPC_PSUBD       (0xfa | P_EXT | P_DATA16)
+#define OPC_PSUBQ       (0xfb | P_EXT | P_DATA16)
+#define OPC_PXOR        (0xef | P_EXT | P_DATA16)
 #define OPC_POP_r32	(0x58)
 #define OPC_POPCNT      (0xb8 | P_EXT | P_SIMDF3)
 #define OPC_PUSH_r32	(0x50)
@@ -500,7 +543,8 @@ static void tcg_out_modrm(TCGContext *s, int opc, int r, int rm)
     tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
 }
 
-static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
+static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v,
+                                int rm, int index)
 {
     int tmp;
 
@@ -515,14 +559,16 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
     } else if (opc & P_EXT) {
         tmp = 1;
     } else {
-        tcg_abort();
+        g_assert_not_reached();
     }
-    tmp |= 0x40;                           /* VEX.X */
     tmp |= (r & 8 ? 0 : 0x80);             /* VEX.R */
+    tmp |= (index & 8 ? 0 : 0x40);         /* VEX.X */
     tmp |= (rm & 8 ? 0 : 0x20);            /* VEX.B */
     tcg_out8(s, tmp);
 
     tmp = (opc & P_REXW ? 0x80 : 0);       /* VEX.W */
+    tmp |= (opc & P_VEXL ? 0x04 : 0);      /* VEX.L */
+
     /* VEX.pp */
     if (opc & P_DATA16) {
         tmp |= 1;                          /* 0x66 */
@@ -538,7 +584,7 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
 
 static void tcg_out_vex_modrm(TCGContext *s, int opc, int r, int v, int rm)
 {
-    tcg_out_vex_pfx_opc(s, opc, r, v, rm);
+    tcg_out_vex_pfx_opc(s, opc, r, v, rm, 0);
     tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
 }
 
@@ -565,7 +611,7 @@ static void tcg_out_opc_pool_imm(TCGContext *s, int opc, int r,
 static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
                                  tcg_target_ulong data)
 {
-    tcg_out_vex_pfx_opc(s, opc, r, v, 0);
+    tcg_out_vex_pfx_opc(s, opc, r, v, 0, 0);
     tcg_out_sfx_pool_imm(s, r, data);
 }
 
@@ -574,8 +620,8 @@ static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
    mode for absolute addresses, ~RM is the size of the immediate operand
    that will follow the instruction.  */
 
-static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
-                                     int index, int shift, intptr_t offset)
+static void tcg_out_sib_offset(TCGContext *s, int r, int rm, int index,
+                               int shift, intptr_t offset)
 {
     int mod, len;
 
@@ -586,7 +632,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
             intptr_t pc = (intptr_t)s->code_ptr + 5 + ~rm;
             intptr_t disp = offset - pc;
             if (disp == (int32_t)disp) {
-                tcg_out_opc(s, opc, r, 0, 0);
                 tcg_out8(s, (LOWREGMASK(r) << 3) | 5);
                 tcg_out32(s, disp);
                 return;
@@ -596,7 +641,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
                use of the MODRM+SIB encoding and is therefore larger than
                rip-relative addressing.  */
             if (offset == (int32_t)offset) {
-                tcg_out_opc(s, opc, r, 0, 0);
                 tcg_out8(s, (LOWREGMASK(r) << 3) | 4);
                 tcg_out8(s, (4 << 3) | 5);
                 tcg_out32(s, offset);
@@ -604,10 +648,9 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
             }
 
             /* ??? The memory isn't directly addressable.  */
-            tcg_abort();
+            g_assert_not_reached();
         } else {
             /* Absolute address.  */
-            tcg_out_opc(s, opc, r, 0, 0);
             tcg_out8(s, (r << 3) | 5);
             tcg_out32(s, offset);
             return;
@@ -630,7 +673,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
        that would be used for %esp is the escape to the two byte form.  */
     if (index < 0 && LOWREGMASK(rm) != TCG_REG_ESP) {
         /* Single byte MODRM format.  */
-        tcg_out_opc(s, opc, r, rm, 0);
         tcg_out8(s, mod | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
     } else {
         /* Two byte MODRM+SIB format.  */
@@ -644,7 +686,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
             tcg_debug_assert(index != TCG_REG_ESP);
         }
 
-        tcg_out_opc(s, opc, r, rm, index);
         tcg_out8(s, mod | (LOWREGMASK(r) << 3) | 4);
         tcg_out8(s, (shift << 6) | (LOWREGMASK(index) << 3) | LOWREGMASK(rm));
     }
@@ -656,6 +697,21 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
     }
 }
 
+static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
+                                     int index, int shift, intptr_t offset)
+{
+    tcg_out_opc(s, opc, r, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
+    tcg_out_sib_offset(s, r, rm, index, shift, offset);
+}
+
+static void tcg_out_vex_modrm_sib_offset(TCGContext *s, int opc, int r, int v,
+                                         int rm, int index, int shift,
+                                         intptr_t offset)
+{
+    tcg_out_vex_pfx_opc(s, opc, r, v, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
+    tcg_out_sib_offset(s, r, rm, index, shift, offset);
+}
+
 /* A simplification of the above with no index or shift.  */
 static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
                                         int rm, intptr_t offset)
@@ -663,6 +719,31 @@ static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
     tcg_out_modrm_sib_offset(s, opc, r, rm, -1, 0, offset);
 }
 
+static inline void tcg_out_vex_modrm_offset(TCGContext *s, int opc, int r,
+                                            int v, int rm, intptr_t offset)
+{
+    tcg_out_vex_modrm_sib_offset(s, opc, r, v, rm, -1, 0, offset);
+}
+
+static void tcg_out_maybe_vex_modrm(TCGContext *s, int opc, int r, int rm)
+{
+    if (have_avx1) {
+        tcg_out_vex_modrm(s, opc, r, 0, rm);
+    } else {
+        tcg_out_modrm(s, opc, r, rm);
+    }
+}
+
+static void tcg_out_maybe_vex_modrm_offset(TCGContext *s, int opc, int r,
+                                           int rm, intptr_t offset)
+{
+    if (have_avx1) {
+        tcg_out_vex_modrm_offset(s, opc, r, 0, rm, offset);
+    } else {
+        tcg_out_modrm_offset(s, opc, r, rm, offset);
+    }
+}
+
 /* Generate dest op= src.  Uses the same ARITH_* codes as tgen_arithi.  */
 static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
 {
@@ -673,12 +754,32 @@ static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
     tcg_out_modrm(s, OPC_ARITH_GvEv + (subop << 3) + ext, dest, src);
 }
 
-static inline void tcg_out_mov(TCGContext *s, TCGType type,
-                               TCGReg ret, TCGReg arg)
+static void tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
 {
     if (arg != ret) {
-        int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
-        tcg_out_modrm(s, opc, ret, arg);
+        int opc = 0;
+
+        switch (type) {
+        case TCG_TYPE_I64:
+            opc = P_REXW;
+            /* fallthru */
+        case TCG_TYPE_I32:
+            opc |= OPC_MOVL_GvEv;
+            tcg_out_modrm(s, opc, ret, arg);
+            break;
+
+        case TCG_TYPE_V256:
+            opc = P_VEXL;
+            /* fallthru */
+        case TCG_TYPE_V128:
+        case TCG_TYPE_V64:
+            opc |= OPC_MOVDQA_GyMy;
+            tcg_out_maybe_vex_modrm(s, opc, ret, arg);
+            break;
+
+        default:
+            g_assert_not_reached();
+        }
     }
 }
 
@@ -687,6 +788,27 @@ static void tcg_out_movi(TCGContext *s, TCGType type,
 {
     tcg_target_long diff;
 
+    switch (type) {
+    case TCG_TYPE_I32:
+    case TCG_TYPE_I64:
+        break;
+
+    case TCG_TYPE_V64:
+    case TCG_TYPE_V128:
+    case TCG_TYPE_V256:
+        /* ??? Revisit this as the implementation progresses.  */
+        tcg_debug_assert(arg == 0);
+        if (have_avx1) {
+            tcg_out_vex_modrm(s, OPC_PXOR, ret, ret, ret);
+        } else {
+            tcg_out_modrm(s, OPC_PXOR, ret, ret);
+        }
+        return;
+
+    default:
+        g_assert_not_reached();
+    }
+
     if (arg == 0) {
         tgen_arithr(s, ARITH_XOR, ret, ret);
         return;
@@ -750,18 +872,54 @@ static inline void tcg_out_pop(TCGContext *s, int reg)
     tcg_out_opc(s, OPC_POP_r32 + LOWREGMASK(reg), 0, reg, 0);
 }
 
-static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
-                              TCGReg arg1, intptr_t arg2)
+static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
+                       TCGReg arg1, intptr_t arg2)
 {
-    int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
-    tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
+    switch (type) {
+    case TCG_TYPE_I64:
+        tcg_out_modrm_offset(s, OPC_MOVL_GvEv | P_REXW, ret, arg1, arg2);
+        break;
+    case TCG_TYPE_I32:
+        tcg_out_modrm_offset(s, OPC_MOVL_GvEv, ret, arg1, arg2);
+        break;
+    case TCG_TYPE_V64:
+        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_GyMy, ret, arg1, arg2);
+        break;
+    case TCG_TYPE_V128:
+        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_GyMy, ret, arg1, arg2);
+        break;
+    case TCG_TYPE_V256:
+        tcg_out_vex_modrm_offset(s, OPC_MOVDQU_GyMy | P_VEXL,
+                                 ret, 0, arg1, arg2);
+        break;
+    default:
+        g_assert_not_reached();
+    }
 }
 
-static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
-                              TCGReg arg1, intptr_t arg2)
+static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
+                       TCGReg arg1, intptr_t arg2)
 {
-    int opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
-    tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
+    switch (type) {
+    case TCG_TYPE_I64:
+        tcg_out_modrm_offset(s, OPC_MOVL_EvGv | P_REXW, arg, arg1, arg2);
+        break;
+    case TCG_TYPE_I32:
+        tcg_out_modrm_offset(s, OPC_MOVL_EvGv, arg, arg1, arg2);
+        break;
+    case TCG_TYPE_V64:
+        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_MyGy, arg, arg1, arg2);
+        break;
+    case TCG_TYPE_V128:
+        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_MyGy, arg, arg1, arg2);
+        break;
+    case TCG_TYPE_V256:
+        tcg_out_vex_modrm_offset(s, OPC_MOVDQU_MyGy | P_VEXL,
+                                 arg, 0, arg1, arg2);
+        break;
+    default:
+        g_assert_not_reached();
+    }
 }
 
 static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
@@ -773,6 +931,8 @@ static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
             return false;
         }
         rexw = P_REXW;
+    } else if (type != TCG_TYPE_I32) {
+        return false;
     }
     tcg_out_modrm_offset(s, OPC_MOVL_EvIz | rexw, 0, base, ofs);
     tcg_out32(s, val);
@@ -1914,6 +2074,15 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         case glue(glue(INDEX_op_, x), _i32)
 #endif
 
+#define OP_128_256(x) \
+        case glue(glue(INDEX_op_, x), _v256): \
+            rexw = P_VEXL; /* FALLTHRU */     \
+        case glue(glue(INDEX_op_, x), _v128)
+
+#define OP_64_128_256(x) \
+        OP_128_256(x):   \
+        case glue(glue(INDEX_op_, x), _v64)
+
     /* Hoist the loads of the most common arguments.  */
     a0 = args[0];
     a1 = args[1];
@@ -2379,19 +2548,94 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         }
         break;
 
+    OP_64_128_256(add8):
+        c = OPC_PADDB;
+        goto gen_simd;
+    OP_64_128_256(add16):
+        c = OPC_PADDW;
+        goto gen_simd;
+    OP_64_128_256(add32):
+        c = OPC_PADDD;
+        goto gen_simd;
+    OP_128_256(add64):
+        c = OPC_PADDQ;
+        goto gen_simd;
+    OP_64_128_256(sub8):
+        c = OPC_PSUBB;
+        goto gen_simd;
+    OP_64_128_256(sub16):
+        c = OPC_PSUBW;
+        goto gen_simd;
+    OP_64_128_256(sub32):
+        c = OPC_PSUBD;
+        goto gen_simd;
+    OP_128_256(sub64):
+        c = OPC_PSUBQ;
+        goto gen_simd;
+    OP_64_128_256(and):
+        c = OPC_PAND;
+        goto gen_simd;
+    OP_64_128_256(andc):
+        c = OPC_PANDN;
+        goto gen_simd;
+    OP_64_128_256(or):
+        c = OPC_POR;
+        goto gen_simd;
+    OP_64_128_256(xor):
+        c = OPC_PXOR;
+    gen_simd:
+        if (have_avx1) {
+            tcg_out_vex_modrm(s, c, a0, a1, a2);
+        } else {
+            tcg_out_modrm(s, c, a0, a2);
+        }
+        break;
+
+    case INDEX_op_ld_v64:
+        c = TCG_TYPE_V64;
+        goto gen_simd_ld;
+    case INDEX_op_ld_v128:
+        c = TCG_TYPE_V128;
+        goto gen_simd_ld;
+    case INDEX_op_ld_v256:
+        c = TCG_TYPE_V256;
+    gen_simd_ld:
+        tcg_out_ld(s, c, a0, a1, a2);
+        break;
+
+    case INDEX_op_st_v64:
+        c = TCG_TYPE_V64;
+        goto gen_simd_st;
+    case INDEX_op_st_v128:
+        c = TCG_TYPE_V128;
+        goto gen_simd_st;
+    case INDEX_op_st_v256:
+        c = TCG_TYPE_V256;
+    gen_simd_st:
+        tcg_out_st(s, c, a0, a1, a2);
+        break;
+
     case INDEX_op_mb:
         tcg_out_mb(s, a0);
         break;
     case INDEX_op_mov_i32:  /* Always emitted via tcg_out_mov.  */
     case INDEX_op_mov_i64:
+    case INDEX_op_mov_v64:
+    case INDEX_op_mov_v128:
+    case INDEX_op_mov_v256:
     case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi.  */
     case INDEX_op_movi_i64:
+    case INDEX_op_movi_v64:
+    case INDEX_op_movi_v128:
+    case INDEX_op_movi_v256:
     case INDEX_op_call:     /* Always emitted via tcg_out_call.  */
     default:
         tcg_abort();
     }
 
 #undef OP_32_64
+#undef OP_128_256
+#undef OP_64_128_256
 }
 
 static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
@@ -2417,6 +2661,9 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
         = { .args_ct_str = { "r", "r", "L", "L" } };
     static const TCGTargetOpDef L_L_L_L
         = { .args_ct_str = { "L", "L", "L", "L" } };
+    static const TCGTargetOpDef x_0_x = { .args_ct_str = { "x", "0", "x" } };
+    static const TCGTargetOpDef x_x_x = { .args_ct_str = { "x", "x", "x" } };
+    static const TCGTargetOpDef x_r = { .args_ct_str = { "x", "r" } };
 
     switch (op) {
     case INDEX_op_goto_ptr:
@@ -2620,6 +2867,52 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
             return &s2;
         }
 
+    case INDEX_op_ld_v64:
+    case INDEX_op_ld_v128:
+    case INDEX_op_ld_v256:
+    case INDEX_op_st_v64:
+    case INDEX_op_st_v128:
+    case INDEX_op_st_v256:
+        return &x_r;
+
+    case INDEX_op_add8_v64:
+    case INDEX_op_add8_v128:
+    case INDEX_op_add16_v64:
+    case INDEX_op_add16_v128:
+    case INDEX_op_add32_v64:
+    case INDEX_op_add32_v128:
+    case INDEX_op_add64_v128:
+    case INDEX_op_sub8_v64:
+    case INDEX_op_sub8_v128:
+    case INDEX_op_sub16_v64:
+    case INDEX_op_sub16_v128:
+    case INDEX_op_sub32_v64:
+    case INDEX_op_sub32_v128:
+    case INDEX_op_sub64_v128:
+    case INDEX_op_and_v64:
+    case INDEX_op_and_v128:
+    case INDEX_op_andc_v64:
+    case INDEX_op_andc_v128:
+    case INDEX_op_or_v64:
+    case INDEX_op_or_v128:
+    case INDEX_op_xor_v64:
+    case INDEX_op_xor_v128:
+        return have_avx1 ? &x_x_x : &x_0_x;
+
+    case INDEX_op_add8_v256:
+    case INDEX_op_add16_v256:
+    case INDEX_op_add32_v256:
+    case INDEX_op_add64_v256:
+    case INDEX_op_sub8_v256:
+    case INDEX_op_sub16_v256:
+    case INDEX_op_sub32_v256:
+    case INDEX_op_sub64_v256:
+    case INDEX_op_and_v256:
+    case INDEX_op_andc_v256:
+    case INDEX_op_or_v256:
+    case INDEX_op_xor_v256:
+        return &x_x_x;
+
     default:
         break;
     }
@@ -2725,9 +3018,16 @@ static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
 static void tcg_target_init(TCGContext *s)
 {
 #ifdef CONFIG_CPUID_H
-    unsigned a, b, c, d;
+    unsigned a, b, c, d, b7 = 0;
     int max = __get_cpuid_max(0, 0);
 
+    if (max >= 7) {
+        /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs.  */
+        __cpuid_count(7, 0, a, b7, c, d);
+        have_bmi1 = (b7 & bit_BMI) != 0;
+        have_bmi2 = (b7 & bit_BMI2) != 0;
+    }
+
     if (max >= 1) {
         __cpuid(1, a, b, c, d);
 #ifndef have_cmov
@@ -2736,17 +3036,26 @@ static void tcg_target_init(TCGContext *s)
            available, we'll use a small forward branch.  */
         have_cmov = (d & bit_CMOV) != 0;
 #endif
+#ifndef have_sse2
+        have_sse2 = (d & bit_SSE2) != 0;
+#endif
         /* MOVBE is only available on Intel Atom and Haswell CPUs, so we
            need to probe for it.  */
         have_movbe = (c & bit_MOVBE) != 0;
         have_popcnt = (c & bit_POPCNT) != 0;
-    }
 
-    if (max >= 7) {
-        /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs.  */
-        __cpuid_count(7, 0, a, b, c, d);
-        have_bmi1 = (b & bit_BMI) != 0;
-        have_bmi2 = (b & bit_BMI2) != 0;
+#ifndef have_avx2
+        /* There are a number of things we must check before we can be
+           sure of not hitting invalid opcode.  */
+        if (c & bit_OSXSAVE) {
+            unsigned xcrl, xcrh;
+            asm ("xgetbv" : "=a" (xcrl), "=d" (xcrh) : "c" (0));
+            if (xcrl & 6 == 6) {
+                have_avx1 = (c & bit_AVX) != 0;
+                have_avx2 = (b7 & bit_AVX2) != 0;
+            }
+        }
+#endif
     }
 
     max = __get_cpuid_max(0x8000000, 0);
@@ -2763,6 +3072,13 @@ static void tcg_target_init(TCGContext *s)
     } else {
         tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xff);
     }
+    if (have_sse2) {
+        tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V64], 0, 0xff0000);
+        tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V128], 0, 0xff0000);
+    }
+    if (have_avx2) {
+        tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V256], 0, 0xff0000);
+    }
 
     tcg_regset_clear(tcg_target_call_clobber_regs);
     tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_EAX);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported Richard Henderson
@ 2017-08-17 23:44   ` Philippe Mathieu-Daudé
  2017-09-07 19:02   ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-08-17 23:44 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: qemu-arm, alex.bennee

On 08/17/2017 08:01 PM, Richard Henderson wrote:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

> ---
>   tcg/tcg.h |   2 +
>   tcg/tcg.c | 310 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 312 insertions(+)
> 
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index b9e15da13b..b443143b21 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -962,6 +962,8 @@ do {\
>   #define tcg_temp_free_ptr(T) tcg_temp_free_i64(TCGV_PTR_TO_NAT(T))
>   #endif
>   
> +bool tcg_op_supported(TCGOpcode op);
> +
>   void tcg_gen_callN(TCGContext *s, void *func,
>                      TCGArg ret, int nargs, TCGArg *args);
>   
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index ea78d47fad..3c3cdda938 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -751,6 +751,316 @@ int tcg_check_temp_count(void)
>   }
>   #endif
>   
> +/* Return true if OP may appear in the opcode stream.
> +   Test the runtime variable that controls each opcode.  */
> +bool tcg_op_supported(TCGOpcode op)
> +{
> +    switch (op) {
> +    case INDEX_op_discard:
> +    case INDEX_op_set_label:
> +    case INDEX_op_call:
> +    case INDEX_op_br:
> +    case INDEX_op_mb:
> +    case INDEX_op_insn_start:
> +    case INDEX_op_exit_tb:
> +    case INDEX_op_goto_tb:
> +    case INDEX_op_qemu_ld_i32:
> +    case INDEX_op_qemu_st_i32:
> +    case INDEX_op_qemu_ld_i64:
> +    case INDEX_op_qemu_st_i64:
> +        return true;
> +
> +    case INDEX_op_goto_ptr:
> +        return TCG_TARGET_HAS_goto_ptr;
> +
> +    case INDEX_op_mov_i32:
> +    case INDEX_op_movi_i32:
> +    case INDEX_op_setcond_i32:
> +    case INDEX_op_brcond_i32:
> +    case INDEX_op_ld8u_i32:
> +    case INDEX_op_ld8s_i32:
> +    case INDEX_op_ld16u_i32:
> +    case INDEX_op_ld16s_i32:
> +    case INDEX_op_ld_i32:
> +    case INDEX_op_st8_i32:
> +    case INDEX_op_st16_i32:
> +    case INDEX_op_st_i32:
> +    case INDEX_op_add_i32:
> +    case INDEX_op_sub_i32:
> +    case INDEX_op_mul_i32:
> +    case INDEX_op_and_i32:
> +    case INDEX_op_or_i32:
> +    case INDEX_op_xor_i32:
> +    case INDEX_op_shl_i32:
> +    case INDEX_op_shr_i32:
> +    case INDEX_op_sar_i32:
> +        return true;
> +
> +    case INDEX_op_movcond_i32:
> +        return TCG_TARGET_HAS_movcond_i32;
> +    case INDEX_op_div_i32:
> +    case INDEX_op_divu_i32:
> +        return TCG_TARGET_HAS_div_i32;
> +    case INDEX_op_rem_i32:
> +    case INDEX_op_remu_i32:
> +        return TCG_TARGET_HAS_rem_i32;
> +    case INDEX_op_div2_i32:
> +    case INDEX_op_divu2_i32:
> +        return TCG_TARGET_HAS_div2_i32;
> +    case INDEX_op_rotl_i32:
> +    case INDEX_op_rotr_i32:
> +        return TCG_TARGET_HAS_rot_i32;
> +    case INDEX_op_deposit_i32:
> +        return TCG_TARGET_HAS_deposit_i32;
> +    case INDEX_op_extract_i32:
> +        return TCG_TARGET_HAS_extract_i32;
> +    case INDEX_op_sextract_i32:
> +        return TCG_TARGET_HAS_sextract_i32;
> +    case INDEX_op_add2_i32:
> +        return TCG_TARGET_HAS_add2_i32;
> +    case INDEX_op_sub2_i32:
> +        return TCG_TARGET_HAS_sub2_i32;
> +    case INDEX_op_mulu2_i32:
> +        return TCG_TARGET_HAS_mulu2_i32;
> +    case INDEX_op_muls2_i32:
> +        return TCG_TARGET_HAS_muls2_i32;
> +    case INDEX_op_muluh_i32:
> +        return TCG_TARGET_HAS_muluh_i32;
> +    case INDEX_op_mulsh_i32:
> +        return TCG_TARGET_HAS_mulsh_i32;
> +    case INDEX_op_ext8s_i32:
> +        return TCG_TARGET_HAS_ext8s_i32;
> +    case INDEX_op_ext16s_i32:
> +        return TCG_TARGET_HAS_ext16s_i32;
> +    case INDEX_op_ext8u_i32:
> +        return TCG_TARGET_HAS_ext8u_i32;
> +    case INDEX_op_ext16u_i32:
> +        return TCG_TARGET_HAS_ext16u_i32;
> +    case INDEX_op_bswap16_i32:
> +        return TCG_TARGET_HAS_bswap16_i32;
> +    case INDEX_op_bswap32_i32:
> +        return TCG_TARGET_HAS_bswap32_i32;
> +    case INDEX_op_not_i32:
> +        return TCG_TARGET_HAS_not_i32;
> +    case INDEX_op_neg_i32:
> +        return TCG_TARGET_HAS_neg_i32;
> +    case INDEX_op_andc_i32:
> +        return TCG_TARGET_HAS_andc_i32;
> +    case INDEX_op_orc_i32:
> +        return TCG_TARGET_HAS_orc_i32;
> +    case INDEX_op_eqv_i32:
> +        return TCG_TARGET_HAS_eqv_i32;
> +    case INDEX_op_nand_i32:
> +        return TCG_TARGET_HAS_nand_i32;
> +    case INDEX_op_nor_i32:
> +        return TCG_TARGET_HAS_nor_i32;
> +    case INDEX_op_clz_i32:
> +        return TCG_TARGET_HAS_clz_i32;
> +    case INDEX_op_ctz_i32:
> +        return TCG_TARGET_HAS_ctz_i32;
> +    case INDEX_op_ctpop_i32:
> +        return TCG_TARGET_HAS_ctpop_i32;
> +
> +    case INDEX_op_brcond2_i32:
> +    case INDEX_op_setcond2_i32:
> +        return TCG_TARGET_REG_BITS == 32;
> +
> +    case INDEX_op_mov_i64:
> +    case INDEX_op_movi_i64:
> +    case INDEX_op_setcond_i64:
> +    case INDEX_op_brcond_i64:
> +    case INDEX_op_ld8u_i64:
> +    case INDEX_op_ld8s_i64:
> +    case INDEX_op_ld16u_i64:
> +    case INDEX_op_ld16s_i64:
> +    case INDEX_op_ld32u_i64:
> +    case INDEX_op_ld32s_i64:
> +    case INDEX_op_ld_i64:
> +    case INDEX_op_st8_i64:
> +    case INDEX_op_st16_i64:
> +    case INDEX_op_st32_i64:
> +    case INDEX_op_st_i64:
> +    case INDEX_op_add_i64:
> +    case INDEX_op_sub_i64:
> +    case INDEX_op_mul_i64:
> +    case INDEX_op_and_i64:
> +    case INDEX_op_or_i64:
> +    case INDEX_op_xor_i64:
> +    case INDEX_op_shl_i64:
> +    case INDEX_op_shr_i64:
> +    case INDEX_op_sar_i64:
> +    case INDEX_op_ext_i32_i64:
> +    case INDEX_op_extu_i32_i64:
> +        return TCG_TARGET_REG_BITS == 64;
> +
> +    case INDEX_op_movcond_i64:
> +        return TCG_TARGET_HAS_movcond_i64;
> +    case INDEX_op_div_i64:
> +    case INDEX_op_divu_i64:
> +        return TCG_TARGET_HAS_div_i64;
> +    case INDEX_op_rem_i64:
> +    case INDEX_op_remu_i64:
> +        return TCG_TARGET_HAS_rem_i64;
> +    case INDEX_op_div2_i64:
> +    case INDEX_op_divu2_i64:
> +        return TCG_TARGET_HAS_div2_i64;
> +    case INDEX_op_rotl_i64:
> +    case INDEX_op_rotr_i64:
> +        return TCG_TARGET_HAS_rot_i64;
> +    case INDEX_op_deposit_i64:
> +        return TCG_TARGET_HAS_deposit_i64;
> +    case INDEX_op_extract_i64:
> +        return TCG_TARGET_HAS_extract_i64;
> +    case INDEX_op_sextract_i64:
> +        return TCG_TARGET_HAS_sextract_i64;
> +    case INDEX_op_extrl_i64_i32:
> +        return TCG_TARGET_HAS_extrl_i64_i32;
> +    case INDEX_op_extrh_i64_i32:
> +        return TCG_TARGET_HAS_extrh_i64_i32;
> +    case INDEX_op_ext8s_i64:
> +        return TCG_TARGET_HAS_ext8s_i64;
> +    case INDEX_op_ext16s_i64:
> +        return TCG_TARGET_HAS_ext16s_i64;
> +    case INDEX_op_ext32s_i64:
> +        return TCG_TARGET_HAS_ext32s_i64;
> +    case INDEX_op_ext8u_i64:
> +        return TCG_TARGET_HAS_ext8u_i64;
> +    case INDEX_op_ext16u_i64:
> +        return TCG_TARGET_HAS_ext16u_i64;
> +    case INDEX_op_ext32u_i64:
> +        return TCG_TARGET_HAS_ext32u_i64;
> +    case INDEX_op_bswap16_i64:
> +        return TCG_TARGET_HAS_bswap16_i64;
> +    case INDEX_op_bswap32_i64:
> +        return TCG_TARGET_HAS_bswap32_i64;
> +    case INDEX_op_bswap64_i64:
> +        return TCG_TARGET_HAS_bswap64_i64;
> +    case INDEX_op_not_i64:
> +        return TCG_TARGET_HAS_not_i64;
> +    case INDEX_op_neg_i64:
> +        return TCG_TARGET_HAS_neg_i64;
> +    case INDEX_op_andc_i64:
> +        return TCG_TARGET_HAS_andc_i64;
> +    case INDEX_op_orc_i64:
> +        return TCG_TARGET_HAS_orc_i64;
> +    case INDEX_op_eqv_i64:
> +        return TCG_TARGET_HAS_eqv_i64;
> +    case INDEX_op_nand_i64:
> +        return TCG_TARGET_HAS_nand_i64;
> +    case INDEX_op_nor_i64:
> +        return TCG_TARGET_HAS_nor_i64;
> +    case INDEX_op_clz_i64:
> +        return TCG_TARGET_HAS_clz_i64;
> +    case INDEX_op_ctz_i64:
> +        return TCG_TARGET_HAS_ctz_i64;
> +    case INDEX_op_ctpop_i64:
> +        return TCG_TARGET_HAS_ctpop_i64;
> +    case INDEX_op_add2_i64:
> +        return TCG_TARGET_HAS_add2_i64;
> +    case INDEX_op_sub2_i64:
> +        return TCG_TARGET_HAS_sub2_i64;
> +    case INDEX_op_mulu2_i64:
> +        return TCG_TARGET_HAS_mulu2_i64;
> +    case INDEX_op_muls2_i64:
> +        return TCG_TARGET_HAS_muls2_i64;
> +    case INDEX_op_muluh_i64:
> +        return TCG_TARGET_HAS_muluh_i64;
> +    case INDEX_op_mulsh_i64:
> +        return TCG_TARGET_HAS_mulsh_i64;
> +
> +    case INDEX_op_mov_v64:
> +    case INDEX_op_movi_v64:
> +    case INDEX_op_ld_v64:
> +    case INDEX_op_st_v64:
> +    case INDEX_op_and_v64:
> +    case INDEX_op_or_v64:
> +    case INDEX_op_xor_v64:
> +    case INDEX_op_add8_v64:
> +    case INDEX_op_add16_v64:
> +    case INDEX_op_add32_v64:
> +    case INDEX_op_sub8_v64:
> +    case INDEX_op_sub16_v64:
> +    case INDEX_op_sub32_v64:
> +        return TCG_TARGET_HAS_v64;
> +
> +    case INDEX_op_mov_v128:
> +    case INDEX_op_movi_v128:
> +    case INDEX_op_ld_v128:
> +    case INDEX_op_st_v128:
> +    case INDEX_op_and_v128:
> +    case INDEX_op_or_v128:
> +    case INDEX_op_xor_v128:
> +    case INDEX_op_add8_v128:
> +    case INDEX_op_add16_v128:
> +    case INDEX_op_add32_v128:
> +    case INDEX_op_add64_v128:
> +    case INDEX_op_sub8_v128:
> +    case INDEX_op_sub16_v128:
> +    case INDEX_op_sub32_v128:
> +    case INDEX_op_sub64_v128:
> +        return TCG_TARGET_HAS_v128;
> +
> +    case INDEX_op_mov_v256:
> +    case INDEX_op_movi_v256:
> +    case INDEX_op_ld_v256:
> +    case INDEX_op_st_v256:
> +    case INDEX_op_and_v256:
> +    case INDEX_op_or_v256:
> +    case INDEX_op_xor_v256:
> +    case INDEX_op_add8_v256:
> +    case INDEX_op_add16_v256:
> +    case INDEX_op_add32_v256:
> +    case INDEX_op_add64_v256:
> +    case INDEX_op_sub8_v256:
> +    case INDEX_op_sub16_v256:
> +    case INDEX_op_sub32_v256:
> +    case INDEX_op_sub64_v256:
> +        return TCG_TARGET_HAS_v256;
> +
> +    case INDEX_op_not_v64:
> +        return TCG_TARGET_HAS_not_v64;
> +    case INDEX_op_not_v128:
> +        return TCG_TARGET_HAS_not_v128;
> +    case INDEX_op_not_v256:
> +        return TCG_TARGET_HAS_not_v256;
> +
> +    case INDEX_op_andc_v64:
> +        return TCG_TARGET_HAS_andc_v64;
> +    case INDEX_op_andc_v128:
> +        return TCG_TARGET_HAS_andc_v128;
> +    case INDEX_op_andc_v256:
> +        return TCG_TARGET_HAS_andc_v256;
> +
> +    case INDEX_op_orc_v64:
> +        return TCG_TARGET_HAS_orc_v64;
> +    case INDEX_op_orc_v128:
> +        return TCG_TARGET_HAS_orc_v128;
> +    case INDEX_op_orc_v256:
> +        return TCG_TARGET_HAS_orc_v256;
> +
> +    case INDEX_op_neg8_v64:
> +    case INDEX_op_neg16_v64:
> +    case INDEX_op_neg32_v64:
> +        return TCG_TARGET_HAS_neg_v64;
> +
> +    case INDEX_op_neg8_v128:
> +    case INDEX_op_neg16_v128:
> +    case INDEX_op_neg32_v128:
> +    case INDEX_op_neg64_v128:
> +        return TCG_TARGET_HAS_neg_v128;
> +
> +    case INDEX_op_neg8_v256:
> +    case INDEX_op_neg16_v256:
> +    case INDEX_op_neg32_v256:
> +    case INDEX_op_neg64_v256:
> +        return TCG_TARGET_HAS_neg_v256;
> +
> +    case NB_OPS:
> +        break;
> +    }
> +    g_assert_not_reached();
> +}
> +
>   /* Note: we convert the 64 bit args to 32 bit and do some alignment
>      and endian swap. Maybe it would be better to do the alignment
>      and endian swap in tcg_reg_alloc_call(). */
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid Richard Henderson
@ 2017-08-17 23:45   ` Philippe Mathieu-Daudé
  2017-09-08  9:30   ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-08-17 23:45 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: qemu-arm, alex.bennee

On 08/17/2017 08:01 PM, Richard Henderson wrote:
> Add with value 0 so that structure zero initialization can
> indicate that the field is not present.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

> ---
>   tcg/tcg-opc.h | 2 ++
>   tcg/tcg.c     | 3 +++
>   2 files changed, 5 insertions(+)
> 
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index 9162125fac..b1445a4c24 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -26,6 +26,8 @@
>    * DEF(name, oargs, iargs, cargs, flags)
>    */
>   
> +DEF(invalid, 0, 0, 0, TCG_OPF_NOT_PRESENT)
> +
>   /* predefined ops */
>   DEF(discard, 1, 0, 0, TCG_OPF_NOT_PRESENT)
>   DEF(set_label, 0, 0, 1, TCG_OPF_BB_END | TCG_OPF_NOT_PRESENT)
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 3c3cdda938..879b29e81f 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -756,6 +756,9 @@ int tcg_check_temp_count(void)
>   bool tcg_op_supported(TCGOpcode op)
>   {
>       switch (op) {
> +    case INDEX_op_invalid:
> +        return false;
> +
>       case INDEX_op_discard:
>       case INDEX_op_set_label:
>       case INDEX_op_call:
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors Richard Henderson
@ 2017-08-17 23:46   ` Philippe Mathieu-Daudé
  2017-09-07 18:18   ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-08-17 23:46 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: qemu-arm, alex.bennee

On 08/17/2017 08:01 PM, Richard Henderson wrote:
> Nothing uses or enables them yet.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

> ---
>   tcg/tcg.h | 5 +++++
>   tcg/tcg.c | 2 +-
>   2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index dd97095af5..1277caed3d 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -256,6 +256,11 @@ typedef struct TCGPool {
>   typedef enum TCGType {
>       TCG_TYPE_I32,
>       TCG_TYPE_I64,
> +
> +    TCG_TYPE_V64,
> +    TCG_TYPE_V128,
> +    TCG_TYPE_V256,
> +
>       TCG_TYPE_COUNT, /* number of different types */
>   
>       /* An alias for the size of the host register.  */
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 787c8ba0f7..ea78d47fad 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -118,7 +118,7 @@ static TCGReg tcg_reg_alloc_new(TCGContext *s, TCGType t)
>   static bool tcg_out_ldst_finalize(TCGContext *s);
>   #endif
>   
> -static TCGRegSet tcg_target_available_regs[2];
> +static TCGRegSet tcg_target_available_regs[TCG_TYPE_COUNT];
>   static TCGRegSet tcg_target_call_clobber_regs;
>   
>   #if TCG_TARGET_INSN_UNIT_SIZE == 1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations Richard Henderson
@ 2017-08-22 13:15   ` Alex Bennée
  2017-08-23 19:02     ` Richard Henderson
  2017-09-08 10:13   ` Alex Bennée
  1 sibling, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-08-22 13:15 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/i386/tcg-target.h     |  46 +++++-
>  tcg/tcg-opc.h             |  12 +-
>  tcg/i386/tcg-target.inc.c | 382 ++++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 399 insertions(+), 41 deletions(-)
>
> diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
> index e512648c95..147f82062b 100644
> --- a/tcg/i386/tcg-target.h
> +++ b/tcg/i386/tcg-target.h
> @@ -30,11 +30,10 @@
>
>  #ifdef __x86_64__
>  # define TCG_TARGET_REG_BITS  64
> -# define TCG_TARGET_NB_REGS   16
>  #else
>  # define TCG_TARGET_REG_BITS  32
> -# define TCG_TARGET_NB_REGS    8
>  #endif
> +# define TCG_TARGET_NB_REGS   24
>
>  typedef enum {
>      TCG_REG_EAX = 0,
> @@ -56,6 +55,19 @@ typedef enum {
>      TCG_REG_R13,
>      TCG_REG_R14,
>      TCG_REG_R15,
> +
> +    /* SSE registers; 64-bit has access to 8 more, but we won't
> +       need more than a few and using only the first 8 minimizes
> +       the need for a rex prefix on the sse instructions.  */
> +    TCG_REG_XMM0,
> +    TCG_REG_XMM1,
> +    TCG_REG_XMM2,
> +    TCG_REG_XMM3,
> +    TCG_REG_XMM4,
> +    TCG_REG_XMM5,
> +    TCG_REG_XMM6,
> +    TCG_REG_XMM7,
> +
>      TCG_REG_RAX = TCG_REG_EAX,
>      TCG_REG_RCX = TCG_REG_ECX,
>      TCG_REG_RDX = TCG_REG_EDX,
> @@ -79,6 +91,17 @@ extern bool have_bmi1;
>  extern bool have_bmi2;
>  extern bool have_popcnt;
>
> +#ifdef __SSE2__
> +#define have_sse2  true
> +#else
> +extern bool have_sse2;
> +#endif
> +#ifdef __AVX2__
> +#define have_avx2  true
> +#else
> +extern bool have_avx2;
> +#endif
> +
>  /* optional instructions */
>  #define TCG_TARGET_HAS_div2_i32         1
>  #define TCG_TARGET_HAS_rot_i32          1
> @@ -147,6 +170,25 @@ extern bool have_popcnt;
>  #define TCG_TARGET_HAS_mulsh_i64        0
>  #endif
>
> +#define TCG_TARGET_HAS_v64              have_sse2
> +#define TCG_TARGET_HAS_v128             have_sse2
> +#define TCG_TARGET_HAS_v256             have_avx2
> +
> +#define TCG_TARGET_HAS_andc_v64         TCG_TARGET_HAS_v64
> +#define TCG_TARGET_HAS_orc_v64          0
> +#define TCG_TARGET_HAS_not_v64          0
> +#define TCG_TARGET_HAS_neg_v64          0
> +
> +#define TCG_TARGET_HAS_andc_v128        TCG_TARGET_HAS_v128
> +#define TCG_TARGET_HAS_orc_v128         0
> +#define TCG_TARGET_HAS_not_v128         0
> +#define TCG_TARGET_HAS_neg_v128         0
> +
> +#define TCG_TARGET_HAS_andc_v256        TCG_TARGET_HAS_v256
> +#define TCG_TARGET_HAS_orc_v256         0
> +#define TCG_TARGET_HAS_not_v256         0
> +#define TCG_TARGET_HAS_neg_v256         0
> +
>  #define TCG_TARGET_deposit_i32_valid(ofs, len) \
>      (have_bmi2 ||                              \
>       ((ofs) == 0 && (len) == 8) ||             \
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index b1445a4c24..b84cd584fb 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -212,13 +212,13 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
>  /* Host integer vector operations.  */
>  /* These opcodes are required whenever the base vector size is enabled.  */
>
> -DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
> -DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
> -DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(mov_v64, 1, 1, 0, TCG_OPF_NOT_PRESENT)
> +DEF(mov_v128, 1, 1, 0, TCG_OPF_NOT_PRESENT)
> +DEF(mov_v256, 1, 1, 0, TCG_OPF_NOT_PRESENT)
>
> -DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
> -DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
> -DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
> +DEF(movi_v64, 1, 0, 1, TCG_OPF_NOT_PRESENT)
> +DEF(movi_v128, 1, 0, 1, TCG_OPF_NOT_PRESENT)
> +DEF(movi_v256, 1, 0, 1, TCG_OPF_NOT_PRESENT)
>
>  DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
>  DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
> diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
> index aeefb72aa0..0e01b54aa0 100644
> --- a/tcg/i386/tcg-target.inc.c
> +++ b/tcg/i386/tcg-target.inc.c
> @@ -31,7 +31,9 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
>      "%r8",  "%r9",  "%r10", "%r11", "%r12", "%r13", "%r14", "%r15",
>  #else
>      "%eax", "%ecx", "%edx", "%ebx", "%esp", "%ebp", "%esi", "%edi",
> +    NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
>  #endif
> +    "%xmm0", "%xmm1", "%xmm2", "%xmm3", "%xmm4", "%xmm5", "%xmm6", "%xmm7",
>  };
>  #endif
>
> @@ -61,6 +63,14 @@ static const int tcg_target_reg_alloc_order[] = {
>      TCG_REG_EDX,
>      TCG_REG_EAX,
>  #endif
> +    TCG_REG_XMM0,
> +    TCG_REG_XMM1,
> +    TCG_REG_XMM2,
> +    TCG_REG_XMM3,
> +    TCG_REG_XMM4,
> +    TCG_REG_XMM5,
> +    TCG_REG_XMM6,
> +    TCG_REG_XMM7,
>  };
>
>  static const int tcg_target_call_iarg_regs[] = {
> @@ -94,7 +104,7 @@ static const int tcg_target_call_oarg_regs[] = {
>  #define TCG_CT_CONST_I32 0x400
>  #define TCG_CT_CONST_WSZ 0x800
>
> -/* Registers used with L constraint, which are the first argument
> +/* Registers used with L constraint, which are the first argument
>     registers on x86_64, and two random call clobbered registers on
>     i386. */
>  #if TCG_TARGET_REG_BITS == 64
> @@ -127,6 +137,16 @@ bool have_bmi1;
>  bool have_bmi2;
>  bool have_popcnt;
>
> +#ifndef have_sse2
> +bool have_sse2;
> +#endif
> +#ifdef have_avx2
> +#define have_avx1  have_avx2
> +#else
> +static bool have_avx1;
> +bool have_avx2;
> +#endif
> +
>  #ifdef CONFIG_CPUID_H
>  static bool have_movbe;
>  static bool have_lzcnt;
> @@ -215,6 +235,10 @@ static const char *target_parse_constraint(TCGArgConstraint *ct,
>          /* With TZCNT/LZCNT, we can have operand-size as an input.  */
>          ct->ct |= TCG_CT_CONST_WSZ;
>          break;
> +    case 'x':
> +        ct->ct |= TCG_CT_REG;
> +        tcg_regset_set32(ct->u.regs, 0, 0xff0000);
> +        break;
>
>          /* qemu_ld/st address constraint */
>      case 'L':
> @@ -292,6 +316,7 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
>  #endif
>  #define P_SIMDF3        0x20000         /* 0xf3 opcode prefix */
>  #define P_SIMDF2        0x40000         /* 0xf2 opcode prefix */
> +#define P_VEXL          0x80000         /* Set VEX.L = 1 */
>
>  #define OPC_ARITH_EvIz	(0x81)
>  #define OPC_ARITH_EvIb	(0x83)
> @@ -324,13 +349,31 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
>  #define OPC_MOVL_Iv     (0xb8)
>  #define OPC_MOVBE_GyMy  (0xf0 | P_EXT38)
>  #define OPC_MOVBE_MyGy  (0xf1 | P_EXT38)
> +#define OPC_MOVDQA_GyMy (0x6f | P_EXT | P_DATA16)
> +#define OPC_MOVDQA_MyGy (0x7f | P_EXT | P_DATA16)
> +#define OPC_MOVDQU_GyMy (0x6f | P_EXT | P_SIMDF3)
> +#define OPC_MOVDQU_MyGy (0x7f | P_EXT | P_SIMDF3)
> +#define OPC_MOVQ_GyMy   (0x7e | P_EXT | P_SIMDF3)
> +#define OPC_MOVQ_MyGy   (0xd6 | P_EXT | P_DATA16)
>  #define OPC_MOVSBL	(0xbe | P_EXT)
>  #define OPC_MOVSWL	(0xbf | P_EXT)
>  #define OPC_MOVSLQ	(0x63 | P_REXW)
>  #define OPC_MOVZBL	(0xb6 | P_EXT)
>  #define OPC_MOVZWL	(0xb7 | P_EXT)
> +#define OPC_PADDB       (0xfc | P_EXT | P_DATA16)
> +#define OPC_PADDW       (0xfd | P_EXT | P_DATA16)
> +#define OPC_PADDD       (0xfe | P_EXT | P_DATA16)
> +#define OPC_PADDQ       (0xd4 | P_EXT | P_DATA16)
> +#define OPC_PAND        (0xdb | P_EXT | P_DATA16)
> +#define OPC_PANDN       (0xdf | P_EXT | P_DATA16)
>  #define OPC_PDEP        (0xf5 | P_EXT38 | P_SIMDF2)
>  #define OPC_PEXT        (0xf5 | P_EXT38 | P_SIMDF3)
> +#define OPC_POR         (0xeb | P_EXT | P_DATA16)
> +#define OPC_PSUBB       (0xf8 | P_EXT | P_DATA16)
> +#define OPC_PSUBW       (0xf9 | P_EXT | P_DATA16)
> +#define OPC_PSUBD       (0xfa | P_EXT | P_DATA16)
> +#define OPC_PSUBQ       (0xfb | P_EXT | P_DATA16)
> +#define OPC_PXOR        (0xef | P_EXT | P_DATA16)
>  #define OPC_POP_r32	(0x58)
>  #define OPC_POPCNT      (0xb8 | P_EXT | P_SIMDF3)
>  #define OPC_PUSH_r32	(0x50)
> @@ -500,7 +543,8 @@ static void tcg_out_modrm(TCGContext *s, int opc, int r, int rm)
>      tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
>  }
>
> -static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
> +static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v,
> +                                int rm, int index)
>  {
>      int tmp;
>
> @@ -515,14 +559,16 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
>      } else if (opc & P_EXT) {
>          tmp = 1;
>      } else {
> -        tcg_abort();
> +        g_assert_not_reached();
>      }
> -    tmp |= 0x40;                           /* VEX.X */
>      tmp |= (r & 8 ? 0 : 0x80);             /* VEX.R */
> +    tmp |= (index & 8 ? 0 : 0x40);         /* VEX.X */
>      tmp |= (rm & 8 ? 0 : 0x20);            /* VEX.B */
>      tcg_out8(s, tmp);
>
>      tmp = (opc & P_REXW ? 0x80 : 0);       /* VEX.W */
> +    tmp |= (opc & P_VEXL ? 0x04 : 0);      /* VEX.L */
> +
>      /* VEX.pp */
>      if (opc & P_DATA16) {
>          tmp |= 1;                          /* 0x66 */
> @@ -538,7 +584,7 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
>
>  static void tcg_out_vex_modrm(TCGContext *s, int opc, int r, int v, int rm)
>  {
> -    tcg_out_vex_pfx_opc(s, opc, r, v, rm);
> +    tcg_out_vex_pfx_opc(s, opc, r, v, rm, 0);
>      tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
>  }
>
> @@ -565,7 +611,7 @@ static void tcg_out_opc_pool_imm(TCGContext *s, int opc, int r,
>  static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
>                                   tcg_target_ulong data)
>  {
> -    tcg_out_vex_pfx_opc(s, opc, r, v, 0);
> +    tcg_out_vex_pfx_opc(s, opc, r, v, 0, 0);
>      tcg_out_sfx_pool_imm(s, r, data);
>  }
>
> @@ -574,8 +620,8 @@ static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
>     mode for absolute addresses, ~RM is the size of the immediate operand
>     that will follow the instruction.  */
>
> -static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> -                                     int index, int shift, intptr_t offset)
> +static void tcg_out_sib_offset(TCGContext *s, int r, int rm, int index,
> +                               int shift, intptr_t offset)
>  {
>      int mod, len;
>
> @@ -586,7 +632,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>              intptr_t pc = (intptr_t)s->code_ptr + 5 + ~rm;
>              intptr_t disp = offset - pc;
>              if (disp == (int32_t)disp) {
> -                tcg_out_opc(s, opc, r, 0, 0);
>                  tcg_out8(s, (LOWREGMASK(r) << 3) | 5);
>                  tcg_out32(s, disp);
>                  return;
> @@ -596,7 +641,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>                 use of the MODRM+SIB encoding and is therefore larger than
>                 rip-relative addressing.  */
>              if (offset == (int32_t)offset) {
> -                tcg_out_opc(s, opc, r, 0, 0);
>                  tcg_out8(s, (LOWREGMASK(r) << 3) | 4);
>                  tcg_out8(s, (4 << 3) | 5);
>                  tcg_out32(s, offset);
> @@ -604,10 +648,9 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>              }
>
>              /* ??? The memory isn't directly addressable.  */
> -            tcg_abort();
> +            g_assert_not_reached();
>          } else {
>              /* Absolute address.  */
> -            tcg_out_opc(s, opc, r, 0, 0);
>              tcg_out8(s, (r << 3) | 5);
>              tcg_out32(s, offset);
>              return;
> @@ -630,7 +673,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>         that would be used for %esp is the escape to the two byte form.  */
>      if (index < 0 && LOWREGMASK(rm) != TCG_REG_ESP) {
>          /* Single byte MODRM format.  */
> -        tcg_out_opc(s, opc, r, rm, 0);
>          tcg_out8(s, mod | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
>      } else {
>          /* Two byte MODRM+SIB format.  */
> @@ -644,7 +686,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>              tcg_debug_assert(index != TCG_REG_ESP);
>          }
>
> -        tcg_out_opc(s, opc, r, rm, index);
>          tcg_out8(s, mod | (LOWREGMASK(r) << 3) | 4);
>          tcg_out8(s, (shift << 6) | (LOWREGMASK(index) << 3) | LOWREGMASK(rm));
>      }
> @@ -656,6 +697,21 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>      }
>  }
>
> +static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> +                                     int index, int shift, intptr_t offset)
> +{
> +    tcg_out_opc(s, opc, r, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
> +    tcg_out_sib_offset(s, r, rm, index, shift, offset);
> +}
> +
> +static void tcg_out_vex_modrm_sib_offset(TCGContext *s, int opc, int r, int v,
> +                                         int rm, int index, int shift,
> +                                         intptr_t offset)
> +{
> +    tcg_out_vex_pfx_opc(s, opc, r, v, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
> +    tcg_out_sib_offset(s, r, rm, index, shift, offset);
> +}
> +
>  /* A simplification of the above with no index or shift.  */
>  static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
>                                          int rm, intptr_t offset)
> @@ -663,6 +719,31 @@ static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
>      tcg_out_modrm_sib_offset(s, opc, r, rm, -1, 0, offset);
>  }
>
> +static inline void tcg_out_vex_modrm_offset(TCGContext *s, int opc, int r,
> +                                            int v, int rm, intptr_t offset)
> +{
> +    tcg_out_vex_modrm_sib_offset(s, opc, r, v, rm, -1, 0, offset);
> +}
> +
> +static void tcg_out_maybe_vex_modrm(TCGContext *s, int opc, int r, int rm)
> +{
> +    if (have_avx1) {
> +        tcg_out_vex_modrm(s, opc, r, 0, rm);
> +    } else {
> +        tcg_out_modrm(s, opc, r, rm);
> +    }
> +}
> +
> +static void tcg_out_maybe_vex_modrm_offset(TCGContext *s, int opc, int r,
> +                                           int rm, intptr_t offset)
> +{
> +    if (have_avx1) {
> +        tcg_out_vex_modrm_offset(s, opc, r, 0, rm, offset);
> +    } else {
> +        tcg_out_modrm_offset(s, opc, r, rm, offset);
> +    }
> +}
> +
>  /* Generate dest op= src.  Uses the same ARITH_* codes as tgen_arithi.  */
>  static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
>  {
> @@ -673,12 +754,32 @@ static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
>      tcg_out_modrm(s, OPC_ARITH_GvEv + (subop << 3) + ext, dest, src);
>  }
>
> -static inline void tcg_out_mov(TCGContext *s, TCGType type,
> -                               TCGReg ret, TCGReg arg)
> +static void tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
>  {
>      if (arg != ret) {
> -        int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> -        tcg_out_modrm(s, opc, ret, arg);
> +        int opc = 0;
> +
> +        switch (type) {
> +        case TCG_TYPE_I64:
> +            opc = P_REXW;
> +            /* fallthru */
> +        case TCG_TYPE_I32:
> +            opc |= OPC_MOVL_GvEv;
> +            tcg_out_modrm(s, opc, ret, arg);
> +            break;
> +
> +        case TCG_TYPE_V256:
> +            opc = P_VEXL;
> +            /* fallthru */
> +        case TCG_TYPE_V128:
> +        case TCG_TYPE_V64:
> +            opc |= OPC_MOVDQA_GyMy;
> +            tcg_out_maybe_vex_modrm(s, opc, ret, arg);
> +            break;
> +
> +        default:
> +            g_assert_not_reached();
> +        }
>      }
>  }
>
> @@ -687,6 +788,27 @@ static void tcg_out_movi(TCGContext *s, TCGType type,
>  {
>      tcg_target_long diff;
>
> +    switch (type) {
> +    case TCG_TYPE_I32:
> +    case TCG_TYPE_I64:
> +        break;
> +
> +    case TCG_TYPE_V64:
> +    case TCG_TYPE_V128:
> +    case TCG_TYPE_V256:
> +        /* ??? Revisit this as the implementation progresses.  */
> +        tcg_debug_assert(arg == 0);
> +        if (have_avx1) {
> +            tcg_out_vex_modrm(s, OPC_PXOR, ret, ret, ret);
> +        } else {
> +            tcg_out_modrm(s, OPC_PXOR, ret, ret);
> +        }
> +        return;
> +
> +    default:
> +        g_assert_not_reached();
> +    }
> +
>      if (arg == 0) {
>          tgen_arithr(s, ARITH_XOR, ret, ret);
>          return;
> @@ -750,18 +872,54 @@ static inline void tcg_out_pop(TCGContext *s, int reg)
>      tcg_out_opc(s, OPC_POP_r32 + LOWREGMASK(reg), 0, reg, 0);
>  }
>
> -static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
> -                              TCGReg arg1, intptr_t arg2)
> +static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
> +                       TCGReg arg1, intptr_t arg2)
>  {
> -    int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> -    tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
> +    switch (type) {
> +    case TCG_TYPE_I64:
> +        tcg_out_modrm_offset(s, OPC_MOVL_GvEv | P_REXW, ret, arg1, arg2);
> +        break;
> +    case TCG_TYPE_I32:
> +        tcg_out_modrm_offset(s, OPC_MOVL_GvEv, ret, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V64:
> +        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_GyMy, ret, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V128:
> +        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_GyMy, ret, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V256:
> +        tcg_out_vex_modrm_offset(s, OPC_MOVDQU_GyMy | P_VEXL,
> +                                 ret, 0, arg1, arg2);
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
>  }
>
> -static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
> -                              TCGReg arg1, intptr_t arg2)
> +static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
> +                       TCGReg arg1, intptr_t arg2)
>  {
> -    int opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> -    tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
> +    switch (type) {
> +    case TCG_TYPE_I64:
> +        tcg_out_modrm_offset(s, OPC_MOVL_EvGv | P_REXW, arg, arg1, arg2);
> +        break;
> +    case TCG_TYPE_I32:
> +        tcg_out_modrm_offset(s, OPC_MOVL_EvGv, arg, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V64:
> +        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_MyGy, arg, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V128:
> +        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_MyGy, arg, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V256:
> +        tcg_out_vex_modrm_offset(s, OPC_MOVDQU_MyGy | P_VEXL,
> +                                 arg, 0, arg1, arg2);
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
>  }
>
>  static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
> @@ -773,6 +931,8 @@ static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
>              return false;
>          }
>          rexw = P_REXW;
> +    } else if (type != TCG_TYPE_I32) {
> +        return false;
>      }
>      tcg_out_modrm_offset(s, OPC_MOVL_EvIz | rexw, 0, base, ofs);
>      tcg_out32(s, val);
> @@ -1914,6 +2074,15 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>          case glue(glue(INDEX_op_, x), _i32)
>  #endif
>
> +#define OP_128_256(x) \
> +        case glue(glue(INDEX_op_, x), _v256): \
> +            rexw = P_VEXL; /* FALLTHRU */     \
> +        case glue(glue(INDEX_op_, x), _v128)
> +
> +#define OP_64_128_256(x) \
> +        OP_128_256(x):   \
> +        case glue(glue(INDEX_op_, x), _v64)
> +
>      /* Hoist the loads of the most common arguments.  */
>      a0 = args[0];
>      a1 = args[1];
> @@ -2379,19 +2548,94 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>          }
>          break;
>
> +    OP_64_128_256(add8):
> +        c = OPC_PADDB;
> +        goto gen_simd;
> +    OP_64_128_256(add16):
> +        c = OPC_PADDW;
> +        goto gen_simd;
> +    OP_64_128_256(add32):
> +        c = OPC_PADDD;
> +        goto gen_simd;
> +    OP_128_256(add64):
> +        c = OPC_PADDQ;
> +        goto gen_simd;
> +    OP_64_128_256(sub8):
> +        c = OPC_PSUBB;
> +        goto gen_simd;
> +    OP_64_128_256(sub16):
> +        c = OPC_PSUBW;
> +        goto gen_simd;
> +    OP_64_128_256(sub32):
> +        c = OPC_PSUBD;
> +        goto gen_simd;
> +    OP_128_256(sub64):
> +        c = OPC_PSUBQ;
> +        goto gen_simd;
> +    OP_64_128_256(and):
> +        c = OPC_PAND;
> +        goto gen_simd;
> +    OP_64_128_256(andc):
> +        c = OPC_PANDN;
> +        goto gen_simd;
> +    OP_64_128_256(or):
> +        c = OPC_POR;
> +        goto gen_simd;
> +    OP_64_128_256(xor):
> +        c = OPC_PXOR;
> +    gen_simd:
> +        if (have_avx1) {
> +            tcg_out_vex_modrm(s, c, a0, a1, a2);
> +        } else {
> +            tcg_out_modrm(s, c, a0, a2);
> +        }
> +        break;
> +
> +    case INDEX_op_ld_v64:
> +        c = TCG_TYPE_V64;
> +        goto gen_simd_ld;
> +    case INDEX_op_ld_v128:
> +        c = TCG_TYPE_V128;
> +        goto gen_simd_ld;
> +    case INDEX_op_ld_v256:
> +        c = TCG_TYPE_V256;
> +    gen_simd_ld:
> +        tcg_out_ld(s, c, a0, a1, a2);
> +        break;
> +
> +    case INDEX_op_st_v64:
> +        c = TCG_TYPE_V64;
> +        goto gen_simd_st;
> +    case INDEX_op_st_v128:
> +        c = TCG_TYPE_V128;
> +        goto gen_simd_st;
> +    case INDEX_op_st_v256:
> +        c = TCG_TYPE_V256;
> +    gen_simd_st:
> +        tcg_out_st(s, c, a0, a1, a2);
> +        break;
> +
>      case INDEX_op_mb:
>          tcg_out_mb(s, a0);
>          break;
>      case INDEX_op_mov_i32:  /* Always emitted via tcg_out_mov.  */
>      case INDEX_op_mov_i64:
> +    case INDEX_op_mov_v64:
> +    case INDEX_op_mov_v128:
> +    case INDEX_op_mov_v256:
>      case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi.  */
>      case INDEX_op_movi_i64:
> +    case INDEX_op_movi_v64:
> +    case INDEX_op_movi_v128:
> +    case INDEX_op_movi_v256:
>      case INDEX_op_call:     /* Always emitted via tcg_out_call.  */
>      default:
>          tcg_abort();
>      }
>
>  #undef OP_32_64
> +#undef OP_128_256
> +#undef OP_64_128_256
>  }
>
>  static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
> @@ -2417,6 +2661,9 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
>          = { .args_ct_str = { "r", "r", "L", "L" } };
>      static const TCGTargetOpDef L_L_L_L
>          = { .args_ct_str = { "L", "L", "L", "L" } };
> +    static const TCGTargetOpDef x_0_x = { .args_ct_str = { "x", "0", "x" } };
> +    static const TCGTargetOpDef x_x_x = { .args_ct_str = { "x", "x", "x" } };
> +    static const TCGTargetOpDef x_r = { .args_ct_str = { "x", "r" } };
>
>      switch (op) {
>      case INDEX_op_goto_ptr:
> @@ -2620,6 +2867,52 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
>              return &s2;
>          }
>
> +    case INDEX_op_ld_v64:
> +    case INDEX_op_ld_v128:
> +    case INDEX_op_ld_v256:
> +    case INDEX_op_st_v64:
> +    case INDEX_op_st_v128:
> +    case INDEX_op_st_v256:
> +        return &x_r;
> +
> +    case INDEX_op_add8_v64:
> +    case INDEX_op_add8_v128:
> +    case INDEX_op_add16_v64:
> +    case INDEX_op_add16_v128:
> +    case INDEX_op_add32_v64:
> +    case INDEX_op_add32_v128:
> +    case INDEX_op_add64_v128:
> +    case INDEX_op_sub8_v64:
> +    case INDEX_op_sub8_v128:
> +    case INDEX_op_sub16_v64:
> +    case INDEX_op_sub16_v128:
> +    case INDEX_op_sub32_v64:
> +    case INDEX_op_sub32_v128:
> +    case INDEX_op_sub64_v128:
> +    case INDEX_op_and_v64:
> +    case INDEX_op_and_v128:
> +    case INDEX_op_andc_v64:
> +    case INDEX_op_andc_v128:
> +    case INDEX_op_or_v64:
> +    case INDEX_op_or_v128:
> +    case INDEX_op_xor_v64:
> +    case INDEX_op_xor_v128:
> +        return have_avx1 ? &x_x_x : &x_0_x;
> +
> +    case INDEX_op_add8_v256:
> +    case INDEX_op_add16_v256:
> +    case INDEX_op_add32_v256:
> +    case INDEX_op_add64_v256:
> +    case INDEX_op_sub8_v256:
> +    case INDEX_op_sub16_v256:
> +    case INDEX_op_sub32_v256:
> +    case INDEX_op_sub64_v256:
> +    case INDEX_op_and_v256:
> +    case INDEX_op_andc_v256:
> +    case INDEX_op_or_v256:
> +    case INDEX_op_xor_v256:
> +        return &x_x_x;
> +
>      default:
>          break;
>      }
> @@ -2725,9 +3018,16 @@ static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
>  static void tcg_target_init(TCGContext *s)
>  {
>  #ifdef CONFIG_CPUID_H
> -    unsigned a, b, c, d;
> +    unsigned a, b, c, d, b7 = 0;
>      int max = __get_cpuid_max(0, 0);
>
> +    if (max >= 7) {
> +        /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs.  */
> +        __cpuid_count(7, 0, a, b7, c, d);
> +        have_bmi1 = (b7 & bit_BMI) != 0;
> +        have_bmi2 = (b7 & bit_BMI2) != 0;
> +    }
> +
>      if (max >= 1) {
>          __cpuid(1, a, b, c, d);
>  #ifndef have_cmov
> @@ -2736,17 +3036,26 @@ static void tcg_target_init(TCGContext *s)
>             available, we'll use a small forward branch.  */
>          have_cmov = (d & bit_CMOV) != 0;
>  #endif
> +#ifndef have_sse2
> +        have_sse2 = (d & bit_SSE2) != 0;
> +#endif
>          /* MOVBE is only available on Intel Atom and Haswell CPUs, so we
>             need to probe for it.  */
>          have_movbe = (c & bit_MOVBE) != 0;
>          have_popcnt = (c & bit_POPCNT) != 0;
> -    }
>
> -    if (max >= 7) {
> -        /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs.  */
> -        __cpuid_count(7, 0, a, b, c, d);
> -        have_bmi1 = (b & bit_BMI) != 0;
> -        have_bmi2 = (b & bit_BMI2) != 0;
> +#ifndef have_avx2
> +        /* There are a number of things we must check before we can be
> +           sure of not hitting invalid opcode.  */
> +        if (c & bit_OSXSAVE) {
> +            unsigned xcrl, xcrh;
> +            asm ("xgetbv" : "=a" (xcrl), "=d" (xcrh) : "c" (0));
> +            if (xcrl & 6 == 6) {

My picky compiler complains:

/home/alex/lsrc/qemu/qemu.git/tcg/i386/tcg-target.inc.c: In function ‘tcg_target_init’:
/home/alex/lsrc/qemu/qemu.git/tcg/i386/tcg-target.inc.c:3053:22: error: suggest parentheses around comparison in operand of ‘&’ [-Werror=parentheses]
             if (xcrl & 6 == 6) {

> +                have_avx1 = (c & bit_AVX) != 0;
> +                have_avx2 = (b7 & bit_AVX2) != 0;
> +            }
> +        }
> +#endif
>      }
>
>      max = __get_cpuid_max(0x8000000, 0);
> @@ -2763,6 +3072,13 @@ static void tcg_target_init(TCGContext *s)
>      } else {
>          tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xff);
>      }
> +    if (have_sse2) {
> +        tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V64], 0, 0xff0000);
> +        tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V128], 0, 0xff0000);
> +    }
> +    if (have_avx2) {
> +        tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V256], 0, 0xff0000);
> +    }
>
>      tcg_regset_clear(tcg_target_call_clobber_regs);
>      tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_EAX);


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
  2017-08-22 13:15   ` Alex Bennée
@ 2017-08-23 19:02     ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-23 19:02 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, qemu-arm

On 08/22/2017 06:15 AM, Alex Bennée wrote:
>> +#ifndef have_avx2
>> +        /* There are a number of things we must check before we can be
>> +           sure of not hitting invalid opcode.  */
>> +        if (c & bit_OSXSAVE) {
>> +            unsigned xcrl, xcrh;
>> +            asm ("xgetbv" : "=a" (xcrl), "=d" (xcrh) : "c" (0));
>> +            if (xcrl & 6 == 6) {
> 
> My picky compiler complains:
> 
> /home/alex/lsrc/qemu/qemu.git/tcg/i386/tcg-target.inc.c: In function ‘tcg_target_init’:
> /home/alex/lsrc/qemu/qemu.git/tcg/i386/tcg-target.inc.c:3053:22: error: suggest parentheses around comparison in operand of ‘&’ [-Werror=parentheses]
>              if (xcrl & 6 == 6) {


Bah.  I forgot that my default build uses -march=native, and my laptop has
AVX2, so this bit wouldn't have been compile tested at all.

Fixed on the branch.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
@ 2017-08-30  1:31   ` Philippe Mathieu-Daudé
  2017-09-01 20:38     ` Richard Henderson
  2017-09-07 16:34   ` Alex Bennée
  1 sibling, 1 reply; 36+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-08-30  1:31 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: qemu-arm, alex.bennee

Hi Richard,

I can't find anything to say about this patch... Hardcore stuff.
Some part could be more a bit more verbose but after a while focusing it 
makes sens.
I wonder how long it took you to write this :) "roughly 2h"

On 08/17/2017 08:01 PM, Richard Henderson wrote:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Hoping I didn't miss anything:

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

> ---
>   Makefile.target        |   5 +-
>   tcg/tcg-op-gvec.h      |  88 ++++++++++
>   tcg/tcg-runtime.h      |  16 ++
>   tcg/tcg-op-gvec.c      | 443 +++++++++++++++++++++++++++++++++++++++++++++++++
>   tcg/tcg-runtime-gvec.c | 199 ++++++++++++++++++++++
>   5 files changed, 749 insertions(+), 2 deletions(-)
>   create mode 100644 tcg/tcg-op-gvec.h
>   create mode 100644 tcg/tcg-op-gvec.c
>   create mode 100644 tcg/tcg-runtime-gvec.c
> 
> diff --git a/Makefile.target b/Makefile.target
> index 7f42c45db8..9ae3e904f7 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -93,8 +93,9 @@ all: $(PROGS) stap
>   # cpu emulator library
>   obj-y += exec.o
>   obj-y += accel/
> -obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/optimize.o
> -obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/tcg-runtime.o
> +obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-common.o tcg/optimize.o
> +obj-$(CONFIG_TCG) += tcg/tcg-op.o tcg/tcg-op-gvec.o
> +obj-$(CONFIG_TCG) += tcg/tcg-runtime.o tcg/tcg-runtime-gvec.o
>   obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o
>   obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
>   obj-y += fpu/softfloat.o
> diff --git a/tcg/tcg-op-gvec.h b/tcg/tcg-op-gvec.h
> new file mode 100644
> index 0000000000..10db3599a5
> --- /dev/null
> +++ b/tcg/tcg-op-gvec.h
> @@ -0,0 +1,88 @@
> +/*
> + *  Generic vector operation expansion
> + *
> + *  Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +/*
> + * "Generic" vectors.  All operands are given as offsets from ENV,
> + * and therefore cannot also be allocated via tcg_global_mem_new_*.
> + * OPSZ is the byte size of the vector upon which the operation is performed.
> + * CLSZ is the byte size of the full vector; bytes beyond OPSZ are cleared.
> + *
> + * All sizes must be 8 or any multiple of 16.
> + * When OPSZ is 8, the alignment may be 8, otherwise must be 16.
> + * Operands may completely, but not partially, overlap.
> + */
> +
> +/* Fundamental operation expanders.  These are exposed to the front ends
> +   so that target-specific SIMD operations can be handled similarly to
> +   the standard SIMD operations.  */
> +
> +typedef struct {
> +    /* "Small" sizes: expand inline as a 64-bit or 32-bit lane.
> +       Generally only one of these will be non-NULL.  */
> +    void (*fni8)(TCGv_i64, TCGv_i64, TCGv_i64);
> +    void (*fni4)(TCGv_i32, TCGv_i32, TCGv_i32);
> +    /* Similarly, but load up a constant and re-use across lanes.  */
> +    void (*fni8x)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64);
> +    uint64_t extra_value;
> +    /* Larger sizes: expand out-of-line helper w/size descriptor.  */
> +    void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32);
> +} GVecGen3;
> +
> +void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                    uint32_t opsz, uint32_t clsz, const GVecGen3 *);
> +
> +#define DEF_GVEC_2(X) \
> +    void tcg_gen_gvec_##X(uint32_t dofs, uint32_t aofs, uint32_t bofs, \
> +                          uint32_t opsz, uint32_t clsz)
> +
> +DEF_GVEC_2(add8);
> +DEF_GVEC_2(add16);
> +DEF_GVEC_2(add32);
> +DEF_GVEC_2(add64);
> +
> +DEF_GVEC_2(sub8);
> +DEF_GVEC_2(sub16);
> +DEF_GVEC_2(sub32);
> +DEF_GVEC_2(sub64);
> +
> +DEF_GVEC_2(and8);
> +DEF_GVEC_2(or8);
> +DEF_GVEC_2(xor8);
> +DEF_GVEC_2(andc8);
> +DEF_GVEC_2(orc8);
> +
> +#undef DEF_GVEC_2
> +
> +/*
> + * 64-bit vector operations.  Use these when the register has been
> + * allocated with tcg_global_mem_new_i64.  OPSZ = CLSZ = 8.
> + */
> +
> +#define DEF_VEC8_2(X) \
> +    void tcg_gen_vec8_##X(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +
> +DEF_VEC8_2(add8);
> +DEF_VEC8_2(add16);
> +DEF_VEC8_2(add32);
> +
> +DEF_VEC8_2(sub8);
> +DEF_VEC8_2(sub16);
> +DEF_VEC8_2(sub32);
> +
> +#undef DEF_VEC8_2
> diff --git a/tcg/tcg-runtime.h b/tcg/tcg-runtime.h
> index c41d38a557..f8d07090f8 100644
> --- a/tcg/tcg-runtime.h
> +++ b/tcg/tcg-runtime.h
> @@ -134,3 +134,19 @@ GEN_ATOMIC_HELPERS(xor_fetch)
>   GEN_ATOMIC_HELPERS(xchg)
>   
>   #undef GEN_ATOMIC_HELPERS
> +
> +DEF_HELPER_FLAGS_4(gvec_add8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +
> +DEF_HELPER_FLAGS_4(gvec_sub8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +
> +DEF_HELPER_FLAGS_4(gvec_and8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_or8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_xor8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_andc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_orc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
> new file mode 100644
> index 0000000000..6de49dc07f
> --- /dev/null
> +++ b/tcg/tcg-op-gvec.c
> @@ -0,0 +1,443 @@
> +/*
> + *  Generic vector operation expansion
> + *
> + *  Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "cpu.h"
> +#include "exec/exec-all.h"
> +#include "tcg.h"
> +#include "tcg-op.h"
> +#include "tcg-op-gvec.h"
> +#include "trace-tcg.h"
> +#include "trace/mem.h"
> +
> +#define REP8(x)    ((x) * 0x0101010101010101ull)
> +#define REP16(x)   ((x) * 0x0001000100010001ull)
> +
> +#define MAX_INLINE 16
> +
> +static inline void check_size_s(uint32_t opsz, uint32_t clsz)
> +{
> +    tcg_debug_assert(opsz % 8 == 0);
> +    tcg_debug_assert(clsz % 8 == 0);
> +    tcg_debug_assert(opsz <= clsz);
> +}
> +
> +static inline void check_align_s_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +{
> +    tcg_debug_assert(dofs % 8 == 0);
> +    tcg_debug_assert(aofs % 8 == 0);
> +    tcg_debug_assert(bofs % 8 == 0);
> +}
> +
> +static inline void check_size_l(uint32_t opsz, uint32_t clsz)
> +{
> +    tcg_debug_assert(opsz % 16 == 0);
> +    tcg_debug_assert(clsz % 16 == 0);
> +    tcg_debug_assert(opsz <= clsz);
> +}
> +
> +static inline void check_align_l_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +{
> +    tcg_debug_assert(dofs % 16 == 0);
> +    tcg_debug_assert(aofs % 16 == 0);
> +    tcg_debug_assert(bofs % 16 == 0);
> +}
> +
> +static inline void check_overlap_3(uint32_t d, uint32_t a,
> +                                   uint32_t b, uint32_t s)
> +{
> +    tcg_debug_assert(d == a || d + s <= a || a + s <= d);
> +    tcg_debug_assert(d == b || d + s <= b || b + s <= d);
> +    tcg_debug_assert(a == b || a + s <= b || b + s <= a);
> +}
> +
> +static void expand_clr(uint32_t dofs, uint32_t opsz, uint32_t clsz)
> +{
> +    if (clsz > opsz) {
> +        TCGv_i64 zero = tcg_const_i64(0);
> +        uint32_t i;
> +
> +        for (i = opsz; i < clsz; i += 8) {
> +            tcg_gen_st_i64(zero, tcg_ctx.tcg_env, dofs + i);
> +        }
> +        tcg_temp_free_i64(zero);
> +    }
> +}
> +
> +static TCGv_i32 make_desc(uint32_t opsz, uint32_t clsz)
> +{
> +    tcg_debug_assert(opsz >= 16 && opsz <= 255 * 16 && opsz % 16 == 0);
> +    tcg_debug_assert(clsz >= 16 && clsz <= 255 * 16 && clsz % 16 == 0);
> +    opsz /= 16;
> +    clsz /= 16;
> +    opsz -= 1;
> +    clsz -= 1;
> +    return tcg_const_i32(deposit32(opsz, 8, 8, clsz));
> +}
> +
> +static void expand_3_o(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz,
> +                       void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32))
> +{
> +    TCGv_ptr d = tcg_temp_new_ptr();
> +    TCGv_ptr a = tcg_temp_new_ptr();
> +    TCGv_ptr b = tcg_temp_new_ptr();
> +    TCGv_i32 desc = make_desc(opsz, clsz);
> +
> +    tcg_gen_addi_ptr(d, tcg_ctx.tcg_env, dofs);
> +    tcg_gen_addi_ptr(a, tcg_ctx.tcg_env, aofs);
> +    tcg_gen_addi_ptr(b, tcg_ctx.tcg_env, bofs);
> +    fno(d, a, b, desc);
> +
> +    tcg_temp_free_ptr(d);
> +    tcg_temp_free_ptr(a);
> +    tcg_temp_free_ptr(b);
> +    tcg_temp_free_i32(desc);
> +}
> +
> +static void expand_3x4(uint32_t dofs, uint32_t aofs,
> +                       uint32_t bofs, uint32_t opsz,
> +                       void (*fni)(TCGv_i32, TCGv_i32, TCGv_i32))
> +{
> +    TCGv_i32 t0 = tcg_temp_new_i32();
> +    uint32_t i;
> +
> +    if (aofs == bofs) {
> +        for (i = 0; i < opsz; i += 4) {
> +            tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
> +            fni(t0, t0, t0);
> +            tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +    } else {
> +        TCGv_i32 t1 = tcg_temp_new_i32();
> +        for (i = 0; i < opsz; i += 4) {
> +            tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
> +            tcg_gen_ld_i32(t1, tcg_ctx.tcg_env, bofs + i);
> +            fni(t0, t0, t1);
> +            tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +        tcg_temp_free_i32(t1);
> +    }
> +    tcg_temp_free_i32(t0);
> +}
> +
> +static void expand_3x8(uint32_t dofs, uint32_t aofs,
> +                       uint32_t bofs, uint32_t opsz,
> +                       void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64))
> +{
> +    TCGv_i64 t0 = tcg_temp_new_i64();
> +    uint32_t i;
> +
> +    if (aofs == bofs) {
> +        for (i = 0; i < opsz; i += 8) {
> +            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> +            fni(t0, t0, t0);
> +            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +    } else {
> +        TCGv_i64 t1 = tcg_temp_new_i64();
> +        for (i = 0; i < opsz; i += 8) {
> +            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> +            tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
> +            fni(t0, t0, t1);
> +            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +        tcg_temp_free_i64(t1);
> +    }
> +    tcg_temp_free_i64(t0);
> +}
> +
> +static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                         uint32_t opsz, uint64_t data,
> +                         void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64))
> +{
> +    TCGv_i64 t0 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_const_i64(data);
> +    uint32_t i;
> +
> +    if (aofs == bofs) {
> +        for (i = 0; i < opsz; i += 8) {
> +            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> +            fni(t0, t0, t0, t2);
> +            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +    } else {
> +        TCGv_i64 t1 = tcg_temp_new_i64();
> +        for (i = 0; i < opsz; i += 8) {
> +            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> +            tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
> +            fni(t0, t0, t1, t2);
> +            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +        tcg_temp_free_i64(t1);
> +    }
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                    uint32_t opsz, uint32_t clsz, const GVecGen3 *g)
> +{
> +    check_overlap_3(dofs, aofs, bofs, clsz);
> +    if (opsz <= MAX_INLINE) {
> +        check_size_s(opsz, clsz);
> +        check_align_s_3(dofs, aofs, bofs);
> +        if (g->fni8) {
> +            expand_3x8(dofs, aofs, bofs, opsz, g->fni8);
> +        } else if (g->fni4) {
> +            expand_3x4(dofs, aofs, bofs, opsz, g->fni4);
> +        } else if (g->fni8x) {
> +            expand_3x8p1(dofs, aofs, bofs, opsz, g->extra_value, g->fni8x);
> +        } else {
> +            g_assert_not_reached();
> +        }
> +        expand_clr(dofs, opsz, clsz);
> +    } else {
> +        check_size_l(opsz, clsz);
> +        check_align_l_3(dofs, aofs, bofs);
> +        expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
> +    }
> +}
> +
> +static void gen_addv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 t3 = tcg_temp_new_i64();
> +
> +    tcg_gen_andc_i64(t1, a, m);
> +    tcg_gen_andc_i64(t2, b, m);
> +    tcg_gen_xor_i64(t3, a, b);
> +    tcg_gen_add_i64(d, t1, t2);
> +    tcg_gen_and_i64(t3, t3, m);
> +    tcg_gen_xor_i64(d, d, t3);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +    tcg_temp_free_i64(t3);
> +}
> +
> +void tcg_gen_gvec_add8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .extra_value = REP8(0x80),
> +        .fni8x = gen_addv_mask,
> +        .fno = gen_helper_gvec_add8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .extra_value = REP16(0x8000),
> +        .fni8x = gen_addv_mask,
> +        .fno = gen_helper_gvec_add16,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni4 = tcg_gen_add_i32,
> +        .fno = gen_helper_gvec_add32,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_add_i64,
> +        .fno = gen_helper_gvec_add64,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_vec8_add8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 m = tcg_const_i64(REP8(0x80));
> +    gen_addv_mask(d, a, b, m);
> +    tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_add16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 m = tcg_const_i64(REP16(0x8000));
> +    gen_addv_mask(d, a, b, m);
> +    tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_add32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +
> +    tcg_gen_andi_i64(t1, a, ~0xffffffffull);
> +    tcg_gen_add_i64(t2, a, b);
> +    tcg_gen_add_i64(t1, t1, b);
> +    tcg_gen_deposit_i64(d, t1, t2, 0, 32);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +static void gen_subv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 t3 = tcg_temp_new_i64();
> +
> +    tcg_gen_or_i64(t1, a, m);
> +    tcg_gen_andc_i64(t2, b, m);
> +    tcg_gen_eqv_i64(t3, a, b);
> +    tcg_gen_sub_i64(d, t1, t2);
> +    tcg_gen_and_i64(t3, t3, m);
> +    tcg_gen_xor_i64(d, d, t3);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +    tcg_temp_free_i64(t3);
> +}
> +
> +void tcg_gen_gvec_sub8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .extra_value = REP8(0x80),
> +        .fni8x = gen_subv_mask,
> +        .fno = gen_helper_gvec_sub8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .extra_value = REP16(0x8000),
> +        .fni8x = gen_subv_mask,
> +        .fno = gen_helper_gvec_sub16,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni4 = tcg_gen_sub_i32,
> +        .fno = gen_helper_gvec_sub32,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_sub_i64,
> +        .fno = gen_helper_gvec_sub64,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_vec8_sub8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 m = tcg_const_i64(REP8(0x80));
> +    gen_subv_mask(d, a, b, m);
> +    tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_sub16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 m = tcg_const_i64(REP16(0x8000));
> +    gen_subv_mask(d, a, b, m);
> +    tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_sub32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +
> +    tcg_gen_andi_i64(t1, b, ~0xffffffffull);
> +    tcg_gen_sub_i64(t2, a, b);
> +    tcg_gen_sub_i64(t1, a, t1);
> +    tcg_gen_deposit_i64(d, t1, t2, 0, 32);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +void tcg_gen_gvec_and8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_and_i64,
> +        .fno = gen_helper_gvec_and8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_or8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                      uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_or_i64,
> +        .fno = gen_helper_gvec_or8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_xor8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_xor_i64,
> +        .fno = gen_helper_gvec_xor8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_andc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_andc_i64,
> +        .fno = gen_helper_gvec_andc8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_orc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_orc_i64,
> +        .fno = gen_helper_gvec_orc8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> diff --git a/tcg/tcg-runtime-gvec.c b/tcg/tcg-runtime-gvec.c
> new file mode 100644
> index 0000000000..9a37ce07a2
> --- /dev/null
> +++ b/tcg/tcg-runtime-gvec.c
> @@ -0,0 +1,199 @@
> +/*
> + *  Generic vectorized operation runtime
> + *
> + *  Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/host-utils.h"
> +#include "cpu.h"
> +#include "exec/helper-proto.h"
> +
> +/* Virtually all hosts support 16-byte vectors.  Those that don't
> +   can emulate them via GCC's generic vector extension.
> +
> +   In tcg-op-gvec.c, we asserted that both the size and alignment
> +   of the data are multiples of 16.  */
> +
> +typedef uint8_t vec8 __attribute__((vector_size(16)));
> +typedef uint16_t vec16 __attribute__((vector_size(16)));
> +typedef uint32_t vec32 __attribute__((vector_size(16)));
> +typedef uint64_t vec64 __attribute__((vector_size(16)));
> +
> +static inline intptr_t extract_opsz(uint32_t desc)
> +{
> +    return ((desc & 0xff) + 1) * 16;
> +}
> +
> +static inline intptr_t extract_clsz(uint32_t desc)
> +{
> +    return (((desc >> 8) & 0xff) + 1) * 16;
> +}
> +
> +static inline void clear_high(void *d, intptr_t opsz, uint32_t desc)
> +{
> +    intptr_t clsz = extract_clsz(desc);
> +    intptr_t i;
> +
> +    if (unlikely(clsz > opsz)) {
> +        for (i = opsz; i < clsz; i += sizeof(vec64)) {
> +            *(vec64 *)(d + i) = (vec64){ 0 };
> +        }
> +    }
> +}
> +
> +void HELPER(gvec_add8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec8)) {
> +        *(vec8 *)(d + i) = *(vec8 *)(a + i) + *(vec8 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add16)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec16)) {
> +        *(vec16 *)(d + i) = *(vec16 *)(a + i) + *(vec16 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add32)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec32)) {
> +        *(vec32 *)(d + i) = *(vec32 *)(a + i) + *(vec32 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add64)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) + *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec8)) {
> +        *(vec8 *)(d + i) = *(vec8 *)(a + i) - *(vec8 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub16)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec16)) {
> +        *(vec16 *)(d + i) = *(vec16 *)(a + i) - *(vec16 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub32)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec32)) {
> +        *(vec32 *)(d + i) = *(vec32 *)(a + i) - *(vec32 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub64)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) - *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_and8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) & *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_or8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) | *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_xor8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) ^ *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_andc8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) &~ *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_orc8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) |~ *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] tcg: Add operations for host vectors
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 4/8] tcg: Add operations " Richard Henderson
@ 2017-08-30  1:34   ` Philippe Mathieu-Daudé
  2017-09-07 19:00   ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-08-30  1:34 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: qemu-arm, alex.bennee

On 08/17/2017 08:01 PM, Richard Henderson wrote:
> Nothing uses or implements them yet.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

> ---
>   tcg/tcg-opc.h | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   tcg/tcg.h     | 24 ++++++++++++++++
>   2 files changed, 113 insertions(+)
> 
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index 956fb1e9f3..9162125fac 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -206,6 +206,95 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
>   
>   #undef TLADDR_ARGS
>   #undef DATA64_ARGS
> +
> +/* Host integer vector operations.  */
> +/* These opcodes are required whenever the base vector size is enabled.  */
> +
> +DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(ld_v256, 1, 1, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(st_v64, 0, 2, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(st_v128, 0, 2, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(st_v256, 0, 2, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(and_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(and_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(and_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(or_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(or_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(or_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(xor_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(xor_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(xor_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(add8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(add16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(add32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +
> +DEF(add8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +
> +DEF(add8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(sub8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(sub16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(sub32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +
> +DEF(sub8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +
> +DEF(sub8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +/* These opcodes are optional.
> +   All element counts must be supported if any are.  */
> +
> +DEF(not_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v64))
> +DEF(not_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v128))
> +DEF(not_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v256))
> +
> +DEF(andc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v64))
> +DEF(andc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v128))
> +DEF(andc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v256))
> +
> +DEF(orc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v64))
> +DEF(orc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v128))
> +DEF(orc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v256))
> +
> +DEF(neg8_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +DEF(neg16_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +DEF(neg32_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +
> +DEF(neg8_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg16_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg32_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg64_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +
> +DEF(neg8_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg16_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg32_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg64_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +
>   #undef IMPL
>   #undef IMPL64
>   #undef DEF
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 1277caed3d..b9e15da13b 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -166,6 +166,30 @@ typedef uint64_t TCGRegSet;
>   #define TCG_TARGET_HAS_rem_i64          0
>   #endif
>   
> +#ifndef TCG_TARGET_HAS_v64
> +#define TCG_TARGET_HAS_v64              0
> +#define TCG_TARGET_HAS_andc_v64         0
> +#define TCG_TARGET_HAS_orc_v64          0
> +#define TCG_TARGET_HAS_not_v64          0
> +#define TCG_TARGET_HAS_neg_v64          0
> +#endif
> +
> +#ifndef TCG_TARGET_HAS_v128
> +#define TCG_TARGET_HAS_v128             0
> +#define TCG_TARGET_HAS_andc_v128        0
> +#define TCG_TARGET_HAS_orc_v128         0
> +#define TCG_TARGET_HAS_not_v128         0
> +#define TCG_TARGET_HAS_neg_v128         0
> +#endif
> +
> +#ifndef TCG_TARGET_HAS_v256
> +#define TCG_TARGET_HAS_v256             0
> +#define TCG_TARGET_HAS_andc_v256        0
> +#define TCG_TARGET_HAS_orc_v256         0
> +#define TCG_TARGET_HAS_not_v256         0
> +#define TCG_TARGET_HAS_neg_v256         0
> +#endif
> +
>   /* For 32-bit targets, some sort of unsigned widening multiply is required.  */
>   #if TCG_TARGET_REG_BITS == 32 \
>       && !(defined(TCG_TARGET_HAS_mulu2_i32) \
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic
  2017-08-30  1:31   ` Philippe Mathieu-Daudé
@ 2017-09-01 20:38     ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-09-01 20:38 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel; +Cc: qemu-arm, alex.bennee

On 08/29/2017 06:31 PM, Philippe Mathieu-Daudé wrote:
> Hi Richard,
> 
> I can't find anything to say about this patch... Hardcore stuff.
> Some part could be more a bit more verbose but after a while focusing it makes
> sens.
> I wonder how long it took you to write this :) "roughly 2h"

Not quite that quickly.  ;-)
You're absolutely right that it needs lots more documentation.
I'll improve that when it comes to round 2.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
  2017-08-30  1:31   ` Philippe Mathieu-Daudé
@ 2017-09-07 16:34   ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-07 16:34 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  Makefile.target        |   5 +-
>  tcg/tcg-op-gvec.h      |  88 ++++++++++
>  tcg/tcg-runtime.h      |  16 ++
>  tcg/tcg-op-gvec.c      | 443 +++++++++++++++++++++++++++++++++++++++++++++++++
>  tcg/tcg-runtime-gvec.c | 199 ++++++++++++++++++++++
>  5 files changed, 749 insertions(+), 2 deletions(-)
>  create mode 100644 tcg/tcg-op-gvec.h
>  create mode 100644 tcg/tcg-op-gvec.c
>  create mode 100644 tcg/tcg-runtime-gvec.c
>
> diff --git a/Makefile.target b/Makefile.target
> index 7f42c45db8..9ae3e904f7 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -93,8 +93,9 @@ all: $(PROGS) stap
>  # cpu emulator library
>  obj-y += exec.o
>  obj-y += accel/
> -obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/optimize.o
> -obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/tcg-runtime.o
> +obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-common.o tcg/optimize.o
> +obj-$(CONFIG_TCG) += tcg/tcg-op.o tcg/tcg-op-gvec.o
> +obj-$(CONFIG_TCG) += tcg/tcg-runtime.o tcg/tcg-runtime-gvec.o
>  obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o
>  obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
>  obj-y += fpu/softfloat.o
> diff --git a/tcg/tcg-op-gvec.h b/tcg/tcg-op-gvec.h
> new file mode 100644
> index 0000000000..10db3599a5
> --- /dev/null
> +++ b/tcg/tcg-op-gvec.h
> @@ -0,0 +1,88 @@
> +/*
> + *  Generic vector operation expansion
> + *
> + *  Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +/*
> + * "Generic" vectors.  All operands are given as offsets from ENV,
> + * and therefore cannot also be allocated via tcg_global_mem_new_*.
> + * OPSZ is the byte size of the vector upon which the operation is performed.
> + * CLSZ is the byte size of the full vector; bytes beyond OPSZ are cleared.
> + *
> + * All sizes must be 8 or any multiple of 16.
> + * When OPSZ is 8, the alignment may be 8, otherwise must be 16.
> + * Operands may completely, but not partially, overlap.

Isn't this going to be a problem for narrow/widden Rn->Rn operations?
Should we say so explicitly here?

> + */
> +
> +/* Fundamental operation expanders.  These are exposed to the front ends
> +   so that target-specific SIMD operations can be handled similarly to
> +   the standard SIMD operations.  */
> +
> +typedef struct {
> +    /* "Small" sizes: expand inline as a 64-bit or 32-bit lane.
> +       Generally only one of these will be non-NULL.  */

Generally or always? We after all go through in a certain order and
expand the first one defined.

> +    void (*fni8)(TCGv_i64, TCGv_i64, TCGv_i64);
> +    void (*fni4)(TCGv_i32, TCGv_i32, TCGv_i32);
> +    /* Similarly, but load up a constant and re-use across lanes.  */
> +    void (*fni8x)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64);
> +    uint64_t extra_value;

Probably personal preference but I'd leave extra_value and additional
non-function pointers to the end of the structure for cleaner
readability.

> +    /* Larger sizes: expand out-of-line helper w/size descriptor.  */
> +    void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32);
> +} GVecGen3;
> +
> +void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                    uint32_t opsz, uint32_t clsz, const GVecGen3 *);
> +

Why GVecGen3 and tcg_gen_gvec_3? It seems a little arbitrary.

> +#define DEF_GVEC_2(X) \
> +    void tcg_gen_gvec_##X(uint32_t dofs, uint32_t aofs, uint32_t bofs, \
> +                          uint32_t opsz, uint32_t clsz)
> +
> +DEF_GVEC_2(add8);
> +DEF_GVEC_2(add16);
> +DEF_GVEC_2(add32);
> +DEF_GVEC_2(add64);
> +
> +DEF_GVEC_2(sub8);
> +DEF_GVEC_2(sub16);
> +DEF_GVEC_2(sub32);
> +DEF_GVEC_2(sub64);
> +
> +DEF_GVEC_2(and8);
> +DEF_GVEC_2(or8);
> +DEF_GVEC_2(xor8);
> +DEF_GVEC_2(andc8);
> +DEF_GVEC_2(orc8);
> +
> +#undef DEF_GVEC_2
> +
> +/*
> + * 64-bit vector operations.  Use these when the register has been
> + * allocated with tcg_global_mem_new_i64.  OPSZ = CLSZ = 8.
> + */
> +
> +#define DEF_VEC8_2(X) \
> +    void tcg_gen_vec8_##X(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +
> +DEF_VEC8_2(add8);
> +DEF_VEC8_2(add16);
> +DEF_VEC8_2(add32);
> +
> +DEF_VEC8_2(sub8);
> +DEF_VEC8_2(sub16);
> +DEF_VEC8_2(sub32);
> +
> +#undef DEF_VEC8_2

Again GVEC_2 and VEC8_2 don't tell me much.

> diff --git a/tcg/tcg-runtime.h b/tcg/tcg-runtime.h
> index c41d38a557..f8d07090f8 100644
> --- a/tcg/tcg-runtime.h
> +++ b/tcg/tcg-runtime.h
> @@ -134,3 +134,19 @@ GEN_ATOMIC_HELPERS(xor_fetch)
>  GEN_ATOMIC_HELPERS(xchg)
>
>  #undef GEN_ATOMIC_HELPERS
> +
> +DEF_HELPER_FLAGS_4(gvec_add8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +
> +DEF_HELPER_FLAGS_4(gvec_sub8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +
> +DEF_HELPER_FLAGS_4(gvec_and8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_or8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_xor8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_andc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_orc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
> new file mode 100644
> index 0000000000..6de49dc07f
> --- /dev/null
> +++ b/tcg/tcg-op-gvec.c
> @@ -0,0 +1,443 @@
> +/*
> + *  Generic vector operation expansion
> + *
> + *  Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "cpu.h"
> +#include "exec/exec-all.h"
> +#include "tcg.h"
> +#include "tcg-op.h"
> +#include "tcg-op-gvec.h"
> +#include "trace-tcg.h"
> +#include "trace/mem.h"
> +
> +#define REP8(x)    ((x) * 0x0101010101010101ull)
> +#define REP16(x)   ((x) * 0x0001000100010001ull)
> +
> +#define MAX_INLINE 16
> +
> +static inline void check_size_s(uint32_t opsz, uint32_t clsz)
> +{
> +    tcg_debug_assert(opsz % 8 == 0);
> +    tcg_debug_assert(clsz % 8 == 0);
> +    tcg_debug_assert(opsz <= clsz);
> +}
> +
> +static inline void check_align_s_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +{
> +    tcg_debug_assert(dofs % 8 == 0);
> +    tcg_debug_assert(aofs % 8 == 0);
> +    tcg_debug_assert(bofs % 8 == 0);
> +}
> +
> +static inline void check_size_l(uint32_t opsz, uint32_t clsz)
> +{
> +    tcg_debug_assert(opsz % 16 == 0);
> +    tcg_debug_assert(clsz % 16 == 0);
> +    tcg_debug_assert(opsz <= clsz);
> +}
> +
> +static inline void check_align_l_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +{
> +    tcg_debug_assert(dofs % 16 == 0);
> +    tcg_debug_assert(aofs % 16 == 0);
> +    tcg_debug_assert(bofs % 16 == 0);
> +}
> +
> +static inline void check_overlap_3(uint32_t d, uint32_t a,
> +                                   uint32_t b, uint32_t s)
> +{
> +    tcg_debug_assert(d == a || d + s <= a || a + s <= d);
> +    tcg_debug_assert(d == b || d + s <= b || b + s <= d);
> +    tcg_debug_assert(a == b || a + s <= b || b + s <= a);
> +}
> +
> +static void expand_clr(uint32_t dofs, uint32_t opsz, uint32_t clsz)
> +{
> +    if (clsz > opsz) {
> +        TCGv_i64 zero = tcg_const_i64(0);
> +        uint32_t i;
> +
> +        for (i = opsz; i < clsz; i += 8) {
> +            tcg_gen_st_i64(zero, tcg_ctx.tcg_env, dofs + i);
> +        }
> +        tcg_temp_free_i64(zero);
> +    }
> +}
> +
> +static TCGv_i32 make_desc(uint32_t opsz, uint32_t clsz)

A comment about the encoding of opdata into the constant probably
wouldn't go amiss. Should we have some inline helpers to extract the
data for the actual implementations?

> +{
> +    tcg_debug_assert(opsz >= 16 && opsz <= 255 * 16 && opsz % 16 == 0);
> +    tcg_debug_assert(clsz >= 16 && clsz <= 255 * 16 && clsz % 16 == 0);
> +    opsz /= 16;
> +    clsz /= 16;
> +    opsz -= 1;
> +    clsz -= 1;
> +    return tcg_const_i32(deposit32(opsz, 8, 8, clsz));
> +}
> +
> +static void expand_3_o(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz,
> +                       void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr,
> TCGv_i32))

Hmm copy of the function pointer definition, maybe they should be
typedefs and declared with comments in tcg-op-gvec.h?

> +{
> +    TCGv_ptr d = tcg_temp_new_ptr();
> +    TCGv_ptr a = tcg_temp_new_ptr();
> +    TCGv_ptr b = tcg_temp_new_ptr();
> +    TCGv_i32 desc = make_desc(opsz, clsz);
> +
> +    tcg_gen_addi_ptr(d, tcg_ctx.tcg_env, dofs);
> +    tcg_gen_addi_ptr(a, tcg_ctx.tcg_env, aofs);
> +    tcg_gen_addi_ptr(b, tcg_ctx.tcg_env, bofs);
> +    fno(d, a, b, desc);
> +
> +    tcg_temp_free_ptr(d);
> +    tcg_temp_free_ptr(a);
> +    tcg_temp_free_ptr(b);
> +    tcg_temp_free_i32(desc);
> +}
> +
> +static void expand_3x4(uint32_t dofs, uint32_t aofs,
> +                       uint32_t bofs, uint32_t opsz,
> +                       void (*fni)(TCGv_i32, TCGv_i32, TCGv_i32))

Ditto typedef?

> +{
> +    TCGv_i32 t0 = tcg_temp_new_i32();
> +    uint32_t i;
> +
> +    if (aofs == bofs) {
> +        for (i = 0; i < opsz; i += 4) {
> +            tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
> +            fni(t0, t0, t0);
> +            tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +    } else {
> +        TCGv_i32 t1 = tcg_temp_new_i32();
> +        for (i = 0; i < opsz; i += 4) {
> +            tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
> +            tcg_gen_ld_i32(t1, tcg_ctx.tcg_env, bofs + i);
> +            fni(t0, t0, t1);
> +            tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +        tcg_temp_free_i32(t1);
> +    }
> +    tcg_temp_free_i32(t0);
> +}
> +
> +static void expand_3x8(uint32_t dofs, uint32_t aofs,
> +                       uint32_t bofs, uint32_t opsz,
> +                       void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64))
> +{
> +    TCGv_i64 t0 = tcg_temp_new_i64();
> +    uint32_t i;
> +
> +    if (aofs == bofs) {
> +        for (i = 0; i < opsz; i += 8) {
> +            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> +            fni(t0, t0, t0);
> +            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +    } else {
> +        TCGv_i64 t1 = tcg_temp_new_i64();
> +        for (i = 0; i < opsz; i += 8) {
> +            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> +            tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
> +            fni(t0, t0, t1);
> +            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +        tcg_temp_free_i64(t1);
> +    }
> +    tcg_temp_free_i64(t0);
> +}
> +
> +static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                         uint32_t opsz, uint64_t data,
> +                         void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64,
> TCGv_i64))

Again typedef

I don't quite follow the suffix's of the expanders. I guess _o is for
offset but p1? Either we need a mini comment for each expander or a more
obvious suffix scheme...

> +{
> +    TCGv_i64 t0 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_const_i64(data);
> +    uint32_t i;
> +
> +    if (aofs == bofs) {
> +        for (i = 0; i < opsz; i += 8) {
> +            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> +            fni(t0, t0, t0, t2);
> +            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +    } else {
> +        TCGv_i64 t1 = tcg_temp_new_i64();
> +        for (i = 0; i < opsz; i += 8) {
> +            tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> +            tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
> +            fni(t0, t0, t1, t2);
> +            tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> +        }
> +        tcg_temp_free_i64(t1);
> +    }
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                    uint32_t opsz, uint32_t clsz, const GVecGen3 *g)
> +{
> +    check_overlap_3(dofs, aofs, bofs, clsz);
> +    if (opsz <= MAX_INLINE) {
> +        check_size_s(opsz, clsz);
> +        check_align_s_3(dofs, aofs, bofs);
> +        if (g->fni8) {
> +            expand_3x8(dofs, aofs, bofs, opsz, g->fni8);
> +        } else if (g->fni4) {
> +            expand_3x4(dofs, aofs, bofs, opsz, g->fni4);
> +        } else if (g->fni8x) {
> +            expand_3x8p1(dofs, aofs, bofs, opsz, g->extra_value, g->fni8x);
> +        } else {
> +            g_assert_not_reached();
> +        }
> +        expand_clr(dofs, opsz, clsz);
> +    } else {
> +        check_size_l(opsz, clsz);
> +        check_align_l_3(dofs, aofs, bofs);
> +        expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
> +    }
> +}
> +
> +static void gen_addv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 t3 = tcg_temp_new_i64();
> +
> +    tcg_gen_andc_i64(t1, a, m);
> +    tcg_gen_andc_i64(t2, b, m);
> +    tcg_gen_xor_i64(t3, a, b);
> +    tcg_gen_add_i64(d, t1, t2);
> +    tcg_gen_and_i64(t3, t3, m);
> +    tcg_gen_xor_i64(d, d, t3);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +    tcg_temp_free_i64(t3);
> +}
> +
> +void tcg_gen_gvec_add8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .extra_value = REP8(0x80),
> +        .fni8x = gen_addv_mask,
> +        .fno = gen_helper_gvec_add8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add16(uint32_t dofs, uint32_t aofs, uint32_t bofs,

> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .extra_value = REP16(0x8000),
> +        .fni8x = gen_addv_mask,
> +        .fno = gen_helper_gvec_add16,

OK now I'm confused - we have two functions here but tcg_gen_gvec_3
expand one of them depending on the leg taken by opsz. One is a mask
function and the other using adds?

> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni4 = tcg_gen_add_i32,
> +        .fno = gen_helper_gvec_add32,

Ahh ok I see here, use native add_i32 for small values, pass to the
generic helper for larger vectors. Still confused about the previous
expander though...

> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_add_i64,
> +        .fno = gen_helper_gvec_add64,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_vec8_add8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 m = tcg_const_i64(REP8(0x80));
> +    gen_addv_mask(d, a, b, m);
> +    tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_add16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 m = tcg_const_i64(REP16(0x8000));
> +    gen_addv_mask(d, a, b, m);
> +    tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_add32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +
> +    tcg_gen_andi_i64(t1, a, ~0xffffffffull);
> +    tcg_gen_add_i64(t2, a, b);
> +    tcg_gen_add_i64(t1, t1, b);
> +    tcg_gen_deposit_i64(d, t1, t2, 0, 32);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +static void gen_subv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 t3 = tcg_temp_new_i64();
> +
> +    tcg_gen_or_i64(t1, a, m);
> +    tcg_gen_andc_i64(t2, b, m);
> +    tcg_gen_eqv_i64(t3, a, b);
> +    tcg_gen_sub_i64(d, t1, t2);
> +    tcg_gen_and_i64(t3, t3, m);
> +    tcg_gen_xor_i64(d, d, t3);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +    tcg_temp_free_i64(t3);
> +}
> +
> +void tcg_gen_gvec_sub8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .extra_value = REP8(0x80),
> +        .fni8x = gen_subv_mask,
> +        .fno = gen_helper_gvec_sub8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .extra_value = REP16(0x8000),
> +        .fni8x = gen_subv_mask,
> +        .fno = gen_helper_gvec_sub16,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni4 = tcg_gen_sub_i32,
> +        .fno = gen_helper_gvec_sub32,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_sub_i64,
> +        .fno = gen_helper_gvec_sub64,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_vec8_sub8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 m = tcg_const_i64(REP8(0x80));
> +    gen_subv_mask(d, a, b, m);
> +    tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_sub16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 m = tcg_const_i64(REP16(0x8000));
> +    gen_subv_mask(d, a, b, m);
> +    tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_sub32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +
> +    tcg_gen_andi_i64(t1, b, ~0xffffffffull);
> +    tcg_gen_sub_i64(t2, a, b);
> +    tcg_gen_sub_i64(t1, a, t1);
> +    tcg_gen_deposit_i64(d, t1, t2, 0, 32);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +void tcg_gen_gvec_and8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_and_i64,
> +        .fno = gen_helper_gvec_and8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_or8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                      uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_or_i64,
> +        .fno = gen_helper_gvec_or8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_xor8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_xor_i64,
> +        .fno = gen_helper_gvec_xor8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_andc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                        uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_andc_i64,
> +        .fno = gen_helper_gvec_andc8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_orc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t clsz)
> +{
> +    static const GVecGen3 g = {
> +        .fni8 = tcg_gen_orc_i64,
> +        .fno = gen_helper_gvec_orc8,
> +    };
> +    tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> diff --git a/tcg/tcg-runtime-gvec.c b/tcg/tcg-runtime-gvec.c
> new file mode 100644
> index 0000000000..9a37ce07a2
> --- /dev/null
> +++ b/tcg/tcg-runtime-gvec.c
> @@ -0,0 +1,199 @@
> +/*
> + *  Generic vectorized operation runtime
> + *
> + *  Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/host-utils.h"
> +#include "cpu.h"
> +#include "exec/helper-proto.h"
> +
> +/* Virtually all hosts support 16-byte vectors.  Those that don't
> +   can emulate them via GCC's generic vector extension.
> +
> +   In tcg-op-gvec.c, we asserted that both the size and alignment
> +   of the data are multiples of 16.  */
> +
> +typedef uint8_t vec8 __attribute__((vector_size(16)));
> +typedef uint16_t vec16 __attribute__((vector_size(16)));
> +typedef uint32_t vec32 __attribute__((vector_size(16)));
> +typedef uint64_t vec64 __attribute__((vector_size(16)));
> +
> +static inline intptr_t extract_opsz(uint32_t desc)
> +{
> +    return ((desc & 0xff) + 1) * 16;
> +}
> +
> +static inline intptr_t extract_clsz(uint32_t desc)
> +{
> +    return (((desc >> 8) & 0xff) + 1) * 16;
> +}

Ahh the data helpers. Any reason we don't use extract32() here where as
we used deposit32 the other end? It should generate the most efficient
code right?

> +
> +static inline void clear_high(void *d, intptr_t opsz, uint32_t desc)
> +{
> +    intptr_t clsz = extract_clsz(desc);
> +    intptr_t i;
> +
> +    if (unlikely(clsz > opsz)) {
> +        for (i = opsz; i < clsz; i += sizeof(vec64)) {
> +            *(vec64 *)(d + i) = (vec64){ 0 };
> +        }
> +    }
> +}
> +
> +void HELPER(gvec_add8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec8)) {
> +        *(vec8 *)(d + i) = *(vec8 *)(a + i) + *(vec8 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add16)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec16)) {
> +        *(vec16 *)(d + i) = *(vec16 *)(a + i) + *(vec16 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add32)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec32)) {
> +        *(vec32 *)(d + i) = *(vec32 *)(a + i) + *(vec32 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add64)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) + *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec8)) {
> +        *(vec8 *)(d + i) = *(vec8 *)(a + i) - *(vec8 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub16)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec16)) {
> +        *(vec16 *)(d + i) = *(vec16 *)(a + i) - *(vec16 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub32)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec32)) {
> +        *(vec32 *)(d + i) = *(vec32 *)(a + i) - *(vec32 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub64)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) - *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_and8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) & *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_or8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) | *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_xor8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) ^ *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_andc8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) &~ *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_orc8)(void *d, void *a, void *b, uint32_t desc)
> +{
> +    intptr_t opsz = extract_opsz(desc);
> +    intptr_t i;
> +
> +    for (i = 0; i < opsz; i += sizeof(vec64)) {
> +        *(vec64 *)(d + i) = *(vec64 *)(a + i) |~ *(vec64 *)(b + i);
> +    }
> +    clear_high(d, opsz, desc);
> +}

OK I can follow the helpers easily enough. I think the generators just
need to be a little clearer for non-authors to follow ;-)

--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic Richard Henderson
@ 2017-09-07 16:58   ` Alex Bennée
  2017-09-10  1:43     ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-07 16:58 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  target/arm/translate-a64.c | 137 ++++++++++++++++++++++++++++-----------------
>  1 file changed, 87 insertions(+), 50 deletions(-)
>
> diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
> index 2200e25be0..025354f983 100644
> --- a/target/arm/translate-a64.c
> +++ b/target/arm/translate-a64.c
> @@ -21,6 +21,7 @@
>  #include "cpu.h"
>  #include "exec/exec-all.h"
>  #include "tcg-op.h"
> +#include "tcg-op-gvec.h"
>  #include "qemu/log.h"
>  #include "arm_ldst.h"
>  #include "translate.h"
> @@ -82,6 +83,7 @@ typedef void NeonGenTwoDoubleOPFn(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_ptr);
>  typedef void NeonGenOneOpFn(TCGv_i64, TCGv_i64);
>  typedef void CryptoTwoOpEnvFn(TCGv_ptr, TCGv_i32, TCGv_i32);
>  typedef void CryptoThreeOpEnvFn(TCGv_ptr, TCGv_i32, TCGv_i32, TCGv_i32);
> +typedef void GVecGenTwoFn(uint32_t, uint32_t, uint32_t, uint32_t, uint32_t);
>
>  /* initialize TCG globals.  */
>  void a64_translate_init(void)
> @@ -537,6 +539,21 @@ static inline int vec_reg_offset(DisasContext *s, int regno,
>      return offs;
>  }
>
> +/* Return the offset info CPUARMState of the "whole" vector register Qn.  */
> +static inline int vec_full_reg_offset(DisasContext *s, int regno)
> +{
> +    assert_fp_access_checked(s);
> +    return offsetof(CPUARMState, vfp.regs[regno * 2]);
> +}
> +
> +/* Return the byte size of the "whole" vector register, VL / 8.  */
> +static inline int vec_full_reg_size(DisasContext *s)
> +{
> +    /* FIXME SVE: We should put the composite ZCR_EL* value into tb->flags.
> +       In the meantime this is just the AdvSIMD length of 128.  */
> +    return 128 / 8;
> +}
> +
>  /* Return the offset into CPUARMState of a slice (from
>   * the least significant end) of FP register Qn (ie
>   * Dn, Sn, Hn or Bn).
> @@ -9042,11 +9059,38 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
>      bool is_q = extract32(insn, 30, 1);
>      TCGv_i64 tcg_op1, tcg_op2, tcg_res[2];
>      int pass;
> +    GVecGenTwoFn *gvec_op;
>
>      if (!fp_access_check(s)) {
>          return;
>      }
>
> +    switch (size + 4 * is_u) {

Hmm I find this switch a little too magical. I mean I can see that the
encoding abuses size for the final opcode when I look at the manual but
it reads badly.

> +    case 0: /* AND */
> +        gvec_op = tcg_gen_gvec_and8;
> +        goto do_gvec;
> +    case 1: /* BIC */
> +        gvec_op = tcg_gen_gvec_andc8;
> +        goto do_gvec;
> +    case 2: /* ORR */
> +        gvec_op = tcg_gen_gvec_or8;
> +        goto do_gvec;
> +    case 3: /* ORN */
> +        gvec_op = tcg_gen_gvec_orc8;
> +        goto do_gvec;
> +    case 4: /* EOR */
> +        gvec_op = tcg_gen_gvec_xor8;
> +        goto do_gvec;
> +    do_gvec:
> +        gvec_op(vec_full_reg_offset(s, rd),
> +                vec_full_reg_offset(s, rn),
> +                vec_full_reg_offset(s, rm),
> +                is_q ? 16 : 8, vec_full_reg_size(s));
> +        return;

No default case (although I guess we just fall through). What's wrong
with just having a !is_u test with gvec_op = tbl[size] and skipping all
the goto stuff?

> +    }
> +
> +    /* Note that we've now eliminated all !is_u.  */
> +
>      tcg_op1 = tcg_temp_new_i64();
>      tcg_op2 = tcg_temp_new_i64();
>      tcg_res[0] = tcg_temp_new_i64();
> @@ -9056,47 +9100,27 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
>          read_vec_element(s, tcg_op1, rn, pass, MO_64);
>          read_vec_element(s, tcg_op2, rm, pass, MO_64);
>
> -        if (!is_u) {
> -            switch (size) {
> -            case 0: /* AND */
> -                tcg_gen_and_i64(tcg_res[pass], tcg_op1, tcg_op2);
> -                break;
> -            case 1: /* BIC */
> -                tcg_gen_andc_i64(tcg_res[pass], tcg_op1, tcg_op2);
> -                break;
> -            case 2: /* ORR */
> -                tcg_gen_or_i64(tcg_res[pass], tcg_op1, tcg_op2);
> -                break;
> -            case 3: /* ORN */
> -                tcg_gen_orc_i64(tcg_res[pass], tcg_op1, tcg_op2);
> -                break;
> -            }
> -        } else {
> -            if (size != 0) {
> -                /* B* ops need res loaded to operate on */
> -                read_vec_element(s, tcg_res[pass], rd, pass, MO_64);
> -            }
> +        /* B* ops need res loaded to operate on */
> +        read_vec_element(s, tcg_res[pass], rd, pass, MO_64);
>
> -            switch (size) {
> -            case 0: /* EOR */
> -                tcg_gen_xor_i64(tcg_res[pass], tcg_op1, tcg_op2);
> -                break;
> -            case 1: /* BSL bitwise select */
> -                tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_op2);
> -                tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> -                tcg_gen_xor_i64(tcg_res[pass], tcg_op2, tcg_op1);
> -                break;
> -            case 2: /* BIT, bitwise insert if true */
> -                tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> -                tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_op2);
> -                tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
> -                break;
> -            case 3: /* BIF, bitwise insert if false */
> -                tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> -                tcg_gen_andc_i64(tcg_op1, tcg_op1, tcg_op2);
> -                tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
> -                break;
> -            }
> +        switch (size) {
> +        case 1: /* BSL bitwise select */
> +            tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_op2);
> +            tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> +            tcg_gen_xor_i64(tcg_res[pass], tcg_op2, tcg_op1);
> +            break;
> +        case 2: /* BIT, bitwise insert if true */
> +            tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> +            tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_op2);
> +            tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
> +            break;
> +        case 3: /* BIF, bitwise insert if false */
> +            tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> +            tcg_gen_andc_i64(tcg_op1, tcg_op1, tcg_op2);
> +            tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
> +            break;
> +        default:
> +            g_assert_not_reached();
>          }
>      }
>
> @@ -9370,6 +9394,7 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
>      int rn = extract32(insn, 5, 5);
>      int rd = extract32(insn, 0, 5);
>      int pass;
> +    GVecGenTwoFn *gvec_op;
>
>      switch (opcode) {
>      case 0x13: /* MUL, PMUL */
> @@ -9409,6 +9434,28 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
>          return;
>      }
>
> +    switch (opcode) {
> +    case 0x10: /* ADD, SUB */
> +        {
> +            static GVecGenTwoFn * const fns[4][2] = {
> +                { tcg_gen_gvec_add8, tcg_gen_gvec_sub8 },
> +                { tcg_gen_gvec_add16, tcg_gen_gvec_sub16 },
> +                { tcg_gen_gvec_add32, tcg_gen_gvec_sub32 },
> +                { tcg_gen_gvec_add64, tcg_gen_gvec_sub64 },
> +            };
> +            gvec_op = fns[size][u];
> +            goto do_gvec;
> +        }
> +        break;
> +
> +    do_gvec:
> +        gvec_op(vec_full_reg_offset(s, rd),
> +                vec_full_reg_offset(s, rn),
> +                vec_full_reg_offset(s, rm),
> +                is_q ? 16 : 8, vec_full_reg_size(s));
> +        return;
> +    }
> +
>      if (size == 3) {
>          assert(is_q);
>          for (pass = 0; pass < 2; pass++) {
> @@ -9581,16 +9628,6 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
>                  genfn = fns[size][u];
>                  break;
>              }
> -            case 0x10: /* ADD, SUB */
> -            {
> -                static NeonGenTwoOpFn * const fns[3][2] = {
> -                    { gen_helper_neon_add_u8, gen_helper_neon_sub_u8 },
> -                    { gen_helper_neon_add_u16, gen_helper_neon_sub_u16 },
> -                    { tcg_gen_add_i32, tcg_gen_sub_i32 },
> -                };
> -                genfn = fns[size][u];
> -                break;
> -            }
>              case 0x11: /* CMTST, CMEQ */
>              {
>                  static NeonGenTwoOpFn * const fns[3][2] = {

Other than the comments on the switch the rest looks good to me.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors Richard Henderson
  2017-08-17 23:46   ` Philippe Mathieu-Daudé
@ 2017-09-07 18:18   ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-07 18:18 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> Nothing uses or enables them yet.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  tcg/tcg.h | 5 +++++
>  tcg/tcg.c | 2 +-
>  2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index dd97095af5..1277caed3d 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -256,6 +256,11 @@ typedef struct TCGPool {
>  typedef enum TCGType {
>      TCG_TYPE_I32,
>      TCG_TYPE_I64,
> +
> +    TCG_TYPE_V64,
> +    TCG_TYPE_V128,
> +    TCG_TYPE_V256,
> +
>      TCG_TYPE_COUNT, /* number of different types */
>
>      /* An alias for the size of the host register.  */
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 787c8ba0f7..ea78d47fad 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -118,7 +118,7 @@ static TCGReg tcg_reg_alloc_new(TCGContext *s, TCGType t)
>  static bool tcg_out_ldst_finalize(TCGContext *s);
>  #endif
>
> -static TCGRegSet tcg_target_available_regs[2];
> +static TCGRegSet tcg_target_available_regs[TCG_TYPE_COUNT];
>  static TCGRegSet tcg_target_call_clobber_regs;
>
>  #if TCG_TARGET_INSN_UNIT_SIZE == 1


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] tcg: Add operations for host vectors
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 4/8] tcg: Add operations " Richard Henderson
  2017-08-30  1:34   ` Philippe Mathieu-Daudé
@ 2017-09-07 19:00   ` Alex Bennée
  2017-09-07 19:02     ` Richard Henderson
  1 sibling, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-07 19:00 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> Nothing uses or implements them yet.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/tcg-opc.h | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tcg/tcg.h     | 24 ++++++++++++++++
>  2 files changed, 113 insertions(+)
>
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index 956fb1e9f3..9162125fac 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -206,6 +206,95 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
>
>  #undef TLADDR_ARGS
>  #undef DATA64_ARGS
> +
> +/* Host integer vector operations.  */
> +/* These opcodes are required whenever the base vector size is enabled.  */
> +
> +DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(ld_v256, 1, 1, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(st_v64, 0, 2, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(st_v128, 0, 2, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(st_v256, 0, 2, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(and_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(and_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(and_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(or_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(or_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(or_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(xor_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(xor_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(xor_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(add8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(add16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(add32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +
> +DEF(add8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +
> +DEF(add8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(sub8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(sub16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(sub32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +
> +DEF(sub8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +
> +DEF(sub8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +/* These opcodes are optional.
> +   All element counts must be supported if any are.  */
> +
> +DEF(not_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v64))
> +DEF(not_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v128))
> +DEF(not_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v256))
> +
> +DEF(andc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v64))
> +DEF(andc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v128))
> +DEF(andc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v256))
> +
> +DEF(orc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v64))
> +DEF(orc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v128))
> +DEF(orc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v256))
> +
> +DEF(neg8_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +DEF(neg16_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +DEF(neg32_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +
> +DEF(neg8_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg16_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg32_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg64_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +
> +DEF(neg8_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg16_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg32_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg64_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +
>  #undef IMPL
>  #undef IMPL64
>  #undef DEF
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 1277caed3d..b9e15da13b 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -166,6 +166,30 @@ typedef uint64_t TCGRegSet;
>  #define TCG_TARGET_HAS_rem_i64          0
>  #endif
>
> +#ifndef TCG_TARGET_HAS_v64
> +#define TCG_TARGET_HAS_v64              0
> +#define TCG_TARGET_HAS_andc_v64         0
> +#define TCG_TARGET_HAS_orc_v64          0
> +#define TCG_TARGET_HAS_not_v64          0
> +#define TCG_TARGET_HAS_neg_v64          0
> +#endif
> +
> +#ifndef TCG_TARGET_HAS_v128
> +#define TCG_TARGET_HAS_v128             0
> +#define TCG_TARGET_HAS_andc_v128        0
> +#define TCG_TARGET_HAS_orc_v128         0
> +#define TCG_TARGET_HAS_not_v128         0
> +#define TCG_TARGET_HAS_neg_v128         0
> +#endif
> +
> +#ifndef TCG_TARGET_HAS_v256
> +#define TCG_TARGET_HAS_v256             0
> +#define TCG_TARGET_HAS_andc_v256        0
> +#define TCG_TARGET_HAS_orc_v256         0
> +#define TCG_TARGET_HAS_not_v256         0
> +#define TCG_TARGET_HAS_neg_v256         0
> +#endif

Is it possible to use the DEF expanders to avoid manually defining all
the TCG_TARGET_HAS_op for each vector size?

> +
>  /* For 32-bit targets, some sort of unsigned widening multiply is required.  */
>  #if TCG_TARGET_REG_BITS == 32 \
>      && !(defined(TCG_TARGET_HAS_mulu2_i32) \


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported Richard Henderson
  2017-08-17 23:44   ` Philippe Mathieu-Daudé
@ 2017-09-07 19:02   ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-07 19:02 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  tcg/tcg.h |   2 +
>  tcg/tcg.c | 310 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 312 insertions(+)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index b9e15da13b..b443143b21 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -962,6 +962,8 @@ do {\
>  #define tcg_temp_free_ptr(T) tcg_temp_free_i64(TCGV_PTR_TO_NAT(T))
>  #endif
>
> +bool tcg_op_supported(TCGOpcode op);
> +
>  void tcg_gen_callN(TCGContext *s, void *func,
>                     TCGArg ret, int nargs, TCGArg *args);
>
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index ea78d47fad..3c3cdda938 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -751,6 +751,316 @@ int tcg_check_temp_count(void)
>  }
>  #endif
>
> +/* Return true if OP may appear in the opcode stream.
> +   Test the runtime variable that controls each opcode.  */
> +bool tcg_op_supported(TCGOpcode op)
> +{
> +    switch (op) {
> +    case INDEX_op_discard:
> +    case INDEX_op_set_label:
> +    case INDEX_op_call:
> +    case INDEX_op_br:
> +    case INDEX_op_mb:
> +    case INDEX_op_insn_start:
> +    case INDEX_op_exit_tb:
> +    case INDEX_op_goto_tb:
> +    case INDEX_op_qemu_ld_i32:
> +    case INDEX_op_qemu_st_i32:
> +    case INDEX_op_qemu_ld_i64:
> +    case INDEX_op_qemu_st_i64:
> +        return true;
> +
> +    case INDEX_op_goto_ptr:
> +        return TCG_TARGET_HAS_goto_ptr;
> +
> +    case INDEX_op_mov_i32:
> +    case INDEX_op_movi_i32:
> +    case INDEX_op_setcond_i32:
> +    case INDEX_op_brcond_i32:
> +    case INDEX_op_ld8u_i32:
> +    case INDEX_op_ld8s_i32:
> +    case INDEX_op_ld16u_i32:
> +    case INDEX_op_ld16s_i32:
> +    case INDEX_op_ld_i32:
> +    case INDEX_op_st8_i32:
> +    case INDEX_op_st16_i32:
> +    case INDEX_op_st_i32:
> +    case INDEX_op_add_i32:
> +    case INDEX_op_sub_i32:
> +    case INDEX_op_mul_i32:
> +    case INDEX_op_and_i32:
> +    case INDEX_op_or_i32:
> +    case INDEX_op_xor_i32:
> +    case INDEX_op_shl_i32:
> +    case INDEX_op_shr_i32:
> +    case INDEX_op_sar_i32:
> +        return true;
> +
> +    case INDEX_op_movcond_i32:
> +        return TCG_TARGET_HAS_movcond_i32;
> +    case INDEX_op_div_i32:
> +    case INDEX_op_divu_i32:
> +        return TCG_TARGET_HAS_div_i32;
> +    case INDEX_op_rem_i32:
> +    case INDEX_op_remu_i32:
> +        return TCG_TARGET_HAS_rem_i32;
> +    case INDEX_op_div2_i32:
> +    case INDEX_op_divu2_i32:
> +        return TCG_TARGET_HAS_div2_i32;
> +    case INDEX_op_rotl_i32:
> +    case INDEX_op_rotr_i32:
> +        return TCG_TARGET_HAS_rot_i32;
> +    case INDEX_op_deposit_i32:
> +        return TCG_TARGET_HAS_deposit_i32;
> +    case INDEX_op_extract_i32:
> +        return TCG_TARGET_HAS_extract_i32;
> +    case INDEX_op_sextract_i32:
> +        return TCG_TARGET_HAS_sextract_i32;
> +    case INDEX_op_add2_i32:
> +        return TCG_TARGET_HAS_add2_i32;
> +    case INDEX_op_sub2_i32:
> +        return TCG_TARGET_HAS_sub2_i32;
> +    case INDEX_op_mulu2_i32:
> +        return TCG_TARGET_HAS_mulu2_i32;
> +    case INDEX_op_muls2_i32:
> +        return TCG_TARGET_HAS_muls2_i32;
> +    case INDEX_op_muluh_i32:
> +        return TCG_TARGET_HAS_muluh_i32;
> +    case INDEX_op_mulsh_i32:
> +        return TCG_TARGET_HAS_mulsh_i32;
> +    case INDEX_op_ext8s_i32:
> +        return TCG_TARGET_HAS_ext8s_i32;
> +    case INDEX_op_ext16s_i32:
> +        return TCG_TARGET_HAS_ext16s_i32;
> +    case INDEX_op_ext8u_i32:
> +        return TCG_TARGET_HAS_ext8u_i32;
> +    case INDEX_op_ext16u_i32:
> +        return TCG_TARGET_HAS_ext16u_i32;
> +    case INDEX_op_bswap16_i32:
> +        return TCG_TARGET_HAS_bswap16_i32;
> +    case INDEX_op_bswap32_i32:
> +        return TCG_TARGET_HAS_bswap32_i32;
> +    case INDEX_op_not_i32:
> +        return TCG_TARGET_HAS_not_i32;
> +    case INDEX_op_neg_i32:
> +        return TCG_TARGET_HAS_neg_i32;
> +    case INDEX_op_andc_i32:
> +        return TCG_TARGET_HAS_andc_i32;
> +    case INDEX_op_orc_i32:
> +        return TCG_TARGET_HAS_orc_i32;
> +    case INDEX_op_eqv_i32:
> +        return TCG_TARGET_HAS_eqv_i32;
> +    case INDEX_op_nand_i32:
> +        return TCG_TARGET_HAS_nand_i32;
> +    case INDEX_op_nor_i32:
> +        return TCG_TARGET_HAS_nor_i32;
> +    case INDEX_op_clz_i32:
> +        return TCG_TARGET_HAS_clz_i32;
> +    case INDEX_op_ctz_i32:
> +        return TCG_TARGET_HAS_ctz_i32;
> +    case INDEX_op_ctpop_i32:
> +        return TCG_TARGET_HAS_ctpop_i32;
> +
> +    case INDEX_op_brcond2_i32:
> +    case INDEX_op_setcond2_i32:
> +        return TCG_TARGET_REG_BITS == 32;
> +
> +    case INDEX_op_mov_i64:
> +    case INDEX_op_movi_i64:
> +    case INDEX_op_setcond_i64:
> +    case INDEX_op_brcond_i64:
> +    case INDEX_op_ld8u_i64:
> +    case INDEX_op_ld8s_i64:
> +    case INDEX_op_ld16u_i64:
> +    case INDEX_op_ld16s_i64:
> +    case INDEX_op_ld32u_i64:
> +    case INDEX_op_ld32s_i64:
> +    case INDEX_op_ld_i64:
> +    case INDEX_op_st8_i64:
> +    case INDEX_op_st16_i64:
> +    case INDEX_op_st32_i64:
> +    case INDEX_op_st_i64:
> +    case INDEX_op_add_i64:
> +    case INDEX_op_sub_i64:
> +    case INDEX_op_mul_i64:
> +    case INDEX_op_and_i64:
> +    case INDEX_op_or_i64:
> +    case INDEX_op_xor_i64:
> +    case INDEX_op_shl_i64:
> +    case INDEX_op_shr_i64:
> +    case INDEX_op_sar_i64:
> +    case INDEX_op_ext_i32_i64:
> +    case INDEX_op_extu_i32_i64:
> +        return TCG_TARGET_REG_BITS == 64;
> +
> +    case INDEX_op_movcond_i64:
> +        return TCG_TARGET_HAS_movcond_i64;
> +    case INDEX_op_div_i64:
> +    case INDEX_op_divu_i64:
> +        return TCG_TARGET_HAS_div_i64;
> +    case INDEX_op_rem_i64:
> +    case INDEX_op_remu_i64:
> +        return TCG_TARGET_HAS_rem_i64;
> +    case INDEX_op_div2_i64:
> +    case INDEX_op_divu2_i64:
> +        return TCG_TARGET_HAS_div2_i64;
> +    case INDEX_op_rotl_i64:
> +    case INDEX_op_rotr_i64:
> +        return TCG_TARGET_HAS_rot_i64;
> +    case INDEX_op_deposit_i64:
> +        return TCG_TARGET_HAS_deposit_i64;
> +    case INDEX_op_extract_i64:
> +        return TCG_TARGET_HAS_extract_i64;
> +    case INDEX_op_sextract_i64:
> +        return TCG_TARGET_HAS_sextract_i64;
> +    case INDEX_op_extrl_i64_i32:
> +        return TCG_TARGET_HAS_extrl_i64_i32;
> +    case INDEX_op_extrh_i64_i32:
> +        return TCG_TARGET_HAS_extrh_i64_i32;
> +    case INDEX_op_ext8s_i64:
> +        return TCG_TARGET_HAS_ext8s_i64;
> +    case INDEX_op_ext16s_i64:
> +        return TCG_TARGET_HAS_ext16s_i64;
> +    case INDEX_op_ext32s_i64:
> +        return TCG_TARGET_HAS_ext32s_i64;
> +    case INDEX_op_ext8u_i64:
> +        return TCG_TARGET_HAS_ext8u_i64;
> +    case INDEX_op_ext16u_i64:
> +        return TCG_TARGET_HAS_ext16u_i64;
> +    case INDEX_op_ext32u_i64:
> +        return TCG_TARGET_HAS_ext32u_i64;
> +    case INDEX_op_bswap16_i64:
> +        return TCG_TARGET_HAS_bswap16_i64;
> +    case INDEX_op_bswap32_i64:
> +        return TCG_TARGET_HAS_bswap32_i64;
> +    case INDEX_op_bswap64_i64:
> +        return TCG_TARGET_HAS_bswap64_i64;
> +    case INDEX_op_not_i64:
> +        return TCG_TARGET_HAS_not_i64;
> +    case INDEX_op_neg_i64:
> +        return TCG_TARGET_HAS_neg_i64;
> +    case INDEX_op_andc_i64:
> +        return TCG_TARGET_HAS_andc_i64;
> +    case INDEX_op_orc_i64:
> +        return TCG_TARGET_HAS_orc_i64;
> +    case INDEX_op_eqv_i64:
> +        return TCG_TARGET_HAS_eqv_i64;
> +    case INDEX_op_nand_i64:
> +        return TCG_TARGET_HAS_nand_i64;
> +    case INDEX_op_nor_i64:
> +        return TCG_TARGET_HAS_nor_i64;
> +    case INDEX_op_clz_i64:
> +        return TCG_TARGET_HAS_clz_i64;
> +    case INDEX_op_ctz_i64:
> +        return TCG_TARGET_HAS_ctz_i64;
> +    case INDEX_op_ctpop_i64:
> +        return TCG_TARGET_HAS_ctpop_i64;
> +    case INDEX_op_add2_i64:
> +        return TCG_TARGET_HAS_add2_i64;
> +    case INDEX_op_sub2_i64:
> +        return TCG_TARGET_HAS_sub2_i64;
> +    case INDEX_op_mulu2_i64:
> +        return TCG_TARGET_HAS_mulu2_i64;
> +    case INDEX_op_muls2_i64:
> +        return TCG_TARGET_HAS_muls2_i64;
> +    case INDEX_op_muluh_i64:
> +        return TCG_TARGET_HAS_muluh_i64;
> +    case INDEX_op_mulsh_i64:
> +        return TCG_TARGET_HAS_mulsh_i64;
> +
> +    case INDEX_op_mov_v64:
> +    case INDEX_op_movi_v64:
> +    case INDEX_op_ld_v64:
> +    case INDEX_op_st_v64:
> +    case INDEX_op_and_v64:
> +    case INDEX_op_or_v64:
> +    case INDEX_op_xor_v64:
> +    case INDEX_op_add8_v64:
> +    case INDEX_op_add16_v64:
> +    case INDEX_op_add32_v64:
> +    case INDEX_op_sub8_v64:
> +    case INDEX_op_sub16_v64:
> +    case INDEX_op_sub32_v64:
> +        return TCG_TARGET_HAS_v64;
> +
> +    case INDEX_op_mov_v128:
> +    case INDEX_op_movi_v128:
> +    case INDEX_op_ld_v128:
> +    case INDEX_op_st_v128:
> +    case INDEX_op_and_v128:
> +    case INDEX_op_or_v128:
> +    case INDEX_op_xor_v128:
> +    case INDEX_op_add8_v128:
> +    case INDEX_op_add16_v128:
> +    case INDEX_op_add32_v128:
> +    case INDEX_op_add64_v128:
> +    case INDEX_op_sub8_v128:
> +    case INDEX_op_sub16_v128:
> +    case INDEX_op_sub32_v128:
> +    case INDEX_op_sub64_v128:
> +        return TCG_TARGET_HAS_v128;
> +
> +    case INDEX_op_mov_v256:
> +    case INDEX_op_movi_v256:
> +    case INDEX_op_ld_v256:
> +    case INDEX_op_st_v256:
> +    case INDEX_op_and_v256:
> +    case INDEX_op_or_v256:
> +    case INDEX_op_xor_v256:
> +    case INDEX_op_add8_v256:
> +    case INDEX_op_add16_v256:
> +    case INDEX_op_add32_v256:
> +    case INDEX_op_add64_v256:
> +    case INDEX_op_sub8_v256:
> +    case INDEX_op_sub16_v256:
> +    case INDEX_op_sub32_v256:
> +    case INDEX_op_sub64_v256:
> +        return TCG_TARGET_HAS_v256;
> +
> +    case INDEX_op_not_v64:
> +        return TCG_TARGET_HAS_not_v64;
> +    case INDEX_op_not_v128:
> +        return TCG_TARGET_HAS_not_v128;
> +    case INDEX_op_not_v256:
> +        return TCG_TARGET_HAS_not_v256;
> +
> +    case INDEX_op_andc_v64:
> +        return TCG_TARGET_HAS_andc_v64;
> +    case INDEX_op_andc_v128:
> +        return TCG_TARGET_HAS_andc_v128;
> +    case INDEX_op_andc_v256:
> +        return TCG_TARGET_HAS_andc_v256;
> +
> +    case INDEX_op_orc_v64:
> +        return TCG_TARGET_HAS_orc_v64;
> +    case INDEX_op_orc_v128:
> +        return TCG_TARGET_HAS_orc_v128;
> +    case INDEX_op_orc_v256:
> +        return TCG_TARGET_HAS_orc_v256;
> +
> +    case INDEX_op_neg8_v64:
> +    case INDEX_op_neg16_v64:
> +    case INDEX_op_neg32_v64:
> +        return TCG_TARGET_HAS_neg_v64;
> +
> +    case INDEX_op_neg8_v128:
> +    case INDEX_op_neg16_v128:
> +    case INDEX_op_neg32_v128:
> +    case INDEX_op_neg64_v128:
> +        return TCG_TARGET_HAS_neg_v128;
> +
> +    case INDEX_op_neg8_v256:
> +    case INDEX_op_neg16_v256:
> +    case INDEX_op_neg32_v256:
> +    case INDEX_op_neg64_v256:
> +        return TCG_TARGET_HAS_neg_v256;
> +
> +    case NB_OPS:
> +        break;
> +    }
> +    g_assert_not_reached();
> +}
> +
>  /* Note: we convert the 64 bit args to 32 bit and do some alignment
>     and endian swap. Maybe it would be better to do the alignment
>     and endian swap in tcg_reg_alloc_call(). */


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] tcg: Add operations for host vectors
  2017-09-07 19:00   ` Alex Bennée
@ 2017-09-07 19:02     ` Richard Henderson
  2017-09-08  9:28       ` Alex Bennée
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-09-07 19:02 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, qemu-arm

On 09/07/2017 12:00 PM, Alex Bennée wrote:
> 
> Richard Henderson <richard.henderson@linaro.org> writes:
> 
>> Nothing uses or implements them yet.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>  tcg/tcg-opc.h | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  tcg/tcg.h     | 24 ++++++++++++++++
>>  2 files changed, 113 insertions(+)
>>
>> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
>> index 956fb1e9f3..9162125fac 100644
>> --- a/tcg/tcg-opc.h
>> +++ b/tcg/tcg-opc.h
>> @@ -206,6 +206,95 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
>>
>>  #undef TLADDR_ARGS
>>  #undef DATA64_ARGS
>> +
>> +/* Host integer vector operations.  */
>> +/* These opcodes are required whenever the base vector size is enabled.  */
>> +
>> +DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(ld_v256, 1, 1, 1, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(st_v64, 0, 2, 1, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(st_v128, 0, 2, 1, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(st_v256, 0, 2, 1, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(and_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(and_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(and_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(or_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(or_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(or_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(xor_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(xor_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(xor_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(add8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(add16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(add32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +
>> +DEF(add8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(add16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(add32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(add64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +
>> +DEF(add8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(add16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(add32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(add64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(sub8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(sub16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(sub32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +
>> +DEF(sub8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(sub16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(sub32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(sub64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +
>> +DEF(sub8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(sub16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(sub32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(sub64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +/* These opcodes are optional.
>> +   All element counts must be supported if any are.  */
>> +
>> +DEF(not_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v64))
>> +DEF(not_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v128))
>> +DEF(not_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v256))
>> +
>> +DEF(andc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v64))
>> +DEF(andc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v128))
>> +DEF(andc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v256))
>> +
>> +DEF(orc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v64))
>> +DEF(orc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v128))
>> +DEF(orc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v256))
>> +
>> +DEF(neg8_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>> +DEF(neg16_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>> +DEF(neg32_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>> +
>> +DEF(neg8_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>> +DEF(neg16_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>> +DEF(neg32_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>> +DEF(neg64_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>> +
>> +DEF(neg8_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>> +DEF(neg16_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>> +DEF(neg32_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>> +DEF(neg64_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>> +
>>  #undef IMPL
>>  #undef IMPL64
>>  #undef DEF
>> diff --git a/tcg/tcg.h b/tcg/tcg.h
>> index 1277caed3d..b9e15da13b 100644
>> --- a/tcg/tcg.h
>> +++ b/tcg/tcg.h
>> @@ -166,6 +166,30 @@ typedef uint64_t TCGRegSet;
>>  #define TCG_TARGET_HAS_rem_i64          0
>>  #endif
>>
>> +#ifndef TCG_TARGET_HAS_v64
>> +#define TCG_TARGET_HAS_v64              0
>> +#define TCG_TARGET_HAS_andc_v64         0
>> +#define TCG_TARGET_HAS_orc_v64          0
>> +#define TCG_TARGET_HAS_not_v64          0
>> +#define TCG_TARGET_HAS_neg_v64          0
>> +#endif
>> +
>> +#ifndef TCG_TARGET_HAS_v128
>> +#define TCG_TARGET_HAS_v128             0
>> +#define TCG_TARGET_HAS_andc_v128        0
>> +#define TCG_TARGET_HAS_orc_v128         0
>> +#define TCG_TARGET_HAS_not_v128         0
>> +#define TCG_TARGET_HAS_neg_v128         0
>> +#endif
>> +
>> +#ifndef TCG_TARGET_HAS_v256
>> +#define TCG_TARGET_HAS_v256             0
>> +#define TCG_TARGET_HAS_andc_v256        0
>> +#define TCG_TARGET_HAS_orc_v256         0
>> +#define TCG_TARGET_HAS_not_v256         0
>> +#define TCG_TARGET_HAS_neg_v256         0
>> +#endif
> 
> Is it possible to use the DEF expanders to avoid manually defining all
> the TCG_TARGET_HAS_op for each vector size?

No.  The preprocessor doesn't work that way.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] tcg: Add operations for host vectors
  2017-09-07 19:02     ` Richard Henderson
@ 2017-09-08  9:28       ` Alex Bennée
  0 siblings, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-08  9:28 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> On 09/07/2017 12:00 PM, Alex Bennée wrote:
>>
>> Richard Henderson <richard.henderson@linaro.org> writes:
>>
>>> Nothing uses or implements them yet.
>>>
>>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>>> ---
>>>  tcg/tcg-opc.h | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  tcg/tcg.h     | 24 ++++++++++++++++
>>>  2 files changed, 113 insertions(+)
>>>
>>> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
>>> index 956fb1e9f3..9162125fac 100644
>>> --- a/tcg/tcg-opc.h
>>> +++ b/tcg/tcg-opc.h
>>> @@ -206,6 +206,95 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
>>>
>>>  #undef TLADDR_ARGS
>>>  #undef DATA64_ARGS
>>> +
>>> +/* Host integer vector operations.  */
>>> +/* These opcodes are required whenever the base vector size is enabled.  */
>>> +
>>> +DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(ld_v256, 1, 1, 1, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(st_v64, 0, 2, 1, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(st_v128, 0, 2, 1, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(st_v256, 0, 2, 1, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(and_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(and_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(and_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(or_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(or_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(or_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(xor_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(xor_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(xor_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(add8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(add16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(add32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +
>>> +DEF(add8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(add16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(add32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(add64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +
>>> +DEF(add8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(add16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(add32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(add64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(sub8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(sub16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(sub32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +
>>> +DEF(sub8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(sub16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(sub32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(sub64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +
>>> +DEF(sub8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(sub16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(sub32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(sub64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +/* These opcodes are optional.
>>> +   All element counts must be supported if any are.  */
>>> +
>>> +DEF(not_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v64))
>>> +DEF(not_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v128))
>>> +DEF(not_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v256))
>>> +
>>> +DEF(andc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v64))
>>> +DEF(andc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v128))
>>> +DEF(andc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v256))
>>> +
>>> +DEF(orc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v64))
>>> +DEF(orc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v128))
>>> +DEF(orc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v256))
>>> +
>>> +DEF(neg8_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>>> +DEF(neg16_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>>> +DEF(neg32_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>>> +
>>> +DEF(neg8_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>>> +DEF(neg16_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>>> +DEF(neg32_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>>> +DEF(neg64_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>>> +
>>> +DEF(neg8_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>>> +DEF(neg16_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>>> +DEF(neg32_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>>> +DEF(neg64_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>>> +
>>>  #undef IMPL
>>>  #undef IMPL64
>>>  #undef DEF
>>> diff --git a/tcg/tcg.h b/tcg/tcg.h
>>> index 1277caed3d..b9e15da13b 100644
>>> --- a/tcg/tcg.h
>>> +++ b/tcg/tcg.h
>>> @@ -166,6 +166,30 @@ typedef uint64_t TCGRegSet;
>>>  #define TCG_TARGET_HAS_rem_i64          0
>>>  #endif
>>>
>>> +#ifndef TCG_TARGET_HAS_v64
>>> +#define TCG_TARGET_HAS_v64              0
>>> +#define TCG_TARGET_HAS_andc_v64         0
>>> +#define TCG_TARGET_HAS_orc_v64          0
>>> +#define TCG_TARGET_HAS_not_v64          0
>>> +#define TCG_TARGET_HAS_neg_v64          0
>>> +#endif
>>> +
>>> +#ifndef TCG_TARGET_HAS_v128
>>> +#define TCG_TARGET_HAS_v128             0
>>> +#define TCG_TARGET_HAS_andc_v128        0
>>> +#define TCG_TARGET_HAS_orc_v128         0
>>> +#define TCG_TARGET_HAS_not_v128         0
>>> +#define TCG_TARGET_HAS_neg_v128         0
>>> +#endif
>>> +
>>> +#ifndef TCG_TARGET_HAS_v256
>>> +#define TCG_TARGET_HAS_v256             0
>>> +#define TCG_TARGET_HAS_andc_v256        0
>>> +#define TCG_TARGET_HAS_orc_v256         0
>>> +#define TCG_TARGET_HAS_not_v256         0
>>> +#define TCG_TARGET_HAS_neg_v256         0
>>> +#endif
>>
>> Is it possible to use the DEF expanders to avoid manually defining all
>> the TCG_TARGET_HAS_op for each vector size?
>
> No.  The preprocessor doesn't work that way.

Ahh I follow now. tcg-target.h defines the TCG_TARGET_HAS_foo for all
ops it supports and this boilerplate ensures there is a concrete define
for the targets that don't support it (yet).

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid Richard Henderson
  2017-08-17 23:45   ` Philippe Mathieu-Daudé
@ 2017-09-08  9:30   ` Alex Bennée
  1 sibling, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-08  9:30 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> Add with value 0 so that structure zero initialization can
> indicate that the field is not present.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  tcg/tcg-opc.h | 2 ++
>  tcg/tcg.c     | 3 +++
>  2 files changed, 5 insertions(+)
>
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index 9162125fac..b1445a4c24 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -26,6 +26,8 @@
>   * DEF(name, oargs, iargs, cargs, flags)
>   */
>
> +DEF(invalid, 0, 0, 0, TCG_OPF_NOT_PRESENT)
> +
>  /* predefined ops */
>  DEF(discard, 1, 0, 0, TCG_OPF_NOT_PRESENT)
>  DEF(set_label, 0, 0, 1, TCG_OPF_BB_END | TCG_OPF_NOT_PRESENT)
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 3c3cdda938..879b29e81f 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -756,6 +756,9 @@ int tcg_check_temp_count(void)
>  bool tcg_op_supported(TCGOpcode op)
>  {
>      switch (op) {
> +    case INDEX_op_invalid:
> +        return false;
> +
>      case INDEX_op_discard:
>      case INDEX_op_set_label:
>      case INDEX_op_call:


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops Richard Henderson
@ 2017-09-08  9:34   ` Alex Bennée
  0 siblings, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-08  9:34 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

I can see where this is going but I'll defer the review until v2 with
the extra verbosity in the original expander patch.

> ---
>  tcg/tcg-op-gvec.h |   4 +
>  tcg/tcg.h         |   6 +-
>  tcg/tcg-op-gvec.c | 230 +++++++++++++++++++++++++++++++++++++++++++-----------
>  tcg/tcg.c         |   8 +-
>  4 files changed, 197 insertions(+), 51 deletions(-)
>
> diff --git a/tcg/tcg-op-gvec.h b/tcg/tcg-op-gvec.h
> index 10db3599a5..99f36d208e 100644
> --- a/tcg/tcg-op-gvec.h
> +++ b/tcg/tcg-op-gvec.h
> @@ -40,6 +40,10 @@ typedef struct {
>      /* Similarly, but load up a constant and re-use across lanes.  */
>      void (*fni8x)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64);
>      uint64_t extra_value;
> +    /* Operations with host vector ops.  */
> +    TCGOpcode op_v256;
> +    TCGOpcode op_v128;
> +    TCGOpcode op_v64;
>      /* Larger sizes: expand out-of-line helper w/size descriptor.  */
>      void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32);
>  } GVecGen3;
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index b443143b21..7f10501d31 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -825,9 +825,11 @@ int tcg_global_mem_new_internal(TCGType, TCGv_ptr, intptr_t, const char *);
>  TCGv_i32 tcg_global_reg_new_i32(TCGReg reg, const char *name);
>  TCGv_i64 tcg_global_reg_new_i64(TCGReg reg, const char *name);
>
> -TCGv_i32 tcg_temp_new_internal_i32(int temp_local);
> -TCGv_i64 tcg_temp_new_internal_i64(int temp_local);
> +int tcg_temp_new_internal(TCGType type, bool temp_local);
> +TCGv_i32 tcg_temp_new_internal_i32(bool temp_local);
> +TCGv_i64 tcg_temp_new_internal_i64(bool temp_local);
>
> +void tcg_temp_free_internal(int arg);
>  void tcg_temp_free_i32(TCGv_i32 arg);
>  void tcg_temp_free_i64(TCGv_i64 arg);
>
> diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
> index 6de49dc07f..3aca565dc0 100644
> --- a/tcg/tcg-op-gvec.c
> +++ b/tcg/tcg-op-gvec.c
> @@ -30,54 +30,73 @@
>  #define REP8(x)    ((x) * 0x0101010101010101ull)
>  #define REP16(x)   ((x) * 0x0001000100010001ull)
>
> -#define MAX_INLINE 16
> +#define MAX_UNROLL  4
>
> -static inline void check_size_s(uint32_t opsz, uint32_t clsz)
> +static inline void check_size_align(uint32_t opsz, uint32_t clsz, uint32_t ofs)
>  {
> -    tcg_debug_assert(opsz % 8 == 0);
> -    tcg_debug_assert(clsz % 8 == 0);
> +    uint32_t align = clsz > 16 || opsz >= 16 ? 15 : 7;
> +    tcg_debug_assert(opsz > 0);
>      tcg_debug_assert(opsz <= clsz);
> +    tcg_debug_assert((opsz & align) == 0);
> +    tcg_debug_assert((clsz & align) == 0);
> +    tcg_debug_assert((ofs & align) == 0);
>  }
>
> -static inline void check_align_s_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +static inline void check_overlap_3(uint32_t d, uint32_t a,
> +                                   uint32_t b, uint32_t s)
>  {
> -    tcg_debug_assert(dofs % 8 == 0);
> -    tcg_debug_assert(aofs % 8 == 0);
> -    tcg_debug_assert(bofs % 8 == 0);
> +    tcg_debug_assert(d == a || d + s <= a || a + s <= d);
> +    tcg_debug_assert(d == b || d + s <= b || b + s <= d);
> +    tcg_debug_assert(a == b || a + s <= b || b + s <= a);
>  }
>
> -static inline void check_size_l(uint32_t opsz, uint32_t clsz)
> +static inline bool check_size_impl(uint32_t opsz, uint32_t lnsz)
>  {
> -    tcg_debug_assert(opsz % 16 == 0);
> -    tcg_debug_assert(clsz % 16 == 0);
> -    tcg_debug_assert(opsz <= clsz);
> +    uint32_t lnct = opsz / lnsz;
> +    return lnct >= 1 && lnct <= MAX_UNROLL;
>  }
>
> -static inline void check_align_l_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +static void expand_clr_v(uint32_t dofs, uint32_t clsz, uint32_t lnsz,
> +                         TCGType type, TCGOpcode opc_mv, TCGOpcode opc_st)
>  {
> -    tcg_debug_assert(dofs % 16 == 0);
> -    tcg_debug_assert(aofs % 16 == 0);
> -    tcg_debug_assert(bofs % 16 == 0);
> -}
> +    TCGArg t0 = tcg_temp_new_internal(type, 0);
> +    TCGArg env = GET_TCGV_PTR(tcg_ctx.tcg_env);
> +    uint32_t i;
>
> -static inline void check_overlap_3(uint32_t d, uint32_t a,
> -                                   uint32_t b, uint32_t s)
> -{
> -    tcg_debug_assert(d == a || d + s <= a || a + s <= d);
> -    tcg_debug_assert(d == b || d + s <= b || b + s <= d);
> -    tcg_debug_assert(a == b || a + s <= b || b + s <= a);
> +    tcg_gen_op2(&tcg_ctx, opc_mv, t0, 0);
> +    for (i = 0; i < clsz; i += lnsz) {
> +        tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
> +    }
> +    tcg_temp_free_internal(t0);
>  }
>
> -static void expand_clr(uint32_t dofs, uint32_t opsz, uint32_t clsz)
> +static void expand_clr(uint32_t dofs, uint32_t clsz)
>  {
> -    if (clsz > opsz) {
> -        TCGv_i64 zero = tcg_const_i64(0);
> -        uint32_t i;
> +    if (clsz >= 32 && TCG_TARGET_HAS_v256) {
> +        uint32_t done = QEMU_ALIGN_DOWN(clsz, 32);
> +        expand_clr_v(dofs, done, 32, TCG_TYPE_V256,
> +                     INDEX_op_movi_v256, INDEX_op_st_v256);
> +        dofs += done;
> +        clsz -= done;
> +    }
>
> -        for (i = opsz; i < clsz; i += 8) {
> -            tcg_gen_st_i64(zero, tcg_ctx.tcg_env, dofs + i);
> -        }
> -        tcg_temp_free_i64(zero);
> +    if (clsz >= 16 && TCG_TARGET_HAS_v128) {
> +        uint16_t done = QEMU_ALIGN_DOWN(clsz, 16);
> +        expand_clr_v(dofs, done, 16, TCG_TYPE_V128,
> +                     INDEX_op_movi_v128, INDEX_op_st_v128);
> +        dofs += done;
> +        clsz -= done;
> +    }
> +
> +    if (TCG_TARGET_REG_BITS == 64) {
> +        expand_clr_v(dofs, clsz, 8, TCG_TYPE_I64,
> +                     INDEX_op_movi_i64, INDEX_op_st_i64);
> +    } else if (TCG_TARGET_HAS_v64) {
> +        expand_clr_v(dofs, clsz, 8, TCG_TYPE_V64,
> +                     INDEX_op_movi_v64, INDEX_op_st_v64);
> +    } else {
> +        expand_clr_v(dofs, clsz, 4, TCG_TYPE_I32,
> +                     INDEX_op_movi_i32, INDEX_op_st_i32);
>      }
>  }
>
> @@ -164,6 +183,7 @@ static void expand_3x8(uint32_t dofs, uint32_t aofs,
>      tcg_temp_free_i64(t0);
>  }
>
> +/* FIXME: add CSE for constants and we can eliminate this.  */
>  static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>                           uint32_t opsz, uint64_t data,
>                           void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64))
> @@ -192,28 +212,111 @@ static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>      tcg_temp_free_i64(t2);
>  }
>
> +static void expand_3_v(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> +                       uint32_t opsz, uint32_t lnsz, TCGType type,
> +                       TCGOpcode opc_op, TCGOpcode opc_ld, TCGOpcode opc_st)
> +{
> +    TCGArg t0 = tcg_temp_new_internal(type, 0);
> +    TCGArg env = GET_TCGV_PTR(tcg_ctx.tcg_env);
> +    uint32_t i;
> +
> +    if (aofs == bofs) {
> +        for (i = 0; i < opsz; i += lnsz) {
> +            tcg_gen_op3(&tcg_ctx, opc_ld, t0, env, aofs + i);
> +            tcg_gen_op3(&tcg_ctx, opc_op, t0, t0, t0);
> +            tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
> +        }
> +    } else {
> +        TCGArg t1 = tcg_temp_new_internal(type, 0);
> +        for (i = 0; i < opsz; i += lnsz) {
> +            tcg_gen_op3(&tcg_ctx, opc_ld, t0, env, aofs + i);
> +            tcg_gen_op3(&tcg_ctx, opc_ld, t1, env, bofs + i);
> +            tcg_gen_op3(&tcg_ctx, opc_op, t0, t0, t1);
> +            tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
> +        }
> +        tcg_temp_free_internal(t1);
> +    }
> +    tcg_temp_free_internal(t0);
> +}
> +
>  void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>                      uint32_t opsz, uint32_t clsz, const GVecGen3 *g)
>  {
> +    check_size_align(opsz, clsz, dofs | aofs | bofs);
>      check_overlap_3(dofs, aofs, bofs, clsz);
> -    if (opsz <= MAX_INLINE) {
> -        check_size_s(opsz, clsz);
> -        check_align_s_3(dofs, aofs, bofs);
> -        if (g->fni8) {
> -            expand_3x8(dofs, aofs, bofs, opsz, g->fni8);
> -        } else if (g->fni4) {
> -            expand_3x4(dofs, aofs, bofs, opsz, g->fni4);
> +
> +    if (opsz > MAX_UNROLL * 32 || clsz > MAX_UNROLL * 32) {
> +        goto do_ool;
> +    }
> +
> +    /* Recall that ARM SVE allows vector sizes that are not a power of 2.
> +       Expand with successively smaller host vector sizes.  The intent is
> +       that e.g. opsz == 80 would be expanded with 2x32 + 1x16.  */
> +    /* ??? For clsz > opsz, the host may be able to use an op-sized
> +       operation, zeroing the balance of the register.  We can then
> +       use a cl-sized store to implement the clearing without an extra
> +       store operation.  This is true for aarch64 and x86_64 hosts.  */
> +
> +    if (check_size_impl(opsz, 32) && tcg_op_supported(g->op_v256)) {
> +        uint32_t done = QEMU_ALIGN_DOWN(opsz, 32);
> +        expand_3_v(dofs, aofs, bofs, done, 32, TCG_TYPE_V256,
> +                   g->op_v256, INDEX_op_ld_v256, INDEX_op_st_v256);
> +        dofs += done;
> +        aofs += done;
> +        bofs += done;
> +        opsz -= done;
> +        clsz -= done;
> +    }
> +
> +    if (check_size_impl(opsz, 16) && tcg_op_supported(g->op_v128)) {
> +        uint32_t done = QEMU_ALIGN_DOWN(opsz, 16);
> +        expand_3_v(dofs, aofs, bofs, done, 16, TCG_TYPE_V128,
> +                   g->op_v128, INDEX_op_ld_v128, INDEX_op_st_v128);
> +        dofs += done;
> +        aofs += done;
> +        bofs += done;
> +        opsz -= done;
> +        clsz -= done;
> +    }
> +
> +    if (check_size_impl(opsz, 8)) {
> +        uint32_t done = QEMU_ALIGN_DOWN(opsz, 8);
> +        if (tcg_op_supported(g->op_v64)) {
> +            expand_3_v(dofs, aofs, bofs, done, 8, TCG_TYPE_V64,
> +                       g->op_v64, INDEX_op_ld_v64, INDEX_op_st_v64);
> +        } else if (g->fni8) {
> +            expand_3x8(dofs, aofs, bofs, done, g->fni8);
>          } else if (g->fni8x) {
> -            expand_3x8p1(dofs, aofs, bofs, opsz, g->extra_value, g->fni8x);
> +            expand_3x8p1(dofs, aofs, bofs, done, g->extra_value, g->fni8x);
>          } else {
> -            g_assert_not_reached();
> +            done = 0;
>          }
> -        expand_clr(dofs, opsz, clsz);
> -    } else {
> -        check_size_l(opsz, clsz);
> -        check_align_l_3(dofs, aofs, bofs);
> -        expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
> +        dofs += done;
> +        aofs += done;
> +        bofs += done;
> +        opsz -= done;
> +        clsz -= done;
>      }
> +
> +    if (check_size_impl(opsz, 4)) {
> +        uint32_t done = QEMU_ALIGN_DOWN(opsz, 4);
> +        expand_3x4(dofs, aofs, bofs, done, g->fni4);
> +        dofs += done;
> +        aofs += done;
> +        bofs += done;
> +        opsz -= done;
> +        clsz -= done;
> +    }
> +
> +    if (opsz == 0) {
> +        if (clsz != 0) {
> +            expand_clr(dofs, clsz);
> +        }
> +        return;
> +    }
> +
> + do_ool:
> +    expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
>  }
>
>  static void gen_addv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
> @@ -240,6 +343,9 @@ void tcg_gen_gvec_add8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>      static const GVecGen3 g = {
>          .extra_value = REP8(0x80),
>          .fni8x = gen_addv_mask,
> +        .op_v256 = INDEX_op_add8_v256,
> +        .op_v128 = INDEX_op_add8_v128,
> +        .op_v64 = INDEX_op_add8_v64,
>          .fno = gen_helper_gvec_add8,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -251,6 +357,9 @@ void tcg_gen_gvec_add16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>      static const GVecGen3 g = {
>          .extra_value = REP16(0x8000),
>          .fni8x = gen_addv_mask,
> +        .op_v256 = INDEX_op_add16_v256,
> +        .op_v128 = INDEX_op_add16_v128,
> +        .op_v64 = INDEX_op_add16_v64,
>          .fno = gen_helper_gvec_add16,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -261,6 +370,9 @@ void tcg_gen_gvec_add32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>  {
>      static const GVecGen3 g = {
>          .fni4 = tcg_gen_add_i32,
> +        .op_v256 = INDEX_op_add32_v256,
> +        .op_v128 = INDEX_op_add32_v128,
> +        .op_v64 = INDEX_op_add32_v64,
>          .fno = gen_helper_gvec_add32,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -271,6 +383,8 @@ void tcg_gen_gvec_add64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>  {
>      static const GVecGen3 g = {
>          .fni8 = tcg_gen_add_i64,
> +        .op_v256 = INDEX_op_add64_v256,
> +        .op_v128 = INDEX_op_add64_v128,
>          .fno = gen_helper_gvec_add64,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -328,6 +442,9 @@ void tcg_gen_gvec_sub8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>      static const GVecGen3 g = {
>          .extra_value = REP8(0x80),
>          .fni8x = gen_subv_mask,
> +        .op_v256 = INDEX_op_sub8_v256,
> +        .op_v128 = INDEX_op_sub8_v128,
> +        .op_v64 = INDEX_op_sub8_v64,
>          .fno = gen_helper_gvec_sub8,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -339,6 +456,9 @@ void tcg_gen_gvec_sub16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>      static const GVecGen3 g = {
>          .extra_value = REP16(0x8000),
>          .fni8x = gen_subv_mask,
> +        .op_v256 = INDEX_op_sub16_v256,
> +        .op_v128 = INDEX_op_sub16_v128,
> +        .op_v64 = INDEX_op_sub16_v64,
>          .fno = gen_helper_gvec_sub16,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -349,6 +469,9 @@ void tcg_gen_gvec_sub32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>  {
>      static const GVecGen3 g = {
>          .fni4 = tcg_gen_sub_i32,
> +        .op_v256 = INDEX_op_sub32_v256,
> +        .op_v128 = INDEX_op_sub32_v128,
> +        .op_v64 = INDEX_op_sub32_v64,
>          .fno = gen_helper_gvec_sub32,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -359,6 +482,8 @@ void tcg_gen_gvec_sub64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>  {
>      static const GVecGen3 g = {
>          .fni8 = tcg_gen_sub_i64,
> +        .op_v256 = INDEX_op_sub64_v256,
> +        .op_v128 = INDEX_op_sub64_v128,
>          .fno = gen_helper_gvec_sub64,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -397,6 +522,9 @@ void tcg_gen_gvec_and8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>  {
>      static const GVecGen3 g = {
>          .fni8 = tcg_gen_and_i64,
> +        .op_v256 = INDEX_op_and_v256,
> +        .op_v128 = INDEX_op_and_v128,
> +        .op_v64 = INDEX_op_and_v64,
>          .fno = gen_helper_gvec_and8,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -407,6 +535,9 @@ void tcg_gen_gvec_or8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>  {
>      static const GVecGen3 g = {
>          .fni8 = tcg_gen_or_i64,
> +        .op_v256 = INDEX_op_or_v256,
> +        .op_v128 = INDEX_op_or_v128,
> +        .op_v64 = INDEX_op_or_v64,
>          .fno = gen_helper_gvec_or8,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -417,6 +548,9 @@ void tcg_gen_gvec_xor8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>  {
>      static const GVecGen3 g = {
>          .fni8 = tcg_gen_xor_i64,
> +        .op_v256 = INDEX_op_xor_v256,
> +        .op_v128 = INDEX_op_xor_v128,
> +        .op_v64 = INDEX_op_xor_v64,
>          .fno = gen_helper_gvec_xor8,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -427,6 +561,9 @@ void tcg_gen_gvec_andc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>  {
>      static const GVecGen3 g = {
>          .fni8 = tcg_gen_andc_i64,
> +        .op_v256 = INDEX_op_andc_v256,
> +        .op_v128 = INDEX_op_andc_v128,
> +        .op_v64 = INDEX_op_andc_v64,
>          .fno = gen_helper_gvec_andc8,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -437,6 +574,9 @@ void tcg_gen_gvec_orc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
>  {
>      static const GVecGen3 g = {
>          .fni8 = tcg_gen_orc_i64,
> +        .op_v256 = INDEX_op_orc_v256,
> +        .op_v128 = INDEX_op_orc_v128,
> +        .op_v64 = INDEX_op_orc_v64,
>          .fno = gen_helper_gvec_orc8,
>      };
>      tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 879b29e81f..86eb4214b0 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -604,7 +604,7 @@ int tcg_global_mem_new_internal(TCGType type, TCGv_ptr base,
>      return temp_idx(s, ts);
>  }
>
> -static int tcg_temp_new_internal(TCGType type, int temp_local)
> +int tcg_temp_new_internal(TCGType type, bool temp_local)
>  {
>      TCGContext *s = &tcg_ctx;
>      TCGTemp *ts;
> @@ -650,7 +650,7 @@ static int tcg_temp_new_internal(TCGType type, int temp_local)
>      return idx;
>  }
>
> -TCGv_i32 tcg_temp_new_internal_i32(int temp_local)
> +TCGv_i32 tcg_temp_new_internal_i32(bool temp_local)
>  {
>      int idx;
>
> @@ -658,7 +658,7 @@ TCGv_i32 tcg_temp_new_internal_i32(int temp_local)
>      return MAKE_TCGV_I32(idx);
>  }
>
> -TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
> +TCGv_i64 tcg_temp_new_internal_i64(bool temp_local)
>  {
>      int idx;
>
> @@ -666,7 +666,7 @@ TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
>      return MAKE_TCGV_I64(idx);
>  }
>
> -static void tcg_temp_free_internal(int idx)
> +void tcg_temp_free_internal(int idx)
>  {
>      TCGContext *s = &tcg_ctx;
>      TCGTemp *ts;


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations Richard Henderson
  2017-08-22 13:15   ` Alex Bennée
@ 2017-09-08 10:13   ` Alex Bennée
  2017-09-08 13:10     ` Alex Bennée
  1 sibling, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-08 10:13 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/i386/tcg-target.h     |  46 +++++-
>  tcg/tcg-opc.h             |  12 +-
>  tcg/i386/tcg-target.inc.c | 382 ++++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 399 insertions(+), 41 deletions(-)
>
> diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
> index e512648c95..147f82062b 100644
> --- a/tcg/i386/tcg-target.h
> +++ b/tcg/i386/tcg-target.h
> @@ -30,11 +30,10 @@
>
>  #ifdef __x86_64__
>  # define TCG_TARGET_REG_BITS  64
> -# define TCG_TARGET_NB_REGS   16
>  #else
>  # define TCG_TARGET_REG_BITS  32
> -# define TCG_TARGET_NB_REGS    8
>  #endif
> +# define TCG_TARGET_NB_REGS   24
>
>  typedef enum {
>      TCG_REG_EAX = 0,
> @@ -56,6 +55,19 @@ typedef enum {
>      TCG_REG_R13,
>      TCG_REG_R14,
>      TCG_REG_R15,
> +
> +    /* SSE registers; 64-bit has access to 8 more, but we won't
> +       need more than a few and using only the first 8 minimizes
> +       the need for a rex prefix on the sse instructions.  */
> +    TCG_REG_XMM0,
> +    TCG_REG_XMM1,
> +    TCG_REG_XMM2,
> +    TCG_REG_XMM3,
> +    TCG_REG_XMM4,
> +    TCG_REG_XMM5,
> +    TCG_REG_XMM6,
> +    TCG_REG_XMM7,
> +
>      TCG_REG_RAX = TCG_REG_EAX,
>      TCG_REG_RCX = TCG_REG_ECX,
>      TCG_REG_RDX = TCG_REG_EDX,
> @@ -79,6 +91,17 @@ extern bool have_bmi1;
>  extern bool have_bmi2;
>  extern bool have_popcnt;
>
> +#ifdef __SSE2__
> +#define have_sse2  true
> +#else
> +extern bool have_sse2;
> +#endif
> +#ifdef __AVX2__
> +#define have_avx2  true
> +#else
> +extern bool have_avx2;
> +#endif
> +
>  /* optional instructions */
>  #define TCG_TARGET_HAS_div2_i32         1
>  #define TCG_TARGET_HAS_rot_i32          1
> @@ -147,6 +170,25 @@ extern bool have_popcnt;
>  #define TCG_TARGET_HAS_mulsh_i64        0
>  #endif
>
> +#define TCG_TARGET_HAS_v64              have_sse2
> +#define TCG_TARGET_HAS_v128             have_sse2
> +#define TCG_TARGET_HAS_v256             have_avx2
> +
> +#define TCG_TARGET_HAS_andc_v64         TCG_TARGET_HAS_v64
> +#define TCG_TARGET_HAS_orc_v64          0
> +#define TCG_TARGET_HAS_not_v64          0
> +#define TCG_TARGET_HAS_neg_v64          0
> +
> +#define TCG_TARGET_HAS_andc_v128        TCG_TARGET_HAS_v128
> +#define TCG_TARGET_HAS_orc_v128         0
> +#define TCG_TARGET_HAS_not_v128         0
> +#define TCG_TARGET_HAS_neg_v128         0
> +
> +#define TCG_TARGET_HAS_andc_v256        TCG_TARGET_HAS_v256
> +#define TCG_TARGET_HAS_orc_v256         0
> +#define TCG_TARGET_HAS_not_v256         0
> +#define TCG_TARGET_HAS_neg_v256         0
> +
>  #define TCG_TARGET_deposit_i32_valid(ofs, len) \
>      (have_bmi2 ||                              \
>       ((ofs) == 0 && (len) == 8) ||             \
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index b1445a4c24..b84cd584fb 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -212,13 +212,13 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
>  /* Host integer vector operations.  */
>  /* These opcodes are required whenever the base vector size is enabled.  */
>
> -DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
> -DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
> -DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(mov_v64, 1, 1, 0, TCG_OPF_NOT_PRESENT)
> +DEF(mov_v128, 1, 1, 0, TCG_OPF_NOT_PRESENT)
> +DEF(mov_v256, 1, 1, 0, TCG_OPF_NOT_PRESENT)
>
> -DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
> -DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
> -DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
> +DEF(movi_v64, 1, 0, 1, TCG_OPF_NOT_PRESENT)
> +DEF(movi_v128, 1, 0, 1, TCG_OPF_NOT_PRESENT)
> +DEF(movi_v256, 1, 0, 1, TCG_OPF_NOT_PRESENT)

I don't follow, isn't the point of IMPL(TCG_TARGET_HAS_foo) to allow the
definition when the backend adds #define TCG_TARGET_HAS_foo 1?

>
>  DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
>  DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
> diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
> index aeefb72aa0..0e01b54aa0 100644
> --- a/tcg/i386/tcg-target.inc.c
> +++ b/tcg/i386/tcg-target.inc.c
> @@ -31,7 +31,9 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
>      "%r8",  "%r9",  "%r10", "%r11", "%r12", "%r13", "%r14", "%r15",
>  #else
>      "%eax", "%ecx", "%edx", "%ebx", "%esp", "%ebp", "%esi", "%edi",
> +    NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
>  #endif
> +    "%xmm0", "%xmm1", "%xmm2", "%xmm3", "%xmm4", "%xmm5", "%xmm6", "%xmm7",
>  };
>  #endif
>
> @@ -61,6 +63,14 @@ static const int tcg_target_reg_alloc_order[] = {
>      TCG_REG_EDX,
>      TCG_REG_EAX,
>  #endif
> +    TCG_REG_XMM0,
> +    TCG_REG_XMM1,
> +    TCG_REG_XMM2,
> +    TCG_REG_XMM3,
> +    TCG_REG_XMM4,
> +    TCG_REG_XMM5,
> +    TCG_REG_XMM6,
> +    TCG_REG_XMM7,
>  };
>
>  static const int tcg_target_call_iarg_regs[] = {
> @@ -94,7 +104,7 @@ static const int tcg_target_call_oarg_regs[] = {
>  #define TCG_CT_CONST_I32 0x400
>  #define TCG_CT_CONST_WSZ 0x800
>
> -/* Registers used with L constraint, which are the first argument
> +/* Registers used with L constraint, which are the first argument
>     registers on x86_64, and two random call clobbered registers on
>     i386. */
>  #if TCG_TARGET_REG_BITS == 64
> @@ -127,6 +137,16 @@ bool have_bmi1;
>  bool have_bmi2;
>  bool have_popcnt;
>
> +#ifndef have_sse2
> +bool have_sse2;
> +#endif
> +#ifdef have_avx2
> +#define have_avx1  have_avx2
> +#else
> +static bool have_avx1;
> +bool have_avx2;
> +#endif
> +
>  #ifdef CONFIG_CPUID_H
>  static bool have_movbe;
>  static bool have_lzcnt;
> @@ -215,6 +235,10 @@ static const char *target_parse_constraint(TCGArgConstraint *ct,
>          /* With TZCNT/LZCNT, we can have operand-size as an input.  */
>          ct->ct |= TCG_CT_CONST_WSZ;
>          break;
> +    case 'x':
> +        ct->ct |= TCG_CT_REG;
> +        tcg_regset_set32(ct->u.regs, 0, 0xff0000);
> +        break;

The documentation on constraints in the README is fairly minimal and we
keep adding target specific ones so perhaps a single line comment here
for clarity?

>
>          /* qemu_ld/st address constraint */
>      case 'L':
> @@ -292,6 +316,7 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
>  #endif
>  #define P_SIMDF3        0x20000         /* 0xf3 opcode prefix */
>  #define P_SIMDF2        0x40000         /* 0xf2 opcode prefix */
> +#define P_VEXL          0x80000         /* Set VEX.L = 1 */
>
>  #define OPC_ARITH_EvIz	(0x81)
>  #define OPC_ARITH_EvIb	(0x83)
> @@ -324,13 +349,31 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
>  #define OPC_MOVL_Iv     (0xb8)
>  #define OPC_MOVBE_GyMy  (0xf0 | P_EXT38)
>  #define OPC_MOVBE_MyGy  (0xf1 | P_EXT38)
> +#define OPC_MOVDQA_GyMy (0x6f | P_EXT | P_DATA16)
> +#define OPC_MOVDQA_MyGy (0x7f | P_EXT | P_DATA16)
> +#define OPC_MOVDQU_GyMy (0x6f | P_EXT | P_SIMDF3)
> +#define OPC_MOVDQU_MyGy (0x7f | P_EXT | P_SIMDF3)
> +#define OPC_MOVQ_GyMy   (0x7e | P_EXT | P_SIMDF3)
> +#define OPC_MOVQ_MyGy   (0xd6 | P_EXT | P_DATA16)
>  #define OPC_MOVSBL	(0xbe | P_EXT)
>  #define OPC_MOVSWL	(0xbf | P_EXT)
>  #define OPC_MOVSLQ	(0x63 | P_REXW)
>  #define OPC_MOVZBL	(0xb6 | P_EXT)
>  #define OPC_MOVZWL	(0xb7 | P_EXT)
> +#define OPC_PADDB       (0xfc | P_EXT | P_DATA16)
> +#define OPC_PADDW       (0xfd | P_EXT | P_DATA16)
> +#define OPC_PADDD       (0xfe | P_EXT | P_DATA16)
> +#define OPC_PADDQ       (0xd4 | P_EXT | P_DATA16)
> +#define OPC_PAND        (0xdb | P_EXT | P_DATA16)
> +#define OPC_PANDN       (0xdf | P_EXT | P_DATA16)
>  #define OPC_PDEP        (0xf5 | P_EXT38 | P_SIMDF2)
>  #define OPC_PEXT        (0xf5 | P_EXT38 | P_SIMDF3)
> +#define OPC_POR         (0xeb | P_EXT | P_DATA16)
> +#define OPC_PSUBB       (0xf8 | P_EXT | P_DATA16)
> +#define OPC_PSUBW       (0xf9 | P_EXT | P_DATA16)
> +#define OPC_PSUBD       (0xfa | P_EXT | P_DATA16)
> +#define OPC_PSUBQ       (0xfb | P_EXT | P_DATA16)
> +#define OPC_PXOR        (0xef | P_EXT | P_DATA16)
>  #define OPC_POP_r32	(0x58)
>  #define OPC_POPCNT      (0xb8 | P_EXT | P_SIMDF3)
>  #define OPC_PUSH_r32	(0x50)
> @@ -500,7 +543,8 @@ static void tcg_out_modrm(TCGContext *s, int opc, int r, int rm)
>      tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
>  }
>
> -static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
> +static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v,
> +                                int rm, int index)
>  {
>      int tmp;
>
> @@ -515,14 +559,16 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
>      } else if (opc & P_EXT) {
>          tmp = 1;
>      } else {
> -        tcg_abort();
> +        g_assert_not_reached();
>      }
> -    tmp |= 0x40;                           /* VEX.X */
>      tmp |= (r & 8 ? 0 : 0x80);             /* VEX.R */
> +    tmp |= (index & 8 ? 0 : 0x40);         /* VEX.X */
>      tmp |= (rm & 8 ? 0 : 0x20);            /* VEX.B */
>      tcg_out8(s, tmp);
>
>      tmp = (opc & P_REXW ? 0x80 : 0);       /* VEX.W */
> +    tmp |= (opc & P_VEXL ? 0x04 : 0);      /* VEX.L */
> +
>      /* VEX.pp */
>      if (opc & P_DATA16) {
>          tmp |= 1;                          /* 0x66 */
> @@ -538,7 +584,7 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
>
>  static void tcg_out_vex_modrm(TCGContext *s, int opc, int r, int v, int rm)
>  {
> -    tcg_out_vex_pfx_opc(s, opc, r, v, rm);
> +    tcg_out_vex_pfx_opc(s, opc, r, v, rm, 0);
>      tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
>  }
>
> @@ -565,7 +611,7 @@ static void tcg_out_opc_pool_imm(TCGContext *s, int opc, int r,
>  static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
>                                   tcg_target_ulong data)
>  {
> -    tcg_out_vex_pfx_opc(s, opc, r, v, 0);
> +    tcg_out_vex_pfx_opc(s, opc, r, v, 0, 0);
>      tcg_out_sfx_pool_imm(s, r, data);
>  }
>
> @@ -574,8 +620,8 @@ static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
>     mode for absolute addresses, ~RM is the size of the immediate operand
>     that will follow the instruction.  */
>
> -static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> -                                     int index, int shift, intptr_t offset)
> +static void tcg_out_sib_offset(TCGContext *s, int r, int rm, int index,
> +                               int shift, intptr_t offset)
>  {
>      int mod, len;
>
> @@ -586,7 +632,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>              intptr_t pc = (intptr_t)s->code_ptr + 5 + ~rm;
>              intptr_t disp = offset - pc;
>              if (disp == (int32_t)disp) {
> -                tcg_out_opc(s, opc, r, 0, 0);
>                  tcg_out8(s, (LOWREGMASK(r) << 3) | 5);
>                  tcg_out32(s, disp);
>                  return;
> @@ -596,7 +641,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>                 use of the MODRM+SIB encoding and is therefore larger than
>                 rip-relative addressing.  */
>              if (offset == (int32_t)offset) {
> -                tcg_out_opc(s, opc, r, 0, 0);
>                  tcg_out8(s, (LOWREGMASK(r) << 3) | 4);
>                  tcg_out8(s, (4 << 3) | 5);
>                  tcg_out32(s, offset);
> @@ -604,10 +648,9 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>              }
>
>              /* ??? The memory isn't directly addressable.  */
> -            tcg_abort();
> +            g_assert_not_reached();
>          } else {
>              /* Absolute address.  */
> -            tcg_out_opc(s, opc, r, 0, 0);
>              tcg_out8(s, (r << 3) | 5);
>              tcg_out32(s, offset);
>              return;
> @@ -630,7 +673,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>         that would be used for %esp is the escape to the two byte form.  */
>      if (index < 0 && LOWREGMASK(rm) != TCG_REG_ESP) {
>          /* Single byte MODRM format.  */
> -        tcg_out_opc(s, opc, r, rm, 0);
>          tcg_out8(s, mod | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
>      } else {
>          /* Two byte MODRM+SIB format.  */
> @@ -644,7 +686,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>              tcg_debug_assert(index != TCG_REG_ESP);
>          }
>
> -        tcg_out_opc(s, opc, r, rm, index);
>          tcg_out8(s, mod | (LOWREGMASK(r) << 3) | 4);
>          tcg_out8(s, (shift << 6) | (LOWREGMASK(index) << 3) | LOWREGMASK(rm));
>      }
> @@ -656,6 +697,21 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
>      }
>  }
>
> +static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> +                                     int index, int shift, intptr_t offset)
> +{
> +    tcg_out_opc(s, opc, r, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
> +    tcg_out_sib_offset(s, r, rm, index, shift, offset);
> +}
> +
> +static void tcg_out_vex_modrm_sib_offset(TCGContext *s, int opc, int r, int v,
> +                                         int rm, int index, int shift,
> +                                         intptr_t offset)
> +{
> +    tcg_out_vex_pfx_opc(s, opc, r, v, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
> +    tcg_out_sib_offset(s, r, rm, index, shift, offset);
> +}
> +
>  /* A simplification of the above with no index or shift.  */
>  static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
>                                          int rm, intptr_t offset)
> @@ -663,6 +719,31 @@ static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
>      tcg_out_modrm_sib_offset(s, opc, r, rm, -1, 0, offset);
>  }
>
> +static inline void tcg_out_vex_modrm_offset(TCGContext *s, int opc, int r,
> +                                            int v, int rm, intptr_t offset)
> +{
> +    tcg_out_vex_modrm_sib_offset(s, opc, r, v, rm, -1, 0, offset);
> +}
> +
> +static void tcg_out_maybe_vex_modrm(TCGContext *s, int opc, int r, int rm)
> +{
> +    if (have_avx1) {
> +        tcg_out_vex_modrm(s, opc, r, 0, rm);
> +    } else {
> +        tcg_out_modrm(s, opc, r, rm);
> +    }
> +}
> +
> +static void tcg_out_maybe_vex_modrm_offset(TCGContext *s, int opc, int r,
> +                                           int rm, intptr_t offset)
> +{
> +    if (have_avx1) {
> +        tcg_out_vex_modrm_offset(s, opc, r, 0, rm, offset);
> +    } else {
> +        tcg_out_modrm_offset(s, opc, r, rm, offset);
> +    }
> +}
> +
>  /* Generate dest op= src.  Uses the same ARITH_* codes as tgen_arithi.  */
>  static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
>  {
> @@ -673,12 +754,32 @@ static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
>      tcg_out_modrm(s, OPC_ARITH_GvEv + (subop << 3) + ext, dest, src);
>  }
>
> -static inline void tcg_out_mov(TCGContext *s, TCGType type,
> -                               TCGReg ret, TCGReg arg)
> +static void tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
>  {
>      if (arg != ret) {
> -        int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> -        tcg_out_modrm(s, opc, ret, arg);
> +        int opc = 0;
> +
> +        switch (type) {
> +        case TCG_TYPE_I64:
> +            opc = P_REXW;
> +            /* fallthru */
> +        case TCG_TYPE_I32:
> +            opc |= OPC_MOVL_GvEv;
> +            tcg_out_modrm(s, opc, ret, arg);
> +            break;
> +
> +        case TCG_TYPE_V256:
> +            opc = P_VEXL;
> +            /* fallthru */
> +        case TCG_TYPE_V128:
> +        case TCG_TYPE_V64:
> +            opc |= OPC_MOVDQA_GyMy;
> +            tcg_out_maybe_vex_modrm(s, opc, ret, arg);
> +            break;
> +
> +        default:
> +            g_assert_not_reached();
> +        }
>      }
>  }
>
> @@ -687,6 +788,27 @@ static void tcg_out_movi(TCGContext *s, TCGType type,
>  {
>      tcg_target_long diff;
>
> +    switch (type) {
> +    case TCG_TYPE_I32:
> +    case TCG_TYPE_I64:
> +        break;
> +
> +    case TCG_TYPE_V64:
> +    case TCG_TYPE_V128:
> +    case TCG_TYPE_V256:
> +        /* ??? Revisit this as the implementation progresses.  */
> +        tcg_debug_assert(arg == 0);
> +        if (have_avx1) {
> +            tcg_out_vex_modrm(s, OPC_PXOR, ret, ret, ret);
> +        } else {
> +            tcg_out_modrm(s, OPC_PXOR, ret, ret);
> +        }
> +        return;
> +
> +    default:
> +        g_assert_not_reached();
> +    }
> +
>      if (arg == 0) {
>          tgen_arithr(s, ARITH_XOR, ret, ret);
>          return;
> @@ -750,18 +872,54 @@ static inline void tcg_out_pop(TCGContext *s, int reg)
>      tcg_out_opc(s, OPC_POP_r32 + LOWREGMASK(reg), 0, reg, 0);
>  }
>
> -static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
> -                              TCGReg arg1, intptr_t arg2)
> +static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
> +                       TCGReg arg1, intptr_t arg2)
>  {
> -    int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> -    tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
> +    switch (type) {
> +    case TCG_TYPE_I64:
> +        tcg_out_modrm_offset(s, OPC_MOVL_GvEv | P_REXW, ret, arg1, arg2);
> +        break;
> +    case TCG_TYPE_I32:
> +        tcg_out_modrm_offset(s, OPC_MOVL_GvEv, ret, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V64:
> +        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_GyMy, ret, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V128:
> +        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_GyMy, ret, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V256:
> +        tcg_out_vex_modrm_offset(s, OPC_MOVDQU_GyMy | P_VEXL,
> +                                 ret, 0, arg1, arg2);
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
>  }
>
> -static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
> -                              TCGReg arg1, intptr_t arg2)
> +static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
> +                       TCGReg arg1, intptr_t arg2)
>  {
> -    int opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> -    tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
> +    switch (type) {
> +    case TCG_TYPE_I64:
> +        tcg_out_modrm_offset(s, OPC_MOVL_EvGv | P_REXW, arg, arg1, arg2);
> +        break;
> +    case TCG_TYPE_I32:
> +        tcg_out_modrm_offset(s, OPC_MOVL_EvGv, arg, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V64:
> +        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_MyGy, arg, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V128:
> +        tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_MyGy, arg, arg1, arg2);
> +        break;
> +    case TCG_TYPE_V256:
> +        tcg_out_vex_modrm_offset(s, OPC_MOVDQU_MyGy | P_VEXL,
> +                                 arg, 0, arg1, arg2);
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
>  }
>
>  static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
> @@ -773,6 +931,8 @@ static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
>              return false;
>          }
>          rexw = P_REXW;
> +    } else if (type != TCG_TYPE_I32) {
> +        return false;
>      }
>      tcg_out_modrm_offset(s, OPC_MOVL_EvIz | rexw, 0, base, ofs);
>      tcg_out32(s, val);
> @@ -1914,6 +2074,15 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>          case glue(glue(INDEX_op_, x), _i32)
>  #endif
>
> +#define OP_128_256(x) \
> +        case glue(glue(INDEX_op_, x), _v256): \
> +            rexw = P_VEXL; /* FALLTHRU */     \
> +        case glue(glue(INDEX_op_, x), _v128)
> +
> +#define OP_64_128_256(x) \
> +        OP_128_256(x):   \
> +        case glue(glue(INDEX_op_, x), _v64)
> +
>      /* Hoist the loads of the most common arguments.  */
>      a0 = args[0];
>      a1 = args[1];
> @@ -2379,19 +2548,94 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>          }
>          break;
>
> +    OP_64_128_256(add8):
> +        c = OPC_PADDB;
> +        goto gen_simd;
> +    OP_64_128_256(add16):
> +        c = OPC_PADDW;
> +        goto gen_simd;
> +    OP_64_128_256(add32):
> +        c = OPC_PADDD;
> +        goto gen_simd;
> +    OP_128_256(add64):
> +        c = OPC_PADDQ;
> +        goto gen_simd;
> +    OP_64_128_256(sub8):
> +        c = OPC_PSUBB;
> +        goto gen_simd;
> +    OP_64_128_256(sub16):
> +        c = OPC_PSUBW;
> +        goto gen_simd;
> +    OP_64_128_256(sub32):
> +        c = OPC_PSUBD;
> +        goto gen_simd;
> +    OP_128_256(sub64):
> +        c = OPC_PSUBQ;
> +        goto gen_simd;
> +    OP_64_128_256(and):
> +        c = OPC_PAND;
> +        goto gen_simd;
> +    OP_64_128_256(andc):
> +        c = OPC_PANDN;
> +        goto gen_simd;
> +    OP_64_128_256(or):
> +        c = OPC_POR;
> +        goto gen_simd;
> +    OP_64_128_256(xor):
> +        c = OPC_PXOR;
> +    gen_simd:
> +        if (have_avx1) {
> +            tcg_out_vex_modrm(s, c, a0, a1, a2);
> +        } else {
> +            tcg_out_modrm(s, c, a0, a2);
> +        }
> +        break;
> +
> +    case INDEX_op_ld_v64:
> +        c = TCG_TYPE_V64;
> +        goto gen_simd_ld;
> +    case INDEX_op_ld_v128:
> +        c = TCG_TYPE_V128;
> +        goto gen_simd_ld;
> +    case INDEX_op_ld_v256:
> +        c = TCG_TYPE_V256;
> +    gen_simd_ld:
> +        tcg_out_ld(s, c, a0, a1, a2);
> +        break;
> +
> +    case INDEX_op_st_v64:
> +        c = TCG_TYPE_V64;
> +        goto gen_simd_st;
> +    case INDEX_op_st_v128:
> +        c = TCG_TYPE_V128;
> +        goto gen_simd_st;
> +    case INDEX_op_st_v256:
> +        c = TCG_TYPE_V256;
> +    gen_simd_st:
> +        tcg_out_st(s, c, a0, a1, a2);
> +        break;
> +
>      case INDEX_op_mb:
>          tcg_out_mb(s, a0);
>          break;
>      case INDEX_op_mov_i32:  /* Always emitted via tcg_out_mov.  */
>      case INDEX_op_mov_i64:
> +    case INDEX_op_mov_v64:
> +    case INDEX_op_mov_v128:
> +    case INDEX_op_mov_v256:
>      case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi.  */
>      case INDEX_op_movi_i64:
> +    case INDEX_op_movi_v64:
> +    case INDEX_op_movi_v128:
> +    case INDEX_op_movi_v256:
>      case INDEX_op_call:     /* Always emitted via tcg_out_call.  */
>      default:
>          tcg_abort();
>      }
>
>  #undef OP_32_64
> +#undef OP_128_256
> +#undef OP_64_128_256
>  }
>
>  static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
> @@ -2417,6 +2661,9 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
>          = { .args_ct_str = { "r", "r", "L", "L" } };
>      static const TCGTargetOpDef L_L_L_L
>          = { .args_ct_str = { "L", "L", "L", "L" } };
> +    static const TCGTargetOpDef x_0_x = { .args_ct_str = { "x", "0", "x" } };
> +    static const TCGTargetOpDef x_x_x = { .args_ct_str = { "x", "x", "x" } };
> +    static const TCGTargetOpDef x_r = { .args_ct_str = { "x", "r" } };
>
>      switch (op) {
>      case INDEX_op_goto_ptr:
> @@ -2620,6 +2867,52 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
>              return &s2;
>          }
>
> +    case INDEX_op_ld_v64:
> +    case INDEX_op_ld_v128:
> +    case INDEX_op_ld_v256:
> +    case INDEX_op_st_v64:
> +    case INDEX_op_st_v128:
> +    case INDEX_op_st_v256:
> +        return &x_r;
> +
> +    case INDEX_op_add8_v64:
> +    case INDEX_op_add8_v128:
> +    case INDEX_op_add16_v64:
> +    case INDEX_op_add16_v128:
> +    case INDEX_op_add32_v64:
> +    case INDEX_op_add32_v128:
> +    case INDEX_op_add64_v128:
> +    case INDEX_op_sub8_v64:
> +    case INDEX_op_sub8_v128:
> +    case INDEX_op_sub16_v64:
> +    case INDEX_op_sub16_v128:
> +    case INDEX_op_sub32_v64:
> +    case INDEX_op_sub32_v128:
> +    case INDEX_op_sub64_v128:
> +    case INDEX_op_and_v64:
> +    case INDEX_op_and_v128:
> +    case INDEX_op_andc_v64:
> +    case INDEX_op_andc_v128:
> +    case INDEX_op_or_v64:
> +    case INDEX_op_or_v128:
> +    case INDEX_op_xor_v64:
> +    case INDEX_op_xor_v128:
> +        return have_avx1 ? &x_x_x : &x_0_x;
> +
> +    case INDEX_op_add8_v256:
> +    case INDEX_op_add16_v256:
> +    case INDEX_op_add32_v256:
> +    case INDEX_op_add64_v256:
> +    case INDEX_op_sub8_v256:
> +    case INDEX_op_sub16_v256:
> +    case INDEX_op_sub32_v256:
> +    case INDEX_op_sub64_v256:
> +    case INDEX_op_and_v256:
> +    case INDEX_op_andc_v256:
> +    case INDEX_op_or_v256:
> +    case INDEX_op_xor_v256:
> +        return &x_x_x;
> +
>      default:
>          break;
>      }
> @@ -2725,9 +3018,16 @@ static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
>  static void tcg_target_init(TCGContext *s)
>  {
>  #ifdef CONFIG_CPUID_H
> -    unsigned a, b, c, d;
> +    unsigned a, b, c, d, b7 = 0;
>      int max = __get_cpuid_max(0, 0);
>
> +    if (max >= 7) {
> +        /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs.  */
> +        __cpuid_count(7, 0, a, b7, c, d);
> +        have_bmi1 = (b7 & bit_BMI) != 0;
> +        have_bmi2 = (b7 & bit_BMI2) != 0;
> +    }
> +
>      if (max >= 1) {
>          __cpuid(1, a, b, c, d);
>  #ifndef have_cmov
> @@ -2736,17 +3036,26 @@ static void tcg_target_init(TCGContext *s)
>             available, we'll use a small forward branch.  */
>          have_cmov = (d & bit_CMOV) != 0;
>  #endif
> +#ifndef have_sse2
> +        have_sse2 = (d & bit_SSE2) != 0;
> +#endif
>          /* MOVBE is only available on Intel Atom and Haswell CPUs, so we
>             need to probe for it.  */
>          have_movbe = (c & bit_MOVBE) != 0;
>          have_popcnt = (c & bit_POPCNT) != 0;
> -    }
>
> -    if (max >= 7) {
> -        /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs.  */
> -        __cpuid_count(7, 0, a, b, c, d);
> -        have_bmi1 = (b & bit_BMI) != 0;
> -        have_bmi2 = (b & bit_BMI2) != 0;
> +#ifndef have_avx2
> +        /* There are a number of things we must check before we can be
> +           sure of not hitting invalid opcode.  */
> +        if (c & bit_OSXSAVE) {
> +            unsigned xcrl, xcrh;
> +            asm ("xgetbv" : "=a" (xcrl), "=d" (xcrh) : "c" (0));
> +            if (xcrl & 6 == 6) {
> +                have_avx1 = (c & bit_AVX) != 0;
> +                have_avx2 = (b7 & bit_AVX2) != 0;
> +            }
> +        }
> +#endif
>      }
>
>      max = __get_cpuid_max(0x8000000, 0);
> @@ -2763,6 +3072,13 @@ static void tcg_target_init(TCGContext *s)
>      } else {
>          tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xff);
>      }
> +    if (have_sse2) {
> +        tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V64], 0, 0xff0000);
> +        tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V128], 0, 0xff0000);
> +    }
> +    if (have_avx2) {
> +        tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V256], 0, 0xff0000);
> +    }
>
>      tcg_regset_clear(tcg_target_call_clobber_regs);
>      tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_EAX);


--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
  2017-09-08 10:13   ` Alex Bennée
@ 2017-09-08 13:10     ` Alex Bennée
  2017-09-10  2:44       ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-08 13:10 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Alex Bennée <alex.bennee@linaro.org> writes:

> Richard Henderson <richard.henderson@linaro.org> writes:
>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
<snip>

Also this commit breaks RISU:

 qemu-aarch64 build/aarch64-linux-gnu/risu
    testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin \
    -t testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin.trace

Gives:

mismatch detail (master : apprentice):
  V29   : 000000000000000005388083c1444242 vs 00000000000000002a000e0416a30018

The insn is:

  37c:       6f56a29d        umull2  v29.4s, v20.8h, v6.h[1]

Which is odd because I didn't think we'd touched that.

You can find my bundle of testcases with trace files at:

  http://people.linaro.org/~alex.bennee/testcases/arm64.risu/aarch64-patterns-v8dot0.tar.xz

Which is used in our master RISU tracking job:

  https://validation.linaro.org/results/query/~alex.bennee/master-aarch64-risu-results

--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion
  2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
                   ` (7 preceding siblings ...)
  2017-08-17 23:01 ` [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations Richard Henderson
@ 2017-09-08 13:49 ` Alex Bennée
  2017-09-08 16:05   ` Richard Henderson
  8 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-08 13:49 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> When Alex and I started talking about this topic, this is the direction
> I was thinking.  The primary difference from Alex's version is that the
> interface on the target/cpu/ side uses offsets and not a faux temp.  The
> secondary difference is that, for smaller vector sizes at least, I will
> expand to inline host vector operations.  The use of explicit offsets
> aids that.
<snip>

OK I think this is a lot more complete than my pass. I'm done with my
review for now, I look forward to the next version. It looks like most
of the pre-requisites are merged now?

--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion
  2017-09-08 13:49 ` [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Alex Bennée
@ 2017-09-08 16:05   ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-09-08 16:05 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, qemu-arm

On 09/08/2017 06:49 AM, Alex Bennée wrote:
> 
> Richard Henderson <richard.henderson@linaro.org> writes:
> 
>> When Alex and I started talking about this topic, this is the direction
>> I was thinking.  The primary difference from Alex's version is that the
>> interface on the target/cpu/ side uses offsets and not a faux temp.  The
>> secondary difference is that, for smaller vector sizes at least, I will
>> expand to inline host vector operations.  The use of explicit offsets
>> aids that.
> <snip>
> 
> OK I think this is a lot more complete than my pass. I'm done with my
> review for now, I look forward to the next version. It looks like most
> of the pre-requisites are merged now?

Yep, all lead-up patches are now in.  Thanks for the review.

As I work on the next version, I'll do aarch64 host as well for comparison.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic
  2017-09-07 16:58   ` Alex Bennée
@ 2017-09-10  1:43     ` Richard Henderson
  2017-09-11  9:12       ` Alex Bennée
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-09-10  1:43 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, qemu-arm

On 09/07/2017 09:58 AM, Alex Bennée wrote:
>> +    switch (size + 4 * is_u) {
> 
> Hmm I find this switch a little too magical. I mean I can see that the
> encoding abuses size for the final opcode when I look at the manual but
> it reads badly.
> 
>> +    case 0: /* AND */
>> +        gvec_op = tcg_gen_gvec_and8;
>> +        goto do_gvec;
>> +    case 1: /* BIC */
>> +        gvec_op = tcg_gen_gvec_andc8;
>> +        goto do_gvec;
>> +    case 2: /* ORR */
>> +        gvec_op = tcg_gen_gvec_or8;
>> +        goto do_gvec;
>> +    case 3: /* ORN */
>> +        gvec_op = tcg_gen_gvec_orc8;
>> +        goto do_gvec;
>> +    case 4: /* EOR */
>> +        gvec_op = tcg_gen_gvec_xor8;
>> +        goto do_gvec;
>> +    do_gvec:
>> +        gvec_op(vec_full_reg_offset(s, rd),
>> +                vec_full_reg_offset(s, rn),
>> +                vec_full_reg_offset(s, rm),
>> +                is_q ? 16 : 8, vec_full_reg_size(s));
>> +        return;
> 
> No default case (although I guess we just fall through). What's wrong
> with just having a !is_u test with gvec_op = tbl[size] and skipping all
> the goto stuff?

Because that would still leave EOR out in the woods.
I do think this is the cleanest way to filter out these 5 operations.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
  2017-09-08 13:10     ` Alex Bennée
@ 2017-09-10  2:44       ` Richard Henderson
  2017-09-11  9:07         ` Alex Bennée
  0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-09-10  2:44 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, qemu-arm

On 09/08/2017 06:10 AM, Alex Bennée wrote:
> Also this commit breaks RISU:
> 
>  qemu-aarch64 build/aarch64-linux-gnu/risu
>     testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin \
>     -t testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin.trace
> 
> Gives:
> 
> mismatch detail (master : apprentice):
>   V29   : 000000000000000005388083c1444242 vs 00000000000000002a000e0416a30018
> 
> The insn is:
> 
>   37c:       6f56a29d        umull2  v29.4s, v20.8h, v6.h[1]
> 
> Which is odd because I didn't think we'd touched that.

Indeed we didn't.  Still, I'll check it out next week.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
  2017-09-10  2:44       ` Richard Henderson
@ 2017-09-11  9:07         ` Alex Bennée
  2017-09-12 13:52           ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-11  9:07 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> On 09/08/2017 06:10 AM, Alex Bennée wrote:
>> Also this commit breaks RISU:
>>
>>  qemu-aarch64 build/aarch64-linux-gnu/risu
>>     testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin \
>>     -t testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin.trace
>>
>> Gives:
>>
>> mismatch detail (master : apprentice):
>>   V29   : 000000000000000005388083c1444242 vs 00000000000000002a000e0416a30018
>>
>> The insn is:
>>
>>   37c:       6f56a29d        umull2  v29.4s, v20.8h, v6.h[1]
>>
>> Which is odd because I didn't think we'd touched that.
>
> Indeed we didn't.  Still, I'll check it out next week.

OK it would help if I had objdumped the right file:

     36c:       0e781fdd        bic     v29.8b, v30.8b, v24.8b
     370:       00005af0        .inst   0x00005af0 ; undefined

--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic
  2017-09-10  1:43     ` Richard Henderson
@ 2017-09-11  9:12       ` Alex Bennée
  2017-09-11 18:09         ` Richard Henderson
  0 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-11  9:12 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> On 09/07/2017 09:58 AM, Alex Bennée wrote:
>>> +    switch (size + 4 * is_u) {
>>
>> Hmm I find this switch a little too magical. I mean I can see that the
>> encoding abuses size for the final opcode when I look at the manual but
>> it reads badly.
>>
>>> +    case 0: /* AND */
>>> +        gvec_op = tcg_gen_gvec_and8;
>>> +        goto do_gvec;
>>> +    case 1: /* BIC */
>>> +        gvec_op = tcg_gen_gvec_andc8;
>>> +        goto do_gvec;
>>> +    case 2: /* ORR */
>>> +        gvec_op = tcg_gen_gvec_or8;
>>> +        goto do_gvec;
>>> +    case 3: /* ORN */
>>> +        gvec_op = tcg_gen_gvec_orc8;
>>> +        goto do_gvec;
>>> +    case 4: /* EOR */
>>> +        gvec_op = tcg_gen_gvec_xor8;
>>> +        goto do_gvec;
>>> +    do_gvec:
>>> +        gvec_op(vec_full_reg_offset(s, rd),
>>> +                vec_full_reg_offset(s, rn),
>>> +                vec_full_reg_offset(s, rm),
>>> +                is_q ? 16 : 8, vec_full_reg_size(s));
>>> +        return;
>>
>> No default case (although I guess we just fall through). What's wrong
>> with just having a !is_u test with gvec_op = tbl[size] and skipping all
>> the goto stuff?
>
> Because that would still leave EOR out in the woods.
> I do think this is the cleanest way to filter out these 5 operations.

Is this going to look better if the other operations in this branch of
the decode are converted as well?

--
Alex Bennée

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic
  2017-09-11  9:12       ` Alex Bennée
@ 2017-09-11 18:09         ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-09-11 18:09 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, qemu-arm

On 09/11/2017 02:12 AM, Alex Bennée wrote:
> 
> Richard Henderson <richard.henderson@linaro.org> writes:
> 
>> On 09/07/2017 09:58 AM, Alex Bennée wrote:
>>>> +    switch (size + 4 * is_u) {
>>>
>>> Hmm I find this switch a little too magical. I mean I can see that the
>>> encoding abuses size for the final opcode when I look at the manual but
>>> it reads badly.
>>>
>>>> +    case 0: /* AND */
>>>> +        gvec_op = tcg_gen_gvec_and8;
>>>> +        goto do_gvec;
>>>> +    case 1: /* BIC */
>>>> +        gvec_op = tcg_gen_gvec_andc8;
>>>> +        goto do_gvec;
>>>> +    case 2: /* ORR */
>>>> +        gvec_op = tcg_gen_gvec_or8;
>>>> +        goto do_gvec;
>>>> +    case 3: /* ORN */
>>>> +        gvec_op = tcg_gen_gvec_orc8;
>>>> +        goto do_gvec;
>>>> +    case 4: /* EOR */
>>>> +        gvec_op = tcg_gen_gvec_xor8;
>>>> +        goto do_gvec;
>>>> +    do_gvec:
>>>> +        gvec_op(vec_full_reg_offset(s, rd),
>>>> +                vec_full_reg_offset(s, rn),
>>>> +                vec_full_reg_offset(s, rm),
>>>> +                is_q ? 16 : 8, vec_full_reg_size(s));
>>>> +        return;
>>>
>>> No default case (although I guess we just fall through). What's wrong
>>> with just having a !is_u test with gvec_op = tbl[size] and skipping all
>>> the goto stuff?
>>
>> Because that would still leave EOR out in the woods.
>> I do think this is the cleanest way to filter out these 5 operations.
> 
> Is this going to look better if the other operations in this branch of
> the decode are converted as well?

It might do.  I'll have to think about those some more.

Indeed, perhaps that's exactly what I ought to do.  Those are complex logical
operations which certainly will not get their own opcodes, but are "simple" in
that they can be implemented in terms of vector xor+and, so there's no reason
we couldn't expand those inline for all vector supporting hosts.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
  2017-09-11  9:07         ` Alex Bennée
@ 2017-09-12 13:52           ` Richard Henderson
  0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-09-12 13:52 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, qemu-arm

On 09/11/2017 02:07 AM, Alex Bennée wrote:
> 
> Richard Henderson <richard.henderson@linaro.org> writes:
> 
>> On 09/08/2017 06:10 AM, Alex Bennée wrote:
>>> Also this commit breaks RISU:
>>>
>>>  qemu-aarch64 build/aarch64-linux-gnu/risu
>>>     testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin \
>>>     -t testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin.trace
>>>
>>> Gives:
>>>
>>> mismatch detail (master : apprentice):
>>>   V29   : 000000000000000005388083c1444242 vs 00000000000000002a000e0416a30018
>>>
>>> The insn is:
>>>
>>>   37c:       6f56a29d        umull2  v29.4s, v20.8h, v6.h[1]
>>>
>>> Which is odd because I didn't think we'd touched that.
>>
>> Indeed we didn't.  Still, I'll check it out next week.
> 
> OK it would help if I had objdumped the right file:
> 
>      36c:       0e781fdd        bic     v29.8b, v30.8b, v24.8b
>      370:       00005af0        .inst   0x00005af0 ; undefined

Thanks.  The sse pandn operand order is ... surprising.
Even though I know that I still managed to get it wrong.
Fixed for v2.


r~

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2017-09-12 13:52 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
2017-08-30  1:31   ` Philippe Mathieu-Daudé
2017-09-01 20:38     ` Richard Henderson
2017-09-07 16:34   ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic Richard Henderson
2017-09-07 16:58   ` Alex Bennée
2017-09-10  1:43     ` Richard Henderson
2017-09-11  9:12       ` Alex Bennée
2017-09-11 18:09         ` Richard Henderson
2017-08-17 23:01 ` [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors Richard Henderson
2017-08-17 23:46   ` Philippe Mathieu-Daudé
2017-09-07 18:18   ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 4/8] tcg: Add operations " Richard Henderson
2017-08-30  1:34   ` Philippe Mathieu-Daudé
2017-09-07 19:00   ` Alex Bennée
2017-09-07 19:02     ` Richard Henderson
2017-09-08  9:28       ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported Richard Henderson
2017-08-17 23:44   ` Philippe Mathieu-Daudé
2017-09-07 19:02   ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid Richard Henderson
2017-08-17 23:45   ` Philippe Mathieu-Daudé
2017-09-08  9:30   ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops Richard Henderson
2017-09-08  9:34   ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations Richard Henderson
2017-08-22 13:15   ` Alex Bennée
2017-08-23 19:02     ` Richard Henderson
2017-09-08 10:13   ` Alex Bennée
2017-09-08 13:10     ` Alex Bennée
2017-09-10  2:44       ` Richard Henderson
2017-09-11  9:07         ` Alex Bennée
2017-09-12 13:52           ` Richard Henderson
2017-09-08 13:49 ` [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Alex Bennée
2017-09-08 16:05   ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.